Chaos engineering can be defined as experiments over a distributed system at scale, which increases the confidence that the system will behave as desired and expected under undesired and unexpected conditions.
The concept was popularised initially by Netflix and its Chaos Monkey approach. As the company put it as far back as 2010: "The Chaos Monkey’s job is to randomly kill instances and services within our architecture. If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most – in the event of an unexpected outage."
The foundation of chaos engineering lies in controlled experiments; a simple approach follows.
Interim on controlled experiments with control and experimental groups
A controlled experiment is simply an experiment under controlled conditions. Unless necessary, when performing an experiment it is important to only do so one variable at a time, otherwise it would be increasingly difficult to determine what caused the changes in the results.
One type of controlled experiment is the ‘control and experimental’ group experiment. In this kind of experiment a control group is subject to observation with no variables being modified/affected purposefully, and the experimental group will have one variable at a time modified/affected with the consequent observation of the output at that stage.
A simple approach
Defining a steady state: The main focus is to aim for output metrics and not for system behaviour; the goal is to find out if the system can continue to provide the expected service, but not how it is providing that service. It is useful to define thresholds that will make for an easy comparison between the control group and the experimental group. Also, this will allow for automated comparisons as well, which makes comparing large quantity of metrics easier.
Building the hypothesis around control and experimental group: Due to the nature of chaos engineering, which is a mixture between science and engineering, the foundation is built around having two groups; a control group, which will be unaffected by injected events, and an experimental group, which will be the objective of the variable manipulation.
Introducing variables that correspond to undesired/unexpected events: Changing the state of the variables is what makes the experiment, however those variables need to be of significance and within reason; also, it is of utmost importance to change one variable input at a time.
Try to disprove the hypothesis: The purpose of the experiment is not to validate the hypothesis, it is to disprove it; we must not fool ourselves, knowing that we are the easiest to fool.
Production means production
The only way of increasing confidence in a system running in production is to experiment on the system running in production, under live production traffic, which may seem odd at first glance, but it is absolutely necessary.
One important aspect that sometimes goes unnoticed is that we must not attack the point where we know the system will fail; speaking with upper management I have got answers of the like ‘I know that if I unplug the DB the system will break’. Well that is not chaos engineering – that is just plain foolishness. A chaos experiment will inject failure in parts of the system we are confident will continue to provide the service. Be it be failing over, using HA, or recovering, we know that the service to the client will not be disrupted, and we try our best to prove ourselves wrong, so we can learn from it.
It is also absolutely necessary to minimise the impact of the experiment on real traffic; although we are looking for disruption, we are not pursuing interruption or fault SLO/SLI/SLA; it is an engineering task to minimise negative impact.
Interim on the blast radius
Chaos engineering or failure injection testing is not about causing outages, it is about learning from the system being managed; in order to do so, the changes injected into the system must go from small to big. Inject a small change, observe the output and what it has caused. If we have learned something, splendid; if not, we increase the change and consequently the blast radius. Rinse and repeat. Many people would argue that they know when and where the system will go down, but that is not the intention. The intention is to start small and improve the system incrementally. It is a granular approach, from small to large scale.
The importance of automation is undisputed, more so on these experiments where it is necessary to:
- Be able to rollback fast enough without human interaction or with minimal HI
- Be able to examine a large set of output metrics at first glance
- Be able to pinpoint infrastructure weak spots visually
Other sources and good reads
The basics: https://principlesofchaos.org/
An extended introduction: https://www.gremlin.com/community/tutorials/chaos-engineering-the-history-principles-and-practice/
A big list of resources: https://github.com/dastergon/awesome-chaos-engineering