Breaking the top five myths around chaos engineering

Mikolaj is a software engineer, author and speaker. He wrote Chaos Engineering: Site Reliability Through Controlled Disruption, as well open source projects Goldpinger and PowerfulSeal. He is currently leading a Kubernetes SRE team at Bloomberg.

If the only exposure you’ve had to chaos engineering is Chaos Monkey and some flashy blog headlines, it’s easy to paint the whole endeavour as reckless. After all, testing in production is an internet meme, and arguing for it makes for an attractive story. It turns out, however, to be a double-edged sword since a lot of people still associate chaos engineering with extra risk – yet  the premise of the practice is, in fact, all about understanding and mitigating risk.

Myth #1: Chaos engineering is testing in production

Chaos engineering is about finding problems in your systems before they find you – through a process of rigorous scientific experimentation. It doesn’t grant you moral absolution from following the same principles that you take on board when releasing any other code. Yes – the holy grail is to be so confident in your system, that it makes sense to create your own production outages to demonstrate that your systems handle them well. But, that’s only after all stages have been thoroughly tested, and in some cases (say life-supporting devices), it might be better to leave well alone. 

Myth #2: Chaos engineering is about randomly breaking things

Chaos Monkey was about randomly taking virtual machines down, but we have come a long way since. Now, if you want to verify how reliable your system really is, there is an entire spectrum of approaches available to you.

On one end, you can approach the system as a black box, without the understanding of its inner workings. Randomness is a useful tool in this context, because it’s easy to get started with and it can produce race conditions and situations that are sometimes very hard to predict (i.e. emergent properties of the system). In a way, it’s a bit like the practise of fuzzing – applying random (but valid) inputs in the hope of finding a bug. It’s a good starting point that typically requires little upfront investment.

This approach, however, has its limitations. There are only so many universal things that can happen to a programme – crashing, running out of resources, etc. – and setting up good observability when you’re not sure what you’re looking for,  often proves challenging. As a result,  the other side of the spectrum is when you understand the system well, and can  design fine-grained experiments that play to any weaknesses that you want to test.

An effective chaos engineer works with the entire spectrum, picking the right tool for the job.

Myth #3: Chaos engineering is only for large, modern distributed systems

It’s easy to get the impression that chaos engineering  only works for projects running on a massive scale, in the cloud, with all the latest bells and whistles. After all, that was the case at Netflix, where the term was coined.

But when you look a little bit closer, the same methodology applies equally well for any kind of system – whether it’s a computer or not (teams are systems built with people)! Even if you own a monolith or a legacy service that you don’t understand very well, experimenting on it using the chaos engineering framework can help you gain confidence that it will ultimately survive difficult conditions.

Myth #4: We don’t need more chaos  – we already have plenty!

Invariably, every time I speak about chaos engineering, someone cracks this half-joke. Granted, chaos is in its name, but the goal of the entire methodology is to reduce the amount of inherent chaos, not increase it!

Chaos engineering is about injecting a controlled and well understood failure into the system, while controlling as many other variables as possible, to confirm that the system reacts in the way that we’re expecting it to. The goal is to find issues before your clients do.

If you’re worried that your system is fragile, running a scenario that breaks it and then analysing the findings and fixing the problem, is a really good way to worry less and sleep better.

Myth #5: Chaos engineering is only for very mature teams/products

When we speak about injecting failure, many IT teams  think about their recent outages, and conclude that they’re not mature enough to undertake chaos engineering. The good news is that chaos engineering  can offer some of the best ROI around without a huge cost in time or resources.  

The most cost-effective way is to start small and add more as your systems and teams grow. And the mindset of building systems that you know for sure will be tested with failure scenarios makes for stronger software and developers. So don’t wait until your product is perfect – start early and reap the benefits as you go.

Editor’s note: This article is in association with Conf42. Find out more about Chaos Engineering at Conf42: Cloud Native 2021 on 29th April at 5pm (GMT) –  Subscribe for Free at https://www.conf42.com/cloud2021#register

Photo by Brett Jordan on Unsplash

Want to find out more about topics like this from industry thought leaders? The Cloud Transformation Congress, taking place on 13 July 2021, is a virtual event and conference focusing on how to enable digital transformation with the power of cloud.

View Comments
Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *