Enable javascript in your browser for better experience. Need to know to enable it?

黑料门

Four chaos engineering mistakes to avoid

Four chaos engineering mistakes to avoid

According to :

鈥淐haos engineering is the discipline of experimenting on a system in order to build confidence in the system鈥檚 capability to withstand turbulent conditions in production.鈥

These experiments take the form of simulating real world failures (鈥渃haos variables鈥) such as hardware or network failures and examining the impact on the running system. Popularized by Netflix and credited with helping them successfully operate a complex system at massive scale, chaos engineering has been gaining attention and acceptance in the software engineering community.

As more teams start using chaos engineering, we see a few common mistakes that are worth discussing.

Mistake 1: Starting with the tools

Developed by Netflix and almost synonymous with chaos engineering, picks servers at random and disables them, simulating real world failures. While regular, automated chaos experiments are an important way to realize the full value of chaos engineering, they probably aren鈥檛 the best place to start.

A chaos variable can be as simple as logging into your cloud console and disabling a server, and this simple act will usually lead to learning much faster than taking the time to evaluate and deploy the many chaos engineering tools available. Regular, automated chaos experiments are an important goal but manual experiments are usually a faster way to start.

Mistake 2: Not limiting your blast radius

Netflix鈥檚 approach to chaos engineering, has become widely known but like Chaos Monkey, this represents a state of maturity and not a starting point.

There are many ways to limit the blast radius of your chaos experiments:

  • Run the experiment against a non-production environment

  • Only target a subset of your services

  • Run the experiment for a specific, limited time

  • Run the experiment during a period of lower usage

Each of these approaches comes with costs but limits the potential blast radius of your experiments. For example, experimenting in a non-production environment can never teach as much as working with production but cannot impact real users either.

Running chaos experiments in production is a fantastic way to learn about your system and an important goal for any serious investment in chaos engineering but starting out small, in a non-production environment is a great way to build confidence and experience, without making too many enemies by bringing down production.

Mistake 3: Not going in with a hypothesis

A chaos experiment represents a significant investment from the business: on top of the inherent risk to the system, time is needed to plan chaos variables, plan rollbacks, monitor the system, and respond to any incidents. Given the cost, one of our jobs as chaos engineering practitioners is to ensure a return on that investment, and testing hypotheses is an important way to do that.

Creating your first hypothesis could be as simple as asking your team 鈥渨hat are we concerned about?鈥 or pointing at an architecture diagram and asking 鈥渨hat happens if this service fails?鈥. Try to focus your questioning on areas where uncertainty is greatest and potential impact is highest.

When you have questions without answers, simply add your expectations:

鈥淲e hypothesise that if the recommendations service becomes unavailable, customers will still be able to complete their purchases.鈥

鈥淲e hypothesise that if one database in our cluster becomes unavailable, customers will still experience reasonable performance (95% of requests complete in <200 ms)鈥

Testing these hypotheses will uncover important information about your system and identify areas for improvement. This is a great starting point for your first chaos experiment.

Mistake 4: Not investing in observability

A hypothesis isn鈥檛 very useful if we can鈥檛 validate it. To validate the hypothesis 鈥渋f one database in our cluster becomes unavailable, customers will still experience reasonable performance (95% of requests complete in <200 ms)鈥 we need visibility on all requests to our system, their response time and their success rates.

While a simple dashboard might help us answer some of these questions, is what allows us to dig into the data and answer the important questions that come from chaos experiments, like 鈥渨hy did response time increase during the experiment?鈥.

Practicing chaos engineering forces you to take the observability of your system seriously if you want to see any return on your chaos experiments. Likewise, investments in observability enable more interesting and expansive chaos experiments, so these two practices naturally co-evolve.

Conclusion

Chaos engineering is not just about breaking things, we want to learn from the experiments we carry out. As puts it:

鈥淚ncidents are unplanned investments; their costs have already been incurred. Your org鈥檚 challenge is to get ROI on those events.鈥

With chaos engineering, we have a unique opportunity to plan some of those investments. Follow the advice here to make the most of that opportunity.

Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of 黑料门.

Keep up to date with our latest insights