Nigel Pereira explains why organisations are exploring chaos engineering, and disrupting their own chain of operations to discover the weaknesses within

There were not a lot of people in 2010 who would willingly disrupt their production environments to find points of failure, but that’s exactly what Netflix did, and in doing so gave birth to the rise of Chaos Engineering. The year was 2009 and the popular OTT platform was migrating to AWS in order to support its ever-growing audience base.

The problem, however, is that cloud infrastructure consists of Kubernetes, containers, microservices, and distributed cloud architecture and is far more complex than physical data centers. Seeing as the expensive move to the cloud did not have the effect they were looking for and only seemed to multiply their problems, it was time to think out of the box.

Sify Technologies – Cloud Services

What doesn’t kill you makes you stronger

Similar to how the military puts soldiers through extremely chaotic tests in inhospitable environments so that they can learn to maintain composure in the face of chaos on the battlefield, chaos engineering is basically doing the same thing to your production environment.

That’s right, it’s basically Bootcamp for DevOps teams where you willingly harm your live systems in order to not only study the effects but also build resilience. The science behind the madness here is that while no single component in the cloud can guarantee 100% uptime, small doses of controlled outages can help teams know what to expect and also prepare in advance.

As cloud architecture exponentially expands the attack surface and perimeter in terms of places to look for problems, the only thing that’s truly guaranteed is that services will fail. Keeping that in mind, the next step is getting your team used to services failing randomly throughout the day in order for them to learn how to deal with these outages quickly and effectively.

Think about it, if you are working in an environment that’s being self-sabotaged periodically, any actual failures are going to be literally just another day at the office. This builds resilience to a point where the system is reinforced with enough redundancy to recover from failures.

Chaos Monkey

War of the Planet of the Apes (Image Credit: Flickr)

The first chaos engineering tool developed by Netflix was called Chaos Monkey. If you can imagine a monkey running around your server room randomly smashing screens and pulling out wires, that’s exactly the idea. Chaos Monkey works by randomly disabling systems and terminating instances in order to test how the remaining systems respond and also to build instances that are immune to such failures. Additionally, inspired by the results from Chaos Monkey, Netflix developed a whole host of similar “monkeys” that was referred to as the Simian Army. These include Latency Monkey, Conformity Monkey, Doctor Monkey, Janitor Monkey, Security Monkey, 10-18 Monkey, and more.

As the names suggest the Latency Monkey induces artificial delays while Conformity Monkey shuts down instances that don’t adhere to best practices. Similarly, Doctor Monkey monitors the CPU load on instances and terminates “unhealthy” ones, Janitor Monkey gets rid of clutter and waste, and Security Monkey looks for violations and vulnerabilities.

The 10-18 monkey is a bit harder to explain as it looks for clashes due to problematic configurations for users across the world, especially when they are in different languages. There’s also Chaos Gorilla which drops an entire AWS availability zone, and Chaos Kong which drops an entire AWS region.

Orchestrated Chaos

Gaga Symphony Orchestra (Image Credit: Wikipedia)

While Netflix did release Chaos Monkey under an Apache 2.0 license, there are now a number of Chaos Engineering tools that are Open Source and free to use. Chaos Mesh is a chaos engineering platform designed to orchestrate chaos across Kubernetes clusters and is very straightforward to use. Chaos Mesh was voted to become a Cloud Native Computing Foundation (CNCF) incubating project earlier in February this year. Chaos Mesh has no dependencies so all you need is the Chaos Mesh toolset which can be installed with a helm package. Chaos Mesh is specifically built for Kubernetes and uses CustomResourceDefinitions to define chaos experiments in Kubernetes.

Another chaos engineering tool in the CNCF’s incubator is LitmusChaos which has a much larger scope when compared to Chaos Mesh which is specifically for Kubernetes environments. LitmusChaos can be used to perform tests on Kubernetes Apps, Virtual Machines, bare metal environments, cloud platforms, Kubernetes environments, and even legacy applications. LitmusChaos can be installed using Helm and provides a number of chaos scenarios in the form of Litmus Workflows. Other chaos engineering tools include Gremlin which is a hosted chaos engineering platform provided as a service (SaaS) and Chaos Blade which is focused on helping DevOps engineers perform chaos experiments in the cloud.

Wolff’s Law

German anatomist and surgeon Julius Wolff stated in the 19th century that the bones of a healthy animal will adapt and remodel itself based on the loads under which it is placed.

This phenomenon is referred to as Wolff’s law and has a similar premise to the laws of chaos engineering. If you want your environment to adapt and remodel itself in a way that outages and breaches don’t affect business, you need to acclimate your environment with all the possible outages in order to build resilience.

In case you missed:

With a background in Linux system administration, Nigel Pereira began his career with Symantec Antivirus Tech Support. He has now been a technology journalist for over 6 years and his interests lie in Cloud Computing, DevOps, AI, and enterprise technologies.

Leave A Reply

© Copyright Sify Technologies Ltd, 1998-2022. All rights reserved