Forming a Chaos Engineering team

Werner Vogels, Amazon.com CTO said: “Everything fails all the time.” Although the cloud allows for very high levels of availability, it is our responsibility to build architectures in a resilient way, to tolerate failures such as the outage of a service in an availability zone.

To do so, you can leverage the Reliability pillar of the AWS Well-Architected Framework:

To feel confident that your architecture will be able to withstand turbulent conditions and outages when required, it is a good idea to simulate outages and see how the architecture behaves. In this way you can verify if the architecture behaves correctly, mitigating the impact perceived by the user, or if it fails and you need to adjust it. This is what Chaos Engineering consists of.

To implement Chaos Engineering you can leverage AWS Fault Injection Simulator

Also, there are several open-source tools for Chaos Engineering, such as Netflix team’s Chaos Monkey:

Practice chaos engineering in test environments before simulating failures in production, to avoid causing disruption.

This is an advanced recommendation, introducing failures to test resiliency is not for everyone.
Implement this recommendation if:
– You are operating in an environment with strong requirements for availability or you have to prove to someone else that your environment can withstand a failure
– You have a skilled security team performing these tests
Understand that mistakes in this type of testing can cause failures.

re:Invent talk about Chaos engineering

View re:Invent Talk: Testing resiliency using chaos engineering

re:Invent talk about AWS Fault Injection Simulator

View re:Invent Talk: AWS Fault Injection Simulator: Fully managed chaos engineering service