sysadvent: Day 19 - Into the World of Chaos Engineering

By: Julie Gunderson (@Julie_Gund)
Edited by: Kerim Satirli (@ksatirli)

Intro

I recently left my role as a DevOps Advocate at PagerDuty to join the Gremlin team as a Sr. Reliability Advocate. The past few months have been an immersive experience into the world of Chaos Engineering and all things reliability. That said, my foray into the world of Chaos Engineering started long before joining the Gremlin team.

From my time as a lab researcher, to being a single parent, to dealing with cancer, I have learned that the journey of unpredictability is everywhere. I could never have imagined in college that I would end up doing what I do now. As I reflect on the path I have taken to where I am today, I realize one thing: chaos was always right there with me. My start in tech was as a recruiter and let me tell you: there is no straight line that leads from recruiting to advocacy. I experimented in my career, tried new things, failed more than a few times, learned from my experiences and made tweaks. Being a parent is very similar: most experiences you make along the way fall in one of two camps:mistakes or learning. With cancer, there was, and is a lot of experimenting and learning– even with the brightest of minds, every person’s system handles treatments differently. Luckily, I had, and still have others who mentor me both professionally and personally, people who help me improve along the way, and I learned that chaos is a part of life that can be harnessed for positive change.

Technical systems have a lot of similarities to our life experiences, we think we know how they are going to act, but suddenly a monkey wrench gets thrown into the mix and poof, all bets are off. So what do we do? We experiment, we try new things, we follow the data, we don’t let failure stop us in our tracks, and we learn how to build resiliency in.

We can’t mitigate every possible issue out in the wild, we should be proactive in identifying potential failure modes. We need to prepare folks to handle outages in a calm and efficient manner. We need to remember that there are users on the other end of those ones and zeros. We need to keep our eye on the reliability needle. Most of all, we need to have empathy for our co-workers, and remember that we are all in this together, and that we don’t need to be afraid of failure.

Talking about Chaos in the System

When a system or provider goes down (cough, cough us-east-1), people notice, and they share their frustrations, widely. Long Twitter rants are one thing, the media’s reaction is another: – outages make great headlines, and the old adage of “all press is good press” doesn’t really hold up anymore. Brand awareness is one thing, however, great SEO numbers based off of a headline in the New York Times that calls out a company for being down might not be the best way to go about it.

What is Chaos Engineering

So what is Chaos Engineering, and more importantly: why would you want to engineer Chaos? Chaos Engineering is one of those things that is just unfortunately named. After all, the practice has evolved a lot from the time when Jesse Robbins coined the term GameDays, to the codified processes we have in place today. The word “chaos” can still unfortunately lead to anxiety across the management team(s) of a company. But, fear not, the practice of Chaos Engineering helps us all create those highly reliable systems that the world depends on, it builds a culture of learning, and teaches us all to embrace failure and improve.

Chaos Engineering is the practice of proactively injecting failure into your systems to identify weaknesses. In a world where everyone relies on digital systems in some way, shape, or form, almost all of us have a focus on reliability. After all: the cost of downtime can be astronomical!

My studies started at the University of Idaho in microbiology. I worked as a researcher and studied the effects of carbon dioxide (CO2) and short-term storage success of Chinook salmon milt (spoiler alert– there is no advantage to using CO2). That’s where I learned that effective research requires the use of the scientific method:

Observe the current state
Create a hypothesis
Run experiments in a controlled, consistent environment
Analyze the data
Repeat the experiments and data analyzation
Share the results

In the research process, we focused on one thing at a time, we didn’t introduce all the variables at once, we built on our experiments as we gathered and analyzed the data. For example, we started off with the effects of CO2 and once we had our data we introduced clove oil into the study. Once we understood the effect on Chinook we moved to Sturgeon, and so on.

Similarly, you want to take a scientific approach when identifying weakness in your systems with Chaos Engineering, a key difference is on the system that is currently under study; your technical and social technical systems, vs. CO2 and Chinook salmon milt (also, there are no cool white coats.) With Chaos Engineering you aren’t running around unplugging everything at once, or introducing 100% memory consumption on all of your containers at the same time, you take little steps, starting with a small blast radius and increasing that blast radius so you can understand where the failure had impact.

How do we get there

Metrics

At PagerDuty, I focused on best practices around reducing Mean Time to Detection (MTTD) and Mean Time to Resolution (MTTR) of incidents, and then going beyond those metrics to learning and improvement. I often spoke about Chaos Engineering and how through intentionally injecting failure into our systems, we could not only build more reliable systems, we could build a culture of blamelessness, learning, and continuous improvement.

In my time at Gremlin, I have seen a lot of folks get blocked at the very beginning when it comes to metrics such as MTTD and MTTR. Some organizations may not have right monitoring tools in place, or are just at the beginning of the journey into metric collection. It’s okay if everything isn’t perfect here, the fact is you can just pick a place to start; one or two metrics to start collecting to give you a baseline to start measuring improvement from. As far as monitoring is concerned, you can use Chaos Engineering to validate what you do have, and make improvements from there.

People

On the people-side of our systems, being prepared to handle incidents takes practice. Waking up at 2am to a barbershop quartet alert singing “The Server is on Fire” is a blood pressure raising experience, however that stress can be reduced through practice.

For folks who are on-call, it’s important to give them some time to learn the ropes before tossing them into the proverbial fire. Give folks a chance to practice incident response through Chaos Engineering, run GameDays and FireDrills, where people can learn in a safe and controlled environment what the company process looks like in action. This is also a great way to validate your alerting mechanisms and response process. At PagerDuty we had a great joint workshop with Gremlin where people could practice incident response with Chaos Engineering to learn about the different roles and responsibilities and participate in mock incident calls. As a piano player, I had to build the muscle memory needed to memorize Beethovan’s Moonlight Sonata by practicing over, and over, and over for months. Similar to learning a musical instrument, practicing incident response builds the muscle memory needed to reduce the stress on those 2am calls. If I can stress (no pun intended) anything from my experiences in life, it is that repetition and practice are essential elements to handling surprises calmly.

Building a culture of accepting failure as a learning opportunity takes bravery and doesn’t happen overnight. Culture takes practice, empathy, and patience, so make sure to take the time to thank folks for finding bugs, for making mistakes, for accepting feedback, and for the willingness to learn.

Speak the language

As I mentioned before, sometimes we just have things that are unfortunately named. Many of us have the opportunities to attend conferences, read articles and blogs, earn certifications, etc... It’s important to remember that leadership often doesn't have the time to do those things. We as individual contributors, team leaders, engineers, whatever our title may be, need to be well equipped to speak effectively to our audience; Leaders need to understand the message we are trying to convey. I have found that using the phrase “just trust me” isn’t always an effective communication tool. I had to learn how to talk to decision makers and leadership in the terms they used, such as business objectives, business outcomes, Return on Investment (ROI). By communicating the business case I was trying to solve, they could connect the dots to the ROI of adopting and sponsoring new ways of working.

It’s a Wrap

To sum it up, chaos is part of our lives from the moment we are born, from learning to walk to learning to code, and all of the messiness in between. We don’t need to be afraid of experimentation, but we should be thoughtful with our test, and be open to learning. For me personally this next year, I plan on learning to play Bohemian Rhapsody, and professionally, I plan on experimenting with AWS and building a multi-regional application to test ways to be more resilient in the face of outages. Wish me luck, I think I’ll need it on both fronts.

Happy holidays, and may the chaos be with you.

December 19, 2021

Day 19 - Into the World of Chaos Engineering