sysadvent: Day 19 - SRE Practice: Error Budgets

By: Nathen Harvey (@nathenharvey)
Edited by: Paul Welch (@pwelch

Site Reliability Engineering (SRE) is a set of principles, practices, and organizational constructs that seek to balance the reliability of a service with the need to continually deliver new features. It is highly likely that your organization utilizes many SRE principles and practices even without having an SRE team.

The Scenario

Let's look at an example of an online retailer. The customer's typical flow through the site is likely familiar to you. A customer browses the catalog, selects one or more items to add to the cart, views the shopping cart, enters payment information, and completes the transaction.

The operations team for this retailer meets regularly to review metrics, discuss incidents, plan work, and such. During a recent review, the team noticed a trend where the time it takes from submitting payment details to receiving a status update (e.g., successful payment or invalid card details) was gradually increasing. The team raised concerned that as this processing time continued to increase revenue would drop off. In short, they were concerned that customers would feel this slowdown and take their business elsewhere.

The team worked together with the development team to diagnose the reasons behind this degradation and made the necessary changes to improve the speed and consistency of the payment processing. This required the teams to take joint ownership of the issues and work together to resolve them.

Fixing the issues required some heroics from the development and operations teams, namely they worked day and night to get a fix in place. Meanwhile some new features that the product owners were pushing to launch took longer than initially anticipated. In the end, the product teams were unhappy about feature velocity and both the development and operation teams were showing some signs of burnout and had trouble understanding why the product owners were not prioritizing the work to hasten payment processing. In short, the issue was resolved but nobody was happy.

On further reflection and discussion of the scenario there were a few things that really stood out to everyone involved.

The outcome was good: payment processing was consistently fast and customers kept buying from the retailer.
The internal frustration was universal: product owners were frustrated with the pace of new development and development teams were frustrated with pressure to deliver new features while working to prevent an impending disaster.
Visibility was lacking
- The product owners did not know latency was increasing
- The work to resolve the latency issues was only visible to the people doing that work.
The product owners agreed that they would have prioritized this work if they had known about the issue (please join me in willingly suspending our belief in hindsight bias when considering this stance).

Error Budgets

Error Budgets provide teams a way to describe the reliability goals of a service, ways to spend that error budget, and the consequences of missing the reliability goals. SRE practices prescribe a data-driven approach. Error budgets are based on actual measures of the system's behavior.

Taking a bottom-up approach, we will define our Error Budget using Service Level Indicators (SLIs), Service Level Objectives (SLOs) and an Error Budget policy.

Service Level Indicators (SLIs)

We start with SLIs. An SLI is a metric that gives some indication about the level of service a customer of your service is experiencing. All systems have latent SLIs. The best way to discover the SLIs for your system is to consider the tasks a customer is trying to accomplish with the system. We might call these paths through the system Critical User Journeys (CUJs).

Working together with everyone who cares about our application, we may identify that purchasing items is a CUJ for our example application. We agree that the payment status page should load quickly. This is the first SLI we have identified. We know that customers will notice if the payment page is not loading quickly and this may lead to fewer sales. However, saying that the page should load quickly is not precise enough for our purposes, we have to answer a few more questions, such as:

How do we measure "quickly"?
When does the timer start and stop?

The best SLI measurements are taken as close to the customer as possible. For example, we may want to start the timer when the customer clicks the "buy" button and end the timer when the resulting page is fully rendered in the customer interface. However, this might require additional instrumentation in the application and consent from the customers to measure accurately. A sufficient proxy for this could be measured at the load balancer for our application. When a POST is made to a particular URL at the load balancer the timer will start, when the full response is sent back to the customer the timer will end. This will clearly miss things like requests that never make it to the load balancer, responses that never make it back to the customer, and any host of things that can go wrong between the customer and the load balancer. But there are significant benefits to using this data including the ease with which we can collect it. Starting simple and iterating for more precision is strongly recommended.

An SLI should always be expressed as a percentage so we may further refine this SLI as follows:

The proportion of HTTP POSTs to /api/pay
that send their entire response 
within X ms measured at the load balancer.

With our metric in hand, we must now agree on a goal, or Service Level Objective, that we are trying to meet.

Service Level Objectives (SLOs)

An SLO is the goal that we set for a given SLI. Looking at our SLI, notice that we did not define how fast (X ms), nor did we define how many POSTs should be considered.

Getting everyone to agree to a reasonable goal can be difficult. Ask a typical product owner "how reliable do you want this feature to be?" and the answer is often "110%, I want the service to be 110% reliable." As awesome as that sounds, we all know that even targeting 100% reliability is not a worthwhile endeavor. The investments required to go from 99% to 99.9% to 99.99% grow exponentially. Meanwhile, a typical customer's ability to notice these improvements will disappear. And that's the key: we should set a target that keeps a typical customer happy with our service.

So, let's agree on some goals for our SLI. Each SLO will be expressed as a percentage over a period of time. So we may set an SLO of 99% of POSTs in the last 28 days have a response time < 500ms.

Every request that takes less than 500ms is said to be within our SLO. Those that take longer than that are outside our SLO and consume our Error Budget.

Error Budget Policy

An error budget represents the acceptable level of unreliability. Our sample SLO gives us an error budget that allows up to 1% of all purchase requests to take longer than 500ms to process.

As a cross-functional team, before our error budget is exhausted, we must agree on a policy. This policy can help inform how to spend the error budget and what the consequences are of over spending the budget.

The current state of the error budget helps inform where engineering effort is focused. With the strictest of interpretations, without any remaining error budget, all engineering work focuses on things that make the system more reliable; no customer-facing features should be built or shipped. When budget is available we build and ship customer-facing features.

There are, of course, less drastic measures that you might agree to when there is no remaining error budget. For example, perhaps you will agree to prioritize some of the remediation items that were identified during your last audit, retrospective, or post-mortem. Maybe your team agrees to put engineering effort into better observability when there is no error budget remaining.

Having excess error budget available is not necessarily a good thing. Exceeding our targets has a number of potential downsides. First, we may inadvertently be resetting our customers' expectations, they will become dependent on this new level of service. Second, building up this excess is a signal from the system that we are not learning enough or shipping new features fast enough. We may use existing budget to focus energy on introducing risky features, paying down technical debt, or injecting latency to help validate our assumptions about what is "fast enough."

Our SLOs typically look back over a rolling time window. Our sample SLO, for example, looks at POSTs over the last 28 days. Doing this allows the recent past to have the most impact on the decisions we make.

You already do this

Applying the practice of Error Budgets, SLIs, and SLOs does not require adding anyone new to your team. In fact, you may already have these practices in place. Using these terms and language helps move the practice from implicit to explicit and allows teams to be more intentional about how to prioritize work.

Looking back at our example retailer, they had some SLIs, SLOs, and even an error budget in place. Let's look at what they were.

SLI

The time it takes to process payment.

SLO

The response times should look and feel similar to previous response times.

Error Budget Policy

When the error budget is consumed the team will work extra hours
to fix the system until it is once again meeting the service level
objectives.  This should take as long as necessary but should not 
interrupt the existing flow of new features to production.

There are a number of problems with these definitions though. They are not defined using actual metrics and data from the system and a human's intuition is required to assess if the objective is being met. The consequences of missing the objective or overspending the budget are not humane to anyone involved and are likely to introduce additional issues with both reliability and feature velocity. These also were not agreed to, discussed, or shared across the various teams responsible for the service.

Intentional, Iterative

Improving your team's practice with SLIs, SLOs, and error budgets requires an intentional, iterative approach. Gather the team of humans that care about the service and its customers, identify some CUJs, discover the SLIs, and have frank discussions about reliability goals and how the team will prioritize work when the budget is consumed and when there is budget available. Use past data as input to the discussions, agree on a future date to revisit the decisions, and build the measurements required to make data-driven decisions. Start simple using the best data you can gather now. Above all else, agree that whatever you put in place first will be wrong. With practice and experience you will identify and implement ways to improve.

Learn More

Google provides a number of freely available SRE Resources, including books and training materials, online at https://google.com/sre.

December 19, 2019

Day 19 - SRE Practice: Error Budgets