December 20, 2016

Day 20 - How to set and monitor SLAs

Written by: Emily Chang (emily@datadoghq.com)
Edited by: Ben Cotton (@funnelfiasco)

SLAs give concrete form to a worthy but amorphous goal: you should always be trying to improve the performance and reliability of your services. If you’re maintaining an SLA, collecting and monitoring the right metrics can help you set goals that are meant to improve performance, rather than simply policing it. In this post we’ll walk through the process of collecting data to define reasonable SLAs, and creating dashboards and alerts to help you monitor and maintain them over time.

The ABCs of SLAs, SLOs, and SLIs

Before we go any further, let us first define what the term SLA means within the context of this article. Throughout this post, we will refer to the terms SLA, SLO, and SLI as they are defined in Site Reliability Engineering, a book written by members of Google’s SRE team. In brief:
- SLA: Service Level Agreements are publicly stated or implied contracts with users—either external customers, or another group/team within your organization. The agreement may also outline the economic repercussions (e.g. service credits) that will occur if the service fails to meet the objectives (SLOs) it contains.
- SLO: Service Level Objectives are objectives that aim to deliver certain levels of service, typically measured by one or more Service Level Indicators (SLI).
- SLI: Service Level Indicators are metrics (such as latency or throughput) that indicate how well a service is performing.

In the next section, we will explore the process of collecting and analyzing key SLI metrics that will help us define reasonable SLAs and SLOs.

Collect data to (re)define SLAs and SLOs

Infrastructure monitoring

Maintaining an SLA is difficult, if not impossible, if you don’t have excellent visibility into your systems and applications. Therefore, the first step to maintaining an SLA is to deploy a monitoring platform to make all your systems and applications observable, with no gaps. Every one of your hosts and services should be submitting metrics to your monitoring platform so that when there is a degradation, you can spot the problem immediately, and diagnose the cause quickly.

Whether you are interested in defining external, user-facing SLAs or internal SLOs, you should collect as much data as you can, analyze the data to see what standards you’re currently achieving, and set reasonable goals from there. Even if you’re not able to set your own SLAs, gathering historical performance data may help you make an argument for redefining more reasonable objectives. Generally, there are two types of data you’ll want to collect: customer-facing data (availability/uptime, service response time), and internal metrics (internal application latency).

Collect user-facing metrics to define external SLAs

Synthetic monitoring tools like Pingdom and Catchpoint are widely used to measure the availability and performance of various user-facing services (video load time, transaction response time, etc.). To supplement this data, you’ll probably also want to use an application performance monitoring (APM) tool to analyze the user-facing performance of each of your applications, broken down by its subcomponents.

When it comes to assessing performance metrics, it’s often not enough to simply look at the average values—you need to look at the entire distribution to gain more accurate insights. This is explained in more detail in Site Reliability Engineering: “Using percentiles for indicators allows you to consider the shape of the distribution and its differing attributes: a high-order percentile, such as the 99th or 99.9th, shows you a plausible worst-case value, while using the 50th percentile (also known as the median) emphasizes the typical case.”

For example, let’s say that we wanted to define an SLA that contains the following SLO: In any calendar month period, the user-facing API service will return 99 percent of requests in less than 100 milliseconds. To determine if this is a reasonable SLA, we used Datadog APM to track the distribution of API request latency over the past month, as shown in the screenshot below.

request latency distribution

In this example, the distribution indicates that 99 percent of requests were completed in under 161 ms over the past month. This suggests that it may be difficult to fulfill the previously stated SLO without some backend performance enhancements. But you’d probably be able to meet a 250-ms SLO, assuming this month is fairly representative. If you have the data available, it may also be a good idea to query the latency distribution and graph it over longer periods of time, to identify seasonal trends or long-term variations in performance.

Analyze subcomponent metrics to define internal SLOs

Monitoring customer-facing metrics is important for obvious reasons, but collecting and assessing metrics that impact internal services can be just as crucial. Unlike a user-facing SLA, internal SLO violations do not necessarily result in external or economic repercussions. However, SLOs still serve an important purpose—for example, an SLO can help establish expectations between teams within the same organization (such as how long it takes to execute a query that another team’s service depends on).

Below, we captured and graphed an internal application’s average, 95th percentile, and maximum response times over the past month. By collecting and graphing these metrics over a substantial window of time (the past month), we can identify patterns and trends in behavior, and use this information to assess the ongoing viability of our SLOs.

response time histogram metrics graphed

For example, the graph above indicates that the application response time was maxing out around 1 second, while the response time was averaging about 400 ms. Assuming you have leeway to set your own SLO, looking at the full range of values should help guide you toward a more informed and supportable objective.

Internal SLOs can also serve as more stringent versions of external SLAs. In the case of the graph above, if the external SLA contains an SLO that aims to fulfill 95 percent of requests in under 2 seconds within any given month, this team might choose to set its internal SLO to 1.5 seconds. This would hopefully leave the alerted individual(s) enough time to investigate and take action before the external SLA is violated.

Create SLA-focused dashboards

Once you’ve collected metrics from internal and external services and used this data to define SLOs and SLAs, it’s time to create dashboards to visualize their performance over time. As outlined in Datadog’s Monitoring 101 series, preparing dashboards before you need them helps you detect and troubleshoot issues more quickly—ideally, before they degrade into more serious slowdowns or outages.

End user-focused dashboards

The dashboard below provides a general overview of high-level information, including the real-time status and response time of an HTTP check that pings a URL every second. Separate widgets display the average response time over the past 5 minutes, and the maximum response time over the past hour.

sla overview dashboard

Incorporating event correlation into your graphs can also help provide additional context for troubleshooting. As Ben Maurer has explained, Facebook noticed that it recorded fewer internal SLA violations during certain time periods—specifically, when employees were not releasing code.

To spot if your SLA and SLO violations are correlated with code releases or other occurrences, you may want to overlay them as events on your metric graphs. In the screenshot above, the pink bar on the timeseries graph indicates that a code release occurred, which may have something to do with the spike in page load response time that occurred shortly thereafter. We also included a graph that compares today’s average response time to the previous day’s.

Dashboards that dive beneath the surface

While the previous example provided a general overview of user-facing performance, you should also create more comprehensive dashboards to identify and correlate potential issues across subcomponents of your applications.

In the example shown below, we can see a mixture of customer-facing data (API endpoint errors, slowest page load times), and metrics from the relevant underlying components (HAProxy response times, Gunicorn errors). In each graph, code releases have been overlaid as pink bars for correlation purposes.

SLA dashboard with underlying components

Alerting on SLAs

For many businesses, not meeting their SLAs is almost as serious a problem as downtime. So in addition to creating informative dashboards, you should set up alerts to trigger at increasing levels of severity as metrics approach internal and external SLO thresholds.

Classify your alerts by urgency

Effective alerting assigns an appropriate notification method based on the level of urgency. Datadog’s Monitoring 101 post on alerting has detailed guidelines for determining the best method of notification (record, notification, or page).

As mentioned earlier, you may want to set an internal SLO that is more aggressive than the objectives in your external SLA, and alert on that value. Any time a metric crosses a “warning” threshold (approaching the internal SLO threshold), it should be recorded as a lighter notification, such as an email or a chat room message. However, if the internal SLO is violated, you should page the person(s) responsible, which will ideally provide the individual(s) with enough time to address the situation before the external SLA is violated.

Watch your resources

In addition to alerting on SLO thresholds, you should also collect comprehensive metrics on underlying resources—everything from low-level (disks, memory) to higher-level components (databases, microservices). If these go unchecked, they can degrade into a state where they will negatively impact end-user experience.

Degradations of resource metrics (for example, database nodes running low on disk space) may not immediately impact customers, but they should be addressed before they ripple into more serious consequences. In this case, the appropriate alert would be a notification (email, chat channel), so that someone can prioritize it in the near future (unless the disk space is very low, in which case it is probably an urgent problem). Whenever a less serious resource issue (for example, an increase in database replication errors) is detected, it should be saved as a record. Even if it eventually resolves on its own, these records can be useful later on, if further investigation is required.

Set informative, actionable alerts

Successful alerts should:
- clearly communicate what triggered the alert
- provide actionable steps to address the situation
- include information about how it was resolved in the past, if applicable

For further reading, check out this other SysAdvent article for sage advice about making alerts more actionable and eliminating unnecessary noise.

Some examples of alerts that you might create to monitor SLAs as well as their underlying resources:
- Your average service response time over the past day exceeds your internal SLO threshold (alert as page)
- An important HTTP check has been failing for the past 2 minutes (alert as page)
- 20 percent of health checks are failing across your HAProxy backends (alert as record if it resolves itself, or email/chat notification if it doesn’t)

Put your SLA strategy in action

If you don’t have a monitoring platform in place, start there. After that you’ll be ready to set up dashboards and alerts that reflect your SLAs and key resources and services that the SLA depends on.

No comments:

Post a Comment