December 1, 2019

Day 1 - Alerting with Prometheus

By: Julien Pivotto (@roidelapluie)
Edited By: Jeff Hoelter (@jeffhoelter)

I have been on a metrics journey for a couple of years, starting from Graphite and ending with Prometheus. When I look around, the trend is the same everywhere: everyone collects more and more metrics and numbers.

What makes the success of metrics is the quality of the data. This means that in a metrics-oriented world it is important to keep in mind some principles.

Prometheus has been part of my daily life for two years and a half, and the
first version that I used really in production was the 2.0.0-beta5. I am always
happy to share what I achieved, contributing back to the project or helping other prometheus users in the community.

What is surprising when you come as a new user to Prometheus is what you get out of the box. Yes, you have a powerful engine, but no built-in checks and not a lot of helpers. You can use Prometheus like you want, so it can become anything. In the future I hope that Prometheus can gain some maturity to provide with users good pointers about alerts that matter and how to get started. While I have seen attempts to fill this gap, nothing has matured to the point of readiness. Hopefully the community will come together soon to fill this gap.

Metadata

One core principle of observability is that metadata is important. In Prometheus, this is done via labels. Metrics should have labels that can identify their origin (datacenter, cluster, instance), but also what they represent (status code). Being able to easily access this data and make comparisons between these labels is critical to making your metrics actionable and meaningful. Correct labels will enable you quickly identify where the problems are, what kind of issues they represent, and what is the business impact of the behaviours you observe.

sum by(datacenter) (http_requests_total{code=~"5.."})

Whether to chose by (including labels) or without (excluding) labels will make your alerts react differently when labels are changing. We generally tend to use without() to just exclude the level we do not want in our alerts (e.g. without(instance)), retaining the rest.

Collect more than you think you need

One thing to know is that you can’t make up metrics you don’t have. It is okay to collect more data than you actually need because in the future you might do unexpected correlations or link some data to user behaviour. Not all the metrics have to be used in dashboards or in alerts, but having the additional data could come in handy while investigating edge cases.

In our case, we successfully identified faulty network components by looking at
the Prometheus node_exporter zero window tcp netstat metric (quite a niche
metric), and could successfully find the exact moment in the past the error
started.

Metrics and Monitoring

Metrics monitoring is not a key to everything. You need more layers to
fully understand your applications, namely logs and traces. However, metrics monitoring can replace traditional monitoring stacks. There is no need to keep the burden of maintaining a metrics monitoring and another “classic” monitoring solution. You can port all of that monitoring inside your Prometheus server and it can do everything, including the alerting.

One caveat to metrics is that they can reflect complex situation. By their nature, metrics are not simply “critical” or “ok”. They have a huge amount of possible values that can take on a lot of different meanings. This is a strength of metrics - but also something that can cause issues if the alerts are not configured correctly.

Business monitoring is not a new topic - but it has new dimensions in the metrics world. Understanding and observing how applications work under real end-users workloads, and comparing those values to historical ones, can quickly tell you if there is a problem.

The number of dashboards you have does not really matter if no one is looking at them. Also, having dashboards without putting some alerting is a missed opportunity. It is frustrating to see a critical issue on a dashboard that does not generate an alert.

Tips

When alerting in Prometheus, there are a couple of simple tips I can provide. As any metric can become an alert, you often want to limit the false positives by including a for duration in your alerts to specify a length of time that the threshold has to be reached. Another alternative is to use the _over_time functions, who can reduce the noise as well in your alerts.

In Prometheus, alerts and recording rules are computed in groups. All the rules in a group are run sequentially. This is important because it enables you to compute dynamic thresholds that you can use immediately, even within the same alert:

groups:
  - name: disk space
    rules:
      - record: disk_full_exclusion
        expr: vector(1)
        labels:
          instance: 127.0.0.1:9090
          mountpoint: /tmp
      - alert: a disk is full
        expr: |
          node_filesystem_avail_bytes /
          node_filesystem_size_bytes < .3
          unless on (instance,mountpoint) disk_full_exclusion
        for: 30m

Regarding the normal rates, we have found that making comparisons with historical data is also effective.

- record: http_requests:past:rate_5m
  expr: http_requests offset 21d
  labels:
    when: 21d
- record: http_requests:past:rate_5m
  expr: http_requests offset 14d
  labels:
    when: 14d
- record: http_requests:past:rate_5m
  expr: http_requests offset 7d
  labels:
    when: 7d
- record: http_requests:median:rate_5m
  expr: avg(bottomk(1, topk(2, http_requests:past:rate_5m))) without (when)

center
center

In this example, we take the median values of the last 3 weeks. The result,
http_requests:median:rate_5m, can now be used in following alerting rules:

center
Everything in one graph: current usage, normal usage and alerting threshold.

- alert: request rate below the norm
  expr: http_requests:rate_5m < .7*http_requests:median:rate_5m

As alerts are computed in Prometheus, they are also written in its datastore. We can then use some special metrics to simulate
hysteresis:

center
This should fire just one alert, not 10.

(avg_over_time(temperature_celcius[5m]) > 27)
or (temperature_celcius > 25 and
count without (alertstate, alertname, priority)
  ALERTS{
  alertstate="firing",
  alertname="temperature is above threshold"
})

That query in an alerting rule would fire an alert when the temperature is above 27 degrees for 5 minutes and only stop firing when it gets back 26 degrees, using the built-in ALERTS metrics generated by Prometheus.

Alerting with meaningful data

Alerts have values. Sometimes, the value you alert on is not an important value. Let’s take an alert on disk space: if you set a threshold, you want to know how much space left you have, not by percentage, but as an absolute value.

groups:
  - name: disk space
    rules:
      - alert: a disk is full
        expr: |
          node_filesystem_avail_bytes and
          node_filesystem_avail_bytes / node_filesystem_size_bytes < .3
          unless node_filesystem_avail_bytes > 100e9
        for: 30m
        annotations:
          description: |
            There is only {{ $value | humanize1024 }}B free in {{ $labels.mountpoint}} on {{$labels.instance}}.

In this example, we put a node_filesystem_avail_bytes and in front of our
query, so that that value will be the one we can use in the annotation below.

Missing metrics

One last thing that can be painful is the absence of metrics. Metrics are really nice to have, but once they are gone, you don’t have any alerting at all. Therefore, it is important to be able to see which metrics are missing, and alert on them.

The first method is the absent() function.

absent(http_requests_total{env="prod"})

That approach has multiple issues:

  • you need to specify the labels you want, otherwise if there is just one app that exposes one variant of the metric, it would not fire.
  • you do not know precisely which metrics are missing.

We have found out, for some of our critical metrics, that the following
approach was better:

http_requests_total offset 6h unless http_requests_total

That query will give us the exact metrics that were present 6h ago but not
present anymore. This is more helpful that the absent() as it will reveal all the labels you need.

Wrap-up

These suggestions are just the tip of the iceberg - metrics monitoring with Prometheus is powerful and should not be seen as ‘just providing performance insights’. It provides everything you need to keep just one tool to do the monitoring and alerting work, and once you realize that, you can actually simplify how you monitor your applications.

No comments :