Good IT monitoring stands and falls with its precision. Monitoring must inform you at the right time when something is wrong. But similar to statistics, you also have to deal with errors produced by your monitoring. In this post, I will talk about two types of errors - false positives and false negatives. And similar again to statistics, you can’t eliminate these errors completely in your monitoring. The best you can do is manage them and optimize for an acceptable level of errors.
In this article, I share ways in which you can fine-tune notifications from your monitoring system, to alleviate noisy alerts and ideally receive only those alerts that are really relevant.
Fine-tuning notifications is one of the most important and rewarding activities when it comes to configuring your monitoring system. The impact of a well-defined notification setup is felt immediately. First and foremost, your team will benefit from better focus due to less ‘noise’. This ultimately results in better service levels and higher service level objective (SLO) attainment across the board.
In this article, I talk about ‘alerts’ and ‘notifications’ using them interchangeably. An ‘alert’ or ‘notification’ is your monitoring system letting you know that something is supposedly wrong. Depending on your setup, this may be via email, text or a trouble ticket in PagerDuty.
When I talk about a ‘monitoring system’, I’m referring to both ‘traditional’ IT infrastructure and application monitoring tools such as Nagios, Zabbix, or Solarwinds Orion, as well as cloud-based monitoring solutions such as Prometheus, Datadog or Sensu.
Types of Alert Errors
Let’s start by examining two common alert errors: false positives and false negatives.
A false positive would be your monitoring tool alerting about an issue when in reality the monitored system is perfectly fine (or has recovered in the meantime). It could be a server or service being shown as DOWN because there was a short glitch in the network connection, or a specific service instance, for example Apache restarting to rotate its logs.
False negatives are when your monitoring system does not alert you, although something really is wrong. If you're running an on-prem infrastructure and your firewall is down, you want to know about it. If your monitoring system for some reason does not alert you about this, your network may be exposed to all kinds of threats, which can get into real trouble, really quickly.
However, the cost of erroneous alerting can differ vastly. Hence, when IT Ops teams try to determine the acceptable level of false positives versus an acceptable level of false negatives, they will often deem false positives more acceptable. Because a false negative could be a mission critical system down and not alerting. A false positive might just be one unnecessary notification that’s quickly deleted from your inbox.
This is why they will err on the side of caution and notify, which is totally understandable. The consequence, however, is that these teams get drowned in meaningless alerts, which increases the risk of overlooking a critical one.
Because notifications will only be of help when no — or only occasional — false alarms are produced.
In this article, I use Checkmk to show examples of minimizing false positive alerting. You can apply the same philosophy with other tools though they may vary in implementation and functionality.
1. Don’t alert.
My first tip to improve monitoring and reduce the noise of false notifications is to simply not send notifications. Seriously!
In Checkmk, notifications are actually optional. The monitoring system can still be used efficiently without them. Some large organizations have a sort of control panel in which an ops team is constantly monitoring the Checkmk interface. As they will be visually alerted, additional notifications are unnecessary.
These are typically users that can’t risk any downtime of their IT at all, like a stock exchange for example. They use the problem dashboards in Checkmk to immediately see the issue and its detail. As the lists are mostly empty, it is pretty clear when something red pops up on a big dashboard.
But in my opinion, this is rather the exception. Most people use some way of notifying their ops and sysadmin teams, be it through email, SMS or notification plugins for ITSM tools such as ServiceNow, PagerDuty or Splunk OnCall.
2. Give it time
So if you’ve decided you don’t want to go down the ‘no notifications’ route from my previous point, you need to make sure that your notifications are finely tuned to only notify people in case of real problems.
The first thing to tell your monitoring is: Give the system being monitored time.
Some systems produce sporadic and short-lived errors. Of course, what you really should do is investigate and eliminate the reason for these sporadic problems, but you may not have the capacity to chase after all of them.
You can reduce alarms from systems like that in two ways:
- You can simply delay notifications to be sent only after a specified time AND if the system state hasn’t changed back to OK in the meantime.
- You can alert on the number of failed checks. For Checkmk this is the ‘Maximum number of check attempts for service’ rule set. This will make the monitoring system check for a defined number of times before triggering a notification. By multiplying the number of check attempts with your defined check interval you can determine how much time you want to give the system. The default Checkmk interval is 1 minute, but you can configure this differently.
The two options are slightly different in how they treat the monitored system. By using the number of failed checks, you can be sure that the system has really been re-checked. If you alert only based on time and you (or someone else) changed the check interval to a longer timeframe you gain nothing. In Checkmk specifically there are some other factors as well, but that’s out of scope for this article. The essential effect is: By giving a system a bit of time to ‘recover’, you can avoid a bunch of unnecessary notifications.
This method also works great for ‘self-healing’ systems that should recover on their own; for example, you wouldn’t want to get a notification for a cloud provider killing an instance to upgrade it when your code will automatically deploy a new container instance to handle requests
Of course, this is not an option for systems that are mission-critical with zero downtime that require rapid remediation. For example, a hedge-fund that monitors the network link to a derivative marketplace can't trade if it goes down. Every second of downtime costs them dearly.
3. On average, you don’t have a problem
Notifications are often triggered by threshold values on utilization metrics (e.g. CPU utilization) which might only exceed the threshold for a short time. As a general rule, such brief peaks are not a problem and should not immediately cause the monitoring system to start notifying people.
For this reason, many check plug-ins have the configuration option that their metrics are averaged over a longer period (say, 15 minutes) before the thresholds for alerting are applied. By using this option, temporary peaks are ignored, and the metric will first be averaged over the defined time period and only afterwards will the threshold values be applied to this average value.
4. Like parents, like children
Imagine the following scenario: You are monitoring a remote data center. You have hundreds of servers in that data center working well and being monitored by your monitoring system. However, the connection to those servers goes through the DC’s core switch (forget redundancy for a moment). Now that core switch goes down, and all hell breaks loose. All of the sudden, hundreds of hosts are no longer being reached by your monitoring system and are being shown as DOWN. Hundreds of DOWN hosts mean a wave of hundreds of notifications…
But in reality, all those servers are (probably) doing just fine. Anyway we couldn’t tell, because we can’t connect to them because of the core switch acting up. So what do you do about it?
Configure your monitoring system so that it knows this interdependency. So the server checks are dependent on that core switch. You can do so in Checkmk by using ‘parent-child-relationships’. By declaring host A the ‘Child’ of another ‘Parent’ host B, you tell your Checkmk system that A is dependent on host B. Checkmk pauses notifications for the children if the parent is down.
5. Avoid alerts on systems that are supposed to be down
There are hundreds of reasons why a system should be down at times. Maybe some systems need to be rebooted regularly, maybe you are doing some maintenance or simply don’t need a system at certain times. What you don’t want is your monitoring system going into panic mode during these times, alerting who-knows-whom if a system is supposed to be down. To do that, you can use ‘Scheduled Downtimes’.
Scheduled downtimes work for entire hosts, but also for individual services. But why would you send certain services into scheduled downtimes? More or less for the same reason as hosts – when you know something will be going on that would trigger an unnecessary notification. You still might want your monitoring to keep an eye on the host as a whole, but you are expecting and accepting that some services might go haywire and breach thresholds for some time. An example could be a nightly cron job that syncs data to long term storage, causing the disk I/O check to spike. But, if everything goes back to normal once the sync is through, no need to lose sleep over it.
Moreover, you can extend scheduled downtimes to ‘Children’ of a ‘Parent’ host as well.
I hope this short overview has given you some ideas about really simple ways with which you can cut down on the number of meaningless notifications your team is getting from your monitoring system. There are other strategies to do this, but this should get you started.