December 6, 2010

Day 6 - Aggregating Monitoring Checks for Better Alerts

This article was written by Jordan Sissel (@jordansissel)

Chances are that your infrastructure has more than one machine performing the same role - frontend (webapps, etc), backend (databases, log collectors, etc), middleware (message brokers, proxies, etc), support services (config management, etc). Horizontal scaling means that you scale by adding more server resources, right?

Monitoring a horizontally scaled system can often lead to unintended noise. You might have 10 web servers, but 3 are broken, so you get 3 alerts. Maybe you monitor ssh health on all 50 of your servers, and someone pushed a bad config or there is a bad network connection, and now all 50 servers are alerting you about failures. Further, I believe that monitoring alerts (things that page you) should generally be business-oriented, not single-metric-and-single-server-oriented.

Even if you aren't running systems with duplicate roles, it often makes sense to collect multiple individual health checks into a single alert.

This calls for aggregating your data. Benefits of aggregation are increased alert value while decreasing alert noise. Noisy monitoring systems often result in yourself (and others) unconsciously training to ignore that system - don't let your monitoring servers cry, "Wolf!". Further, alerting on aggregates means you can do cool things like only alerting if, for example, more than 5% of things are failing - assuming that's safe for your business.

If you could keep the existing checks you have, not alert on them, but add aggregate checks that cause alerting, you could maintain your current fine-grain monitoring while improving alerting in both noise level and business relevance! Sounds like win-win.

There are some solutions for this already, like check_multi. but I didn't want to run additional or duplicate checks. Why? Some checks may take a long time (like a complex Selenium test), or may be resource-intensive to run (like verifying a large backup), or may not scale if I have to run 200 of them to answer one aggregated health check.

So, I took a different approach. Nagios already records state in a file on disk, so we can parse that file and produce an aggregate check. Lucky for me, there are already tools that parse this format, so most of the work is done already (nagiosity, checkmk_livestatus, and ruby-nagios). I picked ruby-nagios, for no particular reason. It didn't require much hacking to add what I needed. The result is nagios-manage and a tool: check_check.rb (comes with the nagios-manage rubygem).

Notes: Currently, this tool assumes a few things about what I want:

  • Skips checks that have notifications disabled
  • Skips checks that are marked in downtime
  • Only uses the hard state. If it is in soft critical, for example, it will assume the previous hard state.

The above constraints let me aggregate other nagios-state data into health that is deemed important - if one server is down for maintenance or has a known failure, I don't want to alert about it.

Examples:

# Summarize all checks for all hosts:
% check_check.rb
OK=110 WARNING=0 CRITICAL=0 UNKNOWN=0 services=// hosts=// 

# Summarize solr checks, when one check is failing:
% check_check.rb -s solrserver
OK=27 WARNING=0 CRITICAL=1 UNKNOWN=0 services=/solrserver/ hosts=// 
Services in CRITICAL:
  frontend1.example.com => solrserver client tests

The result in nagios looks something like this:

nagios check_check

By the way, the no-argument invocation is super useful for, from the command line, checking the overall health of everything nagios monitors.

For the solrserver check above, I would create one single check that would run the above check_check.rb command and alert appropriately. This saves me from having 28 potentially-failing-and-alerting checks. Further, if I get an alert about 1 critical check, I still have the further fine-grained alerting per server, so I can easily drill into what is failing.

Summarizing: I still have monitoring doing fine-grained, individual checks, like cpu usage, network activity, server behaviors, etc, but individual checks do not cause alerts themselves. This means setting the contacts for your individual checks to something that won't alert you. Instead, I use check_check.rb to aggregate results to reduce noise and to collect relevant data - alerting only on important failures in a way that doesn't flood the pager and helps me better indicate business-service problems to those on call.

8 comments:

  1. Great post! I wasn't aware of the various tools you mentioned, but I've created aggregated Nagios checks before and find them to be incredibly useful. For example, I don't care if 1 app server is down; however, I care if >X% of app servers are down.

    ReplyDelete
  2. Cool. I've been looking for something like this.
    However, I'm also looking for something that can dynamically adjust alerting thresholds. Stuff like "alert me when diskspace usage is above 80% but not if there are any critical errors in hostgroup foo and/or more then 5 criticals in servicegroup bar"

    Because the gravity of all issues depends on which other issues are going on. And the service check dependencies model is really basic in nagios. Disk usage is such a typical example where a certain treshold can be worth alerting about, but not if there is more important stuff going on. Until it goes to 99% or so of course (but then still, you have some headroom)

    ReplyDelete
  3. Agregatting is what ITIL calls service component definition.
    That means, a Service is a group of other things and you can monitor it as a whole.
    For instance the Intranet is composed by those two servers in a cluster, one mysql database, two apaches and a mail server.

    Osmius - http://osmius.com - (Monitoring Tool) does exactly that.Osmius is completely service oriented and can manage also SLAs.

    ReplyDelete
  4. Isn't this exactly what the existing check_cluster plugin does? It's distributed with Nagios, it doesn't lead to multiple checks, and it supports aggregating both host and service checks. [ http://nagiosplugins.org/man/check_cluster ]

    ReplyDelete
  5. check_cluster doesn't seem to do anything useful for me. You have to specify the 'return code' of whatever checks you are aggregating, for example:


    % check_cluster -s -d 0,1,1,2,2,2 -c @3:
    CLUSTER CRITICAL: Service cluster: 1 ok, 2 warning, 0 unknown, 3 critical

    Here I said there were 6 results OK, WARN,WARN,CRIT,CRIT,CRIT, and 'critcal >=3' is critical.

    It doesn't parse the status file like check_check does, meaning I'd still have to parse that file myself. Am I missing something?

    ReplyDelete
  6. @jordan yes, you're missing nagios on-demand macros.

    ReplyDelete
  7. I think what you are looking for is the Nagios Business Processes addon. There is also the nodebrain method for more complex setups, but the BP addon seems to be easier to configure and more focused. I'm using in our business and it performs admirable, as well as having a nice web gui.
    http://bp-addon.monitoringexchange.org/

    ReplyDelete
  8. the check_check was missing options warning and critical levels. this patch will add them.

    # diff -Nur nagios-manage-0.5.5/bin/check_check.rb
    nagios-manage-0.5
    .5/bin/check_check.rb.new
    --- nagios-manage-0.5.5/bin/check_check.rb 2012-09-17 18:13:26.350856775 +1000
    +++ nagios-manage-0.5.5/bin/check_check.rb.new 2012-09-18 03:44:11.875922212 +1000
    @@ -80,6 +80,7 @@
    def main(args)
    progname = File.basename($0)
    settings = Settings.new
    + thresholds = Hash.new { |h,k| h[k] = 0 }
    settings.nagios_cfg = "/etc/nagios3/nagios.cfg" # debian/ubuntu default

    opts = OptionParser.new do |opts|
    @@ -99,6 +100,18 @@
    "Aggregate only services from hosts matching the given pattern") do |val|
    settings.host_pattern = val
    end
    +
    + thresholds["WARNING"] = 1
    + opts.on("-w NUMBER", "--warning NUMBER",
    + "Exit with a warning state if more than x checks are in warning state (defaults to 1)") do |val|
    + thresholds["WARNING"] = val.to_i
    + end
    +
    + thresholds["CRITICAL"] = 1
    + opts.on("-c NUMBER", "--critical NUMBER",
    + "Exit with a critical state if more than x checks are in critical state (defaults to 1)") do |val|
    + thresholds["CRITICAL"] = val.to_i
    + end
    end # OptionParser.new

    opts.parse!(args)
    @@ -136,6 +149,9 @@
    # Output a summary line
    ["OK", "WARNING", "CRITICAL", "UNKNOWN"].each do | state|
    print "#{state}=#{results[state].length} "
    + if ["WARNING", "CRITICAL"].include?(state)
    + print"(threshold: #{thresholds[state]}) "
    + end
    end
    print "services=/#{settings.service_pattern}/ "
    print "hosts=/#{settings.host_pattern}/ "
    @@ -153,11 +169,11 @@

    exitcode = 0

    - if results["WARNING"].length > 0
    + if results["WARNING"].length >= thresholds["WARNING"]
    exitcode = 1
    end

    - if results["CRITICAL"].length > 0
    + if results["CRITICAL"].length >= thresholds["CRITICAL"]
    exitcode = 2
    end
    return exitcode

    ReplyDelete