sysadvent: Day 18 - Who Watches the Watcher?

The health of your infrastructure relies on you. You rely on a monitoring system (Nagios, etc) to detect problems and repair them or notify you of them, but what monitors the monitor?

Assuming for the moment that Nagios is your monitor. What happens when Nagios crashes or gets stuck? What happens if a syntax error creeps in and causes Nagios to fail at startup? It may be up, but not behaving correctly; how do you know? Moreover, if you have automated processes for generating and deploying Nagios configs and even upgrading Nagios, how do you know if things are working?

You need notification when your monitoring system is malfunctioning just like you want notification for any other important problem. It's possible you've overlooked this when configuring your monitoring systems, and solving this takes some thought, because not every solution is a complete one.

One option to consider is having Nagios check it's config file for problems on startup, and alerts you (via mail, whatever) on errors. This can be done with nagios -v <configfile>, which exits nonzero if there are config errors. Conveniently, if 'nagios' the command fails for another reason like the package was uninstalled, a dependent library is missing, etc, this also results in a nonzero exit code. This solution fails, however, to alert you when Nagios crashes or hangs.

Another option is to test behavior. One behavior to observe is the modification dates on Nagios' stored results, like status.dat (status_file in nagios.cfg) or the checkresults directory (check_result_path in nagios.cfg). If these items haven't been modified recently, it's possible Nagios is unhappy.

Both of the above solutions are incomplete because they run local to the monitoring host, and if that host fails, you don't get any notification. A solution is to run your monitor on at least two hosts and have them monitor each other. Your second monitor can just be a metamonitor (a monitor monitoring a monitor!) that does nothing else. Just remember that you should also monitor your metamonitor (meta-metamonitor?), and this can be done from your first Nagios instance.

How do we remotely monitor Nagios? The status.dat file is used as a data source when you access the Nagios web interface. The default web interface has a link "Process Info" which points to http://.../nagios/cgi-bin/extinfo.cgi?type=0. Included in the process info report is the last time an external command was run. Here's a test written in ruby that will remotely verify Nagios is healthy and the last check time is within some threshold. Example using this script :

# with nagios down (stopped safely)
% ruby nagios-last-update.rb localhost 900
extinfo.cgi returned bad data. Nagios is probably down?

# Recently started, no checks run yet:
% ruby nagios-last-update.rb localhost 900
last external command time is 'N/A'. Nagios may have just restarted.

# Working
% ruby nagios-last-update.rb localhost 900
OK. Nagios last-update 2.814607 seconds ago.

# Web server is up, but nagios isn't running:
% ruby nagios-last-update.rb localhost 900
Time of last Nagios check is older than 900.0 seconds: 1434.941687

# Web server is down
% ruby nagios-last-update.rb localhost 900
Connection refused when fetching http://localhost/nagios/cgi-bin/extinfo.cgi?type=0

# Host is unresponsive:
% ruby nagios-last-update.rb localhost 900     
Timeout (30) while fetching http://localhost/nagios/cgi-bin/extinfo.cgi?type=0

The script uses the proper exit statuses (0 == OK, 1 == warn, 2 == critical) nagios checks expect. There may be exceptions I'm not catching that I should, but uncaught ruby exceptions cause an exit code 1, which Nagios (or whatever) should interpret as a check failure.

Now we have a way to remotely verify the health of a Nagios instance that tells us if Nagios is running properly. Plug this into a monitoring instance on a different host, and you should get alerts whenever your Nagios instance is down.

December 18, 2009

Day 18 - Who Watches the Watcher?

1 comment: