December 3, 2008

Day 3 - Babysitting with Monit

Software just isn't as reliable as we want it to be. Sometimes a simple reboot (or task restart) will make a problem go away, and this kind of "fix" is so commonly tried that it made it's way to the TV show mentioned in day 1.

A blind fix that restores health to a down or busted service can be valuable. If there are a known set of conditions that indicate the poor health of a service or device, and a restart can fix it, why not try it automatically? The restart probably doesn't fix the real problem, but automated health-repairs can help you debug the root cause.

Restarting a service when it dies unexpectedly seems like a no-brainer, which is why mysql comes with "mysqld_safe" for babysitting mysqld. This script is basically:

while true
  run mysqld
  if mysqld exited normally:

A process (or device) that watches and restarts another process seems to have a few names: watchdog, babysitter, etc. There are a handful of free software projects that provide babysitting, including daemontools, mon, and Monit. Monit was the first tool I looked at that today, so let's talk Monit.

Focusing only on the process health check features, Monit seems pretty decent. You can have it monitor things other than processes, and even send you email alerts, but that's not the focus today. Each process in Monit can have multiple health checks that, upon failure, result in a service restart or other action. Here's an example config with a health check ensuring mysql connections are working and restarting it on failure:

# Check every 5 seconds.
set daemon 5 

# monit requires each process have a pidfile and does not create pidfiles for you.
# this means the start script (or mysql itself, here) must maintain the pid file.
check process mysqld with pidfile /var/run/mysqld/
  start "/etc/init.d/mysqld start"
  stop "/etc/init.d/mysqld stop"
  if failed port 3306 protocol mysql then restart
This will cause mysqld to be restarted whenever the check fails, such as when mysql's max connections is reached.

While I consider an automatic quick-fix to be good, this alone isn't good enough. Automatic restarts could hinder your ability to debug because the restart flushed the cause of the problem (at least temporarily). A mysql check failed, but what caused it?

To start with, maybe we want to record who was doing what when mysql was having problems. Depending on the state of your database, some of this data may not be available (if mysql is frozen, you probably can't run 'show full processlist') Here's a short script to do that (that we'll call ""):


time="$(date +%Y%m%d.%H%M%S)"
[ ! -d /var/log/debug ] && mkdir -p /var/log/debug
exec > "/var/log/debug/mysql.failure.$time"

echo "=> Status"
mysqladmin status
echo "=> Active SQL queries"
mysql -umonitor -e 'show full processlist\G'
echo "=> Hosts connected to mysql"
lsof -i :3306
We'll also need to tell Monit to run this script whenever mysql's check fails.
check process mysqld with pidfile /var/run/mysqld/
  if failed port 3306 protocol mysql then
    exec ""
However, now mysql doesn't get restarted if a health check fails, we only record data. I tried a few permutations to get both data recorded and mysql restarted, and came up with this as working:
check process mysqld with pidfile /var/run/mysqld/
  start "/etc/init.d/mysqld start"
  stop "/bin/sh -c '/bin/ ; /etc/init.d/mysqld stop'"
  if failed port 3306 protocol mysql then restart
Now any time mysql is restarted by monit, we'll exec the debug data script and then stop mysqld. The better solution is to probably combine both data and stop script invocations into a separate script you set to 'stop ""'.

If I run monit in the foreground (monit -I), I'll see this when mysql's check fails:

MYSQL: login failed
'mysqld' failed protocol test [MYSQL] at INET[localhost:3306] via TCP
'mysqld' trying to restart
'mysqld' stop: /bin/sh
Stopping MySQL:                                            [  OK  ]
'mysqld' start: /etc/init.d/mysqld
Starting MySQL:                                            [  OK  ]
'mysqld' connection succeeded to INET[localhost:3306] via TCP
And in our debug log directory, a new file has been created with our debug output.

This kind of application isn't a perfect solution, but it can be quite useful. How many times has a coworker accidentally caused a development service to crash and you've needed to go restart it? Applying the ideas presented above will help you both keep from sshing all over restarting broken services in addition to helping automatically track crash/bad-health information for you.

Further reading:

Another discussion of daemon monitoring tools
This article is old, but still makes good points about why you want your services to automatically restart when they die.

1 comment :

Anonymous said...

Great series, Jordan.

One typo in script, last line.

lsof -p :3306

I think you meant

lsof -i :3306