December 8, 2009

Day 8 - Cron Practices

Cron is just about everywhere. It's configuration and behavior is pretty similar across any platform:
  • every <scheduled time>, it runs your command as some user
  • output gets emailed to MAILTO= or $USER
Cron doesn't do everything I want by default. Here's what I want:
  • to prevent the same job from having overlapping execution.
  • want emailed output only on failures.
  • all output to be logged somewhere.
  • some jobs to timeout if they run too long.
  • randomize startup time to avoid resource contention.
It's easiest to first discuss each of these features individually.

For the rest of this article, we'll show various improvements to the following cron job that does a twice-daily backup of mysql.

0 0,12 * * *
The contents of our are:

mysqldump ...
For simplicity, we omit the mysqldump arguments. Let's get on to addressing individual problems.

Overlapping jobs - Locks

Overlapping jobs can be prevented using locking. Last year, we covered lock file practices which can be applied to solve this. Simply pick a unique lockfile for each cronjob and wrap your cron job with flock(1) (or lockf(1) on FreeBSD).

Let's prevent two backups from running simultaneously. Additionally, we want to abort if we can't grab the lock. flock(1) defaults to waiting indefinitely, so let's set the wait time to 0 and use "/tmp/cron.backupmysql" as the lockfile:

flock -w 0 $lockfile mysqldump ...

Emailed output only on failures

You don't necessarily need an email every time your job runs and succeeds. Personally, I only want to be contacted if there's a failure. In this case, we want to capture output somewhere and only emit the output if the exit status of something is nonzero.

mysqldump ... > $output 2>&1

if [ "$code" -ne 0 ] ; then
  echo "mysqldump exited with nonzero status: $code"
  cat $output
  rm $output
  exit $code
rm $output

All output should be logged somewhere

Regardless of exit status, I always want the output of the job to be logged so we can audit it later. This is easily done with the logger(1) command.

# pipe all output to syslog with tag 'backupmysql'
mysqldump ...  2>&1 | logger -t "backupmysql"

Some jobs need timeouts

Run-away cronjobs are bad. If you use locking as above to prevent overlaps, a stuck or frozen job can prevent any future jobs from running unless something causes the stuck or very-long job to die. For this, we'll need a tool to interrupt execution of a program after a timeout. I don't know if there's a canonical tool for this, so I wrote one for this artcle.

Download alarm.rb.

You'll need ruby for alarm.rb. We can now apply this to our backup script:


alarm.rb 28800 mysqldump ...

This will abort if the mysqldump runtime exceeds 8 hours (28800 seconds). My alarm.rb will exit nonzero on timeouts, so if we use the email-on-error tip from above, we'll get notified on job timeouts.

Randomized startup

If you have lots of hosts all doing backups at the same time, your backup server may get overloaded. You can hand-schedule all your similar jobs to not run simultaneously on multiple hosts, or you can take a shortcut and randomize the startup time.

To do this in a shell script, you'll need something to generate random numbers for you. Doing this explicitly in shell requires a shell that can generate random numbers: bash, Solaris ksh, and zsh support the magic variable $RANDOM which evaluates to a random number between 0 and 32767. You'll also need something to map your random value across your sleep duration, we'll use bc(1) and bash(1) here (Even though zsh's $(( )) math operations support floats, bash seems more common).


sleeptime=$(echo "scale=8; ($RANDOM / 32768) * 3600" | bc | cut -d. -f1)
echo "Sleeping for $sleeptime before starting backupmysql."
sleep $sleeptime

mysqldump ...

Combining everything

Now let's combine all of the above into one super script. Doing all of the above cleanly and safely in bash is not the most trivial thing. Here is the result:

Using is simple. It takes options as environment variables. Here's an example:

% TIMEOUT=5 JOBNAME=helloworld sh -c "echo hello world; sleep 10"
Job failed with status 254 (command: sh -c echo hello world; sleep 10)
hello world
/home/jls/bin/alarm.rb: Execution expired (timeout == 5.0)

# and in /var/log/messages:
Dec  8 02:58:02 snack helloworld[19565]: hello world
Dec  8 02:58:07 snack helloworld[19565]: /home/jls/bin/alarm.rb: Execution expired (timeout == 5.0)
Dec  8 02:58:07 snack helloworld[19573]: Job failed with status 254 (command: sh -c echo hello world; sleep 10)

Now armed with and alarm.rb, we can modify our cron job. Let us choose an 8 hour timeout and a 1 hour random startup delay:

0 0,12 * * * JOBNAME="backupmysql" SLEEPYSTART=3600 TIMEOUT=28800
The new cron entry is now:
  • logging any output to syslog
  • only outputting to stdout when there's been a failure (and thus only emailing us on failures)
  • staggering startup across an hour
  • aborting after 8 hours if not finished
  • locking so overlapping runs are impossible
Using the tools above should help you build more reliable and less noisy cron jobs, which makes your systems more reliable and your pager more quiet.


Further reading:


Anonymous said...

As good as it sounds to add these to cron, they are not functions that are appropriate to its purpose. I am happier to see additions, like this, to perform additional tasks. This keeps the system more modular and flexible.

Cian said...

There's a far better discussion on scheduling uniquely with cron here

rjbs said...

Hi! I'm also running an Advent calendar of Perl libraries that I've written. Coincidentally, one of those libraries is all about improving our experiences with cron!

Check out the article, and maybe you'll find it as useful as we have.

I'm enjoying your calendar, thanks!


Some interesting things there. Thanks!

PIPESTATUS for instance I didn't know about at all.

I wonder if the lock file should go in /var/run to be LSB?

And I wonder if that lockfile shouldn't have been created with mktemp too.

Daniel Howard said...


This article is full of good wisdom for writing effective cron. Two suggestions I would make are:

1) Since you are already doing a lock file, instead of introducing a dependency on Ruby, simply see if lockf failed to obtain a lock, and emit an error. This wouldn't be too great on a high-frequency cron, but since you're running twice a day:

(Sorry about this next mess, but Blogger is a pretty horrible blogging platform which prohibits use of the PRE tag in comments.)

0-6:46 djh@ratchet ~> ( lockf -t 0 sleep.lock sleep 60 || echo "can not sleep" ) &
[1] 92594
0-6:46 djh@ratchet ~> lockf -t 0 sleep.lock sleep 60 || echo "can not sleep"
lockf: sleep.lock: already locked can not sleep

2) Backups are really really important. I prefer to get an e-mail summary each day confirming success or failure. If you only get the failures then you set yourself up for a scenario where your backup system runs into a chronic failure mode you didn't anticipate.

Last time I wrote a backup script comparable to this, the script compared the size of the new dump file to the one previous, and if the size had decreased, it noted that as a potential error.

And of course, yes, monthly or quarterly you should try loading from backup. ;)

Thanks for a great article.


Dave Rodenbaugh said...

Seems to me that my online cron service gets around a lot of these problems, notably the notify only on failure, with relative ease. If you don't want to spend a lot of time on this problem and can spend a little cash (and I do mean little). Check out this link for free cron jobs.

Jordan Sissel said...

re: putting the lockfile in /var/run - That sounds reasonable.

Jordan Sissel said...

@rjbs - awesome! I'll update the further reading links to point there.

morr said...

I belive most of that features have fcron:

Amy said...

Instead of using a custom script to time out a command, I use a command called timeout which is available in the Debian package repository and I assume other distros. Basically it does exactly what your alarm.rb script does - terminates a command with a configurable signal after a specified timeout, but the timeout can be specified in whatever unit you give it, so you could specify 8h for 8 hours instead of however many seconds.

The basic usage is:
timeout [timeout] command arguments...

Unknown said...

A few notes on

* alarm.rb has to be in your PATH, or you should replace the call to it in the script with the absolute path to alarm.rb (and of course it must be executable)

* If you are getting permissioned denied when atempting to 'tee /dev/stderr' you can remove the '| tee /dev/stderr' and add the 's' flag to the logger call, which will also output the contents to stderr.

Thanks for writing this script, very useful.

Unknown said...

not to toot my own horn, but I've been working on a project to implement all of these things: