December 9, 2008

Day 9 - Lock File Practices

The script started with a simple, small idea. Some simple task like backing up a database or running rsync. You produce the script matching your requirements and throw it up in cron on some reasonable schedule.

Time passes, growth happens, and suddenly your server is croaking because 10 simultaneous rsyncs are happening. The script runtime is now longer than your interval. Being the smart person you are, you add some kind of synchronization to prevent multiple instances from running at once, and it might look like this:

#!/bin/sh

$lock = "/tmp/cron_rsync.lock"
if [ -f "$lock" ] ; then
  echo "Lockfile exists, aborting."
  exit 1
fi

touch $lock
rsync ...
rm $lock
You have your cron job put the output of this script into a logfile so cron doesn't email you when the lockfile's stuck.

Looks good for now. A while later, you log in and need to do work that requires this script temporarily not run, so you disable the cron job and kill the running script. After you finish you work, you enable the cron job again.

Due to your luck, you killed the script while it was in the rsync process, which meant the 'rm $lock' never ran, which means your cron job isn't running now and is periodically updating your logfile with "Lockfile exists, aborting." It's easy to not watch logfiles, so you only notice this when something breaks that depends on your script. Realizing the edge case you forgot, you add handling for signals, just above your 'touch' statement:

trap "rm -f $lock; exit" INT TERM EXIT
Now normal termination and signal (safely rebooting, for example) will remove your lockfile. And there was once again peace among the land ...

... until a power outage causes your server to reboot, interrupting the rsync and leaving your lockfile around. If you're lucky, your lockfile is in /tmp and your platform happens to wipe /tmp on boot, clearing your lockfile. If you aren't lucky, you'll need to fix the bug (you should fix the bug anyway), but how?

The real fix means we'll have to reliably know whether or not a process is running. Recording the pid isn't totally reliable unless you check the pid's command arguments, and it doesn't survive some kinds of updates (name change, etc). A reliable way to do it with the least amount of change is to use flock(1) for lockfile tracking. The flock(1) tool uses the flock(2) interface to lock your file. Locks are released when the program holding the lock dies or unlocks it. A small update to our script will let us use flock instead:

#!/bin/sh

lockfile="/tmp/cron_rsync.lock"
if [ -z "$flock" ] ; then
  lockopts="-w 0 $lockfile"
  exec env flock=1 flock $lockopts $0 "$@"
fi

rsync ...
This change allows us to keep all of the locking logic in one small part of the script, which is a benefit alone. The trick here is that if '$flock' is not set, we will exec flock with this script and its arguments. The '-w 0' argument to flock tells it to exit immediately if the lock is already held. This solution provide locking that expires when the shell script exits under any conditions (normal, signal, sigkill, power outage).

You could also use something like daemontools for this. If you use daemontools, you'd be better off making a service specific to this script. To have cron start your process only once and let it die, you can use 'svc -o /service/yourservice'

Whatever solution you decide, it's important that all of your periodic scripts will continue running normally if they are interrupted.

Further reading:

  • flock(2) syscall is available on solaris, freebsd, linux, and probably other platforms
  • FreeBSD port of a different flock implementation: sysutils/flock
  • daemontools homepage

14 comments :

xtat said...

see also package lockfile-progs on debian

Jordan Sissel said...

lockfile-progs requires that a lockfile be touched at least once every 5 minutes, which puts you in a horrible situation if a command (like rsync, here) runs longer than that. You're almost guaranteed to have a situation where lockfile-progs considers your lockfile to be stale when the process is still running.

This sounds worse than managing your own lockfiles with touch and rm.

Paul said...

Your lock code has a race condition.

if [[ -f "${lock}" ]]; then
...
fi
touch "${lock}"

If another copy of the script runs between the file check and the touch, you'll have two instances of the script running. touch will not return an error if the file already exists.

What you need to do is use something that errors on invocation and call it directly, like so:

if ! $(mkdir ${lock} 2>/dev/null); then
echo "Error, lock file ${lock} exists" >&2
fi

rsync .....

Rob said...

You're right Paul, but that's exactly the point Jordan was making, that the more commonly used way of locking has logic holes. Your rewritten version still fails in the 'power outage' scenario that the flock version solves.

Thanks Jordan, this was a great tip for an old sysadmin to learn, it's a beautiful technique.

furicle said...

Shouldn't you use a tool like mktemp to make good lockfile names as well?

Paul said...

Depends on what you're doing the lock for.

If it's all within the same program, fine. If it's between programs you need a standard lock name so that other processes can check it.

Jon Dowland said...

There's an example of a more seemingly atomic lock operation that makes use of a bashism at http://www.davidpashley.com/articles/writing-robust-shell-scripts.html

Jordan Sissel said...

@Jon Dowland - The locking described on the page you linked has two failure conditions that are unchecked. First, that the 'test' and 'set' of the locking are not atomic "test and set".

Doing:
if [ ! -e $file ] ; then
touch $file
...
rm $file
fi

Is not safe. You could have two scripts run at the same time that both see the file doesn't exist, then touch the file, etc.

This is why you need flock, which grabs a lock on a file in the kernel.

Further:
trap "rm $file" INT TERM EXIT

Is also not safe, because if your machine loses power (or kernel panics) during this script's execution, your lockfile will remain and be stale.

flock is the way.

The Open Sourcerer said...

Thanks for this. I'd not come across flock before.

On my Ubuntu Linux I can use "-n" instead of "-w 0"

-n, --nb, --nonblock,
Fail (with an exit code of 1) rather than wait if the lock can not be immediately acquired.

Daniel said...

On linux/debian, use "lockf" (default in standard install), this is a usable alternative to flock(1) on solaris and fbsd.

People in this post is confusing lockf, with lockf-tools on linux. lockf and flock are safe tools to use.

Jordan Sissel said...

Daniel,

I don't know what package (In debian/ubuntu) contains lockf, I tried searching, but flock(1) comes with util-linux-ng (util-linux on ubuntu) and is generally always installed since that package also provides other basic system tools like fdisk.

Also, freebsd has lockf(1) by default, not flock(1). FreeBSD lockf and util-linux-ng flock(1) are similar tools.

Agreed about lockfile-progs, that tool is a very bad misimplementation of file locks.

miiimooo said...

re lockfile-progs: RTM you're supposed to leave a lockfile-touch running. Nothing terminates after 5 minutes. Here's what I tend to use:

LOCKFILE="/var/run/mylock"
lockfile-create $LOCKFILE
if [ $? -ne 0 ] ; then
// log this
exit 0
fi
lockfile-touch $LOCKFILE &
LOCKTOUCHPID="$!"
trap "kill $LOCKTOUCHPID;lockfile-remove $LOCKFILE; exit" INT TERM EXIT

// do stuff

I know it looks pretty overengineered but it's been working nicely in all sorts of situations...

Jordan Sissel said...

@miiimooo,

Thanks for the pointer :)

There are still conditions that would leave lockfile-touch running (for example, if you kill -9 the parent script, or if the linux OOM killer kills it (which also uses SIGKILL in most cases), etc.

There also seems to be a race condition in lockfile-create.

The best solution is still flock since it correctly frees the lock under any terminating conditions and has no timing constraints. Additionally, it is much less code than your use of lockfile-progs.

Anonymous said...

Over all thx for this useful tips.
Exists it a way to write message into stdout or stderr, just before exit whit flock when lock is already set?

For example :
#!/bin/sh

lockfile="/tmp/cron_rsync.lock"
if [ -z "$flock" ] ; then
lockopts="-w 0 $lockfile"
exec env flock=1 flock $lockopts $0 "$@" "echo 'This program is already launched.'"
fi

rsync ...