Time passes, growth happens, and suddenly your server is croaking because 10 simultaneous rsyncs are happening. The script runtime is now longer than your interval. Being the smart person you are, you add some kind of synchronization to prevent multiple instances from running at once, and it might look like this:
#!/bin/sh $lock = "/tmp/cron_rsync.lock" if [ -f "$lock" ] ; then echo "Lockfile exists, aborting." exit 1 fi touch $lock rsync ... rm $lockYou have your cron job put the output of this script into a logfile so cron doesn't email you when the lockfile's stuck.
Looks good for now. A while later, you log in and need to do work that requires this script temporarily not run, so you disable the cron job and kill the running script. After you finish you work, you enable the cron job again.
Due to your luck, you killed the script while it was in the rsync process, which meant the 'rm $lock' never ran, which means your cron job isn't running now and is periodically updating your logfile with "Lockfile exists, aborting." It's easy to not watch logfiles, so you only notice this when something breaks that depends on your script. Realizing the edge case you forgot, you add handling for signals, just above your 'touch' statement:
trap "rm -f $lock; exit" INT TERM EXITNow normal termination and signal (safely rebooting, for example) will remove your lockfile. And there was once again peace among the land ...
... until a power outage causes your server to reboot, interrupting the rsync and leaving your lockfile around. If you're lucky, your lockfile is in /tmp and your platform happens to wipe /tmp on boot, clearing your lockfile. If you aren't lucky, you'll need to fix the bug (you should fix the bug anyway), but how?
The real fix means we'll have to reliably know whether or not a process is running. Recording the pid isn't totally reliable unless you check the pid's command arguments, and it doesn't survive some kinds of updates (name change, etc). A reliable way to do it with the least amount of change is to use flock(1) for lockfile tracking. The flock(1) tool uses the flock(2) interface to lock your file. Locks are released when the program holding the lock dies or unlocks it. A small update to our script will let us use flock instead:
#!/bin/sh lockfile="/tmp/cron_rsync.lock" if [ -z "$flock" ] ; then lockopts="-w 0 $lockfile" exec env flock=1 flock $lockopts $0 "$@" fi rsync ...This change allows us to keep all of the locking logic in one small part of the script, which is a benefit alone. The trick here is that if '$flock' is not set, we will exec flock with this script and its arguments. The '-w 0' argument to flock tells it to exit immediately if the lock is already held. This solution provide locking that expires when the shell script exits under any conditions (normal, signal, sigkill, power outage).
You could also use something like daemontools for this. If you use daemontools, you'd be better off making a service specific to this script. To have cron start your process only once and let it die, you can use 'svc -o /service/yourservice'
Whatever solution you decide, it's important that all of your periodic scripts will continue running normally if they are interrupted.
Further reading:
- flock(2) syscall is available on solaris, freebsd, linux, and probably other platforms
- FreeBSD port of a different flock implementation: sysutils/flock
- daemontools homepage
see also package lockfile-progs on debian
ReplyDeletelockfile-progs requires that a lockfile be touched at least once every 5 minutes, which puts you in a horrible situation if a command (like rsync, here) runs longer than that. You're almost guaranteed to have a situation where lockfile-progs considers your lockfile to be stale when the process is still running.
ReplyDeleteThis sounds worse than managing your own lockfiles with touch and rm.
Your lock code has a race condition.
ReplyDeleteif [[ -f "${lock}" ]]; then
...
fi
touch "${lock}"
If another copy of the script runs between the file check and the touch, you'll have two instances of the script running. touch will not return an error if the file already exists.
What you need to do is use something that errors on invocation and call it directly, like so:
if ! $(mkdir ${lock} 2>/dev/null); then
echo "Error, lock file ${lock} exists" >&2
fi
rsync .....
You're right Paul, but that's exactly the point Jordan was making, that the more commonly used way of locking has logic holes. Your rewritten version still fails in the 'power outage' scenario that the flock version solves.
ReplyDeleteThanks Jordan, this was a great tip for an old sysadmin to learn, it's a beautiful technique.
Shouldn't you use a tool like mktemp to make good lockfile names as well?
ReplyDeleteDepends on what you're doing the lock for.
ReplyDeleteIf it's all within the same program, fine. If it's between programs you need a standard lock name so that other processes can check it.
There's an example of a more seemingly atomic lock operation that makes use of a bashism at http://www.davidpashley.com/articles/writing-robust-shell-scripts.html
ReplyDelete@Jon Dowland - The locking described on the page you linked has two failure conditions that are unchecked. First, that the 'test' and 'set' of the locking are not atomic "test and set".
ReplyDeleteDoing:
if [ ! -e $file ] ; then
touch $file
...
rm $file
fi
Is not safe. You could have two scripts run at the same time that both see the file doesn't exist, then touch the file, etc.
This is why you need flock, which grabs a lock on a file in the kernel.
Further:
trap "rm $file" INT TERM EXIT
Is also not safe, because if your machine loses power (or kernel panics) during this script's execution, your lockfile will remain and be stale.
flock is the way.
Thanks for this. I'd not come across flock before.
ReplyDeleteOn my Ubuntu Linux I can use "-n" instead of "-w 0"
-n, --nb, --nonblock,
Fail (with an exit code of 1) rather than wait if the lock can not be immediately acquired.
On linux/debian, use "lockf" (default in standard install), this is a usable alternative to flock(1) on solaris and fbsd.
ReplyDeletePeople in this post is confusing lockf, with lockf-tools on linux. lockf and flock are safe tools to use.
Daniel,
ReplyDeleteI don't know what package (In debian/ubuntu) contains lockf, I tried searching, but flock(1) comes with util-linux-ng (util-linux on ubuntu) and is generally always installed since that package also provides other basic system tools like fdisk.
Also, freebsd has lockf(1) by default, not flock(1). FreeBSD lockf and util-linux-ng flock(1) are similar tools.
Agreed about lockfile-progs, that tool is a very bad misimplementation of file locks.
re lockfile-progs: RTM you're supposed to leave a lockfile-touch running. Nothing terminates after 5 minutes. Here's what I tend to use:
ReplyDeleteLOCKFILE="/var/run/mylock"
lockfile-create $LOCKFILE
if [ $? -ne 0 ] ; then
// log this
exit 0
fi
lockfile-touch $LOCKFILE &
LOCKTOUCHPID="$!"
trap "kill $LOCKTOUCHPID;lockfile-remove $LOCKFILE; exit" INT TERM EXIT
// do stuff
I know it looks pretty overengineered but it's been working nicely in all sorts of situations...
@miiimooo,
ReplyDeleteThanks for the pointer :)
There are still conditions that would leave lockfile-touch running (for example, if you kill -9 the parent script, or if the linux OOM killer kills it (which also uses SIGKILL in most cases), etc.
There also seems to be a race condition in lockfile-create.
The best solution is still flock since it correctly frees the lock under any terminating conditions and has no timing constraints. Additionally, it is much less code than your use of lockfile-progs.
Over all thx for this useful tips.
ReplyDeleteExists it a way to write message into stdout or stderr, just before exit whit flock when lock is already set?
For example :
#!/bin/sh
lockfile="/tmp/cron_rsync.lock"
if [ -z "$flock" ] ; then
lockopts="-w 0 $lockfile"
exec env flock=1 flock $lockopts $0 "$@" "echo 'This program is already launched.'"
fi
rsync ...