This article was written by Jordan Sissel (@jordansissel)
In past sysadvents, I've talked about babysitting services and showed how to use supervisord to achieve it. This year, Ubuntu started shipping its release with a new init system called Upstart that has babysitting built in, so let's talk about that. I'll be doing all of these examples on Ubuntu 10.04, but any upstart-using system should work.
For me, the most important two features of Upstart are babysitting and events. Upstart supports the simple runner scripts that daemontools, supervisord, and other similar-class tools support. It also lets you configure jobs to respond to arbitrary events.
Diving in, let's take a look the ssh
server configuration Ubuntu ships for
Upstart (I edited for clarity). This file lives as /etc/init/ssh.conf:
description "OpenSSH server"
# Start when we get the 'filesystem' event, presumably once the file
# systems are mounted. Stop when shutting down.
start on filesystem
stop on runlevel S
expect fork
respawn
respawn limit 10 5
umask 022
oom never
exec /usr/sbin/sshd
Some points:
respawn
- tells Upstart to restart it if sshd ever stops abnormally (which means every exit except for those caused by you telling it to stop).oom never
- Gives hints to the Out-Of-Memory killer. In this case, we say never kill this process. This is super useful as a built-in feature.exec /usr/bin/sshd
- no massive SysV init script, just one line saying what binary to run. Awesome!
Notice:
- No poorly-written 'status' commands.
- No poorly-written /bin/sh scripts
- No confusing/misunderstood restart vs reload vs stop/start semantics.
The initctl
(8) command is the main interface to upstart, but there are
shorthand commands status
, stop
, start
, and restart
. Let's query status:
% sudo initctl status ssh
ssh start/running, process 1141
# Or this works, too (/sbin/status is a symlink to /sbin/initctl):
% sudo status ssh
ssh start/running, process 1141
# Stop the ssh server
% sudo initctl stop ssh
ssh stop/waiting
# And start it again
% sudo initctl start ssh
ssh start/running, process 28919
Honestly, I'm less interested in how to be a user of upstart and more interested in running processes in upstart.
How about running nagios with upstart? Make /etc/init/nagios.conf:
description "Nagios"
start on filesystem
stop on runlevel S
respawn
# Run nagios
exec /usr/bin/nagios3 /etc/nagios3/nagios.cfg
Let's start it:
% sudo initctl start nagios
nagios start/running, process 1207
% sudo initctl start nagios
initctl: Job is already running: nagios
Most importantly, if something goes wrong and nagios crashes or otherwise dies, it should restart, right? Let's see:
% sudo initctl status nagios
nagios start/running, process 4825
% sudo kill 4825
% sudo initctl status nagios
nagios start/running, process 4904
Excellent.
Events
Upstart supports simple messages. That is, you can create messages with
'initctl emit
# /etc/init/helloworld.conf
start on helloworld
exec env | logger -t helloworld
Now send the 'helloworld' message, but also set some parameters in that message.
% sudo initctl emit helloworld foo=bar baz=fizz
And look at the logger results (writes to syslog)
2010-12-19T11:03:29.000+00:00 ops helloworld: UPSTART_INSTANCE=
2010-12-19T11:03:29.000+00:00 ops helloworld: foo=bar
2010-12-19T11:03:29.000+00:00 ops helloworld: baz=fizz
2010-12-19T11:03:29.000+00:00 ops helloworld: UPSTART_JOB=helloworld
2010-12-19T11:03:29.000+00:00 ops helloworld: TERM=linux
2010-12-19T11:03:29.000+00:00 ops helloworld: PATH=/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/sbin:/sbin:/bin
2010-12-19T11:03:29.000+00:00 ops helloworld: UPSTART_EVENTS=helloworld
2010-12-19T11:03:29.000+00:00 ops helloworld: PWD=/
You can also conditionally accept events with key/value settings, too. See the init(5) manpage for more details.
Additionally, you can start jobs and pass parameters to the job with start
helloworld key1=value1 ...
Problems
Upstart has issues.
First: Debugging it sucks. Why is your pre-start script failing? There's no
built-in way to capture the output and log it. You're best doing 'exec 2>
/var/log/upstart.${UPSTART_JOB}.log
' or something similar. Your only option
for capturing output otherwise is the 'console
' setting which lets you send
output to /dev/console, but that's not useful.
Second: The common 'graceful restart' idiom (test then restart) is hard to implement directly in Upstart. I tried one way, which is to in the 'pre-start' perform a config test, and on success, copy the file to a 'good' file and running on that, but that doesn't work well for things like Nagios that can have many config files:
# Set two variables for easier maintainability:
env CONFIG_FILE=/etc/nagios3/nagios.cfg
env NAGIOS=/usr/sbin/nagios3
pre-start script
if $NAGIOS -v $CONFIG_FILE ; then
# Copy to '<config file>.test_ok'
cp $CONFIG_FILE ${CONFIG_FILE}.test_ok
else
echo "Config check failed, using old config."
fi
end script
# Use the verified 'test_ok' config
exec $NAGIOS $CONFIG_FILE.test_ok
The above solution kind of sucks. The right way to implement graceful restart
, with upstart, is to implement the 'test' yourself and only call initctl
restart nagios
on success - that is, keep it external to upstart.
Third: D-Bus (the message backend for Upstart) has very bad user documentation. The system seems to support access control, but I couldn't find any docs on the subject. Upstart doesn't seem to mention how, but you can see access control in action when you try to 'start' ssh as non-root:
initctl: Rejected send message, 1 matched rules; type="method_call",
sender=":1.328" (uid=1000 pid=29686 comm="initctl)
interface="com.ubuntu.Upstart0_6.Job" member="Start" error name="(unset)"
requested_reply=0 destination="com.ubuntu.Upstart" (uid=0 pid=1 comm="/sbin/init"))
So, there's access control, but I'm not sure anyone knows how to use it.
Fourth: There's no "died" or "exited" event to otherwise indicate that a process has exited unexpectedly, so you can't have event-driven tasks that alert you if a process is flapping or to notify you otherwise that it died.
Fifth: Again on the debugging problem, there's no way to watch events passing along to upstart. strace doesn't help very much:
% sudo strace -s1500 -p 1 |& grep com.ubuntu.Upstart
# output edited for sanity, I ran 'sudo initctl start ssh'
read(10, "BEGIN ... binary mess ... /com/ubuntu/Upstart ... GetJobByName ...ssh\0", 2048) = 127
...
Lastly, the system feels like it was built for desktops: lack of 'exited' event, confusing or missing access control, stopped state likely being lost across reboots, no slow-starts or backoff, little/no output on failures, etc.
Conclusion
Anyway, despite some problems, Upstart seems like a promising solution to the problem of babysitting your daemons. If it has no other benefit, the best benefit is that it comes with Ubuntu 10.04 and beyond, by default, so if you're an Ubuntu infrastructure, it's worth learning.
Further reading:
- Upstart's site
- Upstart on Wikipedia
- init(5) - upstart job configuration
- initctl(8) - upstart control tool
Great series, great post. Thanks.
ReplyDeleteThis sentence looks unfinished:
ReplyDelete"oom never - Gives hints to the Out-Of-Memory killer. In this case, we say never kill this process. This is super useful as a built-in feature. You can"
This year? Ubuntu started shipping upstart in 8.04.
ReplyDeleteAlthough the earlier version used a different and incompatible configuration format using files in /etc/event.d/* instead of /etc/init/*.conf.
@Scott, fixed. Thanks!
ReplyDelete@Marius: I know it shipped earlier, but prior versions I thought everything was still in sysv init scripts (most things are). As of 10.04, more things are actually upstart configs. It may have happened earlier than 10.04, but I wasn't aware ;)
Very nice article. It explains in human understandble terms how to use upstart and what it does.
ReplyDeleteHow would you configure what upstart jobs start on boot? How do you disable sshd from starting at boot for example? This is not very clear. Using the old-and-tried init.d you just deleted some link files, but how do you do this with upstart?
@Adrian,
ReplyDeleteI think the way you do this is the 'start on' event hooks. There are several events fired during startup. See the 'ssh' example and how it starts on 'filesystem' which is an event emitted by another startup task.
If you have no 'start on' events, you are never started automatically.
wow great i have read many articles about this topic and everytime i learn something new i dont think it will ever stop always new info , Thanks for all of your hard work!
ReplyDelete