December 6, 2008

Day 6 - Automating Tripwire

We need more automation-minded people writing tools. While playing with tripwire today, I saw something that made me think, how am I supposed to automate this? I don't like feeling like a useful tool can't be automated, so let's figure out how.

I'd done the basics using tripwire, so far. Creating site and host keys, creating the encrypted config and policy files, and running 'tripwire --init' to get things started. After making some changes, I ran 'tripwire --check' to see what tripwire would tell me. Things were going good until I decided to update tripwire's idea of what the current system should be, with 'tripwire --update'

The tripwire guide I was following told me what would happen, but I hadn't read that far. Tripwire launched vi and let me edit a document that started like this:

Tripwire(R) 2.3.0 Integrity Check Report

Report generated by:          root
Report created on:            Fri Dec  5 19:11:39 2008
Database last updated on:     Never
The document was full of information about what had changed on the system. I hadn't a clue what I was supposed to do, since I was only skimming documentation when I got stuck or confused, so I went back to the guide and saw:
"If any changes are found you will be presented with a "ballot-box" styled form that must be completed by placing an 'x' opposite the violations that are safe to be updated in the database."
(link to the guide this quote came from under further reading)
I have to what? Carefully hand-edit some generated output, so tripwire will know what to store back in it's truth database? How the heck do you automate this? Was this a design decision meaning automation and security were mutually exclusive? I don't think they're mutually exclusive.

The tripwire config you used when you ran 'tripwire --init' had a variable in it, "EDITOR." This was set to /usr/bin/vi. I changed it to '/bin/cat' and regenerated (tripwire --create-cfgfile) the encrypted config file and reran the update command, and instead of launching vi, the data was simply output to stdout, meaning we might be able to automate it, replacing cat with some smart script.

The data format in the file report is very clearly meant for human reading, not for computer parsing. Tripwire can parse it for its own purposes, but are you up to writing a parser? Googling for tripwire report parser doesn't show promise.

I replaced /usr/bin/nano with a shell script to see what output I should expect. Rerunning 'tripwire --check' and then 'tripwire --update', my nano change shows up like this:

[x] "/usr/bin/nano"
Leaving that box checked would mean "I know nano changed, it's ok." Writing a handler that automatically decided whether file was knowingly modified might be simple. For example, if you upgraded a package recently, most/all of the files for that package will be reported as modified/added/removed. You might be able to ask your packaging system if a file is valid. For instance, if a file is listed as modified, and you use RPMs, you could see if the file has changed since installing the RPM:
% rpm -Vf /usr/bin/nano
S.5....T    /usr/bin/nano
According to rpm's manpage, the first column means that the size, md5, and modifciation time are different on the current nano than the one that was installed by the rpm.

I'd hate to have to answer these questions every time I did an upgrade on one of my servers. Doing it once would be annoying, but doing it across all of my servers after an upgrade (how many servers would that be for you?) would be an impossible nightmare.

Since tripwire is a useful tool, you could use the verification information from rpm to automatically answer tripwire's inquiry with a script set as your EDITOR config variable. If you're especially on-top of your sysadmin practices, your systems have automated software rollouts, and if you want to use tripwire, you'll need to automate the management process.

Further reading:

Open source version of Tripwire
Intrusion Detection with Tripwire
The tripwire guide I was learning from.
www.tripwire.com

December 5, 2008

Day 5 - Capistrano

Do you store and deploy configuration files from a revision control system? You should. If you don't, yet, this article aims to show you how to make that happen with very little effort using Capistrano.

Capistrano is a ruby-powered tool that acts like make (or rake, or any build tool), but it is designed with deploying data and running commands on remote machines. You can write tasks (like make targets) and even nest them in namespaces. Hosts can be grouped together in roles and you can have a task affect any number of hosts and/or roles. Capistrano, like Rake, uses only Ruby for configuration. Capistrano files are named 'Capfile'.

Much of the documentation and buzz about Capistrano deals with deployment of Ruby on Rails, but it's not at all limited to Rails.

For a simple example, lets ask a few servers what kernel version they are running:

# in 'Capfile'
role :linux, "jls", "mywebserver"

namespace :query do
  task :kernelversion, :roles => "linux" do
    run "uname -r"
  end
end
Output:
% cap query:kernelversion
  * executing `query:kernelversion'
  * executing "uname -r"
    servers: ["jls", "mywebserver"]
    [jls] executing command
 ** [out :: jls] 2.6.25.11-97.fc9.x86_64
    [mywebserver] executing command
 ** [out :: mywebserver] 2.6.18-53.el5
    command finished
Back at the original problem being solved, we want to download configuration files for any service on any host we care about and store it revision control. For now, let's just grab apache configs from one server.

Learning how to do this in Capistrano proved to be a great exercise in learning a boatload of Capistrano's features. The Capfile is short, but too long to paste here, so click here to view.

If I run "cap pull:apache", Capistrano dutifully downloads my apache configs from 'mywebserver' and pushes them into a local svn repository. Here's what it looks like (I removed some output):

% cap pull:apache
    triggering start callbacks for `pull:apache'
  * executing `ensure:workdir'
At revision 8.
  * executing `pull:apache'
    triggering after callbacks for `pull:apache'
  * executing `pull:sync'
  * executing "echo -n $CAPISTRANO:HOST$"
    servers: ["mywebserver"]
    [mywebserver] executing command
    servers: ["mywebserver"]
 ** sftp download /etc/httpd/conf -> /home/configs/work/mywebserver
    [mywebserver] /etc/httpd/conf/httpd.conf
    [mywebserver] /etc/httpd/conf/magic
    [mywebserver] done
  * sftp download complete
    servers: ["mywebserver"]
 ** sftp download /etc/httpd/conf.d -> /home/configs/work/mywebserver
    [mywebserver] /etc/httpd/conf.d/README
    [mywebserver] /etc/httpd/conf.d/welcome.conf
    [mywebserver] /etc/httpd/conf.d/proxy_ajp.conf
    [mywebserver] done
  * sftp download complete
A         /home/configs/work/mywebserver/README
A         /home/configs/work/mywebserver/httpd.conf
A         /home/configs/work/mywebserver/magic
A         /home/configs/work/mywebserver/welcome.conf
A         /home/configs/work/mywebserver/proxy_ajp.conf
    command finished
Adding         configs/work/mywebserver/README
Adding         configs/work/mywebserver/httpd.conf
Adding         configs/work/mywebserver/magic
Adding         configs/work/mywebserver/proxy_ajp.conf
Adding         configs/work/mywebserver/welcome.conf
Transmitting file data .....
Committed revision 9.
If I then modify 'httpd.conf' on the webserver, and rerun 'cap pull:apache':
<output edited for content>
% cap pull:apache
 ** sftp download /etc/httpd/conf -> /home/configs/work/mywebserver
    [mywebserver] /etc/httpd/conf/httpd.conf
    [mywebserver] /etc/httpd/conf/magic
    [mywebserver] done
  * sftp download complete
Sending        configs/work/mywebserver/httpd.conf
Transmitting file data .
Committed revision 10.
Now if I want to see the diff against the latest two revisions, to see what we changed on the server:
% svn diff -r9:10 file:///home/configs/svn/mywebserver/httpd.conf
Index: httpd.conf
===================================================================
--- httpd.conf  (revision 9)
+++ httpd.conf  (revision 10)
@@ -1,3 +1,4 @@
+# Hurray for revision control!
 #
 # This is the main Apache server configuration file.  It contains the
 # configuration directives that give the server its instructions.
This kind of solution is not necessarily ideal, it's a good and simple way to get history tracking on your config files right now until you have the time, energy and need to improve the way you do config management.

Capistrano might just help you with deployment and other common, remote access tasks.

Further reading:

Capistrano homepage
Capistrano cheat-sheet
RANCID
A similar idea presented here (download config files and put them in revision control) but for network gear.
Coverage for this was suggested by Jon Heise, who helpfully provided me with a intro to Capistrano. <3

December 4, 2008

Day 4 - Extending snmpd

Do you monitor your hosts with snmp? Ever wanted to add additional data sources to your snmp agent? Net-SNMP's snmpd lets you do this.

There are a few different options available to extend snmpd. The first is the most primitive, simply running a program and reporting the first line of output and the exit status. This is done with the 'exec' statement in snmpd.conf.

# Format is
# exec <name> <command> [args]
exec googleping /bin/ping -c 1 -w 1 -q
You need to specify the full path for 'exec' commands. If you want to run your command in /bin/sh, swap in 'sh' for 'exec' and you get to avoid the full path requirement. The 'exec' and 'sh' extensions command show results through the UCD-SNMP-MIB::extTable table:
% snmpwalk -v2c -c secret localhost UCD-SNMP-MIB::extTable
UCD-SNMP-MIB::extIndex.1 = INTEGER: 1
UCD-SNMP-MIB::extNames.1 = STRING: googleping
UCD-SNMP-MIB::extCommand.1 = STRING: /bin/ping
UCD-SNMP-MIB::extResult.1 = INTEGER: 0
UCD-SNMP-MIB::extOutput.1 = STRING: PING www.l.google.com (74.125.19.147) 56(84) bytes of data.
UCD-SNMP-MIB::extErrFix.1 = INTEGER: noError(0)
UCD-SNMP-MIB::extErrFixCmd.1 = STRING: 
You can see that the first line of output is available in extOutput. This is nice, but the order of commands depends entirely on the order in snmpd.conf, so if you put another 'exec' above the googleping one, the googleping check becomes .2 instead of .1, which is not so stable with respect to adding new exec statements or moving them around. Boo.

The second option available is called 'extend,' and it works similarly to 'exec,' but better. The 'extend' configuration accepts multiline output from your command and is indexed on the name (ie; "googleping") instead of an index number (ie; 1, 2, etc). Just change 'exec' to 'extend':

extend googleping /bin/ping -c 1 -w 1 -qn www.google.com
extend mysqlstatus /usr/bin/mysqladmin status
These 'extend' commands show up in NET-SNMP-AGENT-MIB::nsExtensions. If you only want the output, you can walk NET-SNMP-EXTEND-MIB::nsExtendOutput1Table (or nsExtendOutput2Table). If you want only the exit code, you can walk nsExtendResult. If you want to view the output of walking nsExtensions (it's too long to post here), click here.

Remember the benefit of 'extend' over 'exec' was that the indexing was on the name, so let's query for only the googleping result:

% snmpget -v2c -c secret localhost 'NET-SNMP-EXTEND-MIB::nsExtendResult."googleping"' 
NET-SNMP-EXTEND-MIB::nsExtendResult."googleping" = INTEGER: 0

# If I null route all www.google.com IPs, and requery:
% snmpget -v2c -c secret localhost 'NET-SNMP-EXTEND-MIB::nsExtendResult."googleping"' 
NET-SNMP-EXTEND-MIB::nsExtendResult."googleping" = INTEGER: 2
Take note above that the OID is in single quotes and "googleping" still needs to be sent as quoted to snmpget, this is so snmpget understands that this is really an octet-string OID. (See what "googleping" becomes with snmpget -On)

The output and exit code of your 'extend' and 'exec' statements are cached for a short period of time. The exact time saved in cache is determined by the nsExtendCacheTime OID. If you have write access configured in snmp, you can issue a SET command to change the cache time.

# Cache the googleping results for 15 seconds
% snmpset -v2c -c secret localhost 'NET-SNMP-EXTEND-MIB::nsExtendCacheTime."googleping"' i 15
NET-SNMP-EXTEND-MIB::nsExtendCacheTime."googleping" = INTEGER: 15

% snmpwalk -v2c -c secret localhost NET-SNMP-EXTEND-MIB::nsExtendCacheTime
NET-SNMP-EXTEND-MIB::nsExtendCacheTime."googleping" = INTEGER: 15
NET-SNMP-EXTEND-MIB::nsExtendCacheTime."mysqlstatus" = INTEGER: 5
Lastly, you can tell snmpd to 'pass' (that's the name of the config statement) handling of an entire OID subtree to an external program, which seems like a nice feature. This lets you write a subtree handler in your language of choice rather than being required (while still an option) to write your more complex handlers using snmpd's perl support or C module support. For brevity, I'll skip coverage of that, but it works similar to 'extend' and 'exec,' but has it's own (simple) text protocol for telling your subprocess what OID it wants data on (See further reading).

Extending SNMP to support your own data sources is a good way to allow your existing monitoring tools (nagios, etc) to monitor remotely without having to have local access such as with ssh or nrpe.

Further reading:

Net-SNMP snmpd extension configuration and documentation
See the "EXTENDING AGENT FUNCTIONALITY" section

December 3, 2008

Day 3 - Babysitting with Monit

Software just isn't as reliable as we want it to be. Sometimes a simple reboot (or task restart) will make a problem go away, and this kind of "fix" is so commonly tried that it made it's way to the TV show mentioned in day 1.

A blind fix that restores health to a down or busted service can be valuable. If there are a known set of conditions that indicate the poor health of a service or device, and a restart can fix it, why not try it automatically? The restart probably doesn't fix the real problem, but automated health-repairs can help you debug the root cause.

Restarting a service when it dies unexpectedly seems like a no-brainer, which is why mysql comes with "mysqld_safe" for babysitting mysqld. This script is basically:

while true
  run mysqld
  if mysqld exited normally:
    exit

A process (or device) that watches and restarts another process seems to have a few names: watchdog, babysitter, etc. There are a handful of free software projects that provide babysitting, including daemontools, mon, and Monit. Monit was the first tool I looked at that today, so let's talk Monit.

Focusing only on the process health check features, Monit seems pretty decent. You can have it monitor things other than processes, and even send you email alerts, but that's not the focus today. Each process in Monit can have multiple health checks that, upon failure, result in a service restart or other action. Here's an example config with a health check ensuring mysql connections are working and restarting it on failure:

# Check every 5 seconds.
set daemon 5 

# monit requires each process have a pidfile and does not create pidfiles for you.
# this means the start script (or mysql itself, here) must maintain the pid file.
check process mysqld with pidfile /var/run/mysqld/mysqld.pid
  start "/etc/init.d/mysqld start"
  stop "/etc/init.d/mysqld stop"
  if failed port 3306 protocol mysql then restart
This will cause mysqld to be restarted whenever the check fails, such as when mysql's max connections is reached.

While I consider an automatic quick-fix to be good, this alone isn't good enough. Automatic restarts could hinder your ability to debug because the restart flushed the cause of the problem (at least temporarily). A mysql check failed, but what caused it?

To start with, maybe we want to record who was doing what when mysql was having problems. Depending on the state of your database, some of this data may not be available (if mysql is frozen, you probably can't run 'show full processlist') Here's a short script to do that (that we'll call "get-mysql-debug-data.sh"):

#/bin/sh

time="$(date +%Y%m%d.%H%M%S)"
[ ! -d /var/log/debug ] && mkdir -p /var/log/debug
exec > "/var/log/debug/mysql.failure.$time"

echo "=> Status"
mysqladmin status
echo
echo "=> Active SQL queries"
mysql -umonitor -e 'show full processlist\G'
echo
echo "=> Hosts connected to mysql"
lsof -i :3306
We'll also need to tell Monit to run this script whenever mysql's check fails.
check process mysqld with pidfile /var/run/mysqld/mysqld.pid
  if failed port 3306 protocol mysql then
    exec "get-mysql-debug-data.sh"
However, now mysql doesn't get restarted if a health check fails, we only record data. I tried a few permutations to get both data recorded and mysql restarted, and came up with this as working:
check process mysqld with pidfile /var/run/mysqld/mysqld.pid
  start "/etc/init.d/mysqld start"
  stop "/bin/sh -c '/bin/get-mysql-debug-data.sh ; /etc/init.d/mysqld stop'"
  if failed port 3306 protocol mysql then restart
Now any time mysql is restarted by monit, we'll exec the debug data script and then stop mysqld. The better solution is to probably combine both data and stop script invocations into a separate script you set to 'stop "myscript.sh"'.

If I run monit in the foreground (monit -I), I'll see this when mysql's check fails:

MYSQL: login failed
'mysqld' failed protocol test [MYSQL] at INET[localhost:3306] via TCP
'mysqld' trying to restart
'mysqld' stop: /bin/sh
Stopping MySQL:                                            [  OK  ]
'mysqld' start: /etc/init.d/mysqld
Starting MySQL:                                            [  OK  ]
'mysqld' connection succeeded to INET[localhost:3306] via TCP
And in our debug log directory, a new file has been created with our debug output.

This kind of application isn't a perfect solution, but it can be quite useful. How many times has a coworker accidentally caused a development service to crash and you've needed to go restart it? Applying the ideas presented above will help you both keep from sshing all over restarting broken services in addition to helping automatically track crash/bad-health information for you.

Further reading:

daemontools
Monit
mon
Another discussion of daemon monitoring tools
This article is old, but still makes good points about why you want your services to automatically restart when they die.

December 2, 2008

Day 2 - Windows Powershell

Maybe you're a windows sysadmin. Maybe you're not. Either way, you might find the features in Powershell pretty cool.

Powershell is Windows-only and free to use. Some syntactic differences asside, it looks and feels like a unix shell language. It has standard features you might expect such as functions, recursion, variables, variable scope, objects, and a handful of built-in functionality to help you get work done, but it does many things better.

In addition to these baseline expectations, functions in powershell trivially take flag arguments by simply declaring a function argument (function foo($bar) { ... } can be invoked as foo -bar "somevalue". You can create arbitrary objects with properties and methods defined on the fly. Exception handling, logging, trace debugging, and other goodies are packed in by default.

It supports pipes like your favorite unix shell, except instead of piping text, you pipe objects. The key word is object. When you run 'dir' (or ls, which is an alias), it outputs file objects. When you run 'ps' (which is an alias of get-process), you get process objects. When you run 'get-content' a file, you get an array of strings.

Why is this significant? As a unix sysadmin, you quickly become intimate with piping one command to another, smoothly sandwiching filter invocations between others tools. Filter tools like awk, sed, grep, xargs, etc, all helping you convert one output text into another input text for another command. What if you didn't have to do that, or had to do it less? No more parsing the output of ls(1), stat(1), or du(1) to ask for file attributes when powershell's file object has them. What about getting process attributes?

# Yes, this is a comment in Powershell
# Show the top 3 consumers of virtual memory:
PS > get-process | sort {$_.VirtualMemorySize} | select -last 3

Handles  NPM(K)    PM(K)      WS(K) VM(M)   CPU(s)     Id ProcessName
-------  ------    -----      ----- -----   ------     -- -----------
    745      58    66648       4316   632    21.03   3564 CCC
   1058     107   230788      28384   680   600.23   5048 Steam
    446      78  1328988    1267960  1616 6,223.72   3692 firefox

# Kill firefox
PS > get-process firefox | stop-process
# Alternately
PS > get-process firefox | foreach { $_.kill() }
'select' is an alias for 'select-object' which lets you (among other things) trim an object to only selected properties. Inspection is done with 'get-member' (or 'gm' for short) and you can inspect objects output by 'ls' by doing: ls | gm, or processes with get-process | gm. You can ask an object what type it is with obj.gettype(); such as (get-item .).gettype()

But what if you want to manipulate the registry easily? The registry, filesystem, aliases, variables, environment, functions, and more are all considered "providers" in Powershell. Each provider gives you access to a certain data store using standard built-in commands. A provider can be invoked by prefixing a path with the provider name. For example, to access a registry key, you could use dir Registry::HKEY_CURRENT_USER to list keys in that part of the registry.

In addition to other neat features, you've got nice access to both COM and .NET. Want to create a tempfile?

PS > $tmp = [System.IO.Path]::GetTempFileName()
PS > ls $tmp
    Directory: Microsoft.PowerShell.Core\FileSystem::C:\Users\Jordan\AppData\Local\Temp

Mode                LastWriteTime     Length Name
----                -------------     ------ ----
-a---         12/2/2008   1:25 AM          0 tmp55FC.tmp

PS > rm $tmp

Help is also conveniently available from the powershell prompt. Help, which can be accessed with a muscle-memory-friendly 'man,' comes in different details levels. Try help select-object and then help select-object -detailed. There's also other useful builtins like foreach-object (like 'for' in bourne), select-object (like cut, tail, head, and uniq, but cooler) , sort-object (like sort, but cooler), where-object (like grep, but cooler), measure-object (like wc, but cooler), and format-list and format-table for sanely printing object properties.

Are you still scripting in DOS batch or VBScript? Do you use Cygwin as a means of escaping to a scripting language on windows that is less frustrating or awkward? Are you suddenly facing windows administration when your background is unix? Check out Powershell.

Further reading:

Download Powershell
Powershell homepage
Hey, Scripting Guy!
A pretty good resources for practical powershell examples

December 1, 2008

Day 1 - strace and tcpdump

One of the staple quotes from the British sitcom The IT Crowd is "Have you tried turning it off and on again?" as a first response when one of the IT staff answers a call. My officemate (a fellow sysadmin) has his own generic first response when someone wanders in with a question: "Have you run tcpdump or strace?"

It's a good question partly because almost nobody answers "yes" and partly because these two tools are very useful in helping you debug.

When other tools are failing to help you when debugging a system or network problem, strace or tcpdump might just be your salvation. Strace helps you trace system calls while tcpdump helps you trace network activity. For the BSD and Solaris users, you'll find truss a similar tool for tracing system calls. On Solaris, you also get snoop, which is similar to tcpdump.

These tools generally provide you the ability to have your output with high-precision real or relative timestamps, more or less verbosity, some filtering, etc. Times are important if you have a mysterious time-related problem.

Strace lets you trace a new process (strace <command ...>) or running processes (strace -p <pid>). Is apache acting strange? Use strace to attach to all of the httpd processes:

% strace $(pgrep httpd | sed -e 's/^/-p /')
Process 12571 attached - interrupt to quit
Process 12573 attached - interrupt to quit
Process 12574 attached - interrupt to quit
Process 12575 attached - interrupt to quit
[pid 12574] accept(4,  <unfinished ...>
[pid 12573] accept(4,  <unfinished ...>
[pid 12571] select(0, NULL, NULL, NULL, {0, 216000} <unfinished ...>
[pid 12575] accept(4,  <unfinished ...>
[pid 12571] wait4(-1, 0x7fff8f7a2ba4, WNOHANG|WSTOPPED, NULL) = 0
[pid 12571] select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
(output continues, but I cut it for brevity)
Now you have a good idea what each process is doing with respect to system calls: On this idle apache server, one process appears to be in a sleep loop waiting for children to die while the rest are waiting for accept() to return on the listening http socket.

Access a page on this webserver from your workstation and check strace's output - maybe you'll learn more about what your webserver does when it serves up a page?

To see the network traffic alone, use tcpdump. tcpdump will show you traces of packets and can have the trace limited to only packets matching a query. To watch for http traffic, we would use this invocation:

% tcpdump 'port 80'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
00:57:08.167785 IP 192.168.30.89.33471 > 192.168.30.19.http: S 3860627520:3860627520(0) win 5840 
00:57:08.167994 IP 192.168.30.19.http > 192.168.30.89.33471: S 1074530775:1074530775(0) ack 3860627521 win 5792 
00:57:08.167905 IP 192.168.30.89.33471 > 192.168.30.19.http: . ack 1 win 46 
00:57:08.169271 IP 192.168.30.89.33471 > 192.168.30.19.http: P 1:94(93) ack 1 win 46 
(output continues, but I cut it for brevity)
The above output might not be totally readable, but you should at least understand some of it: source and destination address and ports, timestamps, etc. Lastly, the filter language used for selecting only certain packets is documented well in the tcpdump manpage.

Keeping tcpdump, strace, and similar inspection tools close to your debugging practices should help you better debug and profile problems, and it just might save you the trip down the hall.

Further reading:

tcpdump manpage
strace manpage
DTrace (Solaris, FreeBSD, OS X) and SystemTap (Linux)
These tools are much more advanced than strace or truss. They allow you to scriptably inspect and instrument your system and processes in a wonderful range of ways beyond just system calls.
Wireshark (previously called Ethereal)
Wireshark (and tshark, the terminal version) provides much greater protocol inspection than does tcpdump or snoop. You'll find it's benefits beyond tcpdump include more advanced (and easier) filtering, stream tracking, deeper protocol inspection, and more.