December 1, 2008

Day 1 - strace and tcpdump

One of the staple quotes from the British sitcom The IT Crowd is "Have you tried turning it off and on again?" as a first response when one of the IT staff answers a call. My officemate (a fellow sysadmin) has his own generic first response when someone wanders in with a question: "Have you run tcpdump or strace?"

It's a good question partly because almost nobody answers "yes" and partly because these two tools are very useful in helping you debug.

When other tools are failing to help you when debugging a system or network problem, strace or tcpdump might just be your salvation. Strace helps you trace system calls while tcpdump helps you trace network activity. For the BSD and Solaris users, you'll find truss a similar tool for tracing system calls. On Solaris, you also get snoop, which is similar to tcpdump.

These tools generally provide you the ability to have your output with high-precision real or relative timestamps, more or less verbosity, some filtering, etc. Times are important if you have a mysterious time-related problem.

Strace lets you trace a new process (strace <command ...>) or running processes (strace -p <pid>). Is apache acting strange? Use strace to attach to all of the httpd processes:

% strace $(pgrep httpd | sed -e 's/^/-p /')
Process 12571 attached - interrupt to quit
Process 12573 attached - interrupt to quit
Process 12574 attached - interrupt to quit
Process 12575 attached - interrupt to quit
[pid 12574] accept(4,  <unfinished ...>
[pid 12573] accept(4,  <unfinished ...>
[pid 12571] select(0, NULL, NULL, NULL, {0, 216000} <unfinished ...>
[pid 12575] accept(4,  <unfinished ...>
[pid 12571] wait4(-1, 0x7fff8f7a2ba4, WNOHANG|WSTOPPED, NULL) = 0
[pid 12571] select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
(output continues, but I cut it for brevity)
Now you have a good idea what each process is doing with respect to system calls: On this idle apache server, one process appears to be in a sleep loop waiting for children to die while the rest are waiting for accept() to return on the listening http socket.

Access a page on this webserver from your workstation and check strace's output - maybe you'll learn more about what your webserver does when it serves up a page?

To see the network traffic alone, use tcpdump. tcpdump will show you traces of packets and can have the trace limited to only packets matching a query. To watch for http traffic, we would use this invocation:

% tcpdump 'port 80'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
00:57:08.167785 IP > S 3860627520:3860627520(0) win 5840 
00:57:08.167994 IP > S 1074530775:1074530775(0) ack 3860627521 win 5792 
00:57:08.167905 IP > . ack 1 win 46 
00:57:08.169271 IP > P 1:94(93) ack 1 win 46 
(output continues, but I cut it for brevity)
The above output might not be totally readable, but you should at least understand some of it: source and destination address and ports, timestamps, etc. Lastly, the filter language used for selecting only certain packets is documented well in the tcpdump manpage.

Keeping tcpdump, strace, and similar inspection tools close to your debugging practices should help you better debug and profile problems, and it just might save you the trip down the hall.

Further reading:

tcpdump manpage
strace manpage
DTrace (Solaris, FreeBSD, OS X) and SystemTap (Linux)
These tools are much more advanced than strace or truss. They allow you to scriptably inspect and instrument your system and processes in a wonderful range of ways beyond just system calls.
Wireshark (previously called Ethereal)
Wireshark (and tshark, the terminal version) provides much greater protocol inspection than does tcpdump or snoop. You'll find it's benefits beyond tcpdump include more advanced (and easier) filtering, stream tracking, deeper protocol inspection, and more.

No comments :