December 2, 2009

Day 2 - Zen and the Art of Troubleshooting

This article was written by Joseph Kern.

Sysadmins troubleshoot using mental abstractions. Our knowledge of a system can never be complete; we are always missing information when troubleshooting. Abstractions are used to cover gaps in our knowledge, and allow us to fix problems without knowing all of the details.

Abstractions are a sysadmin's most powerful tool and potentially our biggest weakness.

A co-worker (let's call him Ralph) had been troubleshooting a user's computer for two hours when he finally called me. I walked to where Ralph was working, he looked frustrated and exhausted,
"I've tried every thing I can think of, but this system will not stay on the network."

"Really? What have you tried?", I asked.

"Updating the drivers, a reboot, re-seating the NIC, loopback tests, I checked the switch config, and I even updated the BIOS."

"Really?", I questioned.

"Yeah. No dice."

I took a deep breath, and thought about what he said.

I turned around and left the room without saying a word.

I walked down to the communications closet, found the right cable and pulled it out. I bent the plastic clip back and inserted it back into the switchport with a satisfying *click*. The link-light went green.

I walked back into the room, sat down at the computer and executed a few well practiced keyboard maneuvers (/release and /renew). The network connection was established.

The user was very grateful (probably had an ebay auction), I mumbled something about a switch configuration error and that we'd be sure to look into it. After we left, I told Ralph what happened.

"#&%$! How did I miss that?"

"It's simple. You took something for granted."

We can take abstractions for granted, and forget to see the obvious. When we abstract the portion of our knowledge that contains the problem, we come to a point where we cannot troubleshoot even the simplest issue. This can be embarrassing, and undermine our own confidence in our work.

This happens to everyone, even senior sysadmins. Tom Limoncelli has an excellent List of Dumb Things To Check, that's filled with some simple (and some complex) things that have caused hours of wasted time for sysadmins.

All too often when troubleshooting it's easy to think of every possible thing that could go wrong. We get caught up in our own abstractions and forget about reality. We must focus on the moment, and deliberately acknowledge where we've created abstractions.

This is a deliberate form of thinking, and it takes some practice. In Zen this is called it 初心 (shoshin), the Beginners Mind. Seeing everything fresh, as if it were the first time you've seen it. Being in the moment. Being deliberate.

The next time a complex problem occurs, take a minute (take a deep breath), and deliberately choose your abstractions. If you don't know why you've chosen one ("Is the network cable plugged in?") question it, observe it, and understand it.

Deliberately choosing your thoughts will not only help you troubleshoot, it will bring a vitality and freshness to your work. You will see things that you haven't seen before and understand things few others do. Your work will feel more like play, and you will enjoy the simple as well as the complex problems.

Further reading:

5 comments :

adewale said...

The link to Tom Limoncelli's List of dumb things to check, doesn't work.

Shaun said...

Here's a working link:

http://whatexit.org/tal/mywritings/dumb-things-to-check.html

Joseph Kern said...

Whoops. Sorry about that. I'll forward that along.

Garp said...

One of the things they emphasised in my CCNA training was working up through the ISO/OSI layers, so starting at the physical layer first and working up to system.

That said I'd been practicing similar methods for years and just called it 'going to first principles'. Establish the basics first and only then aim for the exotic. 99 times out of 100 the problem is actually a basic one that looks exotic :)

Joseph Kern said...

Garp, that's the same way I learned too. Start at the Physical Layer and work your way up. That will be a future post: Finite Frameworks. Combining Finite State Machines and the OSI model.

I haven't really written anything on it yet. But it should be kewl.

I also have rearranged Tom's Dumb Things to Check list as OSI layers:

http://docs.google.com/View?id=dg8sr4q_19dk3xjqh6

This will probably be in the aforementioned article.