December 22, 2009

Day 22- Lessons in Migrations

This article was written by Saint Aardvark the Carpeted

I've been through two big moves in my career. The first was about four years ago when the company I was working for moved offices. It was only across the street, but it meant shifting the whole company over. We had about forty employees at the time, and maybe a hundred workstations, test servers, and production servers.

The second move was when, earlier this year at my current job, we finally got to move into our new server room. This time the scope of the move was smaller (no workstations, and about twenty servers), but the new digs were nicer. :-)

I learned a lot from these two moves. I want to pass those lessons on to you.

Have a second set of skilled hands around

At both places, I was the only sysadmin on staff. For the first move, my company hired a consultant for a few days to help me out with the move and its aftershocks. It was great to have someone else around that could help diagnose email problems, run traceroute and generally run interference while I swore at the servers.

The second time, I thought that four volunteers, plus me, would be enough... it was only twenty servers, after all. Mainly, it would be a question of cabling, then things would just fall into place after that... right?

Well, the volunteers were excellent -- I can't say enough about them, but a second set of skilled hands would have simplified things a lot. I found myself often switching between them as questions came up: How do these rack rails work? Which interface is eth0? Did you really mean to put 8U of servers into 4U of space?

Obviously, someone familiar with your network, OS/distro and thought patterns can help you with network testing, re-jigging Apache proxy directives, and finding your pizza coupons. Even something as simple as being familiar with rack rails helps a lot.

And if you're moving offices, don't do this without the support of your company. For the first move there were three of us -- including the CEO -- and I wouldn't want to do it with less bodies or less influence.

Don't underestimate how tired you'll be

In some ways, the first move was easier despite it being much more involved. We moved on a Saturday, I got machines up and running on Sunday, and on Monday, things were mostly working again. Knowing that I had the time meant that I could go home with a clear conscience.

The second move, though, was meant to be done in one day. It was gonna be simple: I had a checklist for services and boot order, the network settings were ready to go, and the new server room was quite close our old server room. How long could it take to move stuff two blocks?

Well, the moving took the morning. De-racking machines, getting stuff on the elevator and to the truck (thank goodness for strong movers), then dropping stuff off in the server room left us in a good position for lunch.

But after lunch, little things cropped up: I'd borked some netmask settings on a couple key servers. The rack I'd planned to put the firewall in was too shallow to accept it. My placement of the in-rack switches blocked some PDU outlets. Some of the rack rails were fragile, stupidly constructed, and difficult to figure out.

Each of these things were overcome, but it took time. Before I knew it, it was 7:15pm, I'd been at it for 11 hours and I was exhausted. I had to head home and finish it the next day. Fortunately, I had the support of my boss in this.

Don't make the day any worse than it has to be

At the first move, I'd decided it would be a good idea to switch to a new phone vendor as we moved into the new building.

I avoided firing by, I later calculated, the skin of my teeth.

Your move will be long. It will be stressful. You will trip over things you didn't plan for, thought you'd planned for, and were sure someone else was planning for. Don't add to the misery by making another big change at the same time. This goes double for anything involving a complicated technology with multiple vendors (including a local monopoly that Does Not Like competition) that will leave everyone very upset if it fails to work right when they come in.

Instead, mark it carefully on your calendar for five years in the future.

Set up monitoring early

For the second move, my Nagios box was second on my list of machines to boot up. I'd set it up with new addresses ahead of time, and made sure when it did start that alerts were turned off.

As machines came up, I watched the host and service checks turn green. It was a good way to ensure that I hadn't forgotten anything...if it failed, I'd either forgotten to update the address or I had a genuine problem. Either way, I knew about it quickly, and could decide whether to tackle it right away or leave it for later.

Don't forget about cabling

I planned out a lot of things for my second move, and it served me well. Service checklists, boot sequences...it had taken a long time, but it was worth it. I even had a colour-coded spreadsheet showing how many rack units, watts and network cables I'd need for each server.

Unfortunately, what I missed was thinking about the cabling itself. I'd picked out where the switch in each rack would go, I'd made sure I had lots of cables of varying lengths around, and so on. But there were some things I'd missed that experience -- or a dry run -- would have caught:

  • Horizontal cable management bars blocked a couple of PDU outlets each; this was mostly, but not entirely, unavoidable.
  • PDU outlets were on the wrong side for most -- but not all -- servers, which put power cables right next to network cables.
  • The switches were right next to some PDU outlets -- and since the switch outlets went all the way to the side, that meant some network cables were right next to power cables.

A dry run of the cabling would not have been easy. I didn't have a spare server to rack and check for problems, and some of these things only emerged when you had a full rack. But it would have been a lot less work than doing it all on the day of the move (let alone swearing at it and leaving it for Christmas maintenance).

Getting new equipment? Make sure it works

As part of the new server room, we got a few bells and whistles. Among them were a humidifier (necessary since we didn't have a vapour barrier) and leak detectors that sat on the floor, waiting to yell at me about floods. "Woohoo!" I thought. "We're movin' on up!"

What I didn't think about was how these things worked...or rather, how I could tell that they worked. We moved in during summer, so the humidifier wasn't really necessary. But when winter came around and the humidity dropped to 15%, I realized that I had no idea how to tell if the thing was working. And when I dug up the manual, I had no idea what it was talking about.

Same with the leak detection. I knew it was there, since the sub-contractor had pointed it out. I had assumed it was managed by the monitoring box that had been installed along with it...and since I was busy right then moving in boxes and getting NFS working, I put it on the list of stuff to do later.

When I finally did tackle it later, it turned out I was wrong: it wasn't part of the other monitoring box. The box I needed to query didn't show anything about a leak detector. And I had no idea how to test the leak detection once I did figure it out.

In both cases, I erred by assuming that I could figure things out later. Most of the time, I can -- and being handy at figuring things out goes with the job. But there are limits to our expertise, our area of familiarity, and our ability to learn whole technologies at one sitting. One of the hardest things I've had to realize is that, while I like to think I'm capable of learning just about anything I'm likely to try my hand at, it's not practical -- that there are times when you have to give up and say, "That's just something I'll have to learn in my next life."

I also erred by not asking the installer to walk me through things. I should have asked for simple steps to test whether they were working, how to check for problems, and how to reset them.

Conclusion

Moving tests things and people. You (re-)learn what you forgot about; you find out how to do without missing parts; you come to terms with the limits of being human. It's no less true for being melodramatic, but a few tricks, some obsessive planning, foolhardy volunteers, and hard work will give you the best war story of all: a boring one, where everything worked out just fine in the end.

Further reading:

4 comments :

Matt said...

Wow, Hugh, great article! I had visions of my own office migration from downtown Manhattan to New Jersey, and yeah, you're right. It's amazing the things that pop up.

I've been lucky when building out my data sites that I haven't had to migrate them so much as build a new one, get it running, switch over, then tear down the old one. Moving a datacenter as an atomic operation certainly sounds...much less fun. If I ever have to do it, I know who I'm calling :-D

The one thing that always caused me the most problems were the leased lines. Getting data lines installed, then the test & turnup, then relying on it, is always hair raising the first half dozen times you do it, particularly if test & turnup needs to happen on a specific day before a specific time. Then you've got to deal with number porting, and hope that goes through. It is certainly a hassle, but when it works, it's beautiful!

Thanks again for a great article!

orev said...

Quote:
One of the hardest things I've had to realize is that, while I like to think I'm capable of learning just about anything I'm likely to try my hand at, it's not practical -- that there are times when you have to give up and say, "That's just something I'll have to learn in my next life."

This has got to be one of the most important statements in this post... so much so that an entire book could be written about it. It's not understanding this that plagues almost all IT projects, causes business to hate the IT dept, and makes IT people generally impossible to work with (I am an IT person).

If someone thinks and acts like they are so smart they can do anything, everyone they have to work with will immediately shut down when dealing with them (in addition to thinking they are immature and also knowing that they don't actually know everything).

It's not about how much you know, or even how much capacity you have to learn, it's about what you can actually do in the time you have been given.
-------
I also recently completed a data center move, and I had everything planned out, down to calculating the length of cables needed for each server. This level of planning requires a lot of effort, and even most other people in the business will not understand why you need to go through so much "trouble" to plan so much. I'm happy to say that the only snag I ran into was that the 0U PDUs blocked some of the rack holes, but luckily everything is on Dell rapid/versa rails, which don't need to be screwed in to the racks.

Unknown said...

If you're moving services with physical connections, such as data or phone lines, force your vendor to come out and do a site survey no matter what they say. At a past company we moved offices in the same building (from the 3rd floor of a wing to the 10th floor of a tower), and the upstream data provider swore the line would be ready on the move data, "it was the same building" etc etc. Turns out the wing was serviced by a different central office, and the tower wasn't going to get a line until 60 days past the move in date. My bad for not forcing the vendor to do a site survey.

We saved the day by dropping an ethernet cable from the 10th window, across the roof of the wing, and into a window on the old 3rd floor and then into the phone system, which the building owner let us keep there for a while. For 60 days we prayed nobody would lean out there window and cut the cable.

Don't trust your vendors until you get them on site at the old and new location.

Matt said...

wow, Adam, that's..uhh...heroic. Good save!