The machine that is delivering this blog to you is a standalone Dell server, running CentOS 5.2. It resides in a data center managed by Zope Corporation (ZC) Systems Administrators (SAs). I perform the majority of the software administration and maintenance myself, and they maintain all of the hardware along with some of the software.
The single most important software maintenance that they are responsible for is backing up the system. We’ll get to that in a minute.
This past Sunday, we were in our usual hotel room. I was logged on for roughly eight straight hours, and all was well on this server. I shut my laptop off at around 10:15pm. When we woke up in the morning, Lois checked her email on her Treo. It kept hanging. We thought it was a Treo/Sprint problem, but even rebooting the Treo didn’t correct the hang.
When we got into the office (at around 7am), our laptops couldn’t retrieve mail from the server either. Pinging the server worked, but I couldn’t SSH on to it. In fact, most services delivered by this server were gone (including this blog, which was unreachable). The only service (aside from ping/ICMP) that was obviously up was Zope! If you just wanted to visit our corporate site, and learn about our VC business (all of which is handled by Zope), that was running fine!
The head of the SA group at ZC was also in that early (thankfully!). After poking around a bit, I encouraged him to power off the machine remotely, and them power it back up remotely. He did. Unfortunately, the machine didn’t come back up (or at least it wasn’t responding to pings any longer, so something was wrong).
Another SA was called and directed to go to the data center rather than the office. When he got there, and hooked up a console to the server, he saw that the disk was failing. Attempts to get it started (running fsck, etc.) proved fruitless. It decided to die completely that morning.
Yet another SA was dispatched (from the office) to the data center with a fresh disk (he had other work to perform at the data center, or I would have delivered the disk myself, happily!). We knew in advance that this disk might be slightly problematic, as it had CentOS 4.6 installed on it.
When the disk was inserted into the machine, it booted up immediately. I was able to successfully SSH to the machine, but of course, nothing of mine was on there. That’s when the original SA (who also happens to be our expert in backups) started restoring the machine from our backups.
For a long time, we used to run the Amanda backup program at ZC. I don’t know why, but our experience with Amanda was never good. I am not suggesting that others don’t find it to be a perfectly acceptable solution, but for us, for whatever reasons, it wasn’t good enough.
After searching and evaluating a number of alternatives, the ZC SAs selected Bacula. We’ve been using that for a reasonable period of time now. The Bacula restore of my machine didn’t take all that long, and every file that I wanted/needed was successfully and correctly restored. In fact, the nightly incremental backup had run successfully before the disk decided to die (how polite), so even some non-critical files that I touched on Sunday were successfully restored! Whew!
That said, it was hardly a trip to the candy store. All of the applications that I compiled myself (more than you’d think, I’m a geek at heart!), didn’t work because of the OS (operating system) mismatch. My programs required newer versions of various libraries (possibly even the kernel itself in some cases), so starting from a 4.6 machine and restoring files that required 5.2 wasn’t as clever as we’d hoped.
Still, with some pain, theoretically, one can ugrade a 4.6 machine to 5.2 over the network. That was my plan… Well, the best laid plans, as they say…
I released the SA from the task of baby-sitting my machine, because I knew that a network upgrade would take a while, at best. After doing some network magic, the machine was in a little bit of a funky state, but there was a chance that a reboot would bring the machine up in a 5.x state, making the rest of the upgrade fairly straightforward.
Unfortunately, when I rebooted, the machine hung (wouldn’t actually go away, still pingable, but otherwise non-responsive). Again I asked the head of the SA group to remotely power down/up. Again, it powered down properly, but didn’t come back up.
In fact, it likely did come up, but because of the funky state that I left the machine in, it couldn’t be seen publicly due to network configuration issues. This time, we decided to take a more conservative approach, because opticality.com was down for at least 8 hours already (not a happy situation for either Lois or me).
The original SA went back down to the data center. This time, he burned a CD with a network install ISO of CentOS 5.2. After installing the correct OS onto the machine, he again restored the disk with Bacula. This time, everything matched. Still, there were problems…
The biggest issue (by far!) was foolishness on my part in mapping out what I wanted backed up on the machine to begin with. Sparing you the gory details, I ended up restoring the Yum database from my backup over the actual Yum database from the installation, so the system didn’t really know what was installed and what wasn’t. Not a good thing.
I only really cared about email to begin with. I built a few things (pretty quickly) by hand, and got email running. Then I got the web stuff up pretty quickly too. Finally, IM. Those are my big three hot buttons, everything else could be dealt with later on.
I didn’t want the SA to leave until we could prove that the machine could be booted correctly remotely. That took some time as well, as a number of the services that are started automatically weren’t installed on the machine (though RPM/YUM thought they were!). We (or rather he, at the console) disabled them one by one, until the machine came up.
After I restored the more critical ones, we tested a reboot again, and it came up fine. Whew. I released him again, this time for the last time.
I cleaned up a few more things and went to bed reasonably happy (it was now close to 10pm on Monday night). Over the next two days, I spent a few hours cleaning up more things. Yesterday, I completed the cleanup.
A series of shell scripts and filters, doing things like the following:
yum list | grep installed | cut -d ‘ ‘ -f 1 > /tmp/installed
Then running the resulting packages (the ones the system thought were installed!) through:
rpm -V package-name
Filtering that output for lines starting with the word “missing”. Then removing those packages (they weren’t there anyway, so it was a database cleanup activity) and then installing them via Yum again. It wasn’t actually as painful as it sounds, but it wasn’t pain free either.
The biggest headache occurred when removing a non-existent package also moved a config file (that was being used by a package I built separately, so Yum was unaware of it). At one point yesterday, without realizing it, I killed our SMTP (email) server. Oops. We were down for about 10 minutes, before I realized it, and got it back up and running.
At this point, all is back to normal, and Yum correctly knows what’s on the system.
Here are the lessons:
- Back Up Regularly
- Have a great group of dedicated SAs behind you
- Have some spare parts close by
- Think hard before taking some stupid actions (all my fault!)
I’m truly amazed, pleased, impressed, etc. that we lost nothing. Of course, we were down for 12 hours, but Internet email is a truly resilient thing. All mail that wanted to be sent to us was deferred when our SMTP server didn’t answer the call. When we came back up, mail started to be delivered nearly instantaneously, and by morning, all mail had been correctly delivered.
Here’s hoping I don’t go through this experience again, and here’s hoping that if you do, this post might help a bit…