July 31st, 2008:

Recovering From a Server Disk Crash

The machine that is delivering this blog to you is a standalone Dell server, running CentOS 5.2. It resides in a data center managed by Zope Corporation (ZC) Systems Administrators (SAs). I perform the majority of the software administration and maintenance myself, and they maintain all of the hardware along with some of the software.

The single most important software maintenance that they are responsible for is backing up the system. We’ll get to that in a minute.

This past Sunday, we were in our usual hotel room. I was logged on for roughly eight straight hours, and all was well on this server. I shut my laptop off at around 10:15pm. When we woke up in the morning, Lois checked her email on her Treo. It kept hanging. We thought it was a Treo/Sprint problem, but even rebooting the Treo didn’t correct the hang.

When we got into the office (at around 7am), our laptops couldn’t retrieve mail from the server either. Pinging the server worked, but I couldn’t SSH on to it. In fact, most services delivered by this server were gone (including this blog, which was unreachable). The only service (aside from ping/ICMP) that was obviously up was Zope! If you just wanted to visit our corporate site, and learn about our VC business (all of which is handled by Zope), that was running fine!

The head of the SA group at ZC was also in that early (thankfully!). After poking around a bit, I encouraged him to power off the machine remotely, and them power it back up remotely. He did. Unfortunately, the machine didn’t come back up (or at least it wasn’t responding to pings any longer, so something was wrong).

Another SA was called and directed to go to the data center rather than the office. When he got there, and hooked up a console to the server, he saw that the disk was failing. Attempts to get it started (running fsck, etc.) proved fruitless. It decided to die completely that morning.

Yet another SA was dispatched (from the office) to the data center with a fresh disk (he had other work to perform at the data center, or I would have delivered the disk myself, happily!). We knew in advance that this disk might be slightly problematic, as it had CentOS 4.6 installed on it.

When the disk was inserted into the machine, it booted up immediately. I was able to successfully SSH to the machine, but of course, nothing of mine was on there. That’s when the original SA (who also happens to be our expert in backups) started restoring the machine from our backups.

For a long time, we used to run the Amanda backup program at ZC. I don’t know why, but our experience with Amanda was never good. I am not suggesting that others don’t find it to be a perfectly acceptable solution, but for us, for whatever reasons, it wasn’t good enough.

After searching and evaluating a number of alternatives, the ZC SAs selected Bacula. We’ve been using that for a reasonable period of time now. The Bacula restore of my machine didn’t take all that long, and every file that I wanted/needed was successfully and correctly restored. In fact, the nightly incremental backup had run successfully before the disk decided to die (how polite), so even some non-critical files that I touched on Sunday were successfully restored! Whew!

That said, it was hardly a trip to the candy store. All of the applications that I compiled myself (more than you’d think, I’m a geek at heart!), didn’t work because of the OS (operating system) mismatch. My programs required newer versions of various libraries (possibly even the kernel itself in some cases), so starting from a 4.6 machine and restoring files that required 5.2 wasn’t as clever as we’d hoped.

Still, with some pain, theoretically, one can ugrade a 4.6 machine to 5.2 over the network. That was my plan… Well, the best laid plans, as they say…

I released the SA from the task of baby-sitting my machine, because I knew that a network upgrade would take a while, at best. After doing some network magic, the machine was in a little bit of a funky state, but there was a chance that a reboot would bring the machine up in a 5.x state, making the rest of the upgrade fairly straightforward.

Unfortunately, when I rebooted, the machine hung (wouldn’t actually go away, still pingable, but otherwise non-responsive). Again I asked the head of the SA group to remotely power down/up. Again, it powered down properly, but didn’t come back up.

In fact, it likely did come up, but because of the funky state that I left the machine in, it couldn’t be seen publicly due to network configuration issues. This time, we decided to take a more conservative approach, because opticality.com was down for at least 8 hours already (not a happy situation for either Lois or me).

The original SA went back down to the data center. This time, he burned a CD with a network install ISO of CentOS 5.2. After installing the correct OS onto the machine, he again restored the disk with Bacula. This time, everything matched. Still, there were problems…

The biggest issue (by far!) was foolishness on my part in mapping out what I wanted backed up on the machine to begin with. Sparing you the gory details, I ended up restoring the Yum database from my backup over the actual Yum database from the installation, so the system didn’t really know what was installed and what wasn’t. Not a good thing.

I only really cared about email to begin with. I built a few things (pretty quickly) by hand, and got email running. Then I got the web stuff up pretty quickly too. Finally, IM. Those are my big three hot buttons, everything else could be dealt with later on.

I didn’t want the SA to leave until we could prove that the machine could be booted correctly remotely. That took some time as well, as a number of the services that are started automatically weren’t installed on the machine (though RPM/YUM thought they were!). We (or rather he, at the console) disabled them one by one, until the machine came up.

After I restored the more critical ones, we tested a reboot again, and it came up fine. Whew. I released him again, this time for the last time.

I cleaned up a few more things and went to bed reasonably happy (it was now close to 10pm on Monday night). Over the next two days, I spent a few hours cleaning up more things. Yesterday, I completed the cleanup.

A series of shell scripts and filters, doing things like the following:

yum list | grep installed | cut -d ‘ ‘ -f 1 > /tmp/installed

Then running the resulting packages (the ones the system thought were installed!) through:

rpm -V package-name

Filtering that output for lines starting with the word “missing”. Then removing those packages (they weren’t there anyway, so it was a database cleanup activity) and then installing them via Yum again. It wasn’t actually as painful as it sounds, but it wasn’t pain free either.

The biggest headache occurred when removing a non-existent package also moved a config file (that was being used by a package I built separately, so Yum was unaware of it). At one point yesterday, without realizing it, I killed our SMTP (email) server. Oops. We were down for about 10 minutes, before I realized it, and got it back up and running.

At this point, all is back to normal, and Yum correctly knows what’s on the system.

Here are the lessons:

  1. Back Up Regularly
  2. Have a great group of dedicated SAs behind you
  3. Have some spare parts close by
  4. Think hard before taking some stupid actions (all my fault!)

I’m truly amazed, pleased, impressed, etc. that we lost nothing. Of course, we were down for 12 hours, but Internet email is a truly resilient thing. All mail that wanted to be sent to us was deferred when our SMTP server didn’t answer the call. When we came back up, mail started to be delivered nearly instantaneously, and by morning, all mail had been correctly delivered.

Here’s hoping I don’t go through this experience again, and here’s hoping that if you do, this post might help a bit… :-)

July 2008 Poker

I won’t be playing tonight, so I can safely report this month’s results now.

Given the past few months, this one was financially successful. That said, it was extremely unsatisfying financially, while simultaneously being extremely satisfying from a having a good time perspective. I guess you can’t have it all. ;-)

So, the bottom line first, then a few details that I want to report just to get them off my chest (and make the above paragraph a little less opaque).

I finished the month with a profit of $359.02. Not too shabby.

Still, it was a financially frustrating month. If you read this space regularly, you know that I play a lot less than I used to (as in a ton less). That means I have much less time to play in cheaper qualifiers, so I pay a lot more to enter the bigger tournaments. That means one or two bad breaks a month and it becomes very hard to show a profit.

This month, I lost in all four attempts at the big Sunday tourney. That was an $860 hole that I dug for myself! In addition, they now have a nice Omaha Hi/Lo on Saturday afternoon (perfect time for me!) that costs $162 to enter. I entered that one twice as well, so that’s nearly $1,200 just to enter these six tournaments.

In the big Sunday one, I just missed this week. I came 109th out of 944. They paid the top 100. I likely could have drifted into the money, and gotten back $300 for my $215. Psychologically, I needed to do that (or rather, could have benefited from a cash), but I knew that playing my JJ was the right thing to do. Amazingly, after a very big stack called my all-in, another guy went all-in (bigger stack than me, but less than the stack that called me!) and he only had AK. He risked his entire tournament on a draw with one person all-in, and a bigger stack calling it.

The guy who called me had 66, so I was ahead of him, and the AK got lucky and hit, so I was out. A week ago, I came 164th out of 901, so I’m getting close, but just not getting there all the way…

That said, in the new Saturday Omaha Hi/Lo, I had very good success this past Saturday. I came in 4th out of 72 players. That was good enough for a $1,500 payday (bring the month back into the black). That was my single largest cash in online poker. That’s why I say it was a frustrating month, as I needed that large of a cash just to put me somewhat ahead for the month.

My other minor frustrations were in some qualifiers. I missed winning seats to the Omaha Hi/Lo and the Sunday tourney by one spot at least three times this month. In one, I was the overwhelming chip leader and when it came down to two people, the other person hit every flop five hands in a row, and there was nothing I could do but come second. The free seat only came to first place… On the flip side, after paying the full freight, that’s the one I made my big cash in, so I can’t complain too loudly…

Anyway, on to another month. :-)