Tuesday, October 9, 2007

Adventures in Disaster Recovery

This past Sunday morning, the power went off at the building where I work. A transformer blew up or something like that. CenterPoint, which was responsible for whatever caused the problem, restored power by Sunday evening; but apparently there were other problems because the power failed again on Monday morning and stayed off until around 2:00 PM that afternoon. We do have a machine room in that building, but most production boxes are located in a data center off site. Furthermore, we have a diesel-powered backup generator, and each of the machines in the data center is plugged into one of two giant UPS boxes to keep them up while the backup generator fires up. In addition, all support personnel are well-accustomed to working from home. Everything should have been okay, right?

Wrong. Virtually everyone working from home will remotely connect to their desktops at work, and only two out of maybe two hundred desktops were plugged into the circuit served by the backup generator. There are a set of machines that can be accessed even in the absence of desktops, and the idea is that one can telnet from these machines to whatever production boxes need to be monitored. The problem with that, though, is that the only terminal software installed on these machines can only use telnet, which has been disabled on all production boxes. That means that all support for most of Sunday and most of Monday had to be performed from one of the two desktops that happened to be on the backup generator's circuit. To top it all off, one of the UPS boxes failed on Monday morning, taking down half of the machines in the machine room and sowing chaos throughout our network. Until the UPS can be replaced, the backup generator has to be kept running to prevent power surges that could kill the boxes that were plugged into the failed UPS. Sounds like perfect disaster recovery planning, doesn't it?

1 comment:

Sara said...

sounds like you're having fun.