Having been impacted by the outage at The Planet, obviously I followed it closely. Through many telephone calls and chats with staff there and with people I know personally in the area with professional relationships on staff at The Planet, here's the best take I can give you on what really happened.
What Happened?
At about 17:45 CST on a Saturday, a fire and explosion took down an entire data center in Houston, TX belonging to hosting provider "The Planet". The state of the art facility manages more than 9000 servers in racks on two floors of the huge data center. The facility is known for extremely high reliability and high performance network connectivity to the internet. As such, a large number of its customers ignored best practices and did not have hot standby servers in other data centers. Although the facility has standby generators capable of instantly providing enough power to sustain the facility indefinitely and the fuel for such use to last well over a week, they were unable to use these generators due to the nature of the incident.
The initial incident was an explosion and fire that required (obviously) the immediate evacuation of the entire facility. Initial reports have been that a transformer exploded in their power distribution room. It may be that it was in fact an electrical conduit explosion, and some have suggested that the initial explosion may have caused a burst in a pressurized fire suppression system which did more of the damage. It will be some time before we know for sure what the initial explosion was. We do know, however, that the force of the blast blew out three interior walls of the power distribution room - moving those walls "several feet" from their original position.
Power distribution rooms in large data centers are like the circuit breaker panel at your house, only instead of distributing 20-50 kilowatts to a dozen or so circuits, a data center can be dealing with several megawatts and hundreds or thousands of circuits. Clearly this couldn't have been an explosion of a megawatt sized transformer as that would have left a crater where that half of the building used to be. It could have been one of several smaller transformers or it could more likely have been a conduit explosion.
The damage from the explosion destroyed the connections of nearly every circuit from the distribution room out to the racks of equipment, the cooling equipment, and everything else. Had they started the generators, they'd only have created a dangerous fire.
Even with dozens of electricians at work and vendors from the power company to the networking gear having offices already on site, it was nearly 28 hours before the electrical connections could be made safe enough and enough equipment replaced for the fire marshal to allow them to begin restoring power. Initially, power was restored to the second floor housing 2/3 of the servers. We're talking about Houston in the summer time, so the area first had to be slowly cooled to safe temperatures without cooling too quickly and creating a condensation problem. Racks could then be started in small groups. Each rack drawing its maximum load as servers restarted and rack mounted batteries (designed to carry gear through a transition to generator power) all demanded a full charge at once.
By mid-day on Monday, most of the second floor was powered up. The Planet had brought teams of support people from their Dallas and other Houston data centers to help with the work. Many servers hadn't been restarted in months or years. Some small percent did not start. Drives that had been spinning all that time didn't have strong enough motors to spin up. Configurations that had changed over the months since the last reboot failed to start properly. All of these issues were handled as best they could.
The first floor servers simply had no connection any longer to the power distribution center. Their power conduits were in the concrete floor of the building, and were simply no longer even near the walls that used to hold their connective cables to the distribution room. By the afternoon on Monday, temporary circuits had been created allowing huge cables from out door generators to run directly into the building and connect to the first floor circuits. This is the way it is currently running, and will continue for at least a week while equipment is brought in and the production grid is rebuilt from scratch.
Some Lessons Reinforced --
I would say lessons learned, but anyone in our business should know these already.
1. Any data center, no matter how good, is subject to this kind of rare incident. A data center can be blown up, a plane can fall on one, a meteor could hit one, a sinkhole could swallow one, or flying monkeys could carry one away. I have little sympathy for people screaming at tech support staff about losing thousands of dollars an hour. If the up time is that critical, you should have standby in another data center -- possibly with another vendor entirely. For me, I've already got the fail over machine in place now, and am building it out and configuring today. A bit late, but it can happen again. Especially since right now there is a lot of temporary patchwork in that data center and a lot of mitigation work to be done over the coming weeks.
2. Don't delay your disaster recovery plans. I got caught with my pants down, as I'd been planning a hot standby server in another data center and was months over due setting one up. I own that failure, not The Planet. They do own their own mistake however. The planet had acquired another outfit and was still using the DNS setup that the older firm had used. This left them without a backup DNS system in another data center themselves for those customers. This, combined with the customers themselves not getting their own backup dns providers (which is free in some places) left some customers in other data centers without service. I think that is the only part of this incident where The Planet holds fault. They knew about the issue, but hadn't completed their plans to better redistribute that configuration. Like me, they'd put off what they knew they had to do and they got caught.
3. Don't delay implementing your disaster recovery plans in hopes that you'll be up and running before you could complete the process. If your process takes a long time to change over, you should start right away. If something does happen to bring the primary service back, that's good. In the mean time, the sooner you start the better off you are. Often with these things, it is many hours into the incident before enough facts are known and verified to give an accurate time estimate.
Comment Entry |
Please wait while your document is saved.