Rackspace's really bad 36 hours: The Internet is fragile

Rackspace, a Web hosting firm, should adopt that song "I Don't Like Mondays" as its corporate motto. After all, Monday turned out to be horrendous for the company.

Rackspace, a Web hosting firm, should adopt that song "I Don't Like Mondays" as its corporate motto. After all, Monday turned out to be horrendous for the company.

That song, a 1979 hit from The Boomtown Rats, sums up Rackspace's last 36 hours. First, Rackspace had a "maintenance failure" at its Dallas data center on Sunday. Then a truck driver hit a transformer feeding power to the Rackspace data center on Monday.

Techmeme has a collage of the individual accounts of the Rackspace fallout. For instance, GigaOm had troubles. The Rackspace team also posted an full account.

Add it up and you have another data center outage that harkens back to the 365 Main incident earlier this year. Is the Internet really this fragile? You bet.

In the Rackspace incident much of this debacle gets filed in the "stuff happens" category. Sunday's incident may have been handled better. Monday's was out of Rackspace's control. Turns out the power grid is just as fragile as the Internet.

Here's the chain of events outlined by the company.

The first incident happened Sunday at approximately 4:00 AM CST when a mechanical failure occurred, resulting in a number of customers experiencing intermittent service interruptions. We quickly deployed a team of more than one hundred Rackers to react, diagnose and devise a solution that would get our customers back online as quickly as possible. All affected customers were apprised of the situation and were brought back online.

Rackspace's response in the first incident was in keeping with its "fanatical support" pledge. The company, however, couldn't see the next debacle coming down the pike (literally).

In the second incident at approximately 6:30 PM CST Monday, a vehicle struck and brought down the transformer feeding power to the DFW data center. It immediately disrupted power to the entire data center and our emergency generators kicked in and operated as intended. When we transferred power to our secondary utility power system, the data center's chilling units were cycled back up. At this time, however, the utility provider shut down power in order to allow emergency rescue teams safe access to the accident victim. This repeated cycling of the chillers resulted in increasing temperatures within the data center. As a precautionary measure we decided to take some customers' servers offline. These servers are now back up, as are the chillers.

What can you do in that situation? As an aside, I know a former truck driver that did the exact same thing in Virginia. He fell asleep at the wheel, took out a bunch of transformers and blacked out a town. He's lucky to be alive and no one else was injured. He's also not driving anymore. The point: These accidents happen all the time--we just notice them more when our beloved sites go down. Or when your power disappears.

Rackspace summed up the situation well:

We cannot promise that hardware won't break, that software won't fail or that we will always be perfect. What we can promise is that if something goes wrong we will rise to the occasion, take action, resolve the issue and accept responsibility.

That's about all you can ask.