Outages beyond the recent Amazon event

The Amazon outage wasn't the first, nor will it be the last, outage of a major company's IT environment. Few internal outages ever get publicized. Should the focus of this event change?

My good friend, Vinnie Mirchandani (see Deal Architect) posted last week a piece re: the Amazon outage.

In his short piece, he suggested that the cries for more transparency from Amazon might better be focused on getting more transparency from many other firms, not just cloud providers.  Vinnie stated:

"The other thing that may surprise folks is the last time many on-premise data centers ran a full disaster recovery drill. They have their own disasters and plenty of down time - they just are not that public or reported in blogs, newspapers or Twitter."

His timing for this post was eerie for me as I had lunch with the top IT exec of a manufacturer the previous day. This individual recounted for me the lost four days he had spent working with a security firm trying to get numerous desktop computers operational again. Apparently, some nasty virus had snuck into several parts of their enterprise. The IT group spent days rebuilding many machines at several locations. In short, they suffered downtime and lost worker productivity. The good news was that none of their ERP systems were impacted.

Vinnie's right - few outages actually get press coverage. Usually, when outages are made public, they often are on shared systems, like cloud or outsourcing sites. Few companies really see any upside with going public when their internal systems fail.

Moreover, many (not all) failures, from the anecdotal accounts I'm aware of, often involve equipment failures and not some sort of IT malfeasance or neglect. If that's true, then failures can happen to anyone. It's not a phenomenon that only affects a cloud provider like Amazon.

What is worth publicizing is the way that all companies should handle these outages. Let's see a frank assessment of how well each company worked its way through the downtime. Let's see more sharing of best practices and more discussion around the successful remedies that minimized downtime. And, finally, let's have more discussion around how to eliminate these outages all together.