When there are problems with your datacenter that impact the services it provides and the customers it serves, there is often a new standard set for finger pointing. From large vendors like Amazon telling customers they need to buy more services to prevent internal failures from fouling up their customers' needs to small vendors blaming providers of specific pieces of equipment or services, there always seems to be someone to blame.
That’s what makes Joyent’s ownership of a datacenter-wide failure last week so refreshing. Not only did they take responsibility for the human-error induced failure, they explained, in detail, why and how it happened, and what they were doing to address the problem moving forward.
The cause of the outage, which took down the entire datacenter for over half an hour and left some customers without services for over two and a half hours, was simple. A datacenter operator manually issued a command that rebooted every server (their us-east-1 API systems and customer instances) in the datacenter. The obvious question is, of course, why is an operator able to issue a command that reboots everything simultaneously? And if such a command is possible, why aren’t there confirmation steps before going ahead with the action (Are you sure you want to do this? Really sure? Come on, do you really want this?)
Joyent took the time to explain, in detail, the why and how of the problem. While it boiled down to them saying they need this capability for the automation of their platform and what they give their customers (and that they're looking at ways to prevent this type of failure in the future), they took the additional step of explaining why these things were important to their business model and customer base.
If there is one important takeaway that operators of datacenters should consider, I think it’s this: no matter how well you design for potential failures (the Joyent facility is designed as 2F + 1), it won't matter if you turn everything off at the same time.
It didn't matter how well designed Joyent's infrastructure was in regard to mitigating system and service failures. One operator was able to shut down everything with a single command. Consider this when looking at the operation of your own facilities; your single point of failure may not be hardware or service related; it may be simply a keystroke away.