The detailed blog post noted that there were three issues that led to the December storage outage. Some of the nodes didn't have node protection turned on. The monitoring system for detecting this kind of problem had a defect, resulting in no alarms or escalation. On top of this, a transition to a new primary node triggered a reaction that led to an incorrect formatting of other nodes. Normally, according to the post, Azure should have survived the simultaneous failure of two nodes of data within a stamp, as the system keeps three copies of data spread across three separate fault domains.
"However, the reformatted nodes were spread across all fault domains, which, in some cases, led to all three copies of data becoming unavailable," Neil explained.
The team came up with a Plan A (restoring data in place) and a Plan B (geo-redundant storage and failover). They decided to go with A.
Geo-Replication for Queue data for Windows Azure Storage accounts
Read-only access to a customer's storage account from the secondary location
Customer-controlled failover for Geo-Replicated Storage Accounts, by allowing customers to "prioritize service availability over data durability based on their individual business needs," which will be provided via a programming interface to trigger failover of a storage account.
In early January, the company sent an apology note to its users for the "62 hours of unexpected downtime" and complete inability to reach the company's services.
In that note (a copy of which I saw thanks to Shaun Jennings), Soluto officials noted they decided to go with Azure over Amazon Web Services because the company believed it could develop its solution faster on top of it, despite the fact that Amazon was the more mature platform.
"In addition, we got lots of help from Microsoft by being added to their BizSpark One program: we got both great pricing and the highest level of support," the note said. Officials said the decision paid off.