EC2 outage ends, but what is the long term impact?
Amazon is promising a detailed post mortem following the outage of its EC2 platform - but while the incident is likely to make CIOs review their cloud backup plans, it is unlikely to have a long term impact on the seemingly inexorable rise of cloud computing.
On 21 April, Amazon reported issues with its EC2 cloud service as several centres experienced "connectivity errors".
The problems lasted four days and had a knock-on effect on websites relying on the Amazon infrastructure including Reddit and Quora.
And while by the morning of 25 April Amazon reported that "the vast majority of volumes have now been recovered", it has also conceded it would not be able to fully recover some of the data entrusted in its EC2 service and is in the process of contacting those affected.
The company said it is "digging deeply into the root causes of this event" and will post a detailed post mortem. Still, few are expecting the incident to stop organisations moving to the cloud, simply because of the potential benefits.
As a spokesman from Reddit, one of the websites which suffered downtime due to the EC2 outage, told silicon.com, the EC2 outage has "not fundamentally changed our strategy".
He said that Reddit is "now considering other options in hosting providers, such as a private/public hybrid where we have some dedicated machines that share a hosting facility with some shared machines that can be used for overflow."
But despite the outage, the cloud is still considered the best option by Reddit: "In this day and age, unless you are doing something very specialised, it doesn't make sense to own your own hardware - I would rather let someone else bear the cost of spare capacity."
Clive Longbottom, service director at business and IT analyst Quocirca told silicon.com that EC2 failures were worrying because it is the major commercial cloud offering.
"If it was something that was a free service that had gone down again then it would be a case of well, you've not paid for it. But this is EC2, it is a commercial cloud offering and it's taken Amazon a long time to sort it out."
He criticised Amazon's handling of the event: "There hasn't been much in the way of statements from them other than 'we have problems and we're trying to sort them out'. No real time scales were put forward that I saw, either."
But he added: "I don't see that there's much option for any of these big companies to go 'we're going to throw all our toys out of the pram, we're going to leave Amazon and we're going to go to somebody else'," simply because "the 'somebody else' options aren't phenomenally big at the moment."
Longbottom said the lack of standardisation within cloud services means that simply moving data and service infrastructure from one provider to another is not a straightforward option.
Indeed, in order to achieve the potential for flexibility within the cloud computing model, this lack of standardisation needs to be changed, otherwise businesses will find that they are locked in to IT services in the cloud just as they were with internal IT services, he said.
In the meantime, Longbottom believes the clients affected by the EC2 failures are more likely to go back to Amazon and renegotiate their terms of service than attempt to change providers.
While some businesses looking to adopt the cloud computing model may be put off by the problems experienced with EC2 services, Longbottom believes that the shift to cloud computing is inevitable, even if some organisations will look at the outage and conclude that until they see some best practice they're not going to go down the cloud route.
"But I do feel that cloud has too much going for it not to succeed," he added.
Longbottom said enterprise use of the cloud will shift away from placing all services with one cloud provider, to locating different tasks in different parts of the cloud. This will not only increase the flexibility of cloud computing, but will also allow businesses to separate critical services from non-critical services, and reduce the risks involved.
Essentially, however, Longbottom stressed that businesses using cloud services need to plan for service failures themselves.
"An SLA in itself is worth nothing. When something like this happens, you can go back and say 'Well the agreement says that you now have to give us three months of service free of charge - but hang on we nearly went out of business while this was all going on'."
For many businesses, securing services in the cloud may end up costing more than was originally planned, but Longbottom argues that at some point, businesses are going to have to weigh up the risks.
The solution? "Pay the prime provider a lot of money to make sure the data is always available, and then look at failing over from their functional services to somebody else's functional services using the data from the primary provider if all else fails."
Indeed, not every customer was crippled by the EC2 outage: Don MacAskill, CEO of SmugMug wrote a blog post explaining how his photography hosting service survived the outage.
The key to SmugMug’s success, according to MacAskill, was that SmugMug "spread across availability zones and designed for failure to begin with".
MacAskill advises that other businesses working in the cloud should ensure that "each component (EC2 for instance), should be able to die without affecting the whole system as much as possible".
Furthermore, by separating systems into components that can be taken offline individually, businesses do not run the risk of having to take down a whole website to fix one small problem according to MacAskill.
MacAskill also wrote in his blog post that SmugMug is not yet 100 per cent cloud. "The lack of performant, predictable cloud database at our scale has kept us from going [to the cloud] 100 per cent".