AWS cloud accidentally deletes customer data

AWS cloud accidentally deletes customer data

Summary: A software bug caused some customers' data in Amazon Web Services's European cloud to be erased after lightning downed the service on Sunday

SHARE:
TOPICS: Cloud
1

After lightning downed parts of Amazon's European cloud over the weekend, a fault appeared in the company's storage software that caused the system to accidentally delete customer data.

Read this

AWS disrupted by US east coast failure

Amazon Web Services' cloud has taken a hit from problems in its US East Coast region, downing multiple sites that depend upon the service.

Read more+

The software bug began deleting customer data after the outage on Sunday, according to Amazon Web Services (AWS). The cloud services provider was still attempting to recover customer data held in its Elastic Block Storage (EBS) on Wednesday, meaning some customers are still having downtime three days after the initial problem.

AWS's rentable computers — known as 'instances' — typically use EBS to store data. The data is placed on hardware separate from that running the instance, and the data is served to the instance via a network connection. The bug lies in the part of EBS that manages stored images of EBS data pools, known as 'snapshots'.

"Independent from the power issue in the affected availability zone, we've discovered an error in the EBS software that cleans up unused [EBS] snapshots," AWS wrote on its status page on Monday. "During a recent run of this EBS software in the EU-West Region, one or more blocks in a number of EBS snapshots were incorrectly deleted.

"The root cause was a software error that caused the snapshot references to a subset of blocks to be missed during the reference-counting process. As a result of the software error, the EBS snapshot management system in the EU-West Region incorrectly thought some of the blocks were no longer being used and deleted them," it added.

Recovery snapshots

Since then, AWS has been working to create recovery snapshots for customers to help them resurrect the data volumes. This may not be a foolproof solution, as some of the data in the restored pools of data, or 'volumes', could be inconsistent, the company said. This could cause trouble for applications reliant on the data, it added.

Either way, it will take time for all the affected customers to receive their recovery snapshots, because creating them "requires [AWS] to move and process large amounts of data", Amazon said. This is "why it is taking a long time to complete, particularly for some of the larger volumes. As recovery snapshots become available, customers will see them appear in their accounts", it added.

It has been quite a long outage, I wouldn't expect that level of outage on any of our other systems.

– Paul Armstrong, AWS customer

Within Amazon's European region — EU-West — there are three availability zones. Each EBS volume is tied to a specific availability zone and is backed up to several storage devices. While there is redundancy within the zone, if the whole zone goes down, it can take EBS with it.

"I have been concerned [by the EBS problems]," Paul Armstrong, the business systems manager of AWS customer Haven Power, told ZDNet UK. "It has disrupted our service to some extent. It has been quite a long outage, I wouldn't expect that level of outage on any of our other systems."

AWS customers can also store their data in the company's Scalable Storage Cloud (S3). This acts like a tape backup service in that it is good for storing large quantities of information, but is slower to deliver it when needed. However, S3 cannot directly connect to instances, and EBS is typically used as a mediator between the two.

In addition, customers can use 'ephemeral' storage, which is directly attached to the individual instance. Ephemeral data has drawbacks, compared with EBS, because it co-exists with the instance and will disappear if the instance is hit by problems.  

Troubled history

EBS has attracted criticism in the past from customers over the quality of service provided, and the service saw failures in March and April that generated sharp responses from some.

"Amazon's EBSes are a barrel of laughs in terms of performance and reliability and are a constant (and the single largest) source of failure across Reddit," a former Reddit programmer wrote in March, after a cascading fail in EBS led to outages at Reddit, Quora and a host of other sites.

Critics also argue that because EBS is a shared storage environment, heavy use by one customer can get all the others on the same server into trouble.

"I've heard complaints about EBS suffering from 'noisy neighbour syndrome' here," Colin Percival, a security researcher and Amazon cloud user, told ZDNet UK. "I don't know if this is a problem with the underlying EBS storage or if it's just the (unavoidable) problem of EC2 nodes hosting multiple EC2 instances, and the EC2 nodes having limited network bandwidth."

EBS's main problem may stem from its lack of redundancy. Ewan Leith, founder of system migration company Nutmeg Data, noted that EBS images are locked into a single availability zone. If problems occur in that zone, the image cannot be moved to a zone in another region.

"When a zone goes down, EBS is almost always the last to be recovered," Leith said.

Amazon has a history of expanding its services to run on multiple availability zones, as it did with its virtual private cloud product on Thursday. However, it has not publicly disclosed any plans to do allow a single EBS pool to straddle multiple availability zones.


Get the latest technology news and analysis, blogs and reviews delivered directly to your inbox with ZDNet UK's newsletters.

Topic: Cloud

Jack Clark

About Jack Clark

Currently a reporter for ZDNet UK, I previously worked as a technology researcher and reporter for a London-based news agency.

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

1 comment
Log in or register to join the discussion
  • I'm a customer of AWS who had resources within the affected availability zone. It has taken AWS three days to recover the EBS volumes for two of my servers, one of which I had already managed to recover and move to another AZ. This downtime is a real shame as the power of Amazon's EC2 platform is unparalleled and this downtime really damages the credibility of using AWS in business. Some businesses have been crippled as a result of this downtime and this will be one of their IT horror stories for years to come.

    Amazon have got A LOT of work to do to gain the trust of the various IT managers. There has to be a level of redundancy in EBS volumes as right now there is none. The general consensus going around the support forums is that you should develop in to your application the ability for it to survive the failure of an AZ, but if you aren't running your own applications (like me!) what do you do? All you can do is hope you have working snapshots, and AWS gets to your EBS volumes first to recover database data.

    AWS EC2 is insanely powerful, and many problems you have with images can be fixed by yourself as an admin without the need to contact AWS, but the level of redundancy your application has is entirely down to how much you want to spend on bringing up instances in multiple Availability Zones. I can imagine Rackspace's cloud pros as well as other cloud vendors are furious as this event damages the feasibility of using the cloud in business as a whole, not just AWS. This isn't really the fault of AWS as it is an act of god, but surely there are ways of defending datacentre backup facilities from thunder?
    anonymous