After lightning downed parts of Amazon's European cloud over the weekend, a fault appeared in the company's storage software that caused the system to accidentally delete customer data.
The software bug began deleting customer data after the outage on Sunday, according to Amazon Web Services (AWS). The cloud services provider was still attempting to recover customer data held in its Elastic Block Storage (EBS) on Wednesday, meaning some customers are still having downtime three days after the initial problem.
AWS's rentable computers — known as 'instances' — typically use EBS to store data. The data is placed on hardware separate from that running the instance, and the data is served to the instance via a network connection. The bug lies in the part of EBS that manages stored images of EBS data pools, known as 'snapshots'.
"Independent from the power issue in the affected availability zone, we've discovered an error in the EBS software that cleans up unused [EBS] snapshots," AWS wrote on its status page on Monday. "During a recent run of this EBS software in the EU-West Region, one or more blocks in a number of EBS snapshots were incorrectly deleted.
"The root cause was a software error that caused the snapshot references to a subset of blocks to be missed during the reference-counting process. As a result of the software error, the EBS snapshot management system in the EU-West Region incorrectly thought some of the blocks were no longer being used and deleted them," it added.
Since then, AWS has been working to create recovery snapshots for customers to help them resurrect the data volumes. This may not be a foolproof solution, as some of the data in the restored pools of data, or 'volumes', could be inconsistent, the company said. This could cause trouble for applications reliant on the data, it added.
Either way, it will take time for all the affected customers to receive their recovery snapshots, because creating them "requires [AWS] to move and process large amounts of data", Amazon said. This is "why it is taking a long time to complete, particularly for some of the larger volumes. As recovery snapshots become available, customers will see them appear in their accounts", it added.
It has been quite a long outage, I wouldn't expect that level of outage on any of our other systems.– Paul Armstrong, AWS customer
Within Amazon's European region — EU-West — there are three availability zones. Each EBS volume is tied to a specific availability zone and is backed up to several storage devices. While there is redundancy within the zone, if the whole zone goes down, it can take EBS with it.
"I have been concerned [by the EBS problems]," Paul Armstrong, the business systems manager of AWS customer Haven Power, told ZDNet UK. "It has disrupted our service to some extent. It has been quite a long outage, I wouldn't expect that level of outage on any of our other systems."
AWS customers can also store their data in the company's Scalable Storage Cloud (S3). This acts like a tape backup service in that it is good for storing large quantities of information, but is slower to deliver it when needed. However, S3 cannot directly connect to instances, and EBS is typically used as a mediator between the two.
In addition, customers can use 'ephemeral' storage, which is directly attached to the individual instance. Ephemeral data has drawbacks, compared with EBS, because it co-exists with the instance and will disappear if the instance is hit by problems.
EBS has attracted criticism in the past from customers over the quality of service provided, and the service saw failures in March and April that generated sharp responses from some.
"Amazon's EBSes are a barrel of laughs in terms of performance and reliability and are a constant (and the single largest) source of failure across Reddit," a former Reddit programmer wrote in March, after a cascading fail in EBS led to outages at Reddit, Quora and a host of other sites.
Critics also argue that because EBS is a shared storage environment, heavy use by one customer can get all the others on the same server into trouble.
"I've heard complaints about EBS suffering from 'noisy neighbour syndrome' here," Colin Percival, a security researcher and Amazon cloud user, told ZDNet UK. "I don't know if this is a problem with the underlying EBS storage or if it's just the (unavoidable) problem of EC2 nodes hosting multiple EC2 instances, and the EC2 nodes having limited network bandwidth."
EBS's main problem may stem from its lack of redundancy. Ewan Leith, founder of system migration company Nutmeg Data, noted that EBS images are locked into a single availability zone. If problems occur in that zone, the image cannot be moved to a zone in another region.
"When a zone goes down, EBS is almost always the last to be recovered," Leith said.
Amazon has a history of expanding its services to run on multiple availability zones, as it did with its virtual private cloud product on Thursday. However, it has not publicly disclosed any plans to do allow a single EBS pool to straddle multiple availability zones.
Get the latest technology news and analysis, blogs and reviews delivered directly to your inbox with ZDNet UK's newsletters.