Innovation

AWS service interruptions raise doubts over reliability

Disruptions that occurred within one of the Amazon Web Services' cloud regions have hit popular aggregator Reddit, developer platform Heroku and other AWS-dependent sites

Written by Jack Clark, Contributor March 18, 2011 at 6:18 a.m. PT

Service interruptions that hit one region within Amazon's global cloud hindered the services of Reddit, Heroku and other cloud-dependent sites, and has led Reddit to question the reliability of Amazon Web Services.

The period of spotty service occurred intermittently on Thursday and continued into Friday, affecting the Elastic Compute Cloud (EC2) component of Amazon Web Services' (AWS) US east coast region.

AWS service interruptions caused downtime on Reddit. Credit: ZDNet UK

The problems began with increased latencies in Elastic Block Storage (EBS) volumes in the region at 2:45am PDT (9:45am GMT). This problem was resolved at 5:20am. At 8:54pm, AWS reported that connectivity problems had hit the region again. At 11:17pm, AWS reported these had stemmed from a "misbehaving network device" and that network connectivity was subsequently restored. The problems emerged again at 1:00am on Friday with high latency and error rates for the EBS APIs. At the time of writing, AWS had not disclosed the source of this problem.

"During this time, customers may have experienced timeouts when performing these calls or when attempting to launch new EC2 instances in the US-EAST-1 region. We are continuing to monitor the service," AWS wrote on its service health dashboard.

Affected sites

Platform-as-a-service product Heroku, which runs on Amazon Web Services began to be affected by the problems at around 7:53pm PDT on Thursday. It suffered networking problems that led to increased error rates for web requests to its platform and use of its tools. The disruption lasted for around three hours.

"Many applications continue to function normally, but due to the intermittent nature of the problem, it's hard to state which apps are affected," Heroku wrote at 9:52pm.

Popular aggregator service Reddit, which has reported over one billion page views a month, was also affected by the AWS downtime. The latencies caused a "complete halt" in Reddit servers dependent on AWS, Reddit systems administrator Jason Harvey wrote in a blog post.

The halt led to intermittent downtime for the servers, with the problems resolved by around 4am PDT. However, problems re-occurred at 10am in AWS EBS, which led to a cascading series of failures within Reddit, due to faulty replication processes across its databases.

"We are still investigating as to why replication failed. All we know is that it definitely broke when the EBS disks on the masters started having issues," Harvey wrote.

A 'constant source of failure'

As a consequence of the downtime, Reddit is mulling migrating its core services off of AWS EBS and onto storage, which is directly attached to its EC2 instances. Reddit has spent the past few weeks working on migrating one of its core database — a Cassandra database — onto local storage.

"While the local storage has much less functionality than EBS, the reliability of local storage outweighs the benefits of [AWS] EBS," Harvey wrote.

Amazon's EBS's are a barrel of laughs in terms of performance and reliability. Amazon needs to fix these now, or Reddit needs to move off of EC2.

– David King, former Reddit programer

"Even before the serious outage last night, we suffered random disks degrading multiple times a week. While we do have protections in place to mitigate latency on a small set of disks by using Raid-0 stripes, the frequency of degradation has become highly unpalatable," Harvey wrote.

However, former Reddit programmer David King posted on a Reddit discussion board on Friday regarding the downtime, saying "Amazon's EBS's are a barrel of laughs in terms of performance and reliability and are a constant (and the single largest) source of failure across Reddit".

"Amazon needs to fix these now, or Reddit needs to move off of EC2. Unfortunately, moving off is such a huge project that as under-staffed as Reddit is... it's untenable in the short term to both keep the site up and run projects like migrating datacentres," King, who uses the name 'ketralnis' on the site, wrote.

ZDNet UK has contacted AWS for comment regarding this story, but had not received a response at the time of writing.

Get the latest technology news and analysis, blogs and reviews delivered directly to your inbox with ZDNet UK's newsletters.

Editorial standards

Show Comments

AWS service interruptions raise doubts over reliability

Affected sites

A 'constant source of failure'

Related

I've tried a zillion desktop distros - it doesn't get any better than Linux Mint 22

One of the best foldable phones I've tested is not from OnePlus or Motorola

One of the best budget Android tablets I've tested is not made by Samsung or Google