AWS service interruptions raise doubts over reliability

AWS service interruptions raise doubts over reliability

Summary: Disruptions that occurred within one of the Amazon Web Services' cloud regions have hit popular aggregator Reddit, developer platform Heroku and other AWS-dependent sites

SHARE:
TOPICS: Cloud
6

Service interruptions that hit one region within Amazon's global cloud hindered the services of Reddit, Heroku and other cloud-dependent sites, and has led Reddit to question the reliability of Amazon Web Services.

The period of spotty service occurred intermittently on Thursday and continued into Friday, affecting the Elastic Compute Cloud (EC2) component of Amazon Web Services' (AWS) US east coast region.

Amazon AWS Reddit

AWS service interruptions caused downtime on Reddit. Credit: ZDNet UK

The problems began with increased latencies in Elastic Block Storage (EBS) volumes in the region at 2:45am PDT (9:45am GMT). This problem was resolved at 5:20am. At 8:54pm, AWS reported that connectivity problems had hit the region again. At 11:17pm, AWS reported these had stemmed from a "misbehaving network device" and that network connectivity was subsequently restored. The problems emerged again at 1:00am on Friday with high latency and error rates for the EBS APIs. At the time of writing, AWS had not disclosed the source of this problem.

"During this time, customers may have experienced timeouts when performing these calls or when attempting to launch new EC2 instances in the US-EAST-1 region. We are continuing to monitor the service," AWS wrote on its service health dashboard.

Read this

Citrix: Why hypervisor dominance matters

As Microsoft, Google and Amazon jostle for control of the public cloud, Citrix chief technology officer Simon Crosby talks about AWS's prominence, hypervisors and the threat of VMware dominance

Read more+

Affected sites

Platform-as-a-service product Heroku, which runs on Amazon Web Services began to be affected by the problems at around 7:53pm PDT on Thursday. It suffered networking problems that led to increased error rates for web requests to its platform and use of its tools. The disruption lasted for around three hours.

"Many applications continue to function normally, but due to the intermittent nature of the problem, it's hard to state which apps are affected," Heroku wrote at 9:52pm.

Popular aggregator service Reddit, which has reported over one billion page views a month, was also affected by the AWS downtime. The latencies caused a "complete halt" in Reddit servers dependent on AWS, Reddit systems administrator Jason Harvey wrote in a blog post.

The halt led to intermittent downtime for the servers, with the problems resolved by around 4am PDT. However, problems re-occurred at 10am in AWS EBS, which led to a cascading series of failures within Reddit, due to faulty replication processes across its databases.

"We are still investigating as to why replication failed. All we know is that it definitely broke when the EBS disks on the masters started having issues," Harvey wrote.

A 'constant source of failure'

As a consequence of the downtime, Reddit is mulling migrating its core services off of AWS EBS and onto storage, which is directly attached to its EC2 instances. Reddit has spent the past few weeks working on migrating one of its core database — a Cassandra database — onto local storage.

"While the local storage has much less functionality than EBS, the reliability of local storage outweighs the benefits of [AWS] EBS," Harvey wrote.

Amazon's EBS's are a barrel of laughs in terms of performance and reliability. Amazon needs to fix these now, or Reddit needs to move off of EC2.

– David King, former Reddit programer

"Even before the serious outage last night, we suffered random disks degrading multiple times a week. While we do have protections in place to mitigate latency on a small set of disks by using Raid-0 stripes, the frequency of degradation has become highly unpalatable," Harvey wrote.

However, former Reddit programmer David King posted on a Reddit discussion board on Friday regarding the downtime, saying "Amazon's EBS's are a barrel of laughs in terms of performance and reliability and are a constant (and the single largest) source of failure across Reddit".

"Amazon needs to fix these now, or Reddit needs to move off of EC2. Unfortunately, moving off is such a huge project that as under-staffed as Reddit is... it's untenable in the short term to both keep the site up and run projects like migrating datacentres," King, who uses the name 'ketralnis' on the site, wrote.

ZDNet UK has contacted AWS for comment regarding this story, but had not received a response at the time of writing.


Get the latest technology news and analysis, blogs and reviews delivered directly to your inbox with ZDNet UK's newsletters.

Topic: Cloud

Jack Clark

About Jack Clark

Currently a reporter for ZDNet UK, I previously worked as a technology researcher and reporter for a London-based news agency.

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

6 comments
Log in or register to join the discussion
  • It does not look good when the "special coverage" on "Cloud" that blasts Amazon is done "In Association with Microsoft" and is surrounded by ads for Microsoft's "Cloud Power" marketing campaign. While the issue with Amazon is clearly newsworthy, such blatant conflict of interest with Microsoft as a funding source kind of taints the objectivity of the piece.
    rubyjam
  • @rubyjam
    Thanks for your comment. ZDNet UK is funded through third-party advertising, so you will see ads and sponsorship on some sections. However, those commercial deals have no bearing on editorial decision-making - within the Cloud hot topic, you will see articles covering many vendors and products, and these are editorially selected.
    Tony Hallett, Publisher, ZDNet UK
    tony.hallett@...
  • I would like to make one clarification. When we say "moving to local storage", that does not mean we are moving Cassandra/Postgres off of AWS. We are simply moving to the local storage available to the EC2 instance, instead of the remote SAN-type storage provided by EBS.
    alienth
  • @ alienth
    Thanks for your comment. We have updated the story to reflect that Reddit is moving away from AWS EBS to instance-attached storage.
    Jack Clark
  • Just to chime in, our use of Amazon's EBS has been extremely reliable (at elog.com and enlyton.com). For those recommending local instance-store instead, I wonder how you would really make that work. That storage is obliterated if the machine instance goes down. It also has relatively poor performance. Do you keep in sync with S3? Please advise.
    On another note, Amazon is NOT being very open regarding their problem status at http://status.aws.amazon.com/. The problem is not just with "EBS API". We can't launch instances in any way for elog.com. We can't even launch a "micro" instance at this time (9:52am EST) and those instances, by definition, can't even utilize EBS's (for the root).
    littlearth
  • Hello littleearth,
    AWS has just confirmed what you have said, in an update to their status page they wrote - "There has been a moderate increase in error rates for CreateVolume. This may impact the launch of new EBS-backed EC2 instances in multiple availability zones in the US-EAST-1 region. Launches of instance store AMIs are currently unaffected. We are continuing to work on resolving this issue."
    Jack Clark