X
Innovation

AWS EC2 North Virginia outage resolves but some issues linger

UPDATE: Signal falls over while Xero and Nest got a bit iffy when the main AWS EC2 region had degraded performance. Amazon Web Service says all is well but some users are still reporting trouble.
datacentre-outage-gettyimages-912000448.jpg
Image: Getty Images

Amazon Web Services (AWS) didn't have a relaxing Sunday night before the work week ahead, with its EC2 instances in its main US-EAST-1 region struggling. And, as of Monday morning PDT, some users are still reporting trouble, although the AWS status page now reports, "The issue has been fully resolved and the service is operating normally."

It all began at 20:11 PDT, when the AWS status page announced the platform was suffering from degraded performance in its main availability zone."Existing EC2 instances within the affected availability zone that use EBS volumes may also experience impairment due to stuck IO to the attached EBS volume(s)," a notice said 30 minutes later.

"Newly launched EC2 instances within the affected availability zone may fail to launch due to the degraded volume performance."

At 21:47 PDT, AWS said the fault was within Amazon Elastic Block Store being overloaded, and customers should "fail out" to another availability zone.

"We continue to make progress in determining the root cause of the issue causing degraded performance for some EBS volumes in a single availability zone (USE1-AZ2) in the US-EAST-1 region. We have made several changes to address the increased resource contention within the subsystem responsible for coordinating storage hosts with the EBS service," the notice at 22:16 PDT said.

"While these changes have led to some improvement, we have not yet seen full recovery for the affected EBS volumes."

After a further 25 minutes, AWS said its mitigation had worked, was in process of deploying it fully, and EBS volumes should return to normal in the next hour.

In the final report, at 4:21 AM PDT, AWS reported "the issue was caused by increased resource contention within the EBS subsystem responsible for coordinating EBS storage hosts. Engineering worked to identify the root cause and resolve the issue within the affected subsystem. At 11:20 PM PDT, after deploying an update to the affected subsystem, IO performance for the affected EBS volumes began to return to normal levels. By 12:05 AM on September 27th, IO performance for the vast majority of affected EBS volumes in the USE1-AZ2 Availability Zone were operating normally. However, starting at 12:12 AM PDT, we saw recovery slow down for a smaller set of affected EBS volumes as well as seeing degraded performance for a small number of additional volumes in the USE1-AZ2 Availability Zone."

AWS continued, "Engineering investigated the root cause and put in place mitigations to restore performance for the smaller set of remaining affected EBS volumes. These mitigations slowly improved the performance for the remaining smaller set of affected EBS volumes, with full operations restored by 3:45 AM PDT. While almost all of EBS volumes have fully recovered, we continue to work on recovering a remaining small set of EBS volumes. We will communicate the recovery status of these volumes via the Personal Health Dashboard. While the majority of affected services have fully recovered, we continue to recover some services, including RDS databases and Elasticache clusters. We will also communicate the recovery status of these services via the Personal Health Dashboard." 

While AWS was experiencing issues, other sites were also hit with performance issues.

"Hold tight, folks! Signal is currently down, due to a hosting outage affecting parts of our service. We're working on bringing it back up," the messaging service tweeted.

Nest said its users had trouble logging in, but the situation was resolved.

At the time of writing, Xero said it was suffering from slowness.

To sum up, as Thaddeus E. Grugq, snarkily tweeted, "The internet was designed to survive nuclear wars, not AWS going down."

Update at 10 AM EDT, 27 September: Added further status update.

Related Coverage

Editorial standards