ACT Emergency Services site restored as AWS fixes Sydney region API errors

Amazon Web Services restored its Sydney services that on Thursday suffered 'increased API error rates and latencies' to a point before the issues began.
Written by Asha Barbaschow, Contributor

Amazon Web Services (AWS) has recovered from Thursday's service interruptions which affected its Appstream 2.0, Elastic Cloud Compute (EC2), Elastic Load Balancing (ELB), ElastiCache, Relational Database Service (RDS), Workspaces, and Lambda services out of the AWS Asia Pacific (Sydney) Region.

The company on Thursday said it was experiencing interruptions and that it was investigating "increased API error rates and latencies" from around 11:40 am AEDT.

Consequences of the issues were felt by customers, including the ACT Emergency Services Agency (ESA), which is used by many in the area to stay up to date with the status of emergencies.

Currently, a state of alert is in place in the ACT with heavy smoke "coming and going in the ACT and bushfires currently in the surrounding region and in the ACT".

The agency responsible for emergency management in the ACT on Thursday afternoon said AWS' error was affecting its website.

"The ESA website hosting provider AMAZON has experienced an error which has affected the ESA website and their other clients. We are working to resolve the issue as quickly as possible. Please continue to stay up to date via ESA Facebook and local media," it said in a tweet.

At 6:30pm AEDT Thursday, ESA said its website had been restored.

"On any day, communications can be affected which is why during an emergency we use multiple platforms to share our message and encourage people to keep up to date across a variety of channels," it wrote.

At 6:45pm AEDT Thursday, AWS said that all error rates and latencies had returned to normal levels and that the issues had been resolved and its services operating normally.

In a status message posted shortly after at 7:30pm AEDT, AWS provided a summary of the issue.

"Starting at 4:07pm PST [11:07am AEDT Thursday], customers began to experience increased error rates and latencies for the network-related APIs in the AP-SOUTHEAST-2 Region," it wrote under the EC2 error log.

"Launches of new EC2 instances also experienced increased failure rates as a result of this issue. Connectivity to existing instances was not affected by this event.

"We immediately began investigating the root cause and identified that the data store used by the subsystem responsible for the Virtual Private Cloud (VPC) regional state was impaired."

See also: Amazon AWS: Complete business guide to the world's largest provider of cloud services

AWS said that while the investigation into the issue kicked off immediately, it took the cloud giant longer to understand the full extent of the issue and determine a path to recovery.

"We determined that the data store needed to be restored to a point before the issue began. We began the data store restoration process, which took a few hours and by 10:50pm PST [5:50pm AEDT Thursday], we had fully restored the primary node in the affected data store," it wrote.

"At this stage, we began to see recovery in instance launches within the AP-SOUTHEAST-2 Region, restoring many customer applications and services to a healthy state."

The company said it continued to bring the data store back to a fully operational state and that by 6:20pm AEDT all API error rates and latencies had fully recovered.

"We apologise for any inconvenience this event may have caused as we know how critical our services are to our customers," AWS said. "We are never satisfied with operational performance of our services that is anything less than perfect, and will do everything we can to learn from this event and drive improvement across our services."


Editorial standards