AWS Sydney region suffers API 'errors and latencies'

As of Thursday afternoon, the affected services were Appstream 2.0, EC2, ELB, ElastiCache, RDS, Workspaces, and Lambda.

Amazon Web Services (AWS) Asia Pacific (Sydney) Region was on Thursday experiencing interruptions with the company investigating "increased API error rates and latencies" from around 11:40 am AEDT.

The errors started affecting the Amazon Elastic Compute Cloud (EC2) service.

See also: Amazon AWS: Complete business guide to the world's largest provider of cloud services

According to AWS' status page, the affected services as of 3:30 pm AEDT were Appstream 2.0, EC2, Elastic Load Balancing (ELB), ElastiCache, Relational Database Service (RDS), Workspaces, and Lambda.

At 11:41 am AEDT, in its detail for EC2, AWS said it was investigating increased API error rates and latencies, saying connectivity to existing instances was not impacted.

Less than an hour later, AWS said it had identified the root cause of the issue and that it was "continuing to work towards resolution".

"This issue mainly affects EC2 RunInstances and VPC related API requests," AWS wrote under the detail for the EC2 error. 

"Customer[s] using the EC2 Management Console will also experience error rates for instance and network-related functions. Connectivity to existing instances remains unaffected."

By 2:00 pm AEDT, the services affected had spread to seven services, with AWS noting under each service it was "continuing to work towards resolution".

"We can confirm increased API error rates in the AP-SOUTHEAST-2 Region for functions that are configured with VPC settings. Functions that are not configured with VPC settings are unaffected," AWS wrote at 2:17 pm AEDT under its Lambda service status.

At 3:48pm AEDT, AWS offered further details on the issue causing increased API error rates and latencies.

"A data store used by a subsystem responsible for the configuration of Virtual Private Cloud (VPC) networks is currently offline and the engineering team are working to restore it. While the investigation into the issue was started immediately, it took us longer to understand the full extent of the issue and determine a path to recovery," the company wrote.

"We determined that the data store needed to be restored to a point before the issue began. In order to do this restore, we needed to disable writes."

Error rates and latencies for the networking-related APIs will continue until the restore has been completed and writes re-enabled, AWS added, saying that with issues such as these, it's difficult to provide an accurate ETA, but that it expects to complete the restore process within the next 2 hours.

"Connectivity to existing instances is not impacted. Also, launch requests that refer to regional objects like subnets that already exist will succeed at this stage, as they do not depend on the affected subsystem. If you know the subnet ID, you can use that to launch instances within the region," AWS wrote. 

"We apologise for the impact and continue to work towards full resolution."

A 5:55pm AEST update said AWS had completed the restoration of the affected data store, but that it was still working towards re-enabling writes. 

"We have seen an improvement in successful launches over the last 20 minutes and expect that to continue as we work towards full recovery," AWS said.

More to come

Updated 6:00pm AEDT Thursday 23 January 2020: Added further status updates from AWS.

RELATED COVERAGE