Windows Azure suffers worldwide outage

Microsoft is trying to fit a global outage of its Windows Azure cloud.
Written by Jack Clark, Contributor

A component of Windows Azure has experienced a worldwide outage for the past eight hours, preventing customers from carrying out management operations for technology that uses the cloud management service.

The worldwide outage of the Windows Azure Service Management technology began at 1:45am GMT on Wednesday 5:45pm PT Thursday) and, at the time of writing, Microsoft was in the process of rolling out a fix to deal with the problem. Here's how the Azure outage affected businesses in the United States.

Windows Azure Service Management lets customers manage their deployments, hosted services and storage accounts on the cloud platform-as-a-service.

The most recent update by Microsoft on the Windows Azure status page said it had broadened a hotfix patch to cover all sub-regions.

"As we proceed through the rollout, we will progressively enable service management back for customers," the company wrote at 10:30am. "Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers."

At the time of writing Microsoft had not responded to requests for further information on the outage.

Update: 12:02pm GMT/4:02am PT

After rolling out the hotfix to deal with the Windows Azure Service Management fault, Microsoft began reporting problems with Windows Azure Compute across North America.

"Incoming traffic may not go through for a subset of hosted services in this sub-region," the company wrote at 10:55am GMT (2pm PT). It assured customers that applications will continue to run and there is no impact on storage.

Update: 12:36pm GMT/4:36am PT

ZDNet UK is now unable to access the Windows Azure Service Status dashboard.

Update: 1:54pm GMT/5:54 PT

ZDNet UK continues to have trouble accessing the service dashboard. Microsoft has acknowledged requests for further information but not responded to questions.

Update: 2:20pm GMT/ 6:20pm PT

The service management issue has been "mitigated" and service has been brought back for the majority of customers, Microsoft wrote on its periodically inaccessible service dashboard at 1:30pm GMT. However, "We still need to work through some issues before we can completely restore service management."

Meanwhile, the company continues to investigate the Azure Compute issue and is in the process of "verifying the most probable cause," it wrote at 1:30pm GMT 5:30am PT.

Update: 4:00pm GMT/8:00am PT

Microsoft has started trying to restore Azure Compute service in the North Europe, North Central US and South Central US regions, the company said at 3:30pm, without giving details on what caused the fault.

According to a Microsoft dashboard update at 2:30pm around 37 percent of Azure Compute services in North Europe, 6.7 percent of North Central US and 28 percent of South Central US, were affected by the problems.

Service Management is still unavailable for some customers in the North Central US, South Central US and North Europe regions, Microsoft wrote at 3:30pm.

At some point during the day the Access Control 2.0 service experienced an outage in the North Europe and South Central US regions. A dashboard post at 11:15am said customers were not able to access their ACS namespaces. The post has not been updated since then.

At the time of writing Microsoft had still not been able to raise anyone for ZDNet UK to speak to about the outages.

Update: 4:58pm GMT/8:58am PT

Microsoft has acknowledged the issue and admitted that some customers in the North Central US, South Central US and North Europe remain affected.

"On February 28th, 2012 at 5:45 PM PST Microsoft became aware of an issue impacting Windows Azure service management in a number of regions. Windows Azure engineering teams developed, validated and deployed a fix that resolved the issue for the majority of our customers," a Microsoft spokesperson said in a statement delivered to ZDNet UK at 4:55pm, over 12 hours after problems started being flagged on the service dashboard. "Some customers in 3 sub regions — North Central US, South Central and North Europe — remain affected. Engineering teams are actively working to resolve the issue as soon as possible We will update the Service Dashboard, hourly until this incident is resolved."

Update: 7:05pm GMT/9:05am PT

One ZDNet UK reader called in to give some details on how the Azure problems affected their business.

"Our live site's been down all day now, so we've been losing money. The address it's on is not resolving, you can't even ping it," Ashley Rudland, who runs a startup travel site hosted on Azure named worldcitycard.com, told ZDNet UK. "Everything I've been told is that the management portal is the only thing that's got down, but the thing is I can go in and see my machines are running in the cloud, they all say they are ready and green and fine, but they're completely inaccessible."

Rudland said he shares an office with a major Microsoft partner that does cloud integration for the public sector and large companies. "Their biggest clients are all offline," he said. "It's ridiculous, it's crazy, you need so much confidence to use these cloud services."

Earlier in the day the Government's G-Cloud application store, CloudStore, was also down due to the Azure problems, though it was brought back online in the late afternoon.

Update 7:50pm GMT/11am PT

Faults spread across Azure as Microsoft tries to restore service

Failures are propagating across the Azure cloud in America and Northern Europe as Microsoft tries to get its cloud online.

Due to attempting to restore Compute functionality in the North Central US, South Central US and North Europe regions, functionality has been downgraded or squashed on a range of Azure services, including the Windows Azure Marketplace Datamarket in South Central US; SQL Reporting in North Europe; SQL Azure Data Sync across the East Asia, North Central US, North Europe, South Central US, Southeast Asia and West Europe regions; the Service Bus, Access Control & Caching Portal worldwide; the Service Bus in South Central US, and there are continuing problems in Access Control 2.0 across the North Europe and South Central US and Access Control in the South Central US region.

Strangely, the faults for the above services are reported on Microsoft's flaky Windows Azure Service Dashboard as all having occurred earlier in the day. ZDNet UK has been checking the dashboard since the problems first emerged and can attest to readers that Microsoft is retroactively giving fault information about Azure services — not as they happen.

Microsoft is keeping quiet on the entire thing. "There will not be any briefings around this issue," one of Microsoft's PR companies informed ZDNet UK on Wednesday afternoon.

This story originally appeared on ZDNet UK.

Editorial standards