Microsoft explains roots of this week's Office 365 downtime

Microsoft officials explain the causes of the back-to-back Lync Online and Exchange Online outages experienced by a number of Office 365 users this week.
Written by Mary Jo Foley, Senior Contributing Editor

It wasn't a good week for a number of Office 365 users in North America this week.


On Monday, June 23 Lync Online was down for a number of users for several hours. On Tuesday, June 24, Exchange Online issues resulted in some users being unable to sign in and/or get their email in a timely manner for most of the day.

In a June 26 blog post, Rajesh Jha, Corporate Vice President of Office 365 Engineering, apologized and explained to customers what happened in its North America Office 365 datacenters.

Jha said the back-to-back Lync Online and Exchange Online service issues were "unrelated" to one another.

The Lync Online issue resulted in a number of users being unable to log into Microsoft's Lync Online unified communications service. Microsoft is attributing the inability to connect to "external network failures."

"Even though connectivity was restored in minutes, the ensuing traffic spike caused several network elements to get overloaded, resulting in some of our customers being unable to access Lync functionality for an extended duration," Jha explained.

The Exchange Online issue resulted in "prolonged email delays for externally bound email (email coming inside & going outside the company) for some customers," Jha acknowledged. Also for "a small subset of customers," Exchange email could not be accessed at all. At the same time, the Service Health Dashboard didn't notify all customers of the service issues, instead indicating that all was well.

"In the case of the Exchange Online issue, the trigger was an intermittent failure in a directory role that caused a directory partition to stop responding to authentication requests," Jha said.

Jha maintained that a "small set of customers" lost email access, but their loss of access was "prolonged." However, Jha noted, "the nature of this failure led to an unexpected issue in the broader mail delivery system due to a previously unknown code flaw leading to mail flow delays for a larger set of customers."

The team ended up partioning the mail delivery system away from the failed directory partition and then addressing the root cause for the failed directory partition. Microsoft is "working on further layers of hardening for this pattern," Jha said.

Microsoft still plans to post a "Post-Incident Report" (PIR) in customers' dashboards that will contain a detailed analysis of what happened, how Microsoft responded and what the company will do in the future to prevent similar issues, Jha said.

There's no word so far on what Microsoft is planning to do, if anything, to financially compensate those subscribers affected by this week's Lync Online and Exchange Online issues. I've asked a spokesperson if there's more to come on that front. No word back yet.

Update (June 28): A Microsoft spokesperson sent me the following regarding financial compensation for the outages this week:

"Microsoft guarantees 99.9% uptime as part of the Office 365 SLA (Service Level Agreement), so if it’s determined that the service didn’t meet that bar in a particular month, we’ll work with customers to credit them appropriately. This is on a case by case basis given the impact of service issues can vary among customers."

Editorial standards