Microsoft explains roots of this week's Office 365 downtime

Microsoft explains roots of this week's Office 365 downtime

Summary: Microsoft officials explain the causes of the back-to-back Lync Online and Exchange Online outages experienced by a number of Office 365 users this week.

SHARE:

It wasn't a good week for a number of Office 365 users in North America this week.

o365outages2

On Monday, June 23 Lync Online was down for a number of users for several hours. On Tuesday, June 24, Exchange Online issues resulted in some users being unable to sign in and/or get their email in a timely manner for most of the day.

In a June 26 blog post, Rajesh Jha, Corporate Vice President of Office 365 Engineering, apologized and explained to customers what happened in its North America Office 365 datacenters.

Jha said the back-to-back Lync Online and Exchange Online service issues were "unrelated" to one another.

The Lync Online issue resulted in a number of users being unable to log into Microsoft's Lync Online unified communications service. Microsoft is attributing the inability to connect to "external network failures."

"Even though connectivity was restored in minutes, the ensuing traffic spike caused several network elements to get overloaded, resulting in some of our customers being unable to access Lync functionality for an extended duration," Jha explained.

The Exchange Online issue resulted in "prolonged email delays for externally bound email (email coming inside & going outside the company) for some customers," Jha acknowledged. Also for "a small subset of customers," Exchange email could not be accessed at all. At the same time, the Service Health Dashboard didn't notify all customers of the service issues, instead indicating that all was well.

"In the case of the Exchange Online issue, the trigger was an intermittent failure in a directory role that caused a directory partition to stop responding to authentication requests," Jha said.

Jha maintained that a "small set of customers" lost email access, but their loss of access was "prolonged." However, Jha noted, "the nature of this failure led to an unexpected issue in the broader mail delivery system due to a previously unknown code flaw leading to mail flow delays for a larger set of customers."

The team ended up partioning the mail delivery system away from the failed directory partition and then addressing the root cause for the failed directory partition. Microsoft is "working on further layers of hardening for this pattern," Jha said.

Microsoft still plans to post a "Post-Incident Report" (PIR) in customers' dashboards that will contain a detailed analysis of what happened, how Microsoft responded and what the company will do in the future to prevent similar issues, Jha said.

There's no word so far on what Microsoft is planning to do, if anything, to financially compensate those subscribers affected by this week's Lync Online and Exchange Online issues. I've asked a spokesperson if there's more to come on that front. No word back yet.

Update (June 28): A Microsoft spokesperson sent me the following regarding financial compensation for the outages this week:

"Microsoft guarantees 99.9% uptime as part of the Office 365 SLA (Service Level Agreement), so if it’s determined that the service didn’t meet that bar in a particular month, we’ll work with customers to credit them appropriately. This is on a case by case basis given the impact of service issues can vary among customers."

Topics: Cloud, IT Priorities, Microsoft, Unified Comms, IT Policies

About

Mary Jo has covered the tech industry for 30 years for a variety of publications and Web sites, and is a frequent guest on radio, TV and podcasts, speaking about all things Microsoft-related. She is the author of Microsoft 2.0: How Microsoft plans to stay relevant in the post-Gates era (John Wiley & Sons, 2008).

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

23 comments
Log in or register to join the discussion
  • uh huh...

    Why does "Even though connectivity was restored in minutes, the ensuing traffic spike caused several network elements to get overloaded, resulting in some of our customers being unable to access Lync functionality for an extended duration," sound like BS and a lot "look over there".

    A traffic spike by definition is short duration - and that can't cause an extended failure.
    jessepollard
    • It actually makes sense to me

      Connectivity was restored, but now there were a lot of requests in the queue and they caused a traffic spike that the network was not ready for. This caused an overload to the network, so the network components failed -- hence functionality was blocked until the network components could be fixed. (the components were probably software based, though hardware ones could also cause issues).

      In either case, the explanation is plausible.

      I didn't have any issues connecting to Microsoft services during the time frame, but locally we had a Google hangouts/email outage and some other Internet issues.
      grayknight
      • I suppose - you being a fanboi an' all -

        - that you don't think they should have a sytsem that doesn't crash as soon as they fixed the previous crash?

        The point is, many paying customers were denied service for "an extended duration", howver much they pass the buck.

        Most companies would just take it on the chin, not try to weasel out.
        Heenan73
        • Though they were just being transparent

          Weaseling out would have been blaming it on something external in my view. He apologizes at least twice in his explanation, and knows his company took a hit on the chin. In a hyper connected world, outages will happen and all you can hope for is everyone on the team following contingency and remediation plans ASAP. Cut the guy some slack.
          Tired Tech
      • An overloaded network doesn't cause "network components" to fail.

        It can't.

        Traffic congestion will slow things down. But that is a fault of poor network design.

        Components just don't "fail" because they are busy.
        jessepollard
        • actually they can

          I've seen it happen twice in the last few years on Cisco routers.

          In both situations a spike in traffic basically triggered a 'self-protection' response in the routers as they perceived the unusual traffic pattern as a potential DDoS attack.

          In both cases it also took the network engineers a couple of hours to sort the routers out.
          aesonaus
          • "self protection" is not a component failure.

            It was designed to detect DDoS attacks....

            That it improperly "detected" an attack is a different issue - like saying "power off if more than xxx amount of traffic arrives".

            Even so - it points to a design flaw in the network.
            jessepollard
    • layperson sez so

      Of course it can. That pattern (small disruption, then acute period of increased traffic to make up for the disruption... causing further problems) is the legitimate reason behind a majority of large-scale issues. If you think it can't happen, just wait until your company has more than 3 people for you to support.
      boobuhbuhloo
      • It does happen with poor design.

        Or just poor implementation.
        jessepollard
    • Work at home special report.........www.Works23.us

      Start working from home! Great job for students, stay-at-home moms or anyone needing an extra income... You only need a computer and a reliable internet connection... Make $90 hourly and up to $12000 a month by following link at the bottom and signing up... You can have your first check by the end of this week..............http://x.co/4vMK8
      Soccut
    • what's the first thing you do when you can't connect?

      you hit refresh, then try another client, then hit refresh again... connectivity problems are exacerbated by both the displaced original traffic and the extra traffic of people tapping the screen and asking 'is this thing on?'. same thing kept BlackBerry offline several extra days, because their design said 'never throw away an undelivered message'.
      mary.branscombe
  • Epic!

    Now who shall be burned at the stake for this faux pas!

    Choices are:
    1) The hapless users (always a fan favourite)
    2) Some H1B drones that did the code at 1/2 the cost and way less than that in terms quality
    3) Somebody at a suitable level from Microsoft's product teams
    4) ZDnet luminaries such as Davidson, Cayble, Ye, Owl et al

    Truly, a burning at the stake is necessary at moments like this as both comedic relief and as "encouragement" for those left to do better from now on.

    Mind you, it could have been worse...
    ego.sum.stig
    • A "here's how we screwed up, here's what we going to do better" email...

      ...Is a good thing. It shows that they understand that things screw up (they do - about half my job is unscrewing up unforeseen screw-ups), understand that doing a good post-mortem analysis and fessing up to their customers is the best path.

      I'd be curious to see how other companies react after things get messed up.
      Flydog57
      • Most come clean ..

        .. and say "We Messed Up".

        Some even add "we'll install systems to avoid a repetition".

        This isn't the first outage the web has ever seen; there are precedents.
        Heenan73
        • Keep on trolling

          that is all you have. scroogle and apple wouldn't even own up, they would just blame the user for using the internet wrong. scroogle (and their fanboys) would blame a vast Microsoft comspiracy to keep their FOSS products from ever working, oh wait, scroogle doesn't really have any of those do they.
          hoppmang
          • Google never has before.

            And they fix what doesn't work, and they improve the design when it is a design fault.

            MS is the one blaming anyone but themselves.
            jessepollard
          • your right

            Google has never owned up.

            They either scrap the product a few months later saying 'it wasn't that popular anyway' or the product was in beta (Gmail for the first 5+ years) so they don't have to answer for flaws

            And Apple is - 'nothing to see here, move along'.

            At least MS is trying to be open with a response and trying to do right.

            Good luck getting answers from Amazon as well - it'd be more like - 'heres credit for our store, so you just go shopping and don't worry'
            aesonaus
          • Evidently you can't read.

            Too bad.

            Google has owned up several times over failures, and said what they planned to do to prevent it from happening. That is why Google has spread their servers over multiple sites on multiple continents.

            Doesn't mean it won't happen again - but it does mean the failure would be limited in scope.

            And canceling products has nothing to do with it. MS canceled the Zune.

            And totally destroyed Danger with incompetent backups.

            And how about those bricked RT systems? or the security patches that open additional vulnerabilities? or put systems in a permanent reboot cycle? Have they even apologized? How about announcing what they are doing to ensure they don't screw it up again?
            jessepollard
          • Your entire online life is dedicated to AMB, and this post is no different

            You have a very sad life indeed. Oh so sad :)
            toreoasesino
  • Microsoft owes more than an apology

    This is exactly why I won't go the subscription/cloud route. I'm glad that Microsoft apologized, but it's the least they should do. How about Microsoft having the affected businesses tally up their economic losses due to the downtime, and issuing checks to compensate them for their lost productivity?
    preilly2@...