Microsoft: What caused one of the worst Visual Studio Online outages ever

Microsoft: What caused one of the worst Visual Studio Online outages ever

Summary: What caused a five-hour-plus outage in Microsoft's Visual Studio Online service earlier this month? Microsoft officials share the details.

SHARE:

Microsoft's Visual Studio Online service was down for more than five hours on August 14. What went wrong?

vsonlineoutage

In an August 22 blog post Technical Fellow and Product Unit Manager Brian Harry detailed the causes of what he described as "one of the worst incidents we've ever had on VS Online."

Visual Studio Online is Team Foundation Server plus a few other related services running on Azure.

On August 14, the VS Online core Shared Platform Services (SPS) databases became overwhelmed with database updates that queued up so badly that callers were blocked, explained Harry.

Harry acknowledged that Microsoft officials still aren't sure what specifically triggered the outage. There were some configuration changes causing a significant increase in traffic between TFS and SPS, and some of that traffic included license-validation checks that were improperly disabled, he said. There also was a simultaneous spike in latencies and failed Service Bus message deliveries.

Harry listed a few "core causal bugs" the team discovered in its analysis of the outage, including a bug in the Azure portal extension service.

Harry said the team learned a few things from the recent VS Online outage. He admitted candidly:

"So back to last Thursday… We’ve gotten sloppy. Sloppy is probably too harsh. As with any team, we are pulled in the tension between eating our Wheaties and adding capabilities that customers are asking for. In the drive toward rapid cadence, value every sprint, etc., we’ve allowed some of the engineering rigor that we had put in place back then to atrophy – or more precisely, not carried it forward to new code that we’ve been writing. This, I believe, is the root cause – Developers can’t fully understand the cost/impact of a change they make because we don’t have sufficient visibility across the layers of software/abstraction, and we don’t have automated regression tests to flag when code changes trigger measurable increases in overall resource cost of operations. You must, of course, be able to do this in synthetic test environments – like unit tests, but also in production environments because you’ll never catch everything in your tests."

He added that Microsoft needs to put in place some infrastructure to better measure and flag changes in end-to-end cost in both test and production to avoid similar problems in the future.

Harry said the team plans to analyze futher the call patterns within SPS and between SPS and SQL to build alerts to catch earlier situations like the August 14 one. The team also is working on partioning and scaling the SPS Configu DB and possibly building a service to throttle and recover itself from a slow or failed dependency, among other remedies.

Topics: Cloud, Microsoft, Software Development

About

Mary Jo has covered the tech industry for 30 years for a variety of publications and Web sites, and is a frequent guest on radio, TV and podcasts, speaking about all things Microsoft-related. She is the author of Microsoft 2.0: How Microsoft plans to stay relevant in the post-Gates era (John Wiley & Sons, 2008).

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

6 comments
Log in or register to join the discussion
  • Microsoft layoffs

    So, about all those testers getting fired? They still think it's a good idea?
    andradedearthur
    • This has ...

      ... NOTHING to do with those testers being laid off.
      bitcrazed
      • What I meant to say is that...

        ... even with all these testers on their teams they are still talking about need to do more testing. When they say something like this, firing testers doesn't make much sense to me.
        andradedearthur
        • Layoffs Where

          on different teams.
          rmark@...
  • Transparency

    The MS response actually strikes me as very good. They seem like they are being relatively transparent about what they know and don't know and have some good ideas for how to prevent not just this outage, but similar outages as well. They looked at the systems in place, found them wanting and plan to make systemic changes.

    Obviously it would be better if there were never any problems, but with complex systems that seems unrealisitic. If you can't be perfect, at least put good systems in place to deal with problems when they arise.
    DWAnderson
  • Here's a great take away from Brian's comments

    For anyone considering the cloud, here's a great take away from Brian brutally honest comments:

    "Developers can’t fully understand the cost/impact of a change they make because we don’t have sufficient visibility across the layers of software/abstraction, and we don’t have automated regression tests to flag when code changes trigger measurable increases in overall resource cost of operations."

    That part about sufficient visibility across layer of software abstraction and not being able measure end-to-end resource cost of changes is huge! In the cloud, resource cost is paramount to the entire value proposition of moving to the cloud. I don't think the problem here was the team got lazy, someone implemented a change without understanding the resource implications of that change, primarily because those details are abstracted away from them. Teams do this everyday, it typically ends in shock from your monthly billing statement, but for a high profile service like VS Online, everyone knows your mistakes. Welcome to the cloud folks!
    windowseat