On August 14, the VS Online core Shared Platform Services (SPS) databases became overwhelmed with database updates that queued up so badly that callers were blocked, explained Harry.
Harry acknowledged that Microsoft officials still aren't sure what specifically triggered the outage. There were some configuration changes causing a significant increase in traffic between TFS and SPS, and some of that traffic included license-validation checks that were improperly disabled, he said. There also was a simultaneous spike in latencies and failed Service Bus message deliveries.
Harry listed a few "core causal bugs" the team discovered in its analysis of the outage, including a bug in the Azure portal extension service.
Harry said the team learned a few things from the recent VS Online outage. He admitted candidly:
"So back to last Thursday… We’ve gotten sloppy. Sloppy is probably too harsh. As with any team, we are pulled in the tension between eating our Wheaties and adding capabilities that customers are asking for. In the drive toward rapid cadence, value every sprint, etc., we’ve allowed some of the engineering rigor that we had put in place back then to atrophy – or more precisely, not carried it forward to new code that we’ve been writing. This, I believe, is the root cause – Developers can’t fully understand the cost/impact of a change they make because we don’t have sufficient visibility across the layers of software/abstraction, and we don’t have automated regression tests to flag when code changes trigger measurable increases in overall resource cost of operations. You must, of course, be able to do this in synthetic test environments – like unit tests, but also in production environments because you’ll never catch everything in your tests."
He added that Microsoft needs to put in place some infrastructure to better measure and flag changes in end-to-end cost in both test and production to avoid similar problems in the future.
Harry said the team plans to analyze futher the call patterns within SPS and between SPS and SQL to build alerts to catch earlier situations like the August 14 one. The team also is working on partioning and scaling the SPS Configu DB and possibly building a service to throttle and recover itself from a slow or failed dependency, among other remedies.