Skype: we "don't blame Microsoft" for this "perfect storm"

Skype's Villu Arak has just posted further clarifications on an earlier statement citing the cause of the recent, two-day Skype service interruption some understood as being precipitated by a surge of Skype re-connects after PCs worldwide automatically rebooted after a monthly Microsoft patch upgrade and install:We don’t blame anyone but ourselves.

Skype's Villu Arak has just posted further clarifications on an earlier statement citing the cause of the recent, two-day Skype service interruption some understood as being precipitated by a surge of Skype re-connects after PCs worldwide automatically rebooted after a monthly Microsoft patch upgrade and install:

We don’t blame anyone but ourselves. The Microsoft Update patches were merely a catalyst — a trigger — for a series of events that led to the disruption of Skype, not the root cause of it. And Microsoft has been very helpful and supportive throughout.

The high number of post-update reboots affected Skype’s network resources. This caused a flood of log-in requests, which, combined with the lack of peer-to-peer network resources at the time, prompted a chain reaction that had a critical impact. The self-healing mechanisms of the P2P network upon which Skype’s software runs have worked well in the past. Simply put, every single time Skype has needed to recover from reboots that naturally accompany a routine Windows Update, there hasn’t been a problem.

Unfortunately, this time, for the first time, Skype was unable to rise to the challenge and the reasons for this were exceptional. In this instance, the day’s Skype traffic patterns, combined with the large number of reboots, revealed a previously unseen fault in the P2P network resource allocation algorithm Skype used. Consequently, the P2P network’s self-healing function didn’t work quickly enough. Skype’s peer-to-peer core was not properly tuned to cope with the load and core size changes that occurred on August 16. The reboots resulting from software patching merely served as a catalyst. This combination of factors created a situation where the self-healing needed outside intervention and assistance by our engineers.

2. What was different about this set of Microsoft update patches?

In short – there was nothing different about this set of Microsoft patches. During a joint call soon after problems were detected, Skype and Microsoft engineers went through the list of patches that had been pushed out. We ruled each one out as a possible cause for Skype’s problems. We also walked through the standard Windows Update process to understand it better and to ensure that nothing in the process had changed from the past (and nothing had). The Microsoft team was fantastic to work with, and after going through the potential causes, it appeared clearer than ever to us that our software’s P2P network management algorithm was not tuned to take into account a combination of high load and supernode rebooting.

3. How come previous Microsoft update patches didn’t cause disruption?

That’s because the update patches were not the cause of the disruption. In previous instances where a large number of supernodes in the P2P network were rebooted, other factors of a “perfect storm” had not been present. That is, there had not been such a combination of high usage load during supernode rebooting. As a result, P2P network resources were allocated efficiently and self-healing worked fast enough to overcome the challenge.

4. Has the bug been fixed? Should Skype users worry about future Microsoft Update patches and reboots?

Yes, the bug has been squashed. The parameters of the P2P network have been tuned to be smarter about how similar situations should be handled. Once we found the algorithmic fix to ensure continued operation in the face of high numbers of client reboots, the efforts focused squarely on stabilising the P2P core. The fix means that we’ve tuned Skype’s P2P core so that it can cope with simultaneous P2P network load and core size changes similar to those that occurred on August 16. We’d like to reassure our users across the globe that we’ve done everything we need to do to make sure this doesn’t happen again. We’ve already introduced a number of improvements to our software to ensure our users will not be similarly affected – in the unlikely possibility of this combination of events recurring.