(See update below)
Skype is blaming last week's two-day outage on millions of Windows machines restarting after the installation of Microsoft's security patches.
The massive number of reboots caused a flood of log-in requests (the Skype default is to login at reboot), causing "a chain reaction that had a critical impact."
In a note posted on the Skype home page, the eBay-owned company that the peer-to-peer network that powers the Internet phone service has a self-healing component that failed because of a software bug.
[This] event revealed a previously unseen software bug within the network resource allocation algorithm which prevented the self-healing function from working quickly. Regrettably, as a result of this disruption, Skype was unavailable to the majority of its users for approximately two days.
The issue has now been identified explicitly within Skype. We can confirm categorically that no malicious activities were attributed or that our users’ security was not, at any point, at risk.
The Windows Update explanation seems a bit bizarre. After all, Microsoft has been delivering automatic updates (and simultaneous reboots) every month since 2003. Something still isn't adding up.
[UPDATE: August 21, 2007 @ 10:46 AM] Skype has posted another explanation to clarify the Microsoft Patch Tuesday connection and explain why this never happened before:
2. What was different about this set of Microsoft update patches?
In short – there was nothing different about this set of Microsoft patches. During a joint call soon after problems were detected, Skype and Microsoft engineers went through the list of patches that had been pushed out. We ruled each one out as a possible cause for Skype’s problems. We also walked through the standard Windows Update process to understand it better and to ensure that nothing in the process had changed from the past (and nothing had). The Microsoft team was fantastic to work with, and after going through the potential causes, it appeared clearer than ever to us that our software’s P2P network management algorithm was not tuned to take into account a combination of high load and supernode rebooting.
3. How come previous Microsoft update patches didn’t cause disruption?
That’s because the update patches were not the cause of the disruption. In previous instances where a large number of supernodes in the P2P network were rebooted, other factors of a “perfect storm” had not been present. That is, there had not been such a combination of high usage load during supernode rebooting. As a result, P2P network resources were allocated efficiently and self-healing worked fast enough to overcome the challenge.