X
Business

A Game of Clue: What Killed Skype

It was a server failure, with a Windows application crash, over a peer-to-peer network.
Written by Steven Vaughan-Nichols, Senior Contributing Editor

Days after Skype, the popular Voice-over-Internet-Protocol (VoIP), crashed we finally know why Skype died for several days. Perhaps launching into what blasted Skype though you need to know how Skype works.

You need to keep in mind that Skype is a true peer-to-peer (P2P) network application. Indeed, if you trace back Skype's ancestry you'll find that its developers first cut their teeth on the Kazaa P2P file-sharing program. What's important about that is that Skype, unlike client-server programs, relies on its client PCs to help carry voice communications.

If you're a Skype user your PC may not just be an ordinary client, but it may be working as a Super Node (SN) as well. When you login to Skype, the odds are you're not logging directly into the Skype login-servers but into a SN instead. The SN in turn, stores your Skype name, your e-mail address, and an encrypted version of your password.

Skype automatically and constantly modifies its network as users go off and on the service. With Skype installed, your PC may be used as a SN and you'll never know it. As a SN, your PC will store the addresses of up to several hundred Skype users. If your PC isn't behind a firewall and/or NAT (Network Address Translation), it may also be used to route calls.

The program' is designed so that Skype won't be using your system when you're in the middle of a big project. In addition, even if you're watching for Skype traffic, you're not likely to be able to crack it since voice traffic is encrypted with 256-bit Advanced Encryption Standard (AES, aka Rijndael).

The idea behind all this is to make Skype extremely scalable without requiring the company to maintain a large, read expensive, server infrastructure. Technically, this worked well for Skype until December 22, 2010.

Then, as Lars Rabbe, Skype's CIO explained, bad things started to happen: "A cluster of support servers responsible for offline instant messaging became overloaded. As a result of this overload, some Skype clients received delayed responses from the overloaded servers. In a version of the Skype for Windows client (version 5.0.0152), the delayed responses from the overloaded servers were not properly processed, causing Windows clients running the affected version to crash."

That wasn't the latest version of Skype for Windows, but "around 50% of all Skype users globally were running the 5.0.0.152 version of Skype for Windows, and the crashes caused approximately 40% of those clients to fail. These clients included 25-30% of the publicly available supernodes, also failed as a result of this problem."

Those of you know how cascade problems work can already see where this is going. "Once a supernode has failed, even when restarted, it takes some time to become available as a resource to the P2P network again. As a result, the P2P network was left with 25-30% fewer supernodes than normal. This caused a disproportionate load on the remaining available supernodes." How big a load on the last standing supernodes? Try, "about 100 times what would normally be expected at that time of day."

Whoops.

Well that worked about as well as you might think it would: "Regrettably, as a result of the confluence of events - server overload, a bug in Skype for Windows clients (version 5.0.0.152), and the decline in available supernodes - Skype's functionality became unavailable to many of our users for approximately 24 hours."

To combat this, Skype started adding its own "dedicated supernodes, which we nick-named 'mega-supernodes,' to provide enough temporary supernode capacity to accelerate the recovery of the peer-to-peer cloud. " While Skype claims that officially the problem only lasted about 24-hours, Rabbe admitted that "The supernodes stabilized overnight on Thursday and by Friday, several tens of thousands of supernodes were supporting the P2P network. During Friday, we withdrew a significant proportion of the mega-supernodes from service, leaving some in operation to ensure stability of the P2P network over Christmas and New Year." So, the problem really lasted over 48-hours. Skype seems to be working just fine now.

To prevent this kind of thing from happening again, Skype will be working on improving its Windows client quality assurance. The company is also working on adding to its small number of core servers so that if the P2P side of Skype goes down, there will be a bit more robustness in the service's infrastructure.

I'd also ask Skype to improve its software automatic update functionality. There's no way that so many old versions of the Windows client software should have been out there when the failure hit. The latest version of Skype, version 5.0.0.156, which proved able to resist the problem, had been released a week before the crash. Almost all Windows users should have been automatically upgraded to it.

I've said it before, I'll say it again: People should be forced to upgrade their systems if they're going to be on the Internet. One way to do that is to make sure applications, like Skype, which depend on the Internet, can be automatically updated. Yes, that can be a headache for system administrators, but then so is having out-of-date software on the loose that contributed to taking down an important service.

After all, what would you rather do? Play a game of Clue with what the heck just happened to a major Internet service or deal with automatic push software upgrades? I'd rather deal with the patches myself.

Editorial standards