A Game of Clue: What Killed Skype
Summary: It was a server failure, with a Windows application crash, over a peer-to-peer network.
Days after Skype, the popular Voice-over-Internet-Protocol (VoIP), crashed we finally know why Skype died for several days. Perhaps launching into what blasted Skype though you need to know how Skype works.
You need to keep in mind that Skype is a true peer-to-peer (P2P) network application. Indeed, if you trace back Skype's ancestry you'll find that its developers first cut their teeth on the Kazaa P2P file-sharing program. What's important about that is that Skype, unlike client-server programs, relies on its client PCs to help carry voice communications.
If you're a Skype user your PC may not just be an ordinary client, but it may be working as a Super Node (SN) as well. When you login to Skype, the odds are you're not logging directly into the Skype login-servers but into a SN instead. The SN in turn, stores your Skype name, your e-mail address, and an encrypted version of your password.
Skype automatically and constantly modifies its network as users go off and on the service. With Skype installed, your PC may be used as a SN and you'll never know it. As a SN, your PC will store the addresses of up to several hundred Skype users. If your PC isn't behind a firewall and/or NAT (Network Address Translation), it may also be used to route calls.
The program' is designed so that Skype won't be using your system when you're in the middle of a big project. In addition, even if you're watching for Skype traffic, you're not likely to be able to crack it since voice traffic is encrypted with 256-bit Advanced Encryption Standard (AES, aka Rijndael).
The idea behind all this is to make Skype extremely scalable without requiring the company to maintain a large, read expensive, server infrastructure. Technically, this worked well for Skype until December 22, 2010.
Then, as Lars Rabbe, Skype's CIO explained, bad things started to happen: "A cluster of support servers responsible for offline instant messaging became overloaded. As a result of this overload, some Skype clients received delayed responses from the overloaded servers. In a version of the Skype for Windows client (version 5.0.0152), the delayed responses from the overloaded servers were not properly processed, causing Windows clients running the affected version to crash."
That wasn't the latest version of Skype for Windows, but "around 50% of all Skype users globally were running the 5.0.0.152 version of Skype for Windows, and the crashes caused approximately 40% of those clients to fail. These clients included 25-30% of the publicly available supernodes, also failed as a result of this problem."
Those of you know how cascade problems work can already see where this is going. "Once a supernode has failed, even when restarted, it takes some time to become available as a resource to the P2P network again. As a result, the P2P network was left with 25-30% fewer supernodes than normal. This caused a disproportionate load on the remaining available supernodes." How big a load on the last standing supernodes? Try, "about 100 times what would normally be expected at that time of day."
Whoops.
Well that worked about as well as you might think it would: "Regrettably, as a result of the confluence of events - server overload, a bug in Skype for Windows clients (version 5.0.0.152), and the decline in available supernodes - Skype's functionality became unavailable to many of our users for approximately 24 hours."
To combat this, Skype started adding its own "dedicated supernodes, which we nick-named 'mega-supernodes,' to provide enough temporary supernode capacity to accelerate the recovery of the peer-to-peer cloud. " While Skype claims that officially the problem only lasted about 24-hours, Rabbe admitted that "The supernodes stabilized overnight on Thursday and by Friday, several tens of thousands of supernodes were supporting the P2P network. During Friday, we withdrew a significant proportion of the mega-supernodes from service, leaving some in operation to ensure stability of the P2P network over Christmas and New Year." So, the problem really lasted over 48-hours. Skype seems to be working just fine now.
To prevent this kind of thing from happening again, Skype will be working on improving its Windows client quality assurance. The company is also working on adding to its small number of core servers so that if the P2P side of Skype goes down, there will be a bit more robustness in the service's infrastructure.
I'd also ask Skype to improve its software automatic update functionality. There's no way that so many old versions of the Windows client software should have been out there when the failure hit. The latest version of Skype, version 5.0.0.156, which proved able to resist the problem, had been released a week before the crash. Almost all Windows users should have been automatically upgraded to it.
I've said it before, I'll say it again: People should be forced to upgrade their systems if they're going to be on the Internet. One way to do that is to make sure applications, like Skype, which depend on the Internet, can be automatically updated. Yes, that can be a headache for system administrators, but then so is having out-of-date software on the loose that contributed to taking down an important service.
After all, what would you rather do? Play a game of Clue with what the heck just happened to a major Internet service or deal with automatic push software upgrades? I'd rather deal with the patches myself.
Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.
Talkback
Allowing automatic updates opens up the possibility of mass infection
RE: A Game of Clue: What Killed Skype
Steven
Steven
What SHOULD Happen . . .
Is the software should automatically check for updates and then NOTIFY THE USER THAT AN UPDATE IS AVAILABLE.
I checked mine today, and I still had the old software. So I did an auto update, which failed the first time, and succeeded on the second attempt.
I don't think users should be forced to update, but they should at least be notified when an update exists, and given the opportunity to download and install it. I honestly think that most of them don't upgrade (which IE6-8 are, not an update), or update simply because they use the default settings, which until recently, were for the most part, set not to auto-update. This means they aren't even aware of the updates, and quite frankly, grandma and grandpa may not even realize that they need to do it.
I choose not to auto-update until I'm sure the update actually works (please reference the Mcafee disaster recently when an important windows file was mis-identified as an infected file and removed, due to an automatic update).
RE: A Game of Clue: What Killed Skype
RE: A Game of Clue: What Killed Skype
I agree with JLHenry and masonwheeler.
The first rule of working with computers, mason, is "save early, save often." Losing work by leaving that work unsaved is the user's fault.
I can not understand how skpe servers became overloaded
:|
You missed the bit about...
RE: A Game of Clue: What Killed Skype
But don't go down that path of blaiming it on the OS. It had NOTHING to do with the OS and everything to do with their own buggy Windows client, though outdated. In the end it still was the Linux servers that went down. "Then, as Lars Rabbe, Skype?s CIO explained, bad things started to happen: ?A cluster of support servers responsible for offline instant messaging became overloaded. As a result of this overload, some Skype clients received delayed responses from the overloaded servers." Those clients were the Skype client for Windows. Nothing to do with the Windows OS. But it all cascaded from the servers being overloaded.
RE: A Game of Clue: What Killed Skype
Compute:
Application Code <> Operating System
Hmmm...
I blamed the client, not windows. And "in the end" it wouldn't have mattered what the servers were running, they would have been toast.
@Traxxion
Perhaps you could let mr_Spock know that and for that matter smfrazz too.
Does not compute
OK, so it's me who doesn't understand how the cascade propagates :-(
I guess the failed support nodes and clients kept taking down working nodes faster than they were recovering ...
... or is it the case that the redundancy mechanism failed and that a crashing client took down part of the working segment too?
"The latest version of Skype, version 5.0.0.156, which proved able to resist the problem, had been released a week before the crash. Almost all Windows users should have been automatically upgraded to it."
No!
Imagine what would happen after a buggy version was pushed out. (No software quality assurance programme can eliminate ALL bugs.)
A much slower release ... accompanied by monitoring of failures after the same ... would surely be preferable.
Also I am not entirely comfortable with a forced upgrade: I currently value a feature of Photoshop 3 which has been removed from subsequent versions so I haven't upgraded. In the case of Skype I think they need to make it clear to customers in advance that their machine is part of the network and not entirley their own AND ASK PERMISSION for automatic-ish upgrades. (OK, then if you don't grant permission - you don't get the software!)
RE: A Game of Clue: What Killed Skype
What kind of idiot company would take down a significant
The lesson here is that skype seems to know jack sh** about cloud computing. Peer to peer is for rinky dink kids guys...
Auto-updating has to be optional
This was purely a Skype failure and suggesting that all internet software should silent update beyond user control to protect us from sloppy companies is not well thought out. It could have just as easily happened in reverse where a new version was the culprit and if silent updating was in place the problem would have been even worse.
I only use Skype when I'm on the road so I'm sure I'm one of the ones with an older version and I have no intention of updating it until the next time I need it.
Autoupdating can really be a headache
The real story here is that Skype can crash your computer
RE: A Game of Clue: What Killed Skype
RE: A Game of Clue: What Killed Skype
It always seemed to me there should be a third option between "optional" and "forced / NOW!" ... that being something along these lines:
"Your (insert name of application here) application requires an important and urgent update. This update will demand considerable system resources for anywhere from (min) to (max) minutes/hours, etc. This update is not optional. You do have the option of processing the update now or, if you are in the middle of something critical, delaying the update for up to XX hours (where XX is determined by the application creator)."
Given this sort of option, I could get my most urgent work done, then authorize the update to process when the red hot work was complete.
Allow opting out of auto-update, but then disallow SN status
Example of forced update