Skype's outage: Lessons learned

Skype's outage: Lessons learned

Summary: Skype has its official response to its nearly two-day outage: A software bug was unearthed after numerous restarts over a Microsoft patch download.Russell Shaw has more, but here's what Skype had to say:On Thursday, 16th August 2007, the Skype peer-to-peer network became unstable and suffered a critical disruption.

SHARE:

Skype has its official response to its nearly two-day outage: A software bug was unearthed after numerous restarts over a Microsoft patch download.

Russell Shaw has more, but here's what Skype had to say:

On Thursday, 16th August 2007, the Skype peer-to-peer network became unstable and suffered a critical disruption. The disruption was triggered by a massive restart of our users’ computers across the globe within a very short timeframe as they re-booted after receiving a routine set of patches through Windows Update.

The high number of restarts affected Skype’s network resources. This caused a flood of log-in requests, which, combined with the lack of peer-to-peer network resources, prompted a chain reaction that had a critical impact.

Normally Skype’s peer-to-peer network has an inbuilt ability to self-heal, however, this event revealed a previously unseen software bug within the network resource allocation algorithm which prevented the self-healing function from working quickly. Regrettably, as a result of this disruption, Skype was unavailable to the majority of its users for approximately two days.

So what are the key learnings here (Techmeme has more):

1. Patch management is a headache. As anyone that reads Ryan Naraine knows patch management is a pain--a monthly one. You download the patches and applications break. The systems and processes you have in place when managing patches are critical. Hopefully, patch management is automated to some degree. And it's not just Microsoft patches. The entire industry patches about the same time. That means you better have a strategy to ease the pain.

2. Skype's reputation as a phone replacement took a hit. Skype did a nice job of keeping people updated during the crisis, but its reliability reputation took a hit. The rub: You really have to wonder if Skype can replace your land line. Is that fair? Maybe not, but at last check my plain old telephone wasn't impacted by patches, algorithms and software bugs. The damn thing just works.

3. Peer to peer isn't perfect. Skype noted that it had self-healing functions, but it stumbled. There's a bit of a debate over whether Skype's outage reflects on P2P. Once you delve into the nitty gritty Skype's outage may not apply to P2P. But as the poster child of P2P Skypes outage will hurt perception.

4. Skype's goals are unclear. If Skype is supposed to be a phone service that could replace a land line this line should probably been edited.

"This disruption was unprecedented in terms of its impact and scope. We would like to point out that very few technologies or communications networks today are guaranteed to operate without interruptions."

Two-day interruptions don't fly at the corporate--or even consumer--level.

Topics: Outage, Collaboration, Social Enterprise

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

6 comments
Log in or register to join the discussion
  • I'm reminded...

    ...of the stories about problems with water pressure when massive numbers of people flush their toilets at the same time during the Super Bowl. I have to wonder, though, about the applicability of your item #1. What does patch management have to do with anything? It would seem to me that there could potentially be a similar problem every morning when businesses all log in at about the same time. I guess Microsoft is bevoming the convenient excuse, right up there with 'The dog ate my homework'.

    Carl Rapson
    rapson
    • Media (sk)hyping it up...

      I agree with the above poster about the Microsoft thing. I would think a company the size of Skype would have 1) Non-production servers to test the patch on and 2) a disaster-recovery plan in effect (i'm sure they're thinking about that now though!).

      Furthermore, it amazes me how the media is beat the Skype outage to a pulp. I rely on Skype every day as do dozens of IT professionals I speak to regularly. Not once did I hear how upset they were or how this single outage will change the perception of Skype forever. In that time we were just using cell phones to communicate. Honestly the only person that complained was my business partner and it went something like this: "Hey Skype's not logging on...maybe I should reboot...."Oh hey I read online that Skype is having an outage so it's not going to work."...."Oh ok. Thank kinda sucks". However, I keep reading on these virtual newstands that this is possibly the worst thing that has happened in years of Internet history. To top it off as soon as I see "Windows Patch Caused Crash, Skype Says" on my RSS feed I'm like "oh brother". Let's leave it at "hey Skype's back up" and not a massive discussion on how to patch your computers.
      rlescaille
    • I dont think they are using Microsoft as an excues

      based on this statement, where they admit it is a software bug.
      They dont claim it is Microsoft's fault.

      "Normally Skype?s peer-to-peer network has an inbuilt ability to self-heal, however, this event revealed a previously unseen software bug within the network resource allocation algorithm which prevented the self-healing function from working quickly. Regrettably, as a result of this disruption, Skype was unavailable to the majority of its users for approximately two days."
      mrlinux
  • That is lame

    They are using the fact that the patches happen at the same time to cover up the fact that they don't have enough bandwidth or server resources to handle the number of users. that is lame. MS has nothing to do with it it is Skype's lack of planning and adequate system upgrades that is to blame. If they weren't so set on outsourcing everyone's job the people that have known about patch tuesday for years would not have been let go. It is also clear that SKYPE has not done proper capacity planning, probably because thier business model sucks!!!
    donengene@...
  • Businesses all loggin in?

    The capacity issue of business users all logging in at the same time would be somewhat less due to the staggered impact of multiple time zones.
    jcd-zdnet@...
  • Skype outage

    It seems to me that the author is suffering from the typical American NIH syndrome. If its not created here then it's no good.
    donesp@...