Skype blames Patch Tuesday PC reboots for outage

Skype blames Patch Tuesday PC reboots for outage

Summary: Skype is blaming last week's two-day outage on millions of Windows machines restarting after the installation of Microsoft's security patches.


(See update below) 

Skype is blaming last week's two-day outage on millions of Windows machines restarting after the installation of Microsoft's security patches.

The massive number of reboots caused a flood of log-in requests (the Skype default is to login at reboot), causing "a chain reaction that had a critical impact."

In a note posted on the Skype home page, the eBay-owned company that the peer-to-peer network that powers the Internet phone service has a self-healing component that failed because of a software bug.

[This] event revealed a previously unseen software bug within the network resource allocation algorithm which prevented the self-healing function from working quickly. Regrettably, as a result of this disruption, Skype was unavailable to the majority of its users for approximately two days.

The issue has now been identified explicitly within Skype. We can confirm categorically that no malicious activities were attributed or that our users’ security was not, at any point, at risk.

The Windows Update explanation seems a bit bizarre. After all, Microsoft has been delivering automatic updates (and simultaneous reboots) every month since 2003. Something still isn't adding up.

[UPDATE: August 21, 2007 @ 10:46 AM] Skype has posted another explanation to clarify the Microsoft Patch Tuesday connection and explain why this never happened before:

2. What was different about this set of Microsoft update patches?

In short – there was nothing different about this set of Microsoft patches. During a joint call soon after problems were detected, Skype and Microsoft engineers went through the list of patches that had been pushed out. We ruled each one out as a possible cause for Skype’s problems. We also walked through the standard Windows Update process to understand it better and to ensure that nothing in the process had changed from the past (and nothing had). The Microsoft team was fantastic to work with, and after going through the potential causes, it appeared clearer than ever to us that our software’s P2P network management algorithm was not tuned to take into account a combination of high load and supernode rebooting.

3. How come previous Microsoft update patches didn’t cause disruption?

That’s because the update patches were not the cause of the disruption. In previous instances where a large number of supernodes in the P2P network were rebooted, other factors of a “perfect storm” had not been present. That is, there had not been such a combination of high usage load during supernode rebooting. As a result, P2P network resources were allocated efficiently and self-healing worked fast enough to overcome the challenge.

Topics: Social Enterprise, Collaboration, Outage

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to join the discussion
  • Doesn't ring true

    Whilst I could happily accept that millions of machines attempting to login to the service simultaneously may have had the effect of preventing some of those login attempts from being successful I can't buy it as an excuse for a two day outage.

    Surely if this was the simple cause then there would have been a slim chance that each individual login attempt would be processed by the server. This chance should increase with each subsequent successful attempt until service resumed as normal.

    I can't believe that the login process is so much more demanding than the polling that occurs when the app is logged in that it could take two daysw to clear the backlog of requests.

    I agree with you Ryan, this just doesn't add up.
    • Yes it does!

      Most computer owners are too lazy to change the default settings for auto-update. Windows updates are phased in meaning not everyone gets to see the updates as being available before the 3AM auto-check and reboot options. Havent changed the 3AM default? Neither have millions of others. Skpe users participate as nodes on p2p network but only if and whenn they meet reliability bar - say up and available for a few hours. Get enough clients to reboot and none can meet the bar. Suddenly a large number of clients can't become service nodes. Failureto code for this scenario can lead to issues. A more important question to ask is why base a business model on resources you don't own and can't control?
  • I smell something rotten here.

    Ummm, log on attempts kept the entire system down for two days? Don't think so...
    • Skype is lying.

      Yea, especially given their blog instructed users to keep trying at login. Why ask for more of the problem?
    • Why not?

      I've seen it happen before. Granted that was NT 4.0 days but a flood of log on attempts can knock over a server. Question is this though. Does Skype have that crappy of network set up? Do they have one DC for 50,000 workstations or something?
      • Umm, because it was twoi days later...

        Assuming we are talking about Automatic Updates then that all happened Tuesday. Skype went down Thursday.
    • I hate to say this

      but I kind of agree with you Axey. What I also have to wonder is why they didn't have any backup plans or had all of their systems set to auto login like that.
      Or if they had a backup plan it sure as heck didn't seem like it.
    • imPOSSIBLE that there was a problem related to MS

      makes perfect sense.

      MS makes trash software. Skype probably uses it!


      NATG defending MS.

      What a hoot!
  • BS!

    If everyone was patching as soon as updates were made available, there wouldn't be these ongoing security issues. Time zone variation alone is pointing to Skype not being honest about the outage.
    Uber Dweeb
  • skype better wake up fast

    It is very possible that this great system has few bad pieces that show up in critical loads.

    every company has minotr flaws in their code that end up causing massive problems . even the greats MS,google,oracle.

    It is usually thought in dev cycle as that one area that we dont have time we will get back to it and by the end of the development cycle there is wars going on and everybody hates one an other and there is no time left to go back and fix things that were not considered important in the first place.

    It is just the matter sometimes you don't want to over engineer and make a problem more complicated than it is. For skype the important thing is for this not to happen again. This happening once is understandable for any developer who has worked in a comercial world where schedules are tight and deadlines must be met.

    Skype has had a couple of instances of miss haps recently and this better not continue or they will get the repuation of unreliable.
  • Ummm..

    Thursday outages? Patch Tuesday? Since when did Tuesday reboots cause a Thursday log-in overload?

    Something already smells funny.

    Add to that the liklihood that "millions of users" are simultaneously rebooting their systems, and they all have the same connection speed, so they are all logging back on at the same time?

    Now it's starting to straight up smell.
  • Sounds like the citrix black hole effect.

  • About that swampland in Florida you wanted to buy... (NT)

  • Skype outage

    I can appreaciate their having problems with Microsoft updates. Almost every time I do an update, I have to spend days fixing all the problems. (Some one with good computer skills could probably do the fix in several hours, but I don't fit that catagory of skills.)
    I hate to get an update from most software houses because it causes me so much trouble. Quick Books is another one that kills me on most updates.
    Why can't they just do it right the first time and not have so many problems. If we had to update and fix our cars or airplanes that often it would be a mess.
    • Updates

      I have been using Microsoft updates with XP for almost three years and have never had a problem, have I been lucky?
      • not lucky at all...

        First you are running XP. Okay, bad joke but true. That is by design and that is the common experience. Some people have persistent issues that cause all subsequent servicing issues ? if the above user does not get a root cause analysis done, they will continue to have issues.

        As for the "get it right the first time": I would love to see MS do that, but then again I do not want to pay $30,000 for an OS. As for getting "getting it right the first time", my last new car had a recall in the accelerator which caused a decent number of accidents. The correctness is proportionally related to the risk involved - my computer crashes, mere irritation and perhaps important missing data about my finances. My car crashes and my kids are dead. See, the analogy the previous poster tried to create does not scale.

        I am an open source proponent, but one of my most frustrating issues is getting fixes in a timely and inexpensive manner. I lack the skills to fix the things I need fix much of the time and have to wait upon the whims of the unpaid labor of OSS much of the time. I know you can hire people to do the work, but the one time I tried to hire the correct group of consultants to get the work done, it was going to cost me close to $10K - more money than running Windows on all my small companies servers and clients...Granted, it was the one time I really needed the fix and with MS I would have needed a fix each month. Not sure what my point is...

        Oh, I know...all software sucks and we have yet to discover the correct model for creating and supporting it.
  • Sounds like they are not quite ready for Prime Time

    If they want to be counted on as a Production level phone system for a business they need to be able to stay up and running, and certainly not be down for 2 days when they do go down, and there is no extenuating circumstance beyond there control that took them down. I think if they had come out and said it was the result of hackers or some other malicious activity it would be more acceptable then saying that the login load from a bunch of users running windows update taking them down.
  • BS!!! Use SUS dumbasses

    It is obvious if Skype had outages it is because they failed to impliment SUS properly and test patches in the manner MCSE's are tought to. Maybe they should stop offshoring IT jobs and actualg get some employees who know how to do thier jobs!!!

    Don't blame microsoft for your employee's not doing what they are paid for.
    • SUS Recommendation

      Sounds like they were talking about the millions of regular users using SKYPE not SKYPE itself. 99% of SKYPE users are not inside of a corporate network where SUS makes sense. Oh by the way. Are you still installing SUS? SUS is the old version, WSUS is the new version.
  • Probably only part of the problem

    I could believe that causing (almost) the entire userbase to re-login within a 24 hour period (due to timezones), would put a tremendous stress on the system.

    Hotmail was essentially unavailable when the improved version came out recently as everyone tried to get it. These surges do make a difference - even to Microsoft.

    But, if the default behavior is to re-login at reboot, perhaps a redesign is in order. MS will continue to have patch Tuesday and will continue to need reboots to fix gaping holes in the kernel. Always had, always will.