Post mortem: Our site fail Wednesday and what went wrong

Post mortem: Our site fail Wednesday and what went wrong

Summary: As you may have noticed during our live coverage of Apple's iPad event Wednesday, ZDNet had a few performance issues. Actually, that's a euphemism since we were pretty much dead in the water for a few hours.

SHARE:
TOPICS: Servers, Hardware
34

As you may have noticed during our live coverage of Apple's iPad event Wednesday, ZDNet had a few performance issues. Actually, that's a euphemism since we were pretty much dead in the water for a few hours.

Since we're an IT site and you go through these failures from time to time we thought it would be instructive and hopefully educational to have a gander at our post mortem.

Here's a memo John Potter, our vice president of technology, sent out to our merry band of bloggers:

Everyone,

I wanted to reach out to you and give you all an explanation for the site outages we had during the Apple iPad event. Our Site Reliability team conducted a thorough post-mortem, and I’ve outlined the salient points below.

What happened?

The load balancer is a combination of software and hardware that acts as a gateway between the Internet and our servers. It helps route traffic to the appropriate servers and web applications.

During the Apple iPad event, the load balancer repeatedly thought most of our blog servers were not responding, so it only sent traffic to one or two at a time instead of all of our servers that were available.

This disruption impacted the blogs for all our sites. ZDNet Blogs was the site that suffered the most problems. However, during our attempts at recovery, BNET Industries, SmartPlanet, MoneyWatch, TechRepublic Training and ZDNet Reviews were also impacted.

Why did it happen?

The load balancer is configured to routinely check our servers to make sure they are ready to receive traffic. This can be done in more than one way, and the method selected depends upon the nature of the web application running on the server. For blogs, the load balancer was configured to send a request to the web application on each server and to expect an "I'm up" or "I'm down" response. If it receives the latter or no response at all, then it will stop routing traffic to that server. It does this kind of check every 5 seconds.

When the traffic spike began due to the Apple iPad announcement, the servers were inundated with requests. Due to that, some of the blog servers did not return a response to the check from the load balancer in a timely manner. Therefore, the load balancer stopped routing traffic to those servers. This first happened to a server at 10:08. Within a minute, it had happened to the majority of our servers, and all were removed from the server rotation. This result further overloaded the remaining servers  which caused the load balancer to believe that they were no longer available. Meanwhile, once traffic was no longer routed to the removed servers,  they would again start to respond, and the load balancer would add them back into the rotation. However, by this time we were locked in a vicious cycle. The load balancer only thought one or a few servers were available at a time, so all traffic was routed to those servers; which promptly failed. The servers could never recover in full, so that the load balancer thought they were all up.

How did we respond and recover?

My team and I were on a conference line before 10AM and already monitoring server performance. We were able to respond immediately.

All previous blog server outages had been due to web application issues. So, based on prior experience, we focused almost exclusively on the web application running on the blog servers. This led us to spend a great deal of time modifying application configuration, adding additional server capacity and temporarily removing components from certain pages. Our ability to diagnose the problem was also hampered by the lag in our monitoring tools which made it look like all the servers were overloaded simultaneously.

Shortly before 11:06, one of my team members noticed that the load balancer was not configured as we had previously specified for the blog tier. I paged the Systems Administration team at 11:06 and received a response at 11:10. We requested a change to the load balancer that tells which algorithm to use to route traffic. At this point, we were further misled as to the cause of the problem by the fact that the load balancer was running at 100% CPU utilization. This was not effective, so we explored other options at the web application and load balancer and made futile attempts at effecting recovery.

Further discussion with Systems Administration led to the idea to force the load balancer to believe all servers were available for traffic. At approximately 12:50, this change was made. The load on all servers dropped immediately and drastically. Pages began to render normally.

How can we prevent similar unexpected issues from having system-wide impact?

In the past, we have focused on site performance mainly at the web application. When a web application was underperforming, we would tweak it or give it more hardware to run on to improve performance. This worked, so it reinforced our tendency to address site performance problems in this manner. In many cases, this is the best approach.

Based on today's experience, we need to thoroughly review how every request to our sites is routed from a load balancer to our servers and back to a user. We need to review each load balancer setup and whether it is the most appropriate for the servers it is in front of. We need to look at each web application and decide whether it should run on different hardware or with a different configuration. Timely and frequent communication about the progress toward recovery is important to those who depend on it.

These are not quick tasks or issues to address and resolve.

The first task we will start on is a review of the load balancer setup for each of our sites. We'll determine if it's appropriate and adjust it if not. The review should be done by Monday, so the determination whether to adjust can proceed as soon as possible.

Another task is to review how to prevent problems on one site from impacting other sites. This will focus on which applications and sites share hardware. Additionally, a focus on additional application partitioning will be done as well.

For times when we anticipate large spikes in traffic, I will follow up with Systems Administration about having a tech on hand to assist with any potential troubleshooting. This provides a ready set of hands and eyes to parts of the technological infrastructure we don't have access to. That will help speed up response and recovery.

During times like yesterday, my team is overwhelmed with trying to address the problem. However, it is important to communicate as frequently as possible with as much detail as possible about our progress. We will review technological and organizational options to improve this.

Topics: Servers, Hardware

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

34 comments
Log in or register to join the discussion
  • If they dumpped their crappy Windoze systems

    and went with Linux they wouldn't have had this issue.
    Ron Bergundy
    • I agree

      Linux sites seldom go down.
      Linux Geek
      • Yeah right Linux servers must be "magical" devices (NT)

        .
        themarty
    • The blogs are run on Linux

      The current set-up of the blogging platform is Linux on Apache. The
      issue with this particular outage occurred at the load balancer level.
      brandons
      • Oh me, oh my ... no response Linux fans?

        I am also willing to bet that the Load Balancer that ZD uses runs Linux or some other *N*X variant.

        No whitty retorts or commentary you'd like to grace us with?

        Congratulations to ZD Net for sharing the detailed analysis of your site's issues during what must have been one of the highest-volumes of traffic seen in quite a while.

        I've been involved with building massive-scale systems and sites for many years and have seen many an otherwise superb site brought down by load balancing issues of one kind or another.

        I recall one government site in particular that was fronted by a load balancer from a company that should have known better. Their algorithm directed traffic to TCP ports on machines that responded the quickest ... without examining the message returned by the TCP socket.

        Heavily burdened servers would respond almost instantaneously with an HTTP 503 (Service Unavailable) status. The load balancer didn't actually look at the status code, but said "hey - he was fast, have another request". And the whole site would collapse as all traffic got sent to the first server to become overloaded.

        Fun times!
        de-void-21165590650301806002836337787023
    • In this case, nothing to do with Linux vs Windows - just a load balancing

      algorithm.
      DonnieBoy
      • ... triggered by the inability of a LINUX servers to respond

        in a timely manner.

        Sure, the loadbalancer appears to be the main
        culprit, but it was the failure to respond on
        part of the web servers that pushed it over the
        edge.

        I wonder if you would have been this grandiose
        if the blog had been running on Windows?
        honeymonster
        • Had it been running on Windows....

          The blog would have been down with the first iPad
          rumor.
          storm14k
        • If they're F5 BigIP units then they're running UNIX.

          I can't recall if they're Linux or BSD based (been a while since I worked on them...I'm thinking BSD).

          Regardless the load balancers were not the problem but the back end servers inability to respond to the load balancers inquiry by the specified time.
          ye
          • Check Method

            Yes, the problem was the check method that the load balancer was set to use. It was set based on the the normal constraints on application performance. For this kind of load situation, we should have set it to a simple tcp/ip check.
            JFPSF
          • tcp or more patience?

            I assume that you had set the load balancer to
            perform a http request every 5 secs with a
            constraint on how long it would wait for the
            response?

            In that case you could also say that the load
            balancer had been configured with a too short
            expiration of these probe requests. So, rather
            than performing simple TCP requests you could
            also have configured a more lenient expiration.
            For example that the load balancer would wait
            for 5 or 10 secs for a request to complete and
            if the last 5 request had a 100% loss the
            server would be considered "down".
            honeymonster
          • Check Time-out

            Yes, we could have set it for a longer timeout, but I'm not sure that we could have successfully picked the right time given the tsunami of traffic that occurred at the very start of the Apple event.
            JFPSF
    • Owned

      You just got it :)


      Linux is great huh?
      The one and only, Cylon Centurion
    • RE: Post mortem: Our site fail Wednesday and what went wrong

      @Ron Bergundy lol
      cjschris
  • Yeah that sucked.

    nt
    Snooki_smoosh_smoosh
  • Giving Linux advocates a bad name...

    Let me first say that I don't have a dog in this fight. Windows or Linux, I don't care.

    But, "Geek" and "All the Way", are you sure you're not MS shills paid to make Linux supporters look foolish? If not, you should apply for the position.
    psquared007
    • I'm sure.

      If M$ gave me a check, I wouldn't cash it!
      Ron Bergundy
      • Great. You sound just like Castro

        though he actually did cash the first one...
        John Zern
  • PS...

    Thanks Larry for posting the internal e-mail.

    Very good example of the "law of unintended consequences". "No response in 5 seconds, take it out of the line-up....oops, all the servers are out of the line-up". Doh!
    psquared007
  • Thanks for the post-mortem

    It's always good to see these kinds of problems explained in detail so others can learn from them.
    Ed Burnette