Post mortem: Our site fail Wednesday and what went wrong

As you may have noticed during our live coverage of Apple's iPad event Wednesday, ZDNet had a few performance issues. Actually, that's a euphemism since we were pretty much dead in the water for a few hours.
Written by Larry Dignan, Contributor

As you may have noticed during our live coverage of Apple's iPad event Wednesday, ZDNet had a few performance issues. Actually, that's a euphemism since we were pretty much dead in the water for a few hours.

Since we're an IT site and you go through these failures from time to time we thought it would be instructive and hopefully educational to have a gander at our post mortem.

Here's a memo John Potter, our vice president of technology, sent out to our merry band of bloggers:


I wanted to reach out to you and give you all an explanation for the site outages we had during the Apple iPad event. Our Site Reliability team conducted a thorough post-mortem, and I’ve outlined the salient points below.

What happened?

The load balancer is a combination of software and hardware that acts as a gateway between the Internet and our servers. It helps route traffic to the appropriate servers and web applications.

During the Apple iPad event, the load balancer repeatedly thought most of our blog servers were not responding, so it only sent traffic to one or two at a time instead of all of our servers that were available.

This disruption impacted the blogs for all our sites. ZDNet Blogs was the site that suffered the most problems. However, during our attempts at recovery, BNET Industries, SmartPlanet, MoneyWatch, TechRepublic Training and ZDNet Reviews were also impacted.

Why did it happen?

The load balancer is configured to routinely check our servers to make sure they are ready to receive traffic. This can be done in more than one way, and the method selected depends upon the nature of the web application running on the server. For blogs, the load balancer was configured to send a request to the web application on each server and to expect an "I'm up" or "I'm down" response. If it receives the latter or no response at all, then it will stop routing traffic to that server. It does this kind of check every 5 seconds.

When the traffic spike began due to the Apple iPad announcement, the servers were inundated with requests. Due to that, some of the blog servers did not return a response to the check from the load balancer in a timely manner. Therefore, the load balancer stopped routing traffic to those servers. This first happened to a server at 10:08. Within a minute, it had happened to the majority of our servers, and all were removed from the server rotation. This result further overloaded the remaining servers  which caused the load balancer to believe that they were no longer available. Meanwhile, once traffic was no longer routed to the removed servers,  they would again start to respond, and the load balancer would add them back into the rotation. However, by this time we were locked in a vicious cycle. The load balancer only thought one or a few servers were available at a time, so all traffic was routed to those servers; which promptly failed. The servers could never recover in full, so that the load balancer thought they were all up.

How did we respond and recover?

My team and I were on a conference line before 10AM and already monitoring server performance. We were able to respond immediately.

All previous blog server outages had been due to web application issues. So, based on prior experience, we focused almost exclusively on the web application running on the blog servers. This led us to spend a great deal of time modifying application configuration, adding additional server capacity and temporarily removing components from certain pages. Our ability to diagnose the problem was also hampered by the lag in our monitoring tools which made it look like all the servers were overloaded simultaneously.

Shortly before 11:06, one of my team members noticed that the load balancer was not configured as we had previously specified for the blog tier. I paged the Systems Administration team at 11:06 and received a response at 11:10. We requested a change to the load balancer that tells which algorithm to use to route traffic. At this point, we were further misled as to the cause of the problem by the fact that the load balancer was running at 100% CPU utilization. This was not effective, so we explored other options at the web application and load balancer and made futile attempts at effecting recovery.

Further discussion with Systems Administration led to the idea to force the load balancer to believe all servers were available for traffic. At approximately 12:50, this change was made. The load on all servers dropped immediately and drastically. Pages began to render normally.

How can we prevent similar unexpected issues from having system-wide impact?

In the past, we have focused on site performance mainly at the web application. When a web application was underperforming, we would tweak it or give it more hardware to run on to improve performance. This worked, so it reinforced our tendency to address site performance problems in this manner. In many cases, this is the best approach.

Based on today's experience, we need to thoroughly review how every request to our sites is routed from a load balancer to our servers and back to a user. We need to review each load balancer setup and whether it is the most appropriate for the servers it is in front of. We need to look at each web application and decide whether it should run on different hardware or with a different configuration. Timely and frequent communication about the progress toward recovery is important to those who depend on it.

These are not quick tasks or issues to address and resolve.

The first task we will start on is a review of the load balancer setup for each of our sites. We'll determine if it's appropriate and adjust it if not. The review should be done by Monday, so the determination whether to adjust can proceed as soon as possible.

Another task is to review how to prevent problems on one site from impacting other sites. This will focus on which applications and sites share hardware. Additionally, a focus on additional application partitioning will be done as well.

For times when we anticipate large spikes in traffic, I will follow up with Systems Administration about having a tech on hand to assist with any potential troubleshooting. This provides a ready set of hands and eyes to parts of the technological infrastructure we don't have access to. That will help speed up response and recovery.

During times like yesterday, my team is overwhelmed with trying to address the problem. However, it is important to communicate as frequently as possible with as much detail as possible about our progress. We will review technological and organizational options to improve this.

Editorial standards