Why Click Frenzies shouldn't cause web scale fail

The website for last week's Click Frenzy online sale collapsed under an unexpectedly high load. It didn't have to; techniques do exist for building highly elastic web infrastructure.
Written by Stilgherrian , Contributor

Play audio version

The Click Frenzy sale's organisers had planned for 1 million visitors to the site during the 24-hour sale. But, according to their hosting provider UltraServe, site traffic peaked at 2 million simultaneous "hits" — presumably simultaneous HTTP connections — as the sale kicked off.

The site was down for around 3 hours until engineers managed to re-deploy the site to Amazon's cloud.

The next morning, Click Frenzy's supporters tried to brush off this technical disaster as the inevitable result of a high-volume sale.

"I think the important thing to understand with this is that it's been running for about five, six years in overseas countries ... and for all of that period that it has been operating overseas, as recently as last year, they routinely have crashes as part of this mechanism, simply because of the unpredictable peaks and troughs that occur as part of the mechanisms," said Margie Osmond, chief executive of the Australian Retailers Association, on ABC Radio.

But as application architect Benno Rice explains on this week's Patch Monday podcast, while she might have been right about that in the past, we now have techniques for coping with those unpredictable peaks and troughs. The expertise can be found in Australia.

"If you look at, say, ticketing websites these days, they're fairly stable when, say, Radiohead goes on sale. But back in the day, they would have been crashing left and right, and it's only through a lot of understanding of how to handle traffic spikes that they actually stay online these days," he said.

"You need to make sure as much of your site as possible is cacheable, and so you're trying to avoid having to generate dynamic pages as much as you can. A good way to look at it is whether certain pages need to be generated every time a customer hits, or whether they can be valid for, say, 5 minutes, because that 5 minutes is 5 minutes where your server's not getting hit."

Get this wrong, and there are many potential bottlenecks that can become critical, including the computational intensity needed to generate the pages overloading processors, or the number of database lookups needed to gather the data for the page overloading the database servers.

"Then there's the overall tuning of both your server, the execution environment for your application code, and the database itself," Rice said.

Once the web application is properly structured and the environment tuned, it can be deployed on elastic cloud services that automatically add more capacity when a potential overload is detected, billing you for what you use.

It seems that Click Frenzy's developers at Fontis and their hosting provider UltraServe ended up having to do all of that on the fly as the service collapsed.

Rice also discusses the potential weaknesses to look for in selecting application platforms, and provides plenty of hints on how to avoid suffering scale fail.

Click Frenzy also suffered a security glitch, leaving configuration files containing the application database's username and password open to the world. HackLabs director Chris Gatford discusses the significance of that failure.

To leave an audio comment on the program, Skype to stilgherrian, or phone Sydney +61 2 8011 3733.

Running time 45 minutes, 00 seconds

Editorial standards