Amazon blames outage on complicated systems

The company has blamed its complicated infrastucture for a site outage that left Amazon.com inaccessible to many US visitors on Friday
Written by Stephen Shankland, Contributor

Amazon.com appears to be blaming its complicated infrastructure for the outage that left it inaccessible to many US visitors for more than an hour and a half on Friday.

Amazon declared itself clear of the problem on Friday afternoon. "The Amazon retail site was down for approximately two hours earlier today beginning around 10.25am. The site [is] back up," the company said in statement following the outage. "Amazon's systems are very complex and, on rare occasions, despite our best efforts, they may experience problems. We work to minimise any disruption and to get the site back as quickly as possible." Amazon declined to comment further.

The site, which is held up as an exponent of cloud computing due to the large number and complexity of web services used by partner sites, went offline completely by 10.21am. PDT on Friday. Efforts to restore it appeared to be taking effect about noon, said Keynote Systems, which monitors website responsiveness. As of 12.45pm, the site was working intermittently, with many product pages functioning but others still broken.

"At noon PDT, we started to see the site getting better," said Shawn White, director of external operations for Keynote. "We [were] seeing about 70 percent availability."

Sustained outages can be a serious problem. EBay suffered outages in 1999 that outraged users and sent the stock down; even a backup system didn't ward off more problems in 2002.

For major commerce sites, the problem can have a ripple effect. Both Amazon and eBay provide a commercial foundation used by many partners and entrepreneurs.

Based on last quarter's revenue of $4.13bn (£2.09bn) globally, a full-scale global outage would cost Amazon more than $31,000 per minute, on average. For North America, it would be more than $16,000 per minute. Those figures do not include revenue from other sources, such as search or contextual advertisements or Amazon Web Services.

It appeared that Amazon Web Services such as the S3 storage and EC2 computing services continued to function at least for some customers, though the Amazon Web Services page at Amazon.com wasn't working.

"S3 and EC2 continue to function for us as normal," said Don MacAskill, chief executive of photo-sharing site Smugmug. Mashery.com chief executive Oren Michels, who uses Amazon Web Services for several functions and who has several customers who use Amazon Web Services, reported no problems on Friday.

As to the explanation for the outage, the company only hinted its complicated computing infrastructure was the culprit.

In the estimation of Shawn White, director of operations for Keynote, the most likely culprit was simple human error.

"Some engineer might have made a particular change, not knowing it could cause a trickle-down effect [that eventually brought down the site]," said White.

For example, he said, somebody in charge of maintenance might have been directing internet traffic to a particular group of servers, but selected the wrong group.

"What I find still so surprising is that it happened in the middle of the day. Typically, you do that in off-peak hours," White said. "[Amazon] ranks on the top with performance and availability, consistently, time and time again."

Another possible explanation is an attack such as the distributed denial-of-service (DDoS) attack that struck Amazon and other high-profile sites in 2000. White said he thinks it unlikely, though, that a crushing load of network traffic brought Amazon down.

"These guys are experts at dealing with flash floods of users", including those that routinely arrive during peak shopping days, said White. "Usually, when you see a site going under because of traffic issues or a denial-of-service attack, you see a gradual slowdown in performance and drop in availability. Here, we saw at 10.16am that it completely dropped off — 100 percent."

Soups Ranjan, a senior member of the technical staff of network protection and management company Narus, hasn't yet found any attack evidence.

"It doesn't seem to be the result of a network-initiated attack, at least from my preliminary analysis from our probes," Ranjan said.

Human error may not sound as gripping a tale as a network attack, but there's plenty of drama for the people responsible. And it's the career-limiting variety of drama, said Illuminata analyst Gordon Haff, who hazarded a guess that Amazon's problem involved its front-end web servers.

The security group of WebSense, a website and communications protection company, also saw no evidence that Amazon's problem was security related.

CNET News.com's Robert Vamosi contributed to this report.

Editorial standards