Inside Amazon

You're running one of the world's busiest e-commerce sites, handling up to 4 million checkouts per day. Response time is critical.
Written by Robin Harris, Contributor

You're running one of the world's busiest e-commerce sites, handling up to 4 million checkouts per day. Response time is critical. Every page is customized on the fly using over 150 network services. And the system must manage failures of any component, including entire data centers.

You are taking real money and shipping real goods. Quality of Service is your lifeblood. How does Amazon do it?

Enterprise reliability at massive scale Like Google (see Google's three rules) Amazon:

  • Uses commodity servers
  • Embraces failure
  • Architects for scale

Yet Amazon raises the ante on Google: application developers can tune the storage infrastructure, Dynamo, to meet application needs.

Scalable architecture Here's an illustration from a recent Amazon paper on Dynamo, their main storage infrastructure.

Unique features Highly automated. No manual intervention is required to add or remove storage nodes - the system handles discovery and data redistribution automatically.

99.9% percentile performance Amazon's best customers also have the most data: recently viewed items; wish lists; long histories. Instead of measuring average performance, Amazon looks at performance at the far end of the distribution to ensure that all customers, not just the majority, have a good experience.

Tunable trade-offs. It isn't possible to have high availability and consistent data: the mechanisms that ensure consistency hobble availability. Amazon gives developers some knobs so they or their applications can tune the system for fast reads or fast writes, for availability vs. cost. Availability is key for Amazon's 7x24 business model so Dynamo is designed for eventual consistency.

Decentralization. Amazon's system is designed to withstand the loss of many components, up to and including data centers. Dynamo clusters are distributed across data centers linked by fast pipes. All data is stored in multiple data centers.

Heterogeneity. Systems management is based on application performance. Powerful new systems get more work than old systems, but all applications get their work done on time.

Implications for the enterprise Your friendly $250,000 a year storage sales rep would be shocked to learn that a loosely coupled storage system built of commodity parts can provide mission-critical availability and performance. As the Amazon team reports:

Many Amazon internal services have used Dynamo for the past two years . . . . In particular, applications have received successful responses (without timing out) for 99.9995% of its requests and no data loss event has occurred to date.

I'd wager that matches the best that EMC, IBM and HP can do at that volume.

The storage Bits take Today's massive scale Internet Data Centers point the way to a revolution in enterprise computing. Instead of feature-rich products that attempt to handle every eventuality, the IDCs have the scale to architect their infrastructures for the jobs they need done.

Amazon's storage doesn't use RAID, relational databases or fancy interconnects. The intelligence goes into optimizing the architecture of the software rather than attempting to build bulletproof hardware, which, at their scale, is hopeless anyway.

Given the growth of data, I predict that in 10 years most enterprises will be running at least part of their business on similar architectures.

Comments welcome, of course. Werner Vogels, Amazon's CTO, kindly sent me the link to this paper. I'll have a longer description of it on StorageMojo later this week.

Editorial standards