The cloud storage market is accelerating fast - despite naysayers and alarmists - and Amazon's S3 is leading the charge. Storing over 40 billion files for 400,000 customers Amazon is the one to beat. How do they it for pennies per GB a month? Read on.
I attended FAST '09, the best storage conference around, where Alyssa Henry, S3's GM, gave a keynote. Amazon doesn't talk much about how their technology works, so even the little Alyssa added was welcome.
A multi-billion dollar business running one of the world's largest websites, Amazon engineers understand the problem. Their goals reflect both technical and market requirements:
- 99.99 availability
- Support an unlimited number of web scale apps
- Use scale as an advantage - linear scalability
- Vendors won't engineer for the 1% - only the the 80% - DIY
- Straightforward APIs
- Few concepts to learn
- AWS handles partitioning - not customers
One key: Amazon writes the software and builds massive scale on commodity boxes. Reliability at low cost achieved through engineering, experience and scale.
With many components come many failures
10,000+ node clusters mean failures happen frequently - even unlikely events happen.
- Disk drives fail
- Power and cooling failure
- Corrupted packets
- Techs pull live fiber
- Bits rot
- Natural disasters
Amazon's deals with failure with a few basic techniques:
Increases durability, availability, cost, and complexity. Example: plan for the catastrophic loss of entire data center; store data in multiple data centers.
Expensive but once paid for costly small-scale features like RAID aren't needed.
Just like disk drives - it's quicker for Amazon to retry than it is for customers. Leverage redundancy - retry from different copies.
This is cool. An idempotent actions result doesn't change even if the action is repeated - so there's no harm in doing it twice if the response is too slow.
For example, reading a customer record can be repeated without changing the result. So they don't retry too much there's surge protection.
Rate limiting is a bad idea - build the infrastructure to handle uncertainty. Don't burden already stressed components with retries. Don't let a few customers bring down the system.
Surge management techniques include exponential back off (like CSMA/CD) and caching TTL (time to live) extensions.
Amazon sacrifices some consistency for availability. And sacrifices some availability for durability.
Everything fails - so every failure handling code path must work. Avoid unused/rarely used code paths since they are likely to be buggy.
Amazon routinely fails disks, servers, data centers. For data center maintenance they just turn the data center off to exercise the recovery system.
Monocultures are risky. For software there is version diversity: they engineer systems so different versions are compatible.
Likewise with hardware. One lot of drives from a vendor all failed. A shipment of faulty power cords. Correlated failures happen
Diversity of workloads: interleave customer workloads for load balancing.
Identify corruption inbound, outbound, at rest. Store checksums and compare at read - plus scan all the data at rest as a background task.
Internal, external. Real time, historical. Per host, aggregate. When things go wrong, you need history to see why.
-Get people out of the loop
Human processes fail. Humans are slow. If a human screws up an Amazon system, don't blame the human. It's the system.
Storage is a lasting relationship that requires trust.
The Storage Bits take
Amazon is the world leader in scale out system engineering. Google may have led the way, but the necessity to count money and ship products set a higher bar for Amazon.
Amazon Web Services will dwarf their products business within a decade. I'd like to see them open the kimono more in the future.
Comments welcome, of course. There's a longer version of this on StorageMojo. And there's the Amazon CompSci paper Dynamo: Amazon’s Highly Available Key-value Store. Not S3 specific, but close.