Megastore handles 3 billion writes and 20 billion reads daily on 8 PB of data across the globe supporting millions of users of Google Apps. How does it work?
In a paper titled Megastore: Providing Scalable, Highly Available Storage for Interactive Services (pdf) Google engineers give a technical overview.
The mission Support Internet apps such as Google's AppEngine.
- Scale to millions of users
- Responsive despite Internet latencies to impatient users
- Easy for developers
- Fault resilience from drive failures to data center loss
- Low-latency synchronous replication to distant sites
Key assumptions: 1) data for most apps can be partitioned into entities, for example by user, and 2) a subset of DB features can help developer productivity.
Entities Entities are synchronously replicated over a wide area. ACID transactions within entities are replicated, but not transactions across entities.
For cross-entity transactions, the synchronous replication requirement is relaxed. Entity group boundaries must reflect application usage and user expectations.
For example, an e-mail account is a natural entity. But defining other entities is harder.
Geographic data, for one, lacks natural granularity. So the globe is divided into non-overlapping entities. Changes across these geographic entities use (expensive) two-phase commits while changes within entities don't.
The trick: entities large enough to make two-phase commits uncommon but small enough to keep transaction rates low.
Availability and scale To achieve availability and global scale the designers implemented two key architectural features:
- For availability, an asynchronous log replicator optimized for long-distance
- For scale, data partitioned into small databases each with its own replicated log
Replication The venerable Paxos algorithm manages synchronous replication - with some Google developed optimizations:
- Fast reads. Current reads are usually from local replicas since most writes succeed on all replicas.
- Fast writes. Since most apps repeatedly write from the same region, the initial writer is granted priority for further replica writes. Using local replicas and reducing write contention for distant replicas minimizes latency.
- witness replicas . Witnesses vote in Paxos rounds and store the write-ahead log but do not store entity data or indexes to keep storage costs low. They are also tiebreakers when isn't a quorum.
- Read-only replicas are the inverse: nonvoting replicas that contain full snapshots of the data. Their data may be slightly stale but they help disseminate the data over a wide area without slowing writes.
Performance Because Megastore is geographically distributed, application servers in different locations may initiate writes to the same entity group simultaneously. Only one will succeed; other writers will have to retry.
For multiuser apps with more writes developers can shard entities more finely or batch user operations into fewer transactions. Fine-grained advisory locks and sequencing transactions also help handle write loads.
The Storage Bits take? As Brewer's CAP theorem showed, a distributed system can't provide consistency, availability and partition tolerance to all nodes at the same time. But Google shows that smart choices can get us darn close.
This demonstrates again the value of Internet scale thinking. Enterprise vendors would never have developed Megastore, but now that we've seen it work we can begin applying its principles to smaller scale problems.
Comments welcome, of course. Want more detail without reading the entire paper? See a longer version of this post on StorageMojo.