Google's Megastore: global storage for the Internet
Summary: Megastore handles 3 billion writes and 20 billion reads daily on 8 PB of data across the globe supporting millions of users of Google Apps. How does it work?
Megastore handles 3 billion writes and 20 billion reads daily on 8 PB of data across the globe supporting millions of users of Google Apps. How does it work?
In a paper titled Megastore: Providing Scalable, Highly Available Storage for Interactive Services (pdf) Google engineers give a technical overview.
The mission Support Internet apps such as Google's AppEngine.
- Scale to millions of users
- Responsive despite Internet latencies to impatient users
- Easy for developers
- Fault resilience from drive failures to data center loss
- Low-latency synchronous replication to distant sites
Key assumptions: 1) data for most apps can be partitioned into entities, for example by user, and 2) a subset of DB features can help developer productivity.
Entities Entities are synchronously replicated over a wide area. ACID transactions within entities are replicated, but not transactions across entities.
For cross-entity transactions, the synchronous replication requirement is relaxed. Entity group boundaries must reflect application usage and user expectations.
For example, an e-mail account is a natural entity. But defining other entities is harder.
Geographic data, for one, lacks natural granularity. So the globe is divided into non-overlapping entities. Changes across these geographic entities use (expensive) two-phase commits while changes within entities don't.
The trick: entities large enough to make two-phase commits uncommon but small enough to keep transaction rates low.
Availability and scale To achieve availability and global scale the designers implemented two key architectural features:
- For availability, an asynchronous log replicator optimized for long-distance
- For scale, data partitioned into small databases each with its own replicated log
Replication The venerable Paxos algorithm manages synchronous replication - with some Google developed optimizations:
- Fast reads. Current reads are usually from local replicas since most writes succeed on all replicas.
- Fast writes. Since most apps repeatedly write from the same region, the initial writer is granted priority for further replica writes. Using local replicas and reducing write contention for distant replicas minimizes latency.
- witness replicas . Witnesses vote in Paxos rounds and store the write-ahead log but do not store entity data or indexes to keep storage costs low. They are also tiebreakers when isn't a quorum.
- Read-only replicas are the inverse: nonvoting replicas that contain full snapshots of the data. Their data may be slightly stale but they help disseminate the data over a wide area without slowing writes.
Performance Because Megastore is geographically distributed, application servers in different locations may initiate writes to the same entity group simultaneously. Only one will succeed; other writers will have to retry.
For multiuser apps with more writes developers can shard entities more finely or batch user operations into fewer transactions. Fine-grained advisory locks and sequencing transactions also help handle write loads.
The real world
Megastore has been deployed for several years on over 100 production apps. Here's what Megastore has achieved in production:
The Storage Bits take? As Brewer's CAP theorem showed, a distributed system can't provide consistency, availability and partition tolerance to all nodes at the same time. But Google shows that smart choices can get us darn close.
This demonstrates again the value of Internet scale thinking. Enterprise vendors would never have developed Megastore, but now that we've seen it work we can begin applying its principles to smaller scale problems.
Comments welcome, of course. Want more detail without reading the entire paper? See a longer version of this post on StorageMojo.
Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback
Still many unsolved problems in distributed systems
Google as yet have made no interesting or significant theoretical contributions to data management.
A very boring and conservative company.
RE: Google's Megastore: global storage for the Internet
Who do you think is exciting? IMHO Google has moved the industry forward with the Google File System, BigTable and now Megastore. Agree that the bad old days of IBM's hierachical hairballs were bad - SNA, anyone? - but until we beat the speed of light it seems unavoidable in geographically distributed systems.
Your thoughts?
Mmm...
Paragraph of the Year
Bravo!
I am sick and tired of expaining to Ed Bott and Phil Wainewright that the choice between M$ VAIL restricted to one hard disk ... and the cost of the likes of AMAZON cloud storage ... is that offered by those vendors in favour of their continued revenue streams ... and not that required by consumers, businesses and communities to minimise their costs and maximise their benefits.
O for a new architecture!
Is it not apparent that Google's design (where they pay their own bill) is different from the vendor designs (where the customer pays the bill)? Apparently not to many ZDNET bloggers.
Robin, can you please beat the crap out of your colleagues until they UNDERTSTAND your little paragraph!!
Google's approach looks fundamentally flawed
"Normalized relational schemas rely on joins at query time
to service user operations". Google for some reason seem to have forgotten (or perhaps have never heard of) the other seven relational operators.
"declarative denormalization help
eliminate the need for most joins."
The problem with this is that it is impossible to materialize the result of all possible inferenced results - the data volume for even a moderate sized database would exceed the entire storage capacity of the world's computers.
Could it be that the Google database is big not because it contains a lot of information, but because it contains the same information many times?
More seriously, unless the denormalization takes place according to well defined rules and is purely a physical (and not a logical) denormalization there is a very high chance of logical inconsistency in the data.
Google refer to "declarative denormalization", but how can this be achieved without the "expressive query language" that is specifically rejected in the paper?
Furthermore all this denormalization doesn't happen without the use of system resources. I suspect that most of the system's resources will be being used to perform expensive write operations to denormalize data, probably crippling it for any extraneous operations like user updates or queries.
I could go on much longer, but I really don't think Google have any significant contribution to make in the field of data management if they continue on their current path.
All these things have been done before and have been shown to be flawed in theory and cumbersome and inflexible in practice.
RE: Google's Megastore: global storage for the Internet
Not to point out the painfully obvious, but I just checked my Gmail. Apparently as idiotic and poorly designed as it is, it seems to work. Maybe you should send them your CV and offer to help them clean it up.
I don't think GMail uses the technology described in this paper
So, I am not sure what the relevance of your comment is.
RE: Google's Megastore: global storage for the Internet
RE: Google's Megastore: global storage for the Internet
RE: Google's Megastore: global storage for the Internet
RE: Google's Megastore: global storage for the Internet
RE: Google's Megastore: global storage for the Internet