Google's Megastore: global storage for the Internet

Summary: Megastore handles 3 billion writes and 20 billion reads daily on 8 PB of data across the globe supporting millions of users of Google Apps. How does it work?

Megastore handles 3 billion writes and 20 billion reads daily on 8 PB of data across the globe supporting millions of users of Google Apps. How does it work?

In a paper titled Megastore: Providing Scalable, Highly Available Storage for Interactive Services (pdf) Google engineers give a technical overview.

The mission Support Internet apps such as Google's AppEngine.

  • Scale to millions of users
  • Responsive despite Internet latencies to impatient users
  • Easy for developers
  • Fault resilience from drive failures to data center loss
  • Low-latency synchronous replication to distant sites

Key assumptions: 1) data for most apps can be partitioned into entities, for example by user, and 2) a subset of DB features can help developer productivity.

Entities Entities are synchronously replicated over a wide area. ACID transactions within entities are replicated, but not transactions across entities.

For cross-entity transactions, the synchronous replication requirement is relaxed. Entity group boundaries must reflect application usage and user expectations.

For example, an e-mail account is a natural entity. But defining other entities is harder.

Geographic data, for one, lacks natural granularity. So the globe is divided into non-overlapping entities. Changes across these geographic entities use (expensive) two-phase commits while changes within entities don't.

The trick: entities large enough to make two-phase commits uncommon but small enough to keep transaction rates low.

Availability and scale To achieve availability and global scale the designers implemented two key architectural features:

  • For availability, an asynchronous log replicator optimized for long-distance
  • For scale, data partitioned into small databases each with its own replicated log

Replication The venerable Paxos algorithm manages synchronous replication - with some Google developed optimizations:

  • Fast reads. Current reads are usually from local replicas since most writes succeed on all replicas.
  • Fast writes. Since most apps repeatedly write from the same region, the initial writer is granted priority for further replica writes. Using local replicas and reducing write contention for distant replicas minimizes latency.
  • witness replicas
  • . Witnesses vote in Paxos rounds and store the write-ahead log but do not store entity data or indexes to keep storage costs low. They are also tiebreakers when isn't a quorum.
  • Read-only replicas
  • are the inverse: nonvoting replicas that contain full snapshots of the data. Their data may be slightly stale but they help disseminate the data over a wide area without slowing writes.

Performance Because Megastore is geographically distributed, application servers in different locations may initiate writes to the same entity group simultaneously. Only one will succeed; other writers will have to retry.

For multiuser apps with more writes developers can shard entities more finely or batch user operations into fewer transactions. Fine-grained advisory locks and sequencing transactions also help handle write loads.

The real world Megastore has been deployed for several years on over 100 production apps. Here's what Megastore has achieved in production:

The Storage Bits take? As Brewer's CAP theorem showed, a distributed system can't provide consistency, availability and partition tolerance to all nodes at the same time. But Google shows that smart choices can get us darn close.

This demonstrates again the value of Internet scale thinking. Enterprise vendors would never have developed Megastore, but now that we've seen it work we can begin applying its principles to smaller scale problems.

Comments welcome, of course. Want more detail without reading the entire paper? See a longer version of this post on StorageMojo.

Topics: Storage, Browser, Google

About

Robin Harris has been messing with computers for over 30 years and selling and marketing data storage for over 20 in companies large and small.

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

12 comments
Log in or register to join the discussion
  • Still many unsolved problems in distributed systems

    And the return to archaic hierarchical approaches as last seen when IBM dinosaurs like IMS ruled the earth is inflexible and riddled with potential data consistency errors.

    Google as yet have made no interesting or significant theoretical contributions to data management.

    A very boring and conservative company.
    jorwell
    • RE: Google's Megastore: global storage for the Internet

      @jorwell
      Who do you think is exciting? IMHO Google has moved the industry forward with the Google File System, BigTable and now Megastore. Agree that the bad old days of IBM's hierachical hairballs were bad - SNA, anyone? - but until we beat the speed of light it seems unavoidable in geographically distributed systems.

      Your thoughts?
      Robin Harris
      • Mmm...

        @Robin Harris <br><br>From the paper you link to "Joins, when required, are implemented in application code." <br><br>"declarative denormalization"<br><br>The first point conjures up the picture of rooms full of programmers churning out nested loops taking days to do something that can be achieved in a few minutes with joins. <br><br>Regarding "declarative denormalization", all well and good, providing is it purely a physical copy of the data, but presumably a modern system does this for you automatically. Indexes are an example of "declarative denormalization" of course, so nothing new there. <br><br>And anyway Google seem to be in entirely the wrong mindset here. <br>Saying joins are slow doesn't make any sense all. Join is a logical operation - you may as well say "and" or plus are slow. All Google are saying here is that they have failed to do the work on methods that make join fast (and therefore avoiding a return to the miserable productivity and complexity offered by procedural methods). There is huge potential for improving the physical implementation of RDBMSs without throwing away the simplicity and elegance of the relational model.<br><br>Those of us who don't have Google's data volumes (and that is almost everybody) should be more concerned with data quality, integrity and the flexibility offered by declarative programming; not with performance.<br><br>The interesting work is to do with defining a better relational language than SQL, which fails in many areas to implement the model properly leading to many potential inconsistences, inaccuracies, misunderstandings and unnecessary complexity. <br><br>Sure better performance is good but I want the logical level to be completely separated from the physical level (which can be implemented however you want). Google don't seem to grasp the importance of this distinction.
        jorwell
  • Paragraph of the Year

    "This demonstrates again the value of Internet scale thinking. Enterprise vendors would never have developed Megastore, but now that we?ve seen it work we can begin applying its principles to smaller scale problems."

    Bravo!

    I am sick and tired of expaining to Ed Bott and Phil Wainewright that the choice between M$ VAIL restricted to one hard disk ... and the cost of the likes of AMAZON cloud storage ... is that offered by those vendors in favour of their continued revenue streams ... and not that required by consumers, businesses and communities to minimise their costs and maximise their benefits.

    O for a new architecture!

    Is it not apparent that Google's design (where they pay their own bill) is different from the vendor designs (where the customer pays the bill)? Apparently not to many ZDNET bloggers.

    Robin, can you please beat the crap out of your colleagues until they UNDERTSTAND your little paragraph!!
    johnfenjackson@...
  • Google's approach looks fundamentally flawed

    and they fail totally to reach their claimed objectives in the paper.

    "Normalized relational schemas rely on joins at query time
    to service user operations". Google for some reason seem to have forgotten (or perhaps have never heard of) the other seven relational operators.

    "declarative denormalization help
    eliminate the need for most joins."

    The problem with this is that it is impossible to materialize the result of all possible inferenced results - the data volume for even a moderate sized database would exceed the entire storage capacity of the world's computers.

    Could it be that the Google database is big not because it contains a lot of information, but because it contains the same information many times?

    More seriously, unless the denormalization takes place according to well defined rules and is purely a physical (and not a logical) denormalization there is a very high chance of logical inconsistency in the data.

    Google refer to "declarative denormalization", but how can this be achieved without the "expressive query language" that is specifically rejected in the paper?

    Furthermore all this denormalization doesn't happen without the use of system resources. I suspect that most of the system's resources will be being used to perform expensive write operations to denormalize data, probably crippling it for any extraneous operations like user updates or queries.

    I could go on much longer, but I really don't think Google have any significant contribution to make in the field of data management if they continue on their current path.

    All these things have been done before and have been shown to be flawed in theory and cumbersome and inflexible in practice.
    jorwell
    • RE: Google's Megastore: global storage for the Internet

      @jorwell
      Not to point out the painfully obvious, but I just checked my Gmail. Apparently as idiotic and poorly designed as it is, it seems to work. Maybe you should send them your CV and offer to help them clean it up.
      symbolset
      • I don't think GMail uses the technology described in this paper

        @symbolset

        So, I am not sure what the relevance of your comment is.
        jorwell
  • RE: Google's Megastore: global storage for the Internet

    I also desire to signal in your RSS feeds. Thank you as soon as once again and maintain up the great operate!<a href="http://nccma.com">nccma</a> <a href="http://coolerkings.com">cooler</a>
    MACKENZI
  • RE: Google's Megastore: global storage for the Internet

    I used to be more than happy to seek out this internet-site.I wanted to thanks in your time for this glorious read!! I positively enjoying each little bit of it and I have you bookmarked to check out new stuff you weblog post. this thread is amazing i like your work and i appreciate you that you have share a useful stuff thanks for sharing <a href="http://the-ishop.com">the i shop</a> <a href="http://abatwa.com">abatwa</a>
    MARAGARET
  • RE: Google's Megastore: global storage for the Internet

    I used to be more than happy to seek out this internet-site.I wanted to thanks in your time for this glorious read!! I positively enjoying each little bit of it and I have you bookmarked to check out new stuff you weblog post.Bookmarking now thanks please consider a follow up post.<a href="http://power28.com">power</a> <a href="http://sagesinc.com">sa</a> <a href="http://iloveshoping.net">shop</a>
    RHIANNONA
  • RE: Google's Megastore: global storage for the Internet

    I think the representation of this article is actually superb one. This is my first visit to your site. Thanks a lot and keep sharing the information. Keep updating the information for all of us. Thanks ZDNet Government was launched as the brand's first industry vertical, with a mission to cater to IT professionals in the public secto I agree with your post. However, do you have any sources I can cite for my paper <a href="http://easy-wheels.com/">wheel</a> <a href="http://pbcars.com/">car</a> <a href="http://com69.net">com</a> <a href="http://cadburry.com">bury</a>
    SATURNINA
  • RE: Google's Megastore: global storage for the Internet

    Well welcome, hopefully you can become a vital member of the community and really help to push far ahead of google. Which Im sure the development team would love. This will of course earn you alot points too and get you on the leaders board.<a href="http://vintagesnapbackhatsfan.com">z</a><a href="http://bestsolidstatedrive.net">d</a><a href="http://b2days.com/">n</a><a href="http://b2wp.com/">e</a><a href="http://buy-sell-cheap.com/">t</a> <a href="http://sellcheap.net/">t</a><a href="http://newsoftwarepc.com/">h</a><a href="http://bestlaptoppcreviews.com/">a</a><a href="http://buyfurniturefreeshipping.com/">n</a><a href="http://cheapclothingstoresonline.com/">k</a> Im not sure i come to an agreement with you on every level, howevor it absolutely was a good posting, many thanks for taking the time to put up your ideas.
    TOCCAR