ie8 fix

Google's Megastore: global storage for the Internet

By | April 27, 2011, 7:30am PDT

Megastore handles 3 billion writes and 20 billion reads daily on 8 PB of data across the globe supporting millions of users of Google Apps. How does it work?

In a paper titled Megastore: Providing Scalable, Highly Available Storage for Interactive Services (pdf) Google engineers give a technical overview.

The mission
Support Internet apps such as Google’s AppEngine.

  • Scale to millions of users
  • Responsive despite Internet latencies to impatient users
  • Easy for developers
  • Fault resilience from drive failures to data center loss
  • Low-latency synchronous replication to distant sites

Key assumptions: 1) data for most apps can be partitioned into entities, for example by user, and 2) a subset of DB features can help developer productivity.

Entities
Entities are synchronously replicated over a wide area. ACID transactions within entities are replicated, but not transactions across entities.

For cross-entity transactions, the synchronous replication requirement is relaxed. Entity group boundaries must reflect application usage and user expectations.

For example, an e-mail account is a natural entity. But defining other entities is harder.

Geographic data, for one, lacks natural granularity. So the globe is divided into non-overlapping entities. Changes across these geographic entities use (expensive) two-phase commits while changes within entities don’t.

The trick: entities large enough to make two-phase commits uncommon but small enough to keep transaction rates low.

Availability and scale
To achieve availability and global scale the designers implemented two key architectural features:

  • For availability, an asynchronous log replicator optimized for long-distance
  • For scale, data partitioned into small databases each with its own replicated log

Replication
The venerable Paxos algorithm manages synchronous replication - with some Google developed optimizations:

  • Fast reads. Current reads are usually from local replicas since most writes succeed on all replicas.
  • Fast writes. Since most apps repeatedly write from the same region, the initial writer is granted priority for further replica writes. Using local replicas and reducing write contention for distant replicas minimizes latency.
  • witness replicas
  • . Witnesses vote in Paxos rounds and store the write-ahead log but do not store entity data or indexes to keep storage costs low. They are also tiebreakers when isn’t a quorum.

  • Read-only replicas
  • are the inverse: nonvoting replicas that contain full snapshots of the data. Their data may be slightly stale but they help disseminate the data over a wide area without slowing writes.

Performance
Because Megastore is geographically distributed, application servers in different locations may initiate writes to the same entity group simultaneously. Only one will succeed; other writers will have to retry.

For multiuser apps with more writes developers can shard entities more finely or batch user operations into fewer transactions. Fine-grained advisory locks and sequencing transactions also help handle write loads.

The real world
Megastore has been deployed for several years on over 100 production apps. Here’s what Megastore has achieved in production:

The Storage Bits take
As Brewer’s CAP theorem showed, a distributed system can’t provide consistency, availability and partition tolerance to all nodes at the same time. But Google shows that smart choices can get us darn close.

This demonstrates again the value of Internet scale thinking. Enterprise vendors would never have developed Megastore, but now that we’ve seen it work we can begin applying its principles to smaller scale problems.

Comments welcome, of course. Want more detail without reading the entire paper? See a longer version of this post on StorageMojo.

Kick off your day with ZDNet's daily e-mail newsletter. It's the freshest tech news and opinion, served hot. Get it.

Topics

Robin Harris has been messing with computers for over 30 years and selling and marketing data storage for over 20 in companies large and small.

Disclosure

Robin Harris

Robin Harris is a president of TechnoQWAN, a consulting and analyst firm in northern Arizona. He also writes StorageMojo.com, a blog which accepts advertising from companies in the storage industry, and has a 25 year history with IT vendors. He has many industry contacts, many of whom are friends and all of whom he has opinions about. Robin has relationships with many companies in the technology industry. Every company he writes about may have sought to influence his opinion through carefully-crafted marketing messages and self-serving white papers, gifts ranging from desk calendars, t-shirts, lunches and trips as well as analyst or consulting assignments. He also invests in some technology companies. He may accept payment for services in stock as well. Robin discloses financial investments in or client relationships with companies named in Storage Bits. To help readers sort out the gold from the dross in his writings, Robin tries to communicate his reasons as clearly as he can. If you agree, you are intelligent and discerning. If you disagree, well, you disagree. In all cases, Robin encourages readers to subject everything they read, see or hear on the internet or from politicians to some simple questions: * What assumptions are implicit in the world view and judgments of the author? * What, if any, is the factual basis for the opinions the author expresses? * Is it reasonable, logical and clear? Your critical faculties: use ‘em or lose ‘em!

Biography

Robin Harris

Harris has been messing with computers for over 30 years and selling and marketing data storage for over 20 in companies large and small. He introduced a couple of multi-billion dollar storage products (DLT, the first Fibre Channel array) to market, as well as a many smaller ones. Earlier he spent 10 years marketing servers and networks. After leaving corporate life he founded TechnoQWAN, a consulting and analyst firm. He also developed StorageMojo into one of the top storage industry blogs.

Robin writes, consults, coaches and lives among the mountains of northern Arizona.

15
Comments

Join the conversation!

Just In

RE: Google's Megastore: global storage for the Internet
FAULKNE 13th Oct
Good day to confirm this comment I would appreciate T h e b e s t o f Z D N e t d e l i v e r e d your website very nice to everyone Yes, Oracle is the only one with shared-disk architecture, but that is there advantage. It means you can add or remove nodes and the database lives on. In a shared nothing architecture, if you lose a node, you lose the system. I'm sure Oracle appreciates EMC highlighting their advantage.I also desire to signal in your RSS feeds. Thank you as soon as once again and maintain up the great operate Awesome post! Thank you very much || thanks for nice content this is really benefit to me.
And the return to archaic hierarchical approaches as last seen when IBM dinosaurs like IMS ruled the earth is inflexible and riddled with potential data consistency errors.

Google as yet have made no interesting or significant theoretical contributions to data management.

A very boring and conservative company.
0 Votes
+ -
@jorwell
Who do you think is exciting? IMHO Google has moved the industry forward with the Google File System, BigTable and now Megastore. Agree that the bad old days of IBM's hierachical hairballs were bad - SNA, anyone? - but until we beat the speed of light it seems unavoidable in geographically distributed systems.

Your thoughts?
0 Votes
+ -
Mmm...
jorwell Updated - 28th Apr 2011
@Robin Harris

From the paper you link to "Joins, when required, are implemented in application code."

"declarative denormalization"

The first point conjures up the picture of rooms full of programmers churning out nested loops taking days to do something that can be achieved in a few minutes with joins.

Regarding "declarative denormalization", all well and good, providing is it purely a physical copy of the data, but presumably a modern system does this for you automatically. Indexes are an example of "declarative denormalization" of course, so nothing new there.

And anyway Google seem to be in entirely the wrong mindset here.
Saying joins are slow doesn't make any sense all. Join is a logical operation - you may as well say "and" or plus are slow. All Google are saying here is that they have failed to do the work on methods that make join fast (and therefore avoiding a return to the miserable productivity and complexity offered by procedural methods). There is huge potential for improving the physical implementation of RDBMSs without throwing away the simplicity and elegance of the relational model.

Those of us who don't have Google's data volumes (and that is almost everybody) should be more concerned with data quality, integrity and the flexibility offered by declarative programming; not with performance.

The interesting work is to do with defining a better relational language than SQL, which fails in many areas to implement the model properly leading to many potential inconsistences, inaccuracies, misunderstandings and unnecessary complexity.

Sure better performance is good but I want the logical level to be completely separated from the physical level (which can be implemented however you want). Google don't seem to grasp the importance of this distinction.
0 Votes
+ -
Paragraph of the Year
johnfenjackson@... 28th Apr 2011
"This demonstrates again the value of Internet scale thinking. Enterprise vendors would never have developed Megastore, but now that we?ve seen it work we can begin applying its principles to smaller scale problems."

Bravo!

I am sick and tired of expaining to Ed Bott and Phil Wainewright that the choice between M$ VAIL restricted to one hard disk ... and the cost of the likes of AMAZON cloud storage ... is that offered by those vendors in favour of their continued revenue streams ... and not that required by consumers, businesses and communities to minimise their costs and maximise their benefits.

O for a new architecture!

Is it not apparent that Google's design (where they pay their own bill) is different from the vendor designs (where the customer pays the bill)? Apparently not to many ZDNET bloggers.

Robin, can you please beat the crap out of your colleagues until they UNDERTSTAND your little paragraph!!
0 Votes
+ -
and they fail totally to reach their claimed objectives in the paper.

"Normalized relational schemas rely on joins at query time
to service user operations". Google for some reason seem to have forgotten (or perhaps have never heard of) the other seven relational operators.

"declarative denormalization help
eliminate the need for most joins."

The problem with this is that it is impossible to materialize the result of all possible inferenced results - the data volume for even a moderate sized database would exceed the entire storage capacity of the world's computers.

Could it be that the Google database is big not because it contains a lot of information, but because it contains the same information many times?

More seriously, unless the denormalization takes place according to well defined rules and is purely a physical (and not a logical) denormalization there is a very high chance of logical inconsistency in the data.

Google refer to "declarative denormalization", but how can this be achieved without the "expressive query language" that is specifically rejected in the paper?

Furthermore all this denormalization doesn't happen without the use of system resources. I suspect that most of the system's resources will be being used to perform expensive write operations to denormalize data, probably crippling it for any extraneous operations like user updates or queries.

I could go on much longer, but I really don't think Google have any significant contribution to make in the field of data management if they continue on their current path.

All these things have been done before and have been shown to be flawed in theory and cumbersome and inflexible in practice.
@jorwell
Not to point out the painfully obvious, but I just checked my Gmail. Apparently as idiotic and poorly designed as it is, it seems to work. Maybe you should send them your CV and offer to help them clean it up.
@symbolset

So, I am not sure what the relevance of your comment is.
I also desire to signal in your RSS feeds. Thank you as soon as once again and maintain up the great operate! nccma cooler
I used to be more than happy to seek out this internet-site.I wanted to thanks in your time for this glorious read!! I positively enjoying each little bit of it and I have you bookmarked to check out new stuff you weblog post. this thread is amazing i like your work and i appreciate you that you have share a useful stuff thanks for sharing the i shop abatwa
I used to be more than happy to seek out this internet-site.I wanted to thanks in your time for this glorious read!! I positively enjoying each little bit of it and I have you bookmarked to check out new stuff you weblog post.Bookmarking now thanks please consider a follow up post. power sa shop
I think the representation of this article is actually superb one. This is my first visit to your site. Thanks a lot and keep sharing the information. Keep updating the information for all of us. Thanks ZDNet Government was launched as the brand's first industry vertical, with a mission to cater to IT professionals in the public secto I agree with your post. However, do you have any sources I can cite for my paper wheel car com bury
Well welcome, hopefully you can become a vital member of the community and really help to push far ahead of google. Which Im sure the development team would love. This will of course earn you alot points too and get you on the leaders board. z d n e t t h a n k Im not sure i come to an agreement with you on every level, howevor it absolutely was a good posting, many thanks for taking the time to put up your ideas.
This is my first visit to z d n e t site. Thanks a lot and keep sharing the information. Keep updating the information for all of us.how can i clean up, because i don???t know why it seems my skeen has to fat i get the glasses dirty every day.i search y a h o o Very good quality indeed. I surely recommend it. The template used in their site is also great.
Fantastic news about the new release.I positively enjoying each little bit of it and I have you b o o k m a r k e d to check out new stuff you weblog post.Im not sure i come to an agreement with you on every level, howevor it absolutely was a good posting, many thanks for taking the time to put up your ideas
Good day to confirm this comment I would appreciate T h e b e s t o f Z D N e t d e l i v e r e d your website very nice to everyone Yes, Oracle is the only one with shared-disk architecture, but that is there advantage. It means you can add or remove nodes and the database lives on. In a shared nothing architecture, if you lose a node, you lose the system. I'm sure Oracle appreciates EMC highlighting their advantage.I also desire to signal in your RSS feeds. Thank you as soon as once again and maintain up the great operate Awesome post! Thank you very much || thanks for nice content this is really benefit to me.

Join the conversation!

Formatting +
BB Codes - Note: HTML is not supported in forums
  • [b] Bold [/b]
  • [i] Italic [/i]
  • [u] Underline [/u]
  • [s] Strikethrough [/s]
  • [q] "Quote" [/q]
  • [ol][*] 1. Ordered List [/ol]
  • [ul][*] · Unordered List [/ul]
  • [pre] Preformat [/pre]
  • [quote] "Blockquote" [/quote]
ie8 fix

The best of ZDNet, delivered

ZDNet Newsletters

Get the best of ZDNet delivered straight to your inbox

Facebook Activity

White Papers, Webcasts, & Resources
ie8 fix