While, on the surface, what Google is doing looks like it could transfer over as a best practice for your data center, that's not always the case. Google runs their data centers optimally for their business -- to deliver content that generates advertising revenue. What's important is that you focus on how to run your enterprise data center optimally for your business, not Google's. This comment always receives an "Amen" from data center professionals when I give talks at industry events, because it isn't fair to compare apples (Google's content delivery) and oranges (enterprise applications). As this article points out, your goals and Google's are not always aligned. While you focus on availability and reliability, Google prioritizes cost control over availability in most cases.
Without further delay, here are what I believe to be five myths about Google's data centers.
Myth #1: Google's business critical applications and advertising systems run in PUE 1.2 content-delivery data centers.
This is probably the biggest myth out there. Google runs two types of IT systems: content delivery and critical business services. Let's take a look at the goals that define how Google runs these two types of systems.
First is content delivery, which is a homogeneous system of hardware and software that runs MapReduce over Google File System. This is where all the data for YouTube, GMail, Google Apps, etc lives. The content delivery system needs to be mostly available, but Google has provisioned such that some outages can be masked with redundancy and other outages can be solved with some apology messages. This homogenous environment can be run at the limits, because availability is not the #1 requirement. The content delivery system is the "cost of goods sold (COGS)", or the cost of Google doing business. Minimizing cost maximizes profit. These are the very large facilities with very low PUE.
Critical business services include Google's internal systems that keep the company running day-to-day (customer management, HR, etc) as well as their advertising system, which serves advertisements and collects money. Without these systems, Google as a company doesn't exist. These systems are heterogeneous, running different software packages across a wide array of hardware inside of a conventional facility. Running these systems at their limits could jeopardize the ability to conduct business and collect revenue, so availability is tantamount. These conventional facilities use best practices and likely have a more moderate PUE between 1.5 and 1.9. Google doesn't disclose information about these facilities, because they don't have a sustained power draw of 5MW or more (so you never hear about them).
Myth #2: Google uses PUE as their primary metric to manage their data centers.
While PUE is an important metric to Google, it is one metric in a family of metrics that lead to lowest cost of content delivery. Engineers at Google tell me that, for each of their "business units" (such as YouTube, GMail, etc), they evaluate the profit per unit of content. Think of it as comparing the revenue generated versus the cost of delivering the content to generate that revenue. I applaud Google for this metric, but wish that they would publicly admit that it is really how they manage their IT infrastructure.
Think about this way: if you are constantly evaluating the business metrics (not technology or infrastructure metrics), then the realm of possible ways to increase business value is higher. Changing the way you do things is not limited to a particular technology or infrastructure, instead you can redesign the software (ala MapReduce and GFS), you can redesign the hardware (ala single DC voltage and backup batteries), and you can redesign the facility (ala containerization). All of this work is in the interest of lowering costs and increasing revenue per unit of content. Oh, and by the way, when you make all of these changes, your PUE goes down too. Why? Because the last thing you want to do is spend money on overhead in the cost of goods sold. You're paid for the IT output, so the business metrics naturally maximize for increased profit.
Myth #3: Google uses renewable energy to power their data centers.
While Google does use renewable energy to power their facilities, these sources are not currently used in Google's data centers as any meaningful power source. Even the most progressive solar designs (at Emerson, not Google) provide a paltry 16% of the data center's power. And solar has the added problem that there's no power when the sun goes down.
When Bloom Energy revealed the Bloom Box, they noted that Google has been testing the system for 18 months. The test was at their Mountain View headquarters, and they found the Bloom Box to be 98% reliable (available). While this is a great step forward for fuel cells in scalability and reliability, one 9 of reliability simply isn't sufficient to power any data center. Many journalists, when they found out that Google was a customer, immediately jumped to the conclusion that Google must be using it for their data center. No, not true, as Data Center Knowledge quickly pointed out.
Myth #4: Google's battery-on-server technique provides a more robust power backup solution. Google's server design for their content delivery data centers includes a full 12V system (no 3V or 5V components) with lead-acid battery backup (instead of a central UPS). The battery is said to power the system "for a few minutes" during an outage, after which the backup generators should be running and supplying power. Google said at their Data Center Efficiency Summit, "if the generators don't kick in within a few minutes, you have bigger problems and better have a fail over strategy."
Generally this is true; if your generators don't kick in within a few minutes, you are going to have bigger problems. That's why it is important to test them regularly, and familiarize yourself with their operation. Continually evaluate whether the generators are appropriately sized for today's IT load.
This gets back to availability versus efficiency; Google again chooses cost efficiency over availability, and the system-wide design of their homogeneous software architecture enables this battery design decision. Conventional UPS systems can power a data center for an hour or more, and battery systems can be extended centrally to provide more runtime. The battery-on-server system cannot be extended without replacing batteries on every piece of equipment or waiting for a refresh cycle. It does, however, provide a distributed battery backup that eliminates the single point of failure (central UPS) in conventional designs.
The batteries used in the Google design are 3.4AH 12V sealed lead acid. Based on a 3.4A discharge rate (roughly 350W), the battery voltage and charge drops below a usable level after 6 to 12 minutes. The graph at the right shows the discharge rate for various current draws. Note that Google has to go with the 3.4AH battery and not use one with higher capacity because the higher capacity batteries are too large to fit in a 2U physical configuration. The 3.4AH battery is 2.36" high, plus wiring and terminals, and thus nicely fits in the 3.5" 2U height.
Myth #5: You should be held to the same standard as Google when running your data center.
Let's face it, Google's content delivery data centers run a single application across a homogenous physical infrastructure. While this is much more possible with new builds, existing data centers have such a wide array of equipment that these types of industrial-sized efficiency techniques are infeasible. Furthermore, your data center runs ERP, CRM, HR, transactional, and web applications -- to name a few. These applications have varied architectures, and service, availability, and performance requirements.
To achieve the same level of efficiency as Google in your data center is a noble goal, but ultimately you need to get the best performance for your data center. This means metrics that map to the business needs that your data center fulfills. Just as Google uses profit per unit content served, you must identify the right guiding metrics to run a lean, mean operation.
While Google's content delivery data centers perform very well for the task that they perform, they are not apples-to-apples comparable to a business-critical enterprise operation. Manage your team and communicate to executives the metrics that make sense, because the last thing you want to do is get into a debate around "my PUE is better than yours" and "why don't you have the same PUE as Google" when the service you're providing is so vastly different than the one provided by Google.
There are more myths than just these five, of course. Let's start a dialog about how to best run an enterprise data center, not an industrial content delivery system, and develop best practices to optimize for the enterprise.
Joe Polastre is co-founder and chief technology officer at Sentilla, a company that provides enterprise software for managing power in the data center. Joe is responsible for defining and implementing the company’s global technology and product strategy.