Google: This is what caused CPU throttling at our cloud data center

Google says crushed rack wheels busted a cooling system, causing CPU performance to be throttled.
Written by Liam Tung, Contributing Writer

Google says a set of crushed wheels used for moving its server racks triggered a chain reaction that may have disrupted Search, Gmail, and other services for some users. 

A rack of servers at one of its data centers started overheating to the point where CPUs were automatically throttled, ultimately because a set of rack wheels couldn't bear the weight of Google's cloud kit.

Steve McGhee, a solutions architect at Google Cloud, says Google users "most likely" wouldn't have noticed errors caused by the rack's crushed wheels. But the chain of events resulted in enough CPU throttling to cause "user harm". 

Fortunately, the incident wasn't as serious as one from June last year, caused by a failure in Google's automation software, which took down Gmail, YouTube, and customers' applications. That incident prompted a big apology to customers and a commitment to do better in future. 

SEE: Cloud v. data center decision (ZDNet special report) | Download the report as a PDF (TechRepublic)

This time the company has decided to tell the story to illustrate the lengths it goes to to find the root cause of disruptions – even when they don't noticeably impact users. 

The latest event came to light when Google recently kicked off an investigation after a site reliability engineer noticed a spike in errors from machines on its edge network that cache content users frequently access. The machines were immediately taken offline to stop them impacting customers, allowing other machines to take up the slack. 

Google engineers noticed some border gateway protocol (BGP) network errors but their characteristics suggested issues with the machines rather than the router. Further investigation turned up kernel messages in machines on the edge network that revealed CPU clock throttling. 

The engineers found that failing systems were isolated to machines on a single rack. All of this investigation was happening remotely. Unable to explain why the rack was overheating enough to cause kernel errors, the engineers then requested Google's on-site data-center workers to physically check out the problem rack. 

Soon after the data-center team reported back with a brief message and a picture of the rack's crushed wheels. 

"Hello, we have inspected the rack. The casters on the rear wheels have failed and the machines are overheating as a consequence of being tilted," the team explained. 

"The wheels (casters) supporting the rack had been crushed under the weight of the fully loaded rack," said McGhee

"The rack then had physically tilted forward, disrupting the flow of liquid coolant and resulting in some CPUs heating up to the point of being throttled."

SEE: There's more to Google than Google: Dataset Search comes out of beta

It's not clear why the wheels were crushed but Google engineers feared it could be a more widespread problem and so they replaced all the racks that could be vulnerable to the same broken-wheel tilting issue. 

The problem has caused Google to reconsider how it moves new racks into its data centers when they're being built.


Google's engineers discovered that casters on the rear wheels had failed, ultimately causing the machines to overheat.

Image: Google

The alarming tilt of a refrigeration unit also pointed to the underlying problem.

Image: Google
Editorial standards