Even the biggest tech companies have major IT disasters because engineers fail to foresee that a seemingly minor event might overload a vulnerable system.
If your job involves protecting IT infrastructure, it could well be worth reading Google's new and free 500-page book detailing numerous failures affecting Google's internal systems and products like YouTube.
Importantly, the new book also reveals how its site-reliability engineering and security teams cooperate to protect key Google systems, from Android to Chrome, Gmail, Search, and Google Cloud.
Few companies in the world operate at Google's scale, but nonetheless there may be lessons to learn from Google's book, which comes as the COVID-19 coronavirus pandemic makes it more important than ever for online systems to remain reliable, available, and secure.
SEE: Cloud v. data center decision (ZDNet special report) | Download the report as a PDF (TechRepublic)
The book offers insights from teams that practice so-called site reliability engineering (SRE), Google's approach to coordinating software engineers who develop its products and systems, and operations teams that keep the product running.
Google, which has used SRE principles for nearly two decades, defines it as "what you get when you treat operations as if it's a software problem".
The new book, titled 'Building Secure and Reliable Systems', focuses on how Google brings an SRE approach to security, and security's role in software product development and operations. Google's previous books on SRE covered best practices in SRE but didn't deal with the links between reliability and security.
"For good reasons, enterprise security teams have largely focused on confidentiality. However, organizations often recognize data integrity and availability to be equally important, and address these areas with different teams and different controls," explains Royal Hansen, Google's VP of security engineering.
"The SRE function is a best-in-class approach to reliability. However, it also plays a role in the real-time detection of and response to technical issues – including security-related attacks on privileged access or sensitive data. Ultimately, while engineering teams are often organizationally separated according to specialized skillsets, they have a common goal: ensuring the quality and safety of the system or application."
The book opens with the questions "Can a system be considered truly reliable if it isn't fundamentally secure? Or can it be considered secure if it's unreliable?".
SEE: Try these six awesome Google Chrome extensions today
Google's first tale is about cascading failure in 2012 after its corporate transportation announced the Wi-Fi password for its buses connecting its San Francisco Bay Area campuses had changed.
The flood of employees trying to change their password overloaded its password manager and knocked it and its three replicas offline.
Google needed a smartcard to restart the system and had them in multiple offices across the globe, but couldn't access them in the US. So it reached out to engineers in Australia for one there, which turned out to be locked in a safe with a code the engineer had forgotten.
And where was the code saved? Of course, in the now-offline password manager. But there were even more failures as engineers fumbled to restart the password manager.
"On that day in September, the corporate transportation team emailed an announcement to thousands of employees that the WiFi password had changed. The resulting spike in traffic was far larger than the password management system – which had been developed years earlier for a small audience of system administrators – could handle.
The load caused the primary replica of the password manager to become unresponsive, so the load balancer diverted traffic to the secondary replica, which promptly failed in the same way. At this point, the system paged the on-call engineer. The engineer had no experience responding to failures of the service: the password manager was supported on a best-effort basis, and had never suffered an outage in its five years of existence. The engineer attempted to restart the service, but did not know that a restart required a hardware security module (HSM) smart card.
These smart cards were stored in multiple safes in different Google offices across the globe, but not in New York City, where the on-call engineer was located. When the service failed to restart, the engineer contacted a colleague in Australia to retrieve a smart card. To their great dismay, the engineer in Australia could not open the safe because the combination was stored in the now-offline password manager. Fortunately, another colleague in California had memorized the combination to the on-site safe and was able to retrieve a smart card.
However, even after the engineer in California inserted the card into a reader, the service still failed to restart with the cryptic error, "The password could not load any of the cards protecting this key."
At this point, the engineers in Australia decided that a brute-force approach to their safe problem was warranted and applied a power drill to the task. An hour later, the safe was open – but even the newly retrieved cards triggered the same error message.
It took an additional hour for the team to realize that the green light on the smart card reader did not, in fact, indicate that the card had been inserted correctly. When the engineers flipped the card over, the service restarted and the outage ended."