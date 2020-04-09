Even the biggest tech companies have major IT disasters because engineers fail to foresee that a seemingly minor event might overload a vulnerable system.

If your job involves protecting IT infrastructure, it could well be worth reading Google's new and free 500-page book detailing numerous failures affecting Google's internal systems and products like YouTube.

Importantly, the new book also reveals how its site-reliability engineering and security teams cooperate to protect key Google systems, from Android to Chrome, Gmail, Search, and Google Cloud.

Few companies in the world operate at Google's scale, but nonetheless there may be lessons to learn from Google's book, which comes as the COVID-19 coronavirus pandemic makes it more important than ever for online systems to remain reliable, available, and secure.

The book offers insights from teams that practice so-called site reliability engineering (SRE), Google's approach to coordinating software engineers who develop its products and systems, and operations teams that keep the product running.

Google, which has used SRE principles for nearly two decades, defines it as "what you get when you treat operations as if it's a software problem".

The new book, titled 'Building Secure and Reliable Systems', focuses on how Google brings an SRE approach to security, and security's role in software product development and operations. Google's previous books on SRE covered best practices in SRE but didn't deal with the links between reliability and security.

"For good reasons, enterprise security teams have largely focused on confidentiality. However, organizations often recognize data integrity and availability to be equally important, and address these areas with different teams and different controls," explains Royal Hansen, an early SRE lead for Gmail and Google's current VP of security engineering.

"The SRE function is a best-in-class approach to reliability. However, it also plays a role in the real-time detection of and response to technical issues – including security-related attacks on privileged access or sensitive data. Ultimately, while engineering teams are often organizationally separated according to specialized skillsets, they have a common goal: ensuring the quality and safety of the system or application."

The book opens with the questions "Can a system be considered truly reliable if it isn't fundamentally secure? Or can it be considered secure if it's unreliable?".

Google's first tale is about cascading failure in 2012 after its corporate transportation announced the Wi-Fi password for its buses connecting its San Francisco Bay Area campuses had changed.

The flood of employees trying to change their password overloaded its password manager and knocked it and its three replicas offline.

Google needed a smartcard to restart the system and had them in multiple offices across the globe, but couldn't access them in the US. So it reached out to engineers in Australia for one there, which turned out to be locked in a safe with a code the engineer had forgotten.

And where was the code saved? Of course, in the now-offline password manager. But there were even more failures as engineers fumbled to restart the password manager.