Why clouds will always, eventually, fail

And every other computer will fail too. Is this anyway to run a digital first civilization?
Written by Robin Harris, Contributor on

You aren't running one computer system, you're running dozens. What's worse is their software is rarely updated, while their bugs are masked. And the problems are orders of magnitude greater in the cloud. Is this anyway to run a digital first civilization?

Microprocessors are everywhere. Forget the x86 or ARM chips running Linux, Windows, macOS, iOS, or Android in your devices or in the cloud infrastructure that provides so many of our everyday services.

No, these microprocessors are usually called microcontrollers, but there's nothing micro about them, other than price. They're more powerful than the 1980's workhorse superminicomputer, the DEC VAX 11/780, a six foot high, 32-bit system capable of supporting dozens of users.

Microcontrollers are sprinkled over computer systems like powdered sugar on a donut. Display controllers, network interfaces, disk controllers, graphics cards, audio interfaces, and within disks and SSDs. All use powerful microcontrollers, running a Real Time Operating System (RTOS).

And that's just within your server, notebook, tablet, and phone. Any device you link to, be it printer, storage array, cable modem or network switch, also relies on microcontrollers. Our infrastructure is not carved in stone: it's more like a pile of gravel, made of thousands of tiny pebbles. And about as stable.

Proprietary software everywhere

Open source software, primarily Unix-based, owns a huge chunk of our cloud infrastructure and most personal devices. Software only gets better when billions of people use it, and tens of thousands engineers can examine and improve it.

Microcontrollers? Forget that. There are over 140 RTOSs in active use today, most of which are proprietary. And the software they run is almost exclusively proprietary, and rarely evaluated outside of the few engineers who develop it.

Microcontroller software goes by the reassuring name of firmware, shorthand for a pain to fix. It is just as buggy as any other software, but instead of analysis and fixes, relies upon higher layers of software to retry if it breaks.

The exemplar of this strategy is TCP/IP, the internet protocol that dominates local and wide area networking. Data is loaded into numbered packets, and the receiving node looks to make sure all the needed packets have been received. If they haven't, it requests a resend. As long as everything works, you'll eventually get your data.

So any errors in the underlying infrastructure, especially in microcontroller firmware, are swept under the rug. A switch vendor may update their software, but the firmware in the NIC? Rarely.

Tech turkeys: Apple and Google dominate the year's menu of failures

The Storage Bits take

So, you may say, there are thousands, maybe millions of little known and nearly unfixable bugs in our infrastructures. So what? My Instagram account usually works, so I'm good.

Except that as infrastructure grows, more long tail bugs emerge. Ten beta users will discover the most common bugs. One hundred more users will discover less common bugs. Five billion users will uncover truly exotic bugs of mindbending complexity, often the result of one bug interacting with one or more others..

As our infrastructures grow and become more integral to the smooth functioning of our daily lives, this increases our vulnerability. Murphy's Law: "if something can go wrong, it will". May I offer, in all modesty, Harris's Corollary? "If computer infrastructure absolutely, positively, has to work, eventually, it won't."

Humans have been building bridges for thousands of years, and yet, some still collapse every year. Cloud infrastructures are barely two decades old. We have many data collapses in our future.

And now you know why.

Comments welcome! This post inspired by a talk given by the brilliant co-founder of the new Oxide Computer Company, Bryan Cantrill. Watch the talk to learn even more scary stuff.

Editorial standards