ie8 fix

Amazon's experience: fault tolerance and fault finding

By | April 26, 2011, 8:19am PDT

Fault-tolerant architectures have risks that are magnified by natural human tendencies. A new Google paper points to new risks with fault-tolerant infrastructures.

Fault tolerance breeds complacency
In 1997 Sun introduced its first fully redundant storage system: I/O channels; power supplies; cooling; and data. But within weeks we got field reports of total system failures.

The problem was simple: a component failed but because the system kept running and error messages were ignored a 2nd fault brought the system down. Oops.

We fixed it not with triple redundancy but by making sure that faults were harder to ignore. Compare that solution to the 5X redundancy of computer systems on board the Apollo moon shots.

Google’s experience
A Google engineering team commented on this in a recent paper on their Megastore system. They found that fault tolerance

. . . often hides persistent underlying problems. We have a saying in the group: “Fault tolerance is fault masking”. Too often the resilience of our system coupled with insufficient vigilance in tracking the underlying faults leads to unexpected problems: small transient errors on top of persistent uncorrected problems cause significantly larger problems.

This is a bigger issue because of the way it intersects with human psychology: we want to believe that the multibillion-dollar infrastructures we’ve built our strong and safe; we get bored looking for things that probably aren’t there; and the intersection of the transient with an underlying fault is very difficult for engineers to imagine or guard against.

Bottom line: humans aren’t built to handle this class problem, so we need our machines to help.

Tolerate or amputate?
The Google team makes another important observation about fault-tolerant infrastructures: a system that tolerates faulty participants can miss their larger impact on the total infrastructure.

For example, if an algorithm tolerates slow participants, and a slow system is interpreted as a fault, the algorithm can throttle the entire group as it “handles” the fault of a slow machine. A chain gang is only as fast as its slowest member.

Sometimes the right answer is not to tolerate but to amputate a “fault” and let the rest of the system get on with the work. But how do you tell the difference?

The Storage Bits take
All systems have faults. The interesting question is: what happens then?

In the case of Amazon’s outage I expect the root cause analysis will show an unexpected interaction of a transient fault with an underlying fault. And the solution will be improved instrumentation and notification to admins. And maybe an “amputation” feature in the software.

But at bottom we live in a statistical universe: a larger user base finds more bugs than a smaller user base. And Amazon’s user base is growing rapidly.

This particular problem is unlikely to recur. But other nasty bugs are out there and will bite us.

Is this a reason to run screaming from cloud technology? No. But know that there will be other failures even as the failures grow rarer each year.

Comments welcome, of course. I wrote about and linked to the Google paper on StorageMojo.

Kick off your day with ZDNet's daily e-mail newsletter. It's the freshest tech news and opinion, served hot. Get it.

Topics

Robin Harris has been messing with computers for over 30 years and selling and marketing data storage for over 20 in companies large and small.

Disclosure

Robin Harris

Robin Harris is a president of TechnoQWAN, a consulting and analyst firm in northern Arizona. He also writes StorageMojo.com, a blog which accepts advertising from companies in the storage industry, and has a 25 year history with IT vendors. He has many industry contacts, many of whom are friends and all of whom he has opinions about. Robin has relationships with many companies in the technology industry. Every company he writes about may have sought to influence his opinion through carefully-crafted marketing messages and self-serving white papers, gifts ranging from desk calendars, t-shirts, lunches and trips as well as analyst or consulting assignments. He also invests in some technology companies. He may accept payment for services in stock as well. Robin discloses financial investments in or client relationships with companies named in Storage Bits. To help readers sort out the gold from the dross in his writings, Robin tries to communicate his reasons as clearly as he can. If you agree, you are intelligent and discerning. If you disagree, well, you disagree. In all cases, Robin encourages readers to subject everything they read, see or hear on the internet or from politicians to some simple questions: * What assumptions are implicit in the world view and judgments of the author? * What, if any, is the factual basis for the opinions the author expresses? * Is it reasonable, logical and clear? Your critical faculties: use ‘em or lose ‘em!

Biography

Robin Harris

Harris has been messing with computers for over 30 years and selling and marketing data storage for over 20 in companies large and small. He introduced a couple of multi-billion dollar storage products (DLT, the first Fibre Channel array) to market, as well as a many smaller ones. Earlier he spent 10 years marketing servers and networks. After leaving corporate life he founded TechnoQWAN, a consulting and analyst firm. He also developed StorageMojo into one of the top storage industry blogs.

Robin writes, consults, coaches and lives among the mountains of northern Arizona.

Related Discussions on TechRepublic

Did you know you can take part in these discussions with your ZDNet membership?
9
Comments

Join the conversation!

Just In

RE: Amazon's experience: fault tolerance and fault finding
FAULKNE 13th Oct
Good day to confirm this comment I would appreciate T h e b e s t o f Z D N e t d e l i v e r e d your website very nice to everyone Yes, Oracle is the only one with shared-disk architecture, but that is there advantage. It means you can add or remove nodes and the database lives on. In a shared nothing architecture, if you lose a node, you lose the system. I'm sure Oracle appreciates EMC highlighting their advantage.I also desire to signal in your RSS feeds. Thank you as soon as once again and maintain up the great operate Awesome post! Thank you very much || thanks for nice content this is really benefit to me.
See, while agree that faults should not be ignored i do not believe that fault tolerance is fault masking. If you didn't have a fault tolerant system, the system would break a sa result of a fault. What needs to occur is better fault monitoring. The idea of a faulty component (computer system in this case) being amputated is a novel idea, but the ability for the amputation to be re-grafted when the fault is fixed needs to be present as well. In the example of a slow computer, a fault tolerant system could distribute the load from the slow computer across the multiple other recipients, leaving the slow computer to do less work, thereby removing the fault. That is to say if I got your analogy correctly.
Well we had some Stratus fault tolerant systems in our data center, and when it detected a fault it dialed home and ordered the part and the part would show up on the sys admins desk to be replaced. Never had system hardware issue cause an outage.
@mrlinux
Cool for hardware. How did that work for software faults?
@Robin Harris
For certain platforms Stratus provides a means to allow customers to configure alerts that are triggered off of software events (full disclosure: I presently work for Stratus Technologies).
not finding and fixing the root cause. It's likely still there waiting to happen again. It could take a week or two to be legitimately fixed for this instance and several months of wider datacenter hw swap out or rolling software redeploy before thousands of identical nodes are similiarly fixed. I'm sure they're praying it doesn't happen again before they complete the process but it's not unlikely that it'll happen again in the mean time...
@Johnny Vegas
I'm sure the paying customers appreciate getting back online ASAP. The root cause analysis comes after that, and yes, as you note, fixing the problem will take longer. But since the problem hadn't happened in the years since EBS debuted, it is still unlikely to occur before the next fix is in.
0 Votes
+ -
Yea, run into this before...
Been_Done_Before 26th Apr 2011
Had a server with a HD fault, ordered a new HD a few days later because we didnt notice the alert right away.... Second HD failed, had to rebuild the SAN unit. Luckily we were fault tolerant on SAN units, but still work had to be done, more work than if the unit had been ordered in a timely manner.

Second, we went full on VMWare, moved all our DC's to the cloud. Had a cooling failure which shutdown the entire vmware cluster, which we never expected to do, now there was no DC available to start virtual center service which used integrated authentication. Tense moments there, but luckily everything booted and the virtual DCs came back online so we could start the service. Now we have a physical DC at another site just incase of another failure such as this.
Love the theme ' fault tolerance is fault masking'. I'd say it applies to all lesser 'resilient' infrastructures as well. Your human nature points are right on as we see customers having a difficult time doing RCA on thorny problems - so these are often unresolved. 'Known problem' meets transient and voila! Major outage or slowdown.

There is a lot of promise in self-learning tools, however, that can uncover these underlying problems and identify the cause. You can check them out by doing a google search on 'application performance analytics'.

Now we've led the horse to water - let's see if they'll drink happy
Good day to confirm this comment I would appreciate T h e b e s t o f Z D N e t d e l i v e r e d your website very nice to everyone Yes, Oracle is the only one with shared-disk architecture, but that is there advantage. It means you can add or remove nodes and the database lives on. In a shared nothing architecture, if you lose a node, you lose the system. I'm sure Oracle appreciates EMC highlighting their advantage.I also desire to signal in your RSS feeds. Thank you as soon as once again and maintain up the great operate Awesome post! Thank you very much || thanks for nice content this is really benefit to me.

Join the conversation!

Formatting +
BB Codes - Note: HTML is not supported in forums
  • [b] Bold [/b]
  • [i] Italic [/i]
  • [u] Underline [/u]
  • [s] Strikethrough [/s]
  • [q] "Quote" [/q]
  • [ol][*] 1. Ordered List [/ol]
  • [ul][*] · Unordered List [/ul]
  • [pre] Preformat [/pre]
  • [quote] "Blockquote" [/quote]
ie8 fix

The best of ZDNet, delivered

ZDNet Newsletters

Get the best of ZDNet delivered straight to your inbox

Facebook Activity

White Papers, Webcasts, & Resources
ie8 fix