X
Innovation

What can we learn from the Amazon outage?

Simple planning and forethought could have saved many organizations a great deal of pain. When will organizations think about the cost of downtime as well as the cost of products and services when building an application?
Written by Dan Kusnetzky, Contributor

It is amazing how a single outage could bring out so many suppliers of hardware, software and services. I've had at least ten suppliers of management software, network virtualization software, storage virtualization hardware and software and professional services call to explain how the proper use of their product or service would have prevented the pain that many felt when Amazon's Web Service had an outage. While I agree with nearly everything they said, I think that they are missing the point to some degree.

Here are some lessons this outage offer to the industry:

  • It is clear that those acquiring web services are often not people who have been involved with their organization's own facilities or IT functions. They didn't:
    • Understand the need for careful planning for a workload's environment
    • How to set up processes that deal with planned or unplanned outages
    • How to read and understand the terms and conditions offered by a supplier
    • What IT people do to create a reliable, secure and manageable environment.

  • Organizations didn't have a well-developed and tested set of alternatives if something in their computing environment failed. Plans should have included what do to if any of the following happened:
    • Workloads slow down to the point that they are no longer useful
    • Networks become unavailable or unresponsive
    • Systems become unavailable or unresponsive
    • Storage becomes unavailable or unresponsive

  • Workloads needed to be developed so that they could run locally or on more than one cloud service offering. Building to just one set of APIs usually means being locked into that supplier forever. If one must accept a locked in environment, at least consider transportability and select the highest level platform possible.
  • Automated tools must be used to monitor workloads so that slow downs or outages can be dealt with in real time. This could mean moving the work to the local data center, to a different data center offered by the same service provider or to an entirely different supplier's data center.
  • Data for virtual applications must live in a virtual storage world

I could go on and on (and the suppliers I spoke with would love it if I did) about the importance of careful planning and execution. I'm reminded of the quote "those failing to plan are planning to fail."

Editorial standards