The cloud brand has been taking a beating in the press as of late. Bloggers and journalists alike continue to question the viability of the cloud without fully understanding the potential causes of data center outages.
The most recent was the Amazon-Apple debacle where a hacker used social engineering to overcome in-built security.
You can read Mark Honan's Wired article here: How Apple and Amazon Security Flaws Led to My Epic Hacking .
Social engineering is the dark art of manipulating people. As the now reformed hacker Kevin Mitnick is fond of pointing out in his CSEPS Course Workbook (2004): "it is much easier to trick someone into giving a password for a system than to spend the effort to crack into the system."
This is nothing new, nor specific to cloud computing. Mitnick was doing this in the '70s at age 12 to get free bus rides.
A modern day approach to social engineering is phishing. This is where the 'attacker' sends an email to the target -- typically a group of people -- which appears to come from a legitimate business, bank, or credit card company, requesting the recipient to verify some information, or promising that if you send them the transaction cost, you can have access to much larger amount of money.
I would be surprised if many of you had not seen an example of this.
Social engineering is exactly what happened to the Amazon Client Service Representative (CSR). Where I a phisher interested in targeted Phishing, known as Spear Phishing, I would be tempted to send an email or two to this CSR. Why? Because the scam preys on the gullible.
There have also been several stories recently detailing high profile data center outages, as well as questioning the viability of the cloud. I don't believe that many of these folks understand what goes into a data center, nor the components that may fail -- nor how often they actually fail -- where no outage occurs.
The Uptime Institute outlines the following tiers available for data centers.
- Tier 1: Basis Capacity: site-wide shutdowns are required for maintenance or repair work. Capacity or distribution failures will impact the site.
- Tier 2: Redundant Capacity Components: site-wide shutdowns for maintenance are still required. Capacity failures may impact the site. Distribution failures will impact the site.
- Tier 3: Concurrently Maintainable: each and every capacity component and distribution path in a site can be removed on a planned basis for maintenance or replacement without impacting operations. The site is still exposed to a equipment failure or operator error. A 'fault tolerant' site is also 'concurrently maintainable'.
- Tier 4: Fault Tolerant: an individual equipment failure or distribution path interruption will not impact operations.
The redundancy in the Uptime Institute tiering is achieved through various methods. Levels 1 and 2 require some outages to continue operations, which are unacceptable at the enterprise level. Most mature organizations require a minimum of a Level 3 data center.
The hallmarks of a Level 3 (N+1 UPS) data center are that it can support maintenance work without incurring a shutdown, or interrupting business. For the Level 3 (N+1) data center redundancy is achieved through redundant UPS and generator paralleling switch gear. The idea is that if there is an outage the generator can go for two or three days, and as long as fuel delivery continues, the data center can continue operations.
The difference between a Level 3 (N+1 UPS) and Level 4 (2N UPS) data centers are that the subsystem level where an additional UPS provides fault tolerance. These may or may not have static switches, depending on the architecture.
The things that can go wrong in either a Level 3 or 4 data canter include:
- Cable(s) and tray
- Circuit overload
- Electrical Panel / Breaker
- Emergency Power Off
- Fire Suppression equipment
- Power Distribution Unit
- Power whip
- Static switch
- Switch gear
- Uninterruptible Power Supply
- Human Error
Of the above, typically, any can occur at a Level 3 or 4 data center and still not cause an outage. That is: an issue occurred, was discovered, addressed, and an outage was averted. This happens a good deal in the day-to-day operations of most data centers.
Most outages occur from human error, or electrical equipment failure.
We have all heard the story of the security guard that accidentally tripped the Emergency Power Switch to shut off a wailing alarm. Well, it is not an IT urban myth. I could recount at least one such incident in 2008-2009, but I digress.
It is also no surprise that many more outages are experienced in April-May and September-October than in any other months, so there is a change or release management component to the cycles. I suspect the September-October spike is folks rushing to get things out ahead of the holiday freezes that typically occur here in the U.S.; though not sure if this is the case world-wide.
What do we get from all of this?
Well, we know redundancy works. Thought the cost from going from Level 3 to Level 4 may cause some organizations to maintain both Level 3 and 4 data centers. Outages are typically associated with electrical over cooling distribution, electrical failure, and/or human error.
We need to not only design, install, and maintain redundancy, we need to actively test systems to ensure their viability. Manage changes and releases to ensure that communication about releases occurs at the right levels, and that changes include testing and back-out procedures.
Did I miss anything? Let me know your experiences with outages.