SF power outage reveals disaster recovery plans need tweaking

SF power outage reveals disaster recovery plans need tweaking

Summary: When a power outage that affects a mere 30,000 to 50,000 customers knocks out some of the more popular sites on the Web you know disaster recovery plans have some holes. On Tuesday afternoon a host of sites such as Craigslist, Red Envelope, Yelp Technorati and ZDNet had trouble delivering pages.

SHARE:
22

When a power outage that affects a mere 30,000 to 50,000 customers knocks out some of the more popular sites on the Web you know disaster recovery plans have some holes.

On Tuesday afternoon a host of sites such as Craigslist, Red Envelope, Yelp Technorati and ZDNet had trouble delivering pages. The San Francisco Chronicle detailed the outage, which affected the 365 Main data center (see Techmeme).

These outages are disturbing. In theory, these sites should have gone to backup power. Where were 365 Main's backup generators? Data Center Knowledge reports that 365 Main's generators failed too. Meanwhile, a disaster recovery site should have kicked in. Preferably these sites would not be on the same power grid as their headquarters.

One question about this outage irks me: What if this outage was something worse--say a terrorist attack or Katrina? I'll tell you what would happen: Sites would have been out of business. And when your business is a Web site that fact is a tad alarming. I've noticed a few holes on the disaster recovery front of late: For instance, NetSuite relies on one third party data center facility in California to deliver its services. That means one power outage or earthquake and NetSuite customers have issues. At least NetSuite plans on adding another data center.

Perhaps I'm a little more in tune with the importance of disaster recovery since I'm in New York City primarily. I also remember those disaster recovery tips from the likes of Cantor Fitzgerald and the New York Board of Trade. They lost buildings and/or employees on Sept. 11, 2001. Rest assured they have their plans buttoned down. In New York your plans have to be set. Some businesses are in disaster recovery mode now after a blown steam pipe near Grand Central shut down the area.

And if you don't have the resources of the the big guys hopefully you at least have some space on layaway with a vendor like Sungard.

The basics are: Have your backup site on a separate grid; test your backup plan quarterly and keep your site running with data centers in multiple locations. The financial services folks have this drill down. They have prepped for everything from avian flu to another terrorist attack.

Apparently, others haven't learned the disaster recovery lessons. Some people just have to learn the hard way.

Topics: Data Management, Outage

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

22 comments
Log in or register to join the discussion
  • This will become a bigger and bigger issue

    As environmentalists policies cause ever increasing brownouts and blackouts not just across California, but the rest of the nation.
    frgough
  • Reasons why I don't depend on web services

    I may use them occasionally, but these are reasons I don't use web services for the most important stuff. In addition, I can't guarantee I have Internet access all of the time. I don't use SF, and I don't imagine I ever will.
    CobraA1
    • That's what I was thinking as well

      Granted, the major web service providers (Google, Yahoo, Microsoft) all have more robust, redundant systems in place than a lot of smaller players. But even then I can't count how many times I've tried to login to Gmail to no avail.

      There's weaknesses to every system and this is one of the biggest for the whole 'cloud' model. The cloud concept is great because it provides remote access and eases the requirements on the user's hardware. But user hardware is always improving and it's increasingly easy for users to create their own private cloud for remote access. As always, a middle-ground (using a combination of the cloud and the desktop) seems the best alternative.
      RustyShackleford
  • Paranoia vs. Pocketbook

    Why do you claim: "Meanwhile, a disaster recovery site should have kicked in?" Did 365 Main's contracts claim that such a site existed? Did ZDNet or the others pay for it?

    Feel free to blame 365 Main for its generator failure. Given that 365 Main purchased the data center at a fire sale price, I'm not terribly surprised that the plans didn't survive contact with the enemy.

    I once worked at a company who had a similar data center generator failure. The cause was traced to the (identical) batteries used to start the generators.

    But if 365 Main's clients wanted a disaster recovery site they should have paid for one, preferably from a different vendor. After all, had 365 Main gone bankrupt like the previous owner of this datacenter, the backup data center would have been of little help.
    benveniste
  • Downtime

    These companies have access to rock solid third party providers with mulitple points of presence. Companies that are in the business of delivering 100% uptime; such as Neustar Ultra Services & Akamai eliminate unforseeables for thousands of companies everyday. This will continue to be a problem since most websites choose to go with outdated technology or practices in search of even greater profits over reliability.
    bryant.reyes9
    • Completely Agree

      Okay I'm biased. I write for Messaging News as well as work in downtown SF and as I'm writing this, a story is going to print on disaster recovery and what companies still fail to do after all the lessons learned (New Orleans). Companies such as Teneros, NeverFail, Marathon and AppRiver offer a host of great services that offer 100% up time as you said. There are indeed options, with managed services being the best, it's just convincing the CFO that it's a necessary spend, and that's unfortunate.

      Melisa LaBancz-Bleasdale, Palamida
      Melisa1
  • Netflix - ouch!

    I feel sorry for them...disappointing earnings (stock down 25%), 18 hours of downtime - at the same time (partially due to power outage)...my oh my!
    THEE WOLF
  • No such thing as 100% uptime

    This is a fallacy. Even the largest sites can't claim that since somewhere in the world someone is trying to access their site/service and is unable to do so.

    Most sites claim 5-9s, but even that is a stretch. Gmail, Hotmail, etc. If you can't log into them then you're out of luck and it counts against downtime.

    For example 99.999% uptime = 5 minutes 15 seconds per year of downtime. For a typical user this would mean that out of every 100K logins or page views you only have 1 error!

    That never happens! Just remember that 99.% anything = 100% of something!
    THEE WOLF
  • It's not about recovery, it's about continuity

    When you think about terms like "Disaster Recovery" inevitably terms like "RTO" (recovery time obective) and "RPO" (recovery point objective) spring to mind. Historically the IT marketplace has focused so much on how long it will take to recover your systems and/or data that we tend to forget about what matters the most - the end user.

    IT goals should not focus on recovering systems (typically measured in hours), but instead providing continuous availability to end users so that they can remain productive and driving business performance throughout an outage - after all, every minute (or hour!) that a system isn't available is ultimately costing somebody something. If you're busy "recovering" from a major outage, you've missed the point.

    Continuous availability solutions focus on just that - providing seamless end user availability to a critical application or service no matter the cause of an IT disruption. More than "clustering", these solutions focus on predicting problems before they ever occur, protecting from downtime if an unforseen outage does occur, and delivering continuous business performance to the people that matter the most - the end user.

    For more information about continuous availability, check out www.neverfailgroup.com.

    -John Posavatz
    -VP, Product Management
    -The Neverfail Group
    jposavatz
    • Funyy You Said that!

      I just mentioned Neverfail without prompting in a previous post string!

      Melisa LaBancz-Bleasdale, Palamida and Messaging News
      Melisa1
    • Graceful degradation

      In the perfect world, we should have multiple upon multiple backup resources so that we can run nearly unaffected by a worse-case situation. Clearly that is never "economical" for a lot of businesses. The approach I have seen smart companies implement is "Graceful Degradation" of their services.

      For instance, during a power outage, they lose power to many of their departments, but keep the essential equipment and key departments running. If the generator goes out, they may have long term battery/power backup for only the most essential operations (telephone, Line of business computer system, etc.). If the batteries start to run down, they move the business to a tiny, overcrowded, prearranged hot site. A precise layered/fallback response can mean the survival of a business during a disaster. It is nothing new, since many biological systems implement this time-honored strategy for survival.

      Clear this did not happen here. This was more of a Sudden Degradation. Either everything is up or not. Such an approach leaves a business vulnerable to abject failure during a disaster. You either must support everything or everything collapses.

      I hope they learned at least that lesson....
      rcsteinbach
  • Report from 2nd and Howard

    Funny thing, we work in the building next door to CNET. We also have a server room that powers a 3 Terabyte compliance database for our worldwide customers. We have backup power, however, as with everyone else on our block we lost power to our phones, internet and more critical(coffee!) The interesting thing is that when we called PG&E, our power provider, they said they had no idea what was wrong, that we were the first to report it. Frightening. The power went on and off at two minute intervals for about a half an hour and then stabilized. It was a sober reminder that we as a company still have work to do, however, I do feel that our city, the same one that charges us our first born children to live here, should have a better infrastructure in place to support its own financial district.

    Melisa LaBancz-Bleasdale, Palamida
    Melisa1
  • Now had they been using solar

    arrays to augment power and then provide power during outages to the servers this would have been a minor issue. But hey, what does a power engineer know? ]:)
    Linux User 147560
    • After dark ... not much.

      The issue wasn't one of alternative fuels ... it was that the ones already in place failed.
      Jambalaya Breath
      • Missing the point completely

        part of the cause for the initial power loss was most likely an overload on the transmission system. Been pretty hot here lately, lot of AC's working over time AND since most HVAC systems are NOT maintained all that well with energy efficiency in mind... it all ads up to a huge load that eventually the power transmission system can't handle.

        By using alternative power to augment and reduce loads on the main bus bringing power in, events like this go away. At night time, loads are less due to lower temps and most normal people go home. But hey... what do I know!? ]:) I only carry a Journeyman's license for power plant operations and power transmission (working on my masters now)...
        Linux User 147560
  • SOMEBODY SHOULD TALK TO THE FCC,THERE SHOULD BE A LAW HERE

    If you could have been there when the Internet was designed what would you have insisted on?Would you want everybody to have their sites in their own individual computer or a hosting system?Or something more like International banking.
    BALTHOR
    • I don't see that as an either/or proposition.

      Frankly, except fot the two-tier power graps, I think everything is just ducky already.
      Jambalaya Breath
  • I accept downtime

    Because my website barely pays its own way. It certainly doesn't fund redundancy. But I think that, were the numbers to get significantly larger, I'd be looking to serve my pages from multiple locations full time so that the loss of one of them would go unnoticed by my customers.

    I've never been to SF. Maybe it's a nice place. But if it falls into the bay someday, I wouldn't want my business to go under with it.
    Jambalaya Breath
  • Disaster Recovery Plans

    Living, as I do, in South Florida, I deal with frequent power outages, unbelievable (almost daily) thunderstorms during the summer and the odd hurricane (three in the past two years). Clearly one is best served by a "belt and suspenders" approach to disaster recovery. This stuff is going to happen and I have chosen to approach the problem with a combination of battery powered UPS, an on-site automatic propane electric generator, along with on-site backup and an Internet based data storage/recovery service(ElephantDrive.com). ElephantDrive provides me with an inexpensive, secure, user-friendly service with frequent automated back-up of all data on my system. I have been using the service and have found it to be a reliable and useful service which provides protection from power failure, physical damage to my system and from hard drive failure. Data retrieval can be achieved from any Internet connected computer and, as a bonus, I have been able to use the service to access individual files while traveling.
    jpfisher
  • I've worked Disaster Recovery since I was at DEC

    Yes ... that's how far back people have been thinking about this sort of thing. We
    had inter-plant agreements and off-site storage ... the disaster that took our
    system [b][i]down[/i][/b] would have been so bad nobody [b]cared[/b] about
    "recovery". Two decades later after 9/11 I was fighting similar battles at Putnam -
    and being told "less is Good Enough", except it isn't.
    DR costs money. Even if you are planning on using excess capacity at a second or
    third data center, you still need to maintain that excess capacity. If you don't own
    your servers, you depend on the assurances of your host that you are covered for
    rapid recovery should their main data center lose power or functionality. This, too,
    costs money. If you are big enough and spread out enough, you can manage it. If
    your entire business lives on a single server ... disaster.

    DLMeyer - the Voice of [url=http://tinyurl.com/y4amro] [b]G.L.Horton's Stage
    Page[/b][/url] Pod Cast - latest episode is about Mumbet
    dlmeyer9