Once again, a datacenter “upgrade” takes down the datacenter

Once again, a datacenter “upgrade” takes down the datacenter

Summary: If you're putting lots of eggs in one basket it's important to have a plan if the basket gets dropped

SHARE:
12

With data communication centralized for more than 90 different Oregon state agencies, it’s not surprising that when a planned storage upgrade Monday somehow screwed up network communications with the datacenter, that the state referred to it as a catastrophic failure. Oregon State Administrative Department spokesman Matt Shelby placed the blame firmly on their storage vendor, Hitatchi, but gave no details as to the root cause of the failure. The crash solely affected communications; the state claims that there was no loss of data.

The crash did affect the normal operations of some of those 90 departments, however. A number of state services went down due to the problem, including high-profile services such as the ODOT TripCheck road cameras. Some of the reported high-level failures might leave you scratching your head, such as the problem faced by the department of Forestry when the responded to a fire.  Apparently, the database crash impacted their job by not permitting the department to access some of their database forms.

The most significant impact was on the State Employment department. The datacenter availability problem meant that 70,000 checks for unemployment compensation were delayed, with the longest delay suffered by new filers, who get their first check via the mail (later payments are made via direct deposit or to a benefits ATM card).

As the state repeatedly confirmed that the failure was the result of communication issues and that no data was lost or compromised it’s difficult to extrapolate what the storage vendor might have done in the normal course of a storage upgrade to cause such widespread impact on what should be unrelated systems.

 

Topics: Data Centers, Storage

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

12 comments
Log in or register to join the discussion
  • where was it?!

    where is the redundancy? datacenters that house tech for the state cannot just 'go down'...regardless of who's to blame for the primary, there has to be a secondary site for at least critical items. Shame on you Oregon
    i_hear_u_but
    • If the network is down...

      It doesn't matter how many backup sites there may be.

      You still can't communicate with them.
      jessepollard
      • You do realize...

        ...that the reason "the internet" was invented as a DARPA project, was to ensure a network that wouldn't go down?! TCP/IP networking *should* route around an outage. In this case, what I would suspect is that there is some segment that is a single route. Should not be. Redundancy needs to be built into critical network segments, the same as for every other layer/part of an infra stack (e.g., server clusters with failover, etc.).
        Techboy_z
        • OREGON CONUS NET OUT

          190547ZJUL2013
          TO: techboy_z
          Of course you are right. Most of the "public" cannot spell DARPA much less define it. Neither can they remember the 5th Chief Directorate of the KGB before their big move. KRYPTO is unknown to them and 1A ckts are beyond their minds. We need to remind them about 1Alfa ckts!!!
          siopesi3
    • Easy to say for a non-Tax payer

      Please move to Oregon so you can pay for the redundancy!
      gajohn003123
  • wise words

    things can go wrong with anything, it always pays to be prepared! A shame that some people have not recognized, it really can create a vicious cycle
    TechIan16
  • Right, that's why I have an infinite number of internet connections

    And an infinite number of cable modems, and an infinite number of home computers with an infinite number of power supplies and hard drives and CPUs and whatnot, just to prevent that kind of thing.....

    Look, there's no way to have absolute failure proofing, and redundancy is not a synonym for failure proof. It's like the difference between waterproof and water resistant.
    Vesicant
  • Monday?

    Why was it done on a Monday instead of the weekend?
    chips@...
  • Upgrades have been common on Mainframes since the 70s without failures

    Get a mainframe.
    douglas_john_ledet@...
  • Stupid is...

    This is ridiculous! I have worked for 2 tier one suppliers and colo/hosted provicers, and worked with half a dozen major hosted providers, and datacenters in hundreds of locations around the world and I have never *ONCE* *EVER* encountered a full datacenter outage due to a stupid act by the people who's very job it is is to keep the datacenter functional! Sure...power outages, flooding, nuclear reactors...yes, those have all knocked out datacenters that I access, but stupidity? Never! Why? Because Tier 1 providers don't have datacenter outages from stupidity! It just doesn't happen!
    tech_ed@...
  • I think Mr Scott said it best,

    "The more they overthink the plumbing, the easier it is to stop up the drain."
    Star Trek III: The Search for Spock
    Dr_Zinj
  • Ummm ... did I miss something?

    I didn't see that it said that their entire network was down -- just that they had a communications issue ... one that apparently prevented systems from communicating with their storage system. That could have simply been caused by a misconfiguration in their storage system or their network gear, or maybe they accidentally killed the power to the storage system. Who knows? Shouldn't happen, but I can certainly see how such things could happen.

    They obviously need a bit better planning before they attempt their upgrades ... and they might want to think about doing them over the weekend, when those systems aren't all in-use.

    But I wouldn't go indicting the network guys for something that sounds more like the storage vendor's snafu.
    imalugnut