Wikimedia's Datacenter Failures Should be Taken to Heart

Wikimedia's Datacenter Failures Should be Taken to Heart

Summary: Repeated downtime should be your first clue that there is a problem with your datacenter strategy

SHARE:

In the last 6 months, Wikimedia has had two datacenter failures that brought down significant portions of their worldwide information networks. The first failure, back in March, was the result of an overheating problem in their datacenter in Amsterdam and the failure of the failover process to their datacenter in the US.  The second was last Monday, in their Florida datacenter due to a power outage.

The first thing to note was that neither failure was due to a technical problem relating to either the amount of traffic or data that the wiki sites generate. That's the positive.  The negative is that despite the wakeup call back in March, there were no processes in place to failover the US datacenter to a secondary site.

Now Wikimedia doesn't own their own datacenters; given the capital expenditure necessary that is unsurprising. In the US they are hosted by Equinex and Hostway; in Europe by EvoSwitch and SARA. This means, like many smaller businesses using colocation or hosting services they are, to a certain extent, at the mercy of their provider's infrastructure and to the limited budget that gets applied to paying for the datacenter services.

Now Wikimedia is aware of their deficiencies in keeping their sites up and running and has plans to spend over $3,000,000 in their 2010-11 budgets for the addition of another datacenter. But from the downtime issues they have had so far this year it would seem that simply adding another datacenter is not the only issue that needs to be addressed.

Downtime appears to have been caused, in both situations, by circumstances that were initially beyond their control, but it also seems that there were not the proper procedures in place to facilitate the rapid restoration of service for the Wiki servers.  Logic dictates that after the first failure measures would be implemented to assure the proper failover from a down datacenter to a functioning one; this appears not to have been the case.

For a business to whom uptime means money, the failures would be problematical, with the second one being indefensible. For Wikimedia, it appears that is not the case.  But the lesson that needs to be learned here is that when you move your operations to a datacenter not under your direct control, you need to make sure that the procedures and processes are in place, and tested, to keep your business up and running in the face of datacenter failure.

Topics: Data Centers, Hardware, Storage

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

6 comments
Log in or register to join the discussion
  • RE: Wikimedia's Datacenter Failures Should be Taken to Heart

    "to a certain extent" is absolutely correct. It is not the decision of the data center not to provide a hot failover capability. It is Wikimedias decision not to pay for that capability. Hosting services have a menu of options from which to choose, hot-failover being one of the most expensive because you are reserving (paying for) infrastructure at a second center. This stuff ain't free folks!
    DennisDCS
    • RE: Wikimedia's Datacenter Failures Should be Taken to Heart

      @DennisDCS

      True, but there is only so much money available in any budget. And if the failures are at the datacenter-wide level (for example, an entire server room overheating, such as happened in Amsterdam) there is also some onus on the datacenter provider.

      Regardless, my point is more that you need to plan for these types of problems, even when you use a top-tier provider.Wikimedia is just a high profile example.
      David Chernicoff
  • RE: Wikimedia's Datacenter Failures Should be Taken to Heart

    This is pretty much absolutely right. We run a top-5 website on pretty much *nothing*.

    (In my day job, I work for an extremely small-time publisher. Our departmental spend is about what Wikimedia spends. Our load is negligible by comparison ...)

    One of the actual problems is that the MediaWiki software doesn't have a distributed backend - several hundred web servers and proxies around the world all get their data from a few huge database servers.

    So, if there are any seriously hotshot MySQL-tweaking programmers out there with time on their hands for a charity, do please have a look around our Bugzilla ...
    DavidGerard
    • RE: Wikimedia's Datacenter Failures Should be Taken to Heart

      @DavidGerard

      The lack of a properly distributed backend obviously is an architectural issue, and a fairly common one when a business grows at a rate as fast as Wikimedia did.

      Without a compelling financial incentive it's easy to see it as a "it works well enough" issue that a for-profit organization would have to address.
      David Chernicoff
  • Customer Contolled Data Centers Are Rarely Redundant

    Very few corporations can have or can afford to have online fail-over capability. U.P.S. does. In many cases, even IBM does not, its simply isn't affordable. Cloud at least offers the possibility of lowering the costs enough to allow companies to do so.
    brookem
  • RE: Wikimedia's Datacenter Failures Should be Taken to Heart

    Great!!! thanks for sharing this information to us! <a href="http://www.yuregininsesi.com">sesli sohbet</a> <a href="http://www.yuregininsesi.com">sesli chat</a>
    efsane