ie8 fix
madison

Wikimedia's Datacenter Failures Should be Taken to Heart

By | July 7, 2010, 9:53am PDT

Summary: Repeated downtime should be your first clue that there is a problem with your datacenter strategy

In the last 6 months, Wikimedia has had two datacenter failures that brought down significant portions of their worldwide information networks. The first failure, back in March, was the result of an overheating problem in their datacenter in Amsterdam and the failure of the failover process to their datacenter in the US.  The second was last Monday, in their Florida datacenter due to a power outage.

The first thing to note was that neither failure was due to a technical problem relating to either the amount of traffic or data that the wiki sites generate. That’s the positive.  The negative is that despite the wakeup call back in March, there were no processes in place to failover the US datacenter to a secondary site.

Now Wikimedia doesn’t own their own datacenters; given the capital expenditure necessary that is unsurprising. In the US they are hosted by Equinex and Hostway; in Europe by EvoSwitch and SARA. This means, like many smaller businesses using colocation or hosting services they are, to a certain extent, at the mercy of their provider’s infrastructure and to the limited budget that gets applied to paying for the datacenter services.

Now Wikimedia is aware of their deficiencies in keeping their sites up and running and has plans to spend over $3,000,000 in their 2010-11 budgets for the addition of another datacenter. But from the downtime issues they have had so far this year it would seem that simply adding another datacenter is not the only issue that needs to be addressed.

Downtime appears to have been caused, in both situations, by circumstances that were initially beyond their control, but it also seems that there were not the proper procedures in place to facilitate the rapid restoration of service for the Wiki servers.  Logic dictates that after the first failure measures would be implemented to assure the proper failover from a down datacenter to a functioning one; this appears not to have been the case.

For a business to whom uptime means money, the failures would be problematical, with the second one being indefensible. For Wikimedia, it appears that is not the case.  But the lesson that needs to be learned here is that when you move your operations to a datacenter not under your direct control, you need to make sure that the procedures and processes are in place, and tested, to keep your business up and running in the face of datacenter failure.

Kick off your day with ZDNet's daily e-mail newsletter. It's the freshest tech news and opinion, served hot. Get it.

Topics

With more than 20 years of published writings about technology, as well as industry stints as everything from a database developer to CTO, David Chernicoff has earned the term "veteran" in the technology world.

Disclosure

David Chernicoff

David does not invest in the technology he covers. As a freelance author and technologist he has had contract work with many vendors in the industry. Beyond the term of these short-term contracts there is no business or fiduciary arrangement with any technology vendor. David does not enter into contracts that would limit his freedom of expression in any way, nor is he remunerated for discussing any vendor. All comments in his blog writings are solely the opinions of David Chernicoff.

Biography

David Chernicoff

With more than 20 years of published writings about technology, as well as industry stints as everything from a database developer to CTO, David Chernicoff has earned the term "veteran" in the technology world. Currently the principal of an independent consulting business and an active freelance writer, David has most recently been a Senior Contributing Editor for Windows IT Pro magazine, having also been the Lab Director for Windows NT Magazine, Technical Director of PC Week Labs, the author or co-author of a number of books on different versions of Windows, a plethora of eBooks on various technology topics, and of approximately 3000 magazine articles in print and on the web.
6
Comments

Join the conversation!

Just In

RE: Wikimedia's Datacenter Failures Should be Taken to Heart
efsane Updated - 2nd Feb 2011
Great!!! thanks for sharing this information to us! sesli sohbet sesli chat
"to a certain extent" is absolutely correct. It is not the decision of the data center not to provide a hot failover capability. It is Wikimedias decision not to pay for that capability. Hosting services have a menu of options from which to choose, hot-failover being one of the most expensive because you are reserving (paying for) infrastructure at a second center. This stuff ain't free folks!
0 Votes
+ -
Contributr
@DennisDCS

True, but there is only so much money available in any budget. And if the failures are at the datacenter-wide level (for example, an entire server room overheating, such as happened in Amsterdam) there is also some onus on the datacenter provider.

Regardless, my point is more that you need to plan for these types of problems, even when you use a top-tier provider.Wikimedia is just a high profile example.
This is pretty much absolutely right. We run a top-5 website on pretty much *nothing*.

(In my day job, I work for an extremely small-time publisher. Our departmental spend is about what Wikimedia spends. Our load is negligible by comparison ...)

One of the actual problems is that the MediaWiki software doesn't have a distributed backend - several hundred web servers and proxies around the world all get their data from a few huge database servers.

So, if there are any seriously hotshot MySQL-tweaking programmers out there with time on their hands for a charity, do please have a look around our Bugzilla ...
0 Votes
+ -
Contributr
@DavidGerard

The lack of a properly distributed backend obviously is an architectural issue, and a fairly common one when a business grows at a rate as fast as Wikimedia did.

Without a compelling financial incentive it's easy to see it as a "it works well enough" issue that a for-profit organization would have to address.
Very few corporations can have or can afford to have online fail-over capability. U.P.S. does. In many cases, even IBM does not, its simply isn't affordable. Cloud at least offers the possibility of lowering the costs enough to allow companies to do so.
0 Votes
+ -
Great!!! thanks for sharing this information to us! sesli sohbet sesli chat

Join the conversation!

Formatting +
BB Codes - Note: HTML is not supported in forums
  • [b] Bold [/b]
  • [i] Italic [/i]
  • [u] Underline [/u]
  • [s] Strikethrough [/s]
  • [q] "Quote" [/q]
  • [ol][*] 1. Ordered List [/ol]
  • [ul][*] · Unordered List [/ul]
  • [pre] Preformat [/pre]
  • [quote] "Blockquote" [/quote]
ie8 fix
Click Here
ie8 fix

The best of ZDNet, delivered

ZDNet Newsletters

Get the best of ZDNet delivered straight to your inbox

Facebook Activity

White Papers, Webcasts, & Resources
ie8 fix
ie8 fix