In the last 6 months, Wikimedia has had two datacenter failures that brought down significant portions of their worldwide information networks. The first failure, back in March, was the result of an overheating problem in their datacenter in Amsterdam and the failure of the failover process to their datacenter in the US. The second was last Monday, in their Florida datacenter due to a power outage.
The first thing to note was that neither failure was due to a technical problem relating to either the amount of traffic or data that the wiki sites generate. That's the positive. The negative is that despite the wakeup call back in March, there were no processes in place to failover the US datacenter to a secondary site.
Now Wikimedia doesn't own their own datacenters; given the capital expenditure necessary that is unsurprising. In the US they are hosted by Equinex and Hostway; in Europe by EvoSwitch and SARA. This means, like many smaller businesses using colocation or hosting services they are, to a certain extent, at the mercy of their provider's infrastructure and to the limited budget that gets applied to paying for the datacenter services.
Now Wikimedia is aware of their deficiencies in keeping their sites up and running and has plans to spend over $3,000,000 in their 2010-11 budgets for the addition of another datacenter. But from the downtime issues they have had so far this year it would seem that simply adding another datacenter is not the only issue that needs to be addressed.
Downtime appears to have been caused, in both situations, by circumstances that were initially beyond their control, but it also seems that there were not the proper procedures in place to facilitate the rapid restoration of service for the Wiki servers. Logic dictates that after the first failure measures would be implemented to assure the proper failover from a down datacenter to a functioning one; this appears not to have been the case.
For a business to whom uptime means money, the failures would be problematical, with the second one being indefensible. For Wikimedia, it appears that is not the case. But the lesson that needs to be learned here is that when you move your operations to a datacenter not under your direct control, you need to make sure that the procedures and processes are in place, and tested, to keep your business up and running in the face of datacenter failure.