When Wikimedia first started experiencing high-profile datacenter failures I looked at it as an opportunity for datacenter operators to learn from the experience of others. But with this most recent outage, caused by a fiber optic cable being severed outside Wikimedia’s Florida datacenter I find that my perspective has changed.
The takeaway from the latest failure, especially when held up in context with the previous outages, is that a reliable datacenter is not a good candidate for crowdsourcing, at least when the primary purpose of the datacenter is an outward facing professional interface to the public.
Wikimedia has developed some of their own monitoring tools and implemented commercial applications designed to minimize downtime, but they suffer from a much exaggerated version of a primary issue in commercial datacenters; funding. Datacenter disaster recovery and business continuity solutions are expensive, and while Wikimedia has gotten some significant cash contributions along with services and equipment targeted to their datacenters, operational issues come first, with DR/BC considerations, by necessity, not taking primary focus.
The most recent outage is a very clear example of the problem; network connection gets cut, datacenter disappears from view. A commercial operation with a DR/BC process in place, would have allowed for this eventuality in a number of ways. The simplest would have been multiple carriers supported in the datacenter; different providers, routed in to the facility over different physical connections so that the simple problem that caused Wiki’s outage would have been address by internal routing and resulted only in a blip in the services being provided by the datacenter.
More complete DR/BC solutions can go as far as clustering and replication services at separate datacenters, with the backup datacenter coming online when the primary facility goes offline, automatically. Obviously, solutions of this nature, that require hardware replication, are considerably more expensive, and impractical for an organization like Wikimedia, given their funding concerns.
Wikimedia does have a disaster recovery site, in Ashland, VA, but there is a big difference between having a disaster recover site, where you can eventually restore services from other locations if there is a problem, and the real-time, minimal service interruption disaster recovery/business continuity solution that minimizes (or completely removes) the potential impact on the datacenter user. While not meaningless to Wikimedia, this latest outage is just an example of a failure potential that every datacenter operator was already aware of and should have plans in place to remediate.