In 2010, Wikimedia had two major datacenter outages that impacted the users of their information services. These were major datacenter failures that due to problems in the facilities that housed their distributed services, and they were failures that had a noticeable impact on users worldwide.
As Wikimedia doesn't own their datacenters, there is little they can do with what is an amazingly meager budget for such a large operation. Complete failover capabilities are available in the datacenter world, but deployments of that type are very expensive and, in the case of Wikimedia, not a very good utilization of resources.
But Wikimedia's issues aren't just those of keeping the datacenter up and running (which is the responsibility of their location provider in meeting their SLA), it's also making sure that their services are available to the users, developers, editors, and applications that make use of the information Wikipedia contains. And service availability can be impacted by things far less obvious than the loss of a major datacenter. So how does Wikimedia keep track of all their distributed services?
In the past, Wiki used their own internally developed monitoring services to keep track of the availability of their services to all of their users. But all of their own tools shared a common flaw; running in the same datacenters as the services they were monitoring, they could only focus on the availability of those services with the datacenter. There was no reliable, consistent view of the service availability to the end-users that were depending on that availability. And there are plenty of issues that could have unexpected consequences regarding the availability of services, from simple network outages to complex interdependencies that require the availability of multiple services.
To get a better handle on what was going on with all of their services, Wikimedia has adopted the NimSoft WatchMouse user experience monitoring service. WatchMouse is a SaaS solution that has the ability to track and monitor application availability and response time from 60 different sites in 40 different countries, giving a worldwide view of what the response time and behavior looks like for the monitored applications and services.
WatchMouse is a self-administered SaaS platform; NimSoft provides the service, but it is up to the client to configure how it is used, what reports are generated, and what information is made available. Wikimedia makes use of these services to provide a more proactive approach to problem solving and has replaced what used to be a word-of-mouth management system. In the past, Wiki relied on their user community to report problems that they were experiencing and had to react to those problems long after the occurred when there was clear impact on the users. The new management solution means that Wiki's own support teams know about problems before they impact users; a major change in their management process.
As part of that monitoring, Wikimedia has made information about the current status, uptime, and availability of their services available to the public at large, simply by going http://status.wikimedia.org.