Technorati, a major blog indexer, recently shut down their spiders to fix severe data problems. From the blog post explaining the incident (emphasis added):
[A] small percentage of recently created blogs were having their data scrambled. An example of this appears in this blog post. The spidering outages allowed us time to investigate, diagnose and make corrections that prevented further data corruption.
Technorati handles a large volume of data everyday; isolating and devising remedies for these kinds of issues that effect a small percentage of the data flow is tricky. However, we think we're recovering now and the backlog of data processing is getting worked through.
Just to peek into the works a little bit, many distributed data systems rely on centrally dispensing identifiers for data elements and Technorati has such a beast. What was found were cases of blogs new to our system (from within the last 3 weeks) losing their identifiers and those identifiers getting re-associated to other new blogs. No blogs that existed in our system before Dec. 18th (the vast majority) were impacted at all. The outward manifestations visible were posts for blogs with a shared ID mingled (a mashup the authors naturally were unhappy with) and mis-associated blog claims ("And you may tell yourself, this is not my beautiful blog").
This was a unprecedented case for us; while it had been occurring in about 8% of those blogs (created on or after December 18) for about 2 days (beginning on Tuesday, January 8th) we had until that time never encountered this phenomenon. An intensive investigation was launched, reconstructing operational timelines and correlating facts. What we found was that this stemmed from a failure incident with the primary system for identifier dispensing, another failure in the secondary system that took its place and then a corrupted data set mistakenly taking over that one, ouch! The first two blows appeared to be handled routinely but the third time was cursed; propagation of corrupted data was not detected for about 48 hours between Tuesday when it started and Thursday when we pulled the emergency brakes on the spiders.
So we're recovering now, most of the data is being restored to its previous state and we have had a number of internal postmortem discussions about earlier fault detection and recovery.
THE PROJECT FAILURES ANALYSIS
Technical failures often have two components: the failure itself and management's subsequent handling of the incident. Although uncontrolled technical failures can occur under the best of circumstances, end-user satisfaction is usually a function of management rather than technology.
In this case, Technorati handled the incident well. The company:
- Acknowledged the full scope of the problem
- Took immediate corrective action once they realized the problem existed
- Provided context regarding why the problem was hard to solve
- Protected the company's credibility (I call this "intelligent CYA")
- Described symptoms the customer might experience, in jargon-free terms
- Presented their problem resolution strategy
- Demonstrated responsible and professional analysis
Technorati's short blog post explained an arcane problem, helped calm jumpy users, covered the company's collective butt, and showed the place is run by pros. I'm impressed.