Facebook's worst outage for four years was due to an internal configuration error, the company disclosed on Friday.
The 150 minute-long outage, during which time the site was turned off completely, was the result of a single incorrect setting that produced a cascade of erroneous traffic, Facebook software engineering director Robert Johnson said in a posting to the site.
"Today we made a change to the persistent copy of a configuration value that was interpreted as invalid. This meant that every single client saw the invalid value and attempted to fix it. Because the fix involves making a query to a cluster of databases, that cluster was quickly overwhelmed by hundreds of thousands of queries a second," Johnson said.
"To make matters worse, every time a client got an error attempting to query one of the databases, it interpreted it as an invalid value and deleted the corresponding cache key," he added. "This meant that even after the original problem had been fixed, the stream of queries continued. As long as the databases failed to service some of the requests, they were causing even more requests to themselves. We had entered a feedback loop that didn’t allow the databases to recover."
Network analyst company Arbor Networks, which collates global internet statistics from 80 ISPs, reported that Facebook traffic fell from 60Gbps to 10Gbps between 5:30pm and 6:30pm BST on 23 September. It subsequently slumped to under 5Gbps before returning fully shortly after 9pm.
Facebook uses a mix of MySQL and InnoDB database technologies to serve information, and the company is active in the open-source database community. On 15 September, it released OSC, a tool it has developed to make rapid changes to MySQL schemas on live systems.