In a chicken-and-egg scenario, Slobodan Predolac and Nicolas Spiegelberg, members of Facebook's engineering team based in New York, admitted in a blog post that it took so long to fix the problem primarily because of how fast the world's largest social network continues to grow — especially on mobile.
The ability to work at scale is one of the most exciting parts of engineering at Facebook. However, certain fundamental programming challenges inevitably become more difficult with scale. Debugging, for example, can prove difficult even if you can reliably reproduce the problem – and this difficulty increases when debugging a highly visible but nondeterministic [sic] issue in a rapidly changing codebase.
Facebook's monthly active user count stood at approximately 1.32 billion as of June 30, but mobile jumped 31 percent year-over-year to 1.07 billion alone.
"It turns out that abandoning manual code analysis was a good strategy," the engineers concluded.
Predolac and Spiegelberg outlined the investigation into a myriad of crash reports, leading to "different theories about race-condition situations, architectural changes, and even false fundamental premises."
In the end, the engineering team traced back the problem to the networking stack, resolving the issue in partnership with the networking team and deploying Fishhook, an open source method for rebinding system APIs.
"It turns out that abandoning manual code analysis was a good strategy," the engineers concluded. "The bug surfaced with existing code that was exercised more as we ramped up default secure connections for all our users."