Colleague Marguerite Reardon has talked to several networking administrators about the BlackBerry outage.
She's just posted an analysis of how this could have happened, and what technology systems were at fault.
The blame seems to be on issues at one of BlackBerry's Network Operations Centers. Probably the one in Waterloo, co-located at company HQ shown in the above picture.
Here's what Marguerite writes in part:
While it's not known for sure what caused RIM's outage, it's not difficult to see how the very nature of RIM's network could potentially lead to a major service outage. RIM's service is centralized and it works by routing all BlackBerry e-mails through one of two main NOCs, which are essentially large data centers. One NOC is located in Canada and it primarily services the Western Hemisphere as well as parts of Asia, said analysts familiar with the company. The other data center, located in the U.K., handles e-mail traffic in Europe, Africa and the Middle East.
The BlackBerry Enterprise Server, which sits on the corporate network, receives e-mails from the company's Exchange or Lotus e-mail server and forwards those e-mails in an encrypted tunnel to one of the NOCs. The NOC then acts as an efficient delivery system that authenticates users and forwards the messages to the appropriate handheld device.
Because user authentication is handled by RIM away from the corporate network, it protects companies from hackers who may try to obtain information through e-mail servers, which sit inside the company's firewall. RIM's approach also means that corporate IT departments don't have to juggle relationships with multiple mobile operators because RIM handles all of that for them in the NOC.
The flipside of RIM's approach is that with only two NOCs handling e-mails from 8 million subscribers, there are two major points of potential failure. And when something goes wrong in one or both of these data centers, it can result in an outage like the one that occurred Tuesday night and Wednesday morning, which technologically paralyzed users.
"Anytime you have a situation where traffic is flowing through a single data center, there is potential for a catastrophic outage," said Gene Signorini, vice president of enterprise research at the Yankee Group. "But that said, the RIM architecture also provides a lot of benefits to its corporate customers. It's just the nature of the beast."
Some of the most common issues that can result in an outage are power failures, failure of a critical component that takes down a larger component, software bugs, viruses and other attacks from the outside, or patches that fail. RIM hasn't identified which issue caused this particular outage, but Todd Kort, principal analyst at Gartner said the outage may have been caused by a software bug.
"If the RIM outage is affecting other parts of the globe, this fact most likely points to some type of software bug," he said in an e-mail.
You can bet that a substantial amount of forensic analysis will be performed on this incident.