Research in Motion got around to explaining its BlackBerry outage Tuesday into Wednesday, but it may be a case of too little too late in terms of customer perception.
In a statement, RIM outlined what went wrong and started out with:
RIM's in-depth diagnostic analysis of the service interruption that occurred in North America on Tuesday night is progressing well and RIM will continue to provide further information as it's available. RIM's first priority during any service interruption is always to restore service and then establish, monitor and maintain stability. Proper analysis can take several days or longer and RIM's commitment is to provide the most accurate and complete information possible in such situations.
RIM's contention that it wanted to take its time with its response is understandable. But it has an obligation to say something even if it's "we have no idea what's going on, but will fix it." Instead, RIM was inexplicably mum on the incident. Russell Shaw's post-mortem notes there's enough blame in the RIM food chain to go around.
In fact, RIM's statement is it at least partially designed to rule out concerns that would really hurt RIM's business and give rivals such as Motorola's Good Technology some ammunition.
RIM has been able to definitively rule out security and capacity issues as a root cause. Further, RIM has confirmed that the incident was not caused by any hardware failure or core software infrastructure.
Translation: You can count on us for your corporate communications needs. And by the way we're reliable and this was a fluke.
So what was the culprit? A software upgrade that wasn't tested well. RIM said:
RIM has determined that the incident was triggered by the introduction of a new, non-critical system routine that was designed to provide better optimization of the system's cache. The system routine was expected to be non-impacting with respect to the real-time operation of the BlackBerry infrastructure, but the pre-testing of the system routine proved to be insufficient.
The new system routine produced an unexpected impact and triggered a compounding series of interaction errors between the system's operational database and cache. After isolating the resulting database problem and unsuccessfully attempting to correct it, RIM began its failover process to a backup system.
Although the backup system and failover process had been repeatedly and successfully tested previously, the failover process did not fully perform to RIM's expectations in this situation and therefore caused further delay in restoring service and processing the resulting message queue.
RIM then apologizes to customers and said it's enhancing its systems so similar problems don't happen again. The big question is that enough. In a Wall Street Journal story (subscription required), the question of customer compensation was raised. One analyst noted that when cable goes out you get free HBO.
What should RIM do to make good on its outage?