The recent situation at UK banking giant, Royal Bank of Scotland (RBS), will certainly go down in history as one of the most disruptive IT failures of all time. The massive impact of this failure continues to be felt by banking customers a week after the computer disruption first interfered with their gaining access to funds. Unfortunately, RBS has not released sufficient detail and many questions remain.
Also read:
ZDNet: RBS Bank joins the IT failures 'Hall of Shame'
ZDNet: RBS gives more detail on IT failure train wreck
The Guardian: How NatWest's IT meltdown developed
Although RBS (and its operating units NatWest and Ulster Bank) has revealed little information about what caused the problems, new details have emerged in London newspaper The Guardian. Reports indicate that the failure occurred when RBS computer operators tried to upgrade the bank's workload automation system, which is based on a product called CA-7 from CA Technologies.
The upgrade initially seemed to work as expected, but within hours other "guardian systems" discovered anomalies in batch jobs following the upgrade. Although technicians performed the failed upgrade last Tuesday night, they were not able to complete a successful batch run until Friday. By the time operators finished a successful run, millions upon millions of customer transactions were waiting to be processed. Customers will continue to experience problems until the bank works through this massive backlog of transactions.
WHAT'S GOING ON?
We don't know and that's the problem. RBS has released sketchy details of what caused the problem and why it took so long to resolve. To understand the full scope of the event and aftermath, RBS must answer questions like these:
Upgrade Questions
Software Recovery Questions
Business Policy Questions
THE BOTTOM LINE - DON'T BLAME THE CIO
In a case like this, it's tempting to blame the CIO, after all, he or she is responsible for systems. However, it is not the CIO's fault if bank policies dictate layoffs and offshoring that result in lost skills. These are management issues with consequences across the organization, including in IT.
Also read: Who's accountable for IT failure? (part one) Who's accountable for IT failure? (part two)
Government regulators must question the judgment and wisdom of decisions made by the bank Board and CEO, to uncover policies that created an environment in which critical IT tasks could roll out without complete testing, verification, and clear paths to recovery.
Update 6/27/12: According the The Register (a sensationalist tech news site), an inexperienced computer operator in India caused the RBS failure. Apparently, a routine problem arose during the upgrade procedure, which is usually not a serious issue because administrators can roll back to a previous and stable version of the software. In this case, however, it seems the operator erroneously cleared the entire transaction job queue, kicking off a long and difficult process of reconstruction. The article adds: "A complicated legacy mainframe system at RBS and a team inexperienced in its quirks made the problem harder to fix".