The recent situation at UK banking giant, Royal Bank of Scotland (RBS), will certainly go down in history as one of the most disruptive IT failures of all time. The massive impact of this failure continues to be felt by banking customers a week after the computer disruption first interfered with their gaining access to funds. Unfortunately, RBS has not released sufficient detail and many questions remain.
Although RBS (and its operating units NatWest and Ulster Bank) has revealed little information about what caused the problems, new details have emerged in London newspaper The Guardian. Reports indicate that the failure occurred when RBS computer operators tried to upgrade the bank's workload automation system, which is based on a product called CA-7 from CA Technologies.
The upgrade initially seemed to work as expected, but within hours other "guardian systems" discovered anomalies in batch jobs following the upgrade. Although technicians performed the failed upgrade last Tuesday night, they were not able to complete a successful batch run until Friday. By the time operators finished a successful run, millions upon millions of customer transactions were waiting to be processed. Customers will continue to experience problems until the bank works through this massive backlog of transactions.
WHAT'S GOING ON?
We don't know and that's the problem. RBS has released sketchy details of what caused the problem and why it took so long to resolve. To understand the full scope of the event and aftermath, RBS must answer questions like these:
- How detailed are the procedures governing patches and upgrades to production systems?
- Did the operator follow these procedures or deviate at all?
- To what extent was this upgrade tested in a production-style environment?
- Who installed the upgrade?
- How much CA-7 experience did the upgrade installer possess?
- How much RBS-specific experience did the installer possess?
Software Recovery Questions
- How did the bank first learn of the problem?
- What procedures did the operators use to solve the problem initially?
- Who performed these problem resolution procedures?
- What was their level of experience with the software, RBS processes, and with the broader RBS technology architecture and environment?
- Why did the rollback procedure take so long?
- How did the team finally isolate and solve the problem?
Business Policy Questions
- Did layoffs and outsourcing contribute to reducing the bank's knowledge and experience with its own systems?
- Why did recovery take several days?
- What are the bank's business continuity plans and how often are they tested?
- Why did this event happen in the first place -- there is always a cause -- what happened?
THE BOTTOM LINE - DON'T BLAME THE CIO
In a case like this, it's tempting to blame the CIO, after all, he or she is responsible for systems. However, it is not the CIO's fault if bank policies dictate layoffs and offshoring that result in lost skills. These are management issues with consequences across the organization, including in IT.
Government regulators must question the judgment and wisdom of decisions made by the bank Board and CEO, to uncover policies that created an environment in which critical IT tasks could roll out without complete testing, verification, and clear paths to recovery.
Update 6/27/12: According the The Register (a sensationalist tech news site), an inexperienced computer operator in India caused the RBS failure. Apparently, a routine problem arose during the upgrade procedure, which is usually not a serious issue because administrators can roll back to a previous and stable version of the software. In this case, however, it seems the operator erroneously cleared the entire transaction job queue, kicking off a long and difficult process of reconstruction. The article adds: "A complicated legacy mainframe system at RBS and a team inexperienced in its quirks made the problem harder to fix".