Key questions on the massive RBS / NatWest IT failure
Summary: Despite it's massive IT failure, RBS has not released sufficient detail and many questions remain.
The recent situation at UK banking giant, Royal Bank of Scotland (RBS), will certainly go down in history as one of the most disruptive IT failures of all time. The massive impact of this failure continues to be felt by banking customers a week after the computer disruption first interfered with their gaining access to funds. Unfortunately, RBS has not released sufficient detail and many questions remain.
Also read:
ZDNet: RBS Bank joins the IT failures 'Hall of Shame'
ZDNet: RBS gives more detail on IT failure train wreck
The Guardian: How NatWest's IT meltdown developed
Although RBS (and its operating units NatWest and Ulster Bank) has revealed little information about what caused the problems, new details have emerged in London newspaper The Guardian. Reports indicate that the failure occurred when RBS computer operators tried to upgrade the bank's workload automation system, which is based on a product called CA-7 from CA Technologies.
The upgrade initially seemed to work as expected, but within hours other "guardian systems" discovered anomalies in batch jobs following the upgrade. Although technicians performed the failed upgrade last Tuesday night, they were not able to complete a successful batch run until Friday. By the time operators finished a successful run, millions upon millions of customer transactions were waiting to be processed. Customers will continue to experience problems until the bank works through this massive backlog of transactions.
WHAT'S GOING ON?
We don't know and that's the problem. RBS has released sketchy details of what caused the problem and why it took so long to resolve. To understand the full scope of the event and aftermath, RBS must answer questions like these:
Upgrade Questions
- How detailed are the procedures governing patches and upgrades to production systems?
- Did the operator follow these procedures or deviate at all?
- To what extent was this upgrade tested in a production-style environment?
- Who installed the upgrade?
- How much CA-7 experience did the upgrade installer possess?
- How much RBS-specific experience did the installer possess?
Software Recovery Questions
- How did the bank first learn of the problem?
- What procedures did the operators use to solve the problem initially?
- Who performed these problem resolution procedures?
- What was their level of experience with the software, RBS processes, and with the broader RBS technology architecture and environment?
- Why did the rollback procedure take so long?
- How did the team finally isolate and solve the problem?
Business Policy Questions
- Did layoffs and outsourcing contribute to reducing the bank's knowledge and experience with its own systems?
- Why did recovery take several days?
- What are the bank's business continuity plans and how often are they tested?
- Why did this event happen in the first place -- there is always a cause -- what happened?
THE BOTTOM LINE - DON'T BLAME THE CIO
In a case like this, it's tempting to blame the CIO, after all, he or she is responsible for systems. However, it is not the CIO's fault if bank policies dictate layoffs and offshoring that result in lost skills. These are management issues with consequences across the organization, including in IT.
Also read: Who's accountable for IT failure? (part one) Who's accountable for IT failure? (part two)
Government regulators must question the judgment and wisdom of decisions made by the bank Board and CEO, to uncover policies that created an environment in which critical IT tasks could roll out without complete testing, verification, and clear paths to recovery.
Update 6/27/12: According the The Register (a sensationalist tech news site), an inexperienced computer operator in India caused the RBS failure. Apparently, a routine problem arose during the upgrade procedure, which is usually not a serious issue because administrators can roll back to a previous and stable version of the software. In this case, however, it seems the operator erroneously cleared the entire transaction job queue, kicking off a long and difficult process of reconstruction. The article adds: "A complicated legacy mainframe system at RBS and a team inexperienced in its quirks made the problem harder to fix".
Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback
Great questions that I don't think will ever be answered...
This combined with the fact that you can't shut the systems down for business continuity reasons during the recovery process is why it can take days to fully clean up.
Should the CIO take the largest share of the blame? I believe so, even if the business steer is to reduce cost the CIO is accountable for choosing the vendors/partners, overseeing this, and advising the board of the associated risks.
One thing is for sure... if it was a cost cutting exercise then years of savings are likely to be wiped out in one go...
A huge lesson for RBS nonetheless
fdsafsd
The old Tie Brigade ain't what it used to be.
Fast forward to today, and that the CEO boys running anything British are friends of friends fathers, and get their peerages, knighthoods with a nudge and wink, utterly without any merit for the job, let alone aptitude, they are being made a mockery of johnny foreigner, who comes in and offers them shiny baubles for the British company they ran into the ground listening only to each other, and not to those of "lower class" than themselves, despite them having vastly superior intellects, knowledge and, dare I say it, less ego.
The main failure, truth be told, is arrogance in the British Ruling Elite, still thinking they have little competition. When in fact, they have been utterly outed for the charlatans they have always been.
It's worth bearing in mind that the previous CEO of this bank, actually had his knighthood taken off him. Poor little mite, how badly that must have gone down at the polo club. Fred the Shred he was called.
The "lower classes" were forced to bail out this bank, almost completely. He walked away with billions in bonuses, and millions in pensions.
The new "krew", came in, decided they wanted to keep their bonuses, and so sacked the people who knew what they were doing.
Arrogance. Ineptitude, and sheer criminal negligence.
These "chaps" should be in prison, but, that would ruin christmas lunch with the Judges, and all the other cronies.
Guardian article comments
He says the batch systems are largely written in IBM Assembler. Now I acknowledge that these systems were not the cause of the problem (it was the scheduler configuration files) but even so I would say this points to a lack of investment in more modern technologies that would (hopefully) be more maintainable than the existing systems.
investment in more modern technologies...
As someone who used to program in COBOL
But I would agree there would a huge amount of work involved in replacing the existing systems and also considerable risk.
If someone were suggesting replacing the assembler and COBOL with Java then I would agree, they would never make it.
"The Register (a sensationalist tech news site)"
discount chrisitian louboutin
http://www.christianlouboutincheapc.com
http://www.cheapmonsterbeatstest.com
discount chrisitian louboutin
http://www.christianlouboutincheapc.com
http://www.cheapmonsterbeatstest.com
Great Article