For the Reserve Bank of Australia, every minute of downtime could potentially cost more than AU$40 million. But even with a business case on that scale, it still took three tries before it got an effective systems management strategy in place.
The RBA's most visible function for many Australians is in setting interest rates as a means of controlling inflation -- people's curiosity about mortgage payments has a direct and measurable impact on its IT systems.
According to Peter Speranza, senior manager infrastructure services at the RBA, on the first Wednesday morning of each month, when a new rates announcement is due, bandwidth utilisation on the RBA site rockets above 80 percent as borrowers log on seeking up to date information.
While supporting the activities of its economists and other staff members (it employs around 850 people) is important, perhaps the most critical task for the RBA is managing the Real-Time Gross Settlement (RTGS) system for interbank payments. Each day, the RTGS system handles AU$160 billion in payments.
"Forget three nines, four nines or five nines -- the system simply has to be up all the time. The country relies on it," Speranza said.
Underpinning those applications are two datacentres: one at its head office in Sydney, and another disaster recovery site, comprising more than 300 servers in total. "We look after over half the infrastructure in the bank," Speranza told attendees during a presentation at CA Expo in Melbourne.
He measures the success of these systems using the "three ITYs": availability, reliability and scalability. However, evolving a systems management strategy which can handle those demands has been a long and arduous task.
The world that was
"In the 1990s, we bought two products which we thought would do enterprise systems management for us. Both times they failed," Speranza said. He didn't name either one.
The first system was complex and difficult to administer, but widely used, so staff trained in it tended to quickly leave for higher-paying jobs elsewhere. The second system performed somewhat better, but was maintained by external contractors, an approach which had clear limitations in terms of staff commitment. "You only get out of this kind of thing what you're prepared to put into it," Speranza said.
For the third rollout, which began in 2000, the RBA wanted a system that would allow quick issue resolution without arguments over whether the root cause could be blamed on the comms team, the server group or application developers. "It took a long time to fix problems," Speranza said.
Early attempts to share problem ownership had not been successful. While there was a network monitoring system in place, network management was on an isolated terminal in a corner of the server room where few staff ever bothered to venture.
Alert sharing was unsuccessful, with staff often bulk-deleting notifications rather than investigating any potential problems. A lack of well-defined basic alerts exacerbated that lack of enthusiasm. "Out of the box alerts might as well be in hieroglyphics. They mean nothing," Speranza said.
A better reporting capability was also high on the agenda. "We don't need SLAs in the bank -- we have everything available all the time -- but we still need reports," he added.
Testing and planning
To ensure business-wide commitment, the infrastructure group partnered with the RBA's datacentre management team. Under the plan, infrastructure would build and test the new systems management platform, based on CA technology, but the data centre team would manage it day-to-day.
Organising that partnership was much more complicated than actually installing the software, Speranza said. "Technology is easy; it's the process that's difficult."
Achieving effective management requires a broader view than just network level, he suggested. "We look at things as services. Forget network management -- that's easy."
A key consideration was ensuring that the system produced reliable, genuine alerts. "If people can't trust the system, they won't have any faith in it at all," Speranza said. After rollout, his team closely monitored all alerts for a month, to ensure that levels had been set correctly before handing off the system.
Automation of as much monitoring as possible was a major goal. Speranza's ultimate vision, not yet realised, is to monitor two screens of traffic-light-style dashboards, all contentedly glowing green, and with any occasional amber alerts resolving themselves before he has a chance to intervene.
Despite that enthusiasm for robotic automation, Speranza advises against deploying specialised monitoring agents on anything but business-critical applications. For most standard servers, their existing monitoring tools should be ample.
Staff training is also often overlooked with systems management, Speranza said. "You need to train all the technical people in how to use the system. They are a little bit intuitive, but there's a lot under there to discover."
He is also a firm advocate of compact, 30-minute training sessions. "Techies won't turn up for an hour-long session."
To minimise problems from unexpected new technology additions, the RBA system produces weekly reports on the overall network, highlighting any changes. In a similar vein, new additions to the management software are widely promoted. "Every time you do an upgrade, go and show people," he advised.
Although RBA policy bars endorsing specific products, Speranza is happy to suggest that "you're going to need lots of them", and considerable tweaking to boot. "No company can give you all the products you need. They'll tell you they can, but they can't."
Future plans for the RBA include monitoring its mainframe via the same infrastructure, and automatically generating help desk tickets when a problem does occur. Desktop PC monitoring is also being considered. "At this stage, we have excluded desktops to get the system stable, but we may include it down the track," Speranza said.