The need for regulatory compliance is ever-present for Australia's banks, and it ranges across all facets of the organisation, from ledgers to fee structures, and includes the IT that underpins the bank's operations.
Speaking at Puppetize in Sydney recently, a pair of ANZ enterprise platform senior consultants, Nathan Kroenert and Boyd Adamson, detailed how the bank brought its fleet of over 7,000 Unix-based systems -- described as being in "various states of awesomeness" and running Solaris, AIX or a Linux flavour -- under control.
"We have about 22 regulatory bodies in different countries that oversee what's going on, and they will have their own sets of rules about how things have to be configured and requirements and so forth," Adamson said.
"So all of those have to get distilled down to essentially how a box, or certain requirements on a box, is configured. So some of that includes what packages need to be installed or not installed, what services have to be running or not running, various security settings in various areas."
As in any large organisation, the team looking to introduce a new definition of sanity had to wrestle with legacy and technical debt.
Kroenert described how the team made use of Puppet as the central pillar of its automation and orchestration strategy, and started out with very broad rules, such as not allowing users to put certain files in certain places.
"Over time, you have projects that come along and say: 'Hey, we need this in that file'. So there's a bunch of that -- the human cost of dealing with that, which is: 'But we didn't have to do that before and you're going to make me do this!'," he said.
"Well, because you got away with stabbing someone once doesn't mean you should keep on stabbing them," Kroenert explained.
Prior to the move to Puppet, ANZ used a human-intensive method of determining compliance, which involved a BMC BladeLogic script that ran on demand and produced a report for someone to analyse, with tickets subsequently lodged for any non-compliance observed. Typically, the tickets were fixed by the teams who introduced the non-compliance, simply because it was an easier way of doing things.
"These terrible, terrible states of poor people having to log onto boxes, actually logging onto boxes to fix things -- and they log on 40 times to fix the same problem across 40 hosts," Kroenert said.
"That was not ideal, and worse yet ... I asked these people who were doing the wrong thing to fix that, but they were doing the wrong thing because it solved the problem for them earlier on in their development cycles, so that actually also caused some friction here and there."
According to the pair, the bank was compliant, but it was not scalable or straightforward. As for the question of whether the bank was provably compliant -- "hell no" was Kroenert's response.
"I don't want to sit down for six months with an auditor to explain to them how it is that we've gone from this system to this statement of its compliance," he said. "It's a really, really hard discussion to have, and for them to maintain the context throughout that entire discussion."
"The last thing you want to do is demonstrate that you don't know how it works, and if I'm saying: 'Well, this team passes it to this team, to this team, to this team', that becomes a really, really difficult problem for us to solve as an organisation."
Hence the decision to introduce Puppet and move from being able to report on compliance to being able to enforce compliance. But the team was not going to be able to stand up all-new infrastructure built the "right way" to make the rollout easy.
"We have 7,000-odd hosts with 6,999 different configurations across those hosts because that's what the business needed when they were built. So we were coming into that very much brownfields," Kroenert said.
"We had to deal with what we had as step one, which is kind of making it the hardest way possible, but that's where we were," he added.
Compounding the difficulty was the organisation not having a central list of all its boxes, and needing to install Puppet masters in different security zones. But once the full extent of the systems was discovered, and the agent was pushed out, the next step was tackling low-hanging fruit like NTP or SSHD configurations across the 7,000 servers spread between physical and virtual hardware.
Given the numbers involved, there was guaranteed to be some sort of breakage.
"When we're moving at speed, we have to be prepared that every now and then we're going to clip a gutter. And we did here and there, and there were cases where we'd say, 'Alright, well, we've just pushed this out. It looked great. We did a heap of testing. We tested in all of our little canary environments, and it looked great'," Kroenert said.
"But the reality is once you then hit 5,500 boxes that have all been built differently, some things might break and we had [management's] backing to say: 'Look, we're really sorry about that. We've pushed it to 5,500 nodes. We broke four, but look at the step forward that we're taking'."
"As long as we knew that we'd done it and we can put it back -- if we broke it with Puppet, we can fix it with Puppet," Adamson added. "We were okay."
The deployment was not without pushback, as teams had to come around to the idea of the automation play, and Puppet was often the whipping boy due to ANZ's environment having Puppet log into syslog. Therefore, before an error typically occurs in a log, there is likely to be a Puppet entry above it.
"I think my favorite was that the last log message was Puppet, but it was two weeks earlier," Adamson said.
Kroenert detailed how operations people would attempt to turn off Puppet, nooping the service, or deleting ruby or other binaries to stop it running in vain attempts to fix the issue they were trying to deal with.
"In truth, 99.9 percent -- and it is possibly higher than that -- was not Puppet at all," Kroenert said. "But it gave other parts of the organisation space to manoeuvre to say, 'Oh no, it's not us'."
The bank knew it had reached its goal when it was able to present a one-source-of-truth dashboard to its internal audit team that details the compliance of each host.
"Our internal audit guys are able to look at that as a portal and they can say: 'Hey, we need to transition this application set'. They look at the hosts. They say: 'Hey, they're all green, that looks great'. They're happy to push it straight through," Kroenert said.
"There is no discussion. There is no concern. There is no proving anything, because they've already looked in all of the code that underpins this. They know that if we say it's compliant, it's actually already compliant."
"So for us, that's removed a tremendous number of man hours -- I would say literally thousands of man hours a year of that very discussion having to happen. So for us, that is a tremendous win coming out of the compliance side."
Across its move to Puppet, and controlling over 6,000 nodes from a single master, the pair recommend getting the orchestration master onto physical hardware with plenty of CPU physical cores and a ton of memory.
"Lots of CPU, and real CPU -- not VM CPUs, not threads, not hyper-threaded threads," Kroenert explained. "Java, when you're dealing in hyper-threaded environments, punches each other's cache in the face all the time."
"Make sure Java has got enough memory. We scaled it up and up and up and up, I think now we're running somewhere around 64 and 96GB," said Kroenert.