The Australian Taxation Office (ATO) has once again found itself the centre of an investigation, following a tumultuous 18 months of IT-related incidents and systems outages plaguing the agency.
While probes into its physical equipment have previously been the focus, the Australian National Audit Office (ANAO) on Tuesday called the taxation office out for lacking on the service commitment front, particularly where cloud is concerned, noting a year-old agreement with Amazon Web Services (AWS) does not include service level provisions.
"This contract exposes the ATO to contractual and operational risks in the absence of measurable service levels," ANAO wrote in its report [PDF], Unscheduled taxation system outages.
In assessing whether the ATO has effectively responded to recent unscheduled IT system outages, ANAO revealed the ATO began to set up cloud computing contracts in 2016, now boasting three separate agreements with Macquarie Telecom for its MacGov cloud since May 2016, and Microsoft's Azure, in addition to the AWS contract that began in December 2016 -- days after the first outage brought the ATO's online services down.
The three cloud contracts came eight years after an ICT Sourcing Program led to contracts for three separate groups of service "bundles" for end-user computing, contracted to Leidos; managed network services, contracted to Optus; and centralised computing, contracted to DXC Technology -- formerly Hewlett Packard Enterprise (HPE).
In probing the ATO's IT service measures, ANAO found only the MacGov contract had assessment summaries in place -- and that was only for two of the four key elements ANAO had investigated. Where physical kit was concerned, ANAO seemed pleased with the documentation in place.
"The three major bundles of IT contracts incorporated a Performance Framework in their contractual service level agreements. Consistent with that framework, the service measures were generally well specified across the categories of: Service indicators; service monitoring, and reporting; critical system deliverables; and commercial assessments," the report reads.
The three bundle contracts are due for renewal this year, which ANAO said provides the ATO with an opportunity to reassess its IT service measurement approach, and where possible implement common approaches, at least in terms of "reflecting tolerances that align with the IT outage service standards that the ATO has committed to develop".
"Such an approach would support the ATO in its efforts to use digital technology and online services effectively and efficiently in the administration of the taxation and superannuation systems," it added.
Of the IT-related incidents plaguing the taxation office, there were two significant system failures, with the first occurring in December 2016, and a subsequent outage in February 2017 the result of work to fix the fibre cabling from the first.
A report from the ATO into the outages revealed the HPE-owned and operated SAN could not handle more than one drive or cage failure thanks to a design decision taken by the tech giant. An analysis of logs from the six months before the incident showed a number of alerts indicating problems with the SAN.
"Since May 2016, at least 77 events related to components that were observed to fail in the December 2016 incident were logged in our incident resolution tool," the ATO said previously. "We were not made fully aware of the significance of the continuing trend of alerts, nor the broader systems impacts that would result from the failure of the 3PAR SAN."
The report described HPE's lack of preparation for an event of the kind experienced by the ATO in December 2016.
"Recovery procedures for applications in the event of a complete SAN outage had not been defined or tested by HPE," the ATO said.
Regarding the non-identification of SAN risks, ANAO highlighted that the system recovery tools used by the ATO to restore its data management, system monitoring, and backup/restore systems were in the same datacentre, on the affected SAN.
"The system failure meant that these tools were unavailable, and there were no backup or redundant system recovery tools available on other ICT systems to detect and analyse the incident, and to support efforts to recover and restore services," ANAO wrote.
In the second major outage, a data card was dislodged in the process and caused the SAN to behave in much the same manner as the December incident. In both cases, the SAN was unable to automatically restore itself and shut down to preserve data.
In the February incident, the ATO website remained up, as it had been moved off of the SAN and hosted in a cloud environment.
As a result of the incidents, the ATO rebuilt its storage solution with a new 3PAR, and decommissioned the old one in July for forensic analysis.
"The December 2016 and February 2017 incidents highlight that the ATO did not have a sufficient level of understanding of system failure risks," ANAO's report added. "The ATO's risk management and BCM [business continuity management] processes did not include an assessment of risks associated with storage area networks, which were a potential single point of failure. Moreover, BCM processes were limited in planning for critical infrastructure and ICT system failure to the datacentres."
As a consequence, ANAO said the ATO -- including DXC and Leidos -- were not prepared for the possibility of complete system failure caused by storage failure. It also found the ATO did not have a secondary enterprise system in place, other than a disaster recovery procedure.
It also reported that at that time, cloud services were considered for performance purposes but not fully implemented.
Leidos, ANAO said, also had not identified the SANs were a single point of failure.
ANAO, however, said the ATO's responses to the system failures and unscheduled outages were "largely effective", despite inadequacies in business continuity management planning relating to critical infrastructure.
Making a total of three recommendations, ANAO has asked the ATO to also update its BCM, IT service continuity management (ITSCM), and risk management frameworks to "improve and better integrate the identification and treatment of risks to critical infrastructure that may lead to system failures".
The final recommendation requests the government entity "determines the level of availability of services associated with its ICT systems to include in service standard(s) and subsequently reports performance against those standard(s)".
Following the two major incidents, the ATO has experienced multiple outages and mainframe reboots, with the most recent outage in September affecting its online services.
Despite the HPE equipment being at the centre of the first and a handful of resulting issues, the ATO contracted DXC Technology for the provision of a further AU$735 million in "centralised computing" in December 2017, bringing the total value of the contract with the tech giant to AU$1.47 billion.
PREVIOUS AND RELATED COVERAGE
ATO rectifying tax agent 'irritants' following string of IT outages
The taxation office has told a House of Representatives committee that fixing the 'irritants' that tax agents have with the ATO's systems is a key priority.
ATO turns to AppDynamics for application monitoring
Following a string of outages plaguing the Australian Taxation Office's online service delivery, it has signed with Cisco's AppDynamics to monitor its application delivery.
Four years ago no one would know if we had an outage: ATO
Following a string of IT issues plaguing the ATO, its chief digital officer has said future-proofing its infrastructure for a software-driven future is a priority.
ATO called out for not tracking costs in digital transformation program
The costs and savings associated with the program undertaken to make the ATO more 'contemporary and innovative' have not been tracked, a report from the Audit Office has found.
ATO using Govpass digital ID solution for tax file numbers
The federal government expects its Govpass solution will link to existing document and facial verification services to establish identity.