ATO SAN could not handle more than one drive or cage failure thanks to HPE design

The ATO's report into the December 2016 SAN failure places the blame squarely on the design of HPE's turnkey SAN product.

ato-san-screenie.png
(Image: ATO)

A series of design decisions taken by Hewlett Packard Enterprise (HPE) doomed the 3PAR 20850 SAN solution sold to the Australian Taxation Office (ATO) in the event of failure beyond a single drive or cage, a report [PDF] into a series of storage outages by the ATO has stated.

"The SAN was neither designed nor built to cater for greater than single drive failure or single cage failure," the report said. "This established a risk to our business due to the large number of business systems that depended on the SAN for normal operation."

While the exact root cause for the outage is pending a report from HPE due to arrive in "late 2017", the report placed blame on the degradation of a number of fibre optic cables used within the SAN.

Under the arrangement between the ATO and HPE, the SAN is owned and operated by HPE, with the ATO having no direct access to it. As noted by the report, an analysis of logs from the six months before the incident showed a number of alerts indicating problems with the SAN.

"Since May 2016, at least 77 events related to components that were observed to fail in the December 2016 incident were logged in our incident resolution tool," the ATO said.

"We were not made fully aware of the significance of the continuing trend of alerts, nor the broader systems impacts that would result from the failure of the 3PAR SAN."

Under the ATO's timeline of the incident, the December 12 outage started at 12:40am when disks started to enter a preserved state to prevent data being deleted, and were effectively inaccessible to ATO applications.

By 3:35am, 455 out of 3063 drives were in a preserved state, and the firmware on the drives was preventing them from being rebooted. Three and a half hours later, HPE then decided to escalate the issue to a "Priority 1 incident".

Work would continue throughout the Christmas and the New Year break, with a subsequent outage in February a result of work to fix the fibre cabling. In that outage, a data card was dislodged in the process and caused the SAN to behave in much the same manner as the December incident. In both cases, the SAN was unable to automatically restore itself and shut down to preserve data.

In the February incident, the ATO website remained up, as it had been moved off of the SAN and hosted in a cloud environment.

Over Easter, the cables were replaced, and the alerts ended.

The SAN solution, which consisted of one primary 3PAR SAN in Sydney and another backup 3PAR in western Sydney, was designed for a manual failover for applications, and had a daisy-chain 5 cage configuration which allowed errors to spread across cages during a failure.

"Full automated fail‑over for the entire suite of applications and services in the event of a complete SAN failure in Sydney was not part of the storage solution for the SAN. The cost of automatic fail‑over systems, as they exist in some areas of critical infrastructure or in large financial institutions, is very high."

Most damning though, was HPE's lack of preparation for an event of the kind experienced by the ATO in December.

"Recovery procedures for applications in the event of a complete SAN outage had not been defined or tested by HPE," the ATO said.

As a result of the incidents, the ATO has rebuilt its storage solution with a new 3PAR, and once data from the existing 3PAR SAN is transferred, it will be decommissioned in July for forensic analysis.

"The newly built data storage system which includes enhanced technology consists of a four part storage configuration and increased data replication, which provides the appropriate back‑up and fail‑over abilities as well as enabled monitoring and resilience features," the report said.

Last week, Commissioner of Taxation Chris Jordan said the system was designed for performance instead of stability, and a number of monitoring and resilience features were not enabled.

"This particular SAN configuration leverages a feature known as wide‑striping which is designed to significantly improve performance by reading and writing blocks of data to and from multiple drives at the same time, preventing single‑drive performance bottlenecks," the report confirmed.

"When several physical disk drives were impacted by a drive firmware issue which prevented those drives from re‑booting, the result was that a small number of drives temporarily and in some cases permanently prevented access to a significant amount of application data. This also had the effect of extending the duration and complexity of the recovery effort."

Jordan also admitted it took longer than it should have to restore the SAN, because the recovery tools were kept on the failed SAN.

Following the incidents, the ATO has already moved its data management, monitoring, and recovery systems into a separate, independent, storage area to remove the dependency on the HPE SAN.

Newsletters

You have been successfully signed up. To sign up for more newsletters or to manage your account, visit the Newsletter Subscription Center.
See All
See All