Horror story: Qld Health datacentre disaster

Horror story: Qld Health datacentre disaster

Summary: On 20 May, a brief electricity brown-out struck a Queensland Health datacentre, starting a chain of incidents that resulted in serious outages of over 20 health applications. Read our blow by blow account of an event that constitutes every CIO's nightmare scenario.

SHARE:

On 20 May, a brief electricity brown-out struck a Queensland Health datacentre, starting a chain of incidents that resulted in serious outages of over 20 health applications.

(CERN Datacentre, CERN, Geneva image by Cory Doctorow, CC2.0)

The datacentre, located on the campus of Herston hospital, is believed to be one of three datacentres Queensland Health operates. It only lost power for a fraction of a second, when two flooded Energex transformers failed at around 5:00pm on that day, according to a source close to the incident. Uninterrupted power supplies kicked in to keep servers up.

However, the brown-out tripped the chilled water system, cutting chilled water to the hospital campus. As it wasn't monitored, the datacentre support team didn't notice the loss of the chilled water. A datacentre employee came on scene to check everything was running, but being happy that there wasn't anything wrong, he left.

Only two of 10 air-conditioning units within the datacentre were able to use refrigerated gas if chilled water wasn't available, meaning that although the rest of the units were operating, they weren't cooling. The temperature in the datacentre began to rise.

Although people were called in to investigate the temperature rise, the cool water problem wasn't found. Due to a DNS change the day before the problems began, there were no messages being sent to tell staff of server problems. Four hours after the brown-out, services began to suffer. On-call hospital staff were affected and complained. Soon after, a server shut down.

The whereabouts of the air-conditioning specialist who had been called in was unknown to many staff members and he didn't answer his phone. It had taken the engineer three hours to arrive on site. Five hours after the systems failed, the fact that the chilled water pumps had not been operating was discovered as more servers shut down with temperatures over 50 degrees. It was believed to be fixed.

In the face of a severe weather event, the IT staff involved were outstanding in their response to minimise the impact of this incident.

Ray Brown, acting CIO Queensland Health

Because the remote access system wasn't working, staff had to wait until they arrived at the datacentre until they could begin shutting down servers. When they arrived, they started to move systems over to an alternate datacentre, which in some cases caused brief user inconvenience. Some, however, could not be moved since their servers had no ability to failover and Queensland Health's architecture for virtual machines didn't allow moving it over to a second datacentre.

The hospital's Cerner electronic medical record (patient administration) system was shut down by the hospital staff.

Six hours after the brown-out, the air conditioning was still not working. Although staff believed they had found the problem, more systems including iPharmacy shut down until 75 per cent of applications were down and the datacentre reached 45 degrees.

Eight hours after the brown-out, chilled water was finally brought back up. Nine hours after, the datacentre was back to normal and the services could be restored. By nine o'clock the morning after the brown-out, all services were restored.

Over the course of the problems, 12 applications caused significant impact, with another 12 having minor impact. Three years ago the datacentre was forced to shut down for the same reasons. Afterwards, the team had been told it could not happen again.

When queried on the incident, Queensland Health acting CIO Ray Brown did not respond to a question on what facilities around the state the downed applications provided services to. However, it is believed that Queensland Health's three datacentres provide services around the state to multiple locations.

He denied that there had been more than one incident over the past three years at the datacentre.

According to Brown, since several applications were relocated to the other datacentre, there was "minimal disruption" to services. "The majority of services impacted were available by 2:30am and all Queensland Health systems categorised as critical remained operational during this incident," he said.

"In the face of a severe weather event, the IT staff involved were outstanding in their response to minimise the impact of this incident. The ability of staff to physically attend the site was severely hampered by flooding in the area."

Lessons had been learned, according to Brown. Queensland Health was exploring options to remove reliance on chilled water. It also intended to replace the remote access system by the third quarter of this year. It is undertaking a review of management tools and is examining the crisis management plan.

Queensland Health has lost several chief information officers over the past several years. Long-time CIO Paul Summergreene had his contract terminated by the department in July 2008. Dr Richard Ashby filled his shoes for a short time, before leaving the chair vacant, with Brown currently leading the department's IT function in an acting capacity.

The news also comes as the Queensland Government flagged in the last state budget its intent to splurge hundreds of millions of dollars on health IT systems to support its e-health capability.

Topics: Servers, Data Centers, Health

Suzanne Tindal

About Suzanne Tindal

Suzanne Tindal cut her teeth at ZDNet.com.au as the site's telecommunications reporter, a role that saw her break some of the biggest stories associated with the National Broadband Network process. She then turned her attention to all matters in government and corporate ICT circles. Now she's taking on the whole gamut as news editor for the site.

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

9 comments
Log in or register to join the discussion
  • CITEC, anyone?

    Anyone fancy moving back to CITEC?
    anonymous
  • DIY Data Centres are not the future

    Their data centre is il conceived and reeks of DIY. Time move forward with technology and co lo it in a proper DC or better yet get the servers hosted so somebody else can worry about it.
    anonymous
  • DR planning disaster

    Whats more concerning is that they thought they had the ability to v-motion the virtual machines, but when the time came, they couldn't.

    What happened to the DR planning? Should this have been tested after the platform was rolled out?
    anonymous
  • soft skills

    dont worry, soft skills will fix this,we've got our best looking sales chick on the job
    anonymous
  • good read

    Nice article thanks Suzanne
    anonymous
  • indeed

    indeed
    anonymous
  • Very little facts......

    There was no application outage's as there were active redundant servers at the second data centre that took over when the failure occurred. Another fact is that there is no third data center. Looks as if should check your sources.
    anonymous
  • A Severe Weather Event?

    Two failed transformers does not equal a "Severe Weather Event", that is just bullshit.

    There are a million reasons why a transformer might fail, and when you are managing a critical datacentre it is not a question of "procedures for _if_ we lose power", but "procedures for WHEN we lose power".

    If only for gross dishonesty, Ray Brown should be sacked immediately.
    anonymous
  • Agreed.. Sack Him

    Sounds like a "Spin Doctor", anyone in that position should have tested their DR procedure thouroughly, instead on relying on a vendors sales pitch that it will work.

    I wouldnt let him anywhere near a Microsoft Small Business Server, let alone a Data Centre
    anonymous