X
Tech

Anatomy of a server-room meltdown

If you think your server is running fine just because the monitoring system has not sent any warning messages, think again. This cautionary tale shows just how disastrous a tiny fault in an old air conditioner unit can be.
Written by Matt Loney, Contributor
The following story is a cautionary tale for anyone who runs a server room.

Back in June, the UK experienced its first hot weekend of the year. One IT manager, who asked to remain anonymous in return for sharing the litany of horrors that followed that weekend--but we'll call him Bob--spent Saturday and Sunday, like most people, enjoying the sunshine. Like most IT managers Bob carries a phone, to which his monitoring systems send text messages should anything go wrong in the server room. On this particular weekend, like most others, there were no text messages warning of any problems, and Bob spent a relaxing couple of days in the sun, safe in the knowledge that the servers back at work were humming quietly away.

Bob's weekend was only spoilt slightly on Sunday evening when he tried to log onto his corporate e-mail account but couldn't connect for some reason. Never mind, he thought, a switch must have failed. It will just need a quick reboot in the morning.

How wrong he was.

"I turned up to work on Monday morning," says Bob, "to find the whole comms room had gone down. When I opened the door the temperature was about 45-degrees Celsius (113-degrees Fahrenheit)."

When the temperature in a comms room reaches that level, there is only one explanation: the air conditioner has failed. "We had two units, which we thought provided redundant air conditioning," says Bob. "But when one seized the second one was unable to cope with the load and so that one shut down too."

As if that wasn't bad enough, in the building where Bob's company is located, the main air conditioning is shut down at weekends to save money. Even in the winter, the offices can be pretty warm first thing on a Monday morning; in the summer they're stifling. So just imagine what it's like in a nicely insulated room with several dozen e-mail, Web and application servers churning out many hundreds of Watts. As Bob put it, "The trouble with comms rooms is that when you switch the air conditioner off, they stop being a cool room and turn into an oven."

Obviously Bob's first job on Monday was to bring the temperature back down. "On the Monday morning, we restarted the one working air conditioner and that began to have an effect. Then we looked for the cause of the equipment shutting down -- it turned out that the UPS had reached its critical temperature and powered down to protect itself."

There were actually two USPes--one main one and a second, smaller one, for the monitoring system. The smaller of the two should survive at least 20 minutes after any power failure to send out text messages to support staff. This did not happen. The smaller UPS did not have a thermal shut-down - instead, it just fried.

By the end of the day the single air conditioner unit had brought the temperature back down, the IT team thought they were ok, and that they could survive on the single unit for the short-term. After all, all they needed to do was call the air conditioner engineer and everything would be hunky-dory.

But life, as most of us know, is rarely that simple. Read on to find out what went wrong next.

The first problem Bob faced was that his wasn't the only air conditioner unit that packed-in on that first hot weekend of the year. The engineer wouldn't be available for another for three days.

"We thought we'd be OK," says Bob, "as the second air conditioner unit seemed to be holding-up on its own, so we thought we could survive the three days. Then, at 6.00pm, the building's main air conditioner shut off, increasing the load on our server room unit. The whole server room shut down again and, yet again, we got no text message from the monitoring system." "

When we came back in on Tuesday morning, the comms room was even hotter than on Monday but we managed to get a junior air conditioner engineer in." Now you'd think even a junior air conditioner engineer should be quite capable of dealing with a broken air conditioner unit, but again, life isn't that simple.

Because the server room only contained the heat exchanger, the engineer needed roof access to reach the main air conditioner unit. "The trouble was that nobody is allowed on the roof without an hour of safety instruction, a method statement from us, and 24 hours' notice. We clearly weren't going to get the broken air conditioner unit fixed that day," says Bob.

"At this point, we realized we had a major problem. What we thought were two air conditioner units running redundantly were actually required in parallel, but because nobody had switch them off since the server room was built seven years ago, nobody knew this."

So Bob hired a 6KWatt portable air conditioner unit and stuck it inside the server room, with a pipe taking the hot air out through the server room door--a short-term fix at best. Aside from being an obvious security risk, the open door also ruined the insulation effects of the server room. Nevertheless, Bob hoped it would work.

It didn't.

"On Wednesday morning we came in and the same thing had happened again; our comms room was down, and this time it was hotter still; the small air conditioner unit simply had not coped," says Bob.

"So we had a choice. We could either increase the shut-off point on the UPS, or we could switch off some of the servers. In some circumstances, servers will switch themselves off as the temperature rises, but once the room temperature gets to 45 Celsius (113 Fahrenheit) it's only going to keep rising. So we started switching off every server that we could survive without, and hired a bigger portable air conditioner unit."

After four days of crashes, this stabilized the server room, even if the door was now even wider open to accommodate the thicker tube blowing even more hot air out into the offices.

On Thursday, Bob finally managed to get a senior person from the air conditioner company in to have a look at the broken unit on the roof. He traced the problem to a seized pump, for which there was no chance of repair. But, in what appeared like a change of fortune, although this old model of air conditioner unit is no longer manufactured, the engineer somehow managed to locate one.

"We paid for it, and it was due to be delivered on the Friday, but when Friday came we got a call saying they had dropped it off the back of the lorry and cracked the pressure unit, which could not be repaired. We'd have to buy a new air conditioner unit instead." More paperwork, and more people on the roof.

Now the trouble with new air conditioner units - from our beleaguered manager's point of view--is that under EU regulations they have to use a new, eco-friendly coolant that they work at different pressures and therefore require thicker pipes. Bob's server room required 120 feet of pipes to channel coolant to and from the units on the roof.

Finally--and we're half-way into the second week at this point--Bob had a stroke of luck.

"They told us we can drain the old system and continue to use that coolant with the new air conditioner unit--it's just that they would not have been able to sell us the old type of coolant," he says.

But, just when it seemed like there could be a glimmer of hope on the horizon, the gods of misfortune went to work on Bob again. The second air conditioner unit on the roof, which had been working single-handedly for the past week and a half, collapsed under the pressure. Bob had returned the rental equipment that had been jamming that door open so--you guessed it--the comms room shut down yet again.

"So we were back to square one. We had to go and get another air conditioner unit." This, after several weeks of server crashes, finally fixed the problem.

Bob believes his company is not alone in facing such issues, given that many air conditioner units servicing comms and server rooms are probably by now about seven or eight years old, and may well have been forgotten about as servers are constantly upgraded.

The heat problem is made all the more frustrating because it has overtaken space as the major limiting factor on the potential power output of Bob's comms room. Servers may be getting smaller but the space saving means nothing if they can't be kept cool.

Bob is in the process of replacing ageing 8U servers, running on relatively cool (and slow) PIIs that have a couple of 7,000 rpm disk drives with 2U, dual-process servers that have Xeon processors and 15,000 rpm disk drives, but the server racks in Bob's comms room will never be full again. Although the 2U servers will do the job that is required of them, there is little prospect of taking full advantage of their small size and dramatically ramping up processing resources: the air conditioner simply won't take it.

In a final twist to the tale, Bob recently learned that even if he manages to sort the air conditioner issue, the building owners and the local power company are unable to supply any more power for any potential servers that could be slotted into the freed-up space.

If you've got a technology horror story that you want to get off your chest, contact me at: matt.loney@zdnet.co.uk

Editorial standards