Anatomy of a server-room meltdown

If you think your server is running fine just because the monitoring system has not sent any warning messages, think again. This cautionary tale shows just how disastrous a tiny fault in an old aircon unit can be
Written by Matt Loney, Contributor
The following story is a cautionary tale for anyone who runs a server room.

Back in June, the UK experienced its first hot weekend of the year. One IT manager, who asked to remain anonymous in return for sharing the litany of horrors that followed that weekend - but we'll call him Bob - spent Saturday and Sunday, like most people, enjoying the sunshine. Like most IT managers Bob carries a phone, to which his monitoring systems send text messages should anything go wrong in the server room. On this particular weekend, like most others, there were no text messages warning of any problems, and Bob spent a relaxing couple of days in the sun, safe in the knowledge that the servers back at work were humming quietly away.

Bob's weekend was only spoilt slightly on Sunday evening when he tried to log onto his corporate email account but couldn't connect for some reason. Never mind, he thought, a switch must have failed. It will just need a quick reboot in the morning.

How wrong he was.

"I turned up to work on Monday morning," says Bob, "to find the whole comms room had gone down. When I opened the door the temperature was about 45 degrees (Celsius)."

When the temperature in a comms room reaches that level, there is only one explanation: the aircon has failed. "We had two units, which we thought provided redundant air conditioning," says Bob. "But when one seized the second one was unable to cope with the load and so that one shut down too."

As if that wasn't bad enough, in the building where Bob's company is located, the main air conditioning is shut down at weekends to save money. Even in the winter, the offices can be pretty warm first thing on a Monday morning; in the summer they're stifling. So just imagine what it's like in a nicely insulated room with several dozen email, Web and application servers churning out many hundreds of Watts. As Bob put it, "The trouble with comms rooms is that when you switch the aircon off, they stop being a cool room and turn into an oven."

Obviously one of Bob's first jobs on Monday was to bring the temperature back down. The other, less obvious job (to anyone who has never had an aircon unit fail) was to start mopping up. "When the aircon swithced off," says Bob, "moisture condensed in the pipes that lead to the units on the roof." As this moitsure condensed, there was only one place for it to go: down the pipes, through the vents and onto the server-room floor. As for the temperature, says Bob: "On the Monday morning, we restarted the one working air conditioner and that began to have an effect. Then we looked for the cause of the equipment shutting down -- it turned out that the UPS had reached its critical temperature and powered down to protect itself."

There were actually two UPSes - one main one and a second, smaller one, for the monitoring system. The smaller of the two should survive at least 20 minutes after any power failure to send out text messages to support staff. This did not happen. The smaller UPS did not have a thermal shut-down - instead, it just fried.

By the end of the day the single aircon unit had brought the temperature back down, the IT team thought they were ok, and that they could survive on the single unit for the short-term. After all, all they needed to do was call the aircon engineer and everything would be hunky-dory.

But life, as most of us know, is rarely that simple.

Read on to find out what went wrong next.

The first problem Bob faced was that his wasn't the only aircon unit that packed-in on that first hot weekend of the year. The engineer wouldn't be available for another for three days.

"We thought we'd be ok," says Bob, "as the second aircon unit seemed to be holding-up on its own, so we thought we could survive the three days. Then, at 6.00pm, the building's main aircon shut off, increasing the load on our server room unit. The whole server room shut down again and, yet again, we got no text message from the monitoring system."

"When we came back in on Tuesday morning, the comms room was even hotter than on Monday but we managed to get a junior aircon engineer in." Now you'd think even a junior aircon engineer should be quite capable of dealing with a broken aircon unit, but again, life isn't that simple.

Because the server room only contained the heat exchanger, the engineer needed roof access to reach the main aircon unit. "The trouble was that nobody is allowed on the roof without an hour of safety instruction, a method statement from us, and 24 hours' notice. We clearly weren't going to get the broken aircon unit fixed that day," says Bob.

"At this point, we realised we had a major problem. What we thought were two aircon units running redundantly were actually required in parallel, but because nobody had switch them off since the server room was built seven years ago, nobody knew this."

So Bob hired a 6KWatt portable aircon unit and stuck it inside the server room, with a pipe taking the hot air out through the server room door -- a short-term fix at best. Aside from being an obvious security risk, the open door also ruined the insulation effects of the server room. Nevertheless, Bob hoped it would work.

It didn't. "On Wednesday morning we came in and the same thing had happened again; our comms room was down, and this time it was hotter still; the small aircon unit simply had not coped," says Bob.

"So we had a choice. We could either increase the shut-off point on the UPS, or we could switch off some of the servers. In some circumstances, servers will switch themselves off as the temperature rises, but once the room temperature gets to 45 degrees it's only going to keep rising. So we started switching off every server that we could survive without, and hired a bigger portable aircon unit."

After four days of crashes, this stabilised the server room, even if the door was now even wider open to accommodate the thicker tube blowing even more hot air out into the offices.

On Thursday, Bob finally managed to get a senior person from the aircon company in to have a look at the broken unit on the roof. He traced the problem to a seized pump, for which there was no chance of repair. But, in what appeared like a change of fortune, although this old model of aircon unit is no longer manufactured, the engineer somehow managed to locate one.

"We paid for it, and it was due to be delivered on the Friday, but when Friday came we got a call saying they had dropped it off the back of the lorry and cracked the pressure unit, which could not be repaired. We'd have to buy a new aircon unit instead." More paperwork, and more people on the roof.

Now the trouble with new aircon units - from our beleaguered manager's point of view - is that under EU regulations they have to use a new, eco-friendly coolant that they work at different pressures and therefore require thicker pipes. Bob's server room required 120 feet of pipes to channel coolant to and from the units on the roof.

Finally -- and we're half-way into the second week at this point -- Bob had a stroke of luck.

"They told us we can drain the old system and continue to use that coolant with the new aircon unit - it's just that they would not have been able to sell us the old type of coolant," he says.

But, just when it seemed like there could be a glimmer of hope on the horizon, the gods of misfortune went to work on Bob again. The second aircon unit on the roof, which had been working single-handedly for the past week and a half, collapsed under the pressure. Bob had returned the rental equipment that had been jamming that door open so - you guessed it - the comms room shut down yet again.

"So we were back to square one. We had to go and get another aircon unit." This, after several weeks of server crashes, finally fixed the problem.

Bob believes his company is not alone in facing such issues, given that many aircon units servicing comms and server rooms are probably by now about seven or eight years old, and may well have been forgotten about as servers are constantly upgraded.

The heat problem is made all the more frustrating because it has overtaken space as the major limiting factor on the potential power output of Bob's comms room. Servers may be getting smaller but the space saving means nothing if they can't be kept cool.

Bob is in the process of replacing ageing 8U servers, running on relatively cool (and slow) PIIs that have a couple of 7,000 rpm disk drives with 2U, dual-process servers that have Xeon processors and 15,000 rpm disk drives, but the server racks in Bob's comms room will never be full again. Although the 2U servers will do the job that is required of them, there is little prospect of taking full advantage of their small size and dramatically ramping up processing resources: the aircon simply won't take it.

In a final twist to the tale, Bob recently learned that even if he manages to sort the aircon issue, the building owners and the local power company are unable to supply any more power for any potential servers that could be slotted into the freed-up space.

If you've got a technology horror story that you want to get off your chest, contact me at:


Editorial standards