Server war stories: webcast

Server war stories: webcast

Summary: What's your most outlandish server disaster story?

SHARE:

How many server disaster stories (or close calls) do you think the average system administrator has under their belt?

We're guessing that the answer would be a lot. Everyone who works with servers has heard of a story or experienced a breakdown. In a long career, it's impossible not to. Often, it's no one's fault; it's just a fact of life.

On September 13, ZDNet hosted a webcast, in which a panel of experts reminisced about some of their best server war stories — the ones where everything suddenly went pear shaped, until someone found out what went wrong and fixed it.

Live from Microsoft TechEd in the Gold Coast, Queensland, we were joined by IT industry analyst and strategist Sam Higgins, Microsoft senior program manager and "virtual PC guy" Ben Armstrong, South Australia Water senior IT architect Pete Calvert, and Fastlane Asia Pacific enterprise solutions manager Erdal Ozkaya who shared their stories of their biggest disasters and close calls.

These experts will talked about what caused the problem, what they did to make things right, and how they prevented it from happening again.

They also opened up about what they hate most about servers, new features they've loved or loathed, and the things they wish they could change.

Topics: Servers, Dell, Hardware, Hewlett-Packard, IBM, Microsoft

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

4 comments
Log in or register to join the discussion
  • Server war stories

    It happened just before I said to my non-IT Manager, "It's never a good idea to uninstall a Windows Service Pack, but if you order me to do it . . . I will." (All this because we had a Linux Server that was running such an old OS it could no longer be seen in the Windows domain once the new service pack had been put on all the Windows servers. Oh yea, it was still running and accessible via Webmin.) Needless to say, I spent the next day or so getting the Windows server restored and moving files off the old Linux machine so folks physically could access them. Oh yea, did I mention that they didn't want to pay for backup client software to backup open databases? Sweet! 'Nough said.
    Zaikenburg
  • War stories...

    One of the worst I dealt with had to do with an aging, badly maintained IBM RS6000 M80 that had something like 40 18GB SSA hard disks. One fine day the boot disk crashed and burned. Taking down several instances of Peoplesoft running on it.

    The server had NO maintenance contract and it took two days of negotiations with IBM to get them to come and replace the hard disk.

    The AIX version was like 10 years old. No install media, luckyly there was a bootable tape with an OS backup. Which took almost a day to install.

    There were about 4 Oracle Database instances in the Server and two instances of Peoplesoft, after we managed to start three of them, we found that the fourth and largest had been corrupted beyond repair. It took close to a day to retrieve the backup from tape and rebuild the database thro

    One of the instances of Peoplesoft was fine, the other instance (the one that depended on the damaged Oracle Database) started properly but we had problems with the Cobol programs that made most of the batch processes of that Peoplesoft Instance, it turns out that the Cobol license file for both the compiler and the runtime were corrupted.

    All in all took a week to re-establish service. At least, after that debacle, it finally dawned on Management that we need new equipment and better air conditioning. Despite the fact that I was bitching about that 6 months before the disaster.
    Imprecator
  • War stories

    I was called in to a California County server room to assess the damage to some servers that had shut down over the 4th of July weekend in 107 degree heat. The ancient building these were housed in was not air conditioned but relied on swamp cooling which was shut down after 5pm.
    The server room had AC but the fuse to the AC had failed and the inside temperature was 150+ degrees. There were a number of servers, some of which I had supplied. My servers were still operating being Xeon processors with huge heat sinks in an 8 fan 4u case. The 2 Cisco units had wisely shut down. The old HP was running but the RAID array was only operating on 1 scsi drive. There was a Gateway that had failed completely. All the sentencing records for the whole county was on the Gateway that had 2 failed hard drives. I asked for backup tapes and was told that the tape drive never worked and that one of the hard drives had failed about a year before. I got every thing working again except for this melted Gateway. I took the controller from 1 hard drive and swapped it with the other and after about 3 days I managed to rescue the crucial data. My attorney said that if he knew of this failure he would have gone down to the jail with writs of Habeus Corpus and sprung all his clients. The IT guy who was in charge left and went to work for the Federal Government in Washington. I rell this story to my clients who forget to backup. So endeth my story.
    yagijd
  • War stories

    For one of our client, there were diversified teams working on Siebel, SAP, J2EE applications on Sun Solaris OS. The backend DB was Oracle with RAC's.

    We had a big release that was underway. One DBA guy applied an Oracle Patch during the release process, as our DB downtime is very rare. He wanted to make full use of this downtime so he planned this patch application on this downtime window. He was not aware that the patch to be applied was not compatible with the version of J2EE that was deployed.
    Once the release was completely deployed and smoke test started, we could not find any bug immediately. We went live and allowed user access. After half an hour, our VP-IT got call from the CTO of the client saying their users are having lot of issues. Well. the war started. We have to get teams from US and India to work together to resolve the issue. We first rolled back the repository changes and found that still the problem persisted. The DBA who applied changes overnight had taken allowed break and was in deep sleep not taking his mobile calls.
    Our Audit team used the checklist to find changes done one by one and asked us to rollback one by one. There we found the real issue. Hectic L1 calls were made to Oracle to undo the patch.. after an hour of war with Oracle guys, we were able to rollback the patch.
    Then applied the new release again and things were good as gold.
    Real war..... if you are applying patches without checking the compatibility end to end...

    Visit us at http://www.layer7labs.com
    Layer7 Labs