I don't really consider myself a Linux guru. I haven't spent as much time working with Linux as I'd like, given all my other responsibilities. It's a weird feeling, because I once pretty much knew all there was to know about UNIX (I wrote kernel code, was a product manager for a very well-known implementation, etc.), but that was a very long time ago.
These days, my use of Linux is simply as a system manager for my own servers, and I'm as much an explorer as anything else. That's why, when I talk with you about Linux, I'll be sharing with you what I consider a "discovery," but what a true Linux guru would probably think is common knowledge. Even so, there are a lot of explorers out there and I hope these tips can help.
I've got one such tip today. I co-locate my servers at Prominic.NET. Before I tell you the rest of the story, I should disclose that these guys are old friends of mine, and I'm a very happy fan of their service.
In any case, I've been having no end of problems with one of the Linux machines I co-lo there. It's a CentOS 5.6 machine and about once a week, it'd crash hard. It's been doing this for months, and we had no idea what was causing it.
In desperation, I asked for help. Eric McCartney over at Prominic took some pity on me and started looking at the system. After doing all the usual swaps and tests, there were no obvious issues.
Finally, Eric loaded up a neat little program called sys_basher, which exercises all the elements of the system...hard. It puts the machine under a strong load and if something can't handle the load, it'll die. Quickly.
As it turned out, the problem was a hard drive. The weird thing is, we hadn't allocated the hard drive in the Linux environment yet. It was simply plugged in, awaiting configuration. But, when that hard drive was plugged in, sys_basher would take the system down almost instantly. When the hard drive was unplugged, the system would run rock solid.
Hard drives, like all system components, are not perfect. But it would have taken far longer to diagnose what was up without sys_basher bashing the system.
Since we removed the drive, we've run fourteen days of testing with sys_basher and the machine has been solid. I think we've found the problem.
So, if you have a mysterious problem with a Linux box, try using sys_basher and see if it'll help you track down the trouble.