How Linux handles hardware problems

I usually write about various issues in Windows, because there are so many and so frequent. Hardly do I run across major issues with Linux.

I usually write about various issues in Windows, because there are so many and so frequent. Hardly do I run across major issues with Linux. Until the one I will mention here, which involves a very rare kernel lockup. As we should all know, kernel lockups in Linux are very rare, and in fact I can easily count the number of instances I've seen it happen over 14 years, and keep it under a total count of 10. They usually happen due to hardware problems, when the kernel can no longer run. Recently, I've seen an 10 year old server running Red Hat Linux 7.1, lock up completely. And yes, the OS was installed on the Dell Poweredge 2400 10 years ago, back in 2001, and has been running just fine for many years. Never any file corruption or slowdowns, or other issues like we see with old Windows installations. Recently, the server was shut off abruptly due to an extended power outage. After that, it would run for roughly a week at a time then lock up. The screen at the console was black one time, and another time had a kernel dump screen.

After booting the server back up, I noticed this entry in /var/log/messages:

kernel: Uhhuh. NMI received. Dazed and confused, but trying to continue kernel: You probably have a hardware problem with your RAM chips

Just some humorous log entries there. But, contains some useful information. It seems that with a lockup like this, the kernel is having some sort of issue with the system memory in the server. Fortunately, this very same problem happened 3 years ago and we knew what the fix is: reseat the power supplies. The first time this happened I ran memory diagnostics and they passed. I ended up finding a forum post that referenced the NMI errors and power issues with the system. Since the issue appeared to be exactly the same, we did not run memory diagnostics this time, and reseated the two hot swap power supplies. This fixed the problem before, and should fix it again this time.

A few points to this post. First, diagnosing problems in Linux is not as hard as it is rumored to be. /var/log/messages is usually where the kernel logs its information. And it logs very thoroughly. The kernel's entries show up as "kernel", just like the example above. And, the logs are in plain text so they can be opened with any program that can read text. Unlike Windows which stores logs in a proprietary format that need Microsoft tools to view.

Second, hardware problems cannot be prevented, and they tend to happen when you least expect. It's not the software's fault when a hardware problem prevents it from running. This tends to be what I mostly see with Linux issues. On the flip side, think about how common Windows blue screens of death (BSOD) are. Countless jokes about it circulate all of the time, and even Linux screensavers contain Windows crash screens (the xscreensaver packages contain these!).

Third, bad RAM is probably the most common cause for a Linux system crash, other than a bad motherboard. The Linux kernel can continue to run as long as it can access the memory. When bad memory is suspect, I always run a copy of the free utility Memtest86, which is an excellent memory tester.

In conclusion, hardware problems are sometimes difficult to diagnose and fix. I believe we happened to get lucky with the example here, but logs should be examined and addressed if there are errors. I've seen a lot of Windows administrators that do not view the error logs, or take any proactive steps upon them. Maybe because they are often difficult to decipher. But, there are services like MOM/SCOM available to make this easier. With Linux, I tend to prefer Logwatch to email the errors. Either way, a well-tuned system will run for many years and should provide reliable service.


You have been successfully signed up. To sign up for more newsletters or to manage your account, visit the Newsletter Subscription Center.
See All
See All