My office computer had been a tad flaky for many months. Rebooting often resulted in the system powering down rather than rebooting. I needed about two minutes before I could power on the system and the system would stay on. Annoying yes, horrible no.
Then my definition of flaky changed. The system started freezing. Hard. Several times in a day. No kernel panics. Just a hard freeze. The kernel magic keys did not work. Only the power switch resolved the problem. When freezing while running a desktop environment the screen would change to a strange pattern. When just running from a console I would experience a black screen of death.
Were the two problems related or unique?
Who knows because a day later the system died.
Initially I suspected the power supply unit (PSU). A spare replacement provided no relief.
Next I suspected a bad RAM module but all four 2GB modules worked fine in the replacement computer.
The system logs revealed nothing obvious. I found one common entry of “INFO: task sync:xxxxx blocked for more than 120 seconds.” Browsing the web indicated the message was more or less normal and could be tweaked with some sysctl.conf settings. That specific message was in my logs going back a full year. The freeze started suddenly and likely not related to that message. Full file system checks and smartctl outputs and logs indicated the freezing was unlikely to be the hard drives.
That was good news although part of my backup strategy includes cloning my drives every few days. As the system consistently froze I doubted I could keep the system running long enough to clone the larger data partitions.
To update my clone drives I would have to move the drives temporarily to my media center. As the media center is not a high priority, I moveed my office system drives to the media center and ran the pseudo server services from there.
All doable but I really wanted to resolve the freezing.
In the mean time I had to store new data files, such as this specific blog entry, in a temporary location on my laptop. I also would be without access to my data files until I moved the drives to the media center, resolved the freezing, or bought new hardware.
I long had suspected heat being related to the quirky reboot/halt problem. I could drive the fans at full speed and albeit inconsistently, find more success with rebooting than with the fans being controlled by sensor outputs. With the new freezing I configured the system to run the fans at full speed all the time. The freezing still occurred all the time. The freezing likely was not heat related.
I rebooted with memtest86+. I disconnected the hard drives. Every time I ran the memory test the system froze hard. Often but not always within 10 minutes. After several freezes I tried watching the output to learn whether the freeze happened during the same memory test. No luck with any such consistency. No output clues of bad RAM. Just a hard freeze. The fact that running the test often resulted in a freeze within 10 minutes was suspicious.
That the hard drives were disconnected was reassuring that they were not part of the cause.
Unlike the black screen of death while running an operating system, this freeze retained some of the memtest86+ screen output. Interestingly, the little red blip in the upper left of the screen continued blinking.
The best I could guess was either the CPU or ASUS M2NPV-VM main board was bad.
All fans ran (12V) and the Standby LED (5V) was lit. I pulled the battery and jumper-reset the CMOS. After the CMOS reset the system booted with the vendor boot splash but goes no further. Pressing the Delete key failed to enter the BIOS setup.
Then the system failed to progress that far. The fans would spin and after about 7 seconds everything powered down.
There were no BIOS POST beeps.
Obtaining a loaner CPU revealed the problem was the main board as the system failed to boot with the new CPU.
There is a lesson in all of this inconvenience.
I use the office system as a pseudo server. For a while I had been planning to convert the machine to a full time server because my usage habits are now more mobile. When the system died I evaluated my usage habits as well as my grand plan for a central server.
The loss of my office system exposed a flaw with centralization. I keep all of my data files on that computer as well as my email. I connect to the data files from other systems using NFS. I access email from other systems using SSH notifications and VNC to access my mail client. Yet how to access mail and data files with the loss of such a system?
RAID does not avoid CPU or main board failures.
My backup strategy includes hourly rsnapshot backups of system and data files, historical weekly backups with alternating backup drives, and clone drives.
While I had a spare power supply, I had no emergency replacement main board. How many home and non enterprise users buy two boards of the same make and model to keep one as a spare? Spare CPUs?
When considering electronic challenges such as electrolytic capacitors drying, what is the shelf life of a spare board? How often should I “breadboard” the spare main board and power supply to exercise electronic components? I also need a spare CPU for that.
I am not a fan of cloud strategies. There are too many security and privacy issues that I am willing to accept or compromise. The only palatable cloud option is one I control inside my own network.
While moving all data to a central server still makes sense for me, the move is no panacea. This particular failure shows a need for a Plan B.