Digging Deeper With Server Issues

I was still experiencing problems with my server. I had configured GRUB to ignore a region of memory. That hack seemed to work well for a couple of weeks. Then one day my backups ended way too soon, just like previously. This time there was no BUG: error in the logs.

I again rebuilt my backup hard drive.

I swapped the 8 GB of RAM (four 2 GB sticks) in the server with 4 GB (two 2 GB sticks) from my test system. I individually ran memtest86+ against each of the four 2 GB RAM sticks. No failures found. I ran the test again with 2 sticks as a pair. No problems. I could not test all four concurrently because the motherboard supports both DDR2 and DDR3 with only two slots for each type.

That led me to think the problem with the server is either a memory slot or the memory controller. Or something else.

As my previous RAM test resulted in errors in the affected RAM region 001d10da480 7440.6MB, I restored three sticks in the first three slots and ran memtest86+ all night.

No errors reported the next morning.

The next night I again ran the test with all four sticks.

Clean. No errors reported the next morning.

The root cause of the original crash and the RAM failing the memory test? Unknown.

Not unheard of is oxidation on the RAM stick contacts. The process of removing and seating the sticks several times might have resolved that. If that was the root cause. As an additional precaution I rubbed the RAM stick contacts with a pencil eraser before returning to the server motherboard for the final time.

I was unconvinced that oxidation was the cause although I could not ignore the possibility. I thought I would wait and see how the server behaved.

I did not wait long. Some hours later while — again — performing backups, I experienced another crash. The symptoms were the same as the original and occurred — again — while backing up an rsnapshot partition. This time there was a BUG: error in the logs.

This repeating coincidence had me thinking about rsnapshot, which in my case uses a ton of hard links. I looked again at the log error:

BUG: unable to handle kernel paging request at fffffffffffffffe

The memory address seems to indicate the system ran out of memory. The server has 8 GB of RAM and 5 GB of swap. I wondered if the root cause might be related to huge pages discussed a few years ago. The article now seems dated, but the symptoms described seemed similar. The problem might be resolved in kernel 4.14, but my system is running kernel 4.4.88 (Slackware 14.2 64-bit).

The rsync command is involved when I see these crashes when backing up this particular partition. I use a shell script to perform this backup to a second disk. Somewhat like cloning. I read that rsync uses a lot of memory. This partition that seemed to always be in the middle of this mess is large, but those hard links had me wondering. Perhaps I had run into an corner case of just trying to copy too many files.

Possibly another element to this mystery is the crash does not seem to occur until my backup script unmounts the partition. The script uses umount -f -l to ensure an unmount. While the umount command is supposed to sync the disk cache, I went ahead and added the sync command to the script. If this element is related then a possible obvious answer is there is an open file preventing the unmount. Which possibly leads back to rsync.

One option seems to be splitting that partition backup into multiple commands. Another option might be to use dd or e2image.

Another nightly memory test with all four sticks resulted in no errors.

The Asus M3N78-EM motherboard is somewhat old, purchased in 2009. Possibly the board is going bad, or perhaps only one memory slot. Yet now I am leaning toward the problem being rsync causing the system to run out of memory. A mystery is why the problem never appeared until recently using this same script and rsync. I have been using this process for years. The crashes started about 6 weeks ago, which is several weeks after I replaced the previous ASRock N68C-GS4 motherboard with the M3N78-EM. I am inclined to think that if the motherboard is directly related that the crashes would have appeared right away.

That said, I replaced the power supply unit (PSU) with a spare. The original PSU was rated at 300 watts and the replacement is rated at 450 watts.

Back to wait and see, but I plan to tinker some with my backup script.

Posted: January 08, 2018 Category: Usability Tagged: General

Next: vboxpci: IOMMU not found (not registered)

Previous: Not The Year of the Linux Desktop