Lately my server crashed a few times. The first few crashes were non-destructive to disk data, but then the inevitable happened. A crash caused a lot of inode errors, of which I was unaware initially. Not until I performed backups that night did the disk problem show. Normally the backups take about 90 minutes and this time the backup “completed” in about 10 minutes. I knew I had a problem.
I spent a couple of hours manually running
fsck several times. After several passes I finally could run
fsck in automatic mode with no returned errors.
The affected area was on the second disk on the partition where rsnapshot backups are stored.
Looking through the logs I found the following a few times, including the latest crash with the disk corruption:
BUG: unable to handle kernel paging request at fffffffffffffffe
A little digging had me suspecting a RAM module. The server has four 2 GB sticks.
Using the Slackware Live ISO I ran
memtest86+ overnight. To no avail because I had chosen the SMP option. The test froze less than 7 minutes later and after I had retired for the evening.
I tried again the next night and let the program run with the defaults. The next morning the screen was filled with red. The affected RAM region:
The next step is to run the memory test with only one stick installed to conclusively find the bad stick. Then run the bad stick in different slots to determine whether the stick, slot, or memory controller is bad.
With the server idling much of the time and not using a lot of RAM, having 8 GB of RAM, feeling lazy about pulling the server from under the desk, and wanting to avoid down time, I decided to try a short term approach. Exclude the memory region.
The Linux kernel supports boot parameters to exclude memory regions. There are two parameter options. As the memory test reported a high region, I chose the simpler parameter of excluding the RAM above a specific region. I multiplied 1024 by 7, which produces 7168, or 7 GB, which is below the suspect area of
7440.6 MB. The boot parameter then would be
mem=7168M. Possibly I could refine the parameter to
mem=7386M, but I would leave the extra room.
As another precaution, in the hourly rsnapshot backup cron job, I decided to sync and flush the hard drive cache. The idea being that the kernel would start using lower memory regions.
I don’t know if my approach is sane. I’ll see how things go the next many days. Eventually though I’ll have to pull the case and experience some down time.