Stress test RAM annually

TLDR: System memory can go bad. Use `memtester` on your Linux system annually to spot any problems early.

Now the long story...

In bioinformatics, we process a lot of data and conduct a lot of analysis. We use a range of devices from laptops to desktop workstations and remote servers and cloud. One particular desktop workstation of mine has been showing intermittent freezing and other problems, so I spent a bit of time trying to diagnose the issue. It is based on the AMD Threadripper 2990WX 32 Core CPU with 8x 16GB DDR4 modules, and we have been using it to process thousands of RNA sequencing datasets for the DEE2 project. It has been working at maximum capacity for about 6 years. Symptoms it had were sudden shutdowns and freezing.

I checked the CPU temperatures (using the `stress` command) and it was high, in the 90 °C. range which was odd given it was water cooled with an all in one system. I removed the block to inspect the thermal paste, and I found that the block did not appear to cover the entire surface of the CPU, only a circle in the middle. I found another cooler with the proper shaped block, the Silverstone IceGem 280, and it reduced the temperatures a lot and it no longer had any sudden shutdowns. But the freezing symptom continued. 

I considered it could be a memory issue so I wanted to run a memory test. On this motherboard, there is no option to run a memory test before loading the OS, so I installed the `memtester` utility, which was much more effective than my custom script.

`sudo apt install memtester`

First step is to check the available RAM.

`free -h`

The system has 128 GB, with 120 free so I tested 115 GB.

`memtester 115G 5`

The number at the end refers to the number of times the test should be repeated. My hunch is that it isn't the same addresses being tested, but a different addresses.

Here's how this system looks.


The test gave some errors, so I processed to test each stick separately. The results were strange. Three of the eight sticks showed memory errors. Then I put the good five sticks in and did a test of the 70 GB overnight and it still gave errors. Frustrating!

But after some research I read that some BIOS settings might cause such problems and it is worth putting the settings back to default. I didn't set this machine up myself, so I went into the BIOS and there were some CPU settings that were different to the default - probably an attempt at overclocking by the person setting up the system. I put them back to default and repeated the test. Suddenly, most of the memory problems went away! Turns out only one of the sticks was bad. After removing that one, the system passed all memtests and has been rock solid since. 

So the take home messages here are:

  • Don't ignore it when a computer system freezes or shuts down randomly, there is something wrong!
  • Check the CPU tempueratures under load, there may be a problem with CPU cooling like mine.
  • Run memtester of all system memory annually, and whenever the memory configuration is changed. If there are problems, return the BIOS settings to default.
  • If it still gives you problems, then check each stick separately. You may be able to return the defective modules to the manufacturer for replacement.
  • If you want to have good uptime and longevity of the system, keep BIOS settings default and don't overclock CPU or RAM.
---------------------
Update Oct 2024:
* Suspected power supply might not actually be providing enough juice, which is still causing the occassional random shut downs.
* Swapped the RX580 graphics card (185W) with RX6400 (53W) and haven't had any issues since.
* Swapped the 240mm AIO for The Arctic Cooling Freezer 4U-M and now it is much quieter, at barely 50 dB at max output (albeit with slightly slower boost frequency).
* Upgraded RAM to 256 GB and got a set of 5 new 140 mm case fans.
* Now is an absolute beast.


Popular posts from this blog

Two subtle problems with over-representation analysis

Data analysis step 8: Pathway analysis with GSEA

Uploading data to GEO - which method is faster?