Benchmarking R based bioinformatics speed across different computer types

As a bioinformatician, or other type of data analyst, we should be using the right tool for the job. There are many types of compute we can use from laptops, desktops, pro workstations, on prem-servers, high performance computing clusters (HPC) and cloud. One of the considerations should be the speed/performance of the computer to complete the task, as that can result in less waiting around and better productivity. 

What I thought I would do today is to test a typical and relatively simple R-based bioinformatics pipeline across a few different computer systems I have on hand to see how quickly they can process a containerised workflow. The workflow involved downloading gene expression count data from DEE2, then doing differential expression with DESeq2, followed by pathway enrichment done with over-representation and with functional class scoring with the fgsea package. The workflow itself is available from the github repository. The docker container is available from dockerhub.

The script I used was:

time bash main.sh

where `main.sh` consisted of the two lines:

Rscript -e "rmarkdown::render('dataprep.Rmd')"

Rscript -e "rmarkdown::render('session2.Rmd')"

For testing I included my work laptop, a basic desktop PC, two workstations with consumer grade CPUs, a Threadripper workstation, a Xeon HPC system and a cloud based AMD Epyc server. The elapsed time was recorded an included in the table below.

The results show the Ryzen (9950X3D) workstation was fastest with 51 seconds, closely followed by the Intel i9-14900 workstation with 54 seconds. The Threadripper workstation with 5955WX recorded 68.1 seconds and the basic PC with the older i5-10400 part scored 87.9 seconds. The HPC based Xeon Gold 6240R running in an Apptainer recorded 104 seconds. The Dell laptop with i7-1365U CPU achieved a lacklustre 151 seconds. In last place was the cloud based EPYC-Rome server with a lousy 193 seconds.

Elspsed time for a container based R bioinformatics workflow to complete across different computers.

Certainly this result is contrary to the idea that the most powerful or expensive computer would give the fastest performance. To explain this, we should realise that the workflow is mostly operating on just one thread for the most part, with four threads being engaged for the pathway enrichment analysis.

The desktop CPUs seem to do very well with single threaded performance. The server, cloud and HPC options have CPUs that are tuned for very efficient computing over many threads, which R cannot readily exploit in most bioinformatics workflows. The laptop is tuned for basic computing and has power and thermal restrictions which means it doesn't perform that well in this situation. Interestingly, the threadripper part performed relatively poorly considering just how expensive those systems are.

Next, I calculated the performance per dollar (only the CPU cost) which gives you some indication as to the type of system that you should select if the budget is tight. The results below show that midrange desktop CPUs that are a few years old are actually the most cost-effective option for this type of workflow and that investing in server or professional workstation parts are the least cost-effective.



Popular posts from this blog

Uploading data to GEO - which method is faster?

Data analysis step 8: Pathway analysis with GSEA

Bioinformatics data processing power depends on CPU L3 cache A LOT!