Posts

Yes, you can use a single stick of DDR5 for bioinformatics and data analysis

Image
INTRO DRAM prices skyrocketed 171% in 2025 [ 1 ], and this trend looks like it will continue into 2026 unless there is a crash in demand for hardware for GenAI applications. This leaves bioinformaticians and other data analysts in a pickle, as most applications we use require a lot of RAM. To keep costs low, we might consider using a single stick (aka Dual In-line Memory Module: DIMM) of RAM for a new workstation build, which is something that has been tried with reasonable success for low budget gaming setups [ 2 ]. So in this post we will look at whether using a single stick of DDR5 DRAM will cause a dramatic reduction in computational throughput as compared to the normal two-stick setup. We will also examine whether stock memory configuration (4800MT/s) is any slower as compared to the tweaked settings (EXPO 6000MT/s with low latency and high bandwidth support). SETUP The tests I will use include: A synthetic CPU test using stress-ng Single end RNA-seq human (STAR) Single end RNA-se...

DEE2 database gets HDF5

Here I’ll show you how to download and work with the new HDF5 datasets from DEE2 (dee2.io). HDF5 files are provide fast random access to large and complex datasets while occupying less disk space. Overall the bulk data files are 50% smaller than the previously used BZ2. It also makes selecting datasets of interest quicker and obviates the need to convert data from "long" to "wide" formats, which takes a long time and lots of RAM. In short, this is a big upgrade in end user accessibility to power large scale analysis of DEE2 transcriptome data. The materials here are mostly based on the rhdf5 package  here . I’ll demonstrate with  E. coli , but this should also work for other organisms. First step is to load the  rhdf5  library and download the h5 file. library ( "rhdf5" ) library ( "tictoc" ) if ( file.exists( "ecoli_se.h5" ) ) { message( "HDF5 file exists" ) } else { message( "Downloading HDF5 file" ) downl...

Mitch gets upgraded - now with gene set networks

Image
Interpreting pathway enrichment analysis results is a big challenge. There may be hundreds of statistically significant pathways from an analysis and getting to a shortlist of key mechanisms to follow up with experiments and describe in a publication is difficult. I recently got a request from a collaborator to come up with a way to visualise the key networks. I groaned... because network analysis in bioinformatics is sometimes characterised by showing hairballs of hundreds/thousands of meaningless interactions. Mostly it is done poorly and the charts themselves do not have any explanatory function and mostly appear to be decorative. After seeing a few well done examples such as Figure 2 from Chappel et al (PMID: 32138627), I thought this could be something we include as a common step in the mitch workflow. For those of you unaware, mitch is the R/bioconductor package that Dr Antony Kaspi and I published in 2020 (PMID: 32600408) with the main focus being on multi-dimensional enrichmen...

Installing pip based tools for bioinformatics

Loads of bioinformatics and data science tools are available through the `pip` package manager/installer. In days gone by, it was possible to type somethinig like pip install mytool --user to get a tool installed, but due to changes this isn't possible anymore. When you try a command like that on Ubuntu 24, you will get an error and suggestion to use a virtual environment (or "venv"). While there is a lot of in depth documentation on using venv, I couldn't find a quick start guide specifically foir bioinformaticians. Quickly, a virtual environment is an isolated environment which should help by stabilising the base python version and the package version, so there isn't a chance of problems from version mismatch. So here is a quick guide to install the MultiQC tool, used extensively in genome sequence analysis. Install multiqc via pip The first step is to create a folder to put your venvs (you will likely have several if you do a lot of analysis). mkdir ~/venv Next...

Bioinformatics data processing power depends on CPU L3 cache A LOT!

When speccing a CPU for a new bioinformatics computer we tend to focus on threads and frequency, but do you look at the cache? Well you definitely should. On January 30, I started jobs on two servers. The first one has 2x Xeon E5-2680 v3 (48 threads total) and the second has 1x Xeon E5-2667 v3 (16 threads). The work we are doing is processing raw RNA-seq data using a pipeline of Skewer+STAR+Kallisto. Server 1 has many more threads, so I ran two parallel jobs of 12 threads each, while on Server 2 I'm running one job using 8 cores. The cluster nodes are attached to the same network storage, so I/O capacity is the same. Since Jan 30, these two machines have collectively processed 4670 datasets and the difference between these two machines was a big surprise. Not only did Server 2 process more datasets, when normalised by number of threads, Server 2 processed 4.1 times more datasets!  Server 1 Server 2 CPU 2x Xeon E5-2680 v3 1x Xeon E5-2667 v3 Cores 24 16 Threads 48 16 Frequency ...