Posts

Bioinformatiics data processing power depends on CPU L3 cache A LOT!

When speccing a CPU for a new bioinformatics computer we tend to focus on threads and frequency, but do you look at the cache? Well you definitely should. On January 30, I started jobs on two servers. The first one has 2x Xeon E5-2680 v3 (48 threads total) and the second has 1x Xeon E5-2667 v3 (16 threads). The work we are doing is processing raw RNA-seq data using a pipeline of Skewer+STAR+Kallisto. Server 1 has many more threads, so I ran two parallel jobs of 12 threads each, while on Server 2 I'm running one job using 8 cores. The cluster nodes are attached to the same network storage, so I/O capacity is the same. Since Jan 30, these two machines have collectively processed 4670 datasets and the difference between these two machines was a big surprise. Not only did Server 2 process more datasets, when normalised by number of threads, Server 2 processed 4.1 times more datasets!  Server 1 Server 2 CPU 2x Xeon E5-2680 v3 1x Xeon E5-2667 v3 Cores 24 16 Threads 48 16 Frequency ...

DEE2 2025 updates: growth and development

Image
DEE2 is a database service I co-founded in 2015 aiming to provide uniformly processed gene expression profiles for each and every RNA-seq dataset in NCBI's Sequence Read Archive. After some major revisions, we published the database/service journal article in 2019. Over time, DEE2 has grown dramatically, with the number of datasets and total number of counted reads, as seen in the tables below. Here, I'll walk you  through the  growth of the database over time and the new features we've added. Growth of DEE2 DEE2  continues to ingest new metadata from NCBI SRA and GEO, and the growth of this metadata set has caused us a lot of issues over time. In the early years of DEE2 we used SRAdb, and then it became too large for its design, so we sought different solutions. Currently we are using pySRA and only fetching quarterly. The size of each request has been a challenge, with human and mouse requests of annual  dumps exceeding the 64 GB RAM of our backend workstation! S...

Two subtle problems with over-representation analysis

Image
Over-representation analysis is a helpful and really frequently used technique for understanding trends from omics data and gene lists more generally. Just to demonstrate this, I tabulated the number of citations of some of the most popular web tools and packages, which reaches a massive 191k citations! But if you have used some of these tools, you will notice that they yield subtly different results. We were curiuous about that and ran a whole bunch of investigations into the internal workings of these tools, in particular clusterProfiler. We identified two subtle problems with clusterProfiler, which we unpack in detail in our new publication , but here I will give you a quick overview. Problem #1: The background problem The first one we call the “background problem,” because it involves the software eliminating large numbers of genes from the background list if they are not annotated as belonging to any category. This results in removing a large number of unannotated genes from the b...

Screen for mycoplasma contamination in DNA-seq and RNA-seq data: Part 2

Image
If you're culturing cells, avoiding contamination is an extremely important priority. Mycoplasma are common contaminant that can change the behaviour of cells and don't cause any obvious morphological changes, so routine testing is key. As bioinformaticians we should be incorporating myco screening into all our workflows, but the available tools haven't been that great. Even the shell script I wrote in a previous blog post isn't that great. It suffers from a few problems in miscounting reads, usability and leaving many files in the working directory. So it is high time for a refresh. With this refresh, the idea was to make a command line tool that folks could easily incorporate into their workflows, would process single- and paired-end data, be quick and be concise in its output. The result is `contam`. To give you an idea of what it can do, here is the help page: contam is a script for the quantification of contaminant sequences in Illumina short read sequence data i...

Need to build and run a Docker image but don't have root access? No problem with Apptainer!

Image
My group is moving towards a working paradigm where each project has a Docker image. This helps keep control of the environments for each project, as you know that operating systems and programming languages like R and Python undergo regular upgrades, which are mostly undesirable for long running data analytics projects. For example in the field of biomedical genomics, some projects can take up to 10 years between initiation and final publication, so having a stable and portable computing environment is crucial. While this solution is fantastic if you have a workstation where you have free reign to use root/sudo permissions, many data analysts are restricted to using shared computing resources without root. There are some ways to mitigate these restrictions. One approach is to make a Docker image on another computer and use Singularity  or Udocker  to pull and convert it to be run by non-root user. But sometimes, users don't have access to another computer to build these image...

Weird multi-threaded behaviour of R/Bioconductor under Docker

Image
As I was running some R code under Docker code recently, I noticed that processes that should be single threaded, were using all available threads. And this behaviour was different between R on a native Linux machine as compared to Docker.  A search of the forums found that this is due to the configuraiton of the BLAS system dependancy on those Docker images, which is set to use all available threads for matrix operations. This configuration sounds like a good idea at first because dedicating more threads to a problem should speed up the execution. But realise that parallel processing incurs some overhead to coordinate the sub tasks and communicate the data to/from daughter threads. This means that you rarely achieve linear speedup the more threads you add. Typically what happens is that parallelisation has a sweetspot where the first 5-10 threads provide some speed-up, but beyond that there is either no improvement in speed or that adding additional threads actually makes the cod...

Stress test RAM annually

Image
TLDR: System memory can go bad. Use `memtester` on your Linux system annually to spot any problems early. Now the long story... In bioinformatics, we process a lot of data and conduct a lot of analysis. We use a range of devices from laptops to desktop workstations and remote servers and cloud. One particular desktop workstation of mine has been showing intermittent freezing and other problems, so I spent a bit of time trying to diagnose the issue. It is based on the AMD Threadripper 2990WX 32 Core CPU with 8x 16GB DDR4 modules, and we have been using it to process thousands of RNA sequencing datasets for the DEE2 project. It has been working at maximum capacity for about 6 years. Symptoms it had were sudden shutdowns and freezing. I checked the CPU temperatures (using the `stress` command) and it was high, in the 90 °C. range which was odd given it was water cooled with an all in one system. I removed the block to inspect the thermal paste, and I found that the block did not appear t...