Posts

DEE2 database gets HDF5

Here I’ll show you how to download and work with the new HDF5 datasets from DEE2 (dee2.io). HDF5 files are provide fast random access to large and complex datasets while occupying less disk space. Overall the bulk data files are 50% smaller than the previously used BZ2. It also makes selecting datasets of interest quicker and obviates the need to convert data from "long" to "wide" formats, which takes a long time and lots of RAM. In short, this is a big upgrade in end user accessibility to power large scale analysis of DEE2 transcriptome data. The materials here are mostly based on the rhdf5 package  here . I’ll demonstrate with  E. coli , but this should also work for other organisms. First step is to load the  rhdf5  library and download the h5 file. library ( "rhdf5" ) library ( "tictoc" ) if ( file.exists( "ecoli_se.h5" ) ) { message( "HDF5 file exists" ) } else { message( "Downloading HDF5 file" ) downl...

Mitch gets upgraded - now with gene set networks

Image
Interpreting pathway enrichment analysis results is a big challenge. There may be hundreds of statistically significant pathways from an analysis and getting to a shortlist of key mechanisms to follow up with experiments and describe in a publication is difficult. I recently got a request from a collaborator to come up with a way to visualise the key networks. I groaned... because network analysis in bioinformatics is sometimes characterised by showing hairballs of hundreds/thousands of meaningless interactions. Mostly it is done poorly and the charts themselves do not have any explanatory function and mostly appear to be decorative. After seeing a few well done examples such as Figure 2 from Chappel et al (PMID: 32138627), I thought this could be something we include as a common step in the mitch workflow. For those of you unaware, mitch is the R/bioconductor package that Dr Antony Kaspi and I published in 2020 (PMID: 32600408) with the main focus being on multi-dimensional enrichmen...

Installing pip based tools for bioinformatics

Loads of bioinformatics and data science tools are available through the `pip` package manager/installer. In days gone by, it was possible to type somethinig like pip install mytool --user to get a tool installed, but due to changes this isn't possible anymore. When you try a command like that on Ubuntu 24, you will get an error and suggestion to use a virtual environment (or "venv"). While there is a lot of in depth documentation on using venv, I couldn't find a quick start guide specifically foir bioinformaticians. Quickly, a virtual environment is an isolated environment which should help by stabilising the base python version and the package version, so there isn't a chance of problems from version mismatch. So here is a quick guide to install the MultiQC tool, used extensively in genome sequence analysis. Install multiqc via pip The first step is to create a folder to put your venvs (you will likely have several if you do a lot of analysis). mkdir ~/venv Next...

Bioinformatics data processing power depends on CPU L3 cache A LOT!

When speccing a CPU for a new bioinformatics computer we tend to focus on threads and frequency, but do you look at the cache? Well you definitely should. On January 30, I started jobs on two servers. The first one has 2x Xeon E5-2680 v3 (48 threads total) and the second has 1x Xeon E5-2667 v3 (16 threads). The work we are doing is processing raw RNA-seq data using a pipeline of Skewer+STAR+Kallisto. Server 1 has many more threads, so I ran two parallel jobs of 12 threads each, while on Server 2 I'm running one job using 8 cores. The cluster nodes are attached to the same network storage, so I/O capacity is the same. Since Jan 30, these two machines have collectively processed 4670 datasets and the difference between these two machines was a big surprise. Not only did Server 2 process more datasets, when normalised by number of threads, Server 2 processed 4.1 times more datasets!  Server 1 Server 2 CPU 2x Xeon E5-2680 v3 1x Xeon E5-2667 v3 Cores 24 16 Threads 48 16 Frequency ...

DEE2 2025 updates: growth and development

Image
DEE2 is a database service I co-founded in 2015 aiming to provide uniformly processed gene expression profiles for each and every RNA-seq dataset in NCBI's Sequence Read Archive. After some major revisions, we published the database/service journal article in 2019. Over time, DEE2 has grown dramatically, with the number of datasets and total number of counted reads, as seen in the tables below. Here, I'll walk you  through the  growth of the database over time and the new features we've added. Growth of DEE2 DEE2  continues to ingest new metadata from NCBI SRA and GEO, and the growth of this metadata set has caused us a lot of issues over time. In the early years of DEE2 we used SRAdb, and then it became too large for its design, so we sought different solutions. Currently we are using pySRA and only fetching quarterly. The size of each request has been a challenge, with human and mouse requests of annual  dumps exceeding the 64 GB RAM of our backend workstation! S...