Posts

Showing posts from October, 2024

Two subtle problems with over-representation analysis

Image
Over-representation analysis is a helpful and really frequently used technique for understanding trends from omics data and gene lists more generally. Just to demonstrate this, I tabulated the number of citations of some of the most popular web tools and packages, which reaches a massive 191k citations! But if you have used some of these tools, you will notice that they yield subtly different results. We were curiuous about that and ran a whole bunch of investigations into the internal workings of these tools, in particular clusterProfiler. We identified two subtle problems with clusterProfiler, which we unpack in detail in our new publication , but here I will give you a quick overview. Problem #1: The background problem The first one we call the “background problem,” because it involves the software eliminating large numbers of genes from the background list if they are not annotated as belonging to any category. This results in removing a large number of unannotated genes from the b

Screen for mycoplasma contamination in DNA-seq and RNA-seq data: Part 2

Image
If you're culturing cells, avoiding contamination is an extremely important priority. Mycoplasma are common contaminant that can change the behaviour of cells and don't cause any obvious morphological changes, so routine testing is key. As bioinformaticians we should be incorporating myco screening into all our workflows, but the available tools haven't been that great. Even the shell script I wrote in a previous blog post isn't that great. It suffers from a few problems in miscounting reads, usability and leaving many files in the working directory. So it is high time for a refresh. With this refresh, the idea was to make a command line tool that folks could easily incorporate into their workflows, would process single- and paired-end data, be quick and be concise in its output. The result is `contam`. To give you an idea of what it can do, here is the help page: contam is a script for the quantification of contaminant sequences in Illumina short read sequence data i