Two subtle problems with over-representation analysis

Over-representation analysis is a helpful and really frequently used technique for understanding trends from omics data and gene lists more generally. Just to demonstrate this, I tabulated the number of citations of some of the most popular web tools and packages, which reaches a massive 191k citations!


But if you have used some of these tools, you will notice that they yield subtly different results. We were curiuous about that and ran a whole bunch of investigations into the internal workings of these tools, in particular clusterProfiler. We identified two subtle problems with clusterProfiler, which we unpack in detail in our new publication, but here I will give you a quick overview.

Problem #1: The background problem

The first one we call the “background problem,” because it involves the software eliminating large numbers of genes from the background list if they are not annotated as belonging to any category. This results in removing a large number of unannotated genes from the background and has the effect of making the enrichment scores seem smaller than they really are. In the paper we show that this effect is more acute for smaller gene set libraries like KEGG, whereas for GO it is less of an issue.

Problem #2: The FDR problem

The second one we call the “false discovery rate problem,” because some tools underestimate the true number of parallel tests conducted. We found that clusterProfiler classifies gene sets as present or absent depending on the number of overlapping genes in the foregound, which is problematic because the mere act of trying to find an overlap between two sets is a test, even if the overlap is zero. This underestimates the number of  tests conducted and risks increasing the rate of false positives. We found this to be more severe when the user defined gene list is small.

Impact: testing with simulations and real data

Often in enrichment analysis papers, authors talk about how there isn't any good gold standard method for simulation, which inhibits the development of better tools for enrichment analysis. The truth however is that there are very good and workable simulation frameworks available. The one we used is one we developed back in 2020 when we benchmarked the mitch enrichment analysis package. That system is simple but clever. It relies on a single very large expression profile, with over 300 M reads allocated to different genes. We then create pseudosamples by downsampling these counts and adding some random noise to the dataset. We can then select particular gene sets to be given set fold changes in the "case" samples to emulate differential expression. This results in a fairly realistic set of samples, where the sequencing depth, number of replicates and degree of added noise can be tweaked to simulate real world experiments. After analysing these data with the favourite DE tool, it can uindergo enrichment analysis and we can then assess the results in terms of recall and precision. What we found is that fixing the background bug improved recall at the expense of precision and fixing the FDR bug improved precision at the expense of recall. In these comparisons, I also ran a standard functional class scoring technique (fgsea) and it was far superior in terms of recall.

To support these findings further, we reanalysed a public gene expression dataset using downsampling to estimate precision and recall in sample subsets from n=2 all the way to n=30 tumor/normal datasets. We confirmed that fgsea outperformed all of the ORA methods in terms of recall, indicating FCS methods are overall more sensitive. We were also surprised to find that fgsea was also found to have a lower false positive rate, indicating that with real data, FCS is actually more precise than ORA. So the take home message here is that ORA is really underperforming as compared to FCS, and therefore ORA shouldn't really be used anymore unless there is no alternative.

Conclusion: These problems are really widespread

We then surveyed whether these tools listed above also share these problems, and the results were very concerning. In terms of web tools, only three tools didn't suffer from either of these two problems, while the other five suffered from one or more flaws. On the command line, four of five tools suffered from these problems. This leads us to recommend only two tools for ORA analysis: ShinyGO for the web, and fora at the command line.

Many popular ORA tools have issues


Popular posts from this blog

Data analysis step 8: Pathway analysis with GSEA

Uploading data to GEO - which method is faster?

Using GTF tools to get gene lengths