EdgeR or DESeq2? Comparing the performance of differential expression tools

It seems like this discussion comes up a lot. Choosing a differential expression (DE) tool could change the results and conclusions of studies. How much depends on the strength of the data. Indeed this has been coveres by others already (here, here).

In this post I will compare these tools for a particular dataset that highlights the different ways these algorithms perform. So you can try it out at home, I've uploaded the code for this to GitHub here. To get it to work, clone the repository and open the Rmarkdown (.Rmd) file in Rstudio. You will need to install Bioconductor packages edgeR, DESeq2 and the CRAN package eulerr (for the nifty venn diagrams). Once that's done you can start working with the Rmd file. For an introduction to Rmd  read on here. Suffice to say that it's the most convenient way to generate a data report that includes code, results and descriptive text for background information, interpretation, references and the rest. To create the report in Rstudio, click the "knit" button and it will execute the code and you'll get a HTML report. If you're working in the R console, for example if you are connecting to a server via SSH, then you can generate the report using the command rmarkdown::render("myreport.Rmd") and you will get the output HTML file.

In case you're impatient and just want to see the result for yourself, I've hosted my output report at this link. For everyone else, here's the walkthrough. 

Multidimensional scaling analysis

Always do an MDS because it will give you an idea of the intra and inter group differences. As you can see here, there is a good measure of both that distinguishes this n=3 experiment. (The counts were derived from an ATAC-seq experiment).

DE analysis

Here is the summary of the results with the different tools:

edgeR (LRT) Up: 297 Down:589
edgeR (QL) Up:15 Down:204
DESeq2 Up:3 Down:113
Voom Up:100 Down:128

You can see the massive difference between them. In particular between edgeR LRT and DESeq. Let's take a look. The Euler package venn diagrams are quite easy to make and the areas of the shaded parts are proportional which is a nice touch.

I selected 9 genes that were significant with edgeR that were discordant with DESeq2 and created barcharts. Gene9565 in demonstrates that the algorithms deviate in how they treat outliers. 

Here are the stats:

EdgeR

              logFC   logCPM       LR       PValue        FDR dispersion

Gene9565 -0.7916072 2.291451 11.07955 0.0008728513 0.04036433   0.064547

               C1       C2       C3       T1       T2       T3

Gene9565 21.96188 14.17148 40.87624 24.08976 24.94944 20.96064


DESeq2

         baseMean log2FoldChange     lfcSE      stat    pvalue padj       C1

Gene9565  24.7313     -0.8373232 0.3698349 -2.264046 0.0235713   NA 7.203133

               C2      C3       T1       T2       T3

Gene9565 7.148011 7.06639 6.875975 6.935422 6.907739


The T2 sample is clearly a bit of an outlier here but it is being marked as statistically significant by edgeR and NA by DESeq2, because of the outlier detection

The next gene I want to highlight is Gene28045. 

EdgeR

               logFC   logCPM      LR       PValue        FDR dispersion

Gene28045 -0.3876685 5.166385 13.8639 0.0001965379 0.01514142 0.00657676

                C1       C2       C3       T1       T2       T3

Gene28045 148.1977 95.62844 275.8308 215.6399 223.3354 187.6296


DESeq2

          baseMean log2FoldChange     lfcSE      stat     pvalue      padj

Gene28045 183.5495     -0.4058583 0.1610077 -2.520739 0.01171087 0.4080693

                C1       C2       C3       T1       T2       T3

Gene28045 8.468181 8.443831 8.337265 8.230507 8.135978 8.113126


Clearly this is a problem because there's no way that these genes could be validated with qPCR and t-test and get a p-value anywhere near 0.05. Therefore I'm happy to recommend people not use edgeR LRT anymore and go with the more conservative DESeq2.

What about the other tools?

Based on twitter discussions I also assessed edger with the QL test, which gave results more similar to DESeq.
Voom-limma as well

One thing I'm curious about is how the results from DESeq2 are biased with very few upregulated genes, which makes sense because this is a knock-down experiment. However with Voom-Limma, there is roughly equal number of up and downregulated genes.

UpSet plots can be useful to visualise commonlalities between three or more lists. Let's see how it looks for the 4 methods above. Firstly for the upregulated genes:

Then for the downregulated genes:




Popular posts from this blog

Two subtle problems with over-representation analysis

Data analysis step 8: Pathway analysis with GSEA

Uploading data to GEO - which method is faster?