Get the newest Reactome gene sets for pathway analysis

For first-pass pathway analysis I find Reactome to be the most useful database of gene sets for biologists to understand. For a long time I have been using Reactome gene sets as deposited to the GSEA/MSigDB website. Recently a colleague pointed me to the gene matrix file offered directly on the Reactome webpage (Thanks Dr Okabe).
The latest Reactome gene set matrix file (gmt) can be found at this link https://reactome.org/download/current/ReactomePathways.gmt.zip 

There are some differences. Firstly there are more gene sets in the one from the Reactome webpage (accessed 2018-05-09)

$ wc -l *gmt
    674 c2.cp.reactome.v6.1.symbols.gmt
   2022 ReactomePathways.gmt
   2696 total

Secondly, there are more genes included in one or more gene sets:

$ cut -f3- c2.cp.reactome.v6.1.symbols.gmt | tr '\t' '\n' | sort -u | wc -l
6025
$ cut -f3- ReactomePathways.gmt | tr '\t' '\n' | sort -u | wc -l
10852

And overall there are about threefold more gene-pathway entries 

$ cut -f3- ReactomePathways.gmt | wc -w
106405
$ cut -f3- c2.cp.reactome.v6.1.symbols.gmt | wc -w
37601

I also looked at whether the gene sets had the same names.

First I counted the names that were common.
$ comm -12 <(cut -f1 ReactomePathways.gmt | tr '[a-z]' '[A-Z]' | tr ' -' '_' | sort ) <(cut -f1 c2.cp.reactome.v6.1.symbols.gmt | tr '[a-z]' '[A-Z]' | tr ' -' '_' | sed 's/REACTOME_//' | sort )  | wc -l
414

Then I counted the names specific to the MSigDB version
$ comm -13 <(cut -f1 ReactomePathways.gmt | tr '[a-z]' '[A-Z]' | tr ' -' '_' | sort ) <(cut -f1 c2.cp.reactome.v6.1.symbols.gmt | tr '[a-z]' '[A-Z]' | tr ' -' '_' | sed 's/REACTOME_//' | sort )  | wc -l
260

Lastly I counted the names specific to the official version
$ comm -23 <(cut -f1 ReactomePathways.gmt | tr '[a-z]' '[A-Z]' | tr ' -' '_' | sort ) <(cut -f1 c2.cp.reactome.v6.1.symbols.gmt | tr '[a-z]' '[A-Z]' | tr ' -' '_' | sed 's/REACTOME_//' | sort )  | wc -l
1608

I have re-run a couple of GSEAs with the official Reactome gene sets and have obtained MUCH better results, so I would recommend to use the official Reactome release for future pathway analyses. It would be great if GSEA/MSigDB team could update their Reactome soon.

Finally, if you include Reactome data in your publications, please cite them so they can continue their awesome work.

Popular posts from this blog

Two subtle problems with over-representation analysis

Data analysis step 8: Pathway analysis with GSEA

Uploading data to GEO - which method is faster?