Get the newest Reactome gene sets for pathway analysis
For first-pass pathway analysis I find Reactome to be the most useful database of gene sets for biologists to understand. For a long time I have been using Reactome gene sets as deposited to the GSEA/MSigDB website. Recently a colleague pointed me to the gene matrix file offered directly on the Reactome webpage (Thanks Dr Okabe).
The latest Reactome gene set matrix file (gmt) can be found at this link https://reactome.org/download/current/ReactomePathways.gmt.zip |
There are some differences. Firstly there are more gene sets in the one from the Reactome webpage (accessed 2018-05-09)
$ wc -l *gmt
674 c2.cp.reactome.v6.1.symbols.gmt
2022 ReactomePathways.gmt
2696 total
Secondly, there are more genes included in one or more gene sets:
$ cut -f3- c2.cp.reactome.v6.1.symbols.gmt | tr '\t' '\n' | sort -u | wc -l
6025
$ cut -f3- ReactomePathways.gmt | tr '\t' '\n' | sort -u | wc -l
10852
And overall there are about threefold more gene-pathway entries
$ cut -f3- ReactomePathways.gmt | wc -w
106405
$ cut -f3- c2.cp.reactome.v6.1.symbols.gmt | wc -w
37601
I also looked at whether the gene sets had the same names.
First I counted the names that were common.
$ comm -12 <(cut -f1 ReactomePathways.gmt | tr '[a-z]' '[A-Z]' | tr ' -' '_' | sort ) <(cut -f1 c2.cp.reactome.v6.1.symbols.gmt | tr '[a-z]' '[A-Z]' | tr ' -' '_' | sed 's/REACTOME_//' | sort ) | wc -l
414
Then I counted the names specific to the MSigDB version
$ comm -13 <(cut -f1 ReactomePathways.gmt | tr '[a-z]' '[A-Z]' | tr ' -' '_' | sort ) <(cut -f1 c2.cp.reactome.v6.1.symbols.gmt | tr '[a-z]' '[A-Z]' | tr ' -' '_' | sed 's/REACTOME_//' | sort ) | wc -l
260
Lastly I counted the names specific to the official version
$ comm -23 <(cut -f1 ReactomePathways.gmt | tr '[a-z]' '[A-Z]' | tr ' -' '_' | sort ) <(cut -f1 c2.cp.reactome.v6.1.symbols.gmt | tr '[a-z]' '[A-Z]' | tr ' -' '_' | sed 's/REACTOME_//' | sort ) | wc -l
1608
I have re-run a couple of GSEAs with the official Reactome gene sets and have obtained MUCH better results, so I would recommend to use the official Reactome release for future pathway analyses. It would be great if GSEA/MSigDB team could update their Reactome soon.
Finally, if you include Reactome data in your publications, please cite them so they can continue their awesome work.