Posts

Showing posts with the label GSEA

Get the newest Reactome gene sets for pathway analysis

Image
For first-pass pathway analysis I find Reactome to be the most useful database of gene sets for biologists to understand. For a long time I have been using Reactome gene sets as deposited to the GSEA/MSigDB website. Recently a colleague pointed me to the gene matrix file offered directly on the Reactome webpage (Thanks Dr Okabe).

There are some differences. Firstly there are more gene sets in the one from the Reactome webpage (accessed 2018-05-09)
$ wc -l *gmt     674 c2.cp.reactome.v6.1.symbols.gmt    2022 ReactomePathways.gmt    2696 total
Secondly, there are more genes included in one or more gene sets:
$ cut -f3- c2.cp.reactome.v6.1.symbols.gmt | tr '\t' '\n' | sort -u | wc -l 6025 $ cut -f3- ReactomePathways.gmt | tr '\t' '\n' | sort -u | wc -l 10852
And overall there are about threefold more gene-pathway entries 
$ cut -f3- ReactomePathways.gmt | wc -w 106405 $ cut -f3- c2.cp.reactome.v6.1.symbols.gmt | wc -w 37601
I also looked at whether the gen…

Introducing the ENCODE Gene Set Hub

Image
TL;DR We curated a bunch of ENCODE data into gene sets that is super useful in pathway analysis (ie GSEA).
Link to gene sets and data: https://sourceforge.net/projects/encodegenesethub/
Poster presentation: DOI:10.13140/RG.2.2.34302.59208

Now for the longer version. Gene sets are wonderful resources. We use them to do pathway level analyses and identify trends in data that lead us to improved interpretation and new hypotheses. Most pathway analysis tools like GSEA allow us to use custom gene sets, this is really cool as you can start to generate gene sets based on your own profiling work and that of others.

There is huge value in curating experimental data into gene sets, as the MSigDB team have demonstrated. But overall, these data are under-shared. Even our group is guilty of not sharing the gene sets we've used in papers. There have been a few papers where we've used gene sets curated  from ENCODE transcription factor binding site (TFBS) data to understand which TFs were drivi…

MSigDB gene sets for mouse

I recently needed to convert MSigDB gene sets to mouse so I thought I would share.

GO.v5.2.symbols_mouse.gmt
kegg.v5.2.symbols_mouse.gmt
msigdb.v5.2.symbols_mouse.gmt
reactome.v5.2.symbols_mouse.gmt

Below is the code used to do the conversion. It requires an input GMT file of human gene symbols as well as a human-mouse orthology file. You can download the ortholog file here. As the name suggests, it is based on data downloaded from Ensembl Biomart version 87.

Running the program converts all human gmt files. It requres gnu parallel which can be easily installed on Ubuntu with "sudo apt-get install parallel"


#!/bin/bash

conv(){
line=$1
  NAME_DESC=`echo $line | cut -d ' ' -f-2`

  GENES=`echo $line | cut -d ' ' -f3- \
  | tr ' ' '\n' | sort -uk 1b,1 \
  | join -1 1 -2 1 - \
  <(cut -f3,5 mouse2hum_biomart_ens87.txt \
  | sed 1d | awk '$1!="" && $2!=""' \
  | sort -uk 1b,1) | cut -d ' ' -f2 \
  | sort -u…

How to generate a rank file from gene expression data in R

Previously I wrote about how to make a gene expression rankfile using the unix shell. While this works for me just fine, there are some differences in the behaviour of shell tools like awk an sed that make it unusable for mac users.

Therefore, I think the best solution is to show you how to do this in R, which you can install on Linux, Mac and Windows systems.

The expression data for this demo looks like this (the first 5 columns).


Name                        logFC               logCPM            LR                PValue ENSG00000134294.9_SLC38A2   0.365464972841137   8.35504063447063  80.9697378748286  2.29200821312451e-19 ENSG00000164692.13_COL1A2   0.369440969815233   6.9371167845581   59.239238358843   1.39621819298599e-14 ENSG00000117152.9_RGS4      0.417512339007825   6.27805181231213  48.690887963755   2.99655418444362e-12 ENSG00000120738.7_EGR1      -0.448872068345368  5.72373823077344  40.5159113813129  1.95021420597484e-10 ENSG00000236552.1_RPL13AP5  -0.263118776572713  7.87437056…

Running many GSEA analyses using a simple Bash wrapper script

In my previous posts, I've recommended using GSEA pre-ranked for performing pathway analysis through the graphical user interface. If you have a more elaborate experiment with several sample groups and a few different gene set libraries to analyse, then you'll find that process quite tedious.

So in this post, I'd like to share with you my GSEA wrapper written in bash to make this process a bit easier. The wrapper script simply accepts two user defined arguments, the rank file and the gmt file. Take note that this works only when the gene IDs are already gene symbols. If you have probe IDs or some other type of accession number, then you'll need to convert them to gene symbols or use a "chip file". The rank files and GMT files need to be in the current directory. You may need to adjust the memory up or down depending on your machine spec and the memory required for your jobs.

#!/bin/bash
RNK=$1
GMT=$2
java -Xmx4096m -cp /path/to/gsea2-2.2.1.jar xtools.gsea.GseaPre…

Pathway analysis: DAVID versus GSEA

Image
Pathway analysis has become such a common procedure in bioinformatics, especially in studying gene expression. If you look at a survey of recent papers it seems that there is a bunch of ways that it can be done. In this post, I'll discuss the differences between two commonly used tools; DAVID and GSEA.
The concepts behind the two algorithms are very different. DAVID determines overlaps between user-supplied gene lists and the curated databases, looking for overlaps that are bigger than that expected by random chance. You can improve the accuracy of the algorithm by providing a background file that contains all genes that were considered/detected in the experiment. The user-supplied gene sets are normally generated by selecting genes that pass a significance threshold. The DAVID procedure is similar to others available such as Ingenuity, AmiGO and GeneGO. The selection of significance values is largely arbitrary, but it is common to set the threshold at FDR adjusted p<0.05. Modi…

Pathway analysis with ZGST

Image
Pathway analysis is a common procedure for determining the regulation of groups of functionally linked genes. There are a lot of pathway analyses strategies available and I can break them down into these groups:

Bioconductor/R-based: It makes sense to run pathway analysis in the same environment that runs the major differential expression software Limma, edgeR, DESeq, etc. These include CAMERA, MRGST, WilcoxGST, Roast, etcCommercial, GUI based. Such as Ingenuity IPA or MetaCore GeneGO.Java based such as GSEA and GSAAWeb-based tools such as DAVIDWebGestalt, GO Enrichment analysis Now these each have their advantages and disadvantages. I wanted to see whether I could make a tool pathway analysis tool that could be run with just one simple command and didn't require expertise in R. It would run quickly for >10k gene sets and have a lower memory footprint than GSEA.
I wrote a pathway analysis in Bash. I know. Its crazy. But it works. It works in Ubuntu, Fedora, Debian and Mint. It…

How to generate a rank file from gene expression data

Turning a gene expression profile into a ranked list is useful for comparing with other profiling data sets as well as an input for preranked GSEA analysis (example here). In this post, I describe a simple bash script called rnkgen.sh that can take gene expression data from a range of sources, such as edgeR, DESeq, GEO2R, etc., and generate a ranked list of genes from most up-expressed to most down-expressed based on the p-value.

#!/bin/bash
#rnkgen.sh converts a differential gene expression spreadsheet (XLS) into a rank file (RNK)

#Specify the input file
XLS=$1
#Specify the gene ID column
ID=$2
#Specify the fold change value column
FC=$3
#Specify the raw p-value column
P=$4

sed 1d $XLS | tr -d '"' \
| awk -v I=$ID -v F=$FC -v P=$P '{FS="\t"} $I!="" {print $I, $F, $P}' \
| awk '$2!="NA" && $3!="NA"' \
| awk '{s=1} $2<0{s=-1} {print $1"\t"s*-1*log($3)/log(10)}' \
| awk '{print $1,$2,$2*…

Data analysis step 9: Leverage ENCODE data for enhanced pathway analysis

Image
So far in this series of posts we've analysed the effect of azacitidine on AML3 cell gene expression using publicly available RNA-seq data. Our pathway analysis showed many interesting trends, but unravelling all the different mechanisms from this point can be really hard. In order to identify the major players at the chromatin level, it can be useful to integrate transcription factor binding data and see whether targets of a particular transcription factor are differentially regulated in a pathway analysis. The problem with this analysis in the past was that ChIP-seq datasets were in varying formats on GEO and processing these into a standardised format would be too time consuming. With the advent of the ENCODE Project, there is now a large body of transcription factor binding data in a uniform format (link), that is being mined in many creative ways. In our group, we used this approach extensively (here, here and here).

In this post, we will mine ENCODE transcription factor bind…

Data analysis step 8: Pathway analysis with GSEA

Image
In our RNA-seq series so far we've performed differential analysis and generated some pretty graphs, showing thousands of differentially expressed genes after azacitidine treatment. In order to understand the biology underlying the differential gene expression profile, we need to perform pathway analysis.

We use Gene Set Enrichment Analysis (GSEA) because it can detect pathway changes more sensitively and robustly than some methods. A 2013 paper compared a bunch of gene set analyses software with microarrays and is worth a look.
Generate a rank file The rank file is a list of detected genes and a rank metric score. At the top of the list are genes with the "strongest" up-regulation, at the bottom of the list are the genes with the "strongest" down-regulation and the genes not changing are in the middle. The metric score I like to use is the sign of the fold change multiplied by the inverse of the p-value, although there may be better methods out there (link).

A Biomarker Gene Set From ENCODE Expression Data

Image
MSigDB contains thousands of gene sets which have been mined from a range of genome wide studies and these are a valuable resource for gene ontology and pathway analysis. You probably know that many gene sets are curated by KEGG, REACTOME and BIOCARTA - as well as dry lab scientists who specialise in analysing these data sets and curating gene sets. What you may not know is that if you follow a few basic guidelines, you can start generating your own custom gene sets and these can become a valuable resource for running gene ontology and Gene Set Enrichment Analysis (as per the graphic)


For instance within our lab, we have extensively used ENCODE ChIP-Seq data to help us to analyse our mRNA-seq data and this has provided a huge leg-up in generating hypothesis and designing follow-up experiments. For this example, I want to show an overview of how I made biomarker gene sets for a bunch of cell types analysed by ENCODE. Biomarker gene sets can useful in array or mRNA-Seq analysis that you…