Data analysis step 8: Pathway analysis with GSEA

In our RNA-seq series so far we've performed differential analysis and generated some pretty graphs, showing thousands of differentially expressed genes after azacitidine treatment. In order to understand the biology underlying the differential gene expression profile, we need to perform pathway analysis.

We use Gene Set Enrichment Analysis (GSEA) because it can detect pathway changes more sensitively and robustly than some methods. A 2013 paper compared a bunch of gene set analyses software with microarrays and is worth a look.

Generate a rank file

The rank file is a list of detected genes and a rank metric score. At the top of the list are genes with the "strongest" up-regulation, at the bottom of the list are the genes with the "strongest" down-regulation and the genes not changing are in the middle. The metric score I like to use is the sign of the fold change multiplied by the inverse of the p-value, although there may be better methods out there (link).

Here is what the first 10 lines in the DESeq output look like:
Top 10 lines of differential expression table generated by DESeq.

RNK=`echo $DGE | sed 's/.xls/.rnk/'`
sed 1d $DGE \
| sort -k7g \
| cut -d '_' -f2- \
| awk '!arr[$1]++' \
| awk '{OFS="\t"}
{ if ($6>0) printf "%s\t%4.3e\n", $1, 1/$7 ;
else printf "%s\t%4.3e\n", $1, -1/$7 }' \
| sort -k2gr > $RNK

And here's the top and bottom 10 lines of the rank file:

$cat DEB_DESeq.rnk | (head;tail)
AC011899.9 3.352e+64
CAMK1D 1.471e+51
HSD11B1 1.919e+48
GNG4 2.279e+43
CPLX1 3.873e+40
CGNL1 1.200e+39
APCDD1 1.237e+35
MANEAL 9.303e+32
MKX 4.441e+28
HOXA13 1.007e+26
HLA-B -2.908e+132
PRAME -3.063e+165
LYZ -9.701e+185
CPXM1 -1.169e+186
HSH2D -1.658e+186
TGFBI -8.856e+210
APOC2 -9.633e+243
APOC4-APOC2 -1.984e+245
COL1A2 -1.298e+274
SLC12A3 -5.422e+304

The output looks consistent with the table I made in a previous post.

Run GSEA using the Java GUI

Download the javaGSEA Desktop Application from the Broad website (you will need to register). Open a terminal and cd into the directory containing the jar file and write (substitute the XX for appropriate version)

java -jar gsea2-XX.jar

Hit the "Load data" tab and browse for the rnk file we just generated.

Next, select "GSEA Preranked" from the "Tools" pull-down menu.

Here is the list of parameters I used:
Gene sets database: Initially, I like to use REACTOME to see that GSEA is working as well as quickly give a broad idea of the biology going on. Later on we will run with everything in the MSigDB database.
Number of permutations: 1000 is normally OK.
Ranked List: Ensure that GSEA is looking at the newly generated rnk file.
Collapse dataset to gene symbols: False
Chip platform: We're using is the gene symbols, you can leave this blank and GSEA will download the gene symbol chip file from the Broad website.
Enrichment statistic: Classic, this behaves better when we have extreme p-values such as this case
Max gene set size: 5000
Min gene set size: 15
Collapsing mode for probe sets>1 gene: Max_probe
Normalisation mode: meandiv
Omit features with no symbol: True


From the initial REACTOME data base of 674 sets, 242 were removed because there either were less than 15 members detected or more than 5000. This left us with 432 sets to analyse. There were 220 sets with FDR p-value (called q-value here) ≤0.05. From these, 125 were up-regulated and 95 were down-regulated.
REACTOME GSEA result of azacitidine exposed AML3 cells by RNA-seq. Top 20 gene sets in either direction are shown.
Most down and up-regulated REACTOME pathways in azacitidine treated AML3 cells.
This result shows up-regulation of cell cycle and DNA repair pathways and down-regulation of innate immune response.

MSigDB Result

In the case of MSigDB, we began with 10295 gene sets and 2148 were filtered out (too few or too many), leaving 8147 sets. From these, 1613 were up-regulated (FDR p-value≤0.05) and 3692 were down-regulated.
MSigDB GSEA result of azacitidine exposed AML3 cells by RNA-seq. Top 20 gene sets in either direction are shown.
Two significant up and down-regulated MSigDB gene sets in azacitidine treated AML3 cells .
These results show the up-regulation of EWS/FLI fusion protein target genes that are linked to oncogenesis (reference). The down-regulation of genes linked to metabolic syndrome (reference) is somewhat unexpected, although many members are part of the interferon signalling pathway shown above.


  1. What do you do if the adj. p-value = 0 (equal zero)?
    The value would be than be Inf. How can I prevent it?

    1. You could try this one. Notice the sed command at the end which handles infinites. Also note that these days I'm doing most of these things in R because it is better suited to working with tables of data.

      #!/bin/bash converts a differential gene expression spreadsheet (TSV/XLS)
      #into a rank file (RNK)

      #Specify the input file
      #Specify the gene ID column
      #Specify the fold change value column
      #Specify the raw p-value column


      sed 1d $XLS | tr -d '"' \
      | cut -d '_' -f2- \
      | awk '!arr[$1]++' \
      | awk -v I=$ID -v F=$FC -v P=$P '{FS="\t"} {print $I, $F, $P}' \
      | awk '$2!="NA" && $3!="NA"' \
      | awk '{s=1} $2<0{s=-1} {print $1"\t"s*-1*log($3)/log(10)}' \
      | sort -k2gr \
      | sed 's/inf$/312/' > $RNK

    2. Thanks for the reply. I also do most of the analysis, as well as the enrichment and gsea analyses. But I have searches for a method to handle Inf-values and found your description. This is why I have tried to ask here.

      I have two follow-up questions :-)

      1. Do you prefer t use the p-values as the term for the ranking? I know that people have been discussing this point. Some prefer the log2FC, other even the adjusted p-values. Is there a reason why you use the p-values?

      2. why 312? How do you choose the value to replace the Inf into?


    3. No problem.
      1. You are free to experiment with different ranking procedures. My experience with logFC is that there are some genes that have overall very low expression and high expression in one or a few samples. When looking at the bar charts of such genes you can see this is not a truly differential expressed gene, but it gets an extreme logFC value. At least with the p-value method, the genes at the extreme ends have some degree of differential expression, its just arguable how much. The truly correct solution would do ranking based on something like the confidence interval, something which has been done by Harrison et al implemented in the topconfects package.
      2. A rank score of 312 corresponds to a p-value of 10^312 which is close to the smallest number that can be represented by the computer. As I mentioned previously this is a bit of a hack and these sorts of numerical analyses are better served in R.

  2. This comment has been removed by a blog administrator.

  3. Hello
    You said that the metric score you like to use is the sign of the fold change multiplied by the inverse of the p-value, could you attach the paper it is based on ?

    1. Details in here:

  4. Thank you for sharing such a useful article. I had a great time. This article was fantastic to read. Continue to publish more articles on

    Data Engineering Solutions 

    Data Analytics Service Provider

    Data Modernization Services

    Machine Learning Services


Post a Comment

Popular posts from this blog

Installing R-4.0 on Ubuntu 18.04 painlessly

EdgeR or DESeq2? Comparing the performance of differential expression tools