Pathway analysis with ZGST

Pathway analysis is a common procedure for determining the regulation of groups of functionally linked genes. There are a lot of pathway analyses strategies available and I can break them down into these groups:

  1. Bioconductor/R-based: It makes sense to run pathway analysis in the same environment that runs the major differential expression software Limma, edgeR, DESeq, etc. These include CAMERA, MRGST, WilcoxGST, Roast, etc
  2. Commercial, GUI based. Such as Ingenuity IPA or MetaCore GeneGO.
  3. Java based such as GSEA and GSAA
  4. Web-based tools such as DAVIDWebGestalt, GO Enrichment analysis
Now these each have their advantages and disadvantages. I wanted to see whether I could make a tool pathway analysis tool that could be run with just one simple command and didn't require expertise in R. It would run quickly for >10k gene sets and have a lower memory footprint than GSEA.

I wrote a pathway analysis in Bash. I know. Its crazy. But it works. It works in Ubuntu, Fedora, Debian and Mint. Its called ZGST, for Ziemann's Gene Set Test. It is available on Sourceforge now.

Here's how it works:
  1. User supplies a gene set matrix in GMT format along with a differential expression spreadsheet specifying p-value and fold change.
  2. ZGST ranks scores each gene based in the signed log p-value. Most up-regulated genes by p-value at the top and most down-regulated at the bottom.
  3. Program then calculates the sum of the ranks for each gene set if the gene set is up-regulated, the value is positive, otherwise the value is negative.
  4. Next the same number of genes are selected at random from the list and the rank sum is compared to that of the gene set score. This process is repeated 1000+ times and that is used to estimate a p-value that the gene set is deregulated.
  5. False discovery rate p-value correction.
  6. Generate tables of deregulated genes, enrichment plot and volcano plot for a select number of gene sets.
Here are some features:
  1. Performs rank based (non-parametric) by default but can perform weighted analysis as an option
  2. Can perform a bootstrap sensitivity analysis by randomly discarding 10% of genes during each permutation.
  3. Runs in parallel; the config file specifies the number of threads to use.
  4. Very low memory footprint.
There are some limitations:
  1. The differential gene list needs to contain gene identifiers that match the GMT file.
  2. Currently limited to Linux, but early testing suggests it could be easily ported to MacOSX. Windows fans will need VirtualBox to run one of the above Linux distros.
  3. Dependencies: GNU parallel, gnuplot, R is only required for the weighted analysis.

Test run

I selected GSE20602, an experiment comparing 4 control kidney samples with 14 kidney samples from patients with nephrosclerosis. Then I downloaded MSigDB v5. Then ZGST can be run quite simply:

 ./ <-c configfile.config>
 ./ <-d DGE.xls,GeneIDcol,FCcol,Pvalcol> <-g gmtfile.gmt>
 ./ -h

So here's an example of my GEO2R generated analysis.

./ -d GSE20602.txt,7,6,3 -g msigdb.v5.0.symbols.gmt

Alternatively, a config file can be used to specify more options:

./ -c config1.cfg

Where an example config file looks like:



 Total gene sets: 10153
 Gene sets analysed: 9535
 Up-regulated gene sets nom 95 conf: 1467
 Down-regulated gene sets nom 95 conf: 2142
 Up-regulated gene sets FDR 95 conf: 1100
 Down-regulated gene sets FDR 95 conf: 1684

So here's the top 10 most deregulated pathways in the nephrosclerosis dataset:

Here's what the enrichment plots look like.
Example enrichment plot produced by ZGST
And it also outputs volcano plot too, with the members of this gene set highlighted in blue.

Example volcano plot produced by ZGST.


On my 8-core workstation, ZGST ran in 3m10s seconds, compared to 16m49s for GSEA making ZGST about 5x faster. GSEA also used 1.8GB RAM compared to negligible RAM for ZGST.

Then I compared the gene set results by plotting the GSEA (classic) enrichment score against the ZGST rank sum score. Expectedly, the results were very similar, with some interesting differences about the middle of the plot. The shape of the plot could be due to how GSEA calculates the ES and how the ES is never very small (<0.01).

Lastly, I looked at the overlaps of the statistically significant (FDR<0.05) gene sets, which confirms the large overlap in the gene set identification. I did notice 33 gene sets that were discordant between the two methods.
I'll discuss these differences in more detail in another post.

In conclusion, it is possible to do valid pathway analysis using a shell script program 5x faster than the current standard (GSEA). The approach used here could easily be applied in another language (ie. Perl/Python) to improve portability and speed even further.

Popular posts from this blog

Data analysis step 8: Pathway analysis with GSEA

A selection of useful bash one-liners

HISAT vs STAR vs TopHat2 vs Olego vs SubJunc