Running many GSEA analyses using a simple Bash wrapper script

In my previous posts, I've recommended using GSEA pre-ranked for performing pathway analysis through the graphical user interface. If you have a more elaborate experiment with several sample groups and a few different gene set libraries to analyse, then you'll find that process quite tedious.

So in this post, I'd like to share with you my GSEA wrapper written in bash to make this process a bit easier. The wrapper script simply accepts two user defined arguments, the rank file and the gmt file. Take note that this works only when the gene IDs are already gene symbols. If you have probe IDs or some other type of accession number, then you'll need to convert them to gene symbols or use a "chip file". The rank files and GMT files need to be in the current directory. You may need to adjust the memory up or down depending on your machine spec and the memory required for your jobs.

#!/bin/bash
RNK=$1
GMT=$2
java -Xmx4096m -cp /path/to/gsea2-2.2.1.jar xtools.gsea.GseaPreranked \
-gmx $GMT -collapse false -mode Max_probe \
-norm meandiv -nperm 1000 -rnk $RNK -scoring_scheme classic \
-rpt_label ${RNK}.${GMT} -include_only_symbols true -make_sets true \
-plot_top_x 20 -rnd_seed timestamp -set_max 5000 -set_min 10 -zip_report false \
-out . -gui false

The script is run like this:

$ chmod +x run_gsea.sh
$ ./run_gsea.sh gene_expression.rnk msigdb.gmt

You might want to tee up a few jobs using for loops:

for RNK in *rnk ; do
  for GMT in *gmt ; do
    ./run_gsea.sh $RNK $GMT
  done
done

A better solution could be to use gnu-parallel to schedule several jobs in parallel, in this case four. Make sure that you have enough memory to complete this many jobs in parallel.

$ parallel -j4 ./run_gsea.sh ::: *rnk ::: *gmt

You could also use the R script wrapper thats supplied on the GSEA website (http://software.broadinstitute.org/gsea/downloads.jsp). You will need to register to gain access.

Popular posts from this blog

Data analysis step 8: Pathway analysis with GSEA

Uploading data to GEO - which method is faster?

Using GTF tools to get gene lengths