Generating a custom gmt file for gene set analysis

Pathway and gene set analysis is a common procedure for interpretation of RNA-seq or other genome-wide expression assays. Most of the time, we use GSEA to tell us whether our gene sets of interest are up- or down-regulated. We can use gene sets from KEGG, Reactome, GOMSigDB and other sources, but you can also generate your own gene sets. The format used for GSEA is gmt. I'm going to take you through two examples of generating custom gene sets:

Generate gene sets from published data sets using GEO2R

Let's say you're interested in the transcription factor STAT1. I found a dataset in GEO called "Knockdown of STAT1 in SCC61 tumor xenografts leads to alterations in the expression of energy metabolic pathways", which has a paper in BMC Med. Most uploaded array data sets can be reanalysed with GEO2R, which runs the array analysis tool Limma but this is embedded in the webpage and has a GUI which makes it very accessible for biologists.

Click this link to go directly to the GEO2R analysis page for this data set. You will see that there are four sample groups STAT1 knock-down irradiated, STAT1 knock-down untreated, Control vector irradiated and Control vector untreated. For this example, select the Control vector untreated samples, click define groups, enter in "Control" and then select the STAT1 knock-down untreated, click define groups, enter in "STAT1KD" as I have done in the below image.

Then click on the "Top250" button and this will kick off the comparison of the samples and show the top 250 differential array probes. You can then click the "save all results" button and save to a file.  You might notice that some probes have multiple associated genes, in which case we can include both/all the gene names. To get to a gmt format, we need the gene list to be organised horizontally with the two left-most columns set aside for metadata. Normally the first column is the gene set name and the second is a link to a paper or GEO accession. In our case to make it streamlined, I've included some metadata (ie, the GEO accession) in the gene set name, while leaving only "NA" in the second column.

In this example, because the adjusted p-values aren't that great, I've selected a non-adjusted p-value threshold of 0.001 which gives us a few hundred genes in either direction if you run the script below. You may also want to filter on fold change if that is part of your standard pipeline. If you don't like using command line scripts, then it is still possible to make a gmt file manually using spreadsheet filtering and transposing.

for DGE in STAT1-Knockdown_GSE15845.xls
do
tr -d '"' < $DGE | awk '$3<=0.001' | awk '$6>0'| cut -f7 \
| tr -s '/' | tr '/' '\n' | sort -u | tr '\n' '\t' \
| sed "s/^/${DGE}_UP\tNA\t/" \
| sed 's/$/\n/' > ${DGE}.gmt
tr -d '"' < $DGE | awk '$3<=0.001' | awk '$6<0'| cut -f7 \
| tr -s '/' | tr '/' '\n' | sort -u | tr '\n' '\t' \
| sed "s/^/${DGE}_DN\tNA\t/" \
| sed 's/$/\n/' >> ${DGE}.gmt
done

Generate gene sets of human microRNA-mRNA targets from StarBase

StarBase is a repository for CLIP-Seq and degradome seq and provides a database for potential microRNA-mRNA target information and is a useful resource for predicting the function of microRNAs. If you go to the StarBase website, go to the miRNA-target relationship page. Leave all the pull-down menus blank and hit the "Search" button and it will bring up the complete StarBase microRNA-mRNA target set. Hit the "export as excel" button and it will be downloaded.

You can run the below script and it will generate a gmt containing sets of genes which are targeted by specific microRNAs. You will need to check the name of the starbase csv file.

STARBASE=starBase_Human_miRNA-target_interactions.csv
for MIR in `sed -n '1!p' $STARBASE | cut -d ',' -f1 | sort -u `
do
echo $MIRgrep -w $MIR $STARBASE | cut -d ',' -f2
done | tr '\n' '\t' | sed 's/hsa-miR/>hsa-miR/g' \
| tr '>' '\n' > Starbase_Human.gmt

The result from the first few lines and columns looks good and is ready for GSEA.

hsa-let-7a      SLC45A4 CNOT4   SMCR7L  APPBP2  CFL2    KLF9
hsa-miR-1       SMAD5   VGLL4   C9orf82 NCOA3   PPP2R5C RNF40   PHF20L1 PINX1   MINK1
hsa-miR-100     RAP1B   EPC2    ICMT    MBNL1   SMARCA5 C17orf85        FZD8    MBNL1   SLC35E1
hsa-miR-101     ZNF238  ARID5B  APPBP2  AGFG1   STMN1   G3BP2   QKI     G3BP2   SUB1
hsa-miR-103     FLCN    SMAD5   USP42   KPNA1   NDEL1   AGFG1   SNTB2   NPTN    IPO5
hsa-miR-105     APPBP2  TMEM30A CREB5   MAP4K3  CDKN1A  CLIP1   ATXN1   ZCCHC3  MORF4L2
hsa-miR-106a    ZBTB9   E2F1    CNOT4   SOX4    SMAD5   RAPGEFL1        OCRL    CFL2    CHD9
hsa-miR-106b    ZBTB9   E2F1    ZNF238  CNOT7   CNOT4   TBX3    SOX4    SMAD5   RAPGEFL1
hsa-miR-107     FLCN    SPRYD3  SMAD5   USP42   KPNA1   NDEL1   AGFG1   SNTB2   NPTN
hsa-miR-10a     ACVR2A  SOX4    APPBP2  LANCL1  HSPC159 PAPD5   RAP2A   LRP12   SLC38A2
hsa-miR-10b     ACVR2A  SOX4    APPBP2  LANCL1  AIM1L   HSPC159 PAPD5   RAP2A   LRP12
hsa-miR-1178    E2F1    G3BP2   CD164   PANK3   PPP3CB  SPTLC1  KCNG3   FAM98A  PGK1
hsa-miR-1179    PPTC7   CTNNB1  TMEM30A PAN3    PLEKHA1 TSPYL1  ZNF644  ARID2   CNIH
hsa-miR-1180    MKNK2   NPM1    CCDC6   MKNK2   VEGFB   HOXA10  SESN2   UTP15   KIAA1267
hsa-miR-1184    SMAD5   AGFG1   PTP4A2  SIX4    RAB1A   MAML1   PPP4R1  NR2C2   FMNL2
hsa-miR-1200    ACTL6A  FAM178A SORT1   ZCCHC3  PAPD5   C16orf63        MAP2K6  SNRPB   RASSF

These are just two obvious examples of data mining for pathway analysis, but there are many more, such as directly searching through journal articles. If you have come up with a cool way of mining gene sets, I would love to hear about it.

Popular posts from this blog

Two subtle problems with over-representation analysis

Data analysis step 8: Pathway analysis with GSEA

Uploading data to GEO - which method is faster?