Generating a custom gmt file for gene set analysis
Pathway and gene set analysis is a common procedure for interpretation of RNA-seq or other genome-wide expression assays. Most of the time, we use GSEA to tell us whether our gene sets of interest are up- or down-regulated. We can use gene sets from KEGG, Reactome, GO, MSigDB and other sources, but you can also generate your own gene sets. The format used for GSEA is gmt. I'm going to take you through two examples of generating custom gene sets:
do
tr -d '"' < $DGE | awk '$3<=0.001' | awk '$6>0'| cut -f7 \
| tr -s '/' | tr '/' '\n' | sort -u | tr '\n' '\t' \
| sed "s/^/${DGE}_UP\tNA\t/" \
| sed 's/$/\n/' > ${DGE}.gmt
tr -d '"' < $DGE | awk '$3<=0.001' | awk '$6<0'| cut -f7 \
| tr -s '/' | tr '/' '\n' | sort -u | tr '\n' '\t' \
| sed "s/^/${DGE}_DN\tNA\t/" \
| sed 's/$/\n/' >> ${DGE}.gmt
done
You can run the below script and it will generate a gmt containing sets of genes which are targeted by specific microRNAs. You will need to check the name of the starbase csv file.
STARBASE=starBase_Human_miRNA-target_interactions.csv
for MIR in `sed -n '1!p' $STARBASE | cut -d ',' -f1 | sort -u `
do
echo $MIRgrep -w $MIR $STARBASE | cut -d ',' -f2
done | tr '\n' '\t' | sed 's/hsa-miR/>hsa-miR/g' \
| tr '>' '\n' > Starbase_Human.gmt
The result from the first few lines and columns looks good and is ready for GSEA.
These are just two obvious examples of data mining for pathway analysis, but there are many more, such as directly searching through journal articles. If you have come up with a cool way of mining gene sets, I would love to hear about it.
Generate gene sets from published data sets using GEO2R
Let's say you're interested in the transcription factor STAT1. I found a dataset in GEO called "Knockdown of STAT1 in SCC61 tumor xenografts leads to alterations in the expression of energy metabolic pathways", which has a paper in BMC Med. Most uploaded array data sets can be reanalysed with GEO2R, which runs the array analysis tool Limma but this is embedded in the webpage and has a GUI which makes it very accessible for biologists.
Click this link to go directly to the GEO2R analysis page for this data set. You will see that there are four sample groups STAT1 knock-down irradiated, STAT1 knock-down untreated, Control vector irradiated and Control vector untreated. For this example, select the Control vector untreated samples, click define groups, enter in "Control" and then select the STAT1 knock-down untreated, click define groups, enter in "STAT1KD" as I have done in the below image.
Then click on the "Top250" button and this will kick off the comparison of the samples and show the top 250 differential array probes. You can then click the "save all results" button and save to a file. You might notice that some probes have multiple associated genes, in which case we can include both/all the gene names. To get to a gmt format, we need the gene list to be organised horizontally with the two left-most columns set aside for metadata. Normally the first column is the gene set name and the second is a link to a paper or GEO accession. In our case to make it streamlined, I've included some metadata (ie, the GEO accession) in the gene set name, while leaving only "NA" in the second column.
In this example, because the adjusted p-values aren't that great, I've selected a non-adjusted p-value threshold of 0.001 which gives us a few hundred genes in either direction if you run the script below. You may also want to filter on fold change if that is part of your standard pipeline. If you don't like using command line scripts, then it is still possible to make a gmt file manually using spreadsheet filtering and transposing.
for DGE in STAT1-Knockdown_GSE15845.xlsIn this example, because the adjusted p-values aren't that great, I've selected a non-adjusted p-value threshold of 0.001 which gives us a few hundred genes in either direction if you run the script below. You may also want to filter on fold change if that is part of your standard pipeline. If you don't like using command line scripts, then it is still possible to make a gmt file manually using spreadsheet filtering and transposing.
do
tr -d '"' < $DGE | awk '$3<=0.001' | awk '$6>0'| cut -f7 \
| tr -s '/' | tr '/' '\n' | sort -u | tr '\n' '\t' \
| sed "s/^/${DGE}_UP\tNA\t/" \
| sed 's/$/\n/' > ${DGE}.gmt
tr -d '"' < $DGE | awk '$3<=0.001' | awk '$6<0'| cut -f7 \
| tr -s '/' | tr '/' '\n' | sort -u | tr '\n' '\t' \
| sed "s/^/${DGE}_DN\tNA\t/" \
| sed 's/$/\n/' >> ${DGE}.gmt
done
Generate gene sets of human microRNA-mRNA targets from StarBase
StarBase is a repository for CLIP-Seq and degradome seq and provides a database for potential microRNA-mRNA target information and is a useful resource for predicting the function of microRNAs. If you go to the StarBase website, go to the miRNA-target relationship page. Leave all the pull-down menus blank and hit the "Search" button and it will bring up the complete StarBase microRNA-mRNA target set. Hit the "export as excel" button and it will be downloaded.You can run the below script and it will generate a gmt containing sets of genes which are targeted by specific microRNAs. You will need to check the name of the starbase csv file.
STARBASE=starBase_Human_miRNA-target_interactions.csv
for MIR in `sed -n '1!p' $STARBASE | cut -d ',' -f1 | sort -u `
do
echo $MIRgrep -w $MIR $STARBASE | cut -d ',' -f2
done | tr '\n' '\t' | sed 's/hsa-miR/>hsa-miR/g' \
| tr '>' '\n' > Starbase_Human.gmt
The result from the first few lines and columns looks good and is ready for GSEA.
hsa-let-7a SLC45A4 CNOT4 SMCR7L APPBP2 CFL2 KLF9
hsa-miR-1 SMAD5 VGLL4 C9orf82 NCOA3 PPP2R5C RNF40 PHF20L1 PINX1 MINK1
hsa-miR-100 RAP1B EPC2 ICMT MBNL1 SMARCA5 C17orf85 FZD8 MBNL1 SLC35E1
hsa-miR-101 ZNF238 ARID5B APPBP2 AGFG1 STMN1 G3BP2 QKI G3BP2 SUB1
hsa-miR-103 FLCN SMAD5 USP42 KPNA1 NDEL1 AGFG1 SNTB2 NPTN IPO5
hsa-miR-105 APPBP2 TMEM30A CREB5 MAP4K3 CDKN1A CLIP1 ATXN1 ZCCHC3 MORF4L2
hsa-miR-106a ZBTB9 E2F1 CNOT4 SOX4 SMAD5 RAPGEFL1 OCRL CFL2 CHD9
hsa-miR-106b ZBTB9 E2F1 ZNF238 CNOT7 CNOT4 TBX3 SOX4 SMAD5 RAPGEFL1
hsa-miR-107 FLCN SPRYD3 SMAD5 USP42 KPNA1 NDEL1 AGFG1 SNTB2 NPTN
hsa-miR-10a ACVR2A SOX4 APPBP2 LANCL1 HSPC159 PAPD5 RAP2A LRP12 SLC38A2
hsa-miR-10b ACVR2A SOX4 APPBP2 LANCL1 AIM1L HSPC159 PAPD5 RAP2A LRP12
hsa-miR-1178 E2F1 G3BP2 CD164 PANK3 PPP3CB SPTLC1 KCNG3 FAM98A PGK1
hsa-miR-1179 PPTC7 CTNNB1 TMEM30A PAN3 PLEKHA1 TSPYL1 ZNF644 ARID2 CNIH
hsa-miR-1180 MKNK2 NPM1 CCDC6 MKNK2 VEGFB HOXA10 SESN2 UTP15 KIAA1267
hsa-miR-1184 SMAD5 AGFG1 PTP4A2 SIX4 RAB1A MAML1 PPP4R1 NR2C2 FMNL2
hsa-miR-1200 ACTL6A FAM178A SORT1 ZCCHC3 PAPD5 C16orf63 MAP2K6 SNRPB RASSF
These are just two obvious examples of data mining for pathway analysis, but there are many more, such as directly searching through journal articles. If you have come up with a cool way of mining gene sets, I would love to hear about it.