InterPro Protein domain based gene set library

Recently I wanted to know whether genes with methyltransferase domains were upregulated in my dataset. This isn't currently captured in the major gene set databases as far as I know. I dug into some older files of mine and found that the InterPro protein domain information is actually included in Ensembl BioMart and it's relatively staightforward to convert this to GMT format for pathway analysis.

TLDR; Here is a link to the human GMT file for use in GSEA and other pathway analysis
Please cite the recent InterPro paper if you use this in your research: Mitchell et al 2019, https://www.ncbi.nlm.nih.gov/pubmed/30398656
If you are interested in learning how it was made, read on.

Method

1- Obtaining InterPro data from Ensembl BioMart

Head to https://www.ensembl.org/biomart/martview and select the human database.

Then click "Attributes". Here you can select the bits of information you want. Select only the following atributes and click the boxes in the order shown

  1. Gene stable ID
  2. Interpro ID (under "External")
  3. Interpro Short Description (under "Protein domains and families")
  4. Interpro Description
  5. HGNC symbol 
Then click on the "Results" button in the top left of the page - it should look like this. If the columns are in a different order, then go back and try again.


2- Reformatting into GMT format

Copy the mart_export.txt file into the current working directory, then run the following bash script. The script has been saved to a Gist for posterity (https://gist.github.com/markziemann/3fc0c90e59c508c66067681a6c6dc3a1)
======================

#!/bin/bash

# This script creates a GMT file of genesets classified by protein domains
# First need to obtain some data from ensembl biomart
# Go to https://www.ensembl.org/biomart/martview/
# Select human database
# Select the following attributes:
# - Gene stable ID
# - Interpro ID
# - Interpro Short Description
# - Interpro Description
# - HGNC symbol


DAT=mart_export.txt
for IPR in $(cut -f2 $DAT | sed 1d | sort -u ) ; do
  NAME=$(grep -wm1 $IPR $DAT | cut -f4)
  grep -w $IPR $DAT | cut -f5 | sort -u | paste -s | sed "s#^#${NAME}\t${IPR}\t#"
done > ipr.gmt
======================

The result is exactly what I wanted:

$ head -4 ipr.gmt 
Kringle IPR000001 F12 F2 HABP2 HGF HGFAC KREMEN1 KREMEN2 LPA MST1 PIK3IP1 PLAT PLAU PLG PRSS12ROR1 ROR2
Retinoid X receptor/HNF4 IPR000003 HNF4A NR2E3 RXRA RXRB RXRG
Metallothionein, vertebrate IPR000006 MT1A MT1B MT1E MT1F MT1G MT1H MT1HL1 MT1M MT1X MT2A MT3 MT4
Tubby, C-terminal IPR000007 TUB TULP1 TULP2 TULP3 TULP4

And there's a nice list of methyltransferases for me to use in downstream analysis

S-adenosyl-L-methionine-dependent methyltransferase IPR029063 ALKBH8 AS3MT ASMT ASMTL ATPSCKMT BCDIN3D BMT2 BUD23 CAMKMT CARM1 CARNMT1 CIAPIN1 CMTR1 CMTR2 COMT COMTD1 COQ3 COQ5 CSKMT DIMT1 DNMT1 DNMT3A DNMT3B DOT1L ECE2 EEF1AKMT1 EEF1AKMT2 EEF1AKMT3 EEF1AKNMT EEF2KMT ETFBKMT FAM173A FAM86B1 FAM86B2 FASN FBL FBLL1 FTSJ1 FTSJ3 GAMT GNMT GSTCD HEMK1 HENMT1 HNMT INMT LCMT1 LCMT2 LRTOMT MEPCE METTL1 METTL11B METTL14 METTL15 METTL16 METTL17 METTL18 METTL21A METTL21C METTL22 METTL23 METTL25 METTL26 METTL27 METTL2A METTL2B METTL3 METTL4 METTL5 METTL6 METTL7A METTL7B METTL8 METTL9 MRM2 N6AMT1 NDUFAF5 NDUFAF7 NNMT NOP2 NSUN2 NSUN3 NSUN4 NSUN5 NSUN6 NSUN7 NTMT1 PCMT1 PCMTD1 PCMTD2 PNMT PRMT1 PRMT2 PRMT3 PRMT5 PRMT6 PRMT7 PRMT8 PRMT9 RNMT RRNAD1 RRP8 SMS SRM TFB1M TFB2M TGS1 THUMPD2 THUMPD3 TPMT TRDMT1 TRMT1 TRMT11 TRMT12 TRMT1L TRMT2A TRMT2B TRMT44 TRMT5 TRMT61A TRMT61B TRMT9B VCPKMT

Here is a link to the human GMT file try it yourself.

Popular posts from this blog

Data analysis step 8: Pathway analysis with GSEA

Two subtle problems with over-representation analysis

Uploading data to GEO - which method is faster?