A Biomarker Gene Set From ENCODE Expression Data

MSigDB contains thousands of gene sets which have been mined from a range of genome wide studies and these are a valuable resource for gene ontology and pathway analysis. You probably know that many gene sets are curated by KEGG, REACTOME and BIOCARTA - as well as dry lab scientists who specialise in analysing these data sets and curating gene sets. What you may not know is that if you follow a few basic guidelines, you can start generating your own custom gene sets and these can become a valuable resource for running gene ontology and Gene Set Enrichment Analysis (as per the graphic)


For instance within our lab, we have extensively used ENCODE ChIP-Seq data to help us to analyse our mRNA-seq data and this has provided a huge leg-up in generating hypothesis and designing follow-up experiments. For this example, I want to show an overview of how I made biomarker gene sets for a bunch of cell types analysed by ENCODE. Biomarker gene sets can useful in array or mRNA-Seq analysis that you can gauge the effect of a treatment or stimulus on a certain cell characteristic - such as "stemness", "neuronal" and the like. In our lab, we wanted to look at the effect of certain methytransferase inhibitor drugs on cell identity and endothelial cell functions specifically using mRNA-Seq.

I downloaded the transcriptome data in bam format listed here using unix wget then used bedtools multicov to extract out exon counts (Ensembl GTF) which were then aggregated with a perl script to get a gene-wise read counts for which the sample replicates were aggregated. The data were the assembled into a larger spreadsheet containing all Ensembl genes which was used to normalise for library size.  The average expression level was calculated for each gene across all cell types and at this point I omitted genes with low expression across the board, retaining about 40 thousand detected genes. Then I calculated the ratio between then cell type's expression score and the average (I call this figure the biomarker ratio). Large numbers indicate that the expression of this gene is higher in the particular cell type compared to the average across all cell types. For each cell type, I sorted the genes by their biomarker ratio and grabbed the top 1000 gene names and formatted them into a gmt format with each cell type now having 1000 biomarkers. If 1000 genes is too large a set for your liking, you can shrink the set as they are ranked in left to right order.

Check out the result here.

If you have any novel ideas for gene sets, I'd love to hear it!

Popular posts from this blog

Two subtle problems with over-representation analysis

Data analysis step 8: Pathway analysis with GSEA

Uploading data to GEO - which method is faster?