Posts

Showing posts with the label gene sets

Upset plots as a replacement to Venn Diagram

Image
I previously posted about different ways to obtain Venn diagrams, but what if you have more than 4 lists to intersect? These plots become messy and not easy to read. One alternative which has become popular is the upset plot. There is an excellent summary of the philosophy behind this approach in this article and academic paper here. An example plot is below:



In this post, I'll describe how to get from lists of genes in text files and present it as an UpSet plot using R. As with most R packages, you'll find that loading in the data is the hardest part, and that data import is the least documented aspect.

First I'll generate some random gene lists using a quick and dirty shell script. My complete list contains 58302 genes and looks like this:
$ head -5 Homo_sapiens.GRCh38.90.gnames.txt ENSG00000000003_TSPAN6 ENSG00000000005_TNMD ENSG00000000419_DPM1 ENSG00000000457_SCYL3 ENSG00000000460_C1orf112
This is the script which generates random subsets of genes with the suffix &quo…

Introducing the ENCODE Gene Set Hub

Image
TL;DR We curated a bunch of ENCODE data into gene sets that is super useful in pathway analysis (ie GSEA).
Link to gene sets and data: https://sourceforge.net/projects/encodegenesethub/
Poster presentation: DOI:10.13140/RG.2.2.34302.59208

Now for the longer version. Gene sets are wonderful resources. We use them to do pathway level analyses and identify trends in data that lead us to improved interpretation and new hypotheses. Most pathway analysis tools like GSEA allow us to use custom gene sets, this is really cool as you can start to generate gene sets based on your own profiling work and that of others.

There is huge value in curating experimental data into gene sets, as the MSigDB team have demonstrated. But overall, these data are under-shared. Even our group is guilty of not sharing the gene sets we've used in papers. There have been a few papers where we've used gene sets curated  from ENCODE transcription factor binding site (TFBS) data to understand which TFs were drivi…

MSigDB gene sets for mouse

I recently needed to convert MSigDB gene sets to mouse so I thought I would share.

GO.v5.2.symbols_mouse.gmt
kegg.v5.2.symbols_mouse.gmt
msigdb.v5.2.symbols_mouse.gmt
reactome.v5.2.symbols_mouse.gmt

Below is the code used to do the conversion. It requires an input GMT file of human gene symbols as well as a human-mouse orthology file. You can download the ortholog file here. As the name suggests, it is based on data downloaded from Ensembl Biomart version 87.

Running the program converts all human gmt files. It requres gnu parallel which can be easily installed on Ubuntu with "sudo apt-get install parallel"


#!/bin/bash

conv(){
line=$1
  NAME_DESC=`echo $line | cut -d ' ' -f-2`

  GENES=`echo $line | cut -d ' ' -f3- \
  | tr ' ' '\n' | sort -uk 1b,1 \
  | join -1 1 -2 1 - \
  <(cut -f3,5 mouse2hum_biomart_ens87.txt \
  | sed 1d | awk '$1!="" && $2!=""' \
  | sort -uk 1b,1) | cut -d ' ' -f2 \
  | sort -u…

Are we ready to move beyond MSigDB and start a community-based gene set resource?

Image
Gene sets are distilled information about molecular profiling experiments and can generated based on other features shared by groups of genes such as chromosomal position, sequence, co-regulation, functional information, etc.

These are a valuable resource because they suggest similarities between different molecular profiling experiments or phenomona and lead researchers into understanding the factors that drive the trends in profiling experiments such as gene expression assays by microarray or RNA-seq.

To truly grasp the importance of quality gene sets, consider that the original paper describing the GSEA algorithm has accumulated 3144 citations since 2003, while the paper describing the software and wider applicability of GSEA has 7166 citations. The latter paper has also attracted positive comments from experts in the field on PubMed, here is one that I couldn't agree with more. In the words of Rafael Irizarry, "The idea of analyzing differential expression for groups of g…

A Biomarker Gene Set From ENCODE Expression Data

Image
MSigDB contains thousands of gene sets which have been mined from a range of genome wide studies and these are a valuable resource for gene ontology and pathway analysis. You probably know that many gene sets are curated by KEGG, REACTOME and BIOCARTA - as well as dry lab scientists who specialise in analysing these data sets and curating gene sets. What you may not know is that if you follow a few basic guidelines, you can start generating your own custom gene sets and these can become a valuable resource for running gene ontology and Gene Set Enrichment Analysis (as per the graphic)


For instance within our lab, we have extensively used ENCODE ChIP-Seq data to help us to analyse our mRNA-seq data and this has provided a huge leg-up in generating hypothesis and designing follow-up experiments. For this example, I want to show an overview of how I made biomarker gene sets for a bunch of cell types analysed by ENCODE. Biomarker gene sets can useful in array or mRNA-Seq analysis that you…