Upset plots as a replacement to Venn Diagram

I previously posted about different ways to obtain Venn diagrams, but what if you have more than 4 lists to intersect? These plots become messy and not easy to read. One alternative which has become popular is the upset plot. There is an excellent summary of the philosophy behind this approach in this article and academic paper here. An example plot is below:


Example UpSet plot (Source: http://caleydo.org/tools/upset/)

In this post, I'll describe how to get from lists of genes in text files and present it as an UpSet plot using R. As with most R packages, you'll find that loading in the data is the hardest part, and that data import is the least documented aspect.

First I'll generate some random gene lists using a quick and dirty shell script. My complete list contains 58302 genes and looks like this:

$ head -5 Homo_sapiens.GRCh38.90.gnames.txt
ENSG00000000003_TSPAN6
ENSG00000000005_TNMD
ENSG00000000419_DPM1
ENSG00000000457_SCYL3
ENSG00000000460_C1orf112

This is the script which generates random subsets of genes with the suffix "genelist.txt":

#!/bin/bash
GENELIST=Homo_sapiens.GRCh38.90.gnames.txt

for i in {1..6} ; do 
  NUM=$(echo $RANDOM | awk '{printf "%i\n", $1/10}')
  shuf $GENELIST | head -$NUM > list_${i}_genelist.txt
done

We can see that we've generated 6 lists of different sizes, which could be results of 6 experiments:

$ wc -l *list.txt
  2003 list_1_genelist.txt
  2633 list_2_genelist.txt
  2972 list_3_genelist.txt
  2952 list_4_genelist.txt
   133 list_5_genelist.txt
  3029 list_6_genelist.txt
 13722 total

Now we're ready to load into R. The text below can be pasted into R shell or even better, save to text file called "upset_script.R" and then execute with "Rscript upset_script.R"

# load libraries
library(plyr)
library(reshape2)
library(UpSetR)

# make a list of input files to be read
filelist = list.files(pattern = "*list.txt")

# make a 3 column table of listname,gene,1
res<-lapply(filelist, function(x){
data.frame(
set=x,
geneID=as.character(read.table(x)[,1]),
val=1)
})

res<-ldply(res)

# turn the 3 column long table into a wide
res1<-acast(res,geneID~set,value.var="val",fill=0) 

# force as dataframe
res1<-as.data.frame(res1)

# 1st column must be a name
res1$name=rownames(res1)

# rearrange columns
res2=res1[,c(ncol(res1),1:(ncol(res1)-1))]

pdf("gene_intersections.pdf")
#make the plot with 6 sets
upset(res2,nsets = 6)
dev.off()

The end result is excellent! We see that the majority of genes exist in just 1 set.
Example of UpSet plot of random gene lists
Thanks to Antony Kaspi for writing code for the data import.

Popular posts from this blog

Data analysis step 8: Pathway analysis with GSEA

Two subtle problems with over-representation analysis

Uploading data to GEO - which method is faster?