Upset plots as a replacement to Venn Diagram

September 26, 2017

I previously posted about different ways to obtain Venn diagrams, but what if you have more than 4 lists to intersect? These plots become messy and not easy to read. One alternative which has become popular is the upset plot. There is an excellent summary of the philosophy behind this approach in this article and academic paper here. An example plot is below:

Example UpSet plot (Source: http://caleydo.org/tools/upset/)

In this post, I'll describe how to get from lists of genes in text files and present it as an UpSet plot using R. As with most R packages, you'll find that loading in the data is the hardest part, and that data import is the least documented aspect.

First I'll generate some random gene lists using a quick and dirty shell script. My complete list contains 58302 genes and looks like this:



$ head -5 Homo_sapiens.GRCh38.90.gnames.txt

ENSG00000000003_TSPAN6

ENSG00000000005_TNMD

ENSG00000000419_DPM1

ENSG00000000457_SCYL3

ENSG00000000460_C1orf112

This is the script which generates random subsets of genes with the suffix "genelist.txt":



#!/bin/bash

GENELIST=Homo_sapiens.GRCh38.90.gnames.txt




for i in {1..6} ; do 

  NUM=$(echo $RANDOM | awk '{printf "%i\n", $1/10}')

  shuf $GENELIST | head -$NUM > list_${i}_genelist.txt

done

We can see that we've generated 6 lists of different sizes, which could be results of 6 experiments:




$ wc -l *list.txt

  2003 list_1_genelist.txt

  2633 list_2_genelist.txt

  2972 list_3_genelist.txt

  2952 list_4_genelist.txt

   133 list_5_genelist.txt

  3029 list_6_genelist.txt

 13722 total

Now we're ready to load into R. The text below can be pasted into R shell or even better, save to text file called "upset_script.R" and then execute with "Rscript upset_script.R"



# load libraries


library(plyr)

library(reshape2)

library(UpSetR)




# make a list of input files to be read

filelist = list.files(pattern = "*list.txt")




# make a 3 column table of listname,gene,1

res<-lapply(filelist, function(x){

 data.frame(

  set=x,

  geneID=as.character(read.table(x)[,1]),

  val=1)

})




res<-ldply(res)




# turn the 3 column long table into a wide

res1<-acast(res,geneID~set,value.var="val",fill=0) 




# force as dataframe

res1<-as.data.frame(res1)




# 1st column must be a name

res1$name=rownames(res1)




# rearrange columns

res2=res1[,c(ncol(res1),1:(ncol(res1)-1))]




pdf("gene_intersections.pdf")

#make the plot with 6 sets

upset(res2,nsets = 6)

dev.off()

The end result is excellent! We see that the majority of genes exist in just 1 set.

Example of UpSet plot of random gene lists

Thanks to Antony Kaspi for writing code for the data import.

Search This Blog

Genome Spot

Upset plots as a replacement to Venn Diagram

Popular posts from this blog

Data analysis step 8: Pathway analysis with GSEA

Uploading data to GEO - which method is faster?

My personal thoughts on gene name errors