Using the DEE2 bulk data dumps

The DEE2 project makes freely available bulk data dumps of expression for each of the nine species. 

The data is organised as tab separated values (tsv) in "long" format. This is different to a standard gene expression matrix - see below.


The long format is prefered for really huge datasets because it can be readily indexed and converted to a database, as compared to wide matrix format. 

You'll notice that there are 4 files for each species, with each having a different suffix. They are all compressed with bz2. You can use pbzip2 to de-compress these quickly.

*accessions.tsv.bz2 This is a list of runs included in DEE2, along with other SRA/GEO accession numbers. 

*se.tsv.bz2 These are the STAR based gene expression counts in 'long format' tables with the columns: 'dataset', 'gene', 'count'.

*ke.tsv.bz2 These are the Kallisto estimated transcript expression counts also in long format

*qc.tsv.bz2 These are the QC metrics are also available in long format with the columns 'dataset','QC metric type', 'QC metric result'.

Once downloaded and decompressed, we can start working with it in R.


# Load data
x<-read.table("ecoli_se.tsv")

# Randomly select a few run names
runs<-sample(unique(x$V1),9)

# subset based on the run names

y<-x[which (x$V1 %in% runs),]

#Use acast to transform the data from long format to wide
library(reshape2)
z<-as.matrix(acast(y, V2~V1, value.var="V3"))

head(z)
      ERR862937 SRR1793913 SRR2086965 SRR2099922 SRR4434025 SRR5115646
b0001       200         91         40       6292          8        838
b0002       921        167        822       1474        641       3745
b0003       446         64        216        294        287       2437
b0004       832         65        333        957        462       2962
b0005       105          6         84         23          1        324
b0006      2415         38        212        387          3       1443
      SRR5130628 SRR6001772 SRR6020052
b0001         21       5613       2352
b0002          4       5709       1618
b0003          0       1985        839
b0004          0       4250       1440
b0005          1        727         25
b0006          0       1435       1797

Popular posts from this blog

Data analysis step 8: Pathway analysis with GSEA

Two subtle problems with over-representation analysis

Uploading data to GEO - which method is faster?