Using the DEE2 bulk data dumps
The DEE2 project makes freely available bulk data dumps of expression for each of the nine species.
The data is organised as tab separated values (tsv) in "long" format. This is different to a standard gene expression matrix - see below.
The long format is prefered for really huge datasets because it can be readily indexed and converted to a database, as compared to wide matrix format.
You'll notice that there are 4 files for each species, with each having a different suffix. They are all compressed with bz2. You can use pbzip2 to de-compress these quickly.
*se.tsv.bz2 These are the STAR based gene expression counts in 'long format' tables with the columns: 'dataset', 'gene', 'count'.
*ke.tsv.bz2 These are the Kallisto estimated transcript expression counts also in long format
*qc.tsv.bz2 These are the QC metrics are also available in long format with the columns 'dataset','QC metric type', 'QC metric result'.
Once downloaded and decompressed, we can start working with it in R.
# Load data
x<-read.table("ecoli_se.tsv")
# Randomly select a few run names
runs<-sample(unique(x$V1),9)
# subset based on the run names
y<-x[which (x$V1 %in% runs),]
#Use acast to transform the data from long format to wide
library(reshape2)
z<-as.matrix(acast(y, V2~V1, value.var="V3"))
head(z)
ERR862937 SRR1793913 SRR2086965 SRR2099922 SRR4434025 SRR5115646
b0001 200 91 40 6292 8 838
b0002 921 167 822 1474 641 3745
b0003 446 64 216 294 287 2437
b0004 832 65 333 957 462 2962
b0005 105 6 84 23 1 324
b0006 2415 38 212 387 3 1443
SRR5130628 SRR6001772 SRR6020052
b0001 21 5613 2352
b0002 4 5709 1618
b0003 0 1985 839
b0004 0 4250 1440
b0005 1 727 25
b0006 0 1435 1797