Using the DEE2 bulk data dumps

November 26, 2018

The DEE2 project makes freely available bulk data dumps of expression for each of the nine species.

The data is organised as tab separated values (tsv) in "long" format. This is different to a standard gene expression matrix - see below.

The long format is prefered for really huge datasets because it can be readily indexed and converted to a database, as compared to wide matrix format.

You'll notice that there are 4 files for each species, with each having a different suffix. They are all compressed with bz2. You can use pbzip2 to de-compress these quickly.

*accessions.tsv.bz2 This is a list of runs included in DEE2, along with other SRA/GEO accession numbers.

*se.tsv.bz2 These are the STAR based gene expression counts in 'long format' tables with the columns: 'dataset', 'gene', 'count'.

*ke.tsv.bz2 These are the Kallisto estimated transcript expression counts also in long format

*qc.tsv.bz2 These are the QC metrics are also available in long format with the columns 'dataset','QC metric type', 'QC metric result'.

Once downloaded and decompressed, we can start working with it in R.


# Load data

x<-read.table("ecoli_se.tsv")

# Randomly select a few run names

runs<-sample(unique(x$V1),9)

# subset based on the run names
y<-x[which (x$V1 %in% runs),]

#Use acast to transform the data from long format to wide

library(reshape2)

z<-as.matrix(acast(y, V2~V1, value.var="V3"))

head(z)

ERR862937 SRR1793913 SRR2086965 SRR2099922 SRR4434025 SRR5115646

b0001 200 91 40 6292 8 838

b0002 921 167 822 1474 641 3745

b0003 446 64 216 294 287 2437

b0004 832 65 333 957 462 2962

b0005 105 6 84 23 1 324

b0006 2415 38 212 387 3 1443

SRR5130628 SRR6001772 SRR6020052

b0001 21 5613 2352

b0002 4 5709 1618

b0003 0 1985 839

b0004 0 4250 1440

b0005 1 727 25

b0006 0 1435 1797

Search This Blog

Genome Spot

Using the DEE2 bulk data dumps

Popular posts from this blog

Uploading data to GEO - which method is faster?

Data analysis step 8: Pathway analysis with GSEA

Bioinformatics data processing power depends on CPU L3 cache A LOT!