Showing posts from November, 2018

Using the DEE2 bulk data dumps

The DEE2 project makes freely available bulk data dumps of expression for each of the nine species. 
The data is organised as tab separated values (tsv) in "long" format. This is different to a standard gene expression matrix - see below.

The long format is prefered for really huge datasets because it can be readily indexed and converted to a database, as compared to wide matrix format. 
You'll notice that there are 4 files for each species, with each having a different suffix. They are all compressed with bz2. You can use pbzip2 to de-compress these quickly.
*accessions.tsv.bz2This is a list of runs included in DEE2, along with other SRA/GEO accession numbers. 

*se.tsv.bz2These are the STAR based gene expression counts in 'long format' tables with the columns: 'dataset', 'gene', 'count'.
*ke.tsv.bz2These are the Kallisto estimated transcript expression counts also in long format
*qc.tsv.bz2These are the QC metrics are also available in long …