Posts

Showing posts with the label data mining

Using the DEE2 bulk data dumps

Image
The DEE2 project makes freely available bulk data dumps of expression for each of the nine species. 
The data is organised as tab separated values (tsv) in "long" format. This is different to a standard gene expression matrix - see below.

The long format is prefered for really huge datasets because it can be readily indexed and converted to a database, as compared to wide matrix format. 
You'll notice that there are 4 files for each species, with each having a different suffix. They are all compressed with bz2. You can use pbzip2 to de-compress these quickly.
*accessions.tsv.bz2This is a list of runs included in DEE2, along with other SRA/GEO accession numbers. 

*se.tsv.bz2These are the STAR based gene expression counts in 'long format' tables with the columns: 'dataset', 'gene', 'count'.
*ke.tsv.bz2These are the Kallisto estimated transcript expression counts also in long format
*qc.tsv.bz2These are the QC metrics are also available in long …

SRA toolkit tips and workarounds

Image
The Short Read Archive (SRA) is the main repository for next generation sequencing (NGS) raw data. Considering the sheer rate at which NGS is generated (and accelerating), the team at NCBI should be congratulated for providing this service to the scientific community. Take a look at the growth of SRA:

SRA however doesn't provide directly the fastq files that we commonly work with, they prefer the .sra archive that require specialised software (sra-toolkit) to extract. Sra-toolkit has been described as buggy and painful; and I've had my frustrations with it. In this post, I'll share some of my best tips sra-toolkit tips that I've found.

Get the right version of the software and configure it When downloading, make sure you download the newest version from the NCBI website (link). Don't download it from GitHub or from Ubuntu software centre (or apt-get), as it will probably be an older version. In the binary directory (looks like /path/to/sratoolkit.2.4.3-ubuntu64/bin…