Posts

Showing posts with the label DEE

Incorporate dee2 data into your R-based RNA-seq workflow

Dee2.io is a portal for accessing gene expression data derived from public RNA-seq datasets. So far there are over 400k available datasets and its growing every day. While there are existing databases of such as Expression Atlas, Recount2 and ARCHS4, dee2.io offers a number of unique benefits. For instance, dee2 includes gene-wise counts fron STAR as well as transcript-wise quantifications from Kallisto. There are a few ways you can access these data. Firstly, there is a nice web interface that is mobile friendly. Secondly, there are data dumps available if you are running a large scale analysis.  But the purpose of this post is to demonstrate the improved R interface in action together with SRAdbv2 and statistics with edgeR and DESeq. The official documentation is available on GitHub.
Getting started This tutorial provides a walkthrough for how to work with dee2 expression data, starting with dataset searches, obtaining the data from dee2.io and then performing a differential analysi…

Update on DEE2 project for Jan 2018

Image
Today I'd like to share some updates on the DEE2 project, which I wrote about  in an earlier post. The  project code can be viewed on github here. Pipeline and images As the pipeline was recently finalised, I was able to roll out the working docker image. To facilitate users without root access, this image was ported to singularity. This took a lot of effort and some expertise from our local HPC team to get things working (many thanks to the Massive/M3 team). The singularity image is available from the webserver (link) and instructions for running it are available on github here. I have started testing a heavyweight singularity image, which includes the genome indexes, which will be more efficient for running jobs with large genomes and will make it available once testing is complete. Queue management It may sound simple to write a script to determine which datasets have been completed and add new datasets to the queue but when taking about tens of thousands of datasets it's …

Introducing "Digital Expression Explorer"

Image
RNA-seq has been a blessing for molecular biologists, not only does RNA-seq provide unbiased transcriptome wide expression analysis, it can be mined for a variety of other information like splicing, SNV identification, RNA editing, TSS usage, etc. As the cost of RNA-seq declines, more and more labs are using it hence more and more data is being deposited at databases such as SRA and GEO.

But there is a growing problem.

The problem is a lack of uniformity of processed data on GEO. Processed data has assorted reference genomes, gene annotation sets, accession numbers, software pipelines, statistical analyses and output formats, that in most cases makes comparison of two experiments hard if not impossible, let alone three or more experiments. This is a burden on researchers who want to quickly extract expression information from public RNA-seq data. Many researchers then resort to downloading the raw data in SRA format then processing it with QC/alignment/quantification/statistical pipel…