Update on the DEE project Dec 2017

Back in 2015, our group described DEE, a user friendly repository of uniformly processed RNA-seq data, which I covered in detail in a previous post. Ours was the first such repository that wasn't limited to human or mouse and included sequencing data from a variety of instruments and library types. The purpose of this post is to reflect on the mixed success of DEE and outline where this project is going in future.

Overall I've received a lot of positive feedback from users and a number of citations to our poster. Thanks to everyone who used, gave suggestions, comments, bug reports, etc! However our attempt to have the repository published wasn't so successful due to reviewer niggles over what I consider minor points but hard to implement quickly. The main points raised by reviewers were:

  1. Is it reasonable to treat all data sets as if they were single end? For this one, the reviewers were split, one said it was OK and the other was adamant that it was unacceptable despite my showing that overall PE vs SE didn't make much of a difference. This was problematic in that including the 2nd read in the pair was not something that could be done quickly - remember the data set took 6 months of compute time on a cluster of 3 servers. Also I felt that omitting the 2nd read improved the speed of data processing and made the data more uniform between experiments. 
  2. What level of quality control is appropriate? DEE included QC information in the form of brief reports and warnings that were designed to give the end user discretion in whether or not to include the data in their analysis. It was the reviewers opinion that any data set that did not "pass" QC thresholds should not be included in the database at all. It was (still is) my philosophy that data should be included alongside QC warnings. If we start having hard threshold criteria for inclusion/exclusion then we will find that studies containing many samples will have drop-outs which is not desirable. 
With these points, we were unable to persuade the reviewers and it didn't help that the manuscript submission requirements were very rigid with respect to word limit so it was difficult to convey that on such projects there is a compromise between the breadth of data to be included and the detail at which each data set is interrogated. So the paper was rejected and the project went on the back burner for a bit while our group transitioned from Baker Institute to Monash Uni - which complicated things as Baker IT eventually killed off the DEE webserver. Meanwhile if you are looking for DEE data sets, these are available via Google Drive here in he form of bulk data dumps.

The plan moving forward is to revamp the app by including several new features, re-process the data and unveil DEE2 when its ready. Before I go into detail about the new features, I wanted to reiterate what about DEE2 that is the same as before:
  • The goal is to democratise RNA-seq gene expression analysis and make these data available to anybody connected to the internet using nearly any computer.
  • DEE2 just provides processed RNA-seq data from SRA in the form of summary expression data and QC information and comprehensive QC logs. It does not provide visualisation or statistical analysis of the data.
  • We do not keep any fastq or alignment files.
I wanted to highlight some of the project changes and new features:
  • Open development on github. Visit the repo here. I'm hoping that open development will foster community input into it's development. 
  • Using docker image. This helps to ensure that all prerequisites are present and functional and that the code runs identically on different systems such as standard servers, PCs, cloud servers, etc. The docker image is currently available on dockerhub and you can download and run it. We are also looking to add support for HPC via singularity image soon.
  • Scalable processing power. The above points allow DEE2 workload to be distributed over many systems. A central server exists only to delegate SRR accessions to worker nodes and receive completed data sets from worker nodes. Community members wanting to contribute to the database will be able to dedicate compute time to processing SRA data sets. 
  • More flexible input data. If a data set of interest is not present in the DEE2 database, you can run the pipeline yourself and it will be immediately available to you. The processed data is also transmitted to central server for validation before being made publicly available. You can also direct a worker node to process data sets from one species, or let the pipeline decide what species/data set it will work on based on the specs of the machine and the length of the job queue. We have also included the ability to process own fastq files that are not on SRA. This allows end users to process their own data in the same fashion as that used for public DEE2 data. Own data files remain private and are not shared with central server.
  • Pipeline enhancements.
    • Smart adaptor detection and clipping. Using the minion tool (from EBI) 3' adaptor sequences are detected and clipped above a threshold.
    • Detection and clipping of non-reference 5' bases (ie: UMIs)
    • Strandedness detection. Using a simple heuristic, the strandedness of a library can be determined accurately.
    • Parallel assignment of reads to genes and transcripts with STAR and Kallisto. 
    • Support for paired end data. If read 1 and 2 map above a certain threshold then both reads are included in STAR and kallisto mappings
    • Speed improvements as BAM files are never written, there are significant I/O savings. 
  • Transcript-level data. In addition to gene level counts provided by STAR, kallisto quantification is provided as RPM or estimated counts. 
  • Minor webpage enhancements. This work is ongoing and will include a better search interface and appearance
Most of the above have been implemented. In a few weeks we'll start large scale processing of SRA, so we'd love to have your comments, suggestions, feature requests please comment below or raise an issue over at the github repo. Thank you.


Note: Since DEE was described, two similar projects have been reported on, namely Recount2 and ARCHS4 which are centred upon human and mouse Illumina RNA-seq data sets.

Popular posts from this blog

Two subtle problems with over-representation analysis

Data analysis step 8: Pathway analysis with GSEA

Uploading data to GEO - which method is faster?