DEE2 project update June 2019

It has been wonderful to be getting great feedback on DEE2, ways it can be improved and directions for future development of open omics data. Here I'll summarise some of these points.

Recent developments

  • iDEP integration. The iDEP service has provided DEE2 counts on it's R Shiny powered page which makes it very easy to download and analyse DEE2 data using the iDEP bioinformatics in the browser platform: http://bioinformatics.sdstate.edu/reads/ 

  • Degust integration. When using DEE2 in the browser you can now choose to send count matrices to Degust for differential analysis. You can send either the STAR gene counts, Kallisto Tx counts or Kallisto Tx counts aggregated to gene counts (recommended). Many thanks to David Powell's team at Monash especially Andrew Perry for working with me on adding this feature.

  • The R package getDEE2 now works for Windows systems. This required a modification to the download.file() options to account for default behaviour on these OSs. Thanks to Steven Ge for the bug report.
  • Each DEE2 download now contains a README to help end users understand the contents of the folder. Thanks to Peter Karp for the suggestion.
  • Backend dee_pipeline.R issue: I initially thought there was a bug with the backend R script dee_pipeline.R but it was just a memory overuse problem with a call to mclapply, so by reducing the parallel threads, the issue went away :) In addition, whenever load() is used it overwrites functions and variables like CORES, so in future I'll be loading individual objects from the RData file instead.
  • Data updates: Having pinned down the backend script issue, there are about 200,000 newly added mouse and human runs since last week. If there has been missing data in your project, it would be a good idea to try again.

What were working on now

  • Filling in missing datasets: You might have seen that there may be some datasets missing for particular runs in a project. We have scripts running to to search for these and and process them. 
  • SRAdbV2 is sadly not working anymore. This means that although SRA is growing rapidly, the database of corresponding metadata is static. We are looking to replace our dependence on SRAdbV2 but it will take a bit of time to integrate new tools. It could be that https://github.com/seandavi/OmicIDXR/ is the replacement.
  • More integration: Given the success with iDEP and Degust we are looking to integrate with other tools for downstream analysis, especially pathway level tools. 

Future directions

  • Suggestions for more species are welcome, especially if those suggesting can contribute resources (eg: computational, financial, etc)
  • Incorporating NGS data from sources outside SRA, such as GSA (China)
  • Transcript annotations: We've found that Ensembl cDNA used by Kallisto doesn't include ncRNA, which means most lncRNAs won't be properly represented. In future we may opt for the Gencode annotation as suggested by Alex Predeus. This can't be done immediately as it would require reanalysis of the entire dataset but is on the radar for future versions. 
  • Single cell data remains a tricky situation. Some submitted datasets appear to be processed very well by DEE2 pipeline, especially when the SRR runs correspond to individual cells which is typical of fluidigm preparation. However pooled runs from Drop-Seq or 10X seq seems to provide an average of all reads that make it through the pipeline. I think scRNA-seq will require a greater degree of manual curation due to the many varieties of library preparations which strongly impact the suitability of any analysis pipeline. Doing this properly will be a longer term direction for future development.

    Popular posts from this blog

    Two subtle problems with over-representation analysis

    Data analysis step 8: Pathway analysis with GSEA

    Uploading data to GEO - which method is faster?