DEE2 project update Jan 2020

Happy New Year! Here I'll go through a few updates on the DEE2 project. In case you don't know about it you can read the paper here.

1) Data processing

One of the most obvious updates since the paper came out was that the number of datasets hosted by DEE2 has increased substantially from 581,094 to 805,385 runs. This was achieved by using a combination of minicluster at Deakin Uni, several Nectar Coud servers and the Massive HPC located at Monash Uni. The queue is still growing, so I'll need to further expand compute resources in future.

2) Getting a new source of metadata

DEE2 was reliant on the SRADBV2 package for SRA metadata. This was really critical to provide a list of accession numbers for the database to process as well as sample information so that end users can search for data sets of interest. Unfortunately, SRADBV2 became deprecated only a few months after the paper was published. This was possibly due to the expanding size of the SRA metadata making it infeasible to be distributed as a single file. OmicIDXR is slated to be a replacement source of metadata, however I have doubts to its sustainability over the next few years. Therefore I have taken the approach to obtain metadata directly from the SRA website through the Run Selector app. This has the drawback that it needs to be done "manually" and has a batch limit of about 40,000 runs but as it is based on the SRA website, it is likely to be a better long term solution. It is also relatively straightforward to download rmetadata for recently added studies, for example every month. After incorporating new metadata from SRA last week, the queue has grown substantially.

3) The problem of studies missing runs

In the next few weeks I will be working to improve some of the backend scripts. One of the drawbacks of DEE2 at the moment is the fact that for some of the projects, some datasets/runs are missing. I will be working toward ensuring that only wholly completed studies are added to the dataset.

4) Increasing queue efficiency

Related to that is the proposed changes to data processing queue management. The queue should prioritise these missing datasets. As the webserver relays the data sets from nodes to the central data server, it can remove those datasets from the queue which have already passed basic checks.

5) Making the R package more efficient

There are also some improvements that can be made to the R interface, the getDEE2 package. In particular, the handling of the metadata can be improved by only downloading one time. Perhaps searching metadata can be made better with a search engine type function. I'm also considering adding the package to Bioconductor.

6) Website improvements

Other improvements on the horizon include facilitating quicker metadata searches with the DEE2 webpage with elastic search or similar serch engine. I have tried this before but could not pull it off. If you have any good resources for how to implement search engine for a tsv file please share in the comments.

7) Call for compute resources

Finally, if you are a frequent DEE2 user or have used the bulk datasets already or would like to contribute to the project, we are calling for contributors to volunteer their server or HPC compute resources. There is a docker based pipeline for servers/cloud and singularity for HPC. Docker nodes can be initiated using the command:
docker run mziemann/tallyup hsapiens
Substituting you favourite organism. If you would like to process a specific dataset you can provide the SRA run accessions like this
docker run mziemann/tallyup hsapiens SRR123,SRR124,SRR125
You will be able to access the data immediately from the docker container, as well as contributing them to the public database.
Although I will be applying for grants to expand DEE2, your contribution will help a lot.

Popular posts from this blog

Mass download from google drive using R

Data analysis step 8: Pathway analysis with GSEA

Extract properly paired reads from a BAM file using SamTools