Showing posts from November, 2017

Update on the DEE project Dec 2017

Back in 2015, our group described DEE, a user friendly repository of uniformly processed RNA-seq data, which I covered in detail in a previous post. Ours was the first such repository that wasn't limited to human or mouse and included sequencing data from a variety of instruments and library types. The purpose of this post is to reflect on the mixed success of DEE and outline where this project is going in future.

Overall I've received a lot of positive feedback from users and a number of citations to our poster. Thanks to everyone who used, gave suggestions, comments, bug reports, etc! However our attempt to have the repository published wasn't so successful due to reviewer niggles over what I consider minor points but hard to implement quickly. The main points raised by reviewers were:

Is it reasonable to treat all data sets as if they were single end? For this one, the reviewers were split, one said it was OK and the other was adamant that it was unacceptable despite my …

Diagnosing PCR duplicates from cluster duplicates

NovaSeq, HiSeqX and HiSeq4000 Illumina sequencers have patterned flowcells which have a different chemistry as compared to random clustered flowcell systems (Hiseq2500 & MiSeq) which is known to cause duplicates during the clustering process. For some background on the issue, see these previous blog posts:

QC Fail blog Steve WingettEnseqlopedia blog by James Hadfield In my recent whole genome bisulfite sequencing experiment using TruSeq methylation library prep kits and NovaSeq, I noticed a high proportion of duplicate reads and wanted to investigate whether these were "cluster" duplicates, ie generated during the clustering process due to ExAmp chemistry or were duplicates generated during the PCR step. Generally cluster duplicates occur in the immediate proximity on the flowcell surface and PCR duplicates are expected to occur uniformly throughout the flowcell surface.
To diagnose this, I used the diagnose-dups tool by Dave Larson which can be found on Github here. I wr…

Considerations in performing whole human genome bisulfite sequencing on the Illumina NovaSeq system

Today at the NGS workshop at WEHI, Melbourne, I presented some findings related a pilot study of 12 methylomes studied with whole genome bisulfite sequencing. Two of those libraries were also sequenced on the HiSeq4000 platform to similar depth so there were some subtle but interesting differences between the systems. What we found was that the actual sequence coverage obtained was substantially less than that projected due to 2 problems. Firstly that the insert size was too small - which looks like it could be due to the inner workings of the Illumina TruSeq methylation kit. And secondly that there was a high proportion of duplicate reads observed - that is same strand and coordinates which are likely not independent observations. I will need to look into further detail at whether these are PCR duplicates or "cluster" duplicates. Perhaps the library prep or clustering protocols need some tweaking for bisulfite sequencing.

So as promised, here is the link to the slides.