Diagnosing PCR duplicates from cluster duplicates

NovaSeq, HiSeqX and HiSeq4000 Illumina sequencers have patterned flowcells which have a different chemistry as compared to random clustered flowcell systems (Hiseq2500 & MiSeq) which is known to cause duplicates during the clustering process. For some background on the issue, see these previous blog posts:

In my recent whole genome bisulfite sequencing experiment using TruSeq methylation library prep kits and NovaSeq, I noticed a high proportion of duplicate reads and wanted to investigate whether these were "cluster" duplicates, ie generated during the clustering process due to ExAmp chemistry or were duplicates generated during the PCR step. Generally cluster duplicates occur in the immediate proximity on the flowcell surface and PCR duplicates are expected to occur uniformly throughout the flowcell surface.

To diagnose this, I used the diagnose-dups tool by Dave Larson which can be found on Github here. I wrote a wrapper script to process a folder of BAM files and save the summary statistics in a neat table (below), which can be readily graphed using the R script also below. The result indicate the majority of duplicate reads originate from PCR copies, not from cluster duplicates. 


This is useful information that we can use to now focus on optimising the library preparation protocol. Further reading on the subject revealed that the over representation of duplicates is common in TruSeq kits compared to other ones (Raine et al, 2017) and that the selection of library prep kit has a major impact on coverage and methylation calls obained from WGBS (Olova et al, 2017).

References

Amanda Raine, Ulrika Liljedahl, Jessica Nordlund. Data quality of Whole Genome Bisulfite Sequencing on Illumina platforms. bioRxiv 188797; doi: https://doi.org/10.1101/188797

Nelly Olova, Felix Krueger, Simon Andrews, David Oxley, Rebecca V Berrens, Miguel R Branco, Wolf Reik. Comparison of whole-genome bisulfite sequencing library preparation strategies identifies sources of biases affecting DNA methylation data. bioRxiv 165449; doi: https://doi.org/10.1101/165449

Shell script
#!/bin/bash
echo "BAMfile total_fragments total_duplicated_fragments total_duplicate_fragments total_flowcell_duplicates duplicate_on_same_strand(pairs) duplicate_on_different_strand(pairs) subtile_dup_rate_stdev dup_rate estimated_library_dup_rate" | tee output.txt
dups(){
BAM=$1
RPT=$BAM.txt
diagnose-dups -t -i $BAM -o $RPT
grep -A1 summary $RPT | tail -1 | tr -d '{};' | tr ',' '\n' \

| awk '{print $2}' | tr '\n' '\t' | sed 's/\t$/\n/' \

| sed "s/^/${BAM}\t/"
}
export -f dups
parallel dups ::: *bam | tee -a output.txt


R script
x<-read.table("output.txt",header=T,row.names=1)
x$uniq_rate=1-x$dup_rate
x$est_cluster_dup_rate=x$dup_rate-x$estimated_library_dup_rate
y<-data.frame(x$uniq_rate,x$estimated_library_dup_rate,x$est_cluster_dup_rate)
rownames(y)=rownames(x)
colnames(y)=gsub("x.","",colnames(y))
png("duplicate_chart.png")
par(las=2) ; par(mar=c(5,8,8,2)) ; barplot(t(as.matrix(y)),horiz=T,xlab="Proportion of reads",legend=colnames(y),args.legend=list(x = 1,y=16.5))
dev.off()


Popular posts from this blog

Data analysis step 8: Pathway analysis with GSEA

Uploading data to GEO - which method is faster?

Using GTF tools to get gene lengths