DNA aligner accuracy: BWA, Bowtie, Soap and SubRead tested with simulated reads

In the past few posts, we've looked at RNA-seq aligner performance in terms of accuracy and speed. In this post, I'll take a look at the accuracy of DNA aligners using simulated reads.

The first step is to download the genome of interest. I'm using Arabidopsis as its pretty small and good for quick benchmarking. I downloaded the genome from Ensembl plant ftp site (link).

Next step was to generate simulated reads that uniformly cover the genome at user-selected length and intervals. I couldn't find any previously made simulators to do exactly that, so I chained together some awk and bedtools commands (code at the bottom of the post) to generate the reads. I generated pseudo reads of 50, 100 & 200 bp in length at 10 bp intervals and output them in fasta format. Here is an example.

>1-0-50
CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAA
>1-10-60
TAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAAT
>1-20-70
ACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAATCTTTAAATCC

The sequence name is chromosome-start-end. Next I mutated the reads with msbar, a neat little EMBOSS tool which incorporates mutations (SNP/insertions/deletions) at a set rate into sequences. Then I aligned these reads back to the genome. Here are the aligners and versions I used. I used all default values.

BWA aln 0.7.5a-r405
BWA mem 0.7.5a-r405
Bowtie2 2.1.0
Soap 2.21
Subread 1.4.5-p1
STAR 2.4.0d

I included STAR here out of curiosity because it was the most accurate RNA-seq aligner in our last comparison. To determine the number of correctly and incorrectly assigned reads, I used samtools and awk to check the sequence header matched the mapping location. I set a mapQ threshold of 10 which is one that is commonly used. First I looked at how these aligners fared with unmutated, perfect reads.
Table 1: Aligning perfect reads to the Arabidopsis genome.
This result shows that all of the aligners had very low incorrect assignment rates, except for SubRead. BWA mem had the lowest correct assignment rate at all read lengths. Soap and Bowtie2 had the highest correct assignment rate at 50 bp; BWA aln and Soap had the highest correct assignment rate at 100 bp; and SubRead had the highest correct assignment rate at 200 bp. Even though STAR is a splice aware RNA-seq aligner, it performed pretty well on this dataset.

Then I analysed the mutated reads. I wasn't able to run all SubRead analyses as it was just taking too long.

Figure 1: Alignment of mutated 50 bp reads. Left-side panels show rate of correctly mapped reads, and right-side shows incorrectly mapped reads. Upper panels show single nucleotide polymorphisms(base changes), middle panels show 1-base insertions and lower panels show 1-base deletions.
At 50 bp length, STAR was most tolerant to SNPs, followed by Soap, BWA aln and BWA mem. Bowtie2 and SubRead were least SNP-tolerant. BWA mem was most tolerant to indels, followed by STAR, BWA aln and Bowtie2; while Soap was the least indel-tolerant. BWA mem showed the lowest incorrect mapping rates as a proportion of correctly mapped reads.

Figure 2: Alignment of mutated 100 bp reads. Left-side panels show rate of correctly mapped reads, and right-side shows incorrectly mapped reads. Upper panels show single nucleotide polymorphisms(base changes), middle panels show 1-base insertions and lower panels show 1-base deletions.
At 100 bp, STAR and BWA mem were the most SNP tolerant, followed by SubRead, Bowtie2, BWA aln and lastly Soap. BWA mem also showed the best tolerance to indels, followed by STAR, Bowtie2, SubRead, BWA aln and lastly Soap.

Figure 3: Alignment of mutated 200 bp reads. Left-side panels show rate of correctly mapped reads, and right-side shows incorrectly mapped reads. Upper panels show single nucleotide polymorphisms(base changes), middle panels show 1-base insertions and lower panels show 1-base deletions.
At 200 bp just like 100 bp, BWA mem and STAR were the most SNP tolerant, followed by SubRead, Bowtie2, BWA aln and lastly Soap. BWA mem also showed the best tolerance to indels, followed by STAR, Bowtie2, BWA aln and lastly Soap. STAR performed well in terms of correctly assigned reads, but showed a comparatively high false assignment rate.

Limitations of this analysis

I used Arabidopsis, so this might not translate 100% over to mammalian-sized genomes. Also I used indels of only 1 bp so this doesn't really resolve what happens with larger indel sizes. Real sequence data may contain masked bases, as well as mixtures of indels and SNPs which weren't considered here.

Bottom line

For aligning 50 bp reads to the genome BWA aln, Bowtie2 and BWA mem would be recommended. At 100 and 200 bp BWA mem was clearly the best. Soap is not recommended due to indel intolerance. SubRead showed a comparatively high false assignment rate, even with perfect reads making it a poor choice.

Here is the full table:
Table 2: Correct and incorrect read assignment of mutated reads with different aligners at 50, 100 and 200 bp lengths to the Arabidopsis genome. * Denotes abandoned test due to slow running times.

Simulate DNA sequence reads

#!/bin/bash
CHR=Arabidopsis_thaliana.TAIR10.23.dna.genome_fmt.g
GENOMEFA=Arabidopsis_thaliana.TAIR10.23.dna.genome_fmt.fa

READLENGTHRANGE='50 100 200'
STEP=10
PFX=simDNA_

for RDLEN in $READLENGTHRANGE
do
echo $RDLEN
bedtools makewindows -w $RDLEN -s $STEP -g $CHR \
| awk -v L=$RDLEN '($3-$2)==L' \
| bedtools getfasta -fi $GENOMEFA -bed /dev/stdin -fo /dev/stdout \
| tr ':' '-' | pigz > ${PFX}${RDLEN}bp.fasta.gz  &
done
wait

Mutate reads

#!/bin/bash
for FQZ in simDNA_50bp.fasta.gz simDNA_100bp.fasta.gz simDNA_200bp.fasta.gz
do
 for COUNT in 1 2 4 8 16
 do

 BASE=`echo $FQZ | sed 's/.fasta.gz//'`

 pigz -dc ${FQZ} \
 | msbar -sequence /dev/stdin -count $COUNT \
 -point 4 -block 0 -codon 0 -outseq /dev/stdout 2>/dev/null \
 | sed -n '/^>/!{H;$!b};s/$/ /;x;1b;s/\n//g;p' \
 | sed 's/ /\n/' | pigz > ${BASE}_${COUNT}snp.fasta.gz &

 pigz -dc ${FQZ} \
 | msbar -sequence /dev/stdin -count $COUNT \
 -point 2 -block 0 -codon 0 -outseq /dev/stdout 2>/dev/null \
 | sed -n '/^>/!{H;$!b};s/$/ /;x;1b;s/\n//g;p' \
 | sed 's/ /\n/' | pigz > ${BASE}_${COUNT}ins.fasta.gz &

 pigz -dc ${FQZ} \
 | msbar -sequence /dev/stdin -point 3 -count $COUNT \
 -block 0 -codon 0 -outseq /dev/stdout 2>/dev/null \
 | sed -n '/^>/!{H;$!b};s/$/ /;x;1b;s/\n//g;p' \
 | sed 's/ /\n/' | pigz > ${BASE}_${COUNT}del.fasta.gz &

 wait
 done
done


Alignment

==> soap.sh <==
#!/bin/bash
REF=/path/to/soap/index/Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa
PWD=`pwd`
HEADER=header.txt
for FQZ in *fasta.gz
do
pigz -dc $FQZ \
| perl /usr/local/bin/fasta_to_fastq.pl - - \
| soap -p 8 -a /dev/stdin -D $REF -o /dev/stdout \
| soap2sam.pl -p - \
| cat $HEADER - \
| samtools view -uSh - \
| samtools sort - ${FQZ}_soap.sort
done
for BAM in *soap.sort.bam ; do samtools index $BAM & done
for BAM in *soap.sort.bam ; do samtools flagstat $BAM > ${BAM}.stats & done
wait

==> bwamem.sh <==
#!/bin/bash
REF=/path/to/bwa/index/Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa
for FQZ in *fasta.gz
do
FQ=`basename $FQZ .gz`
pigz -dc $FQZ | tee $FQ \
| bwa mem -t 8 $REF - \
| samtools view -uSh - \
| samtools sort - ${FQ}_bwamem.sort
done
for BAM in *_bwamem.sort.bam ; do samtools index $BAM & done
for BAM in *_bwamem.sort.bam ; do samtools flagstat $BAM > ${BAM}.stats & done
wait

==> bwaaln.sh <==
#!/bin/bash
REF=/path/to/bwa/index/Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa
for FQZ in *fasta.gz
do
FQ=`basename $FQZ .gz`
pigz -dc $FQZ | tee $FQ \
| bwa aln -t 8 $REF - | bwa samse $REF - $FQ \
| samtools view -uSh - \
| samtools sort - ${FQ}_bwaaln.sort
done
for BAM in *_bwaaln.sort.bam ; do samtools index $BAM & done
for BAM in *_bwaaln.sort.bam ; do samtools flagstat $BAM > ${BAM}.stats & done
wait

==> bowtie2.sh <==
#!/bin/bash
REF=/path/to/bowtie2/index/Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa
for FQZ in *fasta.gz
do
FQ=`basename $FQZ .gz`
pigz -dc $FQZ \
| bowtie2 -p8 -f -x $REF -U /dev/stdin -S ${FQ}.sam
( samtools view -uSh ${FQ}.sam \
| samtools sort - ${FQ}_bowtie2.sort
rm ${FQ}.sam ) &
done
wait
for BAM in *_bowtie2.sort.bam ; do samtools index $BAM & done
for BAM in *_bowtie2.sort.bam ; do samtools flagstat $BAM > ${BAM}.stats & done
wait

==> subread.sh <==
#!/bin/bash
REF=/path/to/subread/index/Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa
for FQZ in *fasta.gz
do
FQ=`basename $FQZ .gz`
subread-align -T 8 --gzFASTQinput --BAMoutput -i $SUBREADREF \
-r $FQZ -o ${FQ}.subread_unsorted.bam
samtools sort ${FQ}.subread_unsorted.bam ${FQ}.subread &
done
wait
rm *unsorted.bam
for BAM in *subread.bam ; do samtools index $BAM & done
for BAM in *subread.bam ; do samtools flagstat $BAM > ${BAM}.stats & done
wait

==> star.sh <==
STARREF=/path/to/star/index/
for FQZ in *.fasta.gz
do
FQ=`echo $FQZ | sed 's/.gz//'`
pigz -fdk $FQZ > $FQ
STAR --readFilesIn $FQ --genomeLoad LoadAndKeep \
--genomeDir $STARREF --runThreadN 8 \
wait
rm $FQ
mv Aligned.out.sam ${FQ}.STAR_unsorted.sam
samtools view -uSh ${FQ}.STAR_unsorted.sam \
| samtools sort - ${FQ}.STAR
rm ${FQ}.STAR_unsorted.sam
done
for BAM in *STAR.bam ; do samtools index $BAM & done
for BAM in *STAR.bam ; do samtools flagstat $BAM > ${BAM}.stats & done
wait

Determine correct and incorrect reads

#!/bin/bash
for TAG in subread STAR bowtie bwaaln bwamem soap
do
        rm ${TAG}_alignment.txt
        for BAM in `ls *${TAG}*bam | grep -v '\.00'`
        do
        CORRECT=`samtools view -q10 $BAM | tr ':-' '\t' \
        | awk '$1==$5 && $6<($3+100) && $6>($2-100)' | wc -l`

        INCORRECT=`samtools view -q10 $BAM | tr ':-' '\t' \
        | awk '$1!=$5 || $6>($3+100) || $6<($2-100)' | wc -l`

        echo $BAM $CORRECT $INCORRECT >> ${TAG}_alignment.txt
        done &
done
wait

Popular posts from this blog

Two subtle problems with over-representation analysis

Data analysis step 8: Pathway analysis with GSEA

Uploading data to GEO - which method is faster?