DNA aligner accuracy: BWA, Bowtie, Soap and SubRead tested with simulated reads
In the past few posts, we've looked at RNA-seq aligner performance in terms of accuracy and speed. In this post, I'll take a look at the accuracy of DNA aligners using simulated reads.
The first step is to download the genome of interest. I'm using Arabidopsis as its pretty small and good for quick benchmarking. I downloaded the genome from Ensembl plant ftp site (link).
Next step was to generate simulated reads that uniformly cover the genome at user-selected length and intervals. I couldn't find any previously made simulators to do exactly that, so I chained together some awk and bedtools commands (code at the bottom of the post) to generate the reads. I generated pseudo reads of 50, 100 & 200 bp in length at 10 bp intervals and output them in fasta format. Here is an example.
>1-0-50
CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAA
>1-10-60
TAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAAT
>1-20-70
ACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAATCTTTAAATCC
The sequence name is chromosome-start-end. Next I mutated the reads with msbar, a neat little EMBOSS tool which incorporates mutations (SNP/insertions/deletions) at a set rate into sequences. Then I aligned these reads back to the genome. Here are the aligners and versions I used. I used all default values.
BWA aln 0.7.5a-r405
BWA mem 0.7.5a-r405
Bowtie2 2.1.0
Soap 2.21
Subread 1.4.5-p1
STAR 2.4.0d
I included STAR here out of curiosity because it was the most accurate RNA-seq aligner in our last comparison. To determine the number of correctly and incorrectly assigned reads, I used samtools and awk to check the sequence header matched the mapping location. I set a mapQ threshold of 10 which is one that is commonly used. First I looked at how these aligners fared with unmutated, perfect reads.
This result shows that all of the aligners had very low incorrect assignment rates, except for SubRead. BWA mem had the lowest correct assignment rate at all read lengths. Soap and Bowtie2 had the highest correct assignment rate at 50 bp; BWA aln and Soap had the highest correct assignment rate at 100 bp; and SubRead had the highest correct assignment rate at 200 bp. Even though STAR is a splice aware RNA-seq aligner, it performed pretty well on this dataset.
Then I analysed the mutated reads. I wasn't able to run all SubRead analyses as it was just taking too long.
At 200 bp just like 100 bp, BWA mem and STAR were the most SNP tolerant, followed by SubRead, Bowtie2, BWA aln and lastly Soap. BWA mem also showed the best tolerance to indels, followed by STAR, Bowtie2, BWA aln and lastly Soap. STAR performed well in terms of correctly assigned reads, but showed a comparatively high false assignment rate.
Here is the full table:
GENOMEFA=Arabidopsis_thaliana.TAIR10.23.dna.genome_fmt.fa
READLENGTHRANGE='50 100 200'
STEP=10
PFX=simDNA_
for RDLEN in $READLENGTHRANGE
do
echo $RDLEN
bedtools makewindows -w $RDLEN -s $STEP -g $CHR \
| awk -v L=$RDLEN '($3-$2)==L' \
| bedtools getfasta -fi $GENOMEFA -bed /dev/stdin -fo /dev/stdout \
| tr ':' '-' | pigz > ${PFX}${RDLEN}bp.fasta.gz &
done
wait
for FQZ in simDNA_50bp.fasta.gz simDNA_100bp.fasta.gz simDNA_200bp.fasta.gz
do
for COUNT in 1 2 4 8 16
do
BASE=`echo $FQZ | sed 's/.fasta.gz//'`
pigz -dc ${FQZ} \
| msbar -sequence /dev/stdin -count $COUNT \
-point 4 -block 0 -codon 0 -outseq /dev/stdout 2>/dev/null \
| sed -n '/^>/!{H;$!b};s/$/ /;x;1b;s/\n//g;p' \
| sed 's/ /\n/' | pigz > ${BASE}_${COUNT}snp.fasta.gz &
pigz -dc ${FQZ} \
| msbar -sequence /dev/stdin -count $COUNT \
-point 2 -block 0 -codon 0 -outseq /dev/stdout 2>/dev/null \
| sed -n '/^>/!{H;$!b};s/$/ /;x;1b;s/\n//g;p' \
| sed 's/ /\n/' | pigz > ${BASE}_${COUNT}ins.fasta.gz &
pigz -dc ${FQZ} \
| msbar -sequence /dev/stdin -point 3 -count $COUNT \
-block 0 -codon 0 -outseq /dev/stdout 2>/dev/null \
| sed -n '/^>/!{H;$!b};s/$/ /;x;1b;s/\n//g;p' \
| sed 's/ /\n/' | pigz > ${BASE}_${COUNT}del.fasta.gz &
wait
done
done
#!/bin/bash
REF=/path/to/soap/index/Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa
PWD=`pwd`
HEADER=header.txt
for FQZ in *fasta.gz
do
pigz -dc $FQZ \
| perl /usr/local/bin/fasta_to_fastq.pl - - \
| soap -p 8 -a /dev/stdin -D $REF -o /dev/stdout \
| soap2sam.pl -p - \
| cat $HEADER - \
| samtools view -uSh - \
| samtools sort - ${FQZ}_soap.sort
done
for BAM in *soap.sort.bam ; do samtools index $BAM & done
for BAM in *soap.sort.bam ; do samtools flagstat $BAM > ${BAM}.stats & done
wait
==> bwamem.sh <==
#!/bin/bash
REF=/path/to/bwa/index/Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa
for FQZ in *fasta.gz
do
FQ=`basename $FQZ .gz`
pigz -dc $FQZ | tee $FQ \
| bwa mem -t 8 $REF - \
| samtools view -uSh - \
| samtools sort - ${FQ}_bwamem.sort
done
for BAM in *_bwamem.sort.bam ; do samtools index $BAM & done
for BAM in *_bwamem.sort.bam ; do samtools flagstat $BAM > ${BAM}.stats & done
wait
==> bwaaln.sh <==
#!/bin/bash
REF=/path/to/bwa/index/Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa
for FQZ in *fasta.gz
do
FQ=`basename $FQZ .gz`
pigz -dc $FQZ | tee $FQ \
| bwa aln -t 8 $REF - | bwa samse $REF - $FQ \
| samtools view -uSh - \
| samtools sort - ${FQ}_bwaaln.sort
done
for BAM in *_bwaaln.sort.bam ; do samtools index $BAM & done
for BAM in *_bwaaln.sort.bam ; do samtools flagstat $BAM > ${BAM}.stats & done
wait
==> bowtie2.sh <==
#!/bin/bash
REF=/path/to/bowtie2/index/Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa
for FQZ in *fasta.gz
do
FQ=`basename $FQZ .gz`
pigz -dc $FQZ \
| bowtie2 -p8 -f -x $REF -U /dev/stdin -S ${FQ}.sam
( samtools view -uSh ${FQ}.sam \
| samtools sort - ${FQ}_bowtie2.sort
rm ${FQ}.sam ) &
done
wait
for BAM in *_bowtie2.sort.bam ; do samtools index $BAM & done
for BAM in *_bowtie2.sort.bam ; do samtools flagstat $BAM > ${BAM}.stats & done
wait
==> subread.sh <==
#!/bin/bash
REF=/path/to/subread/index/Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa
for FQZ in *fasta.gz
do
FQ=`basename $FQZ .gz`
subread-align -T 8 --gzFASTQinput --BAMoutput -i $SUBREADREF \
-r $FQZ -o ${FQ}.subread_unsorted.bam
samtools sort ${FQ}.subread_unsorted.bam ${FQ}.subread &
done
wait
rm *unsorted.bam
for BAM in *subread.bam ; do samtools index $BAM & done
for BAM in *subread.bam ; do samtools flagstat $BAM > ${BAM}.stats & done
wait
==> star.sh <==
STARREF=/path/to/star/index/
for FQZ in *.fasta.gz
do
FQ=`echo $FQZ | sed 's/.gz//'`
pigz -fdk $FQZ > $FQ
STAR --readFilesIn $FQ --genomeLoad LoadAndKeep \
--genomeDir $STARREF --runThreadN 8 \
wait
rm $FQ
mv Aligned.out.sam ${FQ}.STAR_unsorted.sam
samtools view -uSh ${FQ}.STAR_unsorted.sam \
| samtools sort - ${FQ}.STAR
rm ${FQ}.STAR_unsorted.sam
done
for BAM in *STAR.bam ; do samtools index $BAM & done
for BAM in *STAR.bam ; do samtools flagstat $BAM > ${BAM}.stats & done
wait
for TAG in subread STAR bowtie bwaaln bwamem soap
do
rm ${TAG}_alignment.txt
for BAM in `ls *${TAG}*bam | grep -v '\.00'`
do
CORRECT=`samtools view -q10 $BAM | tr ':-' '\t' \
| awk '$1==$5 && $6<($3+100) && $6>($2-100)' | wc -l`
INCORRECT=`samtools view -q10 $BAM | tr ':-' '\t' \
| awk '$1!=$5 || $6>($3+100) || $6<($2-100)' | wc -l`
echo $BAM $CORRECT $INCORRECT >> ${TAG}_alignment.txt
done &
done
wait
The first step is to download the genome of interest. I'm using Arabidopsis as its pretty small and good for quick benchmarking. I downloaded the genome from Ensembl plant ftp site (link).
Next step was to generate simulated reads that uniformly cover the genome at user-selected length and intervals. I couldn't find any previously made simulators to do exactly that, so I chained together some awk and bedtools commands (code at the bottom of the post) to generate the reads. I generated pseudo reads of 50, 100 & 200 bp in length at 10 bp intervals and output them in fasta format. Here is an example.
>1-0-50
CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAA
>1-10-60
TAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAAT
>1-20-70
ACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAATCTTTAAATCC
The sequence name is chromosome-start-end. Next I mutated the reads with msbar, a neat little EMBOSS tool which incorporates mutations (SNP/insertions/deletions) at a set rate into sequences. Then I aligned these reads back to the genome. Here are the aligners and versions I used. I used all default values.
BWA aln 0.7.5a-r405
BWA mem 0.7.5a-r405
Bowtie2 2.1.0
Soap 2.21
Subread 1.4.5-p1
STAR 2.4.0d
I included STAR here out of curiosity because it was the most accurate RNA-seq aligner in our last comparison. To determine the number of correctly and incorrectly assigned reads, I used samtools and awk to check the sequence header matched the mapping location. I set a mapQ threshold of 10 which is one that is commonly used. First I looked at how these aligners fared with unmutated, perfect reads.
Table 1: Aligning perfect reads to the Arabidopsis genome. |
Then I analysed the mutated reads. I wasn't able to run all SubRead analyses as it was just taking too long.
|
Limitations of this analysis
I used Arabidopsis, so this might not translate 100% over to mammalian-sized genomes. Also I used indels of only 1 bp so this doesn't really resolve what happens with larger indel sizes. Real sequence data may contain masked bases, as well as mixtures of indels and SNPs which weren't considered here.
Bottom line
For aligning 50 bp reads to the genome BWA aln, Bowtie2 and BWA mem would be recommended. At 100 and 200 bp BWA mem was clearly the best. Soap is not recommended due to indel intolerance. SubRead showed a comparatively high false assignment rate, even with perfect reads making it a poor choice.Here is the full table:
Table 2: Correct and incorrect read assignment of mutated reads with different aligners at 50, 100 and 200 bp lengths to the Arabidopsis genome. * Denotes abandoned test due to slow running times. |
Simulate DNA sequence reads
#!/bin/bash
CHR=Arabidopsis_thaliana.TAIR10.23.dna.genome_fmt.g
GENOMEFA=Arabidopsis_thaliana.TAIR10.23.dna.genome_fmt.fa
READLENGTHRANGE='50 100 200'
STEP=10
PFX=simDNA_
for RDLEN in $READLENGTHRANGE
do
echo $RDLEN
bedtools makewindows -w $RDLEN -s $STEP -g $CHR \
| awk -v L=$RDLEN '($3-$2)==L' \
| bedtools getfasta -fi $GENOMEFA -bed /dev/stdin -fo /dev/stdout \
| tr ':' '-' | pigz > ${PFX}${RDLEN}bp.fasta.gz &
done
wait
Mutate reads
#!/bin/bashfor FQZ in simDNA_50bp.fasta.gz simDNA_100bp.fasta.gz simDNA_200bp.fasta.gz
do
for COUNT in 1 2 4 8 16
do
BASE=`echo $FQZ | sed 's/.fasta.gz//'`
pigz -dc ${FQZ} \
| msbar -sequence /dev/stdin -count $COUNT \
-point 4 -block 0 -codon 0 -outseq /dev/stdout 2>/dev/null \
| sed -n '/^>/!{H;$!b};s/$/ /;x;1b;s/\n//g;p' \
| sed 's/ /\n/' | pigz > ${BASE}_${COUNT}snp.fasta.gz &
pigz -dc ${FQZ} \
| msbar -sequence /dev/stdin -count $COUNT \
-point 2 -block 0 -codon 0 -outseq /dev/stdout 2>/dev/null \
| sed -n '/^>/!{H;$!b};s/$/ /;x;1b;s/\n//g;p' \
| sed 's/ /\n/' | pigz > ${BASE}_${COUNT}ins.fasta.gz &
pigz -dc ${FQZ} \
| msbar -sequence /dev/stdin -point 3 -count $COUNT \
-block 0 -codon 0 -outseq /dev/stdout 2>/dev/null \
| sed -n '/^>/!{H;$!b};s/$/ /;x;1b;s/\n//g;p' \
| sed 's/ /\n/' | pigz > ${BASE}_${COUNT}del.fasta.gz &
wait
done
done
Alignment
==> soap.sh <==#!/bin/bash
REF=/path/to/soap/index/Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa
PWD=`pwd`
HEADER=header.txt
for FQZ in *fasta.gz
do
pigz -dc $FQZ \
| perl /usr/local/bin/fasta_to_fastq.pl - - \
| soap -p 8 -a /dev/stdin -D $REF -o /dev/stdout \
| soap2sam.pl -p - \
| cat $HEADER - \
| samtools view -uSh - \
| samtools sort - ${FQZ}_soap.sort
done
for BAM in *soap.sort.bam ; do samtools index $BAM & done
for BAM in *soap.sort.bam ; do samtools flagstat $BAM > ${BAM}.stats & done
wait
==> bwamem.sh <==
#!/bin/bash
REF=/path/to/bwa/index/Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa
for FQZ in *fasta.gz
do
FQ=`basename $FQZ .gz`
pigz -dc $FQZ | tee $FQ \
| bwa mem -t 8 $REF - \
| samtools view -uSh - \
| samtools sort - ${FQ}_bwamem.sort
done
for BAM in *_bwamem.sort.bam ; do samtools index $BAM & done
for BAM in *_bwamem.sort.bam ; do samtools flagstat $BAM > ${BAM}.stats & done
wait
==> bwaaln.sh <==
#!/bin/bash
REF=/path/to/bwa/index/Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa
for FQZ in *fasta.gz
do
FQ=`basename $FQZ .gz`
pigz -dc $FQZ | tee $FQ \
| bwa aln -t 8 $REF - | bwa samse $REF - $FQ \
| samtools view -uSh - \
| samtools sort - ${FQ}_bwaaln.sort
done
for BAM in *_bwaaln.sort.bam ; do samtools index $BAM & done
for BAM in *_bwaaln.sort.bam ; do samtools flagstat $BAM > ${BAM}.stats & done
wait
==> bowtie2.sh <==
#!/bin/bash
REF=/path/to/bowtie2/index/Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa
for FQZ in *fasta.gz
do
FQ=`basename $FQZ .gz`
pigz -dc $FQZ \
| bowtie2 -p8 -f -x $REF -U /dev/stdin -S ${FQ}.sam
( samtools view -uSh ${FQ}.sam \
| samtools sort - ${FQ}_bowtie2.sort
rm ${FQ}.sam ) &
done
wait
for BAM in *_bowtie2.sort.bam ; do samtools index $BAM & done
for BAM in *_bowtie2.sort.bam ; do samtools flagstat $BAM > ${BAM}.stats & done
wait
==> subread.sh <==
#!/bin/bash
REF=/path/to/subread/index/Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa
for FQZ in *fasta.gz
do
FQ=`basename $FQZ .gz`
subread-align -T 8 --gzFASTQinput --BAMoutput -i $SUBREADREF \
-r $FQZ -o ${FQ}.subread_unsorted.bam
samtools sort ${FQ}.subread_unsorted.bam ${FQ}.subread &
done
wait
rm *unsorted.bam
for BAM in *subread.bam ; do samtools index $BAM & done
for BAM in *subread.bam ; do samtools flagstat $BAM > ${BAM}.stats & done
wait
==> star.sh <==
STARREF=/path/to/star/index/
for FQZ in *.fasta.gz
do
FQ=`echo $FQZ | sed 's/.gz//'`
pigz -fdk $FQZ > $FQ
STAR --readFilesIn $FQ --genomeLoad LoadAndKeep \
--genomeDir $STARREF --runThreadN 8 \
wait
rm $FQ
mv Aligned.out.sam ${FQ}.STAR_unsorted.sam
samtools view -uSh ${FQ}.STAR_unsorted.sam \
| samtools sort - ${FQ}.STAR
rm ${FQ}.STAR_unsorted.sam
done
for BAM in *STAR.bam ; do samtools index $BAM & done
for BAM in *STAR.bam ; do samtools flagstat $BAM > ${BAM}.stats & done
wait
Determine correct and incorrect reads
#!/bin/bashfor TAG in subread STAR bowtie bwaaln bwamem soap
do
rm ${TAG}_alignment.txt
for BAM in `ls *${TAG}*bam | grep -v '\.00'`
do
CORRECT=`samtools view -q10 $BAM | tr ':-' '\t' \
| awk '$1==$5 && $6<($3+100) && $6>($2-100)' | wc -l`
INCORRECT=`samtools view -q10 $BAM | tr ':-' '\t' \
| awk '$1!=$5 || $6>($3+100) || $6<($2-100)' | wc -l`
echo $BAM $CORRECT $INCORRECT >> ${TAG}_alignment.txt
done &
done
wait