RNA-seq aligner accuracy tested with simulated reads

In the previous post we looked at how choice of RNA-seq aligner influenced results. In this post, we'll use simulated reads to empirically determine the accuracy of OLego, STAR, SubJunc and SubRead.

I made a pretty basic bash script (see below) to generate fasta format reads at uniform intervals along the length of transcripts without errors. The user can readily change the read length and the interval between read initiation. The script created 11.3 million 100 bp reads in 9 minutes so it is OK in terms of speed. It is designed to take Ensembl cDNA files as an input and has been tested on human and Arabidopsis. Here is what the generated reads look like

>AT2G01210.1:gene:AT2G01210_197863
CGTCAGCTTTCGTTCTGGGGAAGAGCGGAATCGGAATTGTCTACAAAGTG
>AT2G01210.1:gene:AT2G01210_197864
GTTCTAGAGAACGGGCTCACACTGGCCGTACGGAGATTGGGTGAAGGAGG
>AT2G01210.1:gene:AT2G01210_197865
GTCTCAGAGATTCAAGGAGTTTCAGACAGAAGTTGAAGCCATAGGGAAAC
>AT2G01210.1:gene:AT2G01210_197866
TAAAACATCCTAACATTGCTAGTCTTCGAGCTTATTATTGGTCTGTCGAT

As you can see in the example above, the name of the transcript (AT2G01210.1) and gene (AT2G01210) is provided in the fasta heading to allow the identification of correctly and incorrectly mapped reads. The first step was to generate read sets with a reads length of 50, 100 and 200 bp in length, align and count the mapped reads with featureCounts using a mapQ threshold of 10.


The results show that STAR produced the highest correct mapping rates at 50 and 100 bp; while at 200bp, SubJunc showed slightly better correct mapping rates. SubRead showed the higher incorrect mapping rate potentially as it is not optimised for spliced read alignment, but SubJunc was not much better. STAR showed the lowest incorrect mapping rates across all read lengths.

The next step was to incorporate mutations into the sequences, which I did with msbar utility, which is part of the EMBOSS suite.

Fig1. 50 bp reads with single nucleotide changes (upper panels), deletions (centre panels) and insertions (lower panels). Left hand side shows correctly mapped reads and Right hand side panels show the rate of incorrect read mapping at mapQ=10.
Fig2. 100 bp reads with single nucleotide changes (upper panels), deletions (centre panels) and insertions (lower panels). Left hand side shows correctly mapped reads and Right hand side panels show the rate of incorrect read mapping at mapQ=10.
These data show that when single nucleotide changes are incorporated, STAR aligner clearly performs better in identifying the most correct alignments while limiting incorrect read alignments to low levels. STAR, SubRead and SubJunc show similar proportions of correctly mapped reads with increasing insertions, however SubRead and SubJunc showed the highest proportion of incorrectly mapped reads.

Bottom line

STAR showed the best correct mapping rate and lowest error rate with exact reads and reads containing single base mismatches. STAR also performed robustly when dealing with indel-containing reads. Thus I would recommend STAR for all 50-200 bp single end RNA-seq analyses. SubJunc and SubRead showed relatively high incorrect assignment rates, and I would suggest using more strict mapQ thresholds for these aligners if you need to use them.

RNA-seq Read simulator script

#!/bin/bash
THISSCRIPT=Simrd4.sh
CDNA=Arabidopsis_thaliana.TAIR10.23.cdna.all.fa
RDLEN=200
STEP=5
OUTFILE=simreads.fasta
EXE=exe.sh
EXPL=explode

rm -rf $EXPL 2>/dev/null
mkdir $EXPL
cd $EXPL
grep -A20 '^##' ../$THISSCRIPT | sed 1d | sed "s/HOLDPLACE/${RDLEN}/" > $EXE
chmod +x $EXE

perl ../unwrap_fasta.pl ../$CDNA - | cut -d ' ' -f1,4 | tr ' ' ':' \
| tr -d '\()/"' | paste - - | tr '\t' '!' > tmp1
split --additional-suffix=.split -l 1000 tmp1 tmp.
ls tmp*.split > tmp2
parallel -j 8 ./$EXE < tmp2 > ../$OUTFILE
cd ..
#rm -rf $EXPL
exit
############
#!/bin/bash
RDLEN=HOLDPLACE
STEP=5
for CDNA in `cat $1`
do
NAME=`echo $CDNA | cut -d '!' -f1`
for OFFSET in `seq 1 $STEP $RDLEN`
do
#echo $CDNA $OFFSET
echo $CDNA | cut -d '!' -f2 \
| cut -c${OFFSET}- \
| sed "s/\(.\{$RDLEN\}\)/\1\n${NAME}\t/g"
done
done | awk -v L=$RDLEN 'length($2)==L {print $0}' | nl -n ln \
| awk '{print $2"_"$1"\t"$3}'| tr '\t' '\n'

Mutate sequences

#!/bin/bash
for FQ in simreads_50bp simreads_100bp simreads_200bp
do
for COUNT in 1 2 4 8 16
do
pigz -dc ${FQ}.fasta.gz \
| msbar -sequence /dev/stdin -count $COUNT \
-point 4 -block 0 -codon 0 -outseq /dev/stdout 2>/dev/null \
| pigz > ${FQ}_${COUNT}snp.fasta.gz &
done; wait

pigz -dc ${FQ}.fasta.gz \
| msbar -sequence /dev/stdin -count $COUNT \
-point 2 -block 0 -codon 0 -outseq /dev/stdout 2>/dev/null \
| pigz > ${FQ}_${COUNT}ins.fasta.gz &

pigz -dc ${FQ}.fasta.gz \
| msbar -sequence /dev/stdin -count $COUNT \
-point 3 -block 0 -codon 0 -outseq /dev/stdout 2>/dev/null \
| pigz > ${FQ}_${COUNT}del.fasta.gz &
done ; wait
done
wait

Align reads to genome

See previous post

Determine corrrectly mapped reads

#!/bin/bash
GTF=/path/to/arabidopsis.gtf
featureCounts -R -Q 20 -s 0 -T 8 -a $GTF -o simreads.fcount.txt *.bam

for i in *featureCounts
do
CNT=`sed 's/./\t/' $i | grep -w Assigned | awk '$1==$4 | wc -l`
echo $i $CNT
done

Popular posts from this blog

Data analysis step 8: Pathway analysis with GSEA

Two subtle problems with over-representation analysis

Uploading data to GEO - which method is faster?