RNA-seq aligner accuracy tested with simulated reads
In the previous post we looked at how choice of RNA-seq aligner influenced results. In this post, we'll use simulated reads to empirically determine the accuracy of OLego, STAR, SubJunc and SubRead.
I made a pretty basic bash script (see below) to generate fasta format reads at uniform intervals along the length of transcripts without errors. The user can readily change the read length and the interval between read initiation. The script created 11.3 million 100 bp reads in 9 minutes so it is OK in terms of speed. It is designed to take Ensembl cDNA files as an input and has been tested on human and Arabidopsis. Here is what the generated reads look like
As you can see in the example above, the name of the transcript (AT2G01210.1) and gene (AT2G01210) is provided in the fasta heading to allow the identification of correctly and incorrectly mapped reads. The first step was to generate read sets with a reads length of 50, 100 and 200 bp in length, align and count the mapped reads with featureCounts using a mapQ threshold of 10.
The results show that STAR produced the highest correct mapping rates at 50 and 100 bp; while at 200bp, SubJunc showed slightly better correct mapping rates. SubRead showed the higher incorrect mapping rate potentially as it is not optimised for spliced read alignment, but SubJunc was not much better. STAR showed the lowest incorrect mapping rates across all read lengths.
The next step was to incorporate mutations into the sequences, which I did with msbar utility, which is part of the EMBOSS suite.
THISSCRIPT=Simrd4.sh
CDNA=Arabidopsis_thaliana.TAIR10.23.cdna.all.fa
RDLEN=200
STEP=5
OUTFILE=simreads.fasta
EXE=exe.sh
EXPL=explode
rm -rf $EXPL 2>/dev/null
mkdir $EXPL
cd $EXPL
grep -A20 '^##' ../$THISSCRIPT | sed 1d | sed "s/HOLDPLACE/${RDLEN}/" > $EXE
chmod +x $EXE
perl ../unwrap_fasta.pl ../$CDNA - | cut -d ' ' -f1,4 | tr ' ' ':' \
| tr -d '\()/"' | paste - - | tr '\t' '!' > tmp1
split --additional-suffix=.split -l 1000 tmp1 tmp.
ls tmp*.split > tmp2
parallel -j 8 ./$EXE < tmp2 > ../$OUTFILE
cd ..
#rm -rf $EXPL
exit
############
#!/bin/bash
RDLEN=HOLDPLACE
STEP=5
for CDNA in `cat $1`
do
NAME=`echo $CDNA | cut -d '!' -f1`
for OFFSET in `seq 1 $STEP $RDLEN`
do
#echo $CDNA $OFFSET
echo $CDNA | cut -d '!' -f2 \
| cut -c${OFFSET}- \
| sed "s/\(.\{$RDLEN\}\)/\1\n${NAME}\t/g"
done
done | awk -v L=$RDLEN 'length($2)==L {print $0}' | nl -n ln \
| awk '{print $2"_"$1"\t"$3}'| tr '\t' '\n'
I made a pretty basic bash script (see below) to generate fasta format reads at uniform intervals along the length of transcripts without errors. The user can readily change the read length and the interval between read initiation. The script created 11.3 million 100 bp reads in 9 minutes so it is OK in terms of speed. It is designed to take Ensembl cDNA files as an input and has been tested on human and Arabidopsis. Here is what the generated reads look like
>AT2G01210.1:gene:AT2G01210_197863
CGTCAGCTTTCGTTCTGGGGAAGAGCGGAATCGGAATTGTCTACAAAGTG
>AT2G01210.1:gene:AT2G01210_197864
GTTCTAGAGAACGGGCTCACACTGGCCGTACGGAGATTGGGTGAAGGAGG
>AT2G01210.1:gene:AT2G01210_197865
GTCTCAGAGATTCAAGGAGTTTCAGACAGAAGTTGAAGCCATAGGGAAAC
>AT2G01210.1:gene:AT2G01210_197866
TAAAACATCCTAACATTGCTAGTCTTCGAGCTTATTATTGGTCTGTCGAT
As you can see in the example above, the name of the transcript (AT2G01210.1) and gene (AT2G01210) is provided in the fasta heading to allow the identification of correctly and incorrectly mapped reads. The first step was to generate read sets with a reads length of 50, 100 and 200 bp in length, align and count the mapped reads with featureCounts using a mapQ threshold of 10.
The next step was to incorporate mutations into the sequences, which I did with msbar utility, which is part of the EMBOSS suite.
Bottom line
STAR showed the best correct mapping rate and lowest error rate with exact reads and reads containing single base mismatches. STAR also performed robustly when dealing with indel-containing reads. Thus I would recommend STAR for all 50-200 bp single end RNA-seq analyses. SubJunc and SubRead showed relatively high incorrect assignment rates, and I would suggest using more strict mapQ thresholds for these aligners if you need to use them.
RNA-seq Read simulator script
#!/bin/bashTHISSCRIPT=Simrd4.sh
CDNA=Arabidopsis_thaliana.TAIR10.23.cdna.all.fa
RDLEN=200
STEP=5
OUTFILE=simreads.fasta
EXE=exe.sh
EXPL=explode
rm -rf $EXPL 2>/dev/null
mkdir $EXPL
cd $EXPL
grep -A20 '^##' ../$THISSCRIPT | sed 1d | sed "s/HOLDPLACE/${RDLEN}/" > $EXE
chmod +x $EXE
perl ../unwrap_fasta.pl ../$CDNA - | cut -d ' ' -f1,4 | tr ' ' ':' \
| tr -d '\()/"' | paste - - | tr '\t' '!' > tmp1
split --additional-suffix=.split -l 1000 tmp1 tmp.
ls tmp*.split > tmp2
parallel -j 8 ./$EXE < tmp2 > ../$OUTFILE
cd ..
#rm -rf $EXPL
exit
############
#!/bin/bash
RDLEN=HOLDPLACE
STEP=5
for CDNA in `cat $1`
do
NAME=`echo $CDNA | cut -d '!' -f1`
for OFFSET in `seq 1 $STEP $RDLEN`
do
#echo $CDNA $OFFSET
echo $CDNA | cut -d '!' -f2 \
| cut -c${OFFSET}- \
| sed "s/\(.\{$RDLEN\}\)/\1\n${NAME}\t/g"
done
done | awk -v L=$RDLEN 'length($2)==L {print $0}' | nl -n ln \
| awk '{print $2"_"$1"\t"$3}'| tr '\t' '\n'
Mutate sequences
#!/bin/bash
for FQ in simreads_50bp simreads_100bp simreads_200bp
do
for COUNT in 1 2 4 8 16
do
pigz -dc ${FQ}.fasta.gz \
| msbar -sequence /dev/stdin -count $COUNT \
-point 4 -block 0 -codon 0 -outseq /dev/stdout 2>/dev/null \
| pigz > ${FQ}_${COUNT}snp.fasta.gz &
done; wait
pigz -dc ${FQ}.fasta.gz \
| msbar -sequence /dev/stdin -count $COUNT \
-point 2 -block 0 -codon 0 -outseq /dev/stdout 2>/dev/null \
| pigz > ${FQ}_${COUNT}ins.fasta.gz &
pigz -dc ${FQ}.fasta.gz \
| msbar -sequence /dev/stdin -count $COUNT \
-point 3 -block 0 -codon 0 -outseq /dev/stdout 2>/dev/null \
| pigz > ${FQ}_${COUNT}del.fasta.gz &
done ; wait
done
wait
for FQ in simreads_50bp simreads_100bp simreads_200bp
do
for COUNT in 1 2 4 8 16
do
pigz -dc ${FQ}.fasta.gz \
| msbar -sequence /dev/stdin -count $COUNT \
-point 4 -block 0 -codon 0 -outseq /dev/stdout 2>/dev/null \
| pigz > ${FQ}_${COUNT}snp.fasta.gz &
done; wait
pigz -dc ${FQ}.fasta.gz \
| msbar -sequence /dev/stdin -count $COUNT \
-point 2 -block 0 -codon 0 -outseq /dev/stdout 2>/dev/null \
| pigz > ${FQ}_${COUNT}ins.fasta.gz &
pigz -dc ${FQ}.fasta.gz \
| msbar -sequence /dev/stdin -count $COUNT \
-point 3 -block 0 -codon 0 -outseq /dev/stdout 2>/dev/null \
| pigz > ${FQ}_${COUNT}del.fasta.gz &
done ; wait
done
wait
Align reads to genome
See previous post
GTF=/path/to/arabidopsis.gtf
featureCounts -R -Q 20 -s 0 -T 8 -a $GTF -o simreads.fcount.txt *.bam
for i in *featureCounts
do
CNT=`sed 's/./\t/' $i | grep -w Assigned | awk '$1==$4 | wc -l`
echo $i $CNT
done
Determine corrrectly mapped reads
#!/bin/bashGTF=/path/to/arabidopsis.gtf
featureCounts -R -Q 20 -s 0 -T 8 -a $GTF -o simreads.fcount.txt *.bam
for i in *featureCounts
do
CNT=`sed 's/./\t/' $i | grep -w Assigned | awk '$1==$4 | wc -l`
echo $i $CNT
done