HISAT vs STAR vs TopHat2 vs Olego vs SubJunc


HISAT is a brand new RNA-seq aligner which promises great speed with a low memory footprint. The Nature Methods paper is worth a look.

So I thought I'd give it a test run with some simulated data to check its accuracy compared to other aligners.

I generated synthetic 100bp reads based on Arabidopsis cDNAs and then incorporated mutations with msbar.

I then aligned these reads to the Arabidopsis genome with default settings and counted (featureCounts) the number of correctly and incorrectly assigned reads with a mapping quality of 20.

Here is the result for unmutated (perfect) 100 bp reads.

Table 1. Alignment of simulated 100 bp cDNA sequences to the Arabidopsis genome.
HISAT shows similar accuracy to TopHat2 but was not as accurate as STAR.

Now here are the results of the mutation experiment.

Figure 1. Accuracy of mapping mutated 100bp reads. Left hand side graphs show the correct mapping rates and right hand side shows incorrect mapping rates. Top panels show effect of single nucleotide changes, middle panels show effect of single base insertions and lower panels show single base deletions.
A table of full results is shown at the bottom of the post. These results show that STAR consistently scores the greatest number of correctly assigned reads in these tests while keeping incorrectly assigned reads down below 0.1%. Olego and TopHat2 produce the fewest incorrectly assigned reads, but are quite slow. HISAT will still be very useful when speed and memory footprint are a concern.

Table 2. Results of the mapping mutated 100bp reads to the Arabidopsis genome (data shown in Fig1)


==> test_hisat.sh <==

#!/bin/bash
X=Arabidopsis_thaliana.TAIR10.23.dna.genome_fmt.fa
for FQZ in *fasta.gz ; do
FQ=`echo $FQZ | sed 's/.gz//'`
pigz -dk $FQZ
hisat -p 8 -f -x $X -U $FQ \
| samtools view -uSh - \
| samtools sort - ${FQ}_hisat.sort
rm $FQ
done

==> test_olego.sh <==
#!/bin/bash
IDX=Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa
for FQ in *.fasta.gz
do
pigz -dc $FQ \
|olego -t 8 $IDX $FQ \
| samtools view -uSh - \
| samtools sort - ${FQ}_olego.sort
done


==> test_STAR.sh <==
IDX=arabidopsis/
for FQZ in *.fasta.gz
do
pigz -fdk $FQZ
FQ=`echo $FQZ | sed 's/.gz//'`
STAR --readFilesIn $FQ --genomeLoad LoadAndKeep \
--genomeDir $IDX --runThreadN 8 \
mv Aligned.out.sam ${FQ}.STAR.sam
(SAM=${FQ}.STAR.sam
samtools view -uSh $SAM \
| samtools sort - ${SAM}.sort
rm $SAM )&
rm $FQ
done
wait

==> test_subjunc.sh <==
#!/bin/bash
IDX=Arabidopsis_thaliana.TAIR10.23.dna_sm.genome
for FQZ in *.fasta.gz
do
pigz -fdk $FQZ
FQ=`echo $FQZ | sed 's/.gz//'`
subjunc -T 8 -i $IDX \
-r $FQ -o ${FQ}.subjunc.sam
(SAM=${FQ}.subjunc.sam
samtools view -uSh $SAM \
| samtools sort - ${SAM}.sort
rm $FQ ${FQ}.subjunc.sam )&
done
wait

==> test_tophat.sh <==
#!/bin/bash
IDX=Arabidopsis_thaliana.TAIR10.23.dna.genome_fmt
for FQZ in *fasta.gz ; do
FQ=`echo $FQZ | sed 's/.gz//'`
pigz -dk $FQZ
tophat2 -p 8 -o ${FQ}.tophat $IDX $FQ
rm $FQ
done

Popular posts from this blog

Data analysis step 8: Pathway analysis with GSEA

Uploading data to GEO - which method is faster?

Using GTF tools to get gene lengths