RNA-seq aligners: Subread, STAR, HPG aligner and Olego

In this post I compare the speed of Subread/Subjunc, STAR, HPG aligner and Olego. I will use real RNA-seq data from GEO accession GSE42968 and align to the Arabidopsis thaliana genome. All code used is found at the bottom of this post. The benchmarking was performed on a standard 8 core workstation with 8 GB RAM.

Index the Arabidopsis thaliana genome

The first step in any RNA-seq analysis is to prepare the genome for alignment. Subread was the quickest.
Time required to index the Arabidopsis genome (119.67 Mbp).

Align RNA-seq reads

I down sampled the sequence (SRA accesion:SRR634969) file to generate files containing 1M, 2M, 5M, 10M and 14M reads selected randomly. Fastq quality trimmer was used to trim low quality bases from the 3' end and filter low quality reads. Then I used Olego, Subread, Subjunc, STAR and HPG aligner to align the reads. (I wanted to test HPG aligner in both the BWT and SA modes but didn't have the 17 GB RAM to test SA algorithm sadly.) In all the below cases the sequence files were uncompressed fastq (101 bp reads) and output was uncompressed/unsorted SAM format. I monitored the execution time, peak memory and CPU utilisation using a benchmarking script. The below figures are averages of 3 replicates.

As anticipated,  STAR was fast but also very RAM hungry. Olego was the slowest, but required the least RAM. Memory usage of HPG aligner increased with the number of reads, suggesting that it loads in large chunks of reads into RAM before alignment.

Execution time for RNA-seq alignment for fastq files of varying sizes to the Arabidopsis genome. Read length is 101 bp. 

Peak memory usage for RNA-seq alignment to the Arabidopsis genome with fastq files of varying sizes.

CPU usage for RNA-seq alignment to the Arabidopsis genome with fastq files of varying sizes.

Performance summary

From the above execution time figures, I could calculate the time taken to load the index into memory as well as the time taken to process 1M reads. These data show that STAR takes the longest to read the index into memory, but STAR has a much faster performance once the index is loaded.


Bottom line

STAR was the stand-out performer in terms of speed, at least 3 times faster than its nearest competitor. Given that the user can specify that the index remain in memory after an alignment job means that the modest time taken to load the index into memory can be further reduced. If limited RAM prevent the use of STAR, then Subread and HPG aligners (bwt mode) offer marginal gains in terms of RNA-seq alignment speed compared to Olego. The difference between Subread and Subjunc speed was relatively small. We will assess HPG aligner in the SA mode in another post.

Code used

Download the RNA-seq data

$ cat url.txt
ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR634/SRR634969/SRR634969.sra
ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR634/SRR634970/SRR634970.sra
ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR634/SRR634971/SRR634971.sra
ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR634/SRR634972/SRR634972.sra
ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR634/SRR634973/SRR634973.sra
ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR634/SRR634974/SRR634974.sra



#!/bin/bash
URLFILE=url.txt
NUM=8
for URL in `cat $URLFILE`
do axel --num-connections=$NUM $URL
done


Unpack SRA archives

#!/bin/bash
for SRA in SRR634969.sra
do
fastq-dump --gzip --split-3 $SRA
done

Check quality scores 

pigz -dc SRR634969_1.fastq.gz \
| fastx_quality_stats > SRR634969_1_quals.txt

Perform quality trimming and subsample

pigz -dc ../SRR634969_1.fastq.gz \
| fastq_quality_trimmer -t 20 -l 25 -Q33 \
| paste - - - - | shuf \
| pigz > SRR634969_1.tab.gz

for AMT in 1 2 5 10 14
do
NAME=`echo SRR634969_1.tab.gz | sed "s/tab.gz/${AMT}.fastq.gz/"`
pigz -dc SRR634969_1.tab.gz \
| head -${AMT}000000 \
| tr '\t' '\n' > $NAME
done

Download the Arabidopsis thaliana genome and annotation:

wget ftp://ftp.ensemblgenomes.org/pub/release-23/plants/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa.gz

wget ftp://ftp.ensemblgenomes.org/pub/release-23/plants/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.23.gtf.gz

Index the Arabidopsis thaliana genome:

Olego:
olegoindex -a bwtsw Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa

Subread:
subread-buildindex -o Arabidopsis_thaliana.TAIR10.23.dna_sm.genome Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa

STAR:
STAR --runMode genomeGenerate \
--genomeDir star/ \
--genomeFastaFiles star/Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa \
--runThreadN 8

HPG aligner*:
hpg-aligner build-bwt-index \
-g Arabidopsis_thaliana.TAIR10.23.dna.genome_fmt.fa \
-i hpgaligner-arabidopsis/ -r 8

Run alignments

Olego:
olego -t 8 /olegoindex/Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa SRR634969_1.1.fastq > SRR634969_1.1.fastq.olego.sam

Subread:
subread-align -T 8 -i subread/Arabidopsis_thaliana.TAIR10.23.dna_sm.genome -r SRR634969_1.1.fastq -o SRR634969_1.1.fastq.subread.sam

Subjunc:
subjunc -T 8 -i subread/Arabidopsis_thaliana.TAIR10.23.dna_sm.genome -r SRR634969_1.1.fastq -o SRR634969_1.1.fastq.subread.sam

STAR:
STAR --genomeDir star/ --readFilesIn SRR634969_1.1.fastq --runThreadN 8 --sjdbGTFfile Arabidopsis_thaliana.TAIR10.23.gtf

HPG-aligner:
hpg-aligner rna -f SRR634969_1.1.fastq -i hpgaligner-arabidopsis/ 
--report-n-best 1 -o ./SRR634969_1.1.hpg/ --cpu-threads=8

*Notes

HPG aligner needs dev versions of curl and gsl
sudo apt-get install libcurl3 libcurl3-gnutls libcurl4-openssl-dev
sudo apt-get install libgsl0-dev
HPG does not appear to tolerate wobble bases, this took a long time to figure out. use linux "tr" or other tool to replace wobble bases (ie: DKMRSWY) with N's.

Versions

subread-align and subjunc version 1.4.5-p1
OLego: version 1.1.5
STAR_VERSION "STAR_2.4.0d"
HPG-aligner v2.0 (2014)

Popular posts from this blog

Data analysis step 8: Pathway analysis with GSEA

Uploading data to GEO - which method is faster?

Using GTF tools to get gene lengths