RNA-seq aligners: Subread, STAR, HPG aligner and Olego
In this post I compare the speed of Subread/Subjunc, STAR, HPG aligner and Olego. I will use real RNA-seq data from GEO accession GSE42968 and align to the Arabidopsis thaliana genome. All code used is found at the bottom of this post. The benchmarking was performed on a standard 8 core workstation with 8 GB RAM.
As anticipated, STAR was fast but also very RAM hungry. Olego was the slowest, but required the least RAM. Memory usage of HPG aligner increased with the number of reads, suggesting that it loads in large chunks of reads into RAM before alignment.
for SRA in SRR634969.sra
do
fastq-dump --gzip --split-3 $SRA
done
| fastx_quality_stats > SRR634969_1_quals.txt
| fastq_quality_trimmer -t 20 -l 25 -Q33 \
| paste - - - - | shuf \
| pigz > SRR634969_1.tab.gz
for AMT in 1 2 5 10 14
do
NAME=`echo SRR634969_1.tab.gz | sed "s/tab.gz/${AMT}.fastq.gz/"`
pigz -dc SRR634969_1.tab.gz \
| head -${AMT}000000 \
| tr '\t' '\n' > $NAME
done
wget ftp://ftp.ensemblgenomes.org/pub/release-23/plants/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.23.gtf.gz
olegoindex -a bwtsw Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa
Subread:
subread-buildindex -o Arabidopsis_thaliana.TAIR10.23.dna_sm.genome Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa
STAR:
STAR --runMode genomeGenerate \
--genomeDir star/ \
--genomeFastaFiles star/Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa \
--runThreadN 8
HPG aligner*:
hpg-aligner build-bwt-index \
-g Arabidopsis_thaliana.TAIR10.23.dna.genome_fmt.fa \
-i hpgaligner-arabidopsis/ -r 8
Subread:
Subjunc:
subjunc -T 8 -i subread/Arabidopsis_thaliana.TAIR10.23.dna_sm.genome -r SRR634969_1.1.fastq -o SRR634969_1.1.fastq.subread.sam
STAR:
STAR --genomeDir star/ --readFilesIn SRR634969_1.1.fastq --runThreadN 8 --sjdbGTFfile Arabidopsis_thaliana.TAIR10.23.gtf
HPG-aligner:
hpg-aligner rna -f SRR634969_1.1.fastq -i hpgaligner-arabidopsis/
--report-n-best 1 -o ./SRR634969_1.1.hpg/ --cpu-threads=8
Index the Arabidopsis thaliana genome
The first step in any RNA-seq analysis is to prepare the genome for alignment. Subread was the quickest.
Time required to index the Arabidopsis genome (119.67 Mbp). |
Align RNA-seq reads
I down sampled the sequence (SRA accesion:SRR634969) file to generate files containing 1M, 2M, 5M, 10M and 14M reads selected randomly. Fastq quality trimmer was used to trim low quality bases from the 3' end and filter low quality reads. Then I used Olego, Subread, Subjunc, STAR and HPG aligner to align the reads. (I wanted to test HPG aligner in both the BWT and SA modes but didn't have the 17 GB RAM to test SA algorithm sadly.) In all the below cases the sequence files were uncompressed fastq (101 bp reads) and output was uncompressed/unsorted SAM format. I monitored the execution time, peak memory and CPU utilisation using a benchmarking script. The below figures are averages of 3 replicates.As anticipated, STAR was fast but also very RAM hungry. Olego was the slowest, but required the least RAM. Memory usage of HPG aligner increased with the number of reads, suggesting that it loads in large chunks of reads into RAM before alignment.
Execution time for RNA-seq alignment for fastq files of varying sizes to the Arabidopsis genome. Read length is 101 bp. |
Peak memory usage for RNA-seq alignment to the Arabidopsis genome with fastq files of varying sizes. |
|
Performance summary
From the above execution time figures, I could calculate the time taken to load the index into memory as well as the time taken to process 1M reads. These data show that STAR takes the longest to read the index into memory, but STAR has a much faster performance once the index is loaded.Bottom line
STAR was the stand-out performer in terms of speed, at least 3 times faster than its nearest competitor. Given that the user can specify that the index remain in memory after an alignment job means that the modest time taken to load the index into memory can be further reduced. If limited RAM prevent the use of STAR, then Subread and HPG aligners (bwt mode) offer marginal gains in terms of RNA-seq alignment speed compared to Olego. The difference between Subread and Subjunc speed was relatively small. We will assess HPG aligner in the SA mode in another post.
Code used
Download the RNA-seq data
$ cat url.txt
ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR634/SRR634969/SRR634969.sra
ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR634/SRR634970/SRR634970.sra
ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR634/SRR634971/SRR634971.sra
ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR634/SRR634972/SRR634972.sra
ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR634/SRR634973/SRR634973.sra
ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR634/SRR634974/SRR634974.sra
#!/bin/bash
URLFILE=url.txt
NUM=8
for URL in `cat $URLFILE`
do axel --num-connections=$NUM $URL
done
Unpack SRA archives
#!/bin/bashfor SRA in SRR634969.sra
do
fastq-dump --gzip --split-3 $SRA
done
Check quality scores
pigz -dc SRR634969_1.fastq.gz \| fastx_quality_stats > SRR634969_1_quals.txt
Perform quality trimming and subsample
pigz -dc ../SRR634969_1.fastq.gz \| fastq_quality_trimmer -t 20 -l 25 -Q33 \
| paste - - - - | shuf \
| pigz > SRR634969_1.tab.gz
for AMT in 1 2 5 10 14
do
NAME=`echo SRR634969_1.tab.gz | sed "s/tab.gz/${AMT}.fastq.gz/"`
pigz -dc SRR634969_1.tab.gz \
| head -${AMT}000000 \
| tr '\t' '\n' > $NAME
done
Download the Arabidopsis thaliana genome and annotation:
wget ftp://ftp.ensemblgenomes.org/pub/release-23/plants/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa.gzwget ftp://ftp.ensemblgenomes.org/pub/release-23/plants/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.23.gtf.gz
Index the Arabidopsis thaliana genome:
Olego:olegoindex -a bwtsw Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa
subread-buildindex -o Arabidopsis_thaliana.TAIR10.23.dna_sm.genome Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa
STAR --runMode genomeGenerate \
--genomeDir star/ \
--genomeFastaFiles star/Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa \
--runThreadN 8
HPG aligner*:
hpg-aligner build-bwt-index \
-g Arabidopsis_thaliana.TAIR10.23.dna.genome_fmt.fa \
-i hpgaligner-arabidopsis/ -r 8
Run alignments
Olego:
olego -t 8 /olegoindex/Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa SRR634969_1.1.fastq > SRR634969_1.1.fastq.olego.sam
subread-align -T 8 -i subread/Arabidopsis_thaliana.TAIR10.23.dna_sm.genome -r SRR634969_1.1.fastq -o SRR634969_1.1.fastq.subread.sam
subjunc -T 8 -i subread/Arabidopsis_thaliana.TAIR10.23.dna_sm.genome -r SRR634969_1.1.fastq -o SRR634969_1.1.fastq.subread.sam
STAR:
STAR --genomeDir star/ --readFilesIn SRR634969_1.1.fastq --runThreadN 8 --sjdbGTFfile Arabidopsis_thaliana.TAIR10.23.gtf
HPG-aligner:
hpg-aligner rna -f SRR634969_1.1.fastq -i hpgaligner-arabidopsis/
--report-n-best 1 -o ./SRR634969_1.1.hpg/ --cpu-threads=8
*Notes
HPG aligner needs dev versions of curl and gsl
sudo apt-get install libcurl3 libcurl3-gnutls libcurl4-openssl-dev
sudo apt-get install libgsl0-dev
HPG does not appear to tolerate wobble bases, this took a long time to figure out. use linux "tr" or other tool to replace wobble bases (ie: DKMRSWY) with N's.
HPG does not appear to tolerate wobble bases, this took a long time to figure out. use linux "tr" or other tool to replace wobble bases (ie: DKMRSWY) with N's.
Versions
subread-align and subjunc version 1.4.5-p1
OLego: version 1.1.5
STAR_VERSION "STAR_2.4.0d"
HPG-aligner v2.0 (2014)