RNA-seq aligners: Subread, STAR, HPG aligner and Olego | PART 2
In the previous post, I compared the speed of several RNA aligners. In this post, we'll take a closer look at the results generated by these different aligners; Subread/Subjunc, STAR, HPG aligner and Olego. In addition, we will use BWA MEM as an example of what happens when you use a DNA aligner to analyse RNA-seq. For the test, we've been using 101 bp Arabidopsis mRNA-seq data from GEO accession GSE42968. For simplicity, I use only the forward read in the paired-end dataset. I used fastx-toolkit to remove low quality bases from the 3' end (base qual threshold of 20).
One of the simplest ways to assess the quality of an alignment is to determine the proportion of reads that are mapped to the genome and the proportion that map to exons. After the aligners did their work, I used featureCounts to quantify the number of aligned reads with a mapping quality >10. Here is the data for the first sample in the series, SRR634969 which contained 14.5 million reads.
The first thing to notice from the table is that the BAM files do not contain the same number of lines. HPG, OLego, SubJunc and SubRead contain all the QC passed reads which is the normal procedure. STAR and BWA MEM on the other report more entries in the bam file than starting sequences. It would appear that BWA MEM outputs multiple alignment records if two parts of the query sequence align at different parts of the genome. This is potentially bad because it effectively double counts 15% of the reads which could distort the results. STAR on the other hand seems to output only mapped reads. STAR also outputs reads that align to >1 location (4.5% of reads) but assigns a sensible low map quality score to those reads.
HPG aligner showed the highest proportion of reads with map quality scores ≥10 (98.7%), but SubJunc had the greatest proportion of assigned reads (95.3%). Thus SubJunc gave the highest amount of usable data, albeit only slightly better than STAR and HPG.
Next, I loaded the data into R, scaled and log transformed it and generated an MDS plot.
Also, I generated a pairs plot to visualise the data correlation.
Both the MDS and pairs plot shows highly similar results for HPG, OLego, STAR, SubJunc and SubRead.
Next, I want to identify how many genes are significantly differently represented by the different aligners in one sample (SRR634969). To do this, I ran DESeq in no replicate mode. Genes with differential representation FDR≤0.05 are shown in the graph below.
So we find that there is a difference in the gene expression values depending on the choice of aligners (for a few genes at least). But does the choice of aligner impact the differentially expressed genes (DEGs) considering all samples in the experiment? I did just that, using default DESeq settings (manual section 2 and 3.1). Here is the number of DEGs for the different aligners:
BWA MEM 381
HPG 362
OLego 351
STAR 358
SubJunc 363
SubRead 365
As you can see, BWA MEM has the most, but clearly the double-counting issue likely means that there are a number of false positives there. Generally, the number of DEGs tracks the number of assigned reads pretty well; with HPG, STAR, SubJunc and SubRead showing more DEGs than OLego. I analysed the overlap between these differentially expressed gene sets using Venn SuperSelector at Bio-Analytic Resource.
The result shows that all the aligners suited to RNA-seq data agree, showing an overlap ratio of >95%. BWA MEM, despite not designed for RNA-seq, was still able to identify 94.8% of DGEs. So while SubJunc/SubRead gives slightly better mapping, this has only a minor effect on the number of DEGs identified.
One of the simplest ways to assess the quality of an alignment is to determine the proportion of reads that are mapped to the genome and the proportion that map to exons. After the aligners did their work, I used featureCounts to quantify the number of aligned reads with a mapping quality >10. Here is the data for the first sample in the series, SRR634969 which contained 14.5 million reads.
Read mapping statistics for SRR634969 with different aligners. |
HPG aligner showed the highest proportion of reads with map quality scores ≥10 (98.7%), but SubJunc had the greatest proportion of assigned reads (95.3%). Thus SubJunc gave the highest amount of usable data, albeit only slightly better than STAR and HPG.
Next, I loaded the data into R, scaled and log transformed it and generated an MDS plot.
MDS plot for SRR634969 when using different aligners. Data is scaled and log transformed. |
Pairwise comparison of transcript representation computed by different alignment software. |
Both the MDS and pairs plot shows highly similar results for HPG, OLego, STAR, SubJunc and SubRead.
Next, I want to identify how many genes are significantly differently represented by the different aligners in one sample (SRR634969). To do this, I ran DESeq in no replicate mode. Genes with differential representation FDR≤0.05 are shown in the graph below.
So we find that there is a difference in the gene expression values depending on the choice of aligners (for a few genes at least). But does the choice of aligner impact the differentially expressed genes (DEGs) considering all samples in the experiment? I did just that, using default DESeq settings (manual section 2 and 3.1). Here is the number of DEGs for the different aligners:
BWA MEM 381
HPG 362
OLego 351
STAR 358
SubJunc 363
SubRead 365
As you can see, BWA MEM has the most, but clearly the double-counting issue likely means that there are a number of false positives there. Generally, the number of DEGs tracks the number of assigned reads pretty well; with HPG, STAR, SubJunc and SubRead showing more DEGs than OLego. I analysed the overlap between these differentially expressed gene sets using Venn SuperSelector at Bio-Analytic Resource.
Overlap of DGE gene sets derived from alignments by different software. Overlap analysis performed by Bio-Analytic Resource Venn SuperSelector. |
Alignment script
$ cat RNA-seq-aligners.sh
#!/bin/bash
for FQZ in `ls *_1.fastq.gz `
do
FQ=`echo $FQZ | sed 's/fastq.gz/trim.fastq/'`
pigz -dc $FQZ | fastq_quality_trimmer -t 20 -l 25 -Q33 > $FQ &
done
wait
for FQ in *trim.fastq
do
STAR --readFilesIn $FQ --genomeLoad LoadAndKeep \
--genomeDir star/ \
--runThreadN 8 \
--sjdbGTFfile Arabidopsis_thaliana.TAIR10.23.gtf
wait
mv Aligned.out.sam ${FQ}.STAR.sam
done
for SAM in *sam
do
samtools view -uSh $SAM \
| samtools sort - ${SAM}.sort &
done
wait
rm *sam
for FQ in *trim.fastq
do
hpg-aligner rna -f $FQ \
-i hpgaligner-arabidopsis/ \
--report-n-hits=1 --report-n-best=1 \
-o ./${FQ}_hpg/ --cpu-threads=8 --log-level=10
done
for BAM in *_hpgalignments.bam
do
samtools sort $BAM ${BAM}.sort &
done
wait
rm *_hpgalignments.bam
for FQ in *trim.fastq
do
subread-align -T 8 \
-i Arabidopsis_thaliana.TAIR10.23.dna_sm.genome \
-r $FQ -o ${FQ}.subread.sam
done
for SAM in *sam
do
samtools view -uSh $SAM \
| samtools sort - ${SAM}.sort &
done
wait
rm *sam
for FQ in *trim.fastq
do
subjunc -T 8 \
-i Arabidopsis_thaliana.TAIR10.23.dna_sm.genome \
-r $FQ -o ${FQ}.subjunc.sam
done
for SAM in *sam
do
samtools view -uSh $SAM \
| samtools sort - ${SAM}.sort &
done
wait
rm *sam
for FQ in *trim.fastq
do
olego -t 8 \
Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa \
$FQ > ${FQ}.olego.sam
done
for SAM in *sam
do
samtools view -uSh $SAM \
| samtools sort - ${SAM}.sort &
done
wait
rm *sam
for FQ in *trim.fastq
do
bwa mem -t 8 Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa \
$FQ > ${FQ}.bwamem.sam
done
for SAM in *sam
do
samtools view -uSh $SAM \
| samtools sort - ${SAM}.sort &
done
wait
rm *sam
rm *fastq
for BAM in *bam ; do samtools index $BAM ; done
for BAM in *bam ; do samtools flagstat $BAM > ${BAM}.stats ; done
featureCounts -Q 10 -s 0 -T 8 -a Arabidopsis_thaliana.TAIR10.23.gtf \
-o ALL_cnt.xls *bam
Pairs plot
#install.packages("PerformanceAnalytics")
library(PerformanceAnalytics)
x<-read.table("SRR634969_cnt", header=T, row.names=1)
png(filename="pairs_SRR634969.png")
chart.Correlation(x,histogram = TRUE, method = c("pearson"))
dev.off()
MDS plot
x<-log2(scale(read.table("CountMatrix.xls", row.names=1, header=TRUE)))
pdf("MDSplot.pdf")
plot(cmdscale(dist(t(x))), xlab="Coordinate 1", ylab="Coordinate 2", type = "n") ; text(cmdscale(dist(t(x))), labels=colnames(x), )
dev.off()
pdf("MDSplot.pdf")
plot(cmdscale(dist(t(x))), xlab="Coordinate 1", ylab="Coordinate 2", type = "n") ; text(cmdscale(dist(t(x))), labels=colnames(x), )
dev.off()
DESeq analysis of aligners
Analysis of SRR634969 in no replicate mode:
#source("http://bioconductor.org/biocLite.R")
#biocLite("DESeq")
library( "DESeq" )
cnt<-read.table( "SRR634969_cnts_fmt_filt.xls",header=TRUE, row.names=1)
design <- data.frame(row.names = colnames(cnt),condition = c("BM","HP","OL","ST","SJ","SR"))
countTable <- cnt
conds <- design$condition
cds <- newCountDataSet( countTable, conds )
cds <- estimateSizeFactors( cds )
cds <-estimateDispersions(cds,method="blind",sharingMode="fit-only",fitType="local")
res <- nbinomTest( cds, "BM", "HP")
write.table(res, "BMvHP.xls" , quote = FALSE , sep = "\t", row.names = FALSE)
#repeat to compare the other aligners
# Run comparisons of aligners
#source("http://bioconductor.org/biocLite.R")
#biocLite("DESeq")
library( "DESeq" )
cnt<-read.table( "SRR634969_cnts_fmt_filt.xls",header=TRUE, row.names=1)
design <- data.frame(row.names = colnames(cnt),condition = c("BM","HP","OL","ST","SJ","SR"))
countTable <- cnt
conds <- design$condition
cds <- newCountDataSet( countTable, conds )
cds <- estimateSizeFactors( cds )
cds <-estimateDispersions(cds,method="blind",sharingMode="fit-only",fitType="local")
res <- nbinomTest( cds, "BM", "HP")
write.table(res, "BMvHP.xls" , quote = FALSE , sep = "\t", row.names = FALSE)
#repeat to compare the other aligners
Analysis of all 6 samples in the experiment using default DESeq.
cnt<-read.table( "bwamem_cnt.xls_filt.xls",header=TRUE, row.names=1)
design <- data.frame(row.names = colnames(cnt),condition = c("c","c","c","t","t","t"))
countTable <- cnt
conds <- design$condition
cds <- newCountDataSet( countTable, conds )
cds <- estimateSizeFactors( cds )
cds <- estimateDispersions( cds )
res<-nbinomTest(cds,"c","t")
write.table(res, "bwamem_deseq.xls" , quote = FALSE , sep = "\t", row.names = FALSE)
#repeat to compare the other aligners