RNA-seq aligners: Subread, STAR, HPG aligner and Olego | PART 2

In the previous post, I compared the speed of several RNA aligners. In this post, we'll take a closer look at the results generated by these different aligners; Subread/Subjunc, STAR, HPG aligner and Olego. In addition, we will use BWA MEM as an example of what happens when you use a DNA aligner to analyse RNA-seq. For the test, we've been using 101 bp Arabidopsis mRNA-seq data from GEO accession GSE42968. For simplicity, I use only the forward read in the paired-end dataset. I used fastx-toolkit to remove low quality bases from the 3' end (base qual threshold of 20).

One of the simplest ways to assess the quality of an alignment is to determine the proportion of reads that are mapped to the genome and the proportion that map to exons. After the aligners did their work, I used featureCounts to quantify the number of aligned reads with a mapping quality >10. Here is the data for the first sample in the series, SRR634969 which contained 14.5 million reads.

Read mapping statistics for SRR634969 with different aligners.
The first thing to notice from the table is that the BAM files do not contain the same number of lines. HPG, OLego, SubJunc and SubRead contain all the QC passed reads which is the normal procedure. STAR and BWA MEM on the other report more entries in the bam file than starting sequences. It would appear that BWA MEM outputs multiple alignment records if two parts of the query sequence align at different parts of the genome. This is potentially bad because it effectively double counts 15% of the reads which could distort the results. STAR on the other hand seems to output only mapped reads. STAR also outputs reads that align to >1 location (4.5% of reads) but assigns a sensible low map quality score to those reads.

HPG aligner showed the highest proportion of reads with map quality scores ≥10 (98.7%), but SubJunc had the greatest proportion of assigned reads (95.3%). Thus SubJunc gave the highest amount of usable data, albeit only slightly better than STAR and HPG.

Next, I loaded the data into R, scaled and log transformed it and generated an MDS plot.
MDS plot for SRR634969 when using different aligners. Data is scaled and log transformed.

Also, I generated a pairs plot to visualise the data correlation.
Pairwise comparison of transcript representation computed by different alignment software.

Both the MDS and pairs plot shows highly similar results for HPG, OLego, STAR, SubJunc and SubRead.

Next, I want to identify how many genes are significantly differently represented by the different aligners in one sample (SRR634969). To do this, I ran DESeq in no replicate mode. Genes with differential representation FDR≤0.05 are shown in the graph below.

Differentially represented genes in one data set (SRR634969) depending on alignment software (FDR≤0.05). Genes in the upper diagonal are higher in the software listed at the top, while genes in the lower diagonal are higher in the software listed on the LHS. Differential analysis was done with DESeq.
The table shows that results from OLego and STAR are quite similar. SubJunc and SubRead are also similar to each other. There were some differences between SubJunc/SubRead and OLego/STAR. Here are the top 20 lines of the comparison of STAR and SubJunc.

Differentially represented genes in SRR634969 dataset aligned with STAR or SubJunc. Positive values for fold change suggests higher representation in SubJunc data. Differential analysis was done with DESeq.
I was quite surprised to see such large changes between the two aligners. SubJunc seems to be picking up expression of genes that are not detected at all by STAR. The gene ATCG01180, that encodes a chloroplast 23S rRNA, seems to be picked up best by SubJunc/SubRead followed by HPG and to a lesser extent by OLego/STAR/BWA MEM. This suggests that SubJunc is better able to distinguish between two highly similar potential origins of RNA-seq reads.

So we find that there is a difference in the gene expression values depending on the choice of aligners (for a few genes at least). But does the choice of aligner impact the differentially expressed genes (DEGs) considering all samples in the experiment? I did just that, using default DESeq settings (manual section 2 and 3.1). Here is the number of DEGs for the different aligners:

BWA MEM 381
HPG 362
OLego 351
STAR 358
SubJunc 363
SubRead 365

As you can see, BWA MEM has the most, but clearly the double-counting issue likely means that there are a number of false positives there. Generally, the number of DEGs tracks the number of assigned reads pretty well; with HPG, STAR, SubJunc and SubRead showing more DEGs than OLego. I analysed the overlap between these differentially expressed gene sets using Venn SuperSelector at Bio-Analytic Resource.
Overlap of DGE gene sets derived from alignments by different software. Overlap analysis performed by Bio-Analytic Resource Venn SuperSelector.
The result shows that all the aligners suited to RNA-seq data agree, showing an overlap ratio of >95%. BWA MEM, despite not designed for RNA-seq, was still able to identify 94.8% of DGEs. So while SubJunc/SubRead gives slightly better mapping, this has only a minor effect on the number of DEGs identified.

Alignment script

$ cat RNA-seq-aligners.sh
#!/bin/bash
for FQZ in `ls *_1.fastq.gz `
do
FQ=`echo $FQZ | sed 's/fastq.gz/trim.fastq/'`
pigz -dc $FQZ | fastq_quality_trimmer -t 20 -l 25 -Q33 > $FQ &
done
wait

for FQ in *trim.fastq
do
STAR --readFilesIn $FQ --genomeLoad LoadAndKeep \
--genomeDir star/ \
--runThreadN 8 \
--sjdbGTFfile Arabidopsis_thaliana.TAIR10.23.gtf
wait
mv Aligned.out.sam ${FQ}.STAR.sam
done

for SAM in *sam
do
samtools view -uSh $SAM \
| samtools sort - ${SAM}.sort &
done
wait
rm *sam

for FQ in *trim.fastq
do
hpg-aligner rna -f $FQ \
-i hpgaligner-arabidopsis/ \
--report-n-hits=1  --report-n-best=1 \

-o ./${FQ}_hpg/ --cpu-threads=8 --log-level=10
done


for BAM in *_hpgalignments.bam
do
samtools sort $BAM ${BAM}.sort &
done
wait
rm *_hpgalignments.bam

for FQ in *trim.fastq
do
subread-align -T 8 \
-i Arabidopsis_thaliana.TAIR10.23.dna_sm.genome \
-r $FQ -o ${FQ}.subread.sam
done

for SAM in *sam
do
samtools view -uSh $SAM \
| samtools sort - ${SAM}.sort &
done
wait
rm *sam

for FQ in *trim.fastq
do
subjunc -T 8 \
-i Arabidopsis_thaliana.TAIR10.23.dna_sm.genome \
-r $FQ -o ${FQ}.subjunc.sam
done

for SAM in *sam
do
samtools view -uSh $SAM \
| samtools sort - ${SAM}.sort &
done
wait
rm *sam

for FQ in *trim.fastq
do
olego -t 8 \
Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa \
$FQ > ${FQ}.olego.sam
done

for SAM in *sam
do
samtools view -uSh $SAM \
| samtools sort - ${SAM}.sort &
done
wait
rm *sam

for FQ in *trim.fastq
do
bwa mem -t 8 Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa \
$FQ > ${FQ}.bwamem.sam
done

for SAM in *sam
do
samtools view -uSh $SAM \
| samtools sort - ${SAM}.sort &
done
wait
rm *sam
rm *fastq

for BAM in *bam ; do samtools index $BAM ; done
for BAM in *bam ; do samtools flagstat $BAM > ${BAM}.stats ; done

featureCounts -Q 10 -s 0 -T 8 -a Arabidopsis_thaliana.TAIR10.23.gtf \

-o ALL_cnt.xls *bam

Pairs plot

#install.packages("PerformanceAnalytics")
library(PerformanceAnalytics)
x<-read.table("SRR634969_cnt", header=T, row.names=1)
png(filename="pairs_SRR634969.png")
chart.Correlation(x,histogram = TRUE, method = c("pearson"))
dev.off()


MDS plot

x<-log2(scale(read.table("CountMatrix.xls", row.names=1, header=TRUE)))
pdf("MDSplot.pdf") 
plot(cmdscale(dist(t(x))), xlab="Coordinate 1", ylab="Coordinate 2", type = "n") ; text(cmdscale(dist(t(x))), labels=colnames(x), ) 
dev.off()

DESeq analysis of aligners

Analysis of SRR634969 in no replicate mode:
# Run comparisons of aligners
#source("http://bioconductor.org/biocLite.R")
#biocLite("DESeq")
library( "DESeq" )
cnt<-read.table( "SRR634969_cnts_fmt_filt.xls",header=TRUE, row.names=1)
design <- data.frame(row.names = colnames(cnt),condition = c("BM","HP","OL","ST","SJ","SR"))
countTable <- cnt
conds <- design$condition
cds <- newCountDataSet( countTable, conds )
cds <- estimateSizeFactors( cds )
cds <-estimateDispersions(cds,method="blind",sharingMode="fit-only",fitType="local")

res <- nbinomTest( cds, "BM", "HP")
write.table(res, "BMvHP.xls" , quote = FALSE , sep = "\t", row.names = FALSE)
#repeat to compare the other aligners

Analysis of all 6 samples in the experiment using default DESeq.

cnt<-read.table( "bwamem_cnt.xls_filt.xls",header=TRUE, row.names=1)
design <- data.frame(row.names = colnames(cnt),condition = c("c","c","c","t","t","t"))
countTable <- cnt
conds <- design$condition
cds <- newCountDataSet( countTable, conds )
cds <- estimateSizeFactors( cds )
cds <- estimateDispersions( cds )
res<-nbinomTest(cds,"c","t")
write.table(res, "bwamem_deseq.xls" , quote = FALSE , sep = "\t", row.names = FALSE)

#repeat to compare the other aligners

Popular posts from this blog

Two subtle problems with over-representation analysis

Data analysis step 8: Pathway analysis with GSEA

Uploading data to GEO - which method is faster?