Weighing the benefits of RNA-seq error correction

Sequencing data contains two major types of errors, ones that are incorporated during library preparation and ones incorporated during sequence reading. While errors of the former are difficult to correct as they occur without clues from the base quality scores, the latter type do correlate with lower base quality scores and so it is possible to identify these ambiguous base calls and compare them to a library of high confidence k-mers from the same sequencing run. Error correction of this type has been show to improve de novo assemblies and SNP detection.

A recent report posted in bioRxiv shows error correction gives a drastic improvement in transcriptome assembly, with the author benchmarking the performance of five different correction software packages (Bless, Lighter, SGA, SEECER & BFC). The author reports that these software options vary in the aggressiveness of error correction, with SGA being the most aggressive, Lighter the least aggressive and BFC and SEECER performing well over a range of parameters and read depths. Marcel Shulz (developer of SEECER) has put together a great slideshow to provide an overview of error correction for RNA-seq.

So the question remains whether error correction is worthwhile for standard RNA-seq expression profiling? Ie, cases where there is a good reference genome and the end-point is up- or down-regulation of genes and pathways.

The potential of error correction in RNA-seq is a difficult topic to examine, firstly because the normal procedure of using simulated reads in these type of analyses doesn't produce the same error profile of a real sequencing run. One way to approach the problem is to use the proportion of concordant and discordant paired-end read mapping as a measure of "correctness". Theoretically, if a read contains many errors, it may not map to its true gene of origin because it maps to another location or doesn't map at all.

In this evaluation, I'm using BFC (Pubmed). The selection is arbitrary, but the author (Heng Li) has a good reputation for producing accurate, robust and efficient software. In order to test the effectiveness of BFC error correction versus the standard procedure (quality trimming prior to mapping), I took a paired-end Illumina HiSeq2500 mRNA-seq dataset and performed different types of processing. The dataset I selected (GSE63887) that was recently published in Cell Reports and is relevant to my current field of work.

These different processing procedures reflect the different ways BFC can be run (from the manpage):

  • Simplest case, correct each dataset in isolation (bfc-i)
bfc reads_1.fq > corrected_reads_1.fq
bfc reads_2.fq > corrected_reads_2.fq
  • Next case, correct each pairs of reads (bfc-p)
cat reads_1.fq reads_2.fq > reads_1+2.fq
bfc reads_1+2.fq reads_1.fq > corrected_reads_1.fq
bfc reads_1+2.fq reads_2.fq > corrected_reads_2.fq
  • Last case, correct reads based on the k-mer profile of the entire experiment (bfc-e) 
cat *reads*.fq > pooled-reads.fq
bfc pooled-reads.fq reads_1.fq > corrected_reads_1.fq
bfc pooled-reads.fq reads_2.fq > corrected_reads_2.fq

I also included a control data-set that was left without error correction (no-bfc) and in addition, I performed standard base quality trimming and filtering using Skewer that I described previously prior (default settings) to any error correction procedure. Which brings the tally up to 8 different pipelines. Then I performed alignment of the reads to the human genome with STAR (single-end mode), followed by FeatureCounts read summarisation (also single end mode), with the nifty "-R" option to obtain the read summarisation result for each read. I used a custom shell script to count the read pairs that were concordant or discordant for each sample.

The quality profiles show a typical drop in base quality towards  the end of the reads, with the reverse read being lower quality compared to the forward read. There was also some differences between the samples with regard to quality profiles, with the best and worst datasets shown below (FastQC).
SRR1693847_1 best quality forward readset
SRR1693847_2 best quality reverse readset

SRR1693849_1 worst quality forward readset
SRR1693849_1 worst quality reverse readset
And here are the base quality averages for a subsample (2%) of reads using Fastx-quality-stats (Fastx-Toolkit) as well as total number of reads.
GSE63887 dataset reads and mean base quality (over full read length)
After mapping these reads with STAR in single end mode followed by featureCounts, I was able to count the concordant and discordant read pairs with a shell script. I could then compare the different data processing pipelines.
Quantification of concordant mRNA-seq paired end reads (both reads mapQ≥20) 

Quantification of discordant mRNA-seq paired end reads (both reads mapQ≥20) 

So you can see that unprocessed data yields 65.6% concordant pairs which increases marginally to 66.1% in the BFC corrected data. In contrast, using Skewer trimmed data yielded 67.7% concordance. Using BFC correction after Skewer trimming gave 67.8% concordant mapping. The proportion of discordant pairs was highest in the unprocessed data 1.15% and was lowest in the data that had undergone Skewer trimming prior to BFC correction.

Then I looked at the mismatch rate, to see whether error correction or quality trimming was most beneficial. I looked at the logs produced by the STAR aligner, which calculates a mismatch rate for all uniquely mapped reads found.

Mismatch rate per base in uniquely mapped reads (STAR output)

As you can see, the highest error rates are present in the unprocessed data, and the lowest error rates are found in the datasets that have undergone both quality trimming and error correction. Error correction alone had a greater impact on mismatch rate than quality trimming alone.

I checked the error profile with RSeqQC, which gives detail about mismatch types over the read length for (dataset SRR1693849_1) 

As you can see, Skewer quality trimming reduces the rate at which mismatches occur toward the end of the read. BFC lowers the overall mismatch rate, but seems to incorporate higher than expected numbers of G2A, T2C, A2G and C2T mismatches.

So to see whether the hit rate of calling variant in RNA-seq is improved by quality trimming or error correction, I extracted all the sites that have at least three non-consensus mismatch base-calls (using samtools mpileup) and checked these positions against dbSNP (common_all_20150603.vcf.gz). I used the SRR1693849_1 dataset and only looked at chromosome 1. 

We see that both bfc correction and skewer quality trimming have a positive impact on identification of common variants present in dbSNP.

Bottom line

The increase of alignment performance after BFC error correction was only minimal compared to standard quality trimming procedures. Thus the benefit of error correction for differential expression analysis would thus be minimal, considering how computationally intensive it is. On the other hand, error correction does appear to improve mismatch error rates in uniquely aligned reads, which could be really handy for de novo transcriptome assembly and calling variants from RNA-seq data and allele-specific gene expression analysis.

Popular posts from this blog

Data analysis step 8: Pathway analysis with GSEA

A selection of useful bash one-liners

HISAT vs STAR vs TopHat2 vs Olego vs SubJunc