Screen for rRNA contamination in RNA-seq data

August 06, 2015

Ribosomal RNA (rRNA) is very abundant in cells (~80% of total RNA), so it is useful to deplete rRNA when doing genomewide assays to have sufficient coverage of other genes including protein coding and non-protein coding genes.

There are two major strategies for achieving rRNA removal, being (1) poly A enrichment and (2) rRNA depletion. Poly A enrichment uses an oligo dT coupled magnetic bead to "pull-out" RNA molecules with a polyA tag, a common feature of protein coding transcripts. rRNA depletion can be achieved using kits such as Ribo-Zero (Illumina), Ribo-Minus (LifeTech) and NEBNext® rRNA Depletion Kit (NEB). These kits contain oligonucleotide probes that either hybridize and immobilise the rRNA or hybridize and degrade the unwanted rRNA.

Once you have some sequence data, you will need to check whether the rRNA depletion has worked. This is somewhat different to a genome-wide analysis I've mentioned in earlier posts because rRNA genes exist in multiple copies and reads mapped there are normally discarded in the map quality filtering step. So we need to quantify more directly. You can use Broad institutes RNA-seq-QC package, but it requires so many obscure input file formats, I'm sceptical that anyone outside of the Broad has actually gotten it to work. An easy and quick QC for rRNA contamination can be done with BWA and samtools. These can both be installed with apt-get in Ubuntu Linux.

First you need the sequences of the rRNAs, so I collected them from NCBI and post them here for your convenience (human & mouse). For this example we will use human. Start with indexing the fasta reference sequence with BWA:


bwa index human_rRNA.fa

Then run a bwa alignment on the rRNA reference. The following script only looks at the first 1 million reads in the file and uses fastq_quality_trimmer to remove low quality bases from the 3' ends.

#!/bin/bash

REF=/data/projects/mziemann/rna-seq-qc/human_rRNA.fa

for FQZ in *fastq.gz ; do

echo $FQZ

FQ=`echo $FQZ | sed 's/.gz//'`

zcat $FQZ | head -4000000 | fastq_quality_trimmer -t 30 -l 20 -Q33 \

| tee $FQ | bwa aln -t 8 $REF - | bwa samse $REF - $FQ \

| samtools view -uSh - \

| samtools sort - ${FQ}.sort

done

wait

for i in *bam ; do samtools index $i & done ; wait

for i in *bam ; do samtools flagstat $i > ${i}.stats & done ; wait

The samtools flagstats command outputs a simple report of total and mapped reads

$ head -3 *.fastq.sort.bam.stats

==> WT1-1681_C6K35ANXX_ATCACG_L005_R1.fastq.sort.bam.stats <==

998121 + 0 in total (QC-passed reads + QC-failed reads)

0 + 0 duplicates

156636 + 0 mapped (15.69%:-nan%)

==> WT2-1684_C6K35ANXX_CGATGT_L005_R1.fastq.sort.bam.stats <==

997640 + 0 in total (QC-passed reads + QC-failed reads)

0 + 0 duplicates

153092 + 0 mapped (15.35%:-nan%)

==> WT3-1691_C6K35ANXX_TTAGGC_L005_R1.fastq.sort.bam.stats <==

998021 + 0 in total (QC-passed reads + QC-failed reads)

0 + 0 duplicates

238082 + 0 mapped (23.86%:-nan%)

So you can see that from the 1 million reads we started with, we lost a few with quality trimming and then had 15-24% of reads mapping to the rRNAs (irrespective of mapQ). You can also dig a bit deeper and see which rRNA accounts for the most reads. Column 1 is the contig name, column 2 is the contiog length and column 3 is the number of mapped reads, column 4 is the number of unmapped reads.

$ samtools idxstats WT2-1684_C6K35ANXX_CGATGT_L005_R1.fastq.sort.bam
gi|120444900|ref|NR_003279.1| 4730 144029 0
gi|374088232|ref|NR_003278.3| 1870 8997 0
gi|374093199|ref|NR_003280.2| 157 57 0
gi|374088139|ref|NR_046144.1| 121 9 0
* 0 0 844548

It looks like the larger 28S fragment accounts for the most reads, followed by 18S.

In my most recent RNA-seq experiment, we used NEB polyA enrichment for human primary cell culture from 1ug of starting material as well as NEB rRNA depletion of RNA from mouse tissue for which we could only obtain 50ng of RNA. I used the NEB Ultra Directional RNA-seq library prep kit which worked for 45/45 samples. The polyA data contained 9.1% rRNA and the rRNA depleted data contained 21.4% rRNA. Overall I'm happy with that, because the mouse study would not have been possible with the polyA method due to low starting amounts.

EDIT: If using longer reads (>70bp) or another sequencing platform such as Ion Torrent, BWA MEM will actually work better. Below is the code for running it.

#!/bin/bash
REF=/data/projects/mziemann/rna-seq-qc/human_rRNA.fa

for FQZ in *fastq.gz ; do
echo $FQZ
FQ=`echo $FQZ | sed 's/.gz//'`
zcat $FQZ | head -4000000 | fastq_quality_trimmer -t 30 -l 20 -Q33 \
| bwa mem -t 8 $REF - \
| samtools view -uSh - \
| samtools sort - ${FQ}.sort &
done
wait
for i in *bam ; do samtools index $i & done ; wait
for i in *bam ; do samtools flagstat $i > ${i}.stats & done ; wait

Search This Blog

Genome Spot

Screen for rRNA contamination in RNA-seq data

Popular posts from this blog

Data analysis step 8: Pathway analysis with GSEA

Uploading data to GEO - which method is faster?

Generate an RNA-seq count matrix with featureCounts