Analyzing repeat rich plant smRNA-seq data with ShortStack

Small RNA expression is difficult to analyse. They're small molecules anywhere from 18 -25 nt for miRNAs, they occur as identical or near identical family members and are subject to RNA editing as well as errors from the sequencer.

My recent paper is an analysis of alignment tools for microRNA analysis with a strong focus on uniquely mapped reads. All that's OK, but in some organisms such as grasses (rice, barley, wheat, etc) you'll find that multimapped reads far outnumber uniquely placed ones. If you omit multimapped reads from the analysis, then you'll be excluding the majority of reads which is definitely a bad idea in any NGS analysis pipeline.

To demonstrate this, I downloaded smRNA-seq data from SRA (SRP029886) that consists of 3 datasets (SRR976171, SRR976172, SRR976173), clipped the adaptors off and mapped them to the genome with BWA aln then counted reads mapped to exonic regions uniquely (mapQ10).
Table 1. Mapping of rice small RNA reads to the reference genome with BWA aln.
So as you can see, the proportion of reads that are assigned and form the count matrix contains less than 20% of the original reads. Clearly this approach will only capture about a fifth of whats going on.

There is an alternative to help with multimapped smRNA reads. The ShortStack pipeline uses uniquely placed reads in the vicinity of multimapped reads to estimate the number of tags that are originate from different multimapped regions.
Table 2. ShortStack pipeline resolves multimapping smRNA reads with local unique alignment information.
As you can see in the lower table, the local mapping information has helped to place 43.8% of reads that were previously ignored due to low mapQ. ShortStack also has an in-built miRNA gene identification tool that uses RNA folding info and read mapping positions to infer genes (Figure 1).
Figure 1. Example of a miRNA gene identified from rice smRNA-seq data with ShortStack.
From the 309,946 rice contigs identified, 94 were identified as small RNAs. Of the 94M mapped reads, MiRNA genes accounted for just 339,4147 reads (3.61%). If the majority of reads aren't miRNA, then what are they? I took a look at the top 10 contigs by assigned reads (Figure 2). 
Figure 2. Highest expressed rice contigs in smRNA-seq data.
Just these top 10 contigs accounted for 36.6M reads (39% of mapped reads) and as you can see uniquely mapped reads represent just 39% of this figure. The counts determined by ShortStack could be used for differential expression analysis with GFold or DESeq

What I found really interesting was despite the very high expression of these contigs, their function remains more or less unexplored. There are few genes in these genomic regions that have any functional data ascribed, and there is no hint from the databases that these encode any type of non-coding RNA.

These data show that naive "uniquely" mapped read analysis is inappropriate for plants like rice that have a large proportion of multimapped smRNA reads. ShortStack was able to account for a much greater proportion of reads and was able to determine genomic regions containing highly expressed genes that likely have important but unexplored roles.

Popular posts from this blog

Data analysis step 8: Pathway analysis with GSEA

A selection of useful bash one-liners

HISAT vs STAR vs TopHat2 vs Olego vs SubJunc