Quantifying repeat sequences from short read next gen sequencing data

Repetitive sequences pose significant challenges for the analysis of short read sequence data. Any sequence read that is shorter than the repeat unit is going to fail at being uniquely placed in the genome. At a given read length, genome can therefore be divided into mappable and unmappable compartments. For analyses such as gene expression and chromatin profiling, we generally ignore those unmappable regions and ignore reads that map to multiple positions. But what if we were actually interested in the abundance of repeat sequences in the dataset? How could it be quantified with short read sequencing? Why is that even relevant?

The "Why"

About 50% of the human genome is comprised of repetitive DNA (Ref). While the function of these elements in many cases remains unknown, some have been well defined. Alu elements have been shown to be a birthplace of enhancers (Ref). Endogenous retroviruses have been domesticated to boost antiviral response in B cells after exposure to T cell–independent type 2 (TI-2) antigens (Ref). Repeat elements can be found embedded in fusion transcripts (Ref) or expressed alone in somatic tissues (Ref). These elements are normally repressed by DNA methylation and other chromatin marks (Ref,Ref). Quantification of these elements in MeDIP-seq, ChIP-seq, mRNA-seq could lead to discoveries in how epigenetic processes regulate repeat expression. Quantification of these elements in genome resequencing projects could reveal differences in the repeat contents of individuals.

The "How"

Traditional method for identification of repeat DNA is using RepeatMasker (RM), which utilises a BLAST-based search engine (RM-BLAST) as default. This is normally fine for a relatively small number of relatively long sequences, but is not very well suited to millions of short reads that are generated from deep sequencing (or NGS) platforms these days. So in this post, I'll compare the accuracy of RM and NGS aligner BWA for quantifying repeat sequences from short reads.

Test data set

I generated simulated reads from the entire RM repeat library using a method I used previously. In addition, I generated a bunch of simulated reads from the E. coli genome as a type of background set. In total I had 16558094 reads but this would take too long to analyse with RM so I used "shuf -n" to select only a million reads for analysis. Here is what the reads look like, you can see the header can  be used to match text to contig names in the alignment results.


Simulated reads results

Then I used some bash tools to check the mapping of reads to the correct class of repeats.

So it looks like BWA and Bowtie2 yield a greater proportion of correctly mapped reads compared to RM with the default settings. The number of incorrectly mapped reads remained relatively low for BWA and Bowtie2 given how challenging repeat DNA is to map. Then I looked in a bit more detail at the main repeat elements in human (LINE-1, Alu and LTR).

This data shows that the identification of repeat sequences was consistent between these three approaches.

Effect of DNMT inhibitor azacitidine on repeat abundance in RNA-seq data

So now I want to see how highly these repeat DNA classes are detected in RNA-seq data.The data set I chose is one that I've used before, an mRNA-seq experiment looking at the effect of the DNA methylation inhibitor 5-azacitidine (Ctrl Vs Aza). I used fastq quality trimmer to remove any unreliable reads and then aligned these reads to the human specific repeat library with BWA. I aggregated the reads based on their class, performed reads-per-million normalisation and performed some rudimentary differential analysis.
Detection of repeat sequences in RNA-seq data using BWA alignment to human-specific repeat library.

These data show that there is a trend of decreasing SVA retrotransposon and increasing MIR and TcMar Mariner abundance due to azacitidine treatment. To check this more thoroughly I will need to append this repeat-centric count matrix to the gene-wise count matrix and analyze it with edgeR or DESeq. I was really expecting a more drastic change in repeat content from azacitidine because my previous analysis found >6000 differentially expressed genes with edgeR. Specifically, I was looking for an increase in the abundance of LTR and LINE-1 elements, as that would be consistent with a loss of methylation due to azacitidine mediated DNMT inhibition. 


Despite the lack of a drastic change in repeat abundance, we show that RepeatMasker and BWA are both valid methods for detecting and quantifying repeats from reads as short as 50 bp. Given BWA and other NGS aligners are an order of magnitude faster than RepeatMasker, these should be useful for routine fast screening of repeat sequences in RNA-seq, ChIP-seq and other data.

Popular posts from this blog

Data analysis step 8: Pathway analysis with GSEA

A selection of useful bash one-liners

HISAT vs STAR vs TopHat2 vs Olego vs SubJunc