Posts

Showing posts from November, 2014

microRNA aligners compared

Image
Alignment of microRNA to the genome poses a particular challenge because the reads are short, and some microRNAs are nearly identical. Moreover, microRNAs themselves are subject to RNA editing ( adenine-to-inosine conversion , non-templated base addition ) and normal sequencing error rates. In this post, I'm going to test the performance of several aligners in aligning microRNA reads to the Arabidopsis genome.  I downloaded the Arabidopsis genome from  Ensembl plant  and the latest  miRbase release version 21 . I used bowtie2 to align the 325 full-length hairpin transcripts to the Arabidopsis genome. I generated pseudo microRNA reads that uniformly cover the hairpin transcript at a range of lengths from 16 nt to 25 nt. I then aligned the reads to the Arabidopsis genome using these different aligners with the default settings. I then used bedtools and awk to count the correctly and incorrectly mapped reads at a mapQ threshold of 10.  Table 1. Performance of several aligne

DNA aligner accuracy: BWA, Bowtie, Soap and SubRead tested with simulated reads

Image
In the past few posts, we've looked at RNA-seq aligner performance in terms of accuracy and speed. In this post, I'll take a look at the accuracy of DNA aligners using simulated reads. The first step is to download the genome of interest. I'm using Arabidopsis as its pretty small and good for quick benchmarking. I downloaded the genome from Ensembl plant ftp site ( link ). Next step was to generate simulated reads that uniformly cover the genome at user-selected length and intervals. I couldn't find any previously made simulators to do exactly that, so I chained together some awk and bedtools commands (code at the bottom of the post) to generate the reads. I generated pseudo reads of 50, 100 & 200 bp in length at 10 bp intervals and output them in fasta format. Here is an example. >1-0-50 CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAA >1-10-60 TAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAAT >1-20-70 ACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAAT