Posts

Showing posts with the label Alignment

Accuracy, speed and error tolerance of short DNA sequence aligners

Image
Over the past few years at GenomeSpot, I've evaluated alignment software accuracy for DNA-seq, RNA-seq and small RNA-seq. After discussions with colleagues and followers, I thought it was time to develop aspects of this work to a point where it would be publishable in journals. Over the past year I've put together a paper that comprehensively evaluated accuracy of small RNA aligners. After three review and revision cycles I'm happy to say it has been accepted for publication in RNA. It will likely appear in the August issue.

Secondly, I've extended upon work from a previous post where I evaluated the accuracy of several DNA-seq mappers with error containing reads, and did further work like: varying the read length from 50 nt to 480 ntperforming parallel analysis with Arabidopsis and humanusing simulators to generate Illumina and Ion Torrent read setstested the speed (throughput) of aligners with Illumina reads This work was just made public today on bioaRxiv. Have a rea…

Genome methylation analysis with Bismark

Image
Bismark is currently the de facto standard for primary analysis of high throughput bisulfite sequencing data. Bismark can align the reads to the genome and perform methylation calling. In this post, I'll go through Illumina whole genome bisulfite sequence (WGBS) alignment and methylation calling using Bismark. First I want to mention that this post is just a summary, not meant to be a user manual or thorough troubleshooting guide. Fortunately, Bismark has some of the best documentation for any bioinformatics suite and is mandatory reading. The Bismark crew are very proactive with responding to user queries on various forums as well.

First step in getting Bismark to work is to index the genome, in this case with Bowtie2:

bismark_genome_preparation --bowtie2 /pathto/refgenome/

Conventionally, multiplexed libraries will be sequenced over a number of lanes. Resist concatenating or merging the smaller fastq files for each patient/sample until after the alignment, as the concatenated fil…

Mapping NGS data: which genome version to use?

Image
Aligning reads to the genome is a key step in nearly all NGS data pipelines, the quality of an alignment will dictate the quality of the final results. So for beginners in this space, the options available can be a bit overwhelming.

Which options are available? Depending on what species you are working on, you will have either a limited number of choices or a vast number of choices. These include NCBI, Ensembl, UCSC as well as the consortia that generate these genome builds, such as the Human Genome Reference Consortium for human and TAIR for Arabidopsis. My recommendation at this point is Ensembl, for a number of reasons: It is clear to see what genome build and version just from the file names. Contrast "hg38.fa.gz" for UCSC vs "Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz" for EnsemblFrom the Ensembl file name you can tell whether its masked, and whether its "primary assembly" or "toplevel".The website is intuitive, ftp downloads are fast…

HISAT vs STAR vs TopHat2 vs Olego vs SubJunc

Image
HISAT is a brand new RNA-seq aligner which promises great speed with a low memory footprint. The Nature Methods paper is worth a look.

So I thought I'd give it a test run with some simulated data to check its accuracy compared to other aligners.

I generated synthetic 100bp reads based on Arabidopsis cDNAs and then incorporated mutations with msbar.

I then aligned these reads to the Arabidopsis genome with default settings and counted (featureCounts) the number of correctly and incorrectly assigned reads with a mapping quality of 20.

Here is the result for unmutated (perfect) 100 bp reads.

HISAT shows similar accuracy to TopHat2 but was not as accurate as STAR.

Now here are the results of the mutation experiment.

A table of full results is shown at the bottom of the post. These results show that STAR consistently scores the greatest number of correctly assigned reads in these tests while keeping incorrectly assigned reads down below 0.1%. Olego and TopHat2 produce the fewest incorr…

microRNA aligners compared

Image
Alignment of microRNA to the genome poses a particular challenge because the reads are short, and some microRNAs are nearly identical. Moreover, microRNAs themselves are subject to RNA editing (adenine-to-inosine conversion, non-templated base addition) and normal sequencing error rates. In this post, I'm going to test the performance of several aligners in aligning microRNA reads to the Arabidopsis genome. 
I downloaded the Arabidopsis genome from Ensembl plant and the latest miRbase release version 21. I used bowtie2 to align the 325 full-length hairpin transcripts to the Arabidopsis genome. I generated pseudo microRNA reads that uniformly cover the hairpin transcript at a range of lengths from 16 nt to 25 nt. I then aligned the reads to the Arabidopsis genome using these different aligners with the default settings. I then used bedtools and awk to count the correctly and incorrectly mapped reads at a mapQ threshold of 10. 
Firstly note that there were no alignment with either …

DNA aligner accuracy: BWA, Bowtie, Soap and SubRead tested with simulated reads

Image
In the past few posts, we've looked at RNA-seq aligner performance in terms of accuracy and speed. In this post, I'll take a look at the accuracy of DNA aligners using simulated reads.

The first step is to download the genome of interest. I'm using Arabidopsis as its pretty small and good for quick benchmarking. I downloaded the genome from Ensembl plant ftp site (link).

Next step was to generate simulated reads that uniformly cover the genome at user-selected length and intervals. I couldn't find any previously made simulators to do exactly that, so I chained together some awk and bedtools commands (code at the bottom of the post) to generate the reads. I generated pseudo reads of 50, 100 & 200 bp in length at 10 bp intervals and output them in fasta format. Here is an example.

>1-0-50
CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAA
>1-10-60
TAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAAT
>1-20-70
ACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAATCTTTAAATCC

T…

RNA-seq aligner accuracy tested with simulated reads

Image
In the previous post we looked at how choice of RNA-seq aligner influenced results. In this post, we'll use simulated reads to empirically determine the accuracy of OLego, STAR, SubJunc and SubRead.

I made a pretty basic bash script (see below) to generate fasta format reads at uniform intervals along the length of transcripts without errors. The user can readily change the read length and the interval between read initiation. The script created 11.3 million 100 bp reads in 9 minutes so it is OK in terms of speed. It is designed to take Ensembl cDNA files as an input and has been tested on human and Arabidopsis. Here is what the generated reads look like

>AT2G01210.1:gene:AT2G01210_197863
CGTCAGCTTTCGTTCTGGGGAAGAGCGGAATCGGAATTGTCTACAAAGTG
>AT2G01210.1:gene:AT2G01210_197864
GTTCTAGAGAACGGGCTCACACTGGCCGTACGGAGATTGGGTGAAGGAGG
>AT2G01210.1:gene:AT2G01210_197865
GTCTCAGAGATTCAAGGAGTTTCAGACAGAAGTTGAAGCCATAGGGAAAC
>AT2G01210.1:gene:AT2G01210_197866
TAAAACATCCTAACATTGCTAGTCTTC…

RNA-seq aligners: Subread, STAR, HPG aligner and Olego | PART 2

Image
In the previous post, I compared the speed of several RNA aligners. In this post, we'll take a closer look at the results generated by these different aligners; Subread/Subjunc, STAR, HPG aligner and Olego. In addition, we will use BWA MEM as an example of what happens when you use a DNA aligner to analyse RNA-seq. For the test, we've been using 101 bp Arabidopsis mRNA-seq data from GEO accession GSE42968. For simplicity, I use only the forward read in the paired-end dataset. I used fastx-toolkit to remove low quality bases from the 3' end (base qual threshold of 20).

One of the simplest ways to assess the quality of an alignment is to determine the proportion of reads that are mapped to the genome and the proportion that map to exons. After the aligners did their work, I used featureCounts to quantify the number of aligned reads with a mapping quality >10. Here is the data for the first sample in the series, SRR634969 which contained 14.5 million reads.

The first thing…