Posts

Showing posts with the label TopHat

HISAT vs STAR vs TopHat2 vs Olego vs SubJunc

Image
HISAT is a brand new RNA-seq aligner which promises great speed with a low memory footprint. The Nature Methods paper is worth a look.

So I thought I'd give it a test run with some simulated data to check its accuracy compared to other aligners.

I generated synthetic 100bp reads based on Arabidopsis cDNAs and then incorporated mutations with msbar.

I then aligned these reads to the Arabidopsis genome with default settings and counted (featureCounts) the number of correctly and incorrectly assigned reads with a mapping quality of 20.

Here is the result for unmutated (perfect) 100 bp reads.

HISAT shows similar accuracy to TopHat2 but was not as accurate as STAR.

Now here are the results of the mutation experiment.

A table of full results is shown at the bottom of the post. These results show that STAR consistently scores the greatest number of correctly assigned reads in these tests while keeping incorrectly assigned reads down below 0.1%. Olego and TopHat2 produce the fewest incorr…

Data analysis step 4: Count aligned reads and create count matrix

Image
In this RNA-seq analysis series of posts, we're going through an RNA-seq analysis workflow. So far, we've downloaded, done QC and aligned the reads. In this post, we will count the reads aligning to exons and generate a count matrix for statistical analysis.

My overall strategy is to use bedtools to count reads over exons. To do that, I'll need a bed file containing all exons. I could use UCSC genes, but I'd rather use the latest Ensembl version.

Create exon bed file  To begin, we need a list of known exons. I will use the latest Ensembl human gene annotation gtf file from their ftp server. Then extract the exon entries with grep and then rearrange the gene accession and symbol information into a temporary bed file. The temporary bed file is exploded on gene name and then overlapping exons are merged with bedtools.

#!/bin/bash
wget ftp://ftp.ensembl.org/pub/release-76/gtf/homo_sapiens/Homo_sapiens.GRCh38.76.gtf.gz
zcat Homo_sapiens.GRCh38.76.gtf.gz \
| grep -w exon | tr &#…

Data analysis step 3: Align paired end RNA-seq with Tophat

Image
In this series of posts, we're going through an RNA-seq analysis workflow. So far, we've downloaded and inspected sequence quality. In this post, we will align the paired end data to the human genome with Tophat.

Part 1 is to get a suitable reference genome sequence. I chose download the most recent human genome sequence from the Ensembl ftp site (Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa).

axel --num-connections=6 ftp://ftp.ensembl.org/pub/release-76/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz


Once downloaded, the genome needs to be uncompressed and indexed with bowtie2. This will take a few hours.

REF=Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa
gunzip ${REF}.gz
bowtie2-build $REF $REF

Next step is to use Tophat to align the reads in paired-end mode. Tophat is smart enough to recognise the bz2 compression. The script also indexes the bam files and generates some familiar samtools statistics.

#!/bin/bash

REF=/path/to/genomes/Homo_sapiens.GRCh38.d…

Align RNA-seq reads to genome with TopHat

Image
RNA-seq data analysis poses some interesting bioinformatic challenges. One of the most important considerations is the type of reference sequence to be used. If there is a whole genome sequence (WGS), then the WGS is the most appropriate reference, even if it is not annotated entirely. If there is no WGS, then you may need to use a curated cDNA library such as RefSeq, Unigene, or organism specific collection. There are a few reasons why aligning to the genome is the most correct method
MapQ scores mean something - cDNA collections like RefSeq often have splice variants entered as different entries. This means that some variants will have common exons. Reads aligning to those exons will be assigned a mapQ of zero!Discovery - many of the cDNA sequence collections are incomplete, including RefSeq. Compare the number of entries from GENCODE to RefSeq and you will see that GENCODE is far richer in terms of number of known transcripts. Moreover, if you are aligning to a genome, you can perf…