Posts

Showing posts with the label featureCounts

HISAT vs STAR vs TopHat2 vs Olego vs SubJunc

Image
HISAT is a brand new RNA-seq aligner which promises great speed with a low memory footprint. The Nature Methods paper is worth a look.

So I thought I'd give it a test run with some simulated data to check its accuracy compared to other aligners.

I generated synthetic 100bp reads based on Arabidopsis cDNAs and then incorporated mutations with msbar.

I then aligned these reads to the Arabidopsis genome with default settings and counted (featureCounts) the number of correctly and incorrectly assigned reads with a mapping quality of 20.

Here is the result for unmutated (perfect) 100 bp reads.

HISAT shows similar accuracy to TopHat2 but was not as accurate as STAR.

Now here are the results of the mutation experiment.

A table of full results is shown at the bottom of the post. These results show that STAR consistently scores the greatest number of correctly assigned reads in these tests while keeping incorrectly assigned reads down below 0.1%. Olego and TopHat2 produce the fewest incorr…

Count ChIP-seq reads across a promoter with featureCounts

When analyzing ChIP-seq data, we sometimes like to focus on promoters as a proxy of gene activation especially for histone marks that are normally associated with gene activation such as H3K9 acetylation or H3K4 tri-methylation.

FeatureCounts has emerged as a competitor to HTSeq and BedTools MultiCov for counting reads across features (ie, exons, genes, promoters). FeatureCounts is great for RNA-seq because it can natively read GTF annotation files, but can't read BED format (that we use a lot in ChIP-seq analysis).

In order to make featureCounts work, we need to extract the TSS coordinates and convert to a BED-like format that it can read (SAF format). In the below script, I extract the positions of the TSSs using a grep search followed by stripping only the neccessary information from the GTF, then using awk and BedTools to expand the region  around the TSS by 3kbp.

#!/bin/bash
#Generate an SAF file for TSS intervals from a GTF

#Specify some parameters
GTF=Homo_sapiens.GRCh38.78.gtf

Generate an RNA-seq count matrix with featureCounts

Featurecounts is the fastest read summarization tool currently out there and has some great features which make it superior to HTSeq or Bedtools multicov.

FeatureCounts takes GTF files as an annotation. This can be downloaded from the Ensembl FTP site. Make sure that the GTF version matches the genome that you aligned to. FeatureCounts it also smart enough to recognise and correctly process SAM and BAM alignment files.

Here is a script to generate a gene-wise matrix from all BAM files in a directory.
#!/bin/bash
#Generate RNA-seq matrix
#Set parameters
GTF=/path/to/Mus_musculus.GRCm38.78.gtf
EXPTNAME=mouse_rna
CPUS=8
MAPQ=10
GENEMX=${EXPTNAME}_genes.mx

#Make the gene-wise matrix
featureCounts -Q $MAPQ -T $CPUS -a $GTF -o /dev/stdout *bam \
| cut -f1,7- | sed 1d > $GENEMX

The data are now ready to analyse with your favourite statistical package (DESeq, EdgeR, Voom/Limma, etc).

Consider attaching the gene name to give the data more relevance. To do that, first make a table of Ensembl IDs and ge…

RNA-seq aligner accuracy tested with simulated reads

Image
In the previous post we looked at how choice of RNA-seq aligner influenced results. In this post, we'll use simulated reads to empirically determine the accuracy of OLego, STAR, SubJunc and SubRead.

I made a pretty basic bash script (see below) to generate fasta format reads at uniform intervals along the length of transcripts without errors. The user can readily change the read length and the interval between read initiation. The script created 11.3 million 100 bp reads in 9 minutes so it is OK in terms of speed. It is designed to take Ensembl cDNA files as an input and has been tested on human and Arabidopsis. Here is what the generated reads look like

>AT2G01210.1:gene:AT2G01210_197863
CGTCAGCTTTCGTTCTGGGGAAGAGCGGAATCGGAATTGTCTACAAAGTG
>AT2G01210.1:gene:AT2G01210_197864
GTTCTAGAGAACGGGCTCACACTGGCCGTACGGAGATTGGGTGAAGGAGG
>AT2G01210.1:gene:AT2G01210_197865
GTCTCAGAGATTCAAGGAGTTTCAGACAGAAGTTGAAGCCATAGGGAAAC
>AT2G01210.1:gene:AT2G01210_197866
TAAAACATCCTAACATTGCTAGTCTTC…

RNA-seq aligners: Subread, STAR, HPG aligner and Olego | PART 2

Image
In the previous post, I compared the speed of several RNA aligners. In this post, we'll take a closer look at the results generated by these different aligners; Subread/Subjunc, STAR, HPG aligner and Olego. In addition, we will use BWA MEM as an example of what happens when you use a DNA aligner to analyse RNA-seq. For the test, we've been using 101 bp Arabidopsis mRNA-seq data from GEO accession GSE42968. For simplicity, I use only the forward read in the paired-end dataset. I used fastx-toolkit to remove low quality bases from the 3' end (base qual threshold of 20).

One of the simplest ways to assess the quality of an alignment is to determine the proportion of reads that are mapped to the genome and the proportion that map to exons. After the aligners did their work, I used featureCounts to quantify the number of aligned reads with a mapping quality >10. Here is the data for the first sample in the series, SRR634969 which contained 14.5 million reads.

The first thing…

Read counting with featureCounts, BedTools and HTSeq

Image
Counting the number of reads that align to certain genomic features is a key element of many next gen sequencing analysis pipelines. For RNA-seq, this is commonly used to count reads aligning to exons, while for ChIP-seq this is used to count reads over a promoter or other region of interest. There are several available tools for performing this task and in this post I will compare the three of the most commonly used:
bedtools multicovhtseq-countfeatureCounts I took one of the bam files from the recent RNA-seq series of posts and subsampled it using samtools and shuf into file sizes of 1M, 2M, 5M, 10M reads, as well as the bam file containing 25.4M reads.

I  then used the benchmarking script described in the previous post to record execution time, CPU usage and peak memory for read counting to generate a gene-wise matrix. I used featureCounts in the single thread mode as well as the parallel mode (maximum of 8 cores).
The results show that featureCounts is about 10 times faster than Be…