Read counting with featureCounts, BedTools and HTSeq

Counting the number of reads that align to certain genomic features is a key element of many next gen sequencing analysis pipelines. For RNA-seq, this is commonly used to count reads aligning to exons, while for ChIP-seq this is used to count reads over a promoter or other region of interest. There are several available tools for performing this task and in this post I will compare the three of the most commonly used:
I took one of the bam files from the recent RNA-seq series of posts and subsampled it using samtools and shuf into file sizes of 1M, 2M, 5M, 10M reads, as well as the bam file containing 25.4M reads.

I  then used the benchmarking script described in the previous post to record execution time, CPU usage and peak memory for read counting to generate a gene-wise matrix. I used featureCounts in the single thread mode as well as the parallel mode (maximum of 8 cores).
Execution time for read counting by featureCounts, BedTools Multicov and HTSeq-count for bam files of varying sizes. 
Peak memory usage for read counting by featureCounts, BedTools Multicov and HTSeq-count for bam files of varying sizes. 
CPU utilisation for read counting by featureCounts, BedTools Multicov and HTSeq-count for bam files of varying sizes. 
The results show that featureCounts is about 10 times faster than BedTools Multicov and about 18 times faster than HTSeq-count when using a single thread, and when allowing parallel processing, this became 20 times and 37 times respectively.

Then I broke down the overall times into that invested in (1) reading in the annotation file and (2) processing the reads.
Time taken for featureCounts, BedTools Multicov and HTSeq to read in the annotation feature (upper panel) file and process 1 million reads (lower panel).

This result shows that featureCounts is considerably faster at both tasks than its competitors.


Exact commands used:
htseq-count -q -f bam --stranded=no --minaqual=10 --type=gene --mode=union test.bam Homo_sapiens.GRCh38.76.gtf > test.bam.htseq.count
bedtools multicov -D -q 10 -bams test.bam -bed Homo_sapiens.GRCh38_exon.bed > test.bam.bedtools.cnt
featureCounts -Q 10 -M -s 0 -T 1 -a Homo_sapiens.GRCh38.76.gtf -o test.bam.featureCount.cnt test.bam
featureCounts -Q 10 -M -s 0 -T 8 -a Homo_sapiens.GRCh38.76.gtf -o test.bam.featureCount.cnt test.bam

Popular posts from this blog

Data analysis step 8: Pathway analysis with GSEA

Two subtle problems with over-representation analysis

Uploading data to GEO - which method is faster?