Sambamba vs Samtools

Samtools has been the one main tools for reading and writing aligned NGS data since the SAM alignment format was initially proposed. With the amounts of NGS data being generated increasing at a staggering rate, informatic bottlenecks are beginning to appear and there is a rush to make bioinformatics as parallel and scalable as possible. Samtools went parallel only recently, and there is a new competitor to Samtools called Sambamba, so I'll just do a few quick tests to see how it stacks up to Samtools on our 32-core server with 128 GB RAM (standard HDD).

We have an 1.8 GB bam file to work with.
$ du -sh *
1.8G  Sequence.bam

Now convert to unsorted sam format.
$ samtools view -H Sequence.bam > header.sam
$ samtools view Sequence.bam | shuf \
| cat header.sam - > Sequence_shuf.sam

The sam file is 9.9 GB. Lets try 1-thread SAM-to-BAM conversion and sorting with Samtools.
$ time samtools view -Shb Sequence_shuf.sam \
| samtools sort - Sequence_samtools.test

real 18m52.374s
user 18m30.619s
sys 1m5.952s

Now multithreaded with Samtools
$ time samtools view -Shb -@16 Sequence_shuf.sam \
| samtools sort -@16 - Sequence_samtools_prl.test

real 6m24.779s
user 20m36.528s
sys 2m31.423s

Now Sambamba which is natively parallel
$ time sambamba view -S -f bam Sequence_shuf.sam \
| sambamba sort -o Sequence_sambamba.bam /dev/stdin

real 7m9.870s
user 17m42.158s
sys 2m9.758s

Now to index the BAM files
$ time samtools index Sequence_samtools.bam

real 0m37.755s
user 0m36.321s
sys 0m1.351s

$time sambamba index Sequence_samtools.bam

real 0m15.180s
user 0m48.981s
sys 0m6.354s

When converting SAM to sorted bam, Sambamba is quicker than Samtools single-threaded, but was not quicker than multi-threaded Samtools in this test. What I did notice, was that multithreaded Samtools used a lot of RAM, so Sambamba may be useful where RAM is more limiting. It seems that disk I/O is potentially a limiting resource so it would be interesting to test again with SSD. Sambamba was much quicker in indexing BAM files as advertised. Sambamba offers a few new interesting features compared to Samtools and is also CRAM-file compatible.

Have you used Sambamba? Did it speed-up your work?

Popular posts from this blog

Two subtle problems with over-representation analysis

Data analysis step 8: Pathway analysis with GSEA

Uploading data to GEO - which method is faster?