Sambamba vs Samtools
Samtools has been the one main tools for reading and writing aligned NGS data since the SAM alignment format was initially proposed. With the amounts of NGS data being generated increasing at a staggering rate, informatic bottlenecks are beginning to appear and there is a rush to make bioinformatics as parallel and scalable as possible. Samtools went parallel only recently, and there is a new competitor to Samtools called Sambamba, so I'll just do a few quick tests to see how it stacks up to Samtools on our 32-core server with 128 GB RAM (standard HDD).
We have an 1.8 GB bam file to work with.
$ du -sh *
1.8G Sequence.bam
$ samtools view Sequence.bam | shuf \
| cat header.sam - > Sequence_shuf.sam
The sam file is 9.9 GB. Lets try 1-thread SAM-to-BAM conversion and sorting with Samtools.
$ time samtools view -Shb Sequence_shuf.sam \
| samtools sort - Sequence_samtools.test
real 18m52.374s
user 18m30.619s
sys 1m5.952s
Now multithreaded with Samtools
We have an 1.8 GB bam file to work with.
$ du -sh *
1.8G Sequence.bam
Now convert to unsorted sam format.
$ samtools view -H Sequence.bam > header.sam$ samtools view Sequence.bam | shuf \
| cat header.sam - > Sequence_shuf.sam
The sam file is 9.9 GB. Lets try 1-thread SAM-to-BAM conversion and sorting with Samtools.
$ time samtools view -Shb Sequence_shuf.sam \
| samtools sort - Sequence_samtools.test
real 18m52.374s
user 18m30.619s
sys 1m5.952s
Now multithreaded with Samtools
$ time samtools view -Shb -@16 Sequence_shuf.sam \
| samtools sort -@16 - Sequence_samtools_prl.test
real 6m24.779s
user 20m36.528s
sys 2m31.423s
Now Sambamba which is natively parallel
| samtools sort -@16 - Sequence_samtools_prl.test
real 6m24.779s
user 20m36.528s
sys 2m31.423s
$ time sambamba view -S -f bam Sequence_shuf.sam \
| sambamba sort -o Sequence_sambamba.bam /dev/stdin
real 7m9.870s
user 17m42.158s
sys 2m9.758s
| sambamba sort -o Sequence_sambamba.bam /dev/stdin
real 7m9.870s
user 17m42.158s
sys 2m9.758s
Now to index the BAM files
$ time samtools index Sequence_samtools.bam
real 0m37.755s
user 0m36.321s
sys 0m1.351s
real 0m37.755s
user 0m36.321s
sys 0m1.351s
$time sambamba index Sequence_samtools.bam
real 0m15.180s
user 0m48.981s
sys 0m6.354s
real 0m15.180s
user 0m48.981s
sys 0m6.354s
When converting SAM to sorted bam, Sambamba is quicker than Samtools single-threaded, but was not quicker than multi-threaded Samtools in this test. What I did notice, was that multithreaded Samtools used a lot of RAM, so Sambamba may be useful where RAM is more limiting. It seems that disk I/O is potentially a limiting resource so it would be interesting to test again with SSD. Sambamba was much quicker in indexing BAM files as advertised. Sambamba offers a few new interesting features compared to Samtools and is also CRAM-file compatible.
Have you used Sambamba? Did it speed-up your work?
Have you used Sambamba? Did it speed-up your work?