Test-driving parallel compression software

Next generation sequencing is driving genomics into realm of big data, and this will only continue as NGS moves into the hands of more and more researchers and clinicians. Genome sequence datasets are huge and place an immense demand on informatics infrastructure including disk space, memory and CPU usage. Bioinformaticians need to be aware of the performance of different compression software available to make the most out of their hardware.

In this post, we'll test the performance of some compression algorithms in terms of their disk-saving ability as well as speed of decompression/compression.

The file that we're working on is an RNA-seq fastq file that is 5.0 GB in size and contains 25.8 million sequence reads on 103.0 million lines. Here is the top 10 lines.

@SRR1171523.1 Y40-ILLUMINA:15:FC:5:1:8493:1211 length=36
ACTTGCTGCTAATTAAAACCAACAATAGAACAGTGA
+SRR1171523.1 Y40-ILLUMINA:15:FC:5:1:8493:1211 length=36
B55??4024/13425>;>>CDD@DBB<<<BB<4<<9
@SRR1171523.2 Y40-ILLUMINA:15:FC:5:1:3017:1215 length=36
GAACAATCCAACGCTTGGTGAATTCTGCTTCACAAT
+SRR1171523.2 Y40-ILLUMINA:15:FC:5:1:3017:1215 length=36
DBBE?BDFDEGGGEG@GGBGDGGGGHHHHHGFHFBG
@SRR1171523.3 Y40-ILLUMINA:15:FC:5:1:3346:1218 length=36
TTCCACTTCCCAAGTAAACAGTGCAGTGAATCCTGT

The system performing the benchmark is a pretty standard desktop PC with 8 threads.
The system used to benchmark compression software
I performed compression using the default settings and determined the compression ratio. bzip2 was better than gzip, but not as good as plzip, pxzip or lrzip.
Data compression characteristics. Compressed fastq file size compared to the uncompressed file (top). Compression ratio of various compression software (bottom).
Next I determined the time to read a compressed file and output to /dev/null. Then determined the time required to compress and write to disk. These tests show that parallel gzip (pigz) is definitely the fastest way to read/compress a file on this system. plzip2 was quick for decompression but pretty slow for compression. pxz and  lrzip were slow for both decompression and compression. Their CPU and memory requirements may be useful if other jobs will be running concurrently.
Comparison of different compression algorithms with respect to time (left side), peak memory (centre) and CPU utilisation (right) of decompression (top), compression to /dev/null (middle) and compression with write to disk (bottom).

The pigz program was certainly the stand out in terms of performance and would be recommended for files that might be read regularly. Reading and writing compressed data with pigz proved to be considerably faster than using cat, probably because after compression, there is less data to write to disk. Where disk space is limiting, pbzip2 and plzip might be more applicable for data archiving.

See my post on how the benchmarking was done.

Sources:
http://unix.stackexchange.com/questions/52313/how-to-get-execution-time-of-a-script-effectively
http://www.ubuntugeek.com/how-to-clear-cached-memory-on-ubuntu.html
http://www.thegeekstuff.com/2012/01/time-command-examples/
http://ss64.com/bash/time.html

Popular posts from this blog

Data analysis step 8: Pathway analysis with GSEA

Two subtle problems with over-representation analysis

Uploading data to GEO - which method is faster?