Test-driving parallel compression software
Next generation sequencing is driving genomics into realm of big data, and this will only continue as NGS moves into the hands of more and more researchers and clinicians. Genome sequence datasets are huge and place an immense demand on informatics infrastructure including disk space, memory and CPU usage. Bioinformaticians need to be aware of the performance of different compression software available to make the most out of their hardware.
In this post, we'll test the performance of some compression algorithms in terms of their disk-saving ability as well as speed of decompression/compression.
ACTTGCTGCTAATTAAAACCAACAATAGAACAGTGA
+SRR1171523.1 Y40-ILLUMINA:15:FC:5:1:8493:1211 length=36
B55??4024/13425>;>>CDD@DBB<<<BB<4<<9
@SRR1171523.2 Y40-ILLUMINA:15:FC:5:1:3017:1215 length=36
GAACAATCCAACGCTTGGTGAATTCTGCTTCACAAT
+SRR1171523.2 Y40-ILLUMINA:15:FC:5:1:3017:1215 length=36
DBBE?BDFDEGGGEG@GGBGDGGGGHHHHHGFHFBG
@SRR1171523.3 Y40-ILLUMINA:15:FC:5:1:3346:1218 length=36
TTCCACTTCCCAAGTAAACAGTGCAGTGAATCCTGT
The file that we're working on is an RNA-seq fastq file that is 5.0 GB in size and contains 25.8 million sequence reads on 103.0 million lines. Here is the top 10 lines.
@SRR1171523.1 Y40-ILLUMINA:15:FC:5:1:8493:1211 length=36
ACTTGCTGCTAATTAAAACCAACAATAGAACAGTGA
+SRR1171523.1 Y40-ILLUMINA:15:FC:5:1:8493:1211 length=36
B55??4024/13425>;>>CDD@DBB<<<BB<4<<9
@SRR1171523.2 Y40-ILLUMINA:15:FC:5:1:3017:1215 length=36
GAACAATCCAACGCTTGGTGAATTCTGCTTCACAAT
+SRR1171523.2 Y40-ILLUMINA:15:FC:5:1:3017:1215 length=36
DBBE?BDFDEGGGEG@GGBGDGGGGHHHHHGFHFBG
@SRR1171523.3 Y40-ILLUMINA:15:FC:5:1:3346:1218 length=36
TTCCACTTCCCAAGTAAACAGTGCAGTGAATCCTGT
The system performing the benchmark is a pretty standard desktop PC with 8 threads.
I performed compression using the default settings and determined the compression ratio. bzip2 was better than gzip, but not as good as plzip, pxzip or lrzip.
Next I determined the time to read a compressed file and output to /dev/null. Then determined the time required to compress and write to disk. These tests show that parallel gzip (pigz) is definitely the fastest way to read/compress a file on this system. plzip2 was quick for decompression but pretty slow for compression. pxz and lrzip were slow for both decompression and compression. Their CPU and memory requirements may be useful if other jobs will be running concurrently.
|
|
pigz program was certainly the stand out in terms of performance and would be recommended for files that might be read regularly. Reading and writing compressed data with pigz proved to be considerably faster than using cat, probably because after compression, there is less data to write to disk. Where disk space is limiting, pbzip2 and plzip might be more applicable for data archiving.
See my post on how the benchmarking was done.
Sources:
http://unix.stackexchange.com/questions/52313/how-to-get-execution-time-of-a-script-effectively
http://www.ubuntugeek.com/how-to-clear-cached-memory-on-ubuntu.html
http://www.thegeekstuff.com/2012/01/time-command-examples/
http://ss64.com/bash/time.html
See my post on how the benchmarking was done.
Sources:
http://unix.stackexchange.com/questions/52313/how-to-get-execution-time-of-a-script-effectively
http://www.ubuntugeek.com/how-to-clear-cached-memory-on-ubuntu.html
http://www.thegeekstuff.com/2012/01/time-command-examples/
http://ss64.com/bash/time.html