Benchmark scripts and programs

September 21, 2014

Bioinformaticians strive for accurate results, but when time or computational resources are limited, speed can be a factor too. This is especially true when dealing with the huge data sets coming off sequencers these days.

When putting together an analysis pipeline, try taking a small fraction of the data and perform some benchmarking of the available tools.

Benchmarking could be as simple as using time:


time ./script1.sh

time ./script2.sh

But if you need a little more detail, this benchmarking approach captures peak memory usage and average CPU utilisation too.

1. Set up a list of commands/scripts in a file called "codes.txt"

Here is a list of commands that I used in a previous post:


$ cat codes.txt

cat test.fastq > /dev/null

zcat test.fastq.gz > /dev/null

bzcat test.fastq.bz2 > /dev/null

pigz -dc test.fastq.gz > /dev/null

pbzip2 -dc test.fastq.bz2 > /dev/null

plzip -dc test.fastq.lz > /dev/null

2. Setup the benchmarking script

Use the following script to execute each line of "codes.txt" and uses /usr/bin/time to measure the time/memory/CPU. The /usr/bin/time is more flexible than built-in "time" because the output can be more customised. Importantly, the script requires sudo permissions to clear the cache before executing the next task. You will find sudo su more convenient here to prevent recurrent sudo password requests but be CAREFUL not to delete system files or data. The script is set up to perform replicates of the test to ensure reproducibility.


$ cat benchmark_time.sh



#!/bin/bash

CODES=codes.txt

LEN=`wc -l < $CODES`

EX=exe.sh

TMP=tmp.txt

RES=results.txt

REPS=3



for REP in `seq $REPS`

do

 for LINE in `seq $LEN`

 do

 CODE=`sed -n ${LINE}p $CODES`

 echo $CODE > $EX

 chmod +x $EX

 sync; echo 3| sudo tee /proc/sys/vm/drop_caches > /dev/null

 /usr/bin/time --format "%e\t%M\t%P" --output=$TMP ./$EX

 echo $CODE | cat - $TMP | tr '\n' '\t' | sed 's/$/\n/'

 done

done | tee $RES

3. Execute the script

chmod +x benchmark_time.sh

sudo su

./benchmark_time.sh

The output should look something like this:


cat test.fastq > /dev/null 30.06 728 6%

zcat test.fastq.gz > /dev/null 23.56 1448 93%
bzcat test.fastq.bz2 > /dev/null 95.39 4148 98%


pigz -dc test.fastq.gz > /dev/null 11.99 1020 142%


pbzip2 -dc test.fastq.bz2 > /dev/null 32.96 43124 765%


plzip -dc test.fastq.lz > /dev/null 13.45 225424 751%

Where the 1st column is the time in seconds, the second is the peak memory and the 3rd is the CPU utilisation.

If you need more detailed reports, consider dstat.

Search This Blog

Genome Spot