Benchmark scripts and programs

Bioinformaticians strive for accurate results, but when time or computational resources are limited, speed can be a factor too. This is especially true when dealing with the huge data sets coming off sequencers these days.

When putting together an analysis pipeline, try taking a small fraction of the data and perform some benchmarking of the available tools.

Benchmarking could be as simple as using time:

time ./script1.sh
time ./script2.sh


But if you need a little more detail, this benchmarking approach captures peak memory usage and average CPU utilisation too.

1. Set up a list of commands/scripts in a file called "codes.txt"

Here is a list of commands that I used in a previous post:

$ cat codes.txt
cat test.fastq > /dev/null
zcat test.fastq.gz > /dev/null
bzcat test.fastq.bz2 > /dev/null
pigz -dc test.fastq.gz > /dev/null
pbzip2 -dc test.fastq.bz2 > /dev/null
plzip -dc test.fastq.lz > /dev/null

2. Setup the benchmarking script

Use the following script to execute each line of "codes.txt" and uses /usr/bin/time to measure the time/memory/CPU. The /usr/bin/time is more flexible than built-in "time" because the output can be more customised. Importantly, the script requires sudo permissions to clear the cache before executing the next task. You will find sudo su more convenient here to prevent recurrent sudo password requests but be CAREFUL not to delete system files or data. The script is set up to perform replicates of the test to ensure reproducibility.

$ cat benchmark_time.sh

#!/bin/bash
CODES=codes.txt
LEN=`wc -l < $CODES`
EX=exe.sh
TMP=tmp.txt
RES=results.txt
REPS=3

for REP in `seq $REPS`
do
for LINE in `seq $LEN`
do
CODE=`sed -n ${LINE}p $CODES`
echo $CODE > $EX
chmod +x $EX
sync; echo 3| sudo tee /proc/sys/vm/drop_caches > /dev/null
/usr/bin/time --format "%e\t%M\t%P" --output=$TMP ./$EX
echo $CODE | cat - $TMP | tr '\n' '\t' | sed 's/$/\n/'
done
done | tee $RES

3. Execute the script

chmod +x benchmark_time.sh

sudo su

./benchmark_time.sh

The output should look something like this:

cat test.fastq > /dev/null 30.06 728 6%
zcat test.fastq.gz > /dev/null 23.56 1448 93%
bzcat test.fastq.bz2 > /dev/null 95.39 4148 98%
pigz -dc test.fastq.gz > /dev/null 11.99 1020 142%
pbzip2 -dc test.fastq.bz2 > /dev/null 32.96 43124 765%
plzip -dc test.fastq.lz > /dev/null 13.45 225424 751%

Where the 1st column is the time in seconds, the second is the peak memory and the 3rd is the CPU utilisation.

If you need more detailed reports, consider dstat.

Popular posts from this blog

Data analysis step 8: Pathway analysis with GSEA

Uploading data to GEO - which method is faster?

Using GTF tools to get gene lengths