Posts

Showing posts with the label data processing

Functions and GNU parallel for effective cluster load management

Image
I've been a fan of GNU parallel for a long time. Initially I was sceptical about using it, preferring to write huge for loops but over time I've grown to love it. The beauty of GNU parallel is that it spawns a specified number of jobs in parallel and then submits more jobs as others are completed. This means that you get maximum usage out of the CPUs without overloading the system. There are many excuses for not using it, but perhaps the only valid one is that you have Sun Grid Engine or another job scheduler or manager in place.

GNU parallel is particularly useful when used with functions. Functions are subroutines that may be repeated many times to complete a piece of work. In bash, here is a simple example, which declares a function consisting of a chain of piped commands, and then executes 4 jobs in parallel, until all of *files.txt have been processed.

#!/bin/bash
my_func2() {
INPUT=$1
VAR1=bar
cmd1 $INPUT $VAR1 | cmd2 | cmd3 > ${1}.out
}
export -f my_func
parallel -j4…

Using Named Pipes and Process Substitution

Image
Little known UNIX features to avoid writing temporary files in your data pipelines explained by Vince Buffalo in his digital notebook. Introducing named pipes and process substitution.

Sambamba vs Samtools

Samtools has been the one main tools for reading and writing aligned NGS data since the SAM alignment format was initially proposed. With the amounts of NGS data being generated increasing at a staggering rate, informatic bottlenecks are beginning to appear and there is a rush to make bioinformatics as parallel and scalable as possible. Samtools went parallel only recently, and there is a new competitor to Samtools called Sambamba, so I'll just do a few quick tests to see how it stacks up to Samtools on our 32-core server with 128 GB RAM (standard HDD).

We have an 1.8 GB bam file to work with.
$ du -sh *
1.8G  Sequence.bam

Now convert to unsorted sam format. $ samtools view -H Sequence.bam > header.sam
$ samtools view Sequence.bam | shuf \
| cat header.sam - > Sequence_shuf.sam

The sam file is 9.9 GB. Lets try 1-thread SAM-to-BAM conversion and sorting with Samtools.
$ time samtools view -Shb Sequence_shuf.sam \
| samtools sort - Sequence_samtools.test

real 18m52.374s
user 18m30.619s

Benchmark scripts and programs

Bioinformaticians strive for accurate results, but when time or computational resources are limited, speed can be a factor too. This is especially true when dealing with the huge data sets coming off sequencers these days.

When putting together an analysis pipeline, try taking a small fraction of the data and perform some benchmarking of the available tools.

Benchmarking could be as simple as using time:

time ./script1.sh
time ./script2.sh

But if you need a little more detail, this benchmarking approach captures peak memory usage and average CPU utilisation too.

1. Set up a list of commands/scripts in a file called "codes.txt" Here is a list of commands that I used in a previous post:

$ cat codes.txt
cat test.fastq > /dev/null
zcat test.fastq.gz > /dev/null
bzcat test.fastq.bz2 > /dev/null
pigz -dc test.fastq.gz > /dev/null
pbzip2 -dc test.fastq.bz2 > /dev/null
plzip -dc test.fastq.lz > /dev/null

2. Setup the benchmarking script Use the following script to …