Posts

Showing posts with the label Parallelization

Functions and GNU parallel for effective cluster load management

Image
I've been a fan of GNU parallel for a long time. Initially I was sceptical about using it, preferring to write huge for loops but over time I've grown to love it. The beauty of GNU parallel is that it spawns a specified number of jobs in parallel and then submits more jobs as others are completed. This means that you get maximum usage out of the CPUs without overloading the system. There are many excuses for not using it, but perhaps the only valid one is that you have Sun Grid Engine or another job scheduler or manager in place.

GNU parallel is particularly useful when used with functions. Functions are subroutines that may be repeated many times to complete a piece of work. In bash, here is a simple example, which declares a function consisting of a chain of piped commands, and then executes 4 jobs in parallel, until all of *files.txt have been processed.

#!/bin/bash
my_func2() {
INPUT=$1
VAR1=bar
cmd1 $INPUT $VAR1 | cmd2 | cmd3 > ${1}.out
}
export -f my_func
parallel -j4…

Paired-end fastq quality control with Skewer

Image
Quality trimming and adapter clipping paired end reads is a tricky business. For paired-end alignments to work, the order of the sequences in both files (forward and reverse) needs to be retained. While there are plenty of read trimmers available open-source (think FASTX-toolkit, SeqTK, CutAdapt, etc), I haven't found many that:

Retain pairing infoRun parallel and are fastAre easy to set up and runHave good docs
Then I fell in love with Skewer. It smashed through 14.5 million gzipped read pairs, doing adapter clipping and quality trimming in two minutes on my 8-core workstation. It auto-detects quality encoding so you can safely analyse any Illumina data. Awesome!

$ skewer -t 8 -q 20 SRR634969.sra_1.fastq.gz SRR634969.sra_2.fastq.gz
Parameters used:
-- 3' end adapter sequence (-x):AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
-- paired 3' end adapter sequence (-y):AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA
-- maximum error ratio allowed (-r):0.100
-- maximum indel error ratio allowed (-d):0.030
-- e…

Sambamba vs Samtools

Samtools has been the one main tools for reading and writing aligned NGS data since the SAM alignment format was initially proposed. With the amounts of NGS data being generated increasing at a staggering rate, informatic bottlenecks are beginning to appear and there is a rush to make bioinformatics as parallel and scalable as possible. Samtools went parallel only recently, and there is a new competitor to Samtools called Sambamba, so I'll just do a few quick tests to see how it stacks up to Samtools on our 32-core server with 128 GB RAM (standard HDD).

We have an 1.8 GB bam file to work with.
$ du -sh *
1.8G  Sequence.bam

Now convert to unsorted sam format. $ samtools view -H Sequence.bam > header.sam
$ samtools view Sequence.bam | shuf \
| cat header.sam - > Sequence_shuf.sam

The sam file is 9.9 GB. Lets try 1-thread SAM-to-BAM conversion and sorting with Samtools.
$ time samtools view -Shb Sequence_shuf.sam \
| samtools sort - Sequence_samtools.test

real 18m52.374s
user 18m30.619s

Test-driving parallel compression software

Image
Next generation sequencing is driving genomics into realm of big data, and this will only continue as NGS moves into the hands of more and more researchers and clinicians. Genome sequence datasets are huge and place an immense demand on informatics infrastructure including disk space, memory and CPU usage. Bioinformaticians need to be aware of the performance of different compression software available to make the most out of their hardware.

In this post, we'll test the performance of some compression algorithms in terms of their disk-saving ability as well as speed of decompression/compression.

The file that we're working on is an RNA-seq fastq file that is 5.0 GB in size and contains 25.8 million sequence reads on 103.0 million lines. Here is the top 10 lines.

@SRR1171523.1 Y40-ILLUMINA:15:FC:5:1:8493:1211 length=36
ACTTGCTGCTAATTAAAACCAACAATAGAACAGTGA
+SRR1171523.1 Y40-ILLUMINA:15:FC:5:1:8493:1211 length=36
B55??4024/13425>;>>CDD@DBB<<<BB<4<<9
@SRR117…

UNIX utility: split

Image
Even the most simple types of analysis (ie: count lines) can take a really long time with the colossal data sets  coming from sequencing experiments these days. Parallelization of jobs over multiple processors/cores/threads can make things faster, but the standard unix based tools we commonly use don't generally have the in-built capacity to run jobs over many processors. Every so often, I'll present some tools which can help in making NGS analysis jobs parallel, and hopefully give you some ideas for speeding up your jobs.

One of the most basic parallelisation tricks is to split a file into smaller fragments, process those smaller files in parallel and then concatenate/merge the final product. To split a file, you can use a tool such as sed to selectively output lines, another option is to use the split command which will divide the file by the specified number of lines.

Here's an example of splitting a fastq file (called sequence.fq) into 1 million line fragments:
split …

What? You're not using parallel compression yet?

Just in case you guys are struggling with (de)compression of collossal data sets, here's something which you'll find useful, a parallel zip archiver called PBZIP2. OK so it's not going to improve the quality of your sequencing data, but it could save you a bit of time.

Say I have a fastq file (FC001_sequence.fq) which needs compression on 8 threads:
pbzip2  -p8 FC001_sequence.fq To decompress (-d) a file (FC001_sequence.fq.bz2) and keep (-k) the archived version on 10 threads:
pbzip2  -dk -p10 FC001_sequence.fq.bz2 To test the integrity of a compressed file:
pbzip2  -t FC001_sequence.fq.bz2
How to compress an entire directory:

tar cv directory | pbzip2 > directory.tar.bz2


Here is the help page with more examples:
Parallel BZIP2 v1.1.5 - by: Jeff Gilchrist [http://compression.ca]
[Jul. 16, 2011]               (uses libbzip2 by Julian Seward)
Major contributions: Yavor Nikolov <nikolov.javor+pbzip2@gmail.com>
Usage: pbzip2 [-1 .. -9] [-b#cdfhklm#p#qrS#tVz] <filen…