Posts

Showing posts with the label Utilities

Shell aliases for bioinformatics

Using shell allows us to take advantage of some nice features to make our bioinformatics lives a little easier for things we do very frequently. In Ubuntu, the ~/.bashrc file is run as a new terminal window is opened to customise the shell. Here are a few of my favourite general shortcuts. Let me know your favourites in the comments section below!

#shorten ls forms
alias ll='ls -alF'
alias la='ls -A'
alias l='ls -CF'


#shorten file viewing
alias h='head'
alias t='tail'
alias n='nano -S'
#the -S option to nano makes scrolling smoother
alias nano='nano -S'

#easy update
alias update='sudo apt-get update && sudo apt-get upgrade -y'

#search through history
alias hgrep='history | grep'

#Get col headers of tab delim file
ch(){
cat $1 | tr '\t' '\n' | nl -n ln
}
export -f ch

#login with ssh where IP is constant (X is the IP address)
alias login1='ssh -Y username@X.X.X.X' #scp can be done as above

#login with …

Functions and GNU parallel for effective cluster load management

Image
I've been a fan of GNU parallel for a long time. Initially I was sceptical about using it, preferring to write huge for loops but over time I've grown to love it. The beauty of GNU parallel is that it spawns a specified number of jobs in parallel and then submits more jobs as others are completed. This means that you get maximum usage out of the CPUs without overloading the system. There are many excuses for not using it, but perhaps the only valid one is that you have Sun Grid Engine or another job scheduler or manager in place.

GNU parallel is particularly useful when used with functions. Functions are subroutines that may be repeated many times to complete a piece of work. In bash, here is a simple example, which declares a function consisting of a chain of piped commands, and then executes 4 jobs in parallel, until all of *files.txt have been processed.

#!/bin/bash
my_func2() {
INPUT=$1
VAR1=bar
cmd1 $INPUT $VAR1 | cmd2 | cmd3 > ${1}.out
}
export -f my_func
parallel -j4…

Using Named Pipes and Process Substitution

Image
Little known UNIX features to avoid writing temporary files in your data pipelines explained by Vince Buffalo in his digital notebook. Introducing named pipes and process substitution.

Extracting specific sequences from a big FASTA file

Say you have a huge FASTA file such as genome build or cDNA library, how to you quickly extract just one or a few desired sequences?

Use samtools faidx to extract a single FASTA entry  first index, then you can extract almost instantaneously.
$ samtools faidx Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa
real0m37.422s
$ time samtools faidx Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa MT
real0m0.003s

Use bedtools getfasta extract portions of a FASTA entry Requires the regions of interest be in BED format. $ head Homo_sapiens.GRCh38_CpG_Islands.bed 11041311207 12868729634 15154651882 1121270121549
The sequences of interest are extracted like this: $ bedtools getfasta -fi Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa -bed Homo_sapiens.GRCh38_CpG_Islands.bed -fo Homo_sapiens.GRCh38_CpG_Islands.fasta
Make a blast database to access bunches of sequences quickly Note: you will need to download and install the BLAST+ package from NCBI or via Ubuntu software centre. It is compatible with protein and…

Benchmark scripts and programs

Bioinformaticians strive for accurate results, but when time or computational resources are limited, speed can be a factor too. This is especially true when dealing with the huge data sets coming off sequencers these days.

When putting together an analysis pipeline, try taking a small fraction of the data and perform some benchmarking of the available tools.

Benchmarking could be as simple as using time:

time ./script1.sh
time ./script2.sh

But if you need a little more detail, this benchmarking approach captures peak memory usage and average CPU utilisation too.

1. Set up a list of commands/scripts in a file called "codes.txt" Here is a list of commands that I used in a previous post:

$ cat codes.txt
cat test.fastq > /dev/null
zcat test.fastq.gz > /dev/null
bzcat test.fastq.bz2 > /dev/null
pigz -dc test.fastq.gz > /dev/null
pbzip2 -dc test.fastq.bz2 > /dev/null
plzip -dc test.fastq.lz > /dev/null

2. Setup the benchmarking script Use the following script to …

Test-driving parallel compression software

Image
Next generation sequencing is driving genomics into realm of big data, and this will only continue as NGS moves into the hands of more and more researchers and clinicians. Genome sequence datasets are huge and place an immense demand on informatics infrastructure including disk space, memory and CPU usage. Bioinformaticians need to be aware of the performance of different compression software available to make the most out of their hardware.

In this post, we'll test the performance of some compression algorithms in terms of their disk-saving ability as well as speed of decompression/compression.

The file that we're working on is an RNA-seq fastq file that is 5.0 GB in size and contains 25.8 million sequence reads on 103.0 million lines. Here is the top 10 lines.

@SRR1171523.1 Y40-ILLUMINA:15:FC:5:1:8493:1211 length=36
ACTTGCTGCTAATTAAAACCAACAATAGAACAGTGA
+SRR1171523.1 Y40-ILLUMINA:15:FC:5:1:8493:1211 length=36
B55??4024/13425>;>>CDD@DBB<<<BB<4<<9
@SRR117…

UNIX utility: split

Image
Even the most simple types of analysis (ie: count lines) can take a really long time with the colossal data sets  coming from sequencing experiments these days. Parallelization of jobs over multiple processors/cores/threads can make things faster, but the standard unix based tools we commonly use don't generally have the in-built capacity to run jobs over many processors. Every so often, I'll present some tools which can help in making NGS analysis jobs parallel, and hopefully give you some ideas for speeding up your jobs.

One of the most basic parallelisation tricks is to split a file into smaller fragments, process those smaller files in parallel and then concatenate/merge the final product. To split a file, you can use a tool such as sed to selectively output lines, another option is to use the split command which will divide the file by the specified number of lines.

Here's an example of splitting a fastq file (called sequence.fq) into 1 million line fragments:
split …

Venn diagrams for genes

Image
One of the easiest ways to compare datasets is to identify common elements between them and show the overlaps in a venn diagram.
So when you're working with many data points such as thousands of genes, what is the best/fastest way to do it?

Venny - Intersect up to 4 lists and count the overlaps. See and download the overlaps between gene lists.  Easy to use. Basic graphical output with overlaps not to scale and basic colours.
Gene Venn - Intersect 2 or 3 lists. Overlap output in one text file. Overlaps not to scale, but graphics are somewhat more attractive than Venny.

BioVenn - Intersect 2 or 3 lists. High quality graphics in a range of formats (SVG/PNG), with extensive customisation which makes it perfect for publications. Overlaps are to scale. Optionally can map gene IDs to actual RefSeq/Ensembl/Affy accession numbers.


Venn SuperSelector - Intersect as many lists as you like. The output will be a matrix and the number of entries in the overlap. The first matrix is the pair-wise ov…

Fastq to tabular - filter reads on length

Fastq sequence format is a little bit strange as it has a structure of 4 lines per data point (sequence). This format is not ideal for analysis for one reason - it is quite difficult to manipulate with unix tools because they are optimized for tab delimited data with one entry per line. One way to overcome this is to "flatten" the file so that each sequence read occupies just one line. The read name, sequence, quality information are separated by a tab on one line. To do this there is a trick you can try with unix paste command. Try it with this command:

paste - - - - < sequence.fq | head If we then wanted to do something like filter reads based on sequence length, we can add a perl (or awk for that matter) one liner.
 paste - - - -  < sequence.fq | perl -ne '@x=split m/\t/; unshift @x, length($x[1]); print join "\t",@x;'  Which will output the length of the sequence string in the first field.

If we wanted to filter reads based on the length of the se…

Evaluate length of sequence strings

If you are working with sequence data often, you will sometimes need to look at the distribution of read lengths in a data set after quality and adapter trimming. From a bam file this can be done starting with samtools view, then cutting the sequence string out and then using either perl or awk to count the length of the sequence. The list of integers can then be piped to numaverage, a numutils program to evaluate the average, median and mode of a list of numbers.

To get the average length
samtools view sequence.bam | cut -f10 | awk '{ print length}' | numaverage

To get the median length
samtools view sequence.bam | cut -f10 | awk '{ print length}' | numaverage -M

To get the mode length
samtools view sequence.bam | cut -f10 | awk '{ print length}' | numaverage- m

To get the shortest length
samtools view FC095_1.bam | cut -f10 | awk '{ print length}' | sort | head -1

To get the longest
samtools view FC095_1.bam | cut -f10 | awk '{ print length}' …

What? You're not using parallel compression yet?

Just in case you guys are struggling with (de)compression of collossal data sets, here's something which you'll find useful, a parallel zip archiver called PBZIP2. OK so it's not going to improve the quality of your sequencing data, but it could save you a bit of time.

Say I have a fastq file (FC001_sequence.fq) which needs compression on 8 threads:
pbzip2  -p8 FC001_sequence.fq To decompress (-d) a file (FC001_sequence.fq.bz2) and keep (-k) the archived version on 10 threads:
pbzip2  -dk -p10 FC001_sequence.fq.bz2 To test the integrity of a compressed file:
pbzip2  -t FC001_sequence.fq.bz2
How to compress an entire directory:

tar cv directory | pbzip2 > directory.tar.bz2


Here is the help page with more examples:
Parallel BZIP2 v1.1.5 - by: Jeff Gilchrist [http://compression.ca]
[Jul. 16, 2011]               (uses libbzip2 by Julian Seward)
Major contributions: Yavor Nikolov <nikolov.javor+pbzip2@gmail.com>
Usage: pbzip2 [-1 .. -9] [-b#cdfhklm#p#qrS#tVz] <filen…