Align Next Gen Sequencing Data Like A Boss

We know that sequencing platforms are generating more and more output, and more of this data is finding itself in online databanks. The volume of data is a blessing because it is so information rich, but it is also increasingly becoming a problem - we need more machine time to process it.

I was recently aligning dozens of mouse encode RNA-seq data sets while we had a server fault. I was left with a mess of half finished jobs and intermediate files and basically had to start the whole process again. This prompted me to make an alignment pipeline that would be more resistant to unexpected interruptions. To do this, I took inspiration from a post at SEQanswers which piped the output from the BWA aligner and generated bam files without any intermediate files.

But I wanted to take this a bit further, by going from compressed fastq to bam file without writing any intermediate files, doing quality trimming on the fly and saving a lot of storage in the process. So here are two examples, one for BWA, which we are using for ChIP-Seq and other DNA-Seq applications; and Olego which we are using for mRNA-Seq. You will need the unix parallel program to do the bam indexing. If you are working with gzipped fastq, you will need to use "zcat" instead of "pbzip -dc". Make sure you specify the correct suffix for the fastq/fq files and correctly specify the path to the indexed genome in fasta format. Watch out for the -t switch (number of threads to use), so that you don't overload the machine you're working on.

Also note that I'm using fastq_quality_trimmer with the -Q33 switch so it can understand the base quality scores for the latest version of Fastq coming off the Illumina HiSeq.
Firtsly for BWA:

#!/bin/bash
REF=/path/to/genome.fa
for FQZ in *fastq.bz2
do
FQ=`basename $FQZ .bz2`
pbzip2 -dc $FQZ | fastq_quality_trimmer -t 30 -l 20 -Q33 \
| tee $FQ | bwa aln -t 30 $REF - | bwa samse $REF - $FQ \
| samtools view -uSh - \
| samtools sort - ${FQ}.sort
done
parallel samtools index -- *bam

Now for Olego:

#!/bin/bash
REF=/path/to/genome.fa
for FQZ in *fastq.bz2
do
FQ=`basename $FQZ .bz2`
pbzip2 -dc $FQZ | fastq_quality_trimmer -t 30 -l 20 -Q33 \
| olego -t 20 $REF - \
| samtools view -uSh - \
| samtools sort - ${FQ}.sort
done
parallel samtools index -- *bam

So now you won't have lots of intermediate files getting in the way. Yay!

EDIT: here is the code for bwa alignment using the new "mem" algorithm designed for longer sequence alignments (70bp to 1Mbp).

#!/bin/bash
REF=/path/to/genome.fa
for FQZ in *fastq.bz2
do
FQ=`basename $FQZ .bz2`
pbzip2 -dc $FQZ | fastq_quality_trimmer -t 30 -l 20 -Q33\
| bwa mem -t 30 $REF - \
| samtools view -uSh - \
| samtools sort - ${FQ}.sort
done
parallel samtools index -- *bam

Popular posts from this blog

Data analysis step 8: Pathway analysis with GSEA

Two subtle problems with over-representation analysis

Uploading data to GEO - which method is faster?