BWA alignment to a genome

BWA alignment to a genome - single ends

November 19, 2012

Over the last few posts, we've discussed various ways to analyse the quality of a sequencing run, and curate the data sets in terms of demultiplexing, quality trimming and adapter clipping. I guess the next step is alignment. There are an increasing number of aligners out there for short read sequencing (listed here), but currently the most popular choices are BWA and Bowtie2. BWA is a quick and accurate tool, but might not be the best for gapped alignments (common in mRNA-Seq). Today I will demonstrate how to align short reads (single end) with BWA and convert the alignment to bam format with Samtools. The things you will need in the otherwise empty directory are:

Reference genome in fasta format
Sequence data in fastq format (.fq)

First thing to do is give the experiment a name and index the genome with BWA. The index step could take about 3 hours for a human size genome.

EXID=Experimentname
REF=Hg19.fa
bwa index -a bwtsw $REF

Next, use BWA to align. All of our fastq data sets have file names that begin with the prefix "FC" and end in the suffix "fq", so I use that to specify the input sequences. Also note that I've specified 4 threads, so if you are working on a big cluster, you can use more cores by increasing the number of threads per alignment or by sending the jobs to the background with ampersand "&" and "wait" command.

for i in `ls FC*.fq | sed 's/.fq//' `
do
bwa aln -t 4 $REF ${i}.fq > ${i}.sai
done

Next, we need to get the alignment into sam format using the samse command. This step can't be multi-threaded, but as you can see below, I have used the ampersand and "wait" command to send the samse jobs to the background and then wait until they're finished before proceeding to the next step.

for i in `ls FC*.fq | sed 's/.fq//' `
do
bwa samse $REF ${i}.sai ${i}.fq > ${i}.sam &
done
wait

Next, we will convert the format to binary sam (bam) format with Samtools.

for l in `ls *.sam | sed 's/.sam//'`
do
samtools view -bhS ${l}.sam > ${l}.unsorted.bam &
done
wait

Now, putting the bam file into sorted order, and discarding the unsorted files.

for m in `ls *.unsorted.bam | sed 's/.unsorted.bam//'`
do
samtools sort ${m}.unsorted.bam ${m} &
done
wait
rm *unsorted*

Now that you have an alignment in the bam file, check the number of reads which have successfully mapped to the reference.

for o in `ls *.bam | sed 's/.bam//'`
do
samtools flagstat ${o}.bam | grep -v ^0 > ${o}.stats &
done
wait

Once you have confirmed that the alignment has worked, clean up some of the intermediate files.

rm *.sai *.sam

That wasn't too hard was it?

Have a look at some other approaches here.

Search This Blog

Genome Spot

BWA alignment to a genome - single ends

Popular posts from this blog

Data analysis step 8: Pathway analysis with GSEA

Uploading data to GEO - which method is faster?

My personal thoughts on gene name errors