Bowtie alignment to a genome

Bowtie alignment to a genome - single end

November 21, 2012

Recently I posted about BWA alignment of NGS data, so today I will demonstrate the use of Bowtie2 for aligning single end sequence data, for more information on advanced options, take a thorough look at the manpage. One of major strengths of Bowtie2, is that it's more tolerant for gaps than Bowtie1 and BWA. Again you will need to have the reference genome and the raw fastq files in the otherwise empty directory.

The first thing we do is stipulate the name of the experiment and the name of the reference sequence.

EXID=Experimentname
REF=Hg19.fa

Now index the genome with bowtie2-build:

bowtie2-build Hg19.fa Hg19.fa

Now to actually start the alignment for all the files with prefix "FC" and suffix "fq":

for FQ in `ls FC*fq | sed 's/.fq//'`
do
bowtie2 -x $REF -U ${FQ}.fq -S ${FQ}.sam -p 4 --met-file ${FQ}.met 2> ${FQ}.stats &
done
wait

The script starts aligning all fq files to the human genome in parallel and each job uses 4 threads as specified by the "-p" option. Alignment output format is sam (not sai like BWA). It also outputs a file containing some alignment metrics as specified by the "--met-file" parameter. Once the sam files are generated, you can follow the SamTools procedure here to convert the data to bam format. You'll also notice that I've redirected the standard error (using "2>") to the stats file, which is generated automatically by bowtie once the alignment is complete. For more info on i/o redirection on Linux, there heaps of good tutorials out there (here, here, here). Here is an example of what the stats file looks like coming from bowtie2 for an mRNA-seq data set:

20213469 reads; of these:
20213469 (100.00%) were unpaired; of these:
1714839 (8.48%) aligned 0 times
12652875 (62.60%) aligned exactly 1 time
5845755 (28.92%) aligned >1 times

Search This Blog

Genome Spot

Bowtie alignment to a genome - single end

Popular posts from this blog

Data analysis step 8: Pathway analysis with GSEA

Uploading data to GEO - which method is faster?

Generate an RNA-seq count matrix with featureCounts