Genome methylation analysis with Bismark


Bismark is currently the de facto standard for primary analysis of high throughput bisulfite sequencing data. Bismark can align the reads to the genome and perform methylation calling. In this post, I'll go through Illumina whole genome bisulfite sequence (WGBS) alignment and methylation calling using Bismark. First I want to mention that this post is just a summary, not meant to be a user manual or thorough troubleshooting guide. Fortunately, Bismark has some of the best documentation for any bioinformatics suite and is mandatory reading. The Bismark crew are very proactive with responding to user queries on various forums as well.

First step in getting Bismark to work is to index the genome, in this case with Bowtie2:

bismark_genome_preparation --bowtie2 /pathto/refgenome/

Conventionally, multiplexed libraries will be sequenced over a number of lanes. Resist concatenating or merging the smaller fastq files for each patient/sample until after the alignment, as the concatenated files will be huge for deeply sequenced samples (30x). Using smaller chunks of less than 100 million read pairs has the benefit of requiring smaller disk space requirements for intermediate files and server failure is likely to have a less detrimental impact.

After quality trimming of paired end reads with Skewer, the files have names such as this (yours may have a different naming convention and the script will need to be modified accordingly):

Sample1_lane01_1.fq-trimmed-pair1.fastq.gz
Sample1_lane01_1.fq-trimmed-pair2.fastq.gz

The following script cycles through the files with the "pair1.fastq.gz" suffix and finds the corresponding read2, uncompresses the gzipped fastq, and uses bowtie2 alignment after C>T/A>G conversion. The script uses a non-directional setting but this will depend on the exact library preparation method. If unsure, just use a million or so read pairs to test directionality of the seq data before running these long jobs. The script will generate gzipped sam files for each input read pair.

It is really important to note that while I set parallelism (number of threads) to 4 for bowtie2 alignment, Bismark actually runs 4 instances of Bowtie2 in parallel (4x4=16), so its important not to overload your system with too many jobs in parallel. The script below will run a maximum of 2 bismark jobs on any server (total of 32 cores). This will need to be adjusted for your circumstance

#!/bin/bash
#How to prepare a genome for bismark alignment
#bismark_genome_preparation --bowtie2 $REF

wgbsaln(){
##Specify a reference genome directory
REF=/pathto/refgenome/

###################################################
# Start pipeline
###################################################
FQZ1=$1
#find read2 - this needs customisation for filenames
FQZ2=`echo $FQZ1 | sed 's/pair1.fastq.gz/pair2.fastq.gz/'`
#unzip read 1 and 2
pigz -dfkp 6 $FQZ1 & pigz -dfkp 6 $FQZ2
FQ1=`echo $FQZ1 | sed 's/.gz$//'`
FQ2=`echo $FQZ2 | sed 's/.gz$//'`
wait

###################################################
# Run Bismark with bowtie2 alignment
###################################################
bismark --non_directional --bowtie2 -gzip \
-p 4 $REF -1 $FQ1 -2 $FQ2
rm $FQ1 $FQ2
rm ${FQ1}_C_to_T.fastq ${FQ2}_G_to_A.fastq
pigz ${FQ1}_bismark_bt2_pe.sam
}
export -f wgbsaln
parallel -u -j2 wgbsaln ::: *pair1.fastq.gz

Jobs can also be run over several servers with the following setup (further reading here). 

parallel -u -j2 -S server1,server2,server3 wgbsaln ::: *pair1.fastq.gz

The alignment will take quite a bit longer than a standard gDNA alignment. When it's finished, the compressed sam files can be concatenated (without uncompressing):

cat Sample1_*.sam.gz > Sample1.sam.gz

These compressed sam files can be used directly in methylation calling, but before we start, take a moment to look at the M-bias graphs, as suggested in the manual to determine whether the first and last few bases of read 1 and 2 show any strong biases. If so, you'll need to ignore those bases in the methylation calling. You might also want to move the concatenated sam.gz alignments to a new directory to perform methylation calling with the following script.

#!/bin/bash
#################################################
# Extract methylation scores
#################################################
methextr(){
REF=/pathto/refgenome/
SAMGZ=$1
SAM=`echo $SAMGZ | sed 's/.gz//'`
bismark_methylation_extractor -p --bedgraph --counts \
 --buffer_size 20G --cytosine_report --genome_folder \
$REF --multicore 10 --gzip --ignore 3 --ignore_r2 3 $SAMGZ
}
export -f methextr

parallel -u -j3 methextr ::: Sample1.sam.gz Sample2.sam.gz SampleN.sam.gz

The script is based upon an example in the manual that generates a genome wide CpG report in addition to the Bedgraph output that look like this.

==> Sample1.sam.bismark.cov <==
20 60179 60179 34.4827586206897 10 19
20 60180 60180 46.6666666666667 7 8
20 60358 60358 0 0 1
20 60426 60426 92.3076923076923 12 1
20 60427 60427 92 23 2
20 60432 60432 84.6153846153846 11 2
20 60433 60433 72 18 7
20 60551 60551 38.8888888888889 7 11
20 60552 60552 46.6666666666667 14 16
20 60578 60578 47.3684210526316 9 10

==> Sample1.sam.CpG_report.txt <==
20 60179 + 10 19 CG CGT
20 60180 - 7 8 CG CGT
20 60426 + 12 1 CG CGA
20 60427 - 23 2 CG CGG
20 60432 + 11 2 CG CGA
20 60433 - 18 7 CG CGG
20 60551 + 7 11 CG CGA
20 60552 - 14 16 CG CGT
20 60578 + 9 10 CG CGT
20 60579 - 25 15 CG CGC



In future posts, I'll go through the process of calling differentially methylated cytosines and regions.

Popular posts from this blog

Two subtle problems with over-representation analysis

Data analysis step 8: Pathway analysis with GSEA

Uploading data to GEO - which method is faster?