Genome methylation analysis with Bismark
First step in getting Bismark to work is to index the genome, in this case with Bowtie2:
bismark_genome_preparation --bowtie2 /pathto/refgenome/
Conventionally, multiplexed libraries will be sequenced over a number of lanes. Resist concatenating or merging the smaller fastq files for each patient/sample until after the alignment, as the concatenated files will be huge for deeply sequenced samples (30x). Using smaller chunks of less than 100 million read pairs has the benefit of requiring smaller disk space requirements for intermediate files and server failure is likely to have a less detrimental impact.
After quality trimming of paired end reads with Skewer, the files have names such as this (yours may have a different naming convention and the script will need to be modified accordingly):
Sample1_lane01_1.fq-trimmed-pair1.fastq.gz
Sample1_lane01_1.fq-trimmed-pair2.fastq.gz
It is really important to note that while I set parallelism (number of threads) to 4 for bowtie2 alignment, Bismark actually runs 4 instances of Bowtie2 in parallel (4x4=16), so its important not to overload your system with too many jobs in parallel. The script below will run a maximum of 2 bismark jobs on any server (total of 32 cores). This will need to be adjusted for your circumstance
#!/bin/bash
#How to prepare a genome for bismark alignment
#bismark_genome_preparation --bowtie2 $REF
wgbsaln(){
##Specify a reference genome directory
REF=/pathto/refgenome/
###################################################
# Start pipeline
###################################################
FQZ1=$1
#find read2 - this needs customisation for filenames
FQZ2=`echo $FQZ1 | sed 's/pair1.fastq.gz/pair2.fastq.gz/'`
#unzip read 1 and 2
pigz -dfkp 6 $FQZ1 & pigz -dfkp 6 $FQZ2
FQ1=`echo $FQZ1 | sed 's/.gz$//'`
FQ2=`echo $FQZ2 | sed 's/.gz$//'`
wait
###################################################
# Run Bismark with bowtie2 alignment
###################################################
bismark --non_directional --bowtie2 -gzip \
-p 4 $REF -1 $FQ1 -2 $FQ2
rm $FQ1 $FQ2
rm ${FQ1}_C_to_T.fastq ${FQ2}_G_to_A.fastq
pigz ${FQ1}_bismark_bt2_pe.sam
}
export -f wgbsaln
parallel -u -j2 wgbsaln ::: *pair1.fastq.gz
Jobs can also be run over several servers with the following setup (further reading here).
parallel -u -j2 -S server1,server2,server3 wgbsaln ::: *pair1.fastq.gz
The alignment will take quite a bit longer than a standard gDNA alignment. When it's finished, the compressed sam files can be concatenated (without uncompressing):
cat Sample1_*.sam.gz > Sample1.sam.gz
These compressed sam files can be used directly in methylation calling, but before we start, take a moment to look at the M-bias graphs, as suggested in the manual to determine whether the first and last few bases of read 1 and 2 show any strong biases. If so, you'll need to ignore those bases in the methylation calling. You might also want to move the concatenated sam.gz alignments to a new directory to perform methylation calling with the following script.
#!/bin/bash
#################################################
# Extract methylation scores
#################################################
methextr(){
REF=/pathto/refgenome/
SAMGZ=$1
SAM=`echo $SAMGZ | sed 's/.gz//'`bismark_methylation_extractor -p --bedgraph --counts \
--buffer_size 20G --cytosine_report --genome_folder \
$REF --multicore 10 --gzip --ignore 3 --ignore_r2 3 $SAMGZ
}
export -f methextr
parallel -u -j3 methextr ::: Sample1.sam.gz Sample2.sam.gz SampleN.sam.gz
The script is based upon an example in the manual that generates a genome wide CpG report in addition to the Bedgraph output that look like this.
==> Sample1.sam.bismark.cov <==
20 60179 60179 34.4827586206897 10 19
20 60180 60180 46.6666666666667 7 8
20 60358 60358 0 0 1
20 60426 60426 92.3076923076923 12 1
20 60427 60427 92 23 2
20 60432 60432 84.6153846153846 11 2
20 60433 60433 72 18 7
20 60551 60551 38.8888888888889 7 11
20 60552 60552 46.6666666666667 14 16
20 60578 60578 47.3684210526316 9 10
==> Sample1.sam.CpG_report.txt <==
20 60179 + 10 19 CG CGT
20 60180 - 7 8 CG CGT
20 60426 + 12 1 CG CGA
20 60427 - 23 2 CG CGG
20 60432 + 11 2 CG CGA
20 60433 - 18 7 CG CGG
20 60551 + 7 11 CG CGA
20 60552 - 14 16 CG CGT
20 60578 + 9 10 CG CGT
20 60579 - 25 15 CG CGC
In future posts, I'll go through the process of calling differentially methylated cytosines and regions.