What is a CpG shore and how to I get them all?
CpG shores are the regions immediately flanking and up to 2 kbp away from CpG islands. These regions are interesting because methylation they are variably methylated in cancer and development (see reference).
To identify CpG shores, you first need a list of coordinates of CpG islands. If you have a linux computer with Emboss tools installed, you can determine CpG island positions with the cpgplot command. (if you don't have Emboss installed you can get CpG island coordinates from the UCSC table browser just be wary of chromosome naming ie, "chr1" vs "1")
Will produce a file like this one:
Now we need to make it bed file by just cutting the chromosome, start and end position.
Simultaneous CpG island and shore detection for the human genome took 5 mins.
Image courtesy of 7-themes.com |
cpgreport -score 17 -outfile genome.fa.cpg -outfeat /dev/stdout genome.fa | head
Will produce a file like this one:
##gff-version 3
##sequence-region 1 1 59999934
#!Date 2015-04-10
#!Type DNA
#!Source-version EMBOSS 6.6.0.0
1 cpgreport sequence_feature 10469 11240 1299 + . ID=1.1
1 cpgreport sequence_feature 11284 11301 37 + . ID=1.2
1 cpgreport sequence_feature 11343 11347 32 + . ID=1.3
1 cpgreport sequence_feature 11404 11405 17 + . ID=1.4
1 cpgreport sequence_feature 11434 11435 17 + . ID=1.5
cpgreport -score 17 -outfile genome.fa.cpg -outfeat /dev/stdout test.fa | grep -v '#' | cut -f1,4,5 > genome_cpg_islands.bed
Next step is to use Bedtools slop to extend the coordinates of the CpG islands by 2000 in each direction.
The catch is that Bedtools requires a file which contains the lengths of each chromosome so that it doesn't create coordinates outside of the possible range. To do that is relatively easy:
samtools faidx genome.fa
Will generate a file containing this data with a ".fai" suffix. The first 2 columns of the .fai file contain the chromosome name and length. Send that text to a new file.
cut -f-2 genome.fa.fai > genome.fa.g
Now we can generate the shores:
slopBed -i cpg_islands.bed -g genome.fa.g -b 2000 | mergeBed \
| subtractBed -a - -b cpg_islands.bed > cpg_shores.bed
This should work if you have a high memory system like a server but you might struggle on your workstation, so here's the same commands but this time processing chromosomes one-by-one.
Its east to run as well.
$cat cgs.sh
#!/bin/bash
#bash script to generate BED file of CpG islands and shores across whole genome
#depends on samtools and bedtools.
FA=$1
RPT=$FA.cpgreport
CGI=`echo $FA | sed 's/.fa/_cgi.bed/'`
CGS=`echo $FA | sed 's/.fa/_cgs.bed/'`
samtools faidx $FA
echo indexed
for CHR in `grep '>' $FA | tr '\t' ' ' | cut -d ' ' -f1 | tr -d '>'`
do
samtools faidx $FA $CHR \
| cpgreport -score 17 -outfile $RPT -outfeat /dev/stdout /dev/stdin \
| grep -v '#' | cut -f1,4,5
done > $CGI
cut -f-2 ${FA}.fai > ${FA}.g
slopBed -i $CGI -g ${FA}.g -b 2000 | mergeBed \
| subtractBed -a - -b $CGI > $CGS
./cgs.sh Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa
Simultaneous CpG island and shore detection for the human genome took 5 mins.
real 4m59.186s
user 4m29.325s
sys 0m13.124s