What is a CpG shore and how to I get them all?

CpG shores are the regions immediately flanking and up to 2 kbp away from CpG islands. These regions are interesting because methylation they are variably methylated in cancer and development (see reference).
Image courtesy of 7-themes.com
To identify CpG shores, you first need a list of coordinates of CpG islands. If you have a linux computer with Emboss tools installed, you can determine CpG island positions with the cpgplot command. (if you don't have Emboss installed you can get CpG island coordinates from the UCSC table browser just be wary of chromosome naming ie, "chr1" vs "1")

cpgreport -score 17 -outfile genome.fa.cpg -outfeat /dev/stdout genome.fa | head

Will produce a file like this one:

##gff-version 3
##sequence-region 1 1 59999934
#!Date 2015-04-10
#!Type DNA
#!Source-version EMBOSS 6.6.0.0
1 cpgreport sequence_feature 10469 11240 1299 + . ID=1.1
1 cpgreport sequence_feature 11284 11301 37 + . ID=1.2
1 cpgreport sequence_feature 11343 11347 32 + . ID=1.3
1 cpgreport sequence_feature 11404 11405 17 + . ID=1.4
1 cpgreport sequence_feature 11434 11435 17 + . ID=1.5

Now we need to make it bed file by just cutting the chromosome, start and end position.


cpgreport -score 17 -outfile genome.fa.cpg -outfeat /dev/stdout test.fa | grep -v '#' | cut -f1,4,5 > genome_cpg_islands.bed


Next step is to use Bedtools slop to extend the coordinates of the CpG islands by 2000 in each direction.

The catch is that Bedtools requires a file which contains the lengths of each chromosome so that it doesn't create coordinates outside of the possible range. To do that is relatively easy:

samtools faidx genome.fa

Will generate a file containing this data with a ".fai" suffix. The first 2 columns of the .fai file contain the chromosome name and length. Send that text to a new file.

cut -f-2 genome.fa.fai > genome.fa.g

Now we can generate the shores:

slopBed -i cpg_islands.bed -g genome.fa.g -b 2000 | mergeBed \
| subtractBed -a - -b cpg_islands.bed > cpg_shores.bed

This should work if you have a high memory system like a server but you might struggle on your workstation, so here's the same commands but this time processing chromosomes one-by-one.

$cat cgs.sh
#!/bin/bash

#bash script to generate BED file of CpG islands and shores across whole genome
#depends on samtools and bedtools.

FA=$1
RPT=$FA.cpgreport
CGI=`echo $FA | sed 's/.fa/_cgi.bed/'`
CGS=`echo $FA | sed 's/.fa/_cgs.bed/'`

samtools faidx $FA
echo indexed
for CHR in `grep '>' $FA | tr '\t' ' ' | cut -d ' ' -f1 | tr -d '>'`

do
samtools faidx $FA $CHR \
| cpgreport -score 17 -outfile $RPT -outfeat /dev/stdout /dev/stdin \
| grep -v '#' | cut -f1,4,5
done > $CGI

cut -f-2 ${FA}.fai > ${FA}.g

slopBed -i $CGI -g ${FA}.g -b 2000 | mergeBed \
| subtractBed -a - -b $CGI > $CGS

Its east to run as well.

./cgs.sh Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa

Simultaneous CpG island and shore detection for the human genome took 5 mins.

real 4m59.186s
user 4m29.325s
sys 0m13.124s


Popular posts from this blog

Two subtle problems with over-representation analysis

Data analysis step 8: Pathway analysis with GSEA

Uploading data to GEO - which method is faster?