Count ChIP-seq reads across a promoter with featureCounts

When analyzing ChIP-seq data, we sometimes like to focus on promoters as a proxy of gene activation especially for histone marks that are normally associated with gene activation such as H3K9 acetylation or H3K4 tri-methylation.

FeatureCounts has emerged as a competitor to HTSeq and BedTools MultiCov for counting reads across features (ie, exons, genes, promoters). FeatureCounts is great for RNA-seq because it can natively read GTF annotation files, but can't read BED format (that we use a lot in ChIP-seq analysis).

In order to make featureCounts work, we need to extract the TSS coordinates and convert to a BED-like format that it can read (SAF format). In the below script, I extract the positions of the TSSs using a grep search followed by stripping only the neccessary information from the GTF, then using awk and BedTools to expand the region  around the TSS by 3kbp.

#!/bin/bash
#Generate an SAF file for TSS intervals from a GTF

#Specify some parameters
GTF=Homo_sapiens.GRCh38.78.gtf
FA=Homo_sapiens.GRCh38.dna.primary_assembly.fa
CHR=Homo_sapiens.GRCh38.dna.primary_assembly.fa.g
TSS=3000
FWD=tmp+
REV=tmp-
SAF=Homo_sapiens.GRCh38.78_3kbTSS.saf
COMPL=compl
SAFCOMPL=Homo_sapiens.GRCh38.78_3kbTSS+compl.saf

#Extract TSS coordinates for fwd strand genes
grep -w transcript $GTF | awk '$7=="+"' \
| cut -d '"' -f-2,10 | cut -f1,4,5,9 | sed 's/gene_id "//' \
| tr '"' '_' | awk '{OFS="\t"} {print $1,$2,$2+1,$4}' \
| bedtools slop -i - -g $CHR -b $TSS \
| awk '{OFS="\t"} {print $4,$1,$2,$3,"+"}' > $FWD

#Extract TSS coordinates for reverse strand genes
grep -w transcript $GTF | awk '$7!="+"' \
|  cut -d '"' -f-2,10 | cut -f1,4,5,9 | sed 's/gene_id "//' \
| tr '"' '_' | awk '{OFS="\t"} {print $1,$3-1,$3,$4}' \
| bedtools slop -i - -g $CHR -b $TSS \
| awk '{OFS="\t"} {print $4,$1,$2,$3,"-"}' > $REV

cat $FWD $REV > $SAF
rm $FWD $REV

#Account for the non-promoter regions
cut -f2-4 $SAF | sortBed | mergeBed -i - \
| complementBed -i - -g $CHR \
| awk '{OFS="\t"} {print "NA_"$1":"$2"-"$3,$0,"+"}' \
| cat $SAF - > $SAFCOMPL


The last part of the script generates SAF coordinates for complement regions, ie., the parts of the genome that are not located near TSSs. This is useful because sometimes the promoters only contain a small fraction of the original data. For my H3K9ac ChIP-seq, only 10-20% of reads are placed near promoters. When including complement regions, I'm able to include over 80% of reads.

Here's how I use featureCounts to count reads in meta-features (if a gene has many TSSs, then the counts are aggregated). You'll notice that I run featureCounts twice once without the complement regions and once with the complement regions.

#!/bin/bash
SAF=Homo_sapiens.GRCh38.78_3kbTSS.saf
OUT=Exercise_chip_3kbpTSS.txt
featureCounts -Q 20 -T 20 -F SAF -a $SAF -o $OUT *bam

SAFC=Homo_sapiens.GRCh38.78_3kbTSS+compl.saf
OUTC=Exercise_chip_3kbpTSS+compl.txt
featureCounts -Q 20 -T 20 -F SAF -a $SAFC -o $OUTC *bam

Here's the result of the promoter only counts:
Status Sequence1.bam
Assigned 6610765
Unassigned_Ambiguity 1162622
Unassigned_MultiMapping 0
Unassigned_NoFeatures 40595847
Unassigned_Unmapped 1783674
Unassigned_MappingQuality 9157142
Unassigned_FragementLength 0
Unassigned_Chimera 0

Now the result of the promoter & complement counts:
Status Sequence1.bam
Assigned 47112044
Unassigned_Ambiguity 1257190
Unassigned_MultiMapping 0
Unassigned_NoFeatures 0
Unassigned_Unmapped 1783674
Unassigned_MappingQuality 9157142
Unassigned_FragementLength 0
Unassigned_Chimera 0

FeatureCounts can also be run at a feature-level to output counts for each TSS of each gene by adding the "-f" switch.

Now the tabulated reads are ready for differential enrichment analysis with edgeR or similar.

Popular posts from this blog

Two subtle problems with over-representation analysis

Data analysis step 8: Pathway analysis with GSEA

Uploading data to GEO - which method is faster?