Update your gene names when doing pathway analysis of array data!

If you are doing analysis of microarray data such as Infinium methylation arrays, then those genomic annotations you're using might be several years old. 

The EPIC methylation chip was released in 2016 and the R bioconductor annotation set hasn't been updated much since.

So we expect that some gene names have changed, which will reduce the performance of the downstream pathway analysis.

The gene symbols you're using can be updated using the HGNChelper R package on CRAN.

Let's say we want to make a table that maps probe IDs to gene names, the following code can be used.


library("IlluminaHumanMethylationEPICanno.ilm10b4.hg19")

anno <- getAnnotation(IlluminaHumanMethylationEPICanno.ilm10b4.hg19)

myann <- data.frame(anno[,c("UCSC_RefGene_Name","UCSC_RefGene_Group","Islands_Name","Relation_to_Island")])

gp <- myann[,"UCSC_RefGene_Name",drop=FALSE]

gp2 <- strsplit(gp$UCSC_RefGene_Name,";")

names(gp2) <- rownames(gp)

gp2 <- lapply(gp2,unique)

gt <- stack(gp2)

colnames(gt) <- c("gene","probe")

dim(gt)

str(gt)

head(gt)


Now the gene symbols can be updated. Note that the built-in gene symbol data is from 2019, but you can get current gene symbols using the following script.

library("HGNChelper")

new.hgnc.table <- getCurrentHumanMap()

fix <- checkGeneSymbols(gt$gene,map=new.hgnc.table)

fix2 <- fix[which(fix$x != fix$Suggested.Symbol),]

length(unique(fix2$x))

gt$gene <- fix$Suggested.Symbol



If you run this code you will see that 3253 genes in our table have new names. That is about 14% of all genes on the chip!

So please remember to update your gene symbols to get the best results out of your pathway analysis!


Popular posts from this blog

Data analysis step 8: Pathway analysis with GSEA

Two subtle problems with over-representation analysis

Uploading data to GEO - which method is faster?