Interspecies gene name conversion

In this post, I'll provide a step-by-step guide to perform interspecies gene name conversion of gene expression data. This is a necessary step in the comparison of profiling data from two different experiments with different species (human and mouse), and allows us to use extensive human-centric gene set libraries in MSigDB when analysing non-human mammalian profiling data (such as mouse).

I performed GEO2R analysis of mouse expression data (GSE30192) to analyse the effect of azacitidine on mouse C2C12 myoblasts. The data looks like this:

"ID" "adj.P.Val" "P.Value" "t" "B" "logFC" "Gene.symbol" "Gene.title"
"1420647_a_at" "0.000346" "2.24e-08" "56.073665" "8.699524" "6.9755573" "Krt8" "keratin 8"
"1423327_at" "0.000346" "2.32e-08" "55.685912" "8.686447" "3.8096523" "Rpl39l""ribosomal protein L39-like"
"1433438_x_at" "0.000378" "4.92e-08" "48.124512" "8.381385" "3.8895855" "Mela" "melanoma antigen"
"1460256_at" "0.000378" "6.75e-08" "45.243998" "8.234349" "3.5680721" "Car3" "carbonic anhydrase 3"

Strategy

Now to convert the gene names to human we will use the databases at Ensembl biomart to download mouse and human gene information separately, join these files together and then attach them to the original profiling data as depicted in the schematic below. (Some other homology resources are found at HomologeneMGI and Phytozome)

Use Ensembl biomart to download a list of mouse gene accession numbers and Associated Gene Names, as in the screenshot below.
The file should look a bit like this:
Ensembl Gene ID Associated Gene Name
ENSMUSG00000064372 mt-Tp
ENSMUSG00000064371 mt-Tt
ENSMUSG00000064370 mt-Cytb
ENSMUSG00000064369 mt-Te

Now do the same for human gene names. Click the "Homologs" button to also select mouse gene accession numbers. It should look like this:
Ensembl Gene ID Mouse Ensembl Gene ID Associated Gene Name
ENSG00000180383 ENSMUSG00000074678 DEFB124
ENSG00000162444 ENSMUSG00000028996 RBP7
ENSG00000165583 ENSMUSG00000079704 SSX5
ENSG00000165583 ENSMUSG00000079701 SSX5

Joining files

Now we need to join these 2 files on the mouse gene accession number. The data need to be "cleaned" and sorted on the field to be joined. The same needs to be done with the profiling data.

#Mouse file: remove the top line and sort on accession number
sed 1d MusENSG2symbol.txt | sort -u \
| sort -k 1b,1 > MusENSG2symbol_s.txt

#Human file: remove top line, select only genes that have mouse 
#homologs and sort on mouse accession number
sed 1d Hum2Mus_ID.txt | awk 'NF==3' | sort -u \
| sort -k 2b,2 > Hum2Mus_ID_s.txt

#Use the UNIX join command to join the two files on the mouse 
#accession number and then sort the output on the mouse gene symbol 
#(column 2)
join -1 1 -2 2 MusENSG2symbol_s.txt Hum2Mus_ID_s.txt \
| sort -k 2b,2 > Hum2Mus_ID_genenames.txt

#Clean and sort the profiling data
#Extract only the mouse gene name, fold change, p-value and 
#adj p-value from the GEO2R result
sed 1d GSE30192_GEO2R.xls | tr -d '"' \
| awk '{OFS="\t"} {print $7, $6, $3, $2}' \
| awk 'NF==4' | grep -wv NA \
| sort -k 1b,1 > GSE30192_GEO2R_s.xls

#Join the gene name/ID key file to the profiling data, sort on 
#p-values then remove redundant human entries
join -1 2 -2 1 Hum2Mus_ID_gnennames.txt GSE30192_GEO2R_s.xls \
| sort -k6g | awk '!arr[$4]++'  > GSE30192_GEO2R_converted.xls

Now the data looks like its ready for comparison with data from human experiments:
MusGeneSymbol MouseAccession HumanAccession HumanGeneSymbol FoldChange P-value adjP-value
Krt8 ENSMUSG00000049382 ENSG00000170421 KRT8 6.9755573 2.24e-08 0.000346
Rpl39l ENSMUSG00000039209 ENSG00000163923 RPL39L 3.8096523 2.32e-08 0.000346
Sprr1a ENSMUSG00000050359 ENSG00000169469 SPRR1B -3.441516 6.46e-08 0.000378
Sprr1a ENSMUSG00000050359 ENSG00000169474 SPRR1A -3.441516 6.46e-08 0.000378

Some thoughts on interspecies gene name conversions

Keep in mind though that these type of conversions have limitations. While human and mouse are fairly closely related, there are many genes that are not conserved. On the other hand, there could be gene copies (paralogs) in one of the species meaning the conversion is one-to-many. In that case, the genes could have the same function, or on the other hand, the paralogs could have different functions. Due to these limitations, many genes are lost along the way. From the 45101 probes on the original dataset, 39315 were assigned to a mouse gene name. Those probes were assigned to 21477 unique gene names, and after assigning these to human homologs we are left with just 16420 genes.

Popular posts from this blog

Two subtle problems with over-representation analysis

Data analysis step 8: Pathway analysis with GSEA

Uploading data to GEO - which method is faster?