How to generate a rank file from gene expression data in R
Previously I wrote about how to make a gene expression rankfile using the unix shell. While this works for me just fine, there are some differences in the behaviour of shell tools like awk an sed that make it unusable for mac users.
Therefore, I think the best solution is to show you how to do this in R, which you can install on Linux, Mac and Windows systems.
The expression data for this demo looks like this (the first 5 columns).
Therefore, I think the best solution is to show you how to do this in R, which you can install on Linux, Mac and Windows systems.
The expression data for this demo looks like this (the first 5 columns).
Name logFC logCPM LR PValue
ENSG00000134294.9_SLC38A2 0.365464972841137 8.35504063447063 80.9697378748286 2.29200821312451e-19
ENSG00000164692.13_COL1A2 0.369440969815233 6.9371167845581 59.239238358843 1.39621819298599e-14
ENSG00000117152.9_RGS4 0.417512339007825 6.27805181231213 48.690887963755 2.99655418444362e-12
ENSG00000120738.7_EGR1 -0.448872068345368 5.72373823077344 40.5159113813129 1.95021420597484e-10
ENSG00000236552.1_RPL13AP5 -0.263118776572713 7.87437056751926 40.4331725003112 2.03457218801857e-10
ENSG00000104332.6_SFRP1 0.493423326062289 4.80359815087205 38.5863976773987 5.23827228439336e-10
ENSG00000187957.7_DNER 0.470592983554025 4.33515164897497 37.0721600878008 1.13837471974503e-09
ENSG00000111335.8_OAS2 0.506694725778856 4.64035138417216 36.5587612062587 1.48132665393875e-09
ENSG00000137809.12_ITGA11 0.358068410134363 6.32535717802246 35.7230925508279 2.27451845482865e-09
Here is the code for reading a tsv and calculating the rank metric you will need to replace column names with the ones in your dataset.
x<-read.table("expression_edgeR.xls",sep="\t",header=T)
attach(x)
x$fcSign=sign(logFC)
x$logP=-log10(PValue)
x$metric=logP/fcSign
y<-x[,c("Name", "metric")]
write.table(y,file="expression.rnk",quote=F,sep="\t",row.names=F)
Read table command loads the data as a data frame called "x", recognising the data is tab delimited and contains a header line.
Attach command allows R to recognise columns by their name rather than by their index ($1,$2,etc), this just makes things a bt easier.
"x$fcSign=sign(logFC)" generates a new column that is the sign of fold change.
"x$logP=-log10(PValue)" generates a new column that is the negative log10 of the p-value.
"x$metric=logP/fcSign" generates a new column that is the metric score that we'll use for ranking and pathway analysis
"y<-x[,c("Name", "metric")]" makes a new table called "y" that consists only of gene names and metric scores.
Write.table command outputs the gene name and rank metric data to a new file called "expression.rnk" gene
In the case above, the gene identifier and gene name are combined and will need to be split before pathway analysis. GSEA can use either Ensembl accession number or gene names. If you use gene names, then take note that many gene names are not unique and will need to be collapsed somehow.