How to generate a rank file from gene expression data in R

Previously I wrote about how to make a gene expression rankfile using the unix shell. While this works for me just fine, there are some differences in the behaviour of shell tools like awk an sed that make it unusable for mac users.

Therefore, I think the best solution is to show you how to do this in R, which you can install on Linux, Mac and Windows systems.

The expression data for this demo looks like this (the first 5 columns).

Name                        logFC               logCPM            LR                PValue
ENSG00000134294.9_SLC38A2   0.365464972841137   8.35504063447063  80.9697378748286  2.29200821312451e-19
ENSG00000164692.13_COL1A2   0.369440969815233   6.9371167845581   59.239238358843   1.39621819298599e-14
ENSG00000117152.9_RGS4      0.417512339007825   6.27805181231213  48.690887963755   2.99655418444362e-12
ENSG00000120738.7_EGR1      -0.448872068345368  5.72373823077344  40.5159113813129  1.95021420597484e-10
ENSG00000236552.1_RPL13AP5  -0.263118776572713  7.87437056751926  40.4331725003112  2.03457218801857e-10
ENSG00000104332.6_SFRP1     0.493423326062289   4.80359815087205  38.5863976773987  5.23827228439336e-10
ENSG00000187957.7_DNER      0.470592983554025   4.33515164897497  37.0721600878008  1.13837471974503e-09
ENSG00000111335.8_OAS2      0.506694725778856   4.64035138417216  36.5587612062587  1.48132665393875e-09
ENSG00000137809.12_ITGA11   0.358068410134363   6.32535717802246  35.7230925508279  2.27451845482865e-09

Here is the code for reading a tsv and calculating the rank metric you will need to replace column names with the ones in your dataset.

y<-x[,c("Name", "metric")]

Read table command loads the data as a data frame called "x", recognising the data is tab delimited and contains a header line.

Attach command allows R to recognise columns by their name rather than by their index ($1,$2,etc), this just makes things a bt easier.

"x$fcSign=sign(logFC)" generates a new column that is the sign of fold change. 

"x$logP=-log10(PValue)" generates a new column that is the negative log10 of the p-value.

"x$metric=logP/fcSign" generates a new column that is the metric score that we'll use for ranking and pathway analysis

"y<-x[,c("Name", "metric")]" makes a new table called "y" that consists only of gene names and metric scores. 

Write.table command outputs the gene name and rank metric data to a new file called "expression.rnk" gene

In the case above, the gene identifier and gene name are combined and will need to be split before pathway analysis. GSEA can use either Ensembl accession number or gene names. If you use gene names, then take note that many gene names are not unique and will need to be collapsed somehow.


  1. Hello. Is there some other reason for using PValues instead of FDR or adjusted P values?

    1. As we are ranking only, it is valid to use the nominal p-value. In effect, the result in most cases will be the same, except in the case when most FDR values are 1. If you want to detect subtle pathway level differences then it would be advisable to use the nominal p-value. In short, if we use the FDR, then the overwhelming majority of genes will be bunched together in the middle, which might be undesirable.

  2. Mine is not expression dataset i have done machine learning approaches and i have predicted gene name and svm score but i have .csv file and i want to convert .rnk file or .gmt file

    1. Hi Tamizh, it should be possible to convert to a rank file if you have a p-value or effect size that you can make into a metric. Similarly if you set a cut-off, then you can obtain the genes satisfying this and turn them into sets in GMT format.

  3. Hi i would really like to use this method analyzing some data for my PhD. How is the best way for me to reference this, do you have a paper or a paper you recommend?

    1. Hi Bridget, thanks for the feedback :)
      This apprach was described by my former group in a paper:

  4. Can you please explain how to prepare that log Fc and log p value of given genes.

    1. Hello Jagdish- to get the logFC and p-values, a tool like DESeq2 or edgeR should be used.

  5. For running GSEA, would you recommend to rank the whole results from Deseq2 (DEG) or subset them first by log2FC, pvalue, etc? Thanks a lot!!

    1. Whole results from Deseq2. The original GSEA articles describe why this is important.


Post a Comment

Popular posts from this blog

Data analysis step 8: Pathway analysis with GSEA

Installing R-4.0 on Ubuntu 18.04 painlessly

EdgeR or DESeq2? Comparing the performance of differential expression tools