How to generate a rank file from gene expression data

Turning a gene expression profile into a ranked list is useful for comparing with other profiling data sets as well as an input for preranked GSEA analysis (example here). In this post, I describe a simple bash script called rnkgen.sh that can take gene expression data from a range of sources, such as edgeR, DESeq, GEO2R, etc., and generate a ranked list of genes from most up-expressed to most down-expressed based on the p-value.

#!/bin/bash
#rnkgen.sh converts a differential gene expression spreadsheet (XLS) into a rank file (RNK)

#Specify the input file
XLS=$1
#Specify the gene ID column
ID=$2
#Specify the fold change value column
FC=$3
#Specify the raw p-value column
P=$4

sed 1d $XLS | tr -d '"' \
| awk -v I=$ID -v F=$FC -v P=$P '{FS="\t"} $I!="" {print $I, $F, $P}' \
| awk '$2!="NA" && $3!="NA"' \
| awk '{s=1} $2<0{s=-1} {print $1"\t"s*-1*log($3)/log(10)}' \

| awk '{print $1,$2,$2*$2}' | sort -k3gr \
| awk '{OFS="\t"} !arr[$1]++ {print $1,$2}' \
| sort -k2gr > ${XLS}.rnk

After you save the script to a file, allow it to be executed


chmod +x rnkgen.sh

The usage of the script is as follows:
./rnkgen.sh /path/to/spreadsheet.xls GeneID_column FoldChange_column P-value_column

So for a spreadsheet which looks like this one generated by Degust previously:
gene c aza FDR C1 C2 C3 A1 A2 A3
LYZ 0 -1.65809730597384 4.70051409587021e-16 17393 14675 16300 4951 5064 4841
IFI27 0 -3.40698193025161 1.01987358298544e-15 5695 5552 4630 475 532 442
SLC12A3 0 -3.46002052417193 6.78905193176251e-14 1129 1051 991 79 109 92
HLA-B 0 -1.41354726616356 6.78905193176251e-14 6678 5597 5567 2167 2281 2042
IFI6 0 -2.76026110717513 7.38087480096726e-14 5884 5044 4171 764 764 624
OAS3 0 -2.64899272132832 1.17759867975735e-13 5196 5812 4621 830 824 753
HLA-C 0 -1.39473306959746 1.18784239161695e-13 3715 3324 3382 1305 1350 1193
IGFBP5 0 -5.1043039727239 1.61487772234777e-13 1706 1282 1643 39 39 52
MX1 0 -2.84129882209861 1.61487772234777e-13 2161 2245 1909 277 339 242

So in this case we run it as follows:
./rnkgen.sh degust.xls 1 3 4 

And have a look at the top and bottom 5 lines of output:
CAMK1D 9.80546
AC011899.9 9.76077
CGNL1 9.56363
GNG4 9.42442
MKX 8.93011

IFI6 -13.1319
HLA-B -13.1682
SLC12A3 -13.1682
IFI27 -14.9915
LYZ -15.3279

We see that the result is just what we're after.

EDIT: Mac users take note that you should try the R solution.

Popular posts from this blog

Two subtle problems with over-representation analysis

Data analysis step 8: Pathway analysis with GSEA

Uploading data to GEO - which method is faster?