Geneclouds: unconventional genetics data visualisation

You have probably seen word clouds before - but have you tried with gene expression data?

I used the following bash script to process a DESeq spreadsheet from a a previous RNA-seq post. The script extracts the gene name and p-value of the genes with differential expression. I used awk to separate the up and down regulated genes into different files. The score used to inform the font size is the exponent of the p-value. So this works best when there are a lot of statistically significant genes p-values. The data looks like this:

AC011899.9:60
CAMK1D:54
GNG4:44
CGNL1:41
APCDD1:37
HSD11B1:33

Now go to the "advanced" tab of the Wordle page and paste in the data. Experiment, with the colours, layouts and be sure to increase the "maximum words" to get a real appreciation for the number of changes in your experiment. Here is an example I made using both the up and down regulated gene sets showing the effect of azacytidine on AML3 cells. The result is pretty amazing.
Differential gene expression wordcloud of azacytidine treated AML3 cells treated with azacytidine. The left panel shows the up-regulated genes and the right panel shows the down-regulated genes. Font size is proportional to the exponent of the p-value.

Script used to process data

awk '$6>0 && $8<0.05 {print $1,$8}' DESeq.xls \
| awk '{printf "%4.3e\t%s\n", $3 , $2}' \
| sed 's/e-/@/' | cut -d '@' -f2- | awk '{print $2":"$1}' > ups.txt

awk '$6<0 && $8<0.05 {print $1,$8}' DESeq.xls \
| awk '{printf "%4.3e\t%s\n", $3 , $2}' \
| sed 's/e-/@/' | cut -d '@' -f2- | awk '{print $2":"$1}' > dns.txt



Popular posts from this blog

Two subtle problems with over-representation analysis

Data analysis step 8: Pathway analysis with GSEA

Uploading data to GEO - which method is faster?