Posts

Showing posts with the label Degust

Introducing "Digital Expression Explorer"

Image
RNA-seq has been a blessing for molecular biologists, not only does RNA-seq provide unbiased transcriptome wide expression analysis, it can be mined for a variety of other information like splicing, SNV identification, RNA editing, TSS usage, etc. As the cost of RNA-seq declines, more and more labs are using it hence more and more data is being deposited at databases such as SRA and GEO.

But there is a growing problem.

The problem is a lack of uniformity of processed data on GEO. Processed data has assorted reference genomes, gene annotation sets, accession numbers, software pipelines, statistical analyses and output formats, that in most cases makes comparison of two experiments hard if not impossible, let alone three or more experiments. This is a burden on researchers who want to quickly extract expression information from public RNA-seq data. Many researchers then resort to downloading the raw data in SRA format then processing it with QC/alignment/quantification/statistical pipel…

Comparing expression profiles

Image
One of the most common tasks in gene expression analysis is to compare different profiling experiments. There are three main strategies:
Compare all data points - using a correlation analysisCompare sets of up and down-regulated genes - using a binomial or Fisher exact testCompare sets of genes within a profile - such as GSEA test In this post, I'll describe how correlation analysis is used between expression data sets of all detected genes.

Merging data sets No matter what type of correlation used, the profiling data sets need to be merged. This means selecting a field that can the datasets can be merged on. This could be a array probe ID, gene accession number or gene symbol as in this case. I will compare gene expression profiles from two experiments (azacitidine in human and mouse cells). The human gene profile was generated by RNA-seq and the mouse data set by microarray.

The human data is currently in CSV format from Degust  and looks like this:

gene,c,aza,FDR,C1,C2,C3,A1,A2,A…

How to generate a rank file from gene expression data

Turning a gene expression profile into a ranked list is useful for comparing with other profiling data sets as well as an input for preranked GSEA analysis (example here). In this post, I describe a simple bash script called rnkgen.sh that can take gene expression data from a range of sources, such as edgeR, DESeq, GEO2R, etc., and generate a ranked list of genes from most up-expressed to most down-expressed based on the p-value.

#!/bin/bash
#rnkgen.sh converts a differential gene expression spreadsheet (XLS) into a rank file (RNK)

#Specify the input file
XLS=$1
#Specify the gene ID column
ID=$2
#Specify the fold change value column
FC=$3
#Specify the raw p-value column
P=$4

sed 1d $XLS | tr -d '"' \
| awk -v I=$ID -v F=$FC -v P=$P '{FS="\t"} $I!="" {print $I, $F, $P}' \
| awk '$2!="NA" && $3!="NA"' \
| awk '{s=1} $2<0{s=-1} {print $1"\t"s*-1*log($3)/log(10)}' \
| awk '{print $1,$2,$2*…

User friendly RNA-seq differential expression analysis with Degust

Image
There is a need to make bioinformatics tools more user friendly and accessible to a wider audience. We have seen that Galaxy, GEO2RGenevestigator and GenePattern have each developed a huge following in the molecular biology community, and this trend will continue with introduction of new RNA-seq analysis tools. Previously, I posted about differential gene expression analysis of RNA-seq performed by the DEB online tool. In this post, I introduce Degust, an online app to analyse gene expression count data and determine which genes are differentially expressed. Degust was written by David R. Powell (@d_r_powell) and was Supported by Victorian Bioinformatics Consortium, Monash University and VLSCI's Life Sciences Computation Centre.

In this test, I'll be using the azacitidine mRNA-seq data set that I have previously analysed. To make the count matrix, I used featureCounts.

First step in the process is to your RNA-seq count data. It can be done in tab or comma separated formats. …