Posts

Showing posts with the label Gene set analysis

Pathway analysis: DAVID versus GSEA

Image
Pathway analysis has become such a common procedure in bioinformatics, especially in studying gene expression. If you look at a survey of recent papers it seems that there is a bunch of ways that it can be done. In this post, I'll discuss the differences between two commonly used tools; DAVID and GSEA.
The concepts behind the two algorithms are very different. DAVID determines overlaps between user-supplied gene lists and the curated databases, looking for overlaps that are bigger than that expected by random chance. You can improve the accuracy of the algorithm by providing a background file that contains all genes that were considered/detected in the experiment. The user-supplied gene sets are normally generated by selecting genes that pass a significance threshold. The DAVID procedure is similar to others available such as Ingenuity, AmiGO and GeneGO. The selection of significance values is largely arbitrary, but it is common to set the threshold at FDR adjusted p<0.05. Modi…

Pathway analysis with ZGST

Image
Pathway analysis is a common procedure for determining the regulation of groups of functionally linked genes. There are a lot of pathway analyses strategies available and I can break them down into these groups:

Bioconductor/R-based: It makes sense to run pathway analysis in the same environment that runs the major differential expression software Limma, edgeR, DESeq, etc. These include CAMERA, MRGST, WilcoxGST, Roast, etcCommercial, GUI based. Such as Ingenuity IPA or MetaCore GeneGO.Java based such as GSEA and GSAAWeb-based tools such as DAVIDWebGestalt, GO Enrichment analysis Now these each have their advantages and disadvantages. I wanted to see whether I could make a tool pathway analysis tool that could be run with just one simple command and didn't require expertise in R. It would run quickly for >10k gene sets and have a lower memory footprint than GSEA.
I wrote a pathway analysis in Bash. I know. Its crazy. But it works. It works in Ubuntu, Fedora, Debian and Mint. It…

Are we ready to move beyond MSigDB and start a community-based gene set resource?

Image
Gene sets are distilled information about molecular profiling experiments and can generated based on other features shared by groups of genes such as chromosomal position, sequence, co-regulation, functional information, etc.

These are a valuable resource because they suggest similarities between different molecular profiling experiments or phenomona and lead researchers into understanding the factors that drive the trends in profiling experiments such as gene expression assays by microarray or RNA-seq.

To truly grasp the importance of quality gene sets, consider that the original paper describing the GSEA algorithm has accumulated 3144 citations since 2003, while the paper describing the software and wider applicability of GSEA has 7166 citations. The latter paper has also attracted positive comments from experts in the field on PubMed, here is one that I couldn't agree with more. In the words of Rafael Irizarry, "The idea of analyzing differential expression for groups of g…

Data analysis step 9: Leverage ENCODE data for enhanced pathway analysis

Image
So far in this series of posts we've analysed the effect of azacitidine on AML3 cell gene expression using publicly available RNA-seq data. Our pathway analysis showed many interesting trends, but unravelling all the different mechanisms from this point can be really hard. In order to identify the major players at the chromatin level, it can be useful to integrate transcription factor binding data and see whether targets of a particular transcription factor are differentially regulated in a pathway analysis. The problem with this analysis in the past was that ChIP-seq datasets were in varying formats on GEO and processing these into a standardised format would be too time consuming. With the advent of the ENCODE Project, there is now a large body of transcription factor binding data in a uniform format (link), that is being mined in many creative ways. In our group, we used this approach extensively (here, here and here).

In this post, we will mine ENCODE transcription factor bind…

Data analysis step 8: Pathway analysis with GSEA

Image
In our RNA-seq series so far we've performed differential analysis and generated some pretty graphs, showing thousands of differentially expressed genes after azacitidine treatment. In order to understand the biology underlying the differential gene expression profile, we need to perform pathway analysis.

We use Gene Set Enrichment Analysis (GSEA) because it can detect pathway changes more sensitively and robustly than some methods. A 2013 paper compared a bunch of gene set analyses software with microarrays and is worth a look.
Generate a rank file The rank file is a list of detected genes and a rank metric score. At the top of the list are genes with the "strongest" up-regulation, at the bottom of the list are the genes with the "strongest" down-regulation and the genes not changing are in the middle. The metric score I like to use is the sign of the fold change multiplied by the inverse of the p-value, although there may be better methods out there (link).

Generating a custom gmt file for gene set analysis

Image
Pathway and gene set analysis is a common procedure for interpretation of RNA-seq or other genome-wide expression assays. Most of the time, we use GSEA to tell us whether our gene sets of interest are up- or down-regulated. We can use gene sets from KEGG, Reactome, GOMSigDB and other sources, but you can also generate your own gene sets. The format used for GSEA is gmt. I'm going to take you through two examples of generating custom gene sets:

Generate gene sets from published data sets using GEO2R Let's say you're interested in the transcription factor STAT1. I found a dataset in GEO called "Knockdown of STAT1 in SCC61 tumor xenografts leads to alterations in the expression of energy metabolic pathways", which has a paper in BMC Med. Most uploaded array data sets can be reanalysed with GEO2R, which runs the array analysis tool Limma but this is embedded in the webpage and has a GUI which makes it very accessible for biologists.

Click this link to go directly t…