Mitch gets upgraded - now with gene set networks

Interpreting pathway enrichment analysis results is a big challenge. There may be hundreds of statistically significant pathways from an analysis and getting to a shortlist of key mechanisms to follow up with experiments and describe in a publication is difficult. I recently got a request from a collaborator to come up with a way to visualise the key networks. I groaned... because network analysis in bioinformatics is sometimes characterised by showing hairballs of hundreds/thousands of meaningless interactions. Mostly it is done poorly and the charts themselves do not have any explanatory function and mostly appear to be decorative. After seeing a few well done examples such as Figure 2 from Chappel et al (PMID: 32138627), I thought this could be something we include as a common step in the mitch workflow. For those of you unaware, mitch is the R/bioconductor package that Dr Antony Kaspi and I published in 2020 (PMID: 32600408) with the main focus being on multi-dimensional enrichment analysis.

In this post we'll look at some data comparing control and SARS-COV2 infected cells, that is publicly available on NCBI SRA. The full workflow (R Markdown script and html report) is uploaded here including fetching the expression counts and differential expression (DE). The DE table looks like this:

Gene IDbaseMeanlog2FoldChangelfcSEstatpvalue padj
ENSG000001137393523.7344.0045930.101256839.5488800
ENSG000001761538851.309-2.7853120.0820808-33.9337800
ENSG000000580855767.1113.6990410.114833432.2122300
ENSG000001697106917.854-2.2591580.0703230-32.1254700
ENSG0000017042141561.919-2.2334820.0742019-30.1000700
ENSG000001002972476.082-2.3643830.0802583-29.4596500

There are some strong gene expression changes as shown by the volcano plot.

Volcano plot showing SARS-COV-2 causes over 8000 differentially expressed genes.

Next, we can use mitch to run the pathway enrichment analysis.

Step 1 is to get the pathways from Reactome loaded in. Reactome has a download page with the GMT file.

genesets <- gmt_import("ReactomePathways.gmt")

Next, use mitch import to score the de table and add the gene symbols according to a mapping table called gt.

m <- mitch_import(de,DEtype = "DEseq2", geneTable = gt)

This generates a dataframe with one column that shows the gene score. In this case the DEseq2 test statistic.

Next, run mitch enrichment tests. Note that I'm using prioritisation by effect size. This is the preferred way as it will prioritise relatively specific pathways with large changes in expression. On the other hand if you want to prioritise by significance, you will likely get large gene sets that show relatively subtle expression changes that I find uninteresting.

mres <- mitch_calc(m,genesets = genesets,priority="effect",cores=8)
This resulted in 101 up-regulated and 339 down-regulated pathways using the FDR<0.05 threshold. Then we can show the pathways with the biggest effect sizes after discarding pathways with FDR>0.05. The x-axis here shows the enrichment score (in mitch results this is called s.dist).
Barplot showing up and downregulated pathways after infection with SARS-COV-2

While this is very interesting to see some highly relevant pathways showing strong changes in response to SARS-COV-2 infection. Additional plots can be generated in PDF format using the mitch_plots() function, including a volcano plot of pathways and enrichment plots of the top 50 pathways.

To generate the network plot, use the following command:

networkplot(res,FDR=0.05,n_sets=20)
It will generate two network charts. One for up-regulated (pink-red) and one for down-regulated pathways (light blue-blue). The purpose is to show the similarity in driver genes between these sets. Only genes in the top tertile are retained. The intensity of the node relates to the enrichment score. The circle sizes depict the number of driver genes. The line width indicates the Jaccard similarity between driver gene sets. The circle size and line width legends are approximate only.
Pathway network chart focusing on those upregulated by SARS-COV-2 infection


Pathway network chart focusing on those down-regulated by SARS-COV-2 infection
From these charts we can appreciate that infection causes up-regulation of FLT3, EGFR and interleukin-20 and -15 pathways that share some common members. Also there's a distinct group of two pathways related to nutrient starvation, and another group showing up-regulation of chemokine and interleukin-10 signaling. In terms of down-regulated gene sets, we see two clusters of sets related to DNA replication and repair.

The FDR threshold and number of pathways to show in each chart are customisable.

This visualisation works best with an extra-wide aspect, for example 14 wide and 9 high. This helps to prevent the longer pathway names from being cut off. If you are using R Markdown you can use the chunk header option "fig.height=9,fig.width=14". My other recommendation is to set the output device a SVG, which will allow you to move around the pathway labels to make presentation clearer. To do that, simply include the following chunk somewhere above your networkplot() call.

knitr::opts_chunk$set(dev = 'svg') 

There is some aspect of randomness when it comes to generating these charts, so if you get one that looks to have too many ugly overlapping labels, you can generate a few more and pick the most aesthetically appealing one. It uses the R/CRAN package "network" to generate the network.

When it comes to investigating the key driver genes underpinning these charts, use the network_genes() function. It will show all the driver genes for each pathway. This will help you when deciding which genes should be investigated further.

network_genes(mres,FDR=0.05,n_sets=20)

networkplot() and network_genes() functions have been added to the development version of mitch, which is available from Bioconductor and from GitHub. I hope you find it useful!

Popular posts from this blog

Data analysis step 8: Pathway analysis with GSEA

Uploading data to GEO - which method is faster?

Generate an RNA-seq count matrix with featureCounts