Data analysis step 5: Differential analysis of RNA-seq

So far in this RNA-seq analysis series of posts, we've done a whole bunch of primary analysis on GSE55125 and now we are at the stage where we can now perform a statistical analysis of the count matrix we generated in the last post and look at the genes expression differences caused by Azacitidine.

For this type of analysis we could load our data into R and perform the analysis ourselves, but for a simple experiment design with 2 sample groups in triplicate without batch effects or sample pairing I want to share with you an easy solution. DEB is a online service provided by the  Interdisciplinary Center for Biotechnology Research (ICBR) University of Florida that will analyse the count matrix for you with either DESeq, edgeR or baySeq. Their Bioinformation paper is also worth a look.

As with all aspects of bioinformatics, format is critical. You need to follow the specified format exactly. Here is what the head of my count matrix looks like:

gene UNTR1 UNTR2 UNTR3 AZA1 AZA2 AZA3
ENSG00000185650_ZFP36L1 479 467 481 89 77 96
ENSG00000116032_GRIN3B 62 45 62 43 54 52
ENSG00000236345_RP11-59D5__B.2 608 512 600 433 406 417
ENSG00000138069_RAB1A 4141 3568 3571 3250 3259 3189
ENSG00000127666_TICAM1 719 650 677 615 648 566
ENSG00000155393_HEATR3 1672 1291 1570 1511 1634 1641
ENSG00000146731_CCT6A 9400 7432 7743 8923 9322 9471
ENSG00000153012_LGI2 273 275 214 299 264 272
ENSG00000253716_RP13-582O9.5 78 53 62 75 77 60



I then uploaded the count matrix to DEB to perform differential gene expression analysis using all three statistical algorithms with a significance threshold of 5% after false discovery rate (FDR) correction.
Number of differentially expressed genes (FDR≤0.05) using three popular algorithms for RNA-seq analysis.
Then I wanted to know what the overlap was between the three algorithms, so I drew Venn diagrams with Venny.
Venn analysis of differentially expressed genes (FDR≤0.05) using three popular algorithms for RNA-seq analysis. (A) Up-regulated by Azacitidine. (B) Down-regulated by Azacitidine.


This shows that overall, the three algorithms were very similar, although edgeR was the least conservative and baySeq was the most conservative.

Here is the smear plot of the effect of azacitidine using DESeq. The version generated by edgeR was virtually identical.
Smear plot from DESeq for the comparison of azacitidine treated cells versus the control cells (GEO link). The x-axis shows the average expression level and the y-axis shows the fold change. Both axes are on the log scale. DEB server was used for statistical analysis.
As you can see, the range of fold changes is quite striking, with several genes with log2 fold changes less than -6, and only 2 genes with fold change greater than +2.

Below is the top 20 up- and down-expressed genes by azacitidine.
Top 20 up- and down-expressed genes by azacitidine determined by DESeq.

In the next post, we will draw a heatmap.

Popular posts from this blog

Data analysis step 8: Pathway analysis with GSEA

Two subtle problems with over-representation analysis

Uploading data to GEO - which method is faster?