Posts

Showing posts with the label DESeq

Incorporate dee2 data into your R-based RNA-seq workflow

Dee2.io is a portal for accessing gene expression data derived from public RNA-seq datasets. So far there are over 400k available datasets and its growing every day. While there are existing databases of such as Expression Atlas, Recount2 and ARCHS4, dee2.io offers a number of unique benefits. For instance, dee2 includes gene-wise counts fron STAR as well as transcript-wise quantifications from Kallisto. There are a few ways you can access these data. Firstly, there is a nice web interface that is mobile friendly. Secondly, there are data dumps available if you are running a large scale analysis.  But the purpose of this post is to demonstrate the improved R interface in action together with SRAdbv2 and statistics with edgeR and DESeq. The official documentation is available on GitHub.
Getting started This tutorial provides a walkthrough for how to work with dee2 expression data, starting with dataset searches, obtaining the data from dee2.io and then performing a differential analysi…

User friendly RNA-seq differential expression analysis with Degust

Image
There is a need to make bioinformatics tools more user friendly and accessible to a wider audience. We have seen that Galaxy, GEO2RGenevestigator and GenePattern have each developed a huge following in the molecular biology community, and this trend will continue with introduction of new RNA-seq analysis tools. Previously, I posted about differential gene expression analysis of RNA-seq performed by the DEB online tool. In this post, I introduce Degust, an online app to analyse gene expression count data and determine which genes are differentially expressed. Degust was written by David R. Powell (@d_r_powell) and was Supported by Victorian Bioinformatics Consortium, Monash University and VLSCI's Life Sciences Computation Centre.

In this test, I'll be using the azacitidine mRNA-seq data set that I have previously analysed. To make the count matrix, I used featureCounts.

First step in the process is to your RNA-seq count data. It can be done in tab or comma separated formats. …

Geneclouds: unconventional genetics data visualisation

Image
You have probably seen word clouds before - but have you tried with gene expression data?

I used the following bash script to process a DESeq spreadsheet from a a previous RNA-seq post. The script extracts the gene name and p-value of the genes with differential expression. I used awk to separate the up and down regulated genes into different files. The score used to inform the font size is the exponent of the p-value. So this works best when there are a lot of statistically significant genes p-values. The data looks like this:

AC011899.9:60
CAMK1D:54
GNG4:44
CGNL1:41
APCDD1:37
HSD11B1:33

Now go to the "advanced" tab of the Wordle page and paste in the data. Experiment, with the colours, layouts and be sure to increase the "maximum words" to get a real appreciation for the number of changes in your experiment. Here is an example I made using both the up and down regulated gene sets showing the effect of azacytidine on AML3 cells. The result is pretty amazing.
Scrip…

RNA-seq aligners: Subread, STAR, HPG aligner and Olego | PART 2

Image
In the previous post, I compared the speed of several RNA aligners. In this post, we'll take a closer look at the results generated by these different aligners; Subread/Subjunc, STAR, HPG aligner and Olego. In addition, we will use BWA MEM as an example of what happens when you use a DNA aligner to analyse RNA-seq. For the test, we've been using 101 bp Arabidopsis mRNA-seq data from GEO accession GSE42968. For simplicity, I use only the forward read in the paired-end dataset. I used fastx-toolkit to remove low quality bases from the 3' end (base qual threshold of 20).

One of the simplest ways to assess the quality of an alignment is to determine the proportion of reads that are mapped to the genome and the proportion that map to exons. After the aligners did their work, I used featureCounts to quantify the number of aligned reads with a mapping quality >10. Here is the data for the first sample in the series, SRR634969 which contained 14.5 million reads.

The first thing…

Data analysis step 6: Draw a heatmap from RNA-seq data using R

Image
In the last post of this series, I left you with a gene expression profile of the effect of azacitidine on AML3 cells. I decided to use the DESeq output for downstream analysis. If we want to draw a heatmap at this stage, we might struggle because the output provided by the DEB applet does not send back the normalised count data for each sample.

It is not really useful to plot all 5704 genes with FDR adjusted p-values <0.05 on the heatmap, so I will simply show the top 100 by p-value. Here are the general steps I will use in my R script below:

Read the count matrix and DESeq table into R and merge into one tableSort based on p-value with most significant genes on topSelect the columns containing gene name and raw countsScale the data per rowSelect the top 100 genes by significanceGenerate the heatmap with mostly default values Google searches show that R has some quite elaborate heatmap options, especially with features from ggplot2 and RColorBrewer. In this example, I will use buil…

Data analysis step 5: Differential analysis of RNA-seq

Image
So far in this RNA-seq analysis series of posts, we've done a whole bunch of primary analysis on GSE55125 and now we are at the stage where we can now perform a statistical analysis of the count matrix we generated in the last post and look at the genes expression differences caused by Azacitidine.

For this type of analysis we could load our data into R and perform the analysis ourselves, but for a simple experiment design with 2 sample groups in triplicate without batch effects or sample pairing I want to share with you an easy solution. DEB is a online service provided by the  Interdisciplinary Center for Biotechnology Research (ICBR) University of Florida that will analyse the count matrix for you with either DESeq, edgeR or baySeq. Their Bioinformation paper is also worth a look.

As with all aspects of bioinformatics, format is critical. You need to follow the specified format exactly. Here is what the head of my count matrix looks like:

geneUNTR1UNTR2UNTR3AZA1AZA2AZA3
ENSG000…