Little known UNIX features to avoid writing temporary files in your data pipelines explained by Vince Buffalo in his digital notebook. Introducing named pipes and process substitution.
Back in 2015, our group described DEE, a user friendly repository of uniformly processed RNA-seq data, which I covered in detail in a previous post. Ours was the first such repository that wasn't limited to human or mouse and included sequencing data from a variety of instruments and library types. The purpose of this post is to reflect on the mixed success of DEE and outline where this project is going in future.
Overall I've received a lot of positive feedback from users and a number of citations to our poster. Thanks to everyone who used, gave suggestions, comments, bug reports, etc! However our attempt to have the repository published wasn't so successful due to reviewer niggles over what I consider minor points but hard to implement quickly. The main points raised by reviewers were:
Is it reasonable to treat all data sets as if they were single end? For this one, the reviewers were split, one said it was OK and the other was adamant that it was unacceptable despite my …
In our RNA-seq series so far we've performed differential analysis and generated some pretty graphs, showing thousands of differentially expressed genes after azacitidine treatment. In order to understand the biology underlying the differential gene expression profile, we need to perform pathway analysis.
We use Gene Set Enrichment Analysis (GSEA) because it can detect pathway changes more sensitively and robustly than some methods. A 2013 paper compared a bunch of gene set analyses software with microarrays and is worth a look.
Generate a rank file
The rank file is a list of detected genes and a rank metric score. At the top of the list are genes with the "strongest" up-regulation, at the bottom of the list are the genes with the "strongest" down-regulation and the genes not changing are in the middle. The metric score I like to use is the sign of the fold change multiplied by the inverse of the p-value, although there may be better methods out there (link).
In the last post of this series, I left you with a gene expression profile of the effect of azacitidine on AML3 cells. I decided to use the DESeq output for downstream analysis. If we want to draw a heatmap at this stage, we might struggle because the output provided by the DEB applet does not send back the normalised count data for each sample.
It is not really useful to plot all 5704 genes with FDR adjusted p-values <0.05 on the heatmap, so I will simply show the top 100 by p-value. Here are the general steps I will use in my R script below:
Read the count matrix and DESeq table into R and merge into one tableSort based on p-value with most significant genes on topSelect the columns containing gene name and raw countsScale the data per rowSelect the top 100 genes by significanceGenerate the heatmap with mostly default values
Google searches show that R has some quite elaborate heatmap options, especially with features from ggplot2 and RColorBrewer. In this example, I will use buil…
Comments
Post a Comment