Mark Ziemann, Atul Kamboj, Runyararo M Hove, Shanon Loveridge, Assam El-Osta, Mrinal Bhave
Acta Physiologiae Plantarum
Abstract: Salinity is a threat to crops in many parts of the world, and together with drought, it is predicted to be a serious constraint to food security. However, understanding the impact of this stressor on plants is a major challenge due to the involvement of numerous genes and regulatory pathways. While transcriptomic analyses of barley (Hordeum vulgare L.) under salt stress have been reported with microarrays, there are no reports as yet of the use of mRNA-Seq. We demonstrate the utility of mRNA-Seq by analysing cDNA libraries derived from acutely salt-stressed and unstressed leaf material of H. vulgare cv. Hindmarsh. The data yielded >50 million sequence tags which aligned to 26,944 sequences in the Unigene reference database. To gain maximum information, we performed de novo assembly of unaligned reads and discovered >3,800 contigs, termed novel tentative consensus sequences, which are either new, or significant improvements on current databases. Differential gene expression screening found 48 significantly up-regulated and 62 significantly down-regulated transcripts. The work provides comprehensive insights into genome-wide effects of salinity and is a new resource for the study of gene regulation in barley and wheat. Further, the bioinformatics workflow may be applicable to other non-model plants to establish their transcriptomes and identify unique sequences.
Mark Ziemann, Antony Kaspi, Ross Lazarus, Assam El-Osta
Abstract: Reliable identification of cis regulatory elements influencing transcription remains a challenging problem in molecular bioinformatics. This is especially true for enhancer elements which are often located hundreds of kilobases from the gene promoter. High resolution DNase hypersensitivity and connectivity profiling by the ENCODE consortium provides evidence of millions of interacting cis-acting elements in the human genome. This prior knowledge can be incorporated into genome-wide expression analyses, in the form of gene sets sharing regulatory sequence motifs in known DNase hypersensitivity peak regions. High proportions of enrichment among the most extreme differentially transcribed genes from controlled biological experiments may suggest novel hypotheses about signalling pathways. The utility of this approach is demonstrated with the reanalysis of a microarray-derived gene expression data set through the Gene Set Enrichment Analysis pipeline, uncovering new putative distal cis elements in the context of innate immunity. The DNase Hypersensitivity Connectivity informed Motif Enrichment in Gene Expression (DHC-MEGE) method described here has the advantage of identifying distal elements such as enhancers, which are often overlooked with standard promoter motif analysis
Abstract: Plant and animal genomes are replete with large gene families, making the task of ortholog identification difficult and labor intensive. OrthoRBH is an automated reciprocal blast pipeline tool enabling the rapid identification of specific gene families of interest in related species, streamlining the collection of homologs prior to downstream molecular evolutionary analysis. The efficacy of OrthoRBH is demonstrated with the identification of the 13-member PYR/PYL/RCAR gene family in Hordeum vulgare using Oryza sativa query sequences. OrthoRBH runs on the Linux command line and is freely available at SourceForge.
In our RNA-seq series so far we've performed differential analysis and generated some pretty graphs, showing thousands of differentially expressed genes after azacitidine treatment. In order to understand the biology underlying the differential gene expression profile, we need to perform pathway analysis.
We use Gene Set Enrichment Analysis (GSEA) because it can detect pathway changes more sensitively and robustly than some methods. A 2013 paper compared a bunch of gene set analyses software with microarrays and is worth a look.
Generate a rank file
The rank file is a list of detected genes and a rank metric score. At the top of the list are genes with the "strongest" up-regulation, at the bottom of the list are the genes with the "strongest" down-regulation and the genes not changing are in the middle. The metric score I like to use is the sign of the fold change multiplied by the inverse of the p-value, although there may be better methods out there (link).
In the last post of this series, I left you with a gene expression profile of the effect of azacitidine on AML3 cells. I decided to use the DESeq output for downstream analysis. If we want to draw a heatmap at this stage, we might struggle because the output provided by the DEB applet does not send back the normalised count data for each sample.
It is not really useful to plot all 5704 genes with FDR adjusted p-values <0.05 on the heatmap, so I will simply show the top 100 by p-value. Here are the general steps I will use in my R script below:
Read the count matrix and DESeq table into R and merge into one tableSort based on p-value with most significant genes on topSelect the columns containing gene name and raw countsScale the data per rowSelect the top 100 genes by significanceGenerate the heatmap with mostly default values
Google searches show that R has some quite elaborate heatmap options, especially with features from ggplot2 and RColorBrewer. In this example, I will use buil…
Aligning reads to the genome is a key step in nearly all NGS data pipelines, the quality of an alignment will dictate the quality of the final results. So for beginners in this space, the options available can be a bit overwhelming.
Which options are available?
Depending on what species you are working on, you will have either a limited number of choices or a vast number of choices. These include NCBI, Ensembl, UCSC as well as the consortia that generate these genome builds, such as the Human Genome Reference Consortium for human and TAIR for Arabidopsis. My recommendation at this point is Ensembl, for a number of reasons: It is clear to see what genome build and version just from the file names. Contrast "hg38.fa.gz" for UCSC vs "Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz" for EnsemblFrom the Ensembl file name you can tell whether its masked, and whether its "primary assembly" or "toplevel".The website is intuitive, ftp downloads are fast…