Posts

Showing posts with the label azacitidine

Quantifying repeat sequences from short read next gen sequencing data

Image
Repetitive sequences pose significant challenges for the analysis of short read sequence data. Any sequence read that is shorter than the repeat unit is going to fail at being uniquely placed in the genome. At a given read length, genome can therefore be divided into mappable and unmappable compartments. For analyses such as gene expression and chromatin profiling, we generally ignore those unmappable regions and ignore reads that map to multiple positions. But what if we were actually interested in the abundance of repeat sequences in the dataset? How could it be quantified with short read sequencing? Why is that even relevant?

The "Why" About 50% of the human genome is comprised of repetitive DNA (Ref). While the function of these elements in many cases remains unknown, some have been well defined. Alu elements have been shown to be a birthplace of enhancers (Ref). Endogenous retroviruses have been domesticated to boost antiviral response in B cells after exposure to T cel…

Screening RNA-seq data for novel transcripts with samtools mpileup and UCSC genome browser

Image
The unbiased nature of deep transcript sequencing makes it the ideal technology to discover novel uncharacterised genes. Lets screen our favourite RNA-seq experiment (azacitidine-treated AML3 human cells, GSE55123) for novel expressed genes. I use Ensembl gene annotations.

We'll start with preparing bed files of exons and gene bodies

grep -w exon Homo_sapiens.GRCh38.76.gtf | tr '" ' '\t' \
| cut -f1,4,5,11 | uniq > Homo_sapiens.GRCh38.76_exons.bed

grep -w gene Homo_sapiens.GRCh38.76.gtf | tr '" ' '\t' \
| cut -f1,4,5,11 > Homo_sapiens.GRCh38.76_genes.bed

Now for convenience, I'll merge the data from the three replicates with samtools.

samtools view -H SRR1171523_1.fastq.sort.bam > header.txt

samtools merge -h header.txt Ctrl.bam SRR1171523_1.bam SRR1171523_2.bam SRR1171524_1.bam SRR1171524_2.bam SRR1171525_1.bam SRR1171525_2.bam

samtools merge -h header.txt Aza.bam SRR1171526_1.bam SRR1171526_2.bam SRR1171527_1.bam SRR1171527_2.bam S…

Geneclouds: unconventional genetics data visualisation

Image
You have probably seen word clouds before - but have you tried with gene expression data?

I used the following bash script to process a DESeq spreadsheet from a a previous RNA-seq post. The script extracts the gene name and p-value of the genes with differential expression. I used awk to separate the up and down regulated genes into different files. The score used to inform the font size is the exponent of the p-value. So this works best when there are a lot of statistically significant genes p-values. The data looks like this:

AC011899.9:60
CAMK1D:54
GNG4:44
CGNL1:41
APCDD1:37
HSD11B1:33

Now go to the "advanced" tab of the Wordle page and paste in the data. Experiment, with the colours, layouts and be sure to increase the "maximum words" to get a real appreciation for the number of changes in your experiment. Here is an example I made using both the up and down regulated gene sets showing the effect of azacytidine on AML3 cells. The result is pretty amazing.
Scrip…

Data analysis step 9: Leverage ENCODE data for enhanced pathway analysis

Image
So far in this series of posts we've analysed the effect of azacitidine on AML3 cell gene expression using publicly available RNA-seq data. Our pathway analysis showed many interesting trends, but unravelling all the different mechanisms from this point can be really hard. In order to identify the major players at the chromatin level, it can be useful to integrate transcription factor binding data and see whether targets of a particular transcription factor are differentially regulated in a pathway analysis. The problem with this analysis in the past was that ChIP-seq datasets were in varying formats on GEO and processing these into a standardised format would be too time consuming. With the advent of the ENCODE Project, there is now a large body of transcription factor binding data in a uniform format (link), that is being mined in many creative ways. In our group, we used this approach extensively (here, here and here).

In this post, we will mine ENCODE transcription factor bind…

Data analysis step 8: Pathway analysis with GSEA

Image
In our RNA-seq series so far we've performed differential analysis and generated some pretty graphs, showing thousands of differentially expressed genes after azacitidine treatment. In order to understand the biology underlying the differential gene expression profile, we need to perform pathway analysis.

We use Gene Set Enrichment Analysis (GSEA) because it can detect pathway changes more sensitively and robustly than some methods. A 2013 paper compared a bunch of gene set analyses software with microarrays and is worth a look.
Generate a rank file The rank file is a list of detected genes and a rank metric score. At the top of the list are genes with the "strongest" up-regulation, at the bottom of the list are the genes with the "strongest" down-regulation and the genes not changing are in the middle. The metric score I like to use is the sign of the fold change multiplied by the inverse of the p-value, although there may be better methods out there (link).

Data analysis step 6: Draw a heatmap from RNA-seq data using R

Image
In the last post of this series, I left you with a gene expression profile of the effect of azacitidine on AML3 cells. I decided to use the DESeq output for downstream analysis. If we want to draw a heatmap at this stage, we might struggle because the output provided by the DEB applet does not send back the normalised count data for each sample.

It is not really useful to plot all 5704 genes with FDR adjusted p-values <0.05 on the heatmap, so I will simply show the top 100 by p-value. Here are the general steps I will use in my R script below:

Read the count matrix and DESeq table into R and merge into one tableSort based on p-value with most significant genes on topSelect the columns containing gene name and raw countsScale the data per rowSelect the top 100 genes by significanceGenerate the heatmap with mostly default values Google searches show that R has some quite elaborate heatmap options, especially with features from ggplot2 and RColorBrewer. In this example, I will use buil…

Data analysis step 5: Differential analysis of RNA-seq

Image
So far in this RNA-seq analysis series of posts, we've done a whole bunch of primary analysis on GSE55125 and now we are at the stage where we can now perform a statistical analysis of the count matrix we generated in the last post and look at the genes expression differences caused by Azacitidine.

For this type of analysis we could load our data into R and perform the analysis ourselves, but for a simple experiment design with 2 sample groups in triplicate without batch effects or sample pairing I want to share with you an easy solution. DEB is a online service provided by the  Interdisciplinary Center for Biotechnology Research (ICBR) University of Florida that will analyse the count matrix for you with either DESeq, edgeR or baySeq. Their Bioinformation paper is also worth a look.

As with all aspects of bioinformatics, format is critical. You need to follow the specified format exactly. Here is what the head of my count matrix looks like:

geneUNTR1UNTR2UNTR3AZA1AZA2AZA3
ENSG000…

RNA-seq analysis step-by-step

In this series of posts, we're going to go step-by-step into analysing RNA-seq data. I found a nice data set on GEO containing RNA-seq and bisulfite sequencing data from AML3 cells treated with the drug Azacitidine (GSE55125). This drug is known to block DNA methylation, so it will be interesting to see how this effects gene expression and whether we can learn anything extra about the mechanisms of this potential anticancer drug.

Many thanks to the data contributors at the Beatson Institute for Cancer Research, University of Glasgow.

Step 1: Download from GEO and convert to fastq
Step 2: Quality control of RNA-seq data
Step 3: Align paired end RNA-seq with Tophat
Step 4: Count aligned reads and create count matrix
Step 5: Differential analysis of RNA-seq
Step 6: Draw a heatmap of gene expression
Step 7: MDS plot
Step 8: Pathway analysis with GSEA
Step 9: Integration of ENCODE transcription factor binding data