Showing posts from August, 2014

Data analysis step 1: download from GEO and convert to fastq

In this post we will be downloading human RNA-seq data from GEO accession  GSE55123 . Now you would have thought that this would be easy, but you have to understand that the data we download from GEO is in NCBI's short read archive format (SRA). To unpack the original sequence files can be a bit tricky at first, even the size of the SRA toolkit manual is enough to make you cringe. So start (in linux) by making a text file containing all the SRA file links fron the NCBI ftp site. Let's call it "url.txt". http:

RNA-seq analysis step-by-step

In this series of posts, we're going to go step-by-step into analysing RNA-seq data. I found a nice data set on GEO containing RNA-seq and bisulfite sequencing data from AML3 cells treated with the drug Azacitidine ( GSE55125 ). This drug is known to block DNA methylation, so it will be interesting to see how this effects gene expression and whether we can learn anything extra about the mechanisms of this potential anticancer drug. Many thanks to the data contributors at the Beatson Institute for Cancer Research, University of Glasgow. Step 1: Download from GEO and convert to fastq Step 2: Quality control of RNA-seq data Step 3: Align paired end RNA-seq with Tophat Step 4: Count aligned reads and create count matrix Step 5: Differential analysis of RNA-seq Step 6: Draw a heatmap of gene expression Step 7: MDS plot Step 8: Pathway analysis with GSEA Step 9: Integration of ENCODE transcription factor binding data