Showing posts from June, 2013

Generating a custom gmt file for gene set analysis

Pathway and gene set analysis is a common procedure for interpretation of RNA-seq or other genome-wide expression assays. Most of the time, we use GSEA to tell us whether our gene sets of interest are up- or down-regulated. We can use gene sets from KEGG , Reactome , GO ,  MSigDB  and other sources, but you can also generate your own gene sets. The format used for GSEA is gmt . I'm going to take you through two examples of generating custom gene sets: Generate gene sets from published data sets using GEO2R Let's say you're interested in the transcription factor STAT1. I found a dataset in GEO called "Knockdown of STAT1 in SCC61 tumor xenografts leads to alterations in the expression of energy metabolic pathways", which has a paper in BMC Med.  Most uploaded array data sets can be reanalysed with GEO2R, which runs the array analysis tool Limma but this is embedded in the webpage and has a GUI which makes it very accessible for biologists. Click this link

Generate an RNA-seq count matrix: Part 2 - count over exons

In the previous post , I mentioned that in order to do counting, we need some regions (exons) to count over. We generated a BED file based on GENCODE exons which looks like this: chr9 33248310 33248419 ENSG00000107262.11_BAG1 chr9 33252477 33255306 ENSG00000107262.11_BAG1 chr9 33255862 33255925 ENSG00000107262.11_BAG1 Now we can actually count how many reads from each of our samples maps to these exons. To do that there are a few different programs: htseq-count samtools view GATK countreads Bams2mx BedTools  multicov For this demo, I will outline use of BedTools for RNA-Seq. I think BedTools nulticov is a good option for 3 main reasons: It can handle multiple bam files at the same time, producing a count table/matrix which is pretty much ready for statistical analysis. It is aware of mapQ scores of alignments, meaning that if the program comes across non-unique alignments, you can set it to ignore or include those reads. The BedTools package is well docu

Generate an RNA-seq count matrix: Part 1 - GTF2BED

Previously we took a look at how to align mRNA-Seq reads to a genome with BWA and Tophat . The next step in the pipeline is to count reads which align on exons and then to get them into a count matrix that we can submit for statistical analysis with EdgeR or another program. At this point in the process, we need to decide which annotation set to use. For human, we commonly use RefSeq and we are beginning to use GENCODE annotation because of its very comprehensive nature. Most of the programs which count reads require a BED file to indicate which genomic regions should be counted. For this blog post, I will show you how we generate an exon BED file for GENCODE genes. The script below will download the GTF file and generate a temp file list of all exons present in the GTF in the form of a BED file (chromosome, start, end) with the gene accession number and gene name pasted into one field. chr12 38710380 38710866 ENSG00000175548.4_ALG10B chr12 38710564 38710866 ENSG00000