Posts

Showing posts with the label GENCODE

Generate an RNA-seq count matrix: Part 1 - GTF2BED

Previously we took a look at how to align mRNA-Seq reads to a genome with BWA and Tophat. The next step in the pipeline is to count reads which align on exons and then to get them into a count matrix that we can submit for statistical analysis with EdgeR or another program.

At this point in the process, we need to decide which annotation set to use. For human, we commonly use RefSeq and we are beginning to use GENCODE annotation because of its very comprehensive nature. Most of the programs which count reads require a BED file to indicate which genomic regions should be counted.

For this blog post, I will show you how we generate an exon BED file for GENCODE genes. The script below will download the GTF file and generate a temp file list of all exons present in the GTF in the form of a BED file (chromosome, start, end) with the gene accession number and gene name pasted into one field.

chr123871038038710866ENSG00000175548.4_ALG10Bchr123871056438710866ENSG00000175548.4_ALG10Bchr12387105…