Using GTF tools to get gene lengths

Sometime you need to normalise gene expression by gene length (eg: FPKM). To do that you need to calculate gene length. But which length to use? One could simply get a total of all exon lengths but if the most abundant isoform is the shortest, this will be terribly inaccurrate. Clearly to do this accurately, analysis at the level of transcripts would be the best approach as the length of each transcript is unambiguous, and the effective gene length can be estimated based on the abundance of each isoform. But if we really want to calculate gene length from a GTF file alone without any isoform quantification, then GTFtools can do it.

For this demo, I'm using the Ensembl GTF file or human: Homo_sapiens.GRCh38.90.gtf

GTF tools calculates the gene length a few different ways (i) mean, (ii) median, (iii) longest single isoform, and (iv) all exons merged.

The command I used looks like this:


gtftools.py  -l Homo_sapiens.GRCh38.90.gtf.genelength  Homo_sapiens.GRCh38.90.gtf

And the output looks like this:

head Homo_sapiens.GRCh38.90.gtf.genelength
gene mean median longest_isoform merged
ENSG00000282222 757 757 757 757
ENSG00000282221 556 556 556 556
ENSG00000187223 609 609 609 609
ENSG00000110514 4930 5744 6005 8366
ENSG00000086015 3405 4047 5738 7877
ENSG00000115808 4529 2254 8168 8345
ENSG00000211769 48 48 48 48
ENSG00000211768 50 50 50 50
ENSG00000211767 49 49 49 49


But there's a hitch, the program doesn't accept genes that are not located on canonical autosomes, in human these are chromosome 1 to 22. This is a problem, as many interesting genes are located on sex chromosomes, mitochondrial genome or haven't been mapped to a chromosome yet, meaning many genes will be missing from the gene length table. This limitation is circumvented easily by converting all chromosomes to "1", then suddenly the lengths of all genes are included. Here's how it's done:

#remove any comment lines from the GTF file and use awk to convert column 1 to "1"
grep -v '#' $HSA_GTF | awk '{OFS="\t"} $1=1' >> $HSA_GTF.tmp\

#now run GTF tools
gtftools.py  -l Homo_sapiens.GRCh38.90.gtf.genelength  Homo_sapiens.GRCh38.90.gtf.tmp

See the GTFtools website to get the code and documentation: http://www.genemine.org/gtftools.php

Popular posts from this blog

Two subtle problems with over-representation analysis

Data analysis step 8: Pathway analysis with GSEA

Uploading data to GEO - which method is faster?