Using GTF tools to get gene lengths

January 07, 2019

Sometime you need to normalise gene expression by gene length (eg: FPKM). To do that you need to calculate gene length. But which length to use? One could simply get a total of all exon lengths but if the most abundant isoform is the shortest, this will be terribly inaccurrate. Clearly to do this accurately, analysis at the level of transcripts would be the best approach as the length of each transcript is unambiguous, and the effective gene length can be estimated based on the abundance of each isoform. But if we really want to calculate gene length from a GTF file alone without any isoform quantification, then GTFtools can do it.

For this demo, I'm using the Ensembl GTF file or human: Homo_sapiens.GRCh38.90.gtf

GTF tools calculates the gene length a few different ways (i) mean, (ii) median, (iii) longest single isoform, and (iv) all exons merged.

The command I used looks like this:

gtftools.py -l Homo_sapiens.GRCh38.90.gtf.genelength Homo_sapiens.GRCh38.90.gtf

And the output looks like this:

head Homo_sapiens.GRCh38.90.gtf.genelength

gene mean median longest_isoform merged

ENSG00000282222 757 757 757 757

ENSG00000282221 556 556 556 556

ENSG00000187223 609 609 609 609

ENSG00000110514 4930 5744 6005 8366

ENSG00000086015 3405 4047 5738 7877

ENSG00000115808 4529 2254 8168 8345

ENSG00000211769 48 48 48 48

ENSG00000211768 50 50 50 50

ENSG00000211767 49 49 49 49

But there's a hitch, the program doesn't accept genes that are not located on canonical autosomes, in human these are chromosome 1 to 22. This is a problem, as many interesting genes are located on sex chromosomes, mitochondrial genome or haven't been mapped to a chromosome yet, meaning many genes will be missing from the gene length table. This limitation is circumvented easily by converting all chromosomes to "1", then suddenly the lengths of all genes are included. Here's how it's done:

#remove any comment lines from the GTF file and use awk to convert column 1 to "1"
grep -v '#' $HSA_GTF | awk '{OFS="\t"} $1=1' >> $HSA_GTF.tmp\

#now run GTF tools

gtftools.py -l Homo_sapiens.GRCh38.90.gtf.genelength Homo_sapiens.GRCh38.90.gtf.tmp

See the GTFtools website to get the code and documentation: http://www.genemine.org/gtftools.php

Search This Blog

Genome Spot

Using GTF tools to get gene lengths

Popular posts from this blog

Uploading data to GEO - which method is faster?

Data analysis step 8: Pathway analysis with GSEA

My personal thoughts on gene name errors