Using GTF tools to get gene lengths

Sometime you need to normalise gene expression by gene length (eg: FPKM). To do that you need to calculate gene length. But which length to use? One could simply get a total of all exon lengths but if the most abundant isoform is the shortest, this will be terribly inaccurrate. Clearly to do this accurately, analysis at the level of transcripts would be the best approach as the length of each transcript is unambiguous, and the effective gene length can be estimated based on the abundance of each isoform. But if we really want to calculate gene length from a GTF file alone without any isoform quantification, then GTFtools can do it.

For this demo, I'm using the Ensembl GTF file or human: Homo_sapiens.GRCh38.90.gtf

GTF tools calculates the gene length a few different ways (i) mean, (ii) median, (iii) longest single isoform, and (iv) all exons merged.

The command I used looks like this:


gtftools.py  -l Homo_sapiens.GRCh38.90.gtf.genelength  Homo_sapiens.GRCh38.90.gtf

And the output looks like this:

head Homo_sapiens.GRCh38.90.gtf.genelength
gene mean median longest_isoform merged
ENSG00000282222 757 757 757 757
ENSG00000282221 556 556 556 556
ENSG00000187223 609 609 609 609
ENSG00000110514 4930 5744 6005 8366
ENSG00000086015 3405 4047 5738 7877
ENSG00000115808 4529 2254 8168 8345
ENSG00000211769 48 48 48 48
ENSG00000211768 50 50 50 50
ENSG00000211767 49 49 49 49


But there's a hitch, the program doesn't accept genes that are not located on canonical autosomes, in human these are chromosome 1 to 22. This is a problem, as many interesting genes are located on sex chromosomes, mitochondrial genome or haven't been mapped to a chromosome yet, meaning many genes will be missing from the gene length table. This limitation is circumvented easily by converting all chromosomes to "1", then suddenly the lengths of all genes are included. Here's how it's done:

#remove any comment lines from the GTF file and use awk to convert column 1 to "1"
grep -v '#' $HSA_GTF | awk '{OFS="\t"} $1=1' >> $HSA_GTF.tmp\

#now run GTF tools
gtftools.py  -l Homo_sapiens.GRCh38.90.gtf.genelength  Homo_sapiens.GRCh38.90.gtf.tmp

See the GTFtools website to get the code and documentation: http://www.genemine.org/gtftools.php

Comments

  1. This saved me a lot of difficulty, thank you! Particularly useful for roughly annotated transcriptomes.

    ReplyDelete
  2. how to do thin on bash system not on python?

    ReplyDelete
  3. I could not manage to make this work with the demo GTF file provided by GTFtools. After substituting all spaces for "\t" and replacing all chromosome names to 1, GTFtools no longer identifies the file format. It then changes the file's format to .ensembl. I later run this to through the GTFtools gene length calculator and it gave no results.

    Anyways I circumvented the issue by not doing any modifications to the original GTF file. I ran through the gene length calculator as is but also included the option -c with a list of all chromosomes as a parameter. I included X and Y in the list. Unmapped genes can be added by adding the label unmapped where given in that genome. In some genomes it is chromosome "UNKN". I also did this with a poorly annotated genome and worked perfectly.

    I used GTFtools_0.6.9 and python 2.7. Anyways thanks for the post because even though I did not use your UNIX command the concept of including all chromosomes came from reading it!

    ReplyDelete

Post a Comment

Popular posts from this blog

EdgeR or DESeq2? Comparing the performance of differential expression tools

Data analysis step 8: Pathway analysis with GSEA

Installing R-4.0 on Ubuntu 18.04 painlessly