Using GTF tools to get gene lengths
Sometime you need to normalise gene expression by gene length (eg: FPKM). To do that you need to calculate gene length. But which length to use? One could simply get a total of all exon lengths but if the most abundant isoform is the shortest, this will be terribly inaccurrate. Clearly to do this accurately, analysis at the level of transcripts would be the best approach as the length of each transcript is unambiguous, and the effective gene length can be estimated based on the abundance of each isoform. But if we really want to calculate gene length from a GTF file alone without any isoform quantification, then GTFtools can do it. For this demo, I'm using the Ensembl GTF file or human: Homo_sapiens.GRCh38.90.gtf GTF tools calculates the gene length a few different ways (i) mean, (ii) median, (iii) longest single isoform, and (iv) all exons merged. The command I used looks like this: gtftools.py -l Homo_sapiens.GRCh38.90.gtf.genelength Homo_sapiens.GRCh38.90.gtf...