Using GTF tools to get gene lengths

Sometime you need to normalise gene expression by gene length (eg: FPKM). To do that you need to calculate gene length. But which length to use? One could simply get a total of all exon lengths but if the most abundant isoform is the shortest, this will be terribly inaccurrate. Clearly to do this accurately, analysis at the level of transcripts would be the best approach as the length of each transcript is unambiguous, and the effective gene length can be estimated based on the abundance of each isoform. But if we really want to calculate gene length from a GTF file alone without any isoform quantification, then GTFtools can do it.

For this demo, I'm using the Ensembl GTF file or human: Homo_sapiens.GRCh38.90.gtf

GTF tools calculates the gene length a few different ways (i) mean, (ii) median, (iii) longest single isoform, and (iv) all exons merged.

The command I used looks like this:  -l Homo_sapiens.GRCh38.90.gtf.genelength  Homo_sapiens.GRCh38.90.gtf
And the output l…

Benchmarking an AMD Ryzen Threadripper 2990WX 64 thread system

I recently received access to a new workstation built around the AMD Ryzen Threadripper 2990WX 32-Core Processor. The main use of the system will be to process genome sequencing data. It has 64 threads and the clock speed is 3.2GHz. It should get those data processing jobs done a lot quicker than my other system. Nevertheless I wanted to run some benchmarks to see how much faster it is as well as give it a stress test to see how well it can cope with high loads. This info could also be useful for comparisons in case degradation of the system in the future.

Here are the specs of the 3 systems being benchmarked:

AMD16: AMD® Ryzen threadripper 1900x 8-core processor - 16 threads@3.8GHz  - 2.4GHz RAMIntel32: Intel® Xeon® CPU E5-2660 16-core processor - 32 threads@3.0GHz - 1.3GHz RAM AMD64: AMD Ryzen Threadripper 2990WX 32-Core Processor - 64 threads@3.2GHZ - 2.9GHz RAM
The command being executed is a pbzip2 of a 4 GB file containing random data using some benchmarking scripts I wrote previ…

Using the DEE2 bulk data dumps

The DEE2 project makes freely available bulk data dumps of expression for each of the nine species. 
The data is organised as tab separated values (tsv) in "long" format. This is different to a standard gene expression matrix - see below.

The long format is prefered for really huge datasets because it can be readily indexed and converted to a database, as compared to wide matrix format. 
You'll notice that there are 4 files for each species, with each having a different suffix. They are all compressed with bz2. You can use pbzip2 to de-compress these quickly.
*accessions.tsv.bz2This is a list of runs included in DEE2, along with other SRA/GEO accession numbers. 

*se.tsv.bz2These are the STAR based gene expression counts in 'long format' tables with the columns: 'dataset', 'gene', 'count'.
*ke.tsv.bz2These are the Kallisto estimated transcript expression counts also in long format
*qc.tsv.bz2These are the QC metrics are also available in long …

Extract data from a spreadsheet file on the linux command line

Sometimes we need to extract data from an Excel spreadsheet for analysis. Here is one approach using the ssconvert tool.

If this isnt installed on your linux machine then you most likely can get it from the package repository.

$ sudo apt install ssconvert

Then if you want to extract a spreadsheet file into a tsv it can be done like this:

$ ssconvert -S --export-type Gnumeric_stf:stf_assistant -O 'separator="'$'\t''"' SomeData.xlsx SomeData.xlsx.tsv

You will notice that all the sheets are output to separate tsv files. This approach is nice as it can accommodate high throughput screening, as I implemented in my Gene Name Errors paper a while back.

Here is an example of obtaining some data from GEO.

$ #first download
$ curl '' > GSE80251.xlsx

$ #now extract $ ssconvert -S --export-type Gnumeric_stf:stf_assistant …

Incorporate dee2 data into your R-based RNA-seq workflow is a portal for accessing gene expression data derived from public RNA-seq datasets. So far there are over 400k available datasets and its growing every day. While there are existing databases of such as Expression Atlas, Recount2 and ARCHS4, offers a number of unique benefits. For instance, dee2 includes gene-wise counts fron STAR as well as transcript-wise quantifications from Kallisto. There are a few ways you can access these data. Firstly, there is a nice web interface that is mobile friendly. Secondly, there are data dumps available if you are running a large scale analysis.  But the purpose of this post is to demonstrate the improved R interface in action together with SRAdbv2 and statistics with edgeR and DESeq. The official documentation is available on GitHub.
Getting started This tutorial provides a walkthrough for how to work with dee2 expression data, starting with dataset searches, obtaining the data from and then performing a differential analysi…

Update on DEE2 project for Sept 2018

A few updates for DEE2 i would like to share. 

I switched over to NameCheap domain name service which appears to be working much nicer than the previous one (HostPapa). The domain name sever change broke the docker image so it was slightly modified and rebuilt. I've integrated with SRAdbV2, an now there are many more datasets in the queue. I think many of these are small ones related to single cell RNA-seq. I am using as many computers as possible to clear up the backlog. I've noticed a lot of SRA project with one or a few datasets missing, so I have have written a script to identify these and queue them with priority. The R interface hs undergone several improvements and should be more robust now. A whole bunch of new documentation has been added, including a complete walkthrough starting with SRAdbV2 query, fetching DEE2 data, and differential analysis with edgeR and DESeq.Also bulk data dumps are again available via http. Dat turned out to be too slow and unreliable for file…

Get the newest Reactome gene sets for pathway analysis

For first-pass pathway analysis I find Reactome to be the most useful database of gene sets for biologists to understand. For a long time I have been using Reactome gene sets as deposited to the GSEA/MSigDB website. Recently a colleague pointed me to the gene matrix file offered directly on the Reactome webpage (Thanks Dr Okabe).

There are some differences. Firstly there are more gene sets in the one from the Reactome webpage (accessed 2018-05-09)
$ wc -l *gmt     674 c2.cp.reactome.v6.1.symbols.gmt    2022 ReactomePathways.gmt    2696 total
Secondly, there are more genes included in one or more gene sets:
$ cut -f3- c2.cp.reactome.v6.1.symbols.gmt | tr '\t' '\n' | sort -u | wc -l 6025 $ cut -f3- ReactomePathways.gmt | tr '\t' '\n' | sort -u | wc -l 10852
And overall there are about threefold more gene-pathway entries 
$ cut -f3- ReactomePathways.gmt | wc -w 106405 $ cut -f3- c2.cp.reactome.v6.1.symbols.gmt | wc -w 37601
I also looked at whether the gen…