Posts

Extract TSS regions from a GTF file with GTFtools

Since stumbling upon GTFtools recently, I found that it has other another interesting use - to generate coordinate sets around transcriptional start sites (TSSs). This is really important for ChIP-seq analysis when we want to compare for example the strength of enrichment of histone modificaions at TSSs and compare it to RNA expression. Using GTFtools, it is a one line command to extract these positions:

gtftools.py -t Homo_sapiens.GRCh38.94.gtf.tss.bed -w 1000  Homo_sapiens.GRCh38.94.gtf

Where "-t" is the output file flag, "-w" is the desired TSS distance to cover, in this case +/- 1000 bp, and the last argument is the input gtf file which needs to be Ensembl or Gencode (other ones don't work due to differences in formatting) 
If I had to do this without GTFtools, it would end up being more complicated, as TSS positions (exon 1 starts) would need to be extracted from the GTF file separately for the top and bottom strands and then merged.

Enabling bioinformatics training in a Windows based computer lab with Docker+Dugong

While Linux remains the OS for developers, data scientists and bioinformaticians, uni classrooms remain stubbornly dependent on windows based applications. Yes, on individual PCs you can install Ubuntu command line apps but ask any IT dept about doing this for an entire classroom and you will undoubtedly receive an emphatic "NO".

So how does one do bioinformatics training when students cannot even access the simplest foundation, the OS?

Good question, and none of the potential answers are optimal to be honest. But it's what we need to deal with until Unis realize that open source software is actually good enough to run entire enterprises.

My first thought was to get students to use Putty to log in to a bioinformatics server with SSH. This would be OK, but would be a bit of a headache to manage all the accounts on the server. Also my feeling is that much of the skills learned by the students would be forgotten again as soon as access to the server is revoked.

Therefore, I…

DEE2 gets published

The dee2.io project has been a labor of love since 2013/2014, has undergone a major overhaul and has finally been published online in GigaScience. The great thing about this journal is not only are the articles open access, but also the reviewer's comments. We had great suggestions and they improved the resource tremendously.

It's great that it has been published finally, but publication is not the end goal of the project. The goal is to democratize omics data to a point where it can be done by biologists without any coding experience, undergrad students, high school students, practically anyone with a smart phone and an internet connection. So instead of being the end of the project, this is really the end of the beginning. Not only will we be keeping up with new SRA submissions over the next year of so, we will be incorporating new features, new species and perhaps some new data types.

If you have suggestions, feedback of comments I would be very grateful!

Using GTF tools to get gene lengths

Sometime you need to normalise gene expression by gene length (eg: FPKM). To do that you need to calculate gene length. But which length to use? One could simply get a total of all exon lengths but if the most abundant isoform is the shortest, this will be terribly inaccurrate. Clearly to do this accurately, analysis at the level of transcripts would be the best approach as the length of each transcript is unambiguous, and the effective gene length can be estimated based on the abundance of each isoform. But if we really want to calculate gene length from a GTF file alone without any isoform quantification, then GTFtools can do it.

For this demo, I'm using the Ensembl GTF file or human: Homo_sapiens.GRCh38.90.gtf

GTF tools calculates the gene length a few different ways (i) mean, (ii) median, (iii) longest single isoform, and (iv) all exons merged.

The command I used looks like this:


gtftools.py  -l Homo_sapiens.GRCh38.90.gtf.genelength  Homo_sapiens.GRCh38.90.gtf
And the output l…

Benchmarking an AMD Ryzen Threadripper 2990WX 64 thread system

Image
I recently received access to a new workstation built around the AMD Ryzen Threadripper 2990WX 32-Core Processor. The main use of the system will be to process genome sequencing data. It has 64 threads and the clock speed is 3.2GHz. It should get those data processing jobs done a lot quicker than my other system. Nevertheless I wanted to run some benchmarks to see how much faster it is as well as give it a stress test to see how well it can cope with high loads. This info could also be useful for comparisons in case degradation of the system in the future.

Here are the specs of the 3 systems being benchmarked:

AMD16: AMD® Ryzen threadripper 1900x 8-core processor - 16 threads@3.8GHz  - 2.4GHz RAMIntel32: Intel® Xeon® CPU E5-2660 16-core processor - 32 threads@3.0GHz - 1.3GHz RAM AMD64: AMD Ryzen Threadripper 2990WX 32-Core Processor - 64 threads@3.2GHZ - 2.9GHz RAM
The command being executed is a pbzip2 of a 4 GB file containing random data using some benchmarking scripts I wrote previ…

Using the DEE2 bulk data dumps

Image
The DEE2 project makes freely available bulk data dumps of expression for each of the nine species. 
The data is organised as tab separated values (tsv) in "long" format. This is different to a standard gene expression matrix - see below.

The long format is prefered for really huge datasets because it can be readily indexed and converted to a database, as compared to wide matrix format. 
You'll notice that there are 4 files for each species, with each having a different suffix. They are all compressed with bz2. You can use pbzip2 to de-compress these quickly.
*accessions.tsv.bz2This is a list of runs included in DEE2, along with other SRA/GEO accession numbers. 

*se.tsv.bz2These are the STAR based gene expression counts in 'long format' tables with the columns: 'dataset', 'gene', 'count'.
*ke.tsv.bz2These are the Kallisto estimated transcript expression counts also in long format
*qc.tsv.bz2These are the QC metrics are also available in long …

Extract data from a spreadsheet file on the linux command line

Sometimes we need to extract data from an Excel spreadsheet for analysis. Here is one approach using the ssconvert tool.

If this isnt installed on your linux machine then you most likely can get it from the package repository.

$ sudo apt install ssconvert

Then if you want to extract a spreadsheet file into a tsv it can be done like this:

$ ssconvert -S --export-type Gnumeric_stf:stf_assistant -O 'separator="'$'\t''"' SomeData.xlsx SomeData.xlsx.tsv


You will notice that all the sheets are output to separate tsv files. This approach is nice as it can accommodate high throughput screening, as I implemented in my Gene Name Errors paper a while back.

Here is an example of obtaining some data from GEO.

$ #first download
$ curl 'https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE80251&format=file&file=GSE80251%5Fprocessed%5FRNA%5Fexpression%5Fmnfyap%2Exlsx' > GSE80251.xlsx


$ #now extract $ ssconvert -S --export-type Gnumeric_stf:stf_assistant …