Posts

Minitalk: on Excel Gene Name Errors

Image
It was great to visit the Monash Clayton Bioinformatics team led by David Powell today to introduce myself and speak about a topic very close to my heart!

Slides below:

Also let me know what you think of the new theme of the blog in the comments below. BTW Just realised this is my 100th post! Yay for me! Thanks for reading!

How NGS is transforming medicine

Image
Last month, I gave a talk at our departmental meeting, describing in general terms how high throughput sequencing technology was having real impacts in medicine and human health, as well as some emerging trends to watch out for in coming years.

Here's the link


Introducing the ENCODE Gene Set Hub

Image
TL;DR We curated a bunch of ENCODE data into gene sets that is super useful in pathway analysis (ie GSEA).
Link to gene sets and data: https://sourceforge.net/projects/encodegenesethub/
Poster presentation: DOI:10.13140/RG.2.2.34302.59208

Now for the longer version. Gene sets are wonderful resources. We use them to do pathway level analyses and identify trends in data that lead us to improved interpretation and new hypotheses. Most pathway analysis tools like GSEA allow us to use custom gene sets, this is really cool as you can start to generate gene sets based on your own profiling work and that of others.

There is huge value in curating experimental data into gene sets, as the MSigDB team have demonstrated. But overall, these data are under-shared. Even our group is guilty of not sharing the gene sets we've used in papers. There have been a few papers where we've used gene sets curated  from ENCODE transcription factor binding site (TFBS) data to understand which TFs were drivi…

2016 wrap-up

Image
What a rollercoaster. On the good side, we've seen advances in sequencing methods including improvements in Nanopore sequencing and single cell methods becoming more common. We also saw 10x genomics come to the party with an emulsion PCR approach that can be applied to produce synthetic long reads or single cell barcoding. There were major results from larger cohort studies exemplified by ExAC, which is revealing more about human genetic variation, while Blueprint and GTEx are revealing more about determinants of gene regulation. The #openaccess is growing rapidly in the bioinformatics community, along with the growth of preprint popularity which I hope is adopted more widely in the biomedical sciences.

On the not so good side, Illumina has made made no new instrument announcements nor any substantial updates to existing sequencing systems. There were a few papers (example) describing the methylation EPIC array announced in 2015. Prices for Illumina reagents continue to increase,…

MSigDB gene sets for mouse

I recently needed to convert MSigDB gene sets to mouse so I thought I would share.

GO.v5.2.symbols_mouse.gmt
kegg.v5.2.symbols_mouse.gmt
msigdb.v5.2.symbols_mouse.gmt
reactome.v5.2.symbols_mouse.gmt

Below is the code used to do the conversion. It requires an input GMT file of human gene symbols as well as a human-mouse orthology file. You can download the ortholog file here. As the name suggests, it is based on data downloaded from Ensembl Biomart version 87.

Running the program converts all human gmt files. It requres gnu parallel which can be easily installed on Ubuntu with "sudo apt-get install parallel"


#!/bin/bash

conv(){
line=$1
  NAME_DESC=`echo $line | cut -d ' ' -f-2`

  GENES=`echo $line | cut -d ' ' -f3- \
  | tr ' ' '\n' | sort -uk 1b,1 \
  | join -1 1 -2 1 - \
  <(cut -f3,5 mouse2hum_biomart_ens87.txt \
  | sed 1d | awk '$1!="" && $2!=""' \
  | sort -uk 1b,1) | cut -d ' ' -f2 \
  | sort -u…

Gene name error scanner webservice

Image
Over the past few weeks, we've had a lot of feedback about our paper describing the sorry state of Excel auto-correct errors in supplemental files in spreadsheets.

In our group, we've discussed a number of ways that these errors could be minimised in future. One suggestion was to publish a webservice which permits reviewers and editors to upload and scan spreadsheets for the presence of gene name errors. So that's what I did. I took some basic file upload code in php and customised it so that it runs the shell script described in the paper. You can access the webservice here. We've been testing it for a few days and seems to work fine, except for the auto-generated email which I presume is being blocked by our IT group.

Upload spreadsheets and have them scanned for gene name errors.
The code for the webservice is up at GitHub, so you can modify it and host another instance if you want. The code should run on Ubuntu machines that can run Apache2, php and other dependenci…

My personal thoughts on gene name errors

Image
Well, our paper "Gene name errors are widespread in the scientific literature" in Genome Biology has stirred up some interest. There are a lot of reasons why this article has taken off beyond what I initially envisioned:

Most tech-savvy people hate ExcelPeople over-rely on Excel, when there are better alternatives for analyticsEveryone has experienced an auto-correct fail and can relatePeople love "bloopers"People are interested when scientists get it wrong (especially other scientists)
 In this post I want to share a few things:

Some responses to journalist questionsList of media coverage and whether they are reporting things accuratelyA look into the scripts used themselvesFuture directions I'll also be answering your questions, so pop them in the comments section. 

Some responses to journalist questions
Why did you do this?

We saw that the problem was first described in 2004, but these errors were present in files from papers in high-ranking journals. We made …