Posts

2016 wrap-up

Image
What a rollercoaster. On the good side, we've seen advances in sequencing methods including improvements in Nanopore sequencing and single cell methods becoming more common. We also saw 10x genomics come to the party with an emulsion PCR approach that can be applied to produce synthetic long reads or single cell barcoding. There were major results from larger cohort studies exemplified by ExAC, which is revealing more about human genetic variation, while Blueprint and GTEx are revealing more about determinants of gene regulation. The #openaccess is growing rapidly in the bioinformatics community, along with the growth of preprint popularity which I hope is adopted more widely in the biomedical sciences.

On the not so good side, Illumina has made made no new instrument announcements nor any substantial updates to existing sequencing systems. There were a few papers (example) describing the methylation EPIC array announced in 2015. Prices for Illumina reagents continue to increase,…

MSigDB gene sets for mouse

I recently needed to convert MSigDB gene sets to mouse so I thought I would share.

GO.v5.2.symbols_mouse.gmt
kegg.v5.2.symbols_mouse.gmt
msigdb.v5.2.symbols_mouse.gmt
reactome.v5.2.symbols_mouse.gmt

Below is the code used to do the conversion. It requires an input GMT file of human gene symbols as well as a human-mouse orthology file. You can download the ortholog file here. As the name suggests, it is based on data downloaded from Ensembl Biomart version 87.

Running the program converts all human gmt files. It requres gnu parallel which can be easily installed on Ubuntu with "sudo apt-get install parallel"


#!/bin/bash

conv(){
line=$1
  NAME_DESC=`echo $line | cut -d ' ' -f-2`

  GENES=`echo $line | cut -d ' ' -f3- \
  | tr ' ' '\n' | sort -uk 1b,1 \
  | join -1 1 -2 1 - \
  <(cut -f3,5 mouse2hum_biomart_ens87.txt \
  | sed 1d | awk '$1!="" && $2!=""' \
  | sort -uk 1b,1) | cut -d ' ' -f2 \
  | sort -u…

Gene name error scanner webservice

Image
Over the past few weeks, we've had a lot of feedback about our paper describing the sorry state of Excel auto-correct errors in supplemental files in spreadsheets.

In our group, we've discussed a number of ways that these errors could be minimised in future. One suggestion was to publish a webservice which permits reviewers and editors to upload and scan spreadsheets for the presence of gene name errors. So that's what I did. I took some basic file upload code in php and customised it so that it runs the shell script described in the paper. You can access the webservice here. We've been testing it for a few days and seems to work fine, except for the auto-generated email which I presume is being blocked by our IT group.

Upload spreadsheets and have them scanned for gene name errors.
The code for the webservice is up at GitHub, so you can modify it and host another instance if you want. The code should run on Ubuntu machines that can run Apache2, php and other dependenci…

My personal thoughts on gene name errors

Image
Well, our paper "Gene name errors are widespread in the scientific literature" in Genome Biology has stirred up some interest. There are a lot of reasons why this article has taken off beyond what I initially envisioned:

Most tech-savvy people hate ExcelPeople over-rely on Excel, when there are better alternatives for analyticsEveryone has experienced an auto-correct fail and can relatePeople love "bloopers"People are interested when scientists get it wrong (especially other scientists)
 In this post I want to share a few things:

Some responses to journalist questionsList of media coverage and whether they are reporting things accuratelyA look into the scripts used themselvesFuture directions I'll also be answering your questions, so pop them in the comments section. 

Some responses to journalist questions
Why did you do this?

We saw that the problem was first described in 2004, but these errors were present in files from papers in high-ranking journals. We made …

Analyzing repeat rich plant smRNA-seq data with ShortStack

Image
Small RNA expression is difficult to analyse. They're small molecules anywhere from 18 -25 nt for miRNAs, they occur as identical or near identical family members and are subject to RNA editing as well as errors from the sequencer.

My recent paper is an analysis of alignment tools for microRNA analysis with a strong focus on uniquely mapped reads. All that's OK, but in some organisms such as grasses (rice, barley, wheat, etc) you'll find that multimapped reads far outnumber uniquely placed ones. If you omit multimapped reads from the analysis, then you'll be excluding the majority of reads which is definitely a bad idea in any NGS analysis pipeline.

To demonstrate this, I downloaded smRNA-seq data from SRA (SRP029886) that consists of 3 datasets (SRR976171, SRR976172, SRR976173), clipped the adaptors off and mapped them to the genome with BWA aln then counted reads mapped to exonic regions uniquely (mapQ≥10).
So as you can see, the proportion of reads that are assigned …

Screen for mycoplasma contamination in DNA-seq and RNA-seq data

Image
If you work in a lab dealing with mammalian cell cultures, then you've probabaly heard of Mycoplasma, these are obligate intracellular parasitic bacteria that not only cause human infection and disease, but also common contaminants in cell culture experiments. Mycoplasma infection can cause many changes to cell biology that can invalidate experimental results which is alarming. These bacteria are also resistant to most antibiotics used in culture experiments like streptomycin and penicillin. Mycoplasma detection is also not straight-forward, as these bugs are not visible with the light microscopes used by most researchers. PCR and Elisa tests can be used but there many researchers out there who simply don't perform these tests. Last year, a study published in NAR showed that about 11% of RNA-seq studies were affected by Mycoplasma contamination, furthermore the study also identified a panel of 61 host genes that were strongly correlated with the presence of Mycoplasma.

In thi…

Accuracy, speed and error tolerance of short DNA sequence aligners

Image
Over the past few years at GenomeSpot, I've evaluated alignment software accuracy for DNA-seq, RNA-seq and small RNA-seq. After discussions with colleagues and followers, I thought it was time to develop aspects of this work to a point where it would be publishable in journals. Over the past year I've put together a paper that comprehensively evaluated accuracy of small RNA aligners. After three review and revision cycles I'm happy to say it has been accepted for publication in RNA. It will likely appear in the August issue.

Secondly, I've extended upon work from a previous post where I evaluated the accuracy of several DNA-seq mappers with error containing reads, and did further work like: varying the read length from 50 nt to 480 ntperforming parallel analysis with Arabidopsis and humanusing simulators to generate Illumina and Ion Torrent read setstested the speed (throughput) of aligners with Illumina reads This work was just made public today on bioaRxiv. Have a rea…