Posts

Gene name error scanner webservice

Image
Over the past few weeks, we've had a lot of feedback about our paper describing the sorry state of Excel auto-correct errors in supplemental files in spreadsheets.

In our group, we've discussed a number of ways that these errors could be minimised in future. One suggestion was to publish a webservice which permits reviewers and editors to upload and scan spreadsheets for the presence of gene name errors. So that's what I did. I took some basic file upload code in php and customised it so that it runs the shell script described in the paper. You can access the webservice here. We've been testing it for a few days and seems to work fine, except for the auto-generated email which I presume is being blocked by our IT group.

Upload spreadsheets and have them scanned for gene name errors.
The code for the webservice is up at GitHub, so you can modify it and host another instance if you want. The code should run on Ubuntu machines that can run Apache2, php and other dependenci…

My personal thoughts on gene name errors

Image
Well, our paper "Gene name errors are widespread in the scientific literature" in Genome Biology has stirred up some interest. There are a lot of reasons why this article has taken off beyond what I initially envisioned:

Most tech-savvy people hate ExcelPeople over-rely on Excel, when there are better alternatives for analyticsEveryone has experienced an auto-correct fail and can relatePeople love "bloopers"People are interested when scientists get it wrong (especially other scientists)
 In this post I want to share a few things:

Some responses to journalist questionsList of media coverage and whether they are reporting things accuratelyA look into the scripts used themselvesFuture directions I'll also be answering your questions, so pop them in the comments section. 

Some responses to journalist questions
Why did you do this?

We saw that the problem was first described in 2004, but these errors were present in files from papers in high-ranking journals. We made …

Analyzing repeat rich plant smRNA-seq data with ShortStack

Image
Small RNA expression is difficult to analyse. They're small molecules anywhere from 18 -25 nt for miRNAs, they occur as identical or near identical family members and are subject to RNA editing as well as errors from the sequencer.

My recent paper is an analysis of alignment tools for microRNA analysis with a strong focus on uniquely mapped reads. All that's OK, but in some organisms such as grasses (rice, barley, wheat, etc) you'll find that multimapped reads far outnumber uniquely placed ones. If you omit multimapped reads from the analysis, then you'll be excluding the majority of reads which is definitely a bad idea in any NGS analysis pipeline.

To demonstrate this, I downloaded smRNA-seq data from SRA (SRP029886) that consists of 3 datasets (SRR976171, SRR976172, SRR976173), clipped the adaptors off and mapped them to the genome with BWA aln then counted reads mapped to exonic regions uniquely (mapQ≥10).
So as you can see, the proportion of reads that are assigned …

Screen for mycoplasma contamination in DNA-seq and RNA-seq data

Image
If you work in a lab dealing with mammalian cell cultures, then you've probabaly heard of Mycoplasma, these are obligate intracellular parasitic bacteria that not only cause human infection and disease, but also common contaminants in cell culture experiments. Mycoplasma infection can cause many changes to cell biology that can invalidate experimental results which is alarming. These bacteria are also resistant to most antibiotics used in culture experiments like streptomycin and penicillin. Mycoplasma detection is also not straight-forward, as these bugs are not visible with the light microscopes used by most researchers. PCR and Elisa tests can be used but there many researchers out there who simply don't perform these tests. Last year, a study published in NAR showed that about 11% of RNA-seq studies were affected by Mycoplasma contamination, furthermore the study also identified a panel of 61 host genes that were strongly correlated with the presence of Mycoplasma.

In thi…

Accuracy, speed and error tolerance of short DNA sequence aligners

Image
Over the past few years at GenomeSpot, I've evaluated alignment software accuracy for DNA-seq, RNA-seq and small RNA-seq. After discussions with colleagues and followers, I thought it was time to develop aspects of this work to a point where it would be publishable in journals. Over the past year I've put together a paper that comprehensively evaluated accuracy of small RNA aligners. After three review and revision cycles I'm happy to say it has been accepted for publication in RNA. It will likely appear in the August issue.

Secondly, I've extended upon work from a previous post where I evaluated the accuracy of several DNA-seq mappers with error containing reads, and did further work like: varying the read length from 50 nt to 480 ntperforming parallel analysis with Arabidopsis and humanusing simulators to generate Illumina and Ion Torrent read setstested the speed (throughput) of aligners with Illumina reads This work was just made public today on bioaRxiv. Have a rea…

How to generate a rank file from gene expression data in R

Previously I wrote about how to make a gene expression rankfile using the unix shell. While this works for me just fine, there are some differences in the behaviour of shell tools like awk an sed that make it unusable for mac users.

Therefore, I think the best solution is to show you how to do this in R, which you can install on Linux, Mac and Windows systems.

The expression data for this demo looks like this (the first 5 columns).


Name                        logFC               logCPM            LR                PValue ENSG00000134294.9_SLC38A2   0.365464972841137   8.35504063447063  80.9697378748286  2.29200821312451e-19 ENSG00000164692.13_COL1A2   0.369440969815233   6.9371167845581   59.239238358843   1.39621819298599e-14 ENSG00000117152.9_RGS4      0.417512339007825   6.27805181231213  48.690887963755   2.99655418444362e-12 ENSG00000120738.7_EGR1      -0.448872068345368  5.72373823077344  40.5159113813129  1.95021420597484e-10 ENSG00000236552.1_RPL13AP5  -0.263118776572713  7.87437056…

Running many GSEA analyses using a simple Bash wrapper script

In my previous posts, I've recommended using GSEA pre-ranked for performing pathway analysis through the graphical user interface. If you have a more elaborate experiment with several sample groups and a few different gene set libraries to analyse, then you'll find that process quite tedious.

So in this post, I'd like to share with you my GSEA wrapper written in bash to make this process a bit easier. The wrapper script simply accepts two user defined arguments, the rank file and the gmt file. Take note that this works only when the gene IDs are already gene symbols. If you have probe IDs or some other type of accession number, then you'll need to convert them to gene symbols or use a "chip file". The rank files and GMT files need to be in the current directory. You may need to adjust the memory up or down depending on your machine spec and the memory required for your jobs.

#!/bin/bash
RNK=$1
GMT=$2
java -Xmx4096m -cp /path/to/gsea2-2.2.1.jar xtools.gsea.GseaPre…