Posts

My personal thoughts on gene name errors

Image
Well, our paper "Gene name errors are widespread in the scientific literature" in Genome Biology has stirred up some interest. There are a lot of reasons why this article has taken off beyond what I initially envisioned:

Most tech-savvy people hate ExcelPeople over-rely on Excel, when there are better alternatives for analyticsEveryone has experienced an auto-correct fail and can relatePeople love "bloopers"People are interested when scientists get it wrong (especially other scientists)
 In this post I want to share a few things:

Some responses to journalist questionsList of media coverage and whether they are reporting things accuratelyA look into the scripts used themselvesFuture directions I'll also be answering your questions, so pop them in the comments section. 

Some responses to journalist questions
Why did you do this?

We saw that the problem was first described in 2004, but these errors were present in files from papers in high-ranking journals. We made …

Analyzing repeat rich plant smRNA-seq data with ShortStack

Image
Small RNA expression is difficult to analyse. They're small molecules anywhere from 18 -25 nt for miRNAs, they occur as identical or near identical family members and are subject to RNA editing as well as errors from the sequencer.

My recent paper is an analysis of alignment tools for microRNA analysis with a strong focus on uniquely mapped reads. All that's OK, but in some organisms such as grasses (rice, barley, wheat, etc) you'll find that multimapped reads far outnumber uniquely placed ones. If you omit multimapped reads from the analysis, then you'll be excluding the majority of reads which is definitely a bad idea in any NGS analysis pipeline.

To demonstrate this, I downloaded smRNA-seq data from SRA (SRP029886) that consists of 3 datasets (SRR976171, SRR976172, SRR976173), clipped the adaptors off and mapped them to the genome with BWA aln then counted reads mapped to exonic regions uniquely (mapQ≥10).
So as you can see, the proportion of reads that are assigned …

Screen for mycoplasma contamination in DNA-seq and RNA-seq data

Image
If you work in a lab dealing with mammalian cell cultures, then you've probabaly heard of Mycoplasma, these are obligate intracellular parasitic bacteria that not only cause human infection and disease, but also common contaminants in cell culture experiments. Mycoplasma infection can cause many changes to cell biology that can invalidate experimental results which is alarming. These bacteria are also resistant to most antibiotics used in culture experiments like streptomycin and penicillin. Mycoplasma detection is also not straight-forward, as these bugs are not visible with the light microscopes used by most researchers. PCR and Elisa tests can be used but there many researchers out there who simply don't perform these tests. Last year, a study published in NAR showed that about 11% of RNA-seq studies were affected by Mycoplasma contamination, furthermore the study also identified a panel of 61 host genes that were strongly correlated with the presence of Mycoplasma.

In thi…

Accuracy, speed and error tolerance of short DNA sequence aligners

Image
Over the past few years at GenomeSpot, I've evaluated alignment software accuracy for DNA-seq, RNA-seq and small RNA-seq. After discussions with colleagues and followers, I thought it was time to develop aspects of this work to a point where it would be publishable in journals. Over the past year I've put together a paper that comprehensively evaluated accuracy of small RNA aligners. After three review and revision cycles I'm happy to say it has been accepted for publication in RNA. It will likely appear in the August issue.

Secondly, I've extended upon work from a previous post where I evaluated the accuracy of several DNA-seq mappers with error containing reads, and did further work like: varying the read length from 50 nt to 480 ntperforming parallel analysis with Arabidopsis and humanusing simulators to generate Illumina and Ion Torrent read setstested the speed (throughput) of aligners with Illumina reads This work was just made public today on bioaRxiv. Have a rea…

How to generate a rank file from gene expression data in R

Previously I wrote about how to make a gene expression rankfile using the unix shell. While this works for me just fine, there are some differences in the behaviour of shell tools like awk an sed that make it unusable for mac users.

Therefore, I think the best solution is to show you how to do this in R, which you can install on Linux, Mac and Windows systems.

The expression data for this demo looks like this (the first 5 columns).


Name                        logFC               logCPM            LR                PValue ENSG00000134294.9_SLC38A2   0.365464972841137   8.35504063447063  80.9697378748286  2.29200821312451e-19 ENSG00000164692.13_COL1A2   0.369440969815233   6.9371167845581   59.239238358843   1.39621819298599e-14 ENSG00000117152.9_RGS4      0.417512339007825   6.27805181231213  48.690887963755   2.99655418444362e-12 ENSG00000120738.7_EGR1      -0.448872068345368  5.72373823077344  40.5159113813129  1.95021420597484e-10 ENSG00000236552.1_RPL13AP5  -0.263118776572713  7.87437056…

Running many GSEA analyses using a simple Bash wrapper script

In my previous posts, I've recommended using GSEA pre-ranked for performing pathway analysis through the graphical user interface. If you have a more elaborate experiment with several sample groups and a few different gene set libraries to analyse, then you'll find that process quite tedious.

So in this post, I'd like to share with you my GSEA wrapper written in bash to make this process a bit easier. The wrapper script simply accepts two user defined arguments, the rank file and the gmt file. Take note that this works only when the gene IDs are already gene symbols. If you have probe IDs or some other type of accession number, then you'll need to convert them to gene symbols or use a "chip file". The rank files and GMT files need to be in the current directory. You may need to adjust the memory up or down depending on your machine spec and the memory required for your jobs.

#!/bin/bash
RNK=$1
GMT=$2
java -Xmx4096m -cp /path/to/gsea2-2.2.1.jar xtools.gsea.GseaPre…

Pathway analysis: DAVID versus GSEA

Image
Pathway analysis has become such a common procedure in bioinformatics, especially in studying gene expression. If you look at a survey of recent papers it seems that there is a bunch of ways that it can be done. In this post, I'll discuss the differences between two commonly used tools; DAVID and GSEA.
The concepts behind the two algorithms are very different. DAVID determines overlaps between user-supplied gene lists and the curated databases, looking for overlaps that are bigger than that expected by random chance. You can improve the accuracy of the algorithm by providing a background file that contains all genes that were considered/detected in the experiment. The user-supplied gene sets are normally generated by selecting genes that pass a significance threshold. The DAVID procedure is similar to others available such as Ingenuity, AmiGO and GeneGO. The selection of significance values is largely arbitrary, but it is common to set the threshold at FDR adjusted p<0.05. Modi…