Showing posts from 2016

2016 wrap-up

What a rollercoaster. On the good side, we've seen advances in sequencing methods including improvements in Nanopore sequencing and single cell methods becoming more common. We also saw 10x genomics come to the party with an emulsion PCR approach that can be applied to produce synthetic long reads or single cell barcoding. There were major results from larger cohort studies exemplified by ExAC, which is revealing more about human genetic variation, while Blueprint and GTEx are revealing more about determinants of gene regulation. The #openaccess is growing rapidly in the bioinformatics community, along with the growth of preprint popularity which I hope is adopted more widely in the biomedical sciences.

On the not so good side, Illumina has made made no new instrument announcements nor any substantial updates to existing sequencing systems. There were a few papers (example) describing the methylation EPIC array announced in 2015. Prices for Illumina reagents continue to increase,…

MSigDB gene sets for mouse

I recently needed to convert MSigDB gene sets to mouse so I thought I would share.


Below is the code used to do the conversion. It requires an input GMT file of human gene symbols as well as a human-mouse orthology file. You can download the ortholog file here. As the name suggests, it is based on data downloaded from Ensembl Biomart version 87.

Running the program converts all human gmt files. It requres gnu parallel which can be easily installed on Ubuntu with "sudo apt-get install parallel"


  NAME_DESC=`echo $line | cut -d ' ' -f-2`

  GENES=`echo $line | cut -d ' ' -f3- \
  | tr ' ' '\n' | sort -uk 1b,1 \
  | join -1 1 -2 1 - \
  <(cut -f3,5 mouse2hum_biomart_ens87.txt \
  | sed 1d | awk '$1!="" && $2!=""' \
  | sort -uk 1b,1) | cut -d ' ' -f2 \
  | sort -u…

Gene name error scanner webservice

Over the past few weeks, we've had a lot of feedback about our paper describing the sorry state of Excel auto-correct errors in supplemental files in spreadsheets.

In our group, we've discussed a number of ways that these errors could be minimised in future. One suggestion was to publish a webservice which permits reviewers and editors to upload and scan spreadsheets for the presence of gene name errors. So that's what I did. I took some basic file upload code in php and customised it so that it runs the shell script described in the paper. You can access the webservice here. We've been testing it for a few days and seems to work fine, except for the auto-generated email which I presume is being blocked by our IT group.

Upload spreadsheets and have them scanned for gene name errors.
The code for the webservice is up at GitHub, so you can modify it and host another instance if you want. The code should run on Ubuntu machines that can run Apache2, php and other dependenci…

My personal thoughts on gene name errors

Well, our paper "Gene name errors are widespread in the scientific literature" in Genome Biology has stirred up some interest. There are a lot of reasons why this article has taken off beyond what I initially envisioned:

Most tech-savvy people hate ExcelPeople over-rely on Excel, when there are better alternatives for analyticsEveryone has experienced an auto-correct fail and can relatePeople love "bloopers"People are interested when scientists get it wrong (especially other scientists)
 In this post I want to share a few things:

Some responses to journalist questionsList of media coverage and whether they are reporting things accuratelyA look into the scripts used themselvesFuture directions I'll also be answering your questions, so pop them in the comments section. 

Some responses to journalist questions
Why did you do this?

We saw that the problem was first described in 2004, but these errors were present in files from papers in high-ranking journals. We made …

Analyzing repeat rich plant smRNA-seq data with ShortStack

Small RNA expression is difficult to analyse. They're small molecules anywhere from 18 -25 nt for miRNAs, they occur as identical or near identical family members and are subject to RNA editing as well as errors from the sequencer.

My recent paper is an analysis of alignment tools for microRNA analysis with a strong focus on uniquely mapped reads. All that's OK, but in some organisms such as grasses (rice, barley, wheat, etc) you'll find that multimapped reads far outnumber uniquely placed ones. If you omit multimapped reads from the analysis, then you'll be excluding the majority of reads which is definitely a bad idea in any NGS analysis pipeline.

To demonstrate this, I downloaded smRNA-seq data from SRA (SRP029886) that consists of 3 datasets (SRR976171, SRR976172, SRR976173), clipped the adaptors off and mapped them to the genome with BWA aln then counted reads mapped to exonic regions uniquely (mapQ≥10).
So as you can see, the proportion of reads that are assigned …

Screen for mycoplasma contamination in DNA-seq and RNA-seq data

If you work in a lab dealing with mammalian cell cultures, then you've probabaly heard of Mycoplasma, these are obligate intracellular parasitic bacteria that not only cause human infection and disease, but also common contaminants in cell culture experiments. Mycoplasma infection can cause many changes to cell biology that can invalidate experimental results which is alarming. These bacteria are also resistant to most antibiotics used in culture experiments like streptomycin and penicillin. Mycoplasma detection is also not straight-forward, as these bugs are not visible with the light microscopes used by most researchers. PCR and Elisa tests can be used but there many researchers out there who simply don't perform these tests. Last year, a study published in NAR showed that about 11% of RNA-seq studies were affected by Mycoplasma contamination, furthermore the study also identified a panel of 61 host genes that were strongly correlated with the presence of Mycoplasma.

In thi…

Accuracy, speed and error tolerance of short DNA sequence aligners

Over the past few years at GenomeSpot, I've evaluated alignment software accuracy for DNA-seq, RNA-seq and small RNA-seq. After discussions with colleagues and followers, I thought it was time to develop aspects of this work to a point where it would be publishable in journals. Over the past year I've put together a paper that comprehensively evaluated accuracy of small RNA aligners. After three review and revision cycles I'm happy to say it has been accepted for publication in RNA. It will likely appear in the August issue.

Secondly, I've extended upon work from a previous post where I evaluated the accuracy of several DNA-seq mappers with error containing reads, and did further work like: varying the read length from 50 nt to 480 ntperforming parallel analysis with Arabidopsis and humanusing simulators to generate Illumina and Ion Torrent read setstested the speed (throughput) of aligners with Illumina reads This work was just made public today on bioaRxiv. Have a rea…

How to generate a rank file from gene expression data in R

Previously I wrote about how to make a gene expression rankfile using the unix shell. While this works for me just fine, there are some differences in the behaviour of shell tools like awk an sed that make it unusable for mac users.

Therefore, I think the best solution is to show you how to do this in R, which you can install on Linux, Mac and Windows systems.

The expression data for this demo looks like this (the first 5 columns).

Name                        logFC               logCPM            LR                PValue ENSG00000134294.9_SLC38A2   0.365464972841137   8.35504063447063  80.9697378748286  2.29200821312451e-19 ENSG00000164692.13_COL1A2   0.369440969815233   6.9371167845581   59.239238358843   1.39621819298599e-14 ENSG00000117152.9_RGS4      0.417512339007825   6.27805181231213  48.690887963755   2.99655418444362e-12 ENSG00000120738.7_EGR1      -0.448872068345368  5.72373823077344  40.5159113813129  1.95021420597484e-10 ENSG00000236552.1_RPL13AP5  -0.263118776572713  7.87437056…

Running many GSEA analyses using a simple Bash wrapper script

In my previous posts, I've recommended using GSEA pre-ranked for performing pathway analysis through the graphical user interface. If you have a more elaborate experiment with several sample groups and a few different gene set libraries to analyse, then you'll find that process quite tedious.

So in this post, I'd like to share with you my GSEA wrapper written in bash to make this process a bit easier. The wrapper script simply accepts two user defined arguments, the rank file and the gmt file. Take note that this works only when the gene IDs are already gene symbols. If you have probe IDs or some other type of accession number, then you'll need to convert them to gene symbols or use a "chip file". The rank files and GMT files need to be in the current directory. You may need to adjust the memory up or down depending on your machine spec and the memory required for your jobs.

java -Xmx4096m -cp /path/to/gsea2-2.2.1.jar xtools.gsea.GseaPre…

Pathway analysis: DAVID versus GSEA

Pathway analysis has become such a common procedure in bioinformatics, especially in studying gene expression. If you look at a survey of recent papers it seems that there is a bunch of ways that it can be done. In this post, I'll discuss the differences between two commonly used tools; DAVID and GSEA.
The concepts behind the two algorithms are very different. DAVID determines overlaps between user-supplied gene lists and the curated databases, looking for overlaps that are bigger than that expected by random chance. You can improve the accuracy of the algorithm by providing a background file that contains all genes that were considered/detected in the experiment. The user-supplied gene sets are normally generated by selecting genes that pass a significance threshold. The DAVID procedure is similar to others available such as Ingenuity, AmiGO and GeneGO. The selection of significance values is largely arbitrary, but it is common to set the threshold at FDR adjusted p<0.05. Modi…

Is paired end RNA-seq better than single-end for gene-wise gene expression analysis?

Something I've wondered about is whether for RNA-seq it's worth forking out the extra cost of sequencing both ends as opposed to single end.

To test this, I went back to a paired end data set present in GEO (GSE55123, 2x 36bp), cleaned the data with Skewer, then mapped the reads with STAR in either paired-end mode or single-end mode (using just read 1).

I then used featureCounts to quantify number of tags aligned to each gene. I excluded genes with fewer than 10 reads per sample on average. Then I ran edgeR at Degust to identify differentially expressed genes (DEGs@FDR<0.05). I used a shell script to quantify the overlap in DEGs. Then I ranked them based on the p-value from most up-regulated to most down-regulated and compared their positions in the rank.

Here's the result of the overlap analysis. You can see that PE fastq detected more genes but identified fewer DGEs than SE.

Detected in PE:15919
Detected in SE:15275
Detected in both:14750
Detected in either:16444
Detected …

2015 Wrap-Up

Its that time of year again where we can reflect on the year that was, hit the reset button and focus on the trend that will dominate 2016.
Sequencing hardware This time last year, we welcomed the NextSeq500, NeoPrep and updates to HiSeq2500. Throughout the year there were further announcements from Illumina on the HiSeq4000 and the technology appears to be improving incrementally from here on. Indeed while cluster numbers have improved, there have only been modest improvements in read length and pricing in Australian dollars has not decreased as substantially as we may have predicted. I'm really excited about the developments in 3rd gen technology coming from Oxford Nanopore and Pacific Biosciences in that the read lengths and accuracy are starting to improve. The growing user base is definitely spurring improved basecalling and error correction algorithms that will further increase the user base. Metagenomics has really been the main beneficiary of 3rd gen long read seq and that …