Posts

Two dimensional filled contour plots in R

Image
Also called kernel density plots, these are two-dimensional contour heatmaps which are useful to replace scatterplots when the number of datapoints is so large that it risks overplotting. I've used these sort of plots for many years, most notably in the mitch bioconductor package where it is used to map the relationship of differential expression between two contrasts. There are solutions in ggplot, but I thought I'd begin with a base R approach. In the example below, some random data is generated and plotted. Just subsitute your own data and give it a try.     xvals <- rnorm(100,10,100) yvals <- rnorm(100,10,200) mx <- cbind(xvals,yvals) palette <- colorRampPalette(c("white", "yellow", "orange", "red",   "darkred", "black")) k <- MASS::kde2d(mx[,1], mx[,2]) X_AXIS = "x axis label"                              Y_AXIS = "y axis label"                              filled.contour(k, co

A docker image for infinum methylation analysis

Image
Performing a differential methylation analysis of infinium array data requires an impressively large number of R packages, such as `minfi`, `missmethyl`, `limma`, `genomicRanges`, `DMRcate`, `bunpHunter` and many others. Each of these in turn are considered heavy packages as they each require many dependancies. This means it can take up to an hour to go from a vanilla R installation to one with all the needed packages installed. If you are using multiple computers you might find that these have slightly different versions of R, bioconductor and this large stack of dependancies, which could lead to different results. You may also find that it is difficult to install this large set of dependancies on shared systems, as some dependenncies might require installation of system libraries that need admin permissions to install. The way I've tried to alleviate this problem is to install all my needed packages into a Docker image which can then be downloaded and run in a few minutes on a ne

Beeswarm chart for categorical data

Image
 Biomedical journal articles are full of categorical data, showing data for control and case groups using barplots with whiskers. While these are popular, that type of chart can hide the underlying data patterns and are discouraged by statisticians. There are other alternatives such as boxplots, violin charts and strip charts, but I've become a big fan of beeswarm charts lately. The reason is that beeswarms show the distribution just like violin plots but have the benefit of showing the individual points, which is helpful if sample size varies between categories. The way I like to emply beeswarm charts is to first create a boxplot and then overlay the beeswarm. With that approach, the mean and interquartile ranges are shown, along with the actual datapoints. Here's the result. Some notes on how to make this chart: First step is to collect the data into a list of vectors, see the example code below for the iris dataset. Then make the boxplot. And finally the beeswarm plot with a

Mass download from google drive using R

Google drive is great for sharing documents and other small files but it's definitely not suited to moving  many large files around. For example I just received 170 fastq files that are about 200 MB in size. If you use the browser to download the whole folder, the web app will zip the contents for you which will take a LOOONG time. Alternatively you can download each and every one of those files one by one, which is annoying and prone to human error. You can insist to your collaborators to transfer in a different way, but there are not that many user-friendly and economic approaches. Your biologist collaborators probably won't be able to use rsync  to get the data to you safely. And fast convenient tools for moving around large files like Hightail cost a lot of money. A good solution to this problem is to use the R package googledrive  which enables command line automation of tasks that might take a long time manually. The package vignette has a good overview of the main comma

Uploading data to GEO - which method is faster?

Image
If you have had to upload omics data to GEO before, you'll know it's a bit of a hassle and takes a long time. There are a few methods suggested by the GEO team if you are using the Unix command line:   Using 'ncftp' ncftp set passive on set so-bufsize 33554432 open  ftp://geoftp:yourpasscode@ftp-private.ncbi.nlm.nih.gov cd uploads/your @mail.com_ yourfolder put -R Folder_with_submission_files Using 'lftp' lftp  ftp://geoftp:yourpasscode@ftp-private.ncbi.nlm.nih.gov cd uploads/ your @mail.com _ yourfolder mirror -R Folder_with_submission_files Using 'sftp'  (expect slower transfer speeds since this method encrypts on-the-fly) sftp  geoftp @s ftp-private.ncbi.nlm.nih.gov password: yourpasscode cd uploads/ your @mail.com _ yourfolder mkdir new_geo_submission cd new_geo_submission put file_name Using 'ncftpput'  (transfers from the command-line without entering an interactive shell) Usage example: ncftpput -F -R -z -u  geoftp  -p "yourpasscode&q

Urgent need for minimum standards for reproducible functional enrichment analysis

Image
So our preprint is online called “ Guidelines for reliable and reproducible functional enrichment analysis” so I thought I’d give you an overview . Enrichment analysis is widely used for exploration and interpretation of omics data, but I’ve noticed sloppy work is becoming more common. Over the past few years I’ve been asked to review several manuscripts where the enrichment analysis was poorly conducted and reported. Examples of this include lack of FDR control, incorrect background gene list specification and lack of essential methodological details. In the article we show how common these problems are and what the impact could be on the results. In this article we did two screens of articles from PubMed Central. Firstly selected 200 articles with terms in the abstract related to enrichment analysis. These were examined with regards to a checklist around reporting and methodological issues, and were cross-checked for accuracy. As some articles describe >1 analysis, the total numbe

Gene name errors: Redux

Image
Our latest article "Gene name errors: Lessons not learned" previously @biorxiv_genomic has just been published in PLoS Computational Biology ( link here ). In this post I'll walk you through why it is so important to how computational biology is done. If you are a regular GenomeSpot reader you are probably well aware about how Excel mangles gene names. So why the need for a 2021 update?  Well we thought that the broader genomics community would know about it by now. It has been nearly 20 years since Zeeberg et al's paper on the issue, and five years since our article in Genome Biology which got picked up in a few media outlets and led to SEPT, MARCH and DEC genes being renamed . So with Mandhri Abeysooriya  an outstanding Masters student at Deakin University, we set out to see whether gene name errors in supplementary data files was still an issue. We also had massive assistance from Megan Soria and Mary Kasu . When designing this new work, we wanted to make it bi