Beeswarm chart for categorical data

 Biomedical journal articles are full of categorical data, showing data for control and case groups using barplots with whiskers. While these are popular, that type of chart can hide the underlying data patterns and are discouraged by statisticians. There are other alternatives such as boxplots, violin charts and strip charts, but I've become a big fan of beeswarm charts lately. The reason is that beeswarms show the distribution just like violin plots but have the benefit of showing the individual points, which is helpful if sample size varies between categories. The way I like to emply beeswarm charts is to first create a boxplot and then overlay the beeswarm. With that approach, the mean and interquartile ranges are shown, along with the actual datapoints. Here's the result. Some notes on how to make this chart: First step is to collect the data into a list of vectors, see the example code below for the iris dataset. Then make the boxplot. And finally the beeswarm plot with a

Mass download from google drive using R

Google drive is great for sharing documents and other small files but it's definitely not suited to moving  many large files around. For example I just received 170 fastq files that are about 200 MB in size. If you use the browser to download the whole folder, the web app will zip the contents for you which will take a LOOONG time. Alternatively you can download each and every one of those files one by one, which is annoying and prone to human error. You can insist to your collaborators to transfer in a different way, but there are not that many user-friendly and economic approaches. Your biologist collaborators probably won't be able to use rsync  to get the data to you safely. And fast convenient tools for moving around large files like Hightail cost a lot of money. A good solution to this problem is to use the R package googledrive  which enables command line automation of tasks that might take a long time manually. The package vignette has a good overview of the main comma

Uploading data to GEO - which method is faster?

If you have had to upload omics data to GEO before, you'll know it's a bit of a hassle and takes a long time. There are a few methods suggested by the GEO team if you are using the Unix command line:   Using 'ncftp' ncftp set passive on set so-bufsize 33554432 open cd uploads/your @mail.com_ yourfolder put -R Folder_with_submission_files Using 'lftp' lftp cd uploads/ your _ yourfolder mirror -R Folder_with_submission_files Using 'sftp'  (expect slower transfer speeds since this method encrypts on-the-fly) sftp  geoftp @s password: yourpasscode cd uploads/ your _ yourfolder mkdir new_geo_submission cd new_geo_submission put file_name Using 'ncftpput'  (transfers from the command-line without entering an interactive shell) Usage example: ncftpput -F -R -z -u  geoftp  -p "yourpasscode&q

Urgent need for minimum standards for reproducible functional enrichment analysis

So our preprint is online called “ Guidelines for reliable and reproducible functional enrichment analysis” so I thought I’d give you an overview . Enrichment analysis is widely used for exploration and interpretation of omics data, but I’ve noticed sloppy work is becoming more common. Over the past few years I’ve been asked to review several manuscripts where the enrichment analysis was poorly conducted and reported. Examples of this include lack of FDR control, incorrect background gene list specification and lack of essential methodological details. In the article we show how common these problems are and what the impact could be on the results. In this article we did two screens of articles from PubMed Central. Firstly selected 200 articles with terms in the abstract related to enrichment analysis. These were examined with regards to a checklist around reporting and methodological issues, and were cross-checked for accuracy. As some articles describe >1 analysis, the total numbe

Gene name errors: Redux

Our latest article "Gene name errors: Lessons not learned" previously @biorxiv_genomic has just been published in PLoS Computational Biology ( link here ). In this post I'll walk you through why it is so important to how computational biology is done. If you are a regular GenomeSpot reader you are probably well aware about how Excel mangles gene names. So why the need for a 2021 update?  Well we thought that the broader genomics community would know about it by now. It has been nearly 20 years since Zeeberg et al's paper on the issue, and five years since our article in Genome Biology which got picked up in a few media outlets and led to SEPT, MARCH and DEC genes being renamed . So with Mandhri Abeysooriya  an outstanding Masters student at Deakin University, we set out to see whether gene name errors in supplementary data files was still an issue. We also had massive assistance from Megan Soria and Mary Kasu . When designing this new work, we wanted to make it bi

Understanding pathway-level regulation of chromatin marks with the "mitch" Bioconductor package (Epigenetics 2021 Conference Presentation)

Presented 18th February 2021 Abstract Gene expression is governed by numerous chromatin modifications. Understanding these dynamics is critical to understanding human health and disease, but there are few software options for researchers looking to integrate multi-omics data at the level of pathways. To address this, we developed mitch, an R package for multi-contrast gene set enrichment analysis. It uses a rank-MANOVA statistical approach to identify sets of genes that exhibit joint enrichment across multiple contrasts. In this talk I will demonstrate using mitch and showcase its advanced visualisation features to explore the regulation of signaling and biochemical pathways at the chromatin level.

10 quick tips for genomics data management

I get asked a lot about the best ways to store sequence data because the files are massive and researchers have various levels of  knowledge of the hardware and software. Here I'll run through some best practices for genomics research data management based on my 10 years of experience in the space. 1. Always work on servers, not remote machines or laptops On-prem machines and cloud servers are preferred because you can log into the from anywhere using ssh or other protocol. These machines are better suited to heavy loads and are less likely to breakdown because of the institutional tech support and maintenance. Institutional data transfer speeds will be far superior to your home network. Never do computational work on a laptop. Avoid storing data on your own portable hard drives or flash drives. If you don't have a server, ask for access at your institution or research cloud provider (we use Nectar in Australia). 2. Download the data to the place where you will be working on i