Posts

Mass download from google drive using R

Google drive is great for sharing documents and other small files but it's definitely not suited to moving  many large files around. For example I just received 170 fastq files that are about 200 MB in size. If you use the browser to download the whole folder, the web app will zip the contents for you which will take a LOOONG time. Alternatively you can download each and every one of those files one by one, which is annoying and prone to human error. You can insist to your collaborators to transfer in a different way, but there are not that many user-friendly and economic approaches. Your biologist collaborators probably won't be able to use rsync  to get the data to you safely. And fast convenient tools for moving around large files like Hightail cost a lot of money. A good solution to this problem is to use the R package googledrive  which enables command line automation of tasks that might take a long time manually. The package vignette has a good overview of the main comma

Uploading data to GEO - which method is faster?

Image
If you have had to upload omics data to GEO before, you'll know it's a bit of a hassle and takes a long time. There are a few methods suggested by the GEO team if you are using the Unix command line:   Using 'ncftp' ncftp set passive on set so-bufsize 33554432 open  ftp://geoftp:yourpasscode@ftp-private.ncbi.nlm.nih.gov cd uploads/your @mail.com_ yourfolder put -R Folder_with_submission_files Using 'lftp' lftp  ftp://geoftp:yourpasscode@ftp-private.ncbi.nlm.nih.gov cd uploads/ your @mail.com _ yourfolder mirror -R Folder_with_submission_files Using 'sftp'  (expect slower transfer speeds since this method encrypts on-the-fly) sftp  geoftp @s ftp-private.ncbi.nlm.nih.gov password: yourpasscode cd uploads/ your @mail.com _ yourfolder mkdir new_geo_submission cd new_geo_submission put file_name Using 'ncftpput'  (transfers from the command-line without entering an interactive shell) Usage example: ncftpput -F -R -z -u  geoftp  -p "yourpasscode&q

Urgent need for minimum standards for reproducible functional enrichment analysis

Image
So our preprint is online called “ Guidelines for reliable and reproducible functional enrichment analysis” so I thought I’d give you an overview . Enrichment analysis is widely used for exploration and interpretation of omics data, but I’ve noticed sloppy work is becoming more common. Over the past few years I’ve been asked to review several manuscripts where the enrichment analysis was poorly conducted and reported. Examples of this include lack of FDR control, incorrect background gene list specification and lack of essential methodological details. In the article we show how common these problems are and what the impact could be on the results. In this article we did two screens of articles from PubMed Central. Firstly selected 200 articles with terms in the abstract related to enrichment analysis. These were examined with regards to a checklist around reporting and methodological issues, and were cross-checked for accuracy. As some articles describe >1 analysis, the total numbe

Gene name errors: Redux

Image
Our latest article "Gene name errors: Lessons not learned" previously @biorxiv_genomic has just been published in PLoS Computational Biology ( link here ). In this post I'll walk you through why it is so important to how computational biology is done. If you are a regular GenomeSpot reader you are probably well aware about how Excel mangles gene names. So why the need for a 2021 update?  Well we thought that the broader genomics community would know about it by now. It has been nearly 20 years since Zeeberg et al's paper on the issue, and five years since our article in Genome Biology which got picked up in a few media outlets and led to SEPT, MARCH and DEC genes being renamed . So with Mandhri Abeysooriya  an outstanding Masters student at Deakin University, we set out to see whether gene name errors in supplementary data files was still an issue. We also had massive assistance from Megan Soria and Mary Kasu . When designing this new work, we wanted to make it bi

Understanding pathway-level regulation of chromatin marks with the "mitch" Bioconductor package (Epigenetics 2021 Conference Presentation)

Image
Presented 18th February 2021 Abstract Gene expression is governed by numerous chromatin modifications. Understanding these dynamics is critical to understanding human health and disease, but there are few software options for researchers looking to integrate multi-omics data at the level of pathways. To address this, we developed mitch, an R package for multi-contrast gene set enrichment analysis. It uses a rank-MANOVA statistical approach to identify sets of genes that exhibit joint enrichment across multiple contrasts. In this talk I will demonstrate using mitch and showcase its advanced visualisation features to explore the regulation of signaling and biochemical pathways at the chromatin level.

10 quick tips for genomics data management

I get asked a lot about the best ways to store sequence data because the files are massive and researchers have various levels of  knowledge of the hardware and software. Here I'll run through some best practices for genomics research data management based on my 10 years of experience in the space. 1. Always work on servers, not remote machines or laptops On-prem machines and cloud servers are preferred because you can log into the from anywhere using ssh or other protocol. These machines are better suited to heavy loads and are less likely to breakdown because of the institutional tech support and maintenance. Institutional data transfer speeds will be far superior to your home network. Never do computational work on a laptop. Avoid storing data on your own portable hard drives or flash drives. If you don't have a server, ask for access at your institution or research cloud provider (we use Nectar in Australia). 2. Download the data to the place where you will be working on i

DEE2 projects on demand

Image
We have noted that the time between new datasets appearing on SRA and being processed by DEE2 has been about 3 to 6 months. Our dream is to shrink this down to two weeks, but we simply do not have access to that much compute power at the moment. To address this we have devised an "on-demand" feature so that you can request certain datasets to be processed rapidly. We think this is a great feature because it serves the main mission of the DEE2 project which is to make all RNA-seq data freely available to everyone.  Here's how to use it:  1. Visit  http://dee2.io/request.html  and you will be greeted with a webform. Select the organism of interest. 2. Provide the SRA project accession number of the dataset. These numbers begin in SRP/ERP/DRP. If you have a different type of accession such as GEO Series (GSE) or Bioproject (PRJNA) then you will need to navigate NCBI  to find the SRP number.  3. Check that the SRP number is in the standard DEE2 queue. To do that, follow the l