Showing posts from 2021

Uploading data to GEO - which method is faster?

If you have had to upload omics data to GEO before, you'll know it's a bit of a hassle and takes a long time. There are a few methods suggested by the GEO team if you are using the Unix command line: Using 'ncftp' ncftp set passive on set so-bufsize 33554432 open cd uploads/your @mail.com_ yourfolder put -R Folder_with_submission_files Using 'lftp' lftp cd uploads/ your _ yourfolder mirror -R Folder_with_submission_files Using 'sftp' (expect slower transfer speeds since this method encrypts on-the-fly) sftp  geoftp @s password: yourpasscode cd uploads/ your _ yourfolder mkdir new_geo_submission cd new_geo_submission put file_name Using 'ncftpput' (transfers from the command-line without entering an interactive shell) Usage example: ncftpput -F -R -z -u  geoftp  -p "yourpasscode"

Urgent need for minimum standards for reproducible functional enrichment analysis

So our preprint is online called “ Guidelines for reliable and reproducible functional enrichment analysis” so I thought I’d give you an overview . Enrichment analysis is widely used for exploration and interpretation of omics data, but I’ve noticed sloppy work is becoming more common. Over the past few years I’ve been asked to review several manuscripts where the enrichment analysis was poorly conducted and reported. Examples of this include lack of FDR control, incorrect background gene list specification and lack of essential methodological details. In the article we show how common these problems are and what the impact could be on the results. In this article we did two screens of articles from PubMed Central. Firstly selected 200 articles with terms in the abstract related to enrichment analysis. These were examined with regards to a checklist around reporting and methodological issues, and were cross-checked for accuracy. As some articles describe >1 analysis, the total numbe

Gene name errors: Redux

Our latest article "Gene name errors: Lessons not learned" previously @biorxiv_genomic has just been published in PLoS Computational Biology ( link here ). In this post I'll walk you through why it is so important to how computational biology is done. If you are a regular GenomeSpot reader you are probably well aware about how Excel mangles gene names. So why the need for a 2021 update?  Well we thought that the broader genomics community would know about it by now. It has been nearly 20 years since Zeeberg et al's paper on the issue, and five years since our article in Genome Biology which got picked up in a few media outlets and led to SEPT, MARCH and DEC genes being renamed . So with Mandhri Abeysooriya  an outstanding Masters student at Deakin University, we set out to see whether gene name errors in supplementary data files was still an issue. We also had massive assistance from Megan Soria and Mary Kasu . When designing this new work, we wanted to make it bi

Understanding pathway-level regulation of chromatin marks with the "mitch" Bioconductor package (Epigenetics 2021 Conference Presentation)

Presented 18th February 2021 Abstract Gene expression is governed by numerous chromatin modifications. Understanding these dynamics is critical to understanding human health and disease, but there are few software options for researchers looking to integrate multi-omics data at the level of pathways. To address this, we developed mitch, an R package for multi-contrast gene set enrichment analysis. It uses a rank-MANOVA statistical approach to identify sets of genes that exhibit joint enrichment across multiple contrasts. In this talk I will demonstrate using mitch and showcase its advanced visualisation features to explore the regulation of signaling and biochemical pathways at the chromatin level.

10 quick tips for genomics data management

I get asked a lot about the best ways to store sequence data because the files are massive and researchers have various levels of  knowledge of the hardware and software. Here I'll run through some best practices for genomics research data management based on my 10 years of experience in the space. 1. Always work on servers, not remote machines or laptops On-prem machines and cloud servers are preferred because you can log into the from anywhere using ssh or other protocol. These machines are better suited to heavy loads and are less likely to breakdown because of the institutional tech support and maintenance. Institutional data transfer speeds will be far superior to your home network. Never do computational work on a laptop. Avoid storing data on your own portable hard drives or flash drives. If you don't have a server, ask for access at your institution or research cloud provider (we use Nectar in Australia). 2. Download the data to the place where you will be working on i

DEE2 projects on demand

We have noted that the time between new datasets appearing on SRA and being processed by DEE2 has been about 3 to 6 months. Our dream is to shrink this down to two weeks, but we simply do not have access to that much compute power at the moment. To address this we have devised an "on-demand" feature so that you can request certain datasets to be processed rapidly. We think this is a great feature because it serves the main mission of the DEE2 project which is to make all RNA-seq data freely available to everyone.  Here's how to use it:  1. Visit  and you will be greeted with a webform. Select the organism of interest. 2. Provide the SRA project accession number of the dataset. These numbers begin in SRP/ERP/DRP. If you have a different type of accession such as GEO Series (GSE) or Bioproject (PRJNA) then you will need to navigate NCBI  to find the SRP number.  3. Check that the SRP number is in the standard DEE2 queue. To do that, follow the l

Effect of COVID-19 on genomics publications in 2020

COVID-19 was and remains a major crisis in many countries, disrupting general life as well as scientific research. But how has it impacted scientific output in genomics? To evaluate this I investigated the number of papers published in PubMed Central  (PMC) in the period from 2016 through 2020. I used total number of papers as well as those matching the genomics search term with the approach below: (genom*[Abstract]) AND ("2020"[Publication Date] : "2020"[Publication Date])  <Note: the number of papers is only a lagging proxy measure of aggregate activity in a field, it does not relate to scientific quality> Here are the number of papers and genomics papers published annually over this period. What you can see is that genomics experienced a major fall in number of papers appearing in PMC in 2020 while total papers did not. Indeed 2020 was the only year since 2000 that the number of published genomics papers has actually gone down compared to the previous year