Adding a Docker image to your R bioinformatics project to improve reproducibility

I have gotten into the habit of making Docker images for each of my projects, which is helpful if these projects are large, and span over multiple years. Long-running projects are a headache for bioinformaticians because software like R has significant updates each year and we want to keep our machines running the most up to date software with recent bug fixes, features and new packages.  Docker images also help in making the research more reproducible for others who are interested in taking a deeper dive into the research materials. This helps auditability and could even help in improving research quality through better transparency. That said, still it is a bit tricky to achieve, so I thought I would write walk through on how I do it with an R workflow. Step 0: install docker on a linux system On Ubuntu you can use: sudo apt install Other OSs will vary. Step 1: Write your R Markdown script You probably have a workflow that you've been working on which works for your cur

The power of enrichment analysis for epigenetics: Application to schizophrenia

As regular readers of GenomeSpot will know, we have adopted the mitch package to enable enrichment analysis of methylation array data, as shown in our recent pre-print . You might be thinking that enrichment analysis of methylation data has existed for a long time, so why do we need a new approach? Well the answer to that is the current methods suffer one of the critical problems: (1) they use over-representation analysis (ORA), which has very low sensitivity, or (2) they do not distinguish between up- and down- directions of methylation change.  (1) is a problem because ORA relies on binary classification of probes or genes into differential or not, whereas in reality probes and genes are spread on a distribution of differential-ness. This makes ORA easy to compute, but leads to low sensitivity. In serious cases, the (arbitrary) threshold used for selection of differential probes means that very few probes are present in the foreground. (2) is a problem because gene and pathways are t

Yes, mitch can be used for pathway analysis of Methylation array data

In 2020, Dr Antony Kaspi and I published a method called "mitch" [1] which is like GSEA, but was specifically designed for multi-contrast analysis, and based on rank-MANOVA statistics inspired by a 2012 paper by Cox and Mann. Mitch worked well for various types of omics data downstream of commonly used differential abundance tools like DESeq2, edgeR, DiffBind, etc, but we didn't consider at the time how mitch could be applied to microarray data.  You might think that microarrays are outdated, but they are still used extensively for epigenome-wide association studies (EWASs), which are frequently used to understand disease processes and to identify biomarkers of disease. To demonstrate, there are 1536 publicly available methylation array studies on NCBI GEO, and probably many more thousands that are restricted access. The tools available for pathway analysis of methylation array data are a bit limited. There's an over-representation method that take into consideration

Don't use KEGG!!!

KEGG is the Kyoto Encyclopedia of Genes and Genomes, a compendium of gene functional annotations which are commonly used for pathway enrichment analysis. KEGG has been around since 1997 (PMID: 9390290) but in 2000, as the draft human genome sequence was released, KEGG became a key database involved in the curation of literature data to catalog the function of all human genes, many of which were newly sequenced and previously uncharacterised. This invigorated interest in KEGG is demonstrated by one of their articles in 2000 (PMID: 10592173) accruing 24,236 citations according to Dimensions , 1270 fold higher than other articles from that time. A PubMed search using "KEGG" keyword shows 27,033 results, with matches in abstracts alone and the tragectory is increasing at a rapid rate. Despite the popularity of this tool, I'm urging you not to use it for your research. Here I will lay out the reasons. 1. It isn't comprehensive I did an analysis of KEGG and other pathway se

Example blast workflow (nucleotide)

BLAST is a stalwart in the bioinformatics space. I've used it in multiple contexts and it is a good way to introduce students to bioinformatics principles and the process of pipeline development. Although there is a web interface, it is still good to use it locally if you need to run a large number of queries. As my group needed to use BLAST again this week, I thought I'd share a small example script which I shared with my Masters research students just getting into bioinformatics. The script (shown below) downloads the E. coli gene coding sequences and then extracts by random a few individual sequences. These undergo random mutagenesis and then we can use the mutated sequences as a query to find the original gene with BLAST. The output format is tabular which suits downstream large scale data analysis. The script also includes steps for generating the blast index. The script is uploaded as a gist here . It requires prerequesites: sudo apt install  ncbi-blast+ emboss unwrap_fas

Update your gene names when doing pathway analysis of array data!

If you are doing analysis of microarray data such as Infinium methylation arrays, then those genomic annotations you're using might be several years old.  The EPIC methylation chip was released in 2016 and the R bioconductor annotation set hasn't been updated much since. So we expect that some gene names have changed, which will reduce the performance of the downstream pathway analysis. The gene symbols you're using can be updated using the HGNChelper R package on CRAN. Let's say we want to make a table that maps probe IDs to gene names, the following code can be used. library("IlluminaHumanMethylationEPICanno.ilm10b4.hg19") anno <- getAnnotation(IlluminaHumanMethylationEPICanno.ilm10b4.hg19) myann <- data.frame(anno[,c("UCSC_RefGene_Name","UCSC_RefGene_Group","Islands_Name","Relation_to_Island")]) gp <- myann[,"UCSC_RefGene_Name",drop=FALSE] gp2 <- strsplit(gp$UCSC_RefGene_Name,";") names

Reflections on 2023 and outlook

It has been an amazing year of research. I've been at Burnet Institute since August 2023 as Head of Bioinformatics and I've really enjoyed the challenge of serving the many and varied 'omics projects at the Institute and loved discussing new project ideas with everyone here. I'm still active at Deakin in student supervision and project collaborations and this is ongoing. In terms of research directions, my group has been focused on our three themes:  1. Bioinformatics collaborative analysis 2. Building better software tools for omics analysis 3. Reproducibility and research rigour Some of the long-running projects have been completed including methylation analysis of type-1 diabetes complications, which has been about 10 years in the making [1]. The number of collaborative projects has dipped, which is normal when changing institutes and I hope this will lift in the coming years as Burnet work gets completed. In terms of research directions for 2024, there are many. One