Posts

Don't use KEGG!!!

Image
KEGG is the Kyoto Encyclopedia of Genes and Genomes, a compendium of gene functional annotations which are commonly used for pathway enrichment analysis. KEGG has been around since 1997 (PMID: 9390290) but in 2000, as the draft human genome sequence was released, KEGG became a key database involved in the curation of literature data to catalog the function of all human genes, many of which were newly sequenced and previously uncharacterised. This invigorated interest in KEGG is demonstrated by one of their articles in 2000 (PMID: 10592173) accruing 24,236 citations according to Dimensions , 1270 fold higher than other articles from that time. A PubMed search using "KEGG" keyword shows 27,033 results, with matches in abstracts alone and the tragectory is increasing at a rapid rate. Despite the popularity of this tool, I'm urging you not to use it for your research. Here I will lay out the reasons. 1. It isn't comprehensive I did an analysis of KEGG and other pathway se

Example blast workflow (nucleotide)

Image
BLAST is a stalwart in the bioinformatics space. I've used it in multiple contexts and it is a good way to introduce students to bioinformatics principles and the process of pipeline development. Although there is a web interface, it is still good to use it locally if you need to run a large number of queries. As my group needed to use BLAST again this week, I thought I'd share a small example script which I shared with my Masters research students just getting into bioinformatics. The script (shown below) downloads the E. coli gene coding sequences and then extracts by random a few individual sequences. These undergo random mutagenesis and then we can use the mutated sequences as a query to find the original gene with BLAST. The output format is tabular which suits downstream large scale data analysis. The script also includes steps for generating the blast index. The script is uploaded as a gist here . It requires prerequesites: sudo apt install  ncbi-blast+ emboss unwrap_fas

Update your gene names when doing pathway analysis of array data!

Image
If you are doing analysis of microarray data such as Infinium methylation arrays, then those genomic annotations you're using might be several years old.  The EPIC methylation chip was released in 2016 and the R bioconductor annotation set hasn't been updated much since. So we expect that some gene names have changed, which will reduce the performance of the downstream pathway analysis. The gene symbols you're using can be updated using the HGNChelper R package on CRAN. Let's say we want to make a table that maps probe IDs to gene names, the following code can be used. library("IlluminaHumanMethylationEPICanno.ilm10b4.hg19") anno <- getAnnotation(IlluminaHumanMethylationEPICanno.ilm10b4.hg19) myann <- data.frame(anno[,c("UCSC_RefGene_Name","UCSC_RefGene_Group","Islands_Name","Relation_to_Island")]) gp <- myann[,"UCSC_RefGene_Name",drop=FALSE] gp2 <- strsplit(gp$UCSC_RefGene_Name,";") names

Reflections on 2023 and outlook

It has been an amazing year of research. I've been at Burnet Institute since August 2023 as Head of Bioinformatics and I've really enjoyed the challenge of serving the many and varied 'omics projects at the Institute and loved discussing new project ideas with everyone here. I'm still active at Deakin in student supervision and project collaborations and this is ongoing. In terms of research directions, my group has been focused on our three themes:  1. Bioinformatics collaborative analysis 2. Building better software tools for omics analysis 3. Reproducibility and research rigour Some of the long-running projects have been completed including methylation analysis of type-1 diabetes complications, which has been about 10 years in the making [1]. The number of collaborative projects has dipped, which is normal when changing institutes and I hope this will lift in the coming years as Burnet work gets completed. In terms of research directions for 2024, there are many. One

Energy expenditure of computational research in context

Image
There have been a few papers recently on “green computing”, and changes we can make to ensure our work is more sustainable. The most important aspect to this is the overuse of energy in conducting our analysis, especially if the energy is derived from burning fossil fuels. What I want to do here is to put in context the energy expenditure of research computing systems by comparing it to other energy expenditures such as travel and transport. Energy consumption of a workstation or small server In order to quantify the energy expenditure of a compute system, we need to make some assumptions. We will assume that for bioinformatics, the CPU is the main consumer of power, is working at 50% of capacity. AMD Ryzen 9 5950X (16 cores / 32 threads) Maximum power consumption = 142 W /2 = 71 W source High end motherboard: 75 W. RAM stick: 3 W * 8 = 24 W Mid range graphics card idle = 12 W HDD Storage: 9 W * 2 = 18 W Total: 71 + 75 + 24 + 12 + 18 = 200 W So a workstation, or small server with 32 th

"Dev" and "Prod" for bioinformatics

Image
I’ve been thinking a lot about best practices lately. I even co-wrote a best practices article just last month. As I have been working with students and colleagues and reflecting on my own practices I have come to the conclusion that us researchers need to align our work more closely towards software developers rather than other researchers. In software development there is a strong differentiation between “development” and “production” work environments. Production is the live app after release to consumers, it is critical that the app functions as expaected, is reliable, useful and with a good user experience. This is after all, where these tech companies make their money. Any downtime is going to be embarrassing and will cost the company money and customers. The development environment on the other hand is the place where software developers can experiment with creating new features, prototype, and refine ideas. As things are built, the software code contains lots of bugs, and this

The five pillars of computational reproducibility: Bioinformatics and beyond

Image
I've been working on a new project to follow-up our paper last year on the problems with pathway enrichment analysis.  That article turned out to be a bleak and depressing look into how frequently used tools in genomics are misused. It is not an exaggeration to say that most articles showing some type of enrichment analysis are doing it wrong and no doubt this is severely impacting the literature. However I think it isn't helpful to only focus on the negative aspects of bioinformatics and computational research. We also need to lead the way towards resolving these issues. The best way to do this is in my view is to provide step-by-step guides and tutorials for common routines. So this is what we are in the process of doing, making a protocol for pathway enrichment analysis that is "extremely reproducible". By this, I mean that the analysis could be reproduced independently in future with the minimum of fuss and time. As we were writing this we also recognised that the