Showing posts from 2019

Extract properly paired reads from a BAM file using SamTools

For a few different types of sequence analysis I need to extract read pairs that are properly mapped and satisfy some map quality filter, but I always forget the command line. So here it is: To read pairs from a bam file (-b) that map with mapQ≥30 including the bam file header (-h). The -f 0x2 option corresponds to the bitwise flags that specify that reads need to be properly paired. Proper pairing means reads are in Read1 forward, Read2 reverse orientation or Read1 reverse, Read2 forward orientation. $ samtools view -q 30 -f 0x2 -b -h in.bam > out.bam To extract single end reads from a bam file (-b) that map with mapQ≥30 including the bam file header (-h). $ samtools view -q 30 -b -h in.bam > out.bam

InterPro Protein domain based gene set library

Recently I wanted to know whether genes with methyltransferase domains were upregulated in my dataset. This isn't currently captured in the major gene set databases as far as I know. I dug into some older files of mine and found that the InterPro protein domain information is actually included in Ensembl BioMart and it's relatively staightforward to convert this to GMT format for pathway analysis. TLDR; Here is a link to the human GMT file for use in GSEA and other pathway analysis Please cite the recent InterPro paper if you use this in your research: Mitchell et al 2019, If you are interested in learning how it was made, read on. Method 1- Obtaining InterPro data from Ensembl BioMart Head to  and select the human database. Then click "Attributes". Here you can select the bits of information you want. Select only the following atributes and click the boxes in the or

DEE2 project update June 2019

It has been wonderful to be getting great feedback on DEE2, ways it can be improved and directions for future development of open omics data. Here I'll summarise some of these points. Recent developments iDEP integration. The iDEP service has provided DEE2 counts on it's R Shiny powered page which makes it very easy to download and analyse DEE2 data using the iDEP bioinformatics in the browser platform:   Degust integration. When using DEE2 in the browser you can now choose to send count matrices to Degust for differential analysis. You can send either the STAR gene counts, Kallisto Tx counts or Kallisto Tx counts aggregated to gene counts (recommended). Many thanks to David Powell's team at Monash especially Andrew Perry for working with me on adding this feature. The R package getDEE2 now works for Windows systems. This required a modification to the download.file() options to account for default behaviour on t

Extract TSS regions from a GTF file with GTFtools

Since stumbling upon GTFtools recently, I found that it has other another interesting use - to generate coordinate sets around transcriptional start sites (TSSs). This is really important for ChIP-seq analysis when we want to compare for example the strength of enrichment of histone modificaions at TSSs and compare it to RNA expression. Using GTFtools, it is a one line command to extract these positions: -t Homo_sapiens.GRCh38.94.gtf.tss.bed -w 1000  Homo_sapiens.GRCh38.94.gtf Where "-t" is the output file flag, "-w" is the desired TSS distance to cover, in this case +/- 1000 bp, and the last argument is the input gtf file which needs to be Ensembl or Gencode (other ones don't work due to differences in formatting)  If I had to do this without GTFtools, it would end up being more complicated, as TSS positions (exon 1 starts) would need to be extracted from the GTF file separately for the top and bottom strands and then merged. 

Enabling bioinformatics training in a Windows based computer lab with Docker+Dugong

While Linux remains the OS for developers, data scientists and bioinformaticians, uni classrooms remain stubbornly dependent on windows based applications. Yes, on individual PCs you can install Ubuntu command line apps but ask any IT dept about doing this for an entire classroom and you will undoubtedly receive an emphatic "NO". So how does one do bioinformatics training when students cannot even access the simplest foundation, the OS? Good question, and none of the potential answers are optimal to be honest. But it's what we need to deal with until Unis realize that open source software is actually good enough to run entire enterprises. My first thought was to get students to use Putty to log in to a bioinformatics server with SSH. This would be OK, but would be a bit of a headache to manage all the accounts on the server. Also my feeling is that much of the skills learned by the students would be forgotten again as soon as access to the server is revoked. There

DEE2 gets published

The project has been a labor of love since 2013/2014, has undergone a major overhaul and has finally been published online in GigaScience . The great thing about this journal is not only are the articles open access, but also the reviewer's comments. We had great suggestions and they improved the resource tremendously. It's great that it has been published finally, but publication is not the end goal of the project. The goal is to democratize omics data to a point where it can be done by biologists without any coding experience, undergrad students, high school students, practically anyone with a smart phone and an internet connection. So instead of being the end of the project, this is really the end of the beginning. Not only will we be keeping up with new SRA submissions over the next year of so, we will be incorporating new features, new species and perhaps some new data types. If you have suggestions, feedback of comments I would be very grateful!

Using GTF tools to get gene lengths

Sometime you need to normalise gene expression by gene length (eg: FPKM). To do that you need to calculate gene length. But which length to use? One could simply get a total of all exon lengths but if the most abundant isoform is the shortest, this will be terribly inaccurrate. Clearly to do this accurately, analysis at the level of transcripts would be the best approach as the length of each transcript is unambiguous, and the effective gene length can be estimated based on the abundance of each isoform. But if we really want to calculate gene length from a GTF file alone without any isoform quantification, then GTFtools can do it. For this demo, I'm using the Ensembl GTF file or human: Homo_sapiens.GRCh38.90.gtf GTF tools calculates the gene length a few different ways (i) mean, (ii) median, (iii) longest single isoform, and (iv) all exons merged. The command I used looks like this:  -l Homo_sapiens.GRCh38.90.gtf.genelength  Homo_sapiens.GRCh38.90.gtf A