Posts

SRA toolkit tips and workarounds

Image
The Short Read Archive (SRA)  is the main repository for next generation sequencing (NGS) raw data. Considering the sheer rate at which NGS is generated (and accelerating), the team at NCBI should be congratulated for providing this service to the scientific community. Take a look at the growth of SRA: Growth of SRA data (http://www.ncbi.nlm.nih.gov/Traces/sra/i/g.png) SRA however doesn't provide directly the fastq files that we commonly work with, they prefer the .sra archive that require specialised software ( sra-toolkit ) to extract. Sra-toolkit has been described as buggy and painful; and I've had my frustrations with it. In this post, I'll share some of my best tips sra-toolkit tips that I've found. Get the right version of the software and configure it When downloading, make sure you download the newest version from the NCBI website ( link ). Don't download it from GitHub or from Ubuntu software centre (or apt-get), as it will probably be an older...

Generate an RNA-seq count matrix with featureCounts

Featurecounts is the fastest read summarization tool currently out there and has some great features which make it superior to HTSeq or Bedtools multicov. FeatureCounts takes GTF files as an annotation. This can be downloaded from the Ensembl FTP site . Make sure that the GTF version matches the genome that you aligned to. FeatureCounts it also smart enough to recognise and correctly process SAM and BAM alignment files. Here is a script to generate a gene-wise matrix from all BAM files in a directory. #!/bin/bash #Generate RNA-seq matrix #Set parameters GTF=/path/to/Mus_musculus.GRCm38.78.gtf EXPTNAME=mouse_rna CPUS=8 MAPQ=10 GENEMX=${EXPTNAME }_genes.mx #Make the gene-wise matrix featureCounts -Q $MAPQ -T $CPUS -a $GTF -o /dev/stdout *bam \ | cut -f1,7- | sed 1d > $GENEMX The data are now ready to analyse with your favourite statistical package (DESeq, EdgeR, Voom/Limma, etc). Consider attaching the gene name to give the data more relevance. To do that, first ...

Comparing expression profiles

Image
One of the most common tasks in gene expression analysis is to compare different profiling experiments. There are three main strategies: Compare all data points - using a correlation analysis Compare sets of up and down-regulated genes - using a binomial or Fisher exact test Compare sets of genes within a profile - such as GSEA test In this post, I'll describe how correlation analysis is used between expression data sets of all detected genes. Merging data sets No matter what type of correlation used, the profiling data sets need to be merged. This means selecting a field that can the datasets can be merged on. This could be a array probe ID, gene accession number or gene symbol as in this case. I will compare gene expression profiles from two experiments (azacitidine in human and mouse cells). The human gene profile was generated by RNA-seq and the mouse data set by microarray. The human data is currently in CSV format from Degust  and looks like this: gene,c,aza,F...

How to generate a rank file from gene expression data

Turning a gene expression profile into a ranked list is useful for comparing with other profiling data sets as well as an input for preranked GSEA analysis ( example here ). In this post, I describe a simple bash script called rnkgen.sh that can take gene expression data from a range of sources, such as edgeR, DESeq, GEO2R, etc., and generate a ranked list of genes from most up-expressed to most down-expressed based on the p-value. #!/bin/bash #rnkgen.sh converts a differential gene expression spreadsheet (XLS) into a rank file (RNK) #Specify the input file XLS=$1 #Specify the gene ID column ID=$2 #Specify the fold change value column FC=$3 #Specify the raw p-value column P=$4 sed 1d $XLS | tr -d '"' \ | awk -v I=$ID -v F=$FC -v P=$P '{FS="\t"} $I!="" {print $I, $F, $P}' \ | awk '$2!="NA" && $3!="NA"' \ | awk '{s=1} $2<0{s=-1} {print $1"\t"s*-1*log($3)/log(10)}' \ | awk ...

2014 Wrap-Up

Image
The year has gone so fast! Lets go through some of the major points of 2014 and predict what 2015 has in store. Sequencing hardware It began with announcements from Illumina of the HiSeqX10 as well as the release of the NextSeq500. Later, X-10 technology was included in V4 HiSeq2500 kits resulting in a substantial increase in sequence output from that instrument. There was also the announcement of the NeoPrep automated library preparation system that will be officially released in the 1st half of 2015. NeoPrep looks like an attempt to use microfluidics to perform generate libraries; if this is successful, it would be a major advance in reducing bottlenecks in sample preparation. Microfluidics are also able to reduce the volumes of reagents required, making the process cheaper. In addition, labour costs per library will be drastically reduced given that automation will be commonplace for routine protocols. Regarding 3rd gen platforms, we saw some mixed reviews from Oxford Nanopore ...

Interspecies gene name conversion

Image
In this post, I'll provide a step-by-step guide to perform interspecies gene name conversion of gene expression data. This is a necessary step in the comparison of profiling data from two different experiments with different species (human and mouse), and allows us to use extensive human-centric gene set libraries in MSigDB when analysing non-human mammalian profiling data (such as mouse). I performed GEO2R analysis of mouse expression data ( GSE30192 ) to analyse the effect of azacitidine on mouse C2C12 myoblasts. The data looks like this: "ID" "adj.P.Val" "P.Value" "t" "B" "logFC" "Gene.symbol" "Gene.title" "1420647_a_at" "0.000346" "2.24e-08" "56.073665" "8.699524" "6.9755573" "Krt8" "keratin 8" "1423327_at" "0.000346" "2.32e-08" "55.685912" "8.686447" ...

User friendly RNA-seq differential expression analysis with Degust

Image
There is a need to make bioinformatics tools more user friendly and accessible to a wider audience. We have seen that Galaxy , GEO2R ,  Genevestigator and GenePattern  have each developed a huge following in the molecular biology community, and this trend will continue with introduction of new RNA-seq analysis tools. Previously, I posted about differential gene expression analysis of RNA-seq performed by the DEB  online tool. In this post, I introduce Degust , an online app to analyse gene expression count data and determine which genes are differentially expressed. Degust was written by David R. Powell ( @d_r_powell ) and was Supported by Victorian Bioinformatics Consortium, Monash University and VLSCI's Life Sciences Computation Centre . In this test, I'll be using the azacitidine mRNA-seq data set that I have previously analysed.  To make the count matrix, I used featureCounts. First step in the process is to your RNA-seq count data. It can be done in tab ...