Posts

Screening RNA-seq data for novel transcripts with samtools mpileup and UCSC genome browser

Image
The unbiased nature of deep transcript sequencing makes it the ideal technology to discover novel uncharacterised genes. Lets screen our favourite RNA-seq experiment (azacitidine-treated AML3 human cells,  GSE55123 ) for novel expressed genes. I use Ensembl gene annotations. We'll start with preparing bed files of exons and gene bodies grep -w exon Homo_sapiens.GRCh38.76.gtf | tr '" ' '\t' \ | cut -f1,4,5,11 | uniq > Homo_sapiens.GRCh38.76_exons.bed grep -w gene Homo_sapiens.GRCh38.76.gtf | tr '" ' '\t' \ | cut -f1,4,5,11 > Homo_sapiens.GRCh38.76_genes.bed Now for convenience, I'll merge the data from the three replicates with samtools. samtools view -H SRR1171523_1.fastq.sort.bam > header.txt samtools merge -h header.txt Ctrl.bam SRR1171523_1.bam SRR1171523_2.bam SRR1171524_1.bam SRR1171524_2.bam SRR1171525_1.bam SRR1171525_2.bam samtools merge -h header.txt Aza.bam SRR1171526_1.bam SRR1171526_2.bam SRR1171527...

ENCODE Users Meeting 2015

Just announced from the ENCODE mailing list is the workshop to analyse and make the most of ENCODE data. Attend if you can! FW: [encode-announce] [Meeting announcement] ENCODE User's Meeting Dear colleagues, We would like to announce the first ENCODE User’s Meeting, which will be a 3-day workshop to learn how to navigate, analyze, use, and integrate ENCODE data. The NHGRI-sponsored meeting will be held from June 29 - July 1, 2015 at the Bolger Center in Potomac, MD. The ENCODE project has generated 3500 datasets in human, more than 1000 datasets each in model organisms including mouse, fly and worm. It provides an extremely valuable resource of potential functional annotations of the human genome and represents an unprecedented opportunity for applications in a variety of biomedical problems. This meeting will have both scientific presentations and hands-on tutorial sessions with the goal of learning how ENCODE data has been used by the scientific community and providing o...

SRA toolkit tips and workarounds

Image
The Short Read Archive (SRA)  is the main repository for next generation sequencing (NGS) raw data. Considering the sheer rate at which NGS is generated (and accelerating), the team at NCBI should be congratulated for providing this service to the scientific community. Take a look at the growth of SRA: Growth of SRA data (http://www.ncbi.nlm.nih.gov/Traces/sra/i/g.png) SRA however doesn't provide directly the fastq files that we commonly work with, they prefer the .sra archive that require specialised software ( sra-toolkit ) to extract. Sra-toolkit has been described as buggy and painful; and I've had my frustrations with it. In this post, I'll share some of my best tips sra-toolkit tips that I've found. Get the right version of the software and configure it When downloading, make sure you download the newest version from the NCBI website ( link ). Don't download it from GitHub or from Ubuntu software centre (or apt-get), as it will probably be an older...

Generate an RNA-seq count matrix with featureCounts

Featurecounts is the fastest read summarization tool currently out there and has some great features which make it superior to HTSeq or Bedtools multicov. FeatureCounts takes GTF files as an annotation. This can be downloaded from the Ensembl FTP site . Make sure that the GTF version matches the genome that you aligned to. FeatureCounts it also smart enough to recognise and correctly process SAM and BAM alignment files. Here is a script to generate a gene-wise matrix from all BAM files in a directory. #!/bin/bash #Generate RNA-seq matrix #Set parameters GTF=/path/to/Mus_musculus.GRCm38.78.gtf EXPTNAME=mouse_rna CPUS=8 MAPQ=10 GENEMX=${EXPTNAME }_genes.mx #Make the gene-wise matrix featureCounts -Q $MAPQ -T $CPUS -a $GTF -o /dev/stdout *bam \ | cut -f1,7- | sed 1d > $GENEMX The data are now ready to analyse with your favourite statistical package (DESeq, EdgeR, Voom/Limma, etc). Consider attaching the gene name to give the data more relevance. To do that, first ...

Comparing expression profiles

Image
One of the most common tasks in gene expression analysis is to compare different profiling experiments. There are three main strategies: Compare all data points - using a correlation analysis Compare sets of up and down-regulated genes - using a binomial or Fisher exact test Compare sets of genes within a profile - such as GSEA test In this post, I'll describe how correlation analysis is used between expression data sets of all detected genes. Merging data sets No matter what type of correlation used, the profiling data sets need to be merged. This means selecting a field that can the datasets can be merged on. This could be a array probe ID, gene accession number or gene symbol as in this case. I will compare gene expression profiles from two experiments (azacitidine in human and mouse cells). The human gene profile was generated by RNA-seq and the mouse data set by microarray. The human data is currently in CSV format from Degust  and looks like this: gene,c,aza,F...

How to generate a rank file from gene expression data

Turning a gene expression profile into a ranked list is useful for comparing with other profiling data sets as well as an input for preranked GSEA analysis ( example here ). In this post, I describe a simple bash script called rnkgen.sh that can take gene expression data from a range of sources, such as edgeR, DESeq, GEO2R, etc., and generate a ranked list of genes from most up-expressed to most down-expressed based on the p-value. #!/bin/bash #rnkgen.sh converts a differential gene expression spreadsheet (XLS) into a rank file (RNK) #Specify the input file XLS=$1 #Specify the gene ID column ID=$2 #Specify the fold change value column FC=$3 #Specify the raw p-value column P=$4 sed 1d $XLS | tr -d '"' \ | awk -v I=$ID -v F=$FC -v P=$P '{FS="\t"} $I!="" {print $I, $F, $P}' \ | awk '$2!="NA" && $3!="NA"' \ | awk '{s=1} $2<0{s=-1} {print $1"\t"s*-1*log($3)/log(10)}' \ | awk ...

2014 Wrap-Up

Image
The year has gone so fast! Lets go through some of the major points of 2014 and predict what 2015 has in store. Sequencing hardware It began with announcements from Illumina of the HiSeqX10 as well as the release of the NextSeq500. Later, X-10 technology was included in V4 HiSeq2500 kits resulting in a substantial increase in sequence output from that instrument. There was also the announcement of the NeoPrep automated library preparation system that will be officially released in the 1st half of 2015. NeoPrep looks like an attempt to use microfluidics to perform generate libraries; if this is successful, it would be a major advance in reducing bottlenecks in sample preparation. Microfluidics are also able to reduce the volumes of reagents required, making the process cheaper. In addition, labour costs per library will be drastically reduced given that automation will be commonplace for routine protocols. Regarding 3rd gen platforms, we saw some mixed reviews from Oxford Nanopore ...