Showing posts from December, 2012

Genome biology highlights for 2012

We are coming to the end of a big year in the field of genome science, one which was dominated by the swarm of papers from the ENCODE consortium. (The number of ENCODE articles, mostly in Nature , has reached 92! and has no sign of slowing down.) If you haven't had time an read them all, do so here . Apart from ENCODE, there has been a quite astonishing pace in genomic studies. Here are just a few highlights from me: The 1000 genomes project published it's main findings in Nature ( link ). Non-invasive prenatal genome sequencing was described in PNAS ( link ), triggering a debate about the ethics of genomic studies of the unborn. Single cell sequencing techniques improve with better methods of amplification, even allowing sequencing of 99 individual sperm ( link ) giving new insights into patterns of recombination, and opening new avenues for IVF testing. Metagenomics analysis takes off. Whether it's environmental samples from the sea, hot springs or soil ( JGI w

Paper of the week - Cooperative epigenetic effect of TETs and OGT

There have been a number of high profile profile articles in recent times discussing the function of TET proteins, mostly in the conversion of methylated cytosine (5mC) into hydroxymethylated cytosine (5hmC), the 5th base. Hydroxymethylcytosine is much rarer than methylcytosine and is thought to be an intermediate towards demethylation of cytosine, a mechanism which remains incompletely resolved. A paper last year showed that TET proteins also convert 5hmc to 5-formylcytosine (5fC) and 5-carboxylcytosine (5caC), termed the 6th and 7th bases. OGT on the other hand is a fairly unique protein because it is the only known known O-GlcNAc transferase in mammals. What is GlcNAc you say? It stands for N-acetylglucosamine, a hexosamine. There has been a series of papers ( here , here , here ) discussing OGT as a nutrient sensor, transferring GlcNAc during period of surplus nutrient supply. GlcNAc can be transferred to the same amino acids as phosphorylation, so there is a suggested crosstalk

List of NGS aligners

The folks at EBI have produced a summary of all the high throughput sequence aligners commonly used ( link ), and there is a companion paper in the current issue of Bioinformatics. The timeline is a fascinating info-graphic. There is also a table describing the features of each. One thing I noticed is that there are relatively few aligners for RNA data, despite RNA-seq being arguably the most important application in NGS right now. (I will cover the use of RNA specific aligners in future posts.) The paper discusses some of the generalities of selecting an appropriate alignment tool for the job but doesn't go so far as to suggest which one is "best".

Illumina iGenomes

In an effort to make NGS data analysis more accessible, Illumina have provided packages of pre-indexed genomes and annotation files in one bundle. For instance you can get the Bowtie/TopHat indexes here for a range of organisms. The iGenomes are a collection of sequence and annotation files for commonly analyzed genomes. Each iGenome contains data for one species, downloaded from one source (UCSC, NCBI, or Ensembl), for one genomic build . Read here for more info.

Align RNA-seq reads to genome with TopHat

RNA-seq data analysis poses some interesting bioinformatic challenges. One of the most important considerations is the type of reference sequence to be used. If there is a whole genome sequence (WGS), then the WGS is the most appropriate reference, even if it is not annotated entirely. If there is no WGS, then you may need to use a curated cDNA library such as RefSeq , Unigene , or organism specific collection. There are a few reasons why aligning to the genome is the most correct method MapQ scores mean something - cDNA collections like RefSeq often have splice variants entered as different entries. This means that some variants will have common exons. Reads aligning to those exons will be assigned a mapQ of zero! Discovery - many of the cDNA sequence collections are incomplete, including RefSeq. Compare the number of entries from GENCODE to RefSeq  and you will see that GENCODE is far richer in terms of number of known transcripts. Moreover, if you are aligning to a genome,

UNIX utility: split

Even the most simple types of analysis (ie: count lines) can take a really long time with the colossal data sets  coming from sequencing experiments these days. Parallelization of jobs over multiple processors/cores/threads can make things faster, but the standard unix based tools we commonly use don't generally have the in-built capacity to run jobs over many processors. Every so often, I'll present some tools which can help in making NGS analysis jobs parallel, and hopefully give you some ideas for speeding up your jobs. One of the most basic parallelisation tricks is to split a file into smaller fragments, process those smaller files in parallel and then concatenate/merge the final product. To split a file, you can use a tool such as sed to selectively output lines, another option is to use the split command which will divide the file by the specified number of lines. Here's an example of splitting a fastq file (called sequence.fq) into 1 million line fragments:

Venn diagrams for genes

One of the easiest ways to compare datasets is to identify common elements between them and show the overlaps in a venn diagram. So when you're working with many data points such as thousands of genes, what is the best/fastest way to do it? Venny - Intersect up to 4 lists and count the overlaps. See and download the overlaps between gene lists.  Easy to use. Basic graphical output with overlaps not to scale and basic colours. Gene Venn - Intersect 2 or 3 lists. Overlap output in one text file. Overlaps not to scale, but graphics are somewhat more attractive than Venny. BioVenn - Intersect 2 or 3 lists. High quality graphics in a range of formats (SVG/PNG), with extensive customisation which makes it perfect for publications. Overlaps are to scale. Optionally can map gene IDs to actual RefSeq/Ensembl/Affy accession numbers. Venn SuperSelector  - Intersect as many lists as you like. The output will be a matrix and the number of entries in the overlap. The first matr

Fastq to tabular - filter reads on length

Fastq sequence format is a little bit strange as it has a structure of 4 lines per data point (sequence). This format is not ideal for analysis for one reason - it is quite difficult to manipulate with unix tools because they are optimized for tab delimited data with one entry per line. One way to overcome this is to "flatten" the file so that each sequence read occupies just one line. The read name, sequence, quality information are separated by a tab on one line. To do this there is a trick you can try with unix  paste  command. Try it with this command: paste - - - - < sequence.fq | head If we then wanted to do something like filter reads based on sequence length, we can add a perl (or awk for that matter) one liner.  paste - - - -  < sequence.fq | perl -ne '@x=split m/\t/; unshift @x, length($x[1]); print join "\t",@x;'  Which will output the length of the sequence string in the first field. If we wanted to filter reads based on the length

Paper of the week - Role of Selfish DNA in Evolution

Its no secret that repeat DNA makes up the majority of the genome in a majority of "higher" eukaryotes. For a long time this repeat DNA has been considered "selfish" or "parasitic" DNA, its only feature was its prolific self-propagation as it hitchhiked a ride throughout evolutionary history at the energetic expense of the host. Nina Fedoroff argues in a recent review in Science that these transposable elements are not "junk" at all, and in fact they play "a profoundly generative role in genome evolution" where the transposons provide novel mechanisms to generate genetic diversity. Nina explores the relationship between genome size and epigenetic complexity in the comparison of eukaryotes and prokaryotes, postulating that the innovation of epigenetic elaborate control mechanisms in bacteria paved the way for increases in genome size and complexity we see in eukaryotes. So rather than epigenetic mechanisms evolving to keep "parasi

Library preparation for RNA-seq

It is often said that RNA-seq is overtaking microarray as the gold-standard for transcriptome wide gene expression analysis, but what most people don't understand is, that this "gold standard" is far from standard. There are an ever increasing number of methods available to apply high throughput sequencing to gene expression analysis. Thus "RNA-Seq" is actually very generic term describing a range of techniques which aim to use sequencing to profile transcripts. Let's go through some of the more common applications  and work our way to the more niche applications. mRNA-Seq This method aims to profile the pool of transcripts which encode for proteins. The idea is to enrich the RNA mixture for mRNAs by depleting the concentration of rRNAs and other abundant ncRNAs. Usually, this is achieved by polyA enrichment using hybridization to magnetic beads decorated with oligo dT strands. After the enrichment the resulting RNA is fragmented by heat exposure in the