Showing posts from 2012

Genome biology highlights for 2012

We are coming to the end of a big year in the field of genome science, one which was dominated by the swarm of papers from the ENCODE consortium. (The number of ENCODE articles, mostly in Nature , has reached 92! and has no sign of slowing down.) If you haven't had time an read them all, do so here . Apart from ENCODE, there has been a quite astonishing pace in genomic studies. Here are just a few highlights from me: The 1000 genomes project published it's main findings in Nature ( link ). Non-invasive prenatal genome sequencing was described in PNAS ( link ), triggering a debate about the ethics of genomic studies of the unborn. Single cell sequencing techniques improve with better methods of amplification, even allowing sequencing of 99 individual sperm ( link ) giving new insights into patterns of recombination, and opening new avenues for IVF testing. Metagenomics analysis takes off. Whether it's environmental samples from the sea, hot springs or soil ( JGI w

Paper of the week - Cooperative epigenetic effect of TETs and OGT

There have been a number of high profile profile articles in recent times discussing the function of TET proteins, mostly in the conversion of methylated cytosine (5mC) into hydroxymethylated cytosine (5hmC), the 5th base. Hydroxymethylcytosine is much rarer than methylcytosine and is thought to be an intermediate towards demethylation of cytosine, a mechanism which remains incompletely resolved. A paper last year showed that TET proteins also convert 5hmc to 5-formylcytosine (5fC) and 5-carboxylcytosine (5caC), termed the 6th and 7th bases. OGT on the other hand is a fairly unique protein because it is the only known known O-GlcNAc transferase in mammals. What is GlcNAc you say? It stands for N-acetylglucosamine, a hexosamine. There has been a series of papers ( here , here , here ) discussing OGT as a nutrient sensor, transferring GlcNAc during period of surplus nutrient supply. GlcNAc can be transferred to the same amino acids as phosphorylation, so there is a suggested crosstalk

List of NGS aligners

The folks at EBI have produced a summary of all the high throughput sequence aligners commonly used ( link ), and there is a companion paper in the current issue of Bioinformatics. The timeline is a fascinating info-graphic. There is also a table describing the features of each. One thing I noticed is that there are relatively few aligners for RNA data, despite RNA-seq being arguably the most important application in NGS right now. (I will cover the use of RNA specific aligners in future posts.) The paper discusses some of the generalities of selecting an appropriate alignment tool for the job but doesn't go so far as to suggest which one is "best".

Illumina iGenomes

In an effort to make NGS data analysis more accessible, Illumina have provided packages of pre-indexed genomes and annotation files in one bundle. For instance you can get the Bowtie/TopHat indexes here for a range of organisms. The iGenomes are a collection of sequence and annotation files for commonly analyzed genomes. Each iGenome contains data for one species, downloaded from one source (UCSC, NCBI, or Ensembl), for one genomic build . Read here for more info.

Align RNA-seq reads to genome with TopHat

RNA-seq data analysis poses some interesting bioinformatic challenges. One of the most important considerations is the type of reference sequence to be used. If there is a whole genome sequence (WGS), then the WGS is the most appropriate reference, even if it is not annotated entirely. If there is no WGS, then you may need to use a curated cDNA library such as RefSeq , Unigene , or organism specific collection. There are a few reasons why aligning to the genome is the most correct method MapQ scores mean something - cDNA collections like RefSeq often have splice variants entered as different entries. This means that some variants will have common exons. Reads aligning to those exons will be assigned a mapQ of zero! Discovery - many of the cDNA sequence collections are incomplete, including RefSeq. Compare the number of entries from GENCODE to RefSeq  and you will see that GENCODE is far richer in terms of number of known transcripts. Moreover, if you are aligning to a genome,

UNIX utility: split

Even the most simple types of analysis (ie: count lines) can take a really long time with the colossal data sets  coming from sequencing experiments these days. Parallelization of jobs over multiple processors/cores/threads can make things faster, but the standard unix based tools we commonly use don't generally have the in-built capacity to run jobs over many processors. Every so often, I'll present some tools which can help in making NGS analysis jobs parallel, and hopefully give you some ideas for speeding up your jobs. One of the most basic parallelisation tricks is to split a file into smaller fragments, process those smaller files in parallel and then concatenate/merge the final product. To split a file, you can use a tool such as sed to selectively output lines, another option is to use the split command which will divide the file by the specified number of lines. Here's an example of splitting a fastq file (called sequence.fq) into 1 million line fragments:

Venn diagrams for genes

One of the easiest ways to compare datasets is to identify common elements between them and show the overlaps in a venn diagram. So when you're working with many data points such as thousands of genes, what is the best/fastest way to do it? Venny - Intersect up to 4 lists and count the overlaps. See and download the overlaps between gene lists.  Easy to use. Basic graphical output with overlaps not to scale and basic colours. Gene Venn - Intersect 2 or 3 lists. Overlap output in one text file. Overlaps not to scale, but graphics are somewhat more attractive than Venny. BioVenn - Intersect 2 or 3 lists. High quality graphics in a range of formats (SVG/PNG), with extensive customisation which makes it perfect for publications. Overlaps are to scale. Optionally can map gene IDs to actual RefSeq/Ensembl/Affy accession numbers. Venn SuperSelector  - Intersect as many lists as you like. The output will be a matrix and the number of entries in the overlap. The first matr

Fastq to tabular - filter reads on length

Fastq sequence format is a little bit strange as it has a structure of 4 lines per data point (sequence). This format is not ideal for analysis for one reason - it is quite difficult to manipulate with unix tools because they are optimized for tab delimited data with one entry per line. One way to overcome this is to "flatten" the file so that each sequence read occupies just one line. The read name, sequence, quality information are separated by a tab on one line. To do this there is a trick you can try with unix  paste  command. Try it with this command: paste - - - - < sequence.fq | head If we then wanted to do something like filter reads based on sequence length, we can add a perl (or awk for that matter) one liner.  paste - - - -  < sequence.fq | perl -ne '@x=split m/\t/; unshift @x, length($x[1]); print join "\t",@x;'  Which will output the length of the sequence string in the first field. If we wanted to filter reads based on the length

Paper of the week - Role of Selfish DNA in Evolution

Its no secret that repeat DNA makes up the majority of the genome in a majority of "higher" eukaryotes. For a long time this repeat DNA has been considered "selfish" or "parasitic" DNA, its only feature was its prolific self-propagation as it hitchhiked a ride throughout evolutionary history at the energetic expense of the host. Nina Fedoroff argues in a recent review in Science that these transposable elements are not "junk" at all, and in fact they play "a profoundly generative role in genome evolution" where the transposons provide novel mechanisms to generate genetic diversity. Nina explores the relationship between genome size and epigenetic complexity in the comparison of eukaryotes and prokaryotes, postulating that the innovation of epigenetic elaborate control mechanisms in bacteria paved the way for increases in genome size and complexity we see in eukaryotes. So rather than epigenetic mechanisms evolving to keep "parasi

Library preparation for RNA-seq

It is often said that RNA-seq is overtaking microarray as the gold-standard for transcriptome wide gene expression analysis, but what most people don't understand is, that this "gold standard" is far from standard. There are an ever increasing number of methods available to apply high throughput sequencing to gene expression analysis. Thus "RNA-Seq" is actually very generic term describing a range of techniques which aim to use sequencing to profile transcripts. Let's go through some of the more common applications  and work our way to the more niche applications. mRNA-Seq This method aims to profile the pool of transcripts which encode for proteins. The idea is to enrich the RNA mixture for mRNAs by depleting the concentration of rRNAs and other abundant ncRNAs. Usually, this is achieved by polyA enrichment using hybridization to magnetic beads decorated with oligo dT strands. After the enrichment the resulting RNA is fragmented by heat exposure in the

Evaluate length of sequence strings

If you are working with sequence data often, you will sometimes need to look at the distribution of read lengths in a data set after quality and adapter trimming. From a bam file this can be done starting with samtools view, then cutting the sequence string out and then using either perl or awk to count the length of the sequence. The list of integers can then be piped to numaverage, a numutils program to evaluate the average, median and mode of a list of numbers. To get the average length samtools view sequence.bam | cut -f10 | awk '{ print length}' | numaverage To get the median length samtools view sequence.bam | cut -f10 | awk '{ print length}' | numaverage -M To get the mode length samtools view sequence.bam | cut -f10 | awk '{ print length}' | numaverage- m To get the shortest length samtools view FC095_1.bam | cut -f10 | awk '{ print length}' | sort | head -1 To get the longest samtools view FC095_1.bam | cut -f10 | awk '{ print

Paper of the week - Guthrie card methylomics

Nearly every baby born in Australia since 1970 has had a few drops of blood taken and stored on a so-called Guthrie card , and this practise is widely adopted in the developed world. As DNA analysis technologies become ever more sensitive and economical, these cards will become ever more important in diagnosis of genetic disease and also in identifying genetic and epigenetic variations which contribute to complex disease. The paper I showcase today from Beyan et al, describes the development of genome-wide assays for DNA methylation using methylation microarrays and methylcytosine immunoprecipitation followed by Illumina sequencing (MeDIP-Seq). Authors find differential methylation regions which are stable from birth to 3 years of age. The methodology is fairly novel, but the conclusions are a bit vague and it would have been best to apply Guthrie card analysis for a specific disease. It would be really neat if they analysed material from discordant twins for a complex disease i.e; j

Bowtie alignment to a genome - single end

Recently I posted about  BWA alignment of NGS data, so today I will demonstrate the use of Bowtie2  for aligning single end sequence data, for more information on advanced options, take a thorough look at the manpage . One of major strengths of Bowtie2, is that it's more tolerant for gaps than Bowtie1 and  BWA. Again you will need to have the reference genome and the raw fastq files in the otherwise empty directory. The first thing we do is stipulate the name of the experiment and the name of the reference sequence. EXID=Experimentname REF=Hg19.fa Now index the genome with bowtie2-build: bowtie2-build Hg19.fa Hg19.fa Now to actually start the alignment for all the files with prefix "FC" and suffix "fq": for FQ in `ls FC*fq | sed 's/.fq//'` do bowtie2 -x $REF -U ${FQ}.fq -S ${FQ}.sam -p 4 --met-file ${FQ}.met 2> ${FQ}.stats & done wait The script starts aligning all fq files to the human genome in parallel and each job uses 4 thre

What? You're not using parallel compression yet?

Just in case you guys are struggling with (de)compression of collossal data sets, here's something which you'll find useful, a parallel zip archiver called PBZIP2 . OK so it's not going to improve the quality of your sequencing data, but it could save you a bit of time. Say I have a fastq file (FC001_sequence.fq) which needs compression on 8 threads: pbzip2  -p8 FC001_sequence.fq To decompress (-d) a file (FC001_sequence.fq.bz2) and keep (-k) the archived version on 10 threads: pbzip2  -dk -p10 FC001_sequence.fq.bz2 To test the integrity of a compressed file: pbzip2  -t FC001_sequence.fq.bz2 How to compress an entire directory: tar cv directory | pbzip2 > directory.tar.bz2 Here is the help page with more examples: Parallel BZIP2 v1.1.5 - by: Jeff Gilchrist [] [Jul. 16, 2011]               (uses libbzip2 by Julian Seward) Major contributions: Yavor Nikolov <> Usage: pbzip2 [-1 .. -9] [-b#cdfhk

BWA alignment to a genome - single ends

Over the last few posts, we've discussed various ways to analyse the quality of a sequencing run, and curate the data sets in terms of demultiplexing , quality trimming and adapter clipping . I guess the next step is alignment. There are an increasing number of aligners out there for short read sequencing (listed here ), but currently the most popular choices are BWA and Bowtie2 . BWA is a quick and accurate tool, but might not be the best for gapped alignments (common in mRNA-Seq). Today I will demonstrate how to align short reads (single end) with BWA and convert the alignment to bam format with  Samtools . The things you will need in the otherwise empty directory are: Reference genome in fasta format Sequence data in fastq format (.fq) First thing to do is give the experiment a name and index the genome with BWA. The index step could take about 3 hours for a human size genome. EXID=Experimentname REF=Hg19.fa bwa index -a bwtsw $REF Next, use BWA to align. All of

Paper of the week - Next gen sequencing of ancient DNA

This paper was featured in the 12 October issue of Science, so really isn't from this week, nevertheless I thought it would be relevant given the previous post on library preparation. Sequencing ancient DNA is a hugely challenging task. Not only is it very difficult to get any sort of yield of DNA from bones tens of thousands of years old, but the DNA itself is normally degraded to such an extent that conventional library preparation is highly inefficient. On top of this, there is the challenge to eliminate environmental contamination. To avoid much of this, the team, lead by Matthias Meyer at the Max Planck Institute for in Leipzig came up with a simple but efficient method to generate sequencing libraries from single stranded DNA. The basic steps in the library prep are: Dephosphorylate Heat denature Ligate single stranded biotinylated adaptors Immobilize on streptavidin beads Generate the second strand with DNA polymerase Ligate a second adaptor by blunt end ligation

Data set curation - quality trimming and adapter clipping

After demultiplexing (covered in the last post ), you'll need to perform a few other steps before aligning Illumina sequence data to the genome reference. Primarily, these are quality filtering and adapter clipping. These may not be very important for short read data, but are pretty important when working with long reads, where the quality starts to drop off and the read might contain some adapter sequence. Quality filtering can be done a few ways, by filtering out entire reads which have poor base quality scores, by converting poor quality base-calls to "N" or hard trimming reads to a certain length before the Q scores start to rapidly decline. I'd much rather use a quality based trimming tool which starts at the 3' end of the read and removes bases below a certain Q threshold. This can be done using fastq_quality_trimmer on the command line or in galaxy. You set the threshold you want to use, in this case Q30, as well as the minimum length of sequence to keep,

Demultiplexing Illumina Sequence data

Demultiplexing is a key step in many sequencing based applications, but it isn't always necessary, as the newer Illumina pipeline software provides demultiplexed data as a standard. But if you need to do this yourself, here is an example using fastx_toolkit designed for sequence data with a 6nt barcode (Illumina barcode sequences 1-12). After a run, the Genome Analyzer software provides sequence files like this for read 1 (insert sequence): FC123_1_1_sequence.txt And for the barcode/index read: FC123_1_2_sequence.txt So here goes: #Enter dataset parameters   FC='FC123 FC124' LANES='1 2 3 4 5 6 7 8' #Create the bcfile echo 'BC01_ATCACG     ATCACG BC02_CGATGT     CGATGT BC03_TTAGGC     TTAGGC BC04_TGACCA     TGACCA BC05_ACAGTG     ACAGTG BC06_GCCAAT     GCCAAT BC07_CAGATC     CAGATC BC08_ACTTGA     ACTTGA BC09_GATCAG     GATCAG BC10_TAGCTT     TAGCTT BC11_GGCTAC     GGCTAC BC12_CTTGTA     CTTGTA' > bcfile.txt for flowcell in $FC do for l

We're not alone: the genomics bioinformatics blogosphere

Go forth and explore! /

Quality control of sequence datasets

Before investing a lot of time in analysis of a sequencing run, it is ALWAYS a good idea to ensure that your sequence data quality is up to scratch. If you have a high proportion of low quality bases in your dataset, you're likely to have many spurious alignments. These can cause problems for  virtually all NGS applications from mRNA-seq through to SNP detection and de novo assembly. There are two main types of QC analysis for sequencing runs. The first type, which only describes the quality of the fastq file, and the second type, which describes the quality of the alignment (sam/bam file format). Lets begin with the simple fastq file analysis and we'll cover the alignment QC at a later stage. The fastq file format  has become the standard raw data format for high throughput sequence datasets. Each base is given a Phred quality score represented as an ASCII character. The higher the Qscore, the more confidence you can have in the identity of the base. But there are other t

Paper of the week - explaining the stability of ncRNAs in the cell

3' Polyadenylation is a key mechanism whereby mRNAs are stabilised and made ready for protein translation. One of the mysteries of molecular biology of late is how long non coding RNA (lncRNA) stability is achieved given that these molecules don't have a polyadenylation signal. Wilusz et al published a paper in Genes and Development predicting that MALAT1 is protected from 3' to 5' exonuclease activity by an RNA triple helix structure. The researchers used molecular modeling to resolve that the 3' terminus is neatly tied into a triple helix and thus likely protected from degradation. This was confirmed by mutagenesis, showing that altering bases in these regions led to a reduction in transcript stability. MALAT1 is transcribed to form a ~6.7 kb lncRNA which is abundant in the nucleus, and also producing a small tRNA like transcript which is processed into a mature 61 nt hairpin localised to the cytosol. Both transcripts are dependant on RNase P , a ribozyme which