Showing posts from November, 2012

Evaluate length of sequence strings

If you are working with sequence data often, you will sometimes need to look at the distribution of read lengths in a data set after quality and adapter trimming. From a bam file this can be done starting with samtools view, then cutting the sequence string out and then using either perl or awk to count the length of the sequence. The list of integers can then be piped to numaverage, a numutils program to evaluate the average, median and mode of a list of numbers. To get the average length samtools view sequence.bam | cut -f10 | awk '{ print length}' | numaverage To get the median length samtools view sequence.bam | cut -f10 | awk '{ print length}' | numaverage -M To get the mode length samtools view sequence.bam | cut -f10 | awk '{ print length}' | numaverage- m To get the shortest length samtools view FC095_1.bam | cut -f10 | awk '{ print length}' | sort | head -1 To get the longest samtools view FC095_1.bam | cut -f10 | awk '{ print

Paper of the week - Guthrie card methylomics

Nearly every baby born in Australia since 1970 has had a few drops of blood taken and stored on a so-called Guthrie card , and this practise is widely adopted in the developed world. As DNA analysis technologies become ever more sensitive and economical, these cards will become ever more important in diagnosis of genetic disease and also in identifying genetic and epigenetic variations which contribute to complex disease. The paper I showcase today from Beyan et al, describes the development of genome-wide assays for DNA methylation using methylation microarrays and methylcytosine immunoprecipitation followed by Illumina sequencing (MeDIP-Seq). Authors find differential methylation regions which are stable from birth to 3 years of age. The methodology is fairly novel, but the conclusions are a bit vague and it would have been best to apply Guthrie card analysis for a specific disease. It would be really neat if they analysed material from discordant twins for a complex disease i.e; j

Bowtie alignment to a genome - single end

Recently I posted about  BWA alignment of NGS data, so today I will demonstrate the use of Bowtie2  for aligning single end sequence data, for more information on advanced options, take a thorough look at the manpage . One of major strengths of Bowtie2, is that it's more tolerant for gaps than Bowtie1 and  BWA. Again you will need to have the reference genome and the raw fastq files in the otherwise empty directory. The first thing we do is stipulate the name of the experiment and the name of the reference sequence. EXID=Experimentname REF=Hg19.fa Now index the genome with bowtie2-build: bowtie2-build Hg19.fa Hg19.fa Now to actually start the alignment for all the files with prefix "FC" and suffix "fq": for FQ in `ls FC*fq | sed 's/.fq//'` do bowtie2 -x $REF -U ${FQ}.fq -S ${FQ}.sam -p 4 --met-file ${FQ}.met 2> ${FQ}.stats & done wait The script starts aligning all fq files to the human genome in parallel and each job uses 4 thre

What? You're not using parallel compression yet?

Just in case you guys are struggling with (de)compression of collossal data sets, here's something which you'll find useful, a parallel zip archiver called PBZIP2 . OK so it's not going to improve the quality of your sequencing data, but it could save you a bit of time. Say I have a fastq file (FC001_sequence.fq) which needs compression on 8 threads: pbzip2  -p8 FC001_sequence.fq To decompress (-d) a file (FC001_sequence.fq.bz2) and keep (-k) the archived version on 10 threads: pbzip2  -dk -p10 FC001_sequence.fq.bz2 To test the integrity of a compressed file: pbzip2  -t FC001_sequence.fq.bz2 How to compress an entire directory: tar cv directory | pbzip2 > directory.tar.bz2 Here is the help page with more examples: Parallel BZIP2 v1.1.5 - by: Jeff Gilchrist [] [Jul. 16, 2011]               (uses libbzip2 by Julian Seward) Major contributions: Yavor Nikolov <> Usage: pbzip2 [-1 .. -9] [-b#cdfhk

BWA alignment to a genome - single ends

Over the last few posts, we've discussed various ways to analyse the quality of a sequencing run, and curate the data sets in terms of demultiplexing , quality trimming and adapter clipping . I guess the next step is alignment. There are an increasing number of aligners out there for short read sequencing (listed here ), but currently the most popular choices are BWA and Bowtie2 . BWA is a quick and accurate tool, but might not be the best for gapped alignments (common in mRNA-Seq). Today I will demonstrate how to align short reads (single end) with BWA and convert the alignment to bam format with  Samtools . The things you will need in the otherwise empty directory are: Reference genome in fasta format Sequence data in fastq format (.fq) First thing to do is give the experiment a name and index the genome with BWA. The index step could take about 3 hours for a human size genome. EXID=Experimentname REF=Hg19.fa bwa index -a bwtsw $REF Next, use BWA to align. All of

Paper of the week - Next gen sequencing of ancient DNA

This paper was featured in the 12 October issue of Science, so really isn't from this week, nevertheless I thought it would be relevant given the previous post on library preparation. Sequencing ancient DNA is a hugely challenging task. Not only is it very difficult to get any sort of yield of DNA from bones tens of thousands of years old, but the DNA itself is normally degraded to such an extent that conventional library preparation is highly inefficient. On top of this, there is the challenge to eliminate environmental contamination. To avoid much of this, the team, lead by Matthias Meyer at the Max Planck Institute for in Leipzig came up with a simple but efficient method to generate sequencing libraries from single stranded DNA. The basic steps in the library prep are: Dephosphorylate Heat denature Ligate single stranded biotinylated adaptors Immobilize on streptavidin beads Generate the second strand with DNA polymerase Ligate a second adaptor by blunt end ligation

Data set curation - quality trimming and adapter clipping

After demultiplexing (covered in the last post ), you'll need to perform a few other steps before aligning Illumina sequence data to the genome reference. Primarily, these are quality filtering and adapter clipping. These may not be very important for short read data, but are pretty important when working with long reads, where the quality starts to drop off and the read might contain some adapter sequence. Quality filtering can be done a few ways, by filtering out entire reads which have poor base quality scores, by converting poor quality base-calls to "N" or hard trimming reads to a certain length before the Q scores start to rapidly decline. I'd much rather use a quality based trimming tool which starts at the 3' end of the read and removes bases below a certain Q threshold. This can be done using fastq_quality_trimmer on the command line or in galaxy. You set the threshold you want to use, in this case Q30, as well as the minimum length of sequence to keep,

Demultiplexing Illumina Sequence data

Demultiplexing is a key step in many sequencing based applications, but it isn't always necessary, as the newer Illumina pipeline software provides demultiplexed data as a standard. But if you need to do this yourself, here is an example using fastx_toolkit designed for sequence data with a 6nt barcode (Illumina barcode sequences 1-12). After a run, the Genome Analyzer software provides sequence files like this for read 1 (insert sequence): FC123_1_1_sequence.txt And for the barcode/index read: FC123_1_2_sequence.txt So here goes: #Enter dataset parameters   FC='FC123 FC124' LANES='1 2 3 4 5 6 7 8' #Create the bcfile echo 'BC01_ATCACG     ATCACG BC02_CGATGT     CGATGT BC03_TTAGGC     TTAGGC BC04_TGACCA     TGACCA BC05_ACAGTG     ACAGTG BC06_GCCAAT     GCCAAT BC07_CAGATC     CAGATC BC08_ACTTGA     ACTTGA BC09_GATCAG     GATCAG BC10_TAGCTT     TAGCTT BC11_GGCTAC     GGCTAC BC12_CTTGTA     CTTGTA' > bcfile.txt for flowcell in $FC do for l

We're not alone: the genomics bioinformatics blogosphere

Go forth and explore! /

Quality control of sequence datasets

Before investing a lot of time in analysis of a sequencing run, it is ALWAYS a good idea to ensure that your sequence data quality is up to scratch. If you have a high proportion of low quality bases in your dataset, you're likely to have many spurious alignments. These can cause problems for  virtually all NGS applications from mRNA-seq through to SNP detection and de novo assembly. There are two main types of QC analysis for sequencing runs. The first type, which only describes the quality of the fastq file, and the second type, which describes the quality of the alignment (sam/bam file format). Lets begin with the simple fastq file analysis and we'll cover the alignment QC at a later stage. The fastq file format  has become the standard raw data format for high throughput sequence datasets. Each base is given a Phred quality score represented as an ASCII character. The higher the Qscore, the more confidence you can have in the identity of the base. But there are other t