Showing posts from March, 2015

Quantifying repeat sequences from short read next gen sequencing data

Repetitive sequences pose significant challenges for the analysis of short read sequence data. Any sequence read that is shorter than the repeat unit is going to fail at being uniquely placed in the genome. At a given read length, genome can therefore be divided into mappable and unmappable compartments. For analyses such as gene expression and chromatin profiling, we generally ignore those unmappable regions and ignore reads that map to multiple positions. But what if we were actually interested in the abundance of repeat sequences in the dataset? How could it be quantified with short read sequencing? Why is that even relevant? The "Why" About 50% of the human genome is comprised of repetitive DNA ( Ref ). While the function of these elements in many cases remains unknown, some have been well defined. Alu elements have been shown to be a birthplace of enhancers ( Ref ). Endogenous retroviruses have been domesticated to boost antiviral response in B cells after exposure t

Paired-end fastq quality control with Skewer

Quality trimming and adapter clipping paired end reads is a tricky business. For paired-end alignments to work, the order of the sequences in both files (forward and reverse) needs to be retained. While there are plenty of read trimmers available open-source (think FASTX-toolkit, SeqTK, CutAdapt, etc), I haven't found many that: Retain pairing info Run parallel and are fast Are easy to set up and run Have good docs Then I fell in love with Skewer . It smashed through 14.5 million gzipped read pairs, doing adapter clipping and quality trimming in two minutes on my 8-core workstation. It auto-detects quality encoding so you can safely analyse any Illumina data. Awesome! $ skewer -t 8 -q 20 SRR634969.sra_1.fastq.gz SRR634969.sra_2.fastq.gz Parameters used: -- 3' end adapter sequence (-x): AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -- paired 3' end adapter sequence (-y): AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA -- maximum error ratio allowed (-r): 0.100 -- maximum indel er

HISAT vs STAR vs TopHat2 vs Olego vs SubJunc

HISAT is a brand new RNA-seq aligner which promises great speed with a low memory footprint. The Nature Methods paper is worth a look. So I thought I'd give it a test run with some simulated data to check its accuracy compared to other aligners. I generated synthetic 100bp reads based on Arabidopsis cDNAs and then incorporated mutations with msbar . I then aligned these reads to the Arabidopsis genome with default settings and counted ( featureCounts ) the number of correctly and incorrectly assigned reads with a mapping quality of 20. Here is the result for unmutated (perfect) 100 bp reads. Table 1. Alignment of simulated 100 bp cDNA sequences to the Arabidopsis genome. HISAT shows similar accuracy to TopHat2 but was not as accurate as STAR. Now here are the results of the mutation experiment. Figure 1. Accuracy of mapping mutated 100bp reads. Left hand side graphs show the correct mapping rates and right hand side shows incorrect mapping rates. Top panels

Count ChIP-seq reads across a promoter with featureCounts

When analyzing ChIP-seq data, we sometimes like to focus on promoters as a proxy of gene activation especially for histone marks that are normally associated with gene activation such as H3K9 acetylation or H3K4 tri-methylation. FeatureCounts has emerged as a competitor to HTSeq and BedTools MultiCov for counting reads across features (ie, exons, genes, promoters). FeatureCounts is great for RNA-seq because it can natively read GTF annotation files, but can't read BED format (that we use a lot in ChIP-seq analysis). In order to make featureCounts work, we need to extract the TSS coordinates and convert to a BED-like format that it can read ( SAF format ). In the below script, I extract the positions of the TSSs using a grep search followed by stripping only the neccessary information from the GTF, then using awk and BedTools to expand the region  around the TSS by 3kbp. #!/bin/bash #Generate an SAF file for TSS intervals from a GTF #Specify some parameters GTF=Homo_sapien

Using Named Pipes and Process Substitution

Little known UNIX features to avoid writing temporary files in your data pipelines explained by Vince Buffalo in his digital notebook . Introducing named pipes and process substitution.

Are we ready to move beyond MSigDB and start a community-based gene set resource?

Gene sets are distilled information about molecular profiling experiments and can generated based on other features shared by groups of genes such as chromosomal position, sequence, co-regulation, functional information, etc. These are a valuable resource because they suggest similarities between different molecular profiling experiments or phenomona and lead researchers into understanding the factors that drive the trends in profiling experiments such as gene expression assays by microarray or RNA-seq. To truly grasp the importance of quality gene sets, consider that the original paper describing the GSEA algorithm has accumulated 3144 citations since 2003, while the paper describing the software and wider applicability of GSEA has 7166 citations. The latter paper has also attracted positive comments from experts in the field on PubMed, here is one that I couldn't agree with more. In the words of Rafael Irizarry, " The idea of analyzing differential expression for gro

Extracting specific sequences from a big FASTA file

Say you have a huge FASTA file such as genome build or cDNA library, how to you quickly extract just one or a few desired sequences? Use samtools faidx to extract a single FASTA entry  first index, then you can extract almost instantaneously. $ samtools faidx Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa real 0m37.422s $ time samtools faidx Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa MT real 0m0.003s Use bedtools getfasta extract portions of a FASTA entry Requires the regions of interest be in BED format. $ head Homo_sapiens.GRCh38_CpG_Islands.bed 1 10413 11207 1 28687 29634 1 51546 51882 1 121270 121549 The sequences of interest are extracted like this: $ bedtools getfasta -fi Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa -bed Homo_sapiens.GRCh38_CpG_Islands.bed -fo Homo_sapiens.GRCh38_CpG_Islands.fasta Make a blast database to access bunches of sequences quickly Note: you will need to download and install the BLAST+ package fro

Sambamba vs Samtools

Samtools has been the one main tools for reading and writing aligned NGS data since the SAM alignment format was initially proposed. With the amounts of NGS data being generated increasing at a staggering rate, informatic bottlenecks are beginning to appear and there is a rush to make bioinformatics as parallel and scalable as possible. Samtools went parallel only recently, and there is a new competitor to Samtools called Sambamba , so I'll just do a few quick tests to see how it stacks up to Samtools on our 32-core server with 128 GB RAM (standard HDD). We have an 1.8 GB bam file to work with. $ du -sh * 1.8G  Sequence.bam Now convert to unsorted sam format. $ samtools view -H Sequence.bam > header.sam $ samtools view Sequence.bam | shuf \ | cat header.sam - > Sequence_shuf.sam The sam file is 9.9 GB. Lets try 1-thread SAM-to-BAM conversion and sorting with Samtools. $ time samtools view -Shb Sequence_shuf.sam \ | samtools sort - Sequence_samtools.test real