Showing posts from 2015

Introducing "Digital Expression Explorer"

RNA-seq has been a blessing for molecular biologists, not only does RNA-seq provide unbiased transcriptome wide expression analysis, it can be mined for a variety of other information like splicing, SNV identification, RNA editing, TSS usage, etc. As the cost of RNA-seq declines, more and more labs are using it hence more and more data is being deposited at databases such as SRA and GEO.

But there is a growing problem.

The problem is a lack of uniformity of processed data on GEO. Processed data has assorted reference genomes, gene annotation sets, accession numbers, software pipelines, statistical analyses and output formats, that in most cases makes comparison of two experiments hard if not impossible, let alone three or more experiments. This is a burden on researchers who want to quickly extract expression information from public RNA-seq data. Many researchers then resort to downloading the raw data in SRA format then processing it with QC/alignment/quantification/statistical pipel…

Using paired end read concordance to determine accuracy of RNA-seq pipelines STAR/featureCounts and Kallisto

This post is a collection of code used to compare accuracy of Kallisto and STAR/featureCounts in my previous post.
Download from SRA and convert to fastq formatCheck my older post. __________________________________________________________________

Run skewer to quality trim the data.
$ cat #!/bin/bash for FQZ1 in *_1.fastq.gz ; do FQZ2=`echo $FQZ1 | sed 's/_1.fastq.gz/_2.fastq.gz/'` skewer -t 8 -q 20 $FQZ1 $FQZ2 done __________________________________________________________________

Run bfc to correct errors.
$ cat
for FQ in *.fastq.gz ; do
OUT=`echo $FQ | sed 's/.fastq.gz$/_bfc.fastq.gz/'`
bfc -t 8 $FQ | pigz > $OUT &
wait __________________________________________________________________
Run STAR aligner to align reads to human genome in single end mode.

$ cat
for FQZ in `ls *gz` ; do
FQ=`echo $FQZ | sed 's/.gz//'`
pigz -dkf …

How accurate is Kallisto?

One of the most interesting developments in RNA-seq informatics in the past year or so is the evolution of so-called "lightweight" analysis. Instead of trying to map a whole read to the reference exons, it may be quicker and just as accurate to simply compare the k-mer content of the read and reference. If the read and reference share enough unique kmers, then the aligner can stop analysing that read and can move on the the next one. It turns out that the speed-up with this approach is huge, when compared with previous methods. Kallisto especially appears to have some speed advantages over competitors such as Sailfish, Salmon, eXpress, etc. For those interested in learning more, please refer to the Kallisto pre-print in arXiv, as well as the following blog posts.
For this post, I wanted to compare the accuracy of Kallisto with STAR/featureCounts (mapQ20 threshold), the pipeline that I'm currently using in mo…

Quantification and equimolar pooling of NGS libraries

Library preparation for next generation sequencing is becoming easier with the quality of kits and protocols improving substantially in the past few years. With the price of NGS decreasing, we are finding that our throughput is increasing, both in terms of the number of experiments as well as the average size of these experiments. With this in mind, the ability to accurately pool barcoded libraries in equimolar ratios (also called "balancing") is even more critical. Accurate quantification is thus vital. There are several ways to quantify library DNA: Qubit fluorometer. Gives very accurate concentrations in nanogram per microlitre, but agnostic of fragment size distribution.Bioanalyzer. Gives accurate readings of fragment size, but is only fairly accurate in terms of concentrations of each fragmentQuantitative PCR. Most accurate, but expensive and most time consuming.NanoDrop UV-Spec. Easiest method but least accurate. Not recommended. So I'll lead you through my favouri…

Screen for rRNA contamination in RNA-seq data

Ribosomal RNA (rRNA) is very abundant in cells (~80% of total RNA), so it is useful to deplete rRNA when doing genomewide assays to have sufficient coverage of other genes including protein coding and non-protein coding genes.

There are two major strategies for achieving rRNA removal, being (1) poly A enrichment and (2) rRNA depletion. Poly A enrichment uses an oligo dT coupled magnetic bead to "pull-out" RNA molecules with a polyA tag, a common feature of protein coding transcripts. rRNA depletion can be achieved using kits such as Ribo-Zero (Illumina), Ribo-Minus (LifeTech) and NEBNext® rRNA Depletion Kit (NEB). These kits contain oligonucleotide probes that either hybridize and immobilise the rRNA or hybridize and degrade the unwanted rRNA.

Once you have some sequence data, you will need to check whether the rRNA depletion has worked. This is somewhat different to a genome-wide analysis I've mentioned in earlier posts because rRNA genes exist in multiple copies and r…

Genome methylation analysis with Bismark

Bismark is currently the de facto standard for primary analysis of high throughput bisulfite sequencing data. Bismark can align the reads to the genome and perform methylation calling. In this post, I'll go through Illumina whole genome bisulfite sequence (WGBS) alignment and methylation calling using Bismark. First I want to mention that this post is just a summary, not meant to be a user manual or thorough troubleshooting guide. Fortunately, Bismark has some of the best documentation for any bioinformatics suite and is mandatory reading. The Bismark crew are very proactive with responding to user queries on various forums as well.

First step in getting Bismark to work is to index the genome, in this case with Bowtie2:

bismark_genome_preparation --bowtie2 /pathto/refgenome/

Conventionally, multiplexed libraries will be sequenced over a number of lanes. Resist concatenating or merging the smaller fastq files for each patient/sample until after the alignment, as the concatenated fil…

Weighing the benefits of RNA-seq error correction

Sequencing data contains two major types of errors, ones that are incorporated during library preparation and ones incorporated during sequence reading. While errors of the former are difficult to correct as they occur without clues from the base quality scores, the latter type do correlate with lower base quality scores and so it is possible to identify these ambiguous base calls and compare them to a library of high confidence k-mers from the same sequencing run. Error correction of this type has been show to improve de novo assemblies and SNP detection.

A recent report posted in bioRxiv shows error correction gives a drastic improvement in transcriptome assembly, with the author benchmarking the performance of five different correction software packages (Bless, Lighter, SGA, SEECER & BFC). The author reports that these software options vary in the aggressiveness of error correction, with SGA being the most aggressive, Lighter the least aggressive and BFC and SEECER performing w…

Mapping NGS data: which genome version to use?

Aligning reads to the genome is a key step in nearly all NGS data pipelines, the quality of an alignment will dictate the quality of the final results. So for beginners in this space, the options available can be a bit overwhelming.

Which options are available? Depending on what species you are working on, you will have either a limited number of choices or a vast number of choices. These include NCBI, Ensembl, UCSC as well as the consortia that generate these genome builds, such as the Human Genome Reference Consortium for human and TAIR for Arabidopsis. My recommendation at this point is Ensembl, for a number of reasons: It is clear to see what genome build and version just from the file names. Contrast "hg38.fa.gz" for UCSC vs "Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz" for EnsemblFrom the Ensembl file name you can tell whether its masked, and whether its "primary assembly" or "toplevel".The website is intuitive, ftp downloads are fast…

Run Linux from USB

Linux is definitely the favorite OS for bioinformatics, but if you ask most university or research institute IT departments they will likely be MS Windows-centric. Even 53% of visitors to this blog run Windows. Many IT departments that I've interacted with lock down their PCs so no software can be installed, leaving employees and students unable to run software to get their work done.

One option is to run virtualisation software such as VirtualBox or VMware to run Linux inside Windows, but that comes with reduced performance. Another, better option is to run Linux from a USB flash drive. Just as virtually all Linux distros can be booted off CD/DVD, they can also be booted off USB. The benefits are that you can run a "pure" Linux OS without modifying the existing host Windows OS. You'll also be able to take it and all the installed software wherever you go, and run it off any machine. Some Linux distros are specifically designed for running off USB (or SD) flash driv…