Posts

Showing posts from August, 2015

Using paired end read concordance to determine accuracy of RNA-seq pipelines STAR/featureCounts and Kallisto

This post is a collection of code used to compare accuracy of Kallisto and STAR/featureCounts in my previous post . Download from SRA and convert to fastq format Check my older post. __________________________________________________________________ Run skewer to quality trim the data. $ cat run_skewer.sh #!/bin/bash for FQZ1 in *_1.fastq.gz ; do FQZ2=`echo $FQZ1 | sed 's/_1.fastq.gz/_2.fastq.gz/'` skewer -t 8 -q 20 $FQZ1 $FQZ2 done __________________________________________________________________ Run bfc to correct errors. $ cat run_bfc.sh #!/bin/bash for FQ in *.fastq.gz ; do OUT=`echo $FQ | sed 's/.fastq.gz$/_bfc.fastq.gz/'` bfc -t 8 $FQ | pigz > $OUT & done wait __________________________________________________________________ Run STAR aligner to align reads to human genome in single end mode. $ cat run_star.sh #!/bin/bash DIR=/refgenome_hsapiens/ GTF=refgenome_hsapiens/Homo_sapiens.GRCh38.78.gtf f

How accurate is Kallisto?

Image
One of the most interesting developments in RNA-seq informatics in the past year or so is the evolution of so-called "lightweight" analysis. Instead of trying to map a whole read to the reference exons, it may be quicker and just as accurate to simply compare the k-mer content of the read and reference. If the read and reference share enough unique kmers, then the aligner can stop analysing that read and can move on the the next one. It turns out that the speed-up with this approach is huge, when compared with previous methods. Kallisto especially appears to have some speed advantages over competitors such as Sailfish, Salmon, eXpress, etc. For those interested in learning more, please refer to the Kallisto pre-print in arXiv, as well as the following blog posts. nextgenseek.com robpatro.com liorpachter.wordpress.com sjcockell.me For this post, I wanted to compare the accuracy of Kallisto with STAR/featureCounts (mapQ20 threshold), the pipeline that I'm currentl

Quantification and equimolar pooling of NGS libraries

Image
Library preparation for next generation sequencing is becoming easier with the quality of kits and protocols improving substantially in the past few years. With the price of NGS decreasing, we are finding that our throughput is increasing, both in terms of the number of experiments as well as the average size of these experiments. With this in mind, the ability to accurately pool barcoded libraries in equimolar ratios (also called "balancing") is even more critical. Accurate quantification is thus vital. There are several ways to quantify library DNA: Qubit fluorometer. Gives very accurate concentrations in nanogram per microlitre, but agnostic of fragment size distribution. Bioanalyzer. Gives accurate readings of fragment size, but is only fairly accurate in terms of concentrations of each fragment Quantitative PCR. Most accurate, but expensive and most time consuming. NanoDrop UV-Spec. Easiest method but least accurate. Not recommended. So I'll lead you throug

Screen for rRNA contamination in RNA-seq data

Image
Ribosomal RNA (rRNA) is very abundant in cells (~80% of total RNA), so it is useful to deplete rRNA when doing genomewide assays to have sufficient coverage of other genes including protein coding and non-protein coding genes. There are two major strategies for achieving rRNA removal, being (1) poly A enrichment and (2) rRNA depletion. Poly A enrichment uses an oligo dT coupled magnetic bead to "pull-out" RNA molecules with a polyA tag, a common feature of protein coding transcripts. rRNA depletion can be achieved using kits such as Ribo-Zero (Illumina), Ribo-Minus (LifeTech) and NEBNext® rRNA Depletion Kit (NEB). These kits contain oligonucleotide probes that either hybridize and immobilise the rRNA or hybridize and degrade the unwanted rRNA. Once you have some sequence data, you will need to check whether the rRNA depletion has worked. This is somewhat different to a genome-wide analysis I've mentioned in earlier posts because rRNA genes exist in multiple copies