Posts

Showing posts with the label next generation sequencing

How NGS is transforming medicine

Image
Last month, I gave a talk at our departmental meeting, describing in general terms how high throughput sequencing technology was having real impacts in medicine and human health, as well as some emerging trends to watch out for in coming years.

Here's the link


Quantification and equimolar pooling of NGS libraries

Image
Library preparation for next generation sequencing is becoming easier with the quality of kits and protocols improving substantially in the past few years. With the price of NGS decreasing, we are finding that our throughput is increasing, both in terms of the number of experiments as well as the average size of these experiments. With this in mind, the ability to accurately pool barcoded libraries in equimolar ratios (also called "balancing") is even more critical. Accurate quantification is thus vital. There are several ways to quantify library DNA: Qubit fluorometer. Gives very accurate concentrations in nanogram per microlitre, but agnostic of fragment size distribution.Bioanalyzer. Gives accurate readings of fragment size, but is only fairly accurate in terms of concentrations of each fragmentQuantitative PCR. Most accurate, but expensive and most time consuming.NanoDrop UV-Spec. Easiest method but least accurate. Not recommended. So I'll lead you through my favouri…

SRA toolkit tips and workarounds

Image
The Short Read Archive (SRA) is the main repository for next generation sequencing (NGS) raw data. Considering the sheer rate at which NGS is generated (and accelerating), the team at NCBI should be congratulated for providing this service to the scientific community. Take a look at the growth of SRA:

SRA however doesn't provide directly the fastq files that we commonly work with, they prefer the .sra archive that require specialised software (sra-toolkit) to extract. Sra-toolkit has been described as buggy and painful; and I've had my frustrations with it. In this post, I'll share some of my best tips sra-toolkit tips that I've found.

Get the right version of the software and configure it When downloading, make sure you download the newest version from the NCBI website (link). Don't download it from GitHub or from Ubuntu software centre (or apt-get), as it will probably be an older version. In the binary directory (looks like /path/to/sratoolkit.2.4.3-ubuntu64/bin…

Read counting with featureCounts, BedTools and HTSeq

Image
Counting the number of reads that align to certain genomic features is a key element of many next gen sequencing analysis pipelines. For RNA-seq, this is commonly used to count reads aligning to exons, while for ChIP-seq this is used to count reads over a promoter or other region of interest. There are several available tools for performing this task and in this post I will compare the three of the most commonly used:
bedtools multicovhtseq-countfeatureCounts I took one of the bam files from the recent RNA-seq series of posts and subsampled it using samtools and shuf into file sizes of 1M, 2M, 5M, 10M reads, as well as the bam file containing 25.4M reads.

I  then used the benchmarking script described in the previous post to record execution time, CPU usage and peak memory for read counting to generate a gene-wise matrix. I used featureCounts in the single thread mode as well as the parallel mode (maximum of 8 cores).
The results show that featureCounts is about 10 times faster than Be…

Benchmark scripts and programs

Bioinformaticians strive for accurate results, but when time or computational resources are limited, speed can be a factor too. This is especially true when dealing with the huge data sets coming off sequencers these days.

When putting together an analysis pipeline, try taking a small fraction of the data and perform some benchmarking of the available tools.

Benchmarking could be as simple as using time:

time ./script1.sh
time ./script2.sh

But if you need a little more detail, this benchmarking approach captures peak memory usage and average CPU utilisation too.

1. Set up a list of commands/scripts in a file called "codes.txt" Here is a list of commands that I used in a previous post:

$ cat codes.txt
cat test.fastq > /dev/null
zcat test.fastq.gz > /dev/null
bzcat test.fastq.bz2 > /dev/null
pigz -dc test.fastq.gz > /dev/null
pbzip2 -dc test.fastq.bz2 > /dev/null
plzip -dc test.fastq.lz > /dev/null

2. Setup the benchmarking script Use the following script to …