Posts

Showing posts with the label benchmarking

Benchmarking an AMD Ryzen Threadripper 2990WX 64 thread system

Image
I recently received access to a new workstation built around the AMD Ryzen Threadripper 2990WX 32-Core Processor. The main use of the system will be to process genome sequencing data. It has 64 threads and the clock speed is 3.2GHz. It should get those data processing jobs done a lot quicker than my other system. Nevertheless I wanted to run some benchmarks to see how much faster it is as well as give it a stress test to see how well it can cope with high loads. This info could also be useful for comparisons in case degradation of the system in the future.

Here are the specs of the 3 systems being benchmarked:

AMD16: AMD® Ryzen threadripper 1900x 8-core processor - 16 threads@3.8GHz  - 2.4GHz RAMIntel32: Intel® Xeon® CPU E5-2660 16-core processor - 32 threads@3.0GHz - 1.3GHz RAM AMD64: AMD Ryzen Threadripper 2990WX 32-Core Processor - 64 threads@3.2GHZ - 2.9GHz RAM
The command being executed is a pbzip2 of a 4 GB file containing random data using some benchmarking scripts I wrote previ…

HISAT vs STAR vs TopHat2 vs Olego vs SubJunc

Image
HISAT is a brand new RNA-seq aligner which promises great speed with a low memory footprint. The Nature Methods paper is worth a look.

So I thought I'd give it a test run with some simulated data to check its accuracy compared to other aligners.

I generated synthetic 100bp reads based on Arabidopsis cDNAs and then incorporated mutations with msbar.

I then aligned these reads to the Arabidopsis genome with default settings and counted (featureCounts) the number of correctly and incorrectly assigned reads with a mapping quality of 20.

Here is the result for unmutated (perfect) 100 bp reads.

HISAT shows similar accuracy to TopHat2 but was not as accurate as STAR.

Now here are the results of the mutation experiment.

A table of full results is shown at the bottom of the post. These results show that STAR consistently scores the greatest number of correctly assigned reads in these tests while keeping incorrectly assigned reads down below 0.1%. Olego and TopHat2 produce the fewest incorr…

Sambamba vs Samtools

Samtools has been the one main tools for reading and writing aligned NGS data since the SAM alignment format was initially proposed. With the amounts of NGS data being generated increasing at a staggering rate, informatic bottlenecks are beginning to appear and there is a rush to make bioinformatics as parallel and scalable as possible. Samtools went parallel only recently, and there is a new competitor to Samtools called Sambamba, so I'll just do a few quick tests to see how it stacks up to Samtools on our 32-core server with 128 GB RAM (standard HDD).

We have an 1.8 GB bam file to work with.
$ du -sh *
1.8G  Sequence.bam

Now convert to unsorted sam format. $ samtools view -H Sequence.bam > header.sam
$ samtools view Sequence.bam | shuf \
| cat header.sam - > Sequence_shuf.sam

The sam file is 9.9 GB. Lets try 1-thread SAM-to-BAM conversion and sorting with Samtools.
$ time samtools view -Shb Sequence_shuf.sam \
| samtools sort - Sequence_samtools.test

real 18m52.374s
user 18m30.619s

microRNA aligners compared

Image
Alignment of microRNA to the genome poses a particular challenge because the reads are short, and some microRNAs are nearly identical. Moreover, microRNAs themselves are subject to RNA editing (adenine-to-inosine conversion, non-templated base addition) and normal sequencing error rates. In this post, I'm going to test the performance of several aligners in aligning microRNA reads to the Arabidopsis genome. 
I downloaded the Arabidopsis genome from Ensembl plant and the latest miRbase release version 21. I used bowtie2 to align the 325 full-length hairpin transcripts to the Arabidopsis genome. I generated pseudo microRNA reads that uniformly cover the hairpin transcript at a range of lengths from 16 nt to 25 nt. I then aligned the reads to the Arabidopsis genome using these different aligners with the default settings. I then used bedtools and awk to count the correctly and incorrectly mapped reads at a mapQ threshold of 10. 
Firstly note that there were no alignment with either …

DNA aligner accuracy: BWA, Bowtie, Soap and SubRead tested with simulated reads

Image
In the past few posts, we've looked at RNA-seq aligner performance in terms of accuracy and speed. In this post, I'll take a look at the accuracy of DNA aligners using simulated reads.

The first step is to download the genome of interest. I'm using Arabidopsis as its pretty small and good for quick benchmarking. I downloaded the genome from Ensembl plant ftp site (link).

Next step was to generate simulated reads that uniformly cover the genome at user-selected length and intervals. I couldn't find any previously made simulators to do exactly that, so I chained together some awk and bedtools commands (code at the bottom of the post) to generate the reads. I generated pseudo reads of 50, 100 & 200 bp in length at 10 bp intervals and output them in fasta format. Here is an example.

>1-0-50
CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAA
>1-10-60
TAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAAT
>1-20-70
ACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAATCTTTAAATCC

T…

RNA-seq aligners: Subread, STAR, HPG aligner and Olego

Image
In this post I compare the speed of Subread/Subjunc, STAR, HPG aligner and Olego. I will use real RNA-seq data from GEO accession GSE42968 and align to the Arabidopsis thaliana genome. All code used is found at the bottom of this post. The benchmarking was performed on a standard 8 core workstation with 8 GB RAM.

Index the Arabidopsis thaliana genome The first step in any RNA-seq analysis is to prepare the genome for alignment. Subread was the quickest.
Align RNA-seq reads I down sampled the sequence (SRA accesion:SRR634969) file to generate files containing 1M, 2M, 5M, 10M and 14M reads selected randomly. Fastq quality trimmer was used to trim low quality bases from the 3' end and filter low quality reads. Then I used Olego, Subread, Subjunc, STAR and HPG aligner to align the reads. (I wanted to test HPG aligner in both the BWT and SA modes but didn't have the 17 GB RAM to test SA algorithm sadly.) In all the below cases the sequence files were uncompressed fastq (101 bp read…

Benchmark scripts and programs

Bioinformaticians strive for accurate results, but when time or computational resources are limited, speed can be a factor too. This is especially true when dealing with the huge data sets coming off sequencers these days.

When putting together an analysis pipeline, try taking a small fraction of the data and perform some benchmarking of the available tools.

Benchmarking could be as simple as using time:

time ./script1.sh
time ./script2.sh

But if you need a little more detail, this benchmarking approach captures peak memory usage and average CPU utilisation too.

1. Set up a list of commands/scripts in a file called "codes.txt" Here is a list of commands that I used in a previous post:

$ cat codes.txt
cat test.fastq > /dev/null
zcat test.fastq.gz > /dev/null
bzcat test.fastq.bz2 > /dev/null
pigz -dc test.fastq.gz > /dev/null
pbzip2 -dc test.fastq.bz2 > /dev/null
plzip -dc test.fastq.lz > /dev/null

2. Setup the benchmarking script Use the following script to …

Test-driving parallel compression software

Image
Next generation sequencing is driving genomics into realm of big data, and this will only continue as NGS moves into the hands of more and more researchers and clinicians. Genome sequence datasets are huge and place an immense demand on informatics infrastructure including disk space, memory and CPU usage. Bioinformaticians need to be aware of the performance of different compression software available to make the most out of their hardware.

In this post, we'll test the performance of some compression algorithms in terms of their disk-saving ability as well as speed of decompression/compression.

The file that we're working on is an RNA-seq fastq file that is 5.0 GB in size and contains 25.8 million sequence reads on 103.0 million lines. Here is the top 10 lines.

@SRR1171523.1 Y40-ILLUMINA:15:FC:5:1:8493:1211 length=36
ACTTGCTGCTAATTAAAACCAACAATAGAACAGTGA
+SRR1171523.1 Y40-ILLUMINA:15:FC:5:1:8493:1211 length=36
B55??4024/13425>;>>CDD@DBB<<<BB<4<<9
@SRR117…