Posts

Showing posts with the label STAR

Using paired end read concordance to determine accuracy of RNA-seq pipelines STAR/featureCounts and Kallisto

This post is a collection of code used to compare accuracy of Kallisto and STAR/featureCounts in my previous post.
Download from SRA and convert to fastq formatCheck my older post. __________________________________________________________________

Run skewer to quality trim the data.
$ cat run_skewer.sh #!/bin/bash for FQZ1 in *_1.fastq.gz ; do FQZ2=`echo $FQZ1 | sed 's/_1.fastq.gz/_2.fastq.gz/'` skewer -t 8 -q 20 $FQZ1 $FQZ2 done __________________________________________________________________

Run bfc to correct errors.
$ cat run_bfc.sh
#!/bin/bash
for FQ in *.fastq.gz ; do
OUT=`echo $FQ | sed 's/.fastq.gz$/_bfc.fastq.gz/'`
bfc -t 8 $FQ | pigz > $OUT &
done
wait __________________________________________________________________
Run STAR aligner to align reads to human genome in single end mode.

$ cat run_star.sh
#!/bin/bash
DIR=/refgenome_hsapiens/
GTF=refgenome_hsapiens/Homo_sapiens.GRCh38.78.gtf
for FQZ in `ls *gz` ; do
FQ=`echo $FQZ | sed 's/.gz//'`
pigz -dkf …

How accurate is Kallisto?

Image
One of the most interesting developments in RNA-seq informatics in the past year or so is the evolution of so-called "lightweight" analysis. Instead of trying to map a whole read to the reference exons, it may be quicker and just as accurate to simply compare the k-mer content of the read and reference. If the read and reference share enough unique kmers, then the aligner can stop analysing that read and can move on the the next one. It turns out that the speed-up with this approach is huge, when compared with previous methods. Kallisto especially appears to have some speed advantages over competitors such as Sailfish, Salmon, eXpress, etc. For those interested in learning more, please refer to the Kallisto pre-print in arXiv, as well as the following blog posts.
nextgenseek.comrobpatro.comliorpachter.wordpress.comsjcockell.me
For this post, I wanted to compare the accuracy of Kallisto with STAR/featureCounts (mapQ20 threshold), the pipeline that I'm currently using in mo…

Weighing the benefits of RNA-seq error correction

Image
Sequencing data contains two major types of errors, ones that are incorporated during library preparation and ones incorporated during sequence reading. While errors of the former are difficult to correct as they occur without clues from the base quality scores, the latter type do correlate with lower base quality scores and so it is possible to identify these ambiguous base calls and compare them to a library of high confidence k-mers from the same sequencing run. Error correction of this type has been show to improve de novo assemblies and SNP detection.

A recent report posted in bioRxiv shows error correction gives a drastic improvement in transcriptome assembly, with the author benchmarking the performance of five different correction software packages (Bless, Lighter, SGA, SEECER & BFC). The author reports that these software options vary in the aggressiveness of error correction, with SGA being the most aggressive, Lighter the least aggressive and BFC and SEECER performing w…

HISAT vs STAR vs TopHat2 vs Olego vs SubJunc

Image
HISAT is a brand new RNA-seq aligner which promises great speed with a low memory footprint. The Nature Methods paper is worth a look.

So I thought I'd give it a test run with some simulated data to check its accuracy compared to other aligners.

I generated synthetic 100bp reads based on Arabidopsis cDNAs and then incorporated mutations with msbar.

I then aligned these reads to the Arabidopsis genome with default settings and counted (featureCounts) the number of correctly and incorrectly assigned reads with a mapping quality of 20.

Here is the result for unmutated (perfect) 100 bp reads.

HISAT shows similar accuracy to TopHat2 but was not as accurate as STAR.

Now here are the results of the mutation experiment.

A table of full results is shown at the bottom of the post. These results show that STAR consistently scores the greatest number of correctly assigned reads in these tests while keeping incorrectly assigned reads down below 0.1%. Olego and TopHat2 produce the fewest incorr…

microRNA aligners compared

Image
Alignment of microRNA to the genome poses a particular challenge because the reads are short, and some microRNAs are nearly identical. Moreover, microRNAs themselves are subject to RNA editing (adenine-to-inosine conversion, non-templated base addition) and normal sequencing error rates. In this post, I'm going to test the performance of several aligners in aligning microRNA reads to the Arabidopsis genome. 
I downloaded the Arabidopsis genome from Ensembl plant and the latest miRbase release version 21. I used bowtie2 to align the 325 full-length hairpin transcripts to the Arabidopsis genome. I generated pseudo microRNA reads that uniformly cover the hairpin transcript at a range of lengths from 16 nt to 25 nt. I then aligned the reads to the Arabidopsis genome using these different aligners with the default settings. I then used bedtools and awk to count the correctly and incorrectly mapped reads at a mapQ threshold of 10. 
Firstly note that there were no alignment with either …

RNA-seq aligner accuracy tested with simulated reads

Image
In the previous post we looked at how choice of RNA-seq aligner influenced results. In this post, we'll use simulated reads to empirically determine the accuracy of OLego, STAR, SubJunc and SubRead.

I made a pretty basic bash script (see below) to generate fasta format reads at uniform intervals along the length of transcripts without errors. The user can readily change the read length and the interval between read initiation. The script created 11.3 million 100 bp reads in 9 minutes so it is OK in terms of speed. It is designed to take Ensembl cDNA files as an input and has been tested on human and Arabidopsis. Here is what the generated reads look like

>AT2G01210.1:gene:AT2G01210_197863
CGTCAGCTTTCGTTCTGGGGAAGAGCGGAATCGGAATTGTCTACAAAGTG
>AT2G01210.1:gene:AT2G01210_197864
GTTCTAGAGAACGGGCTCACACTGGCCGTACGGAGATTGGGTGAAGGAGG
>AT2G01210.1:gene:AT2G01210_197865
GTCTCAGAGATTCAAGGAGTTTCAGACAGAAGTTGAAGCCATAGGGAAAC
>AT2G01210.1:gene:AT2G01210_197866
TAAAACATCCTAACATTGCTAGTCTTC…

RNA-seq aligners: Subread, STAR, HPG aligner and Olego | PART 2

Image
In the previous post, I compared the speed of several RNA aligners. In this post, we'll take a closer look at the results generated by these different aligners; Subread/Subjunc, STAR, HPG aligner and Olego. In addition, we will use BWA MEM as an example of what happens when you use a DNA aligner to analyse RNA-seq. For the test, we've been using 101 bp Arabidopsis mRNA-seq data from GEO accession GSE42968. For simplicity, I use only the forward read in the paired-end dataset. I used fastx-toolkit to remove low quality bases from the 3' end (base qual threshold of 20).

One of the simplest ways to assess the quality of an alignment is to determine the proportion of reads that are mapped to the genome and the proportion that map to exons. After the aligners did their work, I used featureCounts to quantify the number of aligned reads with a mapping quality >10. Here is the data for the first sample in the series, SRR634969 which contained 14.5 million reads.

The first thing…

RNA-seq aligners: Subread, STAR, HPG aligner and Olego

Image
In this post I compare the speed of Subread/Subjunc, STAR, HPG aligner and Olego. I will use real RNA-seq data from GEO accession GSE42968 and align to the Arabidopsis thaliana genome. All code used is found at the bottom of this post. The benchmarking was performed on a standard 8 core workstation with 8 GB RAM.

Index the Arabidopsis thaliana genome The first step in any RNA-seq analysis is to prepare the genome for alignment. Subread was the quickest.
Align RNA-seq reads I down sampled the sequence (SRA accesion:SRR634969) file to generate files containing 1M, 2M, 5M, 10M and 14M reads selected randomly. Fastq quality trimmer was used to trim low quality bases from the 3' end and filter low quality reads. Then I used Olego, Subread, Subjunc, STAR and HPG aligner to align the reads. (I wanted to test HPG aligner in both the BWT and SA modes but didn't have the 17 GB RAM to test SA algorithm sadly.) In all the below cases the sequence files were uncompressed fastq (101 bp read…