Posts

Showing posts with the label BWA

Analyzing repeat rich plant smRNA-seq data with ShortStack

Image
Small RNA expression is difficult to analyse. They're small molecules anywhere from 18 -25 nt for miRNAs, they occur as identical or near identical family members and are subject to RNA editing as well as errors from the sequencer.

My recent paper is an analysis of alignment tools for microRNA analysis with a strong focus on uniquely mapped reads. All that's OK, but in some organisms such as grasses (rice, barley, wheat, etc) you'll find that multimapped reads far outnumber uniquely placed ones. If you omit multimapped reads from the analysis, then you'll be excluding the majority of reads which is definitely a bad idea in any NGS analysis pipeline.

To demonstrate this, I downloaded smRNA-seq data from SRA (SRP029886) that consists of 3 datasets (SRR976171, SRR976172, SRR976173), clipped the adaptors off and mapped them to the genome with BWA aln then counted reads mapped to exonic regions uniquely (mapQ≥10).
So as you can see, the proportion of reads that are assigned …

Screen for mycoplasma contamination in DNA-seq and RNA-seq data

Image
If you work in a lab dealing with mammalian cell cultures, then you've probabaly heard of Mycoplasma, these are obligate intracellular parasitic bacteria that not only cause human infection and disease, but also common contaminants in cell culture experiments. Mycoplasma infection can cause many changes to cell biology that can invalidate experimental results which is alarming. These bacteria are also resistant to most antibiotics used in culture experiments like streptomycin and penicillin. Mycoplasma detection is also not straight-forward, as these bugs are not visible with the light microscopes used by most researchers. PCR and Elisa tests can be used but there many researchers out there who simply don't perform these tests. Last year, a study published in NAR showed that about 11% of RNA-seq studies were affected by Mycoplasma contamination, furthermore the study also identified a panel of 61 host genes that were strongly correlated with the presence of Mycoplasma.

In thi…

Functions and GNU parallel for effective cluster load management

Image
I've been a fan of GNU parallel for a long time. Initially I was sceptical about using it, preferring to write huge for loops but over time I've grown to love it. The beauty of GNU parallel is that it spawns a specified number of jobs in parallel and then submits more jobs as others are completed. This means that you get maximum usage out of the CPUs without overloading the system. There are many excuses for not using it, but perhaps the only valid one is that you have Sun Grid Engine or another job scheduler or manager in place.

GNU parallel is particularly useful when used with functions. Functions are subroutines that may be repeated many times to complete a piece of work. In bash, here is a simple example, which declares a function consisting of a chain of piped commands, and then executes 4 jobs in parallel, until all of *files.txt have been processed.

#!/bin/bash
my_func2() {
INPUT=$1
VAR1=bar
cmd1 $INPUT $VAR1 | cmd2 | cmd3 > ${1}.out
}
export -f my_func
parallel -j4…

Quantifying repeat sequences from short read next gen sequencing data

Image
Repetitive sequences pose significant challenges for the analysis of short read sequence data. Any sequence read that is shorter than the repeat unit is going to fail at being uniquely placed in the genome. At a given read length, genome can therefore be divided into mappable and unmappable compartments. For analyses such as gene expression and chromatin profiling, we generally ignore those unmappable regions and ignore reads that map to multiple positions. But what if we were actually interested in the abundance of repeat sequences in the dataset? How could it be quantified with short read sequencing? Why is that even relevant?

The "Why" About 50% of the human genome is comprised of repetitive DNA (Ref). While the function of these elements in many cases remains unknown, some have been well defined. Alu elements have been shown to be a birthplace of enhancers (Ref). Endogenous retroviruses have been domesticated to boost antiviral response in B cells after exposure to T cel…

microRNA aligners compared

Image
Alignment of microRNA to the genome poses a particular challenge because the reads are short, and some microRNAs are nearly identical. Moreover, microRNAs themselves are subject to RNA editing (adenine-to-inosine conversion, non-templated base addition) and normal sequencing error rates. In this post, I'm going to test the performance of several aligners in aligning microRNA reads to the Arabidopsis genome. 
I downloaded the Arabidopsis genome from Ensembl plant and the latest miRbase release version 21. I used bowtie2 to align the 325 full-length hairpin transcripts to the Arabidopsis genome. I generated pseudo microRNA reads that uniformly cover the hairpin transcript at a range of lengths from 16 nt to 25 nt. I then aligned the reads to the Arabidopsis genome using these different aligners with the default settings. I then used bedtools and awk to count the correctly and incorrectly mapped reads at a mapQ threshold of 10. 
Firstly note that there were no alignment with either …

DNA aligner accuracy: BWA, Bowtie, Soap and SubRead tested with simulated reads

Image
In the past few posts, we've looked at RNA-seq aligner performance in terms of accuracy and speed. In this post, I'll take a look at the accuracy of DNA aligners using simulated reads.

The first step is to download the genome of interest. I'm using Arabidopsis as its pretty small and good for quick benchmarking. I downloaded the genome from Ensembl plant ftp site (link).

Next step was to generate simulated reads that uniformly cover the genome at user-selected length and intervals. I couldn't find any previously made simulators to do exactly that, so I chained together some awk and bedtools commands (code at the bottom of the post) to generate the reads. I generated pseudo reads of 50, 100 & 200 bp in length at 10 bp intervals and output them in fasta format. Here is an example.

>1-0-50
CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAA
>1-10-60
TAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAAT
>1-20-70
ACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAATCTTTAAATCC

T…

RNA-seq aligners: Subread, STAR, HPG aligner and Olego | PART 2

Image
In the previous post, I compared the speed of several RNA aligners. In this post, we'll take a closer look at the results generated by these different aligners; Subread/Subjunc, STAR, HPG aligner and Olego. In addition, we will use BWA MEM as an example of what happens when you use a DNA aligner to analyse RNA-seq. For the test, we've been using 101 bp Arabidopsis mRNA-seq data from GEO accession GSE42968. For simplicity, I use only the forward read in the paired-end dataset. I used fastx-toolkit to remove low quality bases from the 3' end (base qual threshold of 20).

One of the simplest ways to assess the quality of an alignment is to determine the proportion of reads that are mapped to the genome and the proportion that map to exons. After the aligners did their work, I used featureCounts to quantify the number of aligned reads with a mapping quality >10. Here is the data for the first sample in the series, SRR634969 which contained 14.5 million reads.

The first thing…

Quick alignment of microRNA-Seq data to a reference

In my previous post, I discussed an efficient approach to align deep sequencing reads to a genome, which could be used for most applications (mRNA-Seq, ChIP-Seq, exome-Seq, etc).

MicroRNA datasets are rather a bit different because they often contain millions of reads from only a few thousand RNA species, meaning there is a lot of redundancy. We can speed up this analysis considerably if we collapse the sequences and remove the redundancy before genome alignment. You can see in the below script that several steps are done without writing a single intermediate file:

Adapter clippingSequence collapsingBWA alignmentBam conversionSorting bam file
for FQ in *fq
do
fastx_clipper -a TGGAATTCTCGGGTGCCAAGG -l 18 -i $FQ \
| fastx_collapser \
| bwa aln -t 30 $REF - \
| bwa samse $REF - ${FQ} \
| samtools view -uSh - \
| samtools sort - ${FQ}.sort &
done
wait
for bam in *bam ; do samtools index $bam & done ; wait
for bam in *bam ; do samtools flagstat $bam > ${bam}.stats & done ; wait
Using thi…

Align Next Gen Sequencing Data Like A Boss

We know that sequencing platforms are generating more and more output, and more of this data is finding itself in online databanks. The volume of data is a blessing because it is so information rich, but it is also increasingly becoming a problem - we need more machine time to process it.

I was recently aligning dozens of mouse encode RNA-seq data sets while we had a server fault. I was left with a mess of half finished jobs and intermediate files and basically had to start the whole process again. This prompted me to make an alignment pipeline that would be more resistant to unexpected interruptions. To do this, I took inspiration from a post at SEQanswers which piped the output from the BWA aligner and generated bam files without any intermediate files.

But I wanted to take this a bit further, by going from compressed fastq to bam file without writing any intermediate files, doing quality trimming on the fly and saving a lot of storage in the process. So here are two examples, one f…

BWA alignment to a genome - single ends

Over the last few posts, we've discussed various ways to analyse the quality of a sequencing run, and curate the data sets in terms of demultiplexing, quality trimming and adapter clipping. I guess the next step is alignment. There are an increasing number of aligners out there for short read sequencing (listed here), but currently the most popular choices are BWA and Bowtie2. BWA is a quick and accurate tool, but might not be the best for gapped alignments (common in mRNA-Seq). Today I will demonstrate how to align short reads (single end) with BWA and convert the alignment to bam format with Samtools. The things you will need in the otherwise empty directory are:

Reference genome in fasta formatSequence data in fastq format (.fq) First thing to do is give the experiment a name and index the genome with BWA. The index step could take about 3 hours for a human size genome.
EXID=Experimentname
REF=Hg19.fa
bwa index -a bwtsw $REF Next, use BWA to align. All of our fastq data sets hav…