Quick alignment of microRNA-Seq data to a reference
In my previous post, I discussed an efficient approach to align deep sequencing reads to a genome, which could be used for most applications (mRNA-Seq, ChIP-Seq, exome-Seq, etc).
MicroRNA datasets are rather a bit different because they often contain millions of reads from only a few thousand RNA species, meaning there is a lot of redundancy. We can speed up this analysis considerably if we collapse the sequences and remove the redundancy before genome alignment. You can see in the below script that several steps are done without writing a single intermediate file:
Using this method, the fastq is collapsed, meaning the sequences are now in a format which looks like this:
MicroRNA datasets are rather a bit different because they often contain millions of reads from only a few thousand RNA species, meaning there is a lot of redundancy. We can speed up this analysis considerably if we collapse the sequences and remove the redundancy before genome alignment. You can see in the below script that several steps are done without writing a single intermediate file:
- Adapter clipping
- Sequence collapsing
- BWA alignment
- Bam conversion
- Sorting bam file
for FQ in *fq
do
fastx_clipper -a TGGAATTCTCGGGTGCCAAGG -l 18 -i $FQ \
| fastx_collapser \
| bwa aln -t 30 $REF - \
| bwa samse $REF - ${FQ} \
| samtools view -uSh - \
| samtools sort - ${FQ}.sort &
done
wait
for bam in *bam ; do samtools index $bam & done ; wait
for bam in *bam ; do samtools flagstat $bam > ${bam}.stats & done ; wait
Using this method, the fastq is collapsed, meaning the sequences are now in a format which looks like this:
>1-14625
TGTAAACATCCTCGACTGGAAGCT
>2-11786
TGAGGTAGTAGGTTGTATAGTT
>3-8897
GTTTCCGTAGTGTAGTGGTTATCACGTTCGCCTTGG
>4-8495
TTTGGCAATGGTAGAACTCACACT
>5-7084
CAACGGAATCCCAAAAGCAGCTG
Where the sequence name now consists of a single number (unique identifier) as well as the number of times that exact sequence string was identified. Using this technique, I was able to align 12 data sets totaling 32 million reads in about 3 minutes. This is how the bam file looks:
36725-1 0 chr2 59549323 0 33M * 0 0 CCATACCACCCTGAACGCGCCCGATCTCGTCTG * XT:A:R NM:i:0 X0:i:34 XM:i:0 XO:i:0 XG:i:0 MD:Z:3364660-1 0 chr2 59549328 0 28M * 0 0 CCACCCTGAACGCGCCCGATCTCGTCTG * XT:A:R NM:i:0 X0:i:34 XM:i:0 XO:i:0 XG:i:0 MD:Z:286064-8 0 chr2 59549329 0 24M * 0 0 CACCCTGAACGCGCCCGATCTCGT * XT:A:R NM:i:0 X0:i:36 XM:i:0 XO:i:0 XG:i:0 MD:Z:2411925-3 0 chr2 59549331 0 28M * 0 0 CCCTGAACGCGCCCGATCTCGTCTGATC * XT:A:R NM:i:0 X0:i:31 XM:i:0 XO:i:0 XG:i:0 MD:Z:2823045-1 16 chr2 59549331 0 24M * 0 0 CCCTGAACGCGCCCGATCTCGTCT * XT:A:R NM:i:0 X0:i:36 XM:i:0 XO:i:0 XG:i:0 MD:Z:24
I will cover the differential analysis aspect in another post. If you have tips for microRNA-Seq analysis I'd be glad to hear.