Data analysis step 3: Align paired end RNA-seq with Tophat

In this series of posts, we're going through an RNA-seq analysis workflow. So far, we've downloaded and inspected sequence quality. In this post, we will align the paired end data to the human genome with Tophat.

Part 1 is to get a suitable reference genome sequence. I chose download the most recent human genome sequence from the Ensembl ftp site (Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa).

axel --num-connections=6 ftp://ftp.ensembl.org/pub/release-76/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz


Once downloaded, the genome needs to be uncompressed and indexed with bowtie2. This will take a few hours.

REF=Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa
gunzip ${REF}.gz
bowtie2-build $REF $REF

Next step is to use Tophat to align the reads in paired-end mode. Tophat is smart enough to recognise the bz2 compression. The script also indexes the bam files and generates some familiar samtools statistics.

#!/bin/bash

REF=/path/to/genomes/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa

for FQZ1 in SR*_1.fastq.bz2
do
FQZ2=`echo $FQZ1 | sed 's/_1/_2/'`
BASE=`echo $FQZ1 | cut -d '_' -f1`
tophat -p 4 -o ${BASE}_tophat $REF $FQZ1 $FQZ2
samtools index ${BASE}_tophat/accepted_hits.bam
samtools flagstat ${BASE}_tophat/accepted_hits.bam \
> ${BASE}_tophat/accepted_hits.bam.stats
done

A scan through the align_summary.txt shows that the alignment stats looks reasonable.
Tophat alignment statistics for paired end RNA-seq data (GSE55123)
If you are working on a big multicore server with sufficient memory, you can speed up the alignment by running multiple alignment jobs in the background.

#!/bin/bash

REF=/path/to/genomes/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa

for FQZ1 in SR*_1.fastq.bz2
do
(
FQZ2=`echo $FQZ1 | sed 's/_1/_2/'`
BASE=`echo $FQZ1 | cut -d '_' -f1`
tophat -p 4 -o ${BASE}_tophat $REF $FQZ1 $FQZ2
samtools index ${BASE}_tophat/accepted_hits.bam
samtools flagstat ${BASE}_tophat/accepted_hits.bam \
> ${BASE}_tophat/accepted_hits.bam.stats ) &
done ; wait

Popular posts from this blog

Data analysis step 8: Pathway analysis with GSEA

Two subtle problems with over-representation analysis

Uploading data to GEO - which method is faster?