Using paired end read concordance to determine accuracy of RNA-seq pipelines STAR/featureCounts and Kallisto

August 24, 2015

This post is a collection of code used to compare accuracy of Kallisto and STAR/featureCounts in my previous post.

Download from SRA and convert to fastq format

__________________________________________________________________

Run skewer to quality trim the data.


$ cat run_skewer.sh

#!/bin/bash

for FQZ1 in *_1.fastq.gz ; do

FQZ2=`echo $FQZ1 | sed 's/_1.fastq.gz/_2.fastq.gz/'`

skewer -t 8 -q 20 $FQZ1 $FQZ2

done

__________________________________________________________________

Run bfc to correct errors.


$ cat run_bfc.sh

#!/bin/bash
for FQ in *.fastq.gz ; do
OUT=`echo $FQ | sed 's/.fastq.gz$/_bfc.fastq.gz/'`
bfc -t 8 $FQ | pigz > $OUT &
done
wait

__________________________________________________________________

Run STAR aligner to align reads to human genome in single end mode.

$ cat run_star.sh
#!/bin/bash
DIR=/refgenome_hsapiens/
GTF=refgenome_hsapiens/Homo_sapiens.GRCh38.78.gtf
for FQZ in `ls *gz` ; do
FQ=`echo $FQZ | sed 's/.gz//'`
pigz -dkf $FQZ
STAR genomeLoad LoadAndKeep --genomeDir $DIR --readFilesIn $FQ --runThreadN 8 \
--sjdbGTFfile $GTF --outSAMattributes NH HI NM MD
rm $FQ
mv Aligned.out.sam ${FQ}.STAR.sam
mv Log.final.out ${FQ}_starlog.txt
( samtools view -uSh ${FQ}.STAR.sam | samtools sort - ${FQ}.STAR
rm ${FQ}.STAR.sam
samtools index ${FQ}.STAR.bam
samtools flagstat ${FQ}.STAR.bam > ${FQ}.STAR.bam.stats ) &
done
STAR genomeLoad Remove --genomeDir $DIR
wait
__________________________________________________________________

Run featureCounts to generate read-gene mapping info. Note mapQ threshold set at 20.

$ cat run_fcount.sh

#!/bin/bash

#DIR=refgenome_hsapiens/

GTF=refgenome_hsapiens/Homo_sapiens.GRCh38.78.gtf

EXPT=skewered



MX=${EXPT}.mx

MXF=${EXPT}_f.mx

featureCounts -Q 20 -R -T 8 -a $GTF -o $MX *bam

sed 1d $MX | cut -f1,7- > $MXF

__________________________________________________________________

Analyze featureCounts read-gene mapping data, generating true positive and false positive read pairs.

$ cat correctpairs_fcounts.sh

#!/bin/bash

for i in *featureCounts ; do

grep Assigned $i | sort -T . -k 1b,1 | cut -f1,3 > ${i}.s &

done



wait




for FC1 in *pair1.fastq.STAR.bam.featureCounts.s ; do

 ( FC2=`echo $FC1 | sed 's/pair1.fastq/pair2.fastq/'`

 join -1 1 -2 1 $FC1 $FC2 > ${FC1}.jn

 rm $FC1 $FC2 ) &

done

wait







for JN in *jn ; do

 (

 LN=`wc -l < $JN`

 COR=`awk '$2==$3' $JN | wc -l`

 INC=`awk '$2!=$3' $JN | wc -l`

 echo $JN $LN $COR $INC ) &

done | tee result.txt

wait

__________________________________________________________________

Prepare custom transcriptome reference sequence for use with Kallisto.

$ cat run_kal_ref_prep.sh

#!/bin/bash

wget http://ftp.ensembl.org/pub/release-78/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz

zcat Homo_sapie.GRCh38.cdna.all.fa.gz | cut -d ' ' -f1,4 \

| sed 's/ gene:/_/' > Homo_sapiens.GRC38.78.format.cdna.all.fa



__________________________________________________________________

Run Kallisto in single end mode, generating a bam file output.

$ cat run_kal.sh

#!/bin/bash

IDX=Homo_sapiens.GRC38.78.format.cdna.all.fa.idx

for FQZ in *fastq.gz ; do

NAME=`echo $FQZ | sed 's/.fastq.gz//'`



kallisto quant -t 7 --pseudobam --single -i $IDX -o ${NAME}.output -b 100 $FQZ \

| samtools view -Sb - > ${NAME}.bam

done

__________________________________________________________________

Analyze Kallisto read pair information. If multimapping has occurred, select one hit only.

$ cat correctpairs_kal.sh

#!/bin/bash

for BAM in *bam ; do

samtools view $BAM | cut -f1,3 | grep ENSG | tr '_' '\t' | cut -f1,3 \

| awk '!arr[$1]++' | sort -k 1b,1 > $BAM.counts &



done

wait




for C1 in *pair1.bam.counts ; do

C2=`echo $C1 | sed 's/pair1.bam.counts/pair2.bam.counts/'`

ls $C1 $C2

join -1 1 -2 1 $C1 $C2 > $C1.jn &

rm $C1 $C2

done

wait







for JN in *jn ; do

 (

 LN=`wc -l < $JN`

 COR=`awk '$2==$3' $JN | wc -l`

 INC=`awk '$2!=$3' $JN | wc -l`

 echo $JN $LN $COR $INC ) &

done | tee result.txt

wait

Search This Blog

Genome Spot