Using paired end read concordance to determine accuracy of RNA-seq pipelines STAR/featureCounts and Kallisto

This post is a collection of code used to compare accuracy of Kallisto and STAR/featureCounts in my previous post.

Download from SRA and convert to fastq format

__________________________________________________________________

Run skewer to quality trim the data.


$ cat run_skewer.sh
#!/bin/bash
for FQZ1 in *_1.fastq.gz ; do
FQZ2=`echo $FQZ1 | sed 's/_1.fastq.gz/_2.fastq.gz/'`
skewer -t 8 -q 20 $FQZ1 $FQZ2
done
__________________________________________________________________

Run bfc to correct errors.


$ cat run_bfc.sh
#!/bin/bash
for FQ in *.fastq.gz ; do
OUT=`echo $FQ | sed 's/.fastq.gz$/_bfc.fastq.gz/'`
bfc -t 8 $FQ | pigz > $OUT &
done
wait
__________________________________________________________________

Run STAR aligner to align reads to human genome in single end mode.



$ cat run_star.sh
#!/bin/bash
DIR=/refgenome_hsapiens/
GTF=refgenome_hsapiens/Homo_sapiens.GRCh38.78.gtf
for FQZ in `ls *gz` ; do
FQ=`echo $FQZ | sed 's/.gz//'`
pigz -dkf $FQZ
STAR genomeLoad LoadAndKeep --genomeDir $DIR --readFilesIn $FQ --runThreadN 8 \
--sjdbGTFfile $GTF --outSAMattributes NH HI NM MD
rm $FQ
mv Aligned.out.sam ${FQ}.STAR.sam
mv Log.final.out ${FQ}_starlog.txt
( samtools view -uSh ${FQ}.STAR.sam | samtools sort - ${FQ}.STAR
rm ${FQ}.STAR.sam
samtools index ${FQ}.STAR.bam
samtools flagstat ${FQ}.STAR.bam > ${FQ}.STAR.bam.stats ) &
done
STAR genomeLoad Remove --genomeDir $DIR
wait
__________________________________________________________________

Run featureCounts to generate read-gene mapping info. Note mapQ threshold set at 20.



$ cat run_fcount.sh

#!/bin/bash

#DIR=refgenome_hsapiens/

GTF=refgenome_hsapiens/Homo_sapiens.GRCh38.78.gtf
EXPT=skewered
MX=${EXPT}.mx
MXF=${EXPT}_f.mx
featureCounts -Q 20 -R -T 8 -a $GTF -o $MX *bam
sed 1d $MX | cut -f1,7- > $MXF

__________________________________________________________________

Analyze featureCounts read-gene mapping data, generating true positive and false positive read pairs.



$ cat correctpairs_fcounts.sh

#!/bin/bash

for i in *featureCounts ; do

grep Assigned $i | sort -T . -k 1b,1 | cut -f1,3 > ${i}.s &
done
wait

for FC1 in *pair1.fastq.STAR.bam.featureCounts.s ; do
( FC2=`echo $FC1 | sed 's/pair1.fastq/pair2.fastq/'`
join -1 1 -2 1 $FC1 $FC2 > ${FC1}.jn
rm $FC1 $FC2 ) &
done
wait


for JN in *jn ; do
(
LN=`wc -l < $JN`
COR=`awk '$2==$3' $JN | wc -l`
INC=`awk '$2!=$3' $JN | wc -l`
echo $JN $LN $COR $INC ) &
done | tee result.txt
wait

__________________________________________________________________

Prepare custom transcriptome reference sequence for use with Kallisto.



$ cat run_kal_ref_prep.sh

#!/bin/bash

wget http://ftp.ensembl.org/pub/release-78/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz

zcat Homo_sapie.GRCh38.cdna.all.fa.gz | cut -d ' ' -f1,4 \
| sed 's/ gene:/_/' > Homo_sapiens.GRC38.78.format.cdna.all.fa
__________________________________________________________________


Run Kallisto in single end mode, generating a bam file output.



$ cat run_kal.sh

#!/bin/bash

IDX=Homo_sapiens.GRC38.78.format.cdna.all.fa.idx

for FQZ in *fastq.gz ; do
NAME=`echo $FQZ | sed 's/.fastq.gz//'`
kallisto quant -t 7 --pseudobam --single -i $IDX -o ${NAME}.output -b 100 $FQZ \
| samtools view -Sb - > ${NAME}.bam
done

__________________________________________________________________

Analyze Kallisto read pair information. If multimapping has occurred, select one hit only.



$ cat correctpairs_kal.sh

#!/bin/bash

for BAM in *bam ; do

samtools view $BAM | cut -f1,3 | grep ENSG | tr '_' '\t' | cut -f1,3 \
| awk '!arr[$1]++' | sort -k 1b,1 > $BAM.counts &
done
wait

for C1 in *pair1.bam.counts ; do
C2=`echo $C1 | sed 's/pair1.bam.counts/pair2.bam.counts/'`
ls $C1 $C2
join -1 1 -2 1 $C1 $C2 > $C1.jn &
rm $C1 $C2
done
wait


for JN in *jn ; do
(
LN=`wc -l < $JN`
COR=`awk '$2==$3' $JN | wc -l`
INC=`awk '$2!=$3' $JN | wc -l`
echo $JN $LN $COR $INC ) &
done | tee result.txt
wait


Popular posts from this blog

A selection of useful bash one-liners

Data analysis step 8: Pathway analysis with GSEA

HISAT vs STAR vs TopHat2 vs Olego vs SubJunc