Data analysis step 4: Count aligned reads and create count matrix

September 09, 2014

In this RNA-seq analysis series of posts, we're going through an RNA-seq analysis workflow. So far, we've downloaded, done QC and aligned the reads. In this post, we will count the reads aligning to exons and generate a count matrix for statistical analysis.

My overall strategy is to use bedtools to count reads over exons. To do that, I'll need a bed file containing all exons. I could use UCSC genes, but I'd rather use the latest Ensembl version.

Create exon bed file

To begin, we need a list of known exons. I will use the latest Ensembl human gene annotation gtf file from their ftp server. Then extract the exon entries with grep and then rearrange the gene accession and symbol information into a temporary bed file. The temporary bed file is exploded on gene name and then overlapping exons are merged with bedtools.

#!/bin/bash
wget ftp://ftp.ensembl.org/pub/release-76/gtf/homo_sapiens/Homo_sapiens.GRCh38.76.gtf.gz
zcat Homo_sapiens.GRCh38.76.gtf.gz \
| grep -w exon | tr '"' '\t' \
| cut -f1,4,5,10,16 | sed 's/\t/_/4' \
| sort -k4 > tmp.bed

mkdir Homo_sapiens.GRCh38_explode
cd Homo_sapiens.GRCh38_explode
awk '{print > $4}' ../tmp.bed

for i in `ls ENS*`
do cat $i | sortBed | mergeBed -nms \
| cut -d ';' -f1
done | sortBed > ../Homo_sapiens.GRCh38_exon.bed

Count reads with bedtools

In this part, we ask bedtools to count the reads aligned to the exons with multicov tool. We use the options "-D" to allow duplicate reads and "-q 10" to only count reads that were uniquely mapped. The script then parses the data to a perl script called "agg.pl" that aggregates the exons data to generate a final gene-wise count. The temporary count files are then then merged from all the samples using awk to make a data matrix.

#!/bin/bash
BED=/path/to/Homo_sapiens.GRCh38_exon.bed
EXPT=Aza_AML3_tophat
MX=`echo $EXPT | sed 's/$/.xls/'`

for BAM in *bam
do
bedtools multicov -D -q 10 -bams $BAM -bed $BED \
| cut -f4- | tr '\t' ',' \
| perl agg.pl > ${BAM}.cnt &
done
wait

ls *bam | cat | tr '\n' '\t' \
| sed 's/^/\t/;s/$/\n/;s/.fq.sort.bam//g' > header.tmp

awk '$0 !~ /#/{arr[$1]=arr[$1] " " $2}END{for(i in arr)print i,arr[i]}' *cnt \
| tr -s ' ' | tr ' ' '\t' | cat header.tmp - > $MX

rm header.tmp

Here is the agg.pl script that you will need:

#!/usr/bin/perl

use warnings;

use strict;

use integer;

my %a;

while (<>) {

my ($animal, $n) = /^\s*(\S+)\s*,\s*(\S+)/;

$a{$animal} += $n if defined $n;

}

print "$_\t${a{$_}}\n" for sort keys %a;

Here is what the matrix file will look like:

SRR1171523 SRR1171524 SRR1171525 SRR1171526 SRR1171527 SRR1171528

ENSG00000185650_ZFP36L1 479 467 481 89 77 96

ENSG00000260835_RP11-468I15.1 0 0 0 0 0 0

ENSG00000116032_GRIN3B 62 45 62 43 54 52

ENSG00000185339_TCN2 1 2 1 0 1 1

ENSG00000255506_RP11-203F8.1 0 0 0 0 0 0

ENSG00000229899_AC084290.2 0 0 0 0 0 0

ENSG00000269933_RP3-333A15.2 0 0 0 0 0 0

ENSG00000216069_MIR875 0 0 0 0 0 0

ENSG00000236345_RP11-59D5__B.2 608 512 600 433 406 417

You will notice that there are a lot of genes with few or no reads. I generally like to remove genes with fewer than 10 reads per sample on average. I use a small awk script called filt.sh

#!/bin/bash

FILE=$1

sed -n 1p $FILE > header.txt

awk 'BEGIN {FS=OFS="\t"}

{

sum=0; n=0

for(i=3;i<=NF;i++)

{sum+=$i; ++n}

print sum/n,$0

}' $FILE \

| awk '$1>10' | cut -f2- | cat header.txt - > ${FILE}_filt.xls

Here is how you use the filt.sh script:

$ ./filt.sh CountMatrix.xls

Then have a look at the filtered matrix file:

SRR1171523 SRR1171524 SRR1171525 SRR1171526 SRR1171527 SRR1171528

ENSG00000185650_ZFP36L1 479 467 481 89 77 96

ENSG00000116032_GRIN3B 62 45 62 43 54 52

ENSG00000236345_RP11-59D5__B.2 608 512 600 433 406 417

ENSG00000138069_RAB1A 4141 3568 3571 3250 3259 3189

ENSG00000127666_TICAM1 719 650 677 615 648 566

ENSG00000155393_HEATR3 1672 1291 1570 1511 1634 1641

ENSG00000146731_CCT6A 9400 7432 7743 8923 9322 9471

ENSG00000153012_LGI2 273 275 214 299 264 272

ENSG00000253716_RP13-582O9.5 78 53 62 75 77 60

At this stage, we can compare the number of genes that were discarded from the analysis with "wc":

$ wc -l Aza_AML3_tophat*

19919 Aza_AML3_tophat_filt.xls

62758 Aza_AML3_tophat.xls

So this shows that 42,839 Ensembl genes did not pass our detection threshold. By omitting these silent and lowly expressed genes, we will remove some noise and also get the added bonus of making the false discovery rate correction of p-values less severe when we get to the statistical analysis.

A peek at the mapping characteristics

In the below figure I look at the map quality for each sample keeping in mind that mapQ<10 is not considered reliable (I used samtools view). I also look at the location of reads in terms of whether they originate from exons or elsewhere (analysing the .cnt files). It shows that 50-60% of the reads are mapped with mapQ greater than or equal to 10 and that about 93% of those reads align to known exons.

Basic analysis of mapped reads. (A) Proportion of mapped reads with map quality of 10 or greater determined with Samtools. (B) Proportion of mapQ10+ reads mapped to exons compared to other regions.

In the next post I'll go through some differential expression analysis.

Search This Blog

Genome Spot