Data analysis step 4: Count aligned reads and create count matrix

In this RNA-seq analysis series of posts, we're going through an RNA-seq analysis workflow. So far, we've downloaded, done QC and aligned the reads. In this post, we will count the reads aligning to exons and generate a count matrix for statistical analysis.

My overall strategy is to use bedtools to count reads over exons. To do that, I'll need a bed file containing all exons. I could use UCSC genes, but I'd rather use the latest Ensembl version.

Create exon bed file 

To begin, we need a list of known exons. I will use the latest Ensembl human gene annotation gtf file from their ftp server. Then extract the exon entries with grep and then rearrange the gene accession and symbol information into a temporary bed file. The temporary bed file is exploded on gene name and then overlapping exons are merged with bedtools.

#!/bin/bash
wget ftp://ftp.ensembl.org/pub/release-76/gtf/homo_sapiens/Homo_sapiens.GRCh38.76.gtf.gz
zcat Homo_sapiens.GRCh38.76.gtf.gz \
| grep -w exon | tr '"' '\t' \
| cut -f1,4,5,10,16 | sed 's/\t/_/4' \
| sort -k4 > tmp.bed

mkdir Homo_sapiens.GRCh38_explode
cd Homo_sapiens.GRCh38_explode
awk '{print > $4}' ../tmp.bed

for i in `ls ENS*`
do cat $i | sortBed | mergeBed -nms \
| cut -d ';' -f1
done | sortBed > ../Homo_sapiens.GRCh38_exon.bed

Count reads with bedtools

In this part, we ask bedtools to count the reads aligned to the exons with multicov tool. We use the options "-D" to allow duplicate reads  and "-q 10" to only count reads that were uniquely mapped. The script then parses the data to a perl script called "agg.pl" that aggregates the exons data to generate a final gene-wise count. The temporary count files are then then merged from all the samples using awk to make a data matrix.

#!/bin/bash
BED=/path/to/Homo_sapiens.GRCh38_exon.bed
EXPT=Aza_AML3_tophat
MX=`echo $EXPT | sed 's/$/.xls/'`

for BAM in *bam
do
bedtools multicov -D -q 10 -bams $BAM -bed $BED \
| cut -f4- | tr '\t' ',' \
| perl agg.pl > ${BAM}.cnt &
done
wait

ls *bam | cat | tr '\n' '\t' \
| sed 's/^/\t/;s/$/\n/;s/.fq.sort.bam//g' > header.tmp

awk '$0 !~ /#/{arr[$1]=arr[$1] " " $2}END{for(i in arr)print i,arr[i]}' *cnt \
| tr -s ' ' | tr ' ' '\t' | cat header.tmp - > $MX

rm header.tmp

Here is the agg.pl script that you will need:

#!/usr/bin/perl
use warnings;
use strict;
use integer;

my %a;
while (<>) {
    my ($animal, $n) = /^\s*(\S+)\s*,\s*(\S+)/;
    $a{$animal} += $n if defined $n;
}
print "$_\t${a{$_}}\n" for sort keys %a;

Here is what the matrix file will look like:

SRR1171523 SRR1171524 SRR1171525 SRR1171526 SRR1171527 SRR1171528
ENSG00000185650_ZFP36L1 479 467 481 89 77 96
ENSG00000260835_RP11-468I15.1 0 0 0 0 0 0
ENSG00000116032_GRIN3B 62 45 62 43 54 52
ENSG00000185339_TCN2 1 2 1 0 1 1
ENSG00000255506_RP11-203F8.1 0 0 0 0 0 0
ENSG00000229899_AC084290.2 0 0 0 0 0 0
ENSG00000269933_RP3-333A15.2 0 0 0 0 0 0
ENSG00000216069_MIR875 0 0 0 0 0 0
ENSG00000236345_RP11-59D5__B.2 608 512 600 433 406 417

You will notice that there are a lot of genes with few or no reads. I generally like to remove genes with fewer than 10 reads per sample on average. I use a small awk script called filt.sh

#!/bin/bash
FILE=$1
sed -n 1p $FILE > header.txt
awk 'BEGIN {FS=OFS="\t"}
{
sum=0; n=0
for(i=3;i<=NF;i++)
     {sum+=$i; ++n}
     print sum/n,$0
}' $FILE \
| awk '$1>10' | cut -f2- | cat header.txt - > ${FILE}_filt.xls

Here is how you use the filt.sh script:

$ ./filt.sh CountMatrix.xls

Then have a look at the filtered matrix file:

SRR1171523 SRR1171524 SRR1171525 SRR1171526 SRR1171527 SRR1171528
ENSG00000185650_ZFP36L1 479 467 481 89 77 96
ENSG00000116032_GRIN3B 62 45 62 43 54 52
ENSG00000236345_RP11-59D5__B.2 608 512 600 433 406 417
ENSG00000138069_RAB1A 4141 3568 3571 3250 3259 3189
ENSG00000127666_TICAM1 719 650 677 615 648 566
ENSG00000155393_HEATR3 1672 1291 1570 1511 1634 1641
ENSG00000146731_CCT6A 9400 7432 7743 8923 9322 9471
ENSG00000153012_LGI2 273 275 214 299 264 272
ENSG00000253716_RP13-582O9.5 78 53 62 75 77 60

At this stage, we can compare the number of genes that were discarded from the analysis with "wc":

$ wc -l Aza_AML3_tophat*
  19919 Aza_AML3_tophat_filt.xls
  62758 Aza_AML3_tophat.xls

So this shows that 42,839 Ensembl genes did not pass our detection threshold. By omitting these silent and lowly expressed genes, we will remove some noise and also get the added bonus of making the false discovery rate correction of p-values less severe when we get to the statistical analysis.

A peek at the mapping characteristics

In the below figure I look at the map quality for each sample keeping in mind that mapQ<10 is not considered reliable (I used samtools view). I also look at the location of reads in terms of whether they originate from exons or elsewhere (analysing the .cnt files). It shows that 50-60% of the reads are mapped with mapQ greater than or equal to 10 and that about 93% of those reads align to known exons.
Basic analysis of mapped reads. (A) Proportion of mapped reads with map quality of 10 or greater determined with Samtools. (B) Proportion of mapQ10+ reads mapped to exons compared to other regions.

In the next post I'll go through some differential expression analysis.

Popular posts from this blog

Data analysis step 8: Pathway analysis with GSEA

Uploading data to GEO - which method is faster?

Using GTF tools to get gene lengths