Generate an RNA-seq count matrix: Part 2 - count over exons

In the previous post, I mentioned that in order to do counting, we need some regions (exons) to count over. We generated a BED file based on GENCODE exons which looks like this:

chr9 33248310 33248419 ENSG00000107262.11_BAG1
chr9 33252477 33255306 ENSG00000107262.11_BAG1
chr9 33255862 33255925 ENSG00000107262.11_BAG1
Now we can actually count how many reads from each of our samples maps to these exons. To do that there are a few different programs:
For this demo, I will outline use of BedTools for RNA-Seq. I think BedTools nulticov is a good option for 3 main reasons:
  1. It can handle multiple bam files at the same time, producing a count table/matrix which is pretty much ready for statistical analysis.
  2. It is aware of mapQ scores of alignments, meaning that if the program comes across non-unique alignments, you can set it to ignore or include those reads.
  3. The BedTools package is well documented, widely used and the code under constant maintenance.
The first stage in the count is to count over each exon. While multicov allows for reading of multiple files at the same time, this is not done in parallel, so I generally prefer to make a count file for each bam individually and join them at the end. 

for bam in `ls *.bam | sed 's/.bam//' `
bedtools multicov -D -q 10 -bams ${bam}.bam -bed $BED \
| cut -f4,5 > ${bam}.tmpcnt &
So now we have a temporary exon-wise count file which looks like this:

ENSG00000001167.10_NFYA 19
ENSG00000001167.10_NFYA 48
ENSG00000001167.10_NFYA 19
ENSG00000001167.10_NFYA 31
So you can see that the exon coordinates have been removed and we're now looking at counts for each exon of the gene. The next step is to aggregate the data, for which we can use a little perl script:

use warnings;
use strict;
use integer;
my %a;
while (<>) {
   my ($gene, $n) = /^\s*(\S+)\s*\t\s*(\S+)/;
      $a{$gene} += $n if defined $n;
print "$_\t${a{$_}}\n" for sort keys %a;

Put the above script into a text file, make it executable with chmod and then it can be used as follows to generate a gene-wise count file

for bam in `ls *.bam | sed 's/.bam//'`dosort ${bam}.tmpcnt | perl > ${bam}_count.txt &donewait

Ok so now we are ready to join these files together to make the matrix. This can also be done a few ways, but I've chosen to use an awk one liner.

awk '$0 !~ /#/{arr[$1]=arr[$1] " " $2}END{for(i in arr)print i,arr[i]}' *count.txt \| tr ' '  '\t' | cut -f1,3- > matrix.tbl
You will then need to attach the sample names to the appropriate columns and now you have a full count matrix which is basically ready for statistical analysis:
Gene_Name                 Ctrl1 Ctrl2 Trt1 Trt2
ENSG00000266033.1_AL606519.1 0 0 0 0
ENSG00000165195.9_PIGA         39 12 43 33
ENSG00000250813.1_SERF1AP1 0 0 0 0
ENSG00000066813.10_ACSM2B 0 0 0 1
ENSG00000249719.1_RP3-486L4.3 12 48 16 8
ENSG00000266336.1_AL356019.1 0 0 0 0
ENSG00000228142.2_RP11-180I4.2 0 0 0 0
ENSG00000146966.8_DENND2A 105 97 77 99
ENSG00000254295.1_CTC-308K20.2 5 0 5 0

Although you may want to remove genes which do not have any reads, or remove genes which have fewer than a threshold number. We commonly remove genes which have fewer than 10 reads in each sample on average. This is straightforward for a simple 2 way comparison, but gets more complicated when dealing with more sample groups - for instance 2 treatment types in 2 cell lines.

Popular posts from this blog

Data analysis step 8: Pathway analysis with GSEA

A selection of useful bash one-liners

HISAT vs STAR vs TopHat2 vs Olego vs SubJunc