Generate an RNA-seq count matrix with featureCounts

Featurecounts is the fastest read summarization tool currently out there and has some great features which make it superior to HTSeq or Bedtools multicov.

FeatureCounts takes GTF files as an annotation. This can be downloaded from the Ensembl FTP site. Make sure that the GTF version matches the genome that you aligned to. FeatureCounts it also smart enough to recognise and correctly process SAM and BAM alignment files.

Here is a script to generate a gene-wise matrix from all BAM files in a directory.
#!/bin/bash
#Generate RNA-seq matrix
#Set parameters
GTF=/path/to/Mus_musculus.GRCm38.78.gtf
EXPTNAME=mouse_rna
CPUS=8
MAPQ=10
GENEMX=${EXPTNAME}_genes.mx

#Make the gene-wise matrix
featureCounts -Q $MAPQ -T $CPUS -a $GTF -o /dev/stdout *bam \
| cut -f1,7- | sed 1d > $GENEMX

The data are now ready to analyse with your favourite statistical package (DESeq, EdgeR, Voom/Limma, etc).

Consider attaching the gene name to give the data more relevance. To do that, first make a table of Ensembl IDs and gene names from the GTF file.
grep -w gene Mus_musculus.GRCm38.78.gtf | head -1000 \
| cut -d '"' -f2,6 | tr '"' '\t' | sort -k 1b,1 > ENS2genename.txt

Next, use unix join (or other method) to attach the gene name.
head -1 RNA-seq.mx > header.txt
sed 1d RNA-seq.mx | sort -k 1b,1 \
| join -1 1 -2 1 ENS2genename.txt - \
| tr ' ' '\t' | sed 's/\t/_/' \
| cat header.txt - > RNA-seq_genenames.mx

Data matrix should now show the gene name.
Geneid c1_RNA c2_RNA c3_RNA
ENSMUSG00000000001_Gnai3 674 517 115
ENSMUSG00000000003_Pbsn 0 0 0
ENSMUSG00000000028_Cdc45 52 49 58
ENSMUSG00000000031_H19 175 514 772
ENSMUSG00000000037_Scml2 44 19 7
ENSMUSG00000000049_Apoh 0 8 1
ENSMUSG00000000056_Narf 1570 1396 1027
ENSMUSG00000000058_Cav2 2013 1751 254
ENSMUSG00000000078_Klf6 3822 2470 883

If you're interested in exon specific expression and alternative splicing, include the "-f" switch that will output counts for each exon feature.
#Make the exon-wise matrix
GTF=/path/to/Mus_musculus.GRCm38.78.gtf
EXPTNAME=mouse_rna
CPUS=8
MAPQ=10
EXONMX=${EXPTNAME}_exons.mx
featureCounts -Q $MAPQ -T $CPUS -f -a $GTF -o /dev/stdout/ *bam \
| cut -f-4,7- | sed 1d > $EXONMX

Comments

  1. Hi, great blog!

    I think you have a small typo / copy error. I believe a pipe "|" is missing before the cut command.

    #Make the gene-wise matrix
    featureCounts -Q $MAPQ -T $CPUS -a $GTF -o /dev/stdout/ *bam \
    cut -f1,7- | sed 1d > $GENEMX

    ReplyDelete
  2. Thanks for the heads-up Ruben. All fixed now.

    ReplyDelete
  3. I want gene length in my file as well.. how can I do that ?

    ReplyDelete
    Replies
    1. Hi Talha, the gene coordinates are present in the GTF file if you run this command you can extract them (for mouse at least):
      grep -w gene Mus_musculus.GRCm38.78.gtf

      Delete
  4. This comment has been removed by the author.

    ReplyDelete
  5. Thank you for your sharing, Mark!
    I'm new in bioinformatics and I have a question about your script.

    So in this script:

    #!/bin/bash
    #Generate RNA-seq matrix
    #Set parameters
    GTF=/path/to/Mus_musculus.GRCm38.78.gtf
    EXPTNAME=mouse_rna
    CPUS=8
    MAPQ=10
    GENEMX=${EXPTNAME}_genes.mx

    Which line is pointing to the directory where the bam files are stored in?

    ReplyDelete
    Replies
    1. This line:
      featureCounts -Q $MAPQ -T $CPUS -a $GTF -o /dev/stdout *bam

      It is basically assuming that all the bam files in the current directory should be included.

      Delete

Post a Comment

Popular posts from this blog

Data analysis step 8: Pathway analysis with GSEA

Installing R-4.0 on Ubuntu 18.04 painlessly

EdgeR or DESeq2? Comparing the performance of differential expression tools