Generate an RNA-seq count matrix with featureCounts
Featurecounts is the fastest read summarization tool currently out there and has some great features which make it superior to HTSeq or Bedtools multicov.
FeatureCounts takes GTF files as an annotation. This can be downloaded from the Ensembl FTP site. Make sure that the GTF version matches the genome that you aligned to. FeatureCounts it also smart enough to recognise and correctly process SAM and BAM alignment files.
Here is a script to generate a gene-wise matrix from all BAM files in a directory.
#!/bin/bash
#Generate RNA-seq matrix
#Set parameters
GTF=/path/to/Mus_musculus.GRCm38.78.gtf
EXPTNAME=mouse_rna
CPUS=8
MAPQ=10
GENEMX=${EXPTNAME}_genes.mx
#Make the gene-wise matrix
featureCounts -Q $MAPQ -T $CPUS -a $GTF -o /dev/stdout *bam \
| cut -f1,7- | sed 1d > $GENEMX
The data are now ready to analyse with your favourite statistical package (DESeq, EdgeR, Voom/Limma, etc).
Consider attaching the gene name to give the data more relevance. To do that, first make a table of Ensembl IDs and gene names from the GTF file.
grep -w gene Mus_musculus.GRCm38.78.gtf | head -1000 \
| cut -d '"' -f2,6 | tr '"' '\t' | sort -k 1b,1 > ENS2genename.txt
Next, use unix join (or other method) to attach the gene name.
head -1 RNA-seq.mx > header.txt
sed 1d RNA-seq.mx | sort -k 1b,1 \
| join -1 1 -2 1 ENS2genename.txt - \
| tr ' ' '\t' | sed 's/\t/_/' \
| cat header.txt - > RNA-seq_genenames.mx
Data matrix should now show the gene name.
Geneid c1_RNA c2_RNA c3_RNA
ENSMUSG00000000001_Gnai3 674 517 115
ENSMUSG00000000003_Pbsn 0 0 0
ENSMUSG00000000028_Cdc45 52 49 58
ENSMUSG00000000031_H19 175 514 772
ENSMUSG00000000037_Scml2 44 19 7
ENSMUSG00000000049_Apoh 0 8 1
ENSMUSG00000000056_Narf 1570 1396 1027
ENSMUSG00000000058_Cav2 2013 1751 254
ENSMUSG00000000078_Klf6 3822 2470 883
If you're interested in exon specific expression and alternative splicing, include the "-f" switch that will output counts for each exon feature.
#Make the exon-wise matrix
GTF=/path/to/Mus_musculus.GRCm38.78.gtf
EXPTNAME=mouse_rna
CPUS=8
MAPQ=10
EXONMX=${EXPTNAME}_exons.mx
featureCounts -Q $MAPQ -T $CPUS -f -a $GTF -o /dev/stdout/ *bam \
| cut -f-4,7- | sed 1d > $EXONMX
FeatureCounts takes GTF files as an annotation. This can be downloaded from the Ensembl FTP site. Make sure that the GTF version matches the genome that you aligned to. FeatureCounts it also smart enough to recognise and correctly process SAM and BAM alignment files.
Here is a script to generate a gene-wise matrix from all BAM files in a directory.
#!/bin/bash
#Generate RNA-seq matrix
#Set parameters
GTF=/path/to/Mus_musculus.GRCm38.78.gtf
EXPTNAME=mouse_rna
CPUS=8
MAPQ=10
GENEMX=${EXPTNAME}_genes.mx
#Make the gene-wise matrix
featureCounts -Q $MAPQ -T $CPUS -a $GTF -o /dev/stdout *bam \
| cut -f1,7- | sed 1d > $GENEMX
The data are now ready to analyse with your favourite statistical package (DESeq, EdgeR, Voom/Limma, etc).
Consider attaching the gene name to give the data more relevance. To do that, first make a table of Ensembl IDs and gene names from the GTF file.
grep -w gene Mus_musculus.GRCm38.78.gtf | head -1000 \
| cut -d '"' -f2,6 | tr '"' '\t' | sort -k 1b,1 > ENS2genename.txt
Next, use unix join (or other method) to attach the gene name.
head -1 RNA-seq.mx > header.txt
sed 1d RNA-seq.mx | sort -k 1b,1 \
| join -1 1 -2 1 ENS2genename.txt - \
| tr ' ' '\t' | sed 's/\t/_/' \
| cat header.txt - > RNA-seq_genenames.mx
Data matrix should now show the gene name.
Geneid c1_RNA c2_RNA c3_RNA
ENSMUSG00000000001_Gnai3 674 517 115
ENSMUSG00000000003_Pbsn 0 0 0
ENSMUSG00000000028_Cdc45 52 49 58
ENSMUSG00000000031_H19 175 514 772
ENSMUSG00000000037_Scml2 44 19 7
ENSMUSG00000000049_Apoh 0 8 1
ENSMUSG00000000056_Narf 1570 1396 1027
ENSMUSG00000000058_Cav2 2013 1751 254
ENSMUSG00000000078_Klf6 3822 2470 883
#Make the exon-wise matrix
GTF=/path/to/Mus_musculus.GRCm38.78.gtf
EXPTNAME=mouse_rna
CPUS=8
MAPQ=10
EXONMX=${EXPTNAME}_exons.mx
featureCounts -Q $MAPQ -T $CPUS -f -a $GTF -o /dev/stdout/ *bam \
| cut -f-4,7- | sed 1d > $EXONMX