Screening RNA-seq data for novel transcripts with samtools mpileup and UCSC genome browser

The unbiased nature of deep transcript sequencing makes it the ideal technology to discover novel uncharacterised genes. Lets screen our favourite RNA-seq experiment (azacitidine-treated AML3 human cells, GSE55123) for novel expressed genes. I use Ensembl gene annotations.

We'll start with preparing bed files of exons and gene bodies

grep -w exon Homo_sapiens.GRCh38.76.gtf | tr '" ' '\t' \
| cut -f1,4,5,11 | uniq > Homo_sapiens.GRCh38.76_exons.bed

grep -w gene Homo_sapiens.GRCh38.76.gtf | tr '" ' '\t' \
| cut -f1,4,5,11 > Homo_sapiens.GRCh38.76_genes.bed

Now for convenience, I'll merge the data from the three replicates with samtools.

samtools view -H SRR1171523_1.fastq.sort.bam > header.txt

samtools merge -h header.txt Ctrl.bam SRR1171523_1.bam SRR1171523_2.bam SRR1171524_1.bam SRR1171524_2.bam SRR1171525_1.bam SRR1171525_2.bam

samtools merge -h header.txt Aza.bam SRR1171526_1.bam SRR1171526_2.bam SRR1171527_1.bam SRR1171527_2.bam SRR1171528_1.bam SRR1171528_2.bam

Now we'll use samtools mpileup to screen transcript coverage across the whole genome using a relatively high mapping quality threshold of 30 to eliminate mapping artifacts. I'm using awk to identify regions with >60x coverage, then slopBed to expand these by 20 bp in either direction, then intersectBed to determine whether these regions are exonic and then closestBed to determine the distance to the closest gene (zero if inside a known gene).

Before we start, we better generate the chromosome lengths file and specify the bed files we just generated.

FA=Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa
FAI=Homo_sapiens.GRCh38.dna_sm.primary_assembly.fai
CHR=Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.g
EXONBED=Homo_sapiens.GRCh38.76_exons.bed
GENEBED=Homo_sapiens.GRCh38.76_genes.bed

samtools faidx $FA
cut -f-2 $FAI > $CHR

samtools mpileup -q 30 -f $FA Ctrl.bam Aza.bam -r $chr \
| awk '$4+$7>60' | cut -f-2 | awk '{OFS="\t"} {print $1,$2,$2}' \
| bedtools slop -b 20 -i - -g $CHR | sortBed | mergeBed -i - \
| intersectBed -wao -a - -b $EXONBED | cut -f-3,7 | uniq \
| sed 's/\t/:/' | sed 's/\t/-/' | awk '!arr[$1]++' \
| tr ': -' '\t' | sed 's/\t$/\t\./' | cut -f-4 \
| closestBed -d -a - -b $GENEBED | cut -f-4,8- > AML3_Aza_loci.bed

Now we an have a look at the regions identified
Overlapping known exons 84,242  ( awk '$4!="."' AML3_Aza_loci.bed | wc -l)
Intronic incl. antisense 10,079  (awk '$4=="." && $6==0' AML3_Aza_loci.bed | wc -l)
Outside known genes 1,031 (awk '$4=="." && $6>0' AML3_Aza_loci.bed | wc -l)

So lets take a look at a few of the highest expressed novel genes. In the awk filtering $4=="." filters for expressed islands not overlapping known exons and $6>0 filters for islands outside of known genes. Here I use bedtools multicov to count read coverage in the two conditions and estimate a fold change.

awk '{OFS="\t"} $4=="." && $6>0 {print $1,$2,$3,$4"_"$5"_"$6"_"$3-$2}' AML3_Aza_loci.bed | bedtools multicov -q 30 -bed - -bams Ctrl.bam Aza.bam | awk '{print $0,$6/$5}' | sort -k6gr | head -50
*Fold change is an estimate as these figures are not library size normalised by they are roughly
similar (70M in ctrl and 68M in Aza-treated)

Some of those look closeby on (chr6:156776871-170165539) so lets use UCSC to visualise them. First, I'll use mpileup to calculate coverage at each position. The catch is that introns are considered coverage by mpileup (these are defined by < and  > chars) and we need to subtract those from actual exonic coverage. I'll also generate intron coverage to visualise stacks of informative spliced reads.

R=6:156776871-170165539

samtools mpileup -q 10 -f $FA Ctrl.bam -r $R \
| cut -f1,2,4,5 | tr '<>' '@' \
| awk -F'|' '{print $1,$2,$3, gsub(/@/,"")}' \
| awk '{OFS="\t"} $3-$5>0 {print "chr"$1,$2,$2+1,$3-$5}' > Ctrl.wig

samtools mpileup -q 10 -f $FA Aza.bam -r $R \
| cut -f1,2,4,5 | tr '<>' '@' \
| awk -F'|' '{print $1,$2,$3, gsub(/@/,"")}' \
| awk '{OFS="\t"} $3-$5>0 {print "chr"$1,$2,$2+1,$3-$5}' > Aza.wig

Here is an example of the mpileup output

##Now intron spanning depth
samtools mpileup -q 10 -f $FA Aza.bam -r $R \
| cut -f1,2,4,5 | tr '<>' '@' \
| awk -F'|' '{print $1,$2,$3, gsub(/@/,"")}' \
| awk '{OFS="\t"} $5>0 {print "chr"$1,$2,$2+1,$5}' > Aza_intron.wig

samtools mpileup -q 10 -f $FA Ctrl.bam -r $R \
| cut -f1,2,4,5 | tr '<>' '@' \
| awk -F'|' '{print $1,$2,$3, gsub(/@/,"")}' \
| awk '{OFS="\t"} $5>0 {print "chr"$1,$2,$2+1,$5}' > Ctrl_intron.wig

For this to work at the UCSC browser, we need to add the header to the wig file just like I have below (using nano, gedit or other editor), save and gzip the wig.

track type=wiggle_0 name=Aza
variableStep chrom=chr6
chr6 156776876 156776877 1
chr6 156776877 156776878 1
chr6 156776878 156776879 2
chr6 156776879 156776880 2
chr6 156776880 156776881 3
chr6 156776881 156776882 3
chr6 156776882 156776883 3
chr6 156776883 156776884 3

gzip Ctrl.wig Aza.wig

Go to UCSC genome browser and select the species and be sure to select correct build version. Now I've spotted an interesting region with ~100x coverage that's downstream of the
QKI gene.

A closer look confirms that it is outside of described lncRNA RP1-230L10.1 and not overlapping nearby GenBank ESTs.

Taking a look at the profile of splices, we can see that this novel region does not share splice sites with QKI, rather many appear to be shared with the nearby lncRNA.

This method can be also used to identify novel transcripts that are differentially expressed, just by filtering the output from bedtools multicov. Keep in mind that this is exploratory and that a robust DGE analysis should involve specific software (edgeR/DESEq) and include all other genes and each sample individually rather than pooled.

awk '{OFS="\t"} $4=="." && $6>0 {print $1,$2,$3,$4"_"$5"_"$6"_"$3-$2}' AML3_Aza_loci.bed \
| bedtools multicov -q 30 -bed - -bams Ctrl.bam Aza.bam \
| awk '{OFS="\t"} $6>50 && $6/$5>2 {print $0,$6/$5}'


In summary, samtools mpileup and bedtools can identify transcribed regions not annotated as known exons. The output of mpileup can be used to make coverage plots when introns are correctly handled and then these data can be viewed on the UCSC browser. This isn't an exhaustive gene hunting exercise, I can imaging gene annotation requiring several criteria such as detection in multiple cell lines/tissues/experments, chromatin marks that characterise promoters and potentially sequence conservation. The relatively new "directional" or "stranded" RNA-seq methods could also distinguish between antisense transcripts effectively. Also note that there are a range of other genome browser type software out there, most notably IGV, but we'll cover that another time.

Popular posts from this blog

Two subtle problems with over-representation analysis

Data analysis step 8: Pathway analysis with GSEA

Uploading data to GEO - which method is faster?