Posts

Showing posts with the label Samtools

What is a CpG shore and how to I get them all?

Image
CpG shores are the regions immediately flanking and up to 2 kbp away from CpG islands. These regions are interesting because methylation they are variably methylated in cancer and development (see reference).
To identify CpG shores, you first need a list of coordinates of CpG islands. If you have a linux computer with Emboss tools installed, you can determine CpG island positions with the cpgplot command. (if you don't have Emboss installed you can get CpG island coordinates from the UCSC table browser just be wary of chromosome naming ie, "chr1" vs "1")

cpgreport -score 17 -outfile genome.fa.cpg -outfeat /dev/stdout genome.fa | head

Will produce a file like this one:

##gff-version 3
##sequence-region 1 1 59999934
#!Date 2015-04-10
#!Type DNA
#!Source-version EMBOSS 6.6.0.0
1cpgreportsequence_feature10469112401299+.ID=1.1
1cpgreportsequence_feature112841130137+.ID=1.2
1cpgreportsequence_feature113431134732+.ID=1.3
1cpgreportsequence_feature114041140517+.ID=1.4
1…

Extracting specific sequences from a big FASTA file

Say you have a huge FASTA file such as genome build or cDNA library, how to you quickly extract just one or a few desired sequences?

Use samtools faidx to extract a single FASTA entry  first index, then you can extract almost instantaneously.
$ samtools faidx Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa
real0m37.422s
$ time samtools faidx Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa MT
real0m0.003s

Use bedtools getfasta extract portions of a FASTA entry Requires the regions of interest be in BED format. $ head Homo_sapiens.GRCh38_CpG_Islands.bed 11041311207 12868729634 15154651882 1121270121549
The sequences of interest are extracted like this: $ bedtools getfasta -fi Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa -bed Homo_sapiens.GRCh38_CpG_Islands.bed -fo Homo_sapiens.GRCh38_CpG_Islands.fasta
Make a blast database to access bunches of sequences quickly Note: you will need to download and install the BLAST+ package from NCBI or via Ubuntu software centre. It is compatible with protein and…

Sambamba vs Samtools

Samtools has been the one main tools for reading and writing aligned NGS data since the SAM alignment format was initially proposed. With the amounts of NGS data being generated increasing at a staggering rate, informatic bottlenecks are beginning to appear and there is a rush to make bioinformatics as parallel and scalable as possible. Samtools went parallel only recently, and there is a new competitor to Samtools called Sambamba, so I'll just do a few quick tests to see how it stacks up to Samtools on our 32-core server with 128 GB RAM (standard HDD).

We have an 1.8 GB bam file to work with.
$ du -sh *
1.8G  Sequence.bam

Now convert to unsorted sam format. $ samtools view -H Sequence.bam > header.sam
$ samtools view Sequence.bam | shuf \
| cat header.sam - > Sequence_shuf.sam

The sam file is 9.9 GB. Lets try 1-thread SAM-to-BAM conversion and sorting with Samtools.
$ time samtools view -Shb Sequence_shuf.sam \
| samtools sort - Sequence_samtools.test

real 18m52.374s
user 18m30.619s

Screening RNA-seq data for novel transcripts with samtools mpileup and UCSC genome browser

Image
The unbiased nature of deep transcript sequencing makes it the ideal technology to discover novel uncharacterised genes. Lets screen our favourite RNA-seq experiment (azacitidine-treated AML3 human cells, GSE55123) for novel expressed genes. I use Ensembl gene annotations.

We'll start with preparing bed files of exons and gene bodies

grep -w exon Homo_sapiens.GRCh38.76.gtf | tr '" ' '\t' \
| cut -f1,4,5,11 | uniq > Homo_sapiens.GRCh38.76_exons.bed

grep -w gene Homo_sapiens.GRCh38.76.gtf | tr '" ' '\t' \
| cut -f1,4,5,11 > Homo_sapiens.GRCh38.76_genes.bed

Now for convenience, I'll merge the data from the three replicates with samtools.

samtools view -H SRR1171523_1.fastq.sort.bam > header.txt

samtools merge -h header.txt Ctrl.bam SRR1171523_1.bam SRR1171523_2.bam SRR1171524_1.bam SRR1171524_2.bam SRR1171525_1.bam SRR1171525_2.bam

samtools merge -h header.txt Aza.bam SRR1171526_1.bam SRR1171526_2.bam SRR1171527_1.bam SRR1171527_2.bam S…

Data analysis step 4: Count aligned reads and create count matrix

Image
In this RNA-seq analysis series of posts, we're going through an RNA-seq analysis workflow. So far, we've downloaded, done QC and aligned the reads. In this post, we will count the reads aligning to exons and generate a count matrix for statistical analysis.

My overall strategy is to use bedtools to count reads over exons. To do that, I'll need a bed file containing all exons. I could use UCSC genes, but I'd rather use the latest Ensembl version.

Create exon bed file  To begin, we need a list of known exons. I will use the latest Ensembl human gene annotation gtf file from their ftp server. Then extract the exon entries with grep and then rearrange the gene accession and symbol information into a temporary bed file. The temporary bed file is exploded on gene name and then overlapping exons are merged with bedtools.

#!/bin/bash
wget ftp://ftp.ensembl.org/pub/release-76/gtf/homo_sapiens/Homo_sapiens.GRCh38.76.gtf.gz
zcat Homo_sapiens.GRCh38.76.gtf.gz \
| grep -w exon | tr &#…

Evaluate length of sequence strings

If you are working with sequence data often, you will sometimes need to look at the distribution of read lengths in a data set after quality and adapter trimming. From a bam file this can be done starting with samtools view, then cutting the sequence string out and then using either perl or awk to count the length of the sequence. The list of integers can then be piped to numaverage, a numutils program to evaluate the average, median and mode of a list of numbers.

To get the average length
samtools view sequence.bam | cut -f10 | awk '{ print length}' | numaverage

To get the median length
samtools view sequence.bam | cut -f10 | awk '{ print length}' | numaverage -M

To get the mode length
samtools view sequence.bam | cut -f10 | awk '{ print length}' | numaverage- m

To get the shortest length
samtools view FC095_1.bam | cut -f10 | awk '{ print length}' | sort | head -1

To get the longest
samtools view FC095_1.bam | cut -f10 | awk '{ print length}' …

BWA alignment to a genome - single ends

Over the last few posts, we've discussed various ways to analyse the quality of a sequencing run, and curate the data sets in terms of demultiplexing, quality trimming and adapter clipping. I guess the next step is alignment. There are an increasing number of aligners out there for short read sequencing (listed here), but currently the most popular choices are BWA and Bowtie2. BWA is a quick and accurate tool, but might not be the best for gapped alignments (common in mRNA-Seq). Today I will demonstrate how to align short reads (single end) with BWA and convert the alignment to bam format with Samtools. The things you will need in the otherwise empty directory are:

Reference genome in fasta formatSequence data in fastq format (.fq) First thing to do is give the experiment a name and index the genome with BWA. The index step could take about 3 hours for a human size genome.
EXID=Experimentname
REF=Hg19.fa
bwa index -a bwtsw $REF Next, use BWA to align. All of our fastq data sets hav…