Posts

Showing posts with the label BLAST

Quantifying repeat sequences from short read next gen sequencing data

Image
Repetitive sequences pose significant challenges for the analysis of short read sequence data. Any sequence read that is shorter than the repeat unit is going to fail at being uniquely placed in the genome. At a given read length, genome can therefore be divided into mappable and unmappable compartments. For analyses such as gene expression and chromatin profiling, we generally ignore those unmappable regions and ignore reads that map to multiple positions. But what if we were actually interested in the abundance of repeat sequences in the dataset? How could it be quantified with short read sequencing? Why is that even relevant?

The "Why" About 50% of the human genome is comprised of repetitive DNA (Ref). While the function of these elements in many cases remains unknown, some have been well defined. Alu elements have been shown to be a birthplace of enhancers (Ref). Endogenous retroviruses have been domesticated to boost antiviral response in B cells after exposure to T cel…

Extracting specific sequences from a big FASTA file

Say you have a huge FASTA file such as genome build or cDNA library, how to you quickly extract just one or a few desired sequences?

Use samtools faidx to extract a single FASTA entry  first index, then you can extract almost instantaneously.
$ samtools faidx Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa
real0m37.422s
$ time samtools faidx Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa MT
real0m0.003s

Use bedtools getfasta extract portions of a FASTA entry Requires the regions of interest be in BED format. $ head Homo_sapiens.GRCh38_CpG_Islands.bed 11041311207 12868729634 15154651882 1121270121549
The sequences of interest are extracted like this: $ bedtools getfasta -fi Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa -bed Homo_sapiens.GRCh38_CpG_Islands.bed -fo Homo_sapiens.GRCh38_CpG_Islands.fasta
Make a blast database to access bunches of sequences quickly Note: you will need to download and install the BLAST+ package from NCBI or via Ubuntu software centre. It is compatible with protein and…