Extracting specific sequences from a big FASTA file

March 03, 2015

Say you have a huge FASTA file such as genome build or cDNA library, how to you quickly extract just one or a few desired sequences?

Use samtools faidx to extract a single FASTA entry

first index, then you can extract almost instantaneously.
$ samtools faidx Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa
real 0m37.422s
$ time samtools faidx Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa MT
real 0m0.003s

Use bedtools getfasta extract portions of a FASTA entry

Requires the regions of interest be in BED format.

$ head Homo_sapiens.GRCh38_CpG_Islands.bed

1 10413 11207

1 28687 29634

1 51546 51882

1 121270 121549

The sequences of interest are extracted like this:

$ bedtools getfasta -fi Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa -bed Homo_sapiens.GRCh38_CpG_Islands.bed -fo Homo_sapiens.GRCh38_CpG_Islands.fasta

Make a blast database to access bunches of sequences quickly

Note: you will need to download and install the BLAST+ package from NCBI or via Ubuntu software centre.

It is compatible with protein and DNA sequences, but you'll need to specify that accordingly.

$ makeblastdb -in Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa -dbtype nucl -input_type fasta -parse_seqids

Adding sequences from FASTA; added 194 sequences in 72.3594 seconds.

You can extract sequences singly:

makeblastdb

$ blastdbcmd -db Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa -dbtype nucl -entry MT

>lcl|MT dna_sm:chromosome chromosome:GRCh38:MT:1:16569:1 REF

GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGC

GATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTA

Or in a batch using a file like this.

$ cat Contigs.txt

Then call the file like this.

$ time blastdbcmd -db Homo_sapiens.GRCh38.dna_sm.primary_assbly.fa -dbtype nucl -entry_batch Contigs.txt

real 0m0.149s

Related info

http://emboss.sourceforge.net/apps/release/6.2/emboss/apps/seqret.html
https://www.biostars.org/p/49820/
https://www.biostars.org/p/1195/
https://www.biostars.org/p/2822/

Search This Blog

Genome Spot