Extracting specific sequences from a big FASTA file

Say you have a huge FASTA file such as genome build or cDNA library, how to you quickly extract just one or a few desired sequences?

Use samtools faidx to extract a single FASTA entry 

first index, then you can extract almost instantaneously.
$ samtools faidx Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa
real 0m37.422s
$ time samtools faidx Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa MT
real 0m0.003s

Use bedtools getfasta extract portions of a FASTA entry

Requires the regions of interest be in BED format.
$ head Homo_sapiens.GRCh38_CpG_Islands.bed
1 10413 11207
1 28687 29634
1 51546 51882
1 121270 121549

The sequences of interest are extracted like this:
$ bedtools getfasta -fi Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa -bed Homo_sapiens.GRCh38_CpG_Islands.bed -fo Homo_sapiens.GRCh38_CpG_Islands.fasta

Make a blast database to access bunches of sequences quickly

Note: you will need to download and install the BLAST+ package from NCBI or via Ubuntu software centre.
It is compatible with protein and DNA sequences, but you'll need to specify that accordingly.
$ makeblastdb -in Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa -dbtype nucl -input_type fasta -parse_seqids
Adding sequences from FASTA; added 194 sequences in 72.3594 seconds.

You can extract sequences singly:
makeblastdb
$ blastdbcmd -db Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa -dbtype nucl -entry MT
>lcl|MT dna_sm:chromosome chromosome:GRCh38:MT:1:16569:1 REF
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGC
GATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTA

Or in a batch using a file like this.
$ cat Contigs.txt 
21
22
MT
Y

Then call the file like this.
$ time blastdbcmd -db Homo_sapiens.GRCh38.dna_sm.primary_assbly.fa -dbtype nucl -entry_batch  Contigs.txt

real 0m0.149s

Related info
http://emboss.sourceforge.net/apps/release/6.2/emboss/apps/seqret.html
https://www.biostars.org/p/49820/
https://www.biostars.org/p/1195/
https://www.biostars.org/p/2822/

Popular posts from this blog

Two subtle problems with over-representation analysis

Data analysis step 8: Pathway analysis with GSEA

Uploading data to GEO - which method is faster?