Extracting specific sequences from a big FASTA file
Say you have a huge FASTA file such as genome build or cDNA library, how to you quickly extract just one or a few desired sequences?
$ samtools faidx Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa
real 0m37.422s
$ time samtools faidx Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa MT
real 0m0.003s
https://www.biostars.org/p/49820/
https://www.biostars.org/p/1195/
https://www.biostars.org/p/2822/
Use samtools faidx to extract a single FASTA entry
first index, then you can extract almost instantaneously.$ samtools faidx Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa
real 0m37.422s
$ time samtools faidx Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa MT
real 0m0.003s
Use bedtools getfasta extract portions of a FASTA entry
Requires the regions of interest be in BED format.
$ head Homo_sapiens.GRCh38_CpG_Islands.bed
1 10413 11207
1 28687 29634
1 51546 51882
1 121270 121549
The sequences of interest are extracted like this:
$ bedtools getfasta -fi Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa -bed Homo_sapiens.GRCh38_CpG_Islands.bed -fo Homo_sapiens.GRCh38_CpG_Islands.fasta
Make a blast database to access bunches of sequences quickly
Note: you will need to download and install the BLAST+ package from NCBI or via Ubuntu software centre.
It is compatible with protein and DNA sequences, but you'll need to specify that accordingly.
$ makeblastdb -in Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa -dbtype nucl -input_type fasta -parse_seqids
Adding sequences from FASTA; added 194 sequences in 72.3594 seconds.
You can extract sequences singly:
makeblastdb
$ blastdbcmd -db Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa -dbtype nucl -entry MT
>lcl|MT dna_sm:chromosome chromosome:GRCh38:MT:1:16569:1 REF
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGC
GATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTA
Or in a batch using a file like this.
$ cat Contigs.txt
21
22
MT
Y
Then call the file like this.
$ time blastdbcmd -db Homo_sapiens.GRCh38.dna_sm.primary_assbly.fa -dbtype nucl -entry_batch Contigs.txt
real 0m0.149s
Related info
http://emboss.sourceforge.net/apps/release/6.2/emboss/apps/seqret.htmlhttps://www.biostars.org/p/49820/
https://www.biostars.org/p/1195/
https://www.biostars.org/p/2822/