Extracting specific sequences from a big FASTA file
Say you have a huge FASTA file such as genome build or cDNA library, how to you quickly extract just one or a few desired sequences?
Use samtools faidx to extract a single FASTA entry first index, then you can extract almost instantaneously.
$ samtools faidx Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa
real0m37.422s
$ time samtools faidx Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa MT
real0m0.003s
Use bedtools getfasta extract portions of a FASTA entry Requires the regions of interest be in BED format. $ head Homo_sapiens.GRCh38_CpG_Islands.bed 11041311207 12868729634 15154651882 1121270121549
The sequences of interest are extracted like this: $ bedtools getfasta -fi Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa -bed Homo_sapiens.GRCh38_CpG_Islands.bed -fo Homo_sapiens.GRCh38_CpG_Islands.fasta
Make a blast database to access bunches of sequences quickly Note: you will need to download and install the BLAST+ package from NCBI or via Ubuntu software centre. It is compatible with protein and…
Use samtools faidx to extract a single FASTA entry first index, then you can extract almost instantaneously.
$ samtools faidx Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa
real0m37.422s
$ time samtools faidx Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa MT
real0m0.003s
Use bedtools getfasta extract portions of a FASTA entry Requires the regions of interest be in BED format. $ head Homo_sapiens.GRCh38_CpG_Islands.bed 11041311207 12868729634 15154651882 1121270121549
The sequences of interest are extracted like this: $ bedtools getfasta -fi Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa -bed Homo_sapiens.GRCh38_CpG_Islands.bed -fo Homo_sapiens.GRCh38_CpG_Islands.fasta
Make a blast database to access bunches of sequences quickly Note: you will need to download and install the BLAST+ package from NCBI or via Ubuntu software centre. It is compatible with protein and…