Example blast workflow (nucleotide)

February 02, 2024

BLAST is a stalwart in the bioinformatics space. I've used it in multiple contexts and it is a good way to introduce students to bioinformatics principles and the process of pipeline development. Although there is a web interface, it is still good to use it locally if you need to run a large number of queries. As my group needed to use BLAST again this week, I thought I'd share a small example script which I shared with my Masters research students just getting into bioinformatics.

The script (shown below) downloads the E. coli gene coding sequences and then extracts by random a few individual sequences. These undergo random mutagenesis and then we can use the mutated sequences as a query to find the original gene with BLAST. The output format is tabular which suits downstream large scale data analysis. The script also includes steps for generating the blast index. The script is uploaded as a gist here. It requires prerequesites:

sudo apt install ncbi-blast+ emboss

unwrap_fasta.pl is available here.

---

#!/bin/bash

# Download

URL="ftp://ftp.ensemblgenomes.org/pub/bacteria/release-42/fasta/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655/cds/Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.cds.all.fa.gz"

# unzip

if [[ ! -r $FA ]] ; then

wget -N $URL

gunzip -kf $FA.gz

# extract a few sequences

# requires unwrap_fasta.pl

cut -d ' ' -f1 $FA \

| perl unwrap_fasta.pl - - \

| grep -v '>' | nl -n ln | sed 's/^/>/' | tr '\t' '\n' > sample.fa

# in case we need to reindex the db

if [[ ! -r $FA.ndb ]] ; then

formatdb -p F -o T -i $FA

# incorporate some mismatches

# it may generate some error output but actually works (check the output)

msbar -sequence sample.fa -count 100 -point 4 -block 0 -codon 0 -outseq sample_mutated.fa

# run the blastn

blastn -outfmt 6 -evalue 0.001 -db $FA -query sample_mutated.fa > blast_results.tsv

Search This Blog

Genome Spot

Example blast workflow (nucleotide)

Popular posts from this blog

Data analysis step 8: Pathway analysis with GSEA

Uploading data to GEO - which method is faster?

Generate an RNA-seq count matrix with featureCounts