Example blast workflow (nucleotide)

BLAST is a stalwart in the bioinformatics space. I've used it in multiple contexts and it is a good way to introduce students to bioinformatics principles and the process of pipeline development. Although there is a web interface, it is still good to use it locally if you need to run a large number of queries. As my group needed to use BLAST again this week, I thought I'd share a small example script which I shared with my Masters research students just getting into bioinformatics.

The script (shown below) downloads the E. coli gene coding sequences and then extracts by random a few individual sequences. These undergo random mutagenesis and then we can use the mutated sequences as a query to find the original gene with BLAST. The output format is tabular which suits downstream large scale data analysis. The script also includes steps for generating the blast index. The script is uploaded as a gist here. It requires prerequesites:

sudo apt install  ncbi-blast+ emboss

unwrap_fasta.pl is available here

---

#!/bin/bash
# Download
URL="ftp://ftp.ensemblgenomes.org/pub/bacteria/release-42/fasta/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655/cds/Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.cds.all.fa.gz"
# unzip
if [[ ! -r  $FA ]] ; then
  wget -N $URL
  gunzip -kf $FA.gz
fi
# extract a few sequences
# requires unwrap_fasta.pl
cut -d ' ' -f1 $FA \
| perl unwrap_fasta.pl - - \
| paste - - | shuf | head -100 | tr '\t' '\n' | tee sample_named.fa \
| grep -v '>' | nl -n ln | sed 's/^/>/' | tr '\t' '\n' > sample.fa
# in case we need to reindex the db
if [[ ! -r $FA.ndb ]] ; then
  formatdb -p F -o T -i $FA
fi
# incorporate some mismatches
# it may generate some error output but actually works (check the output)
msbar -sequence sample.fa -count 100 -point 4 -block 0   -codon 0  -outseq sample_mutated.fa
# run the blastn
blastn -outfmt 6 -evalue 0.001 -db $FA -query sample_mutated.fa > blast_results.tsv

Popular posts from this blog

Data analysis step 8: Pathway analysis with GSEA

Two subtle problems with over-representation analysis

Uploading data to GEO - which method is faster?