Demultiplexing Illumina Sequence data

Demultiplexing is a key step in many sequencing based applications, but it isn't always necessary, as the newer Illumina pipeline software provides demultiplexed data as a standard. But if you need to do this yourself, here is an example using fastx_toolkit designed for sequence data with a 6nt barcode (Illumina barcode sequences 1-12). After a run, the Genome Analyzer software provides sequence files like this for read 1 (insert sequence):
FC123_1_1_sequence.txt
And for the barcode/index read:
FC123_1_2_sequence.txt
So here goes:
#Enter dataset parameters 
FC='FC123 FC124'
LANES='1 2 3 4 5 6 7 8'
#Create the bcfile
echo 'BC01_ATCACG     ATCACG
BC02_CGATGT     CGATGT
BC03_TTAGGC     TTAGGC
BC04_TGACCA     TGACCA
BC05_ACAGTG     ACAGTG
BC06_GCCAAT     GCCAAT
BC07_CAGATC     CAGATC
BC08_ACTTGA     ACTTGA
BC09_GATCAG     GATCAG
BC10_TAGCTT     TAGCTT
BC11_GGCTAC     GGCTAC
BC12_CTTGTA     CTTGTA' > bcfile.txt
for flowcell in $FC
do
for lane in $LANES
do
paste ${flowcell}_${lane}_1_sequence.txt ${flowcell}_${lane}_2_sequence.txt \
| tr -d '\t' fastx_barcode_splitter.pl --bcfile bcfile.txt --prefix ${flowcell}_${lane}_ --suffix .txt --eol
 &
done
done
wait
So you can see that we start by pasting read 1 and the index read side-by-side and pipe that straight into the fastx_barcode_splitter script which will demultiplex the datasets by the 12 barcodes specified in the bcfile. If there are any lines missing from either read 1 or It will run each lane in parallel, so be sure to use a computer with plenty of processors. For example, in the above script, I've specified all 8 lanes on 2 flow cells so I will be using 16 processors. OK, so we've demultiplexed, and now we need to trim off the 6nt barcode.
for dataset in `ls FC*BC*.txt | sed 's/.txt//'`
do
fastx_trimmer -t 6 -i ${dataset}.txt -o ${dataset}_trim.txt &
done
wait
The fastx_trimmer will remove the 6 nt from the end of the sequence and output the file with the suffix "_trim.txt". It will trim all the files which start with FC and contain BC and end with .txt, which is all the unambiguously demultiplexed datasets. Use caution when using the "&", as it will send many jobs into the background so if you're not working on a big server, you might crash the computer.


Popular posts from this blog

Two subtle problems with over-representation analysis

Data analysis step 8: Pathway analysis with GSEA

Uploading data to GEO - which method is faster?