Demultiplexing Illumina Sequence data

November 08, 2012

Demultiplexing is a key step in many sequencing based applications, but it isn't always necessary, as the newer Illumina pipeline software provides demultiplexed data as a standard. But if you need to do this yourself, here is an example using fastx_toolkit designed for sequence data with a 6nt barcode (Illumina barcode sequences 1-12). After a run, the Genome Analyzer software provides sequence files like this for read 1 (insert sequence):

FC123_1_1_sequence.txt

And for the barcode/index read:

FC123_1_2_sequence.txt

So here goes:

#Enter dataset parameters

FC='FC123 FC124'

LANES='1 2 3 4 5 6 7 8'

#Create the bcfile
echo 'BC01_ATCACG ATCACG
BC02_CGATGT CGATGT
BC03_TTAGGC TTAGGC
BC04_TGACCA TGACCA
BC05_ACAGTG ACAGTG
BC06_GCCAAT GCCAAT
BC07_CAGATC CAGATC
BC08_ACTTGA ACTTGA
BC09_GATCAG GATCAG
BC10_TAGCTT TAGCTT
BC11_GGCTAC GGCTAC
BC12_CTTGTA CTTGTA' > bcfile.txt

for flowcell in $FC
do
for lane in $LANES
do
paste ${flowcell}_${lane}_1_sequence.txt ${flowcell}_${lane}_2_sequence.txt \
| tr -d '\t' fastx_barcode_splitter.pl --bcfile bcfile.txt --prefix ${flowcell}_${lane}_ --suffix .txt --eol &

done

done
wait

So you can see that we start by pasting read 1 and the index read side-by-side and pipe that straight into the fastx_barcode_splitter script which will demultiplex the datasets by the 12 barcodes specified in the bcfile. If there are any lines missing from either read 1 or It will run each lane in parallel, so be sure to use a computer with plenty of processors. For example, in the above script, I've specified all 8 lanes on 2 flow cells so I will be using 16 processors. OK, so we've demultiplexed, and now we need to trim off the 6nt barcode.

for dataset in `ls FC*BC*.txt | sed 's/.txt//'`
do
fastx_trimmer -t 6 -i ${dataset}.txt -o ${dataset}_trim.txt &
done
wait

The fastx_trimmer will remove the 6 nt from the end of the sequence and output the file with the suffix "_trim.txt". It will trim all the files which start with FC and contain BC and end with .txt, which is all the unambiguously demultiplexed datasets. Use caution when using the "&", as it will send many jobs into the background so if you're not working on a big server, you might crash the computer.

Search This Blog

Genome Spot

Demultiplexing Illumina Sequence data

Popular posts from this blog

Using GTF tools to get gene lengths

Uploading data to GEO - which method is faster?

Data analysis step 8: Pathway analysis with GSEA