Data set curation - quality trimming and adapter clipping

After demultiplexing (covered in the last post), you'll need to perform a few other steps before aligning Illumina sequence data to the genome reference. Primarily, these are quality filtering and adapter clipping. These may not be very important for short read data, but are pretty important when working with long reads, where the quality starts to drop off and the read might contain some adapter sequence.

Quality filtering can be done a few ways, by filtering out entire reads which have poor base quality scores, by converting poor quality base-calls to "N" or hard trimming reads to a certain length before the Q scores start to rapidly decline. I'd much rather use a quality based trimming tool which starts at the 3' end of the read and removes bases below a certain Q threshold. This can be done using fastq_quality_trimmer on the command line or in galaxy. You set the threshold you want to use, in this case Q30, as well as the minimum length of sequence to keep, which we have set at 37 nt.


fastq_quality_trimmer -t 30 -l 37 -i dataset.txt -o dataset_Q30.txt 

Now to remove the adapter sequence, there are a few software options (Trimmomatic, cutadapt, among others) but we again chose the fastx toolkit for this example. Keep in mind that the below adapter is for the Illumina Truseq genomic DNA kit and is different for other sequencing platforms. The "-l 37" parameter is the length of the shortest read to keep, so any read shorter than this is discarded, you can tailor this to your specific need.

fastx_clipper -a GATCGGAAGAGCACACGTCTGAACTCCAGTCACATC -l 37 -i dataset_Q30.txt -o dataset_Q30_clip.txt


One thing I need to mention is that the above will work really well for fastq Illumina, but might not work for fastq sanger, which has different quality score characters. This incompatibility has affected a lot of people in the forums and can be solved by adding -Q33 as an option.

Now that the sequence data is now trimmed for bad bases and adapter contamination, we can start the analysis!

Popular posts from this blog

Two subtle problems with over-representation analysis

Data analysis step 8: Pathway analysis with GSEA

Uploading data to GEO - which method is faster?