Paired-end fastq quality control with Skewer

March 20, 2015

Quality trimming and adapter clipping paired end reads is a tricky business. For paired-end alignments to work, the order of the sequences in both files (forward and reverse) needs to be retained. While there are plenty of read trimmers available open-source (think FASTX-toolkit, SeqTK, CutAdapt, etc), I haven't found many that:

Retain pairing info
Run parallel and are fast
Are easy to set up and run
Have good docs

Then I fell in love with Skewer. It smashed through 14.5 million gzipped read pairs, doing adapter clipping and quality trimming in two minutes on my 8-core workstation. It auto-detects quality encoding so you can safely analyse any Illumina data. Awesome!

$ skewer -t 8 -q 20 SRR634969.sra_1.fastq.gz SRR634969.sra_2.fastq.gz
Parameters used:
-- 3' end adapter sequence (-x): AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
-- paired 3' end adapter sequence (-y): AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA
-- maximum error ratio allowed (-r): 0.100
-- maximum indel error ratio allowed (-d): 0.030
-- end quality threshold (-q): 20
-- minimum read length allowed after trimming (-l): 18
-- file format (-f): Sanger/Illumina 1.8+ FASTQ (auto detected)
-- number of concurrent threads (-t): 8
Thu Mar 19 22:19:26 2015 >> started
|=================================================>| (100.00%)
Thu Mar 19 22:21:31 2015 >> done (124.590s)
14555046 read pairs processed; of these:
35892 ( 0.25%) short read pairs filtered out after trimming by size control
21534 ( 0.15%) empty read pairs filtered out after trimming by size control
14497620 (99.61%) read pairs available; of these:
5122437 (35.33%) trimmed read pairs available after processing
9375183 (64.67%) untrimmed read pairs available after processing
log has been saved to "SRR634969.sra_1.fastq-trimmed.log".

Search This Blog

Genome Spot

Paired-end fastq quality control with Skewer

Popular posts from this blog

Uploading data to GEO - which method is faster?

Data analysis step 8: Pathway analysis with GSEA

Bioinformatics data processing power depends on CPU L3 cache A LOT!