Fastq to tabular - filter reads on length

Fastq sequence format is a little bit strange as it has a structure of 4 lines per data point (sequence). This format is not ideal for analysis for one reason - it is quite difficult to manipulate with unix tools because they are optimized for tab delimited data with one entry per line. One way to overcome this is to "flatten" the file so that each sequence read occupies just one line. The read name, sequence, quality information are separated by a tab on one line. To do this there is a trick you can try with unix paste command. Try it with this command:

paste - - - - < sequence.fq | head
If we then wanted to do something like filter reads based on sequence length, we can add a perl (or awk for that matter) one liner.
 paste - - - -  < sequence.fq | perl -ne '@x=split m/\t/; unshift @x, length($x[1]); print join "\t",@x;' 
Which will output the length of the sequence string in the first field.

If we wanted to filter reads based on the length of the sequence read and extract reads with size equal to or greater than 18 nt, we can use an awk command to do that and then remove the length field with cut.
 paste - - - -  < sequence.fq| perl -ne '@x=split m/\t/; unshift @x, length($x[1]); print join "\t",@x;' | awk '$1 >= 18' | cut -f2- 
Another example, originally by Torsten Seemann, shows that this tool can be used to sort fastq files based on sequence read length (link), then the data is "unflattened", back into the fastq format.
paste - - - -  < sequence.fq| perl -ne '@x=split m/\t/; unshift @x, length($x[1]); print join "\t",@x;' | awk '$1 >= 18' | sort -n | cut -f2- | tr '\t' '\n'
Thanks to the Genome Factory!

Popular posts from this blog

Two subtle problems with over-representation analysis

Data analysis step 8: Pathway analysis with GSEA

Uploading data to GEO - which method is faster?