### Evaluate length of sequence strings

If you are working with sequence data often, you will sometimes need to look at the distribution of read lengths in a data set after quality and adapter trimming. From a bam file this can be done starting with samtools view, then cutting the sequence string out and then using either perl or awk to count the length of the sequence. The list of integers can then be piped to numaverage, a numutils program to evaluate the average, median and mode of a list of numbers.

To get the average length

samtools view sequence.bam | cut -f10 | awk '{ print length}' | numaverage

To get the median length

samtools view sequence.bam | cut -f10 | awk '{ print length}' | numaverage -M

To get the mode length

samtools view sequence.bam | cut -f10 | awk '{ print length}' | numaverage- m

To get the shortest length

samtools view FC095_1.bam | cut -f10 | awk '{ print length}' | sort | head -1

To get the longest

samtools view FC095_1.bam | cut -f10 | awk '{ print length}' | sort | tail -1

To get the shortest and longest

To get the average length

samtools view sequence.bam | cut -f10 | awk '{ print length}' | numaverage

To get the median length

samtools view sequence.bam | cut -f10 | awk '{ print length}' | numaverage -M

To get the mode length

samtools view sequence.bam | cut -f10 | awk '{ print length}' | numaverage- m

To get the shortest length

samtools view FC095_1.bam | cut -f10 | awk '{ print length}' | sort | head -1

To get the longest

samtools view FC095_1.bam | cut -f10 | awk '{ print length}' | sort | tail -1

To get the shortest and longest

samtools view FC095_1.bam | cut -f10 | awk '{ print length}' | sort | sed -n '1p;$p'

To get a full distribution

samtools view FC095_1.bam | cut -f10 | head -100 | awk '{ print length}' | sort | uniq -c | sed 's/^[ \t]*//' | tr ' ' '\t'

Further reading