Evaluate length of sequence strings

If you are working with sequence data often, you will sometimes need to look at the distribution of read lengths in a data set after quality and adapter trimming. From a bam file this can be done starting with samtools view, then cutting the sequence string out and then using either perl or awk to count the length of the sequence. The list of integers can then be piped to numaverage, a numutils program to evaluate the average, median and mode of a list of numbers.

To get the average length
samtools view sequence.bam | cut -f10 | awk '{ print length}' | numaverage

To get the median length
samtools view sequence.bam | cut -f10 | awk '{ print length}' | numaverage -M

To get the mode length
samtools view sequence.bam | cut -f10 | awk '{ print length}' | numaverage- m

To get the shortest length
samtools view FC095_1.bam | cut -f10 | awk '{ print length}' | sort | head -1

To get the longest
samtools view FC095_1.bam | cut -f10 | awk '{ print length}' | sort | tail -1

To get the shortest and longest
samtools view FC095_1.bam | cut -f10 | awk '{ print length}' | sort | sed -n '1p;$p'

To get a full distribution
samtools view FC095_1.bam | cut -f10 | head -100 | awk '{ print length}' | sort | uniq -c | sed 's/^[ \t]*//' | tr ' ' '\t'

Further reading

Popular posts from this blog

Two subtle problems with over-representation analysis

Data analysis step 8: Pathway analysis with GSEA

Uploading data to GEO - which method is faster?