Showing posts from August, 2013

A selection of useful bash one-liners

Here's  a selection of one liners that you might find useful in data analysis. I'll update this post regularly and hope you can share some of your own gems. Compress an archive: tar cf backup.tar.bz2 --use-compress-prog=pbzip2  directory Sync a folder of data to a backup rsync -azhv /scratch/project1 /backup/project1/ Extract a tar archive: tar xvf backup.tar   Head a compressed file: bzcat file.bz2 | head   pbzip2 -dc file.bz2 | head zcat file.gz | head  Uniq on one column (field 2): awk '!arr[$2]++' file.txt Explode a file on the first field: awk '{print > $1}' file1.txt Sum a column: awk '{ sum+=$1} END {print sum}' file.txt Put spaces between every three characters sed 's/\(.\{3\}\)/\1 /g' file.txt Join 2 files based on field 1. Both files need to be properly sorted (use sort -k 1b,1)  join -1 1 -2 1 file1.txt file2.txt Join a bunch of files by field 1. Individual files don't need to be sorted but the

Quick alignment of microRNA-Seq data to a reference

In my previous post , I discussed an efficient approach to align deep sequencing reads to a genome, which could be used for most applications (mRNA-Seq, ChIP-Seq, exome-Seq, etc). MicroRNA datasets are rather a bit different because they often contain millions of reads from only a few thousand RNA species, meaning there is a lot of redundancy. We can speed up this analysis considerably if we collapse the sequences and remove the redundancy before genome alignment. You can see in the below script that several steps are done without writing a single intermediate file: Adapter clipping Sequence collapsing BWA alignment Bam conversion Sorting bam file for FQ in *fq do fastx_clipper -a TGGAATTCTCGGGTGCCAAGG -l 18 -i $FQ \ | fastx_collapser \ | bwa aln -t 30 $REF - \ | bwa samse $REF - ${FQ} \ | samtools view -uSh - \ | samtools sort - ${FQ}.sort & done wait for bam in *bam ; do samtools index $bam & done ; wait for bam in *bam ; do samtools flagstat $bam > ${bam}.sta