A selection of useful bash one-liners

August 25, 2013

Here's a selection of one liners that you might find useful in data analysis. I'll update this post regularly and hope you can share some of your own gems.

Compress an archive:

tar cf backup.tar.bz2 --use-compress-prog=pbzip2 directory

Sync a folder of data to a backup

rsync -azhv /scratch/project1 /backup/project1/

Extract a tar archive:

tar xvf backup.tar

Head a compressed file:

bzcat file.bz2 | head

pbzip2 -dc file.bz2 | head

zcat file.gz | head

Uniq on one column (field 2):

awk '!arr[$2]++' file.txt

Explode a file on the first field:

awk '{print > $1}' file1.txt

Sum a column:

awk '{ sum+=$1} END {print sum}' file.txt

Put spaces between every three characters

sed 's/$.\{3\}$/\1 /g' file.txt

Join 2 files based on field 1. Both files need to be properly sorted (use sort -k 1b,1)

join -1 1 -2 1 file1.txt file2.txt

Join a bunch of files by field 1. Individual files don't need to be sorted but the final output might need to be sorted:

awk '$0 !~ /#/{arr[$1]=arr[$1] " " $2}END{for(i in arr)print i,arr[i]}' file1..txt file2.txt ... fileN.txt

Find number of lines shared by 2 files:

sort file1 file2 | uniq -d

Alternative method to find the common lines (files need to be pre-sorted):

comm -12 file1 file2

Add a header to a file

sed -e '1i\HeaderGoesHere' originalFile

Extract every 4th line starting at the second line (extract the sequence from fastq)

sed -n '2~4p' file.txt

Find the most common strings in column 2:

cut -f2 file.txt | sort | uniq -c | sort -k1nr | head

Randomise lines in a file

shuf file.txt

Generate a list of random numbers (integers)

for i in {1..50} ; do echo $RANDOM ; done

Find a bunch of strings in file1 in file2

grep -Ff file1 file2

Print lines which contain string1 or string2

egrep '(string1|string2|stringN)' file.txt

Count the number of "X" characters per line:

n=0; while read line; do echo -n "$((n=$((n + 1)))) "; echo "$line" | tr -cd "X" | wc -c; done < file.txt

Count the length of strings in field 2:

awk '{print length($2)}' file1

Print all possible 3mer DNA sequence combinations

echo {A,C,T,G}{A,C,T,G}{A,C,T,G}

Filter reads with SamTools

samtools view -f 4 file.bam > unmapped.sam

samtools view -F 4 file.bam > mapped.sam

samtools view -f 2 file.bam > mappedPairs.sam

Compress a bunch of folders full of data

for DIR in `ls -d */ | sed 's#/##' ` ; do ZIP=$DIR.zip ; zip -r $ZIP $DIR/ ; done

Get year-month-day hour:minute:second from Unix "date"

DATE=`date +%Y-%m-%d:%H:%M:%S`

GNU Parallel has many interesting uses.

Confirm md5sum of fastq files

ls *fastq.gz | parallel md5sum {} > checksums.txt

Index bam files

parallel samtools index {} ::: *bam

Count number of lines in a bed file

parallel "wc -l {}" ::: *bed

Fix Bedtools "Error: malformed BED entry at line 1. Start was greater than end. Exiting." by reorganising bed coordinates

awk '{OFS="\t"} {if ($3<$2) print $1,$3,$2 ; else print $0}' file.bed > file_fixed.bed

Fix a MACS peak BED file that contains negative coordinates

awk '{if ($2<1) print $1,1,$3 ; else print $0 }' macs_peaks.bed > macs_peaks_fixed.bed

You want to recursively remove spaces from filenames in many sub directories:

find -name "* *" -type d | rename 's/ /_/g'

Generate md5 checksums for directory of files in parallel

parallel md5sum ::: * > checksums.md5

Aggregate: Sum column 2 values based on colum 1 string

awk '{array[$1]+=$2} END { for (i in array) {print i, array[i]}}' file.tsv

Validate files by comparing checksums in parallel

cat checksums.md5 | parallel --pipe -N1 md5sum -c

Find your own public IP address:

dig +short myip.opendns.com @resolver1.opendns.com

Find the IP address of a webpage:

dig +short www.example.com

Find other computers on the local network connected by ethernet:

arp-scan --interface=eth0 --localnet

Find your own devices MAC address:

ifconfig -a | grep -Po 'HWaddr \K.*$'

Search This Blog

Genome Spot

A selection of useful bash one-liners

Popular posts from this blog

Data analysis step 8: Pathway analysis with GSEA

Uploading data to GEO - which method is faster?

Generate an RNA-seq count matrix with featureCounts