A selection of useful bash one-liners

Here's  a selection of one liners that you might find useful in data analysis. I'll update this post regularly and hope you can share some of your own gems.

Compress an archive:
tar cf backup.tar.bz2 --use-compress-prog=pbzip2  directory
Extract a tar archive:
tar xvf backup.tar 
 Head a compressed file:
bzcat file.bz2 | head 
pbzip2 -dc file.bz2 | head
zcat file.gz | head 
Uniq on one column (field 2):
awk '!arr[$2]++' file.txt
Explode a file on the first field:
awk '{print > $1}' file1.txt
Sum a column:
awk '{ sum+=$1} END {print sum}' file.txt
Put spaces between every three characters
sed 's/\(.\{3\}\)/\1 /g' file.txt
Join 2 files based on field 1. Both files need to be properly sorted (use sort -k 1b,1)
 join -1 1 -2 1 file1.txt file2.txt
Join a bunch of files by field 1. Individual files don't need to be sorted but the final output might need to be sorted:
awk '$0 !~ /#/{arr[$1]=arr[$1] " " $2}END{for(i in arr)print i,arr[i]}' file1..txt file2.txt ... fileN.txt
Find number of lines shared by 2 files:
sort file1 file2 | uniq -d
Alternative method to find the common lines (files need to be pre-sorted):
comm -12 file1 file2
Add a header to a file
sed -e '1i\HeaderGoesHere' originalFile
Extract every 4th line starting at the second line (extract the sequence from fastq)
sed -n '2~4p' file.txt
Find the most common strings in column 2:
cut -f2 file.txt | sort | uniq -c | sort -k1nr | head
Randomise lines in a file
shuf file.txt
Generate a list of random numbers (integers)
for i in {1..50} ; do echo $RANDOM ; done
Find a bunch of strings in file1 in file2
grep -Ff file1 file2
Print lines which contain string1 or string2
egrep '(string1|string2|stringN)' file.txt
Count the number of "X" characters per line:
n=0; while read line; do echo -n "$((n=$((n + 1)))) "; echo "$line" | tr -cd  "X" | wc -c; done < file.txt
Count the length of strings in field 2:
awk '{print length($2)}' file1
Print all possible 3mer DNA sequence combinations
echo {A,C,T,G}{A,C,T,G}{A,C,T,G}
Filter reads with SamTools
samtools view -f 4 file.bam > unmapped.sam
 samtools view -F 4 file.bam > mapped.sam 
samtools view -f 2 file.bam > mappedPairs.sam
Compress a bunch of folders full of data
for DIR in `ls -d */ | sed 's#/##' ` ; do ZIP=$DIR.zip ; zip -r $ZIP $DIR/ ; done
Get year-month-day hour:minute:second from Unix "date"
DATE=`date +%Y-%m-%d:%H:%M:%S`
GNU Parallel has many interesting uses.

Confirm md5sum of fastq files
ls *fastq.gz | parallel  md5sum {} > checksums.txt
Index bam files
parallel samtools index {} ::: *bam
Fix Bedtools "Error: malformed BED entry at line 1. Start was greater than end. Exiting." by reorganising bed coordinates
awk '{OFS="\t"} {if ($3<$2) print $1,$3,$2 ; else print $0}' file.bed > file_fixed.bed 
Fix a MACS peak BED file that contains negative coordinates
awk '{if ($2<1) print $1,1,$3 ; else print $0 }' macs_peaks.bed > macs_peaks_fixed.bed
You want to recursively remove spaces from filenames in many sub directories:
find -name "* *" -type d | rename 's/ /_/g'
Generate md5 checksums for directory of files in parallel

parallel md5sum ::: * > checksums.md5
Aggregate: Sum column 2 values based on colum 1 string

awk  '{array[$1]+=$2} END { for (i in array) {print i, array[i]}}' file.tsv

Validate files by comparing checksums in parallel
cat checksums.md5 | parallel --pipe -N1 md5sum -c
Further reading:

Popular posts from this blog

Data analysis step 8: Pathway analysis with GSEA

HISAT vs STAR vs TopHat2 vs Olego vs SubJunc