Posts

Showing posts with the label bash

Shell aliases for bioinformatics

Using shell allows us to take advantage of some nice features to make our bioinformatics lives a little easier for things we do very frequently. In Ubuntu, the ~/.bashrc file is run as a new terminal window is opened to customise the shell. Here are a few of my favourite general shortcuts. Let me know your favourites in the comments section below!

#shorten ls forms
alias ll='ls -alF'
alias la='ls -A'
alias l='ls -CF'


#shorten file viewing
alias h='head'
alias t='tail'
alias n='nano -S'
#the -S option to nano makes scrolling smoother
alias nano='nano -S'

#easy update
alias update='sudo apt-get update && sudo apt-get upgrade -y'

#search through history
alias hgrep='history | grep'

#Get col headers of tab delim file
ch(){
cat $1 | tr '\t' '\n' | nl -n ln
}
export -f ch

#login with ssh where IP is constant (X is the IP address)
alias login1='ssh -Y username@X.X.X.X' #scp can be done as above

#login with …

Running many GSEA analyses using a simple Bash wrapper script

In my previous posts, I've recommended using GSEA pre-ranked for performing pathway analysis through the graphical user interface. If you have a more elaborate experiment with several sample groups and a few different gene set libraries to analyse, then you'll find that process quite tedious.

So in this post, I'd like to share with you my GSEA wrapper written in bash to make this process a bit easier. The wrapper script simply accepts two user defined arguments, the rank file and the gmt file. Take note that this works only when the gene IDs are already gene symbols. If you have probe IDs or some other type of accession number, then you'll need to convert them to gene symbols or use a "chip file". The rank files and GMT files need to be in the current directory. You may need to adjust the memory up or down depending on your machine spec and the memory required for your jobs.

#!/bin/bash
RNK=$1
GMT=$2
java -Xmx4096m -cp /path/to/gsea2-2.2.1.jar xtools.gsea.GseaPre…

Comparing expression profiles

Image
One of the most common tasks in gene expression analysis is to compare different profiling experiments. There are three main strategies:
Compare all data points - using a correlation analysisCompare sets of up and down-regulated genes - using a binomial or Fisher exact testCompare sets of genes within a profile - such as GSEA test In this post, I'll describe how correlation analysis is used between expression data sets of all detected genes.

Merging data sets No matter what type of correlation used, the profiling data sets need to be merged. This means selecting a field that can the datasets can be merged on. This could be a array probe ID, gene accession number or gene symbol as in this case. I will compare gene expression profiles from two experiments (azacitidine in human and mouse cells). The human gene profile was generated by RNA-seq and the mouse data set by microarray.

The human data is currently in CSV format from Degust  and looks like this:

gene,c,aza,FDR,C1,C2,C3,A1,A2,A…

How to generate a rank file from gene expression data

Turning a gene expression profile into a ranked list is useful for comparing with other profiling data sets as well as an input for preranked GSEA analysis (example here). In this post, I describe a simple bash script called rnkgen.sh that can take gene expression data from a range of sources, such as edgeR, DESeq, GEO2R, etc., and generate a ranked list of genes from most up-expressed to most down-expressed based on the p-value.

#!/bin/bash
#rnkgen.sh converts a differential gene expression spreadsheet (XLS) into a rank file (RNK)

#Specify the input file
XLS=$1
#Specify the gene ID column
ID=$2
#Specify the fold change value column
FC=$3
#Specify the raw p-value column
P=$4

sed 1d $XLS | tr -d '"' \
| awk -v I=$ID -v F=$FC -v P=$P '{FS="\t"} $I!="" {print $I, $F, $P}' \
| awk '$2!="NA" && $3!="NA"' \
| awk '{s=1} $2<0{s=-1} {print $1"\t"s*-1*log($3)/log(10)}' \
| awk '{print $1,$2,$2*…

RNA-seq aligner accuracy tested with simulated reads

Image
In the previous post we looked at how choice of RNA-seq aligner influenced results. In this post, we'll use simulated reads to empirically determine the accuracy of OLego, STAR, SubJunc and SubRead.

I made a pretty basic bash script (see below) to generate fasta format reads at uniform intervals along the length of transcripts without errors. The user can readily change the read length and the interval between read initiation. The script created 11.3 million 100 bp reads in 9 minutes so it is OK in terms of speed. It is designed to take Ensembl cDNA files as an input and has been tested on human and Arabidopsis. Here is what the generated reads look like

>AT2G01210.1:gene:AT2G01210_197863
CGTCAGCTTTCGTTCTGGGGAAGAGCGGAATCGGAATTGTCTACAAAGTG
>AT2G01210.1:gene:AT2G01210_197864
GTTCTAGAGAACGGGCTCACACTGGCCGTACGGAGATTGGGTGAAGGAGG
>AT2G01210.1:gene:AT2G01210_197865
GTCTCAGAGATTCAAGGAGTTTCAGACAGAAGTTGAAGCCATAGGGAAAC
>AT2G01210.1:gene:AT2G01210_197866
TAAAACATCCTAACATTGCTAGTCTTC…

A selection of useful bash one-liners

Here's  a selection of one liners that you might find useful in data analysis. I'll update this post regularly and hope you can share some of your own gems.

Compress an archive:
tar cf backup.tar.bz2 --use-compress-prog=pbzip2  directory Sync a folder of data to a backup
rsync -azhv /scratch/project1 /backup/project1/ Extract a tar archive:
tar xvf backup.tar   Head a compressed file:
bzcat file.bz2 | head   pbzip2 -dc file.bz2 | head zcat file.gz | head 
Uniq on one column (field 2):
awk '!arr[$2]++' file.txt Explode a file on the first field:
awk '{print > $1}' file1.txt Sum a column:
awk '{ sum+=$1} END {print sum}' file.txt Put spaces between every three characters
sed 's/\(.\{3\}\)/\1 /g' file.txt Join 2 files based on field 1. Both files need to be properly sorted (use sort -k 1b,1)
 join -1 1 -2 1 file1.txt file2.txt Join a bunch of files by field 1. Individual files don't need to be sorted but the final output might need to be sor…

UNIX utility: split

Image
Even the most simple types of analysis (ie: count lines) can take a really long time with the colossal data sets  coming from sequencing experiments these days. Parallelization of jobs over multiple processors/cores/threads can make things faster, but the standard unix based tools we commonly use don't generally have the in-built capacity to run jobs over many processors. Every so often, I'll present some tools which can help in making NGS analysis jobs parallel, and hopefully give you some ideas for speeding up your jobs.

One of the most basic parallelisation tricks is to split a file into smaller fragments, process those smaller files in parallel and then concatenate/merge the final product. To split a file, you can use a tool such as sed to selectively output lines, another option is to use the split command which will divide the file by the specified number of lines.

Here's an example of splitting a fastq file (called sequence.fq) into 1 million line fragments:
split …