Posts

Showing posts with the label awk

Comparing expression profiles

Image
One of the most common tasks in gene expression analysis is to compare different profiling experiments. There are three main strategies:
Compare all data points - using a correlation analysisCompare sets of up and down-regulated genes - using a binomial or Fisher exact testCompare sets of genes within a profile - such as GSEA test In this post, I'll describe how correlation analysis is used between expression data sets of all detected genes.

Merging data sets No matter what type of correlation used, the profiling data sets need to be merged. This means selecting a field that can the datasets can be merged on. This could be a array probe ID, gene accession number or gene symbol as in this case. I will compare gene expression profiles from two experiments (azacitidine in human and mouse cells). The human gene profile was generated by RNA-seq and the mouse data set by microarray.

The human data is currently in CSV format from Degust  and looks like this:

gene,c,aza,FDR,C1,C2,C3,A1,A2,A…

How to generate a rank file from gene expression data

Turning a gene expression profile into a ranked list is useful for comparing with other profiling data sets as well as an input for preranked GSEA analysis (example here). In this post, I describe a simple bash script called rnkgen.sh that can take gene expression data from a range of sources, such as edgeR, DESeq, GEO2R, etc., and generate a ranked list of genes from most up-expressed to most down-expressed based on the p-value.

#!/bin/bash
#rnkgen.sh converts a differential gene expression spreadsheet (XLS) into a rank file (RNK)

#Specify the input file
XLS=$1
#Specify the gene ID column
ID=$2
#Specify the fold change value column
FC=$3
#Specify the raw p-value column
P=$4

sed 1d $XLS | tr -d '"' \
| awk -v I=$ID -v F=$FC -v P=$P '{FS="\t"} $I!="" {print $I, $F, $P}' \
| awk '$2!="NA" && $3!="NA"' \
| awk '{s=1} $2<0{s=-1} {print $1"\t"s*-1*log($3)/log(10)}' \
| awk '{print $1,$2,$2*…

Interspecies gene name conversion

Image
In this post, I'll provide a step-by-step guide to perform interspecies gene name conversion of gene expression data. This is a necessary step in the comparison of profiling data from two different experiments with different species (human and mouse), and allows us to use extensive human-centric gene set libraries in MSigDB when analysing non-human mammalian profiling data (such as mouse).

I performed GEO2R analysis of mouse expression data (GSE30192) to analyse the effect of azacitidine on mouse C2C12 myoblasts. The data looks like this:
"ID""adj.P.Val""P.Value""t""B""logFC""Gene.symbol""Gene.title"
"1420647_a_at" "0.000346" "2.24e-08" "56.073665" "8.699524" "6.9755573" "Krt8" "keratin 8"
"1423327_at" "0.000346" "2.32e-08" "55.685912" "8.686447" "3.8096523" "Rpl…

A selection of useful bash one-liners

Here's  a selection of one liners that you might find useful in data analysis. I'll update this post regularly and hope you can share some of your own gems.

Compress an archive:
tar cf backup.tar.bz2 --use-compress-prog=pbzip2  directory Sync a folder of data to a backup
rsync -azhv /scratch/project1 /backup/project1/ Extract a tar archive:
tar xvf backup.tar   Head a compressed file:
bzcat file.bz2 | head   pbzip2 -dc file.bz2 | head zcat file.gz | head 
Uniq on one column (field 2):
awk '!arr[$2]++' file.txt Explode a file on the first field:
awk '{print > $1}' file1.txt Sum a column:
awk '{ sum+=$1} END {print sum}' file.txt Put spaces between every three characters
sed 's/\(.\{3\}\)/\1 /g' file.txt Join 2 files based on field 1. Both files need to be properly sorted (use sort -k 1b,1)
 join -1 1 -2 1 file1.txt file2.txt Join a bunch of files by field 1. Individual files don't need to be sorted but the final output might need to be sor…

Evaluate length of sequence strings

If you are working with sequence data often, you will sometimes need to look at the distribution of read lengths in a data set after quality and adapter trimming. From a bam file this can be done starting with samtools view, then cutting the sequence string out and then using either perl or awk to count the length of the sequence. The list of integers can then be piped to numaverage, a numutils program to evaluate the average, median and mode of a list of numbers.

To get the average length
samtools view sequence.bam | cut -f10 | awk '{ print length}' | numaverage

To get the median length
samtools view sequence.bam | cut -f10 | awk '{ print length}' | numaverage -M

To get the mode length
samtools view sequence.bam | cut -f10 | awk '{ print length}' | numaverage- m

To get the shortest length
samtools view FC095_1.bam | cut -f10 | awk '{ print length}' | sort | head -1

To get the longest
samtools view FC095_1.bam | cut -f10 | awk '{ print length}' …