Posts

Showing posts with the label UNIX

Extract data from a spreadsheet file on the linux command line

Sometimes we need to extract data from an Excel spreadsheet for analysis. Here is one approach using the ssconvert tool.

If this isnt installed on your linux machine then you most likely can get it from the package repository.

$ sudo apt install ssconvert

Then if you want to extract a spreadsheet file into a tsv it can be done like this:

$ ssconvert -S --export-type Gnumeric_stf:stf_assistant -O 'separator="'$'\t''"' SomeData.xlsx SomeData.xlsx.tsv


You will notice that all the sheets are output to separate tsv files. This approach is nice as it can accommodate high throughput screening, as I implemented in my Gene Name Errors paper a while back.

Here is an example of obtaining some data from GEO.

$ #first download
$ curl 'https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE80251&format=file&file=GSE80251%5Fprocessed%5FRNA%5Fexpression%5Fmnfyap%2Exlsx' > GSE80251.xlsx


$ #now extract $ ssconvert -S --export-type Gnumeric_stf:stf_assistant …

Pathway analysis with ZGST

Image
Pathway analysis is a common procedure for determining the regulation of groups of functionally linked genes. There are a lot of pathway analyses strategies available and I can break them down into these groups:

Bioconductor/R-based: It makes sense to run pathway analysis in the same environment that runs the major differential expression software Limma, edgeR, DESeq, etc. These include CAMERA, MRGST, WilcoxGST, Roast, etcCommercial, GUI based. Such as Ingenuity IPA or MetaCore GeneGO.Java based such as GSEA and GSAAWeb-based tools such as DAVIDWebGestalt, GO Enrichment analysis Now these each have their advantages and disadvantages. I wanted to see whether I could make a tool pathway analysis tool that could be run with just one simple command and didn't require expertise in R. It would run quickly for >10k gene sets and have a lower memory footprint than GSEA.
I wrote a pathway analysis in Bash. I know. Its crazy. But it works. It works in Ubuntu, Fedora, Debian and Mint. It…

Functions and GNU parallel for effective cluster load management

Image
I've been a fan of GNU parallel for a long time. Initially I was sceptical about using it, preferring to write huge for loops but over time I've grown to love it. The beauty of GNU parallel is that it spawns a specified number of jobs in parallel and then submits more jobs as others are completed. This means that you get maximum usage out of the CPUs without overloading the system. There are many excuses for not using it, but perhaps the only valid one is that you have Sun Grid Engine or another job scheduler or manager in place.

GNU parallel is particularly useful when used with functions. Functions are subroutines that may be repeated many times to complete a piece of work. In bash, here is a simple example, which declares a function consisting of a chain of piped commands, and then executes 4 jobs in parallel, until all of *files.txt have been processed.

#!/bin/bash
my_func2() {
INPUT=$1
VAR1=bar
cmd1 $INPUT $VAR1 | cmd2 | cmd3 > ${1}.out
}
export -f my_func
parallel -j4…

Using Named Pipes and Process Substitution

Image
Little known UNIX features to avoid writing temporary files in your data pipelines explained by Vince Buffalo in his digital notebook. Introducing named pipes and process substitution.