Showing posts from May, 2015

Run Linux from USB

Linux is definitely the favorite OS for bioinformatics, but if you ask most university or research institute IT departments they will likely be MS Windows-centric. Even 53% of visitors to this blog run Windows. Many IT departments that I've interacted with lock down their PCs so no software can be installed, leaving employees and students unable to run software to get their work done.

One option is to run virtualisation software such as VirtualBox or VMware to run Linux inside Windows, but that comes with reduced performance. Another, better option is to run Linux from a USB flash drive. Just as virtually all Linux distros can be booted off CD/DVD, they can also be booted off USB. The benefits are that you can run a "pure" Linux OS without modifying the existing host Windows OS. You'll also be able to take it and all the installed software wherever you go, and run it off any machine. Some Linux distros are specifically designed for running off USB (or SD) flash driv…

Download SRA data with Aspera command line utility

Asperaconnect is NCBI's recommended data transfer client for large datasets >1GB. It uses the FASP protocol, here's a description from the NCBI guide.
"The FASP protocol from Aspera (<<UrlBlockedError.aspx>>) uses UDP, eliminating the latency issues seen with TCP, and provides bandwidth up to 1 gigabit per second (Gbps) to transfer data. It has a restart capability if data transfer is interrupted midstream and is well behaved, so if there is other data traffic on your network connections, it will back off in order to avoid starving other protocols. We have seen effective throughput up to 800 megabits per second (Mbps) to a single site.
The fasp protocol uses UDP port 33001-33009 for data transfer and you may need to contact your IT security staff if this port is not open to NCBI through your institutional firewalls.
NCBI is implementing Aspera for two use cases, occasional users who download files for direct use (Aspe…

Pathway analysis with ZGST

Pathway analysis is a common procedure for determining the regulation of groups of functionally linked genes. There are a lot of pathway analyses strategies available and I can break them down into these groups:

Bioconductor/R-based: It makes sense to run pathway analysis in the same environment that runs the major differential expression software Limma, edgeR, DESeq, etc. These include CAMERA, MRGST, WilcoxGST, Roast, etcCommercial, GUI based. Such as Ingenuity IPA or MetaCore GeneGO.Java based such as GSEA and GSAAWeb-based tools such as DAVIDWebGestalt, GO Enrichment analysis Now these each have their advantages and disadvantages. I wanted to see whether I could make a tool pathway analysis tool that could be run with just one simple command and didn't require expertise in R. It would run quickly for >10k gene sets and have a lower memory footprint than GSEA.
I wrote a pathway analysis in Bash. I know. Its crazy. But it works. It works in Ubuntu, Fedora, Debian and Mint. It…