Little known UNIX features to avoid writing temporary files in your data pipelines explained by Vince Buffalo in his digital notebook. Introducing named pipes and process substitution.
Google drive is great for sharing documents and other small files but it's definitely not suited to moving many large files around. For example I just received 170 fastq files that are about 200 MB in size. If you use the browser to download the whole folder, the web app will zip the contents for you which will take a LOOONG time. Alternatively you can download each and every one of those files one by one, which is annoying and prone to human error. You can insist to your collaborators to transfer in a different way, but there are not that many user-friendly and economic approaches. Your biologist collaborators probably won't be able to use rsync to get the data to you safely. And fast convenient tools for moving around large files like Hightail cost a lot of money. A good solution to this problem is to use the R package googledrive which enables command line automation of tasks that might take a long time manually. The package vignette has a good overview of the main comma
In our RNA-seq series so far we've performed differential analysis and generated some pretty graphs, showing thousands of differentially expressed genes after azacitidine treatment. In order to understand the biology underlying the differential gene expression profile, we need to perform pathway analysis. We use Gene Set Enrichment Analysis ( GSEA ) because it can detect pathway changes more sensitively and robustly than some methods. A 2013 paper compared a bunch of gene set analyses software with microarrays and is worth a look. Generate a rank file The rank file is a list of detected genes and a rank metric score. At the top of the list are genes with the "strongest" up-regulation, at the bottom of the list are the genes with the "strongest" down-regulation and the genes not changing are in the middle. The metric score I like to use is the sign of the fold change multiplied by the inverse of the p-value, although there may be better methods out there
For a few different types of sequence analysis I need to extract read pairs that are properly mapped and satisfy some map quality filter, but I always forget the command line. So here it is: To read pairs from a bam file (-b) that map with mapQ≥30 including the bam file header (-h). The -f 0x2 option corresponds to the bitwise flags that specify that reads need to be properly paired. Proper pairing means reads are in Read1 forward, Read2 reverse orientation or Read1 reverse, Read2 forward orientation. $ samtools view -q 30 -f 0x2 -b -h in.bam > out.bam To extract single end reads from a bam file (-b) that map with mapQ≥30 including the bam file header (-h). $ samtools view -q 30 -b -h in.bam > out.bam