Little known UNIX features to avoid writing temporary files in your data pipelines explained by Vince Buffalo in his digital notebook. Introducing named pipes and process substitution.
In our RNA-seq series so far we've performed differential analysis and generated some pretty graphs, showing thousands of differentially expressed genes after azacitidine treatment. In order to understand the biology underlying the differential gene expression profile, we need to perform pathway analysis. We use Gene Set Enrichment Analysis ( GSEA ) because it can detect pathway changes more sensitively and robustly than some methods. A 2013 paper compared a bunch of gene set analyses software with microarrays and is worth a look. Generate a rank file The rank file is a list of detected genes and a rank metric score. At the top of the list are genes with the "strongest" up-regulation, at the bottom of the list are the genes with the "strongest" down-regulation and the genes not changing are in the middle. The metric score I like to use is the sign of the fold change multiplied by the inverse of the p-value, although there may be better methods out there...
If you have had to upload omics data to GEO before, you'll know it's a bit of a hassle and takes a long time. There are a few methods suggested by the GEO team if you are using the Unix command line: Using 'ncftp' ncftp set passive on set so-bufsize 33554432 open ftp://geoftp:yourpasscode@ftp-private.ncbi.nlm.nih.gov cd uploads/your @mail.com_ yourfolder put -R Folder_with_submission_files Using 'lftp' lftp ftp://geoftp:yourpasscode@ftp-private.ncbi.nlm.nih.gov cd uploads/ your @mail.com _ yourfolder mirror -R Folder_with_submission_files Using 'sftp' (expect slower transfer speeds since this method encrypts on-the-fly) sftp geoftp @s ftp-private.ncbi.nlm.nih.gov password: yourpasscode cd uploads/ your @mail.com _ yourfolder mkdir new_geo_submission cd new_geo_submission put file_name Using 'ncftpput' (transfers from the command-line without entering an interactive shell) Usage example: ncftpput -F -R -z -u geoftp -p "yourpasscode...
Featurecounts is the fastest read summarization tool currently out there and has some great features which make it superior to HTSeq or Bedtools multicov. FeatureCounts takes GTF files as an annotation. This can be downloaded from the Ensembl FTP site . Make sure that the GTF version matches the genome that you aligned to. FeatureCounts it also smart enough to recognise and correctly process SAM and BAM alignment files. Here is a script to generate a gene-wise matrix from all BAM files in a directory. #!/bin/bash #Generate RNA-seq matrix #Set parameters GTF=/path/to/Mus_musculus.GRCm38.78.gtf EXPTNAME=mouse_rna CPUS=8 MAPQ=10 GENEMX=${EXPTNAME }_genes.mx #Make the gene-wise matrix featureCounts -Q $MAPQ -T $CPUS -a $GTF -o /dev/stdout *bam \ | cut -f1,7- | sed 1d > $GENEMX The data are now ready to analyse with your favourite statistical package (DESeq, EdgeR, Voom/Limma, etc). Consider attaching the gene name to give the data more relevance. To do that, first ...