Showing posts from 2018

Extract data from a spreadsheet file on the linux command line

Sometimes we need to extract data from an Excel spreadsheet for analysis. Here is one approach using the ssconvert tool.

If this isnt installed on your linux machine then you most likely can get it from the package repository.

$ sudo apt install ssconvert

Then if you want to extract a spreadsheet file into a tsv it can be done like this:

$ ssconvert -S --export-type Gnumeric_stf:stf_assistant -O 'separator="'$'\t''"' SomeData.xlsx SomeData.xlsx.tsv

You will notice that all the sheets are output to separate tsv files. This approach is nice as it can accommodate high throughput screening, as I implemented in my Gene Name Errors paper a while back.

Here is an example of obtaining some data from GEO.

$ #first download
$ curl '' > GSE80251.xlsx

$ #now extract $ ssconvert -S --export-type Gnumeric_stf:stf_assistant …

Incorporate dee2 data into your R-based RNA-seq workflow is a portal for accessing gene expression data derived from public RNA-seq datasets. So far there are over 400k available datasets and its growing every day. While there are existing databases of such as Expression Atlas, Recount2 and ARCHS4, offers a number of unique benefits. For instance, dee2 includes gene-wise counts fron STAR as well as transcript-wise quantifications from Kallisto. There are a few ways you can access these data. Firstly, there is a nice web interface that is mobile friendly. Secondly, there are data dumps available if you are running a large scale analysis.  But the purpose of this post is to demonstrate the improved R interface in action together with SRAdbv2 and statistics with edgeR and DESeq. The official documentation is available on GitHub.
Getting started This tutorial provides a walkthrough for how to work with dee2 expression data, starting with dataset searches, obtaining the data from and then performing a differential analysi…

Update on DEE2 project for Sept 2018

A few updates for DEE2 i would like to share. 

I switched over to NameCheap domain name service which appears to be working much nicer than the previous one (HostPapa). The domain name sever change broke the docker image so it was slightly modified and rebuilt. I've integrated with SRAdbV2, an now there are many more datasets in the queue. I think many of these are small ones related to single cell RNA-seq. I am using as many computers as possible to clear up the backlog. I've noticed a lot of SRA project with one or a few datasets missing, so I have have written a script to identify these and queue them with priority. The R interface hs undergone several improvements and should be more robust now. A whole bunch of new documentation has been added, including a complete walkthrough starting with SRAdbV2 query, fetching DEE2 data, and differential analysis with edgeR and DESeq.Also bulk data dumps are again available via http. Dat turned out to be too slow and unreliable for file…

Get the newest Reactome gene sets for pathway analysis

For first-pass pathway analysis I find Reactome to be the most useful database of gene sets for biologists to understand. For a long time I have been using Reactome gene sets as deposited to the GSEA/MSigDB website. Recently a colleague pointed me to the gene matrix file offered directly on the Reactome webpage (Thanks Dr Okabe).

There are some differences. Firstly there are more gene sets in the one from the Reactome webpage (accessed 2018-05-09)
$ wc -l *gmt     674 c2.cp.reactome.v6.1.symbols.gmt    2022 ReactomePathways.gmt    2696 total
Secondly, there are more genes included in one or more gene sets:
$ cut -f3- c2.cp.reactome.v6.1.symbols.gmt | tr '\t' '\n' | sort -u | wc -l 6025 $ cut -f3- ReactomePathways.gmt | tr '\t' '\n' | sort -u | wc -l 10852
And overall there are about threefold more gene-pathway entries 
$ cut -f3- ReactomePathways.gmt | wc -w 106405 $ cut -f3- c2.cp.reactome.v6.1.symbols.gmt | wc -w 37601
I also looked at whether the gen…

Publishing datasets on the dat network - benefits and pitfalls

As I mentioned in an earlier post, Dat is a new data sharing tool that uses concepts of bittorrent and git to enable peer-to-peer sharing of versioned data. This is cool for sharing datasets that change over time, because when you sync the dataset, only the changes are retrieved, sort of like git. As it uses peer-to-peer technology, it is fairly resilient to node failures as the datasets are mirrored between peers. The "dat publish" command registers the repository on, meaning that the files can be retrieved by anyone via a normal browser.

To demonstrate, I have released the bulk data dumps from my RNA-seq data processing project, DEE2, which consists of 158 GB of gene expression data. These data are freely available via a browser at or by using the dat command-line tool.

If you're after a single file, then you can use the following syntax to retrieve over https:

wget<long dat address>/<file…

Has RNA-seq overtaken microarrays?

We know RNA-seq has a number of advantages over array based analyses, but is RNA-seq taking over in terms of number of datasets published? I got curious and thought I'd investigate with some PubMed searches. I searched for "RNA-seq" and "microarray" and downloaded the CSV file which summarises the number of citations per year. As a type of control, I also searched "gene expression".

I divided the yearly "RNA-seq" and "Microarray" citation counts by the "Gene expression" counts then multiplied by 1000 to give the numbers seen below.

You can see that microarray is still more frequent in PubMed as compared to RNA-seq, but the gap is getting much narrower and the cross will likely occur in the next two years.

Next, I will look at the rate of GEO data deposition. (Updates soon)

Share and backup data sets with Dat

If you work in genomics, you'll know that sharing large data sets is hard. For instance our group has shared data with our collaborators a number of ways:

DVDs, hard drives and flash drivesFTPHightailGoogle Drive linksAmazon linksSCP/PSCPrsync
But none of these are are ideal as we know data sets change over time and none of the above methods are suited to updating a file tree with changes. If changes occur, then it quickly becomes a mess of files that are either redundant or missing entirely. Copied files could become corrupted. What we need is a type of version control for data sets. That's the goal of dat.

So now I'll take you through a simple example of sharing a data set using dat.

#Install instructions for Ubuntu 16.04
$ sudo npm cache clean -f
$ sudo npm install -g n
$ sudo n stable
$ sudo npm install -g dat

# Files I'm sharing on PC 1: DGE table and 3 genelists (3.4 MB)
$ tree
├── Aza_DESeq_wCounts.tsv
└── list
    ├── Aza_DESeq_wCounts_bg.txt
    ├── Aza_DESeq_wCounts_dn.…

Update on DEE2 project for Jan 2018

Today I'd like to share some updates on the DEE2 project, which I wrote about  in an earlier post. The  project code can be viewed on github here. Pipeline and images As the pipeline was recently finalised, I was able to roll out the working docker image. To facilitate users without root access, this image was ported to singularity. This took a lot of effort and some expertise from our local HPC team to get things working (many thanks to the Massive/M3 team). The singularity image is available from the webserver (link) and instructions for running it are available on github here. I have started testing a heavyweight singularity image, which includes the genome indexes, which will be more efficient for running jobs with large genomes and will make it available once testing is complete. Queue management It may sound simple to write a script to determine which datasets have been completed and add new datasets to the queue but when taking about tens of thousands of datasets it's …