Posts

Showing posts from October, 2018

Extract data from a spreadsheet file on the linux command line

Sometimes we need to extract data from an Excel spreadsheet for analysis. Here is one approach using the ssconvert tool.

If this isnt installed on your linux machine then you most likely can get it from the package repository.

$ sudo apt install ssconvert

Then if you want to extract a spreadsheet file into a tsv it can be done like this:

$ ssconvert -S --export-type Gnumeric_stf:stf_assistant -O 'separator="'$'\t''"' SomeData.xlsx SomeData.xlsx.tsv


You will notice that all the sheets are output to separate tsv files. This approach is nice as it can accommodate high throughput screening, as I implemented in my Gene Name Errors paper a while back.

Here is an example of obtaining some data from GEO.

$ #first download
$ curl 'https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE80251&format=file&file=GSE80251%5Fprocessed%5FRNA%5Fexpression%5Fmnfyap%2Exlsx' > GSE80251.xlsx


$ #now extract $ ssconvert -S --export-type Gnumeric_stf:stf_assistant …