Extract data from a spreadsheet file on the linux command line

Sometimes we need to extract data from an Excel spreadsheet for analysis. Here is one approach using the ssconvert tool.

If this isnt installed on your linux machine then you most likely can get it from the package repository.

$ sudo apt install ssconvert

Then if you want to extract a spreadsheet file into a tsv it can be done like this:

$ ssconvert -S --export-type Gnumeric_stf:stf_assistant -O 'separator="'$'\t''"' SomeData.xlsx SomeData.xlsx.tsv


You will notice that all the sheets are output to separate tsv files. This approach is nice as it can accommodate high throughput screening, as I implemented in my Gene Name Errors paper a while back.

Here is an example of obtaining some data from GEO.

$ #first download
$ curl 'https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE80251&format=file&file=GSE80251%5Fprocessed%5FRNA%5Fexpression%5Fmnfyap%2Exlsx' > GSE80251.xlsx


$ #now extract
$ ssconvert -S --export-type Gnumeric_stf:stf_assistant -O 'separator="'$'\t''"' GSE80251.xlsx GSE80251.xlsx.txt

$ head -5 GSE80251.xlsx.txt.0 | cut -f-4
gene Untreated.AVE Erythromycin.AVE Clindamycin.AVE
aaaD 464.2738789413636 216.166053873777 908.8802442142787
aaaE 864.2227500897734 561.77470127908 662.8782261191445
aaeA 4380.496862018132 2618.7642171454263 5816.7951770285545
aaeB 7846.3239697416175 7792.316476955105 10312.333492435688


Further reading: ReadXL is an R package designed to import Excel data into R
https://readxl.tidyverse.org/

Popular posts from this blog

Data analysis step 8: Pathway analysis with GSEA

Uploading data to GEO - which method is faster?

Using GTF tools to get gene lengths