Publishing datasets on the dat network - benefits and pitfalls
As I mentioned in an earlier post, Dat is a new data sharing tool that uses concepts of bittorrent and git to enable peer-to-peer sharing of versioned data. This is cool for sharing datasets that change over time, because when you sync the dataset, only the changes are retrieved, sort of like git. As it uses peer-to-peer technology, it is fairly resilient to node failures as the datasets are mirrored between peers. The "dat publish" command registers the repository on datbase.org, meaning that the files can be retrieved by anyone via a normal browser.
To demonstrate, I have released the bulk data dumps from my RNA-seq data processing project, DEE2, which consists of 158 GB of gene expression data. These data are freely available via a browser at https://datbase.org/dee2/bulk or by using the dat command-line tool.
If you're after a single file, then you can use the following syntax to retrieve over https:
wget https://datbase.org/download/<long dat address>/<file name>
So to get the STAR counts for ecoli, use the following command. (visit https://datbase.org/dee2/bulk first to obtain the unique dat address)
wget https://datbase.org/download/d2d865cfbb829e15f19814c5f14c19ad1c32dd5b8707264419d00d9c1942c2df/ecoli_se.tsv.bz2
If you want to get the whole the repo, use the "clone" command that downloads the complete latest version. visit https://datbase.org/dee2/bulk to get the current dat address and clone the repo
dat clone dat://d2d865cfbb829e15f19814c5f14c19ad1c32dd5b8707264419d00d9c1942c2df
Then later if you want to get any updates, "cd" into the directory and run
"dat pull". It will download any changes to the dataset and exit once finished.
If you would like to support the network by mirroring the repo and re-sharing, then use the dat sync command. The more syncing participants, the faster the data transfers will be.
From my home PC, I found that the data transfer rate was pretty show as compared to standard http or ftp. Also I noticed that dat share would frequently drop out. Not sure if this is a firewal issue or something else.
**Edit. It looks like there is a problem with the dat, as it gives an error which looks to be critical - preventing the main data source from sharing updated files. I've logged an issue on github in the hope it might get fixed.
$ dat share
Error: Could not satisfy length
at onread (/usr/local/lib/node_modules/dat/node_modules/random-access-file/index.js:75:36)
at FSReqWrap.wrapper [as oncomplete] (fs.js:658:17)
***Extra edit. To get around this issue, I had to delete the .dat folder which contains the metadata and then ran dat share again to re-index the updated dataset. This has the unfortunate effect of changing the dat address, so the one quoted above will only work until the next update. Once indexed, I ran dat publish again and it seemed to work OK. While this workaround is OK, it means that the data set will be unavailable while indexing is happening, and that the mirrors will need to download the entire dataset from scratch. Until this bug is fixed and data transfer rates improve, I can't yet recommend dat for data set sharing.
To demonstrate, I have released the bulk data dumps from my RNA-seq data processing project, DEE2, which consists of 158 GB of gene expression data. These data are freely available via a browser at https://datbase.org/dee2/bulk or by using the dat command-line tool.
If you're after a single file, then you can use the following syntax to retrieve over https:
wget https://datbase.org/download/<long dat address>/<file name>
So to get the STAR counts for ecoli, use the following command. (visit https://datbase.org/dee2/bulk first to obtain the unique dat address)
wget https://datbase.org/download/d2d865cfbb829e15f19814c5f14c19ad1c32dd5b8707264419d00d9c1942c2df/ecoli_se.tsv.bz2
If you want to get the whole the repo, use the "clone" command that downloads the complete latest version. visit https://datbase.org/dee2/bulk to get the current dat address and clone the repo
dat clone dat://d2d865cfbb829e15f19814c5f14c19ad1c32dd5b8707264419d00d9c1942c2df
Then later if you want to get any updates, "cd" into the directory and run
"dat pull". It will download any changes to the dataset and exit once finished.
If you would like to support the network by mirroring the repo and re-sharing, then use the dat sync command. The more syncing participants, the faster the data transfers will be.
From my home PC, I found that the data transfer rate was pretty show as compared to standard http or ftp. Also I noticed that dat share would frequently drop out. Not sure if this is a firewal issue or something else.
**Edit. It looks like there is a problem with the dat, as it gives an error which looks to be critical - preventing the main data source from sharing updated files. I've logged an issue on github in the hope it might get fixed.
$ dat share
Error: Could not satisfy length
at onread (/usr/local/lib/node_modules/dat/node_modules/random-access-file/index.js:75:36)
at FSReqWrap.wrapper [as oncomplete] (fs.js:658:17)
***Extra edit. To get around this issue, I had to delete the .dat folder which contains the metadata and then ran dat share again to re-index the updated dataset. This has the unfortunate effect of changing the dat address, so the one quoted above will only work until the next update. Once indexed, I ran dat publish again and it seemed to work OK. While this workaround is OK, it means that the data set will be unavailable while indexing is happening, and that the mirrors will need to download the entire dataset from scratch. Until this bug is fixed and data transfer rates improve, I can't yet recommend dat for data set sharing.