Share and backup data sets with Dat

If you work in genomics, you'll know that sharing large data sets is hard. For instance our group has shared data with our collaborators a number of ways:

  • DVDs, hard drives and flash drives
  • FTP
  • Hightail
  • Google Drive links
  • Amazon links
  • SCP/PSCP
  • rsync

But none of these are are ideal as we know data sets change over time and none of the above methods are suited to updating a file tree with changes. If changes occur, then it quickly becomes a mess of files that are either redundant or missing entirely. Copied files could become corrupted. What we need is a type of version control for data sets. That's the goal of dat.

So now I'll take you through a simple example of sharing a data set using dat.

#Install instructions for Ubuntu 16.04
$ sudo npm cache clean -f
$ sudo npm install -g n
$ sudo n stable
$ sudo npm install -g dat

# Files I'm sharing on PC 1: DGE table and 3 genelists (3.4 MB)
$ tree
.
├── Aza_DESeq_wCounts.tsv
└── list
    ├── Aza_DESeq_wCounts_bg.txt
    ├── Aza_DESeq_wCounts_dn.txt
    └── Aza_DESeq_wCounts_up.txt

1 directory, 4 files

#share current directory (PC 1)
$ dat share .
dat v13.10.0
dat://a40ce74db29e7e785b3a71511a566f3d30bce1fcda10ba3c4d79ce1fac62611c
Sharing dat: 4 files (3.4 MB)
0 connections | Download 0 B/s Upload 0 B/s
Watching for file updates
Ctrl+C to Exit

#Important! Need to keep dat share process active on PC 1 to enable later retrieval

#Now clone the data on PC 2
$ dat clone dat://a40ce74db29e7e785b3a71511a566f3d30bce1fcda10ba3c4d79ce1fac62611c
dat v13.10.0
Created new dat in
/scratch/mz/tmp/a40ce74db29e7e785b3a71511a566f3d30bce1fcda10ba3c4d79ce1fac62611c/.dat
Cloning: 4 files (3.4 MB)
1 connection | Download 217 KB/s Upload 0 B/s
dat sync complete.
Version 4
Exiting the Dat program...

Boom the data is shared. It is downloaded to a folder a40ce74... 

On PC 1 the data can be modified.

# shuffle the tsv and make a new tsv, then share
$ shuf Aza_DESeq_wCounts.tsv > Aza_DESeq_wCounts_shuf.tsv 
$ dat sync .

Then the data will be available for PC 2. 
$ cd /scratch/mz/tmp/a40ce74db29e7e785b3a71511a566f3d30bce1fcda10ba3c4d79ce1fac62611c
$ dat pull .
dat v13.10.0
Downloading dat: 5 files (6.7 MB)
1 connection | Download 236 KB/s Upload 0 B/s
dat sync complete.
Version 7
Exiting the Dat program...

#check contents, see the new file was downloaded :)
$ tree
.
├── Aza_DESeq_wCounts_shuf.tsv
├── Aza_DESeq_wCounts.tsv
└── list
    ├── Aza_DESeq_wCounts_bg.txt
    ├── Aza_DESeq_wCounts_dn.txt
    └── Aza_DESeq_wCounts_up.txt

1 directory, 5 files

This will find many use cases, but will be of immediate benefit to data analysts looking to share huge data sets that change over time. I tested dat with a folder of approx 190k files and found that it started to encounter problems after about 25,000 files and wasn't able to complete archiving. Nevertheless it will be great for datasets with fewer files and even for things like multimedia files. The fact that dat is a P2P platform will ensure that shared data is more resilient than the equivalent centralised services like FTP, Google Drive, etc. Also consider that this could be a useful way to backup big (valuable) data sets offsite.

Dat also offers the ability to publicly release data on the Dat registry using the "dat publish" command, which allows users to find your dat with a shorter URL, not the 64 character address. Published dats, have a neat web interface and allows end users to download individual files of interest over http.

These features have been exploited to develop Beaker Browser, a P2P browser with builtin tools for hosting content directly


Further reading/viewing:

Popular posts from this blog

Data analysis step 8: Pathway analysis with GSEA

Uploading data to GEO - which method is faster?

Using GTF tools to get gene lengths