Posts

Showing posts with the label Data storage

Share and backup data sets with Dat

Image
If you work in genomics, you'll know that sharing large data sets is hard. For instance our group has shared data with our collaborators a number of ways:

DVDs, hard drives and flash drivesFTPHightailGoogle Drive linksAmazon linksSCP/PSCPrsync
But none of these are are ideal as we know data sets change over time and none of the above methods are suited to updating a file tree with changes. If changes occur, then it quickly becomes a mess of files that are either redundant or missing entirely. Copied files could become corrupted. What we need is a type of version control for data sets. That's the goal of dat.

So now I'll take you through a simple example of sharing a data set using dat.

#Install instructions for Ubuntu 16.04
$ sudo npm cache clean -f
$ sudo npm install -g n
$ sudo n stable
$ sudo npm install -g dat

# Files I'm sharing on PC 1: DGE table and 3 genelists (3.4 MB)
$ tree
.
├── Aza_DESeq_wCounts.tsv
└── list
    ├── Aza_DESeq_wCounts_bg.txt
    ├── Aza_DESeq_wCounts_dn.…

Test-driving parallel compression software

Image
Next generation sequencing is driving genomics into realm of big data, and this will only continue as NGS moves into the hands of more and more researchers and clinicians. Genome sequence datasets are huge and place an immense demand on informatics infrastructure including disk space, memory and CPU usage. Bioinformaticians need to be aware of the performance of different compression software available to make the most out of their hardware.

In this post, we'll test the performance of some compression algorithms in terms of their disk-saving ability as well as speed of decompression/compression.

The file that we're working on is an RNA-seq fastq file that is 5.0 GB in size and contains 25.8 million sequence reads on 103.0 million lines. Here is the top 10 lines.

@SRR1171523.1 Y40-ILLUMINA:15:FC:5:1:8493:1211 length=36
ACTTGCTGCTAATTAAAACCAACAATAGAACAGTGA
+SRR1171523.1 Y40-ILLUMINA:15:FC:5:1:8493:1211 length=36
B55??4024/13425>;>>CDD@DBB<<<BB<4<<9
@SRR117…

What? You're not using parallel compression yet?

Just in case you guys are struggling with (de)compression of collossal data sets, here's something which you'll find useful, a parallel zip archiver called PBZIP2. OK so it's not going to improve the quality of your sequencing data, but it could save you a bit of time.

Say I have a fastq file (FC001_sequence.fq) which needs compression on 8 threads:
pbzip2  -p8 FC001_sequence.fq To decompress (-d) a file (FC001_sequence.fq.bz2) and keep (-k) the archived version on 10 threads:
pbzip2  -dk -p10 FC001_sequence.fq.bz2 To test the integrity of a compressed file:
pbzip2  -t FC001_sequence.fq.bz2
How to compress an entire directory:

tar cv directory | pbzip2 > directory.tar.bz2


Here is the help page with more examples:
Parallel BZIP2 v1.1.5 - by: Jeff Gilchrist [http://compression.ca]
[Jul. 16, 2011]               (uses libbzip2 by Julian Seward)
Major contributions: Yavor Nikolov <nikolov.javor+pbzip2@gmail.com>
Usage: pbzip2 [-1 .. -9] [-b#cdfhklm#p#qrS#tVz] <filen…