10 quick tips for genomics data management
I get asked a lot about the best ways to store sequence data because the files are massive and researchers have various levels of knowledge of the hardware and software. Here I'll run through some best practices for genomics research data management based on my 10 years of experience in the space.
1. Always work on servers, not remote machines or laptops
2. Download the data to the place where you will be working on it the most.
The raw sequencing data should be downloaded in a project subfolder called "fastq" or similar.
I recommend using a command line tool because these are better suited to really large files. Your genomics service provider will probably give you an ftp or rsync command to use. Download to the server where you will be working the most.
3. Write a project README
4. Check that the files are complete and not corrupted
The service provider will also give you checksums for each file which is like a digital fingerprint. The md5 method is the most used. You should check the checksums on the files you've downloaded. If its one file it can be done like this:
If you have lots of checksums in a file called checksums.md5 you can check them on mass with parallel processing (link for more unix 1 liners here)
cat checksums.md5 | parallel --pipe -N1 md5sum -c