10 quick tips for genomics data management

I get asked a lot about the best ways to store sequence data because the files are massive and researchers have various levels of  knowledge of the hardware and software. Here I'll run through some best practices for genomics research data management based on my 10 years of experience in the space.

1. Always work on servers, not remote machines or laptops

On-prem machines and cloud servers are preferred because you can log into the from anywhere using ssh or other protocol. These machines are better suited to heavy loads and are less likely to breakdown because of the institutional tech support and maintenance. Institutional data transfer speeds will be far superior to your home network. Never do computational work on a laptop. Avoid storing data on your own portable hard drives or flash drives. If you don't have a server, ask for access at your institution or research cloud provider (we use Nectar in Australia).

2. Download the data to the place where you will be working on it the most. 

The raw sequencing data should be downloaded in a project subfolder called "fastq" or similar. 

I recommend using a command line tool because these are better suited to really large files. Your genomics service provider will probably give you an ftp or rsync command to use. Download to the server where you will be working the most.

3. Write a project README

It should contain an overview of the purpose of the experiment and the sample groups. Include metadata for each file which includes sample descriptions and any comparisons that need to be made.

4. Check that the files are complete and not corrupted 

The service provider will also give you checksums for each file which is like a digital fingerprint. The md5 method is the most used. You should check the checksums on the files you've downloaded. If its one file it can be done like this:

md5sum mysample.fastq.gz

If you have lots of checksums in a file called checksums.md5 you can check them on mass with parallel processing (link for more unix 1 liners here)

cat checksums.md5 | parallel --pipe -N1 md5sum -c

5. Immediately copy data to institutional research data store (RDS)

Nearly every university and research institute will have a data store where you can dump the raw data and metadata (sample descriptions). It is really important to do this in the case of your server disks failing. It also protects you against things like natural disasters because RDS themselves have geographically independent redundancy built in. For example RDS data is snapshot regularly and stored at other locations. You will need to repeat the md5sum checks for this RDS copy. From time to time you should also check that the data in the RDS is recoverable, so put an annual reminder into your diary.

6. Do not rely on the service provider to keep a copy for you

This includes people who are keeping their data on Illumina basespace. You still need to follow #5 above. Genomics service providers will normally delete the data after a few months.

7. Do your preliminary analysis carefully

Once downloaded and checked, it is a good idea to set the files to read only mode using the immutable flag so they won't accidentally be modified/destroyed by you or someone else.

chattr +i foo/bar.fastq.gz
More info here.

Run a fastq file validator (eg: https://genome.sph.umich.edu/wiki/FastQValidator) to check the integrity of the file including whether each file is complete and pairs have same number of reads.
Run fastqc and multiqc to assess the overall quality of the data.

8. Consider uploading to SRA/GEO early

One you've done an analysis of the quality of the dataset and it represents a complete experiment, it is a good idea to upload it to an archiving repository. My work is mostly epigenomics and RNA-seq so I submit to NCBI GEO. If you are doing other applications such as genome sequencing, you could submit to NCBI SRA. You can keep it private for 12 months or however long you expect that it will need to be embargoed. The good thing about this is that in case there is some future disaster with the lab group, institution or yourself, the data will at least be available in future.

The added benefit of SRA/GEO upload is that you can then delete the raw files from your working server to make space for new projects. 

9. Formalise your data management plan in writing

Draft an SOP and discuss it with your lab group. Share it with your project collaborators so they are aware of how the data is managed. Show your lab head and other members how to recover the data. 

10. Consider file formats carefully

The standard for genomics is compressed fastq, which will have a .fastq.gz or .fq.gz suffix. If your files are not compressed, use gzip or pigz to compress them, especally before long term storage. While there are alernative compression tools available, the benefit is not much. An alternative storage format is BAM or CRAM format which is most useful for large scale genome resequencing data, like 1000 Genomes. Whatever you do, make sure that the format is something that will exist in the future.

Popular posts from this blog

Two subtle problems with over-representation analysis

Data analysis step 8: Pathway analysis with GSEA

Uploading data to GEO - which method is faster?