Mapping NGS data: which genome version to use?

Aligning reads to the genome is a key step in nearly all NGS data pipelines, the quality of an alignment will dictate the quality of the final results. So for beginners in this space, the options available can be a bit overwhelming.

Which options are available?

Depending on what species you are working on, you will have either a limited number of choices or a vast number of choices. These include NCBI, Ensembl, UCSC as well as the consortia that generate these genome builds, such as the Human Genome Reference Consortium for human and TAIR for Arabidopsis. My recommendation at this point is Ensembl, for a number of reasons:
  • It is clear to see what genome build and version just from the file names. Contrast "hg38.fa.gz" for UCSC vs "Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz" for Ensembl
  • From the Ensembl file name you can tell whether its masked, and whether its "primary assembly" or "toplevel".
  • The website is intuitive, ftp downloads are fast and reliable.
  • Ensembl data is available through Biomart and in R.
  • Ensembl cover many animal, plant, microbe, other genomes.
  • Ensembl-based gene annotations are superior (IMHO) to UCSC, RefSeq and others; especially if you're interested in ncRNA.

Which Ensembl fasta to use?

All fine and well to choose Ensembl, but for human there are 957 different fasta files there to choose from. Don't panic. The choice is simple. Keep these rules in mind:

Repeat Masking?

The short answer is that for the purpose of mapping NGS reads, you don't want to use a RepeatMasked genome. Repeat-Masking is a process to identify all repetitive parts of the genome such as LINE-1 and Alu and replace these sequences with "N". While this may sound like a good idea, it's not because mapping software are specifically designed to handle reads from these repeats and have sophisticated heuristics to either discard those alignments or find all possible alignments. So select a build that has no masking and avoid any with "rm" in the name (ie: Homo_sapiens.GRCh38.dna_rm.primary_assembly.fa.gz). The files with "sm" in the name are "soft masked" which means instead of the repeat regions being given Ns, they are converted to lower-case. These are OK to use because most major aligners recognise lower-case letters as valid bases.

Primary assembly vs Top level?

The short answer is to choose the "Primary assembly" (i.e. Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz) as it does not contain the haplotype information. The top level file contains additional sequences that are relatively common variants to the reference. Most mappers available now don't specifically handle these haplo sequences and as such they will appear as simply another contig, therefore complicating the alignment. Perhaps in future, mappers might be better able to handle these hypervariable regions.

Cool, I can see chromosomes 1 to Y but what the hell are these other ones??

These aren't new chromosomes, they're just genomic regions that have been sequenced but we don't know exactly they should be placed in the chromosomal build (extrachromosomal scaffolds). You have to remember that the first draft of a genome generally consists of thousands (or hundreds of thousands) of "contigs" from an assembly process that could be flanked by large repeat elements, meaning we don't know the adjacent genes. These extrachromosomal scaffolds require direct laboratory verification. You might notice that the number of these has substantially decreased between early and recent builds as more regions have been given places in chromosomes. You should resist the urge to remove these extrachromosomal scaffolds from your reference because they are real and they contain 372 genes according to the current annotation file (Homo_sapiens.GRCh38.80.gtf.gz).

Should I use the latest version?

Short answer: Generally, yes. But of course this depends. If a genome build is very new, then it's likely that there will be limited resources for that build as it takes some time for Ensembl, UCSC, etc, to integrate gene annotations, functional information, other genomic data on to the new build. For instance, in some projects we are still using the older version of the human genome (GRCh37/hg19) because the aim of those projects is to integrate with a bunch of ENCODE data which is not yet available for the new genome build.

Gotchas

Its important to perform a couple of checks before jumping into your analysis pipeline:
  • Check that the download finished completely by checking that all chromosomes are present and that md5 sums match between local files and the checksums posted on the Ensembl webpage.
  • Download the genome build and annotation information at the same time (and same provider) and keep these in a "reference_genome" directory in each of your project directories.
  • Check whether the genome contains non A, C, G, T & N bases. Some mappers might not handle wobble bases like K, Y, M, R, etc and could cause an error (Looking at you HPG).
  • Often, you can get pre-indexed genomes from the groups providing the mapping software, but why take the risk when it only takes an hour or so to index a human genome yourself? If the genome was indexed with a version different to your own then it could lead to serious hair-pulling.

Further reading:

http://www.bio-itworld.com/2014/1/27/getting-know-new-reference-genome-assembly.html
https://groups.google.com/forum/#!msg/rna-star/1ngCYlgAbow/g5B5g83Vim8J
https://groups.google.com/forum/#!topic/rna-star/5mazzwVEoJ4

Popular posts from this blog

Mass download from google drive using R

Data analysis step 8: Pathway analysis with GSEA

Installing R-4.0 on Ubuntu 18.04 painlessly