Showing posts from June, 2015

Mapping NGS data: which genome version to use?

Aligning reads to the genome is a key step in nearly all NGS data pipelines, the quality of an alignment will dictate the quality of the final results. So for beginners in this space, the options available can be a bit overwhelming. Which options are available? Depending on what species you are working on, you will have either a limited number of choices or a vast number of choices. These include NCBI, Ensembl, UCSC as well as the consortia that generate these genome builds, such as the Human Genome Reference Consortium for human and TAIR for Arabidopsis. My recommendation at this point is Ensembl, for a number of reasons: It is clear to see what genome build and version just from the file names. Contrast "hg38.fa.gz" for UCSC vs "Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz" for Ensembl From the Ensembl file name you can tell whether its masked, and whether its "primary assembly" or "toplevel". The website is intuitive, ftp do