Posts

Showing posts with the label Ensembl

Mapping NGS data: which genome version to use?

Image
Aligning reads to the genome is a key step in nearly all NGS data pipelines, the quality of an alignment will dictate the quality of the final results. So for beginners in this space, the options available can be a bit overwhelming.

Which options are available? Depending on what species you are working on, you will have either a limited number of choices or a vast number of choices. These include NCBI, Ensembl, UCSC as well as the consortia that generate these genome builds, such as the Human Genome Reference Consortium for human and TAIR for Arabidopsis. My recommendation at this point is Ensembl, for a number of reasons: It is clear to see what genome build and version just from the file names. Contrast "hg38.fa.gz" for UCSC vs "Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz" for EnsemblFrom the Ensembl file name you can tell whether its masked, and whether its "primary assembly" or "toplevel".The website is intuitive, ftp downloads are fast…

Interspecies gene name conversion

Image
In this post, I'll provide a step-by-step guide to perform interspecies gene name conversion of gene expression data. This is a necessary step in the comparison of profiling data from two different experiments with different species (human and mouse), and allows us to use extensive human-centric gene set libraries in MSigDB when analysing non-human mammalian profiling data (such as mouse).

I performed GEO2R analysis of mouse expression data (GSE30192) to analyse the effect of azacitidine on mouse C2C12 myoblasts. The data looks like this:
"ID""adj.P.Val""P.Value""t""B""logFC""Gene.symbol""Gene.title"
"1420647_a_at" "0.000346" "2.24e-08" "56.073665" "8.699524" "6.9755573" "Krt8" "keratin 8"
"1423327_at" "0.000346" "2.32e-08" "55.685912" "8.686447" "3.8096523" "Rpl…