Publications by authors named "Mark Diekhans"

67 Publications

A high-quality bonobo genome refines the analysis of hominid evolution.

Nature 2021 06 5;594(7861):77-81. Epub 2021 May 5.

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.

The divergence of chimpanzee and bonobo provides one of the few examples of recent hominid speciation. Here we describe a fully annotated, high-quality bonobo genome assembly, which was constructed without guidance from reference genomes by applying a multiplatform genomics approach. We generate a bonobo genome assembly in which more than 98% of genes are completely annotated and 99% of the gaps are closed, including the resolution of about half of the segmental duplications and almost all of the full-length mobile elements. We compare the bonobo genome to those of other great apes and identify more than 5,569 fixed structural variants that specifically distinguish the bonobo and chimpanzee lineages. We focus on genes that have been lost, changed in structure or expanded in the last few million years of bonobo evolution. We produce a high-resolution map of incomplete lineage sorting and estimate that around 5.1% of the human genome is genetically closer to chimpanzee or bonobo and that more than 36.5% of the genome shows incomplete lineage sorting if we consider a deeper phylogeny including gorilla and orangutan. We also show that 26% of the segments of incomplete lineage sorting between human and chimpanzee or human and bonobo are non-randomly distributed and that genes within these clustered segments show significant excess of amino acid replacement compared to the rest of the genome.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41586-021-03519-xDOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8172381PMC
June 2021

Towards complete and error-free genome assemblies of all vertebrate species.

Nature 2021 Apr 28;592(7856):737-746. Epub 2021 Apr 28.

UQ Genomics, University of Queensland, Brisbane, Queensland, Australia.

High-quality and complete reference genome assemblies are fundamental for the application of genomics to biology, disease, and biodiversity conservation. However, such assemblies are available for only a few non-microbial species. To address this issue, the international Genome 10K (G10K) consortium has worked over a five-year period to evaluate and develop cost-effective methods for assembling highly accurate and nearly complete reference genomes. Here we present lessons learned from generating assemblies for 16 species that represent six major vertebrate lineages. We confirm that long-read sequencing technologies are essential for maximizing genome quality, and that unresolved complex repeats and haplotype heterozygosity are major sources of assembly error when not handled correctly. Our assemblies correct substantial errors, add missing sequence in some of the best historical reference genomes, and reveal biological discoveries. These include the identification of many false gene duplications, increases in gene sizes, chromosome rearrangements that are specific to lineages, a repeated independent chromosome breakpoint in bat genomes, and a canonical GC-rich pattern in protein-coding genes and their regulatory regions. Adopting these lessons, we have embarked on the Vertebrate Genomes Project (VGP), an international effort to generate high-quality, complete reference genomes for all of the roughly 70,000 extant vertebrate species and to help to enable a new era of discovery across the life sciences.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41586-021-03451-0DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8081667PMC
April 2021

Sequence diversity analyses of an improved rhesus macaque genome enhance its biomedical utility.

Science 2020 12;370(6523)

Department of Biology, University of Bari 'Aldo Moro', 70125 Bari, Italy.

The rhesus macaque () is the most widely studied nonhuman primate (NHP) in biomedical research. We present an updated reference genome assembly (Mmul_10, contig N50 = 46 Mbp) that increases the sequence contiguity 120-fold and annotate it using 6.5 million full-length transcripts, thus improving our understanding of gene content, isoform diversity, and repeat organization. With the improved assembly of segmental duplications, we discovered new lineage-specific genes and expanded gene families that are potentially informative in studies of evolution and disease susceptibility. Whole-genome sequencing (WGS) data from 853 rhesus macaques identified 85.7 million single-nucleotide variants (SNVs) and 10.5 million indel variants, including potentially damaging variants in genes associated with human autism and developmental delay, providing a framework for developing noninvasive NHP models of human disease.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1126/science.abc6617DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7818670PMC
December 2020

GENCODE 2021.

Nucleic Acids Res 2021 01;49(D1):D916-D923

European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK.

The GENCODE project annotates human and mouse genes and transcripts supported by experimental data with high accuracy, providing a foundational resource that supports genome biology and clinical genomics. GENCODE annotation processes make use of primary data and bioinformatic tools and analysis generated both within the consortium and externally to support the creation of transcript structures and the determination of their function. Here, we present improvements to our annotation infrastructure, bioinformatics tools, and analysis, and the advances they support in the annotation of the human and mouse genomes including: the completion of first pass manual annotation for the mouse reference genome; targeted improvements to the annotation of genes associated with SARS-CoV-2 infection; collaborative projects to achieve convergence across reference annotation databases for the annotation of human and mouse protein-coding genes; and the first GENCODE manually supervised automated annotation of lncRNAs. Our annotation is accessible via Ensembl, the UCSC Genome Browser and https://www.gencodegenes.org.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkaa1087DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7778937PMC
January 2021

The UCSC Genome Browser database: 2021 update.

Nucleic Acids Res 2021 01;49(D1):D1046-D1057

Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95064, USA.

For more than two decades, the UCSC Genome Browser database (https://genome.ucsc.edu) has provided high-quality genomics data visualization and genome annotations to the research community. As the field of genomics grows and more data become available, new modes of display are required to accommodate new technologies. New features released this past year include a Hi-C heatmap display, a phased family trio display for VCF files, and various track visualization improvements. Striving to keep data up-to-date, new updates to gene annotations include GENCODE Genes, NCBI RefSeq Genes, and Ensembl Genes. New data tracks added for human and mouse genomes include the ENCODE registry of candidate cis-regulatory elements, promoters from the Eukaryotic Promoter Database, and NCBI RefSeq Select and Matched Annotation from NCBI and EMBL-EBI (MANE). Within weeks of learning about the outbreak of coronavirus, UCSC released a genome browser, with detailed annotation tracks, for the SARS-CoV-2 RNA reference assembly.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkaa1070DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7779060PMC
January 2021

Dense sampling of bird diversity increases power of comparative genomics.

Nature 2020 11 11;587(7833):252-257. Epub 2020 Nov 11.

Centre for Zoo and Wild Animal Health, Copenhagen Zoo, Frederiksberg, Denmark.

Whole-genome sequencing projects are increasingly populating the tree of life and characterizing biodiversity. Sparse taxon sampling has previously been proposed to confound phylogenetic inference, and captures only a fraction of the genomic diversity. Here we report a substantial step towards the dense representation of avian phylogenetic and molecular diversity, by analysing 363 genomes from 92.4% of bird families-including 267 newly sequenced genomes produced for phase II of the Bird 10,000 Genomes (B10K) Project. We use this comparative genome dataset in combination with a pipeline that leverages a reference-free whole-genome alignment to identify orthologous regions in greater numbers than has previously been possible and to recognize genomic novelties in particular bird lineages. The densely sampled alignment provides a single-base-pair map of selection, has more than doubled the fraction of bases that are confidently predicted to be under conservation and reveals extensive patterns of weak selection in predominantly non-coding DNA. Our results demonstrate that increasing the diversity of genomes used in comparative studies can reveal more shared and lineage-specific variation, and improve the investigation of genomic characteristics. We anticipate that this genomic resource will offer new perspectives on evolutionary processes in cross-species comparative analyses and assist in efforts to conserve species.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41586-020-2873-9DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7759463PMC
November 2020

Progressive Cactus is a multiple-genome aligner for the thousand-genome era.

Nature 2020 11 11;587(7833):246-251. Epub 2020 Nov 11.

UC Santa Cruz Genomics Institute, UC Santa Cruz, Santa Cruz, CA, USA.

New genome assemblies have been arriving at a rapidly increasing pace, thanks to decreases in sequencing costs and improvements in third-generation sequencing technologies. For example, the number of vertebrate genome assemblies currently in the NCBI (National Center for Biotechnology Information) database increased by more than 50% to 1,485 assemblies in the year from July 2018 to July 2019. In addition to this influx of assemblies from different species, new human de novo assemblies are being produced, which enable the analysis of not only small polymorphisms, but also complex, large-scale structural differences between human individuals and haplotypes. This coming era and its unprecedented amount of data offer the opportunity to uncover many insights into genome evolution but also present challenges in how to adapt current analysis methods to meet the increased scale. Cactus, a reference-free multiple genome alignment program, has been shown to be highly accurate, but the existing implementation scales poorly with increasing numbers of genomes, and struggles in regions of highly duplicated sequences. Here we describe progressive extensions to Cactus to create Progressive Cactus, which enables the reference-free alignment of tens to thousands of large vertebrate genomes while maintaining high alignment quality. We describe results from an alignment of more than 600 amniote genomes, which is to our knowledge the largest multiple vertebrate genome alignment created so far.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41586-020-2871-yDOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7673649PMC
November 2020

Transcriptional activity and strain-specific history of mouse pseudogenes.

Nat Commun 2020 07 29;11(1):3695. Epub 2020 Jul 29.

Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, 06520, USA.

Pseudogenes are ideal markers of genome remodelling. In turn, the mouse is an ideal platform for studying them, particularly with the recent availability of strain-sequencing and transcriptional data. Here, combining both manual curation and automatic pipelines, we present a genome-wide annotation of the pseudogenes in the mouse reference genome and 18 inbred mouse strains (available via the mouse.pseudogene.org resource). We also annotate 165 unitary pseudogenes in mouse, and 303, in human. The overall pseudogene repertoire in mouse is similar to that in human in terms of size, biotype distribution, and family composition (e.g. with GAPDH and ribosomal proteins being the largest families). Notable differences arise in the pseudogene age distribution, with multiple retro-transpositional bursts in mouse evolutionary history and only one in human. Furthermore, in each strain about a fifth of all pseudogenes are unique, reflecting strain-specific evolution. Finally, we find that ~15% of the mouse pseudogenes are transcribed, and that highly transcribed parent genes tend to give rise to many processed pseudogenes.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41467-020-17157-wDOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7392758PMC
July 2020

halSynteny: a fast, easy-to-use conserved synteny block construction method for multiple whole-genome alignments.

Gigascience 2020 06;9(6)

Computer Technologies Laboratory, School of Translational Information Technologies, ITMO University, 49 Kronverkskiy Pr., St. Petersburg 197101, St. Petersburg, Russian Federation.

Background: Large-scale sequencing projects provide high-quality full-genome data that can be used for reconstruction of chromosomal exchanges and rearrangements that disrupt conserved syntenic blocks. The highest resolution of cross-species homology can be obtained on the basis of whole-genome, reference-free alignments. Very large multiple alignments of full-genome sequence stored in a binary format demand an accurate and efficient computational approach for synteny block production.

Findings: halSynteny performs efficient processing of pairwise alignment blocks for any pair of genomes in the alignment. The tool is part of the HAL comparative genomics suite and is targeted to build synteny blocks for multi-hundred-way, reference-free vertebrate alignments built with the Cactus system.

Conclusions: halSynteny enables an accurate and rapid identification of synteny in multiple full-genome alignments. The method is implemented in C++11 as a component of the halTools software and released under MIT license. The package is available at https://github.com/ComparativeGenomicsToolkit/hal/.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/gigascience/giaa047DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7254927PMC
June 2020

AMELIE speeds Mendelian diagnosis by matching patient phenotype and genotype to primary literature.

Sci Transl Med 2020 05;12(544)

Department of Computer Science, Stanford University, Stanford, CA 94305, USA.

The diagnosis of Mendelian disorders requires labor-intensive literature research. Trained clinicians can spend hours looking for the right publication(s) supporting a single gene that best explains a patient's disease. AMELIE (Automatic Mendelian Literature Evaluation) greatly accelerates this process. AMELIE parses all 29 million PubMed abstracts and downloads and further parses hundreds of thousands of full-text articles in search of information supporting the causality and associated phenotypes of most published genetic variants. AMELIE then prioritizes patient candidate variants for their likelihood of explaining any patient's given set of phenotypes. Diagnosis of singleton patients (without relatives' exomes) is the most time-consuming scenario, and AMELIE ranked the causative gene at the very top for 66% of 215 diagnosed singleton Mendelian patients from the Deciphering Developmental Disorders project. Evaluating only the top 11 AMELIE-scored genes of 127 (median) candidate genes per patient resulted in a rapid diagnosis in more than 90% of cases. AMELIE-based evaluation of all cases was 3 to 19 times more efficient than hand-curated database-based approaches. We replicated these results on a retrospective cohort of clinical cases from Stanford Children's Health and the Manton Center for Orphan Disease Research. An analysis web portal with our most recent update, programmatic interface, and code is available at AMELIE.stanford.edu.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1126/scitranslmed.aau9113DOI Listing
May 2020

Re-annotation of 191 developmental and epileptic encephalopathy-associated genes unmasks de novo variants in .

NPJ Genom Med 2019 2;4:31. Epub 2019 Dec 2.

20Department of Medical Genetics, Cambridge Institute for Medical Research, University of Cambridge, Cambridge, CB2 0XY UK.

The developmental and epileptic encephalopathies (DEE) are a group of rare, severe neurodevelopmental disorders, where even the most thorough sequencing studies leave 60-65% of patients without a molecular diagnosis. Here, we explore the incompleteness of transcript models used for exome and genome analysis as one potential explanation for a lack of current diagnoses. Therefore, we have updated the GENCODE gene annotation for 191 epilepsy-associated genes, using human brain-derived transcriptomic libraries and other data to build 3,550 putative transcript models. Our annotations increase the transcriptional 'footprint' of these genes by over 674 kb. Using as a case study, due to its close phenotype/genotype correlation with Dravet syndrome, we screened 122 people with Dravet syndrome or a similar phenotype with a panel of exon sequences representing eight established genes and identified two de novo variants that now - through improved gene annotation - are ascribed to residing among our exons. These two (from 122 screened people, 1.6%) molecular diagnoses carry significant clinical implications. Furthermore, we identified a previously classified intronic Dravet syndrome-associated variant that now lies within a deeply conserved exon. Our findings illustrate the potential gains of thorough gene annotation in improving diagnostic yields for genetic disorders.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41525-019-0106-7DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6889285PMC
December 2019

UCSC Genome Browser enters 20th year.

Nucleic Acids Res 2020 01;48(D1):D756-D761

Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95064, USA.

The University of California Santa Cruz Genome Browser website (https://genome.ucsc.edu) enters its 20th year of providing high-quality genomics data visualization and genome annotations to the research community. In the past year, we have added a new option to our web BLAT tool that allows search against all genomes, a single-cell expression viewer (https://cells.ucsc.edu), a 'lollipop' plot display mode for high-density variation data, a RESTful API for data extraction and a custom-track backup feature. New datasets include Tabula Muris single-cell expression data, GeneHancer regulatory annotations, The Cancer Genome Atlas Pan-Cancer variants, Genome Reference Consortium Patch sequences, new ENCODE transcription factor binding site peaks and clusters, the Database of Genomic Variants Gold Standard Variants, Genomenon Mastermind variants and three new multi-species alignment tracks.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkz1012DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7145642PMC
January 2020

The UCSC Genome Browser database: 2019 update.

Nucleic Acids Res 2019 01;47(D1):D853-D858

Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95064, USA.

The UCSC Genome Browser (https://genome.ucsc.edu) is a graphical viewer for exploring genome annotations. For almost two decades, the Browser has provided visualization tools for genetics and molecular biology and continues to add new data and features. This year, we added a new tool that lets users interactively arrange existing graphing tracks into new groups. Other software additions include new formats for chromosome interactions, a ChIP-Seq peak display for track hubs and improved support for HGVS. On the annotation side, we have added gnomAD, TCGA expression, RefSeq Functional elements, GTEx eQTLs, CRISPR Guides, SNPpedia and created a 30-way primate alignment on the human genome. Nine assemblies now have RefSeq-mapped gene models.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gky1095DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6323953PMC
January 2019

Whole-Genome Alignment and Comparative Annotation.

Annu Rev Anim Biosci 2019 02 31;7:41-64. Epub 2018 Oct 31.

UC Santa Cruz Genomics Institute, University of California, Santa Cruz, California 95064, USA; email:

Rapidly improving sequencing technology coupled with computational developments in sequence assembly are making reference-quality genome assembly economical. Hundreds of vertebrate genome assemblies are now publicly available, and projects are being proposed to sequence thousands of additional species in the next few years. Such dense sampling of the tree of life should give an unprecedented new understanding of evolution and allow a detailed determination of the events that led to the wealth of biodiversity around us. To gain this knowledge, these new genomes must be compared through genome alignment (at the sequence level) and comparative annotation (at the gene level). However, different alignment and annotation methods have different characteristics; before starting a comparative genomics analysis, it is important to understand the nature of, and biases and limitations inherent in, the chosen methods. This review is intended to act as a technical but high-level overview of the field that should provide this understanding. We briefly survey the state of the genome alignment and comparative annotation fields and potential future directions for these fields in a new, large-scale era of comparative genomics.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1146/annurev-animal-020518-115005DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6450745PMC
February 2019

GENCODE reference annotation for the human and mouse genomes.

Nucleic Acids Res 2019 01;47(D1):D766-D773

European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK.

The accurate identification and description of the genes in the human and mouse genomes is a fundamental requirement for high quality analysis of data informing both genome biology and clinical genomics. Over the last 15 years, the GENCODE consortium has been producing reference quality gene annotations to provide this foundational resource. The GENCODE consortium includes both experimental and computational biology groups who work together to improve and extend the GENCODE gene annotation. Specifically, we generate primary data, create bioinformatics tools and provide analysis to support the work of expert manual gene annotators and automated gene annotation pipelines. In addition, manual and computational annotation workflows use any and all publicly available data and analysis, along with the research literature to identify and characterise gene loci to the highest standard. GENCODE gene annotations are accessible via the Ensembl and UCSC Genome Browsers, the Ensembl FTP site, Ensembl Biomart, Ensembl Perl and REST APIs as well as https://www.gencodegenes.org.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gky955DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6323946PMC
January 2019

Sixteen diverse laboratory mouse reference genomes define strain-specific haplotypes and novel functional loci.

Nat Genet 2018 11 1;50(11):1574-1583. Epub 2018 Oct 1.

Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK.

We report full-length draft de novo genome assemblies for 16 widely used inbred mouse strains and find extensive strain-specific haplotype variation. We identify and characterize 2,567 regions on the current mouse reference genome exhibiting the greatest sequence diversity. These regions are enriched for genes involved in pathogen defence and immunity and exhibit enrichment of transposable elements and signatures of recent retrotransposition events. Combinations of alleles and genes unique to an individual strain are commonly observed at these loci, reflecting distinct strain phenotypes. We used these genomes to improve the mouse reference genome, resulting in the completion of 10 new gene structures. Also, 62 new coding loci were added to the reference genome annotation. These genomes identified a large, previously unannotated, gene (Efcab3-like) encoding 5,874 amino acids. Mutant Efcab3-like mice display anomalies in multiple brain regions, suggesting a possible role for this gene in the regulation of brain development.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41588-018-0223-8DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6205630PMC
November 2018

Comparative Annotation Toolkit (CAT)-simultaneous clade and personal genome annotation.

Genome Res 2018 07 8;28(7):1029-1038. Epub 2018 Jun 8.

Genomics Institute, University of California Santa Cruz and Howard Hughes Medical Institute, Santa Cruz, California 95064, USA.

The recent introductions of low-cost, long-read, and read-cloud sequencing technologies coupled with intense efforts to develop efficient algorithms have made affordable, high-quality de novo sequence assembly a realistic proposition. The result is an explosion of new, ultracontiguous genome assemblies. To compare these genomes, we need robust methods for genome annotation. We describe the fully open source Comparative Annotation Toolkit (CAT), which provides a flexible way to simultaneously annotate entire clades and identify orthology relationships. We show that CAT can be used to improve annotations on the rat genome, annotate the great apes, annotate a diverse set of mammals, and annotate personal, diploid human genomes. We demonstrate the resulting discovery of novel genes, isoforms, and structural variants-even in genomes as well studied as rat and the great apes-and how these annotations improve cross-species RNA expression experiments.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1101/gr.233460.117DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6028123PMC
July 2018

High-resolution comparative analysis of great ape genomes.

Science 2018 06;360(6393)

Bionano Genomics, San Diego, CA 92121, USA.

Genetic studies of human evolution require high-quality contiguous ape genome assemblies that are not guided by the human reference. We coupled long-read sequence assembly and full-length complementary DNA sequencing with a multiplatform scaffolding approach to produce ab initio chimpanzee and orangutan genome assemblies. By comparing these with two long-read de novo human genome assemblies and a gorilla genome assembly, we characterized lineage-specific and shared great ape genetic variation ranging from single- to mega-base pair-sized variants. We identified ~17,000 fixed human-specific structural variants identifying genic and putative regulatory changes that have emerged in humans since divergence from nonhuman apes. Interestingly, these variants are enriched near genes that are down-regulated in human compared to chimpanzee cerebral organoids, particularly in cells analogous to radial glial neural progenitors.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1126/science.aar6343DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6178954PMC
June 2018

Evaluating recovery potential of the northern white rhinoceros from cryopreserved somatic cells.

Genome Res 2018 06 24;28(6):780-788. Epub 2018 May 24.

San Diego Zoo Institute for Conservation Research, Escondido, California 92027, USA.

The critically endangered northern white rhinoceros is believed to be extinct in the wild, with the recent death of the last male leaving only two remaining individuals in captivity. Its extinction would appear inevitable, but the development of advanced cell and reproductive technologies such as cloning by nuclear transfer and the artificial production of gametes via stem cells differentiation offer a second chance for its survival. In this work, we analyzed genome-wide levels of genetic diversity, inbreeding, population history, and demography of the white rhinoceros sequenced from cryopreserved somatic cells, with the goal of informing how genetically valuable individuals could be used in future efforts toward the genetic rescue of the northern white rhinoceros. We present the first sequenced genomes of the northern white rhinoceros, which show relatively high levels of heterozygosity and an average genetic divergence of 0.1% compared with the southern subspecies. The two white rhinoceros subspecies appear to be closely related, with low genetic admixture and a divergent time <80,000 yr ago. Inbreeding, as measured by runs of homozygosity, appears slightly higher in the southern than the northern white rhinoceros. This work demonstrates the value of the northern white rhinoceros cryopreserved genetic material as a potential gene pool for saving this subspecies from extinction.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1101/gr.227603.117DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5991516PMC
June 2018

Consensus coding sequence (CCDS) database: a standardized set of human and mouse protein-coding regions supported by expert curation.

Nucleic Acids Res 2018 01;46(D1):D221-D228

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.

The Consensus Coding Sequence (CCDS) project provides a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assembly in genome annotations produced independently by NCBI and the Ensembl group at EMBL-EBI. This dataset is the product of an international collaboration that includes NCBI, Ensembl, HUGO Gene Nomenclature Committee, Mouse Genome Informatics and University of California, Santa Cruz. Identically annotated coding regions, which are generated using an automated pipeline and pass multiple quality assurance checks, are assigned a stable and tracked identifier (CCDS ID). Additionally, coordinated manual review by expert curators from the CCDS collaboration helps in maintaining the integrity and high quality of the dataset. The CCDS data are available through an interactive web page (https://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi) and an FTP site (ftp://ftp.ncbi.nlm.nih.gov/pub/CCDS/). In this paper, we outline the ongoing work, growth and stability of the CCDS dataset and provide updates on new collaboration members and new features added to the CCDS user interface. We also present expert curation scenarios, with specific examples highlighting the importance of an accurate reference genome assembly and the crucial role played by input from the research community.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkx1031DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5753299PMC
January 2018

The UCSC Genome Browser database: 2018 update.

Nucleic Acids Res 2018 01;46(D1):D762-D769

Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95064, USA.

The UCSC Genome Browser (https://genome.ucsc.edu) provides a web interface for exploring annotated genome assemblies. The assemblies and annotation tracks are updated on an ongoing basis-12 assemblies and more than 28 tracks were added in the past year. Two recent additions are a display of CRISPR/Cas9 guide sequences and an interactive navigator for gene interactions. Other upgrades from the past year include a command-line version of the Variant Annotation Integrator, support for Human Genome Variation Society variant nomenclature input and output, and a revised highlighting tool that now supports multiple simultaneous regions and colors.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkx1020DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5753355PMC
January 2018

Matching phenotypes to whole genomes: Lessons learned from four iterations of the personal genome project community challenges.

Hum Mutat 2017 09 19;38(9):1266-1276. Epub 2017 Jun 19.

PersonalGenomes.org, Boston, Massachusetts.

The advent of next-generation sequencing has dramatically decreased the cost for whole-genome sequencing and increased the viability for its application in research and clinical care. The Personal Genome Project (PGP) provides unrestricted access to genomes of individuals and their associated phenotypes. This resource enabled the Critical Assessment of Genome Interpretation (CAGI) to create a community challenge to assess the bioinformatics community's ability to predict traits from whole genomes. In the CAGI PGP challenge, researchers were asked to predict whether an individual had a particular trait or profile based on their whole genome. Several approaches were used to assess submissions, including ROC AUC (area under receiver operating characteristic curve), probability rankings, the number of correct predictions, and statistical significance simulations. Overall, we found that prediction of individual traits is difficult, relying on a strong knowledge of trait frequency within the general population, whereas matching genomes to trait profiles relies heavily upon a small number of common traits including ancestry, blood type, and eye color. When a rare genetic disorder is present, profiles can be matched when one or more pathogenic variants are identified. Prediction accuracy has improved substantially over the last 6 years due to improved methodology and a better understanding of features.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1002/humu.23265DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5645203PMC
September 2017

The UCSC Genome Browser database: 2017 update.

Nucleic Acids Res 2017 01 29;45(D1):D626-D634. Epub 2016 Nov 29.

Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95064, USA.

Since its 2001 debut, the University of California, Santa Cruz (UCSC) Genome Browser (http://genome.ucsc.edu/) team has provided continuous support to the international genomics and biomedical communities through a web-based, open source platform designed for the fast, scalable display of sequence alignments and annotations landscaped against a vast collection of quality reference genome assemblies. The browser's publicly accessible databases are the backbone of a rich, integrated bioinformatics tool suite that includes a graphical interface for data queries and downloads, alignment programs, command-line utilities and more. This year's highlights include newly designed home and gateway pages; a new 'multi-region' track display configuration for exon-only, gene-only and custom regions visualization; new genome browsers for three species (brown kiwi, crab-eating macaque and Malayan flying lemur); eight updated genome assemblies; extended support for new data types such as CRAM, RNA-seq expression data and long-range chromatin interaction pairs; and the unveiling of a new supported mirror site in Japan.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkw1134DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5210591PMC
January 2017

Long-read sequence assembly of the gorilla genome.

Science 2016 Apr;352(6281):aae0344

Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA 98195, USA. Howard Hughes Medical Institute, University of Washington, Seattle, WA 98195, USA.

Accurate sequence and assembly of genomes is a critical first step for studies of genetic variation. We generated a high-quality assembly of the gorilla genome using single-molecule, real-time sequence technology and a string graph de novo assembly algorithm. The new assembly improves contiguity by two to three orders of magnitude with respect to previously released assemblies, recovering 87% of missing reference exons and incomplete gene models. Although regions of large, high-identity segmental duplications remain largely unresolved, this comprehensive assembly provides new biological insight into genetic diversity, structural variation, gene loss, and representation of repeat structures within the gorilla genome. The approach provides a path forward for the routine assembly of mammalian genomes at a level approaching that of the current quality of the human genome.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1126/science.aae0344DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4920363PMC
April 2016

Genomic legacy of the African cheetah, Acinonyx jubatus.

Genome Biol 2015 Dec 10;16:277. Epub 2015 Dec 10.

Institut de Biologia Evolutiva (CSIC/UPF), Dr. Aiguader, 88, Barcelona, 08003, Spain.

Background: Patterns of genetic and genomic variance are informative in inferring population history for human, model species and endangered populations.

Results: Here the genome sequence of wild-born African cheetahs reveals extreme genomic depletion in SNV incidence, SNV density, SNVs of coding genes, MHC class I and II genes, and mitochondrial DNA SNVs. Cheetah genomes are on average 95 % homozygous compared to the genomes of the outbred domestic cat (24.08 % homozygous), Virunga Mountain Gorilla (78.12 %), inbred Abyssinian cat (62.63 %), Tasmanian devil, domestic dog and other mammalian species. Demographic estimators impute two ancestral population bottlenecks: one >100,000 years ago coincident with cheetah migrations out of the Americas and into Eurasia and Africa, and a second 11,084-12,589 years ago in Africa coincident with late Pleistocene large mammal extinctions. MHC class I gene loss and dramatic reduction in functional diversity of MHC genes would explain why cheetahs ablate skin graft rejection among unrelated individuals. Significant excess of non-synonymous mutations in AKAP4 (p<0.02), a gene mediating spermatozoon development, indicates cheetah fixation of five function-damaging amino acid variants distinct from AKAP4 homologues of other Felidae or mammals; AKAP4 dysfunction may cause the cheetah's extremely high (>80 %) pleiomorphic sperm.

Conclusions: The study provides an unprecedented genomic perspective for the rare cheetah, with potential relevance to the species' natural history, physiological adaptations and unique reproductive disposition.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s13059-015-0837-4DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4676127PMC
December 2015

The UCSC Genome Browser database: 2016 update.

Nucleic Acids Res 2016 Jan 20;44(D1):D717-25. Epub 2015 Nov 20.

Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95064, USA.

For the past 15 years, the UCSC Genome Browser (http://genome.ucsc.edu/) has served the international research community by offering an integrated platform for viewing and analyzing information from a large database of genome assemblies and their associated annotations. The UCSC Genome Browser has been under continuous development since its inception with new data sets and software features added frequently. Some release highlights of this year include new and updated genome browsers for various assemblies, including bonobo and zebrafish; new gene annotation sets; improvements to track and assembly hub support; and a new interactive tool, the "Data Integrator", for intersecting data from multiple tracks. We have greatly expanded the data sets available on the most recent human assembly, hg38/GRCh38, to include updated gene prediction sets from GENCODE, more phenotype- and disease-associated variants from ClinVar and ClinGen, more genomic regulatory data, and a new multiple genome alignment.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkv1275DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4702902PMC
January 2016

The NIH BD2K center for big data in translational genomics.

J Am Med Inform Assoc 2015 Nov 13;22(6):1143-7. Epub 2015 Jul 13.

UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA Howard Hughes Medical Institute, Bethesda, MD, USA

The world's genomics data will never be stored in a single repository - rather, it will be distributed among many sites in many countries. No one site will have enough data to explain genotype to phenotype relationships in rare diseases; therefore, sites must share data. To accomplish this, the genetics community must forge common standards and protocols to make sharing and computing data among many sites a seamless activity. Through the Global Alliance for Genomics and Health, we are pioneering the development of shared application programming interfaces (APIs) to connect the world's genome repositories. In parallel, we are developing an open source software stack (ADAM) that uses these APIs. This combination will create a cohesive genome informatics ecosystem. Using containers, we are facilitating the deployment of this software in a diverse array of environments. Through benchmarking efforts and big data driver projects, we are ensuring ADAM's performance and utility.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/jamia/ocv047DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5009913PMC
November 2015

The UCSC Genome Browser database: 2015 update.

Nucleic Acids Res 2015 Jan 26;43(Database issue):D670-81. Epub 2014 Nov 26.

Center for Biomolecular Science and Engineering, CBSE, UC Santa Cruz, 1156 High Street, Santa Cruz, CA 95064, USA.

Launched in 2001 to showcase the draft human genome assembly, the UCSC Genome Browser database (http://genome.ucsc.edu) and associated tools continue to grow, providing a comprehensive resource of genome assemblies and annotations to scientists and students worldwide. Highlights of the past year include the release of a browser for the first new human genome reference assembly in 4 years in December 2013 (GRCh38, UCSC hg38), a watershed comparative genomics annotation (100-species multiple alignment and conservation) and a novel distribution mechanism for the browser (GBiB: Genome Browser in a Box). We created browsers for new species (Chinese hamster, elephant shark, minke whale), 'mined the web' for DNA sequences and expanded the browser display with stacked color graphs and region highlighting. As our user community increasingly adopts the UCSC track hub and assembly hub representations for sharing large-scale genomic annotation data sets and genome sequencing projects, our menu of public data hubs has tripled.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gku1177DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4383971PMC
January 2015

The UCSC Cancer Genomics Browser: update 2015.

Nucleic Acids Res 2015 Jan 11;43(Database issue):D812-7. Epub 2014 Nov 11.

Center for Biomolecular Science and Engineering, University of California at Santa Cruz, Santa Cruz, CA 95064, USA

The UCSC Cancer Genomics Browser (https://genome-cancer.ucsc.edu/) is a web-based application that integrates relevant data, analysis and visualization, allowing users to easily discover and share their research observations. Users can explore the relationship between genomic alterations and phenotypes by visualizing various -omic data alongside clinical and phenotypic features, such as age, subtype classifications and genomic biomarkers. The Cancer Genomics Browser currently hosts 575 public datasets from genome-wide analyses of over 227,000 samples, including datasets from TCGA, CCLE, Connectivity Map and TARGET. Users can download and upload clinical data, generate Kaplan-Meier plots dynamically, export data directly to Galaxy for analysis, plus generate URL bookmarks of specific views of the data to share with others.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gku1073DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4383911PMC
January 2015
-->