Publications by authors named "Benedict Paten"

93 Publications

A high-quality bonobo genome refines the analysis of hominid evolution.

Nature 2021 May 5. Epub 2021 May 5.

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.

The divergence of chimpanzee and bonobo provides one of the few examples of recent hominid speciation. Here we describe a fully annotated, high-quality bonobo genome assembly, which was constructed without guidance from reference genomes by applying a multiplatform genomics approach. We generate a bonobo genome assembly in which more than 98% of genes are completely annotated and 99% of the gaps are closed, including the resolution of about half of the segmental duplications and almost all of the full-length mobile elements. We compare the bonobo genome to those of other great apes and identify more than 5,569 fixed structural variants that specifically distinguish the bonobo and chimpanzee lineages. We focus on genes that have been lost, changed in structure or expanded in the last few million years of bonobo evolution. We produce a high-resolution map of incomplete lineage sorting and estimate that around 5.1% of the human genome is genetically closer to chimpanzee or bonobo and that more than 36.5% of the genome shows incomplete lineage sorting if we consider a deeper phylogeny including gorilla and orangutan. We also show that 26% of the segments of incomplete lineage sorting between human and chimpanzee or human and bonobo are non-randomly distributed and that genes within these clustered segments show significant excess of amino acid replacement compared to the rest of the genome.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41586-021-03519-xDOI Listing
May 2021

Towards complete and error-free genome assemblies of all vertebrate species.

Nature 2021 Apr 28;592(7856):737-746. Epub 2021 Apr 28.

UQ Genomics, University of Queensland, Brisbane, Queensland, Australia.

High-quality and complete reference genome assemblies are fundamental for the application of genomics to biology, disease, and biodiversity conservation. However, such assemblies are available for only a few non-microbial species. To address this issue, the international Genome 10K (G10K) consortium has worked over a five-year period to evaluate and develop cost-effective methods for assembling highly accurate and nearly complete reference genomes. Here we present lessons learned from generating assemblies for 16 species that represent six major vertebrate lineages. We confirm that long-read sequencing technologies are essential for maximizing genome quality, and that unresolved complex repeats and haplotype heterozygosity are major sources of assembly error when not handled correctly. Our assemblies correct substantial errors, add missing sequence in some of the best historical reference genomes, and reveal biological discoveries. These include the identification of many false gene duplications, increases in gene sizes, chromosome rearrangements that are specific to lineages, a repeated independent chromosome breakpoint in bat genomes, and a canonical GC-rich pattern in protein-coding genes and their regulatory regions. Adopting these lessons, we have embarked on the Vertebrate Genomes Project (VGP), an international effort to generate high-quality, complete reference genomes for all of the roughly 70,000 extant vertebrate species and to help to enable a new era of discovery across the life sciences.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41586-021-03451-0DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8081667PMC
April 2021

Pervasive cis effects of variation in copy number of large tandem repeats on local DNA methylation and gene expression.

Am J Hum Genet 2021 May 31;108(5):809-824. Epub 2021 Mar 31.

Department of Genetics and Genomic Sciences and Mindich Child Health and Development Institute, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA. Electronic address:

Variable number tandem repeats (VNTRs) are composed of large tandemly repeated motifs, many of which are highly polymorphic in copy number. However, because of their large size and repetitive nature, they remain poorly studied. To investigate the regulatory potential of VNTRs, we used read-depth data from Illumina whole-genome sequencing to perform association analysis between copy number of ∼70,000 VNTRs (motif size ≥ 10 bp) with both gene expression (404 samples in 48 tissues) and DNA methylation (235 samples in peripheral blood), identifying thousands of VNTRs that are associated with local gene expression (eVNTRs) and DNA methylation levels (mVNTRs). Using an independent cohort, we validated 73%-80% of signals observed in the two discovery cohorts, while allelic analysis of VNTR length and CpG methylation in 30 Oxford Nanopore genomes gave additional support for mVNTR loci, thus providing robust evidence to support that these represent genuine associations. Further, conditional analysis indicated that many eVNTRs and mVNTRs act as QTLs independently of other local variation. We also observed strong enrichments of eVNTRs and mVNTRs for regulatory features such as enhancers and promoters. Using the Human Genome Diversity Panel, we define sets of VNTRs that show highly divergent copy numbers among human populations and show that these are enriched for regulatory effects and preferentially associate with genes that have been linked with human phenotypes through GWASs. Our study provides strong evidence supporting functional variation at thousands of VNTRs and defines candidate sets of VNTRs, copy number variation of which potentially plays a role in numerous human phenotypes.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.ajhg.2021.03.016DOI Listing
May 2021

Real-Time Culture-Independent Microbial Profiling Onboard the International Space Station Using Nanopore Sequencing.

Genes (Basel) 2021 Jan 16;12(1). Epub 2021 Jan 16.

Biomedical Research and Environmental Sciences Division, NASA Johnson Space Center, Houston, TX 77058, USA.

For the past two decades, microbial monitoring of the International Space Station (ISS) has relied on culture-dependent methods that require return to Earth for analysis. This has a number of limitations, with the most significant being bias towards the detection of culturable organisms and the inherent delay between sample collection and ground-based analysis. In recent years, portable and easy-to-use molecular-based tools, such as Oxford Nanopore Technologies' MinION™ sequencer and miniPCR bio's miniPCR™ thermal cycler, have been validated onboard the ISS. Here, we report on the development, validation, and implementation of a swab-to-sequencer method that provides a culture-independent solution to real-time microbial profiling onboard the ISS. Method development focused on analysis of swabs collected in a low-biomass environment with limited facility resources and stringent controls on allowed processes and reagents. ISS-optimized procedures included enzymatic DNA extraction from a swab tip, bead-based purifications, altered buffers, and the use of miniPCR and the MinION. Validation was conducted through extensive ground-based assessments comparing current standard culture-dependent and newly developed culture-independent methods. Similar microbial distributions were observed between the two methods; however, as expected, the culture-independent data revealed microbial profiles with greater diversity. Protocol optimization and verification was established during NASA Extreme Environment Mission Operations (NEEMO) analog missions 21 and 22, respectively. Unique microbial profiles obtained from analog testing validated the swab-to-sequencer method in an extreme environment. Finally, four independent swab-to-sequencer experiments were conducted onboard the ISS by two crewmembers. Microorganisms identified from ISS swabs were consistent with historical culture-based data, and primarily consisted of commonly observed human-associated microbes. This simplified method has been streamlined for high ease-of-use for a non-trained crew to complete in an extreme environment, thereby enabling environmental and human health diagnostics in real-time as future missions take us beyond low-Earth orbit.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3390/genes12010106DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7830261PMC
January 2021

Sequence diversity analyses of an improved rhesus macaque genome enhance its biomedical utility.

Science 2020 12;370(6523)

Department of Biology, University of Bari 'Aldo Moro', 70125 Bari, Italy.

The rhesus macaque () is the most widely studied nonhuman primate (NHP) in biomedical research. We present an updated reference genome assembly (Mmul_10, contig N50 = 46 Mbp) that increases the sequence contiguity 120-fold and annotate it using 6.5 million full-length transcripts, thus improving our understanding of gene content, isoform diversity, and repeat organization. With the improved assembly of segmental duplications, we discovered new lineage-specific genes and expanded gene families that are potentially informative in studies of evolution and disease susceptibility. Whole-genome sequencing (WGS) data from 853 rhesus macaques identified 85.7 million single-nucleotide variants (SNVs) and 10.5 million indel variants, including potentially damaging variants in genes associated with human autism and developmental delay, providing a framework for developing noninvasive NHP models of human disease.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1126/science.abc6617DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7818670PMC
December 2020

Fully phased human genome assembly without parental data using single-cell strand sequencing and long reads.

Nat Biotechnol 2021 03 7;39(3):302-308. Epub 2020 Dec 7.

Heinrich Heine University Düsseldorf, Medical Faculty, Institute for Medical Biometry and Bioinformatics, Düsseldorf, Germany.

Human genomes are typically assembled as consensus sequences that lack information on parental haplotypes. Here we describe a reference-free workflow for diploid de novo genome assembly that combines the chromosome-wide phasing and scaffolding capabilities of single-cell strand sequencing with continuous long-read or high-fidelity sequencing data. Employing this strategy, we produced a completely phased de novo genome assembly for each haplotype of an individual of Puerto Rican descent (HG00733) in the absence of parental data. The assemblies are accurate (quality value > 40) and highly contiguous (contig N50 > 23 Mbp) with low switch error rates (0.17%), providing fully phased single-nucleotide variants, indels and structural variants. A comparison of Oxford Nanopore Technologies and Pacific Biosciences phased assemblies identified 154 regions that are preferential sites of contig breaks, irrespective of sequencing technology or phasing algorithms.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41587-020-0719-5DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7954704PMC
March 2021

GENCODE 2021.

Nucleic Acids Res 2021 01;49(D1):D916-D923

European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK.

The GENCODE project annotates human and mouse genes and transcripts supported by experimental data with high accuracy, providing a foundational resource that supports genome biology and clinical genomics. GENCODE annotation processes make use of primary data and bioinformatic tools and analysis generated both within the consortium and externally to support the creation of transcript structures and the determination of their function. Here, we present improvements to our annotation infrastructure, bioinformatics tools, and analysis, and the advances they support in the annotation of the human and mouse genomes including: the completion of first pass manual annotation for the mouse reference genome; targeted improvements to the annotation of genes associated with SARS-CoV-2 infection; collaborative projects to achieve convergence across reference annotation databases for the annotation of human and mouse protein-coding genes; and the first GENCODE manually supervised automated annotation of lncRNAs. Our annotation is accessible via Ensembl, the UCSC Genome Browser and https://www.gencodegenes.org.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkaa1087DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7778937PMC
January 2021

ProTECT-Prediction of T-Cell Epitopes for Cancer Therapy.

Front Immunol 2020 10;11:483296. Epub 2020 Nov 10.

Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA, United States.

Somatic mutations in cancers affecting protein coding genes can give rise to potentially therapeutic neoepitopes. These neoepitopes can guide Adoptive Cell Therapies and Peptide- and RNA-based Neoepitope Vaccines to selectively target tumor cells using autologous patient cytotoxic T-cells. Currently, researchers have to independently align their data, call somatic mutations and haplotype the patient's HLA to use existing neoepitope prediction tools. We present ProTECT, a fully automated, reproducible, scalable, and efficient end-to-end analysis pipeline to identify and rank therapeutically relevant tumor neoepitopes in terms of potential immunogenicity starting directly from raw patient sequencing data, or from pre-processed data. The ProTECT pipeline encompasses alignment, HLA haplotyping, mutation calling (single nucleotide variants, short insertions and deletions, and gene fusions), peptide:MHC binding prediction, and ranking of final candidates. We demonstrate the scalability, efficiency, and utility of ProTECT on 326 samples from the TCGA Prostate Adenocarcinoma cohort, identifying recurrent potential neoepitopes from TMPRSS2-ERG fusions, and from SNVs in SPOP. We also compare ProTECT with results from published tools. ProTECT can be run on a standalone computer, a local cluster, or on a compute cloud using a Mesos backend. ProTECT is highly scalable and can process TCGA data in under 30 min per sample (on average) when run in large batches. ProTECT is freely available at https://www.github.com/BD2KGenomics/protect.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3389/fimmu.2020.483296DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7683782PMC
April 2021

Dense sampling of bird diversity increases power of comparative genomics.

Nature 2020 11 11;587(7833):252-257. Epub 2020 Nov 11.

Centre for Zoo and Wild Animal Health, Copenhagen Zoo, Frederiksberg, Denmark.

Whole-genome sequencing projects are increasingly populating the tree of life and characterizing biodiversity. Sparse taxon sampling has previously been proposed to confound phylogenetic inference, and captures only a fraction of the genomic diversity. Here we report a substantial step towards the dense representation of avian phylogenetic and molecular diversity, by analysing 363 genomes from 92.4% of bird families-including 267 newly sequenced genomes produced for phase II of the Bird 10,000 Genomes (B10K) Project. We use this comparative genome dataset in combination with a pipeline that leverages a reference-free whole-genome alignment to identify orthologous regions in greater numbers than has previously been possible and to recognize genomic novelties in particular bird lineages. The densely sampled alignment provides a single-base-pair map of selection, has more than doubled the fraction of bases that are confidently predicted to be under conservation and reveals extensive patterns of weak selection in predominantly non-coding DNA. Our results demonstrate that increasing the diversity of genomes used in comparative studies can reveal more shared and lineage-specific variation, and improve the investigation of genomic characteristics. We anticipate that this genomic resource will offer new perspectives on evolutionary processes in cross-species comparative analyses and assist in efforts to conserve species.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41586-020-2873-9DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7759463PMC
November 2020

Progressive Cactus is a multiple-genome aligner for the thousand-genome era.

Nature 2020 11 11;587(7833):246-251. Epub 2020 Nov 11.

UC Santa Cruz Genomics Institute, UC Santa Cruz, Santa Cruz, CA, USA.

New genome assemblies have been arriving at a rapidly increasing pace, thanks to decreases in sequencing costs and improvements in third-generation sequencing technologies. For example, the number of vertebrate genome assemblies currently in the NCBI (National Center for Biotechnology Information) database increased by more than 50% to 1,485 assemblies in the year from July 2018 to July 2019. In addition to this influx of assemblies from different species, new human de novo assemblies are being produced, which enable the analysis of not only small polymorphisms, but also complex, large-scale structural differences between human individuals and haplotypes. This coming era and its unprecedented amount of data offer the opportunity to uncover many insights into genome evolution but also present challenges in how to adapt current analysis methods to meet the increased scale. Cactus, a reference-free multiple genome alignment program, has been shown to be highly accurate, but the existing implementation scales poorly with increasing numbers of genomes, and struggles in regions of highly duplicated sequences. Here we describe progressive extensions to Cactus to create Progressive Cactus, which enables the reference-free alignment of tens to thousands of large vertebrate genomes while maintaining high alignment quality. We describe results from an alignment of more than 600 amniote genomes, which is to our knowledge the largest multiple vertebrate genome alignment created so far.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41586-020-2871-yDOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7673649PMC
November 2020

Efficient dynamic variation graphs.

Bioinformatics 2020 Jul 16. Epub 2020 Jul 16.

Genomics Institute.

Motivation: Pangenomics is a growing field within computational genomics. Many pangenomic analyses use bidirected sequence graphs as their core data model. However, implementing and correctly using this data model can be difficult, and the scale of pangenomic datasets can be challenging to work at. These challenges have impeded progress in this field.

Results: Here, we present a stack of two C++ libraries, libbdsg and libhandlegraph, which use a simple, field-proven interface, designed to expose elementary features of these graphs while preventing common graph manipulation mistakes. The libraries also provide a Python binding. Using a diverse collection of pangenome graphs, we demonstrate that these tools allow for efficient construction and manipulation of large genome graphs with dense variation. For instance, the speed and memory usage are up to an order of magnitude better than the prior graph implementation in the VG toolkit, which has now transitioned to using libbdsg's implementations.

Availability And Implementation: libhandlegraph and libbdsg are available under an MIT License from https://github.com/vgteam/libhandlegraph and https://github.com/vgteam/libbdsg.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btaa640DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7850124PMC
July 2020

Genus-Wide Characterization of Bumblebee Genomes Provides Insights into Their Evolution and Variation in Ecological and Behavioral Traits.

Mol Biol Evol 2021 01;38(2):486-501

Institute of Apicultural Research, Chinese Academy of Agricultural Sciences, Beijing, China.

Bumblebees are a diverse group of globally important pollinators in natural ecosystems and for agricultural food production. With both eusocial and solitary life-cycle phases, and some social parasite species, they are especially interesting models to understand social evolution, behavior, and ecology. Reports of many species in decline point to pathogen transmission, habitat loss, pesticide usage, and global climate change, as interconnected causes. These threats to bumblebee diversity make our reliance on a handful of well-studied species for agricultural pollination particularly precarious. To broadly sample bumblebee genomic and phenotypic diversity, we de novo sequenced and assembled the genomes of 17 species, representing all 15 subgenera, producing the first genus-wide quantification of genetic and genomic variation potentially underlying key ecological and behavioral traits. The species phylogeny resolves subgenera relationships, whereas incomplete lineage sorting likely drives high levels of gene tree discordance. Five chromosome-level assemblies show a stable 18-chromosome karyotype, with major rearrangements creating 25 chromosomes in social parasites. Differential transposable element activity drives changes in genome sizes, with putative domestications of repetitive sequences influencing gene coding and regulatory potential. Dynamically evolving gene families and signatures of positive selection point to genus-wide variation in processes linked to foraging, diet and metabolism, immunity and detoxification, as well as adaptations for life at high altitudes. Our study reveals how bumblebee genes and genomes have evolved across the Bombus phylogeny and identifies variations potentially linked to key ecological and behavioral traits of these important pollinators.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/molbev/msaa240DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7826183PMC
January 2021

A Survey of Rare Epigenetic Variation in 23,116 Human Genomes Identifies Disease-Relevant Epivariations and CGG Expansions.

Am J Hum Genet 2020 10 15;107(4):654-669. Epub 2020 Sep 15.

Department of Genetics and Genomic Sciences and Mindich Child Health and Development Institute, Icahn School of Medicine at Mount Sinai, Hess Center for Science and Medicine, New York, NY 10029, USA. Electronic address:

There is growing recognition that epivariations, most often recognized as promoter hypermethylation events that lead to gene silencing, are associated with a number of human diseases. However, little information exists on the prevalence and distribution of rare epigenetic variation in the human population. In order to address this, we performed a survey of methylation profiles from 23,116 individuals using the Illumina 450k array. Using a robust outlier approach, we identified 4,452 unique autosomal epivariations, including potentially inactivating promoter methylation events at 384 genes linked to human disease. For example, we observed promoter hypermethylation of BRCA1 and LDLR at population frequencies of ∼1 in 3,000 and ∼1 in 6,000, respectively, suggesting that epivariations may underlie a fraction of human disease which would be missed by purely sequence-based approaches. Using expression data, we confirmed that many epivariations are associated with outlier gene expression. Analysis of variation data and monozygous twin pairs suggests that approximately two-thirds of epivariations segregate in the population secondary to underlying sequence mutations, while one-third are likely sporadic events that occur post-zygotically. We identified 25 loci where rare hypermethylation coincided with the presence of an unstable CGG tandem repeat, validated the presence of CGG expansions at several loci, and identified the putative molecular defect underlying most of the known folate-sensitive fragile sites in the genome. Our study provides a catalog of rare epigenetic changes in the human genome, gives insight into the underlying origins and consequences of epivariations, and identifies many hypermethylated CGG repeat expansions.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.ajhg.2020.08.019DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7536611PMC
October 2020

Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes.

Nat Biotechnol 2020 09 4;38(9):1044-1053. Epub 2020 May 4.

Chan Zuckerberg Initiative, Redwood City, CA, USA.

De novo assembly of a human genome using nanopore long-read sequences has been reported, but it used more than 150,000 CPU hours and weeks of wall-clock time. To enable rapid human genome assembly, we present Shasta, a de novo long-read assembler, and polishing algorithms named MarginPolish and HELEN. Using a single PromethION nanopore sequencer and our toolkit, we assembled 11 highly contiguous human genomes de novo in 9 d. We achieved roughly 63× coverage, 42-kb read N50 values and 6.5× coverage in reads >100 kb using three flow cells per sample. Shasta produced a complete haploid human genome assembly in under 6 h on a single commercial compute node. MarginPolish and HELEN polished haploid assemblies to more than 99.9% identity (Phred quality score QV = 30) with nanopore reads alone. Addition of proximity-ligation sequencing enabled near chromosome-level scaffolds for all 11 genomes. We compare our assembly performance to existing methods for diploid, haploid and trio-binned human samples and report superior accuracy and speed.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41587-020-0503-6DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7483855PMC
September 2020

Telomere-to-telomere assembly of a complete human X chromosome.

Nature 2020 09 14;585(7823):79-84. Epub 2020 Jul 14.

Arima Genomics, San Diego, CA, USA.

After two decades of improvements, the current human reference genome (GRCh38) is the most accurate and complete vertebrate genome ever produced. However, no single chromosome has been finished end to end, and hundreds of unresolved gaps persist. Here we present a human genome assembly that surpasses the continuity of GRCh38, along with a gapless, telomere-to-telomere assembly of a human chromosome. This was enabled by high-coverage, ultra-long-read nanopore sequencing of the complete hydatidiform mole CHM13 genome, combined with complementary technologies for quality improvement and validation. Focusing our efforts on the human X chromosome, we reconstructed the centromeric satellite DNA array (approximately 3.1 Mb) and closed the 29 remaining gaps in the current reference, including new sequences from the human pseudoautosomal regions and from cancer-testis ampliconic gene families (CT-X and GAGE). These sequences will be integrated into future human reference genome releases. In addition, the complete chromosome X, combined with the ultra-long nanopore data, allowed us to map methylation patterns across complex tandem repeats and satellite arrays. Our results demonstrate that finishing the entire human genome is now within reach, and the data presented here will facilitate ongoing efforts to complete the other human chromosomes.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41586-020-2547-7DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7484160PMC
September 2020

Distance indexing and seed clustering in sequence graphs.

Bioinformatics 2020 07;36(Suppl_1):i146-i153

Department of Biomolecular Engineering, University of California Santa Cruz Genomics Institute, Santa Cruz, CA 95060, USA.

Motivation: Graph representations of genomes are capable of expressing more genetic variation and can therefore better represent a population than standard linear genomes. However, due to the greater complexity of genome graphs relative to linear genomes, some functions that are trivial on linear genomes become much more difficult in genome graphs. Calculating distance is one such function that is simple in a linear genome but complicated in a graph context. In read mapping algorithms such distance calculations are fundamental to determining if seed alignments could belong to the same mapping.

Results: We have developed an algorithm for quickly calculating the minimum distance between positions on a sequence graph using a minimum distance index. We have also developed an algorithm that uses the distance index to cluster seeds on a graph. We demonstrate that our implementations of these algorithms are efficient and practical to use for a new generation of mapping algorithms based upon genome graphs.

Availability And Implementation: Our algorithms have been implemented as part of the vg toolkit and are available at https://github.com/vgteam/vg.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btaa446DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7355256PMC
July 2020

Gaussian mixture model-based unsupervised nucleotide modification number detection using nanopore-sequencing readouts.

Bioinformatics 2020 12;36(19):4928-4934

Department of Biomolecular Engineering and Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95064, USA.

Motivation: Nucleotide modification status can be decoded from the Oxford Nanopore Technologies nanopore-sequencing ionic current signals. Although various algorithms have been developed for nanopore-sequencing-based modification analysis, more detailed characterizations, such as modification numbers, corresponding signal levels and proportions are still lacking.

Results: We present a framework for the unsupervised determination of the number of nucleotide modifications from nanopore-sequencing readouts. We demonstrate the approach can effectively recapitulate the number of modifications, the corresponding ionic current signal levels, as well as mixing proportions under both DNA and RNA contexts. We further show, by integrating information from multiple detected modification regions, that the modification status of DNA and RNA molecules can be inferred. This method forms a key step of de novo characterization of nucleotide modifications, shedding light on the interpretation of various biological questions.

Availability And Implementation: Modified nanopolish: https://github.com/adbailey4/nanopolish/tree/cigar_output. All other codes used to reproduce the results: https://github.com/hd2326/ModificationNumber.

Supplementary Information: Supplementary data are available at Bioinformatics online.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btaa601DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7723331PMC
December 2020

halSynteny: a fast, easy-to-use conserved synteny block construction method for multiple whole-genome alignments.

Gigascience 2020 06;9(6)

Computer Technologies Laboratory, School of Translational Information Technologies, ITMO University, 49 Kronverkskiy Pr., St. Petersburg 197101, St. Petersburg, Russian Federation.

Background: Large-scale sequencing projects provide high-quality full-genome data that can be used for reconstruction of chromosomal exchanges and rearrangements that disrupt conserved syntenic blocks. The highest resolution of cross-species homology can be obtained on the basis of whole-genome, reference-free alignments. Very large multiple alignments of full-genome sequence stored in a binary format demand an accurate and efficient computational approach for synteny block production.

Findings: halSynteny performs efficient processing of pairwise alignment blocks for any pair of genomes in the alignment. The tool is part of the HAL comparative genomics suite and is targeted to build synteny blocks for multi-hundred-way, reference-free vertebrate alignments built with the Cactus system.

Conclusions: halSynteny enables an accurate and rapid identification of synteny in multiple full-genome alignments. The method is implemented in C++11 as a component of the halTools software and released under MIT license. The package is available at https://github.com/ComparativeGenomicsToolkit/hal/.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/gigascience/giaa047DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7254927PMC
June 2020

Pangenome Graphs.

Annu Rev Genomics Hum Genet 2020 08 26;21:139-162. Epub 2020 May 26.

Genomics Institute, University of California, Santa Cruz, California 95064, USA; email:

Low-cost whole-genome assembly has enabled the collection of haplotype-resolved pangenomes for numerous organisms. In turn, this technological change is encouraging the development of methods that can precisely address the sequence and variation described in large collections of related genomes. These approaches often use graphical models of the pangenome to support algorithms for sequence alignment, visualization, functional genomics, and association studies. The additional information provided to these methods by the pangenome allows them to achieve superior performance on a variety of bioinformatic tasks, including read alignment, variant calling, and genotyping. Pangenome graphs stand to become a ubiquitous tool in genomics. Although it is unclear whether they will replace linearreference genomes, their ability to harmoniously relate multiple sequence and coordinate systems will make them useful irrespective of which pangenomic models become most common in the future.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1146/annurev-genom-120219-080406DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8006571PMC
August 2020

BioHackathon 2015: Semantics of data for life sciences and reproducible research.

F1000Res 2020 24;9:136. Epub 2020 Feb 24.

St Vincent's Clinical School, Faculty of Medicine, University of New South Wales, Darlinghurst, Australia.

We report on the activities of the 2015 edition of the BioHackathon, an annual event that brings together researchers and developers from around the world to develop tools and technologies that promote the reusability of biological data. We discuss issues surrounding the representation, publication, integration, mining and reuse of biological data and metadata across a wide range of biomedical data types of relevance for the life sciences, including chemistry, genotypes and phenotypes, orthology and phylogeny, proteomics, genomics, glycomics, and metabolomics. We describe our progress to address ongoing challenges to the reusability and reproducibility of research results, and identify outstanding issues that continue to impede the progress of bioinformatics research. We share our perspective on the state of the art, continued challenges, and goals for future research and development for the life sciences Semantic Web.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.12688/f1000research.18236.1DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7141167PMC
February 2021

Bayesian Framework for Detecting Gene Expression Outliers in Individual Samples.

JCO Clin Cancer Inform 2020 02;4:160-170

Computational Genomics Laboratory, University of California, Santa Cruz, Santa Cruz, CA.

Purpose: Many antineoplastics are designed to target upregulated genes, but quantifying upregulation in a single patient sample requires an appropriate set of samples for comparison. In cancer, the most natural comparison set is unaffected samples from the matching tissue, but there are often too few available unaffected samples to overcome high intersample variance. Moreover, some cancer samples have misidentified tissues of origin or even composite-tissue phenotypes. Even if an appropriate comparison set can be identified, most differential expression tools are not designed to accommodate comparisons to a single patient sample.

Methods: We propose a Bayesian statistical framework for gene expression outlier detection in single samples. Our method uses all available data to produce a consensus background distribution for each gene of interest without requiring the researcher to manually select a comparison set. The consensus distribution can then be used to quantify over- and underexpression.

Results: We demonstrate this method on both simulated and real gene expression data. We show that it can robustly quantify overexpression, even when the set of comparison samples lacks ideally matched tissue samples. Furthermore, our results show that the method can identify appropriate comparison sets from samples of mixed lineage and rediscover numerous known gene-cancer expression patterns.

Conclusion: This exploratory method is suitable for identifying expression outliers from comparative RNA sequencing (RNA-seq) analysis for individual samples, and Treehouse, a pediatric precision medicine group that leverages RNA-seq to identify potential therapeutic leads for patients, plans to explore this method for processing its pediatric cohort.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1200/CCI.19.00095DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7053807PMC
February 2020

Genotyping structural variants in pangenome graphs using the vg toolkit.

Genome Biol 2020 02 12;21(1):35. Epub 2020 Feb 12.

UC Santa Cruz Genomics Institute, University of California, Santa Cruz, California, USA.

Structural variants (SVs) remain challenging to represent and study relative to point mutations despite their demonstrated importance. We show that variation graphs, as implemented in the vg toolkit, provide an effective means for leveraging SV catalogs for short-read SV genotyping experiments. We benchmark vg against state-of-the-art SV genotypers using three sequence-resolved SV catalogs generated by recent long-read sequencing studies. In addition, we use assemblies from 12 yeast strains to show that graphs constructed directly from aligned de novo assemblies improve genotyping compared to graphs built from intermediate SV catalogs in the VCF format.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s13059-020-1941-7DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7017486PMC
February 2020

Off Earth Identification of Bacterial Populations Using 16S rDNA Nanopore Sequencing.

Genes (Basel) 2020 01 9;11(1). Epub 2020 Jan 9.

Biomedical Research and Environmental Sciences Division, NASA Johnson Space Center, Houston, TX 77058, USA.

The MinION sequencer has made in situ sequencing feasible in remote locations. Following our initial demonstration of its high performance off planet with Earth-prepared samples, we developed and tested an end-to-end, sample-to-sequencer process that could be conducted entirely aboard the International Space Station (ISS). Initial experiments demonstrated the process with a microbial mock community standard. The DNA was successfully amplified, primers were degraded, and libraries prepared and sequenced. The median percent identities for both datasets were 84%, as assessed from alignment of the mock community. The ability to correctly identify the organisms in the mock community standard was comparable for the sequencing data obtained in flight and on the ground. To validate the process on microbes collected from and cultured aboard the ISS, bacterial cells were selected from a NASA Environmental Health Systems Surface Sample Kit contact slide. The locations of bacterial colonies chosen for identification were labeled, and a small number of cells were directly added as input into the sequencing workflow. Prepared DNA was sequenced, and the data were downlinked to Earth. Return of the contact slide to the ground allowed for standard laboratory processing for bacterial identification. The identifications obtained aboard the ISS, and , matched those determined on the ground down to the species level. This marks the first ever identification of microbes entirely off Earth, and this validated process could be used for in-flight microbial identification, diagnosis of infectious disease in a crewmember, and as a research platform for investigators around the world.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3390/genes11010076DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7016637PMC
January 2020

Nanopore native RNA sequencing of a human poly(A) transcriptome.

Nat Methods 2019 12 18;16(12):1297-1305. Epub 2019 Nov 18.

Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA.

High-throughput complementary DNA sequencing technologies have advanced our understanding of transcriptome complexity and regulation. However, these methods lose information contained in biological RNA because the copied reads are often short and modifications are not retained. We address these limitations using a native poly(A) RNA sequencing strategy developed by Oxford Nanopore Technologies. Our study generated 9.9 million aligned sequence reads for the human cell line GM12878, using thirty MinION flow cells at six institutions. These native RNA reads had a median length of 771 bases, and a maximum aligned length of over 21,000 bases. Mitochondrial poly(A) reads provided an internal measure of read-length quality. We combined these long nanopore reads with higher accuracy short-reads and annotated GM12878 promoter regions to identify 33,984 plausible RNA isoforms. We describe strategies for assessing 3' poly(A) tail length, base modifications and transcript haplotypes.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41592-019-0617-2DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7768885PMC
December 2019

Haplotype-aware graph indexes.

Bioinformatics 2020 01;36(2):400-407

Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton CB10 1SA, UK.

Motivation: The variation graph toolkit (VG) represents genetic variation as a graph. Although each path in the graph is a potential haplotype, most paths are non-biological, unlikely recombinations of true haplotypes.

Results: We augment the VG model with haplotype information to identify which paths are more likely to exist in nature. For this purpose, we develop a scalable implementation of the graph extension of the positional Burrows-Wheeler transform. We demonstrate the scalability of the new implementation by building a whole-genome index of the 5008 haplotypes of the 1000 Genomes Project, and an index of all 108 070 Trans-Omics for Precision Medicine Freeze 5 chromosome 17 haplotypes. We also develop an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes.

Availability And Implementation: Our software is available at https://github.com/vgteam/vg, https://github.com/jltsiren/gbwt and https://github.com/jltsiren/gcsa2.

Supplementary Information: Supplementary data are available at Bioinformatics online.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btz575DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7223266PMC
January 2020

Sequence tube maps: making graph genomes intuitive to commuters.

Bioinformatics 2019 12;35(24):5318-5320

European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK.

Motivation: Compared to traditional haploid reference genomes, graph genomes are an efficient and compact data structure for storing multiple genomic sequences, for storing polymorphisms or for mapping sequencing reads with greater sensitivity. Further, graphs are well-studied computer science objects that can be efficiently analyzed. However, their adoption in genomic research is slow, in part because of the cognitive difficulty in interpreting graphs.

Results: We present an intuitive graphical representation for graph genomes that re-uses well-honed techniques developed to display public transport networks, and demonstrate it as a web tool.

Availability And Implementation: Code: https://github.com/vgteam/sequenceTubeMap.

Demonstration: https://vgteam.github.io/sequenceTubeMap/.

Supplementary Information: Supplementary data are available at Bioinformatics online.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btz597DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6954646PMC
December 2019

Haplotype-aware diplotyping from noisy long reads.

Genome Biol 2019 06 3;20(1):116. Epub 2019 Jun 3.

UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, 95064, CA, USA.

Current genotyping approaches for single-nucleotide variations rely on short, accurate reads from second-generation sequencing devices. Presently, third-generation sequencing platforms are rapidly becoming more widespread, yet approaches for leveraging their long but error-prone reads for genotyping are lacking. Here, we introduce a novel statistical framework for the joint inference of haplotypes and genotypes from noisy long reads, which we term diplotyping. Our technique takes full advantage of linkage information provided by long reads. We validate hundreds of thousands of candidate variants that have not yet been included in the high-confidence reference set of the Genome-in-a-Bottle effort.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s13059-019-1709-0DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6547545PMC
June 2019

The Genome of C57BL/6J "Eve", the Mother of the Laboratory Mouse Genome Reference Strain.

G3 (Bethesda) 2019 06 5;9(6):1795-1805. Epub 2019 Jun 5.

The Jackson Laboratory for Mammalian Genetics, Bar Harbor ME

Isogenic laboratory mouse strains enhance reproducibility because individual animals are genetically identical. For the most widely used isogenic strain, C57BL/6, there exists a wealth of genetic, phenotypic, and genomic data, including a high-quality reference genome (GRCm38.p6). Now 20 years after the first release of the mouse reference genome, C57BL/6J mice are at least 26 inbreeding generations removed from GRCm38 and the strain is now maintained with periodic reintroduction of cryorecovered mice derived from a single breeder pair, aptly named Adam and Eve. To provide an update to the mouse reference genome that more accurately represents the genome of today's C57BL/6J mice, we took advantage of long read, short read, and optical mapping technologies to generate a assembly of the C57BL/6J Eve genome (B6Eve). Using these data, we have addressed recurring variants observed in previous mouse genomic studies. We have also identified structural variations, closed gaps in the mouse reference assembly, and revealed previously unannotated coding sequences. This B6Eve assembly explains discrepant observations that have been associated with GRCm38-based analyses, and will inform a reference genome that is more representative of the C57BL/6J mice that are in use today.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1534/g3.119.400071DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6553538PMC
June 2019