Publications by authors named "Steven L Salzberg"

207 Publications

Balrog: A universal protein model for prokaryotic gene prediction.

PLoS Comput Biol 2021 Feb 26;17(2):e1008727. Epub 2021 Feb 26.

Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland, United States of America.

Low-cost, high-throughput sequencing has led to an enormous increase in the number of sequenced microbial genomes, with well over 100,000 genomes in public archives today. Automatic genome annotation tools are integral to understanding these organisms, yet older gene finding methods must be retrained on each new genome. We have developed a universal model of prokaryotic genes by fitting a temporal convolutional network to amino-acid sequences from a large, diverse set of microbial genomes. We incorporated the new model into a gene finding system, Balrog (Bacterial Annotation by Learned Representation Of Genes), which does not require genome-specific training and which matches or outperforms other state-of-the-art gene finding tools. Balrog is freely available under the MIT license at https://github.com/salzberg-lab/Balrog.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1371/journal.pcbi.1008727DOI Listing
February 2021

Guillain-Barré Syndrome Outbreak in Peru 2019 Associated With Infection.

Neurol Neuroimmunol Neuroinflamm 2021 03 5;8(2). Epub 2021 Feb 5.

From the Departamento de Medicina (A.P.R., M.A.C., C.C.C., J.A.D., M.A.T., J.T.A., H.F.U.), Servicio de Neurología y Neuropsiquiatría, Hospital Cayetano Heredia, Lima, Perú; Department of Neurology (S.E.L.) and Department of Neurology and Department of Immunology (B.C.J.), Erasmus MC, University Medical Center Rotterdam, Netherlands; Institute of Infection, Immunity and Inflammation (S.K.H., D.G., H.J.W.), University of Glasgow, United Kingdom; Departamento de Enfermedades Infecciosas Tropicales y Dermatológicas (A.L.), Hospital Cayetano Heredia, Lima, Perú; U.S. Naval Medical Research Unit-6 (M.G., M.R., J.D.R., R.M.), Lima, Peru; Center for Computational Biology (D.P., R.M.S., S.L.S.), Department of Computer Science, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD; and Department of Pathology (P.J.S.), Department of Neurology (D.R.C.), and Department of Neurology and Department of Pathology (C.A.P.), Johns Hopkins University School of Medicine, Baltimore, MD.

Objective: To identify the clinical phenotypes and infectious triggers in the 2019 Peruvian Guillain-Barré syndrome (GBS) outbreak.

Methods: We prospectively collected clinical and neurophysiologic data of patients with GBS admitted to a tertiary hospital in Lima, Peru, between May and August 2019. Molecular, immunologic, and microbiological methods were used to identify causative infectious agents. Sera from 41 controls were compared with cases for antibodies to and gangliosides. Genomic analysis was performed on 4 isolates.

Results: The 49 included patients had a median age of 44 years (interquartile range [IQR] 30-54 years), and 28 (57%) were male. Thirty-two (65%) had symptoms of a preceding infection: 24 (49%) diarrhea and 13 (27%) upper respiratory tract infection. The median time between infectious to neurologic symptoms was 3 days (IQR 2-9 days). Eighty percent had a pure motor form of GBS, 21 (43%) had the axonal electrophysiologic subtype, and 18% the demyelinating subtype. Evidence of recent infection was found in 28/43 (65%). No evidence of recent arbovirus infection was found. Twenty-three cases vs 11 controls (OR 3.3, confidence interval [CI] 95% 1.2-9.2, < 0.01) had IgM and/or IgA antibodies against . Anti-GM1:phosphatidylserine and/or anti-GT1a:GM1 heteromeric complex antibodies were strongly positive in cases (92.9% sensitivity and 68.3% specificity). Genomic analysis showed that the strains were closely related and had the Asn51 polymorphism at gene.

Conclusions: Our study indicates that the 2019 Peruvian GBS outbreak was associated with infection and that the strains linked to GBS circulate widely in different parts of the world.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1212/NXI.0000000000000952DOI Listing
March 2021

Dissecting the Polygenic Basis of Cold Adaptation Using Genome-Wide Association of Traits and Environmental Data in Douglas-fir.

Genes (Basel) 2021 Jan 18;12(1). Epub 2021 Jan 18.

Department of Plant Sciences, University of California-Davis, One Shields Avenue, Davis, CA 95616, USA.

Understanding the genomic and environmental basis of cold adaptation is key to understand how plants survive and adapt to different environmental conditions across their natural range. Univariate and multivariate genome-wide association (GWAS) and genotype-environment association (GEA) analyses were used to test associations among genome-wide SNPs obtained from whole-genome resequencing, measures of growth, phenology, emergence, cold hardiness, and range-wide environmental variation in coastal Douglas-fir (). Results suggest a complex genomic architecture of cold adaptation, in which traits are either highly polygenic or controlled by both large and small effect genes. Newly discovered associations for cold adaptation in Douglas-fir included 130 genes involved in many important biological functions such as primary and secondary metabolism, growth and reproductive development, transcription regulation, stress and signaling, and DNA processes. These genes were related to growth, phenology and cold hardiness and strongly depend on variation in environmental variables such degree days below 0c, precipitation, elevation and distance from the coast. This study is a step forward in our understanding of the complex interconnection between environment and genomics and their role in cold-associated trait variation in boreal tree species, providing a baseline for the species' predictions under climate change.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3390/genes12010110DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7831106PMC
January 2021

Effects of transcriptional noise on estimates of gene and transcript expression in RNA sequencing experiments.

Genome Res 2020 Dec 23. Epub 2020 Dec 23.

Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland 21211, USA.

RNA sequencing is widely used to measure gene expression across a vast range of animal and plant tissues and conditions. Most studies of computational methods for gene expression analysis use simulated data to evaluate the accuracy of these methods. These simulations typically include reads generated from known genes at varying levels of expression. Until now, simulations did not include reads from noisy transcripts, which might include erroneous transcription, erroneous splicing, and other processes that affect transcription in living cells. Here we examine the effects of realistic amounts of transcriptional noise on the ability of leading computational methods to assemble and quantify the genes and transcripts in an RNA sequencing experiment. We show that the inclusion of noise leads to systematic errors in the ability of these programs to measure expression, including systematic underestimates of transcript abundance levels and large increases in the number of false-positive genes and transcripts. Our results also suggest that alignment-free computational methods sometimes fail to detect transcripts expressed at relatively low levels.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1101/gr.266213.120DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7849408PMC
December 2020

Liftoff: accurate mapping of gene annotations.

Bioinformatics 2020 Dec 15. Epub 2020 Dec 15.

Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD.

Motivation: Improvements in DNA sequencing technology and computational methods have led to a substantial increase in the creation of high-quality genome assemblies of many species. To understand the biology of these genomes, annotation of gene features and other functional elements is essential; however for most species, only the reference genome is well-annotated.

Results: One strategy to annotate new or improved genome assemblies is to map or 'lift over' the genes from a previously-annotated reference genome. Here we describe Liftoff, a new genome annotation lift-over tool capable of mapping genes between two assemblies of the same or closely-related species. Liftoff aligns genes from a reference genome to a target genome and finds the mapping that maximizes sequence identity while preserving the structure of each exon, transcript, and gene. We show that Liftoff can accurately map 99.9% of genes between two versions of the human reference genome with an average sequence identity >99.9%. We also show that Liftoff can map genes across species by successfully lifting over 98.3% of human protein-coding genes to a chimpanzee genome assembly with 98.2% sequence identity.

Availability And Implementation: Liftoff can be installed via bioconda and PyPI. Additionally, the source code for Liftoff is available at https://github.com/agshumate/Liftoff.

Supplementary Information: Supplementary data are available at Bioinformatics online.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btaa1016DOI Listing
December 2020

SkewIT: The Skew Index Test for large-scale GC Skew analysis of bacterial genomes.

PLoS Comput Biol 2020 12 4;16(12):e1008439. Epub 2020 Dec 4.

Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland, United States.

GC skew is a phenomenon observed in many bacterial genomes, wherein the two replication strands of the same chromosome contain different proportions of guanine and cytosine nucleotides. Here we demonstrate that this phenomenon, which was first discovered in the mid-1990s, can be used today as an analysis tool for the 15,000+ complete bacterial genomes in NCBI's Refseq library. In order to analyze all 15,000+ genomes, we introduce a new method, SkewIT (Skew Index Test), that calculates a single metric representing the degree of GC skew for a genome. Using this metric, we demonstrate how GC skew patterns are conserved within certain bacterial phyla, e.g. Firmicutes, but show different patterns in other phylogenetic groups such as Actinobacteria. We also discovered that outlier values of SkewIT highlight potential bacterial mis-assemblies. Using our newly defined metric, we identify multiple mis-assembled chromosomal sequences in previously published complete bacterial genomes. We provide a SkewIT web app https://jenniferlu717.shinyapps.io/SkewIT/ that calculates SkewI for any user-provided bacterial sequence. The web app also provides an interactive interface for the data generated in this paper, allowing users to further investigate the SkewI values and thresholds of the Refseq-97 complete bacterial genomes. Individual scripts for analysis of bacterial genomes are provided in the following repository: https://github.com/jenniferlu717/SkewIT.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1371/journal.pcbi.1008439DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7717575PMC
December 2020

The genome of the American groundhog, .

F1000Res 2020 16;9:1137. Epub 2020 Sep 16.

Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, 21211, USA.

We sequenced the genome of the North American groundhog, , also known as the woodchuck. Our sequencing strategy included a combination of short, high-quality Illumina reads plus long reads generated by both Pacific Biosciences and Oxford Nanopore instruments. Assembly of the combined data produced a genome of 2.74 Gbp in total length, with an N50 contig size of 1,094,236 bp. To annotate the genome, we mapped the genes from another genome and from the closely related Alpine marmot, , onto our assembly, resulting in 20,559 annotated protein-coding genes and 28,135 transcripts. The genome assembly and annotation are available in GenBank under BioProject PRJNA587092.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.12688/f1000research.25970.1DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7682491PMC
September 2020

Rapid detection of inter-clade recombination in SARS-CoV-2 with Bolotie.

bioRxiv 2020 Sep 21. Epub 2020 Sep 21.

The ability to detect recombination in pathogen genomes is crucial to the accuracy of phylogenetic analysis and consequently to forecasting the spread of infectious diseases and to developing therapeutics and public health policies. However, previous methods for detecting recombination and reassortment events cannot handle the computational requirements of analyzing tens of thousands of genomes, a scenario that has now emerged in the effort to track the spread of the SARS-CoV-2 virus. Furthermore, the low divergence of near-identical genomes sequenced in short periods of time presents a statistical challenge not addressed by available methods. In this work we present Bolotie, an efficient method designed to detect recombination and reassortment events between clades of viral genomes. We applied our method to a large collection of SARS-CoV-2 genomes and discovered hundreds of isolates that are likely of a recombinant origin. In cases where raw sequencing data was available, we were able to rule out the possibility that these samples represented co-infections by analyzing the underlying sequence reads. Our findings further show that several recombinants appear to have persisted in the population.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1101/2020.09.21.300913DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7523100PMC
September 2020

A Reference Genome Sequence for Giant Sequoia.

G3 (Bethesda) 2020 Nov 5;10(11):3907-3919. Epub 2020 Nov 5.

Department of Plant Sciences, University of California, Davis, CA 95616.

The giant sequoia () of California are massive, long-lived trees that grow along the U.S. Sierra Nevada mountains. Genomic data are limited in giant sequoia and producing a reference genome sequence has been an important goal to allow marker development for restoration and management. Using deep-coverage Illumina and Oxford Nanopore sequencing, combined with Dovetail chromosome conformation capture libraries, the genome was assembled into eleven chromosome-scale scaffolds containing 8.125 Gbp of sequence. Iso-Seq transcripts, assembled from three distinct tissues, was used as evidence to annotate a total of 41,632 protein-coding genes. The genome was found to contain, distributed unevenly across all 11 chromosomes and in 63 orthogroups, over 900 complete or partial predicted NLR genes, of which 375 are supported by annotation derived from protein evidence and gene modeling. This giant sequoia reference genome sequence represents the first genome sequenced in the Cupressaceae family, and lays a foundation for using genomic tools to aid in giant sequoia conservation and management.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1534/g3.120.401612DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7642918PMC
November 2020

Ultrafast and accurate 16S rRNA microbial community analysis using Kraken 2.

Microbiome 2020 08 28;8(1):124. Epub 2020 Aug 28.

Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA.

Background: For decades, 16S ribosomal RNA sequencing has been the primary means for identifying the bacterial species present in a sample with unknown composition. One of the most widely used tools for this purpose today is the QIIME (Quantitative Insights Into Microbial Ecology) package. Recent results have shown that the newest release, QIIME 2, has higher accuracy than QIIME, MAPseq, and mothur when classifying bacterial genera from simulated human gut, ocean, and soil metagenomes, although QIIME 2 also proved to be the most computationally expensive. Kraken, first released in 2014, has been shown to provide exceptionally fast and accurate classification for shotgun metagenomics sequencing projects. Bracken, released in 2016, then provided users with the ability to accurately estimate species or genus relative abundances using Kraken classification results. Kraken 2, which matches the accuracy and speed of Kraken 1, now supports 16S rRNA databases, allowing for direct comparisons to QIIME and similar systems.

Methods: For a comprehensive assessment of each tool, we compare the computational resources and speed of QIIME 2's q2-feature-classifier, Kraken 2, and Bracken in generating the three main 16S rRNA databases: Greengenes, SILVA, and RDP. For an evaluation of accuracy, we evaluated each tool using the same simulated 16S rRNA reads from human gut, ocean, and soil metagenomes that were previously used to compare QIIME, MAPseq, mothur, and QIIME 2. We evaluated accuracy based on the accuracy of the final genera read counts assigned by each tool. Finally, as Kraken 2 is the only tool providing per-read taxonomic assignments, we evaluate the sensitivity and precision of Kraken 2's per-read classifications.

Results: For both the Greengenes and SILVA database, Kraken 2 and Bracken are up to 100 times faster at database generation. For classification, using the same data as previous studies, Kraken 2 and Bracken are up to 300 times faster, use 100x less RAM, and generate results that more accurate at 16S rRNA profiling than QIIME 2's q2-feature-classifier.

Conclusion: Kraken 2 and Bracken provide a very fast, efficient, and accurate solution for 16S rRNA metataxonomic data analysis. Video Abstract.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s40168-020-00900-2DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7455996PMC
August 2020

Chromosome-Scale Assembly of the Bread Wheat Genome Reveals Thousands of Additional Gene Copies.

Genetics 2020 Oct 12;216(2):599-608. Epub 2020 Aug 12.

Department of Computer Science, Johns Hopkins University, Baltimore, Maryland 21218

Bread wheat ( is a major food crop and an important plant system for agricultural genetics research. However, due to the complexity and size of its allohexaploid genome, genomic resources are limited compared to other major crops. The IWGSC recently published a reference genome and associated annotation (IWGSC CS v1.0, Chinese Spring) that has been widely adopted and utilized by the wheat community. Although this reference assembly represents all three wheat subgenomes at chromosome-scale, it was derived from short reads, and thus is missing a substantial portion of the expected 16 Gbp of genomic sequence. We earlier published an independent wheat assembly (Triticum_aestivum_3.1, Chinese Spring) that came much closer in length to the expected genome size, although it was only a contig-level assembly lacking gene annotations. Here, we describe a reference-guided effort to scaffold those contigs into chromosome-length pseudomolecules, add in any missing sequence that was unique to the IWGSC CS v1.0 assembly, and annotate the resulting pseudomolecules with genes. Our updated assembly, Triticum_aestivum_4.0, contains 15.07 Gbp of nongap sequence anchored to chromosomes, which is 1.2 Gbps more than the previous reference assembly. It includes 108,639 genes unambiguously localized to chromosomes, including over 2000 genes that were previously unplaced. We also discovered >5700 additional gene copies, facilitating the accurate annotation of functional gene duplications including at the photoperiod response locus.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1534/genetics.120.303501DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7536849PMC
October 2020

The tuatara genome reveals ancient features of amniote evolution.

Nature 2020 08 5;584(7821):403-409. Epub 2020 Aug 5.

Ngatiwai Trust Board, Whangarei, New Zealand.

The tuatara (Sphenodon punctatus)-the only living member of the reptilian order Rhynchocephalia (Sphenodontia), once widespread across Gondwana-is an iconic species that is endemic to New Zealand. A key link to the now-extinct stem reptiles (from which dinosaurs, modern reptiles, birds and mammals evolved), the tuatara provides key insights into the ancestral amniotes. Here we analyse the genome of the tuatara, which-at approximately 5 Gb-is among the largest of the vertebrate genomes yet assembled. Our analyses of this genome, along with comparisons with other vertebrate genomes, reinforce the uniqueness of the tuatara. Phylogenetic analyses indicate that the tuatara lineage diverged from that of snakes and lizards around 250 million years ago. This lineage also shows moderate rates of molecular evolution, with instances of punctuated evolution. Our genome sequence analysis identifies expansions of proteins, non-protein-coding RNA families and repeat elements, the latter of which show an amalgam of reptilian and mammalian features. The sequencing of the tuatara genome provides a valuable resource for deep comparative analyses of tetrapods, as well as for tuatara biology and conservation. Our study also provides important insights into both the technical challenges and the cultural obligations that are associated with genome sequencing.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41586-020-2561-9DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7116210PMC
August 2020

Genomic basis of white pine blister rust quantitative disease resistance and its relationship with qualitative resistance.

Plant J 2020 Oct 28;104(2):365-376. Epub 2020 Jul 28.

School of Forestry, Northern Arizona University, 200 E. Pine Knoll, Flagstaff, AZ, 86011, USA.

The genomic architecture and molecular mechanisms controlling variation in quantitative disease resistance loci are not well understood in plant species and have been barely studied in long-generation trees. Quantitative trait loci mapping and genome-wide association studies were combined to test a large single nucleotide polymorphism (SNP) set for association with quantitative and qualitative white pine blister rust resistance in sugar pine. In the absence of a chromosome-scale reference genome, a high-density consensus linkage map was generated to obtain locations for associated SNPs. Newly discovered associations for white pine blister rust quantitative disease resistance included 453 SNPs involved in wide biological functions, including genes associated with disease resistance and others involved in morphological and developmental processes. In addition, NBS-LRR pathogen recognition genes were found to be involved in quantitative disease resistance, suggesting these newly reported genes are qualitative genes with partial resistance, they are the result of defeated qualitative resistance due to avirulent races, or they have epistatic effects on qualitative disease resistance genes. This study is a step forward in our understanding of the complex genomic architecture of quantitative disease resistance in long-generation trees, and constitutes the first step towards marker-assisted disease resistance breeding in white pine species.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1111/tpj.14928DOI Listing
October 2020

The genome polishing tool POLCA makes fast and accurate corrections in genome assemblies.

PLoS Comput Biol 2020 06 26;16(6):e1007981. Epub 2020 Jun 26.

Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland, United States of America.

The introduction of third-generation DNA sequencing technologies in recent years has allowed scientists to generate dramatically longer sequence reads, which when used in whole-genome sequencing projects have yielded better repeat resolution and far more contiguous genome assemblies. While the promise of better contiguity has held true, the relatively high error rate of long reads, averaging 8-15%, has made it challenging to generate a highly accurate final sequence. Current long-read sequencing technologies display a tendency toward systematic errors, in particular in homopolymer regions, which present additional challenges. A cost-effective strategy to generate highly contiguous assemblies with a very low overall error rate is to combine long reads with low-cost short-read data, which currently have an error rate below 0.5%. This hybrid strategy can be pursued either by incorporating the short-read data into the early phase of assembly, during the read correction step, or by using short reads to "polish" the consensus built from long reads. In this report, we present the assembly polishing tool POLCA (POLishing by Calling Alternatives) and compare its performance with two other popular polishing programs, Pilon and Racon. We show that on simulated data POLCA is more accurate than Pilon, and comparable in accuracy to Racon. On real data, all three programs show similar performance, but POLCA is consistently much faster than either of the other polishing programs.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1371/journal.pcbi.1007981DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7347232PMC
June 2020

Assembly and annotation of an Ashkenazi human reference genome.

Genome Biol 2020 06 2;21(1):129. Epub 2020 Jun 2.

Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA.

Background: Thousands of experiments and studies use the human reference genome as a resource each year. This single reference genome, GRCh38, is a mosaic created from a small number of individuals, representing a very small sample of the human population. There is a need for reference genomes from multiple human populations to avoid potential biases.

Results: Here, we describe the assembly and annotation of the genome of an Ashkenazi individual and the creation of a new, population-specific human reference genome. This genome is more contiguous and more complete than GRCh38, the latest version of the human reference genome, and is annotated with highly similar gene content. The Ashkenazi reference genome, Ash1, contains 2,973,118,650 nucleotides as compared to 2,937,639,212 in GRCh38. Annotation identified 20,157 protein-coding genes, of which 19,563 are > 99% identical to their counterparts on GRCh38. Most of the remaining genes have small differences. Forty of the protein-coding genes in GRCh38 are missing from Ash1; however, all of these genes are members of multi-gene families for which Ash1 contains other copies. Eleven genes appear on different chromosomes from their homologs in GRCh38. Alignment of DNA sequences from an unrelated Ashkenazi individual to Ash1 identified ~ 1 million fewer homozygous SNPs than alignment of those same sequences to the more-distant GRCh38 genome, illustrating one of the benefits of population-specific reference genomes.

Conclusions: The Ash1 genome is presented as a reference for any genetic studies involving Ashkenazi Jewish individuals.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s13059-020-02047-7DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7265644PMC
June 2020

High-quality chromosome-scale assembly of the walnut (Juglans regia L.) reference genome.

Gigascience 2020 05;9(5)

Department of Plant Sciences, University of California, Davis, One Shields Avenue, Davis, CA 95616, USA.

Background: The release of the first reference genome of walnut (Juglans regia L.) enabled many achievements in the characterization of walnut genetic and functional variation. However, it is highly fragmented, preventing the integration of genetic, transcriptomic, and proteomic information to fully elucidate walnut biological processes.

Findings: Here, we report the new chromosome-scale assembly of the walnut reference genome (Chandler v2.0) obtained by combining Oxford Nanopore long-read sequencing with chromosome conformation capture (Hi-C) technology. Relative to the previous reference genome, the new assembly features an 84.4-fold increase in N50 size, with the 16 chromosomal pseudomolecules assembled and representing 95% of its total length. Using full-length transcripts from single-molecule real-time sequencing, we predicted 37,554 gene models, with a mean gene length higher than the previous gene annotations. Most of the new protein-coding genes (90%) present both start and stop codons, which represents a significant improvement compared with Chandler v1.0 (only 48%). We then tested the potential impact of the new chromosome-level genome on different areas of walnut research. By studying the proteome changes occurring during male flower development, we observed that the virtual proteome obtained from Chandler v2.0 presents fewer artifacts than the previous reference genome, enabling the identification of a new potential pollen allergen in walnut. Also, the new chromosome-scale genome facilitates in-depth studies of intraspecies genetic diversity by revealing previously undetected autozygous regions in Chandler, likely resulting from inbreeding, and 195 genomic regions highly differentiated between Western and Eastern walnut cultivars.

Conclusion: Overall, Chandler v2.0 will serve as a valuable resource to better understand and explore walnut biology.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/gigascience/giaa050DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7238675PMC
May 2020

Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank.

Genome Biol 2020 05 12;21(1):115. Epub 2020 May 12.

Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, 21218, Maryland, USA.

Genomic analyses are sensitive to contamination in public databases caused by incorrectly labeled reference sequences. Here, we describe Conterminator, an efficient method to detect and remove incorrectly labeled sequences by an exhaustive all-against-all sequence comparison. Our analysis reports contamination of 2,161,746, 114,035, and 14,148 sequences in the RefSeq, GenBank, and NR databases, respectively, spanning the whole range from draft to "complete" model organism genomes. Our method scales linearly with input size and can process 3.3 TB in 12 days on a 32-core computer. Conterminator can help ensure the quality of reference databases. Source code (GPLv3): https://github.com/martin-steinegger/conterminator.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s13059-020-02023-1DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7218494PMC
May 2020

Microbial Diagnostics for Cancer: A Step Forward but Not Prime Time Yet.

Cancer Cell 2020 05;37(5):625-627

Departments of Biomedical Engineering, Computer Science, and Biostatistics, Johns Hopkins University, Baltimore, MD 21211, USA. Electronic address:

Translational microbiome science in humans has not yet fully realized its clinical potentials. The analyses by Poore et al. in Nature offer a strong foundation on which to begin to build microbial diagnostics to detect cancer.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.ccell.2020.04.010DOI Listing
May 2020

Pan-genomics in the human genome era.

Nat Rev Genet 2020 04 7;21(4):243-254. Epub 2020 Feb 7.

Department of Computer Science, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA.

Since the early days of the genome era, the scientific community has relied on a single 'reference' genome for each species, which is used as the basis for a wide range of genetic analyses, including studies of variation within and across species. As sequencing costs have dropped, thousands of new genomes have been sequenced, and scientists have come to realize that a single reference genome is inadequate for many purposes. By sampling a diverse set of individuals, one can begin to assemble a pan-genome: a collection of all the DNA sequences that occur in a species. Here we review efforts to create pan-genomes for a range of species, from bacteria to humans, and we further consider the computational methods that have been proposed in order to capture, interpret and compare pan-genome data. As scientists continue to survey and catalogue the genomic variation across human populations and begin to assemble a human pan-genome, these efforts will increase our power to connect variation to human diversity, disease and beyond.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41576-020-0210-7DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7752153PMC
April 2020

Genome assembly and characterization of a complex zfBED-NLR gene-containing disease resistance locus in Carolina Gold Select rice with Nanopore sequencing.

PLoS Genet 2020 01 27;16(1):e1008571. Epub 2020 Jan 27.

Plant Pathology and Plant Microbe Biology Section, School of Integrative Plant Science, Cornell University, Ithaca, NY, United States of America.

Long-read sequencing facilitates assembly of complex genomic regions. In plants, loci containing nucleotide-binding, leucine-rich repeat (NLR) disease resistance genes are an important example of such regions. NLR genes constitute one of the largest gene families in plants and are often clustered, evolving via duplication, contraction, and transposition. We recently mapped the Xo1 locus for resistance to bacterial blight and bacterial leaf streak, found in the American heirloom rice variety Carolina Gold Select, to a region that in the Nipponbare reference genome is NLR gene-rich. Here, toward identification of the Xo1 gene, we combined Nanopore and Illumina reads and generated a high-quality Carolina Gold Select genome assembly. We identified 529 complete or partial NLR genes and discovered, relative to Nipponbare, an expansion of NLR genes at the Xo1 locus. One of these has high sequence similarity to the cloned, functionally similar Xa1 gene. Both harbor an integrated zfBED domain, and the repeats within each protein are nearly perfect. Across diverse Oryzeae, we identified two sub-clades of NLR genes with these features, varying in the presence of the zfBED domain and the number of repeats. The Carolina Gold Select genome assembly also uncovered at the Xo1 locus a rice blast resistance gene and a gene encoding a polyphenol oxidase (PPO). PPO activity has been used as a marker for blast resistance at the locus in some varieties; however, the Carolina Gold Select sequence revealed a loss-of-function mutation in the PPO gene that breaks this association. Our results demonstrate that whole genome sequencing combining Nanopore and Illumina reads effectively resolves NLR gene loci. Our identification of an Xo1 candidate is an important step toward mechanistic characterization, including the role(s) of the zfBED domain. Finally, the Carolina Gold Select genome assembly will facilitate identification of other useful traits in this historically important variety.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1371/journal.pgen.1008571DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7004385PMC
January 2020

Transcriptome assembly from long-read RNA-seq alignments with StringTie2.

Genome Biol 2019 12 16;20(1):278. Epub 2019 Dec 16.

Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, 21205, USA.

RNA sequencing using the latest single-molecule sequencing instruments produces reads that are thousands of nucleotides long. The ability to assemble these long reads can greatly improve the sensitivity of long-read analyses. Here we present StringTie2, a reference-guided transcriptome assembler that works with both short and long reads. StringTie2 includes new methods to handle the high error rate of long reads and offers the ability to work with full-length super-reads assembled from short reads, which further improves the quality of short-read assemblies. StringTie2 is more accurate and faster and uses less memory than all comparable short-read and long-read analysis tools.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s13059-019-1910-1DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6912988PMC
December 2019

Pavian: interactive analysis of metagenomics data for microbiome studies and pathogen identification.

Bioinformatics 2020 02;36(4):1303-1304

Departments of Biomedical Engineering, Computer Science and Biostatistics, Johns Hopkins University, Baltimore, MD, USA.

Summary: Pavian is a web application for exploring classification results from metagenomics experiments. With Pavian, researchers can analyze, visualize and transform results from various classifiers-such as Kraken, Centrifuge and MethaPhlAn-using interactive data tables, heatmaps and Sankey flow diagrams. An interactive alignment coverage viewer can help in the validation of matches to a particular genome, which can be crucial when using metagenomics experiments for pathogen detection.

Availability And Implementation: Pavian is implemented in the R language as a modular Shiny web app and is freely available under GPL-3 from http://github.com/fbreitwieser/pavian.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btz715DOI Listing
February 2020

Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype.

Nat Biotechnol 2019 08 2;37(8):907-915. Epub 2019 Aug 2.

Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, School of Medicine, Johns Hopkins University, Baltimore, MD, USA.

The human reference genome represents only a small number of individuals, which limits its usefulness for genotyping. We present a method named HISAT2 (hierarchical indexing for spliced alignment of transcripts 2) that can align both DNA and RNA sequences using a graph Ferragina Manzini index. We use HISAT2 to represent and search an expanded model of the human reference genome in which over 14.5 million genomic variants in combination with haplotypes are incorporated into the data structure used for searching and alignment. We benchmark HISAT2 using simulated and real datasets to demonstrate that our strategy of representing a population of genomes, together with a fast, memory-efficient search algorithm, provides more detailed and accurate variant analyses than other methods. We apply HISAT2 for HLA typing and DNA fingerprinting; both applications form part of the HISAT-genotype software that enables analysis of haplotype-resolved genes or genomic regions. HISAT-genotype outperforms other computational methods and matches or exceeds the performance of laboratory-based assays.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41587-019-0201-4DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7605509PMC
August 2019

Cretaceous dinosaur bone contains recent organic material and provides an environment conducive to microbial communities.

Elife 2019 06 18;8. Epub 2019 Jun 18.

Department of Geosciences, Princeton University, Princeton, United States.

Fossils were thought to lack original organic molecules, but chemical analyses show that some can survive. Dinosaur bone has been proposed to preserve collagen, osteocytes, and blood vessels. However, proteins and labile lipids are diagenetically unstable, and bone is a porous open system, allowing microbial/molecular flux. These 'soft tissues' have been reinterpreted as biofilms. Organic preservation versus contamination of dinosaur bone was examined by freshly excavating, with aseptic protocols, fossils and sedimentary matrix, and chemically/biologically analyzing them. Fossil 'soft tissues' differed from collagen chemically and structurally; while degradation would be expected, the patterns observed did not support this. 16S rRNA amplicon sequencing revealed that dinosaur bone hosted an abundant microbial community different from lesser abundant communities of surrounding sediment. Subsurface dinosaur bone is a relatively fertile habitat, attracting microbes that likely utilize inorganic nutrients and complicate identification of original organic material. There exists potential post-burial taphonomic roles for subsurface microorganisms.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.7554/eLife.46205DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6581507PMC
June 2019

Next-generation genome annotation: we still struggle to get it right.

Genome Biol 2019 05 16;20(1):92. Epub 2019 May 16.

Departments of Biomedical Engineering, Computer Science, and Biostatistics, Johns Hopkins University, Baltimore, MD, 21205, USA.

While the genome sequencing revolution has led to the sequencing and assembly of many thousands of new genomes, genome annotation still uses very nearly the same technology that we have used for the past two decades. The sheer number of genomes necessitates the use of fully automated procedures for annotation, but errors in annotation are just as prevalent as they were in the past, if not more so. How are we to solve this growing problem?
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s13059-019-1715-2DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6521345PMC
May 2019

Human contamination in bacterial genomes has created thousands of spurious proteins.

Genome Res 2019 06 7;29(6):954-960. Epub 2019 May 7.

Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, Maryland 21205, USA.

Contaminant sequences that appear in published genomes can cause numerous problems for downstream analyses, particularly for evolutionary studies and metagenomics projects. Our large-scale scan of complete and draft bacterial and archaeal genomes in the NCBI RefSeq database reveals that 2250 genomes are contaminated by human sequence. The contaminant sequences derive primarily from high-copy human repeat regions, which themselves are not adequately represented in the current human reference genome, GRCh38. The absence of the sequences from the human assembly offers a likely explanation for their presence in bacterial assemblies. In some cases, the contaminating contigs have been erroneously annotated as containing protein-coding sequences, which over time have propagated to create spurious protein "families" across multiple prokaryotic and eukaryotic genomes. As a result, 3437 spurious protein entries are currently present in the widely used nr and TrEMBL protein databases. We report here an extensive list of contaminant sequences in bacterial genome assemblies and the proteins associated with them. We found that nearly all contaminants occurred in small contigs in draft genomes, which suggests that filtering out small contigs from draft genome assemblies may mitigate the issue of contamination while still keeping nearly all of the genuine genomic sequences.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1101/gr.245373.118DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6581058PMC
June 2019

Author Correction: Assembly of a pan-genome from deep sequencing of 910 humans of African descent.

Nat Genet 2019 02;51(2):364

Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, Baltimore, MD, USA.

In the version of this article initially published, the statement "there are no pan-genomes for any other animal or plant species" was incorrect. The statement has been corrected to "there are no reported pan-genomes for any other animal species, to our knowledge." We thank David Edwards for bringing this error to our attention. The error has been corrected in the HTML and PDF versions of the article.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41588-018-0335-1DOI Listing
February 2019

CHESS: a new human gene catalog curated from thousands of large-scale RNA sequencing experiments reveals extensive transcriptional noise.

Genome Biol 2018 11 28;19(1):208. Epub 2018 Nov 28.

Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, USA.

We assembled the sequences from deep RNA sequencing experiments by the Genotype-Tissue Expression (GTEx) project, to create a new catalog of human genes and transcripts, called CHESS. The new database contains 42,611 genes, of which 20,352 are potentially protein-coding and 22,259 are noncoding, and a total of 323,258 transcripts. These include 224 novel protein-coding genes and 116,156 novel transcripts. We detected over 30 million additional transcripts at more than 650,000 genomic loci, nearly all of which are likely nonfunctional, revealing a heretofore unappreciated amount of transcriptional noise in human cells. The CHESS database is available at http://ccb.jhu.edu/chess .
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s13059-018-1590-2DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6260756PMC
November 2018

Assembly of a pan-genome from deep sequencing of 910 humans of African descent.

Nat Genet 2019 01 19;51(1):30-35. Epub 2018 Nov 19.

Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, Baltimore, MD, USA.

We used a deeply sequenced dataset of 910 individuals, all of African descent, to construct a set of DNA sequences that is present in these individuals but missing from the reference human genome. We aligned 1.19 trillion reads from the 910 individuals to the reference genome (GRCh38), collected all reads that failed to align, and assembled these reads into contiguous sequences (contigs). We then compared all contigs to one another to identify a set of unique sequences representing regions of the African pan-genome missing from the reference genome. Our analysis revealed 296,485,284 bp in 125,715 distinct contigs present in the populations of African descent, demonstrating that the African pan-genome contains ~10% more DNA than the current human reference genome. Although the functional significance of nearly all of this sequence is unknown, 387 of the novel contigs fall within 315 distinct protein-coding genes, and the rest appear to be intergenic.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41588-018-0273-yDOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6309586PMC
January 2019