Publications by authors named "David Haussler"

208 Publications

Positive selection in noncoding genomic regions of vocal learning birds is associated with genes implicated in vocal learning and speech functions in humans.

Genome Res 2021 Oct 19. Epub 2021 Oct 19.

Laboratory of Neurogenetics of Language, The Rockefeller University, New York, New York 10065, USA.

Vocal learning, the ability to imitate sounds from conspecifics and the environment, is a key component of human spoken language and learned song in three independently evolved avian groups-oscine songbirds, parrots, and hummingbirds. Humans and each of these three bird clades exhibit specialized behavioral, neuroanatomical, and brain gene expression convergence related to vocal learning, speech, and song. To understand the evolutionary basis of vocal learning gene specializations and convergence, we searched for and identified accelerated genomic regions (ARs), a marker of positive selection, specific to vocal learning birds. We found avian vocal learner-specific ARs, and they were enriched in noncoding regions near genes with known speech functions or brain gene expression specializations in humans and vocal learning birds, including , , , and , and near genes with major neurodevelopmental functions, including , , and We also found enrichment near the SFARI class S genes associated with syndromic vocal communication forms of autism spectrum disorders. These findings reveal strong candidate noncoding regions near genes for the evolutionary adaptations that distinguish vocal learning species from their close vocal nonlearning relatives and provide further evidence of molecular convergence between birdsong and human spoken language.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1101/gr.275989.121DOI Listing
October 2021

Light-weight electrophysiology hardware and software platform for cloud-based neural recording experiments.

J Neural Eng 2021 Oct 19. Epub 2021 Oct 19.

Department of Biomolecular Engineering, University of California Santa Cruz, 1156 High St, Santa Cruz, California, 95064-1077, UNITED STATES.

Objective: Neural activity represents a functional readout of neurons that is increasingly important to monitor in a wide range of experiments. Extracellular recordings have emerged as a powerful technique for measuring neural activity because these methods do not lead to the destruction or degradation of the cells being measured. Current approaches to electrophysiology have a low throughput of experiments due to manual supervision and expensive equipment. This bottleneck limits broader inferences that can be achieved with numerous long-term recorded samples.

Approach: We developed Piphys, an inexpensive open source neurophysiological recording platform that consists of both hardware and software. It is easily accessed and controlled via a standard web interface through Internet of Things (IoT) protocols.

Main Results: We used a Raspberry Pi as the primary processing device along with an Intan bioamplifier. We designed a hardware expansion circuit board and software to enable voltage sampling and user interaction. This standalone system was validated with primary human neurons, showing reliability in collecting neural activity in near real-time.

Significance: The hardware modules and cloud software allow for remote control of neural recording experiments as well as horizontal scalability, enabling long-term observations of development, organization, and neural activity at scale.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1088/1741-2552/ac310aDOI Listing
October 2021

A daily-updated database and tools for comprehensive SARS-CoV-2 mutation-annotated trees.

Mol Biol Evol 2021 Sep 1. Epub 2021 Sep 1.

Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, 95064, USA.

The vast scale of SARS-CoV-2 sequencing data has made it increasingly challenging to comprehensively analyze all available data using existing tools and file formats. To address this, we present a database of SARS-CoV-2 phylogenetic trees inferred with unrestricted public sequences, which we update daily to incorporate new sequences. Our database uses the recently-proposed mutation-annotated tree (MAT) format to efficiently encode the tree with branches labeled with parsimony-inferred mutations, as well as Nextstrain clade and Pango lineage labels at clade roots. As of June 9, 2021, our SARS-CoV-2 MAT consists of 834,521 sequences and provides a comprehensive view of the virus' evolutionary history using public data. We also present matUtils-a command-line utility for rapidly querying, interpreting and manipulating the MATs. Our daily-updated SARS-CoV-2 MAT database and matUtils software are available at http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/ and https://github.com/yatisht/usher, respectively.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/molbev/msab264DOI Listing
September 2021

Ultrafast Sample placement on Existing tRees (UShER) enables real-time phylogenetics for the SARS-CoV-2 pandemic.

Nat Genet 2021 06 10;53(6):809-816. Epub 2021 May 10.

Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA.

As the SARS-CoV-2 virus spreads through human populations, the unprecedented accumulation of viral genome sequences is ushering in a new era of 'genomic contact tracing'-that is, using viral genomes to trace local transmission dynamics. However, because the viral phylogeny is already so large-and will undoubtedly grow many fold-placing new sequences onto the tree has emerged as a barrier to real-time genomic contact tracing. Here, we resolve this challenge by building an efficient tree-based data structure encoding the inferred evolutionary history of the virus. We demonstrate that our approach greatly improves the speed of phylogenetic placement of new samples and data visualization, making it possible to complete the placements under the constraints of real-time contact tracing. Thus, our method addresses an important need for maintaining a fully updated reference phylogeny. We make these tools available to the research community through the University of California Santa Cruz SARS-CoV-2 Genome Browser to enable rapid cross-referencing of information in new virus sequences with an ever-expanding array of molecular and structural biology data. The methods described here will empower research and genomic contact tracing for SARS-CoV-2 specifically for laboratories worldwide.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41588-021-00862-7DOI Listing
June 2021

Cancer microenvironment and genomics: evolution in process.

Clin Exp Metastasis 2021 May 10. Epub 2021 May 10.

Department of Surgery, Neurosurgery and Pediatrics, University of Arizona College of Medicine-Tucson, Tucson, AZ, USA.

Cancer heterogeneity is a result of genetic mutations within the cancer cells. Their proliferation is not only driven by autocrine functions but also under the influence of cancer microenvironment, which consists of normal stromal cells such as infiltrating immune cells, cancer-associated fibroblasts, endothelial cells, pericytes, vascular and lymphatic channels. The relationship between cancer cells and cancer microenvironment is a critical one and we are just on the verge to understand it on a molecular level. Cancer microenvironment may serve as a selective force to modulate cancer cells to allow them to evolve into more aggressive clones with ability to invade the lymphatic or vascular channels to spread to regional lymph nodes and distant sites. It is important to understand these steps of cancer evolution within the cancer microenvironment towards invasion so that therapeutic strategies can be developed to control or stop these processes.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1007/s10585-021-10097-9DOI Listing
May 2021

Towards complete and error-free genome assemblies of all vertebrate species.

Nature 2021 Apr 28;592(7856):737-746. Epub 2021 Apr 28.

UQ Genomics, University of Queensland, Brisbane, Queensland, Australia.

High-quality and complete reference genome assemblies are fundamental for the application of genomics to biology, disease, and biodiversity conservation. However, such assemblies are available for only a few non-microbial species. To address this issue, the international Genome 10K (G10K) consortium has worked over a five-year period to evaluate and develop cost-effective methods for assembling highly accurate and nearly complete reference genomes. Here we present lessons learned from generating assemblies for 16 species that represent six major vertebrate lineages. We confirm that long-read sequencing technologies are essential for maximizing genome quality, and that unresolved complex repeats and haplotype heterozygosity are major sources of assembly error when not handled correctly. Our assemblies correct substantial errors, add missing sequence in some of the best historical reference genomes, and reveal biological discoveries. These include the identification of many false gene duplications, increases in gene sizes, chromosome rearrangements that are specific to lineages, a repeated independent chromosome breakpoint in bat genomes, and a canonical GC-rich pattern in protein-coding genes and their regulatory regions. Adopting these lessons, we have embarked on the Vertebrate Genomes Project (VGP), an international effort to generate high-quality, complete reference genomes for all of the roughly 70,000 extant vertebrate species and to help to enable a new era of discovery across the life sciences.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41586-021-03451-0DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8081667PMC
April 2021

A new SARS-CoV-2 lineage that shares mutations with known Variants of Concern is rejected by automated sequence repository quality control.

bioRxiv 2021 Apr 6. Epub 2021 Apr 6.

We report a SARS-CoV-2 lineage that shares N501Y, P681H, and other mutations with known variants of concern, such as B.1.1.7. This lineage, which we refer to as B.1.x (COG-UK sometimes references similar samples as B.1.324.1), is present in at least 20 states across the USA and in at least six countries. However, a large deletion causes the sequence to be automatically rejected from repositories, suggesting that the frequency of this new lineage is underestimated using public data. Recent dynamics based on 339 samples obtained in Santa Cruz County, CA, USA suggest that B.1.x may be increasing in frequency at a rate similar to that of B.1.1.7 in Southern California. At present the functional differences between this variant B.1.x and other circulating SARS-CoV-2 variants are unknown, and further studies on secondary attack rates, viral loads, immune evasion and/or disease severity are needed to determine if it poses a public health concern. Nonetheless, given what is known from well-studied circulating variants of concern, it seems unlikely that the lineage could pose larger concerns for human health than many already globally distributed lineages. Our work highlights a need for rapid turnaround time from sequence generation to submission and improved sequence quality control that removes submission bias. We identify promising paths toward this goal.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1101/2021.04.05.438352DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8043452PMC
April 2021

A daily-updated database and tools for comprehensive SARS-CoV-2 mutation-annotated trees.

bioRxiv 2021 Jul 13. Epub 2021 Jul 13.

Department of Biomolecular Engineering, University of California Santa Cruz. Santa Cruz, CA 95064, USA.

The vast scale of SARS-CoV-2 sequencing data has made it increasingly challenging to comprehensively analyze all available data using existing tools and file formats. To address this, we present a database of SARS-CoV-2 phylogenetic trees inferred with unrestricted public sequences, which we update daily to incorporate new sequences. Our database uses the recently-proposed mutation-annotated tree (MAT) format to efficiently encode the tree with branches labeled with parsimony-inferred mutations as well as Nextstrain clade and Pango lineage labels at clade roots. As of June 9, 2021, our SARS-CoV-2 MAT consists of 834,521 sequences and provides a comprehensive view of the virus' evolutionary history using public data. We also present matUtils - a command-line utility for rapidly querying, interpreting and manipulating the MATs. Our daily-updated SARS-CoV-2 MAT database and matUtils software are available at http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/ and https://github.com/yatisht/usher, respectively.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1101/2021.04.03.438321DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8020970PMC
July 2021

The case for using mapped exonic non-duplicate reads when reporting RNA-sequencing depth: examples from pediatric cancer datasets.

Gigascience 2021 Mar;10(3)

UC Santa Cruz, Molecular, Cell and Developmental Biology, 1156 High Street, Santa Cruz, CA 95064, USA.

Background: The reproducibility of gene expression measured by RNA sequencing (RNA-Seq) is dependent on the sequencing depth. While unmapped or non-exonic reads do not contribute to gene expression quantification, duplicate reads contribute to the quantification but are not informative for reproducibility. We show that mapped, exonic, non-duplicate (MEND) reads are a useful measure of reproducibility of RNA-Seq datasets used for gene expression analysis.

Findings: In bulk RNA-Seq datasets from 2,179 tumors in 48 cohorts, the fraction of reads that contribute to the reproducibility of gene expression analysis varies greatly. Unmapped reads constitute 1-77% of all reads (median [IQR], 3% [3-6%]); duplicate reads constitute 3-100% of mapped reads (median [IQR], 27% [13-43%]); and non-exonic reads constitute 4-97% of mapped, non-duplicate reads (median [IQR], 25% [16-37%]). MEND reads constitute 0-79% of total reads (median [IQR], 50% [30-61%]).

Conclusions: Because not all reads in an RNA-Seq dataset are informative for reproducibility of gene expression measurements and the fraction of reads that are informative varies, we propose reporting a dataset's sequencing depth in MEND reads, which definitively inform the reproducibility of gene expression, rather than total, mapped, or exonic reads. We provide a Docker image containing (i) the existing required tools (RSeQC, sambamba, and samblaster) and (ii) a custom script to calculate MEND reads from RNA-Seq data files. We recommend that all RNA-Seq gene expression experiments, sensitivity studies, and depth recommendations use MEND units for sequencing depth.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/gigascience/giab011DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7955155PMC
March 2021

Identification of a differentiation stall in epithelial mesenchymal transition in histone H3-mutant diffuse midline glioma.

Gigascience 2020 12;9(12)

University of California Santa Cruz Genomics Institute, University of California Santa Cruz, 1156 High Street, Santa Cruz, CA 95064, USA.

Background: Diffuse midline gliomas with histone H3 K27M (H3K27M) mutations occur in early childhood and are marked by an invasive phenotype and global decrease in H3K27me3, an epigenetic mark that regulates differentiation and development. H3K27M mutation timing and effect on early embryonic brain development are not fully characterized.

Results: We analyzed multiple publicly available RNA sequencing datasets to identify differentially expressed genes between H3K27M and non-K27M pediatric gliomas. We found that genes involved in the epithelial-mesenchymal transition (EMT) were significantly overrepresented among differentially expressed genes. Overall, the expression of pre-EMT genes was increased in the H3K27M tumors as compared to non-K27M tumors, while the expression of post-EMT genes was decreased. We hypothesized that H3K27M may contribute to gliomagenesis by stalling an EMT required for early brain development, and evaluated this hypothesis by using another publicly available dataset of single-cell and bulk RNA sequencing data from developing cerebral organoids. This analysis revealed similarities between H3K27M tumors and pre-EMT normal brain cells. Finally, a previously published single-cell RNA sequencing dataset of H3K27M and non-K27M gliomas revealed subgroups of cells at different stages of EMT. In particular, H3.1K27M tumors resemble a later EMT stage compared to H3.3K27M tumors.

Conclusions: Our data analyses indicate that this mutation may be associated with a differentiation stall evident from the failure to proceed through the EMT-like developmental processes, and that H3K27M cells preferentially exist in a pre-EMT cell phenotype. This study demonstrates how novel biological insights could be derived from combined analysis of several previously published datasets, highlighting the importance of making genomic data available to the community in a timely manner.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/gigascience/giaa136DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7736793PMC
December 2020

ProTECT-Prediction of T-Cell Epitopes for Cancer Therapy.

Front Immunol 2020 10;11:483296. Epub 2020 Nov 10.

Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA, United States.

Somatic mutations in cancers affecting protein coding genes can give rise to potentially therapeutic neoepitopes. These neoepitopes can guide Adoptive Cell Therapies and Peptide- and RNA-based Neoepitope Vaccines to selectively target tumor cells using autologous patient cytotoxic T-cells. Currently, researchers have to independently align their data, call somatic mutations and haplotype the patient's HLA to use existing neoepitope prediction tools. We present ProTECT, a fully automated, reproducible, scalable, and efficient end-to-end analysis pipeline to identify and rank therapeutically relevant tumor neoepitopes in terms of potential immunogenicity starting directly from raw patient sequencing data, or from pre-processed data. The ProTECT pipeline encompasses alignment, HLA haplotyping, mutation calling (single nucleotide variants, short insertions and deletions, and gene fusions), peptide:MHC binding prediction, and ranking of final candidates. We demonstrate the scalability, efficiency, and utility of ProTECT on 326 samples from the TCGA Prostate Adenocarcinoma cohort, identifying recurrent potential neoepitopes from TMPRSS2-ERG fusions, and from SNVs in SPOP. We also compare ProTECT with results from published tools. ProTECT can be run on a standalone computer, a local cluster, or on a compute cloud using a Mesos backend. ProTECT is highly scalable and can process TCGA data in under 30 min per sample (on average) when run in large batches. ProTECT is freely available at https://www.github.com/BD2KGenomics/protect.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3389/fimmu.2020.483296DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7683782PMC
April 2021

The UCSC Genome Browser database: 2021 update.

Nucleic Acids Res 2021 01;49(D1):D1046-D1057

Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95064, USA.

For more than two decades, the UCSC Genome Browser database (https://genome.ucsc.edu) has provided high-quality genomics data visualization and genome annotations to the research community. As the field of genomics grows and more data become available, new modes of display are required to accommodate new technologies. New features released this past year include a Hi-C heatmap display, a phased family trio display for VCF files, and various track visualization improvements. Striving to keep data up-to-date, new updates to gene annotations include GENCODE Genes, NCBI RefSeq Genes, and Ensembl Genes. New data tracks added for human and mouse genomes include the ENCODE registry of candidate cis-regulatory elements, promoters from the Eukaryotic Promoter Database, and NCBI RefSeq Select and Matched Annotation from NCBI and EMBL-EBI (MANE). Within weeks of learning about the outbreak of coronavirus, UCSC released a genome browser, with detailed annotation tracks, for the SARS-CoV-2 RNA reference assembly.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkaa1070DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7779060PMC
January 2021

Stability of SARS-CoV-2 phylogenies.

PLoS Genet 2020 11 18;16(11):e1009175. Epub 2020 Nov 18.

Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, United States of America.

The SARS-CoV-2 pandemic has led to unprecedented, nearly real-time genetic tracing due to the rapid community sequencing response. Researchers immediately leveraged these data to infer the evolutionary relationships among viral samples and to study key biological questions, including whether host viral genome editing and recombination are features of SARS-CoV-2 evolution. This global sequencing effort is inherently decentralized and must rely on data collected by many labs using a wide variety of molecular and bioinformatic techniques. There is thus a strong possibility that systematic errors associated with lab-or protocol-specific practices affect some sequences in the repositories. We find that some recurrent mutations in reported SARS-CoV-2 genome sequences have been observed predominantly or exclusively by single labs, co-localize with commonly used primer binding sites and are more likely to affect the protein-coding sequences than other similarly recurrent mutations. We show that their inclusion can affect phylogenetic inference on scales relevant to local lineage tracing, and make it appear as though there has been an excess of recurrent mutation or recombination among viral lineages. We suggest how samples can be screened and problematic variants removed, and we plan to regularly inform the scientific community with our updated results as more SARS-CoV-2 genome sequences are shared (https://virological.org/t/issues-with-sars-cov-2-sequencing-data/473 and https://virological.org/t/masking-strategies-for-sars-cov-2-alignments/480). We also develop tools for comparing and visualizing differences among very large phylogenies and we show that consistent clade- and tree-based comparisons can be made between phylogenies produced by different groups. These will facilitate evolutionary inferences and comparisons among phylogenies produced for a wide array of purposes. Building on the SARS-CoV-2 Genome Browser at UCSC, we present a toolkit to compare, analyze and combine SARS-CoV-2 phylogenies, find and remove potential sequencing errors and establish a widely shared, stable clade structure for a more accurate scientific inference and discourse.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1371/journal.pgen.1009175DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7721162PMC
November 2020

Dense sampling of bird diversity increases power of comparative genomics.

Nature 2020 11 11;587(7833):252-257. Epub 2020 Nov 11.

Centre for Zoo and Wild Animal Health, Copenhagen Zoo, Frederiksberg, Denmark.

Whole-genome sequencing projects are increasingly populating the tree of life and characterizing biodiversity. Sparse taxon sampling has previously been proposed to confound phylogenetic inference, and captures only a fraction of the genomic diversity. Here we report a substantial step towards the dense representation of avian phylogenetic and molecular diversity, by analysing 363 genomes from 92.4% of bird families-including 267 newly sequenced genomes produced for phase II of the Bird 10,000 Genomes (B10K) Project. We use this comparative genome dataset in combination with a pipeline that leverages a reference-free whole-genome alignment to identify orthologous regions in greater numbers than has previously been possible and to recognize genomic novelties in particular bird lineages. The densely sampled alignment provides a single-base-pair map of selection, has more than doubled the fraction of bases that are confidently predicted to be under conservation and reveals extensive patterns of weak selection in predominantly non-coding DNA. Our results demonstrate that increasing the diversity of genomes used in comparative studies can reveal more shared and lineage-specific variation, and improve the investigation of genomic characteristics. We anticipate that this genomic resource will offer new perspectives on evolutionary processes in cross-species comparative analyses and assist in efforts to conserve species.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41586-020-2873-9DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7759463PMC
November 2020

Progressive Cactus is a multiple-genome aligner for the thousand-genome era.

Nature 2020 11 11;587(7833):246-251. Epub 2020 Nov 11.

UC Santa Cruz Genomics Institute, UC Santa Cruz, Santa Cruz, CA, USA.

New genome assemblies have been arriving at a rapidly increasing pace, thanks to decreases in sequencing costs and improvements in third-generation sequencing technologies. For example, the number of vertebrate genome assemblies currently in the NCBI (National Center for Biotechnology Information) database increased by more than 50% to 1,485 assemblies in the year from July 2018 to July 2019. In addition to this influx of assemblies from different species, new human de novo assemblies are being produced, which enable the analysis of not only small polymorphisms, but also complex, large-scale structural differences between human individuals and haplotypes. This coming era and its unprecedented amount of data offer the opportunity to uncover many insights into genome evolution but also present challenges in how to adapt current analysis methods to meet the increased scale. Cactus, a reference-free multiple genome alignment program, has been shown to be highly accurate, but the existing implementation scales poorly with increasing numbers of genomes, and struggles in regions of highly duplicated sequences. Here we describe progressive extensions to Cactus to create Progressive Cactus, which enables the reference-free alignment of tens to thousands of large vertebrate genomes while maintaining high alignment quality. We describe results from an alignment of more than 600 amniote genomes, which is to our knowledge the largest multiple vertebrate genome alignment created so far.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41586-020-2871-yDOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7673649PMC
November 2020

Spiking neural state machine for gait frequency entrainment in a flexible modular robot.

PLoS One 2020 21;15(10):e0240267. Epub 2020 Oct 21.

Department of Electrical and Computer Engineering, University of California, Santa Cruz, Santa Cruz, California, United States of America.

We propose a modular architecture for neuromorphic closed-loop control based on bistable relaxation oscillator modules consisting of three spiking neurons each. Like its biological prototypes, this basic component is robust to parameter variation but can be modulated by external inputs. By combining these modules, we can construct a neural state machine capable of generating the cyclic or repetitive behaviors necessary for legged locomotion. A concrete case study for the approach is provided by a modular robot constructed from flexible plastic volumetric pixels, in which we produce a forward crawling gait entrained to the natural frequency of the robot by a minimal system of twelve neurons organized into four modules.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0240267PLOS
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7577446PMC
December 2020

Ultrafast Sample Placement on Existing Trees (UShER) Empowers Real-Time Phylogenetics for the SARS-CoV-2 Pandemic.

bioRxiv 2020 Sep 28. Epub 2020 Sep 28.

As the SARS-CoV-2 virus spreads through human populations, the unprecedented accumulation of viral genome sequences is ushering a new era of "genomic contact tracing" - that is, using viral genome sequences to trace local transmission dynamics. However, because the viral phylogeny is already so large - and will undoubtedly grow many fold - placing new sequences onto the tree has emerged as a barrier to real-time genomic contact tracing. Here, we resolve this challenge by building an efficient, tree-based data structure encoding the inferred evolutionary history of the virus. We demonstrate that our approach improves the speed of phylogenetic placement of new samples and data visualization by orders of magnitude, making it possible to complete the placements under real-time constraints. Our method also provides the key ingredient for maintaining a fully-updated reference phylogeny. We make these tools available to the research community through the UCSC SARS-CoV-2 Genome Browser to enable rapid cross-referencing of information in new virus sequences with an ever-expanding array of molecular and structural biology data. The methods described here will empower research and genomic contact tracing for laboratories worldwide.

Software Availability: USHER is available to users through the UCSC Genome Browser at https://genome.ucsc.edu/cgi-bin/hgPhyloPlace . The source code and detailed instructions on how to compile and run UShER are available from https://github.com/yatisht/usher .
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1101/2020.09.26.314971DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7536873PMC
September 2020

Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes.

Nat Biotechnol 2020 09 4;38(9):1044-1053. Epub 2020 May 4.

Chan Zuckerberg Initiative, Redwood City, CA, USA.

De novo assembly of a human genome using nanopore long-read sequences has been reported, but it used more than 150,000 CPU hours and weeks of wall-clock time. To enable rapid human genome assembly, we present Shasta, a de novo long-read assembler, and polishing algorithms named MarginPolish and HELEN. Using a single PromethION nanopore sequencer and our toolkit, we assembled 11 highly contiguous human genomes de novo in 9 d. We achieved roughly 63× coverage, 42-kb read N50 values and 6.5× coverage in reads >100 kb using three flow cells per sample. Shasta produced a complete haploid human genome assembly in under 6 h on a single commercial compute node. MarginPolish and HELEN polished haploid assemblies to more than 99.9% identity (Phred quality score QV = 30) with nanopore reads alone. Addition of proximity-ligation sequencing enabled near chromosome-level scaffolds for all 11 genomes. We compare our assembly performance to existing methods for diploid, haploid and trio-binned human samples and report superior accuracy and speed.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41587-020-0503-6DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7483855PMC
September 2020

A user guide for the online exploration and visualization of PCAWG data.

Nat Commun 2020 07 7;11(1):3400. Epub 2020 Jul 7.

European Molecular Biology Laboratory, 69117, Heidelberg, Germany.

The Pan-Cancer Analysis of Whole Genomes (PCAWG) project generated a vast amount of whole-genome cancer sequencing resource data. Here, as part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium, which aggregated whole genome sequencing data from 2658 cancers across 38 tumor types, we provide a user's guide to the five publicly available online data exploration and visualization tools introduced in the PCAWG marker paper. These tools are ICGC Data Portal, UCSC Xena, Chromothripsis Explorer, Expression Atlas, and PCAWG-Scout. We detail use cases and analyses for each tool, show how they incorporate outside resources from the larger genomics ecosystem, and demonstrate how the tools can be used together to understand the biology of cancers more deeply. Together, the tools enable researchers to query the complex genomic PCAWG data dynamically and integrate external information, enabling and enhancing interpretation.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41467-020-16785-6DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7340791PMC
July 2020

Hydra: A mixture modeling framework for subtyping pediatric cancer cohorts using multimodal gene expression signatures.

PLoS Comput Biol 2020 04 10;16(4):e1007753. Epub 2020 Apr 10.

Genomics Institute, University of California, Santa Cruz, Santa Cruz, California, United States of America.

Precision oncology has primarily relied on coding mutations as biomarkers of response to therapies. While transcriptome analysis can provide valuable information, incorporation into workflows has been difficult. For example, the relative rather than absolute gene expression level needs to be considered, requiring differential expression analysis across samples. However, expression programs related to the cell-of-origin and tumor microenvironment effects confound the search for cancer-specific expression changes. To address these challenges, we developed an unsupervised clustering approach for discovering differential pathway expression within cancer cohorts using gene expression measurements. The hydra approach uses a Dirichlet process mixture model to automatically detect multimodally distributed genes and expression signatures without the need for matched normal tissue. We demonstrate that the hydra approach is more sensitive than widely-used gene set enrichment approaches for detecting multimodal expression signatures. Application of the hydra analysis framework to small blue round cell tumors (including rhabdomyosarcoma, synovial sarcoma, neuroblastoma, Ewing sarcoma, and osteosarcoma) identified expression signatures associated with changes in the tumor microenvironment. The hydra approach also identified an association between ATRX deletions and elevated immune marker expression in high-risk neuroblastoma. Notably, hydra analysis of all small blue round cell tumors revealed similar subtypes, characterized by changes to infiltrating immune and stromal expression signatures.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1371/journal.pcbi.1007753DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7176284PMC
April 2020

The UCSC repeat browser allows discovery and visualization of evolutionary conflict across repeat families.

Mob DNA 2020 31;11:13. Epub 2020 Mar 31.

1Genomics Institute, University of California, Santa Cruz, USA.

Background: Nearly half the human genome consists of repeat elements, most of which are retrotransposons, and many of which play important biological roles. However repeat elements pose several unique challenges to current bioinformatic analyses and visualization tools, as short repeat sequences can map to multiple genomic loci resulting in their misclassification and misinterpretation. In fact, sequence data mapping to repeat elements are often discarded from analysis pipelines. Therefore, there is a continued need for standardized tools and techniques to interpret genomic data of repeats.

Results: We present the UCSC Repeat Browser, which consists of a complete set of human repeat reference sequences derived from annotations made by the commonly used program RepeatMasker. The UCSC Repeat Browser also provides an alignment from the human genome to these references, uses it to map the standard human genome annotation tracks, and presents all of them as a comprehensive interface to facilitate work with repetitive elements. It also provides processed tracks of multiple publicly available datasets of particular interest to the repeat community, including ChIP-seq datasets for KRAB Zinc Finger Proteins (KZNFs) - a family of proteins known to bind and repress certain classes of repeats. We used the UCSC Repeat Browser in combination with these datasets, as well as RepeatMasker annotations in several non-human primates, to trace the independent trajectories of species-specific evolutionary battles between LINE 1 retroelements and their repressors. Furthermore, we document at https://repeatbrowser.ucsc.edu how researchers can map their own human genome annotations to these reference repeat sequences.

Conclusions: The UCSC Repeat Browser allows easy and intuitive visualization of genomic data on consensus repeat elements, circumventing the problem of multi-mapping, in which sequencing reads of repeat elements map to multiple locations on the human genome. By developing a reference consensus, multiple datasets and annotation tracks can easily be overlaid to reveal complex evolutionary histories of repeats in a single interactive window. Specifically, we use this approach to retrace the history of several primate specific LINE-1 families across apes, and discover several species-specific routes of evolution that correlate with the emergence and binding of KZNFs.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s13100-020-00208-wDOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7110667PMC
March 2020

Blockchain-Authenticated Sharing of Genomic and Clinical Outcomes Data of Patients With Cancer: A Prospective Cohort Study.

J Med Internet Res 2020 03 20;22(3):e16810. Epub 2020 Mar 20.

Division of Hematology and Oncology, Department of Medicine, University of California, San Francisco, CA, United States.

Background: Efficiently sharing health data produced during standard care could dramatically accelerate progress in cancer treatments, but various barriers make this difficult. Not sharing these data to ensure patient privacy is at the cost of little to no learning from real-world data produced during cancer care. Furthermore, recent research has demonstrated a willingness of patients with cancer to share their treatment experiences to fuel research, despite potential risks to privacy.

Objective: The objective of this study was to design, pilot, and release a decentralized, scalable, efficient, economical, and secure strategy for the dissemination of deidentified clinical and genomic data with a focus on late-stage cancer.

Methods: We created and piloted a blockchain-authenticated system to enable secure sharing of deidentified patient data derived from standard of care imaging, genomic testing, and electronic health records (EHRs), called the Cancer Gene Trust (CGT). We prospectively consented and collected data for a pilot cohort (N=18), which we uploaded to the CGT. EHR data were extracted from both a hospital cancer registry and a common data model (CDM) format to identify optimal data extraction and dissemination practices. Specifically, we scored and compared the level of completeness between two EHR data extraction formats against the gold standard source documentation for patients with available data (n=17).

Results: Although the total completeness scores were greater for the registry reports than those for the CDM, this difference was not statistically significant. We did find that some specific data fields, such as histology site, were better captured using the registry reports, which can be used to improve the continually adapting CDM. In terms of the overall pilot study, we found that CGT enables rapid integration of real-world data of patients with cancer in a more clinically useful time frame. We also developed an open-source Web application to allow users to seamlessly search, browse, explore, and download CGT data.

Conclusions: Our pilot demonstrates the willingness of patients with cancer to participate in data sharing and how blockchain-enabled structures can maintain relationships between individual data elements while preserving patient privacy, empowering findings by third-party researchers and clinicians. We demonstrate the feasibility of CGT as a framework to share health data trapped in silos to further cancer research. Further studies to optimize data representation, stream, and integrity are required.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.2196/16810DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7125440PMC
March 2020

Genomic Profiling of Childhood Tumor Patient-Derived Xenograft Models to Enable Rational Clinical Trial Design.

Cell Rep 2019 11;29(6):1675-1689.e9

Preclinical Neurooncology Research Program, Texas Children's Cancer Research Center, Texas Children's Hospital, Houston, TX 77030, USA; Department of Pediatrics, Baylor College of Medicine, Houston, TX 77030, USA.

Accelerating cures for children with cancer remains an immediate challenge as a result of extensive oncogenic heterogeneity between and within histologies, distinct molecular mechanisms evolving between diagnosis and relapsed disease, and limited therapeutic options. To systematically prioritize and rationally test novel agents in preclinical murine models, researchers within the Pediatric Preclinical Testing Consortium are continuously developing patient-derived xenografts (PDXs)-many of which are refractory to current standard-of-care treatments-from high-risk childhood cancers. Here, we genomically characterize 261 PDX models from 37 unique pediatric cancers; demonstrate faithful recapitulation of histologies and subtypes; and refine our understanding of relapsed disease. In addition, we use expression signatures to classify tumors for TP53 and NF1 pathway inactivation. We anticipate that these data will serve as a resource for pediatric oncology drug development and will guide rational clinical trial design for children with cancer.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.celrep.2019.09.071DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6880934PMC
November 2019

UCSC Genome Browser enters 20th year.

Nucleic Acids Res 2020 01;48(D1):D756-D761

Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95064, USA.

The University of California Santa Cruz Genome Browser website (https://genome.ucsc.edu) enters its 20th year of providing high-quality genomics data visualization and genome annotations to the research community. In the past year, we have added a new option to our web BLAT tool that allows search against all genomes, a single-cell expression viewer (https://cells.ucsc.edu), a 'lollipop' plot display mode for high-density variation data, a RESTful API for data extraction and a custom-track backup feature. New datasets include Tabula Muris single-cell expression data, GeneHancer regulatory annotations, The Cancer Genome Atlas Pan-Cancer variants, Genome Reference Consortium Patch sequences, new ENCODE transcription factor binding site peaks and clusters, the Database of Genomic Variants Gold Standard Variants, Genomenon Mastermind variants and three new multi-species alignment tracks.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkz1012DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7145642PMC
January 2020

Comparative Tumor RNA Sequencing Analysis for Difficult-to-Treat Pediatric and Young Adult Patients With Cancer.

JAMA Netw Open 2019 10 2;2(10):e1913968. Epub 2019 Oct 2.

CHOC Children's Hospital, Hyundai Cancer Institute, Orange, California.

Importance: Pediatric cancers are epigenetic diseases; therefore, considering tumor gene expression information is necessary for a complete understanding of the tumorigenic processes.

Objective: To evaluate the feasibility and utility of incorporating comparative gene expression information into the precision medicine framework for difficult-to-treat pediatric and young adult patients with cancer.

Design, Setting, And Participants: This cohort study was conducted as a consortium between the University of California, Santa Cruz (UCSC) Treehouse Childhood Cancer Initiative and clinical genomic trials. RNA sequencing (RNA-Seq) data were obtained from the following 4 clinical sites and analyzed at UCSC: British Columbia Children's Hospital (n = 31), Lucile Packard Children's Hospital at Stanford University (n = 80), CHOC Children's Hospital and Hyundai Cancer Institute (n = 46), and the Pacific Pediatric Neuro-Oncology Consortium (n = 24). The study dates were January 1, 2016, to March 22, 2017.

Exposures: Participants underwent tumor RNA-Seq profiling as part of 4 separate clinical trials at partner hospitals. The UCSC either downloaded RNA-Seq data from a partner institution for analysis in the cloud or provided a Docker pipeline that performed the same analysis at a partner institution. The UCSC then compared each participant's tumor RNA-Seq profile with more than 11 000 uniformly analyzed tumor profiles from pediatric and young adult patients with cancer, downloaded from public data repositories. These comparisons were used to identify genes and pathways that are significantly overexpressed in each patient's tumor. Results of the UCSC analysis were presented to clinical partners.

Main Outcomes And Measures: Feasibility of a third-party institution (UCSC Treehouse Childhood Cancer Initiative) to obtain tumor RNA-Seq data from patients, conduct comparative analysis, and present analysis results to clinicians; and proportion of patients for whom comparative tumor gene expression analysis provided useful clinical and biological information.

Results: Among 144 samples from children and young adults (median age at diagnosis, 9 years; range, 0-26 years; 72 of 118 [61.0%] male [26 patients sex unknown]) with a relapsed, refractory, or rare cancer treated on precision medicine protocols, RNA-Seq-derived gene expression was potentially useful for 99 of 144 samples (68.8%) compared with DNA mutation information that was potentially useful for only 34 of 74 samples (45.9%).

Conclusions And Relevance: This study's findings suggest that tumor RNA-Seq comparisons may be feasible and highlight the potential clinical utility of incorporating such comparisons into the clinical genomic interpretation framework for difficult-to-treat pediatric and young adult patients with cancer. The study also highlights for the first time to date the potential clinical utility of harmonized publicly available genomic data sets.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1001/jamanetworkopen.2019.13968DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6822083PMC
October 2019
-->