Publications by authors named "Paul D Thomas"

59 Publications

Ten years of collaborative progress in the Quest for Orthologs.

Mol Biol Evol 2021 Apr 2. Epub 2021 Apr 2.

Department of Theoretical Biology, National Institute for Basic Biology, National Institutes of Natural Sciences, Okazaki, Aichi, 444-8585, Japan.

Accurate determination of the evolutionary relationships between genes is a foundational challenge in biology. Homology - evolutionary relatedness - is in many cases readily determined based on sequence similarity analysis. By contrast, whether or not two genes directly descended from a common ancestor by a speciation event (orthologs) or duplication event (paralogs) is more challenging, yet provides critical information on the history of a gene. Since 2009, this task has been the focus of the Quest for Orthologs (QFO) consortium. The 6th QFO meeting took place in Okazaki, Japan in conjunction with the 67th National Institute for Basic Biology conference. Here we report recent advances, applications, and oncoming challenges that were discussed during the conference. Steady progress has been made toward standardisation and scalability of new and existing tools. A feature of the conference was the presentation of a panel of accessible tools for phylogenetic profiling and several developments to bring orthology beyond the gene unit-from domains to networks. This meeting brought into light several challenges to come: leveraging orthology computations to get the most of the incoming avalanche of genomic data, integrating orthology from domain to biological network levels, building better gene models, and adapting orthology approaches to the broad evolutionary and genomic diversity recognized in different forms of life and viruses.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/molbev/msab098DOI Listing
April 2021

Bayesian parameter estimation for automatic annotation of gene functions using observational data and phylogenetic trees.

PLoS Comput Biol 2021 Feb 18;17(2):e1007948. Epub 2021 Feb 18.

Division of Biostatistics, Department of Preventive Medicine, University of Southern California, Los Angeles, California, United States of America.

Gene function annotation is important for a variety of downstream analyses of genetic data. But experimental characterization of function remains costly and slow, making computational prediction an important endeavor. Phylogenetic approaches to prediction have been developed, but implementation of a practical Bayesian framework for parameter estimation remains an outstanding challenge. We have developed a computationally efficient model of evolution of gene annotations using phylogenies based on a Bayesian framework using Markov Chain Monte Carlo for parameter estimation. Unlike previous approaches, our method is able to estimate parameters over many different phylogenetic trees and functions. The resulting parameters agree with biological intuition, such as the increased probability of function change following gene duplication. The method performs well on leave-one-out cross-validation, and we further validated some of the predictions in the experimental scientific literature.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1371/journal.pcbi.1007948DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7924801PMC
February 2021

PhyloGenes: An online phylogenetics and functional genomics resource for plant gene function inference.

Plant Direct 2020 Dec 30;4(12):e00293. Epub 2020 Dec 30.

Phoenix Bioinformatics Fremont CA USA.

We aim to enable the accurate and efficient transfer of knowledge about gene function gained from and other model organisms to other plant species. This knowledge transfer is frequently challenging in plants due to duplications of individual genes and whole genomes in plant lineages. Such duplications result in complex evolutionary relationships between related genes, which may have similar sequences but highly divergent functions. In such cases, functional inference requires more than a simple sequence similarity calculation. We have developed an online resource, PhyloGenes (phylogenes.org), that displays precomputed phylogenetic trees for plant gene families along with experimentally validated function information for individual genes within the families. A total of 40 plant genomes and 10 non-plant model organisms are represented in over 8,000 gene families. Evolutionary events such as speciation and duplication are clearly labeled on gene trees to distinguish orthologs from paralogs. Nearly 6,000 families have at least one member with an experimentally supported annotation to a Gene Ontology (GO) molecular function or biological process term. By displaying experimentally validated gene functions associated to individual genes within a tree, PhyloGenes enables functional inference for genes of uncharacterized function, based on their evolutionary relationships to experimentally studied genes, in a visually traceable manner. For the many families containing genes that have evolved to perform different functions, PhyloGenes facilitates the use of evolutionary history to determine the most likely function of genes that have not been experimentally characterized. Future work will enrich the resource by incorporating additional gene function datasets such as plant gene expression atlas data.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1002/pld3.293DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7773024PMC
December 2020

PEREGRINE: A genome-wide prediction of enhancer to gene relationships supported by experimental evidence.

PLoS One 2020 15;15(12):e0243791. Epub 2020 Dec 15.

Division of Bioinformatics, Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, CA, United States of America.

Enhancers are powerful and versatile agents of cell-type specific gene regulation, which are thought to play key roles in human disease. Enhancers are short DNA elements that function primarily as clusters of transcription factor binding sites that are spatially coordinated to regulate expression of one or more specific target genes. These regulatory connections between enhancers and target genes can therefore be characterized as enhancer-gene links that can affect development, disease, and homeostatic cellular processes. Despite their implication in disease and the establishment of cell identity during development, most enhancer-gene links remain unknown. Here we introduce a new, publicly accessible database of predicted enhancer-gene links, PEREGRINE. The PEREGRINE human enhancer-gene links interactive web interface incorporates publicly available experimental data from ChIA-PET, eQTL, and Hi-C assays across 78 cell and tissue types to link 449,627 enhancers to 17,643 protein-coding genes. These enhancer-gene links are made available through the new Enhancer module of the PANTHER database and website where the user may easily access the evidence for each enhancer-gene link, as well as query by target gene and enhancer location.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0243791PLOS
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7737992PMC
January 2021

PANTHER version 16: a revised family classification, tree-based classification tool, enhancer regions and extensive API.

Nucleic Acids Res 2021 01;49(D1):D394-D403

Division of Bioinformatics, Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, USA.

PANTHER (Protein Analysis Through Evolutionary Relationships, http://www.pantherdb.org) is a resource for the evolutionary and functional classification of protein-coding genes from all domains of life. The evolutionary classification is based on a library of over 15,000 phylogenetic trees, and the functional classifications include Gene Ontology terms and pathways. Here, we analyze the current coverage of genes from genomes in different taxonomic groups, so that users can better understand what to expect when analyzing a gene list using PANTHER tools. We also describe extensive improvements to PANTHER made in the past two years. The PANTHER Protein Class ontology has been completely refactored, and 6101 PANTHER families have been manually assigned to a Protein Class, providing a high level classification of protein families and their genes. Users can access the TreeGrafter tool to add their own protein sequences to the reference phylogenetic trees in PANTHER, to infer evolutionary context as well as fine-grained annotations. We have added human enhancer-gene links that associate non-coding regions with the annotated human genes in PANTHER. We have also expanded the available services for programmatic access to PANTHER tools and data via application programming interfaces (APIs). Other improvements include additional plant genomes and an updated PANTHER GO-slim.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkaa1106DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7778891PMC
January 2021

The InterPro protein families and domains database: 20 years on.

Nucleic Acids Res 2021 01;49(D1):D344-D354

European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK.

The InterPro database (https://www.ebi.ac.uk/interpro/) provides an integrative classification of protein sequences into families, and identifies functionally important domains and conserved sites. InterProScan is the underlying software that allows protein and nucleic acid sequences to be searched against InterPro's signatures. Signatures are predictive models which describe protein families, domains or sites, and are provided by multiple databases. InterPro combines signatures representing equivalent families, domains or sites, and provides additional information such as descriptions, literature references and Gene Ontology (GO) terms, to produce a comprehensive resource for protein classification. Founded in 1999, InterPro has become one of the most widely used resources for protein family annotation. Here, we report the status of InterPro (version 81.0) in its 20th year of operation, and its associated software, including updates to database content, the release of a new website and REST API, and performance improvements in InterProScan.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkaa977DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7778928PMC
January 2021

The Minimum Information about a Molecular Interaction CAusal STatement (MI2CAST).

Bioinformatics 2021 Apr;36(24):5712-5718

Department of Biology, Norwegian University of Science and Technology (NTNU), Trondheim 7491, Norway.

Motivation: A large variety of molecular interactions occurs between biomolecular components in cells. When a molecular interaction results in a regulatory effect, exerted by one component onto a downstream component, a so-called 'causal interaction' takes place. Causal interactions constitute the building blocks in our understanding of larger regulatory networks in cells. These causal interactions and the biological processes they enable (e.g. gene regulation) need to be described with a careful appreciation of the underlying molecular reactions. A proper description of this information enables archiving, sharing and reuse by humans and for automated computational processing. Various representations of causal relationships between biological components are currently used in a variety of resources.

Results: Here, we propose a checklist that accommodates current representations, called the Minimum Information about a Molecular Interaction CAusal STatement (MI2CAST). This checklist defines both the required core information, as well as a comprehensive set of other contextual details valuable to the end user and relevant for reusing and reproducing causal molecular interaction information. The MI2CAST checklist can be used as reporting guidelines when annotating and curating causal statements, while fostering uniformity and interoperability of the data across resources.

Availability And Implementation: The checklist together with examples is accessible at https://github.com/MI2CAST/MI2CAST.

Supplementary Information: Supplementary data are available at Bioinformatics online.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btaa622DOI Listing
April 2021

The Quest for Orthologs benchmark service and consensus calls in 2020.

Nucleic Acids Res 2020 07;48(W1):W538-W545

SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland.

The identification of orthologs-genes in different species which descended from the same gene in their last common ancestor-is a prerequisite for many analyses in comparative genomics and molecular evolution. Numerous algorithms and resources have been conceived to address this problem, but benchmarking and interpreting them is fraught with difficulties (need to compare them on a common input dataset, absence of ground truth, computational cost of calling orthologs). To address this, the Quest for Orthologs consortium maintains a reference set of proteomes and provides a web server for continuous orthology benchmarking (http://orthology.benchmarkservice.org). Furthermore, consensus ortholog calls derived from public benchmark submissions are provided on the Alliance of Genome Resources website, the joint portal of NIH-funded model organism databases.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkaa308DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7319555PMC
July 2020

Advances and Applications in the Quest for Orthologs.

Mol Biol Evol 2019 10;36(10):2157-2164

Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA.

Gene families evolve by the processes of speciation (creating orthologs), gene duplication (paralogs), and horizontal gene transfer (xenologs), in addition to sequence divergence and gene loss. Orthologs in particular play an essential role in comparative genomics and phylogenomic analyses. With the continued sequencing of organisms across the tree of life, the data are available to reconstruct the unique evolutionary histories of tens of thousands of gene families. Accurate reconstruction of these histories, however, is a challenging computational problem, and the focus of the Quest for Orthologs Consortium. We review the recent advances and outstanding challenges in this field, as revealed at a symposium and meeting held at the University of Southern California in 2017. Key advances have been made both at the level of orthology algorithm development and with respect to coordination across the community of algorithm developers and orthology end-users. Applications spanned a broad range, including gene function prediction, phylostratigraphy, genome evolution, and phylogenomics. The meetings highlighted the increasing use of meta-analyses integrating results from multiple different algorithms, and discussed ongoing challenges in orthology inference as well as the next steps toward improvement and integration of orthology resources.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/molbev/msz150DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6759064PMC
October 2019

SynGO: An Evidence-Based, Expert-Curated Knowledge Base for the Synapse.

Neuron 2019 07 3;103(2):217-234.e4. Epub 2019 Jun 3.

Department of Functional Genomics, CNCR, VU University and UMC Amsterdam, 1081 HV Amsterdam, the Netherlands. Electronic address:

Synapses are fundamental information-processing units of the brain, and synaptic dysregulation is central to many brain disorders ("synaptopathies"). However, systematic annotation of synaptic genes and ontology of synaptic processes are currently lacking. We established SynGO, an interactive knowledge base that accumulates available research about synapse biology using Gene Ontology (GO) annotations to novel ontology terms: 87 synaptic locations and 179 synaptic processes. SynGO annotations are exclusively based on published, expert-curated evidence. Using 2,922 annotations for 1,112 genes, we show that synaptic genes are exceptionally well conserved and less tolerant to mutations than other genes. Many SynGO terms are significantly overrepresented among gene variations associated with intelligence, educational attainment, ADHD, autism, and bipolar disorder and among de novo variants associated with neurodevelopmental disorders, including schizophrenia. SynGO is a public, universal reference for synapse research and an online analysis platform for interpretation of large-scale -omics data (https://syngoportal.org and http://geneontology.org).
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.neuron.2019.05.002DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6764089PMC
July 2019

GO functional similarity clustering depends on similarity measure, clustering method, and annotation completeness.

BMC Bioinformatics 2019 Mar 27;20(1):155. Epub 2019 Mar 27.

Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, USA.

Background: Biological knowledge, and therefore Gene Ontology annotation sets, for human genes is incomplete. Recent studies have reported that biases in available GO annotations result in biased estimates of functional similarities of genes, but it is still unclear what the effect of incompleteness itself may be, even in the absence of bias. Pairwise gene similarities are used in a number of contexts, including gene "functional similarity" clustering and the related problem of functional ontology structure inference, but it is not known how different similarity measures or clustering methods perform on this task, and how the clusters are affected by annotation completeness.

Results: We developed representations of both "complete" and "incomplete" GO annotation datasets based on experimentally-supported annotations from the GO database-specifically designed to model the incompleteness of human gene annotations-and computed semantic similarities for each set using a variety of different published measures. We then assessed the clusters derived from these measures using two different clustering methods: hierarchical clustering, and the CliXO algorithm. We find the CliXO algorithm, combined with appropriate measures, performs better than hierarchical clustering in reconstructing GO both when the data are complete, and incomplete. Some measures, particularly those that create a pairwise gene similarity by averaging over all pairwise annotation similarities, had consistently poor performance, and a few measures, such as Lin best-matched average and Relevance maximum, were generally among the best performers for a broad range in annotation completeness and types of GO classes. Finally, we show that for semantic similarity-based clustering, the multicellular organism process branch of the GO biological process ontology is more challenging to represent than the cellular process branch.

Conclusions: We assessed the effects of annotation completeness on the distribution of pairwise gene semantic similarity scores, and subsequent effects on the clusters derived from these scores. Our results suggest combinations of semantic similarity measures, gene-level scoring methods and clustering method that perform best for functional gene clustering using annotation sets of varying completeness. Overall, our results underscore the importance of increasing the completeness of GO annotations to for supporting computational analyses of gene function.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s12859-019-2752-2DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6437941PMC
March 2019

Protocol Update for large-scale genome and gene function analysis with the PANTHER classification system (v.14.0).

Nat Protoc 2019 03 25;14(3):703-721. Epub 2019 Feb 25.

Division of Bioinformatics, Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA.

The PANTHER classification system ( http://www.pantherdb.org ) is a comprehensive system that combines genomes, gene function classifications, pathways and statistical analysis tools to enable biologists to analyze large-scale genome-wide experimental data. The current system (PANTHER v.14.0) covers 131 complete genomes organized into gene families and subfamilies; evolutionary relationships between genes are represented in phylogenetic trees, multiple sequence alignments and statistical models (hidden Markov models (HMMs)). The families and subfamilies are annotated with Gene Ontology (GO) terms, and sequences are assigned to PANTHER pathways. A suite of tools has been built to allow users to browse and query gene functions and analyze large-scale experimental data with a number of statistical tests. PANTHER is widely used by bench scientists, bioinformaticians, computer scientists and systems biologists. Since the protocol for using this tool (v.8.0) was originally published in 2013, there have been substantial improvements and updates in the areas of data quality, data coverage, statistical algorithms and user experience. This Protocol Update provides detailed instructions on how to analyze genome-wide experimental data in the PANTHER classification system.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41596-019-0128-8DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6519457PMC
March 2019

PANTHER version 14: more genomes, a new PANTHER GO-slim and improvements in enrichment analysis tools.

Nucleic Acids Res 2019 01;47(D1):D419-D426

Division of Bioinformatics, Department of Preventive Medicine, Keck School of Medicine of USC, University of Southern California, Los Angeles, CA 90033, USA.

PANTHER (Protein Analysis Through Evolutionary Relationships, http://pantherdb.org) is a resource for the evolutionary and functional classification of genes from organisms across the tree of life. We report the improvements we have made to the resource during the past two years. For evolutionary classifications, we have added more prokaryotic and plant genomes to the phylogenetic gene trees, expanding the representation of gene evolution in these lineages. We have refined many protein family boundaries, and have aligned PANTHER with the MEROPS resource for protease and protease inhibitor families. For functional classifications, we have developed an entirely new PANTHER GO-slim, containing over four times as many Gene Ontology terms as our previous GO-slim, as well as curated associations of genes to these terms. Lastly, we have made substantial improvements to the enrichment analysis tools available on the PANTHER website: users can now analyze over 900 different genomes, using updated statistical tests with false discovery rate corrections for multiple testing. The overrepresentation test is also available as a web service, for easy addition to third-party sites.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gky1038DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6323939PMC
January 2019

InterPro in 2019: improving coverage, classification and access to protein sequence annotations.

Nucleic Acids Res 2019 01;47(D1):D351-D360

European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK.

The InterPro database (http://www.ebi.ac.uk/interpro/) classifies protein sequences into families and predicts the presence of functionally important domains and sites. Here, we report recent developments with InterPro (version 70.0) and its associated software, including an 18% growth in the size of the database in terms on new InterPro entries, updates to content, the inclusion of an additional entry type, refined modelling of discontinuous domains, and the development of a new programmatic interface and website. These developments extend and enrich the information provided by InterPro, and provide greater flexibility in terms of data access. We also show that InterPro's sequence coverage has kept pace with the growth of UniProtKB, and discuss how our evaluation of residue coverage may help guide future curation activities.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gky1100DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6323941PMC
January 2019

Ancestral Genomes: a resource for reconstructed ancestral genes and genomes across the tree of life.

Nucleic Acids Res 2019 01;47(D1):D271-D279

Division of Bioinformatics, Department of Preventive Medicine, Keck School of Medicine of USC, University of Southern California, Los Angeles, CA 90033, USA.

A growing number of whole genome sequencing projects, in combination with development of phylogenetic methods for reconstructing gene evolution, have provided us with a window into genomes that existed millions, and even billions, of years ago. Ancestral Genomes (http://ancestralgenomes.org) is a resource for comprehensive reconstructions of these 'fossil genomes'. Comprehensive sets of protein-coding genes have been reconstructed for 78 genomes of now-extinct species that were the common ancestors of extant species from across the tree of life. The reconstructed genes are based on the extensive library of over 15 000 gene family trees from the PANTHER database, and are updated on a yearly basis. For each ancestral gene, we assign a stable identifier, and provide additional information designed to facilitate analysis: an inferred name, a reconstructed protein sequence, a set of inferred Gene Ontology (GO) annotations, and a 'proxy gene' for each ancestral gene, defined as the least-diverged descendant of the ancestral gene in a given extant genome. On the Ancestral Genomes website, users can browse the Ancestral Genomes by selecting nodes in a species tree, and can compare an extant genome with any of its reconstructed ancestors to understand how the genome evolved.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gky1009DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6323951PMC
January 2019

TreeGrafter: phylogenetic tree-based annotation of proteins with Gene Ontology terms and other annotations.

Bioinformatics 2019 02;35(3):518-520

Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA, USA.

Summary: TreeGrafter is a new software tool for annotating protein sequences using pre-annotated phylogenetic trees. Currently, the tool provides annotations to Gene Ontology (GO) terms, and PANTHER family and subfamily. The approach is generalizable to any annotations that have been made to internal nodes of a reference phylogenetic tree. TreeGrafter takes each input query protein sequence, finds the best matching homologous family in a library of pre-calculated, pre-annotated gene trees, and then grafts it to the best location in the tree. It then annotates the sequence by propagating annotations from ancestral nodes in the reference tree. We show that TreeGrafter outperforms subfamily HMM scoring for correctly assigning subfamily membership, and that it produces highly specific annotations of GO terms based on annotated reference phylogenetic trees. This method will be further integrated into InterProScan, enabling an even broader user community.

Availability And Implementation: TreeGrafter is freely available on the web at https://github.com/pantherdb/TreeGrafter, including as a Docker image.

Supplementary Information: Supplementary data are available at Bioinformatics online.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/bty625DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6361231PMC
February 2019

The MEROPS database of proteolytic enzymes, their substrates and inhibitors in 2017 and a comparison with peptidases in the PANTHER database.

Nucleic Acids Res 2018 01;46(D1):D624-D632

EMBL European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK.

The MEROPS database (http://www.ebi.ac.uk/merops/) is an integrated source of information about peptidases, their substrates and inhibitors. The hierarchical classification is: protein-species, family, clan, with an identifier at each level. The MEROPS website moved to the EMBL-EBI in 2017, requiring refactoring of the code-base and services provided. The interface to sequence searching has changed and the MEROPS protein sequence libraries can be searched at the EMBL-EBI with HMMER, FastA and BLASTP. Cross-references have been established between MEROPS and the PANTHER database at both the family and protein-species level, which will help to improve curation and coverage between the resources. Because of the increasing size of the MEROPS sequence collection, in future only sequences of characterized proteins, and from completely sequenced genomes of organisms of evolutionary, medical or commercial significance will be added. As an example, peptidase homologues in four proteomes from the Asgard superphylum of Archaea have been identified and compared to other archaean, bacterial and eukaryote proteomes. This has given insights into the origins and evolution of peptidase families, including an expansion in the number of proteasome components in Asgard archaeotes and as organisms increase in complexity. Novel structures for proteasome complexes in archaea are postulated.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkx1134DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5753285PMC
January 2018

Large-scale inference of gene function through phylogenetic annotation of Gene Ontology terms: case study of the apoptosis and autophagy cellular processes.

Database (Oxford) 2016 26;2016. Epub 2016 Dec 26.

Division of Bioinformatics, Department of Preventive Medicine, Keck School of Medicine of USC, University of Southern California, Los Angeles, CA, USA and.

We previously reported a paradigm for large-scale phylogenomic analysis of gene families that takes advantage of the large corpus of experimentally supported Gene Ontology (GO) annotations. This 'GO Phylogenetic Annotation' approach integrates GO annotations from evolutionarily related genes across ∼100 different organisms in the context of a gene family tree, in which curators build an explicit model of the evolution of gene functions. GO Phylogenetic Annotation models the gain and loss of functions in a gene family tree, which is used to infer the functions of uncharacterized (or incompletely characterized) gene products, even for human proteins that are relatively well studied. Here, we report our results from applying this paradigm to two well-characterized cellular processes, apoptosis and autophagy. This revealed several important observations with respect to GO annotations and how they can be used for function inference. Notably, we applied only a small fraction of the experimentally supported GO annotations to infer function in other family members. The majority of other annotations describe indirect effects, phenotypes or results from high throughput experiments. In addition, we show here how feedback from phylogenetic annotation leads to significant improvements in the PANTHER trees, the GO annotations and GO itself. Thus GO phylogenetic annotation both increases the quantity and improves the accuracy of the GO annotations provided to the research community. We expect these phylogenetically based annotations to be of broad use in gene enrichment analysis as well as other applications of GO annotations.Database URL: http://amigo.geneontology.org/amigo.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/database/baw155DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5199145PMC
November 2017

InterPro in 2017-beyond protein family and domain annotations.

Nucleic Acids Res 2017 01 29;45(D1):D190-D199. Epub 2016 Nov 29.

European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK.

InterPro (http://www.ebi.ac.uk/interpro/) is a freely available database used to classify protein sequences into families and to predict the presence of important domains and sites. InterProScan is the underlying software that allows both protein and nucleic acid sequences to be searched against InterPro's predictive models, which are provided by its member databases. Here, we report recent developments with InterPro and its associated software, including the addition of two new databases (SFLD and CDD), and the functionality to include residue-level annotation and prediction of intrinsic disorder. These developments enrich the annotations provided by InterPro, increase the overall number of residues annotated and allow more specific functional inferences.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkw1107DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5210578PMC
January 2017

PANTHER version 11: expanded annotation data from Gene Ontology and Reactome pathways, and data analysis tool enhancements.

Nucleic Acids Res 2017 01 29;45(D1):D183-D189. Epub 2016 Nov 29.

Division of Bioinformatics, Department of Preventive Medicine, Keck School of Medicine of USC, University of Southern California, Los Angeles, CA 90033, USA

The PANTHER database (Protein ANalysis THrough Evolutionary Relationships, http://pantherdb.org) contains comprehensive information on the evolution and function of protein-coding genes from 104 completely sequenced genomes. PANTHER software tools allow users to classify new protein sequences, and to analyze gene lists obtained from large-scale genomics experiments. In the past year, major improvements include a large expansion of classification information available in PANTHER, as well as significant enhancements to the analysis tools. Protein subfamily functional classifications have more than doubled due to progress of the Gene Ontology Phylogenetic Annotation Project. For human genes (as well as a few other organisms), PANTHER now also supports enrichment analysis using pathway classifications from the Reactome resource. The gene list enrichment tools include a new 'hierarchical view' of results, enabling users to leverage the structure of the classifications/ontologies; the tools also allow users to upload genetic variant data directly, rather than requiring prior conversion to a gene list. The updated coding single-nucleotide polymorphisms (SNP) scoring tool uses an improved algorithm. The hidden Markov model (HMM) search tools now use HMMER3, dramatically reducing search times and improving accuracy of E-value statistics. Finally, the PANTHER Tree-Attribute Viewer has been implemented in JavaScript, with new views for exploring protein sequence evolution.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkw1138DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5210595PMC
January 2017

The Gene Ontology and the Meaning of Biological Function.

Authors:
Paul D Thomas

Methods Mol Biol 2017 ;1446:15-24

Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, NRT 2502, 1450 Biggy Street, Los Angeles, CA, 90033, USA.

The Gene Ontology (GO) provides a framework and set of concepts for describing the functions of gene products from all organisms. It is specifically designed for supporting the computational representation of biological systems. A GO annotation is an association between a specific gene product and a GO concept, together making a statement pertinent to the function of that gene. However, the meaning of the term "function" is not as straightforward as it might seem, and has been discussed at length in both philosophical and biological circles. Here, I first review these discussions. I then present an explicit formulation of the biological model that underlies the GO and annotations, and discuss how this model relates to the broader debates on the meaning of biological function.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1007/978-1-4939-3743-1_2DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6438694PMC
December 2017

Tools for Predicting the Functional Impact of Nonsynonymous Genetic Variation.

Genetics 2016 06;203(2):635-47

Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, California 90033

As personal genome sequencing becomes a reality, understanding the effects of genetic variants on phenotype-particularly the impact of germline variants on disease risk and the impact of somatic variants on cancer development and treatment-continues to increase in importance. Because of their clear potential for affecting phenotype, nonsynonymous genetic variants (variants that cause a change in the amino acid sequence of a protein encoded by a gene) have long been the target of efforts to predict the effects of genetic variation. Whole-genome sequencing is identifying large numbers of nonsynonymous variants in each genome, intensifying the need for computational methods that accurately predict which of these are likely to impact disease phenotypes. This review focuses on nonsynonymous variant prediction with two aims in mind: (1) to review the prioritization methods that have been developed to date and the principles on which they are based and (2) to discuss the challenges to further improving these methods.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1534/genetics.116.190033DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4896183PMC
June 2016

PANTHER-PSEP: predicting disease-causing genetic variants using position-specific evolutionary preservation.

Bioinformatics 2016 07 18;32(14):2230-2. Epub 2016 May 18.

Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90033, USA.

Unlabelled: PANTHER-PSEP is a new software tool for predicting non-synonymous genetic variants that may play a causal role in human disease. Several previous variant pathogenicity prediction methods have been proposed that quantify evolutionary conservation among homologous proteins from different organisms. PANTHER-PSEP employs a related but distinct metric based on 'evolutionary preservation': homologous proteins are used to reconstruct the likely sequences of ancestral proteins at nodes in a phylogenetic tree, and the history of each amino acid can be traced back in time from its current state to estimate how long that state has been preserved in its ancestors. Here, we describe the PSEP tool, and assess its performance on standard benchmarks for distinguishing disease-associated from neutral variation in humans. On these benchmarks, PSEP outperforms not only previous tools that utilize evolutionary conservation, but also several highly used tools that include multiple other sources of information as well. For predicting pathogenic human variants, the trace back of course starts with a human 'reference' protein sequence, but the PSEP tool can also be applied to predicting deleterious or pathogenic variants in reference proteins from any of the ∼100 other species in the PANTHER database.

Availability And Implementation: PANTHER-PSEP is freely available on the web at http://pantherdb.org/tools/csnpScoreForm.jsp Users can also download the command-line based tool at ftp://ftp.pantherdb.org/cSNP_analysis/PSEP/ CONTACT: pdthomas@usc.edu

Supplementary Information: Supplementary data are available at Bioinformatics online.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btw222DOI Listing
July 2016

Standardized benchmarking in the quest for orthologs.

Nat Methods 2016 05 4;13(5):425-30. Epub 2016 Apr 4.

Department of Genetics, Evolution, and Environment, University College London, London, UK.

Achieving high accuracy in orthology inference is essential for many comparative, evolutionary and functional genomic analyses, yet the true evolutionary history of genes is generally unknown and orthologs are used for very different applications across phyla, requiring different precision-recall trade-offs. As a result, it is difficult to assess the performance of orthology inference methods. Here, we present a community effort to establish standards and an automated web-based service to facilitate orthology benchmarking. Using this service, we characterize 15 well-established inference methods and resources on a battery of 20 different benchmarks. Standardized benchmarking provides a way for users to identify the most effective methods for the problem at hand, sets a minimum requirement for new tools and resources, and guides the development of more accurate orthology inference methods.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/nmeth.3830DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4827703PMC
May 2016

Kat3 coactivators in somatic stem cells and cancer stem cells: biological roles, evolution, and pharmacologic manipulation.

Cell Biol Toxicol 2016 Feb 23;32(1):61-81. Epub 2016 Mar 23.

USC Center for Molecular Pathways and Drug Discovery, USC Norris Comprehensive Cancer Center, 1450 Biggy Street, NRT 4501, Los Angeles, CA, 90033, USA.

Long-lived somatic stem cells regenerate adult tissues throughout our lifetime. However, with aging, there is a significant deterioration in the function of stem and progenitor cells, which contribute to diseases of aging. The decision for a long-lived somatic stem cell to become activated and subsequently to undergo either a symmetric or an asymmetric division is a critical cellular decision process. The decision to preferentially divide symmetrically or asymmetrically may be the major fundamental intrinsic difference between normal somatic stem cells and cancer stem cells. Based upon work done primarily in our laboratory over the past 15 years, this article provides a perspective on the critical role of somatic stem cells in aging. In particular, we discuss the importance of symmetric versus asymmetric divisions in somatic stem cells and the role of the differential usage of the highly similar Kat3 coactivators, CREB-binding protein (CBP) and p300, in stem cells. We describe and propose a more complete model for the biological mechanism and roles of these two coactivators, their evolution, and unique roles and importance in stem cell biology. Finally, we discuss the potential to pharmacologically manipulate Kat3 coactivator interactions in endogenous stem cells (both normal and cancer stem cells) to potentially ameliorate the aging process and common diseases of aging.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1007/s10565-016-9318-0DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7458431PMC
February 2016

PANTHER version 10: expanded protein families and functions, and analysis tools.

Nucleic Acids Res 2016 Jan 17;44(D1):D336-42. Epub 2015 Nov 17.

Division of Bioinformatics, Department of Preventive Medicine, Keck School of Medicine of USC, University of Southern California, Los Angeles, CA 90089, USA

PANTHER (Protein Analysis THrough Evolutionary Relationships, http://pantherdb.org) is a widely used online resource for comprehensive protein evolutionary and functional classification, and includes tools for large-scale biological data analysis. Recent development has been focused in three main areas: genome coverage, functional information ('annotation') coverage and accuracy, and improved genomic data analysis tools. The latest version of PANTHER, 10.0, includes almost 5000 new protein families (for a total of over 12 000 families), each with a reference phylogenetic tree including protein-coding genes from 104 fully sequenced genomes spanning all kingdoms of life. Phylogenetic trees now include inference of horizontal transfer events in addition to speciation and gene duplication events. Functional annotations are regularly updated using the models generated by the Gene Ontology Phylogenetic Annotation Project. For the data analysis tools, PANTHER has expanded the number of different 'functional annotation sets' available for functional enrichment testing, allowing analyses to access all Gene Ontology annotations--updated monthly from the Gene Ontology database--in addition to the annotations that have been inferred through evolutionary relationships. The Prowler (data browser) has been updated to enable users to more efficiently browse the entire database, and to create custom gene lists using the multiple axes of classification in PANTHER.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkv1194DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4702852PMC
January 2016

The InterPro protein families database: the classification resource after 15 years.

Nucleic Acids Res 2015 Jan 26;43(Database issue):D213-21. Epub 2014 Nov 26.

European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK

The InterPro database (http://www.ebi.ac.uk/interpro/) is a freely available resource that can be used to classify sequences into protein families and to predict the presence of important domains and sites. Central to the InterPro database are predictive models, known as signatures, from a range of different protein family databases that have different biological focuses and use different methodological approaches to classify protein families and domains. InterPro integrates these signatures, capitalizing on the respective strengths of the individual databases, to produce a powerful protein classification resource. Here, we report on the status of InterPro as it enters its 15th year of operation, and give an overview of new developments with the database and its associated Web interfaces and software. In particular, the new domain architecture search tool is described and the process of mapping of Gene Ontology terms to InterPro is outlined. We also discuss the challenges faced by the resource given the explosive growth in sequence data in recent years. InterPro (version 48.0) contains 36,766 member database signatures integrated into 26,238 InterPro entries, an increase of over 3993 entries (5081 signatures), since 2012.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gku1243DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4383996PMC
January 2015

Big data and other challenges in the quest for orthologs.

Bioinformatics 2014 Nov 26;30(21):2993-8. Epub 2014 Jul 26.

Stockholm Bioinformatics Center, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden, Swedish eScience Research Center, Stockholm, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), 08003 Barcelona, Spain, Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain, Institució Catalana de Recerca i Estudis Avançats (ICREA), 08010 Barcelona, Spain, EMBL-European Bioinformatics Institute, Hinxton CB10 1SD, UK, Department of Ecology and Evolution, University of Lausanne, Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, SwissProt, Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland, Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90089, USA and Department of Genetics, Evolution and Environment, and Department of Computer Science, University College London, Gower St, London WC1E 6BT, UK Stockholm Bioinformatics Center, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden, Swedish eScience Research Center, Stockholm, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), 08003 Barcelona, Spain, Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain, Institució Catalana de Recerca i Estudis Avançats (ICREA), 08010 Barcelona, Spain, EMBL-European Bioinformatics Institute, Hinxton CB10 1SD, UK, Department of Ecology and Evolution, University of Lausanne, Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, SwissProt, Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland, Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90089, USA and Department of Genetics, Evolution and Environment, and Department of Computer Science, University College London, Gower St, London

Unlabelled: Given the rapid increase of species with a sequenced genome, the need to identify orthologous genes between them has emerged as a central bioinformatics task. Many different methods exist for orthology detection, which makes it difficult to decide which one to choose for a particular application. Here, we review the latest developments and issues in the orthology field, and summarize the most recent results reported at the third 'Quest for Orthologs' meeting. We focus on community efforts such as the adoption of reference proteomes, standard file formats and benchmarking. Progress in these areas is good, and they are already beneficial to both orthology consumers and providers. However, a major current issue is that the massive increase in complete proteomes poses computational challenges to many of the ortholog database providers, as most orthology inference algorithms scale at least quadratically with the number of proteomes. The Quest for Orthologs consortium is an open community with a number of working groups that join efforts to enhance various aspects of orthology analysis, such as defining standard formats and datasets, documenting community resources and benchmarking.

Availability And Implementation: All such materials are available at http://questfororthologs.org.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btu492DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4201156PMC
November 2014

PortEco: a resource for exploring bacterial biology through high-throughput data and analysis tools.

Nucleic Acids Res 2014 Jan 26;42(Database issue):D677-84. Epub 2013 Nov 26.

Department of Biochemistry and Biophysics, Texas A&M University, College Station, TX 77843, USA, Department of Genetics, Stanford University, Stanford, CA 94305, USA, Department of Biology, Texas A&M University, College Station, TX, 77843, USA, Artificial Intelligence Center, SRI International, Menlo Park, CA 94025, USA and Deptartment of Preventive Medicine, University of Southern California, Los Angeles, CA 90089, USA.

PortEco (http://porteco.org) aims to collect, curate and provide data and analysis tools to support basic biological research in Escherichia coli (and eventually other bacterial systems). PortEco is implemented as a 'virtual' model organism database that provides a single unified interface to the user, while integrating information from a variety of sources. The main focus of PortEco is to enable broad use of the growing number of high-throughput experiments available for E. coli, and to leverage community annotation through the EcoliWiki and GONUTS systems. Currently, PortEco includes curated data from hundreds of genome-wide RNA expression studies, from high-throughput phenotyping of single-gene knockouts under hundreds of annotated conditions, from chromatin immunoprecipitation experiments for tens of different DNA-binding factors and from ribosome profiling experiments that yield insights into protein expression. Conditions have been annotated with a consistent vocabulary, and data have been consistently normalized to enable users to find, compare and interpret relevant experiments. PortEco includes tools for data analysis, including clustering, enrichment analysis and exploration via genome browsers. PortEco search and data analysis tools are extensively linked to the curated gene, metabolic pathway and regulation content at its sister site, EcoCyc.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkt1203DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3965092PMC
January 2014