Publications by authors named "Huaiyu Mi"

42 Publications

Bayesian parameter estimation for automatic annotation of gene functions using observational data and phylogenetic trees.

PLoS Comput Biol 2021 Feb 18;17(2):e1007948. Epub 2021 Feb 18.

Division of Biostatistics, Department of Preventive Medicine, University of Southern California, Los Angeles, California, United States of America.

Gene function annotation is important for a variety of downstream analyses of genetic data. But experimental characterization of function remains costly and slow, making computational prediction an important endeavor. Phylogenetic approaches to prediction have been developed, but implementation of a practical Bayesian framework for parameter estimation remains an outstanding challenge. We have developed a computationally efficient model of evolution of gene annotations using phylogenies based on a Bayesian framework using Markov Chain Monte Carlo for parameter estimation. Unlike previous approaches, our method is able to estimate parameters over many different phylogenetic trees and functions. The resulting parameters agree with biological intuition, such as the increased probability of function change following gene duplication. The method performs well on leave-one-out cross-validation, and we further validated some of the predictions in the experimental scientific literature.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1371/journal.pcbi.1007948DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7924801PMC
February 2021

PhyloGenes: An online phylogenetics and functional genomics resource for plant gene function inference.

Plant Direct 2020 Dec 30;4(12):e00293. Epub 2020 Dec 30.

Phoenix Bioinformatics Fremont CA USA.

We aim to enable the accurate and efficient transfer of knowledge about gene function gained from and other model organisms to other plant species. This knowledge transfer is frequently challenging in plants due to duplications of individual genes and whole genomes in plant lineages. Such duplications result in complex evolutionary relationships between related genes, which may have similar sequences but highly divergent functions. In such cases, functional inference requires more than a simple sequence similarity calculation. We have developed an online resource, PhyloGenes (phylogenes.org), that displays precomputed phylogenetic trees for plant gene families along with experimentally validated function information for individual genes within the families. A total of 40 plant genomes and 10 non-plant model organisms are represented in over 8,000 gene families. Evolutionary events such as speciation and duplication are clearly labeled on gene trees to distinguish orthologs from paralogs. Nearly 6,000 families have at least one member with an experimentally supported annotation to a Gene Ontology (GO) molecular function or biological process term. By displaying experimentally validated gene functions associated to individual genes within a tree, PhyloGenes enables functional inference for genes of uncharacterized function, based on their evolutionary relationships to experimentally studied genes, in a visually traceable manner. For the many families containing genes that have evolved to perform different functions, PhyloGenes facilitates the use of evolutionary history to determine the most likely function of genes that have not been experimentally characterized. Future work will enrich the resource by incorporating additional gene function datasets such as plant gene expression atlas data.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1002/pld3.293DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7773024PMC
December 2020

PEREGRINE: A genome-wide prediction of enhancer to gene relationships supported by experimental evidence.

PLoS One 2020 15;15(12):e0243791. Epub 2020 Dec 15.

Division of Bioinformatics, Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, CA, United States of America.

Enhancers are powerful and versatile agents of cell-type specific gene regulation, which are thought to play key roles in human disease. Enhancers are short DNA elements that function primarily as clusters of transcription factor binding sites that are spatially coordinated to regulate expression of one or more specific target genes. These regulatory connections between enhancers and target genes can therefore be characterized as enhancer-gene links that can affect development, disease, and homeostatic cellular processes. Despite their implication in disease and the establishment of cell identity during development, most enhancer-gene links remain unknown. Here we introduce a new, publicly accessible database of predicted enhancer-gene links, PEREGRINE. The PEREGRINE human enhancer-gene links interactive web interface incorporates publicly available experimental data from ChIA-PET, eQTL, and Hi-C assays across 78 cell and tissue types to link 449,627 enhancers to 17,643 protein-coding genes. These enhancer-gene links are made available through the new Enhancer module of the PANTHER database and website where the user may easily access the evidence for each enhancer-gene link, as well as query by target gene and enhancer location.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0243791PLOS
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7737992PMC
January 2021

PANTHER version 16: a revised family classification, tree-based classification tool, enhancer regions and extensive API.

Nucleic Acids Res 2021 01;49(D1):D394-D403

Division of Bioinformatics, Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, USA.

PANTHER (Protein Analysis Through Evolutionary Relationships, http://www.pantherdb.org) is a resource for the evolutionary and functional classification of protein-coding genes from all domains of life. The evolutionary classification is based on a library of over 15,000 phylogenetic trees, and the functional classifications include Gene Ontology terms and pathways. Here, we analyze the current coverage of genes from genomes in different taxonomic groups, so that users can better understand what to expect when analyzing a gene list using PANTHER tools. We also describe extensive improvements to PANTHER made in the past two years. The PANTHER Protein Class ontology has been completely refactored, and 6101 PANTHER families have been manually assigned to a Protein Class, providing a high level classification of protein families and their genes. Users can access the TreeGrafter tool to add their own protein sequences to the reference phylogenetic trees in PANTHER, to infer evolutionary context as well as fine-grained annotations. We have added human enhancer-gene links that associate non-coding regions with the annotated human genes in PANTHER. We have also expanded the available services for programmatic access to PANTHER tools and data via application programming interfaces (APIs). Other improvements include additional plant genomes and an updated PANTHER GO-slim.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkaa1106DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7778891PMC
January 2021

The InterPro protein families and domains database: 20 years on.

Nucleic Acids Res 2021 01;49(D1):D344-D354

European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK.

The InterPro database (https://www.ebi.ac.uk/interpro/) provides an integrative classification of protein sequences into families, and identifies functionally important domains and conserved sites. InterProScan is the underlying software that allows protein and nucleic acid sequences to be searched against InterPro's signatures. Signatures are predictive models which describe protein families, domains or sites, and are provided by multiple databases. InterPro combines signatures representing equivalent families, domains or sites, and provides additional information such as descriptions, literature references and Gene Ontology (GO) terms, to produce a comprehensive resource for protein classification. Founded in 1999, InterPro has become one of the most widely used resources for protein family annotation. Here, we report the status of InterPro (version 81.0) in its 20th year of operation, and its associated software, including updates to database content, the release of a new website and REST API, and performance improvements in InterProScan.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkaa977DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7778928PMC
January 2021

Systems Biology Graphical Notation: Process Description language Level 1 Version 2.0.

J Integr Bioinform 2019 Jun 13;16(2). Epub 2019 Jun 13.

cBio Center, Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA 02215, USA.

The Systems Biology Graphical Notation (SBGN) is an international community effort that aims to standardise the visualisation of pathways and networks for readers with diverse scientific backgrounds as well as to support an efficient and accurate exchange of biological knowledge between disparate research communities, industry, and other players in systems biology. SBGN comprises the three languages Entity Relationship, Activity Flow, and Process Description (PD) to cover biological and biochemical systems at distinct levels of detail. PD is closest to metabolic and regulatory pathways found in biological literature and textbooks. Its well-defined semantics offer a superior precision in expressing biological knowledge. PD represents mechanistic and temporal dependencies of biological interactions and transformations as a graph. Its different types of nodes include entity pools (e.g. metabolites, proteins, genes and complexes) and processes (e.g. reactions, associations and influences). The edges describe relationships between the nodes (e.g. consumption, production, stimulation and inhibition). This document details Level 1 Version 2.0 of the PD specification, including several improvements, in particular: 1) the addition of the equivalence operator, subunit, and annotation glyphs, 2) modification to the usage of submaps, and 3) updates to clarify the use of various glyphs (i.e. multimer, empty set, and state variable).
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1515/jib-2019-0022DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6798820PMC
June 2019

SynGO: An Evidence-Based, Expert-Curated Knowledge Base for the Synapse.

Neuron 2019 07 3;103(2):217-234.e4. Epub 2019 Jun 3.

Department of Functional Genomics, CNCR, VU University and UMC Amsterdam, 1081 HV Amsterdam, the Netherlands. Electronic address:

Synapses are fundamental information-processing units of the brain, and synaptic dysregulation is central to many brain disorders ("synaptopathies"). However, systematic annotation of synaptic genes and ontology of synaptic processes are currently lacking. We established SynGO, an interactive knowledge base that accumulates available research about synapse biology using Gene Ontology (GO) annotations to novel ontology terms: 87 synaptic locations and 179 synaptic processes. SynGO annotations are exclusively based on published, expert-curated evidence. Using 2,922 annotations for 1,112 genes, we show that synaptic genes are exceptionally well conserved and less tolerant to mutations than other genes. Many SynGO terms are significantly overrepresented among gene variations associated with intelligence, educational attainment, ADHD, autism, and bipolar disorder and among de novo variants associated with neurodevelopmental disorders, including schizophrenia. SynGO is a public, universal reference for synapse research and an online analysis platform for interpretation of large-scale -omics data (https://syngoportal.org and http://geneontology.org).
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.neuron.2019.05.002DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6764089PMC
July 2019

Protocol Update for large-scale genome and gene function analysis with the PANTHER classification system (v.14.0).

Nat Protoc 2019 03 25;14(3):703-721. Epub 2019 Feb 25.

Division of Bioinformatics, Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA.

The PANTHER classification system ( http://www.pantherdb.org ) is a comprehensive system that combines genomes, gene function classifications, pathways and statistical analysis tools to enable biologists to analyze large-scale genome-wide experimental data. The current system (PANTHER v.14.0) covers 131 complete genomes organized into gene families and subfamilies; evolutionary relationships between genes are represented in phylogenetic trees, multiple sequence alignments and statistical models (hidden Markov models (HMMs)). The families and subfamilies are annotated with Gene Ontology (GO) terms, and sequences are assigned to PANTHER pathways. A suite of tools has been built to allow users to browse and query gene functions and analyze large-scale experimental data with a number of statistical tests. PANTHER is widely used by bench scientists, bioinformaticians, computer scientists and systems biologists. Since the protocol for using this tool (v.8.0) was originally published in 2013, there have been substantial improvements and updates in the areas of data quality, data coverage, statistical algorithms and user experience. This Protocol Update provides detailed instructions on how to analyze genome-wide experimental data in the PANTHER classification system.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41596-019-0128-8DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6519457PMC
March 2019

PANTHER version 14: more genomes, a new PANTHER GO-slim and improvements in enrichment analysis tools.

Nucleic Acids Res 2019 01;47(D1):D419-D426

Division of Bioinformatics, Department of Preventive Medicine, Keck School of Medicine of USC, University of Southern California, Los Angeles, CA 90033, USA.

PANTHER (Protein Analysis Through Evolutionary Relationships, http://pantherdb.org) is a resource for the evolutionary and functional classification of genes from organisms across the tree of life. We report the improvements we have made to the resource during the past two years. For evolutionary classifications, we have added more prokaryotic and plant genomes to the phylogenetic gene trees, expanding the representation of gene evolution in these lineages. We have refined many protein family boundaries, and have aligned PANTHER with the MEROPS resource for protease and protease inhibitor families. For functional classifications, we have developed an entirely new PANTHER GO-slim, containing over four times as many Gene Ontology terms as our previous GO-slim, as well as curated associations of genes to these terms. Lastly, we have made substantial improvements to the enrichment analysis tools available on the PANTHER website: users can now analyze over 900 different genomes, using updated statistical tests with false discovery rate corrections for multiple testing. The overrepresentation test is also available as a web service, for easy addition to third-party sites.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gky1038DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6323939PMC
January 2019

InterPro in 2019: improving coverage, classification and access to protein sequence annotations.

Nucleic Acids Res 2019 01;47(D1):D351-D360

European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK.

The InterPro database (http://www.ebi.ac.uk/interpro/) classifies protein sequences into families and predicts the presence of functionally important domains and sites. Here, we report recent developments with InterPro (version 70.0) and its associated software, including an 18% growth in the size of the database in terms on new InterPro entries, updates to content, the inclusion of an additional entry type, refined modelling of discontinuous domains, and the development of a new programmatic interface and website. These developments extend and enrich the information provided by InterPro, and provide greater flexibility in terms of data access. We also show that InterPro's sequence coverage has kept pace with the growth of UniProtKB, and discuss how our evaluation of residue coverage may help guide future curation activities.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gky1100DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6323941PMC
January 2019

Large-scale inference of gene function through phylogenetic annotation of Gene Ontology terms: case study of the apoptosis and autophagy cellular processes.

Database (Oxford) 2016 26;2016. Epub 2016 Dec 26.

Division of Bioinformatics, Department of Preventive Medicine, Keck School of Medicine of USC, University of Southern California, Los Angeles, CA, USA and.

We previously reported a paradigm for large-scale phylogenomic analysis of gene families that takes advantage of the large corpus of experimentally supported Gene Ontology (GO) annotations. This 'GO Phylogenetic Annotation' approach integrates GO annotations from evolutionarily related genes across ∼100 different organisms in the context of a gene family tree, in which curators build an explicit model of the evolution of gene functions. GO Phylogenetic Annotation models the gain and loss of functions in a gene family tree, which is used to infer the functions of uncharacterized (or incompletely characterized) gene products, even for human proteins that are relatively well studied. Here, we report our results from applying this paradigm to two well-characterized cellular processes, apoptosis and autophagy. This revealed several important observations with respect to GO annotations and how they can be used for function inference. Notably, we applied only a small fraction of the experimentally supported GO annotations to infer function in other family members. The majority of other annotations describe indirect effects, phenotypes or results from high throughput experiments. In addition, we show here how feedback from phylogenetic annotation leads to significant improvements in the PANTHER trees, the GO annotations and GO itself. Thus GO phylogenetic annotation both increases the quantity and improves the accuracy of the GO annotations provided to the research community. We expect these phylogenetically based annotations to be of broad use in gene enrichment analysis as well as other applications of GO annotations.Database URL: http://amigo.geneontology.org/amigo.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/database/baw155DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5199145PMC
November 2017

InterPro in 2017-beyond protein family and domain annotations.

Nucleic Acids Res 2017 01 29;45(D1):D190-D199. Epub 2016 Nov 29.

European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK.

InterPro (http://www.ebi.ac.uk/interpro/) is a freely available database used to classify protein sequences into families and to predict the presence of important domains and sites. InterProScan is the underlying software that allows both protein and nucleic acid sequences to be searched against InterPro's predictive models, which are provided by its member databases. Here, we report recent developments with InterPro and its associated software, including the addition of two new databases (SFLD and CDD), and the functionality to include residue-level annotation and prediction of intrinsic disorder. These developments enrich the annotations provided by InterPro, increase the overall number of residues annotated and allow more specific functional inferences.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkw1107DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5210578PMC
January 2017

PANTHER version 11: expanded annotation data from Gene Ontology and Reactome pathways, and data analysis tool enhancements.

Nucleic Acids Res 2017 01 29;45(D1):D183-D189. Epub 2016 Nov 29.

Division of Bioinformatics, Department of Preventive Medicine, Keck School of Medicine of USC, University of Southern California, Los Angeles, CA 90033, USA

The PANTHER database (Protein ANalysis THrough Evolutionary Relationships, http://pantherdb.org) contains comprehensive information on the evolution and function of protein-coding genes from 104 completely sequenced genomes. PANTHER software tools allow users to classify new protein sequences, and to analyze gene lists obtained from large-scale genomics experiments. In the past year, major improvements include a large expansion of classification information available in PANTHER, as well as significant enhancements to the analysis tools. Protein subfamily functional classifications have more than doubled due to progress of the Gene Ontology Phylogenetic Annotation Project. For human genes (as well as a few other organisms), PANTHER now also supports enrichment analysis using pathway classifications from the Reactome resource. The gene list enrichment tools include a new 'hierarchical view' of results, enabling users to leverage the structure of the classifications/ontologies; the tools also allow users to upload genetic variant data directly, rather than requiring prior conversion to a gene list. The updated coding single-nucleotide polymorphisms (SNP) scoring tool uses an improved algorithm. The hidden Markov model (HMM) search tools now use HMMER3, dramatically reducing search times and improving accuracy of E-value statistics. Finally, the PANTHER Tree-Attribute Viewer has been implemented in JavaScript, with new views for exploring protein sequence evolution.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkw1138DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5210595PMC
January 2017

PANTHER version 10: expanded protein families and functions, and analysis tools.

Nucleic Acids Res 2016 Jan 17;44(D1):D336-42. Epub 2015 Nov 17.

Division of Bioinformatics, Department of Preventive Medicine, Keck School of Medicine of USC, University of Southern California, Los Angeles, CA 90089, USA

PANTHER (Protein Analysis THrough Evolutionary Relationships, http://pantherdb.org) is a widely used online resource for comprehensive protein evolutionary and functional classification, and includes tools for large-scale biological data analysis. Recent development has been focused in three main areas: genome coverage, functional information ('annotation') coverage and accuracy, and improved genomic data analysis tools. The latest version of PANTHER, 10.0, includes almost 5000 new protein families (for a total of over 12 000 families), each with a reference phylogenetic tree including protein-coding genes from 104 fully sequenced genomes spanning all kingdoms of life. Phylogenetic trees now include inference of horizontal transfer events in addition to speciation and gene duplication events. Functional annotations are regularly updated using the models generated by the Gene Ontology Phylogenetic Annotation Project. For the data analysis tools, PANTHER has expanded the number of different 'functional annotation sets' available for functional enrichment testing, allowing analyses to access all Gene Ontology annotations--updated monthly from the Gene Ontology database--in addition to the annotations that have been inferred through evolutionary relationships. The Prowler (data browser) has been updated to enable users to more efficiently browse the entire database, and to create custom gene lists using the multiple axes of classification in PANTHER.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkv1194DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4702852PMC
January 2016

Systems Biology Graphical Notation: Activity Flow language Level 1 Version 1.2.

J Integr Bioinform 2015 Sep 4;12(2):265. Epub 2015 Sep 4.

The Systems Biological Graphical Notation (SBGN) is an international community effort for standardized graphical representations of biological pathways and networks. The goal of SBGN is to provide unambiguous pathway and network maps for readers with different scientific backgrounds as well as to support efficient and accurate exchange of biological knowledge between different research communities, industry, and other players in systems biology. Three SBGN languages, Process Description (PD), Entity Relationship (ER) and Activity Flow (AF), allow for the representation of different aspects of biological and biochemical systems at different levels of detail. The SBGN Activity Flow language represents the influences of activities among various entities within a network. Unlike SBGN PD and ER that focus on the entities and their relationships with others, SBGN AF puts the emphasis on the functions (or activities) performed by the entities, and their effects to the functions of the same or other entities. The nodes (elements) describe the biological activities of the entities, such as protein kinase activity, binding activity or receptor activity, which can be easily mapped to Gene Ontology molecular function terms. The edges (connections) provide descriptions of relationships (or influences) between the activities, e.g., positive influence and negative influence. Among all three languages of SBGN, AF is the closest to signaling pathways in biological literature and textbooks, but its well-defined semantics offer a superior precision in expressing biological knowledge.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.2390/biecoll-jib-2015-265DOI Listing
September 2015

Systems Biology Graphical Notation: Entity Relationship language Level 1 Version 2.

J Integr Bioinform 2015 Sep 4;12(2):264. Epub 2015 Sep 4.

The Systems Biological Graphical Notation (SBGN) is an international community effort for standardized graphical representations of biological pathways and networks. The goal of SBGN is to provide unambiguous pathway and network maps for readers with different scientific backgrounds as well as to support efficient and accurate exchange of biological knowledge between different research communities, industry, and other players in systems biology. Three SBGN languages, Process Description (PD), Entity Relationship (ER) and Activity Flow (AF), allow for the representation of different aspects of biological and biochemical systems at different levels of detail. The SBGN Entity Relationship language (ER) represents biological entities and their interactions and relationships within a network. SBGN ER focuses on all potential relationships between entities without considering temporal aspects. The nodes (elements) describe biological entities, such as proteins and complexes. The edges (connections) provide descriptions of interactions and relationships (or influences), e.g., complex formation, stimulation and inhibition. Among all three languages of SBGN, ER is the closest to protein interaction networks in biological literature and textbooks, but its well-defined semantics offer a superior precision in expressing biological knowledge.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.2390/biecoll-jib-2015-264DOI Listing
September 2015

Systems Biology Graphical Notation: Process Description language Level 1 Version 1.3.

J Integr Bioinform 2015 Sep 4;12(2):263. Epub 2015 Sep 4.

The Systems Biological Graphical Notation (SBGN) is an international community effort for standardized graphical representations of biological pathways and networks. The goal of SBGN is to provide unambiguous pathway and network maps for readers with different scientific backgrounds as well as to support efficient and accurate exchange of biological knowledge between different research communities, industry, and other players in systems biology. Three SBGN languages, Process Description (PD), Entity Relationship (ER) and Activity Flow (AF), allow for the representation of different aspects of biological and biochemical systems at different levels of detail. The SBGN Process Description language represents biological entities and processes between these entities within a network. SBGN PD focuses on the mechanistic description and temporal dependencies of biological interactions and transformations. The nodes (elements) are split into entity nodes describing, e.g., metabolites, proteins, genes and complexes, and process nodes describing, e.g., reactions and associations. The edges (connections) provide descriptions of relationships (or influences) between the nodes, such as consumption, production, stimulation and inhibition. Among all three languages of SBGN, PD is the closest to metabolic and regulatory pathways in biological literature and textbooks, but its well-defined semantics offer a superior precision in expressing biological knowledge.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.2390/biecoll-jib-2015-263DOI Listing
September 2015

The InterPro protein families database: the classification resource after 15 years.

Nucleic Acids Res 2015 Jan 26;43(Database issue):D213-21. Epub 2014 Nov 26.

European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK

The InterPro database (http://www.ebi.ac.uk/interpro/) is a freely available resource that can be used to classify sequences into protein families and to predict the presence of important domains and sites. Central to the InterPro database are predictive models, known as signatures, from a range of different protein family databases that have different biological focuses and use different methodological approaches to classify protein families and domains. InterPro integrates these signatures, capitalizing on the respective strengths of the individual databases, to produce a powerful protein classification resource. Here, we report on the status of InterPro as it enters its 15th year of operation, and give an overview of new developments with the database and its associated Web interfaces and software. In particular, the new domain architecture search tool is described and the process of mapping of Gene Ontology terms to InterPro is outlined. We also discuss the challenges faced by the resource given the explosive growth in sequence data in recent years. InterPro (version 48.0) contains 36,766 member database signatures integrated into 26,238 InterPro entries, an increase of over 3993 entries (5081 signatures), since 2012.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gku1243DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4383996PMC
January 2015

PortEco: a resource for exploring bacterial biology through high-throughput data and analysis tools.

Nucleic Acids Res 2014 Jan 26;42(Database issue):D677-84. Epub 2013 Nov 26.

Department of Biochemistry and Biophysics, Texas A&M University, College Station, TX 77843, USA, Department of Genetics, Stanford University, Stanford, CA 94305, USA, Department of Biology, Texas A&M University, College Station, TX, 77843, USA, Artificial Intelligence Center, SRI International, Menlo Park, CA 94025, USA and Deptartment of Preventive Medicine, University of Southern California, Los Angeles, CA 90089, USA.

PortEco (http://porteco.org) aims to collect, curate and provide data and analysis tools to support basic biological research in Escherichia coli (and eventually other bacterial systems). PortEco is implemented as a 'virtual' model organism database that provides a single unified interface to the user, while integrating information from a variety of sources. The main focus of PortEco is to enable broad use of the growing number of high-throughput experiments available for E. coli, and to leverage community annotation through the EcoliWiki and GONUTS systems. Currently, PortEco includes curated data from hundreds of genome-wide RNA expression studies, from high-throughput phenotyping of single-gene knockouts under hundreds of annotated conditions, from chromatin immunoprecipitation experiments for tens of different DNA-binding factors and from ribosome profiling experiments that yield insights into protein expression. Conditions have been annotated with a consistent vocabulary, and data have been consistently normalized to enable users to find, compare and interpret relevant experiments. PortEco includes tools for data analysis, including clustering, enrichment analysis and exploration via genome browsers. PortEco search and data analysis tools are extensively linked to the curated gene, metabolic pathway and regulation content at its sister site, EcoCyc.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkt1203DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3965092PMC
January 2014

Large-scale gene function analysis with the PANTHER classification system.

Nat Protoc 2013 Aug 18;8(8):1551-66. Epub 2013 Jul 18.

Department of Preventive Medicine, Division of Bioinformatics, Keck School of Medicine, University of Southern California, Los Angeles, California, USA.

The PANTHER (protein annotation through evolutionary relationship) classification system (http://www.pantherdb.org/) is a comprehensive system that combines gene function, ontology, pathways and statistical analysis tools that enable biologists to analyze large-scale, genome-wide data from sequencing, proteomics or gene expression experiments. The system is built with 82 complete genomes organized into gene families and subfamilies, and their evolutionary relationships are captured in phylogenetic trees, multiple sequence alignments and statistical models (hidden Markov models or HMMs). Genes are classified according to their function in several different ways: families and subfamilies are annotated with ontology terms (Gene Ontology (GO) and PANTHER protein class), and sequences are assigned to PANTHER pathways. The PANTHER website includes a suite of tools that enable users to browse and query gene functions, and to analyze large-scale experimental data with a number of statistical tests. It is widely used by bench scientists, bioinformaticians, computer scientists and systems biologists. In the 2013 release of PANTHER (v.8.0), in addition to an update of the data content, we redesigned the website interface to improve both user experience and the system's analytical capability. This protocol provides a detailed description of how to analyze genome-wide experimental data with the PANTHER classification system.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/nprot.2013.092DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6519453PMC
August 2013

PANTHER in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees.

Nucleic Acids Res 2013 Jan 27;41(Database issue):D377-86. Epub 2012 Nov 27.

Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90033, USA.

The data and tools in PANTHER-a comprehensive, curated database of protein families, trees, subfamilies and functions available at http://pantherdb.org-have undergone continual, extensive improvement for over a decade. Here, we describe the current PANTHER process as a whole, as well as the website tools for analysis of user-uploaded data. The main goals of PANTHER remain essentially unchanged: the accurate inference (and practical application) of gene and protein function over large sequence databases, using phylogenetic trees to extrapolate from the relatively sparse experimental information from a few model organisms. Yet the focus of PANTHER has continually shifted toward more accurate and detailed representations of evolutionary events in gene family histories. The trees are now designed to represent gene family evolution, including inference of evolutionary events, such as speciation and gene duplication. Subfamilies are still curated and used to define HMMs, but gene ontology functional annotations can now be made at any node in the tree, and are designed to represent gain and loss of function by ancestral genes during evolution. Finally, PANTHER now includes stable database identifiers for inferred ancestral genes, which are used to associate inferred gene attributes with particular genes in the common ancestral genomes of extant species.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gks1118DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3531194PMC
January 2013

Software support for SBGN maps: SBGN-ML and LibSBGN.

Bioinformatics 2012 Aug 10;28(15):2016-21. Epub 2012 May 10.

EMBL European Bioinformatics Institute, Hinxton, UK.

Motivation: LibSBGN is a software library for reading, writing and manipulating Systems Biology Graphical Notation (SBGN) maps stored using the recently developed SBGN-ML file format. The library (available in C++ and Java) makes it easy for developers to add SBGN support to their tools, whereas the file format facilitates the exchange of maps between compatible software applications. The library also supports validation of maps, which simplifies the task of ensuring compliance with the detailed SBGN specifications. With this effort we hope to increase the adoption of SBGN in bioinformatics tools, ultimately enabling more researchers to visualize biological knowledge in a precise and unambiguous manner.

Availability And Implementation: Milestone 2 was released in December 2011. Source code, example files and binaries are freely available under the terms of either the LGPL v2.1+ or Apache v2.0 open source licenses from http://libsbgn.sourceforge.net.

Contact: sbgn-libsbgn@lists.sourceforge.net.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/bts270DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3400951PMC
August 2012

InterPro in 2011: new developments in the family and domain prediction database.

Nucleic Acids Res 2012 Jan 16;40(Database issue):D306-12. Epub 2011 Nov 16.

EMBL Outstation European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, CB10 1SD Cambridge, UK.

InterPro (http://www.ebi.ac.uk/interpro/) is a database that integrates diverse information about protein families, domains and functional sites, and makes it freely available to the public via Web-based interfaces and services. Central to the database are diagnostic models, known as signatures, against which protein sequences can be searched to determine their potential function. InterPro has utility in the large-scale analysis of whole genomes and meta-genomes, as well as in characterizing individual protein sequences. Herein we give an overview of new developments in the database and its associated software since 2009, including updates to database content, curation processes and Web and programmatic interfaces.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkr948DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3245097PMC
January 2012

BioPAX support in CellDesigner.

Bioinformatics 2011 Dec 21;27(24):3437-8. Epub 2011 Oct 21.

Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, CA 90089, USA.

Motivation: BioPAX is a standard language for representing and exchanging models of biological processes at the molecular and cellular levels. It is widely used by different pathway databases and genomics data analysis software. Currently, the primary source of BioPAX data is direct exports from the curated pathway databases. It is still uncommon for wet-lab biologists to share and exchange pathway knowledge using BioPAX. Instead, pathways are usually represented as informal diagrams in the literature. In order to encourage formal representation of pathways, we describe a software package that allows users to create pathway diagrams using CellDesigner, a user-friendly graphical pathway-editing tool and save the pathway data in BioPAX Level 3 format.

Availability: The plug-in is freely available and can be downloaded at ftp://ftp.pantherdb.org/CellDesigner/plugins/BioPAX/ CONTACT: huaiyumi@usc.edu

Supplementary Information: Supplementary data are available at Bioinformatics online.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btr586DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3232372PMC
December 2011

Ontologies and standards in bioscience research: for machine or for human.

Front Physiol 2011 21;2. Epub 2011 Feb 21.

SRI International, Menlo Park CA, USA.

Ontologies and standards are very important parts of today's bioscience research. With the rapid increase of biological knowledge, they provide mechanisms to better store and represent data in a controlled and structured way, so that scientists can share the data, and utilize a wide variety of software and tools to manage and analyze the data. Most of these standards are initially designed for computers to access large amounts of data that are difficult for human biologists to handle, and it is important to keep in mind that ultimately biologists are going to produce and interpret the data. While ontologies and standards must follow strict semantic rules that may not be familiar to biologists, effort must be spent to lower the learning barrier by involving biologists in the process of development, and by providing software and tool support. A standard will not succeed without support from the wider bioscience research community. Thus, it is crucial that these standards be designed not only for machines to read, but also to be scientifically accurate and intuitive to human biologists.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3389/fphys.2011.00005DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3081276PMC
July 2011

The BioPAX community standard for pathway data sharing.

Nat Biotechnol 2010 Sep 9;28(9):935-42. Epub 2010 Sep 9.

Computational Biology, Memorial Sloan-Kettering Cancer Center, New York, New York, USA.

Biological Pathway Exchange (BioPAX) is a standard language to represent biological pathways at the molecular and cellular level and to facilitate the exchange of pathway data. The rapid growth of the volume of pathway data has spurred the development of databases and computational tools to aid interpretation; however, use of these data is hampered by the current fragmentation of pathway information across many databases with incompatible formats. BioPAX, which was created through a community process, solves this problem by making pathway data substantially easier to collect, index, interpret and share. BioPAX can represent metabolic and signaling pathways, molecular and genetic interactions and gene regulation networks. Using BioPAX, millions of interactions, organized into thousands of pathways, from many organisms are available from a growing number of databases. This large amount of pathway data in a computable form will support visualization, analysis and biological discovery.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/nbt.1666DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3001121PMC
September 2010

PharmGKB summary: dopamine receptor D2.

Pharmacogenet Genomics 2011 Jun;21(6):350-6

Evolutionary Systems Biology, Artificial Intelligence Center, SRI International, Menlo Park, California, USA.

View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1097/FPC.0b013e32833ee605DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3091980PMC
June 2011

PANTHER version 7: improved phylogenetic trees, orthologs and collaboration with the Gene Ontology Consortium.

Nucleic Acids Res 2010 Jan 16;38(Database issue):D204-10. Epub 2009 Dec 16.

Evolutionary Systems Biology Group, SRI International, Lawrence Berkeley National Laboratory, USA.

Protein Analysis THrough Evolutionary Relationships (PANTHER) is a comprehensive software system for inferring the functions of genes based on their evolutionary relationships. Phylogenetic trees of gene families form the basis for PANTHER and these trees are annotated with ontology terms describing the evolution of gene function from ancestral to modern day genes. One of the main applications of PANTHER is in accurate prediction of the functions of uncharacterized genes, based on their evolutionary relationships to genes with functions known from experiment. The PANTHER website, freely available at http://www.pantherdb.org, also includes software tools for analyzing genomic data relative to known and inferred gene functions. Since 2007, there have been several new developments to PANTHER: (i) improved phylogenetic trees, explicitly representing speciation and gene duplication events, (ii) identification of gene orthologs, including least diverged orthologs (best one-to-one pairs), (iii) coverage of more genomes (48 genomes, up to 87% of genes in each genome; see http://www.pantherdb.org/panther/summaryStats.jsp), (iv) improved support for alternative database identifiers for genes, proteins and microarray probes and (v) adoption of the SBGN standard for display of biological pathways. In addition, PANTHER trees are being annotated with gene function as part of the Gene Ontology Reference Genome project, resulting in an increasing number of curated functional annotations.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkp1019DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2808919PMC
January 2010