Publications by authors named "Vasileios Hatzivassiloglou"

7 Publications

  • Page 1 of 1

Using MEDLINE as a knowledge source for disambiguating abbreviations and acronyms in full-text biomedical journal articles.

J Biomed Inform 2007 Apr 7;40(2):150-9. Epub 2006 Jun 7.

Department of Health Sciences, University of Wisconsin-Milwaukee, Milwaukee, WI 53211, USA.

Biomedical abbreviations and acronyms are widely used in biomedical literature. Since many of them represent important content in biomedical literature, information retrieval and extraction benefits from identifying the meanings of those terms. On the other hand, many abbreviations and acronyms are ambiguous, it would be important to map them to their full forms, which ultimately represent the meanings of the abbreviations. In this study, we present a semi-supervised method that applies MEDLINE as a knowledge source for disambiguating abbreviations and acronyms in full-text biomedical journal articles. We first automatically generated from the MEDLINE abstracts a dictionary of abbreviation-full pairs based on a rule-based system that maps abbreviations to full forms when full forms are defined in the abstracts. We then trained on the MEDLINE abstracts and predicted the full forms of abbreviations in full-text journal articles by applying supervised machine-learning algorithms in a semi-supervised fashion. We report up to 92% prediction precision and up to 91% coverage.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.jbi.2006.06.001DOI Listing
April 2007

GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data.

J Biomed Inform 2004 Feb;37(1):43-53

Columbia Genome Center, Columbia University, New York, NY 10032, USA.

The immense growth in the volume of research literature and experimental data in the field of molecular biology calls for efficient automatic methods to capture and store information. In recent years, several groups have worked on specific problems in this area, such as automated selection of articles pertinent to molecular biology, or automated extraction of information using natural-language processing, information visualization, and generation of specialized knowledge bases for molecular biology. GeneWays is an integrated system that combines several such subtasks. It analyzes interactions between molecular substances, drawing on multiple sources of information to infer a consensus view of molecular networks. GeneWays is designed as an open platform, allowing researchers to query, review, and critique stored information.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.jbi.2003.10.001DOI Listing
February 2004

Probabilistic inference of molecular networks from noisy data sources.

Bioinformatics 2004 May 10;20(8):1205-13. Epub 2004 Feb 10.

Department of Medical Informatics, Columbia University, New York, NY 10032, USA.

Information on molecular networks, such as networks of interacting proteins, comes from diverse sources that contain remarkable differences in distribution and quantity of errors. Here, we introduce a probabilistic model useful for predicting protein interactions from heterogeneous data sources. The model describes stochastic generation of protein-protein interaction networks with real-world properties, as well as generation of two heterogeneous sources of protein-interaction information: research results automatically extracted from the literature and yeast two-hybrid experiments. Based on the domain composition of proteins, we use the model to predict protein interactions for pairs of proteins for which no experimental data are available. We further explore the prediction limits, given experimental data that cover only part of the underlying protein networks. This approach can be extended naturally to include other types of biological data sources.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/bth061DOI Listing
May 2004

Automatically identifying gene/protein terms in MEDLINE abstracts.

J Biomed Inform 2002 Oct-Dec;35(5-6):322-30

Department of Computer Science, Columbia University, 1214 Amsterdam Avenue, New York, NY 10027, USA.

Motivation: Natural language processing (NLP) techniques are used to extract information automatically from computer-readable literature. In biology, the identification of terms corresponding to biological substances (e.g., genes and proteins) is a necessary step that precedes the application of other NLP systems that extract biological information (e.g., protein-protein interactions, gene regulation events, and biochemical pathways). We have developed GPmarkup (for "gene/protein-full name mark up"), a software system that automatically identifies gene/protein terms (i.e., symbols or full names) in MEDLINE abstracts. As a part of marking up process, we also generated automatically a knowledge source of paired gene/protein symbols and full names (e.g., LARD for lymphocyte associated receptor of death) from MEDLINE. We found that many of the pairs in our knowledge source do not appear in the current GenBank database. Therefore our methods may also be used for automatic lexicon generation.

Results: GPmarkup has 73% recall and 93% precision in identifying and marking up gene/protein terms in MEDLINE abstracts.

Availability: A random sample of gene/protein symbols and full names and a sample set of marked up abstracts can be viewed at http://www.cpmc.columbia.edu/homepages/yuh9001/GPmarkup/. Contact. [email protected] Voice: 212-939-7028; fax: 212-666-0140.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/s1532-0464(03)00032-7DOI Listing
October 2003

Automatic extraction of gene and protein synonyms from MEDLINE and journal articles.

Proc AMIA Symp 2002 :919-23

Department of Medical Informatics, Columbia University, New York, NY 10032, USA.

Genes and proteins are often associated with multiple names, and more names are added as new functional or structural information is discovered. Because authors often alternate between these synonyms, information retrieval and extraction benefits from identifying these synonymous names. We have developed a method to extract automatically synonymous gene and protein names from MEDLINE and journal articles. We first identified patterns authors use to list synonymous gene and protein names. We developed SGPE (for synonym extraction of gene and protein names), a software program that recognizes the patterns and extracts from MEDLINE abstracts and full-text journal articles candidate synonymous terms. SGPE then applies a sequence of filters that automatically screen out those terms that are not gene and protein names. We evaluated our method to have an overall precision of 71% on both MEDLINE and journal articles, and 90% precision on the more suitable full-text articles alone
View Article and Find Full Text PDF

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2244511PMC
April 2003

Learning anchor verbs for biological interaction patterns from published text articles.

Int J Med Inform 2002 Dec;67(1-3):19-32

Department of Computer Science, Columbia University, 1214 Amsterdam Avenue, New York, NY 10027 USA.

Much of knowledge modeling in the molecular biology domain involves interactions between proteins, genes, various forms of RNA, small molecules, etc. Interactions between these substances are typically extracted and codified manually, increasing the cost and time for modeling and substantially limiting the coverage of the resulting knowledge base. In this paper, we describe an automatic system that learns from text interaction verbs; these verbs can then form the core of automatically retrieved patterns which model classes of biological interactions. We investigate text features relating verbs with genes and proteins, and apply statistical tests and a logistic regression statistical model to determine whether a given verb belongs to the class of interaction verbs. Our system, AVAD, achieves over 87% precision and 82% recall when tested on an 11 million word corpus of journal articles. In addition, we compare the automatically obtained results with a manually constructed database of interaction verbs and show that the automatic approach can significantly enrich the manual list by detecting rarer interaction verbs that were omitted from the database.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/s1386-5056(02)00054-0DOI Listing
December 2002

Of truth and pathways: chasing bits of information through myriads of articles.

Bioinformatics 2002 ;18 Suppl 1:S249-57

Department of Medical Informatics, Columbia University, New York, NY 10032, USA.

Knowledge on interactions between molecules in living cells is indispensable for theoretical analysis and practical applications in modern genomics and molecular biology. Building such networks relies on the assumption that the correct molecular interactions are known or can be identified by reading a few research articles. However, this assumption does not necessarily hold, as truth is rather an emerging property based on many potentially conflicting facts. This paper explores the processes of knowledge generation and publishing in the molecular biology literature using modelling and analysis of real molecular interaction data. The data analysed in this article were automatically extracted from 50000 research articles in molecular biology using a computer system called GeneWays containing a natural language processing module. The paper indicates that truthfulness of statements is associated in the minds of scientists with the relative importance (connectedness) of substances under study, revealing a potential selection bias in the reporting of research results. Aiming at understanding the statistical properties of the life cycle of biological facts reported in research articles, we formulate a stochastic model describing generation and propagation of knowledge about molecular interactions through scientific publications. We hope that in the future such a model can be useful for automatically producing consensus views of molecular interaction data.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/18.suppl_1.s249DOI Listing
October 2004