Publications by authors named "W John Wilbur"

109 Publications

Better synonyms for enriching biomedical search.

J Am Med Inform Assoc 2020 12;27(12):1894-1902

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA.

Objective: In a biomedical literature search, the link between a query and a document is often not established, because they use different terms to refer to the same concept. Distributional word embeddings are frequently used for detecting related words by computing the cosine similarity between them. However, previous research has not established either the best embedding methods for detecting synonyms among related word pairs or how effective such methods may be.

Materials And Methods: In this study, we first create the BioSearchSyn set, a manually annotated set of synonyms, to assess and compare 3 widely used word-embedding methods (word2vec, fastText, and GloVe) in their ability to detect synonyms among related pairs of words. We demonstrate the shortcomings of the cosine similarity score between word embeddings for this task: the same scores have very different meanings for the different methods. To address the problem, we propose utilizing pool adjacent violators (PAV), an isotonic regression algorithm, to transform a cosine similarity into a probability of 2 words being synonyms.

Results: Experimental results using the BioSearchSyn set as a gold standard reveal which embedding methods have the best performance in identifying synonym pairs. The BioSearchSyn set also allows converting cosine similarity scores into probabilities, which provides a uniform interpretation of the synonymy score over different methods.

Conclusions: We introduced the BioSearchSyn corpus of 1000 term pairs, which allowed us to identify the best embedding method for detecting synonymy for biomedical search. Using the proposed method, we created PubTermVariants2.0: a large, automatically extracted set of synonym pairs that have augmented PubMed searches since the spring of 2019.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/jamia/ocaa151DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7727334PMC
December 2020

Better synonyms for enriching biomedical search.

J Am Med Inform Assoc 2020 12;27(12):1894-1902

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA.

Objective: In a biomedical literature search, the link between a query and a document is often not established, because they use different terms to refer to the same concept. Distributional word embeddings are frequently used for detecting related words by computing the cosine similarity between them. However, previous research has not established either the best embedding methods for detecting synonyms among related word pairs or how effective such methods may be.

Materials And Methods: In this study, we first create the BioSearchSyn set, a manually annotated set of synonyms, to assess and compare 3 widely used word-embedding methods (word2vec, fastText, and GloVe) in their ability to detect synonyms among related pairs of words. We demonstrate the shortcomings of the cosine similarity score between word embeddings for this task: the same scores have very different meanings for the different methods. To address the problem, we propose utilizing pool adjacent violators (PAV), an isotonic regression algorithm, to transform a cosine similarity into a probability of 2 words being synonyms.

Results: Experimental results using the BioSearchSyn set as a gold standard reveal which embedding methods have the best performance in identifying synonym pairs. The BioSearchSyn set also allows converting cosine similarity scores into probabilities, which provides a uniform interpretation of the synonymy score over different methods.

Conclusions: We introduced the BioSearchSyn corpus of 1000 term pairs, which allowed us to identify the best embedding method for detecting synonymy for biomedical search. Using the proposed method, we created PubTermVariants2.0: a large, automatically extracted set of synonym pairs that have augmented PubMed searches since the spring of 2019.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/jamia/ocaa151DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7727334PMC
December 2020

Deep learning with sentence embeddings pre-trained on biomedical corpora improves the performance of finding similar sentences in electronic medical records.

BMC Med Inform Decis Mak 2020 04 30;20(Suppl 1):73. Epub 2020 Apr 30.

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, USA.

Background: Capturing sentence semantics plays a vital role in a range of text mining applications. Despite continuous efforts on the development of related datasets and models in the general domain, both datasets and models are limited in biomedical and clinical domains. The BioCreative/OHNLP2018 organizers have made the first attempt to annotate 1068 sentence pairs from clinical notes and have called for a community effort to tackle the Semantic Textual Similarity (BioCreative/OHNLP STS) challenge.

Methods: We developed models using traditional machine learning and deep learning approaches. For the post challenge, we focused on two models: the Random Forest and the Encoder Network. We applied sentence embeddings pre-trained on PubMed abstracts and MIMIC-III clinical notes and updated the Random Forest and the Encoder Network accordingly.

Results: The official results demonstrated our best submission was the ensemble of eight models. It achieved a Person correlation coefficient of 0.8328 - the highest performance among 13 submissions from 4 teams. For the post challenge, the performance of both Random Forest and the Encoder Network was improved; in particular, the correlation of the Encoder Network was improved by ~ 13%. During the challenge task, no end-to-end deep learning models had better performance than machine learning models that take manually-crafted features. In contrast, with the sentence embeddings pre-trained on biomedical corpora, the Encoder Network now achieves a correlation of ~ 0.84, which is higher than the original best model. The ensembled model taking the improved versions of the Random Forest and Encoder Network as inputs further increased performance to 0.8528.

Conclusions: Deep learning models with sentence embeddings pre-trained on biomedical corpora achieve the highest performance on the test set. Through error analysis, we find that end-to-end deep learning models and traditional machine learning models with manually-crafted features complement each other by finding different types of sentences. We suggest a combination of these models can better find similar sentences in practice.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s12911-020-1044-0DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7191680PMC
April 2020

PubMed Text Similarity Model and its application to curation efforts in the Conserved Domain Database.

Database (Oxford) 2019 01;2019

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA.

This study proposes a text similarity model to help biocuration efforts of the Conserved Domain Database (CDD). CDD is a curated resource that catalogs annotated multiple sequence alignment models for ancient domains and full-length proteins. These models allow for fast searching and quick identification of conserved motifs in protein sequences via Reverse PSI-BLAST. In addition, CDD curators prepare summaries detailing the function of these conserved domains and specific protein families, based on published peer-reviewed articles. To facilitate information access for database users, it is desirable to specifically identify the referenced articles that support the assertions of curator-composed sentences. Moreover, CDD curators desire an alert system that scans the newly published literature and proposes related articles of relevance to the existing CDD records. Our approach to address these needs is a text similarity method that automatically maps a curator-written statement to candidate sentences extracted from the list of referenced articles, as well as the articles in the PubMed Central database. To evaluate this proposal, we paired CDD description sentences with the top 10 matching sentences from the literature, which were given to curators for review. Through this exercise, we discovered that we were able to map the articles in the reference list to the CDD description statements with an accuracy of 77%. In the dataset that was reviewed by curators, we were able to successfully provide references for 86% of the curator statements. In addition, we suggested new articles for curator review, which were accepted by curators to be added into the reference list at an acceptance rate of 50%. Through this process, we developed a substantial corpus of similar sentences from biomedical articles on protein sequence, structure and function research, which constitute the CDD text similarity corpus. This corpus contains 5159 sentence pairs judged for their similarity on a scale from 1 (low) to 5 (high) doubly annotated by four CDD curators. Curator-assigned similarity scores have a Pearson correlation coefficient of 0.70 and an inter-annotator agreement of 85%. To date, this is the largest biomedical text similarity resource that has been manually judged, evaluated and made publicly available to the community to foster research and development of text similarity algorithms.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/database/baz064DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6606757PMC
January 2019

LitSense: making sense of biomedical literature at sentence level.

Nucleic Acids Res 2019 07;47(W1):W594-W599

National Center for Biotechnology Information (NCBI), National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.

Literature search is a routine practice for scientific studies as new discoveries build on knowledge from the past. Current tools (e.g. PubMed, PubMed Central), however, generally require significant effort in query formulation and optimization (especially in searching the full-length articles) and do not allow direct retrieval of specific statements, which is key for tasks such as comparing/validating new findings with previous knowledge and performing evidence attribution in biocuration. Thus, we introduce LitSense, which is the first web-based system that specializes in sentence retrieval for biomedical literature. LitSense provides unified access to PubMed and PMC content with over a half-billion sentences in total. Given a query, LitSense returns best-matching sentences using both a traditional term-weighting approach that up-weights sentences that contain more of the rare terms in the user query as well as a novel neural embedding approach that enables the retrieval of semantically relevant results without explicit keyword match. LitSense provides a user-friendly interface that assists its users to quickly browse the returned sentences in context and/or further filter search results by section or publication date. LitSense also employs PubTator to highlight biomedical entities (e.g. gene/proteins) in the sentences for better result visualization. LitSense is freely available at https://www.ncbi.nlm.nih.gov/research/litsense.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkz289DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6602490PMC
July 2019