Publications by authors named "Ramakanth Kavuluru"

57 Publications

Twitter discourse on nicotine as potential prophylactic or therapeutic for COVID-19.

Int J Drug Policy 2021 Sep 20;99:103470. Epub 2021 Sep 20.

Assistant Professor, Center for Health Equity Transformation and Department of Behavioral Science, College of Medicine, Lexington, KY, USA.

Background: An unproven "nicotine hypothesis" that indicates nicotine's therapeutic potential for COVID-19 has been proposed in recent literature. This study is about Twitter posts that misinterpret this hypothesis to make baseless claims about benefits of smoking and vaping in the context of COVID-19. We quantify the presence of such misinformation and characterize the tweeters who post such messages.

Methods: Twitter premium API was used to download tweets (n = 17,533) that match terms indicating (a) nicotine or vaping themes, (b) a prophylactic or therapeutic effect, and (c) COVID-19 (January-July 2020) as a conjunctive query. A constraint on the length of the span of text containing the terms in the tweets allowed us to focus on those that convey the therapeutic intent. We hand-annotated these filtered tweets and built a classifier that identifies tweets that extrapolate the nicotine hypothesis to smoking/vaping with a positive predictive value of 85%. We analyzed the frequently used terms in author bios, top Web links, and hashtags of such tweets.

Results: 21% of our filtered COVID-19 tweets indicate a vaping or smoking-based prevention/treatment narrative. Qualitative analyses show a variety of ways therapeutic claims are being made and tweeter bios reveal pre-existing notions of positive stances toward vaping.

Conclusion: The social media landscape is a double-edged sword in tobacco communication. Although it increases information reach, consumers can also be subject to confirmation bias when exposed to inadvertent or deliberate framing of scientific discourse that may border on misinformation. This calls for circumspection and additional planning in countering such narratives as the COVID-19 pandemic continues to ravage our world. Our results also serve as a cautionary tale in how social media can be leveraged to spread misleading information about tobacco products in the wake of pandemics.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.drugpo.2021.103470DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8450069PMC
September 2021

Literature Retrieval for Precision Medicine with Neural Matching and Faceted Summarization.

Proc Conf Empir Methods Nat Lang Process 2020 Nov;2020:3389-3399

Division of Biomedical Informatics, University of Kentucky, Kentucky, USA.

Information retrieval (IR) for precision medicine (PM) often involves looking for multiple pieces of evidence that characterize a patient case. This typically includes at least the name of a condition and a genetic variation that applies to the patient. Other factors such as demographic attributes, comorbidities, and social determinants may also be pertinent. As such, the retrieval problem is often formulated as search but with multiple facets (e.g., disease, mutation) that may need to be incorporated. In this paper, we present a document reranking approach that combines neural query-document matching and text summarization toward such retrieval scenarios. Our architecture builds on the basic BERT model with three specific components for reranking: (a). document-query matching (b). keyword extraction and (c). facet-conditioned abstractive summarization. The outcomes of (b) and (c) are used to essentially transform a candidate document into a concise summary that can be compared with the query at hand to compute a relevance score. Component (a) directly generates a matching score of a candidate document for a query. The full architecture benefits from the complementary potential of document-query matching and the novel document transformation approach based on summarization along PM facets. Evaluations using NIST's TREC-PM track datasets (2017-2019) show that our model achieves state-of-the-art performance. To foster reproducibility, our code is made available here: https://github.com/bionlproc/text-summ-for-doc-retrieval.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.18653/v1/2020.findings-emnlp.304DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8444997PMC
November 2020

Assigning ICD-O-3 Codes to Pathology Reports using Neural Multi-Task Training with Hierarchical Regularization.

ACM BCB 2021 Aug;2021

Division of Biomedical Informatics (Internal Medicine), University of Kentucky, Lexington, Kentucky, USA.

Tracking population-level cancer information is essential for researchers, clinicians, policymakers, and the public. Unfortunately, much of the information is stored as unstructured data in pathology reports. Thus, too process the information, we require either automated extraction techniques or manual curation. Moreover, many of the cancer-related concepts appear infrequently in real-world training datasets. Automated extraction is difficult because of the limited data. This study introduces a novel technique that incorporates structured expert knowledge to improve histology and topography code classification models. Using pathology reports collected from the Kentucky Cancer Registry, we introduce a novel multi-task training approach with hierarchical regularization that incorporates structured information about the International Classification of Diseases for Oncology, 3rd Edition classes to improve predictive performance. Overall, we find that our method improves both micro and macro F1. For macro F1, we achieve up to a 6% absolute improvement for topography codes and up to 4% absolute improvement for histology codes.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1145/3459930.3469541DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8445227PMC
August 2021

Attention-Gated Graph Convolutions for Extracting Drug Interaction Information from Drug Labels.

ACM Trans Comput Healthc 2021 Mar;2(2)

National Library of Medicine, United States.

Preventable adverse events as a result of medical errors present a growing concern in the healthcare system. As drug-drug interactions (DDIs) may lead to preventable adverse events, being able to extract DDIs from drug labels into a machine-processable form is an important step toward effective dissemination of drug safety information. Herein, we tackle the problem of jointly extracting mentions of drugs and their interactions, including interaction , from drug labels. Our deep learning approach entails composing various intermediate representations, including graph-based context derived using graph convolutions (GCs) with a novel attention-based gating mechanism (holistically called GCA), which are combined in meaningful ways to predict on all subtasks jointly. Our model is trained and evaluated on the 2018 TAC DDI corpus. Our GCA model in conjunction with transfer learning performs at 39.20% F1 and 26.09% F1 on entity recognition (ER) and relation extraction (RE), respectively, on the first official test set and at 45.30% F1 and 27.87% F1 on ER and RE, respectively, on the second official test set. These updated results lead to improvements over our prior best by up to 6 absolute F1 points. After controlling for available training data, the proposed model exhibits state-of-the-art performance for this task.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1145/3423209DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8445229PMC
March 2021

International Changes in COVID-19 Clinical Trajectories Across 315 Hospitals and 6 Countries: Retrospective Cohort Study.

J Med Internet Res 2021 10 11;23(10):e31400. Epub 2021 Oct 11.

Institute for Biomedical Informatics, University of Kentucky, Lexington, KY, United States.

Background: Many countries have experienced 2 predominant waves of COVID-19-related hospitalizations. Comparing the clinical trajectories of patients hospitalized in separate waves of the pandemic enables further understanding of the evolving epidemiology, pathophysiology, and health care dynamics of the COVID-19 pandemic.

Objective: In this retrospective cohort study, we analyzed electronic health record (EHR) data from patients with SARS-CoV-2 infections hospitalized in participating health care systems representing 315 hospitals across 6 countries. We compared hospitalization rates, severe COVID-19 risk, and mean laboratory values between patients hospitalized during the first and second waves of the pandemic.

Methods: Using a federated approach, each participating health care system extracted patient-level clinical data on their first and second wave cohorts and submitted aggregated data to the central site. Data quality control steps were adopted at the central site to correct for implausible values and harmonize units. Statistical analyses were performed by computing individual health care system effect sizes and synthesizing these using random effect meta-analyses to account for heterogeneity. We focused the laboratory analysis on C-reactive protein (CRP), ferritin, fibrinogen, procalcitonin, D-dimer, and creatinine based on their reported associations with severe COVID-19.

Results: Data were available for 79,613 patients, of which 32,467 were hospitalized in the first wave and 47,146 in the second wave. The prevalence of male patients and patients aged 50 to 69 years decreased significantly between the first and second waves. Patients hospitalized in the second wave had a 9.9% reduction in the risk of severe COVID-19 compared to patients hospitalized in the first wave (95% CI 8.5%-11.3%). Demographic subgroup analyses indicated that patients aged 26 to 49 years and 50 to 69 years; male and female patients; and black patients had significantly lower risk for severe disease in the second wave than in the first wave. At admission, the mean values of CRP were significantly lower in the second wave than in the first wave. On the seventh hospital day, the mean values of CRP, ferritin, fibrinogen, and procalcitonin were significantly lower in the second wave than in the first wave. In general, countries exhibited variable changes in laboratory testing rates from the first to the second wave. At admission, there was a significantly higher testing rate for D-dimer in France, Germany, and Spain.

Conclusions: Patients hospitalized in the second wave were at significantly lower risk for severe COVID-19. This corresponded to mean laboratory values in the second wave that were more likely to be in typical physiological ranges on the seventh hospital day compared to the first wave. Our federated approach demonstrated the feasibility and power of harmonizing heterogeneous EHR data from multiple international health care systems to rapidly conduct large-scale studies to characterize how COVID-19 clinical trajectories evolve.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.2196/31400DOI Listing
October 2021

Joint Learning for Biomedical NER and Entity Normalization: Encoding Schemes, Counterfactual Examples, and Zero-Shot Evaluation.

ACM BCB 2021 Aug 1;2021. Epub 2021 Aug 1.

Division of Biomedical Informatics (Internal Medicine), University of Kentucky, Lexington, Kentucky, USA.

Named entity recognition (NER) and normalization (EN) form an indispensable first step to many biomedical natural language processing applications. In biomedical information science, recognizing entities (e.g., genes, diseases, or drugs) and normalizing them to concepts in standard terminologies or thesauri (e.g., Entrez, ICD-10, or RxNorm) is crucial for identifying more informative relations among them that drive disease etiology, progression, and treatment. In this effort we pursue two high level strategies to improve biomedical ER and EN. The first is to decouple standard entity encoding tags (e.g., "B-Drug" for the beginning of a drug) into type tags (e.g., "Drug") and positional tags (e.g., "B"). A second strategy is to use additional counterfactual training examples to handle the issue of models learning spurious correlations between surrounding context and normalized concepts in training data. We conduct elaborate experiments using the MedMentions dataset, the largest dataset of its kind for ER and EN in biomedicine. We find that our first strategy performs better in entity normalization when compared with the standard coding scheme. The second data augmentation strategy uniformly improves performance in span detection, typing, and normalization. The gains from counterfactual examples are more prominent when evaluating in zero-shot settings, for concepts that have never been encountered during training.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1145/3459930.3469533DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8425398PMC
August 2021

Twitter, Telepractice, and the COVID-19 Pandemic: A Social Media Content Analysis.

Am J Speech Lang Pathol 2021 Sep 9:1-11. Epub 2021 Sep 9.

University of Kentucky, Lexington.

Purpose Telepractice was extensively utilized during the COVID-19 pandemic. Little is known about issues experienced during the wide-scale rollout of a service delivery model that was novel to many. Social media research is a way to unobtrusively analyze public communication, including during a health crisis. We investigated the characteristics of tweets about telepractice through the lens of an established health technology implementation framework. Results can help guide efforts to support and sustain telehealth beyond the pandemic context. Method We retrieved a historical Twitter data set containing tweets about telepractice from the early months of the pandemic. Tweets were analyzed using a concurrent mixed-methods content analysis design informed by the nonadoption, abandonment, scale-up, spread, and sustainability (NASSS) framework. Results Approximately 2,200 Twitter posts were retrieved, and 820 original tweets were analyzed qualitatively. Volume of tweets about telepractice increased in the early months of the pandemic. The largest group of Twitter users tweeting about telepractice was a group of clinical professionals. Tweet content reflected many, but not all, domains of the NASSS framework. Conclusions Twitter posting about telepractice increased during the pandemic. Although many tweets represented topics expected in technology implementation, some represented phenomena were potentially unique to speech-language pathology. Certain technology implementation topics, notably sustainability, were not found in the data. Implications for future telepractice implementation and further research are discussed.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1044/2021_AJSLP-21-00034DOI Listing
September 2021

Contrastive Cross-Modal Pre-Training: A General Strategy for Small Sample Medical Imaging.

IEEE J Biomed Health Inform 2021 Sep 8;PP. Epub 2021 Sep 8.

A key challenge in training neural networks for a given medical imaging task is often the difficulty of obtaining a sufficient number of manually labeled examples. In contrast, textual imaging reports, which are often readily available in medical records, contain rich but unstructured interpretations written by experts as part of standard clinical practice. We propose using these textual reports as a form of weak supervision to improve the image interpretation performance of a neural network without requiring additional manually labeled examples. We use an image-text matching task to train a feature extractor and then fine-tune it in a transfer learning setting for a supervised task using a small labeled dataset. The end result is a neural network that automatically interprets imagery without requiring textual reports during inference. This approach can be applied to any task for which text-image pairs are readily available. We evaluate our method on three classification tasks and find consistent performance improvements, reducing the need for labeled data by 67%--98%.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1109/JBHI.2021.3110805DOI Listing
September 2021

Improved biomedical word embeddings in the transformer era.

J Biomed Inform 2021 08 18;120:103867. Epub 2021 Jul 18.

Division of Biomedical Informatics, Department of Internal Medicine, University of Kentucky, United States of America; Department of Computer Science, University of Kentucky, United States of America. Electronic address:

Background: Recent natural language processing (NLP) research is dominated by neural network methods that employ word embeddings as basic building blocks. Pre-training with neural methods that capture local and global distributional properties (e.g., skip-gram, GLoVE) using free text corpora is often used to embed both words and concepts. Pre-trained embeddings are typically leveraged in downstream tasks using various neural architectures that are designed to optimize task-specific objectives that might further tune such embeddings.

Objective: Despite advances in contextualized language model based embeddings, static word embeddings still form an essential starting point in BioNLP research and applications. They are useful in low resource settings and in lexical semantics studies. Our main goal is to build improved biomedical word embeddings and make them publicly available for downstream applications.

Methods: We jointly learn word and concept embeddings by first using the skip-gram method and further fine-tuning them with correlational information manifesting in co-occurring Medical Subject Heading (MeSH) concepts in biomedical citations. This fine-tuning is accomplished with the transformer-based BERT architecture in the two-sentence input mode with a classification objective that captures MeSH pair co-occurrence. We conduct evaluations of these tuned static embeddings using multiple datasets for word relatedness developed by previous efforts.

Results: Both in qualitative and quantitative evaluations we demonstrate that our methods produce improved biomedical embeddings in comparison with other static embedding efforts. Without selectively culling concepts and terms (as was pursued by previous efforts), we believe we offer the most exhaustive evaluation of biomedical embeddings to date with clear performance improvements across the board.

Conclusion: We repurposed a transformer architecture (typically used to generate dynamic embeddings) to improve static biomedical word embeddings using concept correlations. We provide our code and embeddings for public use for downstream applications and research endeavors: https://github.com/bionlproc/BERT-CRel-Embeddings.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.jbi.2021.103867DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8373296PMC
August 2021

Clinical Characterization and Prediction of Clinical Severity of SARS-CoV-2 Infection Among US Adults Using Data From the US National COVID Cohort Collaborative.

JAMA Netw Open 2021 07 1;4(7):e2116901. Epub 2021 Jul 1.

North Carolina Translational and Clinical Sciences Institute, University of North Carolina at Chapel Hill, Chapel Hill.

Importance: The National COVID Cohort Collaborative (N3C) is a centralized, harmonized, high-granularity electronic health record repository that is the largest, most representative COVID-19 cohort to date. This multicenter data set can support robust evidence-based development of predictive and diagnostic tools and inform clinical care and policy.

Objectives: To evaluate COVID-19 severity and risk factors over time and assess the use of machine learning to predict clinical severity.

Design, Setting, And Participants: In a retrospective cohort study of 1 926 526 US adults with SARS-CoV-2 infection (polymerase chain reaction >99% or antigen <1%) and adult patients without SARS-CoV-2 infection who served as controls from 34 medical centers nationwide between January 1, 2020, and December 7, 2020, patients were stratified using a World Health Organization COVID-19 severity scale and demographic characteristics. Differences between groups over time were evaluated using multivariable logistic regression. Random forest and XGBoost models were used to predict severe clinical course (death, discharge to hospice, invasive ventilatory support, or extracorporeal membrane oxygenation).

Main Outcomes And Measures: Patient demographic characteristics and COVID-19 severity using the World Health Organization COVID-19 severity scale and differences between groups over time using multivariable logistic regression.

Results: The cohort included 174 568 adults who tested positive for SARS-CoV-2 (mean [SD] age, 44.4 [18.6] years; 53.2% female) and 1 133 848 adult controls who tested negative for SARS-CoV-2 (mean [SD] age, 49.5 [19.2] years; 57.1% female). Of the 174 568 adults with SARS-CoV-2, 32 472 (18.6%) were hospitalized, and 6565 (20.2%) of those had a severe clinical course (invasive ventilatory support, extracorporeal membrane oxygenation, death, or discharge to hospice). Of the hospitalized patients, mortality was 11.6% overall and decreased from 16.4% in March to April 2020 to 8.6% in September to October 2020 (P = .002 for monthly trend). Using 64 inputs available on the first hospital day, this study predicted a severe clinical course using random forest and XGBoost models (area under the receiver operating curve = 0.87 for both) that were stable over time. The factor most strongly associated with clinical severity was pH; this result was consistent across machine learning methods. In a separate multivariable logistic regression model built for inference, age (odds ratio [OR], 1.03 per year; 95% CI, 1.03-1.04), male sex (OR, 1.60; 95% CI, 1.51-1.69), liver disease (OR, 1.20; 95% CI, 1.08-1.34), dementia (OR, 1.26; 95% CI, 1.13-1.41), African American (OR, 1.12; 95% CI, 1.05-1.20) and Asian (OR, 1.33; 95% CI, 1.12-1.57) race, and obesity (OR, 1.36; 95% CI, 1.27-1.46) were independently associated with higher clinical severity.

Conclusions And Relevance: This cohort study found that COVID-19 mortality decreased over time during 2020 and that patient demographic characteristics and comorbidities were associated with higher clinical severity. The machine learning models accurately predicted ultimate clinical severity using commonly collected clinical data from the first 24 hours of a hospital admission.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1001/jamanetworkopen.2021.16901DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8278272PMC
July 2021

The National COVID Cohort Collaborative: Clinical Characterization and Early Severity Prediction.

medRxiv 2021 Jan 23. Epub 2021 Jan 23.

Background: The majority of U.S. reports of COVID-19 clinical characteristics, disease course, and treatments are from single health systems or focused on one domain. Here we report the creation of the National COVID Cohort Collaborative (N3C), a centralized, harmonized, high-granularity electronic health record repository that is the largest, most representative U.S. cohort of COVID-19 cases and controls to date. This multi-center dataset supports robust evidence-based development of predictive and diagnostic tools and informs critical care and policy.

Methods And Findings: In a retrospective cohort study of 1,926,526 patients from 34 medical centers nationwide, we stratified patients using a World Health Organization COVID-19 severity scale and demographics; we then evaluated differences between groups over time using multivariable logistic regression. We established vital signs and laboratory values among COVID-19 patients with different severities, providing the foundation for predictive analytics. The cohort included 174,568 adults with severe acute respiratory syndrome associated with SARS-CoV-2 (PCR >99% or antigen <1%) as well as 1,133,848 adult patients that served as lab-negative controls. Among 32,472 hospitalized patients, mortality was 11.6% overall and decreased from 16.4% in March/April 2020 to 8.6% in September/October 2020 (p = 0.002 monthly trend). In a multivariable logistic regression model, age, male sex, liver disease, dementia, African-American and Asian race, and obesity were independently associated with higher clinical severity. To demonstrate the utility of the N3C cohort for analytics, we used machine learning (ML) to predict clinical severity and risk factors over time. Using 64 inputs available on the first hospital day, we predicted a severe clinical course (death, discharge to hospice, invasive ventilation, or extracorporeal membrane oxygenation) using random forest and XGBoost models (AUROC 0.86 and 0.87 respectively) that were stable over time. The most powerful predictors in these models are patient age and widely available vital sign and laboratory values. The established expected trajectories for many vital signs and laboratory values among patients with different clinical severities validates observations from smaller studies, and provides comprehensive insight into COVID-19 characterization in U.S. patients.

Conclusions: This is the first description of an ongoing longitudinal observational study of patients seen in diverse clinical settings and geographical regions and is the largest COVID-19 cohort in the United States. Such data are the foundation for ML models that can be the basis for generalizable clinical decision support tools. The N3C Data Enclave is unique in providing transparent, reproducible, easily shared, versioned, and fully auditable data and analytic provenance for national-scale patient-level EHR data. The N3C is built for intensive ML analyses by academic, industry, and citizen scientists internationally. Many observational correlations can inform trial designs and care guidelines for this new disease.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1101/2021.01.12.21249511DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7814838PMC
January 2021

Twitter Discourse on Nicotine as Potential Prophylactic or Therapeutic for COVID-19.

medRxiv 2021 Jan 6. Epub 2021 Jan 6.

Center for Health Equity Transformation and Department of Behavioral Science, College of Medicine, Lexington, KY.

Objective: The low observed prevalence of smokers among hospitalized COVID-19 patients in certain cohorts has led to a hypothesis regarding nicotine's therapeutic role in COVID-19 prevention and treatment. As new scientific evidence surfaces, premature conclusions about nicotine are rife in social media, especially unwarranted leaps of such associations to vaping and smoking. This study reports on the prevalence of such leaps and the nature of authors who are making them.

Methods: We used a Twitter API subscription service to download tweets (n = 17,533) that match terms indicating nicotine or vaping themes, in addition to those that point to a prophylactic or therapeutic effect and COVID-19 (January-July 2020). Using a windowing approach, we focused on tweets that are more likely to convey the therapeutic intent. We hand-annotated these filtered tweets and built a classifier that identifies tweets that extrapolate a nicotine link to vaping/smoking. We analyzed the frequently used terms in author bios, top Web links, and hashtags of such tweets.

Results: 21% of our filtered tweets indicate a vaping/smoking-based prevention/treatment narrative. Our classifier was able to spot tweets that make unproven claims about vaping/smoking and COVID-19 with a positive predictive value of 85%. Qualitative analyses show a variety of ways therapeutic claims are being made and user bios reveal pre-existing notions of positive stances toward vaping.

Conclusion: The social media landscape is a double-edged sword in tobacco communication. Although it increases information reach, consumers can also be subject to confirmation bias when exposed to inadvertent or deliberate framing of scientific discourse that may border on misinformation. This calls for circumspection and additional planning in countering such narratives as the COVID-19 pandemic continues to ravage our world.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1101/2021.01.05.21249284DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7805473PMC
January 2021

Identifying current Juul users among emerging adults through Twitter feeds.

Int J Med Inform 2021 02 10;146:104350. Epub 2020 Dec 10.

Department of Computer Science University of Kentucky, Lexington, USA; Division of Biomedical Informatics, Department of Internal Medicine, University of Kentucky, Lexington, USA. Electronic address:

Introduction: Juul is the most popular electronic cigarette on the market. Amid concerns around uptake of e-cigarettes by never smokers, can we detect whether someone uses Juul based on their social media activities? This is the central premise of the effort reported in this paper. Several recent social media-related studies on Juul use tend to focus on the characterization of Juul-related messages on social media. In this study, we assess the potential in using machine learning methods to automatically identify Juul users (past 30-day usage) based on their Twitter data.

Methods: We obtained a collection of 588 instances, for training and testing, of Juul use patterns (along with associated Twitter handles) via survey responses of college students. With this data, we built and tested supervised machine learning models based on linear and deep learning algorithms with textual, social network (friends and followers), and other hand-crafted features.

Results: The linear model with textual and follower network features performed best with a precision-recall trade-off such that precision (PPV) is 57 % at 24 % recall (sensitivity). Hence, at least every other college-attending Twitter user flagged by our model is expected to be a Juul user. Additionally, our results indicate that social network features tend to have a large impact (positive) on classification performance.

Conclusion: There are enough latent signals from social feeds for supervised modeling of Juul use, even with limited training data, implying that such models are highly beneficial to very focused intervention campaigns. This initial success indicates potential for more involved automated surveillance of Juul use based on social media data, including Juul usage patterns, nicotine dependence, and risk awareness.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.ijmedinf.2020.104350DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7855996PMC
February 2021

Stigma, biomarkers, and algorithmic bias: recommendations for precision behavioral health with artificial intelligence.

JAMIA Open 2020 Apr 22;3(1):9-15. Epub 2020 Jan 22.

Department of Biomedical Engineering, Department of Systems and Industrial Engineering, The University of Arizona, Tucson, Arizona, USA.

Effective implementation of artificial intelligence in behavioral healthcare delivery depends on overcoming challenges that are pronounced in this domain. Self and social stigma contribute to under-reported symptoms, and under-coding worsens ascertainment. Health disparities contribute to algorithmic bias. Lack of reliable biological and clinical markers hinders model development, and model explainability challenges impede trust among users. In this perspective, we describe these challenges and discuss design and implementation recommendations to overcome them in intelligent systems for behavioral and mental health.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/jamiaopen/ooz054DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7309258PMC
April 2020

Non-Negative Matrix Factorization for Drug Repositioning: Experiments with the repoDB Dataset.

AMIA Annu Symp Proc 2019 4;2019:238-247. Epub 2020 Mar 4.

University of Kentucky, Lexington, KY.

Computational methods for drug repositioning are gaining mainstream attention with the availability of experimental gene expression datasets and manually curated relational information in knowledge bases. When building repurpos-ing tools, a fundamental limitation is the lack of gold standard datasets that contain realistic true negative examples of drug-disease pairs that were shown to be non-indications. To address this gap, the repoDB dataset was created in 2017 as a first of its kind realistic resource to benchmark drug repositioning methods - its positive examples are drawn from FDA approved indications and negatives examples are derivedfrom failed clinical trials. In this paper, we present the first effort for repositioning that directly tests against repoDB instances. By using hand-curated drug-disease indications from the UMLS Metathesaurus and automatically extracted relations from the SemMedDB database, we employ non-negative matrix factorization (NMF) methods to recover repoDB positive indications. Among recoverable approved indications, our NMF methods achieve 96% recall with 80% precision providing further evidence that hand-curated knowledge and matrix completion methods can be exploited for hypothesis generation.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7153111PMC
July 2020

Social media surveillance for perceived therapeutic effects of cannabidiol (CBD) products.

Int J Drug Policy 2020 03 21;77:102688. Epub 2020 Feb 21.

Division of Biomedical Informatics, Department of Internal Medicine, Department of Computer Science, University of Kentucky, USA. Electronic address:

Background: CBD products have risen in popularity given CBD's therapeutic potential and lack of legal oversight, despite lacking conclusive scientific evidence for widespread over-the-counter usage for many of its perceived benefits. While medical evidence is being generated, social media surveillance offers a fast and inexpensive alternative to traditional surveys in ascertaining perceived therapeutic purposes and modes of consumption for CBD products.

Methods: We collected all comments from the CBD subreddit posted between January 1 and April 30, 2019 as well as comments submitted to the FDA regarding regulation of cannabis-derived products and analyzed them using a rule-based language processing method. A relative ranking of popular therapeutic uses and product groups for CBD is obtained based on frequency of pattern matches including precise queries that entail identifying mentions of the condition, a CBD product, and some "trigger" phrase indicating therapeutic use. We validated the social media-based findings using a similar analysis on comments to the U.S. Food and Drug Administration's (FDA) 2019 request-for-comments on cannabis-derived products.

Results: CBD is mostly discussed as a remedy for anxiety disorders and pain and this is consistent across both comment sources. Of comments posted to the CBD subreddit during the monitored time span, 6.19% mentioned anxiety at least once with at least 6.02% of these comments specifically mentioning CBD as a treatment for anxiety (i.e., 0.37% of total comments). The most popular CBD product group is oil and tinctures.

Conclusion: Social media surveillance of CBD usage has the potential to surface new therapeutic use-cases as they are posted. Contemporary social media data indicate, for example, that stress and nausea are frequently mentioned as therapeutic use cases for CBD without corresponding evidence, that affirms or denies, in the research literature. However, the abundance of anecdotal claims warrants serious scientific exploration moving forward. Meanwhile, as FDA ponders regulation, our effort demonstrates that social data offers a convenient affordance to surveil for CBD usage patterns in a way that is fast and inexpensive and can inform conventional electronic surveys.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.drugpo.2020.102688DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7153970PMC
March 2020

Enhancing timeliness of drug overdose mortality surveillance: A machine learning approach.

PLoS One 2019 16;14(10):e0223318. Epub 2019 Oct 16.

Department of Computer Science, College of Engineering, University of Kentucky, Lexington, Kentucky, United States of America.

Background: Timely data is key to effective public health responses to epidemics. Drug overdose deaths are identified in surveillance systems through ICD-10 codes present on death certificates. ICD-10 coding takes time, but free-text information is available on death certificates prior to ICD-10 coding. The objective of this study was to develop a machine learning method to classify free-text death certificates as drug overdoses to provide faster drug overdose mortality surveillance.

Methods: Using 2017-2018 Kentucky death certificate data, free-text fields were tokenized and features were created from these tokens using natural language processing (NLP). Word, bigram, and trigram features were created as well as features indicating the part-of-speech of each word. These features were then used to train machine learning classifiers on 2017 data. The resulting models were tested on 2018 Kentucky data and compared to a simple rule-based classification approach. Documented code for this method is available for reuse and extensions: https://github.com/pjward5656/dcnlp.

Results: The top scoring machine learning model achieved 0.96 positive predictive value (PPV) and 0.98 sensitivity for an F-score of 0.97 in identification of fatal drug overdoses on test data. This machine learning model achieved significantly higher performance for sensitivity (p<0.001) than the rule-based approach. Additional feature engineering may improve the model's prediction. This model can be deployed on death certificates as soon as the free-text is available, eliminating the time needed to code the death certificates.

Conclusion: Machine learning using natural language processing is a relatively new approach in the context of surveillance of health conditions. This method presents an accessible application of machine learning that improves the timeliness of drug overdose mortality surveillance. As such, it can be employed to inform public health responses to the drug overdose epidemic in near-real time as opposed to several weeks following events.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0223318PLOS
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6795484PMC
March 2020

Distant supervision for treatment relation extraction by leveraging MeSH subheadings.

Artif Intell Med 2019 07 7;98:18-26. Epub 2019 Jun 7.

Department of Computer Science, University of Kentucky, Lexington, KY, United States; Division of Biomedical Informatics, Department of Internal Medicine, University of Kentucky, Lexington, KY, United States. Electronic address:

The growing body of knowledge in biomedicine is too vast for human consumption. Hence there is a need for automated systems able to navigate and distill the emerging wealth of information. One fundamental task to that end is relation extraction, whereby linguistic expressions of semantic relationships between biomedical entities are recognized and extracted. In this study, we propose a novel distant supervision approach for relation extraction of binary treatment relationships such that high quality positive/negative training examples are generated from PubMed abstracts by leveraging associated MeSH subheadings. The quality of generated examples is assessed based on the quality of supervised models they induce; that is, the mean performance of trained models (derived via bootstrapped ensembling) on a gold standard test set is used as a proxy for data quality. We show that our approach is preferable to traditional distant supervision for treatment relations and is closer to human crowd annotations in terms of annotation quality. For treatment relations, our generated training data performs at 81.38%, compared to traditional distant supervision at 64.33% and crowd-sourced annotations at 90.57% on the model-wide PR-AUC metric. We also demonstrate that examples generated using our method can be used to augment crowd-sourced datasets. Augmented models improve over non-augmented models by more than two absolute points on the more established F1 metric. We lastly demonstrate that performance can be further improved by implementing a classification loss that is resistant to label noise.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.artmed.2019.06.002DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6748648PMC
July 2019

Cross-registry neural domain adaptation to extract mutational test results from pathology reports.

J Biomed Inform 2019 09 8;97:103267. Epub 2019 Aug 8.

Division of Biomedical Informatics, Dept. of Internal Medicine, University of Kentucky, USA; Computer Science Department, University of Kentucky, USA. Electronic address:

Objective: We study the performance of machine learning (ML) methods, including neural networks (NNs), to extract mutational test results from pathology reports collected by cancer registries. Given the lack of hand-labeled datasets for mutational test result extraction, we focus on the particular use-case of extracting Epidermal Growth Factor Receptor mutation results in non-small cell lung cancers. We explore the generalization of NNs across different registries where our goals are twofold: (1) to assess how well models trained on a registry's data port to test data from a different registry and (2) to assess whether and to what extent such models can be improved using state-of-the-art neural domain adaptation techniques under different assumptions about what is available (labeled vs unlabeled data) at the target registry site.

Materials And Methods: We collected data from two registries: the Kentucky Cancer Registry (KCR) and the Fred Hutchinson Cancer Research Center (FH) Cancer Surveillance System. We combine NNs with adversarial domain adaptation to improve cross-registry performance. We compare to other classifiers in the standard supervised classification, unsupervised domain adaptation, and supervised domain adaptation scenarios.

Results: The performance of ML methods varied between registries. To extract positive results, the basic convolutional neural network (CNN) had an F1 of 71.5% on the KCR dataset and 95.7% on the FH dataset. For the KCR dataset, the CNN F1 results were low when trained on FH data (Positive F1: 23%). Using our proposed adversarial CNN, without any labeled data, we match the F1 of the models trained directly on each target registry's data. The adversarial CNN F1 improved when trained on FH and applied to KCR dataset (Positive F1: 70.8%). We found similar performance improvements when we trained on KCR and tested on FH reports (Positive F1: 45% to 96%).

Conclusion: Adversarial domain adaptation improves the performance of NNs applied to pathology reports. In the unsupervised domain adaptation setting, we match the performance of models that are trained directly on target registry's data by using source registry's labeled data and unlabeled examples from the target registry.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.jbi.2019.103267DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6736690PMC
September 2019

Neural transfer learning for assigning diagnosis codes to EMRs.

Artif Intell Med 2019 05 12;96:116-122. Epub 2019 Apr 12.

Department of Computer Science, University of Kentucky, Lexington, KY, United States; Division of Biomedical Informatics, Dept. of Internal Medicine, University of Kentucky, Lexington, KY, United States. Electronic address:

Objective: Electronic medical records (EMRs) are manually annotated by healthcare professionals and specialized medical coders with a standardized set of alphanumeric diagnosis and procedure codes, specifically from the International Classification of Diseases (ICD). Annotating EMRs with ICD codes is important for medical billing and downstream epidemiological studies. However, manually annotating EMRs is both time-consuming and error prone. In this paper, we explore the use of convolutional neural networks (CNNs) for automatic ICD coding. Because many codes occur infrequently, CNN performance is inhibited. Therefore, we propose supplementing EMR data with PubMed indexed biomedical research abstracts through neural transfer learning.

Materials And Methods: Transfer learning is the process of "transferring" knowledge acquired from one task (the source task) to a different (target) task. For the source task, we train a CNN to predict medical subject headings (MeSH) using 1.6 million PubMed indexed biomedical abstracts. For the target task, we train a CNN on 71,463 real-world EMRs collected from the University of Kentucky (UKY) medical center to predict ICD diagnosis codes. We introduce a simple, yet effective, transfer learning methodology which avoids forgetting knowledge gained from the source task.

Results: Compared to our prior work using EMRs from the UKY medical center, we improve both the micro and macro F-scores by more than 8%. Likewise, compared to other transfer learning methods, our approach results in nearly 2% improvement in macro F-score.

Conclusion: We show that transfer learning can improve CNN performance for EMR coding in the presence of data sparsity issues. Furthermore, we find that our proposed transfer learning approach outperforms other methods with respect to macro F-score. Finally, we analyze how transfer learning impacts codes with respect to code frequency. We find that we achieve greater improvement on infrequent codes compared to improvements in most frequent codes.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.artmed.2019.04.002DOI Listing
May 2019

Validity of Natural Language Processing for Ascertainment of and Test Results in SEER Cases of Stage IV Non-Small-Cell Lung Cancer.

JCO Clin Cancer Inform 2019 05;3:1-15

Fred Hutchinson Cancer Research Center, Seattle, WA.

Purpose: SEER registries do not report results of epidermal growth factor receptor () and anaplastic lymphoma kinase () mutation tests. To facilitate population-based research in molecularly defined subgroups of non-small-cell lung cancer (NSCLC), we assessed the validity of natural language processing (NLP) for the ascertainment of EGFR and ALK testing from electronic pathology (e-path) reports of NSCLC cases included in two SEER registries: the Cancer Surveillance System (CSS) and the Kentucky Cancer Registry (KCR).

Methods: We obtained 4,278 e-path reports from 1,634 patients who were diagnosed with stage IV nonsquamous NSCLC from September 1, 2011, to December 31, 2013, included in CSS. We used 855 CSS reports to train NLP systems for the ascertainment of and test status (reported not reported) and test results (positive negative). We assessed sensitivity, specificity, and positive and negative predictive values in an internal validation sample of 3,423 CSS e-path reports and repeated the analysis in an external sample of 1,041 e-path reports from 565 KCR patients. Two oncologists manually reviewed all e-path reports to generate gold-standard data sets.

Results: NLP systems yielded internal validity metrics that ranged from 0.95 to 1.00 for and test status and results in CSS e-path reports. NLP showed high internal accuracy for the ascertainment of and in CSS patients-F scores of 0.95 and 0.96, respectively. In the external validation analysis, NLP yielded metrics that ranged from 0.02 to 0.96 in KCR reports and F scores of 0.70 and 0.72, respectively, in KCR patients.

Conclusion: NLP is an internally valid method for the ascertainment of and test information from e-path reports available in SEER registries, but future work is necessary to increase NLP external validity.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1200/CCI.18.00098DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6874053PMC
May 2019

Prevalence and reasons for Juul use among college students.

J Am Coll Health 2020 07 26;68(5):455-459. Epub 2019 Mar 26.

Division of Biomedical Informatics, Department of Internal Medicine, College of Medicine, University of Kentucky, Lexington, KY, USA.

Examine Juul use patterns, sociodemographic and personal factors associated with Juul use, and reasons for Juul initiation and current use, among college students. Convenience sample of 371 undergraduates at a large university in the southeast; recruited April 2018. Cross-sectional design using an online survey. Logistic regression identified the personal risk factors for current use. Over 80% of participants recognized Juul; 36% reported ever use and 21% past 30-day use. Significant risk factors for current Juul use were: male, White/non-Hispanic, lower undergraduate, and current cigarette smoker. Current Juul users chose ease of use and lack of a bad smell as reasons for use. Ever Juul users most commonly endorsed curiosity and use by friends as reasons for trying Juul. Given the propensity for nicotine addiction among youth and young adults, rates of Juul use are alarming and warrant immediate intervention.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1080/07448481.2019.1577867DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6763357PMC
July 2020

Few-Shot and Zero-Shot Multi-Label Learning for Structured Label Spaces.

Proc Conf Empir Methods Nat Lang Process 2018 Oct-Nov;2018:3132-3142

Division of Biomedical Informatics, University of Kentucky, Lexington, KY.

Large multi-label datasets contain labels that occur thousands of times (frequent group), those that occur only a few times (few-shot group), and labels that never appear in the training dataset (zero-shot group). Multi-label few- and zero-shot label prediction is mostly unexplored on datasets with large label spaces, especially for text classification. In this paper, we perform a fine-grained evaluation to understand how state-of-the-art methods perform on infrequent labels. Furthermore, we develop few- and zero-shot methods for multi-label text classification when there is a known structure over the label space, and evaluate them on two publicly available medical text datasets: MIMIC II and MIMIC III. For few-shot labels we achieve improvements of 6.2% and 4.8% in [email protected] for MIMIC II and MIMIC III, respectively, over prior efforts; the corresponding [email protected] improvements for zero-shot labels are 17.3% and 19%.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6375489PMC
February 2019

Document Retrieval for Biomedical Question Answering with Neural Sentence Matching.

Proc Int Conf Mach Learn Appl 2018 Dec 17;2018:194-201. Epub 2019 Jan 17.

Div. of Biomedical Informatics (Internal Medicine), University of Kentucky, Lexington KY.

Document retrieval (DR) forms an important component in end-to-end question-answering (QA) systems where particular answers are sought for well-formed questions. DR in the QA scenario is also useful by itself even without a more involved natural language processing component to extract exact answers from the retrieved documents. This latter step may simply be done by humans like in traditional search engines granted the retrieved documents contain the answer. In this paper, we take advantage of datasets made available through the BioASQ end-to-end QA shared task series and build an effective biomedical DR system that relies on relevant answer snippets in the BioASQ training datasets. At the core of our approach is a question-answer sentence matching neural network that learns a measure of relevance of a sentence to an input question in the form of a matching score. In addition to this matching score feature, we also exploit two auxiliary features for scoring document relevance: the name of the journal in which a document is published and the presence/absence of semantic relations (subject-predicate-object triples) in a candidate answer sentence connecting entities mentioned in the question. We rerank our baseline sequential dependence model scores using these three additional features weighted via adaptive random research and other learning-to-rank methods. Our full system placed 2nd in the final batch of Phase A (DR) of task B (QA) in BioASQ 2018. Our ablation experiments highlight the significance of the neural matching network component in the full system.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1109/ICMLA.2018.00036DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6353660PMC
December 2018

Overview of the BioCreative VI Precision Medicine Track: mining protein interactions and mutations for precision medicine.

Database (Oxford) 2019 01 1;2019. Epub 2019 Jan 1.

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.

The Precision Medicine Initiative is a multicenter effort aiming at formulating personalized treatments leveraging on individual patient data (clinical, genome sequence and functional genomic data) together with the information in large knowledge bases (KBs) that integrate genome annotation, disease association studies, electronic health records and other data types. The biomedical literature provides a rich foundation for populating these KBs, reporting genetic and molecular interactions that provide the scaffold for the cellular regulatory systems and detailing the influence of genetic variants in these interactions. The goal of BioCreative VI Precision Medicine Track was to extract this particular type of information and was organized in two tasks: (i) document triage task, focused on identifying scientific literature containing experimentally verified protein-protein interactions (PPIs) affected by genetic mutations and (ii) relation extraction task, focused on extracting the affected interactions (protein pairs). To assist system developers and task participants, a large-scale corpus of PubMed documents was manually annotated for this task. Ten teams worldwide contributed 22 distinct text-mining models for the document triage task, and six teams worldwide contributed 14 different text-mining systems for the relation extraction task. When comparing the text-mining system predictions with human annotations, for the triage task, the best F-score was 69.06%, the best precision was 62.89%, the best recall was 98.0% and the best average precision was 72.5%. For the relation extraction task, when taking homologous genes into account, the best F-score was 37.73%, the best precision was 46.5% and the best recall was 54.1%. Submitted systems explored a wide range of methods, from traditional rule-based, statistical and machine learning systems to state-of-the-art deep learning methods. Given the level of participation and the individual team results we find the precision medicine track to be successful in engaging the text-mining research community. In the meantime, the track produced a manually annotated corpus of 5509 PubMed documents developed by BioGRID curators and relevant for precision medicine. The data set is freely available to the community, and the specific interactions have been integrated into the BioGRID data set. In addition, this challenge provided the first results of automatically identifying PubMed articles that describe PPI affected by mutations, as well as extracting the affected relations from those articles. Still, much progress is needed for computer-assisted precision medicine text mining to become mainstream. Future work should focus on addressing the remaining technical challenges and incorporating the practical benefits of text-mining tools into real-world precision medicine information-related curation.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/database/bay147DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6348314PMC
January 2019

Data and systems for medication-related text classification and concept normalization from Twitter: insights from the Social Media Mining for Health (SMM4H)-2017 shared task.

J Am Med Inform Assoc 2018 10;25(10):1274-1283

Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA.

Objective: We executed the Social Media Mining for Health (SMM4H) 2017 shared tasks to enable the community-driven development and large-scale evaluation of automatic text processing methods for the classification and normalization of health-related text from social media. An additional objective was to publicly release manually annotated data.

Materials And Methods: We organized 3 independent subtasks: automatic classification of self-reports of 1) adverse drug reactions (ADRs) and 2) medication consumption, from medication-mentioning tweets, and 3) normalization of ADR expressions. Training data consisted of 15 717 annotated tweets for (1), 10 260 for (2), and 6650 ADR phrases and identifiers for (3); and exhibited typical properties of social-media-based health-related texts. Systems were evaluated using 9961, 7513, and 2500 instances for the 3 subtasks, respectively. We evaluated performances of classes of methods and ensembles of system combinations following the shared tasks.

Results: Among 55 system runs, the best system scores for the 3 subtasks were 0.435 (ADR class F1-score) for subtask-1, 0.693 (micro-averaged F1-score over two classes) for subtask-2, and 88.5% (accuracy) for subtask-3. Ensembles of system combinations obtained best scores of 0.476, 0.702, and 88.7%, outperforming individual systems.

Discussion: Among individual systems, support vector machines and convolutional neural networks showed high performance. Performance gains achieved by ensembles of system combinations suggest that such strategies may be suitable for operational systems relying on difficult text classification tasks (eg, subtask-1).

Conclusions: Data imbalance and lack of context remain challenges for natural language processing of social media text. Annotated data from the shared task have been made available as reference standards for future studies (http://dx.doi.org/10.17632/rxwfb3tysd.1).
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/jamia/ocy114DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6188524PMC
October 2018

An end-to-end deep learning architecture for extracting protein-protein interactions affected by genetic mutations.

Database (Oxford) 2018 01 1;2018:1-13. Epub 2018 Jan 1.

Department of Computer Science, University of Kentucky, Lexington, KY, USA.

The BioCreative VI Track IV (mining protein interactions and mutations for precision medicine) challenge was organized in 2017 with the goal of applying biomedical text mining methods to support advancements in precision medicine approaches. As part of the challenge, a new dataset was introduced for the purpose of building a supervised relation extraction model capable of taking a test article and returning a list of interacting protein pairs identified by their Entrez Gene IDs. Specifically, such pairs represent proteins participating in a binary protein-protein interaction relation where the interaction is additionally affected by a genetic mutation-referred to as a PPIm relation. In this study, we explore an end-to-end approach for PPIm relation extraction by deploying a three-component pipeline involving deep learning-based named-entity recognition and relation classification models along with a knowledge-based approach for gene normalization. We propose several recall-focused improvements to our original challenge entry that placed second when matching on Entrez Gene ID (exact matching) and on HomoloGene ID. On exact matching, the improved system achieved new competitive test results of 37.78% micro-F1 with a precision of 38.22% and recall of 37.34% that corresponds to an improvement from the prior best system by approximately three micro-F1 points. When matching on HomoloGene IDs, we report similarly competitive test results at 46.17% micro-F1 with a precision and recall of 46.67 and 45.59%, respectively, corresponding to an improvement of more than eight micro-F1 points over the prior best result. The code for our deep learning system is made publicly available at https://github.com/bionlproc/biocppi_extraction.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/database/bay092DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6146129PMC
January 2018

EMR Coding with Semi-Parametric Multi-Head Matching Networks.

Proc Conf 2018 Jun;2018:2081-2091

Division of Biomedical Informatics, University of Kentucky, Lexington, KY,

Coding EMRs with diagnosis and procedure codes is an indispensable task for billing, secondary data analyses, and monitoring health trends. Both speed and accuracy of coding are critical. While coding errors could lead to more patient-side financial burden and mis-interpretation of a patient's well-being, timely coding is also needed to avoid backlogs and additional costs for the healthcare facility. In this paper, we present a new neural network architecture that combines ideas from few-shot learning matching networks, multi-label loss functions, and convolutional neural networks for text classification to significantly outperform other state-of-the-art models. Our evaluations are conducted using a well known deidentified EMR dataset (MIMIC) with a variety of multi-label performance measures.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.18653/v1/N18-1189DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6105925PMC
June 2018

Extracting chemical-protein relations with ensembles of SVM and deep learning models.

Database (Oxford) 2018 01;2018

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.

Mining relations between chemicals and proteins from the biomedical literature is an increasingly important task. The CHEMPROT track at BioCreative VI aims to promote the development and evaluation of systems that can automatically detect the chemical-protein relations in running text (PubMed abstracts). This work describes our CHEMPROT track entry, which is an ensemble of three systems, including a support vector machine, a convolutional neural network, and a recurrent neural network. Their output is combined using majority voting or stacking for final predictions. Our CHEMPROT system obtained 0.7266 in precision and 0.5735 in recall for an F-score of 0.6410 during the challenge, demonstrating the effectiveness of machine learning-based approaches for automatic relation extraction from biomedical literature and achieving the highest performance in the task during the 2017 challenge.Database URL: http://www.biocreative.org/tasks/biocreative-vi/track-5/.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/database/bay073DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6051439PMC
January 2018

Exploiting semantic patterns over biomedical knowledge graphs for predicting treatment and causative relations.

J Biomed Inform 2018 06 12;82:189-199. Epub 2018 May 12.

Division of Biomedical Informatics, Department of Internal Medicine, University of Kentucky, United States; Department of Computer Science, University of Kentucky, United States. Electronic address:

Background: Identifying new potential treatment options for medical conditions that cause human disease burden is a central task of biomedical research. Since all candidate drugs cannot be tested with animal and clinical trials, in vitro approaches are first attempted to identify promising candidates. Likewise, identifying different causal relations between biomedical entities is also critical to understand biomedical processes. Generally, natural language processing (NLP) and machine learning are used to predict specific relations between any given pair of entities using the distant supervision approach.

Objective: To build high accuracy supervised predictive models to predict previously unknown treatment and causative relations between biomedical entities based only on semantic graph pattern features extracted from biomedical knowledge graphs.

Methods: We used 7000 treats and 2918 causes hand-curated relations from the UMLS Metathesaurus to train and test our models. Our graph pattern features are extracted from simple paths connecting biomedical entities in the SemMedDB graph (based on the well-known SemMedDB database made available by the U.S. National Library of Medicine). Using these graph patterns connecting biomedical entities as features of logistic regression and decision tree models, we computed mean performance measures (precision, recall, F-score) over 100 distinct 80-20% train-test splits of the datasets. For all experiments, we used a positive:negative class imbalance of 1:10 in the test set to model relatively more realistic scenarios.

Results: Our models predict treats and causes relations with high F-scores of 99% and 90% respectively. Logistic regression model coefficients also help us identify highly discriminative patterns that have an intuitive interpretation. We are also able to predict some new plausible relations based on false positives that our models scored highly based on our collaborations with two physician co-authors. Finally, our decision tree models are able to retrieve over 50% of treatment relations from a recently created external dataset.

Conclusions: We employed semantic graph patterns connecting pairs of candidate biomedical entities in a knowledge graph as features to predict treatment/causative relations between them. We provide what we believe is the first evidence in direct prediction of biomedical relations based on graph features. Our work complements lexical pattern based approaches in that the graph patterns can be used as additional features for weakly supervised relation prediction.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.jbi.2018.05.003DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6070294PMC
June 2018
-->