Single Nucleotide Polymorphism relevance learning with Random Forests for Type 2 diabetes risk prediction.

Authors:
Beatriz Lopez
Beatriz Lopez
Center for Health Studies
Largo | United States
Jose Manuel Fernandez-Real
Jose Manuel Fernandez-Real
University of Cambridge
United Kingdom

Artif Intell Med 2018 04 22;85:43-49. Epub 2017 Sep 22.

Biomedical Research Institute of Girona, Avda. de França, s/n, 17007 Girona, Spain; CIBERobn Pathophysiology of Obesity and Nutrition, Instituto de Salud Carlos III, Madrid, Spain. Electronic address:

Objective: The use of artificial intelligence techniques to find out which Single Nucleotide Polymorphisms (SNPs) promote the development of a disease is one of the features of medical research, as such techniques may potentially aid early diagnosis and help in the prescription of preventive measures. In particular, the aim is to help physicians to identify the relevant SNPs related to Type 2 diabetes, and to build a decision-support tool for risk prediction.

Methods: We use the Random Forest (RF) technique in order to search for the most important attributes (SNPs) related to diabetes, giving a weight (degree of importance), ranging between 0 and 1, to each attribute. Support Vector Machines and Logistic Regression have also been used since they are two other machine learning techniques that are well-established in the health community. Their performance has been compared to that achieved by RF. Furthermore, the relevance of the attributes obtained through the use of RF has then been used to perform predictions with k-Nearest Neighbour method weighting attributes in the similarity measure according to the relevance of the attributes with RF.

Results: Testing is performed on a set of 677 subjects. RF is able to handle the complexity of features' interactions, overfitting, and unknown attribute values, providing the SNPs' relevance with an up to 0.89 area under the ROC curve in terms of risk prediction. RF outperforms all the other tested machine learning techniques in terms of prediction accuracy, and in terms of the stability of the estimated relevance of the attributes.

Conclusions: The Random Forest is a useful method for learning predictive models and the relevance of SNPs without any underlying assumption.

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.artmed.2017.09.005DOI Listing
April 2018
4 Reads

Publication Analysis

Top Keywords

type diabetes
8
relevance attributes
8
machine learning
8
learning techniques
8
random forest
8
risk prediction
8
single nucleotide
8
relevance
6
tested machine
4
attribute support
4
support vector
4
ranging attribute
4
weight degree
4
degree ranging
4
roc curve
4
area roc
4
build decision-support
4
089 area
4
regression machine
4
logistic regression
4

Altmetric Statistics

Similar Publications

Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests.

BMC Genomics 2015 21;16 Suppl 2:S5. Epub 2015 Jan 21.

Background: Single-nucleotide polymorphisms (SNPs) selection and identification are the most important tasks in Genome-wide association data analysis. The problem is difficult because genome-wide association data is very high dimensional and a large portion of SNPs in the data is irrelevant to the disease. Advanced machine learning methods have been successfully used in Genome-wide association studies (GWAS) for identification of genetic variants that have relatively big effects in some common, complex diseases. Read More

View Article
December 2015

Statistical geometry based prediction of nonsynonymous SNP functional effects using random forest and neuro-fuzzy classifiers.

Proteins 2008 Jun;71(4):1930-9

Department of Bioinformatics and Computational Biology, George Mason University, Manassas, Virginia 20110, USA.

There is substantial interest in methods designed to predict the effect of nonsynonymous single nucleotide polymorphisms (nsSNPs) on protein function, given their potential relationship to heritable diseases. Current state-of-the-art supervised machine learning algorithms, such as random forest (RF), train models that classify single amino acid mutations in proteins as either neutral or deleterious to function. However, it is frequently the case that the functional effect of a polymorphism on a protein resides between these two extremes. Read More

View Article
June 2008

Machine learning models in breast cancer survival prediction.

Technol Health Care 2016 ;24(1):31-42

Health Services Management Research Center, Institute for Futures Studies in Health, Kerman University of Medical Sciences, Kerman, Iran.

Background: Breast cancer is one of the most common cancers with a high mortality rate among women. With the early diagnosis of breast cancer survival will increase from 56% to more than 86%. Therefore, an accurate and reliable system is necessary for the early diagnosis of this cancer. Read More

View Article
January 2017

Using a multi-staged strategy based on machine learning and mathematical modeling to predict genotype-phenotype risk patterns in diabetic kidney disease: a prospective case-control cohort analysis.

BMC Nephrol 2013 Jul 23;14:162. Epub 2013 Jul 23.

Hong Kong Bioinformatics Centre, The Chinese University of Hong Kong, Hong Kong, SAR, China.

Background: Multi-causality and heterogeneity of phenotypes and genotypes characterize complex diseases. In a database with comprehensive collection of phenotypes and genotypes, we compared the performance of common machine learning methods to generate mathematical models to predict diabetic kidney disease (DKD).

Methods: In a prospective cohort of type 2 diabetic patients, we selected 119 subjects with DKD and 554 without DKD at enrolment and after a median follow-up period of 7. Read More

View Article
July 2013