Health

Machine Learning-Based Models Able To Predict Diabetes Over Various Ethnicities, Study Finds

Cloudfront

A recent study, featured in eClinicalMedicine, presents a significant stride in the prediction of diabetes mellitus type 2 (T2D) incidence and prevalence, with a particular focus on diverse ethnic groups.

BACKGROUND for the STUDY

The research explores the application of questionnaire-based models, leveraging the potential of machine learning to improve early detection and management of T2D in non-white populations who face unique risk factors leading to early onset and related consequences.

Early screening and prediction technologies hold the key to identifying and addressing T2D, especially within a non-invasive screening approach, enabling preliminary assessments and referrals, ultimately reducing healthcare costs and enhancing public health.

ABOUT the STUDY

The study aimed to create prediction models for T2D incidence and prevalence based on questionnaires. These models were developed using the data from the United Kingdom Biobank (UKBB) for training and then applied to the Lifelines study data for validation, considering both white and non-white individuals.

Questionnaire-based algorithms were initially trained using data from UKBB’s white population, comparing their clinical value with two other models incorporating additional variables like physical measurements and biological markers, as well as gold-standard clinical risk assessment models for predicting T2D occurrence. The study employed logistic regression modeling for T2D incidence prevalence prediction.

The training dataset consisted of data from white individual study, which included 472,696 individuals aged between 37 and 73 years, with data collected from 2006 to 2010. Validation of the models was performed on five non-white ethnicities, comprising 29,811 individuals, using external validation with Lifelines data. The Lifelines dataset included 168,205 individuals spanning ages from 0 to 93 years, with data collected from 2006 to 2013.

Feature selection was conducted for model development, and the accuracy of predictions was evaluated using the area under the receiver operating characteristic (ROC) curve (AUC). Sensitivity analyses were also conducted to assess the clinical value of the models. A reclassification analysis was performed, comparing the performance of questionnaire-based prediction models against models that included biomarkers, physical measurements, and clinical T2D risk tools.

T2D diagnosis in the training cohort relied on self-reported data, clinical T2D diagnoses, or hospital records using the International Classification of Diseases, ninth revision (ICD-9) diagnostic codes. Validation cohort participants were categorized based on self-reports as having incident or prevalent type 2 diabetes.

The study adhered to the National Institute for Health and Care Excellence (NICE) recommendations for defining “potentially undiagnosed” T2D. Individuals with blood glucose levels above 7.0 mmol/L or glycated hemoglobin (HbA1c) levels exceeding 48 mmol/mol in the training and validation datasets were considered potentially undiagnosed and excluded to mitigate bias in prevalence studies. The researchers also excluded incident T2D patients who had more than eight years until diagnosis and individuals who did not acquire T2D but did not return for assessments after eight years.

The study extended its validation efforts to the non-laboratory clinically concise Finnish Diabetes Risk Score (FINDRISC) and the clinical Australian T2D Risk Assessment Tool (AUSDRISK). These tools employ nine and 13 features, respectively, spanning medical history, demographics, lifestyle, and anthropometrics to predict incident T2D.

RESULTS

Results from the study revealed that a total of 67,083 and 631,748 individuals were assessed for T2D incidence and prevalence, respectively. Notably, the prevalence and incidence rates of T2D differed significantly between non-white and white populations. Non-white individuals exhibited a 4.0-fold greater prevalence, ranging between 12% and 23%, and a 0.5 to 3.0-fold greater incidence, ranging from 1.4% to 8.2%, compared to the white UKBB population, which recorded rates of 6.00% for prevalence and 2.80% for incidence.

In contrast, Lifelines demonstrated lower T2D prevalence (two percent) and incidence (two percent) compared to the white UKBB population, partially attributed to age variations in the two cohorts.

In the white UKBB sample, the machine learning-based algorithms effectively predicted T2D prevalence (AUC of 0.9) and eight-year incidence (AUC of 0.9). Furthermore, the models showed consistent performance during the Lifelines external validation, with AUC values of 0.8 for incidence and 0.9 for prevalence.

The machine learning models demonstrated robust performance across diverse ethnic groups, yielding AUC values ranging from 0.86 to 0.89 for prevalence and 0.82 to 0.88 for incidence prediction. The models outperformed clinically validated non-laboratory methods, accurately reclassifying almost 3,000 additional cases. Incorporating biological markers, though not physical measurements, enhanced model performance.

Key indicators contributing to the prediction of T2D prevalence and incidence included body mass index (BMI)and the number of medications used, ranking as the top three features for both models. The incidence and model also factored in sedentary behavior, specifically time spent watching television (TV).

In terms of forecasting T2D prevalence and incidence in diverse populations, the questionnaire-based machine-learning models from Lifelines outperformed FINDRISC and AUSDRISK. The models based solely on questionnaires exhibited an excellence balance between sensitivity and specificity, positive predictive value (PPV), and negative predictive value (NPC) across all populations. This balance improved in models that incorporated biomarkers, resulting in higher PPV values across various demographic groups.

Significantly, the machine learning models accurately classified more cases than clinically validated prediction techniques for white, Caribbean, other, and South Asian populations. In almost all instances, biomarker-based models outperformed clinical methods.

In summary, the study’s findings highlight the effectiveness of machine learning models developed using UK Biobank data in predicating T2D prevalence and incidence across diverse ethnicities, including non-white populations. These models excelled beyond current approaches, providing a precise, scalable, and cost-effective means to detect positive cases and predict risk.