Session

Big Data and Machine Learning for Health Research

Lynn Agre, MPH, PhD and Pratha Sah, PhD, Practice Research Network, Optum, Boston, MA

APHA 2024 Annual Meeting and Expo

Abstract

Impact of social determinants of health on diabetes outcomes: Insights from a quality improvement intervention among adult patients

Owusua Yamoah, PhD, MA1, Dove-Anna Johnson1, Yashashvi Raghuwanshi2, Tyler Barnett, MHSA2, Yumiko Tsushima, MD1, Julia Blanchette, PhD, RN, BC-ADM, CDCES3 and Betul Hatipoglu, MD3
(1)Case Western Reserve University School of Medicine, Cleveland, OH, (2)University Hospitals, Cleveland, OH, (3)Case Western Reserve University, Cleveland, OH

APHA 2024 Annual Meeting and Expo

Introduction: Approximately 80% of modifiable factors of population health outcomes are attributed to Social determinants of health (SDOH). Among individuals with diabetes, SDOH factors are crucial in shaping both physical and psychosocial outcomes. This study aims to examine the significance of SDOH factors on diabetes outcomes among adults participating in a quality improvement (QI) program.

Methods: Adults with elevated HbA1c levels (>8.5%) were enrolled in the QI program. Participants were categorized as responders (HbA1c reduction ≥ 0.5%) or non-responders (HbA1c reduction < 0.5%) at follow-up. A total of 41 predictive features were identified from LexisNexis socioeconomic health attributes. ANOVA and Fisher's Exact tests were employed to compare responders versus non-responders. Three machine learning techniques yielded the most efficient features for inclusion in the predictive model.

Results: The study included 475 individuals from the diabetes program; (mean) age 55.5 years, pre-intervention HbA1c (9.0%), post-intervention HbA1c (8.3%), and change in HbA1c (-0.72%). Significant differences between groups were observed in terms of crime and burglary indices and address stability (p < 0.05). Final selected features included household characteristics (age and income) and neighborhood-level attributes (income, home values, and crime). Logistic regression demonstrated improved accuracy (0.55), while Support Vector Regression recorded the highest Mean Squared Error (MSE) of 0.32.

Discussion: A clinically significant decrease in HbA1c is closely associated with patients’ SDOH factors, collectively accounting for over 30% of intervention outcomes. Future interventions must include social support navigators to identify and address patients’ neighborhood stressors and socioeconomic barriers that may impact intervention outcomes.

Biostatistics, economics Chronic disease management and prevention Public health or related research

Abstract

Missing data in high dimensional multilevel data: A hierarchical machine learning approach applied to a national behavioral health study

Niloofar Ramezani, PhD1, Jennifer Johnson, PhD2 and Faye Taxman, Ph.D.3
(1)Virginia Commonwealth University, Richmond, VA, (2)Michigan State University College of Human Medicine, Flint, MI, (3)George Mason University, Fairfax, VA

APHA 2024 Annual Meeting and Expo

Missing data are common in large-scale health and biomedical studies. Existing methods for handling missing data for high-dimensional data are either slow or difficult to implement by applied researchers and practitioners. In a national study conducted in 2020-2023, a higher percentage of missing data was encountered duo to the fact that surveys were administered during the COVID-19 era. The first round of the survey responses collected from 761 behavioral health and criminal justice practitioners from 504 counties across the U.S, with over 2000 recorded variables, revealed the presence of different types and degrees of missing data. Using this data, a new methodology is developed and tested by carefully combining statistical methodology and machine learning (ML) computing power, and compared to multiple imputation and other traditional missing data methods.

Supervised ML techniques, including tree-based and trained random forest models, are used for tackling missing data by selecting and using available data features to estimate imputed values. Cross Validation is used to simultaneously select the appropriate prediction model, train the model, and validate the accuracy of the of the predicted imputed values. An additional layer of clustering is added to these ML-based methods to obtain accurate predictions of missing values which will be useful to biomedical/healthcare researchers encountering missing data in their studies. Accounting for different characteristics of the data, this strategy accurately imputes missing data in a hierarchical setting and provides a toolkit for public health researchers who need a robust and easy to use tool for accurately predicting/imputing missing values.

Biostatistics, economics Social and behavioral sciences

Abstract

Small area estimation methods for county level obesity prevalence estimation: Generalized linear mixed models vs. machine learning techniques

Zhen Zhang, PhD1, Lei Zhang, PhD, MBA2 and Marinelle Payton, M.D, Ph.D, M.S., M.P.H1
(1)Jackson State University, Jackson, MS, (2)University of Mississippi Medical Center, Jackson, MS

APHA 2024 Annual Meeting and Expo

Background

Small Area Estimation (SAE) is a cost-effective methodology that combines multiple data sources to enhance the survey estimator for small geographic areas or subpopulations. As machine learning (ML) continues to gain momentum in data science, more ML applications is seen in the SAE landscape. The objective of this study was to compare classic generalized linear mixed models (GLMMs) with ML algorithms for SAE of county-level obesity prevalence in Mississippi.

Methods

The 2022 Mississippi Behavioral Risk Factor Surveillance System (BRFSS) data were obtained for this study. The 2020 US census data at county level is incorporated as auxiliary data to “borrow strength.” GLMMs were constructed and validated using the survey packages in R to account for weights and YRBS complex sample design; Tree-based ML, and Neural network models are trained and validated using the IBM SPSS Modeler v18.4.0.

Results

The Mississippi county-level obesity prevalence estimated by GLMMs was validated by a three-fold cross-validation approach. An adequate range of variation among counties and satisfactory precision demonstrated by standard error are observed with the GLMM estimates. Gradient-boosted decision trees (GBDTs) generate point estimates similar to those of GLMMs, with slightly wider confidence intervals. Neural network models appear to be less accurate and less robust among the three techniques applied.

Conclusions

The classic GLMMs are among the top choices as a SAE tool. IBM SPSS Modeler makes it intuitive and easy to train and validate GBDTs, which makes it a suitable choice for less experienced ones to start SAE.

Public health or related research

Abstract

Mathematical modeling of emerging and reemerging infectious disease outbreaks to predict ED visit rate: A novel seitird model

Olumide Arigbede, MPH1, Sarah Buxbaum, Ph.D.2, John Luque, Ph.D.1, Tammie M Johnson, DrPH1, Clyde Perry Brown, DrPH3, Lekan Latinwo, Ph.D.1 and Edward Lockhart, Ph.D.4
(1)Florida A&M University, Tallahassee, FL, (2)Tallahassee, FL, (3)Florida Agricultural and Mechanical University, Tallahassee, FL, (4)Centers for Disease Control and Prevention, Atlanta, GA

APHA 2024 Annual Meeting and Expo

Objectives: To develop a novel compartmental mathematical model (SEITIRD) based on the existing SEIR model to predict the rate of emergency department (ED) visits of detected cases of emerging and reemerging infectious diseases (ERID), using COVID-19 and influenza as case studies for model prediction.

Methods: Mathematical Modeling, AI modeling, and agent-based modeling were combined to improve the accuracy of predictions and the robustness of the six compartments S-E-I-TI-R-D model. Influenza and COVID-19 datasets from the NSSP Platform (01/20/2020-02/29/2024) were used. Stability was assessed using the Routh-Hurwitz Criteria and Jacobi matrix eigenvalues analysis, while the Runge-Kutta simulation method solved ordinary differential equations.

Results: The model accurately predicted Rc and Ri at 2.31 (0.61) and 1.48 (0.23) respectively, with a combined weekly ED visit percentage of 5.95 (1.38)%. Weekly ED visits for COVID-19 peaked at 115,005 per week during the pandemic, while influenza visits were lower at 42,784 per week. However, from October 2023 to February 2024, influenza visits surpassed COVID-19 by 59.6%. On geographical analysis, the East Coast had higher ED visits for both diseases compared to the West Coast, with 12.7% of the population projected to visit the ED within 60 days after mitigation strategy relaxation.

Conclusion: To date, this is the first study that uses an integrated approach to predict the impact of the rate of spread of combined outbreaks of infectious diseases on ED visits. The study addresses the significance of these approaches in global public health and their roles in preparedness for future challenges.

Biostatistics, economics Epidemiology Protection of the public in relation to communicable diseases including prevention or control Public health or related research