Online Program

Mortality Rates for Areas Smaller than US Counties: Predictions based on Machine Learning

Tuesday, November 3, 2015

Anna Belova, Environment and Resources Division, Abt Associates Inc., Bethesda, MD
Jacqueline Haskell, Environment and Resources Division, Abt Associates, Bethesda, MD
Mark Corrales, MPP, Office of the Administrator, Office of Policy, US Environmental Protection Agency, Washington, DC
Background: CDC provides nationwide death statistics at county level. Regression models of CDC data and data collected by other agencies can identify basic patterns in mortality rates. Machine learning methods can reveal more complex, non-linear patterns in the data. A model based on these methods and data can predict mortality rates for populations or areas for which CDC does not report mortality statistics.

Methods: We used publicly available county-level data with national coverage from CDC, US Census Bureau, USEPA, USDA, USDOT, FBI, USDHHS, and CMS to build a predictive model of 2010 mortality rates, stratified by age and sex. The dataset included 109,514 observations and 214 predictors. We assumed Poisson distribution for death counts and used Gradient Boosted Regression Trees (GBRT, a machine learning method) for fitting. The model was evaluated against traditional regression models for count data. The model was validated on two types of data not used for estimation: race/ethnicity-specific death statistics and zip-code level deaths rates for two urban areas.

Results: The GBRT-based model significantly outperformed traditional regression models in terms of fit and validation performance. The most important predictors of mortality were traffic volume and proximity, age, state of residence; others with some influence were cost of care, tobacco use, rurality, air pollution, social support, nativity, commute, morbidity, and income.

Conclusions: Public health data can be successfully mined for patterns using machine learning methods. The models built using these methods can be applied to predict in mortality rates with high spatial resolution or for special populations.

Learning Areas:

Biostatistics, economics
Public health or related research

Learning Objectives:
Evaluate advantages of using machine learning methods, versus traditional regression models, in analyzing public health statistics Identify the most influential predictors of county-level patterns in the US mortality rates based on a large collection of publicly available data Demonstrate the feasibility of predicting US mortality rates with high spatial resolution and/or for special populations

Keyword(s): Mortality, Statistics

Presenting author's disclosure statement:

Qualified on the content I am responsible for because: I have designed and implemented the analyses that supported the content.
Any relevant financial relationships? No

I agree to comply with the American Public Health Association Conflict of Interest and Commercial Support Guidelines, and to disclose to the participants any off-label or experimental uses of a commercial product or service discussed in my presentation.