275055 All models are wrong”, is wrong: Estimates of associations in a model-free world

Monday, October 29, 2012 : 2:30 PM - 2:50 PM

Alan Hubbard, PhD , School of Public Health, University of California, Berkeley, Berkeley, CA
The logistic regression model has become a panacea for estimating associations in the context of high-dimensional (e.g., many) confounders. It is commonly known that associations estimated in a misspecified model, are biased, and one would be hard pressed to argue that there is any theoretical justification, for instance, in assuming a simple additive logistic regression model (one that includes only main effects and no interactions or more complicated functional forms, e.g., polynomials). Thus, the implicit assumption of using such models has been articulated as such (Box, 1979): “all models are wrong, but some are useful”. However, recent advances in model selection, and in defining estimates that do not rely on a parametric model, are challenging this approach/philosophy. Typically, the real statistical model, as defined by what constraints one really know about the mechanism generating the data, is a so-called semi-parametric model (that is, there are nearly no constraints one can impose from previous knowledge). Fortunately, there is theory and available software for how to estimate a regression in the absence of a known model. In addition, research in causal inference provides interesting and relevant parameters, which do not rely on specifying a parametric model and typically, they have more meaningful public health interpretation than say an adjusted odds ratio. These parameters can include things like the average treatment (or exposure) effect, or they can examine so-called causal additive interaction. In this talk, we discuss these developments that utilize on so-called ensemble learning (SuperLearner), and defining parameters in a non-parametric structural equation model (NPSEM). The result is a methodology that remove the “art” out of model selection, and provide parameter estimates targeted to the scientific/public health question. We illustrate the methodology in a data analysis estimating the impacts of alcohol outlet density on alcohol related harm.

Learning Areas:
Biostatistics, economics
Epidemiology
Systems thinking models (conceptual and theoretical models), applications related to public health

Learning Objectives:
Define the concepts of parameters generating from a NPSEM Analyze practical implementation of model selection using automated routines Discuss estimating “causal” parameters resulting from these machine learning techniques.

Presenting author's disclosure statement:

Qualified on the content I am responsible for because: I have been involved both as a co-PI or PI on NIH funded research, but also as an professor within a Division in Biostatistics on methods for controlling confounding in high dimensional data, including over 100 published papers in both development and implementation of methodology related to such applications.
Any relevant financial relationships? No

I agree to comply with the American Public Health Association Conflict of Interest and Commercial Support Guidelines, and to disclose to the participants any off-label or experimental uses of a commercial product or service discussed in my presentation.