229527 Prediction Model Building using Data Mining Approaches: An Example of Prostate Cancer Aggressiveness

Monday, November 8, 2010

Hui-Yi Lin , Biostatistics Department, Moffitt Cancer Center & Research Institute, Tampa, FL
Jong Park , Department of Risk Assessment, Detection & Intervention, Moffitt Cancer Center and Research Institute, Tampa, FL
A growing numbers of evidence show that single nucleotide polymorphisms (SNPs) may play an important role in complex diseases. Currently, most studies primarily focus on effect of individual SNPs. The combined effects of multiple SNPs (interactions) associated with clinical outcomes in building a prediction model are not widely addressed. The objective of this study is to demonstrate how data mining approaches can be superior to the conventional statistical methods in building a prediction model of prostate cancer aggressiveness. The Cancer Genetic Markers of Susceptibility (CGEMS) prostate cancer genome-wide data set, initiated by the National Cancer Institute (NCI), was applied. There are 659 aggressive prostate cancer cases and 492 non-aggressive cases with 151 SNPs in seven estrogen receptor related genes. Both single-locus main effect and SNP-SNP interactions associated with prostate cancer aggressiveness will be evaluated. For a binary clinical outcome, logistic regression, a conventional approach, with automatic variable selection is commonly applied. However, data mining methods, which can automatically transform covariates based on model improvement and have capabilities to handle a large number of variables, are recommended for analyzing high-dimensional genetic data. Previous studies have shown that the prediction accuracy is much improved by using a small number of candidates. We applied a two-step approach by combining two data mining methods: Random Forests (RF) and Multivariate Adaptive Regression Splines (MARS). In the first step, RF will be applied for screening a small subset of SNPs for further evaluation in the second step. MARS will be applied to explore SNP main effects and interactions in the second step. Finally, the proposed prediction model will be evaluated in terms of calibration, discrimination and overall performance. The results showed that the RF-MARS combined method is an effective and efficiency way in a building prediction models for prostate cancer aggressiveness.

Learning Areas:
Biostatistics, economics
Epidemiology

Learning Objectives:
Demonstrate how data mining approaches can be superior to the conventional statistical methods in building a prediction model of prostate cancer aggressiveness.

Keywords: Genetics, Statistics

Presenting author's disclosure statement:

Qualified on the content I am responsible for because: I conduct this study.
Any relevant financial relationships? No

I agree to comply with the American Public Health Association Conflict of Interest and Commercial Support Guidelines, and to disclose to the participants any off-label or experimental uses of a commercial product or service discussed in my presentation.