203511 Data mining methods used for identifying SNPs in complex diseases

Tuesday, November 10, 2009

Hui-Yi Lin , Biostatistics Department, Moffitt Cancer Center & Research Institute, Tampa, FL
Tung-Sung Tseng, DrPH , School of Public Health, Louisiana State University Health Sciences Center, New Orleans, LA
Single nucleotide polymorphisms (SNP) play a critical role in complex diseases and studies show that SNP interactions are more important than single genetic factor. Several statistical methods have been proposed to analyze the high-dimensional data. However, none of a single method has been shown is superior to others. Data mining methods, which can automatically transform covariates based on model improvement and have capabilities to handle a large number of variables, are commonly used in analyzing this kind of genetic data. In this study, we will evaluate two data mining methods for detecting SNPs interactions in complex diseases. Multivariate Adaptive Regression Splines (MARS), which combines the advantages of recursive partitioning and spline fitting, can automatically categorize a three-level categorical SNP into different inherited modes and can automatically detect interaction patterns. These useful features can effectively reduce the number of terms in a model. Random Forests (RF) method is a collection of classification trees grown on bootstrap samples. RF generates variable importance measures, which take into account interactions among variables. Four hundred subjects (200 cases and 200 controls) with non-missing genotypes for 10 SNPs were generated. Four different 2-way interaction models and one 3-way interaction model were applied. We evaluated strengthens and drawbacks for these two methods using simulated data. Results showed that MARS is more powerful to detect interactions than RF in some conditions. In addition, how these methods can be combined to address the genetic interaction issues will also be addressed.

Learning Objectives:
The objecitve is to evaluate two data mining methods (Multivariate Adaptive Regression Splines and Random Forests) in detecting gene-gene interactions.

Keywords: Statistics, Genetics

Presenting author's disclosure statement:

Qualified on the content I am responsible for because: I am a statistician. My research interest is in statistical methodology in gene-gene interactions.
Any relevant financial relationships? No

I agree to comply with the American Public Health Association Conflict of Interest and Commercial Support Guidelines, and to disclose to the participants any off-label or experimental uses of a commercial product or service discussed in my presentation.

See more of: Statistics Section Poster Session
See more of: Statistics