202675 A Multi-layered Approach to Race Imputation using Medical Claims, Name Analyses, Inferred-Familial Connections and a Geographic Information System

Monday, November 9, 2009: 12:45 PM

Stephen Jones, MS , Medical Informatics - Accreditation Analytics, BlueCross BlueShield of Tennessee, Chattanooga, TN
Patty Howard, RN, BSN , Clinical Improvement, BlueCross BlueShield of Tennessee, Chattanooga, TN
Soyal Momin, MS, MBA , Medical Informatics, BlueCross BlueShield of Tennessee, Chattanooga, TN
Objective. To develop an algorithm to estimate race for members where race is unknown using a combination of medical claims, name analyses, inferred-familial connections and geographic imputation from census data

Data Sources. Internally maintained employee database (n=5,439) and Medicaid members with known single race enrolled as of September 2008 (n=348,994) in a large southeastern managed care organization

Study Design. A convenience sample approximating an 85/15 split of Medicaid members enrolled as of September 2008 was created for modeling purposes, where 300,000 members were used for development and a hold-out (test) sample (n=48,994) was created through simple random sampling. The 4-level imputation algorithm was created using 1) race-name associations from the Medicaid development dataset, U.S. Census Bureau and web resources; 2) geo-imputation based on census block population information; 3) race-biased medical claims diagnoses (e.g. Sickle-Cell Disease for African-Americans) and 4) following each imputation stage, members not assigned a race were assigned the race of a member in the same household if one was known (i.e. inferred-familial connection). To determine level of accuracy, imputed race was compared to a pooled response (n=54,433) of member's known race within the test dataset and the internally maintained employee database. We tested for age and gender effects on false-positive race predictions using a backwards-elimination algorithm within a logistic regression model.

Principal Findings. We assigned race to 51,184 (94.0%) of the pooled datasets with an overall accuracy of 85.8%. Accuracy per race was: Asian=43.4%, African-American=70.0%, Hispanic=83.9%, Native American=7.6%, White=90.4%. Medical claims diagnoses and inferred-familial connection methodology imputed approximately 1% and 5%, respectively, of members that would have been missing otherwise. Age (P=0.001; OR 1.002, 95% CI 1.001-1.003) and gender (P=0.047; OR 0.955, 95% CI 0.913-0.999) were significant in the model, where older members' and male members' race values are marginally more likely to be incorrectly predicted.

Conclusions. We were successful in developing a multi-layered approach to race imputation. Utilizing healthcare claims data and familial inference methodology was a valuable addition to traditional methods that include only surname analysis and geo-imputation. Race information is commonly unavailable to commercial health insurance carriers because collection of this data may infer misuse within premium calculations and other discriminatory concerns. However, racial disparities exist relative to the quality and quantity of health care received by minority groups. Imputing race can help plans undertake issues of racial disparity, address specific risk factors associated with race and mitigate them through proactive care management strategies.

Learning Objectives:
Describe the development of a reliable algorithm to estimate race for members where race is unknown using a combination of medical claims, name analyses, inferred-familial connections and geographic imputation from census data

Keywords: Ethnicity, Managed Care

Presenting author's disclosure statement:

Qualified on the content I am responsible for because: Researcher in related field since 1996, multiple submissions and presentations of projects to conferences, received Lundy award for excellence in related work in 2008
Any relevant financial relationships? No

I agree to comply with the American Public Health Association Conflict of Interest and Commercial Support Guidelines, and to disclose to the participants any off-label or experimental uses of a commercial product or service discussed in my presentation.