175423 Impact of including low probability matches on probabilistic linkage results: A comparison of high probability vs. imputed match sets

Monday, October 27, 2008

Larry Cook, MStat , Intermountain Injury Control Center, University of Utah, Salt Lake City, UT
Stacey Knight, MStat , Biomedical Informatics, University of Utah, Salt Lake City, UT
Lenora Olson, MA, PhD , Intermountain Injury Control Center, University of Utah, Salt Lake City, UT

Introduction: Probabilistic linkage has been used to combine databases for numerous uses in injury control. The usefulness of probabilistic linkage depends on the identifiers used to conduct linkages. Objective: To determine and quantify biases introduced by relying on the ‘best' matches and explore the usefulness of imputation at incorporating lower weighted pairs. Methods: Synthetic databases of size 100,000 thousand were created. Each database contained 10,000 records which match uniquely to records in the other database and 90,000 records which do not have a corresponding. Databases were linked multiple times using a different set of identifiers for each linkage. For each linkage, a set of best matches (those with a probability > 0.9) and five sets of imputed sets (incorporating matches with probabilities lower than 0.9) were created. Results: High probability matches always had good specificity (range: 0.913 – 0.999). However, when matching information was sparse the sensitivity for high probability matches was as low as 0.226. When sensitivity was low, the high probability matches were biased on several linkage variables, typically making rare values appear to be more common. Using imputation to include lower weighted pairs decreased this bias. Additionally, the sensitivity for the worst case increased to 0.616. Unfortunately, specificity declined significantly with imputation, ranging from 0.536 – 0.991. Implications: Performing probabilistic linkage without adequate identifiers can lead to biased results. Incorporating imputation to include lower weight pairs can reduce this bias. The loss in specificity in the imputed sets, however, limits their usefulness for specific case studies.

Learning Objectives:
1. Understand the difference between high and low probability matches. 2. Describe potential pitfalls of limiting analyses to only high probability matches.

Keywords: Methodology, Data/Surveillance

Presenting author's disclosure statement:

Qualified on the content I am responsible for because: I have been involved in the study concept and analysis of the data.
Any relevant financial relationships? No

I agree to comply with the American Public Health Association Conflict of Interest and Commercial Support Guidelines, and to disclose to the participants any off-label or experimental uses of a commercial product or service discussed in my presentation.