309739
A comparison of the “Fuzzy Dwyer” matching algorithm and a probabilistic matching algorithm for data linkage
Objective: To explore methods of data linkage using data from the Florida State Fire College registry and data from the Florida State Bureau of Workers Compensation from 2005-2012.
Methods: A roster of active firefighters was obtained from the Florida State Fire College Department of Insurance and Continuing Education (FC-DICE). Data were linked to claims from the Florida State Bureau of Workers Compensation. The Fuzzy Dwyer matching algorithm was employed, which made three assumptions about the data: 1) SSNs contain data errors, 2) first and last name contain inconsistencies, and 3) first and last name fields may be transposed. SSNs from FC-DICE were mutated to every possible single and consecutive two-point mutation and matched to SSNs from Workers Compensation data. Matches were confirmed by first and last name using the Jaro-Winkler distance measure with a threshold of 0.80. This method was compared to a probabilistic matching algorithm that first matches directly on SSN and then to first and last name using weighted probabilities.
Results: Compared to the probabilistic matching algorithm, the Fuzzy Dwyer method matched more records and had higher fidelity.
Conclusion: As public health researchers rely more on big data for research, the development of reliable, high-fidelity data linkage algorithms has become increasingly important. Fuzzy matching algorithms are less labor intensive, have a higher hit rate, and provide higher fidelity matches than probabilistic methods.
Learning Areas:
Biostatistics, economicsEpidemiology
Occupational health and safety
Other professions or practice related to public health
Learning Objectives:
Discuss key problems with linking data between data sources.
Describe methods for data linkage.
Conduct data linkage using fuzzy and probabilistic methods.
Keyword(s): Data Collection and Surveillance, Methodology
Qualified on the content I am responsible for because: I have spent several years working as a data manager for a number of institutions, including Weill Cornell Medical College, Columbia University, and the New York City Department of Health and Mental Hygiene. I am finishing my PhD in epidemiology from Drexel University and have been focusing on data linkage methods.
Any relevant financial relationships? No
I agree to comply with the American Public Health Association Conflict of Interest and Commercial Support Guidelines, and to disclose to the participants any off-label or experimental uses of a commercial product or service discussed in my presentation.