A comparison of the “Fuzzy Dwyer” matching algorithm and a probabilistic matching algorithm for data linkage
Objective: To explore methods of data linkage using data from the Florida State Fire College registry and data from the Florida State Bureau of Workers Compensation from 2005-2012.
Methods: A roster of active firefighters was obtained from the Florida State Fire College Department of Insurance and Continuing Education (FC-DICE). Data were linked to claims from the Florida State Bureau of Workers Compensation. The Fuzzy Dwyer matching algorithm was employed, which made three assumptions about the data: 1) SSNs contain data errors, 2) first and last name contain inconsistencies, and 3) first and last name fields may be transposed. SSNs from FC-DICE were mutated to every possible single and consecutive two-point mutation and matched to SSNs from Workers Compensation data. Matches were confirmed by first and last name using the Jaro-Winkler distance measure with a threshold of 0.80. This method was compared to a probabilistic matching algorithm that first matches directly on SSN and then to first and last name using weighted probabilities.
Results: Compared to the probabilistic matching algorithm, the Fuzzy Dwyer method matched more records and had higher fidelity.
Conclusion: As public health researchers rely more on big data for research, the development of reliable, high-fidelity data linkage algorithms has become increasingly important. Fuzzy matching algorithms are less labor intensive, have a higher hit rate, and provide higher fidelity matches than probabilistic methods.
Learning Areas:Biostatistics, economics
Occupational health and safety
Other professions or practice related to public health
Discuss key problems with linking data between data sources. Describe methods for data linkage. Conduct data linkage using fuzzy and probabilistic methods.
Keyword(s): Data Collection and Surveillance, Methodology
Qualified on the content I am responsible for because: I have spent several years working as a data manager for a number of institutions, including Weill Cornell Medical College, Columbia University, and the New York City Department of Health and Mental Hygiene. I am finishing my PhD in epidemiology from Drexel University and have been focusing on data linkage methods.
Any relevant financial relationships? No
I agree to comply with the American Public Health Association Conflict of Interest and Commercial Support Guidelines, and to disclose to the participants any off-label or experimental uses of a commercial product or service discussed in my presentation.