142nd APHA Annual Meeting and Exposition

Annual Meeting Recordings are now available for purchase

A comparison of the “Fuzzy Dwyer” matching algorithm and a probabilistic matching algorithm for data linkage

142nd APHA Annual Meeting and Exposition (November 15 - November 19, 2014): http://www.apha.org/events-and-meetings/annual
Monday, November 17, 2014

Michael LeVasseur, MPH, PhD(c) , Department of Environmental and Occupational Health, Drexel Univeristy School of Public Health, Philadelphia, PA
Frank Dwyer , Florida Division of Workers Compensation, Tallahassee, FL
Jennifer Taylor, PhD, MPH , Environmental and Occupational Health, Drexel University, Philadelphia, PA
Background: The rise of big data has led to new opportunities for public health research, but as with any new opportunity, new challenges have arisen. Identifying methods for linking data sources together has been a continual challenge. As part of a larger project exploring the feasibility of using state-level data to create a national surveillance system for firefighter injury, data from several state-level departments in Florida were linked together using two different matching algorithms.

Objective:  To explore methods of data linkage using data from the Florida State Fire College registry and data from the Florida State Bureau of Workers Compensation from 2005-2012.

Methods: A roster of active firefighters was obtained from the Florida State Fire College Department of Insurance and Continuing Education (FC-DICE). Data were linked to claims from the Florida State Bureau of Workers Compensation. The Fuzzy Dwyer matching algorithm was employed, which made three assumptions about the data: 1) SSNs contain data errors, 2) first and last name contain inconsistencies, and 3) first and last name fields may be transposed. SSNs from FC-DICE were mutated to every possible single and consecutive two-point mutation and matched to SSNs from Workers Compensation data. Matches were confirmed by first and last name using the Jaro-Winkler distance measure with a threshold of 0.80. This method was compared to a probabilistic matching algorithm that first matches directly on SSN and then to first and last name using weighted probabilities.

Results: Compared to the probabilistic matching algorithm, the Fuzzy Dwyer method matched more records and had higher fidelity.

Conclusion: As public health researchers rely more on big data for research, the development of reliable, high-fidelity data linkage algorithms has become increasingly important. Fuzzy matching algorithms are less labor intensive, have a higher hit rate, and provide higher fidelity matches than probabilistic methods.

Learning Areas:

Biostatistics, economics
Occupational health and safety
Other professions or practice related to public health

Learning Objectives:
Discuss key problems with linking data between data sources. Describe methods for data linkage. Conduct data linkage using fuzzy and probabilistic methods.

Keyword(s): Data Collection and Surveillance, Methodology

Presenting author's disclosure statement:

Qualified on the content I am responsible for because: I have spent several years working as a data manager for a number of institutions, including Weill Cornell Medical College, Columbia University, and the New York City Department of Health and Mental Hygiene. I am finishing my PhD in epidemiology from Drexel University and have been focusing on data linkage methods.
Any relevant financial relationships? No

I agree to comply with the American Public Health Association Conflict of Interest and Commercial Support Guidelines, and to disclose to the participants any off-label or experimental uses of a commercial product or service discussed in my presentation.