159156 Reducing the resource burden of narrative text classifications for large administrative databases

Wednesday, November 7, 2007: 3:10 PM

Helen M. Wellman, MS , Quantitative Analysis Unit, Liberty Mutual Research Institute for Safety, Hopkinton, MA
Helen L. Corns, MS , Quantitative Analysis Unit, Liberty Mutual Research Institute for Safety, Hopkinton, MA
Mark R. Lehto , School of Industrial Engineering, Purdue University, West Lafayette, IN
Objectives: Improve the sensitivity and specificity of computer assigned codes of narrative text from large administrative databases using a filtering routine.

Methods: Narrative text contained in large administrative databases is underutilized due to resource constraints for manual classification. A previously developed method classified Workers' Compensation claims narratives using a machine-learning tool based on Fuzzy Bayes logic (Wellman et al., 2004) with overall sensitivity and specificity of .71 and .97 respectively. Narratives were classified into one of 40 possible Bureau of Labor Statistics 2 digit Occupational Injury and Illness Classification system event codes. A confidence value (or probability strength) for the computer assigned classifications was output. In order to identify the most probable misclassified narratives, the optimum threshold filter values were selected from Receiver Operator Characteristic (ROC) curves, where the improvement in sensitivity (number of hits) outweighed the decrement in specificity (or increased number of false alarms). A filtering routine was applied to extract narratives with strengths below threshold and would benefit the most from manual review.

Results: The overall sensitivity increased from .71 to .81, with the largest improvement in the 'bodily reaction' category (23% increase.) All categories with more than 10 narratives improved at least 4% . This improvement required that approximately 15% (n=3000) of the narratives would need to be manually reviewed

Conclusion: The results suggest that using a filtering routine to assist in machine assigned classifications can improve accuracy substantially and the computer can classify many narratives with high confidence freeing up valuable resources for manual coding.

Learning Objectives:
Understand methods to incorporate machine coding of large administrative database narratives

Keywords: Epidemiology, Methodology

Presenting author's disclosure statement:

Any relevant financial relationships? No
Any institutionally-contracted trials related to this submission?

I agree to comply with the American Public Health Association Conflict of Interest and Commercial Support Guidelines, and to disclose to the participants any off-label or experimental uses of a commercial product or service discussed in my presentation.