Abstract
Methods of Named-Entity Recognition and Machine Learning to transform and analyse unstructured free-text information on current medication use among participants from the UK Myalgic Encephalomyelitis/Chronic Fatigue Syndrome Biobank: A case study
APHA's 2019 Annual Meeting and Expo (Nov. 2 - Nov. 6)
Free-text are a vital source of information in research. However, specialized processing is required to extract structured information. This case study aims to demonstrate how methods of Gazetteer-based Named-Entity recognition (NER) and Machine Learning can be used to capture, standardize and analyse data on medication/supplement use entered as free text in questionnaires from UK ME/CFS Biobank (UKMEB) participants.
methods
Preprocessing steps, including removing missing/unusable entries, detecting and transforming abbreviations, and case-folding, were undertaken prior to analyses. Gazetteer-based NER was employed to identify interventions and then to generate intervention categories and to transform data into structured data. APRIORI, Random Forests and XGBoost machine learning algorithms were employed to identify associations with symptom clusters. Chi-squared tests were used to compare symptom presence between patients taking and not taking an intervention.
results
Of 607 UKMEB participants, 383 were usable records. 95 different interventions and 14 intervention categories were identified. The most common intervention was supplements (n=172). Taking only supplements was the most common combination of interventions (n=55). The strongest association was between sleep and CNS medications (Lift=2.7; Confidence=1.0; Support=15.1%). No significant patterns of association were found with symptom clusters. Of ME/CFS cases who reported taking analgesics (n=104), 82% (n=85) also reported pain (P=0.04) and 82% (n=85) also reported post-exertional malaise symptoms (P=0.01).
discussion
Using UKMEB participants as a case study, we have shown how methods of NER combined with Machine Learning provide an accurate, automated way to extract, standardize, and analyse free-text data. Its use could prove invaluable for research across disciplines.
Biostatistics, economics Communication and informatics Epidemiology