Data-driven and expert-based informatics approaches to identifying HIV-associated common data elements (CDE) in empirically generated and knowledge-based resources

William Brown III, PhD, DrPH, MA1, Timothy Frasca, MPH2, Chunhua Weng, PhD3, David Vawdrey, PhD4, Alex Carballo-DiƩguez, PhD3 and Suzanne Bakken, RN, PhD, FAAN, FACMI5
(1)HIV Center for Clinical and Behavioral Studies, Columbia University, New York, NY, (2)HIV Center for Clinical and Behavioral Studies at the New York State Psychiatric Institute and Columbia University, New York, NY, (3)Columbia University, New York, NY, (4)NewYork-Presbyterian Hospital, New York, NY, (5)Columbia University School of Nursing, New York, NY

APHA 2016 Annual Meeting & Expo (Oct. 29 - Nov. 2, 2016)

Common Data Elements (CDE) facilitate semantic interoperability and integration of heterogeneous data sources from healthcare delivery or research. Challenges to identifying CDE include: identification of relevant data element (DE) resources, high-throughput CDE discovery, and agreement on DE commonality among semantically heterogeneous sources. We used both data-driven and expert-based informatics approaches to mitigate these challenges while identifying CDE in the HIV research domain. We collected DEs from empirically generated (HIV journal articles and HIV-associated datasets) and knowledge-based (AIDSinfo HIV/AIDS Glossary and Drug Database, LOINC, SNOMED, RxNORM) resources. Data-driven approaches to identify resources included: Google Search to find the HIV/AIDS Glossary and Drug Database, Clinicaltrials.gov to identify HIV-associated research datasets so that study principal investigators could be recruited to provide study DEs, and BioPortal's ontology recommender to identify HIV relevant ontologies (i.e. LOINC, SNOMED, and RxNorm). Data-driven approaches to identify HIV CDE from the resources included: using string metrics in R and the BioPortal Annotator. In the expert-based approach, two HIV experts manually reviewed DEs from the journal articles and data dictionaries to confirm DE commonality, and resolved semantic discrepancies through discussion. We identified 2,179 CDE to date. Data-driven approaches identified 2,055 (94%) (999 from the HIV/AIDS Glossary, 398 from the Drug Database, 91 from journal articles, and a total of 567 from LOINC, SNOMED, and RxNorm cumulatively and ongoing). Expert-based approaches identified 124 (6%) from data dictionaries (on going) and confirmed the 91 CDE from journal articles. Data-driven approaches can facilitate relatively high-throughput identification of relevant CDE sources and CDEs. However, data-driven methods are often challenged by semantic heterogeneity, especially with empirically generated DEs. Expert-based approaches can complement data-driven approaches and help resolve semantic discrepancies with more certainty than data-driven methods alone. This research provides the foundation for informatics tools to facilitate semantic interoperability and data integration.

Communication and informatics Public health or related research