Abstract
Machine Learning Classifiers for Socio-Demographics of Social Media Users: Limitations and Possibilities
APHA's 2019 Annual Meeting and Expo (Nov. 2 - Nov. 6)
Methods: We survey several socio-demographic classifiers that work with Twitter and Reddit data. These classifiers use features including users' language patterns, follower behaviors, and choice of names. These classifiers predict labels including users' gender, race and ethnicity, or filter out social media accounts run by organizations.
Results: We explain how the data for these classifiers is collected, how the classification models are trained, and how they could be applied to public health research. We in particular discuss the limitations that these classifiers have, including possible methodological bias introduced by the challenges of large-scale data collection of social media users' demographic information.
Discussion: Health behaviors vary with socio-demographic factors, which are challenging to measure on social media platforms. Machine learning classification of socio-demographics is possible, but requires interdisciplinary considerations.
Public health or related research Social and behavioral sciences