Online Program

Using a novel software program to improve Twitter data collection methods

Tuesday, November 3, 2015

Amy Leader, DrPH, MPH, Department of Medical Oncology, Division of Population Science, Thomas Jefferson University, Philadelphia, PA
Philip Massey, PhD, MPH, Department of Community Health and Prevention, Drexel University School of Public Health, Philadelphia, PA
Alan Black, College of Computing and Informatics, Drexel University, Philadelphia, PA
Alexandra Budenz, MA, DrPH(c), School of Public Health, Drexel University, Philadelphia, PA
Kara Fisher, MPH, School of Public Health, Drexel University, Philadelphia, PA
Elizabeth DeArmas, Jefferson School of Population Health, Thomas Jefferson University, Philadelphia, PA
Ann C. Klassen, PhD, Department of Community Health and Prevention, Drexel University School of Public Health, Philadelphia, PA
Background: Social media is a growing platform for collecting and analyzing public health data.  On contested topics, such as human papillomavirus (HPV) vaccination, messages posted on Twitter can mirror the public sentiment in real time.  Twitter Application Programming Interface (API), the most popular method cited in the literature for collecting tweets, only prospectively captures 1% of all available tweets, limiting the functionality of the software and the generalizability of the results.

Purpose: The purpose of this study is to describe the results of using a novel computer software program to improve the collection and analysis of Twitter data. 

Methods: Each week over 6 months, we collected all content-relevant tweets worldwide using Personal Zombie, a Twitter data collection software program developed at Drexel University.  The software gathers data generated from a series of searches that are executed at regular intervals through a cloud computing platform.  We used 13 key words related to HPV and the vaccine (for example: “HPV”, “cervical cancer”, “HPV vaccine”, “#HPV”) to collect all tweets within our search criteria.

Results: Personal Zombie mined all tweets for each of the 13 search terms separately.  After 6 months of data collection, we collected 511,464 tweets that contained at least one of the 13 search terms. The top three search terms by volume of tweets included HPV (282,354 tweets; 55% of all tweets), cervical cancer (101,171 tweets; 20% of all tweets), and HPV vaccine (40,629 tweets; 8% of all tweets). The fewest number of tweets were collected by #HPVshot (9 tweets). After merging and de-duping tweets collected from each of the 13 search terms, a total of 396,112 unique tweets were collected.

Conclusion: Findings suggest that using an improved software program and a comprehensive list of search terms will result in a pool of tweets approaching the true population.

Learning Areas:

Communication and informatics
Public health or related research

Learning Objectives:
Explain the utility of analyzing social media messages and their application to public health practice Compare the benefits and limitations of different computer software programs that are available to analyze Twitter data Describe how to select key word terms for mining Twitter data so as to maximize the results of the search

Keyword(s): Social Media, Data Collection and Surveillance

Presenting author's disclosure statement:

Qualified on the content I am responsible for because: I am a Co-Investigator on the research study and am intimately involved in the study design, data analysis, and interpretation of results. I wrote the abstract.
Any relevant financial relationships? No

I agree to comply with the American Public Health Association Conflict of Interest and Commercial Support Guidelines, and to disclose to the participants any off-label or experimental uses of a commercial product or service discussed in my presentation.