An open source inventory to evaluate public health surveillance systems

KC Decker; Catherine Ordun; Dimitrious Koutsonanos

doi:10.5210/ojphi.v9i1.7579

An open source inventory to evaluate public health surveillance systems

Online Journal of Public Health Informatics ◽

10.5210/ojphi.v9i1.7579 ◽

2017 ◽

Vol 9 (1) ◽

Author(s):

KC Decker ◽

Catherine Ordun ◽

Dimitrious Koutsonanos

Keyword(s):

Public Health ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Health Department ◽

Surveillance Systems ◽

Health Practitioners ◽

Data Files ◽

Public Health Practitioners ◽

Standard Framework

ObjectiveThe objective of this project is to advance the science of biosurveillance by providing a user curated cataloging system, to be used across health department and other users, that advances daily surveillance operations by better characterizing three key issues in available surveillance systems: duplication in biosurveillance activities; differing perspectives and analyses of the same data; and inadequate information sharing.IntroductionA variety of government reports have cited challenges in coordinating national biosurveillance efforts at strategic and tactical levels. The General Accountability Office (GAO), an independent nonpartisan agency that investigates how the federal government funding and performs analysis at the request of congressional committees or by public mandate, has published 64 reports on biosurveillance since 2005. The aim of this project is to better characterize these issues by collecting and analyzing a sample of publicly documented biosurveillance systems, and making our data and results available for the public health community to review and evaluate. This study openly publishes the data files of information collected (i.e. CSV, XLS), the Python NLP scripts, and a freely available web-based application developed in R Shiny that filters against the 227 biosurveillance systems and activities to promote a more transparent understanding of how public health practitioners conduct surveillance activities.MethodsCollected and reviewed data on 424 systems, of which 227 systems and activities met our criteria;Implemented a new approach to develop a standard framework for data collection using natural language processing (NLP);Openly published all data files publicly on Github and developed an online analytics application; andConvened a workshop of experts from across federal, state, not-for-profit, academic and commercial entities in November 2015 in Washington, D.C., to review the methodology and results of this study.ResultsThe results of this project include a fully functional web application and code (available through Github) for the continued expansion, categorization and analysis of surveillance systems. Unique findings currently rendered through the 227 surveillance systems include: Out of 227 systems, 20 were established in the year 2006, alone, with an increase in systems established following 1990; 68% of all systems catalogued are focused solely on human surveillance; 45% of all cataloged systems used statistical analysis and only 4% are using Natural Language Processing; and 43% of all biosurveillance systems in our inventory reported using “health department” data as a data source.ConclusionsWe believe this project is the first step for public health practitioners and researchers to contribute to a transparent inventory of systems and activities. Results provide meaningful metadata on an over focus on human surveillance, over-reliance on a single data source (health departments) and a lack of advanced data science practices being applied to systems in the field. The value of this project 1) provides a starting point for the development of a standard framework of categories to use for cataloging biosurveillance systems, 2) offers openly available data and code on Github [3] for others to integrate into their research, and 3) introduces a set of methodological issues to consider in a biosurveillance inventorying exercise.

Download Full-text

Natural language processing and machine learning methods in public health surveillance: a narrative review (Preprint)

10.2196/preprints.26351 ◽

2020 ◽

Author(s):

Patrick James Ward ◽

April M Young

Keyword(s):

Public Health ◽

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Public Health Surveillance ◽

Health Surveillance ◽

Surveillance Data ◽

Online Media ◽

Traditional Surveillance

BACKGROUND Public health surveillance is critical to detecting emerging population health threats and improvements. Surveillance data has increased in size and complexity, posing challenges to data management and analysis. Natural language processing (NLP) and machine learning (ML) are valuable tools for analysis of unstructured data involving free-text and have been used in innovative ways to examine a variety of health outcomes. OBJECTIVE Given the cross-disciplinary applications of NLP and ML, research on their applications in surveillance have been disseminated in a variety of outlets. As such, the aim of this narrative review was to describe the current state of NLP and ML use in surveillance science and to identify directions in future research. METHODS Information was abstracted from articles describing the use of natural language processing and machine learning in public health surveillance identified through a PubMed search. RESULTS Twenty-two articles met review criteria, 12 involving traditional surveillance data sources and 10 involving online media sources for surveillance. Traditional surveillance sources analyzed with NLP and ML consisted primarily of death certificates (n=6), hospital data (n=5), and online media sources (e.g., Twitter) (n=8). CONCLUSIONS The reviewed articles demonstrate the potential of NLP and ML to enhance surveillance data through improving timeliness of surveillance, identifying cases in the absence of standardized case definitions, and enabling mining of social media for public health surveillance.

Download Full-text

Texas Public Agencies’ Tweets and Public Engagement During the COVID-19 Pandemic: Natural Language Processing Approach (Preprint)

10.2196/preprints.26720 ◽

2020 ◽

Author(s):

Lu Tang ◽

Wenlin Liu ◽

Benjamin Thomas ◽

Hong Thoai Nga Tran ◽

Wenxue Zou ◽

...

Keyword(s):

Public Health ◽

Natural Language Processing ◽

Natural Language ◽

Health Beliefs ◽

Language Processing ◽

Public Engagement ◽

Local Governments ◽

Public Agencies ◽

Public Health Agencies ◽

Health Agencies

BACKGROUND The ongoing COVID-19 pandemic is characterized by different morbidity and mortality rates across different states, cities, rural areas, and diverse neighborhoods. The absence of a national strategy for battling the pandemic also leaves state and local governments responsible for creating their own response strategies and policies. OBJECTIVE This study examines the content of COVID-19–related tweets posted by public health agencies in Texas and how content characteristics can predict the level of public engagement. METHODS All COVID-19–related tweets (N=7269) posted by Texas public agencies during the first 6 months of 2020 were classified in terms of each tweet’s functions (whether the tweet provides information, promotes action, or builds community), the preventative measures mentioned, and the health beliefs discussed, by using natural language processing. Hierarchical linear regressions were conducted to explore how tweet content predicted public engagement. RESULTS The information function was the most prominent function, followed by the action or community functions. Beliefs regarding susceptibility, severity, and benefits were the most frequently covered health beliefs. Tweets that served the information or action functions were more likely to be retweeted, while tweets that served the action and community functions were more likely to be liked. Tweets that provided susceptibility information resulted in the most public engagement in terms of the number of retweets and likes. CONCLUSIONS Public health agencies should continue to use Twitter to disseminate information, promote action, and build communities. They need to improve their strategies for designing social media messages about the benefits of disease prevention behaviors and audiences’ self-efficacy.

Download Full-text

VINCENT: A visual analytics system for investigating the online vaccine debate

Online Journal of Public Health Informatics ◽

10.5210/ojphi.v11i2.10114 ◽

2019 ◽

Vol 11 (2) ◽

Cited By ~ 5

Author(s):

Anton Ninkov ◽

Kamran Sedig

Keyword(s):

Public Health ◽

Natural Language Processing ◽

Natural Language ◽

Data Visualization ◽

Language Processing ◽

Visual Analytics ◽

Complex Data ◽

Text Data ◽

Human Data ◽

Vaccine Debate

This paper reports and describes VINCENT, a visual analytics system that is designed to help public health stakeholders (i.e., users) make sense of data from websites involved in the online debate about vaccines. VINCENT allows users to explore visualizations of data from a group of 37 vaccine-focused websites. These websites differ in their position on vaccines, topics of focus about vaccines, geographic location, and sentiment towards the efficacy and morality of vaccines, specific and general ones. By integrating webometrics, natural language processing of website text, data visualization, and human-data interaction, VINCENT helps users explore complex data that would be difficult to understand, and, if at all possible, to analyze without the aid of computational tools.The objectives of this paper are to explore A) the feasibility of developing a visual analytics system that integrates webometrics, natural language processing of website text, data visualization, and human-data interaction in a seamless manner; B) how a visual analytics system can help with the investigation of the online vaccine debate; and C) what needs to be taken into consideration when developing such a system. This paper demonstrates that visual analytics systems can integrate different computational techniques; that such systems can help with the exploration of public health online debates that are distributed across a set of websites; and that care should go into the design of the different components of such systems.

Download Full-text

"Life is unrecognisable": A natural language processing study of COVID-19 impacts on Australian adults (Preprint)

10.2196/preprints.29213 ◽

2021 ◽

Author(s):

Jillian RYAN ◽

Hamza Sellak ◽

Emily Brindal

Keyword(s):

Public Health ◽

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Computer Algorithms ◽

Life Domains ◽

Multinomial Regression ◽

Positive Effects ◽

Negative Sentiment

BACKGROUND Natural language processing is a machine learning technique that uses intelligent computer algorithms to detect patterns and themes in unstructured datasets commonly containing text data. Machine learning can aid with understanding the impacts of novel and disruptive events, and therefore offers myriad public health applications. OBJECTIVE This study aims to explore community sentiment towards COVID-19 and the nature of the impacts that COVID-19 has had on people using natural language processing on a linked research dataset. METHODS Stanford CoreNLP was used to analyse and detect sentiment in qualitative COVID-19 impact stories from 3,483 Australian adults. Common themes were categorised according to the Theoretical Life Domains framework and a multinomial regression analysis was conducted to identify psychological and demographic predictors of sentiment. RESULTS About one-third of participants (33%) expressed negative sentiment towards COVID-19, while a further 44% expressed neutral sentiment and 23% expressed positive sentiment. Of the Theoretical Life Domains, behavioural regulation was by far the most commonly impacted life domain, followed by environmental context and resources, emotion, and social influences. Negative sentiment was predicted by financial stress and lower subjective wellbeing. CONCLUSIONS COVID-19 and its containment measures have had dramatic impacts on Australian adults. Ability to regulate health and social behaviours were among the most common impacts and this raises concerns for the effects of public health crises on chronic health and mental health conditions. Positive effects of COVID-19, related to greater flexibility in working arrangements and reductions in life ‘busyness’ were also documented. CLINICALTRIAL N/A

Download Full-text

Natural Language Processing (NLP) in Qualitative Public Health Research: A Proof of Concept Study

International Journal of Qualitative Methods ◽

10.1177/1609406919887021 ◽

2019 ◽

Vol 18 ◽

pp. 160940691988702 ◽

Cited By ~ 1

Author(s):

William Leeson ◽

Adam Resnick ◽

Daniel Alexander ◽

John Rovers

Keyword(s):

Public Health ◽

Natural Language Processing ◽

Natural Language ◽

Qualitative Analysis ◽

Computer Science ◽

Language Processing ◽

Topic Modeling ◽

Qualitative Data ◽

Proof Of Concept ◽

Open Coding

Qualitative data-analysis methods provide thick, rich descriptions of subjects’ thoughts, feelings, and lived experiences but may be time-consuming, labor-intensive, or prone to bias. Natural language processing (NLP) is a machine learning technique from computer science that uses algorithms to analyze textual data. NLP allows processing of large amounts of data almost instantaneously. As researchers become conversant with NLP, it is becoming more frequently employed outside of computer science and shows promise as a tool to analyze qualitative data in public health. This is a proof of concept paper to evaluate the potential of NLP to analyze qualitative data. Specifically, we ask if NLP can support conventional qualitative analysis, and if so, what its role is. We compared a qualitative method of open coding with two forms of NLP, Topic Modeling, and Word2Vec to analyze transcripts from interviews conducted in rural Belize querying men about their health needs. All three methods returned a series of terms that captured ideas and concepts in subjects’ responses to interview questions. Open coding returned 5–10 words or short phrases for each question. Topic Modeling returned a series of word-probability pairs that quantified how well a word captured the topic of a response. Word2Vec returned a list of words for each interview question ordered by which words were predicted to best capture the meaning of the passage. For most interview questions, all three methods returned conceptually similar results. NLP may be a useful adjunct to qualitative analysis. NLP may be performed after data have undergone open coding as a check on the accuracy of the codes. Alternatively, researchers can perform NLP prior to open coding and use the results to guide their creation of their codebook.

Download Full-text

EventEpi—A natural language processing framework for event-based surveillance

PLoS Computational Biology ◽

10.1371/journal.pcbi.1008277 ◽

2020 ◽

Vol 16 (11) ◽

pp. e1008277

Author(s):

Auss Abbood ◽

Alexander Ullrich ◽

Rüdiger Busche ◽

Stéphane Ghozzi

Keyword(s):

Public Health ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Fine Tuning ◽

Entity Recognition ◽

World Health ◽

Case Count ◽

Event Based ◽

Processing Framework

According to the World Health Organization (WHO), around 60% of all outbreaks are detected using informal sources. In many public health institutes, including the WHO and the Robert Koch Institute (RKI), dedicated groups of public health agents sift through numerous articles and newsletters to detect relevant events. This media screening is one important part of event-based surveillance (EBS). Reading the articles, discussing their relevance, and putting key information into a database is a time-consuming process. To support EBS, but also to gain insights into what makes an article and the event it describes relevant, we developed a natural language processing framework for automated information extraction and relevance scoring. First, we scraped relevant sources for EBS as done at the RKI (WHO Disease Outbreak News and ProMED) and automatically extracted the articles’ key data: disease, country, date, and confirmed-case count. For this, we performed named entity recognition in two steps: EpiTator, an open-source epidemiological annotation tool, suggested many different possibilities for each. We extracted the key country and disease using a heuristic with good results. We trained a naive Bayes classifier to find the key date and confirmed-case count, using the RKI’s EBS database as labels which performed modestly. Then, for relevance scoring, we defined two classes to which any article might belong: The article is relevant if it is in the EBS database and irrelevant otherwise. We compared the performance of different classifiers, using bag-of-words, document and word embeddings. The best classifier, a logistic regression, achieved a sensitivity of 0.82 and an index balanced accuracy of 0.61. Finally, we integrated these functionalities into a web application called EventEpi where relevant sources are automatically analyzed and put into a database. The user can also provide any URL or text, that will be analyzed in the same way and added to the database. Each of these steps could be improved, in particular with larger labeled datasets and fine-tuning of the learning algorithms. The overall framework, however, works already well and can be used in production, promising improvements in EBS. The source code and data are publicly available under open licenses.

Download Full-text

Natural Language Processing and Technical Challenges of Influenza-Like Illness Surveillance

Online Journal of Public Health Informatics ◽

10.5210/ojphi.v8i1.6575 ◽

2016 ◽

Vol 8 (1) ◽

Author(s):

Dino P. Rumoro ◽

Gillian S. Gibbs ◽

Shital C. Shah ◽

Marilyn M. Hallock ◽

Gordon M. Trenholme ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Clinical Information ◽

Free Text ◽

Surveillance Systems ◽

Clinical Text ◽

Influenza Like Illness ◽

Technical Issues ◽

Original Information

Processing free-text clinical information in an electronic medical record may enhance surveillance systems for early identification of influenza-like illness outbreaks. However, processing clinical text using natural language processing (NLP) poses a challenge in preserving the semantics of the original information recorded. In this study, we discuss several NLP and technical issues as well as potential solutions for implementation in syndromic surveillance systems.

Download Full-text

Big Data, Natural Language Processing, and Deep Learning to Detect and Characterize Illicit COVID-19 Product Sales: Infoveillance Study on Twitter and Instagram

JMIR Public Health and Surveillance ◽

10.2196/20794 ◽

2020 ◽

Vol 6 (3) ◽

pp. e20794

Author(s):

Tim Ken Mackey ◽

Jiawei Li ◽

Vidya Purushothaman ◽

Matthew Nali ◽

Neal Shah ◽

...

Keyword(s):

Public Health ◽

Social Media ◽

Deep Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Second Phase ◽

The Public ◽

Health Products ◽

Social Media Platforms

Background The coronavirus disease (COVID-19) pandemic is perhaps the greatest global health challenge of the last century. Accompanying this pandemic is a parallel “infodemic,” including the online marketing and sale of unapproved, illegal, and counterfeit COVID-19 health products including testing kits, treatments, and other questionable “cures.” Enabling the proliferation of this content is the growing ubiquity of internet-based technologies, including popular social media platforms that now have billions of global users. Objective This study aims to collect, analyze, identify, and enable reporting of suspected fake, counterfeit, and unapproved COVID-19–related health care products from Twitter and Instagram. Methods This study is conducted in two phases beginning with the collection of COVID-19–related Twitter and Instagram posts using a combination of web scraping on Instagram and filtering the public streaming Twitter application programming interface for keywords associated with suspect marketing and sale of COVID-19 products. The second phase involved data analysis using natural language processing (NLP) and deep learning to identify potential sellers that were then manually annotated for characteristics of interest. We also visualized illegal selling posts on a customized data dashboard to enable public health intelligence. Results We collected a total of 6,029,323 tweets and 204,597 Instagram posts filtered for terms associated with suspect marketing and sale of COVID-19 health products from March to April for Twitter and February to May for Instagram. After applying our NLP and deep learning approaches, we identified 1271 tweets and 596 Instagram posts associated with questionable sales of COVID-19–related products. Generally, product introduction came in two waves, with the first consisting of questionable immunity-boosting treatments and a second involving suspect testing kits. We also detected a low volume of pharmaceuticals that have not been approved for COVID-19 treatment. Other major themes detected included products offered in different languages, various claims of product credibility, completely unsubstantiated products, unapproved testing modalities, and different payment and seller contact methods. Conclusions Results from this study provide initial insight into one front of the “infodemic” fight against COVID-19 by characterizing what types of health products, selling claims, and types of sellers were active on two popular social media platforms at earlier stages of the pandemic. This cybercrime challenge is likely to continue as the pandemic progresses and more people seek access to COVID-19 testing and treatment. This data intelligence can help public health agencies, regulatory authorities, legitimate manufacturers, and technology platforms better remove and prevent this content from harming the public.

Download Full-text

Approaches for Ending Ineffective Programs: Strategies From State Public Health Practitioners

Frontiers in Public Health ◽

10.3389/fpubh.2021.727005 ◽

2021 ◽

Vol 9 ◽

Author(s):

Emily Rodriguez Weno ◽

Peg Allen ◽

Stephanie Mazzucca ◽

Louise Farah Saliba ◽

Margaret Padek ◽

...

Keyword(s):

Public Health ◽

Chronic Disease ◽

Qualitative Interviews ◽

Population Level ◽

Health Department ◽

Public Health Agencies ◽

Health Agencies ◽

Health Practitioners ◽

Evidence Based Programs ◽

Public Health Practitioners

Background: Public health agencies are increasingly concerned with ensuring they are maximizing limited resources by delivering evidence-based programs to enhance population-level chronic disease outcomes. Yet, there is little guidance on how to end ineffective programs that continue in communities. The purpose of this analysis is to identify what strategies public health practitioners perceive to be effective in de-implementing, or reducing the use of, ineffective programs.Methods: From March to July 2019, eight states were selected to participate in qualitative interviews from our previous national survey of US state health department (SHD) chronic disease practitioners on program decision making. This analysis examined responses to a question about “…advice for others who want to end an ineffective program.” Forty-five SHD employees were interviewed via phone. Interviews were audio-recorded, and the conversations were transcribed verbatim. All transcripts were consensus coded, and themes were identified and summarized.Results: Participants were program managers or section directors who had on average worked 11 years at their agency and 15 years in public health. SHD employees provided several strategies they perceived as effective for de-implementation. The major themes were: (1) collect and rely on evaluation data; (2) consider if any of the programs can be saved; (3) transparently communicate and discuss program adjustments; (4) be tactful and respectful of partner relationships; (5) communicate in a way that is meaningful to your audience.Conclusions: This analysis provides insight into how experienced SHD practitioners recommend ending ineffective programs which may be useful for others working at public health agencies. As de-implementation research is limited in public health settings, this work provides a guiding point for future researchers to systematically assess these strategies and their effects on public health programming.

Download Full-text

A Public Health Surveillance Platform Exploiting Free-Text Sources via Natural Language Processing and Linked Data: Application in Adverse Drug Reaction Signal Detection Using PubMed and Twitter

Knowledge Representation for Health Care - Lecture Notes in Computer Science ◽

10.1007/978-3-319-55014-5_4 ◽

2017 ◽

pp. 51-67 ◽

Cited By ~ 1

Author(s):

Pantelis Natsiavas ◽

Nicos Maglaveras ◽

Vassilis Koutkias

Keyword(s):

Public Health ◽

Adverse Drug Reaction ◽

Drug Reaction ◽

Natural Language Processing ◽

Natural Language ◽

Signal Detection ◽

Language Processing ◽

Health Surveillance ◽

Free Text ◽

Data Application

Download Full-text