scholarly journals A machine learning approach to open public comments for policymaking

2020 ◽  
Vol 25 (4) ◽  
pp. 433-448 ◽  
Author(s):  
Alex Ingrams

In this paper, the author argues that the conflict between the copious amount of digital data processed by public organisations and the need for policy-relevant insights to aid public participation constitutes a ‘public information paradox’. Machine learning (ML) approaches may offer one solution to this paradox through algorithms that transparently collect and use statistical modelling to provide insights for policymakers. Such an approach is tested in this paper. The test involves applying an unsupervised machine learning approach with latent Dirichlet allocation (LDA) analysis of thousands of public comments submitted to the United States Transport Security Administration (TSA) on a 2013 proposed regulation for the use of new full body imaging scanners in airport security terminals. The analysis results in salient topic clusters that could be used by policymakers to understand large amounts of text such as in an open public comments process. The results are compared with the actual final proposed TSA rule, and the author reflects on new questions raised for transparency by the implementation of ML in open rule-making processes.

2020 ◽  
Author(s):  
Jia Xue ◽  
Junxiang Chen ◽  
Ran Hu ◽  
Chen Chen ◽  
Chengda Zheng ◽  
...  

BACKGROUND It is important to measure the public response to the COVID-19 pandemic. Twitter is an important data source for infodemiology studies involving public response monitoring. OBJECTIVE The objective of this study is to examine COVID-19–related discussions, concerns, and sentiments using tweets posted by Twitter users. METHODS We analyzed 4 million Twitter messages related to the COVID-19 pandemic using a list of 20 hashtags (eg, “coronavirus,” “COVID-19,” “quarantine”) from March 7 to April 21, 2020. We used a machine learning approach, Latent Dirichlet Allocation (LDA), to identify popular unigrams and bigrams, salient topics and themes, and sentiments in the collected tweets. RESULTS Popular unigrams included “virus,” “lockdown,” and “quarantine.” Popular bigrams included “COVID-19,” “stay home,” “corona virus,” “social distancing,” and “new cases.” We identified 13 discussion topics and categorized them into 5 different themes: (1) public health measures to slow the spread of COVID-19, (2) social stigma associated with COVID-19, (3) COVID-19 news, cases, and deaths, (4) COVID-19 in the United States, and (5) COVID-19 in the rest of the world. Across all identified topics, the dominant sentiments for the spread of COVID-19 were anticipation that measures can be taken, followed by mixed feelings of trust, anger, and fear related to different topics. The public tweets revealed a significant feeling of fear when people discussed new COVID-19 cases and deaths compared to other topics. CONCLUSIONS This study showed that Twitter data and machine learning approaches can be leveraged for an infodemiology study, enabling research into evolving public discussions and sentiments during the COVID-19 pandemic. As the situation rapidly evolves, several topics are consistently dominant on Twitter, such as confirmed cases and death rates, preventive measures, health authorities and government policies, COVID-19 stigma, and negative psychological reactions (eg, fear). Real-time monitoring and assessment of Twitter discussions and concerns could provide useful data for public health emergency responses and planning. Pandemic-related fear, stigma, and mental health concerns are already evident and may continue to influence public trust when a second wave of COVID-19 occurs or there is a new surge of the current pandemic.


10.2196/20550 ◽  
2020 ◽  
Vol 22 (11) ◽  
pp. e20550
Author(s):  
Jia Xue ◽  
Junxiang Chen ◽  
Ran Hu ◽  
Chen Chen ◽  
Chengda Zheng ◽  
...  

Background It is important to measure the public response to the COVID-19 pandemic. Twitter is an important data source for infodemiology studies involving public response monitoring. Objective The objective of this study is to examine COVID-19–related discussions, concerns, and sentiments using tweets posted by Twitter users. Methods We analyzed 4 million Twitter messages related to the COVID-19 pandemic using a list of 20 hashtags (eg, “coronavirus,” “COVID-19,” “quarantine”) from March 7 to April 21, 2020. We used a machine learning approach, Latent Dirichlet Allocation (LDA), to identify popular unigrams and bigrams, salient topics and themes, and sentiments in the collected tweets. Results Popular unigrams included “virus,” “lockdown,” and “quarantine.” Popular bigrams included “COVID-19,” “stay home,” “corona virus,” “social distancing,” and “new cases.” We identified 13 discussion topics and categorized them into 5 different themes: (1) public health measures to slow the spread of COVID-19, (2) social stigma associated with COVID-19, (3) COVID-19 news, cases, and deaths, (4) COVID-19 in the United States, and (5) COVID-19 in the rest of the world. Across all identified topics, the dominant sentiments for the spread of COVID-19 were anticipation that measures can be taken, followed by mixed feelings of trust, anger, and fear related to different topics. The public tweets revealed a significant feeling of fear when people discussed new COVID-19 cases and deaths compared to other topics. Conclusions This study showed that Twitter data and machine learning approaches can be leveraged for an infodemiology study, enabling research into evolving public discussions and sentiments during the COVID-19 pandemic. As the situation rapidly evolves, several topics are consistently dominant on Twitter, such as confirmed cases and death rates, preventive measures, health authorities and government policies, COVID-19 stigma, and negative psychological reactions (eg, fear). Real-time monitoring and assessment of Twitter discussions and concerns could provide useful data for public health emergency responses and planning. Pandemic-related fear, stigma, and mental health concerns are already evident and may continue to influence public trust when a second wave of COVID-19 occurs or there is a new surge of the current pandemic.


2020 ◽  
Vol 25 (4) ◽  
pp. 174-189 ◽  
Author(s):  
Guillaume  Palacios ◽  
Arnaud Noreña ◽  
Alain Londero

Introduction: Subjective tinnitus (ST) and hyperacusis (HA) are common auditory symptoms that may become incapacitating in a subgroup of patients who thereby seek medical advice. Both conditions can result from many different mechanisms, and as a consequence, patients may report a vast repertoire of associated symptoms and comorbidities that can reduce dramatically the quality of life and even lead to suicide attempts in the most severe cases. The present exploratory study is aimed at investigating patients’ symptoms and complaints using an in-depth statistical analysis of patients’ natural narratives in a real-life environment in which, thanks to the anonymization of contributions and the peer-to-peer interaction, it is supposed that the wording used is totally free of any self-limitation and self-censorship. Methods: We applied a purely statistical, non-supervised machine learning approach to the analysis of patients’ verbatim exchanged on an Internet forum. After automated data extraction, the dataset has been preprocessed in order to make it suitable for statistical analysis. We used a variant of the Latent Dirichlet Allocation (LDA) algorithm to reveal clusters of symptoms and complaints of HA patients (topics). The probability of distribution of words within a topic uniquely characterizes it. The convergence of the log-likelihood of the LDA-model has been reached after 2,000 iterations. Several statistical parameters have been tested for topic modeling and word relevance factor within each topic. Results: Despite a rather small dataset, this exploratory study demonstrates that patients’ free speeches available on the Internet constitute a valuable material for machine learning and statistical analysis aimed at categorizing ST/HA complaints. The LDA model with K = 15 topics seems to be the most relevant in terms of relative weights and correlations with the capability to individualizing subgroups of patients displaying specific characteristics. The study of the relevance factor may be useful to unveil weak but important signals that are present in patients’ narratives. Discussion/Conclusion: We claim that the LDA non-supervised approach would permit to gain knowledge on the patterns of ST- and HA-related complaints and on patients’ centered domains of interest. The merits and limitations of the LDA algorithms are compared with other natural language processing methods and with more conventional methods of qualitative analysis of patients’ output. Future directions and research topics emerging from this innovative algorithmic analysis are proposed.


Author(s):  
Dahye Lee ◽  
Jeffery Warner ◽  
Curtis Morgan

According to the Federal Railroad Administration (FRA) Highway-Rail Grade Crossing Accident/Incident database, more than 12,000 accidents occurred between 2012 and 2017 in the United States with casualties of around 3900. Despite repeated efforts to fully understand the risk factors that contribute to highway-rail grade crossing collisions, there still remain many uncertainties. A machine learning approach is proposed in this paper to find out significant factors, along with their individual impacts of crash severities at grade crossings. One of the most efficient and accurate machine learning algorithms, extreme gradient boosting (XGB or XGBoost), is applied to analyze 21 different accident and crossing -related characteristics per driver severities. The XGB model has been proven in previous studies across many research areas in transportation to outperform other machine learning-based methods and statistical classification methods, such as multinomial logit model, multiple additive regression trees, decision tree, and random forest, especially in prediction accuracy. Thereby, applying the algorithm is expected to provide highly reliable results to identify important factors that have impacts on injury severities at grade crossings. Such application will further aid the discovery of potential crossings with significant factors. The FRA’s Highway-Rail Grade Crossing Accident/Incident database from 2012 to 2017 is fused with the FRA Highway-Rail Crossing Inventory database for the analysis. Observations with missing information were removed from the original database. Crossing position under or over the railroad and pedestrian or other types of highway users were also not considered since they were not specifically of interest in this study. After the database cleaning process, it condensed to the total of 1,250 accidents out of the retrieved 12,630 from the combined database. The results show that adjacent highway traffic volume and train speed are the most significant factors causing accidents and injury severity. They are followed by the driver’s age and the estimated vehicle speed. It also indicated that truck-involved accidents and crossings with gates, flashing lights, and other types of warning devices combined, and highway user’s gender as a male also pertain to the higher injury rate. Through this study, it is possible to provide guidance to decision-makers in recognizing possible risks at-grade crossings that may cause driver casualties.


2019 ◽  
Vol 142 (1) ◽  
Author(s):  
Feng Zhou ◽  
Jackie Ayoub ◽  
Qianli Xu ◽  
X. Jessie Yang

Abstract Creating product ecosystems has been one of the strategic ways to enhance user experience and business advantages. Among many, customer needs analysis for product ecosystems is one of the most challenging tasks in creating a successful product ecosystem from both the perspectives of marketing research and product development. In this paper, we propose a machine-learning approach to customer needs analysis for product ecosystems by examining a large amount of online user-generated product reviews within a product ecosystem. First, we filtered out uninformative reviews from the informative reviews using a fastText technique. Then, we extract a variety of topics with regard to customer needs using a topic modeling technique named latent Dirichlet allocation. In addition, we applied a rule-based sentiment analysis method to predict not only the sentiment of the reviews but also their sentiment intensity values. Finally, we categorized customer needs related to different topics extracted using an analytic Kano model based on the dissatisfaction-satisfaction pair from the sentiment analysis. A case example of the Amazon product ecosystem was used to illustrate the potential and feasibility of the proposed method.


10.2196/19509 ◽  
2020 ◽  
Vol 6 (2) ◽  
pp. e19509 ◽  
Author(s):  
Tim Mackey ◽  
Vidya Purushothaman ◽  
Jiawei Li ◽  
Neal Shah ◽  
Matthew Nali ◽  
...  

Background The coronavirus disease (COVID-19) pandemic is a global health emergency with over 6 million cases worldwide as of the beginning of June 2020. The pandemic is historic in scope and precedent given its emergence in an increasingly digital era. Importantly, there have been concerns about the accuracy of COVID-19 case counts due to issues such as lack of access to testing and difficulty in measuring recoveries. Objective The aims of this study were to detect and characterize user-generated conversations that could be associated with COVID-19-related symptoms, experiences with access to testing, and mentions of disease recovery using an unsupervised machine learning approach. Methods Tweets were collected from the Twitter public streaming application programming interface from March 3-20, 2020, filtered for general COVID-19-related keywords and then further filtered for terms that could be related to COVID-19 symptoms as self-reported by users. Tweets were analyzed using an unsupervised machine learning approach called the biterm topic model (BTM), where groups of tweets containing the same word-related themes were separated into topic clusters that included conversations about symptoms, testing, and recovery. Tweets in these clusters were then extracted and manually annotated for content analysis and assessed for their statistical and geographic characteristics. Results A total of 4,492,954 tweets were collected that contained terms that could be related to COVID-19 symptoms. After using BTM to identify relevant topic clusters and removing duplicate tweets, we identified a total of 3465 (<1%) tweets that included user-generated conversations about experiences that users associated with possible COVID-19 symptoms and other disease experiences. These tweets were grouped into five main categories including first- and secondhand reports of symptoms, symptom reporting concurrent with lack of testing, discussion of recovery, confirmation of negative COVID-19 diagnosis after receiving testing, and users recalling symptoms and questioning whether they might have been previously infected with COVID-19. The co-occurrence of tweets for these themes was statistically significant for users reporting symptoms with a lack of testing and with a discussion of recovery. A total of 63% (n=1112) of the geotagged tweets were located in the United States. Conclusions This study used unsupervised machine learning for the purposes of characterizing self-reporting of symptoms, experiences with testing, and mentions of recovery related to COVID-19. Many users reported symptoms they thought were related to COVID-19, but they were not able to get tested to confirm their concerns. In the absence of testing availability and confirmation, accurate case estimations for this period of the outbreak may never be known. Future studies should continue to explore the utility of infoveillance approaches to estimate COVID-19 disease severity.


2020 ◽  
Author(s):  
Tim Mackey ◽  
Vidya Purushothaman ◽  
Jiawei Li ◽  
Neal Shah ◽  
Matthew Nali ◽  
...  

BACKGROUND The coronavirus disease (COVID-19) pandemic is a global health emergency with over 6 million cases worldwide as of the beginning of June 2020. The pandemic is historic in scope and precedent given its emergence in an increasingly digital era. Importantly, there have been concerns about the accuracy of COVID-19 case counts due to issues such as lack of access to testing and difficulty in measuring recoveries. OBJECTIVE The aims of this study were to detect and characterize user-generated conversations that could be associated with COVID-19-related symptoms, experiences with access to testing, and mentions of disease recovery using an unsupervised machine learning approach. METHODS Tweets were collected from the Twitter public streaming application programming interface from March 3-20, 2020, filtered for general COVID-19-related keywords and then further filtered for terms that could be related to COVID-19 symptoms as self-reported by users. Tweets were analyzed using an unsupervised machine learning approach called the biterm topic model (BTM), where groups of tweets containing the same word-related themes were separated into topic clusters that included conversations about symptoms, testing, and recovery. Tweets in these clusters were then extracted and manually annotated for content analysis and assessed for their statistical and geographic characteristics. RESULTS A total of 4,492,954 tweets were collected that contained terms that could be related to COVID-19 symptoms. After using BTM to identify relevant topic clusters and removing duplicate tweets, we identified a total of 3465 (&lt;1%) tweets that included user-generated conversations about experiences that users associated with possible COVID-19 symptoms and other disease experiences. These tweets were grouped into five main categories including first- and secondhand reports of symptoms, symptom reporting concurrent with lack of testing, discussion of recovery, confirmation of negative COVID-19 diagnosis after receiving testing, and users recalling symptoms and questioning whether they might have been previously infected with COVID-19. The co-occurrence of tweets for these themes was statistically significant for users reporting symptoms with a lack of testing and with a discussion of recovery. A total of 63% (n=1112) of the geotagged tweets were located in the United States. CONCLUSIONS This study used unsupervised machine learning for the purposes of characterizing self-reporting of symptoms, experiences with testing, and mentions of recovery related to COVID-19. Many users reported symptoms they thought were related to COVID-19, but they were not able to get tested to confirm their concerns. In the absence of testing availability and confirmation, accurate case estimations for this period of the outbreak may never be known. Future studies should continue to explore the utility of infoveillance approaches to estimate COVID-19 disease severity.


Sign in / Sign up

Export Citation Format

Share Document