Mining Hidden Knowledge About Illegal Compensation for Occupational Injury: Topic Model Approach (Preprint)

Mapping Intimacies ◽

10.2196/preprints.14763 ◽

2019 ◽

Author(s):

Jin-Young Min ◽

Sung-Hee Song ◽

HyeJin Kim ◽

Kyoung-Bok Min

Keyword(s):

Social Media ◽

South Korea ◽

Topic Modeling ◽

Occupational Injury ◽

Topic Model ◽

Insurance Claims ◽

Efficient Operation ◽

Social Media Data ◽

Hidden Knowledge ◽

Media Data

BACKGROUND Although injured employees are legally covered by workers’ compensation insurance in South Korea, some employers make agreements to prevent the injured employees from claiming their compensation. Thus, this leads to underreporting of occupational injury statistics. Illegal compensation (called <italic>gong-sang</italic> in Korean) is a critical method used to underreport or cover-up occupational injuries. However, <italic>gong-sang</italic> is not counted in the official occupational injury statistics; therefore, we cannot identify <italic>gong-sang</italic>–related issues. OBJECTIVE This study aimed to analyze social media data using topic modeling to explore hidden knowledge about illegal compensation—<italic>gong-sang</italic>—for occupational injury in South Korea. METHODS We collected 2210 documents from social media data by filtering the keyword, <italic>gong-sang</italic>. The study period was between January 1, 2006, and December 31, 2017. After completing natural language processing of the Korean language, a morphological analyzer, we performed topic modeling using latent Dirichlet allocation (LDA) in the Python library, Gensim. A 10-topic model was selected and run with 3000 Gibbs sampling iterations to fit the model. RESULTS The LDA model was used to classify <italic>gong-sang</italic>–related documents into 4 categories from a total of 10 topics. Topic 1 was the greatest concern (60.5%). Workers who suffered from industrial accidents seemed to be worried about illegal compensation and legal insurance claims, wherein keywords on the choice between illegal compensation and legal insurance claims were included. In topic 2, keywords were associated with claims for industrial accident insurance benefits. Topics 3 and 4, as the second highest concern (19%), contained keywords implying the monetary compensation of <italic>gong-sang</italic>. Topics 5 to 10 included keywords on vulnerable jobs (ie, workers in the construction and defense industry, delivery riders, and foreign workers) and body parts (ie, injuries to the hands, face, teeth, lower limbs, and back) to <italic>gong-sang</italic>. CONCLUSIONS We explored hidden knowledge to identify the salient issues surrounding <italic>gong-sang</italic> using the LDA model. These topics may provide valuable information to ensure the more efficient operation of South Korea’s occupational health and safety administration and protect vulnerable workers from illegal <italic>gong-sang</italic> compensation practices.

Download Full-text

Mining Hidden Knowledge About Illegal Compensation for Occupational Injury: Topic Model Approach

JMIR Medical Informatics ◽

10.2196/14763 ◽

2019 ◽

Vol 7 (3) ◽

pp. e14763 ◽

Cited By ~ 2

Author(s):

Jin-Young Min ◽

Sung-Hee Song ◽

HyeJin Kim ◽

Kyoung-Bok Min

Keyword(s):

Social Media ◽

South Korea ◽

Topic Modeling ◽

Occupational Injury ◽

Topic Model ◽

Insurance Claims ◽

Efficient Operation ◽

Social Media Data ◽

Hidden Knowledge ◽

Media Data

Background Although injured employees are legally covered by workers’ compensation insurance in South Korea, some employers make agreements to prevent the injured employees from claiming their compensation. Thus, this leads to underreporting of occupational injury statistics. Illegal compensation (called gong-sang in Korean) is a critical method used to underreport or cover-up occupational injuries. However, gong-sang is not counted in the official occupational injury statistics; therefore, we cannot identify gong-sang–related issues. Objective This study aimed to analyze social media data using topic modeling to explore hidden knowledge about illegal compensation—gong-sang—for occupational injury in South Korea. Methods We collected 2210 documents from social media data by filtering the keyword, gong-sang. The study period was between January 1, 2006, and December 31, 2017. After completing natural language processing of the Korean language, a morphological analyzer, we performed topic modeling using latent Dirichlet allocation (LDA) in the Python library, Gensim. A 10-topic model was selected and run with 3000 Gibbs sampling iterations to fit the model. Results The LDA model was used to classify gong-sang–related documents into 4 categories from a total of 10 topics. Topic 1 was the greatest concern (60.5%). Workers who suffered from industrial accidents seemed to be worried about illegal compensation and legal insurance claims, wherein keywords on the choice between illegal compensation and legal insurance claims were included. In topic 2, keywords were associated with claims for industrial accident insurance benefits. Topics 3 and 4, as the second highest concern (19%), contained keywords implying the monetary compensation of gong-sang. Topics 5 to 10 included keywords on vulnerable jobs (ie, workers in the construction and defense industry, delivery riders, and foreign workers) and body parts (ie, injuries to the hands, face, teeth, lower limbs, and back) to gong-sang. Conclusions We explored hidden knowledge to identify the salient issues surrounding gong-sang using the LDA model. These topics may provide valuable information to ensure the more efficient operation of South Korea’s occupational health and safety administration and protect vulnerable workers from illegal gong-sang compensation practices.

Download Full-text

Topic modeling to mind illegal compensation for occupational injuries

European Journal of Public Health ◽

10.1093/eurpub/ckz186.317 ◽

2019 ◽

Vol 29 (Supplement_4) ◽

Author(s):

S H Song ◽

J Y Min ◽

H J Kim ◽

K B Min

Keyword(s):

Social Media ◽

Topic Modeling ◽

Social Insurance ◽

Latent Dirichlet Allocation ◽

Occupational Injuries ◽

Workplace Safety ◽

Body Parts ◽

Insurance Claims ◽

Social Media Data ◽

Media Data

Abstract Background Accurate reports of occupational injuries are important to monitor workplace safety and health initiatives. In South Korea, media reports, experts, and workers have been constantly raising the issue of underreporting. Supposedly it is because employers have strong market “incentives” by underreporting their employees’ injuries. A critical way to underreport or cover-up is illegal compensation (in Korean called “gong-sang”). Unfortunately, “gong-sang” is not counted as official occupational injury statistics. The aim of this study was to analyze the social media data using topic modeling and to explore issues surrounding “gong-sang”. Methods We used web scraping technology and collected 2,210 social media data from Web search engines. Data was processed to transform unstructured textual documents into structured data using the Python and applied Latent Dirichlet allocation (LDA) in the Python library, Gensim, for topic modeling. Results Based on the LDA method from “gong-sang”- related documentation, 10 topics were identified. Topic 1 was the greatest concern (60.5%), with keywords implying the choice between illegal compensation (“gong-sang”) and legal insurance claims. The next concern was Topic 2 including keywords associated with claims for industrial accident insurance benefits. The rest topics (topic 3-10) showed the monetary issue, precarious employment, and vulnerable body parts to “gong-sang”. Conclusions We explored web-based data and identified the salient issues surrounding “gong-sang”. LDA topics may be helpful to ensure efficient occupational health and safety scheme to protect vulnerable employees from “gong-sang” practices. Key messages The topics formulated by LDA included queries about legal insurance claims. Legal insurance claims including private or social insurance, monetary compensation, injured body parts, and the type of jobs vulnerable to “gong-sang”.

Download Full-text

Crash tags: Topic modeling social media data after fatal automated ve-hicle crashes

Proceedings of the Human Factors and Ergonomics Society Annual Meeting ◽

10.1177/1071181320641460 ◽

2020 ◽

Vol 64 (1) ◽

pp. 1909-1910

Author(s):

Ran Wei ◽

Hananeh Alambeigi ◽

Anthony McDonald

Keyword(s):

Social Media ◽

Topic Modeling ◽

Social Media Data ◽

Media Data

Download Full-text

Perceiving Residents’ Festival Activities Based on Social Media Data: A Case Study in Beijing, China

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi10070474 ◽

2021 ◽

Vol 10 (7) ◽

pp. 474

Author(s):

Bingqing Wang ◽

Bin Meng ◽

Juan Wang ◽

Siyu Chen ◽

Jian Liu

Keyword(s):

Social Media ◽

Language Processing ◽

Topic Model ◽

Central Area ◽

Classification Model ◽

Social Media Data ◽

Ring Road ◽

Different Types ◽

Spatial Differences ◽

Media Data

Social media data contains real-time expressed information, including text and geographical location. As a new data source for crowd behavior research in the era of big data, it can reflect some aspects of the behavior of residents. In this study, a text classification model based on the BERT and Transformers framework was constructed, which was used to classify and extract more than 210,000 residents’ festival activities based on the 1.13 million Sina Weibo (Chinese “Twitter”) data collected from Beijing in 2019 data. On this basis, word frequency statistics, part-of-speech analysis, topic model, sentiment analysis and other methods were used to perceive different types of festival activities and quantitatively analyze the spatial differences of different types of festivals. The results show that traditional culture significantly influences residents’ festivals, reflecting residents’ motivation to participate in festivals and how residents participate in festivals and express their emotions. There are apparent spatial differences among residents in participating in festival activities. The main festival activities are distributed in the central area within the Fifth Ring Road in Beijing. In contrast, expressing feelings during the festival is mainly distributed outside the Fifth Ring Road in Beijing. The research integrates natural language processing technology, topic model analysis, spatial statistical analysis, and other technologies. It can also broaden the application field of social media data, especially text data, which provides a new research paradigm for studying residents’ festival activities and adds residents’ perception of the festival. The research results provide a basis for the design and management of the Chinese festival system.

Download Full-text

A Topic Modeling based Approach for Mining Online Social Media Data

2019 2nd International Conference on Intelligent Computing, Instrumentation and Control Technologies (ICICICT) ◽

10.1109/icicict46008.2019.8993231 ◽

2019 ◽

Author(s):

Nimisha S. Fal Dessai ◽

J. A. Laxminarayanan

Keyword(s):

Social Media ◽

Topic Modeling ◽

Social Media Data ◽

Online Social Media ◽

Media Data

Download Full-text

Comparison of Social Media, Syndromic Surveillance, and Microbiologic Acute Respiratory Infection Data: Observational Study

JMIR Public Health and Surveillance ◽

10.2196/14986 ◽

2020 ◽

Vol 6 (2) ◽

pp. e14986 ◽

Cited By ~ 2

Author(s):

Ashlynn R Daughton ◽

Rumi Chunara ◽

Michael J Paul

Keyword(s):

Infectious Disease ◽

Social Media ◽

Random Sample ◽

Topic Model ◽

Ground Truth ◽

Ground Truth Data ◽

Social Media Data ◽

Individual Level ◽

Small Effect Size ◽

Media Data

Background Internet data can be used to improve infectious disease models. However, the representativeness and individual-level validity of internet-derived measures are largely unexplored as this requires ground truth data for study. Objective This study sought to identify relationships between Web-based behaviors and/or conversation topics and health status using a ground truth, survey-based dataset. Methods This study leveraged a unique dataset of self-reported surveys, microbiological laboratory tests, and social media data from the same individuals toward understanding the validity of individual-level constructs pertaining to influenza-like illness in social media data. Logistic regression models were used to identify illness in Twitter posts using user posting behaviors and topic model features extracted from users’ tweets. Results Of 396 original study participants, only 81 met the inclusion criteria for this study. Of these participants’ tweets, we identified only two instances that were related to health and occurred within 2 weeks (before or after) of a survey indicating symptoms. It was not possible to predict when participants reported symptoms using features derived from topic models (area under the curve [AUC]=0.51; P=.38), though it was possible using behavior features, albeit with a very small effect size (AUC=0.53; P≤.001). Individual symptoms were also generally not predictable either. The study sample and a random sample from Twitter are predictably different on held-out data (AUC=0.67; P≤.001), meaning that the content posted by people who participated in this study was predictably different from that posted by random Twitter users. Individuals in the random sample and the GoViral sample used Twitter with similar frequencies (similar @ mentions, number of tweets, and number of retweets; AUC=0.50; P=.19). Conclusions To our knowledge, this is the first instance of an attempt to use a ground truth dataset to validate infectious disease observations in social media data. The lack of signal, the lack of predictability among behaviors or topics, and the demonstrated volunteer bias in the study population are important findings for the large and growing body of disease surveillance using internet-sourced data.

Download Full-text

Crowd Detection in Mass Gatherings Based on Social Media Data: A Case Study of the 2014 Shanghai New Year’s Eve Stampede

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph17228640 ◽

2020 ◽

Vol 17 (22) ◽

pp. 8640

Author(s):

Jiexiong Duan ◽

Weixin Zhai ◽

Chengqi Cheng

Keyword(s):

Social Media ◽

Topic Modeling ◽

Public Attitudes ◽

Three Dimensions ◽

Social Media Data ◽

Sina Weibo ◽

Mass Gatherings ◽

New Year’S Eve ◽

Media Data

The Shanghai New Year’s Eve stampede on 31 December 2014, caused 36 deaths and 47 other injuries, generating attention from around the world. This research aims to explore crowd aggregation from the perspective of Sina Weibo check-in data and evaluate the potential of crowd detection based on social media data. We develop a framework using Weibo check-in data in three dimensions: the aggregation level of check-in data, the topic changes in posts and the sentiment fluctuations of citizens. The results show that the numbers of check-ins in all of Shanghai on New Years’ Eve is twice that of other days and that Moran’s I reaches a peak on this date, implying a spatial autocorrelation mode. Additionally, the results of topic modeling indicate that 72.4% of the posts were related to the stampede, reflecting public attitudes and views on this incident from multiple angles. Moreover, sentiment analysis based on Weibo posts illustrates that the proportion of negative posts increased both when the stampede occurred (40.95%) and a few hours afterwards (44.33%). This study demonstrates the potential of using geotagged social media data to analyze population spatiotemporal activities, especially in emergencies.

Download Full-text

Detecting information requirements for crisis communication from social media data: An interactive topic modeling approach

International Journal of Disaster Risk Reduction ◽

10.1016/j.ijdrr.2020.101692 ◽

2020 ◽

Vol 50 ◽

pp. 101692

Author(s):

Qing Deng ◽

Yang Gao ◽

Chenyang Wang ◽

Hui Zhang

Keyword(s):

Social Media ◽

Crisis Communication ◽

Topic Modeling ◽

Information Requirements ◽

Social Media Data ◽

Modeling Approach ◽

Media Data

Download Full-text

textPrep: A Text Preprocessing Toolkit for Topic Modeling on Social Media Data

10.5220/0010559000002993 ◽

2021 ◽

Author(s):

Rob Churchill ◽

Lisa Singh

Keyword(s):

Social Media ◽

Topic Modeling ◽

Social Media Data ◽

Text Preprocessing ◽

Media Data

Download Full-text

Tracking geographical locations using a geo-aware topic model for analyzing social media data

Decision Support Systems ◽

10.1016/j.dss.2017.05.006 ◽

2017 ◽

Vol 99 ◽

pp. 18-29 ◽

Cited By ~ 14

Author(s):

Marianela García Lozano ◽

Jonah Schreiber ◽

Joel Brynielsson

Keyword(s):

Social Media ◽

Topic Model ◽

Social Media Data ◽

Media Data ◽

Geographical Locations

Download Full-text