Topic modeling for large-scale text data

BACKGROUND The COVID-19 pandemic has impacted nearly all aspects of life and has posed significant threats to international health and the economy. Given the rapidly unfolding nature of the current pandemic, there is an urgent need to streamline literature synthesis of the growing scientific research to elucidate targeted solutions. While traditional systematic literature review studies provide valuable insights, these studies have restrictions, including analyzing a limited number of papers, having various biases, being time-consuming and labor-intensive, focusing on a few topics, incapable of trend analysis, and lack of data-driven tools. OBJECTIVE This study fills the mentioned restrictions in the literature and practice by analyzing two biomedical concepts, clinical manifestations of disease and therapeutic chemical compounds, with text mining methods in a corpus containing COVID-19 research papers and find associations between the two biomedical concepts. METHODS This research has collected papers representing COVID-19 pre-prints and peer-reviewed research published in 2020. We used frequency analysis to find highly frequent manifestations and therapeutic chemicals, representing the importance of the two biomedical concepts. This study also applied topic modeling to find the relationship between the two biomedical concepts. RESULTS We analyzed 9,298 research papers published through May 5, 2020 and found 3,645 disease-related and 2,434 chemical-related articles. The most frequent clinical manifestations of disease terminology included COVID-19, SARS, cancer, pneumonia, fever, and cough. The most frequent chemical-related terminology included Lopinavir, Ritonavir, Oxygen, Chloroquine, Remdesivir, and water. Topic modeling provided 25 categories showing relationships between our two overarching categories. These categories represent statistically significant associations between multiple aspects of each category, some connections of which were novel and not previously identified by the scientific community. CONCLUSIONS Appreciation of this context is vital due to the lack of a systematic large-scale literature review survey and the importance of fast literature review during the current COVID-19 pandemic for developing treatments. This study is beneficial to researchers for obtaining a macro-level picture of literature, to educators for knowing the scope of literature, to journals for exploring most discussed disease symptoms and pharmaceutical targets, and to policymakers and funding agencies for creating scientific strategic plans regarding COVID-19.

Download Full-text

A Framework for Diagnosing Urban Rail Train Turn-Back Faults Based on Rules and Algorithms

Applied Sciences ◽

10.3390/app11083347 ◽

2021 ◽

Vol 11 (8) ◽

pp. 3347

Author(s):

Siqi Ma ◽

Xin Wang ◽

Xiaochen Wang ◽

Hanyu Liu ◽

Runtong Zhang

Keyword(s):

Topic Modeling ◽

Traffic Accidents ◽

Internal Communication ◽

Urban Rail Transit ◽

Rail Transit ◽

Rule Generation ◽

Text Data ◽

Common Cause ◽

Qualitative And Quantitative ◽

Urban Rail

Although urban rail transit provides significant daily assistance to users, traffic risk remains. Turn-back faults are a common cause of traffic accidents. To address turn-back faults, machines are able to learn the complicated and detailed rules of the train’s internal communication codes, and engineers must understand simple external features for quick judgment. Focusing on turn-back faults in urban rail, in this study we took advantage of related accumulated data to improve algorithmic and human diagnosis of this kind of fault. In detail, we first designed a novel framework combining rules and algorithms to help humans and machines understand the fault characteristics and collaborate in fault diagnosis, including determining the category to which the turn-back fault belongs, and identifying the simple and complicated judgment rules involved. Then, we established a dataset including tabular and text data for real application scenarios and carried out corresponding analysis of fault rule generation, diagnostic classification, and topic modeling. Finally, we present the fault characteristics under the proposed framework. Qualitative and quantitative experiments were performed to evaluate the proposed method, and the experimental results show that (1) the framework is helpful in understanding the faults of trains that occur in three types of turn-back: automatic turn-back (ATB), automatic end change (AEC), and point mode end change (PEC); (2) our proposed framework can assist in diagnosing turn-back faults.

Download Full-text

Modeling Human Factors Topics in Aviation Reports

Proceedings of the Human Factors and Ergonomics Society Annual Meeting ◽

10.1177/1071181319631095 ◽

2019 ◽

Vol 63 (1) ◽

pp. 126-130

Author(s):

Beth Lyall-Wilson ◽

Nicolas Kim ◽

Elizabeth Hohman

Keyword(s):

Human Factors ◽

Topic Modeling ◽

Domain Knowledge ◽

Aviation Safety ◽

Subject Matter Experts ◽

Text Data ◽

Modeling Approach ◽

Modeling Process ◽

Manual Review ◽

Initial Extraction

This paper describes the development and new application of a text modeling process for identifying human factors topics, such as fatigue, workload, and distraction in aviation safety reports. Current approaches to identifying human factors topic representations in text data rely on manual review from subject matter experts. The implementation of a semi-supervised text modeling method overcomes the need for lengthy manual review through an initial extraction of pre-defined human factors topics, freeing time for focus on analyzing the information. This modeling approach allows analysts to use keywords to define topics of interest up front and influence the convergence of the model toward a result that reflects them, which provides an advantage over classic topic modeling approaches where domain knowledge is not integrated into the generation of derived topics. This paper includes a description of the modeling approach and rationale, data used, evaluation methods, challenges, and suggestions for future applications.

Download Full-text

Topic Modeling of Large Scale Social Text

DEStech Transactions on Computer Science and Engineering ◽

10.12783/dtcse/cimns2017/17424 ◽

2018 ◽

Author(s):

JIA-WEN WANG ◽

QUN YANG

Keyword(s):

Topic Modeling ◽

Large Scale

Download Full-text

Distributional Semantics Meets Multi-Label Learning

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33013747 ◽

2019 ◽

Vol 33 ◽

pp. 3747-3754 ◽

Cited By ~ 2

Author(s):

Vivek Gupta ◽

Rahul Wadbude ◽

Nagarajan Natarajan ◽

Harish Karnick ◽

Prateek Jain ◽

...

Keyword(s):

Large Scale ◽

Learning Algorithm ◽

Auxiliary Information ◽

Distributional Semantics ◽

Text Data ◽

Joint Learning ◽

Benchmark Datasets ◽

Gradient Based ◽

Label Correlations ◽

Embedding Methods

We present a label embedding based approach to large-scale multi-label learning, drawing inspiration from ideas rooted in distributional semantics, specifically the Skip Gram Negative Sampling (SGNS) approach, widely used to learn word embeddings. Besides leading to a highly scalable model for multi-label learning, our approach highlights interesting connections between label embedding methods commonly used for multi-label learning and paragraph embedding methods commonly used for learning representations of text data. The framework easily extends to incorporating auxiliary information such as label-label correlations; this is crucial especially when many training instances are only partially annotated. To facilitate end-to-end learning, we develop a joint learning algorithm that can learn the embeddings as well as a regression model that predicts these embeddings for the new input to be annotated, via efficient gradient based methods. We demonstrate the effectiveness of our approach through an extensive set of experiments on a variety of benchmark datasets, and show that the proposed models perform favorably as compared to state-of-the-art methods for large-scale multi-label learning.

Download Full-text

Text Relevance Analysis Method over Large-Scale High-Dimensional Text Data Processing

Computational Collective Intelligence - Lecture Notes in Computer Science ◽

10.1007/978-3-319-24069-5_35 ◽

2015 ◽

pp. 371-379

Author(s):

Ling Wang ◽

Wei Ding ◽

Tie Hua Zhou ◽

Keun Ho Ryu

Keyword(s):

Data Processing ◽

Large Scale ◽

High Dimensional ◽

Analysis Method ◽

Text Data ◽

Relevance Analysis

Download Full-text

Framing fracking

Journal of Argumentation in Context ◽

10.1075/jaic.18016.mus ◽

2019 ◽

Vol 8 (1) ◽

pp. 112-135 ◽

Cited By ~ 4

Author(s):

Elena Musi ◽

Mark Aakhus

Keyword(s):

Large Scale ◽

Frame Analysis ◽

Environmental Communication ◽

Text Data ◽

Large Scale Analysis ◽

Automatic Retrieval ◽

Core Elements ◽

Semantic Frames ◽

The Many ◽

Communicative Context

Abstract This article offers a first large scale analysis of argumentative polylogues in the fracking controversy. It provides an empirical methodology (macroscope) that identifies, from large quantities of text data through semantic frame analysis, the many players, positions and places presumed relevant to argumentation in a controversy. It goes beyond the usual study of framing in communication research because it considers that a controversy’s communicative context is shaped, and in turn conditions, the making and defending of standpoints. To achieve these novels aims, theoretical insights from frame semantics, knowledge driven argument mining, and argumentative polylogues are combined. The macroscope is implemented using the Semafor parser to retrieve all the semantic frames present in a large corpus about fracking and then observing the distribution of those frames that semantically presuppose argumentative features of polylogue (meta-argumentative indicators). The prominent indicators are Taking_sides (indicator of “having an argument”), Evidence and Reasoning (indicators of “making an argument”). The automatic retrieval of the words associated with the core elements of the semantic frame enables the mapping of how different players, positions, and discussion venues are assembled around what is treated as disagreeable in the controversy. This knowledge driven approach to argument mining reveals prototypical traits of polylogues related to environmental issues. Moreover, it addresses a problem in conventional frame analysis common in environmental communication that focuses on the way individual arguments are presented without effective consideration of the argumentative relevance the semantics and pragmatics of certain frames operating across discourses.

Download Full-text

Topic modeling and improvement of image representation for large-scale image retrieval

Information Sciences ◽

10.1016/j.ins.2016.05.029 ◽

2016 ◽

Vol 366 ◽

pp. 99-120 ◽

Cited By ~ 11

Author(s):

Nguyen Anh Tu ◽

Dong-Luong Dinh ◽

Mostofa Kamal Rasel ◽

Young-Koo Lee

Keyword(s):

Image Retrieval ◽

Topic Modeling ◽

Large Scale ◽

Image Representation ◽

Large Scale Image Retrieval

Download Full-text