Topic modeling for large-scale text data

2015 ◽  
Vol 16 (6) ◽  
pp. 457-465 ◽  
Author(s):  
Xi-ming Li ◽  
Ji-hong Ouyang ◽  
You Lu
Keyword(s):  
Author(s):  
Zhiqiang Cai ◽  
Amanda Siebert-Evenstone ◽  
Brendan Eagan ◽  
David Williamson Shaffer
Keyword(s):  

2020 ◽  
Author(s):  
Amir Karami ◽  
Brandon Bookstaver ◽  
Melissa Nolan

BACKGROUND The COVID-19 pandemic has impacted nearly all aspects of life and has posed significant threats to international health and the economy. Given the rapidly unfolding nature of the current pandemic, there is an urgent need to streamline literature synthesis of the growing scientific research to elucidate targeted solutions. While traditional systematic literature review studies provide valuable insights, these studies have restrictions, including analyzing a limited number of papers, having various biases, being time-consuming and labor-intensive, focusing on a few topics, incapable of trend analysis, and lack of data-driven tools. OBJECTIVE This study fills the mentioned restrictions in the literature and practice by analyzing two biomedical concepts, clinical manifestations of disease and therapeutic chemical compounds, with text mining methods in a corpus containing COVID-19 research papers and find associations between the two biomedical concepts. METHODS This research has collected papers representing COVID-19 pre-prints and peer-reviewed research published in 2020. We used frequency analysis to find highly frequent manifestations and therapeutic chemicals, representing the importance of the two biomedical concepts. This study also applied topic modeling to find the relationship between the two biomedical concepts. RESULTS We analyzed 9,298 research papers published through May 5, 2020 and found 3,645 disease-related and 2,434 chemical-related articles. The most frequent clinical manifestations of disease terminology included COVID-19, SARS, cancer, pneumonia, fever, and cough. The most frequent chemical-related terminology included Lopinavir, Ritonavir, Oxygen, Chloroquine, Remdesivir, and water. Topic modeling provided 25 categories showing relationships between our two overarching categories. These categories represent statistically significant associations between multiple aspects of each category, some connections of which were novel and not previously identified by the scientific community. CONCLUSIONS Appreciation of this context is vital due to the lack of a systematic large-scale literature review survey and the importance of fast literature review during the current COVID-19 pandemic for developing treatments. This study is beneficial to researchers for obtaining a macro-level picture of literature, to educators for knowing the scope of literature, to journals for exploring most discussed disease symptoms and pharmaceutical targets, and to policymakers and funding agencies for creating scientific strategic plans regarding COVID-19.


2021 ◽  
Vol 11 (8) ◽  
pp. 3347
Author(s):  
Siqi Ma ◽  
Xin Wang ◽  
Xiaochen Wang ◽  
Hanyu Liu ◽  
Runtong Zhang

Although urban rail transit provides significant daily assistance to users, traffic risk remains. Turn-back faults are a common cause of traffic accidents. To address turn-back faults, machines are able to learn the complicated and detailed rules of the train’s internal communication codes, and engineers must understand simple external features for quick judgment. Focusing on turn-back faults in urban rail, in this study we took advantage of related accumulated data to improve algorithmic and human diagnosis of this kind of fault. In detail, we first designed a novel framework combining rules and algorithms to help humans and machines understand the fault characteristics and collaborate in fault diagnosis, including determining the category to which the turn-back fault belongs, and identifying the simple and complicated judgment rules involved. Then, we established a dataset including tabular and text data for real application scenarios and carried out corresponding analysis of fault rule generation, diagnostic classification, and topic modeling. Finally, we present the fault characteristics under the proposed framework. Qualitative and quantitative experiments were performed to evaluate the proposed method, and the experimental results show that (1) the framework is helpful in understanding the faults of trains that occur in three types of turn-back: automatic turn-back (ATB), automatic end change (AEC), and point mode end change (PEC); (2) our proposed framework can assist in diagnosing turn-back faults.


Author(s):  
Beth Lyall-Wilson ◽  
Nicolas Kim ◽  
Elizabeth Hohman

This paper describes the development and new application of a text modeling process for identifying human factors topics, such as fatigue, workload, and distraction in aviation safety reports. Current approaches to identifying human factors topic representations in text data rely on manual review from subject matter experts. The implementation of a semi-supervised text modeling method overcomes the need for lengthy manual review through an initial extraction of pre-defined human factors topics, freeing time for focus on analyzing the information. This modeling approach allows analysts to use keywords to define topics of interest up front and influence the convergence of the model toward a result that reflects them, which provides an advantage over classic topic modeling approaches where domain knowledge is not integrated into the generation of derived topics. This paper includes a description of the modeling approach and rationale, data used, evaluation methods, challenges, and suggestions for future applications.


Author(s):  
Vivek Gupta ◽  
Rahul Wadbude ◽  
Nagarajan Natarajan ◽  
Harish Karnick ◽  
Prateek Jain ◽  
...  

We present a label embedding based approach to large-scale multi-label learning, drawing inspiration from ideas rooted in distributional semantics, specifically the Skip Gram Negative Sampling (SGNS) approach, widely used to learn word embeddings. Besides leading to a highly scalable model for multi-label learning, our approach highlights interesting connections between label embedding methods commonly used for multi-label learning and paragraph embedding methods commonly used for learning representations of text data. The framework easily extends to incorporating auxiliary information such as label-label correlations; this is crucial especially when many training instances are only partially annotated. To facilitate end-to-end learning, we develop a joint learning algorithm that can learn the embeddings as well as a regression model that predicts these embeddings for the new input to be annotated, via efficient gradient based methods. We demonstrate the effectiveness of our approach through an extensive set of experiments on a variety of benchmark datasets, and show that the proposed models perform favorably as compared to state-of-the-art methods for large-scale multi-label learning.


2019 ◽  
Vol 8 (1) ◽  
pp. 112-135 ◽  
Author(s):  
Elena Musi ◽  
Mark Aakhus

Abstract This article offers a first large scale analysis of argumentative polylogues in the fracking controversy. It provides an empirical methodology (macroscope) that identifies, from large quantities of text data through semantic frame analysis, the many players, positions and places presumed relevant to argumentation in a controversy. It goes beyond the usual study of framing in communication research because it considers that a controversy’s communicative context is shaped, and in turn conditions, the making and defending of standpoints. To achieve these novels aims, theoretical insights from frame semantics, knowledge driven argument mining, and argumentative polylogues are combined. The macroscope is implemented using the Semafor parser to retrieve all the semantic frames present in a large corpus about fracking and then observing the distribution of those frames that semantically presuppose argumentative features of polylogue (meta-argumentative indicators). The prominent indicators are Taking_sides (indicator of “having an argument”), Evidence and Reasoning (indicators of “making an argument”). The automatic retrieval of the words associated with the core elements of the semantic frame enables the mapping of how different players, positions, and discussion venues are assembled around what is treated as disagreeable in the controversy. This knowledge driven approach to argument mining reveals prototypical traits of polylogues related to environmental issues. Moreover, it addresses a problem in conventional frame analysis common in environmental communication that focuses on the way individual arguments are presented without effective consideration of the argumentative relevance the semantics and pragmatics of certain frames operating across discourses.


2016 ◽  
Vol 366 ◽  
pp. 99-120 ◽  
Author(s):  
Nguyen Anh Tu ◽  
Dong-Luong Dinh ◽  
Mostofa Kamal Rasel ◽  
Young-Koo Lee

Sign in / Sign up

Export Citation Format

Share Document