Automated Seeded Latent Dirichlet Allocation for Social Media Based Event Detection and Mapping

Cornelia Ferner; Clemens Havas; Elisabeth Birnbacher; Stefan Wegenkittl; Bernd Resch

doi:10.3390/info11080376

Automated Seeded Latent Dirichlet Allocation for Social Media Based Event Detection and Mapping

Information ◽

10.3390/info11080376 ◽

2020 ◽

Vol 11 (8) ◽

pp. 376 ◽

Cited By ~ 2

Author(s):

Cornelia Ferner ◽

Clemens Havas ◽

Elisabeth Birnbacher ◽

Stefan Wegenkittl ◽

Bernd Resch

Keyword(s):

Event Detection ◽

Disaster Response ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Geographic Area ◽

Relevant Information ◽

Suggested Approach ◽

Napa Valley ◽

Source Of Information

In the event of a natural disaster, geo-tagged Tweets are an immediate source of information for locating casualties and damages, and for supporting disaster management. Topic modeling can help in detecting disaster-related Tweets in the noisy Twitter stream in an unsupervised manner. However, the results of topic models are difficult to interpret and require manual identification of one or more “disaster topics”. Immediate disaster response would benefit from a fully automated process for interpreting the modeled topics and extracting disaster relevant information. Initializing the topic model with a set of seed words already allows to directly identify the corresponding disaster topic. In order to enable an automated end-to-end process, we automatically generate seed words using older Tweets from the same geographic area. The results of two past events (Napa Valley earthquake 2014 and hurricane Harvey 2017) show that the geospatial distribution of Tweets identified as disaster related conforms with the officially released disaster footprints. The suggested approach is applicable when there is a single topic of interest and comparative data available.

Download Full-text

Mining Open Government Data for Business Intelligence Using Data Visualization: A Two-Industry Case Study

Journal of theoretical and applied electronic commerce research ◽

10.3390/jtaer16040059 ◽

2021 ◽

Vol 16 (4) ◽

pp. 1042-1065

Author(s):

Anne Gottfried ◽

Caroline Hartmann ◽

Donald Yates

Keyword(s):

Data Visualization ◽

Business Intelligence ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Open Government ◽

Open Government Data ◽

Market Opportunities ◽

Government Data ◽

Source Of Information

The business intelligence (BI) market has grown at a tremendous rate in the past decade due to technological advancements, big data and the availability of open source content. Despite this growth, the use of open government data (OGD) as a source of information is very limited among the private sector due to a lack of knowledge as to its benefits. Scant evidence on the use of OGD by private organizations suggests that it can lead to the creation of innovative ideas as well as assist in making better informed decisions. Given the benefits but lack of use of OGD to generate business intelligence, we extend research in this area by exploring how OGD can be used to generate business intelligence for the identification of market opportunities and strategy formulation; an area of research that is still in its infancy. Using a two-industry case study approach (footwear and lumber), we use latent Dirichlet allocation (LDA) topic modeling to extract emerging topics in these two industries from OGD, and a data visualization tool (pyLDAVis) to visualize the topics in order to interpret and transform the data into business intelligence. Additionally, we perform an environmental scanning of the environment for the two industries to validate the usability of the information obtained. The results provide evidence that OGD can be a valuable source of information for generating business intelligence and demonstrate how topic modeling and visualization tools can assist organizations in extracting and analyzing information for the identification of market opportunities.

Download Full-text

Analyzing the startup ecosystem of India: a Twitter analytics perspective

Journal of Advances in Management Research ◽

10.1108/jamr-08-2019-0164 ◽

2019 ◽

Vol 17 (2) ◽

pp. 262-281 ◽

Cited By ~ 12

Author(s):

Shiwangi Singh ◽

Akshay Chauhan ◽

Sanjay Dhir

Keyword(s):

Social Media ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Descriptive Analysis ◽

Relevant Information ◽

Media Analysis ◽

Social Media Analytics ◽

Content Type ◽

Twitter Analytics ◽

Bayes Algorithm

Purpose The purpose of this paper is to use Twitter analytics for analyzing the startup ecosystem of India. Design/methodology/approach The paper uses descriptive analysis and content analytics techniques of social media analytics to examine 53,115 tweets from 15 Indian startups across different industries. The study also employs techniques such as Naïve Bayes Algorithm for sentiment analysis and Latent Dirichlet allocation algorithm for topic modeling of Twitter feeds to generate insights for the startup ecosystem in India. Findings The Indian startup ecosystem is inclined toward digital technologies, concerned with people, planet and profit, with resource availability and information as the key to success. The study categorizes the emotions of tweets as positive, neutral and negative. It was found that the Indian startup ecosystem has more positive sentiments than negative sentiments. Topic modeling enables the categorization of the identified keywords into clusters. Also, the study concludes on the note that the future of the Indian startup ecosystem is Digital India. Research limitations/implications The analysis provides a methodology that future researchers can use to extract relevant information from Twitter to investigate any issue. Originality/value Any attempt to analyze the startup ecosystem of India through social media analysis is limited. This research aims to bridge such a gap and tries to analyze the startup ecosystem of India from the lens of social media platforms like Twitter.

Download Full-text

Topic Modeling in Embedding Spaces

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00325 ◽

2020 ◽

Vol 8 ◽

pp. 439-453 ◽

Cited By ~ 2

Author(s):

Adji B. Dieng ◽

Francisco J. R. Ruiz ◽

David M. Blei

Keyword(s):

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Topic Models ◽

Predictive Performance ◽

Inner Product ◽

Natural Parameter ◽

Document Models ◽

Heavy Tailed ◽

Categorical Distribution

Topic modeling analyzes documents to learn meaningful patterns of words. However, existing topic models fail to learn interpretable topics when working with large and heavy-tailed vocabularies. To this end, we develop the embedded topic model (etm), a generative model of documents that marries traditional topic models with word embeddings. More specifically, the etm models each word with a categorical distribution whose natural parameter is the inner product between the word’s embedding and an embedding of its assigned topic. To fit the etm, we develop an efficient amortized variational inference algorithm. The etm discovers interpretable topics even with large vocabularies that include rare words and stop words. It outperforms existing document models, such as latent Dirichlet allocation, in terms of both topic quality and predictive performance.

Download Full-text

Incorporating Biterm Correlation Knowledge into Topic Modeling for Short Texts

The Computer Journal ◽

10.1093/comjnl/bxaa079 ◽

2020 ◽

Author(s):

Kai Zhang ◽

Yuan Zhou ◽

Zheng Chen ◽

Yufei Liu ◽

Zhuo Tang ◽

...

Keyword(s):

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Semantic Knowledge ◽

Superior Performance ◽

Knowledge Based ◽

Modeling Process ◽

Proposed Model ◽

Benchmark Datasets ◽

Latent Topic

Abstract The prevalence of short texts on the Web has made mining the latent topic structures of short texts a critical and fundamental task for many applications. However, due to the lack of word co-occurrence information induced by the content sparsity of short texts, it is challenging for traditional topic models like latent Dirichlet allocation (LDA) to extract coherent topic structures on short texts. Incorporating external semantic knowledge into the topic modeling process is an effective strategy to improve the coherence of inferred topics. In this paper, we develop a novel topic model—called biterm correlation knowledge-based topic model (BCK-TM)—to infer latent topics from short texts. Specifically, the proposed model mines biterm correlation knowledge automatically based on recent progress in word embedding, which can represent semantic information of words in a continuous vector space. To incorporate external knowledge, a knowledge incorporation mechanism is designed over the latent topic layer to regularize the topic assignment of each biterm during the topic sampling process. Experimental results on three public benchmark datasets illustrate the superior performance of the proposed approach over several state-of-the-art baseline models.

Download Full-text

Ldagibbs: A Command for Topic Modeling in Stata Using Latent Dirichlet Allocation

The Stata Journal Promoting communications on statistics and Stata ◽

10.1177/1536867x1801800107 ◽

2018 ◽

Vol 18 (1) ◽

pp. 101-117 ◽

Cited By ~ 10

Author(s):

Carlo Schwarz

Keyword(s):

Machine Learning ◽

Probability Distribution ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Topic Models ◽

Text Documents ◽

Text Data ◽

Dirichlet Allocation

In this article, I introduce the ldagibbs command, which implements latent Dirichlet allocation in Stata. Latent Dirichlet allocation is the most popular machine-learning topic model. Topic models automatically cluster text documents into a user-chosen number of topics. Latent Dirichlet allocation represents each document as a probability distribution over topics and represents each topic as a probability distribution over words. Therefore, latent Dirichlet allocation provides a way to analyze the content of large unclassified text data and an alternative to predefined document classifications.

Download Full-text

Topic Modeling Based Image Clustering by Events in Social Media

Scientific Programming ◽

10.1155/2016/5283471 ◽

2016 ◽

Vol 2016 ◽

pp. 1-7

Author(s):

Bin Xu ◽

Guoliang Fan ◽

Dan Yang

Keyword(s):

Event Detection ◽

Topic Modeling ◽

Joint Distribution ◽

Topic Model ◽

Geographic Information ◽

Image Clustering ◽

Social Event ◽

Textual Information ◽

Photo Collections ◽

Social Event Detection

Social event detection in large photo collections is very challenging and multimodal clustering is an effective methodology to deal with the problem. Geographic information is important in event detection. This paper proposed a topic model based approach to estimate the missing geographic information for photos. The approach utilizes a supervised multimodal topic model to estimate the joint distribution of time, geographic, content, and attached textual information. Then we annotate the missing geographic photos with a predicted geographic coordinate. Experimental results indicate that the clustering performance improved by annotated geographic information.

Download Full-text

CLDA: An Effective Topic Model for Mining User Interest Preference under Big Data Background

Complexity ◽

10.1155/2018/2503816 ◽

2018 ◽

Vol 2018 ◽

pp. 1-10 ◽

Cited By ~ 5

Author(s):

Lirong Qiu ◽

Jia Yu

Keyword(s):

Big Data ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

User Interest ◽

Text Data ◽

Data Set ◽

Data Sparsity ◽

Short Text ◽

Text Filtering

In the present big data background, how to effectively excavate useful information is the problem that big data is facing now. The purpose of this study is to construct a more effective method of mining interest preferences of users in a particular field in the context of today’s big data. We mainly use a large number of user text data from microblog to study. LDA is an effective method of text mining, but it will not play a very good role in applying LDA directly to a large number of short texts in microblog. In today’s more effective topic modeling project, short texts need to be aggregated into long texts to avoid data sparsity. However, aggregated short texts are mixed with a lot of noise, reducing the accuracy of mining the user’s interest preferences. In this paper, we propose Combining Latent Dirichlet Allocation (CLDA), a new topic model that can learn the potential topics of microblog short texts and long texts simultaneously. The data sparsity of short texts is avoided by aggregating long texts to assist in learning short texts. Short text filtering long text is reused to improve mining accuracy, making long texts and short texts effectively combined. Experimental results in a real microblog data set show that CLDA outperforms many advanced models in mining user interest, and we also confirm that CLDA also has good performance in recommending systems.

Download Full-text

STABILITY OF TOPIC MODELING VIA MODALITY REGULARIZATION

Computational Linguistics and Intellectual Technologies ◽

10.28995/2075-7182-2020-19-198-210 ◽

2020 ◽

Author(s):

R. Derbanosov ◽

◽

M. Bakhanova ◽

◽

Keyword(s):

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

Topic Model ◽

Side Information ◽

Auxiliary Information ◽

Discrete Distributions ◽

Probabilistic Latent Semantic Analysis ◽

Probabilistic Topic Modeling ◽

Random Initialization

Probabilistic topic modeling is a tool for statistical text analysis that can give us information about the inner structure of a large corpus of documents. The most popular models—Probabilistic Latent Semantic Analysis and Latent Dirichlet Allocation—produce topics in a form of discrete distributions over the set of all words of the corpus. They build topics using an iterative algorithm that starts from some random initialization and optimizes a loss function. One of the main problems of topic modeling is sensitivity to random initialization that means producing significantly different solutions from different initial points. Several studies showed that side information about documents may improve the overall quality of a topic model. In this paper, we consider the use of additional information in the context of the stability problem. We represent auxiliary information as an additional modality and use BigARTM library in order to perform experiments on several text collections. We show that using side information as an additional modality improves topics stability without significant quality loss of the model.

Download Full-text

TAGGS: Grouping Tweets to Improve Global Geotagging for Disaster Response

10.5194/nhess-2017-203 ◽

2017 ◽

Cited By ~ 1

Author(s):

Jens de Bruijn ◽

Hans de Moel ◽

Brenden Jongman ◽

Jurjen Wagemaker ◽

Jeroen C. J. H. Aerts

Keyword(s):

Event Detection ◽

Disaster Response ◽

Spatial Information ◽

Geographical Information ◽

Future Application ◽

Accurate Information ◽

Global Event ◽

Social Media Platforms ◽

Relief Organizations ◽

Source Of Information

Abstract. The availability of timely and accurate information about ongoing events is important for relief organizations seeking to effectively respond to disasters. Recently, social media platforms, and in particular Twitter, have gained traction as a novel source of information on disaster events. Unfortunately, geographical information is rarely attached to tweets, which hinders the use of Twitter for geographical applications. As a solution, analyses of a tweet’s text, combined with an evaluation of its metadata, can help to increase the number of geo-located tweets. This paper describes a new algorithm (TAGGS), that georeferences tweets by using the spatial information of groups of tweets mentioning the same location. This technique results in a roughly twofold increase in the number of geo-located tweets as compared to existing methods. We applied this approach to 35.1 million flood-related tweets in 12 languages, collected over 2.5 years. In the dataset, we found 11.6 million tweets mentioning one or more flood locations, which can be towns (6.9 million), provinces (3.3 million), or countries (2.2 million). Validation demonstrated that TAGGS correctly located about 65–75 % of the tweets. As a future application, TAGGS could form the basis for a global event detection and monitoring system.

Download Full-text

Topic modeling for untargeted substructure exploration in metabolomics

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1608041113 ◽

2016 ◽

Vol 113 (48) ◽

pp. 13738-13743 ◽

Cited By ~ 106

Author(s):

Justin Johan Jozias van der Hooft ◽

Joe Wandy ◽

Michael P. Barrett ◽

Karl E. V. Burgess ◽

Simon Rogers

Keyword(s):

Topic Modeling ◽

Latent Dirichlet Allocation ◽

De Novo ◽

Life Sciences ◽

Relevant Information ◽

Biological Behavior ◽

Molecular Fragments ◽

Structural Annotation ◽

Computational Tools ◽

Biochemical Processes

The potential of untargeted metabolomics to answer important questions across the life sciences is hindered because of a paucity of computational tools that enable extraction of key biochemically relevant information. Available tools focus on using mass spectrometry fragmentation spectra to identify molecules whose behavior suggests they are relevant to the system under study. Unfortunately, fragmentation spectra cannot identify molecules in isolation but require authentic standards or databases of known fragmented molecules. Fragmentation spectra are, however, replete with information pertaining to the biochemical processes present, much of which is currently neglected. Here, we present an analytical workflow that exploits all fragmentation data from a given experiment to extract biochemically relevant features in an unsupervised manner. We demonstrate that an algorithm originally used for text mining, latent Dirichlet allocation, can be adapted to handle metabolomics datasets. Our approach extracts biochemically relevant molecular substructures (“Mass2Motifs”) from spectra as sets of co-occurring molecular fragments and neutral losses. The analysis allows us to isolate molecular substructures, whose presence allows molecules to be grouped based on shared substructures regardless of classical spectral similarity. These substructures, in turn, support putative de novo structural annotation of molecules. Combining this spectral connectivity to orthogonal correlations (e.g., common abundance changes under system perturbation) significantly enhances our ability to provide mechanistic explanations for biological behavior.

Download Full-text