Text Mining in Cybersecurity

Luciano Ignaczak; Guilherme Goldschmidt; Cristiano André Da Costa; Rodrigo Da Rosa Righi

doi:10.1145/3462477

Thesaurus-Based Automatic Indexing

Handbook of Research on Text and Web Mining Technologies ◽

10.4018/978-1-59904-990-8.ch020 ◽

2011 ◽

pp. 331-345

Author(s):

Luis M. de Campos

Keyword(s):

Text Mining ◽

Real World ◽

Future Research ◽

Automatic Indexing ◽

Distinguishing Features

In this chapter, we present a thesaurus application in the field of text mining and more specifically automatic indexing on the set of descriptors defined by a thesaurus. We begin by presenting various definitions and a mathematical thesaurus model, and also describe various examples of real world thesauri which are used in official institutions. We then explore the problem of thesaurus-based automatic indexing by describing its difficulties and distinguishing features and reviewing previous work in this area. Finally, we propose various lines of future research.

Download Full-text

Crowdsourcing: a systematic review of the literature using text mining

Industrial Management & Data Systems ◽

10.1108/imds-08-2020-0474 ◽

2020 ◽

Vol 120 (11) ◽

pp. 2041-2065

Author(s):

Ioanna Pavlidou ◽

Savvas Papagiannidis ◽

Eric Tsui

Keyword(s):

Text Mining ◽

Literature Review ◽

Systematic Literature Review ◽

Relevant Literature ◽

Reliability And Validity ◽

Future Research ◽

Content Type ◽

Main Research ◽

Critical Elements ◽

A Value

PurposeThis study is a systematic literature review of crowdsourcing that aims to present the research evidence so far regarding the extent to which it can contribute to organisational performance and produce innovations and provide insights on how organisations can operationalise it successfully.Design/methodology/approachThe systematic literature review revolved around a text mining methodology analysing 106 papers.FindingsThe themes identified are performance, innovation, operational aspects and motivations. The review revealed a few potential directions for future research in each of the themes considered.Practical implicationsThis study helps researchers to consider the recent themes on crowdsourcing and identify potential areas for research. At the same time, it provides practitioners with an understanding of the usefulness and process of crowdsourcing and insights on what the critical elements are in order to organise a successful crowdsourcing project.Originality/valueThis study employed quantitative content analysis in order to identify the main research themes with higher reliability and validity. It is also the first review on crowdsourcing that incorporates the relevant literature on crowdfunding as a value-creation tool.

Download Full-text

Big Data and Digital Aesthetic, Arts, and Cultural Education: Hot Spots of Current Quantitative Research

Social Science Computer Review ◽

10.1177/0894439319888455 ◽

2019 ◽

pp. 089443931988845 ◽

Cited By ~ 1

Author(s):

Alexander Christ ◽

Marcus Penthin ◽

Stephan Kröner

Keyword(s):

Big Data ◽

Text Mining ◽

Predictive Modeling ◽

Hot Spots ◽

Quantitative Research ◽

Hot Spot ◽

Future Research ◽

Original Research ◽

Cultural Education ◽

Data Volume

Systematic reviews are the method of choice to synthesize research evidence. To identify main topics (so-called hot spots) relevant to large corpora of original publications in need of a synthesis, one must address the “three Vs” of big data (volume, velocity, and variety), especially in loosely defined or fragmented disciplines. For this purpose, text mining and predictive modeling are very helpful. Thus, we applied these methods to a compilation of documents related to digitalization in aesthetic, arts, and cultural education, as a prototypical, loosely defined, fragmented discipline, and particularly to quantitative research within it (QRD-ACE). By broadly querying the abstract and citation database Scopus with terms indicative of QRD-ACE, we identified a corpus of N = 55,553 publications for the years 2013–2017. As the result of an iterative approach of text mining, priority screening, and predictive modeling, we identified n = 8,304 potentially relevant publications of which n = 1,666 were included after priority screening. Analysis of the subject distribution of the included publications revealed video games as a first hot spot of QRD-ACE. Topic modeling resulted in aesthetics and cultural activities on social media as a second hot spot, related to 4 of k = 8 identified topics. This way, we were able to identify current hot spots of QRD-ACE by screening less than 15% of the corpus. We discuss implications for harnessing text mining, predictive modeling, and priority screening in future research syntheses and avenues for future original research on QRD-ACE.

Download Full-text

Better Prediction of Mutation Score

10.36227/techrxiv.14905032 ◽

2021 ◽

Author(s):

Yossi Gil ◽

Dor Ma’ayan

Keyword(s):

Neural Networks ◽

Real World ◽

State Of The Art ◽

Future Research ◽

World Systems ◽

Reliable Measurement ◽

Mutation Score ◽

Class Level ◽

Effectiveness Prediction ◽

Better Than

<div><div><div><p>Mutation score is widely accepted to be a reliable measurement for the effectiveness of software tests. Recent studies, however, show that mutation analysis is extremely costly and hard to use in practice. We present a novel direct prediction model of mutation score using neural networks. Relying solely on static code features that do not require generation of mutants or execution of the tests, we predict mutation score with an accuracy better than a quintile. When we include statement coverage as a feature, our accuracy rises to about a decile. Using a similar approach, we also improve the state-of-the-art results for binary test effectiveness prediction and introduce an intuitive, easy-to-calculate set of features superior to previously studied sets. We also publish the largest dataset of test-class level mutation score and static code features data to date, for future research. Finally, we discuss how our approach could be integrated into real-world systems, IDEs, CI tools, and testing frameworks.</p></div></div></div>

Download Full-text

Graph Neural Networks: Taxonomy, Advances, and Trends

ACM Transactions on Intelligent Systems and Technology ◽

10.1145/3495161 ◽

2022 ◽

Vol 13 (1) ◽

pp. 1-54

Author(s):

Yu Zhou ◽

Haixia Zheng ◽

Xin Huang ◽

Shufeng Hao ◽

Dengao Li ◽

...

Keyword(s):

Neural Networks ◽

Real World ◽

Research Community ◽

Future Research ◽

Research Directions ◽

Comprehensive Review ◽

Future Research Directions ◽

Graph Neural Networks ◽

Low Dimensional

Graph neural networks provide a powerful toolkit for embedding real-world graphs into low-dimensional spaces according to specific tasks. Up to now, there have been several surveys on this topic. However, they usually lay emphasis on different angles so that the readers cannot see a panorama of the graph neural networks. This survey aims to overcome this limitation and provide a systematic and comprehensive review on the graph neural networks. First of all, we provide a novel taxonomy for the graph neural networks, and then refer to up to 327 relevant literatures to show the panorama of the graph neural networks. All of them are classified into the corresponding categories. In order to drive the graph neural networks into a new stage, we summarize four future research directions so as to overcome the challenges faced. It is expected that more and more scholars can understand and exploit the graph neural networks and use them in their research community.

Download Full-text

2D Convolutional Neural Markov Models for Spatiotemporal Sequence Forecasting

Sensors ◽

10.3390/s20154195 ◽

2020 ◽

Vol 20 (15) ◽

pp. 4195

Author(s):

Calvin Janitra Halim ◽

Kazuhiko Kawamoto

Keyword(s):

Neural Networks ◽

Real World ◽

Short Term Memory ◽

Markov Models ◽

Synthetic Data ◽

Future Research ◽

Target Sequence ◽

Temporal Models ◽

Spatial Aspect ◽

Future Work

Recent approaches to time series forecasting, especially forecasting spatiotemporal sequences, have leveraged the approximation power of deep neural networks to model the complexity of such sequences, specifically approaches that are based on recurrent neural networks. Still, as spatiotemporal sequences that arise in the real world are noisy and chaotic, modeling approaches that utilize probabilistic temporal models, such as deep Markov models (DMMs), are favorable because of their ability to model uncertainty, increasing their robustness to noise. However, approaches based on DMMs do not maintain the spatial characteristics of spatiotemporal sequences, with most of the approaches converting the observed input into 1D data halfway through the model. To solve this, we propose a model that retains the spatial aspect of the target sequence with a DMM that consists of 2D convolutional neural networks. We then show the robustness of our method to data with large variance compared with naive forecast, vanilla DMM, and convolutional long short-term memory (LSTM) using synthetic data, even outperforming the DNN models over a longer forecast period. We also point out the limitations of our model when forecasting real-world precipitation data and the possible future work that can be done to address these limitations, along with additional future research potential.

Download Full-text

An Analysis on Text Mining Techniques for Smart Literature Review

International Journal of Advanced Trends in Computer Science and Engineering ◽

10.30534/ijatcse/2021/1121022021 ◽

2021 ◽

Vol 10 (2) ◽

pp. 1284-1288

Keyword(s):

Text Mining ◽

Review Paper ◽

Evaluation Criteria ◽

Unstructured Data ◽

Future Research ◽

Literature Analysis ◽

Research Papers ◽

Text Data ◽

Web Technologies ◽

Analysis Process

With the development of web technologies, databases and social networks etc. a large amount of text data is generated each day. Mostof the data on the internet is in unstructured form. This unstructured data can provide valuable knowledge. For getting valuable knowledge from text data text mining techniques are used widely. As each day large amounts of research papers were published in journals and conferences. These research papers are very valuable for future research and investigations. These research papers act as a source for future innovations. Researchers write review papers to give updated knowledge about the specific field. But review papers used a limited number of papers and involved manually reading each paper. Due to the large volume of research papers published each day, it is not possible for the researchers to go through each paper to find the updated knowledge about their field of interest. To automate the literature analysis process different techniques of text mining were used. This paper provides a review of text mining techniques used in automatic literature analysis. We collected papers in which previous literature is used with text mining techniques to get valuable knowledge. This review paper presented an overview of text mining techniques, their evaluation criteria, their limitations and challenges for exploring literature to find research trends.

Download Full-text

Data Preparation for Big Data Analytics

Advances in Business Information Systems and Analytics - Enterprise Big Data Engineering, Analytics, and Management ◽

10.4018/978-1-5225-0293-7.ch010 ◽

2016 ◽

pp. 157-170 ◽

Cited By ~ 3

Author(s):

Andreas Schmidt ◽

Martin Atzmueller ◽

Martin Hollender

Keyword(s):

Big Data ◽

Real World ◽

Data Analytics ◽

Big Data Analytics ◽

Unstructured Data ◽

Future Research ◽

Data Preparation ◽

Chemical Production ◽

Specific Project

This chapter provides an overview of methods for preprocessing structured and unstructured data in the scope of Big Data. Specifically, this chapter summarizes according methods in the context of a real-world dataset in a petro-chemical production setting. The chapter describes state-of-the-art methods for data preparation for Big Data Analytics. Furthermore, the chapter discusses experiences and first insights in a specific project setting with respect to a real-world case study. Furthermore, interesting directions for future research are outlined.

Download Full-text

Sentiment Mining Approaches for Big Data Classification and Clustering

Modern Technologies for Big Data Classification and Clustering - Advances in Data Mining and Database Management ◽

10.4018/978-1-5225-2805-0.ch002 ◽

2018 ◽

pp. 34-63

Author(s):

Ashok Kumar J ◽

Abirami S ◽

Tina Esther Trueman

Keyword(s):

Big Data ◽

Text Mining ◽

Sentiment Analysis ◽

Real World ◽

Unstructured Data ◽

Real World Problem ◽

Sentiment Mining ◽

Big Data Classification ◽

The Web ◽

Classification And Clustering

Sentiment analysis is one of the most important applications in the field of text mining. It computes people's opinions, comments, posts, reviews, evaluations, and emotions which are expressed on products, sales, services, individuals, organizations, etc. Nowadays, large amounts of structured and unstructured data are being produced on the web. The categorizing and grouping of these data become a real-world problem. In this chapter, the authors address the current research in this field, issues and the problem of sentiment analysis on Big Data for classification and clustering. It suggests new methods, applications, algorithm extensions of classification and clustering and software tools in the field of sentiment analysis.

Download Full-text

Better Prediction of Mutation Score

10.36227/techrxiv.14905032.v1 ◽

2021 ◽

Author(s):

Yossi Gil ◽

Dor Ma’ayan

Keyword(s):

Neural Networks ◽

Real World ◽

State Of The Art ◽

Future Research ◽

World Systems ◽

Reliable Measurement ◽

Mutation Score ◽

Class Level ◽

Effectiveness Prediction ◽

Better Than

<div><div><div><p>Mutation score is widely accepted to be a reliable measurement for the effectiveness of software tests. Recent studies, however, show that mutation analysis is extremely costly and hard to use in practice. We present a novel direct prediction model of mutation score using neural networks. Relying solely on static code features that do not require generation of mutants or execution of the tests, we predict mutation score with an accuracy better than a quintile. When we include statement coverage as a feature, our accuracy rises to about a decile. Using a similar approach, we also improve the state-of-the-art results for binary test effectiveness prediction and introduce an intuitive, easy-to-calculate set of features superior to previously studied sets. We also publish the largest dataset of test-class level mutation score and static code features data to date, for future research. Finally, we discuss how our approach could be integrated into real-world systems, IDEs, CI tools, and testing frameworks.</p></div></div></div>

Download Full-text