A New Text Classification Model Based on Contrastive Word Embedding for Detecting Cybersecurity Intelligence in Twitter

Han-Sub Shin; Hyuk-Yoon Kwon; Seung-Jin Ryu

doi:10.3390/electronics9091527

A New Text Classification Model Based on Contrastive Word Embedding for Detecting Cybersecurity Intelligence in Twitter

Electronics ◽

10.3390/electronics9091527 ◽

2020 ◽

Vol 9 (9) ◽

pp. 1527 ◽

Cited By ~ 1

Author(s):

Han-Sub Shin ◽

Hyuk-Yoon Kwon ◽

Seung-Jin Ryu

Keyword(s):

Deep Learning ◽

Text Classification ◽

Area Under The Curve ◽

Word Embedding ◽

Classification Model ◽

Data Set ◽

Feature Vectors ◽

Model Based ◽

Proposed Model ◽

The Difference

Detecting cybersecurity intelligence (CSI) on social media such as Twitter is crucial because it allows security experts to respond cyber threats in advance. In this paper, we devise a new text classification model based on deep learning to classify CSI-positive and -negative tweets from a collection of tweets. For this, we propose a novel word embedding model, called contrastive word embedding, that enables to maximize the difference between base embedding models. First, we define CSI-positive and -negative corpora, which are used for constructing embedding models. Here, to supplement the imbalance of tweet data sets, we additionally employ the background knowledge for each tweet corpus: (1) CVE data set for CSI-positive corpus and (2) Wikitext data set for CSI-negative corpus. Second, we adopt the deep learning models such as CNN or LSTM to extract adequate feature vectors from the embedding models and integrate the feature vectors into one classifier. To validate the effectiveness of the proposed model, we compare our method with two baseline classification models: (1) a model based on a single embedding model constructed with CSI-positive corpus only and (2) another model with CSI-negative corpus only. As a result, we indicate that the proposed model shows high accuracy, i.e., 0.934 of F1-score and 0.935 of area under the curve (AUC), which improves the baseline models by 1.76∼6.74% of F1-score and by 1.64∼6.98% of AUC.

Download Full-text

Text Classification Model Based on BERT-Capsule with Integrated Deep Learning

10.1109/iciea51954.2021.9516041 ◽

2021 ◽

Author(s):

Yuwei Tian ◽

Zhi Zhang

Keyword(s):

Deep Learning ◽

Text Classification ◽

Classification Model ◽

Model Based

Download Full-text

A multi-label text classification model based on ELMo and attention

MATEC Web of Conferences ◽

10.1051/matecconf/202030903015 ◽

2020 ◽

Vol 309 ◽

pp. 03015

Author(s):

Wenbin Liu ◽

Bojian Wen ◽

Shang Gao ◽

Jiesheng Zheng ◽

Yinlong Zheng

Keyword(s):

Language Processing ◽

Text Classification ◽

Attention Mechanism ◽

Sentiment Classification ◽

Classification Model ◽

Data Set ◽

Model Based ◽

Related Information ◽

Related Text ◽

Common Application

Text classification is a common application in natural language processing. We proposed a multi-label text classification model based on ELMo and attention mechanism which help solve the problem for the sentiment classification task that there is no grammar or writing convention in power supply related text and the sentiment related information disperses in the text. Firstly, we use pre-trained word embedding vector to extract the feature of text from the Internet. Secondly, the analyzed deep information features are weighted according to the attention mechanism. Finally, an improved ELMo model in which we replace the LSTM module with GRU module is used to characterize the text and information is classified. The experimental results on Kaggle’s toxic comment classification data set show that the accuracy of sentiment classification is as high as 98%.

Download Full-text

Intelligent phishing detection scheme using deep learning algorithms

Journal of Enterprise Information Management ◽

10.1108/jeim-01-2020-0036 ◽

2020 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Moruf Akin Adebowale ◽

Khin T. Lwin ◽

M. A. Hossain

Keyword(s):

Deep Learning ◽

Large Data ◽

Classification Model ◽

High Tech ◽

Data Set ◽

Detection Scheme ◽

Content Type ◽

Phishing Attacks ◽

Proposed Model ◽

Phishing Detection

PurposePhishing attacks have evolved in recent years due to high-tech-enabled economic growth worldwide. The rise in all types of fraud loss in 2019 has been attributed to the increase in deception scams and impersonation, as well as to sophisticated online attacks such as phishing. The global impact of phishing attacks will continue to intensify, and thus, a more efficient phishing detection method is required to protect online user activities. To address this need, this study focussed on the design and development of a deep learning-based phishing detection solution that leveraged the universal resource locator and website content such as images, text and frames.Design/methodology/approachDeep learning techniques are efficient for natural language and image classification. In this study, the convolutional neural network (CNN) and the long short-term memory (LSTM) algorithm were used to build a hybrid classification model named the intelligent phishing detection system (IPDS). To build the proposed model, the CNN and LSTM classifier were trained by using 1m universal resource locators and over 10,000 images. Then, the sensitivity of the proposed model was determined by considering various factors such as the type of feature, number of misclassifications and split issues.FindingsAn extensive experimental analysis was conducted to evaluate and compare the effectiveness of the IPDS in detecting phishing web pages and phishing attacks when applied to large data sets. The results showed that the model achieved an accuracy rate of 93.28% and an average detection time of 25 s.Originality/valueThe hybrid approach using deep learning algorithm of both the CNN and LSTM methods was used in this research work. On the one hand, the combination of both CNN and LSTM was used to resolve the problem of a large data set and higher classifier prediction performance. Hence, combining the two methods leads to a better result with less training time for LSTM and CNN architecture, while using the image, frame and text features as a hybrid for our model detection. The hybrid features and IPDS classifier for phishing detection were the novelty of this study to the best of the authors' knowledge.

Download Full-text

An Analysis Method for Interpretability of CNN Text Classification Model

Future Internet ◽

10.3390/fi12120228 ◽

2020 ◽

Vol 12 (12) ◽

pp. 228

Author(s):

Peng Ce ◽

Bao Tie

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Text Classification ◽

Sequence Data ◽

Classification Model ◽

Learning Technology ◽

Analysis Method ◽

Text Data ◽

Data Set ◽

Backtracking Analysis

With continuous development of artificial intelligence, text classification has gradually changed from a knowledge-based method to a method based on statistics and machine learning. Among them, it is a very important and efficient way to classify text based on the convolutional neural network (CNN) model. Text data are a kind of sequence data, while time sequentiality of the general text data is relatively weak, so text classification is usually less relevant to the sequential structure of the full text. Therefore, CNN-based text classification has gradually become a research hotspot when dealing with issues of text classification. For machine learning, especially deep learning, model interpretability has increasingly become the focus of academic research and industrial applications, and also become a key issue for further development and application of deep learning technology. Therefore, we recommend using the backtracking analysis method to conduct in-depth research on deep learning models. This paper proposes an analysis method for interpretability of a CNN text classification model. The method proposed by us can perform multi-angle analysis on the discriminant results of multi-classified text and multi-label classification tasks through backtracking analysis on model prediction results. Finally, the analysis results of the model can be displayed using visualization technology from multiple dimensions based on interpretability. The representative data set IMDB (Internet Movie Database) in text classification is verified by examples, and the results show that the model can be effectively analyzed when using our method.

Download Full-text

Human Activity Recognition using Fourier Transform Inspired Deep Learning Combination Model

International Journal of Sensors Wireless Communications and Control ◽

10.2174/2210327908666180727123657 ◽

2019 ◽

Vol 9 (1) ◽

pp. 16-31

Author(s):

Kyungkoo Jun

Keyword(s):

Fourier Transform ◽

Deep Learning ◽

Short Term Memory ◽

Window Size ◽

Sensor Data ◽

Data Sets ◽

Data Set ◽

Proposed Model ◽

Testing Data ◽

Labeling Scheme

Background & Objective: This paper proposes a Fourier transform inspired method to classify human activities from time series sensor data. Methods: Our method begins by decomposing 1D input signal into 2D patterns, which is motivated by the Fourier conversion. The decomposition is helped by Long Short-Term Memory (LSTM) which captures the temporal dependency from the signal and then produces encoded sequences. The sequences, once arranged into the 2D array, can represent the fingerprints of the signals. The benefit of such transformation is that we can exploit the recent advances of the deep learning models for the image classification such as Convolutional Neural Network (CNN). Results: The proposed model, as a result, is the combination of LSTM and CNN. We evaluate the model over two data sets. For the first data set, which is more standardized than the other, our model outperforms previous works or at least equal. In the case of the second data set, we devise the schemes to generate training and testing data by changing the parameters of the window size, the sliding size, and the labeling scheme. Conclusion: The evaluation results show that the accuracy is over 95% for some cases. We also analyze the effect of the parameters on the performance.

Download Full-text

Hybrid Chinese text classification model based on pretraining model

Journal of Physics Conference Series ◽

10.1088/1742-6596/1961/1/012002 ◽

2021 ◽

Vol 1961 (1) ◽

pp. 012002

Author(s):

Xing Zhaoye ◽

Liu Xiaoqun ◽

Sun Peijie

Keyword(s):

Chinese Text ◽

Text Classification ◽

Classification Model ◽

Chinese Text Classification ◽

Model Based

Download Full-text

Enhancing the performance of cancer text classification model based on cancer hallmarks

IAES International Journal of Artificial Intelligence (IJ-AI) ◽

10.11591/ijai.v10.i2.pp316-323 ◽

2021 ◽

Vol 10 (2) ◽

pp. 316

Author(s):

Noha Ali ◽

Ahmed H. AbuEl-Atta ◽

Hala H. Zayed

Keyword(s):

Neural Network ◽

Language Processing ◽

Text Classification ◽

State Of The Art ◽

Classification Model ◽

Biomedical Text ◽

Cancer Hallmarks ◽

Embedding Technique ◽

Proposed Model ◽

Biomedical Text Classification

<span id="docs-internal-guid-cb130a3a-7fff-3e11-ae3d-ad2310e265f8"><span>Deep learning (DL) algorithms achieved state-of-the-art performance in computer vision, speech recognition, and natural language processing (NLP). In this paper, we enhance the convolutional neural network (CNN) algorithm to classify cancer articles according to cancer hallmarks. The model implements a recent word embedding technique in the embedding layer. This technique uses the concept of distributed phrase representation and multi-word phrases embedding. The proposed model enhances the performance of the existing model used for biomedical text classification. The result of the proposed model overcomes the previous model by achieving an F-score equal to 83.87% using an unsupervised technique that trained on PubMed abstracts called PMC vectors (PMCVec) embedding. Also, we made another experiment on the same dataset using the recurrent neural network (RNN) algorithm with two different word embeddings Google news and PMCVec which achieving F-score equal to 74.9% and 76.26%, respectively.</span></span>

Download Full-text

A multi-label classification model for full slice brain computerised tomography image

BMC Bioinformatics ◽

10.1186/s12859-020-3503-0 ◽

2020 ◽

Vol 21 (S6) ◽

Author(s):

Jianqiang Li ◽

Guanghui Fu ◽

Yueda Chen ◽

Pengzhi Li ◽

Bo Liu ◽

...

Keyword(s):

Deep Learning ◽

Ct Images ◽

Image Features ◽

Classification Model ◽

Brain Diseases ◽

Ct Scans ◽

Computerised Tomography ◽

Brain Ct ◽

Proposed Model ◽

The Brain

Abstract Background Screening of the brain computerised tomography (CT) images is a primary method currently used for initial detection of patients with brain trauma or other conditions. In recent years, deep learning technique has shown remarkable advantages in the clinical practice. Researchers have attempted to use deep learning methods to detect brain diseases from CT images. Methods often used to detect diseases choose images with visible lesions from full-slice brain CT scans, which need to be labelled by doctors. This is an inaccurate method because doctors detect brain disease from a full sequence scan of CT images and one patient may have multiple concurrent conditions in practice. The method cannot take into account the dependencies between the slices and the causal relationships among various brain diseases. Moreover, labelling images slice by slice spends much time and expense. Detecting multiple diseases from full slice brain CT images is, therefore, an important research subject with practical implications. Results In this paper, we propose a model called the slice dependencies learning model (SDLM). It learns image features from a series of variable length brain CT images and slice dependencies between different slices in a set of images to predict abnormalities. The model is necessary to only label the disease reflected in the full-slice brain scan. We use the CQ500 dataset to evaluate our proposed model, which contains 1194 full sets of CT scans from a total of 491 subjects. Each set of data from one subject contains scans with one to eight different slice thicknesses and various diseases that are captured in a range of 30 to 396 slices in a set. The evaluation results present that the precision is 67.57%, the recall is 61.04%, the F1 score is 0.6412, and the areas under the receiver operating characteristic curves (AUCs) is 0.8934. Conclusion The proposed model is a new architecture that uses a full-slice brain CT scan for multi-label classification, unlike the traditional methods which only classify the brain images at the slice level. It has great potential for application to multi-label detection problems, especially with regard to the brain CT images.

Download Full-text

Chinese Text Classification Model Based on Deep Learning

Future Internet ◽

10.3390/fi10110113 ◽

2018 ◽

Vol 10 (11) ◽

pp. 113 ◽

Cited By ~ 17

Author(s):

Yue Li ◽

Xutao Wang ◽

Pengjian Xu

Keyword(s):

Neural Network ◽

Deep Learning ◽

Language Processing ◽

Chinese Text ◽

Text Classification ◽

Short Term Memory ◽

Classification Model ◽

Short Term ◽

Term Memory ◽

Long Short Term Memory

Text classification is of importance in natural language processing, as the massive text information containing huge amounts of value needs to be classified into different categories for further use. In order to better classify text, our paper tries to build a deep learning model which achieves better classification results in Chinese text than those of other researchers’ models. After comparing different methods, long short-term memory (LSTM) and convolutional neural network (CNN) methods were selected as deep learning methods to classify Chinese text. LSTM is a special kind of recurrent neural network (RNN), which is capable of processing serialized information through its recurrent structure. By contrast, CNN has shown its ability to extract features from visual imagery. Therefore, two layers of LSTM and one layer of CNN were integrated to our new model: the BLSTM-C model (BLSTM stands for bi-directional long short-term memory while C stands for CNN.) LSTM was responsible for obtaining a sequence output based on past and future contexts, which was then input to the convolutional layer for extracting features. In our experiments, the proposed BLSTM-C model was evaluated in several ways. In the results, the model exhibited remarkable performance in text classification, especially in Chinese texts.

Download Full-text

The Assisted Positioning Technology for High Speed Train Based on Deep Learning

Applied Sciences ◽

10.3390/app10238625 ◽

2020 ◽

Vol 10 (23) ◽

pp. 8625

Author(s):

Yali Song ◽

Yinghong Wen

Keyword(s):

Deep Learning ◽

High Speed ◽

Detection Method ◽

Generative Adversarial Networks ◽

Single Shot ◽

High Speed Train ◽

Data Set ◽

Detection Model ◽

Model Based ◽

Cumulative Error

In the positioning process of a high-speed train, cumulative error may result in a reduction in the positioning accuracy. The assisted positioning technology based on kilometer posts can be used as an effective method to correct the cumulative error. However, the traditional detection method of kilometer posts is time-consuming and complex, which greatly affects the correction efficiency. Therefore, in this paper, a kilometer post detection model based on deep learning is proposed. Firstly, the Deep Convolutional Generative Adversarial Networks (DCGAN) algorithm is introduced to construct an effective kilometer post data set. This greatly reduces the cost of real data acquisition and provides a prerequisite for the construction of the detection model. Then, by using the existing optimization as a reference and further simplifying the design of the Single Shot multibox Detector (SSD) model according to the specific application scenario of this paper, the kilometer post detection model based on an improved SSD algorithm is established. Finally, from the analysis of the experimental results, we know that the detection model established in this paper ensures both detection accuracy and efficiency. The accuracy of our model reached 98.92%, while the detection time was only 35.43 ms. Thus, our model realizes the rapid and accurate detection of kilometer posts and improves the assisted positioning technology based on kilometer posts by optimizing the detection method.

Download Full-text