scholarly journals Data Mining Technology Application in False Text Information Recognition

2021 ◽  
Vol 2021 ◽  
pp. 1-13
Author(s):  
Jie Wan ◽  
Xue Cao ◽  
Kun Yao ◽  
Donghui Yang ◽  
E. Peng ◽  
...  

False information on the Internet is being heralded as serious social harm to our society. To recognize false text information, in this paper, an effective method for mining text features is proposed in the field of false drug advertisements. Firstly, the data of false drug advertisements and real drug advertisements were collected from the official websites to build a database of false and real drug advertisements. Secondly, by performing feature extraction on the text of drug advertisements, this work built a characteristic matrix based on the effective features and assigned positive or negative labels to the feature vector of the matrix according to whether it is a fake medical advertisement or not. Thirdly, this study trained and tested several different classifiers, selected the classification model with the best performance in identifying false drug advertisements, and found the key characteristics that can determine the classification. Finally, the model with the best performance was used to predict new false drug advertisements collected from Sina Weibo. In the case of identifying false drug advertisements, the classification effect of the support vector machine (SVM) classifier established on the feature set after feature selection was the most effective. The findings of this study can provide an effective method for the government to identify and combat false advertisements. This study has a certain reference significance in demonstrating the use of text data mining technology to identify and detect information fraud behavior.

2019 ◽  
Vol 123 (1267) ◽  
pp. 1415-1436 ◽  
Author(s):  
A. B. A. Anderson ◽  
A. J. Sanjeev Kumar ◽  
A. B. Arockia Christopher

ABSTRACTData mining is a process of finding correlations and collecting and analysing a huge amount of data in a database to discover patterns or relationships. Flight delay creates significant problems in the present aviation system. Data mining techniques are desired for analysing the performance in which micro-level causes propagate to make system-level patterns of delay. Analysing flight delays is very difficult – both when looking from a historical view as well as when estimating delays with forecast demand. This paper proposes using Decision Tree (DT), Support Vector Machine (SVM), Naive Bayesian (NB), K-nearest neighbour (KNN) and Artificial Neural Network (ANN) to study and analyse delays among aircrafts. The performance of different data mining methods is found in the different regions of the updated datasets on these classifiers. Finally, the result shows a significant variation in the performance of different data mining methods and feature selection for this problem. This paper aims to deal with how data mining techniques can be used to understand difficult aircraft system delays in aviation. Our aim is to develop a classification model for studying and reducing delay using different data mining methods and, in this manner, to show that DT has a greater classification accuracy. The different feature selectors are used in this study in order to reduce the number of initial attributes. Our results clearly demonstrate the value of DT for analysing and visualising how system-level effects happen from subsystem-level causes.


Author(s):  
Mohammad M. Masud ◽  
Latifur Khan ◽  
Bhavani Thuraisingham

This chapter applies data mining techniques to detect email worms. Email messages contain a number of different features such as the total number of words in message body/subject, presence/absence of binary attachments, type of attachments, and so on. The goal is to obtain an efficient classification model based on these features. The solution consists of several steps. First, the number of features is reduced using two different approaches: feature-selection and dimension-reduction. This step is necessary to reduce noise and redundancy from the data. The feature-selection technique is called Two-phase Selection (TPS), which is a novel combination of decision tree and greedy selection algorithm. The dimensionreduction is performed by Principal Component Analysis. Second, the reduced data is used to train a classifier. Different classification techniques have been used, such as Support Vector Machine (SVM), Naïve Bayes and their combination. Finally, the trained classifiers are tested on a dataset containing both known and unknown types of worms. These results have been compared with published results. It is found that the proposed TPS selection along with SVM classification achieves the best accuracy in detecting both known and unknown types of worms.


2011 ◽  
Vol 480-481 ◽  
pp. 1144-1149
Author(s):  
Ya Qin Wang ◽  
Yu Ming Song

Currently, monitoring customs declaration with limited examination of imported goods by available scarce resources poses considerable challenge to the customs authority worldwide. This a positive impact on international trade and foreign investment of a country to find out the limited factors of customs clearance, put forward the improved solution and enhance the efficiency of customs clearance efficiency of port logistics. This paper presents a classification model, which is a sort of data mining technology, to analyze the risk of commodity through customs clearance, and builds a classifier of customs inspection as the reference for customs inspection and monitoring. And the classification model based on BP neural network is established and evaluated through experiments, which are proved that a classification data mining method can be used for risk evaluation on customs clearance business to improve the customs inspection and monitoring.


2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
Li Zhao ◽  
Wenjing Qi ◽  
Meihong Zhu

How to choose suppliers scientifically is an important part of strategic decision-making management of enterprises. Expert evaluation is subjective and uncontrollable; sometimes, there exists biased evaluation, which will lead to controversial or unfair results in supplier selection. To tackle this problem, this paper proposes a novel method that employs machine learning to learn the credibility of expert from historical data, which is converted to weights in evaluation process. We first use the Support Vector Machine (SVM) classifier to classify the historical evaluation data of experts and calculate the experts’ evaluation credibility, then determine the weights of the evaluation experts, finally assemble the weighted evaluation results, and get a preference order of choosing suppliers. The main contribution of this method is that it overcomes the shortcomings of multiple conversions and large loss on evaluation information, maintains the initial evaluation information to the maximum extent, and improves the credibility of evaluation results and the fairness and scientificity of supplier selection. The results show that it is feasible to classify the past evaluation data of the evaluation experts by the SVM classification model, and the expert weights determined on the basis of the evaluation credibility of experts are adjustable.


2019 ◽  
Vol 2019 ◽  
pp. 1-11
Author(s):  
Yawen Liu ◽  
Haijun Niu ◽  
Jianming Zhu ◽  
Pengfei Zhao ◽  
Hongxia Yin ◽  
...  

According to previous studies, many neuroanatomical alterations have been detected in patients with tinnitus. However, the results of these studies have been inconsistent. The objective of this study was to explore the cortical/subcortical morphological neuroimaging biomarkers that may characterize idiopathic tinnitus using machine learning methods. Forty-six patients with idiopathic tinnitus and fifty-six healthy subjects were included in this study. For each subject, the gray matter volume of 61 brain regions was extracted as an original feature pool. From this feature pool, a hybrid feature selection algorithm combining the F-score and sequential forward floating selection (SFFS) methods was performed to select features. Then, the selected features were used to train a support vector machine (SVM) model. The area under the curve (AUC) and accuracy were used to assess the performance of the classification model. As a result, a combination of 13 cortical/subcortical brain regions was found to have the highest classification accuracy for effectively differentiating patients with tinnitus from healthy subjects. These brain regions include the bilateral hypothalamus, right insula, bilateral superior temporal gyrus, left rostral middle frontal gyrus, bilateral inferior temporal gyrus, right inferior parietal lobule, right transverse temporal gyrus, right middle temporal gyrus, right cingulate gyrus, and left superior frontal gyrus. The accuracy in the training and test datasets was 80.49% and 80.00%, respectively, and the AUC was 0.8586. To the best of our knowledge, this is the first study to elucidate brain morphological changes in patients with tinnitus by applying an SVM classifier. This study provides validated cortical/subcortical morphological neuroimaging biomarkers to differentiate patients with tinnitus from healthy subjects and contributes to the understanding of neuroanatomical alterations in patients with tinnitus.


Complexity ◽  
2018 ◽  
Vol 2018 ◽  
pp. 1-17
Author(s):  
Dandan Tang ◽  
Man Zhang ◽  
Jiabo Xu ◽  
Xueliang Zhang ◽  
Fang Yang ◽  
...  

Objective. Urumqi is one of the key areas of HIV/AIDS infection in Xinjiang and in China. The AIDS epidemic is spreading from high-risk groups to the general population, and the situation is still very serious. The goal of this study was to use four data mining algorithms to establish the identification model of HIV infection and compare their predictive performance. Method. The data from the sentinel monitoring data of the three groups of high-risk groups (injecting drug users (IDU), men who have sex with men (MSM), and female sex workers (FSW)) in Urumqi from 2009 to 2015 included demographic characteristics, sex behavior, and serological detection results. Then we used age, marital status, education level, and other variables as input variables and whether to infect HIV as output variables to establish four prediction models for the three datasets. We also used confusion matrix, accuracy, sensitivity, specificity, precision, recall, and the area under the receiver operating characteristic (ROC) curve (AUC) to evaluate classification performance and analyzed the importance of predictive variables. Results. The final experimental results show that random forests algorithm obtains the best results, the diagnostic accuracy for random forests on MSM dataset is 94.4821%, 97.5136% on FSW dataset, and 94.6375% on IDU dataset. The k-nearest neighbors algorithm came out second, with 91.5258% diagnostic accuracy on MSM dataset, 96.3083% diagnostic accuracy on FSW dataset, and 90.8287% diagnostic accuracy on IDU dataset, followed by support vector machine (94.0182%, 98.0369%, and 91.3571%). The decision tree algorithm was the poorest among the four algorithms, with 79.1761% diagnostic accuracy on MSM dataset, 87.0283% diagnostic accuracy on FSW dataset, and 74.3879% accuracy on IDU. Conclusions. Data mining technology, as a new method of assisting disease screening and diagnosis, can help medical personnel to screen and diagnose AIDS rapidly from a large number of information.


Author(s):  
HENG ZHANG ◽  
DA-HAN WANG ◽  
CHENG-LIN LIU ◽  
HORST BUNKE

In this paper, we propose a method for text-query-based keyword spotting from online Chinese handwritten documents using character classification model. The similarity between the query word and handwriting is obtained by combining the character classification scores. The classifier is trained by one-versus-all strategy so that it gives high similarity to the target class and low scores to the others. Using character classification-based word similarity also helps overcome the out-of-vocabulary (OOV) problem. We use a character-synchronous dynamic search algorithm to efficiently spot the query word in large database. The retrieval performance is further improved by using competing character confusion and writer-adaptive thresholds. Our experimental results on a large handwriting database CASIA-OLHWDB justify the superiority of one-versus-all trained classifiers and the benefits of confidence transformation, character confusion and adaptive thresholds. Particularly, a one-versus-all trained prototype classifier performs as well as a linear support vector machine (SVM) classifier, but consumes much less storage of index file. The experimental comparison with keyword spotting based on handwritten text recognition also demonstrates the effectiveness of the proposed method.


2021 ◽  
Author(s):  
SANTI BEHERA ◽  
PRABIRA SETHY

Abstract The skin is the main organ. It is approximately 8 pounds for the average adult. Our skin is a truly wonderful organ. It isolates us and shields our bodies from hazards. However, the skin is also vulnerable to damage and distracted from its original appearance; brown, black, or blue, or combinations of those colors, known as pigmented skin lesions. These common pigmented skin lesions (CPSL) are the leading factor of skin cancer, or can say these are the primary causes of skin cancer. In the healthcare sector, the categorization of CPSL is the main problem because of inaccurate outputs, overfitting, and higher computational costs. Hence, we proposed a classification model based on multi-deep feature and support vector machine (SVM) for the classification of CPSL. The proposed system comprises two phases: first, evaluate the 11 CNN model's performance in the deep feature extraction approach with SVM. Then, concatenate the top performed three CNN model's deep features and with the help of SVM to categorize the CPSL. In the second step, 8192 and 12288 features are obtained by combining binary and triple networks of 4096 features from the top performed CNN model. These features are also given to the SVM classifiers. The SVM results are also evaluated with principal component analysis (PCA) algorithm to the combined feature of 8192 and 12288. The highest results are obtained with 12288 features. The experimentation results, the combination of the deep feature of Alexnet, VGG16 & VGG19, achieved the highest accuracy of 91.7% using SVM classifier. As a result, the results show that the proposed methods are a useful tool for CPSL classification.


Sign in / Sign up

Export Citation Format

Share Document