Identifying e-Commerce in Enterprises by means of Text Mining and Classification Algorithms

A Comprehensive Evaluation of Approaches for Built-Up Area Extraction from Landsat OLI Images Using Massive Samples

Remote Sensing ◽

10.3390/rs11010002 ◽

2018 ◽

Vol 11 (1) ◽

pp. 2 ◽

Cited By ~ 11

Author(s):

Tao Zhang ◽

Hong Tang

Keyword(s):

Learning Strategies ◽

Classification Accuracy ◽

Feature Learning ◽

Automatic Generation ◽

Experimental Results ◽

Support Vector ◽

Feature Engineering ◽

Classification Algorithms ◽

Sample Points ◽

Better Than

Detailed information about built-up areas is valuable for mapping complex urban environments. Although a large number of classification algorithms for such areas have been developed, they are rarely tested from the perspective of feature engineering and feature learning. Therefore, we launched a unique investigation to provide a full test of the Operational Land Imager (OLI) imagery for 15-m resolution built-up area classification in 2015, in Beijing, China. Training a classifier requires many sample points, and we proposed a method based on the European Space Agency’s (ESA) 38-m global built-up area data of 2014, OpenStreetMap, and MOD13Q1-NDVI to achieve the rapid and automatic generation of a large number of sample points. Our aim was to examine the influence of a single pixel and image patch under traditional feature engineering and modern feature learning strategies. In feature engineering, we consider spectra, shape, and texture as the input features, and support vector machine (SVM), random forest (RF), and AdaBoost as the classification algorithms. In feature learning, the convolutional neural network (CNN) is used as the classification algorithm. In total, 26 built-up land cover maps were produced. The experimental results show the following: (1) The approaches based on feature learning are generally better than those based on feature engineering in terms of classification accuracy, and the performance of ensemble classifiers (e.g., RF) are comparable to that of CNN. Two-dimensional CNN and the 7-neighborhood RF have the highest classification accuracies at nearly 91%; (2) Overall, the classification effect and accuracy based on image patches are better than those based on single pixels. The features that can highlight the information of the target category (e.g., PanTex (texture-derived built-up presence index) and enhanced morphological building index (EMBI)) can help improve classification accuracy. The code and experimental results are available at https://github.com/zhangtao151820/CompareMethod.

Download Full-text

NAIVE BAYES CLASSIFIER DAN SUPPORT VECTOR MACHINE SEBAGAI ALTERNATIF SOLUSI UNTUK TEXT MINING

Jurnal Teknologi Informasi dan Pendidikan ◽

10.24036/tip.v12i2.219 ◽

2019 ◽

Vol 12 (2) ◽

pp. 32-38

Author(s):

Iin Ernawati

Keyword(s):

Support Vector Machine ◽

Text Mining ◽

Naive Bayes ◽

Naïve Bayes ◽

Support Vector ◽

Classification Algorithms ◽

Naive Bayes Classifier ◽

Bayes Classifier ◽

Naïve Bayes Classifier ◽

The Relationship

This study was conducted to text-based data mining or often called text mining, classification methods commonly used method Naïve bayes classifier (NBC) and support vector machine (SVM). This classification is emphasized for Indonesian language documents, while the relationship between documents is measured by the probability that can be proven with other classification algorithms. This evident from the conclusion that the probability result Naïve Bayes Classifier (NBC) word “party” at least in the economic document and political. Then the result of the algorithm support vector machine (svm) with the word “price” and “kpk” contains in both economic and politic document.

Download Full-text

Automatically generating psychiatric case notes from digital transcripts of doctor-patient conversations using text mining

10.7287/peerj.preprints.27497 ◽

2019 ◽

Author(s):

Nazmul Kazi ◽

Indika Kahanda

Keyword(s):

Text Mining ◽

English Language ◽

Health Care Systems ◽

Characteristic Curve ◽

Classification Problem ◽

Machine Learning Algorithms ◽

Support Vector ◽

Additional Time ◽

Pos Tagging ◽

Part Of Speech

Current health care systems require clinicians to spend a substantial amount of time to digitally document their interactions with their patients through the use of electronic health records (EHRs), limiting the time spent on face-to-face patient care. Moreover, the use of EHRs is known to be highly inefficient due to additional time it takes for completion, which also leads to clinician burnout. In this project, we explore the feasibility of developing an automated case notes system for psychiatrists using text mining techniques that will listen to doctor-patient conversations, generate digital transcripts using speech-to-text conversion, classify information from the transcripts into relevant categories, and automatically generate structured case notes. In our preliminary work, we develop a human-powered doctor-patient conversation transcript annotator and obtain a gold standard dataset through the National Alliance of Mental Illness (NAMI) Montana. We model the task of classifying parts of conversations into six broad categories such as medical and family history as a supervised classification problem and apply several popular machine learning algorithms. According to our preliminary experimental results obtained through 5-fold cross-validation, Support Vector Machines are able to classify an unseen transcript with an average AUROC (area under the receiver operating characteristic curve) score of 89%. Finally, we use part-of-speech (POS) tagging, grammatical rules of English language and verb conjugation, we generate written versions of the pieces of text belonging to different categories. These formal text are aggregated in to filling different sections of the EHR forms.

Download Full-text

Arabic Text Mining Using Rule Based Classification

Journal of Information & Knowledge Management ◽

10.1142/s0219649212500062 ◽

2012 ◽

Vol 11 (01) ◽

pp. 1250006 ◽

Cited By ~ 5

Author(s):

Fadi Thabtah ◽

Omar Gharaibeh ◽

Rashid Al-Zubaidy

Keyword(s):

Text Mining ◽

Text Classification ◽

Business Intelligence ◽

Classification Problem ◽

Decision Making Process ◽

Classification Algorithms ◽

Arabic Text ◽

Essential Information ◽

Rule Based ◽

Arabic Text Classification

A well-known classification problem in the domain of text mining is text classification, which concerns about mapping textual documents into one or more predefined category based on its content. Text classification arena recently attracted many researchers because of the massive amounts of online documents and text archives which hold essential information for a decision-making process. In this field, most of such researches focus on classifying English documents while there are limited studies conducted on other languages like Arabic. In this respect, the paper proposes to investigate the problem of Arabic text classification comprehensively. More specifically the study measures the performance of different rule based classification approaches adopted from machine learning and data mining towards the problem of text Arabic classification. In particular, four different rule based classification approaches: Decision trees (C4.5), Rule Induction (RIPPER), Hybrid (PART) and Simple Rule (One Rule) are evaluated against the published Corpus of Contemporary Arabic Arabic text collection. This experimentation is carried out by employing a modified version of WEKA business intelligence tool. Through analysing the produced results from the experimentation, we determine the most suitable classification algorithms for classifying Arabic texts.

Download Full-text

Automatically generating psychiatric case notes from digital transcripts of doctor-patient conversations using text mining

10.7287/peerj.preprints.27497v2 ◽

2019 ◽

Cited By ~ 1

Author(s):

Nazmul Kazi ◽

Indika Kahanda

Keyword(s):

Text Mining ◽

English Language ◽

Health Care Systems ◽

Characteristic Curve ◽

Classification Problem ◽

Machine Learning Algorithms ◽

Support Vector ◽

Additional Time ◽

Pos Tagging ◽

Part Of Speech

Current health care systems require clinicians to spend a substantial amount of time to digitally document their interactions with their patients through the use of electronic health records (EHRs), limiting the time spent on face-to-face patient care. Moreover, the use of EHRs is known to be highly inefficient due to additional time it takes for completion, which also leads to clinician burnout. In this project, we explore the feasibility of developing an automated case notes system for psychiatrists using text mining techniques that will listen to doctor-patient conversations, generate digital transcripts using speech-to-text conversion, classify information from the transcripts into relevant categories, and automatically generate structured case notes. In our preliminary work, we develop a human-powered doctor-patient conversation transcript annotator and obtain a gold standard dataset through the National Alliance of Mental Illness (NAMI) Montana. We model the task of classifying parts of conversations into six broad categories such as medical and family history as a supervised classification problem and apply several popular machine learning algorithms. According to our preliminary experimental results obtained through 5-fold cross-validation, Support Vector Machines are able to classify an unseen transcript with an average AUROC (area under the receiver operating characteristic curve) score of 89%. Finally, we use part-of-speech (POS) tagging, grammatical rules of English language and verb conjugation, we generate written versions of the pieces of text belonging to different categories. These formal text are aggregated in to filling different sections of the EHR forms.

Download Full-text

Comparing classification algorithms for prediction on CROBEX data

Croatian Review of Economic Business and Social Statistics ◽

10.2478/crebss-2020-0007 ◽

2020 ◽

Vol 6 (2) ◽

pp. 4-11

Author(s):

Silvija Vlah Jerić

Keyword(s):

Binary Classification ◽

Classification Problem ◽

Machine Learning Algorithms ◽

Stock Index ◽

Support Vector ◽

Automatic Identification ◽

Classification Algorithms ◽

Day Trading ◽

Closing Price ◽

Market Situation

AbstractThe main objective of this analysis is to evaluate and compare the various classification algorithms for the automatic identification of favourable days for intraday trading using the Croatian stock index CROBEX data. Intra-day trading refers to the acquisition and sale of financial instruments on the same trading day. If the increase between the opening price and the closing price of the same day is substantial enough to earn a profit by purchasing at the opening price and selling at the closing price, the day is considered to be favourable for intra-day trading. The goal is to discover relation between selected financial indicators on a given day and the market situation on the following day i.e. to determine whether a day is favourable for day trading or not. The problem is modelled as a binary classification problem. The idea is to test different algorithms and to give greater attention to those that are more rarely used than traditional statistical methods. Thus, the following algorithms are used: neural network, support vector machine, random forest, as well as k-nearest neighbours and naïve Bayes classifier as classifiers that are more common. The work is an extension of authors’ previous work in which the algorithms are compared on resamples resulting from tuning the algorithms, while here, each derived model is used to make predictions on new data. The results should add to the increasing corpus of stock market prediction research efforts and try to fill some gaps in this field of research for the Croatian market, in particular by using machine learning algorithms.

Download Full-text

Generating case notes from digital transcripts using text mining

10.7287/peerj.preprints.27497v1 ◽

2019 ◽

Author(s):

Nazmul Kazi ◽

Indika Kahanda

Keyword(s):

Text Mining ◽

English Language ◽

Health Care Systems ◽

Characteristic Curve ◽

Classification Problem ◽

Machine Learning Algorithms ◽

Support Vector ◽

Formal Representation ◽

Additional Time ◽

Pos Tagging

Current health care systems require clinicians to spend a substantial amount of time to digitally document their interactions with their patients through the use of electronic health records (EHRs), limiting the time spent on face-to-face patient care. Moreover, the use of EHRs is known to be highly inefficient due to additional time it takes for completion, which also leads to clinician burnout. In this project, we explore the feasibility of developing an automated case notes system for psychiatrists using text mining techniques that will listen to doctor-patient conversations, generate digital transcripts using speech-to-text conversion, classify information from the transcripts by identifying important keywords, and automatically generate structured case notes. In our preliminary work, we develop a human powered doctor-patient transcript annotator and obtain a gold standard dataset through National Alliance of Mental Illness (NAMI) Montana. We model the task of classifying parts of conversations in to six broad categories such as medical and family history as a supervised classification problem and apply several popular machine learning algorithms. According to our preliminary experimental results obtained through 5-fold cross validation, Support Vector Machines are able to classify an unseen transcript with an average AUROC (area under the receiver operating characteristic curve) score of 89%. Using part-of-speech (POS) tagging, grammatical rules of English language and verb conjugation, we generate formal representation of each sample. For each class, we form a paragraph using the formal representations of its samples. Using these paragraphs, we generate a case note.

Download Full-text

A Hybrid Swarm and Gravitation-based feature selection algorithm for handwritten Indic script classification problem

Complex & Intelligent Systems ◽

10.1007/s40747-020-00237-1 ◽

2021 ◽

Author(s):

Ritam Guha ◽

Manosij Ghosh ◽

Pawan Kumar Singh ◽

Ram Sarkar ◽

Mita Nasipuri

Keyword(s):

Feature Selection ◽

Character Recognition ◽

Optical Character Recognition ◽

Classification Problem ◽

Classification Model ◽

Support Vector ◽

Intermediate Step ◽

Hybrid Swarm ◽

Feature Vectors ◽

Indic Script

AbstractIn any multi-script environment, handwritten script classification is an unavoidable pre-requisite before the document images are fed to their respective Optical Character Recognition (OCR) engines. Over the years, this complex pattern classification problem has been solved by researchers proposing various feature vectors mostly having large dimensions, thereby increasing the computation complexity of the whole classification model. Feature Selection (FS) can serve as an intermediate step to reduce the size of the feature vectors by restricting them only to the essential and relevant features. In the present work, we have addressed this issue by introducing a new FS algorithm, called Hybrid Swarm and Gravitation-based FS (HSGFS). This algorithm has been applied over three feature vectors introduced in the literature recently—Distance-Hough Transform (DHT), Histogram of Oriented Gradients (HOG), and Modified log-Gabor (MLG) filter Transform. Three state-of-the-art classifiers, namely, Multi-Layer Perceptron (MLP), K-Nearest Neighbour (KNN), and Support Vector Machine (SVM), are used to evaluate the optimal subset of features generated by the proposed FS model. Handwritten datasets at block, text line, and word level, consisting of officially recognized 12 Indic scripts, are prepared for experimentation. An average improvement in the range of 2–5% is achieved in the classification accuracy by utilizing only about 75–80% of the original feature vectors on all three datasets. The proposed method also shows better performance when compared to some popularly used FS models. The codes used for implementing HSGFS can be found in the following Github link: https://github.com/Ritam-Guha/HSGFS.

Download Full-text

Early Detection of Red Palm Weevil, Rhynchophorus ferrugineus (Olivier), Infestation Using Data Mining

Plants ◽

10.3390/plants10010095 ◽

2021 ◽

Vol 10 (1) ◽

pp. 95

Author(s):

Heba Kurdi ◽

Amal Al-Aldawsari ◽

Isra Al-Turaiki ◽

Abdulrahman S. Aldawood

Keyword(s):

Data Mining ◽

Plant Size ◽

Support Vector ◽

Classification Algorithms ◽

Palm Tree ◽

Rhynchophorus Ferrugineus ◽

Red Palm Weevil ◽

Palm Weevil ◽

Using Data ◽

F Measure

In the past 30 years, the red palm weevil (RPW), Rhynchophorus ferrugineus (Olivier), a pest that is highly destructive to all types of palms, has rapidly spread worldwide. However, detecting infestation with the RPW is highly challenging because symptoms are not visible until the death of the palm tree is inevitable. In addition, the use of automated RPW weevil identification tools to predict infestation is complicated by a lack of RPW datasets. In this study, we assessed the capability of 10 state-of-the-art data mining classification algorithms, Naive Bayes (NB), KSTAR, AdaBoost, bagging, PART, J48 Decision tree, multilayer perceptron (MLP), support vector machine (SVM), random forest, and logistic regression, to use plant-size and temperature measurements collected from individual trees to predict RPW infestation in its early stages before significant damage is caused to the tree. The performance of the classification algorithms was evaluated in terms of accuracy, precision, recall, and F-measure using a real RPW dataset. The experimental results showed that infestations with RPW can be predicted with an accuracy up to 93%, precision above 87%, recall equals 100%, and F-measure greater than 93% using data mining. Additionally, we found that temperature and circumference are the most important features for predicting RPW infestation. However, we strongly call for collecting and aggregating more RPW datasets to run more experiments to validate these results and provide more conclusive findings.

Download Full-text

Multi-View Vehicle Recognition Based on WRT-SVM

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.694-697.1987 ◽

2013 ◽

Vol 694-697 ◽

pp. 1987-1992 ◽

Cited By ~ 1

Author(s):

Xing Gang Wu ◽

Cong Guo

Keyword(s):

False Positive Rate ◽

Identification Problem ◽

Classification Problem ◽

Random Trees ◽

Support Vector ◽

Image Size ◽

Scale Invariant ◽

Vehicle Recognition ◽

Positive Rate ◽

Image Pairs

Proposed an approach to identify vehicles considering the variation in image size, illumination, and view angles under different cameras using Support Vector Machine with weighted random trees (WRT-SVM). With quantizing the scale-invariant features of image pairs by the weighted random trees, the identification problem is formulated as a same-different classification problem. Results show the efficiency of building the randomized tree due to the weights of the samples and the control of the false-positive rate of the identify system.

Download Full-text