scholarly journals Identifying e-Commerce in Enterprises by means of Text Mining and Classification Algorithms

2018 ◽  
Vol 2018 ◽  
pp. 1-8 ◽  
Author(s):  
Gianpiero Bianchi ◽  
Renato Bruni ◽  
Francesco Scalfati

Monitoring specific features of the enterprises, for example, the adoption of e-commerce, is an important and basic task for several economic activities. This type of information is usually obtained by means of surveys, which are costly due to the amount of personnel involved in the task. An automatic detection of this information would allow consistent savings. This can actually be performed by relying on computer engineering, since in general this information is publicly available on-line through the corporate websites. This work describes how to convert the detection of e-commerce into a supervised classification problem, where each record is obtained from the automatic analysis of one corporate website, and the class is the presence or the absence of e-commerce facilities. The automatic generation of similar data records requires the use of several Text Mining phases; in particular we compare six strategies based on the selection of best words and best n-grams. After this, we classify the obtained dataset by means of four classification algorithms: Support Vector Machines; Random Forest; Statistical and Logical Analysis of Data; Logistic Classifier. This turns out to be a difficult case of classification problem. However, after a careful design and set-up of the whole procedure, the results on a practical case of Italian enterprises are encouraging.

2018 ◽  
Vol 11 (1) ◽  
pp. 2 ◽  
Author(s):  
Tao Zhang ◽  
Hong Tang

Detailed information about built-up areas is valuable for mapping complex urban environments. Although a large number of classification algorithms for such areas have been developed, they are rarely tested from the perspective of feature engineering and feature learning. Therefore, we launched a unique investigation to provide a full test of the Operational Land Imager (OLI) imagery for 15-m resolution built-up area classification in 2015, in Beijing, China. Training a classifier requires many sample points, and we proposed a method based on the European Space Agency’s (ESA) 38-m global built-up area data of 2014, OpenStreetMap, and MOD13Q1-NDVI to achieve the rapid and automatic generation of a large number of sample points. Our aim was to examine the influence of a single pixel and image patch under traditional feature engineering and modern feature learning strategies. In feature engineering, we consider spectra, shape, and texture as the input features, and support vector machine (SVM), random forest (RF), and AdaBoost as the classification algorithms. In feature learning, the convolutional neural network (CNN) is used as the classification algorithm. In total, 26 built-up land cover maps were produced. The experimental results show the following: (1) The approaches based on feature learning are generally better than those based on feature engineering in terms of classification accuracy, and the performance of ensemble classifiers (e.g., RF) are comparable to that of CNN. Two-dimensional CNN and the 7-neighborhood RF have the highest classification accuracies at nearly 91%; (2) Overall, the classification effect and accuracy based on image patches are better than those based on single pixels. The features that can highlight the information of the target category (e.g., PanTex (texture-derived built-up presence index) and enhanced morphological building index (EMBI)) can help improve classification accuracy. The code and experimental results are available at https://github.com/zhangtao151820/CompareMethod.


2019 ◽  
Vol 12 (2) ◽  
pp. 32-38
Author(s):  
Iin Ernawati

This study was conducted to text-based data mining or often called text mining, classification methods commonly used method Naïve bayes classifier (NBC) and support vector machine (SVM). This classification is emphasized for Indonesian language documents, while the relationship between documents is measured by the probability that can be proven with other classification algorithms. This evident from the conclusion that the probability result Naïve Bayes Classifier (NBC) word “party” at least in the economic document and political. Then the result of the algorithm support vector machine (svm) with the word “price” and “kpk” contains in both economic and politic document.  


2019 ◽  
Author(s):  
Nazmul Kazi ◽  
Indika Kahanda

Current health care systems require clinicians to spend a substantial amount of time to digitally document their interactions with their patients through the use of electronic health records (EHRs), limiting the time spent on face-to-face patient care. Moreover, the use of EHRs is known to be highly inefficient due to additional time it takes for completion, which also leads to clinician burnout. In this project, we explore the feasibility of developing an automated case notes system for psychiatrists using text mining techniques that will listen to doctor-patient conversations, generate digital transcripts using speech-to-text conversion, classify information from the transcripts into relevant categories, and automatically generate structured case notes. In our preliminary work, we develop a human-powered doctor-patient conversation transcript annotator and obtain a gold standard dataset through the National Alliance of Mental Illness (NAMI) Montana. We model the task of classifying parts of conversations into six broad categories such as medical and family history as a supervised classification problem and apply several popular machine learning algorithms. According to our preliminary experimental results obtained through 5-fold cross-validation, Support Vector Machines are able to classify an unseen transcript with an average AUROC (area under the receiver operating characteristic curve) score of 89%. Finally, we use part-of-speech (POS) tagging, grammatical rules of English language and verb conjugation, we generate written versions of the pieces of text belonging to different categories. These formal text are aggregated in to filling different sections of the EHR forms.


2012 ◽  
Vol 11 (01) ◽  
pp. 1250006 ◽  
Author(s):  
Fadi Thabtah ◽  
Omar Gharaibeh ◽  
Rashid Al-Zubaidy

A well-known classification problem in the domain of text mining is text classification, which concerns about mapping textual documents into one or more predefined category based on its content. Text classification arena recently attracted many researchers because of the massive amounts of online documents and text archives which hold essential information for a decision-making process. In this field, most of such researches focus on classifying English documents while there are limited studies conducted on other languages like Arabic. In this respect, the paper proposes to investigate the problem of Arabic text classification comprehensively. More specifically the study measures the performance of different rule based classification approaches adopted from machine learning and data mining towards the problem of text Arabic classification. In particular, four different rule based classification approaches: Decision trees (C4.5), Rule Induction (RIPPER), Hybrid (PART) and Simple Rule (One Rule) are evaluated against the published Corpus of Contemporary Arabic Arabic text collection. This experimentation is carried out by employing a modified version of WEKA business intelligence tool. Through analysing the produced results from the experimentation, we determine the most suitable classification algorithms for classifying Arabic texts.


Author(s):  
Nazmul Kazi ◽  
Indika Kahanda

Current health care systems require clinicians to spend a substantial amount of time to digitally document their interactions with their patients through the use of electronic health records (EHRs), limiting the time spent on face-to-face patient care. Moreover, the use of EHRs is known to be highly inefficient due to additional time it takes for completion, which also leads to clinician burnout. In this project, we explore the feasibility of developing an automated case notes system for psychiatrists using text mining techniques that will listen to doctor-patient conversations, generate digital transcripts using speech-to-text conversion, classify information from the transcripts into relevant categories, and automatically generate structured case notes. In our preliminary work, we develop a human-powered doctor-patient conversation transcript annotator and obtain a gold standard dataset through the National Alliance of Mental Illness (NAMI) Montana. We model the task of classifying parts of conversations into six broad categories such as medical and family history as a supervised classification problem and apply several popular machine learning algorithms. According to our preliminary experimental results obtained through 5-fold cross-validation, Support Vector Machines are able to classify an unseen transcript with an average AUROC (area under the receiver operating characteristic curve) score of 89%. Finally, we use part-of-speech (POS) tagging, grammatical rules of English language and verb conjugation, we generate written versions of the pieces of text belonging to different categories. These formal text are aggregated in to filling different sections of the EHR forms.


2020 ◽  
Vol 6 (2) ◽  
pp. 4-11
Author(s):  
Silvija Vlah Jerić

AbstractThe main objective of this analysis is to evaluate and compare the various classification algorithms for the automatic identification of favourable days for intraday trading using the Croatian stock index CROBEX data. Intra-day trading refers to the acquisition and sale of financial instruments on the same trading day. If the increase between the opening price and the closing price of the same day is substantial enough to earn a profit by purchasing at the opening price and selling at the closing price, the day is considered to be favourable for intra-day trading. The goal is to discover relation between selected financial indicators on a given day and the market situation on the following day i.e. to determine whether a day is favourable for day trading or not. The problem is modelled as a binary classification problem. The idea is to test different algorithms and to give greater attention to those that are more rarely used than traditional statistical methods. Thus, the following algorithms are used: neural network, support vector machine, random forest, as well as k-nearest neighbours and naïve Bayes classifier as classifiers that are more common. The work is an extension of authors’ previous work in which the algorithms are compared on resamples resulting from tuning the algorithms, while here, each derived model is used to make predictions on new data. The results should add to the increasing corpus of stock market prediction research efforts and try to fill some gaps in this field of research for the Croatian market, in particular by using machine learning algorithms.


2019 ◽  
Author(s):  
Nazmul Kazi ◽  
Indika Kahanda

Current health care systems require clinicians to spend a substantial amount of time to digitally document their interactions with their patients through the use of electronic health records (EHRs), limiting the time spent on face-to-face patient care. Moreover, the use of EHRs is known to be highly inefficient due to additional time it takes for completion, which also leads to clinician burnout. In this project, we explore the feasibility of developing an automated case notes system for psychiatrists using text mining techniques that will listen to doctor-patient conversations, generate digital transcripts using speech-to-text conversion, classify information from the transcripts by identifying important keywords, and automatically generate structured case notes. In our preliminary work, we develop a human powered doctor-patient transcript annotator and obtain a gold standard dataset through National Alliance of Mental Illness (NAMI) Montana. We model the task of classifying parts of conversations in to six broad categories such as medical and family history as a supervised classification problem and apply several popular machine learning algorithms. According to our preliminary experimental results obtained through 5-fold cross validation, Support Vector Machines are able to classify an unseen transcript with an average AUROC (area under the receiver operating characteristic curve) score of 89%. Using part-of-speech (POS) tagging, grammatical rules of English language and verb conjugation, we generate formal representation of each sample. For each class, we form a paragraph using the formal representations of its samples. Using these paragraphs, we generate a case note.


Author(s):  
Ritam Guha ◽  
Manosij Ghosh ◽  
Pawan Kumar Singh ◽  
Ram Sarkar ◽  
Mita Nasipuri

AbstractIn any multi-script environment, handwritten script classification is an unavoidable pre-requisite before the document images are fed to their respective Optical Character Recognition (OCR) engines. Over the years, this complex pattern classification problem has been solved by researchers proposing various feature vectors mostly having large dimensions, thereby increasing the computation complexity of the whole classification model. Feature Selection (FS) can serve as an intermediate step to reduce the size of the feature vectors by restricting them only to the essential and relevant features. In the present work, we have addressed this issue by introducing a new FS algorithm, called Hybrid Swarm and Gravitation-based FS (HSGFS). This algorithm has been applied over three feature vectors introduced in the literature recently—Distance-Hough Transform (DHT), Histogram of Oriented Gradients (HOG), and Modified log-Gabor (MLG) filter Transform. Three state-of-the-art classifiers, namely, Multi-Layer Perceptron (MLP), K-Nearest Neighbour (KNN), and Support Vector Machine (SVM), are used to evaluate the optimal subset of features generated by the proposed FS model. Handwritten datasets at block, text line, and word level, consisting of officially recognized 12 Indic scripts, are prepared for experimentation. An average improvement in the range of 2–5% is achieved in the classification accuracy by utilizing only about 75–80% of the original feature vectors on all three datasets. The proposed method also shows better performance when compared to some popularly used FS models. The codes used for implementing HSGFS can be found in the following Github link: https://github.com/Ritam-Guha/HSGFS.


Plants ◽  
2021 ◽  
Vol 10 (1) ◽  
pp. 95
Author(s):  
Heba Kurdi ◽  
Amal Al-Aldawsari ◽  
Isra Al-Turaiki ◽  
Abdulrahman S. Aldawood

In the past 30 years, the red palm weevil (RPW), Rhynchophorus ferrugineus (Olivier), a pest that is highly destructive to all types of palms, has rapidly spread worldwide. However, detecting infestation with the RPW is highly challenging because symptoms are not visible until the death of the palm tree is inevitable. In addition, the use of automated RPW weevil identification tools to predict infestation is complicated by a lack of RPW datasets. In this study, we assessed the capability of 10 state-of-the-art data mining classification algorithms, Naive Bayes (NB), KSTAR, AdaBoost, bagging, PART, J48 Decision tree, multilayer perceptron (MLP), support vector machine (SVM), random forest, and logistic regression, to use plant-size and temperature measurements collected from individual trees to predict RPW infestation in its early stages before significant damage is caused to the tree. The performance of the classification algorithms was evaluated in terms of accuracy, precision, recall, and F-measure using a real RPW dataset. The experimental results showed that infestations with RPW can be predicted with an accuracy up to 93%, precision above 87%, recall equals 100%, and F-measure greater than 93% using data mining. Additionally, we found that temperature and circumference are the most important features for predicting RPW infestation. However, we strongly call for collecting and aggregating more RPW datasets to run more experiments to validate these results and provide more conclusive findings.


2013 ◽  
Vol 694-697 ◽  
pp. 1987-1992 ◽  
Author(s):  
Xing Gang Wu ◽  
Cong Guo

Proposed an approach to identify vehicles considering the variation in image size, illumination, and view angles under different cameras using Support Vector Machine with weighted random trees (WRT-SVM). With quantizing the scale-invariant features of image pairs by the weighted random trees, the identification problem is formulated as a same-different classification problem. Results show the efficiency of building the randomized tree due to the weights of the samples and the control of the false-positive rate of the identify system.


Sign in / Sign up

Export Citation Format

Share Document