scholarly journals Building Naïve Bayes Classifiers with High-Dimensional and Small-Sized Data Sets

SISFORMA ◽  
2018 ◽  
Vol 5 (1) ◽  
pp. 22
Author(s):  
Eka Angga Laksana ◽  
Ase Suryana ◽  
Heri Heryono

Sentiment analysis as part of text mining research domain has been being recognized due to the successful implementation in social media analysis. Sentiment analysis methods had intelligent ability to classify texts into negative or positive. Classified texts concluded whole users respond and described opinion polarity about particular topic. Based on this idea, this research took e-learning’s users opinion as object to be measured through sentiment analysis. The results can be used to evaluate the e-learning activity. This research had been implemented in Widyatama University which had been running e-learning activity for several years. Qualitative method by given questioner to users and gather the feedback is commonly used as evaluation of e-learning system previously. Still, questioner doesn’t represent the conclusion about the whole opinion. Hence, it needs the method to identify opinion polarity from e-learning member. The e-learning opinion data sets were gathered from questioner filled by e-learning member included both student and lecturer as participants. The participants gave review about learning outcome after their participation in e-learning activity. Their opinion was needed to describe current situation about e-learning activity. Therefore, the conclusion could be used to make improvement and described few achievements about the e-learning system. The data sets trained by Naïve Bayes classifier to group each user respond into negative or positive. The classification results were also evaluated by a number of particular evaluation metric used in data mining to show the classifier performance such as accuracy, precision, and recall.


Author(s):  
Ankit Srivastava ◽  
Vijendra Singh ◽  
Gurdeep Singh Drall

Over the past few years, the novel appeal and increasing popularity of social networks as a medium for users to express their opinions and views have created an accumulation of a massive amount of data. This evolving mountain of data is commonly termed Big Data. Accordingly, one area in which the application of new techniques in data mining research has significant potential to achieve more precise classification of hidden knowledge in Big Data is sentiment analysis (aka optimal mining). A hybrid approach using Naïve Bayes and Random Forest on mining Twitter datasets is presented here as an extension of previous work. Briefly, relevant data sets are collected from Twitter using Twitter API; then, use of the hybrid methodology is illustrated and evaluated against one with only Naïve Bayes classifier. Results show better accuracy and efficiency in the sentiment classification for the hybrid approach.


2010 ◽  
Vol 25 (4) ◽  
pp. 421-449 ◽  
Author(s):  
Marcin J. Mizianty ◽  
Lukasz A. Kurgan ◽  
Marek R. Ogiela

AbstractCurrent classification problems that concern data sets of large and increasing size require scalable classification algorithms. In this study, we concentrate on several scalable, linear complexity classifiers that include one of the top 10 voted data mining methods, Naïve Bayes (NB), and several recently proposed semi-NB classifiers. These algorithms perform front-end discretization of the continuous features since by design they work only with nominal or discrete features. We address the lack of studies that investigate the benefits and drawbacks of discretization in the context of the subsequent classification. Our comprehensive empirical study considers 12 discretizers (two unsupervised and 10 supervised), seven classifiers (two classical NB and five semi-NB), and 16 data sets. We investigate the scalability of the discretizers and show that the fastest supervised discretizers fast class-attribute interdependency maximization (FCAIM), class-attribute interdependency maximization (CAIM), and information entropy maximization (IEM) provide discretization schemes with the highest overall quality. We show that discretization improves the classification accuracy when compared against the two classical methods, NB and Flexible Naïve Bayes (FNB), executed on the raw data. The choice of the discretization algorithm impacts the significance of the improvements. The MODL, FCAIM, and CAIM methods provide statistically significant improvements, while the IEM, Class-attribute contingency coefficient (CACC), and Khiops discretizers provide moderate improvements. The most accurate classification models are generated by the Averaged one-dependence estimators (AODEsr) classifier followed by AODE and HNB (Hidden Naïve Bayes). AODEsr run on data discretized with MODL, FCAIM, and CAIM provides statistically significantly better accuracies than both the classical NB methods. The worst results are obtained with the NB, FNB, and LBR (Lazy Bayes rule) classifiers. We show that although the time to build the discretization scheme could be longer than the time to train the classifier, the completion of the entire process (to discretize data, compute the classifier, and predict test instances) is often faster than the NB-based classification of the continuous instances. This is because the time to classify test instances is an important factor that is positively influenced by discretization. The biggest positive influence, both on the accuracy and the classification time, is associated with the MODL, FCAIM, and CAIM algorithms.


Author(s):  
Joko Suntoro ◽  
Febrian Wahyu Christanto ◽  
Henny Indriyawati

The most important part in software engineering is a software defect prediction. Software defect prediction is defined as a software prediction process from errors, failures, and system errors. Machine learning methods are used by researchers to predict software defects including estimation, association, classification, clustering, and datasets analysis. Datasets of NASA Metrics Data Program (NASA MDP) is one of the metric software that researchers use to predict software defects. NASA MDP datasets contain unbalanced classes and high dimensional data, so they will affect the classification evaluation results to be low. In this research, data with unbalanced classes will be solved by the AdaCost method and high dimensional data will be handled with the Average Weight Information Gain (AWEIG) method, while the classification method that will be used is the Naïve Bayes algorithm. The proposed method is named AWEIG + AdaCost Bayesian. In this experiment, the AWEIG + AdaCost Bayesian algorithm is compared to the Naïve Bayesian algorithm. The results showed the mean of Area Under the Curve (AUC) algorithm AWEIG + AdaCost Bayesian yields better than just a Naïve Bayes algorithm with respectively mean of AUC values are 0.752 and 0.696.


2012 ◽  
Vol 21 (01) ◽  
pp. 1250007 ◽  
Author(s):  
LIANGXIAO JIANG ◽  
DIANHONG WANG ◽  
ZHIHUA CAI

Many approaches are proposed to improve naive Bayes by weakening its conditional independence assumption. In this paper, we work on the approach of instance weighting and propose an improved naive Bayes algorithm by discriminative instance weighting. We called it Discriminatively Weighted Naive Bayes. In each iteration of it, different training instances are discriminatively assigned different weights according to the estimated conditional probability loss. The experimental results based on a large number of UCI data sets validate its effectiveness in terms of the classification accuracy and AUC. Besides, the experimental results on the running time show that our Discriminatively Weighted Naive Bayes performs almost as efficiently as the state-of-the-art Discriminative Frequency Estimate learning method, and significantly more efficient than Boosted Naive Bayes. At last, we apply the idea of discriminatively weighted learning in our algorithm to some state-of-the-art naive Bayes text classifiers, such as multinomial naive Bayes, complement naive Bayes and the one-versus-all-but-one model, and have achieved remarkable improvements.


Sign in / Sign up

Export Citation Format

Share Document