A Novel Feature Selection Technique for Text Classification Using Naïve Bayes

International Scholarly Research Notices ◽

10.1155/2014/717092 ◽

2014 ◽

Vol 2014 ◽

pp. 1-10 ◽

Cited By ~ 20

Author(s):

Subhajit Dey Sarkar ◽

Saptarsi Goswami ◽

Aman Agarwal ◽

Javed Aktar

Keyword(s):

Feature Selection ◽

Text Classification ◽

Text Categorization ◽

Naive Bayes ◽

Feature Selection Method ◽

Search Space ◽

Selection Method ◽

Naïve Bayes ◽

Training Data ◽

Feature Selection Technique

With the proliferation of unstructured data, text classification or text categorization has found many applications in topic classification, sentiment analysis, authorship identification, spam detection, and so on. There are many classification algorithms available. Naïve Bayes remains one of the oldest and most popular classifiers. On one hand, implementation of naïve Bayes is simple and, on the other hand, this also requires fewer amounts of training data. From the literature review, it is found that naïve Bayes performs poorly compared to other classifiers in text classification. As a result, this makes the naïve Bayes classifier unusable in spite of the simplicity and intuitiveness of the model. In this paper, we propose a two-step feature selection method based on firstly a univariate feature selection and then feature clustering, where we use the univariate feature selection method to reduce the search space and then apply clustering to select relatively independent feature sets. We demonstrate the effectiveness of our method by a thorough evaluation and comparison over 13 datasets. The performance improvement thus achieved makes naïve Bayes comparable or superior to other classifiers. The proposed algorithm is shown to outperform other traditional methods like greedy search based wrapper or CFS.

Download Full-text

Research on Security Selection by Naive Bayes Classifier Based on a New Feature Selection Method

Advances in Applied Mathematics ◽

10.12677/aam.2019.81005 ◽

2019 ◽

Vol 08 (01) ◽

pp. 41-49

Author(s):

盼盼郭

Keyword(s):

Feature Selection ◽

Naive Bayes ◽

Feature Selection Method ◽

Selection Method ◽

Naïve Bayes ◽

Naive Bayes Classifier ◽

Bayes Classifier ◽

Naïve Bayes Classifier ◽

Security Selection ◽

New Feature

Download Full-text

Optimasi Naive Bayes Dengan Pemilihan Fitur Dan Pembobotan Gain Ratio

Lontar Komputer Jurnal Ilmiah Teknologi Informasi ◽

10.24843/lkjiti.2016.v07.i01.p03 ◽

2016 ◽

pp. 22

Author(s):

I Guna Adi Socrates ◽

Afrizal Laksita Akbar ◽

Mohammad Sonhaji Akbar ◽

Agus Zainal Arifin ◽

Darlis Herumurti

Keyword(s):

Feature Selection ◽

Naive Bayes ◽

Feature Selection Method ◽

Simple Algorithm ◽

Selection Method ◽

Naïve Bayes ◽

Computation Complexity ◽

Bayes Method ◽

Gain Ratio ◽

Bayes Methods

Naïve Bayes is one of data mining methods that are commonly used in text-based document classification. The advantage of this method is a simple algorithm with low computation complexity. However, there is weaknesses on Naïve Bayes methods where independence of Naïve Bayes features can’t be always implemented that would affect the accuracy of the calculation. Therefore, Naïve Bayes methods need to be optimized by assigning weights using Gain Ratio on its features. However, assigning weights on Naïve Bayes’s features cause problems in calculating the probability of each document which is caused by there are many features in the document that not represent the tested class. Therefore, the weighting Naïve Bayes is still not optimal. This paper proposes optimization of Naïve Bayes method using weighted by Gain Ratio and feature selection method in the case of text classification. Results of this study pointed-out that Naïve Bayes optimization using feature selection and weighting produces accuracy of 94%.

Download Full-text

Comparison of Naïve Bayes Algorithm and Decision Tree C4.5 for Hospital Readmission Diabetes Patients using HbA1c Measurement

Knowledge Engineering and Data Science ◽

10.17977/um018v2i22019p58-71 ◽

2019 ◽

Vol 2 (2) ◽

pp. 58 ◽

Cited By ~ 1

Author(s):

Utomo Pujianto ◽

Asa Luki Setiawan ◽

Harits Ar Rosyid ◽

Ali M. Mohammad Salah

Keyword(s):

Feature Selection ◽

Decision Tree ◽

Naive Bayes ◽

Feature Selection Method ◽

Selection Method ◽

Naïve Bayes ◽

The Body ◽

Classification Model ◽

Diabetic Patients ◽

Patient Readmissions

Diabetes is a metabolic disorder disease in which the pancreas does not produce enough insulin or the body cannot use insulin produced effectively. The HbA1c examination, which measures the average glucose level of patients during the last 2-3 months, has become an important step to determine the condition of diabetic patients. Knowledge of the patient's condition can help medical staff to predict the possibility of patient readmissions, namely the occurrence of a patient requiring hospitalization services back at the hospital. The ability to predict patient readmissions will ultimately help the hospital to calculate and manage the quality of patient care. This study compares the performance of the Naïve Bayes method and C4.5 Decision Tree in predicting readmissions of diabetic patients, especially patients who have undergone HbA1c examination. As part of this study we also compare the performance of the classification model from a number of scenarios involving a combination of preprocessing methods, namely Synthetic Minority Over-Sampling Technique (SMOTE) and Wrapper feature selection method, with both classification techniques. The scenario of C4.5 method combined with SMOTE and feature selection method produces the best performance in classifying readmissions of diabetic patients with an accuracy value of 82.74 %, precision value of 87.1 %, and recall value of 82.7 %.

Download Full-text

Hybrid Feature Selection Method Based on a Naïve Bayes Algorithm that Enhances the Learning Speed while Maintaining a Similar Error Rate in Cyber ISR

KSII Transactions on Internet and Information Systems ◽

10.3837/tiis.2018.12.005 ◽

2018 ◽

Vol 12 (12) ◽

Keyword(s):

Feature Selection ◽

Error Rate ◽

Naive Bayes ◽

Feature Selection Method ◽

Selection Method ◽

Naïve Bayes ◽

Learning Speed ◽

Bayes Algorithm

Download Full-text

A Hybrid Feature Selection Method based on IGSBFS and Naive Bayes for the Diagnosis of Erythemato - Squamous Diseases

International Journal of Computer Applications ◽

10.5120/5552-7623 ◽

2012 ◽

Vol 41 (7) ◽

pp. 13-18 ◽

Cited By ~ 9

Author(s):

S. Aruna ◽

L. V. Nandakishore ◽

S. P. Rajagopalan

Keyword(s):

Feature Selection ◽

Naive Bayes ◽

Feature Selection Method ◽

Selection Method ◽

Naïve Bayes

Download Full-text

TEXT CLASSIFICATION USING NAIVE BAYES UPDATEABLE ALGORITHM IN SBMPTN TEST QUESTIONS

Telematika ◽

10.31315/telematika.v13i2.1728 ◽

2017 ◽

Vol 13 (2) ◽

pp. 123 ◽

Cited By ~ 1

Author(s):

Ristu Saptono ◽

Meianto Eko Sulistyo ◽

Nur Shobriana Trihabsari

Keyword(s):

Text Classification ◽

Naive Bayes ◽

Feature Selection Method ◽

Classification Performance ◽

Selection Method ◽

Naïve Bayes ◽

Naive Bayes Classifier ◽

Bayes Classifier ◽

Naïve Bayes Classifier ◽

Existing Data

Document classification is a growing interest in the research of text mining. Classification can be done based on the topics, languages, and so on. This study was conducted to determine how Naive Bayes Updateable performs in classifying the SBMPTN exam questions based on its theme. Increment model of one classification algorithm often used in text classification Naive Bayes classifier has the ability to learn from new data introduces with the system even after the classifier has been produced with the existing data. Naive Bayes Classifier classifies the exam questions based on the theme of the field of study by analyzing keywords that appear on the exam questions. One of feature selection method DF-Thresholding is implemented for improving the classification performance. Evaluation of the classification with Naive Bayes classifier algorithm produces 84,61% accuracy.

Download Full-text

A lazy feature selection method for multi-label classification

Intelligent Data Analysis ◽

10.3233/ida-194878 ◽

2021 ◽

Vol 25 (1) ◽

pp. 21-34

Author(s):

Rafael B. Pereira ◽

Alexandre Plastino ◽

Bianca Zadrozny ◽

Luiz H.C. Merschmann

Keyword(s):

Feature Selection ◽

Text Categorization ◽

Feature Selection Method ◽

Selection Method ◽

Video Classification ◽

Classification Problems ◽

Class Label ◽

New Feature ◽

Feature Selection Techniques ◽

Biomolecular Analysis

In many important application domains, such as text categorization, biomolecular analysis, scene or video classification and medical diagnosis, instances are naturally associated with more than one class label, giving rise to multi-label classification problems. This has led, in recent years, to a substantial amount of research in multi-label classification. More specifically, feature selection methods have been developed to allow the identification of relevant and informative features for multi-label classification. This work presents a new feature selection method based on the lazy feature selection paradigm and specific for the multi-label context. Experimental results show that the proposed technique is competitive when compared to multi-label feature selection techniques currently used in the literature, and is clearly more scalable, in a scenario where there is an increasing amount of data.

Download Full-text

A HYBRID FEATURE SELECTION METHOD FOR TEXT CATEGORIZATION

International Journal of Uncertainty Fuzziness and Knowledge-Based Systems ◽

10.1142/s0218488507004492 ◽

2007 ◽

Vol 15 (02) ◽

pp. 133-151 ◽

Cited By ~ 2

Author(s):

E. MONTAÑÉS ◽

J. R. QUEVEDO ◽

E. F. COMBARRO ◽

I. DÍAZ ◽

J. RANILLA

Keyword(s):

Feature Selection ◽

Text Categorization ◽

Hybrid Approach ◽

Feature Selection Method ◽

Selection Method ◽

Fast Method ◽

Evaluation Function ◽

Wrapper Approach ◽

Wrapper Method ◽

Filtering Approach

Feature Selection is an important task within Text Categorization, where irrelevant or noisy features are usually present, causing a lost in the performance of the classifiers. Feature Selection in Text Categorization has usually been performed using a filtering approach based on selecting the features with highest score according to certain measures. Measures of this kind come from the Information Retrieval, Information Theory and Machine Learning fields. However, wrapper approaches are known to perform better in Feature Selection than filtering approaches, although they are time-consuming and sometimes infeasible, especially in text domains. However a wrapper that explores a reduced number of feature subsets and that uses a fast method as evaluation function could overcome these difficulties. The wrapper presented in this paper satisfies these properties. Since exploring a reduced number of subsets could result in less promising subsets, a hybrid approach, that combines the wrapper method and some scoring measures, allows to explore more promising feature subsets. A comparison among some scoring measures, the wrapper method and the hybrid approach is performed. The results reveal that the hybrid approach outperforms both the wrapper approach and the scoring measures, particularly for corpora whose features are less scattered over the categories.

Download Full-text

A CATEGORY CLASSIFICATION ALGORITHM FOR INDONESIAN AND MALAY NEWS DOCUMENTS

Jurnal Teknologi ◽

10.11113/jt.v78.9549 ◽

2016 ◽

Vol 78 (8-2) ◽

Cited By ~ 1

Author(s):

Jafreezal Jaafar ◽

Zul Indra ◽

Nurshuhaini Zamin

Keyword(s):

Feature Selection ◽

Text Classification ◽

Feature Selection Method ◽

Selection Method ◽

Online News ◽

Language Identification ◽

Computational Time ◽

Accuracy Rate ◽

Similar Morphology ◽

Manual Classification

Text classification (TC) provides a better way to organize information since it allows better understanding and interpretation of the content. It deals with the assignment of labels into a group of similar textual document. However, TC research for Asian language documents is relatively limited compared to English documents and even lesser particularly for news articles. Apart from that, TC research to classify textual documents in similar morphology such Indonesian and Malay is still scarce. Hence, the aim of this study is to develop an integrated generic TC algorithm which is able to identify the language and then classify the category for identified news documents. Furthermore, top-n feature selection method is utilized to improve TC performance and to overcome the online news corpora classification challenges: rapid data growth of online news documents, and the high computational time. Experiments were conducted using 280 Indonesian and 280 Malay online news documents from the year 2014 – 2015. The classification method is proven to produce a good result with accuracy rate of up to 95.63% for language identification, and 97.5%% for category classification. While the category classifier works optimally on n = 60%, with an average of 35 seconds computational time. This highlights that the integrated generic TC has advantage over manual classification, and is suitable for Indonesian and Malay news classification.

Download Full-text

A New Feature Selection Method for Text Classification Based on Independent Feature Space Search

Mathematical Problems in Engineering ◽

10.1155/2020/6076272 ◽

2020 ◽

Vol 2020 ◽

pp. 1-14 ◽

Cited By ~ 3

Author(s):

Yong Liu ◽

Shenggen Ju ◽

Junfeng Wang ◽

Chong Su

Keyword(s):

Feature Selection ◽

Text Classification ◽

Predictive Accuracy ◽

Feature Selection Method ◽

Feature Space ◽

Selection Method ◽

The Other ◽

Feature Subset ◽

Search Range ◽

Text Documents

Feature selection method is designed to select the representative feature subsets from the original feature set by different evaluation of feature relevance, which focuses on reducing the dimension of the features while maintaining the predictive accuracy of a classifier. In this study, we propose a feature selection method for text classification based on independent feature space search. Firstly, a relative document-term frequency difference (RDTFD) method is proposed to divide the features in all text documents into two independent feature sets according to the features’ ability to discriminate the positive and negative samples, which has two important functions: one is to improve the high class correlation of the features and reduce the correlation between the features and the other is to reduce the search range of feature space and maintain appropriate feature redundancy. Secondly, the feature search strategy is used to search the optimal feature subset in independent feature space, which can improve the performance of text classification. Finally, we evaluate several experiments conduced on six benchmark corpora, the experimental results show the RDTFD method based on independent feature space search is more robust than the other feature selection methods.

Download Full-text