Term-frequency Based Feature Selection Methods for Text Categorization

Improving Classification of Protein Interaction Articles Using Context Similarity-Based Feature Selection

BioMed Research International ◽

10.1155/2015/751646 ◽

2015 ◽

Vol 2015 ◽

pp. 1-10 ◽

Cited By ~ 1

Author(s):

Yifei Chen ◽

Yuxing Sun ◽

Bing-Qing Han

Keyword(s):

Feature Selection ◽

Protein Interaction ◽

Text Classification ◽

Protein Interactions ◽

Reduction Rate ◽

Importance Measure ◽

Context Information ◽

Selection Methods ◽

Term Frequency ◽

Context Similarity

Protein interaction article classification is a text classification task in the biological domain to determine which articles describe protein-protein interactions. Since the feature space in text classification is high-dimensional, feature selection is widely used for reducing the dimensionality of features to speed up computation without sacrificing classification performance. Many existing feature selection methods are based on the statistical measure of document frequency and term frequency. One potential drawback of these methods is that they treat features separately. Hence, first we design a similarity measure between the context information to take word cooccurrences and phrase chunks around the features into account. Then we introduce the similarity of context information to the importance measure of the features to substitute the document and term frequency. Hence we propose new context similarity-based feature selection methods. Their performance is evaluated on two protein interaction article collections and compared against the frequency-based methods. The experimental results reveal that the context similarity-based methods perform better in terms of theF1measure and the dimension reduction rate. Benefiting from the context information surrounding the features, the proposed methods can select distinctive features effectively for protein interaction article classification.

Download Full-text

Data-driven global-ranking local feature selection methods for text categorization

Expert Systems with Applications ◽

10.1016/j.eswa.2014.10.011 ◽

2015 ◽

Vol 42 (4) ◽

pp. 1941-1949 ◽

Cited By ~ 35

Author(s):

Roberto H.W. Pinheiro ◽

George D.C. Cavalcanti ◽

Tsang Ing Ren

Keyword(s):

Feature Selection ◽

Text Categorization ◽

Local Feature ◽

Data Driven ◽

Selection Methods ◽

Global Ranking

Download Full-text

t-Test feature selection approach based on term frequency for text categorization

Pattern Recognition Letters ◽

10.1016/j.patrec.2014.02.013 ◽

2014 ◽

Vol 45 ◽

pp. 1-10 ◽

Cited By ~ 52

Author(s):

Deqing Wang ◽

Hui Zhang ◽

Rui Liu ◽

Weifeng Lv ◽

Datao Wang

Keyword(s):

Feature Selection ◽

Text Categorization ◽

T Test ◽

Term Frequency ◽

Selection Approach ◽

Feature Selection Approach

Download Full-text

New feature selection methods based on context similarity for text categorization

2014 11th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD) ◽

10.1109/fskd.2014.6980902 ◽

2014 ◽

Cited By ~ 1

Author(s):

Yifei Chen ◽

Bingqing Han ◽

Ping Hou

Keyword(s):

Feature Selection ◽

Text Categorization ◽

Selection Methods ◽

Context Similarity ◽

New Feature

Download Full-text

Improved Feature-Selection Method Considering the Imbalance Problem in Text Categorization

The Scientific World JOURNAL ◽

10.1155/2014/625342 ◽

2014 ◽

Vol 2014 ◽

pp. 1-17 ◽

Cited By ~ 9

Author(s):

Jieming Yang ◽

Zhaoyang Qu ◽

Zhiying Liu

Keyword(s):

Feature Selection ◽

Text Categorization ◽

Information Gain ◽

Feature Selection Method ◽

Support Vector ◽

Selection Methods ◽

Document Collections ◽

Imbalance Problem ◽

Important Approach ◽

Selection Algorithms

The filtering feature-selection algorithm is a kind of important approach to dimensionality reduction in the field of the text categorization. Most of filtering feature-selection algorithms evaluate the significance of a feature for category based on balanced dataset and do not consider the imbalance factor of dataset. In this paper, a new scheme was proposed, which can weaken the adverse effect caused by the imbalance factor in the corpus. We evaluated the improved versions of nine well-known feature-selection methods (Information Gain, Chi statistic, Document Frequency, Orthogonal Centroid Feature Selection, DIA association factor, Comprehensive Measurement Feature Selection, Deviation from Poisson Feature Selection, improved Gini index, and Mutual Information) using naïve Bayes and support vector machines on three benchmark document collections (20-Newsgroups, Reuters-21578, and WebKB). The experimental results show that the improved scheme can significantly enhance the performance of the feature-selection methods.

Download Full-text

Enhanced Filter Feature Selection Methods for Arabic Text Categorization

International Journal of Information Retrieval Research ◽

10.4018/ijirr.2018040101 ◽

2018 ◽

Vol 8 (2) ◽

pp. 1-24 ◽

Cited By ~ 1

Author(s):

Abdullah Saeed Ghareb ◽

Azuraliza Abu Bakara ◽

Qasem A. Al-Radaideh ◽

Abdul Razak Hamdan

Keyword(s):

Feature Selection ◽

Text Categorization ◽

Selection Process ◽

High Dimensional Data ◽

Relevant Information ◽

High Dimensional ◽

Arabic Text ◽

Relevant Feature ◽

Associative Classification ◽

Selection Methods

The filtering of a large amount of data is an important process in data mining tasks, particularly for the categorization of unstructured high dimensional data. Therefore, a feature selection process is desired to reduce the space of high dimensional data into small relevant subset dimensions that represent the best features for text categorization. In this article, three enhanced filter feature selection methods, Category Relevant Feature Measure, Modified Category Discriminated Measure, and Odd Ratio2, are proposed. These methods combine the relevant information about features in both the inter- and intra-category. The effectiveness of the proposed methods with Naïve Bayes and associative classification is evaluated by traditional measures of text categorization, namely, macro-averaging of precision, recall, and F-measure. Experiments are conducted on three Arabic text datasets used for text categorization. The experimental results showed that the proposed methods are able to achieve better and comparable results when compared to 12 well known traditional methods.

Download Full-text

A Comparative Study on Feature Selection of Text Categorization for Hidden Markov Models

Proceedings of the Annual Conference of CAIS / Actes du congrès annuel de l'ACSI ◽

10.29173/cais341 ◽

2013 ◽

Author(s):

Kwan Yi ◽

Jamshid Beheshti

Keyword(s):

Feature Selection ◽

Text Categorization ◽

Markov Models ◽

Hidden Markov ◽

Model Performance ◽

Document Representation ◽

Selection Methods ◽

Learning Models ◽

Text Feature ◽

Selection Of

In document representation for digitalized text, feature selection refers to the selection of the terms of representing a document and of distinguishing it from other documents. This study probes different feature selection methods for HMM learning models to explore how they affect the model performance, which is experimented in the context of text categorization task.Dans la représentation documentaire des textes numérisés, la sélection des caractéristiques se fonde sur la sélection des termes représentant et distinguant un document des autres documents. Cette étude examine différents modèles de sélection de caractéristiques pour les modèles d’apprentissage MMC, afin d’explorer comment ils affectent la performance du modèle, qui est observé dans le contexte de la tâche de catégorisation textuelle.

Download Full-text

Integrating Feature and Instance Selection Techniques in Opinion Mining

International Journal of Data Warehousing and Mining ◽

10.4018/ijdwm.2020070109 ◽

2020 ◽

Vol 16 (3) ◽

pp. 168-182

Author(s):

Zi-Hung You ◽

Ya-Han Hu ◽

Chih-Fong Tsai ◽

Yen-Ming Kuo

Keyword(s):

Feature Selection ◽

Opinion Mining ◽

Classification Performance ◽

Problem Instance ◽

Instance Selection ◽

Selection Methods ◽

Inverse Document Frequency ◽

Term Frequency ◽

Document Frequency ◽

Text Features

Opinion mining focuses on extracting polarity information from texts. For textual term representation, different feature selection methods, e.g. term frequency (TF) or term frequency–inverse document frequency (TF–IDF), can yield diverse numbers of text features. In text classification, however, a selected training set may contain noisy documents (or outliers), which can degrade the classification performance. To solve this problem, instance selection can be adopted to filter out unrepresentative training documents. Therefore, this article investigates the opinion mining performance associated with feature and instance selection steps simultaneously. Two combination processes based on performing feature selection and instance selection in different orders, were compared. Specifically, two feature selection methods, namely TF and TF–IDF, and two instance selection methods, namely DROP3 and IB3, were employed for comparison. The experimental results by using three Twitter datasets to develop sentiment classifiers showed that TF–IDF followed by DROP3 performs the best.

Download Full-text

TTC-3600: A new benchmark dataset for Turkish text categorization

Journal of Information Science ◽

10.1177/0165551515620551 ◽

2015 ◽

Vol 43 (2) ◽

pp. 174-185 ◽

Cited By ~ 23

Author(s):

Deniz Kılınç ◽

Akın Özçift ◽

Fatma Bozyigit ◽

Pelin Yıldırım ◽

Fatih Yücalar ◽

...

Keyword(s):

Feature Selection ◽

Text Categorization ◽

Feature Selection Method ◽

Learning Task ◽

Selection Method ◽

Random Forest Classifier ◽

Experimental Results ◽

Selection Methods ◽

File Formats ◽

Accuracy Criterion

Owing to the rapid growth of the World Wide Web, the number of documents that can be accessed via the Internet explosively increases with each passing day. Considering news portals in particular, sometimes documents related to categories such as technology, sports and politics seem to be in the wrong category or documents are located in a generic category called others. At this point, text categorization (TC), which is generally addressed as a supervised learning task is needed. Although there are substantial number of studies conducted on TC in other languages, the number of studies conducted in Turkish is very limited owing to the lack of accessibility and usability of datasets created. In this paper, a new dataset named TTC-3600, which can be widely used in studies of TC of Turkish news and articles, is created. TTC-3600 is a well-documented dataset and its file formats are compatible with well-known text mining tools. Five widely used classifiers within the field of TC and two feature selection methods are evaluated on TTC-3600. The experimental results indicate that the best accuracy criterion value 91.03% is obtained with the combination of Random Forest classifier and attribute ranking-based feature selection method in all comparisons performed after pre-processing and feature selection steps. The publicly available TTC-3600 dataset and the experimental results of this study can be utilized in comparative experiments by other researchers.

Download Full-text

Combining Multiple Feature Selection Methods for Text Categorization by Using Rank-Score Characteristics

2009 21st IEEE International Conference on Tools with Artificial Intelligence ◽

10.1109/ictai.2009.129 ◽

2009 ◽

Cited By ~ 11

Author(s):

Yanjun Li ◽

D. Frank Hsu ◽

Soon M. Chung

Keyword(s):

Feature Selection ◽

Text Categorization ◽

Selection Methods ◽

Multiple Feature ◽

Rank Score

Download Full-text