A New Robust Classifier on Noise Domains: Bagging of Credal C4.5 Trees

Complexity ◽

10.1155/2017/9023970 ◽

2017 ◽

Vol 2017 ◽

pp. 1-17 ◽

Cited By ~ 2

Author(s):

Joaquín Abellán ◽

Javier G. Castellano ◽

Carlos J. Mantas

Keyword(s):

Classification Tree ◽

Mining Area ◽

Complex Problem ◽

Imprecise Probabilities ◽

Data Sets ◽

C4.5 Algorithm ◽

Tree Models ◽

Noisy Domains ◽

Robust To Noise ◽

Bagging Ensemble

The knowledge extraction from data with noise or outliers is a complex problem in the data mining area. Normally, it is not easy to eliminate those problematic instances. To obtain information from this type of data, robust classifiers are the best option to use. One of them is the application of bagging scheme on weak single classifiers. The Credal C4.5 (CC4.5) model is a new classification tree procedure based on the classical C4.5 algorithm and imprecise probabilities. It represents a type of the so-calledcredal trees. It has been proven that CC4.5 is more robust to noise than C4.5 method and even than other previous credal tree models. In this paper, the performance of the CC4.5 model in bagging schemes on noisy domains is shown. An experimental study on data sets with added noise is carried out in order to compare results where bagging schemes are applied on credal trees and C4.5 procedure. As a benchmark point, the known Random Forest (RF) classification method is also used. It will be shown that the bagging ensemble using pruned credal trees outperforms the successful bagging C4.5 and RF when data sets with medium-to-high noise level are classified.

Download Full-text

Software Benchmark—Classification Tree Algorithms for Cell Atlases Annotation Using Single-Cell RNA-Sequencing Data

Microbiology Research ◽

10.3390/microbiolres12020022 ◽

2021 ◽

Vol 12 (2) ◽

pp. 317-334

Author(s):

Omar Alaqeeli ◽

Li Xing ◽

Xuekui Zhang

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Classification Tree ◽

Area Under The Curve ◽

Data Sets ◽

Sequencing Data ◽

Single Cell Rna Sequencing ◽

Tree Algorithms ◽

R Packages

Classification tree is a widely used machine learning method. It has multiple implementations as R packages; rpart, ctree, evtree, tree and C5.0. The details of these implementations are not the same, and hence their performances differ from one application to another. We are interested in their performance in the classification of cells using the single-cell RNA-Sequencing data. In this paper, we conducted a benchmark study using 22 Single-Cell RNA-sequencing data sets. Using cross-validation, we compare packages’ prediction performances based on their Precision, Recall, F1-score, Area Under the Curve (AUC). We also compared the Complexity and Run-time of these R packages. Our study shows that rpart and evtree have the best Precision; evtree is the best in Recall, F1-score and AUC; C5.0 prefers more complex trees; tree is consistently much faster than others, although its complexity is often higher than others.

Download Full-text

Application of genetic algorithm and greedy stepwise to select input variables in classification tree models for the prediction of habitat requirements of Azolla filiculoides (Lam.) in Anzali wetland, Iran

Ecological Modelling ◽

10.1016/j.ecolmodel.2012.12.010 ◽

2013 ◽

Vol 251 ◽

pp. 44-53 ◽

Cited By ~ 22

Author(s):

Roghayeh Sadeghi ◽

Rahmat Zarkami ◽

Karim Sabetraftar ◽

Patrick Van Damme

Keyword(s):

Genetic Algorithm ◽

Classification Tree ◽

Habitat Requirements ◽

Anzali Wetland ◽

Azolla Filiculoides ◽

Tree Models ◽

Input Variables

Download Full-text

Effects of sample survey design on the accuracy of classification tree models in species distribution models

Ecological Modelling ◽

10.1016/j.ecolmodel.2006.05.016 ◽

2006 ◽

Vol 199 (2) ◽

pp. 132-141 ◽

Cited By ~ 91

Author(s):

Thomas C. Edwards ◽

D. Richard Cutler ◽

Niklaus E. Zimmermann ◽

Linda Geiser ◽

Gretchen G. Moisen

Keyword(s):

Species Distribution ◽

Species Distribution Models ◽

Survey Design ◽

Classification Tree ◽

Sample Survey ◽

Distribution Models ◽

Tree Models

Download Full-text

Application of the C4.5 Algorithm to Predict the Types of Disease in Pigs Based on Android

JELIKU (Jurnal Elektronik Ilmu Komputer Udayana) ◽

10.24843/jlk.2021.v10.i01.p14 ◽

2021 ◽

Vol 10 (1) ◽

pp. 105

Author(s):

I Gusti Ayu Purnami Indryaswari ◽

Ida Bagus Made Mahendra

Keyword(s):

Programming Language ◽

Test Data ◽

Training Data ◽

Data Sets ◽

Android Application ◽

C4.5 Algorithm ◽

Sqlite Database

Many Indonesian people, especially in Bali, make pigs as livestock. Pig livestock are susceptible to various types of diseases and there have been many cases of pig deaths due to diseases that cause losses to breeders. Therefore, the author wants to create an Android-based application that can predict the type of disease in pigs by applying the C4.5 Algorithm. The C4.5 algorithm is an algorithm for classifying data in order to obtain a rule that is used to predict something. In this study, 50 training data sets were used with 8 types of diseases in pigs and 31 symptoms of disease. which is then inputted into the system so that the data is processed so that the system in the form of an Android application can predict the type of disease in pigs. In the testing process, it was carried out by testing 15 test data sets and producing an accuracy value that is 86.7%. In testing the application features built using the Kotlin programming language and the SQLite database, it has been running as expected.

Download Full-text

Identifying the Most Important Factors in the At-Fault Probability of Motorcyclists by Data Mining, Based on Classification Tree Models

International Journal of Civil Engineering ◽

10.1007/s40999-017-0180-0 ◽

2017 ◽

Vol 15 (4) ◽

pp. 653-662 ◽

Cited By ~ 7

Author(s):

Mohammad Bagher Anvari ◽

Ali Tavakoli Kashani ◽

Rahim Rabieyan

Keyword(s):

Data Mining ◽

Classification Tree ◽

Fault Probability ◽

Tree Models

Download Full-text

Study and Ranking of Determinants of Taenia solium Infections by Classification Tree Models

American Journal of Tropical Medicine and Hygiene ◽

10.4269/ajtmh.13-0593 ◽

2015 ◽

Vol 92 (1) ◽

pp. 56-63 ◽

Cited By ~ 6

Author(s):

Kabemba E. Mwape ◽

Nicolas Praet ◽

Gideon Zulu ◽

Sarah Gabriël ◽

Pierre Dorny ◽

...

Keyword(s):

Classification Tree ◽

Taenia Solium ◽

Tree Models

Download Full-text

Computer prediction model for health status using classification tree models and Big data digital image

10.1109/iccasit53235.2021.9633485 ◽

2021 ◽

Author(s):

Jing Wang ◽

Gongli Li

Keyword(s):

Big Data ◽

Health Status ◽

Prediction Model ◽

Digital Image ◽

Classification Tree ◽

Computer Prediction ◽

Tree Models

Download Full-text

Using Classification Tree Models to Determine Course Placement

Educational Measurement Issues and Practice ◽

10.1111/emip.12470 ◽

2021 ◽

Author(s):

Chansoon Lee

Keyword(s):

Classification Tree ◽

Course Placement ◽

Tree Models

Download Full-text

A Robust Dynamic Classifier Selection Approach for Hyperspectral Images with Imprecise Label Information

Sensors ◽

10.3390/s20185262 ◽

2020 ◽

Vol 20 (18) ◽

pp. 5262

Author(s):

Meizhu Li ◽

Shaoguang Huang ◽

Jasper De Bock ◽

Gert de Cooman ◽

Aleksandra Pižurica

Keyword(s):

Hyperspectral Image ◽

Probability Distributions ◽

Imprecise Probabilities ◽

Data Sets ◽

Classifier Selection ◽

Training Samples ◽

Proposed Model ◽

Label Information ◽

Dynamic Classifier Selection ◽

The Individual

Supervised hyperspectral image (HSI) classification relies on accurate label information. However, it is not always possible to collect perfectly accurate labels for training samples. This motivates the development of classifiers that are sufficiently robust to some reasonable amounts of errors in data labels. Despite the growing importance of this aspect, it has not been sufficiently studied in the literature yet. In this paper, we analyze the effect of erroneous sample labels on probability distributions of the principal components of HSIs, and provide in this way a statistical analysis of the resulting uncertainty in classifiers. Building on the theory of imprecise probabilities, we develop a novel robust dynamic classifier selection (R-DCS) model for data classification with erroneous labels. Particularly, spectral and spatial features are extracted from HSIs to construct two individual classifiers for the dynamic selection, respectively. The proposed R-DCS model is based on the robustness of the classifiers’ predictions: the extent to which a classifier can be altered without changing its prediction. We provide three possible selection strategies for the proposed model with different computational complexities and apply them on three benchmark data sets. Experimental results demonstrate that the proposed model outperforms the individual classifiers it selects from and is more robust to errors in labels compared to widely adopted approaches.

Download Full-text

Deterministic Linkage as a Preceding Filter for Other Record Linkage Methods

International Journal of Information Technology & Decision Making ◽

10.1142/s0219622015500108 ◽

2015 ◽

Vol 14 (03) ◽

pp. 521-533

Author(s):

M. Sariyar ◽

A. Borg

Keyword(s):

Record Linkage ◽

Classification Tree ◽

Real Data ◽

Training Data ◽

Data Sets ◽

Empirical Comparison ◽

Linkage Methods ◽

Data Pair ◽

Tree Methods ◽

Almost All

Deterministic record linkage (RL) is frequently regarded as a rival to more sophisticated strategies like probabilistic RL. We investigate the effect of combining deterministic linkage with other linkage techniques. For this task, we use a simple deterministic linkage strategy as a preceding filter: a data pair is classified as ‘match' if all values of attributes considered agree exactly, otherwise as ‘nonmatch'. This strategy is separately combined with two probabilistic RL methods based on the Fellegi–Sunter model and with two classification tree methods (CART and Bagging). An empirical comparison was conducted on two real data sets. We used four different partitions into training data and test data to increase the validity of the results. In almost all cases, application of deterministic linkage as a preceding filter leads to better results compared to the omission of such a pre-filter, and overall classification trees exhibited best results. On all data sets, probabilistic RL only profited from deterministic linkage when the underlying probabilities were estimated before applying deterministic linkage. When using a pre-filter for subtracting definite cases, the underlying population of data pairs changes. It is crucial to take this into account for model-based probabilistic RL.

Download Full-text