Component-wise AdaBoost algorithms for high-dimensional binary classification and class probability prediction

High-dimensional Unbalanced Binary Classification by Genetic Programming with Multi-criterion Fitness Evaluation and Selection

Evolutionary Computation ◽

10.1162/evco_a_00304 ◽

2021 ◽

pp. 1-26

Author(s):

Wenbin Pei ◽

Bing Xue ◽

Lin Shang ◽

Mengjie Zhang

Keyword(s):

Genetic Programming ◽

Fitness Function ◽

Binary Classification ◽

Class Imbalance ◽

Area Under The Curve ◽

High Dimensional ◽

Genetic Operators ◽

Minority Class ◽

Fitness Evaluation ◽

Unbalanced Classification

Abstract High-dimensional unbalanced classification is challenging because of the joint effects of high dimensionality and class imbalance. Genetic programming (GP) has the potential benefits for use in high-dimensional classification due to its built-in capability to select informative features. However, once data is not evenly distributed, GP tends to develop biased classifiers which achieve a high accuracy on the majority class but a low accuracy on the minority class. Unfortunately, the minority class is often at least as important as the majority class. It is of importance to investigate how GP can be effectively utilized for high-dimensional unbalanced classification. In this paper, to address the performance bias issue of GP, a new two-criterion fitness function is developed, which considers two criteria, i.e. the approximation of area under the curve (AUC) and the classification clarity (i.e. how well a program can separate two classes). The obtained values on the two criteria are combined in pairs, instead of summing them together. Furthermore, this paper designs a three-criterion tournament selection to effectively identify and select good programs to be used by genetic operators for generating better offspring during the evolutionary learning process. The experimental results show that the proposed method achieves better classification performance than other compared methods.

Download Full-text

High dimensional model representation of log likelihood ratio: binary classification with SNP data

BMC Medical Genomics ◽

10.1186/s12920-020-00774-1 ◽

2020 ◽

Vol 13 (S9) ◽

Author(s):

Ali Foroughi pour ◽

Maciej Pietrzak ◽

Lara E. Sucheston-Campbell ◽

Ezgi Karaesmen ◽

Lori A. Dalton ◽

...

Keyword(s):

Prediction Accuracy ◽

Binary Classification ◽

Small Sample ◽

High Dimensional ◽

High Dimensional Model Representation ◽

Dimensional Model ◽

Feature Interactions ◽

Pairwise Interactions ◽

Snp Data ◽

Log Likelihood

Abstract Background Developing binary classification rules based on SNP observations has been a major challenge for many modern bioinformatics applications, e.g., predicting risk of future disease events in complex conditions such as cancer. Small-sample, high-dimensional nature of SNP data, weak effect of each SNP on the outcome, and highly non-linear SNP interactions are several key factors complicating the analysis. Additionally, SNPs take a finite number of values which may be best understood as ordinal or categorical variables, but are treated as continuous ones by many algorithms. Methods We use the theory of high dimensional model representation (HDMR) to build appropriate low dimensional glass-box models, allowing us to account for the effects of feature interactions. We compute the second order HDMR expansion of the log-likelihood ratio to account for the effects of single SNPs and their pairwise interactions. We propose a regression based approach, called linear approximation for block second order HDMR expansion of categorical observations (LABS-HDMR-CO), to approximate the HDMR coefficients. We show how HDMR can be used to detect pairwise SNP interactions, and propose the fixed pattern test (FPT) to identify statistically significant pairwise interactions. Results We apply LABS-HDMR-CO and FPT to synthetically generated HAPGEN2 data as well as to two GWAS cancer datasets. In these examples LABS-HDMR-CO enjoys superior accuracy compared with several algorithms used for SNP classification, while also taking pairwise interactions into account. FPT declares very few significant interactions in the small sample GWAS datasets when bounding false discovery rate (FDR) by 5%, due to the large number of tests performed. On the other hand, LABS-HDMR-CO utilizes a large number of SNP pairs to improve its prediction accuracy. In the larger HAPGEN2 dataset FTP declares a larger portion of SNP pairs used by LABS-HDMR-CO as significant. Conclusion LABS-HDMR-CO and FPT are interesting methods to design prediction rules and detect pairwise feature interactions for SNP data. Reliably detecting pairwise SNP interactions and taking advantage of potential interactions to improve prediction accuracy are two different objectives addressed by these methods. While the large number of potential SNP interactions may result in low power of detection, potentially interacting SNP pairs, of which many might be false alarms, can still be used to improve prediction accuracy.

Download Full-text

Prediction and Variable Selection in High-Dimensional Misspecified Binary Classification

Entropy ◽

10.3390/e22050543 ◽

2020 ◽

Vol 22 (5) ◽

pp. 543 ◽

Cited By ~ 2

Author(s):

Konrad Furmańczyk ◽

Wojciech Rejchel

Keyword(s):

Logistic Regression ◽

Variable Selection ◽

Logistic Model ◽

Binary Classification ◽

Model Misspecification ◽

High Dimensional ◽

Classification Models ◽

Computationally Efficient ◽

Class Labels ◽

Penalized Logistic Regression

In this paper, we consider prediction and variable selection in the misspecified binary classification models under the high-dimensional scenario. We focus on two approaches to classification, which are computationally efficient, but lead to model misspecification. The first one is to apply penalized logistic regression to the classification data, which possibly do not follow the logistic model. The second method is even more radical: we just treat class labels of objects as they were numbers and apply penalized linear regression. In this paper, we investigate thoroughly these two approaches and provide conditions, which guarantee that they are successful in prediction and variable selection. Our results hold even if the number of predictors is much larger than the sample size. The paper is completed by the experimental results.

Download Full-text

Sensing structure in learning-based binary classification of high-dimensional data

2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton) ◽

10.1109/allerton.2011.6120348 ◽

2011 ◽

Author(s):

B. Orten ◽

P. Ishwar ◽

W. C. Karl ◽

V. Saligrama

Keyword(s):

Binary Classification ◽

High Dimensional Data ◽

High Dimensional

Download Full-text

Consistent Screening Procedures in High-dimensional Binary Classification

Statistica Sinica ◽

10.5705/ss.202020.0088 ◽

2022 ◽

Author(s):

Hangjin Jiang ◽

Xingqiu Zhao ◽

Ronald C.W. Ma ◽

Xiaodan Fan

Keyword(s):

Binary Classification ◽

High Dimensional ◽

Screening Procedures

Download Full-text

1D embedding multi-category classification methods

International Journal of Wavelets Multiresolution and Information Processing ◽

10.1142/s0219691316400063 ◽

2016 ◽

Vol 14 (02) ◽

pp. 1640006 ◽

Cited By ~ 3

Author(s):

Luoqing Li ◽

Chuanwu Yang ◽

Qiwei Xie

Keyword(s):

Binary Classification ◽

High Dimensional Data ◽

Classification Performance ◽

Classification Method ◽

High Dimensional ◽

Classification Methods ◽

One Dimensional ◽

Interpolation Technique ◽

Facial Images

In this paper, we propose a novel semi-supervised multi-category classification method based on one-dimensional (1D) multi-embedding. Based on the multiple 1D embedding based interpolation technique, we embed the high-dimensional data into several different 1D manifolds and perform binary classification firstly. Then we construct the multi-category classifiers by means of one-versus-rest and one-versus-one strategies separately. A weight strategy is employed in our algorithm for improving the classification performance. The proposed method shows promising results in the classification of handwritten digits and facial images.

Download Full-text

Impact of class-imbalance on multi-class high-dimensional class prediction

Advances in Methodology and Statistics ◽

10.51936/grxm1445 ◽

2012 ◽

Vol 9 (1) ◽

Author(s):

Rok Blagus ◽

Lara Lusa

Keyword(s):

Sample Size ◽

Binary Classification ◽

Class Imbalance ◽

Imbalanced Data ◽

Real Data ◽

High Dimensional ◽

Class Imbalance Problem ◽

Class Prediction ◽

Linear Discriminant ◽

Imbalance Problem

The goal of multi-class supervised classification is to develop a rule that accurately predicts the class membership of new samples when the number of classes is larger than two. In this paper we consider high-dimensional class-imbalanced data: the number of variables greatly exceeds the number of samples and the number of samples in each class is not equal. We focus on Friedman's one-versus-one approach for three-class problems and show how its class probabilities depend on the class probabilities from the binary classification sub-problems. We further explore its performance using diagonal linear discriminant analysis (DLDA) as a base classifier and compare its performance with multi-class DLDA, using simulated and real data. Our results show that the class-imbalance has a significant effect on the classification results: the classification is biased towards the majority class as in the two-class problems and the problem is magnified when the number of variables is large. The amount of the bias depends also, jointly, on the magnitude of the differences between the classes and on the sample size: the bias diminishes when the difference between the classes is larger or the sample size is increased. Also variable selection plays an important role in the class-imbalance problem and the most effective strategy depends on the type of differences that exist between classes. DLDA seems to be among the least sensible classifiers to class-imbalance and its use is recommended also for multi-class problems. Whenever possible the experiments should be planned using balanced data in order to avoid the class-imbalance problem.

Download Full-text

Multivariate binary classification of imbalanced datasets-A case study based on high-dimensional multiplex autoimmune assay data

Biometrical Journal ◽

10.1002/bimj.201600207 ◽

2017 ◽

Vol 59 (5) ◽

pp. 948-966 ◽

Cited By ~ 1

Author(s):

Laura Schlieker ◽

Anna Telaar ◽

Angelika Lueking ◽

Peter Schulz-Knappe ◽

Carmen Theek ◽

...

Keyword(s):

Binary Classification ◽

High Dimensional ◽

Imbalanced Datasets

Download Full-text

LEARNING DECISION TREES WITH LOG CONDITIONAL LIKELIHOOD

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001410007877 ◽

2010 ◽

Vol 24 (01) ◽

pp. 117-151 ◽

Cited By ~ 2

Author(s):

HAN LIANG ◽

YUHONG YAN ◽

HARRY ZHANG

Keyword(s):

Machine Learning ◽

Classification Accuracy ◽

Naive Bayes ◽

Naïve Bayes ◽

Probability Estimation ◽

Conditional Likelihood ◽

Learning Models ◽

Class Probability Estimation ◽

Probability Prediction ◽

Class Probability

In machine learning and data mining, traditional learning models aim for high classification accuracy. However, accurate class probability prediction is more desirable than classification accuracy in many practical applications, such as medical diagnosis. Although it is known that decision trees can be adapted to be class probability estimators in a variety of approaches, and the resulting models are uniformly called Probability Estimation Trees (PETs), the performances of these PETs in class probability estimation, have not yet been investigated. We begin our research by empirically studying PETs in terms of class probability estimation, measured by Log Conditional Likelihood (LCL). We also compare a PET called C4.4 with other representative models, including Naïve Bayes, Naïve Bayes Tree, Bayesian Network, KNN and SVM, in LCL. From our experiments, we draw several valuable conclusions. First, among various tree-based models, C4.4 is the best in yielding precise class probability prediction measured by LCL. We provide an explanation for this and reveal the nature of LCL. Second, compared with non tree-based models, C4.4 also performs best. Finally, LCL does not dominate another well-established relevant metric — AUC, which suggests that different decision-tree learning models should be used for different objectives. Our experiments are conducted on the basis of 36 UCI sample sets. We run all the models within a machine learning platform — Weka. We also explore an approach to improve the class probability estimation of Naïve Bayes Tree. We propose a greedy and recursive learning algorithm, where at each step, LCL is used as the scoring function to expand the decision tree. The algorithm uses Naïve Bayes created at leaves to estimate class probabilities of test samples. The whole tree encodes the posterior class probability in its structure. One benefit of improving class probability estimation is that both classification accuracy and AUC can be possibly scaled up. We call the new model LCL Tree (LCLT). Our experiments on 33 UCI sample sets show that LCLT outperforms all state-of-the-art learning models, such as Naïve Bayes Tree, significantly in accurate class probability prediction measured by LCL, as well as in classification accuracy and AUC.

Download Full-text

High dimensional model representation of log-likelihood ratio: binary classification with expression data

BMC Bioinformatics ◽

10.1186/s12859-020-3486-x ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Ali Foroughi pour ◽

Maciej Pietrzak ◽

Lori A Dalton ◽

Grzegorz A. Rempała

Keyword(s):

Likelihood Ratio ◽

Binary Classification ◽

Model Representation ◽

High Dimensional ◽

High Dimensional Model Representation ◽

Expression Data ◽

Dimensional Model ◽

Log Likelihood ◽

Log Likelihood Ratio

Download Full-text