Novel Approach to Choosing Principal Components Number in Logistic Regression

Borislava Vrigazova

doi:10.54820/pucr5250

Novel Approach to Choosing Principal Components Number in Logistic Regression

ENTRENOVA - ENTerprise REsearch InNOVAtion ◽

10.54820/pucr5250 ◽

2021 ◽

Vol 7 (1) ◽

pp. 1-12

Author(s):

Borislava Vrigazova

Keyword(s):

Logistic Regression ◽

Principal Components ◽

Classification Accuracy ◽

Prediction Models ◽

Principal Component ◽

Classification Models ◽

Target Variable ◽

Creative Commons ◽

Novel Approach ◽

Number Of Principal Components

The confirmed approach to choosing the number of principal components for prediction models includes exploring the contribution of each principal component to the total variance of the target variable. A combination of possible important principal components can be chosen to explain a big part of the variance in the target. Sometimes several combinations of principal components should be explored to achieve the highest accuracy in classification. This research proposes a novel automatic way of deciding how many principal components should be retained to improve classification accuracy. We do that by combining principal components with the ANOVA selection. To improve the accuracy resulting from our automatic approach, we use the bootstrap procedure for model selection. We call this procedure the Bootstrapped-ANOVA PCA selection. Our results suggest that this procedure can automate the principal components selection and improve the accuracy of classification models, in our example, the logistic regression. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Download Full-text

Predicting 30-day Hospital Readmission with Publicly Available Administrative Database

Methods of Information in Medicine ◽

10.3414/me14-02-0017 ◽

2015 ◽

Vol 54 (06) ◽

pp. 560-567 ◽

Cited By ~ 11

Author(s):

K. Zhu ◽

Z. Lou ◽

J. Zhou ◽

N. Ballester ◽

P. Parikh ◽

...

Keyword(s):

Heart Failure ◽

Logistic Regression ◽

Decision Tree ◽

Ad Hoc ◽

Prediction Models ◽

Conditional Logistic Regression ◽

Hospital Readmissions ◽

Decision Rules ◽

Classification Models ◽

Standard Classification

SummaryIntroduction: This article is part of the Focus Theme of Methods of Information in Medicine on “Big Data and Analytics in Healthcare”.Background: Hospital readmissions raise healthcare costs and cause significant distress to providers and patients. It is, therefore, of great interest to healthcare organizations to predict what patients are at risk to be readmitted to their hospitals. However, current logistic regression based risk prediction models have limited prediction power when applied to hospital administrative data. Meanwhile, although decision trees and random forests have been applied, they tend to be too complex to understand among the hospital practitioners.Objectives: Explore the use of conditional logistic regression to increase the prediction accuracy.Methods: We analyzed an HCUP statewide in-patient discharge record dataset, which includes patient demographics, clinical and care utilization data from California. We extracted records of heart failure Medicare beneficiaries who had inpatient experience during an 11-month period. We corrected the data imbalance issue with under-sampling. In our study, we first applied standard logistic regression and decision tree to obtain influential variables and derive practically meaning decision rules. We then stratified the original data set accordingly and applied logistic regression on each data stratum. We further explored the effect of interacting variables in the logistic regression modeling. We conducted cross validation to assess the overall prediction performance of conditional logistic regression (CLR) and compared it with standard classification models.Results: The developed CLR models outperformed several standard classification models (e.g., straightforward logistic regression, stepwise logistic regression, random forest, support vector machine). For example, the best CLR model improved the classification accuracy by nearly 20% over the straightforward logistic regression model. Furthermore, the developed CLR models tend to achieve better sensitivity of more than 10% over the standard classification models, which can be translated to correct labeling of additional 400 – 500 readmissions for heart failure patients in the state of California over a year. Lastly, several key predictor identified from the HCUP data include the disposition location from discharge, the number of chronic conditions, and the number of acute procedures.Conclusions: It would be beneficial to apply simple decision rules obtained from the decision tree in an ad-hoc manner to guide the cohort stratification. It could be potentially beneficial to explore the effect of pairwise interactions between influential predictors when building the logistic regression models for different data strata. Judicious use of the ad-hoc CLR models developed offers insights into future development of prediction models for hospital readmissions, which can lead to better intuition in identifying high-risk patients and developing effective post-discharge care strategies. Lastly, this paper is expected to raise the awareness of collecting data on additional markers and developing necessary database infrastructure for larger-scale exploratory studies on readmission risk prediction.

Download Full-text

Visualizing Emergency Department Admission through Data Mining

International Research Journal of Multidisciplinary Technovation ◽

10.34256/irjmt1923 ◽

2019 ◽

Vol 1 (2) ◽

pp. 19-31

Author(s):

Kalaivani S ◽

Shalini Dhiman ◽

Rajagopal T.K.P.

Keyword(s):

Emergency Department ◽

Logistic Regression ◽

Prediction Models ◽

Personal Data ◽

Mobility Data ◽

Novel Approach ◽

Key Factor ◽

Ed Overcrowding ◽

Historic Data ◽

Short Time

Emergency Department (ED) boarding –the inability to transfer emergency patients to inpatient beds- is a key factor contributing to ED overcrowding. This paper presents a novel approach to improving hospital operational efficiency and, therefore, to decreasing ED boarding. Using the historic data of 15,000 patients, admission results and patient information are correlated in order to identify important admission predictor factors. For example, the type of radiology exams prescribed by the ED physician is identified as among the most important predictors of admission. Based on these factors, a real-time prediction model is developed which is able to correctly predict the admission result of four out of every five ED patients. The proposed admission model can be used by inpatient units to estimate the likelihood of ED patients’ admission, and consequently, the number of incoming patients from ED in the near future. Using similar prediction models, hospitals can evaluate their short-time needs for inpatient care more accurately Emergency Department (ED) boarding – the inability to transfer emergency patients to inpatient beds- is a key factor contributing to ED overcrowding. This paper presents a novel approach to improving hospital operational efficiency and, therefore, to decreasing ED boarding. Using the historic data of 15,000 patients, admission results and patient information are correlated in order to identify important admission predictor factors. For example, the type of radiology exams prescribed by the ED physician is identified as among the most important predictors of admission. The proposed admission model can be used by inpatient units to estimate the likelihood of ED patients’ admission, and consequently, the number of incoming patients from ED in the near future. Using similar prediction models, hospitals can evaluate their short-time needs for inpatient care more accurately. We use three algorithms to build the predictive models: (1) logistic regression, (2) decision trees, and Analytic tools (accuracy=80.31%, AUC-ROC=0.859) than the decision tree accuracy=80.06%, AUC-ROC=0.824) and the logistic regression model (accuracy=79.94%, AUC-ROC=0.849). Drawing on logistic regression, we identify several factors related to hospital admissions including hospital site, age, arrival mode, triage category, care group, previous admission in the past month, and previous admission in the past year. From a different perspective, the research focuses on mobility data instead of personal data in general using Structural Equation Modelling analysis method. Based on this research finding, we identified an unexplored factor that can be used to predict the intention to disclose mobility data, and the result also confirmed that context aspects such as demographics and different personal data categories.

Download Full-text

Hyperspectral Dimensionality Reduction Based on Multiscale Superpixelwise Kernel Principal Component Analysis

Remote Sensing ◽

10.3390/rs11101219 ◽

2019 ◽

Vol 11 (10) ◽

pp. 1219 ◽

Cited By ~ 4

Author(s):

Lan Zhang ◽

Hongjun Su ◽

Jingwei Shen

Keyword(s):

Principal Component Analysis ◽

Dimensionality Reduction ◽

Principal Components ◽

Classification Accuracy ◽

Hyperspectral Image ◽

Principal Component ◽

Component Analysis ◽

Homogeneous Region ◽

Kernel Principal Component Analysis ◽

Nonlinear Features

Dimensionality reduction (DR) is an important preprocessing step in hyperspectral image applications. In this paper, a superpixelwise kernel principal component analysis (SuperKPCA) method for DR that performs kernel principal component analysis (KPCA) on each homogeneous region is proposed to fully utilize the KPCA’s ability to acquire nonlinear features. Moreover, for the proposed method, the differences in the DR results obtained based on different fundamental images (the first principal components obtained by principal component analysis (PCA), KPCA, and minimum noise fraction (MNF)) are compared. Extensive experiments show that when 5, 10, 20, and 30 samples from each class are selected, for the Indian Pines, Pavia University, and Salinas datasets: (1) when the most suitable fundamental image is selected, the classification accuracy obtained by SuperKPCA can be increased by 0.06%–0.74%, 3.88%–4.37%, and 0.39%–4.85%, respectively, when compared with SuperPCA, which performs PCA on each homogeneous region; (2) the DR results obtained based on different first principal components are different and complementary. By fusing the multiscale classification results obtained based on different first principal components, the classification accuracy can be increased by 0.54%–2.68%, 0.12%–1.10%, and 0.01%–0.08%, respectively, when compared with the method based only on the most suitable fundamental image.

Download Full-text

REPRESENTATION BOUND FOR HUMAN FACIAL MIMIC WITH THE AID OF PRINCIPAL COMPONENT ANALYSIS

International Journal of Image and Graphics ◽

10.1142/s0219467810003810 ◽

2010 ◽

Vol 10 (03) ◽

pp. 343-363

Author(s):

ULRIK SÖDERSTRÖM ◽

HAIBO LI

Keyword(s):

Principal Component Analysis ◽

Facial Expressions ◽

Principal Components ◽

Color Image ◽

Signal To Noise Ratio ◽

Principal Component ◽

Component Analysis ◽

Exact Representation ◽

Basic Emotions ◽

Number Of Principal Components

In this paper, we examine how much information is needed to represent the facial mimic, based on Paul Ekman's assumption that the facial mimic can be represented with a few basic emotions. Principal component analysis is used to compact the important facial expressions. Theoretical bounds for facial mimic representation are presented both for using a certain number of principal components and a certain number of bits. When 10 principal components are used to reconstruct color image video at a resolution of 240 × 176 pixels the representation bound is on average 36.8 dB, measured in peak signal-to-noise ratio. Practical confirmation of the theoretical bounds is demonstrated. Quantization of projection coefficients affects the representation, but a quantization with approximately 7-8 bits is found to match an exact representation, measured in mean square error.

Download Full-text

Bug Severity Prediction using Keywords in Imbalanced Learning Environment

International Journal of Information Technology and Computer Science ◽

10.5815/ijitcs.2021.03.04 ◽

2021 ◽

Vol 13 (3) ◽

pp. 53-60

Author(s):

Jayalath Ekanayake ◽

Keyword(s):

Logistic Regression ◽

Regression Model ◽

Prediction Models ◽

Naive Bayes ◽

Naïve Bayes ◽

Classification Models ◽

Prediction Quality ◽

Bug Reports ◽

Severity Prediction ◽

Severity Levels

Reported bugs of software systems are classified into different severity levels before fixing them. The number of bug reports may not be equally distributed according to the severity levels of bugs. However, most of the severity prediction models developed in the literature assumed that the underlying data distribution is evenly distributed, which may not correct at all instances and hence, the aim of this study is to develop bug classification models from unevenly distributed datasets and tested them accordingly. To that end first, the topics or keywords of developer descriptions of bug reports are extracted using Rapid Keyword Extraction (RAKE) algorithm and then transferred them into numerical attributes, which combined with severity levels constructs datasets. These datasets are used to build classification models; Naïve Bayes, Logistic Regression, and Decision Tree Learner algorithms. The models’ prediction quality is measured using Area Under Recursive Operative Characteristics Curves (AUC) as the models learnt from more skewed environments. According to the results, the prediction quality of the Logistics Regression model is 0.65 AUC whereas the other two models recorded maximum 0.60 AUC. Though the datasets contain comparatively less number of instances from the high severity classes; Blocking and High, the Logistic Regression models predict the two classes with a decent AUC value of 0.65 AUC. Hence, this projects shows that the models can be trained from highly skewed datasets so that the models prediction quality is equally well over all the classes regardless of number of instances representing the class. Further, this project emphasizes that the models should be evaluated using the appropriate metrics when the models are trained from imbalance learning environments. Also, this work uncovers that the Logistic Regression model is also capable of classifying documents as Naïve Bayes, which is well known for this task.

Download Full-text

PCR, PLS, or OPLS Evaluation of different regression techniques for hypothesis generation

10.20944/preprints202111.0549.v1 ◽

2021 ◽

Author(s):

Avani Ahuja

Keyword(s):

Least Squares ◽

Partial Least Squares ◽

Principal Components ◽

Selection Process ◽

Principal Component ◽

Hypothesis Generation ◽

Multivariate Techniques ◽

Response Variable ◽

Number Of Principal Components ◽

Latent Structures

In the current era of ‘big data’, scientists are able to quickly amass enormous amount of data in a limited number of experiments. The investigators then try to hypothesize about the root cause based on the observed trends for the predictors and the response variable. This involves identifying the discriminatory predictors that are most responsible for explaining variation in the response variable. In the current work, we investigated three related multivariate techniques: Principal Component Regression (PCR), Partial Least Squares or Projections to Latent Structures (PLS), and Orthogonal Partial Least Squares (OPLS). To perform a comparative analysis, we used a publicly available dataset for Parkinson’ disease patien ts. We first performed the analysis using a cross-validated number of principal components for the aforementioned techniques. Our results demonstrated that PLS and OPLS were better suited than PCR for identifying the discriminatory predictors. Since the X data did not exhibit a strong correlation, we also performed Multiple Linear Regression (MLR) on the dataset. A comparison of the top five discriminatory predictors identified by the four techniques showed a substantial overlap between the results obtained by PLS, OPLS, and MLR, and the three techniques exhibited a significant divergence from the variables identified by PCR. A further investigation of the data revealed that PCR could be used to identify the discriminatory variables successfully if the number of principal components in the regression model were increased. In summary, we recommend using PLS or OPLS for hypothesis generation and systemizing the selection process for principal components when using PCR.rewordexplain later why MLR can be used on a dataset with no correlation

Download Full-text

Comparative analysis of multiple classification models to improve PM10 prediction performance

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v11i3.pp2500-2507 ◽

2021 ◽

Vol 11 (3) ◽

pp. 2500

Author(s):

Yong-Jin Jung ◽

Kyoung-Woo Cho ◽

Jong-Sung Lee ◽

Chang-Heon Oh

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Particulate Matter ◽

Prediction Models ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Occurrence Rate ◽

Classification Models ◽

Comparison Results ◽

Multiple Classification

With the increasing requirement of high accuracy for particulate matter prediction, various attempts have been made to improve prediction accuracy by applying machine learning algorithms. However, the characteristics of particulate matter and the problem of the occurrence rate by concentration make it difficult to train prediction models, resulting in poor prediction. In order to solve this problem, in this paper, we proposed multiple classification models for predicting particulate matter concentrations required for prediction by dividing them into AQI-based classes. We designed multiple classification models using logistic regression, decision tree, SVM and ensemble among the various machine learning algorithms. The comparison results of the performance of the four classification models through error matrices confirmed the f-score of 0.82 or higher for all the models other than the logistic regression model.

Download Full-text

Evaluating the Performance of Supervised Classification Models: Decision Tree and Naïve Bayes Using KNIME

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i4.5.20079 ◽

2018 ◽

Vol 7 (4.5) ◽

pp. 248 ◽

Cited By ~ 1

Author(s):

Syed Muzamil Basha ◽

Dharmendra Singh Rajput ◽

Ravi Kumar Poluru ◽

S. Bharath Bhushan ◽

Shaik Abdul Khalandar Basha

Keyword(s):

Decision Tree ◽

Classification Accuracy ◽

Supervised Classification ◽

Naive Bayes ◽

Naïve Bayes ◽

Classification Task ◽

Classification Models ◽

Target Variable ◽

Input Variables ◽

F Measure

The classification task is to predict the value of the target variable from the values of the input variables. If a target is provided as part of the dataset, then classification is a supervised task. It is important to analysis the performance of supervised classification models before using them in classification task. In our research we would like to propose a novel way to evaluated the performance of supervised classification models like Decision Tree and Naïve Bayes using KNIME Analytics platform. Experiments are conducted on Multi variant dataset consisting 58000 instances, 9 columns associated specially for classification, collected from UCI Machine learning repositories (http://archive.ics.uci.edu/ml/datasets/statlog+(shuttle)) and compared the performance of both the models in terms of Classification Accuracy (CA) and Error Rate. Finally, validated both the models using Metric precision, recall and F-measure. In our finding, we found that Decision tree acquires CA (99.465%) where as Naïve Bayes attain CA (90.358%). The F-measure of Decision tree is 0.984, whereas Naïve Bayes acquire 0.7045.

Download Full-text

Robust PCA via Regularized Reaper with a Matrix-Free Proximal Algorithm

Journal of Mathematical Imaging and Vision ◽

10.1007/s10851-021-01019-1 ◽

2021 ◽

Author(s):

Robert Beinert ◽

Gabriele Steidl

Keyword(s):

Principal Components ◽

High Dimensional Data ◽

Principal Component ◽

High Dimensional ◽

Robust Pca ◽

Lanczos Process ◽

Primal Dual ◽

Curve Method ◽

Number Of Principal Components ◽

Matrix Free

AbstractPrincipal component analysis (PCA) is known to be sensitive to outliers, so that various robust PCA variants were proposed in the literature. A recent model, called reaper, aims to find the principal components by solving a convex optimization problem. Usually the number of principal components must be determined in advance and the minimization is performed over symmetric positive semi-definite matrices having the size of the data, although the number of principal components is substantially smaller. This prohibits its use if the dimension of the data is large which is often the case in image processing. In this paper, we propose a regularized version of reaper which enforces the sparsity of the number of principal components by penalizing the nuclear norm of the corresponding orthogonal projector. If only an upper bound on the number of principal components is available, our approach can be combined with the L-curve method to reconstruct the appropriate subspace. Our second contribution is a matrix-free algorithm to find a minimizer of the regularized reaper which is also suited for high-dimensional data. The algorithm couples a primal-dual minimization approach with a thick-restarted Lanczos process. This appears to be the first efficient convex variational method for robust PCA that can handle high-dimensional data. As a side result, we discuss the topic of the bias in robust PCA. Numerical examples demonstrate the performance of our algorithm.

Download Full-text

Asset Quality as a Predictor of Rural Bank Bankruptcy in Indonesia

Jurnal Ilmu Manajemen Advantage ◽

10.30741/adv.v4i1.612 ◽

2021 ◽

Vol 4 (1) ◽

pp. 44-45

Author(s):

Hesti Budiwati ◽

Ainun Jariah

Keyword(s):

Logistic Regression ◽

Prediction Model ◽

Classification Accuracy ◽

Prediction Models ◽

Predictor Variable ◽

Financial Statements ◽

Bankruptcy Prediction ◽

Asset Quality ◽

Productive Assets

The study aims to form a bankruptcy prediction model of rural bank in Indonesia at a time variation of 1 quarter (MP1), 2 quarters (MP2), 4 quarters (MP4), and 8 quarters (MP8) before bankruptcy. The quality of productive assets as a predictor variable consist of CEA, CEAEA, and NPL. The condition of rural bank bankrupt and non bankrupt as a dependent variable. The analytical method used is logistic regression followed by testing the model accuration. The population of this study is rural bank in Indonesia. The sample used was 241 rural banks that consist of 41 bankrupt rural banks and 200 non bankrupt rural banks. The data used are the quarterly financial statements of 2006 to 2019. The study result showed that of the four prediction models that successfully built, the 1 quarter (MP1) is the most feasible and accurate used as bankruptcy prediction model of rural banks in Indonesia that formed by CEAEA and NPL ratio. The MP1 has a classification accuracy of 93,8% at the level of modelling with cut off point of 0,29 and it has a classification accuracy of 83,93% at the level of validation with cut off point of 0,12. Based on those advantage, the MP1 was chosen as a model that able to predict the bankruptcy of rural bank in Indonesia.

Download Full-text