Improvement of Prediction Ability of Multicomponent Regression Model by a Method Based on Data Mining in Chemometrics

One way to enhance the likelihood that more university students will graduate within the specific major that they begin with is to attract the type of students who have typically (historically) done well in that field of study. This paper expands upon a study that utilizes data mining techniques to analyze the characteristics of students who enroll as actuarial students and then either drop out of the major or graduate as actuarial students. Several predictive models including logistic regression, neural networks and decision trees are obtained using input variables describing academic attributes of the students. The models are then compared and the best fitting model is determined. The regression model turns out to be the best predictor. Since this is a very well understood method, it can easily be explained. The decision tree, although its underpinnings are somewhat difficult to explain, gives a clear and well understood output. In addition, the non-predictive method of cluster analysis is applied in order to group these students into distinct classifications based on the values of the input variables. Finally, a new approach to modeling in SAS®, called Rapid Predictive Modeler (RPM), is described and utilized. The results of the RPM also select the regression model as the best predictor.

Download Full-text

Improving Geospatial Agreement by Hybrid Optimization in Logistic Regression-Based Landslide Susceptibility Modelling

Frontiers in Earth Science ◽

10.3389/feart.2021.713803 ◽

2021 ◽

Vol 9 ◽

Author(s):

Deliang Sun ◽

Haijia Wen ◽

Jiahui Xu ◽

Yalan Zhang ◽

Danzhou Wang ◽

...

Keyword(s):

Logistic Regression ◽

Regression Model ◽

Landslide Susceptibility ◽

Logistic Regression Model ◽

Cross Validation ◽

Dominant Factor ◽

Mountainous Area ◽

Prediction Ability ◽

Before And After ◽

Fold Cross Validation

This study aims to develop a logistic regression model of landslide susceptibility based on GeoDetector for dominant-factor screening and 10-fold cross validation for training sample optimization. First, Fengjie county, a typical mountainous area, was selected as the study area since it experienced 1,522 landslides from 2001 to 2016. Second, 22 factors were selected as the initial conditioning factors, and a geospatial database was established with a grid of 30 m precision. Factor detection of the geographic detector and the stepwise regression method included in logistic regression were used to screen out the dominant factors from the database. Then, based on the sample dataset with a 1:10 ratio of landslides and nonlandslides, 10-fold cross validation was used to select the optimized sample to train the logistic regression model of landslide susceptibility in the study area. Finally, the accuracy and efficiency of the two models before and after screening out the dominant factors were evaluated and compared. The results showed that the total accuracy of the two models was both more than 0.9, and the area under the curve value of the receiver operating characteristic curve was more than 0.8, indicating that the models before and after screening factor both had high reliability and good prediction ability. Besides, the screened factors had an active leading role in the geospatial distribution of the historical landslide, indicating that the screened dominant factors have individual rationality. Improving the geospatial agreement between landslide susceptibility and actual landslide-prone by the screening of dominant factors and the optimization of the training samples, a simple, efficient, and reliable logistic-regression–based landslide susceptibility model can be constructed.

Download Full-text

Data Mining CMMSs: How to Convert Data into Knowledge

Biomedical Instrumentation & Technology ◽

10.2345/0899-8205-52.s2.28 ◽

2018 ◽

Vol 52 (s2) ◽

pp. 28-33 ◽

Cited By ~ 1

Author(s):

Larry Fennigkoh ◽

D. Courtney Nanney

Keyword(s):

Data Mining ◽

Regression Analysis ◽

Regression Model ◽

Multiple Regression Analysis ◽

Preventive Maintenance ◽

Statistical Significance ◽

Research Question ◽

Inferential Statistics ◽

Proper Interpretation

Although the healthcare technology management (HTM) community has decades of accumulated medical device–related maintenance data, little knowledge has been gleaned from these data. Finding and extracting such knowledge requires the use of the well-established, but admittedly somewhat foreign to HTM, application of inferential statistics. This article sought to provide a basic background on inferential statistics and describe a case study of their application, limitations, and proper interpretation. The research question associated with this case study involved examining the effects of ventilator preventive maintenance (PM) labor hours, age, and manufacturer on needed unscheduled corrective maintenance (CM) labor hours. The study sample included more than 21,000 combined PM inspections and CM work orders on 2,045 ventilators from 26 manufacturers during a five-year period (2012–16). A multiple regression analysis revealed that device age, manufacturer, and accumulated PM inspection labor hours all influenced the amount of CM labor significantly (P < 0.001). In essence, CM labor hours increased with increasing PM labor. However, and despite the statistical significance of these predictors, the regression analysis also indicated that ventilator age, manufacturer, and PM labor hours only explained approximately 16% of all variability in CM labor, with the remainder (84%) caused by other factors that were not included in the study. As such, the regression model obtained here is not suitable for predicting ventilator CM labor hours.

Download Full-text

A regression-based algorithm for frequent itemsets mining

Data Technologies and Applications ◽

10.1108/dta-03-2019-0037 ◽

2019 ◽

Vol 54 (3) ◽

pp. 259-273

Author(s):

Zirui Jia ◽

Zengli Wang

Keyword(s):

Data Mining ◽

Regression Model ◽

Multiple Linear Regression Model ◽

Mining Area ◽

Frequent Itemset ◽

Continuous Data ◽

Data Sets ◽

Content Type ◽

Existing Problems ◽

Frequent Itemsets Mining

Purpose Frequent itemset mining (FIM) is a basic topic in data mining. Most FIM methods build itemset database containing all possible itemsets, and use predefined thresholds to determine whether an itemset is frequent. However, the algorithm has some deficiencies. It is more fit for discrete data rather than ordinal/continuous data, which may result in computational redundancy, and some of the results are difficult to be interpreted. The purpose of this paper is to shed light on this gap by proposing a new data mining method. Design/methodology/approach Regression pattern (RP) model will be introduced, in which the regression model and FIM method will be combined to solve the existing problems. Using a survey data of computer technology and software professional qualification examination, the multiple linear regression model is selected to mine associations between items. Findings Some interesting associations mined by the proposed algorithm and the results show that the proposed method can be applied in ordinal/continuous data mining area. The experiment of RP model shows that, compared to FIM, the computational redundancy decreased and the results contain more information. Research limitations/implications The proposed algorithm is designed for ordinal/continuous data and is expected to provide inspiration for data stream mining and unstructured data mining. Practical implications Compared to FIM, which mines associations between discrete items, RP model could mine associations between ordinal/continuous data sets. Importantly, RP model performs well in saving computational resource and mining meaningful associations. Originality/value The proposed algorithms provide a novelty view to define and mine association.

Download Full-text

Logistic Regression Model Using Scheimpflug-Placido Cornea Topographer Parameters to Diagnose Keratoconus

Journal of Ophthalmology ◽

10.1155/2021/5528927 ◽

2021 ◽

Vol 2021 ◽

pp. 1-7

Author(s):

Emre Altinkurt ◽

Ozkan Avci ◽

Orkun Muftuoglu ◽

Adem Ugurlu ◽

Zafer Cebeci ◽

...

Keyword(s):

Logistic Regression ◽

Regression Model ◽

Roc Curve ◽

Logistic Regression Model ◽

Corneal Thickness ◽

Computer Algorithms ◽

Roc Curve Analysis ◽

Data Set ◽

Prediction Ability ◽

Keratorefractive Surgery

Purpose. Diagnose keratoconus by establishing an effective logistic regression model from the data obtained with a Scheimpflug-Placido cornea topographer. Methods. Topographical parameters of 125 eyes of 70 patients diagnosed with keratoconus by clinical or topographical findings were compared with 120 eyes of 63 patients who were defined as keratorefractive surgery candidates. The receiver operating character (ROC) curve analysis was performed to determine the diagnostic ability of the topographic parameters. The data set of parameters with an AUROC (area under the ROC curve) value greater than 0.9 was analyzed with logistic regression analysis (LRA) to determine the most predictive model that could diagnose keratoconus. A logit formula of the model was built, and the logit values of every eye in the study were calculated according to this formula. Then, an ROC analysis of the logit values was done. Results. Baiocchi Calossi Versaci front index (BCVf) had the highest AUROC value (0.976) in the study. The LRA model, which had the highest prediction ability, had 97.5% accuracy, 96.8% sensitivity, and 99.2% specificity. The most significant parameters were found to be BCVf ( p = 0.001 ), BCVb (Baiocchi Calossi Versaci back) ( p = 0.002 ), posterior rf (apical radius of the flattest meridian of the aspherotoric surface in 4.5 mm diameter of the cornea) ( p = 0.005 ), central corneal thickness ( p = 0.072 ), and minimum corneal thickness ( p = 0.494 ). Conclusions. The LRA model can distinguish keratoconus corneas from normal ones with high accuracy without the need for complex computer algorithms.

Download Full-text

A Data Mining Approach on Lorry Drivers Overloading in Tehran Urban Roads

Journal of Advanced Transportation ◽

10.1155/2020/6895407 ◽

2020 ◽

Vol 2020 ◽

pp. 1-10

Author(s):

Ehsan Ayazi ◽

Abdolreza Sheikholeslami

Keyword(s):

Data Mining ◽

Regression Model ◽

Truck Drivers ◽

Commercial Vehicles ◽

Binary Regression ◽

Pickup Truck ◽

Data Mining Approach ◽

Factors Influencing ◽

Other Information ◽

Construction Loads

The aim of this study is to identify the important factors influencing overloading of commercial vehicles on Tehran’s urban roads. The weight information of commercial freight vehicles was collected using a pair of portable scales besides other information needed including driver information, vehicle features, load, and travel details by completing a questionnaire. The results showed that the highest probability of overloading is for construction loads. Further, the analysis of the results in the lorry type section shows that the least likely occurrence of overloading is among pickup truck drivers such that this likelihood within this group was one-third among Nissan and small truck drivers. Also, the results of modeling the type of route showed that the highest likelihood of overloading is for internal loads (origin and destination inside Tehran), and the least probability of overloading is for suburban trips (origin and destination outside of Tehran). Considering the type of load packing as a variable, the results of binary regression model analysis showed that the most probability of overloading occurs for packed (boxed) loads. Finally, it was concluded that drivers are 18 times more likely to commit overloading on weekends than on weekdays.

Download Full-text

PLS Regression on Coal Infrared Spectrum with Wavelet Pre-Processing

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.80-81.279 ◽

2011 ◽

Vol 80-81 ◽

pp. 279-283

Author(s):

Yan Ming Wang ◽

Gou Qing Shi ◽

Xiao Xing Zhong ◽

De Ming Wang

Keyword(s):

Infrared Spectrum ◽

Regression Model ◽

Multivariate Calibration ◽

Computational Time ◽

Discrete Wavelet ◽

Data Set ◽

Prediction Ability ◽

Data Regression ◽

Preprocessing Technique ◽

Compressed Data

Study on multivariate calibration for infrared spectrum of coal was presented. The discrete wavelet transformation as pre-processing tool was carried out to decompose the infrared spectrum and compress the data set. The compressed data regression model was applied to simultaneous multi-component determination for coal contents. Compression performance with several wavelet functions at different resolution scales was studied, and prediction ability of the compressed regression model was investigated. Numerical experiment results show that the wavelet transform performs an effective compression preprocessing technique in multivariate calibration and enhances the ability in characteristic extraction of coal infrared spectrum. Using the compressed data regression model, the reconstructing results are almost identical compared to the original spectrum, and the original size of the data set has been reduced to about 5% while the computational time needed decreases significantly.

Download Full-text