Improvement of Prediction Ability of Multicomponent Regression Model by a Method Based on Data Mining in Chemometrics

Author(s):  
Ling Gao ◽  
Shouxin Ren
2016 ◽  
Vol 3 (1) ◽  
pp. 22-44 ◽  
Author(s):  
Alan Olinsky ◽  
Phyllis Schumacher ◽  
John Quinn

One way to enhance the likelihood that more university students will graduate within the specific major that they begin with is to attract the type of students who have typically (historically) done well in that field of study. This paper expands upon a study that utilizes data mining techniques to analyze the characteristics of students who enroll as actuarial students and then either drop out of the major or graduate as actuarial students. Several predictive models including logistic regression, neural networks and decision trees are obtained using input variables describing academic attributes of the students. The models are then compared and the best fitting model is determined. The regression model turns out to be the best predictor. Since this is a very well understood method, it can easily be explained. The decision tree, although its underpinnings are somewhat difficult to explain, gives a clear and well understood output. In addition, the non-predictive method of cluster analysis is applied in order to group these students into distinct classifications based on the values of the input variables. Finally, a new approach to modeling in SAS®, called Rapid Predictive Modeler (RPM), is described and utilized. The results of the RPM also select the regression model as the best predictor.


2021 ◽  
Vol 9 ◽  
Author(s):  
Deliang Sun ◽  
Haijia Wen ◽  
Jiahui Xu ◽  
Yalan Zhang ◽  
Danzhou Wang ◽  
...  

This study aims to develop a logistic regression model of landslide susceptibility based on GeoDetector for dominant-factor screening and 10-fold cross validation for training sample optimization. First, Fengjie county, a typical mountainous area, was selected as the study area since it experienced 1,522 landslides from 2001 to 2016. Second, 22 factors were selected as the initial conditioning factors, and a geospatial database was established with a grid of 30 m precision. Factor detection of the geographic detector and the stepwise regression method included in logistic regression were used to screen out the dominant factors from the database. Then, based on the sample dataset with a 1:10 ratio of landslides and nonlandslides, 10-fold cross validation was used to select the optimized sample to train the logistic regression model of landslide susceptibility in the study area. Finally, the accuracy and efficiency of the two models before and after screening out the dominant factors were evaluated and compared. The results showed that the total accuracy of the two models was both more than 0.9, and the area under the curve value of the receiver operating characteristic curve was more than 0.8, indicating that the models before and after screening factor both had high reliability and good prediction ability. Besides, the screened factors had an active leading role in the geospatial distribution of the historical landslide, indicating that the screened dominant factors have individual rationality. Improving the geospatial agreement between landslide susceptibility and actual landslide-prone by the screening of dominant factors and the optimization of the training samples, a simple, efficient, and reliable logistic-regression–based landslide susceptibility model can be constructed.


2018 ◽  
Vol 52 (s2) ◽  
pp. 28-33 ◽  
Author(s):  
Larry Fennigkoh ◽  
D. Courtney Nanney

Although the healthcare technology management (HTM) community has decades of accumulated medical device–related maintenance data, little knowledge has been gleaned from these data. Finding and extracting such knowledge requires the use of the well-established, but admittedly somewhat foreign to HTM, application of inferential statistics. This article sought to provide a basic background on inferential statistics and describe a case study of their application, limitations, and proper interpretation. The research question associated with this case study involved examining the effects of ventilator preventive maintenance (PM) labor hours, age, and manufacturer on needed unscheduled corrective maintenance (CM) labor hours. The study sample included more than 21,000 combined PM inspections and CM work orders on 2,045 ventilators from 26 manufacturers during a five-year period (2012–16). A multiple regression analysis revealed that device age, manufacturer, and accumulated PM inspection labor hours all influenced the amount of CM labor significantly (P < 0.001). In essence, CM labor hours increased with increasing PM labor. However, and despite the statistical significance of these predictors, the regression analysis also indicated that ventilator age, manufacturer, and PM labor hours only explained approximately 16% of all variability in CM labor, with the remainder (84%) caused by other factors that were not included in the study. As such, the regression model obtained here is not suitable for predicting ventilator CM labor hours.


2019 ◽  
Vol 54 (3) ◽  
pp. 259-273
Author(s):  
Zirui Jia ◽  
Zengli Wang

Purpose Frequent itemset mining (FIM) is a basic topic in data mining. Most FIM methods build itemset database containing all possible itemsets, and use predefined thresholds to determine whether an itemset is frequent. However, the algorithm has some deficiencies. It is more fit for discrete data rather than ordinal/continuous data, which may result in computational redundancy, and some of the results are difficult to be interpreted. The purpose of this paper is to shed light on this gap by proposing a new data mining method. Design/methodology/approach Regression pattern (RP) model will be introduced, in which the regression model and FIM method will be combined to solve the existing problems. Using a survey data of computer technology and software professional qualification examination, the multiple linear regression model is selected to mine associations between items. Findings Some interesting associations mined by the proposed algorithm and the results show that the proposed method can be applied in ordinal/continuous data mining area. The experiment of RP model shows that, compared to FIM, the computational redundancy decreased and the results contain more information. Research limitations/implications The proposed algorithm is designed for ordinal/continuous data and is expected to provide inspiration for data stream mining and unstructured data mining. Practical implications Compared to FIM, which mines associations between discrete items, RP model could mine associations between ordinal/continuous data sets. Importantly, RP model performs well in saving computational resource and mining meaningful associations. Originality/value The proposed algorithms provide a novelty view to define and mine association.


2021 ◽  
Vol 2021 ◽  
pp. 1-7
Author(s):  
Emre Altinkurt ◽  
Ozkan Avci ◽  
Orkun Muftuoglu ◽  
Adem Ugurlu ◽  
Zafer Cebeci ◽  
...  

Purpose. Diagnose keratoconus by establishing an effective logistic regression model from the data obtained with a Scheimpflug-Placido cornea topographer. Methods. Topographical parameters of 125 eyes of 70 patients diagnosed with keratoconus by clinical or topographical findings were compared with 120 eyes of 63 patients who were defined as keratorefractive surgery candidates. The receiver operating character (ROC) curve analysis was performed to determine the diagnostic ability of the topographic parameters. The data set of parameters with an AUROC (area under the ROC curve) value greater than 0.9 was analyzed with logistic regression analysis (LRA) to determine the most predictive model that could diagnose keratoconus. A logit formula of the model was built, and the logit values of every eye in the study were calculated according to this formula. Then, an ROC analysis of the logit values was done. Results. Baiocchi Calossi Versaci front index (BCVf) had the highest AUROC value (0.976) in the study. The LRA model, which had the highest prediction ability, had 97.5% accuracy, 96.8% sensitivity, and 99.2% specificity. The most significant parameters were found to be BCVf ( p = 0.001 ), BCVb (Baiocchi Calossi Versaci back) ( p = 0.002 ), posterior rf (apical radius of the flattest meridian of the aspherotoric surface in 4.5 mm diameter of the cornea) ( p = 0.005 ), central corneal thickness ( p = 0.072 ), and minimum corneal thickness ( p = 0.494 ). Conclusions. The LRA model can distinguish keratoconus corneas from normal ones with high accuracy without the need for complex computer algorithms.


2020 ◽  
Vol 2020 ◽  
pp. 1-10
Author(s):  
Ehsan Ayazi ◽  
Abdolreza Sheikholeslami

The aim of this study is to identify the important factors influencing overloading of commercial vehicles on Tehran’s urban roads. The weight information of commercial freight vehicles was collected using a pair of portable scales besides other information needed including driver information, vehicle features, load, and travel details by completing a questionnaire. The results showed that the highest probability of overloading is for construction loads. Further, the analysis of the results in the lorry type section shows that the least likely occurrence of overloading is among pickup truck drivers such that this likelihood within this group was one-third among Nissan and small truck drivers. Also, the results of modeling the type of route showed that the highest likelihood of overloading is for internal loads (origin and destination inside Tehran), and the least probability of overloading is for suburban trips (origin and destination outside of Tehran). Considering the type of load packing as a variable, the results of binary regression model analysis showed that the most probability of overloading occurs for packed (boxed) loads. Finally, it was concluded that drivers are 18 times more likely to commit overloading on weekends than on weekdays.


2011 ◽  
Vol 80-81 ◽  
pp. 279-283
Author(s):  
Yan Ming Wang ◽  
Gou Qing Shi ◽  
Xiao Xing Zhong ◽  
De Ming Wang

Study on multivariate calibration for infrared spectrum of coal was presented. The discrete wavelet transformation as pre-processing tool was carried out to decompose the infrared spectrum and compress the data set. The compressed data regression model was applied to simultaneous multi-component determination for coal contents. Compression performance with several wavelet functions at different resolution scales was studied, and prediction ability of the compressed regression model was investigated. Numerical experiment results show that the wavelet transform performs an effective compression preprocessing technique in multivariate calibration and enhances the ability in characteristic extraction of coal infrared spectrum. Using the compressed data regression model, the reconstructing results are almost identical compared to the original spectrum, and the original size of the data set has been reduced to about 5% while the computational time needed decreases significantly.


Sign in / Sign up

Export Citation Format

Share Document