scholarly journals How well does your model fit the data?

2001 ◽  
Vol 3 (1) ◽  
pp. 49-55 ◽  
Author(s):  
M. J. Hall

Despite almost five decades of activity on the computer modelling of input–output relationships, little general agreement has emerged on appropriate indices for the goodness-of-fit of a model to a set of observations of the pertinent variables. The coefficient of efficiency, which is closely allied in form to the coefficient of determination, has been widely adopted in many data mining and modelling exercises. Values of this coefficient close to unity are taken as evidence of good matching between observed and computed flows. However, studies using synthetic data have demonstrated that negative values of the coefficient of efficiency can occur both in the presence of bias in computed outputs, and when the computed volume of flow greatly exceeds the observed volume of flow. In contrast, the coefficient of efficiency lacks discrimination for cases close to perfect reproduction. In the latter case, a coefficient based upon the first differences of the data proves to be more helpful.

2015 ◽  
Author(s):  
Joshua M Diamond

The conserved nature of sleep in Drosophila has allowed the fruit fly to emerge in the last decade as a powerful model organism in which to study sleep. Recent sleep studies in Drosophila have focused on the discovery and characterization of hyposomnolent mutants. One common feature of these animals is a change in sleep architecture: sleep bout count tends to be greater, and sleep bout length lower, in hyposomnolent mutants. I propose a mathematical model, produced by least-squares nonlinear regression to fit the form Y = aX^b, which can explain sleep behavior in the healthy animal as well as previously-reported changes in total sleep and sleep architecture in hyposomnolent mutants. This model, fit to sleep data, yields coefficient of determination R squared, which describes goodness of fit. R squared is lower in hyposomnolent mutant insomniac as compared to control, indicating a poorer fit of the model to the data in insomniac. R squared also tends to be lower in daytime sleep as compared to nighttime sleep. My findings raise the possibility that low R squared is a feature of all hyposomnolent mutants, not just insomniac. If this were the case, R squared could emerge as a novel means by which sleep researchers might assess sleep dysfunction.


PeerJ ◽  
2016 ◽  
Vol 4 ◽  
pp. e1533 ◽  
Author(s):  
Joshua M. Diamond

The conserved nature of sleep in Drosophila has allowed the fruit fly to emerge in the last decade as a powerful model organism in which to study sleep. Recent sleep studies in Drosophila have focused on the discovery and characterization of hyposomnolent mutants. One common feature of these animals is a change in sleep architecture: sleep bout count tends to be greater, and sleep bout length lower, in hyposomnolent mutants. I propose a mathematical model, produced by least-squares nonlinear regression to fit the formY=aX∧b, which can explain sleep behavior in the healthy animal as well as previously-reported changes in total sleep and sleep architecture in hyposomnolent mutants. This model, fit to sleep data, yields coefficient of determinationRsquared, which describes goodness of fit.Rsquared is lower, as compared to control, in hyposomnolent mutantsinsomniacandfumin. My findings raise the possibility that lowRsquared is a feature of all hyposomnolent mutants, not justinsomniacandfumin. If this were the case,Rsquared could emerge as a novel means by which sleep researchers might assess sleep dysfunction.


2020 ◽  
Author(s):  
Charles Onyutha

Abstract. Modelers tend to focus more on advancing methods of statistical and mathematical modeling than developing novel techniques for comparing modeled results with observations or establishing metrics for model performance assessment. Perhaps solely the most extensively applied "goodness-of-fit" measure especially for assessing performance of regression models is the coefficient of determination R2. Normally, high R2 tends to be associated with an efficient model. Nevertheless, R2 has been cited to have no importance in the classical model of regression. Even in its use in descriptive statistics, R2 is known to have questionable justification. R2 is inadequate in assessing model performance because it does not give any information on the model residuals. Furthermore, R-squared can be low for an effective model. Contrastingly, a very poor model fit can yield high R2. Regressing X on Y yields R2 which is the same as that if Y is regressed on X thereby invalidating its use as a coefficient of determination. Taking into account the drawbacks of using R2, this paper introduces coefficient of model accuracy (CMA) the derivation of which comprises an analogy to the R2. However, instead of simply squaring an ordinary Pearson's product-moment correlation coefficient to obtain R2, CMA comprises the product of nonparametric sample correlation and model bias. Acceptability of the introduced method can be found demonstrated through comparison of results from simulations by hydrological models calibrated using CMA and other existing objective functions.


2015 ◽  
Author(s):  
Joshua M Diamond

The conserved nature of sleep in Drosophila has allowed the fruit fly to emerge in the last decade as a powerful model organism in which to study sleep. Recent sleep studies in Drosophila have focused on the discovery and characterization of hyposomnolent mutants. One common feature of these animals is a change in sleep architecture: sleep bout count tends to be greater, and sleep bout length lower, in hyposomnolent mutants. I propose a mathematical model, produced by least-squares nonlinear regression to fit the form Y = aX^b, which can explain sleep behavior in the healthy animal as well as previously-reported changes in total sleep and sleep architecture in hyposomnolent mutants. This model, fit to sleep data, yields coefficient of determination R squared, which describes goodness of fit. R squared is lower in hyposomnolent mutant insomniac as compared to control, indicating a poorer fit of the model to the data in insomniac. R squared also tends to be lower in daytime sleep as compared to nighttime sleep. My findings raise the possibility that low R squared is a feature of all hyposomnolent mutants, not just insomniac. If this were the case, R squared could emerge as a novel means by which sleep researchers might assess sleep dysfunction.


Author(s):  
Raul E. Avelar ◽  
Karen Dixon ◽  
Boniphace Kutela ◽  
Sam Klump ◽  
Beth Wemple ◽  
...  

The calibration of safety performance functions (SPFs) is a mechanism included in the Highway Safety Manual (HSM) to adjust SPFs in the HSM for use in intended jurisdictions. Critically, the quality of the calibration procedure must be assessed before using the calibrated SPFs. Multiple resources to aid practitioners in calibrating SPFs have been developed in the years following the publication of the HSM 1st edition. Similarly, the literature suggests multiple ways to assess the goodness-of-fit (GOF) of a calibrated SPF to a data set from a given jurisdiction. This paper uses the calibration results of multiple intersection SPFs to a large Mississippi safety database to examine the relations between multiple GOF metrics. The goal is to develop a sensible single index that leverages the joint information from multiple GOF metrics to assess overall quality of calibration. A factor analysis applied to the calibration results revealed three underlying factors explaining 76% of the variability in the data. From these results, the authors developed an index and performed a sensitivity analysis. The key metrics were found to be, in descending order: the deviation of the cumulative residual (CURE) plot from the 95% confidence area, the mean absolute deviation, the modified R-squared, and the value of the calibration factor. This paper also presents comparisons between the index and alternative scoring strategies, as well as an effort to verify the results using synthetic data. The developed index is recommended to comprehensively assess the quality of the calibrated intersection SPFs.


Agronomy ◽  
2021 ◽  
Vol 11 (6) ◽  
pp. 1207
Author(s):  
Gonçalo C. Rodrigues ◽  
Ricardo P. Braga

This study aims to evaluate NASA POWER reanalysis products for daily surface maximum (Tmax) and minimum (Tmin) temperatures, solar radiation (Rs), relative humidity (RH) and wind speed (Ws) when compared with observed data from 14 distributed weather stations across Alentejo Region, Southern Portugal, with a hot summer Mediterranean climate. Results showed that there is good agreement between NASA POWER reanalysis and observed data for all parameters, except for wind speed, with coefficient of determination (R2) higher than 0.82, with normalized root mean square error (NRMSE) varying, from 8 to 20%, and a normalized mean bias error (NMBE) ranging from –9 to 26%, for those variables. Based on these results, and in order to improve the accuracy of the NASA POWER dataset, two bias corrections were performed to all weather variables: one for the Alentejo Region as a whole; another, for each location individually. Results improved significantly, especially when a local bias correction is performed, with Tmax and Tmin presenting an improvement of the mean NRMSE of 6.6 °C (from 8.0 °C) and 16.1 °C (from 20.5 °C), respectively, while a mean NMBE decreased from 10.65 to 0.2%. Rs results also show a very high goodness of fit with a mean NRMSE of 11.2% and mean NMBE equal to 0.1%. Additionally, bias corrected RH data performed acceptably with an NRMSE lower than 12.1% and an NMBE below 2.1%. However, even when a bias correction is performed, Ws lacks the performance showed by the remaining weather variables, with an NRMSE never lower than 19.6%. Results show that NASA POWER can be useful for the generation of weather data sets where ground weather stations data is of missing or unavailable.


BioResources ◽  
2021 ◽  
Vol 16 (3) ◽  
pp. 4891-4904
Author(s):  
Selahattin Bardak ◽  
Timucin Bardak ◽  
Hüseyin Peker ◽  
Eser Sözen ◽  
Yildiz Çabuk

Wood materials have been used in many products such as furniture, stairs, windows, and doors for centuries. There are differences in methods used to adapt wood to ambient conditions. Impregnation is a widely used method of wood preservation. In terms of efficiency, it is critical to optimize the parameters for impregnation. Data mining techniques reduce most of the cost and operational challenges with accurate prediction in the wood industry. In this study, three data-mining algorithms were applied to predict bending strength in impregnated wood materials (Pinus sylvestris L. and Millettia laurentii). Models were created from real experimental data to examine the relationship between bending strength, diffusion time, vacuum duration, and wood type, based on decision trees (DT), random forest (RF), and Gaussian process (GP) algorithms. The highest bending strength was achieved with wenge (Millettia laurentii) wood in 10 bar vacuum and the diffusion condition during 25 min. The results showed that all algorithms are suitable for predicting bending strength. The goodness of fit for the testing phase was determined as 0.994, 0.986, and 0.989 in the DT, RF, and GP algorithms, respectively. Moreover, the importance of attributes was determined in the algorithms.


1976 ◽  
Vol 159 (1) ◽  
pp. 105-120 ◽  
Author(s):  
J D Allen ◽  
J A Thoma

We have developed a depolymerase computer model that uses a minimization routine. The model is designed so that, given experimental bond-cleavage frequencies for oligomeric substrates and experimental Michaelis parameters as a function of substrate chain length, the optimum subsite map is generated. The minimized sum of the weighted-squared residuals of the experimental and calculated data is used as a criterion of the goodness-of-fit for the optimized subsite map. The application of the minimization procedure to subsite mapping is explored through the use of simulated data. A procedure is developed whereby the minimization model can be used to determine the number of subsites in the enzymic binding region and to locate the position of the catalytic amino acids among these subsites. The degree of propagation of experimental variance into the subsite-binding energies is estimated. The question of whether hydrolytic rate coefficients are constant or a function of the number of filled subsites is examined.


10.2196/11125 ◽  
2019 ◽  
Vol 21 (11) ◽  
pp. e11125
Author(s):  
Elizabeth Sillence ◽  
John Matthew Blythe ◽  
Pam Briggs ◽  
Mark Moss

Background The internet continues to offer new forms of support for health decision making. Government, charity, and commercial websites increasingly offer a platform for shared personal health experiences, and these are just some of the opportunities that have arisen in a largely unregulated arena. Understanding how people trust and act on this information has always been an important issue and remains so, particularly as the design practices of health websites continue to evolve and raise further concerns regarding their trustworthiness. Objective The aim of this study was to identify the key factors influencing US and UK citizens’ trust and intention to act on advice found on health websites and to understand the role of patient experiences. Methods A total of 1123 users took part in an online survey (625 from the United States and 498 from the United Kingdom). They were asked to recall their previous visit to a health website. The online survey consisted of an updated general Web trust questionnaire to account for personal experiences plus questions assessing key factors associated with trust in health websites (information corroboration and coping perception) and intention to act. We performed principal component analysis (PCA), then explored the relationship between the factor structure and outcomes by testing the fit to the sampled data using structural equation modeling (SEM). We also explored the model fit across US and UK populations. Results PCA of the general Web trust questionnaire revealed 4 trust factors: (1) personal experiences, (2) credibility and impartiality, (3) privacy, and (4) familiarity. In the final SEM model, trust was found to have a significant direct effect on intention to act (beta=.59; P<.001), and of the trust factors, only credibility and impartiality had a significant direct effect on trust (beta=.79; P<.001). The impact of personal experiences on trust was mediated through information corroboration (beta=.06; P=.04). Variables specific to electronic health (eHealth; information corroboration and coping) were found to substantially improve the model fit, and differences in information corroboration were found between US and UK samples. The final model accounting for all factors achieved a good fit (goodness-of-fit index [0.95], adjusted goodness-of-fit index [0.93], root mean square error of approximation [0.50], and comparative fit index [0.98]) and explained 65% of the variance in trust and 41% of the variance in intention to act. Conclusions Credibility and impartiality continue to be key predictors of trust in eHealth websites. Websites with patient experiences can positively influence trust but only if users first corroborate the information through other sources. The need for corroboration was weaker in the United Kingdom, where website familiarity reduced the need to check information elsewhere. These findings are discussed in relation to existing trust models, patient experiences, and health literacy.


A Data mining is the method of extracting useful information from various repositories such as Relational Database, Transaction database, spatial database, Temporal and Time-series database, Data Warehouses, World Wide Web. Various functionalities of Data mining include Characterization and Discrimination, Classification and prediction, Association Rule Mining, Cluster analysis, Evolutionary analysis. Association Rule mining is one of the most important techniques of Data Mining, that aims at extracting interesting relationships within the data. In this paper we study various Association Rule mining algorithms, also compare them by using synthetic data sets, and we provide the results obtained from the experimental analysis


Sign in / Sign up

Export Citation Format

Share Document