scholarly journals Statistical heartburn: An attempt to digest four pizza publications from the Cornell Food and Brand Lab

Author(s):  
Tim van der Zee ◽  
Jordan Anaya ◽  
Nicholas J L Brown

We present the initial results of a reanalysis of four articles from the Cornell Food and Brand Lab based on data collected from diners at an Italian restaurant buffet. On a first glance at these articles, we immediately noticed a number of apparent inconsistencies in the summary statistics. A thorough reading of the articles and careful reanalysis of the results revealed additional problems. The sample sizes for the number of diners in each condition are incongruous both within and between the four articles. In some cases, the degrees of freedom of between-participant test statistics are larger than the sample size, which is impossible. Many of the computed F and t statistics are inconsistent with the reported means and standard deviations. In some cases, the number of possible inconsistencies for a single statistic was such that we were unable to determine which of the components of that statistic were incorrect. We contacted the authors of the four articles, but they have thus far not agreed to share their data. The attached Appendix reports approximately 150 inconsistencies in these four articles, which we were able to identify from the reported statistics alone. We hope that our analysis will encourage readers, using and extending the simple methods that we describe, to undertake their own efforts to verify published results, and that such initiatives will improve the accuracy and reproducibility of the scientific literature.

Author(s):  
Tim van der Zee ◽  
Jordan Anaya ◽  
Nicholas J L Brown

We present the initial results of a reanalysis of four articles from the Cornell Food and Brand Lab based on data collected from diners at an Italian restaurant buffet. On a first glance at these articles, we immediately noticed a number of apparent inconsistencies in the summary statistics. A thorough reading of the articles and careful reanalysis of the results revealed additional problems. The sample sizes for the number of diners in each condition are incongruous both within and between the four articles. In some cases, the degrees of freedom of between-participant test statistics are larger than the sample size, which is impossible. Many of the computed F and t statistics are inconsistent with the reported means and standard deviations. In some cases, the number of possible inconsistencies for a single statistic was such that we were unable to determine which of the components of that statistic were incorrect. We contacted the authors of the four articles, but they have thus far not agreed to share their data. The attached Appendix reports approximately 150 inconsistencies in these four articles, which we were able to identify from the reported statistics alone. We hope that our analysis will encourage readers, using and extending the simple methods that we describe, to undertake their own efforts to verify published results, and that such initiatives will improve the accuracy and reproducibility of the scientific literature.


2012 ◽  
Vol 2012 ◽  
pp. 1-8 ◽  
Author(s):  
Louis M. Houston

We derive a general equation for the probability that a measurement falls within a range of n standard deviations from an estimate of the mean. So, we provide a format that is compatible with a confidence interval centered about the mean that is naturally independent of the sample size. The equation is derived by interpolating theoretical results for extreme sample sizes. The intermediate value of the equation is confirmed with a computational test.


2017 ◽  
Author(s):  
Malte Elson ◽  
Andrew K Przybylski

Editorial of the Journal of Media Psychology special issue on "Technology & Human Behavior", and meta-analysis of the empirical research published in JMP since 2008.DATA AVAILABILITYWe were not able to identify a single publication reporting a link to research data in a public repository or the journal’s supplementary materials.STATISTICAL REPORTING ERRORSWe extracted a total of 1036 NHSTs reported in 98 articles. 129 tests were flagged as inconsistent (i.e., reported test statistics and degrees of freedom do not match reported p-values), of which 23 were grossly inconsistent (the reported p-value is <.05 while the recomputed p-value is >.05, or vice-versa). 41 publications reported at least one inconsistent NHST, and 16 publications reported at least one grossly inconsistent NHST. Thus, a substantial proportion of publications in JMP seem to contain inaccurately reported statistical analyses, of which some might affect the conclusions drawn from them.STATISTICAL POWERAs in other fields, surveys tend to have healthy sample sizes apt to reliably detect medium to large relationships between variables. The median sample size for survey studies is 327, allowing researchers to detect small bivariate correlations of r=.1 at 44% power (rs=.3/.5 both > 99%).For (quasi-)experiments, the outlook is a bit different, with a median sample size of 107. Across all types of designs, the median condition size is 30.67. Thus, the average power of experiments published in JMP to detect small differences between conditions (d=.20) is 12% (d=.50 at 49%, d=.80 at 87%).


1999 ◽  
Vol 45 (6) ◽  
pp. 882-894 ◽  
Author(s):  
Kristian Linnet

Abstract Background: In method comparison studies, it is of importance to assure that the presence of a difference of medical importance is detected. For a given difference, the necessary number of samples depends on the range of values and the analytical standard deviations of the methods involved. For typical examples, the present study evaluates the statistical power of least-squares and Deming regression analyses applied to the method comparison data. Methods: Theoretical calculations and simulations were used to consider the statistical power for detection of slope deviations from unity and intercept deviations from zero. For situations with proportional analytical standard deviations, weighted forms of regression analysis were evaluated. Results: In general, sample sizes of 40–100 samples conventionally used in method comparison studies often must be reconsidered. A main factor is the range of values, which should be as wide as possible for the given analyte. For a range ratio (maximum value divided by minimum value) of 2, 544 samples are required to detect one standardized slope deviation; the number of required samples decreases to 64 at a range ratio of 10 (proportional analytical error). For electrolytes having very narrow ranges of values, very large sample sizes usually are necessary. In case of proportional analytical error, application of a weighted approach is important to assure an efficient analysis; e.g., for a range ratio of 10, the weighted approach reduces the requirement of samples by >50%. Conclusions: Estimation of the necessary sample size for a method comparison study assures a valid result; either no difference is found or the existence of a relevant difference is confirmed.


1978 ◽  
Vol 100 (4) ◽  
pp. 607-612 ◽  
Author(s):  
D. Kececioglu ◽  
G. Lamarre

Charts are presented relating the lower one-sided confidence limit on the reliability, RL1, to the effective sample size, ne, calculated from the sample sizes used to estimate the failure governing stress and strength distributions, or f(s) and f(S) respectively, and a factor K which is a function of the estimated means and standard deviations of f(s) and f(S). These graphs cover an ne range of 5 to 1000, confidence levels of 0.80, 0.90, 0.95, and 0.99, and lower one-sided limits on the reliability of 0.85 to 0.9145. The equations used to develop these charts are derived and two examples of their applications are given.


2021 ◽  
Vol 13 (3) ◽  
pp. 368
Author(s):  
Christopher A. Ramezan ◽  
Timothy A. Warner ◽  
Aaron E. Maxwell ◽  
Bradley S. Price

The size of the training data set is a major determinant of classification accuracy. Nevertheless, the collection of a large training data set for supervised classifiers can be a challenge, especially for studies covering a large area, which may be typical of many real-world applied projects. This work investigates how variations in training set size, ranging from a large sample size (n = 10,000) to a very small sample size (n = 40), affect the performance of six supervised machine-learning algorithms applied to classify large-area high-spatial-resolution (HR) (1–5 m) remotely sensed data within the context of a geographic object-based image analysis (GEOBIA) approach. GEOBIA, in which adjacent similar pixels are grouped into image-objects that form the unit of the classification, offers the potential benefit of allowing multiple additional variables, such as measures of object geometry and texture, thus increasing the dimensionality of the classification input data. The six supervised machine-learning algorithms are support vector machines (SVM), random forests (RF), k-nearest neighbors (k-NN), single-layer perceptron neural networks (NEU), learning vector quantization (LVQ), and gradient-boosted trees (GBM). RF, the algorithm with the highest overall accuracy, was notable for its negligible decrease in overall accuracy, 1.0%, when training sample size decreased from 10,000 to 315 samples. GBM provided similar overall accuracy to RF; however, the algorithm was very expensive in terms of training time and computational resources, especially with large training sets. In contrast to RF and GBM, NEU, and SVM were particularly sensitive to decreasing sample size, with NEU classifications generally producing overall accuracies that were on average slightly higher than SVM classifications for larger sample sizes, but lower than SVM for the smallest sample sizes. NEU however required a longer processing time. The k-NN classifier saw less of a drop in overall accuracy than NEU and SVM as training set size decreased; however, the overall accuracies of k-NN were typically less than RF, NEU, and SVM classifiers. LVQ generally had the lowest overall accuracy of all six methods, but was relatively insensitive to sample size, down to the smallest sample sizes. Overall, due to its relatively high accuracy with small training sample sets, and minimal variations in overall accuracy between very large and small sample sets, as well as relatively short processing time, RF was a good classifier for large-area land-cover classifications of HR remotely sensed data, especially when training data are scarce. However, as performance of different supervised classifiers varies in response to training set size, investigating multiple classification algorithms is recommended to achieve optimal accuracy for a project.


2013 ◽  
Vol 113 (1) ◽  
pp. 221-224 ◽  
Author(s):  
David R. Johnson ◽  
Lauren K. Bachan

In a recent article, Regan, Lakhanpal, and Anguiano (2012) highlighted the lack of evidence for different relationship outcomes between arranged and love-based marriages. Yet the sample size ( n = 58) used in the study is insufficient for making such inferences. This reply discusses and demonstrates how small sample sizes reduce the utility of this research.



2015 ◽  
Author(s):  
Dominic Holland ◽  
Yunpeng Wang ◽  
Wesley K Thompson ◽  
Andrew Schork ◽  
Chi-Hua Chen ◽  
...  

Genome-wide Association Studies (GWAS) result in millions of summary statistics (``z-scores'') for single nucleotide polymorphism (SNP) associations with phenotypes. These rich datasets afford deep insights into the nature and extent of genetic contributions to complex phenotypes such as psychiatric disorders, which are understood to have substantial genetic components that arise from very large numbers of SNPs. The complexity of the datasets, however, poses a significant challenge to maximizing their utility. This is reflected in a need for better understanding the landscape of z-scores, as such knowledge would enhance causal SNP and gene discovery, help elucidate mechanistic pathways, and inform future study design. Here we present a parsimonious methodology for modeling effect sizes and replication probabilities that does not require raw genotype data, relying only on summary statistics from GWAS substudies, and a scheme allowing for direct empirical validation. We show that modeling z-scores as a mixture of Gaussians is conceptually appropriate, in particular taking into account ubiquitous non-null effects that are likely in the datasets due to weak linkage disequilibrium with causal SNPs. The four-parameter model allows for estimating the degree of polygenicity of the phenotype -- the proportion of SNPs (after uniform pruning, so that large LD blocks are not over-represented) likely to be in strong LD with causal/mechanistically associated SNPs -- and predicting the proportion of chip heritability explainable by genome wide significant SNPs in future studies with larger sample sizes. We apply the model to recent GWAS of schizophrenia (N=82,315) and additionally, for purposes of illustration, putamen volume (N=12,596), with approximately 9.3 million SNP z-scores in both cases. We show that, over a broad range of z-scores and sample sizes, the model accurately predicts expectation estimates of true effect sizes and replication probabilities in multistage GWAS designs. We estimate the degree to which effect sizes are over-estimated when based on linear regression association coefficients. We estimate the polygenicity of schizophrenia to be 0.037 and the putamen to be 0.001, while the respective sample sizes required to approach fully explaining the chip heritability are 106and 105. The model can be extended to incorporate prior knowledge such as pleiotropy and SNP annotation. The current findings suggest that the model is applicable to a broad array of complex phenotypes and will enhance understanding of their genetic architectures.


Sign in / Sign up

Export Citation Format

Share Document