Statistical heartburn: An attempt to digest four pizza publications from the Cornell Food and Brand Lab

10.7287/peerj.preprints.2748v1 ◽

2017 ◽

Cited By ~ 1

Author(s):

Tim van der Zee ◽

Jordan Anaya ◽

Nicholas J L Brown

Keyword(s):

Sample Size ◽

Degrees Of Freedom ◽

Scientific Literature ◽

Summary Statistics ◽

Test Statistics ◽

Sample Sizes ◽

Standard Deviations ◽

Initial Results

We present the initial results of a reanalysis of four articles from the Cornell Food and Brand Lab based on data collected from diners at an Italian restaurant buffet. On a ﬁrst glance at these articles, we immediately noticed a number of apparent inconsistencies in the summary statistics. A thorough reading of the articles and careful reanalysis of the results revealed additional problems. The sample sizes for the number of diners in each condition are incongruous both within and between the four articles. In some cases, the degrees of freedom of between-participant test statistics are larger than the sample size, which is impossible. Many of the computed F and t statistics are inconsistent with the reported means and standard deviations. In some cases, the number of possible inconsistencies for a single statistic was such that we were unable to determine which of the components of that statistic were incorrect. We contacted the authors of the four articles, but they have thus far not agreed to share their data. The attached Appendix reports approximately 150 inconsistencies in these four articles, which we were able to identify from the reported statistics alone. We hope that our analysis will encourage readers, using and extending the simple methods that we describe, to undertake their own efforts to verify published results, and that such initiatives will improve the accuracy and reproducibility of the scientiﬁc literature.

Download Full-text

The Probability That a Measurement Falls within a Range of Standard Deviations from an Estimate of the Mean

ISRN Applied Mathematics ◽

10.5402/2012/710806 ◽

2012 ◽

Vol 2012 ◽

pp. 1-8 ◽

Cited By ~ 2

Author(s):

Louis M. Houston

Keyword(s):

Confidence Interval ◽

Sample Size ◽

General Equation ◽

Sample Sizes ◽

The Mean ◽

Standard Deviations ◽

Intermediate Value ◽

Theoretical Results

We derive a general equation for the probability that a measurement falls within a range of n standard deviations from an estimate of the mean. So, we provide a format that is compatible with a confidence interval centered about the mean that is naturally independent of the sample size. The equation is derived by interpolating theoretical results for extreme sample sizes. The intermediate value of the equation is confirmed with a computational test.

Download Full-text

The Science of Technology and Human Behavior: Standards, Old and New

10.31234/osf.io/u58mq ◽

2017 ◽

Author(s):

Malte Elson ◽

Andrew K Przybylski

Keyword(s):

Sample Size ◽

Human Behavior ◽

Degrees Of Freedom ◽

Meta Analysis ◽

Average Power ◽

P Value ◽

Test Statistics ◽

Healthy Sample ◽

D 20 ◽

Single Publication

Editorial of the Journal of Media Psychology special issue on "Technology & Human Behavior", and meta-analysis of the empirical research published in JMP since 2008.DATA AVAILABILITYWe were not able to identify a single publication reporting a link to research data in a public repository or the journal’s supplementary materials.STATISTICAL REPORTING ERRORSWe extracted a total of 1036 NHSTs reported in 98 articles. 129 tests were flagged as inconsistent (i.e., reported test statistics and degrees of freedom do not match reported p-values), of which 23 were grossly inconsistent (the reported p-value is <.05 while the recomputed p-value is >.05, or vice-versa). 41 publications reported at least one inconsistent NHST, and 16 publications reported at least one grossly inconsistent NHST. Thus, a substantial proportion of publications in JMP seem to contain inaccurately reported statistical analyses, of which some might affect the conclusions drawn from them.STATISTICAL POWERAs in other fields, surveys tend to have healthy sample sizes apt to reliably detect medium to large relationships between variables. The median sample size for survey studies is 327, allowing researchers to detect small bivariate correlations of r=.1 at 44% power (rs=.3/.5 both > 99%).For (quasi-)experiments, the outlook is a bit different, with a median sample size of 107. Across all types of designs, the median condition size is 30.67. Thus, the average power of experiments published in JMP to detect small differences between conditions (d=.20) is 12% (d=.50 at 49%, d=.80 at 87%).

Download Full-text

Necessary Sample Size for Method Comparison Studies Based on Regression Analysis

Clinical Chemistry ◽

10.1093/clinchem/45.6.882 ◽

1999 ◽

Vol 45 (6) ◽

pp. 882-894 ◽

Cited By ~ 66

Author(s):

Kristian Linnet

Keyword(s):

Regression Analysis ◽

Sample Size ◽

Statistical Power ◽

Theoretical Calculations ◽

Method Comparison ◽

Sample Sizes ◽

Analytical Error ◽

Deming Regression ◽

Standard Deviations ◽

Range Of Values

Abstract Background: In method comparison studies, it is of importance to assure that the presence of a difference of medical importance is detected. For a given difference, the necessary number of samples depends on the range of values and the analytical standard deviations of the methods involved. For typical examples, the present study evaluates the statistical power of least-squares and Deming regression analyses applied to the method comparison data. Methods: Theoretical calculations and simulations were used to consider the statistical power for detection of slope deviations from unity and intercept deviations from zero. For situations with proportional analytical standard deviations, weighted forms of regression analysis were evaluated. Results: In general, sample sizes of 40–100 samples conventionally used in method comparison studies often must be reconsidered. A main factor is the range of values, which should be as wide as possible for the given analyte. For a range ratio (maximum value divided by minimum value) of 2, 544 samples are required to detect one standardized slope deviation; the number of required samples decreases to 64 at a range ratio of 10 (proportional analytical error). For electrolytes having very narrow ranges of values, very large sample sizes usually are necessary. In case of proportional analytical error, application of a weighted approach is important to assure an efficient analysis; e.g., for a range ratio of 10, the weighted approach reduces the requirement of samples by >50%. Conclusions: Estimation of the necessary sample size for a method comparison study assures a valid result; either no difference is found or the existence of a relevant difference is confirmed.

Download Full-text

Mechanical Reliability Confidence Limits

Journal of Mechanical Design ◽

10.1115/1.3453977 ◽

1978 ◽

Vol 100 (4) ◽

pp. 607-612 ◽

Cited By ~ 5

Author(s):

D. Kececioglu ◽

G. Lamarre

Keyword(s):

Sample Size ◽

Confidence Limit ◽

Effective Sample Size ◽

Confidence Limits ◽

Mechanical Reliability ◽

Sample Sizes ◽

Confidence Levels ◽

Standard Deviations ◽

Strength Distributions

Charts are presented relating the lower one-sided confidence limit on the reliability, RL1, to the effective sample size, ne, calculated from the sample sizes used to estimate the failure governing stress and strength distributions, or f(s) and f(S) respectively, and a factor K which is a function of the estimated means and standard deviations of f(s) and f(S). These graphs cover an ne range of 5 to 1000, confidence levels of 0.80, 0.90, 0.95, and 0.99, and lower one-sided limits on the reliability of 0.85 to 0.9145. The equations used to develop these charts are derived and two examples of their applications are given.

Download Full-text

Effects of Training Set Size on Supervised Machine-Learning Land-Cover Classification of Large-Area High-Resolution Remotely Sensed Data

Remote Sensing ◽

10.3390/rs13030368 ◽

2021 ◽

Vol 13 (3) ◽

pp. 368

Author(s):

Christopher A. Ramezan ◽

Timothy A. Warner ◽

Aaron E. Maxwell ◽

Bradley S. Price

Keyword(s):

Machine Learning ◽

Sample Size ◽

Remotely Sensed ◽

Training Data ◽

Supervised Machine Learning ◽

Sample Sizes ◽

Remotely Sensed Data ◽

Large Area ◽

Training Set ◽

Set Size

The size of the training data set is a major determinant of classification accuracy. Nevertheless, the collection of a large training data set for supervised classifiers can be a challenge, especially for studies covering a large area, which may be typical of many real-world applied projects. This work investigates how variations in training set size, ranging from a large sample size (n = 10,000) to a very small sample size (n = 40), affect the performance of six supervised machine-learning algorithms applied to classify large-area high-spatial-resolution (HR) (1–5 m) remotely sensed data within the context of a geographic object-based image analysis (GEOBIA) approach. GEOBIA, in which adjacent similar pixels are grouped into image-objects that form the unit of the classification, offers the potential benefit of allowing multiple additional variables, such as measures of object geometry and texture, thus increasing the dimensionality of the classification input data. The six supervised machine-learning algorithms are support vector machines (SVM), random forests (RF), k-nearest neighbors (k-NN), single-layer perceptron neural networks (NEU), learning vector quantization (LVQ), and gradient-boosted trees (GBM). RF, the algorithm with the highest overall accuracy, was notable for its negligible decrease in overall accuracy, 1.0%, when training sample size decreased from 10,000 to 315 samples. GBM provided similar overall accuracy to RF; however, the algorithm was very expensive in terms of training time and computational resources, especially with large training sets. In contrast to RF and GBM, NEU, and SVM were particularly sensitive to decreasing sample size, with NEU classifications generally producing overall accuracies that were on average slightly higher than SVM classifications for larger sample sizes, but lower than SVM for the smallest sample sizes. NEU however required a longer processing time. The k-NN classifier saw less of a drop in overall accuracy than NEU and SVM as training set size decreased; however, the overall accuracies of k-NN were typically less than RF, NEU, and SVM classifiers. LVQ generally had the lowest overall accuracy of all six methods, but was relatively insensitive to sample size, down to the smallest sample sizes. Overall, due to its relatively high accuracy with small training sample sets, and minimal variations in overall accuracy between very large and small sample sets, as well as relatively short processing time, RF was a good classifier for large-area land-cover classifications of HR remotely sensed data, especially when training data are scarce. However, as performance of different supervised classifiers varies in response to training set size, investigating multiple classification algorithms is recommended to achieve optimal accuracy for a project.

Download Full-text

What can we Learn from Studies Based on Small Sample Sizes? Comment on Regan, Lakhanpal, and Anguiano (2012)

Psychological Reports ◽

10.2466/21.02.07.pr0.113x12z8 ◽

2013 ◽

Vol 113 (1) ◽

pp. 221-224 ◽

Cited By ~ 3

Author(s):

David R. Johnson ◽

Lauren K. Bachan

Keyword(s):

Sample Size ◽

Sample Size Determination for Repeated Measurement Outcomes Using Summary Statistics

Sample Size Calculations for Clustered and Longitudinal Outcomes in Clinical Research ◽

10.1201/b17822-7 ◽

2014 ◽

pp. 77-98

Keyword(s):

Sample Size ◽

Repeated Measurement ◽

Sample Size Determination ◽

Summary Statistics ◽

Size Determination

Download Full-text

data a; * significance level; a=0.05; * variance of difference of two observations on the log scale; * sigmaW = within-subjects standard deviation; sigmaW=0.355; s=sqrt(2)sigmaW; total number of subjects (needs to be a multiple of 2); n=58; * error degrees of freedom for AB/BA cross-over with n subjects in total; n2=n-2; * ratio = mu_T/mu_R; ratio=1.00; run; data b; set a; * calculate power; t1=tinv(1-a,n-2); t2=-t1; nc1=(sqrt(n))((log(ratio)-log(0.8))/s); nc2=(sqrt(n))((log(ratio)-log(1.25))/s); df=(sqrt(n-2))((nc1-nc2)/(2t1)); prob1=probt(t1,df,nc1); prob2=probt(t2,df,nc2); answer=prob2-prob1; power=answer*100; run; proc print data=b; run; As an example of using this SAS code, suppose µ = 1, σ = 0.355, α = 0.05 and n = 58. The power (as a percentage) is calculated as 90.4. The required number of subjects to achieve a given power can easily be obtained by trial and error using a selection of values of n. An alternative approach is to use trial and error directly on the sample size n for a given power. For more on this see Phillips (1990) and Diletti et al. (1991), for example. 7.4 Individual bioequivalence As noted in Section 7.1, individual bioequivalence (IBE) is a criterion for deciding if a patient who is currently being treated with R can be

Design and Analysis of Cross-Over Trials ◽

10.1201/9781420036091-17 ◽

2003 ◽

pp. 362-362

Keyword(s):

Standard Deviation ◽

Sample Size ◽

Degrees Of Freedom ◽

Trial And Error ◽

Significance Level ◽

Individual Bioequivalence ◽

Alternative Approach ◽

Within Subjects ◽

Log Ratio ◽

Selection Of

Download Full-text

Estimating Effect Sizes and Expected Replication Probabilities from GWAS Summary Statistics

10.1101/032474 ◽

2015 ◽

Author(s):

Dominic Holland ◽

Yunpeng Wang ◽

Wesley K Thompson ◽

Andrew Schork ◽

Chi-Hua Chen ◽

...

Keyword(s):

Association Studies ◽

Significant Snps ◽

Effect Sizes ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Sample Sizes ◽

Genetic Components ◽

Complex Phenotypes ◽

Genome Wide ◽

Z Scores

Genome-wide Association Studies (GWAS) result in millions of summary statistics (``z-scores'') for single nucleotide polymorphism (SNP) associations with phenotypes. These rich datasets afford deep insights into the nature and extent of genetic contributions to complex phenotypes such as psychiatric disorders, which are understood to have substantial genetic components that arise from very large numbers of SNPs. The complexity of the datasets, however, poses a significant challenge to maximizing their utility. This is reflected in a need for better understanding the landscape of z-scores, as such knowledge would enhance causal SNP and gene discovery, help elucidate mechanistic pathways, and inform future study design. Here we present a parsimonious methodology for modeling effect sizes and replication probabilities that does not require raw genotype data, relying only on summary statistics from GWAS substudies, and a scheme allowing for direct empirical validation. We show that modeling z-scores as a mixture of Gaussians is conceptually appropriate, in particular taking into account ubiquitous non-null effects that are likely in the datasets due to weak linkage disequilibrium with causal SNPs. The four-parameter model allows for estimating the degree of polygenicity of the phenotype -- the proportion of SNPs (after uniform pruning, so that large LD blocks are not over-represented) likely to be in strong LD with causal/mechanistically associated SNPs -- and predicting the proportion of chip heritability explainable by genome wide significant SNPs in future studies with larger sample sizes. We apply the model to recent GWAS of schizophrenia (N=82,315) and additionally, for purposes of illustration, putamen volume (N=12,596), with approximately 9.3 million SNP z-scores in both cases. We show that, over a broad range of z-scores and sample sizes, the model accurately predicts expectation estimates of true effect sizes and replication probabilities in multistage GWAS designs. We estimate the degree to which effect sizes are over-estimated when based on linear regression association coefficients. We estimate the polygenicity of schizophrenia to be 0.037 and the putamen to be 0.001, while the respective sample sizes required to approach fully explaining the chip heritability are 106and 105. The model can be extended to incorporate prior knowledge such as pleiotropy and SNP annotation. The current findings suggest that the model is applicable to a broad array of complex phenotypes and will enhance understanding of their genetic architectures.

Download Full-text