Do Female and Male Judges Assign the Same Ratings to the Same Wines? Large Sample Results

2018 ◽  
Vol 13 (4) ◽  
pp. 403-408 ◽  
Author(s):  
Jeff Bodington ◽  
Manuel Malfeito-Ferreira

AbstractMuch research shows that women and men have different taste acuities and preferences. If female and male judges tend to assign different ratings to the same wines, then the gender balances of the judge panels will bias awards. Existing research supports the null hypothesis, however, that finding is based on small sample sizes. This article presents the results for a large sample; 260 wines and 1,736 wine-score observations. Subject to the strong qualification that non-gender-related variation is material, the results affirm that female and male judges do assign about the same ratings to the same wines. The expected value of the difference in their mean ratings is zero. (JEL Classifications: A10, C00, C10, C12, D12)

2018 ◽  
Author(s):  
Christopher Chabris ◽  
Patrick Ryan Heck ◽  
Jaclyn Mandart ◽  
Daniel Jacob Benjamin ◽  
Daniel J. Simons

Williams and Bargh (2008) reported that holding a hot cup of coffee caused participants to judge a person’s personality as warmer, and that holding a therapeutic heat pad caused participants to choose rewards for other people rather than for themselves. These experiments featured large effects (r = .28 and .31), small sample sizes (41 and 53 participants), and barely statistically significant results. We attempted to replicate both experiments in field settings with more than triple the sample sizes (128 and 177) and double-blind procedures, but found near-zero effects (r = –.03 and .02). In both cases, Bayesian analyses suggest there is substantially more evidence for the null hypothesis of no effect than for the original physical warmth priming hypothesis.


2019 ◽  
Vol 50 (2) ◽  
pp. 127-132 ◽  
Author(s):  
Christopher F. Chabris ◽  
Patrick R. Heck ◽  
Jaclyn Mandart ◽  
Daniel J. Benjamin ◽  
Daniel J. Simons

Abstract. Williams and Bargh (2008) reported that holding a hot cup of coffee caused participants to judge a person’s personality as warmer and that holding a therapeutic heat pad caused participants to choose rewards for other people rather than for themselves. These experiments featured large effects ( r = .28 and .31), small sample sizes (41 and 53 participants), and barely statistically significant results. We attempted to replicate both experiments in field settings with more than triple the sample sizes (128 and 177) and double-blind procedures, but found near-zero effects ( r = −.03 and .02). In both cases, Bayesian analyses suggest there is substantially more evidence for the null hypothesis of no effect than for the original physical warmth priming hypothesis.


2019 ◽  
Vol 147 (2) ◽  
pp. 763-769 ◽  
Author(s):  
D. S. Wilks

Abstract Quantitative evaluation of the flatness of the verification rank histogram can be approached through formal hypothesis testing. Traditionally, the familiar χ2 test has been used for this purpose. Recently, two alternatives—the reliability index (RI) and an entropy statistic (Ω)—have been suggested in the literature. This paper presents approximations to the sampling distributions of these latter two rank histogram flatness metrics, and compares the statistical power of tests based on the three statistics, in a controlled setting. The χ2 test is generally most powerful (i.e., most sensitive to violations of the null hypothesis of rank uniformity), although for overdispersed ensembles and small sample sizes, the test based on the entropy statistic Ω is more powerful. The RI-based test is preferred only for unbiased forecasts with small ensembles and very small sample sizes.


2020 ◽  
Vol 117 (32) ◽  
pp. 19151-19158 ◽  
Author(s):  
M.-A. C. Bind ◽  
D. B. Rubin

In randomized experiments, Fisher-exactPvalues are available and should be used to help evaluate results rather than the more commonly reported asymptoticPvalues. One reason is that using the latter can effectively alter the question being addressed by including irrelevant distributional assumptions. The Fisherian statistical framework, proposed in 1925, calculates aPvalue in a randomized experiment by using the actual randomization procedure that led to the observed data. Here, we illustrate this Fisherian framework in a crossover randomized experiment. First, we consider the first period of the experiment and analyze its data as a completely randomized experiment, ignoring the second period; then, we consider both periods. For each analysis, we focus on 10 outcomes that illustrate important differences between the asymptotic and Fisher tests for the null hypothesis of no ozone effect. For some outcomes, the traditionalPvalue based on the approximating asymptotic Student’stdistribution substantially subceeded the minimum attainable Fisher-exactPvalue. For the other outcomes, the Fisher-exact null randomization distribution substantially differed from the bell-shaped one assumed by the asymptoticttest. Our conclusions: When researchers choose to reportPvalues in randomized experiments, 1) Fisher-exactPvalues should be used, especially in studies with small sample sizes, and 2) the shape of the actual null randomization distribution should be examined for the recondite scientific insights it may reveal.


2019 ◽  
Vol 3 (Supplement_1) ◽  
pp. S556-S556
Author(s):  
Judy Poey ◽  
Laci Cornelison

Abstract Outcomes related to person-centered care in nursing homes have been difficult to ascertain. Much of the extant literature has suffered from differing definitions of what it means to be person-centered, variation in the levels of implementation of person-centered care that an organization has achieved, and small sample sizes. The PEAK program provides a unique opportunity to control for these variables across a large sample of nursing homes throughout the state of Kansas. This presentation will discuss the methodological advantages of evaluating the PEAK program and the findings from an evaluation of resident satisfaction in nursing homes at varying levels of implementation of person-centeredness.


2021 ◽  
Vol 15 (1) ◽  
Author(s):  
Weitong Cui ◽  
Huaru Xue ◽  
Lei Wei ◽  
Jinghua Jin ◽  
Xuewen Tian ◽  
...  

Abstract Background RNA sequencing (RNA-Seq) has been widely applied in oncology for monitoring transcriptome changes. However, the emerging problem that high variation of gene expression levels caused by tumor heterogeneity may affect the reproducibility of differential expression (DE) results has rarely been studied. Here, we investigated the reproducibility of DE results for any given number of biological replicates between 3 and 24 and explored why a great many differentially expressed genes (DEGs) were not reproducible. Results Our findings demonstrate that poor reproducibility of DE results exists not only for small sample sizes, but also for relatively large sample sizes. Quite a few of the DEGs detected are specific to the samples in use, rather than genuinely differentially expressed under different conditions. Poor reproducibility of DE results is mainly caused by high variation of gene expression levels for the same gene in different samples. Even though biological variation may account for much of the high variation of gene expression levels, the effect of outlier count data also needs to be treated seriously, as outlier data severely interfere with DE analysis. Conclusions High heterogeneity exists not only in tumor tissue samples of each cancer type studied, but also in normal samples. High heterogeneity leads to poor reproducibility of DEGs, undermining generalization of differential expression results. Therefore, it is necessary to use large sample sizes (at least 10 if possible) in RNA-Seq experimental designs to reduce the impact of biological variability and DE results should be interpreted cautiously unless soundly validated.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Florent Le Borgne ◽  
Arthur Chatton ◽  
Maxime Léger ◽  
Rémi Lenain ◽  
Yohann Foucher

AbstractIn clinical research, there is a growing interest in the use of propensity score-based methods to estimate causal effects. G-computation is an alternative because of its high statistical power. Machine learning is also increasingly used because of its possible robustness to model misspecification. In this paper, we aimed to propose an approach that combines machine learning and G-computation when both the outcome and the exposure status are binary and is able to deal with small samples. We evaluated the performances of several methods, including penalized logistic regressions, a neural network, a support vector machine, boosted classification and regression trees, and a super learner through simulations. We proposed six different scenarios characterised by various sample sizes, numbers of covariates and relationships between covariates, exposure statuses, and outcomes. We have also illustrated the application of these methods, in which they were used to estimate the efficacy of barbiturates prescribed during the first 24 h of an episode of intracranial hypertension. In the context of GC, for estimating the individual outcome probabilities in two counterfactual worlds, we reported that the super learner tended to outperform the other approaches in terms of both bias and variance, especially for small sample sizes. The support vector machine performed well, but its mean bias was slightly higher than that of the super learner. In the investigated scenarios, G-computation associated with the super learner was a performant method for drawing causal inferences, even from small sample sizes.


2013 ◽  
Vol 113 (1) ◽  
pp. 221-224 ◽  
Author(s):  
David R. Johnson ◽  
Lauren K. Bachan

In a recent article, Regan, Lakhanpal, and Anguiano (2012) highlighted the lack of evidence for different relationship outcomes between arranged and love-based marriages. Yet the sample size ( n = 58) used in the study is insufficient for making such inferences. This reply discusses and demonstrates how small sample sizes reduce the utility of this research.


Sign in / Sign up

Export Citation Format

Share Document