Detection of Item Preknowledge Using Likelihood Ratio Test and Score Test

Sandip Sinharay

doi:10.3102/1076998616673872

Detection of Item Preknowledge Using Likelihood Ratio Test and Score Test

Journal of Educational and Behavioral Statistics ◽

10.3102/1076998616673872 ◽

2016 ◽

Vol 42 (1) ◽

pp. 46-68 ◽

Cited By ~ 19

Author(s):

Sandip Sinharay

Keyword(s):

Type I Error ◽

Score Test ◽

Error Rates ◽

Ratio Test ◽

Type I ◽

Polytomous Items ◽

Detailed Simulation ◽

Standard Normal ◽

New Statistics ◽

Item Preknowledge

An increasing concern of producers of educational assessments is fraudulent behavior during the assessment (van der Linden, 2009). Benefiting from item preknowledge (e.g., Eckerly, 2017; McLeod, Lewis, & Thissen, 2003) is one type of fraudulent behavior. This article suggests two new test statistics for detecting individuals who may have benefited from item preknowledge; the statistics can be used for both nonadaptive and adaptive assessments that may include either or both of dichotomous and polytomous items. Each new statistic has an asymptotic standard normal n distribution. It is demonstrated in detailed simulation studies that the Type I error rates of the new statistics are close to the nominal level and the values of power of the new statistics are larger than those of an existing statistic for addressing the same problem.

Download Full-text

How Does Polytomous Item Bias Affect Total-group Survey Score Comparisons?

Sociological Methods & Research ◽

10.1177/0049124115605333 ◽

2015 ◽

Vol 46 (3) ◽

pp. 586-603 ◽

Cited By ~ 1

Author(s):

Ma Dolores Hidalgo ◽

Isabel Benítez ◽

Jose-Luis Padilla ◽

Juana Gómez-Benito

Keyword(s):

Effect Size ◽

Type I Error ◽

Error Rates ◽

T Test ◽

Type I ◽

Polytomous Items ◽

Item Functioning ◽

Scale Scores ◽

Comparative Group ◽

The Impact

The growing use of scales in survey questionnaires warrants the need to address how does polytomous differential item functioning (DIF) affect observed scale score comparisons. The aim of this study is to investigate the impact of DIF on the type I error and effect size of the independent samples t-test on the observed total scale scores. A simulation study was conducted, focusing on potential variables related to DIF in polytomous items, such as DIF pattern, sample size, magnitude, and percentage of DIF items. The results showed that DIF patterns and the number of DIF items affected the type I error rates and effect size of t-test values. The results highlighted the need to analyze DIF before making comparative group interpretations.

Download Full-text

Three New Methods for Analysis of Answer Changes

Educational and Psychological Measurement ◽

10.1177/0013164416632287 ◽

2016 ◽

Vol 77 (1) ◽

pp. 54-81 ◽

Cited By ~ 6

Author(s):

Sandip Sinharay ◽

Matthew S. Johnson

Keyword(s):

Normal Distribution ◽

Null Hypothesis ◽

Type I Error ◽

Error Rates ◽

Standard Normal Distribution ◽

Type I ◽

Data Set ◽

Continuity Correction ◽

Type I Error Rates ◽

Standard Normal

In a pioneering research article, Wollack and colleagues suggested the “erasure detection index” (EDI) to detect test tampering. The EDI can be used with or without a continuity correction and is assumed to follow the standard normal distribution under the null hypothesis of no test tampering. When used without a continuity correction, the EDI often has inflated Type I error rates. When used with a continuity correction, the EDI has satisfactory Type I error rates, but smaller power compared with the EDI without a continuity correction. This article suggests three methods for detecting test tampering that do not rely on the assumption of a standard normal distribution under the null hypothesis. It is demonstrated in a detailed simulation study that the performance of each suggested method is slightly better than that of the EDI. The EDI and the suggested methods were applied to a real data set. The suggested methods, although more computation intensive than the EDI, seem to be promising in detecting test tampering.

Download Full-text

A fast and accurate algorithm to test for binary phenotypes and its application to PheWAS

10.1101/109876 ◽

2017 ◽

Cited By ~ 1

Author(s):

Rounak Dey ◽

Ellen M. Schmidt ◽

Goncalo R. Abecasis ◽

Seunggeun Lee

Keyword(s):

Type I Error ◽

Score Test ◽

Saddlepoint Approximation ◽

Case Control ◽

Error Rates ◽

Type I ◽

Test Statistic ◽

Association Analyses ◽

Type I Error Rates ◽

And Control

AbstractThe availability of electronic health record (EHR)-based phenotypes allows for genome-wide association analyses in thousands of traits, and has great potential to identify novel genetic variants associated with clinical phenotypes. We can interpret the phenome-wide association study (PheWAS) result for a single genetic variant by observing its association across a landscape of phenotypes. Since PheWAS can test 1000s of binary phenotypes, and most of them have unbalanced (case:control = 1:10) or often extremely unbalanced (case:control = 1:600) case-control ratios, existing methods cannot provide an accurate and scalable way to test for associations. Here we propose a computationally fast score test-based method that estimates the distribution of the test statistic using the saddlepoint approximation. Our method is much faster than the state of the art Firth’s test (∼ 100 times). It can also adjust for covariates and control type I error rates even when the case-control ratio is extremely unbalanced. Through application to PheWAS data from the Michigan Genomics Initiative, we show that the proposed method can control type I error rates while replicating previously known association signals even for traits with a very small number of cases and a large number of controls.

Download Full-text

Evaluating Meta-Analytic Methods to Detect Selective Reporting in the Presence of Dependent Effect Sizes

10.31222/osf.io/vqp8u ◽

2019 ◽

Author(s):

Melissa Angelina Rodgers ◽

James E Pustejovsky

Keyword(s):

Effect Size ◽

Type I Error ◽

Error Rates ◽

Effect Sizes ◽

Selective Reporting ◽

Ratio Test ◽

Type I ◽

Dependent Effect ◽

Type I Error Rates ◽

Dependent Effect Sizes

Selective reporting of results based on their statistical significance threatens the validity of meta-analytic findings. A variety of techniques for detecting selective reporting, publication bias, or small-study effects are available and are routinely used in research syntheses. Most such techniques are univariate, in that they assume that each study contributes a single, independent effect size estimate to the meta-analysis. In practice, however, studies often contribute multiple, statistically dependent effect size estimates, such as for multiple measures of a common outcome construct. Many methods are available for meta-analyzing dependent effect sizes, but methods for investigating selective reporting while also handling effect size dependencies require further investigation. Using Monte Carlo simulations, we evaluate three available univariate tests for small-study effects or selective reporting, including the Trim & Fill test, Egger's regression test, and a likelihood ratio test from a three-parameter selection model (3PSM), when dependence is ignored or handled using ad hoc techniques. We also examine two variants of Egger’s regression test that incorporate robust variance estimation (RVE) or multi-level meta-analysis (MLMA) to handle dependence. Simulation results demonstrate that ignoring dependence inflates Type I error rates for all univariate tests. Variants of Egger's regression maintain Type I error rates when dependent effect sizes are sampled or handled using RVE or MLMA. The 3PSM likelihood ratio test does not fully control Type I error rates. With the exception of the 3PSM, all methods have limited power to detect selection bias except under strong selection for statistically significant effects.

Download Full-text

Testing Measurement Invariance Using MIMIC

Educational and Psychological Measurement ◽

10.1177/0013164411427395 ◽

2011 ◽

Vol 72 (3) ◽

pp. 469-492 ◽

Cited By ~ 42

Author(s):

Eun Sook Kim ◽

Myeongsun Yoon ◽

Taehun Lee

Keyword(s):

Measurement Invariance ◽

Type I Error ◽

Factor Loading ◽

Error Rates ◽

Categorical Variables ◽

Ratio Test ◽

Type I ◽

Invariance Testing ◽

Mimic Modeling ◽

Latent Group

Multiple-indicators multiple-causes (MIMIC) modeling is often used to test a latent group mean difference while assuming the equivalence of factor loadings and intercepts over groups. However, this study demonstrated that MIMIC was insensitive to the presence of factor loading noninvariance, which implies that factor loading invariance should be tested through other measurement invariance testing techniques. MIMIC modeling is also used for measurement invariance testing by allowing a direct path from a grouping covariate to each observed variable. This simulation study with both continuous and categorical variables investigated the performance of MIMIC in detecting noninvariant variables under various study conditions and showed that the likelihood ratio test of MIMIC with Oort adjustment not only controlled Type I error rates below the nominal level but also maintained high power across study conditions.

Download Full-text

The Effects of Purification and the Evaluation of Differential Item Functioning With the Likelihood Ratio Test

Methodology ◽

10.1027/1614-2241/a000046 ◽

2012 ◽

Vol 8 (4) ◽

pp. 134-145 ◽

Cited By ~ 6

Author(s):

Fabiola González-Betanzos ◽

Francisco J. Abad

Keyword(s):

Differential Item Functioning ◽

Likelihood Ratio ◽

Likelihood Ratio Test ◽

Type I Error ◽

Error Rates ◽

Ratio Test ◽

Type I ◽

Two Stage ◽

Item Functioning ◽

Size Type

The current research compares the effects of several strategies to establish the anchor subtest when detecting for differential item functioning (DIF) using the IRT likelihood ratio test in one- and two-stage procedures. Two one-stage strategies were examined: (1) “One item” and (2) “All other items” used as anchor. Additionally, two two-stage strategies were tested: (3) “One anchor item with posterior anchor test augmentation” and (4) “All other items with purification.” The strategies were compared in a simulation study, where sample sizes, DIF size, type of DIF, and software implementation (MULTILOG vs. IRTLRDIF) were manipulated. Results indicated that Procedure (1) was more efficient than (2). Purification was found to improve Type I error rates substantially with the “all other items” strategy, while “posterior anchor test augmentation” did not yield a significant improvement. In relation to the effect of the software used, we found that MULTILOG generally offers better results than IRTLRDIF.

Download Full-text

ModL: exploring and restoring regularity when testing for positive selection

Bioinformatics ◽

10.1093/bioinformatics/bty1019 ◽

2018 ◽

Vol 35 (15) ◽

pp. 2545-2554 ◽

Cited By ~ 3

Author(s):

Joseph Mingrone ◽

Edward Susko ◽

Joseph P Bielawski

Keyword(s):

Positive Selection ◽

Likelihood Ratio ◽

Type I Error ◽

Error Rates ◽

Ratio Test ◽

Type I ◽

Chi Square ◽

Type I Error Rates ◽

Modified Likelihood Ratio Test ◽

Modified Likelihood

Abstract Motivation Likelihood ratio tests are commonly used to test for positive selection acting on proteins. They are usually applied with thresholds for declaring a protein under positive selection determined from a chi-square or mixture of chi-square distributions. Although it is known that such distributions are not strictly justified due to the statistical irregularity of the problem, the hope has been that the resulting tests are conservative and do not lose much power in comparison with the same test using the unknown, correct threshold. We show that commonly used thresholds need not yield conservative tests, but instead give larger than expected Type I error rates. Statistical regularity can be restored by using a modified likelihood ratio test. Results We give theoretical results to prove that, if the number of sites is not too small, the modified likelihood ratio test gives approximately correct Type I error probabilities regardless of the parameter settings of the underlying null hypothesis. Simulations show that modification gives Type I error rates closer to those stated without a loss of power. The simulations also show that parameter estimation for mixture models of codon evolution can be challenging in certain data-generation settings with very different mixing distributions giving nearly identical site pattern distributions unless the number of taxa and tree length are large. Because mixture models are widely used for a variety of problems in molecular evolution, the challenges and general approaches to solving them presented here are applicable in a broader context. Availability and implementation https://github.com/jehops/codeml_modl Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Investigating the Behaviors of M2 and RMSEA2 in Fitting a Unidimensional Model to Multidimensional Data

Applied Psychological Measurement ◽

10.1177/0146621617710464 ◽

2017 ◽

Vol 41 (8) ◽

pp. 632-644

Author(s):

Jie Xu ◽

Insu Paek ◽

Yan Xia

Keyword(s):

Goodness Of Fit ◽

Type I Error ◽

Error Rates ◽

Multidimensional Data ◽

Ratio Test ◽

Type I ◽

Limited Information ◽

Test Statistic ◽

Goodness Of Fit Test ◽

Goodness Of Fit Tests

It has been widely known that the Type I error rates of goodness-of-fit tests using full information test statistics, such as Pearson’s test statistic χ2 and the likelihood ratio test statistic G2, are problematic when data are sparse. Under such conditions, the limited information goodness-of-fit test statistic M2 is recommended in model fit assessment for models with binary response data. A simulation study was conducted to investigate the power and Type I error rate of M2 in fitting unidimensional models to many different types of multidimensional data. As an additional interest, the behavior of RMSEA2 was also examined, which is the root mean square error approximation (RMSEA) based on M2. Findings from the current study showed that M2 and RMSEA2 are sensitive in detecting the misfits due to varying slope parameters, the bifactor structure, and the partially (or completely) simple structure for multidimensional data, but not the misfits due to the within-item multidimensional structures.

Download Full-text

Statistical tests under Dallal’s model: Asymptotic and exact methods

PLoS ONE ◽

10.1371/journal.pone.0242722 ◽

2020 ◽

Vol 15 (11) ◽

pp. e0242722

Author(s):

Zhiming Li ◽

Changxing Ma ◽

Mingyao Ai

Keyword(s):

Type I Error ◽

Statistical Tests ◽

Score Test ◽

Error Rates ◽

Small Data ◽

Type I ◽

Test Statistics ◽

Exact Methods ◽

Large Samples ◽

Numerical Studies

This paper proposes asymptotic and exact methods for testing the equality of correlations for multiple bilateral data under Dallal’s model. Three asymptotic test statistics are derived for large samples. Since they are not applicable to small data, several conditional and unconditional exact methods are proposed based on these three statistics. Numerical studies are conducted to compare all these methods with regard to type I error rates (TIEs) and powers. The results show that the asymptotic score test is the most robust, and two exact tests have satisfactory TIEs and powers. Some real examples are provided to illustrate the effectiveness of these tests.

Download Full-text

Detection of Item Preknowledge Using Response Times

Applied Psychological Measurement ◽

10.1177/0146621620909893 ◽

2020 ◽

Vol 44 (5) ◽

pp. 376-392

Author(s):

Sandip Sinharay

Keyword(s):

Type I Error ◽

Response Times ◽

Real Data ◽

Standard Normal Distribution ◽

Major Type ◽

Type I ◽

Standard Normal ◽

The Difference ◽

Item Scores ◽

Item Preknowledge

Benefiting from item preknowledge is a major type of fraudulent behavior during educational assessments. This article suggests a new statistic that can be used for detecting the examinees who may have benefited from item preknowledge using their response times. The statistic quantifies the difference in speed between the compromised items and the non-compromised items of the examinees. The distribution of the statistic under the null hypothesis of no preknowledge is proved to be the standard normal distribution. A simulation study is used to evaluate the Type I error rate and power of the suggested statistic. A real data example demonstrates the usefulness of the new statistic that is found to provide information that is not provided by statistics based only on item scores.

Download Full-text