IRT Models for Expert-Coded Panel Data

Kyle L. Marquardt; Daniel Pemstein

doi:10.1017/pan.2018.28

IRT Models for Expert-Coded Panel Data

Political Analysis ◽

10.1017/pan.2018.28 ◽

2018 ◽

Vol 26 (4) ◽

pp. 431-456 ◽

Cited By ~ 20

Author(s):

Kyle L. Marquardt ◽

Daniel Pemstein

Keyword(s):

Panel Data ◽

Simulated Data ◽

Average Score ◽

Data Sets ◽

Data Set ◽

Standard Practice ◽

Irt Models ◽

Social Scientific ◽

Multiple Experts ◽

Cross National

Data sets quantifying phenomena of social-scientific interest often use multiple experts to code latent concepts. While it remains standard practice to report the average score across experts, experts likely vary in both their expertise and their interpretation of question scales. As a result, the mean may be an inaccurate statistic. Item-response theory (IRT) models provide an intuitive method for taking these forms of expert disagreement into account when aggregating ordinal ratings produced by experts, but they have rarely been applied to cross-national expert-coded panel data. We investigate the utility of IRT models for aggregating expert-coded data by comparing the performance of various IRT models to the standard practice of reporting average expert codes, using both data from the V-Dem data set and ecologically motivated simulated data. We find that IRT approaches outperform simple averages when experts vary in reliability and exhibit differential item functioning (DIF). IRT models are also generally robust even in the absence of simulated DIF or varying expert reliability. Our findings suggest that producers of cross-national data sets should adopt IRT techniques to aggregate expert-coded data measuring latent concepts.

Download Full-text

Variational inference using approximate likelihood under the coalescent with recombination

Genome Research ◽

10.1101/gr.273631.120 ◽

2021 ◽

pp. gr.273631.120

Author(s):

Xinhao Liu ◽

Huw A Ogilvie ◽

Luay Nakhleh

Keyword(s):

Simulated Data ◽

Variational Inference ◽

Divide And Conquer ◽

Data Sets ◽

Transition Rates ◽

Data Set ◽

Population Sizes ◽

Novel Method ◽

Approximate Likelihood ◽

Promising Avenue

Coalescent methods are proven and powerful tools for population genetics, phylogenetics, epidemiology, and other fields. A promising avenue for the analysis of large genomic alignments, which are increasingly common, are coalescent hidden Markov model (coalHMM) methods, but these methods have lacked general usability and flexibility. We introduce a novel method for automatically learning a coalHMM and inferring the posterior distributions of evolutionary parameters using black-box variational inference, with the transition rates between local genealogies derived empirically by simulation. This derivation enables our method to work directly with three or four taxa and through a divide-and-conquer approach with more taxa. Using a simulated data set resembling a human-chimp-gorilla scenario, we show that our method has comparable or better accuracy to previous coalHMM methods. Both species divergence times and population sizes were accurately inferred. The method also infers local genealogies and we report on their accuracy. Furthermore, we discuss a potential direction for scaling the method to larger data sets through a divide-and-conquer approach. This accuracy means our method is useful now, and by deriving transition rates by simulation it is flexible enough to enable future implementations of all kinds of population models.

Download Full-text

THE JOURNAL OF GERONTOLOGY: SOCIAL SCIENCES: GLOBAL SCHOLARSHIP CHALLENGES AND OPPORTUNITIES

Innovation in Aging ◽

10.1093/geroni/igz038.705 ◽

2019 ◽

Vol 3 (Supplement_1) ◽

pp. S195-S196

Author(s):

Deborah Carr

Keyword(s):

Social Sciences ◽

Research Policy ◽

Data Sets ◽

Policy And Practice ◽

Gerontological Research ◽

Parallel Data ◽

Challenges And Opportunities ◽

Social Scientific ◽

The Life Course ◽

Cross National

Abstract Journal of Gerontology: Social Sciences aims to publish the highest quality social scientific research on aging and the life course in the U.S. and worldwide. The disciplinary scope is broad, encompassing scholarship from demography, economics, psychology, public health, and sociology. A key substantive focus is identifying the social, economic, and cultural contexts that shape aging experiences worldwide. In the coming decade, social gerontology research is poised to present many opportunities for cross-national and cross-cultural scholarship – driven in part by the proliferation of large parallel data sets from many nations in Europe, Latin America, and Asia. I will discuss the role that peer-reviewed cross-national scholarship can play in disseminating knowledge that informs gerontological research, policy, and practice internationally. I will also identify under-researched areas that will be of great interest to scholars in the coming decade, including LGBT older adults, aging in the Global South, reconfigured families, and centenarians.

Download Full-text

Stochastic Revealed Preferences with Measurement Error

The Review of Economic Studies ◽

10.1093/restud/rdaa067 ◽

2020 ◽

Cited By ~ 1

Author(s):

Victor H Aguiar ◽

Nail Kashaev

Keyword(s):

International Economic ◽

Measurement Error ◽

Panel Data ◽

Revealed Preference ◽

Revealed Preferences ◽

Data Sets ◽

Single Individual ◽

Data Set ◽

International Economic Review ◽

New Evidence

Abstract A long-standing question about consumer behaviour is whether individuals’ observed purchase decisions satisfy the revealed preference (RP) axioms of the utility maximization theory (UMT). Researchers using survey or experimental panel data sets on prices and consumption to answer this question face the well-known problem of measurement error. We show that ignoring measurement error in the RP approach may lead to overrejection of the UMT. To solve this problem, we propose a new statistical RP framework for consumption panel data sets that allows for testing the UMT in the presence of measurement error. Our test is applicable to all consumer models that can be characterized by their first-order conditions. Our approach is non-parametric, allows for unrestricted heterogeneity in preferences and requires only a centring condition on measurement error. We develop two applications that provide new evidence about the UMT. First, we find support in a survey data set for the dynamic and time-consistent UMT in single-individual households, in the presence of nonclassical measurement error in consumption. In the second application, we cannot reject the static UMT in a widely used experimental data set in which measurement error in prices is assumed to be the result of price misperception due to the experimental design. The first finding stands in contrast to the conclusions drawn from the deterministic RP test of Browning (1989, International Economic Review, 979–992). The second finding reverses the conclusions drawn from the deterministic RP test of Afriat (1967, International Economic Review, 8, 6–77) and Varian (1982, Econometrica, 945–973).

Download Full-text

Feature Scaling via Second-Order Cone Programming

Mathematical Problems in Engineering ◽

10.1155/2016/7347986 ◽

2016 ◽

Vol 2016 ◽

pp. 1-7

Author(s):

Zhizheng Liang

Keyword(s):

Simulated Data ◽

Performance Measure ◽

Second Order ◽

Data Sets ◽

Scaling Factors ◽

Data Set ◽

Cone Programming ◽

Second Order Cone Programming ◽

Second Order Cone ◽

Feature Scaling

Feature scaling has attracted considerable attention during the past several decades because of its important role in feature selection. In this paper, a novel algorithm for learning scaling factors of features is proposed. It first assigns a nonnegative scaling factor to each feature of data and then adopts a generalized performance measure to learn the optimal scaling factors. It is of interest to note that the proposed model can be transformed into a convex optimization problem: second-order cone programming (SOCP). Thus the scaling factors of features in our method are globally optimal in some sense. Several experiments on simulated data, UCI data sets, and the gene data set are conducted to demonstrate that the proposed method is more effective than previous methods.

Download Full-text

A Growth Model for Multilevel Ordinal Data

Journal of Educational and Behavioral Statistics ◽

10.3102/10769986030004369 ◽

2005 ◽

Vol 30 (4) ◽

pp. 369-396 ◽

Cited By ~ 8

Author(s):

Eisuke Segawa

Keyword(s):

Latent Variable ◽

Ordinal Data ◽

Linear Models ◽

Growth Models ◽

Simulated Data ◽

Real Data ◽

Analytic Structure ◽

Data Sets ◽

Data Set ◽

Time Points

Multi-indicator growth models were formulated as special three-level hierarchical generalized linear models to analyze growth of a trait latent variable measured by ordinal items. Items are nested within a time-point, and time-points are nested within subject. These models are special because they include factor analytic structure. This model can analyze not only data with item- and time-level missing observations, but also data with time points freely specified over subjects. Furthermore, features useful for longitudinal analyses, “autoregressive error degree one” structure for the trait residuals and estimated time-scores, were included. The approach is Bayesian with Markov Chain and Monte Carlo, and the model is implemented in WinBUGS. They are illustrated with two simulated data sets and one real data set with planned missing items within a scale.

Download Full-text

MSIGNET: a Metropolis sampling-based method for global optimal significant network identification

10.1101/260844 ◽

2018 ◽

Cited By ~ 2

Author(s):

Xi Chen ◽

Jianhua Xuan

Keyword(s):

Cancer Recurrence ◽

Simulated Data ◽

Superior Performance ◽

Biological Knowledge ◽

Specific Gene ◽

Data Sets ◽

Data Set ◽

Novel Approach ◽

Network Identification ◽

Global Optimal

AbstractIn this paper, we propose a novel approach namely MSIGNET to identify subnetworks with significantly expressed genes by integrating context specific gene expression and protein-protein interaction (PPI) data. Specifically, we integrate differential expression of each gene and mutual information of gene pairs in a Bayesian framework and use Metropolis sampling to identify functional interactions. During the sampling process, a conditional probability is calculated given a randomly selected gene to control the network state transition. Our method provides global statistics of all genes and their interactions, and finally achieves a global optimal sub-network. We apply MSIGNET to simulated data and have demonstrated its superior performance over comparable network identification tools. Using a validated Parkinson data set we show that the network identified using MSIGNET is consistent to previously reported results but provides more biology meaningful interpretation of Parkinson’s disease. Finally, to study networks related to ovarian cancer recurrence, we investigate two patient data sets. Identified networks from independent data sets show functional consistence. And those common genes and interactions are well supported by current biological knowledge.

Download Full-text

Comparison of silhouette-based reallocation methods for vegetation classification

10.1101/630384 ◽

2019 ◽

Cited By ~ 1

Author(s):

Attila Lengyel ◽

David W. Roberts ◽

Zoltán Botta-Dukát

Keyword(s):

Simulated Data ◽

Primary Objective ◽

Vegetation Classification ◽

Data Sets ◽

Data Set ◽

Number Of Clusters ◽

Silhouette Width ◽

Diagnostic Species ◽

Order Of Magnitude ◽

Initial Classification

AbstractAimsTo introduce REMOS, a new iterative reallocation method (with two variants) for vegetation classification, and to compare its performance with OPTSIL. We test (1) how effectively REMOS and OPTSIL maximize mean silhouette width and minimize the number of negative silhouette widths when run on classifications with different structure; (2) how these three methods differ in runtime with different sample sizes; and (3) if classifications by the three reallocation methods differ in the number of diagnostic species, a surrogate for interpretability.Study areaSimulation; example data sets from grasslands in Hungary and forests in Wyoming and Utah, USA.MethodsWe classified random subsets of simulated data with the flexible-beta algorithm for different values of beta. These classifications were subsequently optimized by REMOS and OPTSIL and compared for mean silhouette widths and proportion of negative silhouette widths. Then, we classified three vegetation data sets of different sizes from two to ten clusters, optimized them with the reallocation methods, and compared their runtimes, mean silhouette widths, numbers of negative silhouette widths, and the number of diagnostic species.ResultsIn terms of mean silhouette width, OPTSIL performed the best when the initial classifications already had high mean silhouette width. REMOS algorithms had slightly lower mean silhouette width than what was maximally achievable with OPTSIL but their efficiency was consistent across different initial classifications; thus REMOS was significantly superior to OPTSIL when the initial classification had low mean silhouette width. REMOS resulted in zero or a negligible number of negative silhouette widths across all classifications. OPTSIL performed similarly when the initial classification was effective but could not reach as low proportion of misclassified objects when the initial classification was inefficient. REMOS algorithms were typically more than an order of magnitude faster to calculate than OPTSIL. There was no clear difference between REMOS and OPTSIL in the number of diagnostic species.ConclusionsREMOS algorithms may be preferable to OPTSIL when (1) the primary objective is to reduce or eliminate negative silhouette widths in a classification, (2) the initial classification has low mean silhouette width, or (3) when the time efficiency of the algorithm is important because of the size of the data set or the high number of clusters.

Download Full-text

Model-free detection of unique events in time series

Scientific Reports ◽

10.1038/s41598-021-03526-y ◽

2022 ◽

Vol 12 (1) ◽

Author(s):

Zsigmond Benkő ◽

Tamás Bábel ◽

Zoltán Somogyvári

Keyword(s):

Simulated Data ◽

Normal Activity ◽

Detection Algorithm ◽

Superior Performance ◽

Data Series ◽

Data Sets ◽

Data Set ◽

Detection Algorithms ◽

Model Free ◽

Unique Event

AbstractRecognition of anomalous events is a challenging but critical task in many scientific and industrial fields, especially when the properties of anomalies are unknown. In this paper, we introduce a new anomaly concept called “unicorn” or unique event and present a new, model-free, unsupervised detection algorithm to detect unicorns. The key component of the new algorithm is the Temporal Outlier Factor (TOF) to measure the uniqueness of events in continuous data sets from dynamic systems. The concept of unique events differs significantly from traditional outliers in many aspects: while repetitive outliers are no longer unique events, a unique event is not necessarily an outlier; it does not necessarily fall out from the distribution of normal activity. The performance of our algorithm was examined in recognizing unique events on different types of simulated data sets with anomalies and it was compared with the Local Outlier Factor (LOF) and discord discovery algorithms. TOF had superior performance compared to LOF and discord detection algorithms even in recognizing traditional outliers and it also detected unique events that those did not. The benefits of the unicorn concept and the new detection method were illustrated by example data sets from very different scientific fields. Our algorithm successfully retrieved unique events in those cases where they were already known such as the gravitational waves of a binary black hole merger on LIGO detector data and the signs of respiratory failure on ECG data series. Furthermore, unique events were found on the LIBOR data set of the last 30 years.

Download Full-text

Personalized Dynamic Pricing with Machine Learning: High-Dimensional Features and Heterogeneous Elasticity

Management Science ◽

10.1287/mnsc.2020.3680 ◽

2021 ◽

Author(s):

Gah-Yi Ban ◽

N. Bora Keskin

Keyword(s):

Dynamic Pricing ◽

Simulated Data ◽

The United States ◽

Model Parameters ◽

Data Sets ◽

Demand Model ◽

Pricing Policy ◽

Data Set ◽

Customized Pricing ◽

The Individual

We consider a seller who can dynamically adjust the price of a product at the individual customer level, by utilizing information about customers’ characteristics encoded as a d-dimensional feature vector. We assume a personalized demand model, parameters of which depend on s out of the d features. The seller initially does not know the relationship between the customer features and the product demand but learns this through sales observations over a selling horizon of T periods. We prove that the seller’s expected regret, that is, the revenue loss against a clairvoyant who knows the underlying demand relationship, is at least of order [Formula: see text] under any admissible policy. We then design a near-optimal pricing policy for a semiclairvoyant seller (who knows which s of the d features are in the demand model) who achieves an expected regret of order [Formula: see text]. We extend this policy to a more realistic setting, where the seller does not know the true demand predictors, and show that this policy has an expected regret of order [Formula: see text], which is also near-optimal. Finally, we test our theory on simulated data and on a data set from an online auto loan company in the United States. On both data sets, our experimentation-based pricing policy is superior to intuitive and/or widely-practiced customized pricing methods, such as myopic pricing and segment-then-optimize policies. Furthermore, our policy improves upon the loan company’s historical pricing decisions by 47% in expected revenue over a six-month period. This paper was accepted by Noah Gans, stochastic models and simulation.

Download Full-text

The Impact of Site Sample Size on the Reconstruction of Culture Histories

American Antiquity ◽

10.7183/0002-7316.76.3.547 ◽

2011 ◽

Vol 76 (3) ◽

pp. 547-572 ◽

Cited By ~ 3

Author(s):

Charles Perreault

Keyword(s):

Simulated Data ◽

Archaeological Sites ◽

Cultural Tradition ◽

Data Sets ◽

Data Set ◽

Rate Of Spread ◽

Accuracy And Precision ◽

Historical Processes ◽

The Impact ◽

Simulated Data Sets

I examine how our capacity to produce accurate culture-historical reconstructions changes as more archaeological sites are discovered, dated, and added to a data set. More precisely, I describe, using simulated data sets, how increases in the number of known sites impact the accuracy and precision of our estimations of (1) the earliest and (2) latest date of a cultural tradition, (3) the date and (4) magnitude of its peak popularity, as well as (5) its rate of spread and (6) disappearance in a population. I show that the accuracy and precision of inferences about these six historical processes are not affected in the same fashion by changes in the number of known sites. I also consider the impact of two simple taphonomic site destruction scenarios on the results. Overall, the results presented in this paper indicate that unless we are in possession of near-total samples of sites, and can be certain that there are no taphonomic biases in the universe of sites to be sampled, we will make inferences of varying precision and accuracy depending on the aspect of a cultural trait’s history in question.

Download Full-text