The most under-used statistical method in corpus linguistics: multi-level (and mixed-effects) models

Corpora ◽  
2015 ◽  
Vol 10 (1) ◽  
pp. 95-125 ◽  
Author(s):  
Stefan Th. Gries

Much statistical analysis of psycholinguistic data is now being done with so-called mixed-effects regression models. This development was spearheaded by a few highly influential introductory articles that (i) showed how these regression models are superior to what was the previous gold standard and, perhaps even more importantly, (ii) showed how these models are used practically. Corpus linguistics can benefit from mixed-effects/multi-level models for the same reason that psycholinguistics can – because, for example, speaker-specific and lexically specific idiosyncrasies can be accounted for elegantly; but, in fact, corpus linguistics needs them even more because (i) corpus-linguistic data are observational and, thus, usually unbalanced and messy/noisy, and (ii) most widely used corpora come with a hierarchical structure that corpus linguists routinely fail to consider. Unlike nearly all overviews of mixed-effects/multi-level modelling, this paper is specifically written for corpus linguists to get more of them to start using these techniques more. After a short methodological history, I provide a non-technical introduction to mixed-effects models and then discuss in detail one example – particle placement in English – to show how mixed-effects/multi-level modelling results can be obtained and how they are far superior to those of traditional regression modelling.

2010 ◽  
Vol 27 (5) ◽  
pp. 633-640 ◽  
Author(s):  
Ryung S. Kim ◽  
Juan Lin

NeuroImage ◽  
2013 ◽  
Vol 66 ◽  
pp. 249-260 ◽  
Author(s):  
Jorge L. Bernal-Rusiel ◽  
Douglas N. Greve ◽  
Martin Reuter ◽  
Bruce Fischl ◽  
Mert R. Sabuncu

2015 ◽  
Vol 7 (1) ◽  
Author(s):  
Roger Morbey ◽  
Helen Hughes ◽  
Alex Elliot ◽  
Neville Verlander ◽  
Nick Andrews ◽  
...  

This paper describes the design and application of a new statistical method for real-time syndromic surveillance, used by Public Health England. The Rising Activity, Multi-level Mixed effects, Indicator Emphasis (RAMMIE) statistical method was developed and tested alongside existing methods before being applied to a suite of syndromic surveillance in operation in England. The RAMMIE method has proved to be a reliable, effective method for generating automated alarms for syndromic surveillance. The multi-level models have enabled local models to be created for the first time across all systems and models have proved themselves to be robust across all the signals.


2018 ◽  
Author(s):  
Dale Barr ◽  
Roger Philip Levy ◽  
Christoph Scheepers ◽  
Harry Tily

Linear mixed-effects models (LMEMs) have become increasingly prominent in psycholinguistics and related areas. However, many researchers do not seem to appreciate how random effects structures affect the generalizability of an analysis. Here, we argue that researchers using LMEMs for confirmatory hypothesis testing should minimally adhere to the standards that have been in place for many decades. Through theoretical arguments and Monte Carlo simulation, we show that LMEMs generalize best when they include the maximal random effects structure justified by the design. The generalization performance of LMEMs including data-driven random effects structures strongly depends upon modeling criteria and sample size, yielding reasonable results on moderately-sized samples when conservative criteria are used, but with little or no power advantage over maximal models. Finally, random-intercepts-only LMEMs used on within-subjects and/or within-items data from populations where subjects and/or items vary in their sensitivity to experimental manipulations always generalize worse than separate F1 and F2 tests, and in many cases, even worse than F1 alone. Maximal LMEMs should be the ‘gold standard’ for confirmatory hypothesis testing in psycholinguistics and beyond.


Forests ◽  
2021 ◽  
Vol 12 (8) ◽  
pp. 1111
Author(s):  
Tao Wang ◽  
Longfei Xie ◽  
Zheng Miao ◽  
Faris Rafi Almay Widagdo ◽  
Lihu Dong ◽  
...  

The relative growth rate (RGRnv) is the standardized measurement of forest growth, whereby excluding the size differences between individuals allows their performance to be compared equally. The RGRnv model was developed using the National Forest Inventory (NFI) data on the Daxing’an Mountains, in Northeast China, which contain Dahurian larch (Larix gmelinii Rupr.), white birch (Betula platyphylla Suk.), and mixed coniferous–broadleaf forests. Four predictor variables—i.e., quadratic mean diameter (Dq), stand basal area (G), average tree height (Ha), and altitude (A)—and four different methods—i.e., the nonlinear mixed-effects models (NLME), three nonlinear quantile regression (NQR3), five nonlinear quantile regression (NQR5), and nine nonlinear quantile regression (NQR9) models—were used in this study. All the models were validated using the leave-one-out method. The results showed that (1) the mixed coniferous–broadleaf forest presented the highest RGRnv; (2) the RGRnv was negatively correlated with the four predictors, and the heteroscedasticity reduced significantly after the weighting function was integrated into the models; and (3) the quantile regression models performed better than NLME, and NQR9 outperformed both NQR3 and NQR5. To make more accurate predictions, parameters of the adjusted mixed-effects and quantile regression models should be recalculated and localized using sampled RGRnv in each region and then applied to predict all the other RGRnv of plots. MAPE% indicates the mean absolute percentage error. The values were stable when the sample numbers were greater than or equal to six across the three forest types, which showed relatively accurate and lowest-cost prediction results.


Sign in / Sign up

Export Citation Format

Share Document