How to Distinguish Languages and Dialects

Søren Wichmann

doi:10.1162/coli_a_00366

How to Distinguish Languages and Dialects

Computational Linguistics ◽

10.1162/coli_a_00366 ◽

2020 ◽

Vol 45 (4) ◽

pp. 823-831 ◽

Cited By ~ 2

Author(s):

Søren Wichmann

Keyword(s):

Temporal Distance ◽

Strong Tendency ◽

Data Sets ◽

Language Group ◽

Lexical Information ◽

Normal Distributions ◽

Mixture Of Normal Distributions ◽

The Mean ◽

One To One

The terms “language” and “dialect” are ingrained, but linguists nevertheless tend to agree that it is impossible to apply a non-arbitrary distinction such that two speech varieties can be identified as either distinct languages or two dialects of one and the same language. A database of lexical information for more than 7,500 speech varieties, however, unveils a strong tendency for linguistic distances to be bimodally distributed. For a given language group the linguistic distances pertaining to either cluster can be teased apart, identifying a mixture of normal distributions within the data and then separating them fitting curves and finding the point where they cross. The thresholds identified are remarkably consistent across data sets, qualifying their mean as a universal criterion for distinguishing between language and dialect pairs. The mean of the thresholds identified translates into a temporal distance of around one to one-and-a-half millennia (1,075–1,635 years).

Download Full-text

Hedge Funds: The Good, the Bad, and the Lucky

Journal of Financial and Quantitative Analysis ◽

10.1017/s0022109017000217 ◽

2017 ◽

Vol 52 (3) ◽

pp. 1081-1109 ◽

Cited By ~ 17

Author(s):

Yong Chen ◽

Michael Cliff ◽

Haibei Zhao

Keyword(s):

Hedge Funds ◽

Expectation Maximization ◽

Investment Strategy ◽

Performance Measure ◽

Cross Sectional ◽

Normal Distributions ◽

Out Of Sample ◽

Mixture Of Normal Distributions ◽

Skill Groups ◽

The Mean

We develop an estimation approach based on a modified expectation-maximization (EM) algorithm and a mixture of normal distributions associated with skill groups to assess performance in hedge funds. By allowing luck to affect both skilled and unskilled funds, we estimate the number of skill groups, the fraction of funds from each group, and the mean and variability of skill within each group. For each individual fund, we propose a performance measure combining the fund’s estimated alpha with the cross-sectional distribution of fund skill. In out-of-sample tests, an investment strategy using our performance measure outperforms those using estimated alpha and t-statistic.

Download Full-text

Evaluating hypotheses of instar-grouping in arthropods: a maximum likelihood approach

Paleobiology ◽

10.1666/0094-8373(2001)027<0466:ehoigi>2.0.co;2 ◽

2001 ◽

Vol 27 (3) ◽

pp. 466-484 ◽

Cited By ~ 35

Author(s):

Gene Hunt ◽

Ralph E. Chapman

Keyword(s):

Maximum Likelihood ◽

Data Sets ◽

Size Distributions ◽

Normal Distributions ◽

Maximum Likelihood Approach ◽

Statistical Framework ◽

Growth Increments ◽

Likelihood Analysis ◽

Mixture Of Normal Distributions ◽

Likelihood Approach

The ontogeny of arthropod exoskeletons is punctuated by short periods of growth following each molt, separated by longer stages of unchanging morphology called instars. The recognition of instar clusters in size distributions has been important in understanding the growth and evolution of fossil arthropods. Generally, these clusters have been identified by inspection, but this approach has been criticized for its subjectivity. In this paper, we describe a statistical framework for evaluating hypotheses of clustering based on maximum likelihood analysis of mixture models. The approach assumes that individuals are normally distributed within instars; thus an arthropod size distribution can be considered a mixture of normal distributions. This methodology provides an objective framework to compare various plausible hypotheses of grouping, including the possibility that there is no significant grouping at all.We apply this method to evaluate clustering in two trilobite species, Ampyxina bellatula and Piochaspis sellata. Both of these data sets show statistically significant evidence of clustering, a phenomenon rarely documented for holaspid-stage trilobites. After consideration of alternative causes of clustering, we argue that the observed groupings are best explained as instar groups. In these two species, growth increments between molts were similar throughout the observed portion of ontogeny, although subtle yet significant variation can be seen within the ontogeny of Ampyxina bellatula.

Download Full-text

ESTIMATION OF THE FROCINI CRITERIA AND OMEGA SQUARE CRITERIA STATISTICS BY THE STATISTICAL TESTS METHOD FOR A MIXTURE OF NORMAL DISTRIBUTIONS

Siberian Journal of Science and Technology ◽

10.31772/2587-6066-2019-20-1-28-34 ◽

2019 ◽

Vol 20 (1) ◽

pp. 28-34

Author(s):

S. V. Ushanov ◽

◽

D. A. Ogurtsov ◽

Keyword(s):

Statistical Tests ◽

Normal Distributions ◽

Mixture Of Normal Distributions

Download Full-text

Normal approximation for mixtures of normal distributions and the evolution of phenotypic traits

Advances in Applied Probability ◽

10.1017/apr.2020.53 ◽

2021 ◽

Vol 53 (1) ◽

pp. 162-188

Author(s):

Krzysztof Bartoszek ◽

Torkel Erhardsson

Keyword(s):

Normal Distribution ◽

Phenotypic Trait ◽

Phenotypic Traits ◽

Normal Distributions ◽

Average Value ◽

Conditional Moments ◽

Mixture Of Normal Distributions ◽

Explicit Bounds ◽

Mixtures Of Normal Distributions ◽

Parameter Values

AbstractExplicit bounds are given for the Kolmogorov and Wasserstein distances between a mixture of normal distributions, by which we mean that the conditional distribution given some $\sigma$ -algebra is normal, and a normal distribution with properly chosen parameter values. The bounds depend only on the first two moments of the first two conditional moments given the $\sigma$ -algebra. The proof is based on Stein’s method. As an application, we consider the Yule–Ornstein–Uhlenbeck model, used in the field of phylogenetic comparative methods. We obtain bounds for both distances between the distribution of the average value of a phenotypic trait over n related species, and a normal distribution. The bounds imply and extend earlier limit theorems by Bartoszek and Sagitov.

Download Full-text

The Age of Nonsynonymous and Synonymous Mutations in Animal mtDNA and Implications for the Mildly Deleterious Theory

Genetics ◽

10.1093/genetics/153.1.497 ◽

1999 ◽

Vol 153 (1) ◽

pp. 497-506 ◽

Cited By ~ 4

Author(s):

Rasmus Nielsen ◽

Daniel M Weinreich

Keyword(s):

Dna Sequences ◽

Purifying Selection ◽

Data Sets ◽

Deleterious Mutations ◽

Synonymous Mutations ◽

Weak Evidence ◽

Mitochondrial Data ◽

The Mean ◽

Excess Number ◽

Neutral Mutations

Abstract McDonald/Kreitman tests performed on animal mtDNA consistently reveal significant deviations from strict neutrality in the direction of an excess number of polymorphic nonsynonymous sites, which is consistent with purifying selection acting on nonsynonymous sites. We show that under models of recurrent neutral and deleterious mutations, the mean age of segregating neutral mutations is greater than the mean age of segregating selected mutations, even in the absence of recombination. We develop a test of the hypothesis that the mean age of segregating synonymous mutations equals the mean age of segregating nonsynonymous mutations in a sample of DNA sequences. The power of this age-of-mutation test and the power of the McDonald/Kreitman test are explored by computer simulations. We apply the new test to 25 previously published mitochondrial data sets and find weak evidence for selection against nonsynonymous mutations.

Download Full-text

POSTPARTUM AMENORRHOEA IN RURAL EASTERN UTTAR PRADESH, INDIA

Journal of Biosocial Science ◽

10.1017/s0021932098002272 ◽

1998 ◽

Vol 30 (2) ◽

pp. 227-243

Author(s):

K. N. S. YADAVA ◽

S. K. JAIN

Keyword(s):

Higher Education ◽

Uttar Pradesh ◽

Current Status Data ◽

North India ◽

Current Status ◽

Data Sets ◽

Survival Status ◽

The Mean ◽

The Difference ◽

Using Data

This paper calculates the mean duration of the postpartum amenorrhoea (PPA) and examines its demographic, and socioeconomic correlates in rural north India, using data collected through 'retrospective' (last but one child) as well as 'current status' (last child) reporting of the duration of PPA.The mean duration of PPA was higher in the current status than in the retrospective data;n the difference being statistically significant. However, for the same mothers who gave PPA information in both the data sets, the difference in mean duration of PPA was not statistically significant. The correlates were identical in both the data sets. The current status data were more complete in terms of the coverage, and perhaps less distorted by reporting errors caused by recall lapse.A positive relationship of the mean duration of PPA was found with longer breast-feeding, higher parity and age of mother at the birth of the child, and the survival status of the child. An inverse relationship was found with higher education of a woman, higher education of her husband and higher socioeconomic status of her household, these variables possibly acting as proxies for women's better nutritional status.

Download Full-text

Consistency of long-term marketable yield of carrot and onion cultivars in muck (organic) soil in relation to seasonal weather

Canadian Journal of Plant Science ◽

10.4141/cjps09175 ◽

2010 ◽

Vol 90 (5) ◽

pp. 755-765 ◽

Cited By ~ 12

Author(s):

M. T. Tesfaendrias ◽

M. R. McDonald ◽

J. Warland

Keyword(s):

Allium Cepa ◽

Daucus Carota ◽

Organic Soil ◽

Data Sets ◽

Fresh Market ◽

Marketable Yield ◽

Daucus Carota L ◽

Weather Patterns ◽

The Mean

To identify carrot and onion cultivars that provide consistent marketable yields, we tracked the yields of five fresh market carrot [(Daucus carota L. subsp. sativus (Hoffm.) Arcang.] and six onion (Allium cepa L.) cultivars for at least 13 yr. Relationships between long-term weather variables and marketable yields were also investigated. The effects of cultivar, year and cultivar × year interactions on yield of carrots and onions were assessed. Cultivar and year had significant effects on carrot and onion yields, while the interaction was significant in only one of four data sets of carrot yield. Carrot cv. Cellobunch (95.4 t ha–1) and onion cv. Corona (74.1 t ha–1) had the highest mean marketable yields over the years studied. There was a slight positive correlation between mean yield of the assessed carrots and maximum temperatures in September (r = 0.44). Mean carrot yield was also somewhat negatively correlated with total rainfall in July (r = –0.43) and with number of days with rain in August (r = –0.43) and September (r = –0.44). Most onion cultivars showed stronger relationships between marketable yield and various weather patterns. Marketable yield of onions increased with an increase in the number of days with rainfall in June (r = 0.57). The mean marketable yield of the six onion cultivars decreased in relation to temperatures ≥30°C in June (r = –0.55) and August (r = –0.53). The mean yield of all the onions in the trials was negatively correlated (r = –0.78) with growing degree days (base 5°C, May to August). The results indicated that the data from long-term cultivar trials can be used to identify cultivars that yield well despite seasonal variations in weather. Key words: Daucus carota, Allium cepa, temperature, rainfall

Download Full-text

Using Mixture of Normal Distributions to Detect Treatment Effects when the Frequentist Method Fails

RAS Oncology & Therapy ◽

10.51520/2766-2586-9 ◽

2021 ◽

Vol 2 (1) ◽

Author(s):

Anthony Orlando

Keyword(s):

Dose Response ◽

Linear Models ◽

Controlled Trial ◽

Response To Treatment ◽

Double Blind ◽

Underlying Assumption ◽

Normal Distributions ◽

Efficacy Measure ◽

Baseline Weight ◽

Mixture Of Normal Distributions

Background: Results from a clinical trial can either support the efficacy and safety of a new compound or fail to provide such evidence. One reason for ‘non[1]positive’ result is due to the underlying assumption of normality and homogeneity of variances, which are quite often violated when analyzing data from clinical trials, despite randomization. A question of interest is can we obtain more informative results when using mixture of normal distributions or linear models (MLMs) in such cases. Introduction: MLM can be used when traditional methods fail. MLMs “search” within the variability in data to identify components or subgroups of individuals (also known as latent classes) who have common intercepts and common slopes of change in a variable/endpoint of interest but whose intercepts and slopes are different from other subsets of patients. Thus, MLMs can be used to identify subgroups of patients exhibiting differential response to treatment within each treatment arm. The purpose of our study was to examine the usefulness of using MLM in such circumstances. Methods: Data of 155 subjects taken from a Multicenter, randomized, double blind, placebo controlled trial that evaluated the efficacy of Cpn10, administered twice weekly subcutaneously to treat Rheumatoid Arthritis was taken to evaluate the usefulness of MLM. The primary efficacy measure ACR20 was analyzed using a 3-step process: first, MLM was used to estimate RA duration using a 3-component model. The second step took the results of the first step to inform the logistic model and its analyses. Model was fitted with an intercept, MLM components, treatment arm, RA duration (linear and quadratic), dose response (modeled as an interaction effect), age and baseline weight. LOCF was used to impute for missing data. Data was analyzed using MLM and SAS v 9.0. Results: The model was a good fit to the data with a likelihood ratio significant at p=0.026, and a significant increase in the -2log L. We also observed low p-values for those variables that were non normal. Overall and for the 75 mg dose, Cpn 10 was efficacious relative to placebo, p<0.050. We also observed that dose response was significant at p><0.15 Conclusion: The use of MLM adds value because it can be used to understand the disease experience or the value of treatment when traditional statistical methods cannot. Key words: Mixture of linear models, normality, entropy.

Download Full-text

Velocity inversion by global optimization using finite-offset common-reflection-surface stacking applied to synthetic and Tacutu Basin seismic data

Geophysics ◽

10.1190/geo2017-0117.1 ◽

2019 ◽

Vol 84 (2) ◽

pp. R165-R174 ◽

Cited By ~ 1

Author(s):

Marcelo Jorge Luz Mesquita ◽

João Carlos Ribeiro Cruz ◽

German Garabito Callapino

Keyword(s):

Objective Function ◽

Real Data ◽

Velocity Model ◽

Data Sets ◽

Layer By Layer ◽

Velocity Inversion ◽

Very Fast Simulated Annealing ◽

Target Layer ◽

Common Reflection Surface ◽

The Mean

Estimation of an accurate velocity macromodel is an important step in seismic imaging. We have developed an approach based on coherence measurements and finite-offset (FO) beam stacking. The algorithm is an FO common-reflection-surface tomography, which aims to determine the best layered depth-velocity model by finding the model that maximizes a semblance objective function calculated from the amplitudes in common-midpoint (CMP) gathers stacked over a predetermined aperture. We develop the subsurface velocity model with a stack of layers separated by smooth interfaces. The algorithm is applied layer by layer from the top downward in four steps per layer. First, by automatic or manual picking, we estimate the reflection times of events that describe the interfaces in a time-migrated section. Second, we convert these times to depth using the velocity model via application of Dix’s formula and the image rays to the events. Third, by using ray tracing, we calculate kinematic parameters along the central ray and build a paraxial FO traveltime approximation for the FO common-reflection-surface method. Finally, starting from CMP gathers, we calculate the semblance of the selected events using this paraxial traveltime approximation. After repeating this algorithm for all selected CMP gathers, we use the mean semblance values as an objective function for the target layer. When this coherence measure is maximized, the model is accepted and the process is completed. Otherwise, the process restarts from step two with the updated velocity model. Because the inverse problem we are solving is nonlinear, we use very fast simulated annealing to search the velocity parameters in the target layers. We test the method on synthetic and real data sets to study its use and advantages.

Download Full-text

The failure rate properties of a bimodal mixture of normal distributions in an unequal variance case

Statistics & Probability Letters ◽

10.1016/j.spl.2008.01.083 ◽

2008 ◽

Vol 78 (14) ◽

pp. 2006-2009 ◽

Cited By ~ 1

Author(s):

Fuxiang Liu ◽

Yanyan Liu

Keyword(s):

Failure Rate ◽

Unequal Variance ◽

Normal Distributions ◽

Mixture Of Normal Distributions

Download Full-text