Word2vec Skip-Gram Dimensionality Selection via Sequential Normalized Maximum Likelihood

Pham Thuc Hung; Kenji Yamanishi

doi:10.3390/e23080997

Word2vec Skip-Gram Dimensionality Selection via Sequential Normalized Maximum Likelihood

Entropy ◽

10.3390/e23080997 ◽

2021 ◽

Vol 23 (8) ◽

pp. 997

Author(s):

Pham Thuc Hung ◽

Kenji Yamanishi

Keyword(s):

Probability Theory ◽

Maximum Likelihood ◽

Bayesian Information Criterion ◽

Minimum Description Length ◽

Information Criterion ◽

Information Criteria ◽

Efficient Computation ◽

Data Sequence ◽

Word Similarity ◽

Normalized Maximum Likelihood

In this paper, we propose a novel information criteria-based approach to select the dimensionality of the word2vec Skip-gram (SG). From the perspective of the probability theory, SG is considered as an implicit probability distribution estimation under the assumption that there exists a true contextual distribution among words. Therefore, we apply information criteria with the aim of selecting the best dimensionality so that the corresponding model can be as close as possible to the true distribution. We examine the following information criteria for the dimensionality selection problem: the Akaike’s Information Criterion (AIC), Bayesian Information Criterion (BIC), and Sequential Normalized Maximum Likelihood (SNML) criterion. SNML is the total codelength required for the sequential encoding of a data sequence on the basis of the minimum description length. The proposed approach is applied to both the original SG model and the SG Negative Sampling model to clarify the idea of using information criteria. Additionally, as the original SNML suffers from computational disadvantages, we introduce novel heuristics for its efficient computation. Moreover, we empirically demonstrate that SNML outperforms both BIC and AIC. In comparison with other evaluation methods for word embedding, the dimensionality selected by SNML is significantly closer to the optimal dimensionality obtained by word analogy or word similarity tasks.

Download Full-text

Model Selection Procedures in Bounds Test of Cointegration: Theoretical Comparison and Empirical Evidence

Economies ◽

10.3390/economies8020049 ◽

2020 ◽

Vol 8 (2) ◽

pp. 49 ◽

Cited By ~ 1

Author(s):

Waqar Badshah ◽

Mehmet Bulut

Keyword(s):

Model Selection ◽

Akaike Information Criterion ◽

Bayesian Information Criterion ◽

Selection Process ◽

Information Criterion ◽

Small Sample ◽

Information Criteria ◽

Path Model ◽

Sample Sizes ◽

Bounds Test

Only unstructured single-path model selection techniques, i.e., Information Criteria, are used by Bounds test of cointegration for model selection. The aim of this paper was twofold; one was to evaluate the performance of these five routinely used information criteria {Akaike Information Criterion (AIC), Akaike Information Criterion Corrected (AICC), Schwarz/Bayesian Information Criterion (SIC/BIC), Schwarz/Bayesian Information Criterion Corrected (SICC/BICC), and Hannan and Quinn Information Criterion (HQC)} and three structured approaches (Forward Selection, Backward Elimination, and Stepwise) by assessing their size and power properties at different sample sizes based on Monte Carlo simulations, and second was the assessment of the same based on real economic data. The second aim was achieved by the evaluation of the long-run relationship between three pairs of macroeconomic variables, i.e., Energy Consumption and GDP, Oil Price and GDP, and Broad Money and GDP for BRICS (Brazil, Russia, India, China and South Africa) countries using Bounds cointegration test. It was found that information criteria and structured procedures have the same powers for a sample size of 50 or greater. However, BICC and Stepwise are better at small sample sizes. In the light of simulation and real data results, a modified Bounds test with Stepwise model selection procedure may be used as it is strongly theoretically supported and avoids noise in the model selection process.

Download Full-text

Using Model Selection Criteria to Choose the Number of Principal Components

Journal of Statistical Theory and Applications ◽

10.1007/s44199-021-00002-4 ◽

2021 ◽

Vol 20 (3) ◽

pp. 450-461

Author(s):

Stanley L. Sclove

Keyword(s):

Model Selection ◽

Principal Components ◽

Bayesian Information Criterion ◽

Selection Criteria ◽

Information Criterion ◽

Information Criteria ◽

Akaike's Information Criterion ◽

Model Selection Criteria ◽

Adequate Number ◽

Number Of Principal Components

AbstractThe use of information criteria, especially AIC (Akaike’s information criterion) and BIC (Bayesian information criterion), for choosing an adequate number of principal components is illustrated.

Download Full-text

Sensitivity and specificity of information criteria

Briefings in Bioinformatics ◽

10.1093/bib/bbz016 ◽

2019 ◽

Vol 21 (2) ◽

pp. 553-565 ◽

Cited By ~ 31

Author(s):

John J Dziak ◽

Donna L Coffman ◽

Stephanie T Lanza ◽

Runze Li ◽

Lars S Jermiin

Keyword(s):

Bayesian Information Criterion ◽

Information Criterion ◽

Information Criteria ◽

The Other ◽

Biological Research ◽

Ratio Test ◽

Relative Importance ◽

Fields Of Study ◽

Alternative Perspective ◽

Practical Implications

Abstract Information criteria (ICs) based on penalized likelihood, such as Akaike’s information criterion (AIC), the Bayesian information criterion (BIC) and sample-size-adjusted versions of them, are widely used for model selection in health and biological research. However, different criteria sometimes support different models, leading to discussions about which is the most trustworthy. Some researchers and fields of study habitually use one or the other, often without a clearly stated justification. They may not realize that the criteria may disagree. Others try to compare models using multiple criteria but encounter ambiguity when different criteria lead to substantively different answers, leading to questions about which criterion is best. In this paper we present an alternative perspective on these criteria that can help in interpreting their practical implications. Specifically, in some cases the comparison of two models using ICs can be viewed as equivalent to a likelihood ratio test, with the different criteria representing different alpha levels and BIC being a more conservative test than AIC. This perspective may lead to insights about how to interpret the ICs in more complex situations. For example, AIC or BIC could be preferable, depending on the relative importance one assigns to sensitivity versus specificity. Understanding the differences and similarities among the ICs can make it easier to compare their results and to use them to make informed decisions.

Download Full-text

A Note on the Applied Use of MDL Approximations

Neural Computation ◽

10.1162/0899766041336378 ◽

2004 ◽

Vol 16 (9) ◽

pp. 1763-1768 ◽

Cited By ~ 26

Author(s):

Daniel J. Navarro

Keyword(s):

Maximum Likelihood ◽

Fisher Information ◽

Minimum Description Length ◽

Geometric Interpretation ◽

Applied Problem ◽

Full Model ◽

Large Samples ◽

Psychological Models ◽

Normalized Maximum Likelihood ◽

Practical Implications

An applied problem is discussed in which two nested psychological models of retention are compared using minimum description length (MDL). The standard Fisher information approximation to the normalized maximum likelihood is calculated for these two models, with the result that the full model is assigned a smaller complexity, even for moderately large samples. A geometric interpretation for this behavior is considered, along with its practical implications.

Download Full-text

A note on the applied use of MDL approximations

10.31234/osf.io/upf42 ◽

2019 ◽

Author(s):

Danielle Navarro

Keyword(s):

Maximum Likelihood ◽

Fisher Information ◽

Minimum Description Length ◽

Geometric Interpretation ◽

Applied Problem ◽

Full Model ◽

Large Samples ◽

Psychological Models ◽

Normalized Maximum Likelihood ◽

Practical Implications

Download Full-text

Selecting Amongst Multinomial Models: An Apologia for Normalized Maximum Likelihood

10.31234/osf.io/9v2hd ◽

2019 ◽

Author(s):

David Kellen ◽

Karl Christoph Klauer

Keyword(s):

Maximum Likelihood ◽

Goodness Of Fit ◽

Minimum Description Length ◽

Seminal Paper ◽

Multinomial Data ◽

Functional Flexibility ◽

Multinomial Models ◽

Tremendous Progress ◽

Normalized Maximum Likelihood ◽

Important Addition

The modeling of multinomial data has seen tremendous progress since Riefer and Batchelder’s (1988) seminal paper. One recurring challenge, however, concerns theavailability of relative performance measures that strike an ideal balance between goodness of fit and functional flexibility. One approach to the problem of model selection is Normalized Maximum Likelihood (NML), a solution derived from the Minimum Description Length principle. In the present work we provide an R implementation of a Gibbs sampler that can be used to compute NML for models of joint multinomial data. We discuss the application of NML in different examples, compare NML with Bayes Factors, and show how it constitutes an important addition to researchers’ toolboxes.

Download Full-text

Efficient Computation of Normalized Maximum Likelihood Codes for Gaussian Mixture Models With Its Applications to Clustering

IEEE Transactions on Information Theory ◽

10.1109/tit.2013.2276036 ◽

2013 ◽

Vol 59 (11) ◽

pp. 7718-7727 ◽

Cited By ~ 14

Author(s):

So Hirai ◽

Kenji Yamanishi

Keyword(s):

Maximum Likelihood ◽

Mixture Models ◽

Gaussian Mixture Models ◽

Gaussian Mixture ◽

Efficient Computation ◽

Normalized Maximum Likelihood

Download Full-text

Vibration signals denoising using minimum description length principle for detecting impulsive signatures

Proceedings of the Institution of Mechanical Engineers Part C Journal of Mechanical Engineering Science ◽

10.1177/0954406213498544 ◽

2013 ◽

Vol 228 (10) ◽

pp. 1818-1828 ◽

Cited By ~ 5

Author(s):

Yanxue Wang ◽

Jiawei Xiang ◽

Jiang Zhansi ◽

Yang Lianfa ◽

Zhengjia He

Keyword(s):

Maximum Likelihood ◽

Data Processing ◽

Minimum Description Length ◽

Noise Variance ◽

Minimum Description Length Principle ◽

Denoising Method ◽

Vibration Signals ◽

Practical Applications ◽

Normalized Maximum Likelihood ◽

Adaptive Denoising

Vibration signals are usually affected by noise, which is in turn related to the measurement and data processing procedures. This paper presents a new subband adaptive denoising method for detective impulsive signatures based on minimum description length principle with improved normalized maximum likelihood density model. The threshold of the proposed denoising method is determined automatically without the need to estimate the noise variance. The effectiveness of the proposed denoising method over VisuaShrink, BayesShrink and minimum description length denoising methods are given through simulation and practical applications.

Download Full-text

Efficient computation of normalized maximum likelihood coding for Gaussian mixtures with its applications to optimal clustering

2011 IEEE International Symposium on Information Theory Proceedings ◽

10.1109/isit.2011.6033686 ◽

2011 ◽

Cited By ~ 9

Author(s):

So Hirai ◽

Kenji Yamanishi

Keyword(s):

Maximum Likelihood ◽

Gaussian Mixtures ◽

Efficient Computation ◽

Normalized Maximum Likelihood

Download Full-text

Sensitivity and Specificity of Information Criteria

10.1101/449751 ◽

2018 ◽

Cited By ~ 5

Author(s):

John J Dziak ◽

Donna L Coffman ◽

Stephanie T Lanza ◽

Runze Li ◽

Lars Sommer Jermiin

Keyword(s):

Bayesian Information Criterion ◽

Information Criterion ◽

Information Criteria ◽

The Other ◽

Biological Research ◽

Ratio Test ◽

Relative Importance ◽

Fields Of Study ◽

Alternative Perspective ◽

Practical Implications

Information criteria (ICs) based on penalized likelihood, such as Akaike's Information Criterion (AIC), the Bayesian Information Criterion (BIC), and sample-size-adjusted versions of them, are widely used for model selection in health and biological research. However, different criteria sometimes support different models, leading to discussions about which is the most trustworthy. Some researchers and fields of study habitually use one or the other, often without a clearly stated justification. They may not realize that the criteria may disagree. Others try to compare models using multiple criteria but encounter ambiguity when different criteria lead to substantively different answers, leading to questions about which criterion is best. In this paper we present an alternative perspective on these criteria that can help in interpreting their practical implications. Specifically, in some cases the comparison of two models using ICs can be viewed as equivalent to a likelihood ratio test, with the different criteria representing different alpha levels and BIC being a more conservative test than AIC. This perspective may lead to insights about how to interpret the ICs in more complex situations. For example, AIC or BIC could be preferable, depending on the relative importance one assigns to sensitivity versus specificity. Understanding the differences and similarities among the ICs can make it easier to compare their results and to use them to make informed decisions.

Download Full-text