Approximate Fisher Information Matrix to Characterize the Training of Deep Neural Networks

Author(s):  
Zhibin Liao ◽  
Tom Drummond ◽  
Ian Reid ◽  
Gustavo Carneiro
2021 ◽  
pp. 1-34
Author(s):  
Ryo Karakida ◽  
Shotaro Akaho ◽  
Shun-ichi Amari

Abstract The Fisher information matrix (FIM) plays an essential role in statistics and machine learning as a Riemannian metric tensor or a component of the Hessian matrix of loss functions. Focusing on the FIM and its variants in deep neural networks (DNNs), we reveal their characteristic scale dependence on the network width, depth, and sample size when the network has random weights and is sufficiently wide. This study covers two widely used FIMs for regression with linear output and for classification with softmax output. Both FIMs asymptotically show pathological eigenvalue spectra in the sense that a small number of eigenvalues become large outliers depending on the width or sample size, while the others are much smaller. It implies that the local shape of the parameter space or loss landscape is very sharp in a few specific directions while almost flat in the other directions. In particular, the softmax output disperses the outliers and makes a tail of the eigenvalue density spread from the bulk. We also show that pathological spectra appear in other variants of FIMs: one is the neural tangent kernel; another is a metric for the input signal and feature space that arises from feedforward signal propagation. Thus, we provide a unified perspective on the FIM and its variants that will lead to more quantitative understanding of learning in large-scale DNNs.


2012 ◽  
Vol 51 (1) ◽  
pp. 115-130
Author(s):  
Sergei Leonov ◽  
Alexander Aliev

ABSTRACT We provide some details of the implementation of optimal design algorithm in the PkStaMp library which is intended for constructing optimal sampling schemes for pharmacokinetic (PK) and pharmacodynamic (PD) studies. We discuss different types of approximation of individual Fisher information matrix and describe a user-defined option of the library.


2006 ◽  
Vol 18 (5) ◽  
pp. 1007-1065 ◽  
Author(s):  
Shun-ichi Amari ◽  
Hyeyoung Park ◽  
Tomoko Ozeki

The parameter spaces of hierarchical systems such as multilayer perceptrons include singularities due to the symmetry and degeneration of hidden units. A parameter space forms a geometrical manifold, called the neuromanifold in the case of neural networks. Such a model is identified with a statistical model, and a Riemannian metric is given by the Fisher information matrix. However, the matrix degenerates at singularities. Such a singular structure is ubiquitous not only in multilayer perceptrons but also in the gaussian mixture probability densities, ARMA time-series model, and many other cases. The standard statistical paradigm of the Cramér-Rao theorem does not hold, and the singularity gives rise to strange behaviors in parameter estimation, hypothesis testing, Bayesian inference, model selection, and in particular, the dynamics of learning from examples. Prevailing theories so far have not paid much attention to the problem caused by singularity, relying only on ordinary statistical theories developed for regular (nonsingular) models. Only recently have researchers remarked on the effects of singularity, and theories are now being developed. This article gives an overview of the phenomena caused by the singularities of statistical manifolds related to multilayer perceptrons and gaussian mixtures. We demonstrate our recent results on these problems. Simple toy models are also used to show explicit solutions. We explain that the maximum likelihood estimator is no longer subject to the gaussian distribution even asymptotically, because the Fisher information matrix degenerates, that the model selection criteria such as AIC, BIC, and MDL fail to hold in these models, that a smooth Bayesian prior becomes singular in such models, and that the trajectories of dynamics of learning are strongly affected by the singularity, causing plateaus or slow manifolds in the parameter space. The natural gradient method is shown to perform well because it takes the singular geometrical structure into account. The generalization error and the training error are studied in some examples.


Sign in / Sign up

Export Citation Format

Share Document