Approximate Fisher Information Matrix to Characterize the Training of Deep Neural Networks

Zhibin Liao; Tom Drummond; Ian Reid; Gustavo Carneiro

doi:10.1109/tpami.2018.2876413

Approximate Fisher Information Matrix to Characterize the Training of Deep Neural Networks

IEEE Transactions on Pattern Analysis and Machine Intelligence ◽

10.1109/tpami.2018.2876413 ◽

2020 ◽

Vol 42 (1) ◽

pp. 15-26

Author(s):

Zhibin Liao ◽

Tom Drummond ◽

Ian Reid ◽

Gustavo Carneiro

Keyword(s):

Neural Networks ◽

Fisher Information ◽

Deep Neural Networks ◽

Fisher Information Matrix ◽

Information Matrix

Download Full-text

Pathological Spectra of the Fisher Information Metric and Its Variants in Deep Neural Networks

Neural Computation ◽

10.1162/neco_a_01411 ◽

2021 ◽

pp. 1-34

Author(s):

Ryo Karakida ◽

Shotaro Akaho ◽

Shun-ichi Amari

Keyword(s):

Neural Networks ◽

Sample Size ◽

Fisher Information ◽

Large Scale ◽

Deep Neural Networks ◽

Hessian Matrix ◽

Information Matrix ◽

Signal Propagation ◽

Metric Tensor ◽

Fisher Information Metric

Abstract The Fisher information matrix (FIM) plays an essential role in statistics and machine learning as a Riemannian metric tensor or a component of the Hessian matrix of loss functions. Focusing on the FIM and its variants in deep neural networks (DNNs), we reveal their characteristic scale dependence on the network width, depth, and sample size when the network has random weights and is sufficiently wide. This study covers two widely used FIMs for regression with linear output and for classification with softmax output. Both FIMs asymptotically show pathological eigenvalue spectra in the sense that a small number of eigenvalues become large outliers depending on the width or sample size, while the others are much smaller. It implies that the local shape of the parameter space or loss landscape is very sharp in a few specific directions while almost flat in the other directions. In particular, the softmax output disperses the outliers and makes a tail of the eigenvalue density spread from the bulk. We also show that pathological spectra appear in other variants of FIMs: one is the neural tangent kernel; another is a metric for the input signal and feature space that arises from feedforward signal propagation. Thus, we provide a unified perspective on the FIM and its variants that will lead to more quantitative understanding of learning in large-scale DNNs.

Download Full-text

Quantum Fisher Information Matrix in Critical System with Topological Characterization

International Journal of Theoretical Physics ◽

10.1007/s10773-021-04847-4 ◽

2021 ◽

Author(s):

M. Chen ◽

B. W. Wang ◽

W. W. Cheng

Keyword(s):

Fisher Information ◽

Fisher Information Matrix ◽

Information Matrix ◽

Quantum Fisher Information ◽

Critical System ◽

Topological Characterization

Download Full-text

Improved Monte Carlo Estimation of the Fisher Information Matrix with Independent Perturbations

2021 55th Annual Conference on Information Sciences and Systems (CISS) ◽

10.1109/ciss50987.2021.9400305 ◽

2021 ◽

Author(s):

Xuan Wu ◽

James C. Spall

Keyword(s):

Monte Carlo ◽

Fisher Information ◽

Fisher Information Matrix ◽

Information Matrix ◽

Monte Carlo Estimation

Download Full-text

The Fisher Information Matrix for Log Linear Models Arguing Conditionally on Observed Explanatory Variables

Biometrika ◽

10.2307/2335606 ◽

1981 ◽

Vol 68 (2) ◽

pp. 563 ◽

Cited By ~ 1

Author(s):

Juni Palmgren

Keyword(s):

Fisher Information ◽

Fisher Information Matrix ◽

Linear Models ◽

Information Matrix ◽

Explanatory Variables ◽

Log Linear

Download Full-text

Efficient Monte Carlo computation of Fisher information matrix using prior information

Proceedings of the 2007 Workshop on Performance Metrics for Intelligent Systems - PerMIS '07 ◽

10.1145/1660877.1660912 ◽

2007 ◽

Author(s):

Sonjoy Das ◽

James C. Spall ◽

Roger Ghanem

Keyword(s):

Monte Carlo ◽

Fisher Information ◽

Prior Information ◽

Fisher Information Matrix ◽

Information Matrix ◽

Monte Carlo Computation

Download Full-text

Matrix Algebraic Properties of the Fisher Information Matrix of Stationary Processes

Entropy ◽

10.3390/e16042023 ◽

2014 ◽

Vol 16 (4) ◽

pp. 2023-2055 ◽

Cited By ~ 1

Author(s):

André Klein

Keyword(s):

Fisher Information ◽

Fisher Information Matrix ◽

Information Matrix ◽

Stationary Processes ◽

Algebraic Properties

Download Full-text

Exact Fisher Information Matrix With State-Dependent Probability of Detection

IEEE Transactions on Aerospace and Electronic Systems ◽

10.1109/taes.2017.2667418 ◽

2017 ◽

Vol 53 (3) ◽

pp. 1555-1561

Author(s):

Junkun Yan ◽

Hongwei Liu ◽

Wenqiang Pu ◽

Zheng Bao

Keyword(s):

Fisher Information ◽

Fisher Information Matrix ◽

Information Matrix ◽

Probability Of Detection ◽

State Dependent ◽

Dependent Probability

Download Full-text

Optimal design for population pk/pd models

Tatra Mountains Mathematical Publications ◽

10.2478/v10127-012-0012-1 ◽

2012 ◽

Vol 51 (1) ◽

pp. 115-130

Author(s):

Sergei Leonov ◽

Alexander Aliev

Keyword(s):

Optimal Design ◽

Fisher Information ◽

Fisher Information Matrix ◽

Information Matrix ◽

Optimal Sampling ◽

Design Algorithm ◽

Sampling Schemes ◽

Population Pk ◽

Different Types

ABSTRACT We provide some details of the implementation of optimal design algorithm in the PkStaMp library which is intended for constructing optimal sampling schemes for pharmacokinetic (PK) and pharmacodynamic (PD) studies. We discuss different types of approximation of individual Fisher information matrix and describe a user-defined option of the library.

Download Full-text

Tensor Sylvester matrices and the Fisher information matrix of VARMAX processes

Linear Algebra and its Applications ◽

10.1016/j.laa.2009.06.027 ◽

2010 ◽

Vol 432 (8) ◽

pp. 1975-1989 ◽

Cited By ~ 1

Author(s):

André Klein ◽

Peter Spreij

Keyword(s):

Fisher Information ◽

Fisher Information Matrix ◽

Information Matrix

Download Full-text

Singularities Affect Dynamics of Learning in Neuromanifolds

Neural Computation ◽

10.1162/neco.2006.18.5.1007 ◽

2006 ◽

Vol 18 (5) ◽

pp. 1007-1065 ◽

Cited By ~ 55

Author(s):

Shun-ichi Amari ◽

Hyeyoung Park ◽

Tomoko Ozeki

Keyword(s):

Model Selection ◽

Parameter Space ◽

Fisher Information ◽

Fisher Information Matrix ◽

Information Matrix ◽

Gaussian Mixture ◽

Multilayer Perceptrons ◽

Hierarchical Systems ◽

Space Forms ◽

Training Error

The parameter spaces of hierarchical systems such as multilayer perceptrons include singularities due to the symmetry and degeneration of hidden units. A parameter space forms a geometrical manifold, called the neuromanifold in the case of neural networks. Such a model is identified with a statistical model, and a Riemannian metric is given by the Fisher information matrix. However, the matrix degenerates at singularities. Such a singular structure is ubiquitous not only in multilayer perceptrons but also in the gaussian mixture probability densities, ARMA time-series model, and many other cases. The standard statistical paradigm of the Cramér-Rao theorem does not hold, and the singularity gives rise to strange behaviors in parameter estimation, hypothesis testing, Bayesian inference, model selection, and in particular, the dynamics of learning from examples. Prevailing theories so far have not paid much attention to the problem caused by singularity, relying only on ordinary statistical theories developed for regular (nonsingular) models. Only recently have researchers remarked on the effects of singularity, and theories are now being developed. This article gives an overview of the phenomena caused by the singularities of statistical manifolds related to multilayer perceptrons and gaussian mixtures. We demonstrate our recent results on these problems. Simple toy models are also used to show explicit solutions. We explain that the maximum likelihood estimator is no longer subject to the gaussian distribution even asymptotically, because the Fisher information matrix degenerates, that the model selection criteria such as AIC, BIC, and MDL fail to hold in these models, that a smooth Bayesian prior becomes singular in such models, and that the trajectories of dynamics of learning are strongly affected by the singularity, causing plateaus or slow manifolds in the parameter space. The natural gradient method is shown to perform well because it takes the singular geometrical structure into account. The generalization error and the training error are studied in some examples.

Download Full-text