scholarly journals Gaussian Mean Field Regularizes by Limiting Learned Information

Entropy ◽  
2019 ◽  
Vol 21 (8) ◽  
pp. 758 ◽  
Author(s):  
Julius Kunze ◽  
Louis Kirsch ◽  
Hippolyt Ritter ◽  
David Barber

Variational inference with a factorized Gaussian posterior estimate is a widely-used approach for learning parameters and hidden variables. Empirically, a regularizing effect can be observed that is poorly understood. In this work, we show how mean field inference improves generalization by limiting mutual information between learned parameters and the data through noise. We quantify a maximum capacity when the posterior variance is either fixed or learned and connect it to generalization error, even when the KL-divergence in the objective is scaled by a constant. Our experiments suggest that bounding information between parameters and data effectively regularizes neural networks on both supervised and unsupervised tasks.

2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Abdulkadir Canatar ◽  
Blake Bordelon ◽  
Cengiz Pehlevan

AbstractA theoretical understanding of generalization remains an open problem for many machine learning models, including deep networks where overparameterization leads to better performance, contradicting the conventional wisdom from classical statistics. Here, we investigate generalization error for kernel regression, which, besides being a popular machine learning method, also describes certain infinitely overparameterized neural networks. We use techniques from statistical mechanics to derive an analytical expression for generalization error applicable to any kernel and data distribution. We present applications of our theory to real and synthetic datasets, and for many kernels including those that arise from training deep networks in the infinite-width limit. We elucidate an inductive bias of kernel regression to explain data with simple functions, characterize whether a kernel is compatible with a learning task, and show that more data may impair generalization when noisy or not expressible by the kernel, leading to non-monotonic learning curves with possibly many peaks.


2015 ◽  
Vol 16 (S1) ◽  
Author(s):  
David Angulo-Garcia ◽  
Alessandro Torcini
Keyword(s):  

Author(s):  
David Barber

Finding clusters of well-connected nodes in a graph is a problem common to many domains, including social networks, the Internet and bioinformatics. From a computational viewpoint, finding these clusters or graph communities is a difficult problem. We use a clique matrix decomposition based on a statistical description that encourages clusters to be well connected and few in number. The formal intractability of inferring the clusters is addressed using a variational approximation inspired by mean-field theories in statistical mechanics. Clique matrices also play a natural role in parametrizing positive definite matrices under zero constraints on elements of the matrix. We show that clique matrices can parametrize all positive definite matrices restricted according to a decomposable graph and form a structured factor analysis approximation in the non-decomposable case. Extensions to conjugate Bayesian covariance priors and more general non-Gaussian independence models are briefly discussed.


2013 ◽  
Vol 25 (7) ◽  
pp. 1768-1806 ◽  
Author(s):  
N. Alex Cayco-Gajic ◽  
Eric Shea-Brown

Recent experimental and computational evidence suggests that several dynamical properties may characterize the operating point of functioning neural networks: critical branching, neutral stability, and production of a wide range of firing patterns. We seek the simplest setting in which these properties emerge, clarifying their origin and relationship in random, feedforward networks of McCullochs-Pitts neurons. Two key parameters are the thresholds at which neurons fire spikes and the overall level of feedforward connectivity. When neurons have low thresholds, we show that there is always a connectivity for which the properties in question all occur, that is, these networks preserve overall firing rates from layer to layer and produce broad distributions of activity in each layer. This fails to occur, however, when neurons have high thresholds. A key tool in explaining this difference is the eigenstructure of the resulting mean-field Markov chain, as this reveals which activity modes will be preserved from layer to layer. We extend our analysis from purely excitatory networks to more complex models that include inhibition and local noise, and find that both of these features extend the parameter ranges over which networks produce the properties of interest.


2013 ◽  
Vol 2013 ◽  
pp. 1-10 ◽  
Author(s):  
Benjamin W. Y. Lo ◽  
R. Loch Macdonald ◽  
Andrew Baker ◽  
Mitchell A. H. Levine

Objective. The novel clinical prediction approach of Bayesian neural networks with fuzzy logic inferences is created and applied to derive prognostic decision rules in cerebral aneurysmal subarachnoid hemorrhage (aSAH).Methods. The approach of Bayesian neural networks with fuzzy logic inferences was applied to data from five trials of Tirilazad for aneurysmal subarachnoid hemorrhage (3551 patients).Results. Bayesian meta-analyses of observational studies on aSAH prognostic factors gave generalizable posterior distributions of population mean log odd ratios (ORs). Similar trends were noted in Bayesian and linear regression ORs. Significant outcome predictors include normal motor response, cerebral infarction, history of myocardial infarction, cerebral edema, history of diabetes mellitus, fever on day 8, prior subarachnoid hemorrhage, admission angiographic vasospasm, neurological grade, intraventricular hemorrhage, ruptured aneurysm size, history of hypertension, vasospasm day, age and mean arterial pressure. Heteroscedasticity was present in the nontransformed dataset. Artificial neural networks found nonlinear relationships with 11 hidden variables in 1 layer, using the multilayer perceptron model. Fuzzy logic decision rules (centroid defuzzification technique) denoted cut-off points for poor prognosis at greater than 2.5 clusters.Discussion. This aSAH prognostic system makes use of existing knowledge, recognizes unknown areas, incorporates one's clinical reasoning, and compensates for uncertainty in prognostication.


1997 ◽  
Vol 9 (1) ◽  
pp. 1-42 ◽  
Author(s):  
Sepp Hochreiter ◽  
Jürgen Schmidhuber

We present a new algorithm for finding low-complexity neural networks with high generalization capability. The algorithm searches for a “flat” minimum of the error function. A flat minimum is a large connected region in weight space where the error remains approximately constant. An MDL-based, Bayesian argument suggests that flat minima correspond to “simple” networks and low expected overfitting. The argument is based on a Gibbs algorithm variant and a novel way of splitting generalization error into underfitting and overfitting error. Unlike many previous approaches, ours does not require gaussian assumptions and does not depend on a “good” weight prior. Instead we have a prior over input output functions, thus taking into account net architecture and training set. Although our algorithm requires the computation of second-order derivatives, it has backpropagation's order of complexity. Automatically, it effectively prunes units, weights, and input lines. Various experiments with feedforward and recurrent nets are described. In an application to stock market prediction, flat minimum search outperforms conventional backprop, weight decay, and “optimal brain surgeon/optimal brain damage.”


Sign in / Sign up

Export Citation Format

Share Document