Is SGD a Bayesian sampler? Well, almost
Journal of Machine Learning Research Journal of Machine Learning Research 22 (2021) 79
Abstract:
Deep neural networks (DNNs) generalise remarkably well in the overparameterised regime, suggesting a strong inductive bias towards functions with low generalisation error. We empirically investigate this bias by calculating, for a range of architectures and datasets, the probability PSGD(f∣S) that an overparameterised DNN, trained with stochastic gradient descent (SGD) or one of its variants, converges on a function f consistent with a training set S. We also use Gaussian processes to estimate the Bayesian posterior probability PB(f∣S) that the DNN expresses f upon random sampling of its parameters, conditioned on S. Our main findings are that PSGD(f∣S) correlates remarkably well with PB(f∣S) and that PB(f∣S) is strongly biased towards low-error and low complexity functions. These results imply that strong inductive bias in the parameter-function map (which determines PB(f∣S)), rather than a special property of SGD, is the primary explanation for why DNNs generalise so well in the overparameterised regime. While our results suggest that the Bayesian posterior PB(f∣S) is the first order determinant of PSGD(f∣S), there remain second order differences that are sensitive to hyperparameter tuning. A function probability picture, based on PSGD(f∣S) and/or PB(f∣S), can shed light on the way that variations in architecture or hyperparameter settings such as batch size, learning rate, and optimiser choice, affect DNN performance.
ArXiv papers
arXiv
Abstract:
arXiv papers can't be properly linked on this feed. You can see mine by clicking on the link below.
BioRxiv papers
bioArxiv
Abstract:
bioRxiv papers can't be properly linked on this feed. You can see mine by clicking on the "arXiv" link below.
The structure of genotype-phenotype maps makes fitness landscapes navigable
(2021)
Measuring internal forces in single-stranded DNA: application to a DNA force clamp
Journal of Chemical Theory and Computation American Chemical Society 16:12 (2020) 7764-7775