[1] Neal, Radford M. Bayesian learning for neural networks. Vol. 118. Springer Science & Business Media, 2012.
[Figure 1] Shridhar, Kumar, Felix Laumann, and Marcus Liwicki. "A comprehensive guide to bayesian convolutional neural network with variational inference." arXiv preprint arXiv:1901.02731 (2019).
[Further "reading"] DeepBayes2018 Workshop - Max Welling:
Advanced methods of variational inference: https://youtu.be/mCBnid-1slI
[1] Huge number of parameters in an NN as well as the functional form does not lend itself to exact integration
[1] Srivastava, Nitish, et al. "Dropout: a simple way to prevent neural networks from overfitting." The journal of machine learning research 15.1 (2014): 1929-1958.
[Image] Roffo, Giorgio. "Ranking to learn and learning to rank: On the role of ranking in pattern recognition applications." arXiv preprint arXiv:1706.05933 (2017).
"In fact, we shall see that we can we can get uncertainty information from existing deep learning models for free"
We show that the dropout objective, in effect, minimises the Kullback–Leibler divergence between an approximate distribution and the posterior of a deep Gaussian process.
[1] Gal, Yarin, and Zoubin Ghahramani. "Dropout as a bayesian approximation: Representing model uncertainty in deep learning." international conference on machine learning. 2016.
[2] Gal, Yarin. "Uncertainty in deep learning." University of Cambridge 1.3 (2016).
[3] Gal, Yarin, and Zoubin Ghahramani. "Bayesian convolutional neural networks with Bernoulli approximate variational inference." arXiv preprint arXiv:1506.02158 (2015).
Ian Osband (2018):
Recent work has sought to understand dropout through a Bayesian lens, highlighting the connection to variational inference and arguing that the resultant dropout distribution approximates a Bayesian posterior. This narrative has proved popular despite the fact that [the] dropout distribution can be a poor approximation to most reasonable Bayesian posteriors.
[1] Osband, Ian, John Aslanides, and Albin Cassirer. "Randomized prior functions for deep reinforcement learning." Advances in Neural Information Processing Systems. 2018.
Consequence:
No agent employing dropout for posterior approximation can tell the difference between observing a set of data once and observing it times. This can lead to arbitrarily poor decision making [...]
[*] This would be a possible explanation, why Dropout failed for Out-of-distribution detection using epistemic uncertainty, as evaluated in:
A. Sedlmeier, et al. "Uncertainty-Based Out-of-Distribution Classification in Deep Reinforcement Learning," in 12th International Conference on Agents and Artificial Intelligence (ICAART 2020), 2020.
[**] Further discussion: What is the current state of dropout as Bayesian approximation?
https://web.archive.org/web/20190327225938if_/https://www.reddit.com/r/MachineLearning/comments/7bm4b2/d_what_is_the_current_state_of_dropout_as/
[1] Gal, Yarin, Jiri Hron, and Alex Kendall. "Concrete dropout." Advances in neural information processing systems. 2017.
"The relative accuracy of variational inference and MCMC is still unknown. We do know that variational inference generally underestimates the variance of the posterior density; this is a consequence of its objective function"
[1] Blei, David M., Alp Kucukelbir, and Jon D. McAuliffe. "Variational inference: A review for statisticians." Journal of the American statistical Association 112.518 (2017): 859-877.
Possible solution to Problem 2: Use alpha-divergences as alternative to VI's KL objective
This avoids VI's uncertainty underestimation
Hernandez-Lobato: Black-box alpha divergence
Yingzhen and Gal: Dropout inference in Bayesian neural networks with alpha-divergences
[2] Hernandez-Lobato, Jose, et al. "Black-box alpha divergence minimization." International Conference on Machine Learning. PMLR, 2016.
[3] Li, Yingzhen, and Yarin Gal. "Dropout inference in Bayesian neural networks with alpha-divergences." arXiv preprint arXiv:1703.02914 (2017).
What are Gaussian Processes (GPs)?
Recent scalability developments:
Leveraging uncertainty information from deep neural networks for disease detection: https://www.nature.com/articles/s41598-017-17876-z
Damianou, Andreas, and Neil Lawrence. "Deep gaussian processes." Artificial Intelligence and Statistics. 2013.
- Rasmussen, C. E. & Williams, C. K. I. Gaussian processes for machine learning, vol. 1 (MIT press Cambridge, 2006).
- Gaussian Processes are Not So Fancy: https://planspace.org/20181226-gaussian_processes_are_not_so_fancy/
- Gaussian Process, not quite for dummies: https://yugeten.github.io/posts/2019/09/GP/
- A Visual Exploration of Gaussian Processes: https://distill.pub/2019/visual-exploration-gaussian-processes/
Simple and scalable predictive uncertainty estimation using deep ensembles
- Lakshminarayanan, Pritzel, Blundell (2017)
One of the first works to apply ensemble ideas to deep NNs in order to investigate predictive uncertainty performance:
Lakshminarayanan, Balaji, Alexander Pritzel, and Charles Blundell. "Simple and scalable predictive uncertainty estimation using deep ensembles." Advances in neural information processing systems 30 (2017): 6402-6413.
[1] The authors enforce the positivity constraint on the variance by passing the second output through the softplus function log(1 + exp(·)), and add a minimum variance of for numerical stability
Problem: Empirical evidence of uncertainty estimates are not available in general, quality of predictive uncertainty evaluation is a challenging task.
Well-calibrated predictions that are robust to model misspecification and dataset shift. - Lakshminarayanan (2016)
Kendall :
"To form calibration plots for classification models, we discretize our model’s predicted probabilities into a number of bins, for all classes and all pixels in the test set. We then plot the frequency of correctly predicted labels for each bin of probability values. Better performing uncertainty estimates should correlate more accurately with the line in the calibration plots."
Beluch :
"To assess calibration quality we determine whether the expected fraction of correct classifications (as predicted by the model confidence, i.e.the uncertainty over predictions) matches the observed fraction of correct classifications. When plotting both values against each other, a well-calibrated model lies close to the diagonal."
[1] Kendall, Alex, and Yarin Gal. "What uncertainties do we need in bayesian deep learning for computer vision?." Advances in neural information processing systems. 2017.
[2] Beluch, William H., et al. "The power of ensembles for active learning in image classification." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.
[Further reading] Zadrozny, Bianca, and Charles Elkan. "Transforming classifier scores into accurate multiclass probability estimates." Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. 2002.
[1] Lakshminarayanan, Balaji, Alexander Pritzel, and Charles Blundell. "Simple and scalable predictive uncertainty estimation using deep ensembles." Advances in neural information processing systems 30 (2017): 6402-6413.
(Ensemble + AT: Adversarially trained ensemble)
As confidence is a continuous variable, it appears the authors binned the values using a bin-size of 0.1. (Paper does not state this clearly).
Quote: "We filter out test examples, corresponding to a particular confidence threshold and plot the accuracy for this threshold."
The quality of predictive uncertainty obtained using Bayesian NNs crucially depends on (i) the degree of approximation due to computational constraints and (ii) if the prior distribution is ‘correct’, as priors of convenience can lead to unreasonable predictive uncertainties.
[...]
Interestingly, dropout may also be interpreted as ensemble model combination where the predictions are averaged over an ensemble of NNs (with parameter sharing). The ensemble interpretation seems more plausible particularly in the scenario where the dropout rates are not tuned based on the training data, since any sensible approximation to the true Bayesian posterior distribution has to depend on the training data.
[1] Osband, Ian, et al. "Deep exploration via bootstrapped DQN." Advances in neural information processing systems 29 (2016): 4026-4034.
[1] Osband, Ian, John Aslanides, and Albin Cassirer. "Randomized prior functions for deep reinforcement learning." Advances in Neural Information Processing Systems. 2018.
[Figure 13] See supplementals of above paper
Interesting aspects:
Combine aleatoric and epistemic uncertainty modelling into single model:
MVE (they call it MAP inference) for aleatoric + MC Dropout for epistemic uncertainty
Loss uses L1 distance (Laplacian Prior instead of L2 distance - Gaussian prior)
Modelling uncertainty increases performance (works as loss attenuation)
Modelling aleatoric uncertainty increases performance more than epistemic
Combining both results in best performance
Uncertainties behave as expected: Precision is lower, when image contains more points that the model is uncertain about
Kendall, Alex, and Yarin Gal. "What uncertainties do we need in bayesian deep learning for computer vision?." Advances in neural information processing systems. 2017.
CVPR 2018 (Authors from Bosch Center for Artificial Intelligence)
Results:
"We find that the difference in active learning performance can be explained by a combination of decreased model capacity and lower diversity of MC Dropout ensembles"
Beluch, William H., et al. "The power of ensembles for active learning in image classification." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.
CVPR 2020 (ETH Zürich & Uppsala University)
Gustafsson, Fredrik K., Martin Danelljan, and Thomas B. Schon. "Evaluating scalable bayesian deep learning methods for robust computer vision." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2020.
[1] Kahn, Gregory, et al. "Uncertainty-aware reinforcement learning for collision avoidance." arXiv preprint arXiv:1702.01182 (2017).
[2] Osband, Ian, et al. "Deep exploration via bootstrapped DQN." Advances in neural information processing systems 29 (2016): 4026-4034.
[3] Sedlmeier, Andreas et al. "Uncertainty-based Out-of-Distribution Classification in Deep Reinforcement Learning." In Proceedings of the 12th International Conference on Agents and Artificial Intelligence
Note: Using markdown-it-container plugin, which does not work in VSCode marp. Install locally using: npm install markdown-it-container --save-dev npm install @marp-team/marp-core --save-dev Then use marp-cli to compile: marp --engine ./engine.js uq.md Or auto compile on save: while inotifywait -e close_write uq.md; do marp --engine ./engine.js uq.md; done
# Modelling Uncertainty for Learning Systems > An Overview of Basics, Techniques and Performance Results Andreas Sedlmeier | Institut für Informatik | LMU München ---
[TODO] rausnehmen oder andere Quelle? ## Sources of Uncertainty (ML Pipeline View) 1) Collection and selection of training data 2) Completeness and accuracy of training data 3) Model limitations 4) unc based no operational data ---
Epistemische Unsicherheit, modelliert durch ein (Bootstrap) Ensemble von NNs. Varianz der einzelnen Punktvorhersagen wird als epistemische Unsicherheit verstanden
img src: https://www.researchgate.net/publication/221523889_Efficient_Viterbi_Algorithms_for_Lexical_Tree_Based_Models/figures?lo=1
[TODO] Hier brauch ich andere Quellen z.B. https://arxiv.org/abs/1901.02731.pdf
-> KL minimization = ELBO max
- [TODO] Grafik Fig.7 VI Axen
- [TODO] Wie/hängt die Loss Function von Dropout / L2-reg zusammen?
#### --- #### MCMC [TODO] Raus? --- #### Spezial-Anwendung: Autoencoder [TODO] Raus? Variational Autoencoder VAE - An autoencoder is a variant of DL that consists of twocomponents: (i) encoder, and (ii) decoder. Encoder aims to 9map high-dimensional input samplexto a low-dimensionallatent variablez. While decoder reproduces the originalsamplexusing latent variablez. The latent variables arecompelled to conform a given prior distributionP(z). - VAEs cast learning representations for high-dimensional distributions as a VI problem ---
--- - Flipout hier? ::: note $^{[1]}$ Wen, Yeming, et al. "Flipout: Efficient pseudo-independent weight perturbations on mini-batches." arXiv preprint arXiv:1803.04386 (2018). ::: --- #### Laplace Approximations - Der Vollständigkeit halber erwähnt, selbst bisher nicht damit beschäftigt
## Evaluation: Accuracy as a function of confidence - Considered task: Model is evaluated only if the model’s confidence is above a user-specified threshold - Confidence ist defined as $p(y=\hat{y}|x) = max_kp(y=k|x)$ - If confidence is well-calibrated, one can trust the predictions, when the reported confidence is high - One could then resort to a different solution (e.g. human in the loop) when the model is not confident ---
### Further reading: ### Entry level / overview: - Inovex Blog Post: uncertainty-quantification-deep-learning https://www.inovex.de/blog/uncertainty-quantification-deep-learning/ ---