How to analyze calibration?

Reliability Diagrams aka Calibration Plots

Kendall $^{[1]}$ :

"To form calibration plots for classification models, we discretize our model’s predicted probabilities into a number of bins, for all classes and all pixels in the test set. We then plot the frequency of correctly predicted labels for each bin of probability values. Better performing uncertainty estimates should correlate more accurately with the line $y=x$ in the calibration plots."

Beluch $^{[2]}$ :

"To assess calibration quality we determine whether the expected fraction of correct classifications (as predicted by the model confidence, i.e.the uncertainty over predictions) matches the observed fraction of correct classifications. When plotting both values against each other, a well-calibrated model lies close to the diagonal."

[1] Kendall, Alex, and Yarin Gal. "What uncertainties do we need in bayesian deep learning for computer vision?." Advances in neural information processing systems. 2017.
[2] Beluch, William H., et al. "The power of ensembles for active learning in image classification." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.
[Further reading] Zadrozny, Bianca, and Charles Elkan. "Transforming classifier scores into accurate multiclass probability estimates." Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. 2002.

Modelling Uncertainty forLearning Systems

I. Motivation & Basics

Why consider uncertainty?

Example "Web-ML" VS "Industry-ML"

Current Situation

Basics

Types of Uncertainty

Aleatoric

Epistemic

Predictive Uncertainty = Epistemic + Aleatoric

Multimodality

Multimodality toy example:

II. Uncertainty Modelling Techniques

Aleatoric Uncertainty (unimodal):

Aleatoric Uncertainty (multimodal):

Bayesian Neural Networks (BNNs)

Bayesian Inference in BNNs

Dropout

Dropout Variational Inference aka MC Dropout

Critique:

Problem1: Dropout distribution does not concentrate with observed data

Problem 2: VI can severely underestimate model uncertainty

VI using different divergences

Alternative approaches

(Deep) Gaussian Processes

Further reading:

Ensemble Techniques

Details

Evaluation

What do we want?

What is calibration?

What are scoring rules?

Proper scoring rules:

Generalization:

How to analyze calibration?

Reliability Diagrams aka Calibration Plots

Results in Lakshminarayanan et al.: [1]^{[1]}[1]

Out-of-distribution (OOD) results on ImageNet

OOD on MNIST:

Thoughts on Bayesian NNs / Ensembles

Bootstrap Ensemble

Bootstrap Prior Networks

III. Applications & Evaluations

Computer Vision

Kendall, Gal, et al. : What uncertainties do we need in bayesian deep learning for computer vision?

The power of ensembles for active learning in image classification

Evaluating Scalable Bayesian Deep Learning Methods for Robust Computer Vision

Other fields of application

RL

Medical

Thank you! Questions?

Modelling Uncertainty for
Learning Systems

Results in Lakshminarayanan et al.: $^{[1]}$