For the third course, we will get back a little bit on machine learning (slides are still online on the github repository). The starting point will be loss functions and risk.
Loss functions and risk
A general definition for a loss is that it is positive, and null when we consider \ell(y,y). As we will discuss further, it is neither a distance, nor a dissimilarity measure
Then, define the empirical risk (and the associated empirical risk minimization principle, as coined in Vapnik (1991))
Given a loss \ell and some probabilistic space, define the optimal decision, also called Bayes decision rule
And instead of the risk of a model, define the excess of risk.
A classical loss for a classifier is \ell_{0/1},
In that case, Bayes decision rule, ism^\star(\boldsymbol{x}) = \boldsymbol{1}(\mu(\boldsymbol{x})>1/2) =\begin{cases}1 \text{ if }\mu(\boldsymbol{x})>1/2\\0 \text{ if }\mu(\boldsymbol{x})\leq1/2\\\end{cases}where (of course), one needs to know \mu, otherwise, we can consider some plug-in estimator based on \widehat\mu. For continuous variables y, consider the quadratic loss \ell_2,
In that case, Bayes decision rule (the optimal model) is the conditional expectation
Observe that we can also define the quantile loss (or the expectile loss)
Observe that this loss is not symmetric..
From loss functions to distances
Let us discuss a bit more the fact that losses are not distances. As mentioned, it is neither necessarily symmetric nor seperable,
But furthermore, it has no reason to satisfy the triangle inequality. Actually, if d is the distance, it is very likely that d^2 is not (since exponentiating is not a subadditive transformation)
Another related concept could be the concept of similarity, or dissimilarity.
Another one is the concept of divergence, that we will use much more. For instance, Bregman divergence is
which safisfies desirables properties.
Interestingly it is possible to define “projections” even if we have neither an orthogonal projection (since there is no orthogonal concept since there is no inner-product), nor a distance. But still
One can use a nice algorithm to estimate that quantity, if the convex set can we expressed simply
When considering “distances” between distributions, instead of y‘s, among other interesting properties in statistics, we can mention the one of unbiased gradients,
and Müller (1997) defined integral probability metrics
Standard “distances” between distributions
The first one will be Hellinger distance
that can lead so simple expressions for standard parametric distributions, such as Beta distributions,
or (multivariate) Gaussian ones
We can also mention Pearson divergence
More interesting (and popular in probability theory), total variation
There are several ways to express that distance.
If instead of general sets \mathcal{A}, we can consider half lines, (-\infty,t][\latex], and we obtain Kolmogorov distance (or Kolmogorov-Smirnov)
Another important one in statistics is Kullback–Leibler divergence
For instance, with Gaussian vectors
Observe that the measure is actually a dissimilarity measure
If we want a symmetric version, we can consider Jeffreys divergence
Finally, we will mention f-divergence
and Rényi divergence
We will discuss a little bit more those “distances” (yes, I usually use that term, abusively) and next week, we will present the most interesting distance, that will be Wasserstein‘s.