Fairness and discrimination, PhD Course, #3 Machine Learning, losses and distances

For the third course, we will get back a little bit on machine learning (slides are still online on the github repository). The starting point will be loss functions and risk.

Loss functions and risk

A general definition for a loss is that it is positive, and null when we consider $\ell(y,y)$ . As we will discuss further, it is neither a distance, nor a dissimilarity measure

Then, define the empirical risk (and the associated empirical risk minimization principle, as coined in Vapnik (1991))

Given a loss $\ell$ and some probabilistic space, define the optimal decision, also called Bayes decision rule

And instead of the risk of a model, define the excess of risk.

A classical loss for a classifier is $\ell_{0/1}$ ,

In that case, Bayes decision rule, is $m^\star(\boldsymbol{x}) = \boldsymbol{1}(\mu(\boldsymbol{x})>1/2) =\begin{cases}1 \text{ if }\mu(\boldsymbol{x})>1/2\\0 \text{ if }\mu(\boldsymbol{x})\leq1/2\\\end{cases}$ where (of course), one needs to know $\mu$ , otherwise, we can consider some plug-in estimator based on $\widehat\mu$ . For continuous variables $y$ , consider the quadratic loss $\ell_2$ ,

In that case, Bayes decision rule (the optimal model) is the conditional expectation

Observe that we can also define the quantile loss (or the expectile loss)

Observe that this loss is not symmetric..

From loss functions to distances

Let us discuss a bit more the fact that losses are not distances. As mentioned, it is neither necessarily symmetric nor seperable,

But furthermore, it has no reason to satisfy the triangle inequality. Actually, if $d$ is the distance, it is very likely that $d^2$ is not (since exponentiating is not a subadditive transformation)

Another related concept could be the concept of similarity, or dissimilarity.

Another one is the concept of divergence, that we will use much more. For instance, Bregman divergence is

which safisfies desirables properties.

Interestingly it is possible to define “projections” even if we have neither an orthogonal projection (since there is no orthogonal concept since there is no inner-product), nor a distance. But still

One can use a nice algorithm to estimate that quantity, if the convex set can we expressed simply

When considering “distances” between distributions, instead of $y$ ‘s, among other interesting properties in statistics, we can mention the one of unbiased gradients,

and Müller (1997) defined integral probability metrics

Standard “distances” between distributions

The first one will be Hellinger distance

that can lead so simple expressions for standard parametric distributions, such as Beta distributions,

or (multivariate) Gaussian ones

We can also mention Pearson divergence

More interesting (and popular in probability theory), total variation

There are several ways to express that distance.

If instead of general sets $\mathcal{A}$ , we can consider half lines, $(-\infty,t][\latex], and we obtain Kolmogorov distance (or Kolmogorov-Smirnov)$

Another important one in statistics is Kullback–Leibler divergence

For instance, with Gaussian vectors

Observe that the measure is actually a dissimilarity measure

If we want a symmetric version, we can consider Jeffreys divergence

or Jensen–Shannon divergence

Finally, we will mention f-divergence

and Rényi divergence

We will discuss a little bit more those “distances” (yes, I usually use that term, abusively) and next week, we will present the most interesting distance, that will be Wasserstein‘s.

Source link