standardization - Mathematics behind standardizing the data points in machine learning algorithms (e.g., K-means clustering)

I assume that by “standardising the data points” you mean “standardising the variables/features”.

It is common practice but not always good. If you have two clusters in two dimensions, one distributed $N(\mu_1,I)$, the other one $N(\mu_2,I)$, $I$ unit matrix, $\mu_1=(0,0)’$, $\mu_2=(2,0)’$, then all the information about clustering is in the first dimension and the first dimension also has the larger variance/standard deviation. Standardising then means that you will lower the weight of the first dimension in distance computation (as you will divide by the larger standard deviation), and this will not be good for recovering the clusters.

Standardisation (for K-means) is usually recommended if the different variables have different measurement units and/or the values of measurements are so that the sizes of differences in one variable cannot be meaningfully compared with the differences in other variables. The Euclidean distance, on which K-means is based, aggregates squared variable-wise differences, so these should be comparable for aggregation. The rationale is that after standardisation the differences in the different variables can be meaningfully compared (as relative to the standard deviation that is 1 after standardisation) and therefore meaningfully aggregated.

This however can go wrong, as demonstrated above, because K-means also aggregates the within-cluster sum of squares over clusters in such a way that it implicitly takes the within-cluster variation of all clusters to be the same, or more obviously, it doesn’t adjust for different within-cluster variation. Some theory shows that K-means provides a maximum-likelihood estimator for a model in which all clusters have normal distributions with different means and equal spherical covariance matrices.

This means that optimally we should standardise in such a way that all within-cluster variances are the same, but this is not possible, because before we run K-means, we don’t know what the clusters actually are. The example given above has within-cluster variance equal before standardisation, and this is destroyed by standardisation.

“Standard” standardisation is basically a tentative substitution for a better thing that we can’t do because of lack of information. It may improve matters (because it will make the contributions of potentially wildly different measurements more uniform), but it may do harm (because the variance that is “standardised away” may actually be helpful for clustering if it occurs because of the difference between clusters). Note that there are also alternative methods for standardisation (such as standardisation to minimum 0 maximum 1, or using robust statistics instead of mean and standard deviation), and I have seen literature arguing that for clustering min 0 max 1 standardisation is often better than classical standardisation (which according to my experience is sometimes true but not always, therefore again there is no general mathematical argument).

You will therefore not find general theory that says that “we should always standardise”. Furthermore, as you mention more general ML techniques, the effect of standardisation, its advantages and disadvantages, may play out differently for different algorithms (there are also “scale equivariant” algorithms for which standardisation doesn’t make a difference as they implicitly do that themselves).

A rough guideline is always to ask yourself (a) are the measurement units/values in the different variables so that differences across variables are not comparable and potentially systematically vastly different? (b) Does it create a problem for the algorithm if this is the case? This would be an indication in favour of standardisation. On the other hand, if (c) there are reasons to believe that larger variance of a variable means that the variable has a bigger informative value for what you are interested in (e.g., clustering), this is a reason to not standardise (at least as long as the algorithm is sensitive to such differences in variance).

Source link