interpretation - Which data subset should be used for interpretable machine learning (IML)?

For IML, the last option is the most appropriate: “the best model trained on the full dataset (training + test data) with IML analyzing the full dataset (training + test data)”.

My answer here differs from a well-known answer to a similar question that makes arguments for either using the test set or the training set for IML. I rather argue that both should be used together.

In statistical analysis, this kind of question does not typically come up: because datasets are typically too small for data splitting for validation, various techniques (such as bootstrapping) are used to validly carry out all analysis on the entire dataset. But in machine learning, the data is typically split as described so that the analyst can do all the experimental analyses they like on the validation dataset and yet have an independent test set on which to evaluate the performance of resulting models. However, that is only part of the process. It is important to distinguish three key stages of model development in machine learning:

Hyperparameter optimization (HPO) of algorithms
Selection of the best tuned algorithm
Deployment of the production model

In machine learning, the goal is not merely to get “the best model”; the goal is to get a model that can be used in “real life”. That is model deployment. Here’s how the three stages work towards obtaining that real-life model.

First, although we talk about “tuning models”, we must be clear that the purpose of HPO is not to produce a model that is tuned on any particular subset of data. Its goal is to find the optimal set of hyperparameters (HPs, algorithm settings) that produce the best performance on a specific dataset. Suppose we have three candidate algorithms A, B, and C (e.g., neural network, gradient boosted tree, and generalized additive models). Each of these has specific HPs that control its performance. We use the validation set to try various HPs for each algorithm to see which HPs work best for that data. (Here, the validation set is typically further subdivided with cross-validation or a further tuning-evaluation split, but let’s not get into that.) Although HPO trains and evaluates multiple models, its useful result is not any of these models; its result is the hyperparameter settings that work best for each algorithm on the data.

Second, once we have algorithms A, B, and C, each with their optimal HPs, we train each algorithm on the validation set. The goal of this second step is still not to produce our final model. Its goal is to estimate the performance that we expect each algorithm can produce on future, unseen data. By evaluating that performance on the test set independent of the validation set on which the models were trained, we can obtain a minimally biased estimate of the expected future performance of each algorithm. Based on these results, using whichever performance evaluation metrics we consider appropriate, we select the best-performing algorithm with its optimal HPs. (This step is typically called “model selection”, but that is not quite accurate–we are selecting an algorithm plus its HPs, not a specific trained model.)

Third, once we have selected a best algorithm with its HPs, we are ready to deploy the model for real-life operations. For this, we want the best possible model we can get. So, we must train the algorithm on the ENTIRE dataset–training set, test set, everything. This is because only a model trained on all the available data can give us the best possible performance. This deployed model is the best representation of our data based on our analysis.

For IML, we use this best model trained on the full dataset based on its analysis of the full dataset. IML tries to characterize our data as accurately as possible. The deployment model is the best model for the data and the full dataset is the most complete representation of our data (both past data and possible future data).

Source link