c# - Machine Learning ML.net Prediction Intervals

So, I’ve been searching for an answer to this for weeks now…

I have a machine learning ml.net binary classification model, trained it, and evaluated it, giving me a CalibratedBinaryClassificationMetrics with a lot of metrics, but none helpful to me.

I want to analyze my predictions used for evaluating the model in a way I can predict with a prediction interval that tells me, with x% certainty that the real value is between lowerBound and higherBound.

I’ve tried a lot of stuff, but the most promissing one came today.

I’m not 100% sure this works on regression models, but I want to find a way to work with binary classification models as well, so interpolated what I do on regression models:

//Real outcomes (1 for true, 0 for false)
float[] actualValues;

//Predictions
float[] predictedValues;

//Number of samples used
int sampleCount = actualValues.Length;

//Number of Independent variables
int nPredictors;

//PRESS - Predicted Residual Sum of Squares
double press = actualValues.Zip(predictedValues, (actual, predicted) => Math.Pow(actual - predicted, 2)).Sum();

//PSE - Predicted Standard Error
double pse = Math.Sqrt(press / (sampleCount - nPredictors));

Then, for 95% confidence:

double predictedValue = model.Predict(obj);

double lowerBound = predictedValue - zScore@95 * pse;
double higherBound = predictedValue + zScore@95 * pse;

Is this a valid way to do?

I find it not because of this:

Let’s say something that we know something happens 50% of the time, for like n = 45000 with 80 predictors

int nSamples = 45000;
int nPredictors = 80;

//if real outcome is true: (1 - 0.5)^2 = 0.25
//if real outcome is false: (0 - 0.5)^2 = 0.25
//so, for 45000 samples, we can say
double press = 45000 * 0.25 = 11250;
double pse = Math.Sqrt(11250 / (45000 - 80)) = 0.5004450

This seems ok, considering that it can either be positive or false, and we’re always 50% apart of being right or wrong.

So, for a prediction for 0.5, for zScore@95 = 1.96, I should say:

double lowerBound = 0.5 - 0.97587 = -0.47587
double higherBound = 0.5 + 0.97587 = 1.47587

I guess it makes sense to say that the real probability lies between -0.47587 and 1.47587, but that’s not what I want to know.

I want to estimate an error that my model is 95% certain that the true probability of happening lies there.

Should I consider the variability of the model?
Like the brier score or so?

Like, in this case the brier score would be:

double brierScore = actualValues.Zip(predictedValues, (actual, predicted) => Math.Pow(predicted - actual, 2)).Sum() / nSamples;
//in this case, 0.25 / 45000 = 5.556E-6

This seems way more likely, but I don’t know how to create a prediction interval from a brier score…

How do I even calculate something like that?

Can you help me? Last few weeks have been frustrating…

Source link