Assessing the feasibility of applying machine learning to diagnosing non-effusive feline infectious peritonitis

Dataset and data preparation

Cases submitted to the Veterinary Diagnostic Services (VDS) laboratory between 2001 and 2021 with a suspicion of non-effusive FIP were considered for enrolment in this study. A set of laboratory parameters and case metadata provided by submitting clinicians was abstracted from the VDS Laboratory Information Management System (LIMS). The laboratory data included the following variables measured on blood: anti-FCoV antibody titre, alpha-1-glycoprotein (AGP), total protein, albumin, globulin, albumin:globulin ratio, haemoglobin, red blood cell count, haematocrit, mean corpuscular volume, mean cell haemoglobin, mean cell haemoglobin concentration, total white cell count, band neutrophils, neutrophils, lymphocytes, monocytes, eosinophils, basophils and normoblasts. Demographic data including age, sex and pedigree and clinical notes, denoted as “reason” on the LIMS, was also collected. Retrospective diagnostic disease classifications, based on expert clinical interpretations, were also collected from the LIMS system alongside the laboratory data. A minimum of three clinicians were involved in the decision-making process, including at least one clinical pathologist, and classifications were based on consensus opinion. The personnel providing the interpretations differed across the years. These interpretations were used as ‘ground truth’ for classifying whether samples represented FIP cases or not in the training, validation and expert opinion test datasets.

Cases in which expert clinical opinion (based on the signalment, clinical history and laboratory results), following the ABCD FIP diagnostic guidelines¹⁵ current at the time of interpretation, indicated an extremely high suspicion of non-effusive FIP were included in the analysis, as were cases where non-effusive FIP was not considered a differential diagnosis. The cases included in the study were ill cats with a wide range of clinical presentations, where FIP was considered a differential diagnosis by the referring clinician, and the FIP profile was performed as part of a diagnostic workup. Cases were designated as “high suspicion of FIP” where sufficient criteria within the ABCD guidelines were met to warrant this classification, based on a combination of signalment, history, clinical signs and laboratory data. Similarly, a set of cases was identified where there was a strong suspicion that the cat did not have FIP. The creation of strict case definitions based on a defined set of clinical and laboratory features was avoided as this is not achievable in line with the balanced interpretation of ABCD guidelines¹⁵ and, importantly, it prevented the creation of artefactual models which are overfitted to subsets of parameters within the case defining criteria. Cases in which the interpretation was equivocal were excluded; these included cases where FIP was still considered a differential, but there was insufficient evidence to strongly support or refute an FIP diagnosis. Similarly, cases with incomplete records, due to only a subset of tests being performed, were excluded as some ML algorithms are unable to cope with missing data. This may have been due to insufficient material being submitted or the incorrect sample type being provided. Only data from cases representing initial FIP diagnostic submissions were included; re-test and follow-up test data was excluded, as were suspect cases where treatment with anti-virals, namely GS-441524, Remdesivir or equivalents, had already commenced. Cases where treatment had commenced with palliative drugs, such as steroidal or non-steroidal anti-inflammatories, or with antibiotics were not excluded from the study.

The dataset, comprising cases with expert clinical interpretation, was randomly partitioned into three smaller datasets for modelling purposes namely “training”, “validation” and the “expert opinion test set”, which comprised 40%, 40% and 20% of the total records, respectively. An additional set of 80 reference cases with histology and/or IHC and/or PCR or an alternative non-FIP diagnosis were used as a “gold standard” dataset to evaluate the effectiveness of the models. A summary of the signalment and laboratory variables included in the model development and evaluation datasets is detailed in Table 1.

Table 1 Description of variables used in the training, validation, expert opinion test datasets and definitive diagnosis cases.

Feature selection

Statistical analysis of all variables, comprising correlation and covariance, was undertaken to establish redundancy across each variable within the dataset. In addition, expert guidance was sought from clinical pathologists regarding potentially redundant variables from a clinical perspective. Together, this allowed the number of features used in the modelling exercise to be reduced. Model-based feature importance was also examined throughout the modelling process, during both the training and validation stages. Features were assessed by model-based feature importance using the base learner models: Logistic Regression (LR), Naïve Bayes (NB), Support Vector Machine (SVM), randomForest (rF) and Extreme Gradient Boosting (XGBoost). Where the models allowed, Gini index was assessed for feature importance. For the Naïve Bayes and SVM models, these model types are not compatible with calculating Gini index, therefore model accuracy was instead assessed using iterative removal of features.

Model selection and building

A range of algorithms was selected to incorporate into the models, each exhibiting a different underlying mathematical or statistical methodology. This approach sought to test our hypothesis in a comprehensive manner, without bias towards any particular algorithm. Five types of classification algorithm were implemented, namely Logistic Regression, Naïve Bayes, Support Vector Machine, randomForest and Extreme Gradient Boosting.

Binary classification models were trained using predictor variables listed in Table 1 and these were either numerical variables or dummy binary variables (numeric type) coded “0” or “1” (‘one-hot encoding’) for a specific group. The response variables for the models were also coded as binary variables (numeric type), “0” for cases classified as not-FIP and “1” for cases classified as FIP. All variables, both predictor and response, were coded as numeric data as some algorithms required a numerical matrix as the input data.

Models were built in the statistical programming language R (version 4.1.2)³⁴. The caret package (version 6.0–85)³⁵ was used to access the data pre-processing and algorithm functions. Figure 1 illustrates the workflow from data collection through the process of model building and evaluation to final predictions.

Two predictive binary classification ensemble models were built employing the algorithms listed previously. The training dataset used to train all models, base learners and the logistic regression models was pre-processed using “Caret” pre-processing functions. Data was centred and scaled, and the “downSample” function was used to randomly generate an input dataset where the frequency of both outcome classes was the same as the minority class (FIP cases in this instance). Data used as input for validation and evaluation were similarly centred and scaled through the model function, however there was no requirement to “down-sample” these datasets.

The first approach, the XGBoost ensemble, used one hundred XGBoost base learner models, the predictions of which were then aggregated into an input array for a final stacked randomForest predictive model. The second mixed ensemble comprised one hundred base learner models in total, consisting of 25 randomForest, 25 Naïve Bayes, 25 logistic regression and 25 SVM models. As with XGBoost ensemble, the output predictions of the base learners were aggregated into an array and used as input for a final randomForest ensemble model. The function “caretStack” (from caretEnsemble) was used to build each ensemble random forest model (meta-model) and used ten-fold cross-validation, repeated ten times; the number of randomly drawn candidate variables at each split was forced to 40 (the “mtry” hyperparameter of the random forest algorithm) to ensure that a representative selection of the base learner predictions was evaluated.

Each base learner was trained using ten-fold cross-validation, repeated ten times. The “caretList” function (from caretEnsemble) was used to build the base learners and a grid of optimal tuning parameters was selected for each base learner model using the validation dataset. The model building function automatically selected the best tuning parameters from the parameter grid and consequently selected the optimal model at each iteration; selection was based on accuracy and kappa of the cross-validation hold-out data. Tuning parameters varied for each base learner. Optimised tuning grids are provided in Supplementary Table S1. An additional mixed ensemble model was built as above but without the use of FCoV titre or AGP as predictive variables; all other parameters remained the same.

Basic logistic regression models

We built two standalone basic logistic regression models as comparators to the more complex ensemble models. The logistic regression models, similar to those included in our ensemble models, did not have any hyperparameter tuning performed. One was trained with the same set of predictors as the ensembles and the other with FCoV serology as a single predictor (with ten-fold cross-validation used in each).

Model evaluation and statistical analysis

Ten-fold cross-validation was undertaken with nine folds being used for training and the remaining fold used to evaluate accuracy during training. Additional evaluation of model performance was undertaken using three subsequent datasets. The validation data were used to fine tune the base learners and evaluate tuning parameters, and therefore these data were not used to evaluate the final models to avoid information leakage. The final testing was performed using the “expert opinion” test dataset (a partition of expert interpreted cases) and a group of reference cases (“gold standard”) where either pathology, histopathology, IHC, PCR or a combination thereof was used to determine a definitive diagnosis of FIP, or an alternative diagnosis was determined.

Models were assessed as though they were a new diagnostic tool; confusion matrices were generated using the model predictions and the expert predicted outcome was used as the reference. Model performance metrics including accuracy, sensitivity, specificity and inter-rater agreement (Cohens Kappa, Κ) were used to compare each model’s predictions with the actual outcome. The area under the receiver operator curve (AUC) was also calculated. The basic logistic regression models were also evaluated in this way. A corrected McNemar test^36,37 was used to assess statistical significance of differences between sensitivity and specificity of different models.

All statistical analyses were performed using the statistical programming language “R” (version 4.1.2)³⁴ and the packages “Base R”, “stats” (version 4.1.2) and “pROC” (version 1.18.0)³⁸. “Tidyverse” (version 1.3.1) was used for data preparation, cleaning and graphical output of results. “Caret (version 6.0.90)”³⁵ and “caretEnsemble” (version 2.0.1) were used to build models, to predict from trained models and to produce confusion matrices. “Caret” is a wrapper for several other packages and has several package dependencies from which the algorithms themselves originate; the dependencies utilised for the algorithms are listed in Supplementary Table S1. Additional packages used throughout are listed in Supplementary Fig. S1. An alpha level (P-value) < 0.05 was considered statistically significant in all analyses. The R Studio IDE (version 2021.09.1) was used to facilitate the use of the R language.

Ethical approval

This study was authorised by the School of Veterinary Medicine Ethics Committee, University of Glasgow, application number EA46/21.

Source link

AI Gumbo

Assessing the feasibility of applying machine learning to diagnosing non-effusive feline infectious peritonitis

Dataset and data preparation

Feature selection

Model selection and building

Basic logistic regression models

Model evaluation and statistical analysis

Ethical approval

About The Author

AIGumbo.crew

Leave a Reply Cancel reply

Dataset and data preparation

Feature selection

Model selection and building

Basic logistic regression models

Model evaluation and statistical analysis

Ethical approval

You may also like

About The Author

Leave a Reply Cancel reply