Machine learning based outcome prediction of microsurgically treated unruptured intracranial aneurysms

Ethics board approval was obtained prior to data acquisition from the local ethics committee (JKU-Ethikkommission, EC-No.: 1255/2019). All patients or their legal representatives gave their legal informed consent to the surgical procedures and the study conducted in accordance with the Declaration of Helsinki.

Every UIA of the anterior circulation that was microsurgically treated between January 2002 and December 2020 at the Department of Neurosurgery, Kepler University Hospital Linz, was added to the retrospectively collected registry.

The microsurgical operations were all performed using standard approaches and a compilation of the technical intraoperative parameters is shown in Table 1.

Table 1 Intraoperative parameters; mRS = modified Rankin Scale, pnND = permanent new neurological deficit, GOS = Glasgow outcome scale, tnND = transient new neurological deficit, mRS-Diff > 1 = mRS difference > 1 (preoperative vs. postoperative).

Preoperative parameters

Preoperative parameters were divided into patient- and aneurysm-specific parameters and constituted the input variables for the ML algorithms. Patient-specific parameters consisted of basic demographic parameters (age and sex), parameters concerning personal medical history (earlier SAH, hypertension, diabetes mellitus, body mass index (BMI), autosomal dominant polycystic kidney disease, chronic obstructive pulmonary disease, previous stroke, psychiatric disorder, smoking, alcohol abuse, familial frequency of aneurysms), and preoperative scores (PHASES-Score¹⁷, ASA-Score¹⁸ (American Society of Anesthesiologists), and modified Rankin Scale (mRS)¹⁹).

Aneurysm-specific parameters included aneurysm location, calcification, neck diameter, maximum diameter, side, size of the parenteral vessel, morphology, and the occurrence of multiple aneurysms. Preoperative aneurysm-related symptoms such as cranial nerve deficits, epileptic seizures, or aneurysm-related thromboembolic events were also recorded.

Outcome parameters

Prediction models were calculated for the postoperative parameters. Digital subtraction angiography was performed in every patient to assess complete aneurysm occlusion. New postoperative neurological deficits (nND) were surveyed and divided into transient and permanent nND. A permanent nND persisted after hospital discharge. The functional outcome was assessed using the Glasgow outcome scale (GOS)²⁰, mRS¹⁹, and the difference in the mRS preoperatively to postoperatively. An mRS score of > 2 or a GOS of < 5 was defined as a poor outcome^20,21. A worsening in mRS of more than one point (postoperatively compared to preoperatively) was regarded as functional deterioration.

Statistical analysis

Statistical analysis included a univariate descriptive analysis of the collected input and output variables. In addition, an unpenalized LR model was trained on all available features as a simple baseline to quantify the benefit of sophisticated hyperparameter tuning and complex model classes²².

Train-test split

The data were split into training and testing sets. To stimulate prospective validation and obtain reliable estimates of the predictive performance for future patients, we opted for a temporal split, in which the training set consisted of all data until, and including, the year 2018, and the test set consisted of all remaining data from 2019 and 2020. This led to a train-test ratio of 81:19 or 380 vs. 86 samples. Although a single patient can occur multiple times with different aneurysms in the data, ensuring that all corresponding samples appear in either the training or test set was not considered necessary because these samples can safely be assumed to be independent of each other.

Machine learning algorithms and model selection

A range of ML models was trained on the training set and evaluated on the test set, including extreme gradient boosting estimators (XGB), random forests (RF), extremely randomized trees (ET), support vector machines (SVM), k-nearest neighbor classifiers (KNN), generalized additive models (GAM), multilayer perceptrons (MLP), linear discriminant analysis (LDA), and quadratic discriminant analysis (QDA) models. This diverse set of algorithms was selected to make sure we would find the best-performing algorithm for each outcome. Tree-based algorithms, like random forests, are known to work well on tabular data, but including simpler algorithms as well seemed sensible to avoid overfitting due to the small data set.

The hyperparameters of these models were optimized using recent techniques of Bayesian optimization and meta-learning, as implemented in the auto-sklearn package for Python²³. Hyperparameter optimization not only included finding an optimal model instance but also selecting the optimal preprocessing steps, particularly the class balancing strategy (balancing with respect to class frequencies, vs. no balancing), imputation strategy (mean vs. median imputation for numerical features, most frequent for categorical features), and feature selection. The area under the receiver operating characteristic curve (ROC-AUC) served as the optimization objective because this metric is widely used to illustrate the discriminative power of a binary classifier. Preliminary experiments suggest that optimizing the average precision (AP) does not lead to better overall results. The ROC-AUC was calculated on five predefined train-validation splits of the original training data, where the validation sets were not pairwise disjoint and were biased towards more recent samples from 2017 and 2018, to account for the temporal train-test split. Preliminary experiments suggested that this form of validation was superior to standard k-fold cross-validation.

In addition to ROC-AUC and AP, we also reported threshold performance metrics (such as accuracy and sensitivity) on the test set. Analogous to Staartjes et al., the decision thresholds were chosen according to the closest-to-(0, 1) criterion on the training set^15,24. However, we note that these metrics were only included for the sake of completeness. Because of their strong dependence on a particular decision threshold and the fact that many different threshold selection strategies exist, one must be careful when comparing these metrics between different studies. The ROC-AUC is more robust in this respect and was therefore chosen as the main performance metric.

For estimating the variance of the performance metrics, after fixing hyperparameters, we trained models on 100 bootstrap resamples of the original training set and evaluated them on the test set²⁵. The decision threshold was calculated for each of these models individually.

Python version 3.9.7²⁶, with scikit-learn 0.24.2²⁷, xgboost 1.5.0²⁸, pandas 1.4.1²⁹, and auto-sklearn 0.14.6²³ were used for all analyses through the open-source CaTabRa framework³⁰. ML models were compared to LR models using the Mann–Whitney U-test.

Feature importance

The SHapely Additive exPlanations (SHAP) framework was used to determine the relevance of individual features to each model and thereby gain insights into the inner workings of otherwise opaque prediction models³¹. In contrast to simpler explanation techniques, such as permutation importance, SHAP also considers interactions between multiple features.

External validation

We evaluated our models on a retrospectively collected registry from the Department of Neurosurgery of the University Medical Centre Hamburg-Eppendorf, Germany. Apart from new neurological deficits, the registry contained information about the same pre- and postoperative parameters as in our internal data set, and covered the years between 2016 and 2020. A statistical analysis was performed to identify differences in the distribution of the two data sets, focusing on parameters that were deemed important by the SHAP feature importance analysis. The variance of the performance metrics was estimated using the same models that were used for estimating the variance on the internal test set.

Source link