Decision analysis framework for predicting no-shows to appointments using machine learning algorithms

Our results are now discussed considering the no-show prediction literature considering important aspects of our analytical framework: dataset stratification and folding for model performance assessment, resampling for dataset balancing, the set of features available for analysis and those significant for no-show prediction and practical implications.

Cross-validation strategies and stratification

Ideally, prediction models are generalizable, i.e., when classifying observations not previously used for training and testing, they should yield performance like that verified in testing sets. We conducted a comprehensive review of 62 studies on no-shows, as detailed in Table S1. To obtain generalizable no-show models, cross-validation strategies are reported in the literature [28, 37, 38, 40, 48, 57,58,59, 69, 70], mostly the random division of datasets into calibration and validation portions, followed by cross-validation of the calibration portion in which train and test subsets are divided into 5 or 10-folds. In those studies, the validation set remained the same during model training, and a single performance result was obtained for the best model, i.e., no measure of performance dispersion became available. Aladeemy et al. [31] performed the cross-validation as the ten previous studies but repeated the process 3 times, on the validation portion obtaining more than one validation sample. Krishnan and Sangar [21] were the only study reporting foldings of the validation set (4 × [5-folds]) using the entire dataset. However, foldings were obtained randomly, which may lead to different class frequencies in the calibration and validation sets.

Some authors [3, 26, 29, 34, 35, 49, 71,72,73] divided the complete dataset into train and validation portions using 10-fold partitions. However, analyzing the results in “Datasets” section, it is safe to assume that only 10 simulations may be insufficient to capture the generalizability and stability of the models obtained in the train portion. The SR/RUS model, for example, presented results with variability ranging from 0 to 100% over the 100 replicates, as shown in the sensitivity boxplots in Figs. 2 and 4. If only 10 validation runs were considered, results could be biased for better or worse. Contrastingly, employing cross-validation across two stages enables 100 simulations across distinct folds, curbing repetitions and strengthening result robustness. This method substantially tackles generalization and stability issues in predictive models, especially amidst substantial class frequency variations. Thus, our strategy promises a significant advancement in evaluating predictive models, ensuring their reliability and adaptability for real-world applications.

While this technique demands added computational resources due to more numerous simulations, leveraging technological advancements and parallel processing capabilities can streamline this phase. Moreover, bolstering model generalization and stability holds importance, particularly in sensitive fields like medicine, finance, and complex event forecasting, where unreliable results could lead to adverse consequences. The method’s adaptability across diverse datasets and contexts further underscores its efficacy, paving the way for versatile and dependable deployment in real-world scenarios.

Finally, authors [3, 28, 29, 37, 38, 40, 57, 69] reported the stratified partitioning of datasets to ensure that classes in train, test and validation portions displayed the same proportions as the entire dataset. However, no treatment of class imbalance was reported. Implementing proportional class sampling (Stratification by class) when splitting the dataset into cross-validation folds, such that the incidence of each class in the folds reflects that of the entire dataset, also leads to more reliable results [37, 38, 40].

Resampling techniques

The magnitude of imbalance between classes may undermine the predictive power of machine learning algorithms, as the learning stage becomes difficult and favors the adoption of naive approaches to minimize the loss function during the classification process, leading to models that cannot successfully differentiate between classes [21]. The problem escalates as the dataset imbalance increases, with prediction models that generate biased results (i.e., false alarms) that are not usable in practice [21, 28], since the model will tend to classify new cases as belonging to the majority class [21]. Resampling techniques are an alternative to address this problem.

In the studies surveyed in Table S1, the no-show rate was lower than the attendance rate of patients in most medical specialties, except for Alshammari et al. [74] and Bhavsar et al. [48]. Resampling techniques are an alternative for dealing with imbalanced datasets [21]. They are applied before the algorithm learning process. Examples are minority class oversampling, majority class undersampling, or combinations of both [35]. Resampling techniques perform differently according to the dataset at hand. The best technique is the one that captures the disparities between classes, resulting in the best prediction performance [35].

To overcome class imbalance, resampling techniques (mostly undersampling) were applied by eleven of the authors listed in Table S1 [21, 26, 28, 29, 31, 32, 34, 35, 58, 59, 73]. All eleven studies that used resampling techniques displayed sensitivity results of at least 64%. Of the remaining studies that did not apply resampling, 18 reported sensitivity results: five of them were between 20 and 55% [2, 41, 45,46,47], reflecting a low probability of correctly identifying no-show cases; other remaining eleven studies reported sensitivity results larger than 64% [1, 20, 23, 27, 39, 69, 74,75,76,77,78], although calculated on a single fold of the validation set, which is likely to yield biased results.

In our proposed framework, we recommend using resampling techniques to minimize imbalance in no-show datasets, followed by controlled stratification and cross-validation to obtain generalizable models. The following two other studies adopted a similar strategy. In AlMuhaideb et al. [29], the complete dataset was divided using a 10-fold cross-validation, but the number of folds used for calibration and validation was not reported. Random undersampling was the resampling technique adopted. As mentioned earlier, only 10 simulations may be insufficient to produce stable and generalizable models. Nasir et al. [28] randomly divided the dataset into calibration (20%) and validation (80%) portions. A 5-fold cross-validation was applied only in the calibration set. The 20%/80% splitting of the dataset may not be the most adequate: according to Srinivas and Salah [57], models obtained using larger train sets are more generalizable. In addition, validation was performed on a single fold.

Although not reporting stratification strategies, the combination of resampling and cross-validation techniques was also applied in 9 other studies [21, 26, 31, 32, 34, 35, 58, 59, 73]. Considering those in which no-show was the minority class and AUC was larger than 0.60, the best sensitivity results (0.89) were obtained by Starnes et al. [76] and Joseph et al. [78]. In our study, the combination of IHT and SR yielded sensitivity results greater than 0.94 in the two analyzed datasets, i.e., representing the most favorable outcomes reported in the literature, to the best of our knowledge. It is essential to note that our study, while showcasing this absolute advantage, did not perform specific statistical analyses to confirm significant differences. The presence of overlap and imbalance among classes complicates classification. We believe that the outstanding performance of IHT lies in its ability to identify these challenging instances, allowing for their removal during machine learning model training. This results in significant improvements in the separation between classes, directly impacting classification results.

Significant predictors

We now analyze the set of predictors most frequently selected in modeling the two datasets in our study. The variable present in both datasets and most frequently selected was ‘day of the month’. Variables ‘month of the year’ and ‘age’, frequently selected when using dataset 1, were also selected when using dataset 2, but with lower frequency. On the other hand, variables ‘waiting days’, ‘previous appointments’, and ‘percentage of previous no-shows’, present in both datasets, were selected more frequently only when using dataset 2. Datasets reflect specific cases and, therefore, present different information. For example, frequently selected variables ‘season of the year’, ‘distance to the clinic’, and ‘number of exams with no-show in the previous year’ were available in dataset 1 but not in dataset 2; similarly, the frequently selected variable ‘number of days since previous appointment’ was only available in dataset 2. Prediction quality is dependent on the volume and diversity of the information available in the dataset [10, 26, 35]. As in our work, many authors [3, 5, 22, 26, 27, 33, 79] reported the non-availability of information as a limiting factor for accurate no-show predictions. Furthermore, most predictors displayed importance levels that varied depending on the classification algorithm being tested. According to Nasir et al. [28], that is due to the different processing strategies performed by the algorithms.

The most frequently retained variables found in our study were consistent with the results found in the literature. For example, age [1,2,3,4,5,6,7,8, 10, 18,19,20, 22,23,24, 26,27,28,29, 31, 34, 37, 38, 40, 45,46,47,48, 57, 75, 76, 80,81,82,83,84,85,86], day of the month [26, 33, 35, 40], month of the year [1,2,3, 7, 18, 26, 48, 80, 82], season of the year [6, 19, 22, 25, 33, 45], distance to the clinic [1,2,3, 5, 6, 8, 10, 18, 23,24,25, 30, 75, 81, 82], previous no-shows [1, 2, 4, 7, 10, 18, 20, 23, 26,27,28, 35, 47, 57, 80,81,82, 85, 87], previous appointments [1,2,3, 10, 19, 20, 40, 81], number of days since previous appointment [26, 28, 32, 33, 35, 40, 48] and waiting days [1, 26,27,28, 31, 32, 35, 37, 38, 40, 48, 49, 79, 80, 83] were predictors associated with no-show in previous studies. The predictor ‘number of exams with no-show in the previous year’, selected in our analysis, was not available in other datasets reported in the literature.

Although different variables were identified as significant predictors of no-show in our and previous studies, results are not always generalizable since no-show is a case-specific phenomenon affected by internal and external factors which may be exclusive to each medical service. For example, gender appears as a significant no-show predictor in the works of Mander et al. [5], who found a higher no-show rate in male individuals, and AlRowaili et al. [30], who found the opposite.

Despite not being entirely generalizable, studies that identify significant no-show predictors in each socioeconomic context might help managers devise compensating strategies to reduce its effects. For example, in Glover et al. [24], no-show was associated with transport barriers faced by low-income patients. To overcome that, patients were directed to more geographically accessible clinics for consultation, partnerships with public and private transport managers were created, and a free transportation program was proposed for the most vulnerable patients.

Practical implications

The annual costs of CT scan examinations for the 557 (6.65%) no-show cases in Dataset 1 translated to an annual financial loss ranging between US$ 12,574.40 and US$ 21,149.18. Financial information for Dataset 2 is unavailable. In order to minimize the impacts of no-shows, patient reminders and overbooking emerge as strategies commonly discussed in the literature. Patient reminders and overbooking are strategies commonly presented in the literature to minimize the negative impacts of no-shows. Patient reminders, e.g., phone calls, text messages, and e-mails, are used to prevent patients from forgetting their appointments. Robotic auto calls are low-cost alternatives, although not as effective as resource-demanding personalized reminders [22]. Overbooking is a strategy in which more than one patient is scheduled for the same time slot. It potentially increases the system’s revenues by reducing idle times. However, it may also lead to problems such as system’s overcrowding and patients’ longer waiting times [3, 22, 81]. Our study has practical implications since knowledge of most likely no-show patients allows directing strategies such as patient reminders and overbooking to those patients, optimizing the use of resources.

Source link