Uncategorized

“Determining the efficacy of a machine learning model for measuring periodontal bone loss”


The model proposed in this study aimed to automate the diagnosis of radiographic Periodontal Bone Loss (PBL) using a Deep Convolutional Neural Network (DCNN), which managed to do with acceptable performance and real-time diagnosis.

The dataset used for this study was particularly small and below the recommended size for training models similar to ours [19]. Study design determinations were made to ensure the best possible performance in the context of limited data access. The first determination was the use of two distinct populations. While the Chilean population of periodontal patients was the original subject intended to be studied, a second population, in the form of a publicly available dataset from Tufts University, was included due to the difficulty of obtaining a sufficiently large dataset from the original population alone. This decision might have consequences for the model’s performance, as it has been proposed in the relevant literature that the dataset should come from the population for which an AI is intended to be used [20]. We propose that any detrimental effects in the model’s capability to generalize in the used Chilean population is offset by the fact that including a second population enables the existence of the model in the first place. It is also possible to consider finetuning the weights as additional Chilean data is made available.

A clear limitation is that neither of the populations used in this study were representative in nature. The Chilean population originated from a subset of periodontal patients from a specialized care center. Additionally, the demographic characteristics of both populations remained unclear. This is relevant, given that the used dataset largely determines a model’s learned patterns. Given the non-representative nature of the data, the model’s performance retains validity only in relation to these two specific populations, while its performance on external populations remains undetermined. On the same note, considering that the Chilean population is from a subset of periodontal patients, there is a considerable chance of it being skewed towards over representing periodontally compromised patients over healthy ones. In the context of automating a radiographic pathology, achieving a balanced dataset where all ranges of a given disease are uniformly represented is of paramount importance.

The number of radiographs included in the study (500) represented the maximum amount that the researchers managed to manually label during the study’s allocated time frame. The size of the used dataset constitutes an important limitation on both model training and evaluation procedures, given the widely known fact that ML models perform better, all other conditions being unchanged, if the dataset is more extensive [17].

The second significant determination was using panoramic radiographs and molars as the units of study. It was determined in advance that a limited amount of data should, in turn, be as standardized as possible to promote effective learning by the model. Panoramic radiographs were chosen as study technique due to their inherently standardized capture method, which involves taking images around a fixed occlusal block [21]. Similarly, it was decided to avoid including anterior teeth to regularize anatomical features across data points, which limits performance and validity in teeth other than molars [21].

Another limitation was using a single observer during data labelling, which manifested in the form of Obstructive Factors [Table 1], where the observer could not be sure of the precise location of certain radiographic points. Having two or more observers responsible for data labelling would solve this issue and reduce bias. When faced with visually challenging points, multiple observers could make collective decisions, which would offer two advantages: firstly, there would be no excluded radiographic points, therefore utilizing data more efficiently, and secondly, it would enhance the model’s performance by providing information about precisely those visually demanding points. The use of a sole observer helps explain the performance observed. Nonetheless, the obtained ICC value (0.91) argues for a well-instructed and calibrated undergraduate student.

Still, the main limitation regarding this model’s clinical validity is that it cannot measure PBL on unprocessed panoramic radiographs. As previously reported, the model can only measure PBL on processed cuts containing molar images from a panoramic radiograph. While it was attempted to have it perform this task on unprocessed radiographs using another CNN, this still needs to be achieved. This lack of success explains the first component as a workaround for this issue. Considering that this pre-processing phase encompassed between 6 to 8 hours of computing, this limitation would have to be resolved for creating a truly automated model. Therefore, a clear direction for future work would be to train a CNN to automatically segment molar cuts from a panoramic radiograph, allowing the existing model to then measure the amount of PBL in those cuts. Such an approach would aid in creating a clinically useful model.

Another investigative line worth considering would be to create a model for automating PBL measurement based on regression equations, that could then be trained and compared to dentists in a binning basis instead of a categorical stage basis. This, in turn, would allow to further study and consider the level of similarity between the model and a human standard in a percentage-by-percentage manner, something that our proposed method cannot perform.

Regarding the standardized comparison test, 40 images were drawn at random. This number was chosen as it appeared to be representative enough of the population of study, yet not high enough to start imposing fatigue on human participants. As mentioned, this random sampling resulted in an unbalanced testing set, given the expert criteria of both radiologists identifying most instances as light to moderate PBL (Stage 1 = 57.14%, Stage 2 = 35.71%) and the vast minority as severe PBL (Stage 3/4 = 7.14%). We suspect this is because the training dataset did not contain many instances of PBL stage 3/4 to begin with.

The data was analyzed as a binary classification system. This means that even though the model was tested on a question which has a multi-class answer (i.e., there is PBL, and it is one of three possible choices: 1, 2 or 3/4), each of the possible answer classes were considered as separate and independent tests (i.e., there’s three types of tests, one per each PBL class, and one can either identify it successfully or not). This was done so that the diagnostic indices of true positive, true negative, false positive and false negative could be calculated in the first place, given that they can only be expressed in binary form. This, in turn, created sets of diagnostic indices for each PBL class on each participant [Table 3].

As previously stated, an assumption was made a priori that both radiologists would collectively represent the ground-truths of the study over that of other participants (periodontists and general dentist). This assumption was made on the need for determining a set of answers to be considered as known true data points, or as ground-truths. Given the studied variable of measuring radiographic PBL and considering the specialty of Radiology to be the most closely related to the analysis of radiographic images, it was decided to consider the radiologists’ answers as ground-truths over that of the both periodontists’ and general dentist.

Afterwards, performance metrics were calculated (sensitivity, specificity, precision, recall and F1-score). Sensitivity and precision both express information about the success rate in positive answers (i.e., the capacity to identify presence of disease), on the contrary, specificity informs about the success rate in negative answers (i.e., the capacity to identify absence of disease), recall informs about the rate of success of positive and negative answers at the same time, and F1-Score is the harmonic mean between sensitivity and precision. Moreover, given the unbalanced testing set, averages were further calculated. Given the data, both weighted- and micro-averages stand as the most precise indices, as they take into consideration the observed frequency of each answer class and, therefore, describe an unbalanced dataset more precisely. On the contrary, macro-average assumes all classes as equally prevalent.

Regarding the model’s performance, specifically around the time variable, one thing is concluded: the model was exceedingly faster than the human participants, as it achieved real time diagnosis in practical terms, diagnosing 40 teeth in 0.93 s, or 0.02 s per tooth. Further down the implications of this result will be discussed.

Regarding the model’s performance for measuring PBL, it obtained an acceptable performance for detecting PBL stages 1 and 2, or light to moderate, as it obtained values of 0.23 and 0.29 for weighted sensitivity and F1-score, respectively. It expressed a slight tendency towards over-diagnosing PBL stage 2, as it obtained a value of 0.17 for precision in said stage. Lastly, it proved incapable of detecting PBL stage 3/4, or severe PBL, as it obtained values of 0 for both sensitivity and F1-score in this stage.

Compared to the human controls, the model obtained a similar overall performance against the General Dentist and Periodontist 1, given the comparable values obtained by the model on the weighted and micro averages of sensitivity, precision and F1 score [Table 6]. However, it obtained a worse performance when compared to Periodontist 2 and both Radiologists, given the lower values obtained by the model in the weighted and micro averages of sensitivity, precision and F1 score. When considering the performance obtained between human controls (periodontists and general dentist), there was a noticeable better performance from Periodontist 2 across all performance indices, which can be explained by the considerably larger trajectory from this participant both as a DDS and as a periodontist (Supplemental files).

As previously noted, weighted- and micro-average values are affected in direct proportion to an observed class’ prevalence. Therefore, the model’s performance was not significantly affected by its inability to measure PBL stage 3/4, as this class expressed a very low prevalence. However, the fact remains: not being able to detect the most severe cases of a pathology constitutes a serious limitation. Another direction for future research would be to create a new dataset composed exclusively of instances of PBL Stage 3/4 to strengthen the models’ performance on this stage.

The results of this study allow for multiple implications. Firstly, we have demonstrated that the method used here is overall data efficient. In the context of Machine Learning algorithms like Deep Convolutional Neural Networks, a dataset of 500 original images (panoramic radiographs) represents a low number, which can nevertheless work thanks to the utilization of data augmentation techniques and study design resolutions to increase the amount of useful data points [17].

A second implication is the demonstrated capacity to automate the measurement of PBL in molars from panoramic radiographs using Machine Learning. Considering Periodontitis’ prevalence, not only in Chile but also globally, coupled with the known significance of early diagnosis and periodic monitoring for at-risk and affected populations [3], the value of this model starts to become evident, as it could help automate the radiographic analysis of a massively prevalent condition, helping diminish both time and human resources needed in the currently practiced workflows.

Moreover, the automation of other radiographic pathologies has already been reported, such as interproximal caries, periimplantitis, and even tumours and cysts of the jaw [9, 10, 22]. All these conditions rely heavily on early diagnosis and maintenance to facilitate disease prevention and arrest progression, both aimed at conserving as much healthy tissue as possible. Looking ahead, the development of ML models capable of simultaneously automating the diagnosis of multiple pathologies shows excellent promise. Such models would significantly enhance clinical workflows, enabling practitioners to work more swiftly and precisely, diminishing fatigue and, as research shows, increasing sensitivity to certain pathologies [10]. Further studies are needed to determine the nature and extent of these models’ effect on healthcare workflows and services.

This is not to say that there are no obstacles to work through, as proposed, there are multiple challenges to be addressed, mainly: 1. Ensuring data protection and security, given the need for datasets composed of sensitive patient information. 2. Gathering and producing sufficiently large and standardized datasets, as they are needed for creating differentiated AI models. 3. Creating and inserting clinically useful models that are transparent, reliable, and unbiased [23]. Resolving these challenges will ensure the long-term success of healthcare-applied AI models that make dental care faster, better, and more widely available.

One final future direction would be to apply this architecture to different features found in and around the dental structure in panoramic radiographs, such as caries, dental restorations, periapical lesions, type of bone defect present in PBL (i.e., vertical or horizontal defects), and presence of furcation lesions.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *