Diagnostics | Free Full-Text | Validation of a Deep Learning Chest X-ray Interpretation Model: Integrating Large-Scale AI and Large Language Models for Comparative Analysis with ChatGPT

1. Introduction

Artificial intelligence (AI) revolutionizes healthcare by improving clinical diagnosis, administration, and public health infrastructures. AI applications in healthcare include disease diagnosis, drug discovery, assisted surgeries, and patient care. AI can enhance healthcare outcomes, reduce costs, and optimize treatment planning [1]. However, challenges to be overcome include ensuring ethical boundaries, addressing bias in AI algorithms, and maintaining diversity, transparency, and accountability in algorithm development. AI is not meant to replace doctors and healthcare providers but to complement their skills through human—AI collaboration. The human-in-the-loop approach ensures safety and quality in healthcare services, where AI systems are guided and supervised by human expertise.

The rise in large language models (LLMs) in AI has garnered significant attention and investment from companies like Google, Amazon, Facebook, Tesla, and Apple. LLMs, such as OpenAI’s GPT series and ChatGPT, have shown remarkable progress in tasks like text generation, language translation, and question answering. These models are trained on massive amounts of data and have the potential to display intelligence beyond their primary task of predicting the next word in a text.

LLMs have the potential to revolutionize healthcare by assisting medical professionals with administrative tasks, improving diagnostic accuracy, and engaging patients [2]. LLMs, such as GPT-4 and Bard, can be implemented in healthcare settings to facilitate clinical documentation, obtain insurance pre-authorization, summarize research papers, and answer patient questions [3]. They can generate personalized treatment recommendations, laboratory test suggestions, and medication prompts based on patient information [4]. It is essential to ensure LLMs’ responsible and ethical use in medicine and healthcare, considering privacy, security, and the potential for perpetuating harmful, inaccurate, race-based content [5]. LLMs, like ChatGPT, can accelerate the creation of clinical practice guidelines by quickly searching and selecting evidence from numerous databases [6].

KakaoBrain AI for Radiology Assistant Chest X-ray (KARA-CXR) is a new medical technology that helps in radiological diagnosis. Developed by leveraging the cutting-edge capabilities of large-scale artificial intelligence and advanced language models, this cloud-based tool represents a significant leap in medical imaging analysis. The core functionality of KARA-CXR lies in its ability to generate detailed radiological reports that include findings and conclusions. This process is facilitated by its sophisticated AI, which has been trained on vast datasets of chest X-ray images. By interpreting these images, KARA-CXR can provide accurate and swift diagnostic insights essential in clinical decision-making.

Based on the GPT-4V architecture, ChatGPT has potential in the medical field, especially for interpreting chest X-ray images. This language model can analyze medical images, including chest X-ray data, to generate human-like reading reports. Although not yet available for clinical use, by providing a general interpretation of chest X-rays, ChatGPT has the potential to improve the diagnostic process, especially in settings with limited access to radiology expertise [7]. In this study, we analyze the diagnostic accuracy and utility of KARA-CXR and ChatGPT and discuss their potential for use in clinical settings.

4. Discussion

AI is being increasingly used in the field of chest X-ray reading. It has various applications, including lung cancer risk estimation, detection, and diagnosis, reducing reading time, and serving as a second ‘reader’ during screening interpretation [9]. Doctors in a single hospital reported positive experiences and perceptions of using AI-based software for chest radiographs, finding it useful in the emergency room and for detecting pneumothorax [10]. A model for automatic diagnosis of different diseases based on chest radiographs using machine learning algorithms has been proposed [11]. In a multicenter study, AI was used as a chest X-ray screening tool and achieved good performance in detecting normal and abnormal chest X-rays, reducing turnaround time, and assisting radiologists in assessing pathology [12]. AI solutions for chest X-ray evaluation have been demonstrated to be practical, perform well, and provide benefits in clinical settings [13].

However, conventional labeling-based chest X-ray reading AI has limitations in terms of accuracy and efficiency. The manual labeling of large datasets is expensive and time-consuming. Automatic label extraction from radiology reports is challenging due to semantically similar words and missing annotated data [14]. In a multicenter evaluation, the AI algorithm for chest X-ray analysis showed lower sensitivity and specificity values during prospective validation compared to retrospective evaluation [15]. However, the AI model performed at the same level as or slightly worse than human radiologists in most regions of the ROC curve [15]. A method for standardized automated labeling based on similarity to a previously validated, explainable AI model-derived atlas has been proposed to overcome these limitations. Fine-tuning the original model using automatically labeled exams can preserve or improve performance, resulting in a highly accurate and more generalized model.

The effectiveness of deep-learning based computer-aided diagnosis has been demonstrated in disease detection [16]. However, one of the major challenges in training deep learning models for medical purposes is the need for extensive, high-quality clinical annotation, which is time-consuming and costly. Recently, CLIP [17] and ALIGN [18] have shown the ability to perform vision tasks without any supervision. However, vision-language pre-training (VLP) in the CXR domain still lacks sufficient image-text datasets because many public datasets consist of image-label pairs with different class compositions.

The rise in medical image reading with large language models has gained significant attention in recent research [19]. Language models have been explored to improve various tasks in medical imaging, such as image captioning, report generation, report classification, finding extraction, visual question answering, and interpretable diagnosis. Researchers have highlighted the potential benefits of accurate and efficient language models in medical imaging analysis, including improving clinical workflow efficiency, reducing diagnostic errors, and assisting healthcare professionals in providing timely and accurate diagnoses [20].

KARA-CXR is an innovative cloud-based medical technology that utilizes artificial intelligence and advanced language models to revolutionize radiological diagnostics. It operates over the web and offers a user-friendly interface for healthcare professionals. KARA-CXR generates detailed radiological reports with findings and conclusions by analyzing chest X-ray images uploaded in DICOM format. This is made possible by its sophisticated AI, which has been trained on vast datasets of chest X-ray images. The technology provides accurate and swift diagnostic insights, aiding radiologists in ensuring precise diagnoses and reducing report generation time. KARA-CXR is particularly valuable in high-volume or resource-limited settings where radiologist expertise may be scarce or overburdened.

In this study, ChatGPT based on GPT-4V architecture showed some potential in interpreting chest X-ray images but also revealed some limitations. ChatGPT can generate human-like diagnostic reports based on chest X-ray data through extensive reinforcement learning on the medical text and imaging data included during development. However, due to the limitations of reinforcement learning based on information openly available on the internet, we must recognize that the data generated by ChatGPT do not guarantee medical expertise. In conclusion, it is essential to note that ChatGPT is not a substitute for professional medical advice, diagnosis, or treatment [21].

In our study, detailed observations of reports indicate that KARA generally outperforms GPT4 across various categories of diagnostic accuracy, with consistently higher percentages and interobserver agreement rates. The data suggest a significant discrepancy between the two systems, with KARA displaying more reliable and accurate performance as per the observers’ evaluations. Particularly in terms of hallucination, KARA-CXR demonstrated superior performance compared to ChatGPT. ChatGPT sometimes produced incorrect interpretation results, including hallucinations, even in cases with clinically significant and obvious abnormalities such as pneumothorax (Figure 8).

In our comparative analysis between KARA-CXR and ChatGPT, a striking advantage of KARA-CXR was observed in the hallucination. Notably, KARA-CXR demonstrated a significantly higher percentage in non-hallucinations with a non-hallucination rate of 75% as compared to that of only 38% for ChatGPT, as agreed upon by both observers. This substantial difference underscores the superior capability of KARA-CXR in providing reliable and accurate interpretations in chest X-ray diagnostics, a crucial aspect in the field of medical imaging where the precision of diagnosis can significantly impact patient outcomes. The propensity of ChatGPT to generate more hallucinations in medical contexts can be attributed to its foundational design and training methodology. As a large language model, ChatGPT is trained on a vast corpus of text from diverse sources, not specifically tailored for medical diagnostics. This generalist approach, while versatile, can lead to inaccuracies and hallucinations, especially in highly specialized fields like medical imaging [22]. Despite its potential, the accuracy and reliability of ChatGPT responses should be carefully assessed, and its limitations in understanding medical terminology and context should be addressed [23]. In contrast, KARA-CXR, designed explicitly for medical image analysis, benefits from a more focused training regime, enabling it to discern nuanced details in medical images more effectively and reducing the likelihood of generating erroneous interpretations.

In our exploration of ChatGPT’s application to medical imaging, particularly in chest X-ray interpretation, a notable limitation emerged, meriting explicit mention. ChatGPT, in its current design, is programmed to refuse direct requests for the professional interpretation of medical images, such as X-rays [8]. This usage policy and ethical boundary, built into ChatGPT to avoid the non-professional practice of medicine, significantly impacts its clinical application in this context. In the initial process of our study, we observed that direct prompts requesting chest X-ray interpretation were consistently declined by ChatGPT, aligning with its programming to avoid assuming the role of a radiologist or other medical professional. This limitation is critical to understand for any future research utilizing ChatGPT or similar language models in medical image interpretation. Despite the impressive capabilities of AI in healthcare, such as KARA-CXR and ChatGPT, hallucinations can cause serious problems in real-world clinical applications of AI. Such hallucinations may be of minimal consequence in casual conversation or other contexts but can pose significant risks when applied to the healthcare sector, where accuracy and reliability are of paramount importance. Misinformation in the medical domain can lead to severe health consequences on patient care and outcomes. The accuracy and reliability of information provided by language models can be a matter of life or death. They pose real-life risks, as they could potentially affect healthcare decisions, diagnosis, and treatment plans. Hence, the development of methods to evaluate and mitigate such hallucinations is not just of academic interest but of practical importance.

While promising, the integration of SaMD (software as medical device), including KARA-CXR and ChatGPT, into medical diagnostics faces several challenges that must be addressed in future research. A primary concern is the diversity of data used to train these AI models. Often, AI systems are trained on datasets that may only adequately represent some population groups, leading to potential biases and inaccuracies in diagnostics, particularly for underrepresented demographics. Moreover, these AI systems’ “black box” nature poses a significant challenge. The internal mechanisms of how they analyze and interpret chest X-ray images are only partially transparent, making it difficult for healthcare professionals to understand and thus trust the conclusions drawn by these technologies.

Another notable limitation is the integration of these AI tools into clinical practice. Healthcare professionals may be hesitant to depend on AI for critical diagnostic tasks due to concerns about the accuracy and reliability of these systems, as well as potential legal and ethical implications. Building trust in AI technologies is essential for their successful adoption in medical settings [24]. In addition to these concerns, it is essential to keep in mind that even if an AI-powered diagnostic solution is highly accurate, the final judgment should still be made by a medical professional—a doctor.

To overcome these challenges, future research should focus on enhancing the diversity of training datasets, including a broader range of demographic data, to ensure that AI models can deliver accurate diagnostics across different populations [25]. It is also crucial to improve the transparency and explainability of AI algorithms, developing methods to demystify the decision making process and increase acceptability and trustworthiness among medical practitioners [26]. Although this paper evaluates the diagnostic accuracy of two potential SaMDs, ChatGPT and KARA-CXR, one of the limitations is that there needs to be a clear rationale or recommendation for evaluating or approving such software, legally or within the academic community. The limitation in approving software for medical use (SaMD) stems from the need for a clear definition of SaMD, which makes it difficult to create standards and regulations for its development and implementation [27]. Without clear boundaries, there are risks to patient safety because not all components potentially impacting SaMD are covered by regulations [27]. This lack of clarity also affects innovation and design in the field of SaMD, as new technology applications that support healthcare monitoring and service delivery may need to be more effectively regulated [28]. We believe that gradually, along with software development, we will need to establish factors and regulations that will define the clinical accuracy and safety of these SaMDs. Extensive clinical validation studies are necessary to establish the reliability and accuracy of AI-based diagnostic tools, adhering to high ethical standards and regulatory compliance [29]. These studies should also address patient privacy, data security, and the potential ramifications of misdiagnoses [29]. By focusing on these areas, the potential of AI in medical diagnostics can be more fully realized, leading to enhanced patient care and more efficient healthcare delivery.

The limitations of this study include that this research was conducted as a single-institution study, which presents certain limitations. One of the primary constraints was the limited number of images that could be analyzed due to the restricted number of researchers involved in the study. This limitation could potentially impact the generalizability of our findings to broader image populations and diverse clinical settings. Another significant limitation was the lack of a reference standard for the chest X-ray interpretations. Although we analyzed the interobserver agreement between the readers, the absence of a definitive standard means that even interpretations by experienced readers cannot be considered definitive answers. This aspect could affect the reliability and validity of the diagnostic conclusions drawn in our study. Additionally, ethical considerations programmed into ChatGPT led to the refusal of direct requests for chest image interpretation, necessitating the use of indirect prompts to obtain diagnostic interpretations. This workaround might have influenced the quality and accuracy of the results derived from ChatGPT. We acknowledge that the possibility of obtaining more accurate results from ChatGPT cannot be entirely ruled out if direct requests for chest X-ray interpretation were permissible.

Source link