Article In Brief
A study found that two artificial intelligence programs—ChatGPT-3.5 and ChatGPT-4—scored better than humans on tests similar to neurology board examinations. However, both models consistently used confident language, even when providing incorrect answers, and the test questions didn’t reflect the complexity of most neurology cases, experts said.
A newer large-language model (LLM) significantly achieved better scores than human test takers when presented with questions similar to those on neurology board examinations, according to a study published Dec. 7 in JAMA Open Network.
The study was not done to see if artificial intelligence (AI) could replace doctors—or take their tests for them—but rather to determine whether LLMs might be useful in real-world clinical settings as well as in medical education.
The answer that emerged from the study seems to be a cautious yes, though more research is needed to fine-tune the application of LLMs specifically for neurology, according to the study investigators, based at University Hospital Heidelberg in Germany.
The study, which evaluated two versions of LLMs—ChatGPT-3.5 and ChatGPT-4—found that the newer version achieved higher scores on board-style exams than the average human scores. Both versions of LLM performed better on lower-order questions—multiple-choice questions, for example, on which drug to prescribe for a given condition—than higher-order ones, which required analysis and synthesis of information.
The study caught the attention of neurologists involved in medical education, but not enough to elicit concern. “Bottom line: I’m really not worried at all about my job security, or those of my residents,” said Jeremy Moeller, MD, FAAN, associate professor of neurology and neurology residency program director at Yale School of Medicine. “Good test takers are not necessarily the best neurologists, by any stretch.”
The study came amid a growing sense that AI (of which LLM is one form) is becoming increasingly sophisticated and rapidly expanding in all kinds of fields, from entertainment to business to medicine. According to background information in the study, tools using deep-learning algorithms have already shown some promise in helping with neurologic diagnosis, prognosis, and treatment decisions; however, “the role and potential application of large-language models have been unexplored” in neurology.
LLMs in general are getting more powerful and accurate as the datasets that underpin them continue to grow. Previous research has tested the ability of LLMs to take medical exams in ophthalmology, neurosurgery, and radiology, with varying degrees of success when compared with human performance.
The researchers on the current study, led by Varun Venkataramani, MD, PhD, wrote that neurology exams could prove tougher for LLMs because “the questions in neurology board examinations often present complex narratives with subtle diagnostic clues that require a nuanced understanding of neuroanatomy, neuropathology, and neurophysiology,” requiring the test taker to sift through the scenario, extract relevant data, and synthesize information in order to arrive at a diagnosis and therapy decision.
Lower-order, Higher-order Questions
The researchers designed a cross-sectional study to evaluate the performance of GPT-3.5 and GPT-4 on testing similar to neurology board exams. The bank of nearly 2,000 questions, which did not include questions with video or images, was approved by the American Board of Psychiatry and Neurology and validated with a small question cohort by the European Board for Neurology. The research team categorized questions into lower-order functions (recall, understanding) and higher-order (the ability to apply, analyze, and synthesize) based on the Bloom taxonomy for learning and assessment.
“Notably, LLM2’s (GPT-4) performance was greater than the mean human score of 73.8 percent, effectively achieving near-passing and passing grades in the neurology board examination,” the study reported.
“LLM2 outperformed human users in behavioral, cognitive, and psychological-related questions and demonstrated superior performance to LLM 1 in six categories.”
The newer version of LLM correctly answered 85 percent of the questions, compared with 66.8 percent for the older version. The results, equivalent to passing the exam, were achieved “despite the absence of neurology-specific training,” the study noted.
The study concluded that the findings could have significant applications in clinical neurology and health care.
One concern was that both models consistently used confident language, even when providing incorrect answers, the study found. However, “when prompted with the correct answer after an incorrect response, both models responded by apologizing and agreeing with the provided correct answers in all cases.”
The researchers said neurologists using LLMs should be aware of “the models’ tendencies to phrase inaccurate responses confidently and should be cautious regarding its usage in practice or education.”
The researchers noted other limitations of the study as well, including the fact that the questions were not identical to those official board exams use. Also, no questions involved images or videos, which often are part of clinical practice and medical education.
The study also noted that since LLMs “are trained to identify patterns and relationships among words in their training data, they can struggle in situations requiring a deeper understanding of context or specialized technical language.”
While the LLMs, particularly the newer GPT-4 version, performed well, it’s hard to do a direct comparison with human test takers, in part because the study did not define the level of medical education associated with the human scores to which the LLMs were compared.
Dr. Moeller, of Yale, was not surprised that the LLMs did well on neurology board-style exams, but he noted that “even the best multiple-choice questions are highly artificial and don’t reflect what happens in real life.” He said traditional tests and board exams are just one facet of evaluating a medical student or resident, and in the end it’s all about whether they “are good at taking care of patients.”
“The ACGME and other organizations have identified core competencies for physicians that go far beyond medical knowledge, and include things like communication skills, professionalism, the ability to navigate the health care system, and behaviors toward lifelong learning and improvement. Multiple choice tests aren’t the best way to assess most of these competencies,” he said,
Dr. Moeller recently discussed the future of AI in medicine with a patient’s family, telling them that while he didn’t anticipate being replaced by an AI robot, “I hope it will free up our brains as physicians so we can engage more often with patients on a human level.” He likes to think of LLMs as tools that, when used in combination with traditional tools of medicine such as patient history and the clinical exam, “could help us arrive at the right diagnosis and treatment.”
“Each patient encounter is different and never exactly like you read in a textbook,” he said.
Teaching to the Uncertainty in Medicine
Ralph Jozefowicz, MD, FAAN, professor of neurology and medicine and associate chair for education in the department of neurology at the University of Rochester Medical Center, doubts that even the most refined LLMs and other AI tools would be able to eliminate the reality that uncertainty often exists in medicine, a fact he said is evident during rounds in any neurology ward.
He teaches his students and residents that the best way to deal with uncertainty in making a diagnosis is to carefully follow the patient over time. He said that often important information does not become clear with the first question asked but only in follow-up questions or during give-and-take between the patient and doctor, which may be hard to replicate using AI. He also stated that a doctor’s degree of experience is a key factor that can’t be minimized.
“I am all for advances in medicine but also for the realization that there is no such thing as a magic answer,” said Dr. Jozefowicz, who has practiced medicine for 44 years.
Roy Strowd, MD, MS, MEd, FAAN, associate professor of neurology and vice dean for undergraduate medical education at Wake Forest University School of Medicine, said the one finding from the study that particularly warrants concern is that “AI is confident even when it’s incorrect,” and the same pattern has emerged from other studies on AI as well. In many areas of medical training, confidence tends to lag behind being correct, he noted.
“As you get more and more correct, you are gaining confidence,” Dr. Strowd said. “In other words, confidence tends to follow competence.”
An important message conveyed throughout medical education and training is that “it’s important to know what you know and know what you don’t know,” he said. “It is one of the most important lines to draw as a physician.”
Dr. Strowd said today’s students and residents grew up on social media and tend to be very savvy about technology, but they also tend to have a healthy skepticism.
“They are extremely cognizant of the risks associated with the technology,” he said. Among the challenges he sees in incorporating AI into medical education, training, and practice are how to protect patient privacy and safety as well as the privacy of students and doctors who use the tools and input their work into the databases.
Dr. Strowd said he hopes the finding that LLMs could do just as well as human test takers on neurology board-style exams will draw attention to the need to continue to evolve the methods used to assess medical students and residents. He said medical education is already moving away from an over-reliance on multiple-choice test questions, though memorizing certain facts will always be important. The ability to reasonably assess whether a source of information is accurate and reliable will increasingly be important for doctors, he added.
Dr. Stroud said AI might be used in the future to augment student assessment, perhaps “in helping to assist in the test question writing, answer explanations, or summarizing relevant literature that explains the answer to a multiple choice question.”
“There is going to be skepticism about AI,” Dr. Strowd said, “but now we are going to figure out where it’s going to be most helpful and use it as a tool to help patients.
Disclosures
Dr. Moeller had no disclosures. Dr. Strowd has received fees for consulting from Novocure, Monteris Medical, and Alexion and for lectures from Lecturio and Kaplan Medical; he has received book royalties from Elsevier.