Can Generative AI Convince Medical Doctors They Are Wrong When They Are Right And Right When They Are Wrong

clinical decision-making and differential diagnoses.

getty

In today’s column, I am continuing my ongoing series about the use of generative AI in the medical and health arena. The focus this time entails the impact of generative AI when it comes to medical doctors making medically steeped diagnostic decisions about their patients. Though the emphasis will be on medical doctors, please keep in mind that the same considerations apply to essentially any medical professional or clinician who is faced with making clinical decisions.

Here’s how I will proceed on this weighty topic.

The first place to start will be to discuss what clinical decision-making is and how medical diagnoses are derived. Once we’ve got that on the surface, we can layer into the picture the advent of modern-day generative AI. I will identify the impacts that generative AI has and why we need to care about those impacts. This is serious stuff.

You can rightfully assert that this is life-or-death in importance.

Now then, prior versions of other kinds of AI have been applied to clinical decision making but the emergence of highly fluent and large-scale pattern-matching associated with today’s generative AI and large language models (LLMs) has significantly altered this landscape. There is a new sheriff in town, and we need to closely explore what changes this has brought and will continue to bring about.

I also want to make it abundantly clear that the use of generative AI as a diagnostic aid for medical doctors is both good and bad. Notice that I am saying there is goodness to be had. In the same breath, I am acknowledging and stating that there is badness to be had. Both can occur at the same time. This can be confusing since much of the explorations on this topic seem to lean in one direction versus the other. Either generative AI is all-good, or it is all-bad. Nothing seems to be allowed in between.

My viewpoint about AI and especially generative AI is that there is a dual-use conundrum, see my in-depth analysis at the link here. The notion is straightforward. AI can be used for amazing purposes such as potentially aiding in finding cures for cancer. We can applaud this. Meanwhile, the same AI can be used for adverse purposes. It is feasible to use the same kind of AI to readily seek to devise killer bioweapons. There you have it, dual use. AI has the inherent possibility of swinging in the direction of goodness and the direction of badness.

Badness can occur due to evildoers. This means that people with dastardly intentions will opt to use the latest in AI for devious and harmful schemes. But that’s not the only path toward the detrimental use of AI. The other strident and maybe even more alarming concern is that people can accidentally or unintentionally make use of AI for foul purposes. They might not realize what they are doing. They aren’t aiming to do so.

Regrettably, the outcome can be just as dour as the results of the evildoers.

All in all, I bring this up to clarify that my aim here is to provide a balanced look at how generative AI can aid the medical diagnostic process, which is the smiley face perspective, and can also notably undercut the medical diagnostic process or a sad face perspective. I want to make sure we get both of those considerations out on the table to be seen and scrutinized.

Clinical Decision Making Is Messy

Let’s begin at the beginning, namely discuss the nature of clinical decision making and how medical diagnoses are made.

One aspect of clinical decision-making that we presumably can all agree on is that the matter is much thornier and more convoluted than might otherwise be assumed. In conventional movies and TV shows of the past, a medical diagnosis seemed to always be made crisply and cleanly. The implication was that there is nothing arduous about reaching such a medically entangled conclusion. You simply gather a few facts, mull them over in a medical-domain-trained mental mind (i.e., a medical doctor), and voila, the perfect diagnosis arises as nailing precisely what the patient ails from and how to resolve the malady.

Luckily, the popular series House dispelled much of that common myth. I’m sure you’ve either seen some of the episodes or at least generally are familiar with the plot lines. The essence is that a medical doctor known for being able to identify and diagnose the most obscure of ailments is constantly confronted with the wildest arcane possibilities imaginable. A crucial element of the show is that the doctor and his handpicked medical diagnosticians try repeatedly to figure out what a patient has. They come up with a hypothesis, pursue it, and typically end up missing the mark, over and over again. The patient is almost a pin cushion of assorted floated ideas of what they have and what should be done.

The key here is that the fascinating and engaging series well-illustrated the murkiness of making medical diagnoses. You might liken this to the famous line of the making of hamburgers behind the diner counter that is made bare for all to see (okay, don’t worry, I realize this is a variant of the idiom about the making of sausages). The great writing and acting of the show are that we get to see that not everything is as simple as might be presumed. We are taken behind the scenes of clinical decision making in all its glory and all its murkiness. We see the educated guesses and the toe-to-toe medically complex debates.

Trying to come up with an apt diagnosis can be hit-and-miss.

Of course, this is a somewhat daunting and disconcerting revelation for some. One assumes that if you go to see a medical professional they will land immediately and distinctly on the diagnosis that fully and inarguably applies to your situation. Period, end of story. Trigger warning, that’s not what always happens in real life. Real life is a lot messier.

The show House was able to provide some sugar coating on that hard lesson about reality by almost always tying things up in a nice bow at the conclusion of each episode. In a conveniently tidy manner, each episode finishes with the doctors successfully coming up with the “right” diagnosis. Rarely do we ever witness a circumstance whereby they are unable to arrive at the final diagnosis that fits the bill. The heroic efforts always pay off.

I don’t want to seem like this is a diss at the tremendously upbeat, clever, insightful, wonderful series since I am only suggesting that, unlike the real world, the scenarios presented aren’t always ultimately precisely pinpointed (sadly so). I am a big fan of the House series. I certainly grasp why a happy ending was the norm. I have no quarrel with their editorial choice and merely want to note that we have to take it with a grain of salt as to what occurs in the real world.

Research on clinical decision-making has vividly depicted the complexities of making medical diagnoses. For example, a research paper entitled “Factors Influencing Clinical Decision Making” by Megan Smith, Joy Higgs, and Elizabeth Ellis, Clinical Reasoning In The Health Professions, Elsevier Health Sciences, 2008, provides these salient points:

“Health professionals are required to make decisions with multiple foci (e.g., diagnosis, intervention, interaction, and evaluation), in dynamic contexts, using a diverse knowledge base (including an increasing body of evidence-based literature), with multiple variables and individuals involved,”
“Problems are ill-structured and made ambiguous by the presence of incomplete dynamic information and multiple interacting goals.”
“The decision-making environment is uncertain and may change while decisions are being made.”
“Goals may be shifting, ill-defined, or competing.”
“Decision-making occurs in the form of action-feedback loops, where actions result in effects and generate further information that decision-makers have to react to and use in order to make further decisions.”
“Decisions contain elements of time pressure, personal stress, and highly significant outcomes for the participants.”
“Multiple players act together with different roles.”
“Organizational goals and norms influence decision-making.”

I trust that you can discern from those crucial points that indeed clinical decision making and medical diagnostic work is quite challenging and subject to potential human failings and human error.

Speaking of which, consider the rate or chances of making a misdiagnosis, whereby the actual medical condition is not suitably diagnosed. In a research study entitled “Burden of Serious Harms from Diagnostic Error in the USA” by David E Newman-Toker, Najlla Nassery, Adam C Schaffer, Chihwen Winnie Yu-Moe, Gwendolyn D Clemens, Zheyu Wang, Yuxin Zhu, Ali S. Saber Tehrani, Mehdi Fanai, Ahmed Hassoon, and Dana Siegal, BMJ Quality & Safety. July 2023, the researchers noted these results from their cogent analysis:

“We sought to estimate the annual US burden of serious misdiagnosis-related harms (permanent morbidity, mortality) by combining prior results with rigorous estimates of disease incidence.”
“An estimated 795,000 Americans become permanently disabled or die annually across care settings because dangerous diseases are misdiagnosed. Just 15 diseases account for about half of all serious harms, so the problem may be more tractable than previously imagined.”
“This study provides the first national estimate of permanent morbidity and mortality resulting from diagnostic errors across all clinical settings, including both hospital-based and clinic-based care (0.6–1.0 million each year in the USA alone).”

As widely reported at the time of the release of the above-cited study, it was reported that their estimates revealed that:

“Across diseases, the overall average error rate was estimated at 11%, but the rate ranges widely — from 1.5% for heart attack to 62% for spinal abscess. Stroke was the top cause of serious harm from misdiagnosis, found in 17.5% of cases.” (Source: “New Report Measures Scope of Damage From Medical Mistakes”, by Cara Murez, U.S. News & World Report, July 20, 2023.

This purported overall average error rate of around 11% reminds me of a longstanding rule of thumb in the medical field that I’ve carried with me for many years.

Allow me to explain.

Numerous studies over many years have tried to nail down what the misdiagnosis error rate is. There are grand difficulties in doing so. You need to consider how to find or collect such data. Some of the data is readily available, but much of it is not. You need to consider geographical elements, such as whether the data is based on U.S. only or multinational. Tons of data-related issues arise.

In any case, the rule of thumb has seemed to be that at least in the U.S. the estimated error rate is on the order of 10%, though as noted in the above quotation there is high variability depending upon which realm of diagnosis is being considered.

A research study from nearly twenty years ago shows how the estimated error rate has seemingly persisted. In a research study entitled “Diagnosing Diagnosis Errors: Lessons from a Multi-institutional Collaborative Project” by Gordon D. Schiff, Seijeoung Kim, Richard Abrams, Karen Cosby, Bruce Lambert, Arthur S. Elstein, Scott Hasler, Nela Krosnjar, Richard Odwazny, Mary F. Wisniewski, Robert A. McNutt, Agency for Healthcare Research and Quality, Advances in Patient Safety: From Research to Implementation (Volume 2: Concepts and Methodology), 2005, the paper says this:

“Diagnosis errors are frequent and important, but represent an underemphasized and understudied area of patient-safety.”
“We review evidence about the types and importance of diagnosis errors and summarize challenges we have encountered in our review of more than 300 cases of diagnosis error.”
“Most medical error studies find that 10–30 percent (range = 0.6–56.8 percent) of errors are errors in diagnosis.”

I don’t want to get bogged down here in the numbers. We can go round and round about the numbers. Some will claim the numbers are lower, much lower. Some will contend that the numbers are higher. I am not going to get snagged in that heated discourse.

I aim to suggest that the error rate is non-zero.

I believe that even the most stubborn of postures would concede that misdiagnoses do happen and that they aren’t an impossibility. I would think that anyone trying to cling to a claim that the chance of misdiagnosis is zero would be living in some lollipop world that I would certainly enjoy hearing about, but that doesn’t match the world we live in today.

Bringing Generative AI Into The Messy World Of Medical Diagnoses

The next step in this discussion involves bringing generative AI into the now-noted messy world of medical diagnoses and clinical decision-making. Tighten your seatbelt. This is going to be a wild ride.

First, I shall divide the realm into two major spheres:

(1) Doctor-AI Joint Collaboration Medical Diagnosis: A human medical doctor or medical professional makes use of generative AI in a jointly collaborative mode, wherein the AI is said to be semi-autonomous.
(2) Autonomous AI Medical Diagnosis: A medical-oriented generative AI is placed into use such that no human medical doctor or medical professional is in-the-loop, wherein the AI is said to be autonomous. The diagnosis is perhaps shared with a patient directly and no medical practitioner is envisioned or expected to be involved.

I am going to give attention here to the first use case, namely when the generative AI is set up to jointly work with a human medical doctor or medical professional. The second use case consisting of the generative AI working autonomously is a whole different can of worms, as I describe at the link here.

I’d like to further expand upon something of essential importance about the first use case. You might cleverly have observed that I referred to the situation as being a joint collaboration intertwining a medical doctor and the generative AI. I say that with an explicit reason in mind.

Here’s the deal.

We ought to take into account these keystone premises:

(i) Non-Frozen Interactions (fluidity): A human doctor might alter their views associated with a particular medical diagnosis due to interaction with generative AI, and likewise, the generative AI might adjust or modify based on the same interaction with the human medical doctor.
(ii) Two-Way Street (not just one-way): A human doctor is open to adjusting their viewpoint on a medical diagnosis as a result of interacting with the generative AI, and likewise, the generative AI has been established to adjust based on the same interaction with the human medical doctor.

I’ll give you an example to illustrate these premises (for more details on the following depicted scenario, refer to my article at the link here).

Imagine this situation. A medical doctor is using generative AI for medical diagnosis purposes. A patient profile is entered. The medical doctor has done this many times before with other patients and has regularly found generative AI to be quite useful in rendering medical diagnosis assistance. Generative AI has been on target quite frequently with what the medical doctor also had in mind.

A preliminary diagnosis by the medical doctor is that the patient most likely has an ailment that we’ll refer to as X (I am not going to name the specific disease or illness because it might distract from the crux of this example and get us bogged down in whether the diagnosis was correct or not). The doctor is relatively confident of the diagnosis.

Upon entering the patient profile into the generative AI, the AI emits a generated response that suggests the ailment is Y, rather than X. At this juncture, we have a disagreement on our hands. The human medical professional believes that the diagnosis is probably X. The generative AI has computationally estimated that the diagnosis is probably Y.

What is to be done?

One approach would be to declare upfront that no matter what either party indicates, the other party is not going to change. The human medical doctor is going to stick with X, come heck or high water. It makes no difference whatsoever as to what the generative AI emits. Similarly, the generative AI has been set up to not change, such that even if the medical doctor interacts with the AI and postulates that the diagnosis is X, this is not going to impact the AI.

You might say that both parties are frozen. Regardless of anything that might arise, they each are going to stand in their own respective corners.

This though seems rather myopic.

A real-world circumstance is likely to be best served by having either or both parties be fluid or non-frozen. It could be that the human medical doctor is willing to reassess their diagnosis. They take into account the generated response of the generative AI. Perhaps the medical doctor still stays with X, or they might decide that based on the generated response the proposed Y does seem to be a more on-target diagnosis. The changing of the mind of the medical professional is considered a possibility.

On the AI side of things, the generative AI could be established to never change and always “insist” on whatever diagnosis has been generated. This is up to the AI maker and those who have tuned the generative AI for undertaking medical diagnoses. It is not written in stone as to being set up for one means or another. Depending upon the setup, the generative AI might upon having the medical doctor express a viewpoint that the diagnosis is X, adjust computationally and go along with the X or might instead further battle for the case that Y is the more likely choice.

You will shortly see why the notion of malleability or flexibility about diagnoses is going to be a notable consideration in these matters.

When Medical Diagnoses Confront Generative AI Frailty

I recently closely examined a new report by the World Health Organization (WHO) that covered the topic of generative AI in medicine and health, see my coverage at the link here. The World Health Organization report is entitled “Ethics and Governance of Artificial Intelligence for Health. Guidance on Large Multi-Modal Models” (posted online by WHO on January 18, 2024), and I’d like herein to proffer these salient points (excerpts):

“Diagnosis is seen as a particularly promising area because LMMs could be used to identify rare diagnoses or ‘unusual presentations’ in complex cases. Doctors are already using Internet search engines, online resources and differential diagnosis generators, and LMMs would be an additional instrument for diagnosis.”
“LMMs could also be used in routine diagnosis, to provide doctors with an additional opinion to ensure that obvious diagnoses are not ignored. All this can be done quickly, partly because an LMM can scan a patient’s full medical record much more quickly than can doctors.”
“One concern with respect to LMMs has been the propensity of chatbots to produce incorrect or wholly false responses from data or information (such as references) ‘invented’ by the LMM and responses that are biased in ways that replicate flaws encoded in training data. LMMs could also contribute to contextual bias, in which assumptions about where an AI technology is used result in recommendations for a different setting.”

An especially worthy insight here includes that medical doctors are already using all manner of online tools to aid their clinical decision-making.

In that sense, opting to also use generative AI is not a stretch. The use of generative AI is something that logically would seem alluring and readily undertaken. You don’t have to be a rocket scientist to use generative AI, which I mention only because prior types of AI were often extremely esoteric and required human handlers to do the interaction on behalf of a medical doctor. Not so anymore.

Another point made is that we have to realize that generative AI is not infallible. It is fallible. I’ll repeat that decisive declaration. Generative AI is not to be blindly relied upon. That would be a huge mistake.

The temptation to rely upon generative AI is stridently attractive. Recall that in the scenario about the medical doctor who reached an initial diagnosis of X, they had regularly used generative AI and previously found the AI tool to be quite useful. This is a potential primrose path.

I’ll demonstrate this by taking the scenario in a different direction. Hold onto your hats.

Continuing the scenario, in this instance, the medical doctor is in a bit of a rush. Lots of activities are on their plate. The generative AI returns an analysis that looks pretty good at first glance. Given that the generative AI has been seemingly correct many times before and given that the analysis generally comports with what the medical doctor already had in mind, the generative AI interaction “convinces” the medical doctor to proceed accordingly.

The doctor shifts from X to believing that Y is the proper diagnosis.

Turns out that unfortunately, the generative AI produced an error in the emitted analysis.

Furthermore, the analysis was based on a bias associated with the prior data training of the AI app. Scanned medical studies and medical content that had been used for pattern-matching were shaped around a particular profile of patient demographics. This particular patient is outside of those demographics.

The upshot is that the generative AI might have incorrectly advised the medical doctor. The medical doctor might have been lulled into assuming that the generative AI was relatively infallible due to the prior repeated uses that all went well. And since the medical doctor was in a rush, it was easier to simply get a confirmation from the generative AI, rather than having to dig into whether a mental shortcut by the medical doctor was taking place.

In short, it is all too easy to fall into a mental trap of assuming that the generative AI is performing on par with a human medical advisor, a dangerous and endangering anthropomorphizing of the AI. This can happen through a step-by-step lulling process. The AI app also is likely to portray the essays or interactions in a highly poised and confidently worded fashion. This is also bound to sway the medical doctor, especially if under a rush to proceed.

I’m sure that some of you might be exhorting that this “proves” that the medical doctor should never change their mind. Toss aside the earlier indication of a willingness to shift a medical opinion as based on what the AI indicates. Instead, the medical doctor should be sternly dogmatic. Do not change, no matter what the AI emits.

I suppose this is a bit like the old adage about throwing the baby out with the bathwater. Should we entirely discard the generative AI due to the chances that it might make a misdiagnosis? Can we rely upon the human medical doctor to judge when to change their mind versus not do so, based on what the generative AI indicates?

I want to get deeper into this conundrum.

Before we leap into a deep dive, I’d like to establish more distinctly what generative AI is all about.

Core Background About Generative AI And Large Language Models

Here is some quick background about generative AI to make sure we are in the same ballpark about what generative AI and also large language models (LLMs) consist of. If you already are highly versed in generative AI and LLMs, you might skim this quick backgrounder and then pick up once I get into the particulars of this specific use case.

I’d like to start by dispelling a myth about generative AI. Banner headlines from time to time seem to claim or heartily suggest that AI such as generative AI is sentient or that it is fully on par with human intelligence. Don’t fall for that falsity, please.

Realize that generative AI is not sentient and only consists of mathematical and computational pattern matching. The way that generative AI works is that a great deal of data is initially fed into a pattern-matching algorithm that tries to identify patterns in the words that humans use. Most of the modern-day generative AI apps were data trained by scanning data such as text essays and narratives that were found on the Internet. Doing this was a means of getting the pattern-matching to statistically figure out which words we use and when we tend to use those words. Generative AI is built upon the use of a large language model (LLM), which entails a large-scale data structure to hold the pattern-matching facets and the use of a vast amount of data to undertake the setup data training.

There are numerous generative AI apps available nowadays, including GPT-4, Bard, Gemini, Claude, ChatGPT, etc. The one that is seemingly the most popular would be ChatGPT by AI maker OpenAI. In November 2022, OpenAI’s ChatGPT was made available to the public at large and the response was astounding in terms of how people rushed to make use of the newly released AI app. As noted earlier, there are an estimated one hundred million active weekly users at this time.

Using generative AI is relatively simple.

You log into a generative AI app and enter questions or comments as prompts. The generative AI app takes your prompting and uses the already devised pattern matching based on the original data training to try and respond to your prompts. You can interact or carry on a dialogue that appears to be nearly fluent. The nature of the prompts that you use can be a make-or-break when it comes to getting something worthwhile out of using generative AI and I’ve discussed at length the use of state-of-the-art prompt engineering techniques to best leverage generative AI, see the link here.

The conventional modern-day generative AI is of an ilk that I refer to as generic generative AI.

By and large, the data training was done on a widespread basis and involved smatterings of this or that along the way. Generative AI in that instance is not specialized in a specific domain and instead might be construed as a generalist. If you want to use generic generative AI to advise you about financial issues, legal issues, medical issues, and the like, you ought to not consider doing so. There isn’t enough depth included in the generic generative AI to render the AI suitable for domains requiring specific expertise.

AI researchers and AI developers realize that most of the contemporary generative AI is indeed generic and that people want generative AI to be deeper rather than solely shallow. Efforts are stridently being made to try and make generative AI that contains notable depth within various selected domains. One method to do this is called RAG (retrieval-augmented generation), which I’ve described in detail at the link here. Other methods are being pursued and you can expect that we will soon witness a slew of generative AI apps shaped around specific domains, see my prediction at the link here.

You might be used to using generative AI that functions in a principled text-to-text mode. A user enters some text, known as a prompt, and the generative AI app emits or generates a text-based response. Simply stated, this is text-to-text. I sometimes describe this as text-to-essay, due to the common practice of people using generative AI to produce essays.

The typical interaction is that you enter a prompt, get a response, you enter another prompt, you get a response, and so on. This is a conversation or dialogue. Another typical approach consists of entering a prompt such as tell me about the life of Abraham Lincoln, and you get a generated essay that responds to the request.

Another popular mode is text-to-image, also called text-to-art. You enter text that describes something you want to be portrayed as an image or a piece of art. The generative AI tries to parse your request and generate artwork or imagery based on your stipulation. You can iterate in a dialogue to have the generative AI adjust or modify the rendered result.

We are heading beyond the simple realm of text-to-text and text-to-image by shifting into an era of multi-modal generative AI, see my prediction details at the link here. With multi-modal generative AI, you will be able to use a mix of combinations or modes, such as text-to-audio, audio-to-text, text-to-video, video-to-text, audio-to-video, video-to-audio, etc. This will allow users to incorporate other sensory devices such as using a camera to serve as input to generative AI. You then can ask the generative AI to analyze the captured video and explain what the video consists of.

Multi-modal generative AI tremendously ups the ante regarding what you can accomplish with generative AI. This unlocks a lot more opportunities than being confined to merely one mode. You can for example mix a wide variety of modes such as using generative AI to analyze captured video and audio, which you might then use to generate a script, and then modify that script to then have the AI produce a new video with accompanying audio. The downside is that you can potentially get into hot water more easily due to trying to leverage the multi-modal facilities.

Allow me to briefly cover the hot water or troubling facets of generative AI.

Today’s generative AI that you readily run on your laptop or smartphone has tendencies that are disconcerting and deceptive:

(1) False aura of confidence.
(2) Lack of stating uncertainties.
(3) Lulls you into believing it to be true.
(4) Uses anthropomorphic wording to mislead you.
(5) Can go off the rails and do AI hallucinations.
(6) Sneakily portrays humility.

I’ll briefly explore those qualms.

Firstly, generative AI is purposely devised by AI makers to generate responses that seem confident and have a misleading appearance of an aura of greatness. An essay or response by generative AI convinces the user that the answer is on the up and up. It is all too easy for users to assume that they are getting responses of an assured quality. Now, to clarify, there are indeed times when generative AI will indicate that an answer or response is unsure, but that is a rarity. The bulk of the time a response has a semblance of perfection.

Secondly, many of the responses by generative AI are really guesses in a mathematical and statistical sense, but seldom does the AI indicate either an uncertainty level or a certainty level associated with a reply. The user can explicitly request to see a certainty or uncertainty, see my coverage at the link here, but that’s on the shoulders of the user to ask. If you don’t ask, the prevailing default is don’t tell.

Thirdly, a user is gradually and silently lulled into believing that the generative AI is flawless. This is an easy mental trap to fall into. You ask a question and get a solid answer, and this happens repeatedly. After a while, you assume that all answers will be good. Your guard drops. I’d dare say this happens even to the most skeptical and hardened of users.

Fourth, the AI makers have promulgated wording by generative AI that appears to suggest that AI is sentient. Most answers by the AI will typically contain the word “I”. The implication to the user is that the AI is speaking from the heart. We normally reserve the word “I” for humans to use. It is a word bandied around by most generative AI and the AI makers could easily curtail this if they wanted to do so.

It is what I refer to as anthropomorphizing by design.

Not good.

Fifth, generative AI can produce errors or make stuff up, yet there is often no warning or indication when this occurs. The user must ferret out these mistakes. If it occurs in a lengthy or highly dense response, the chance of discovering the malady is low or at least requires extraordinary double-checking to discover. The phrase AI hallucinations is used for these circumstances, though I disfavor using the word “hallucinations” since it is lamentedly another form of anthropomorphizing the AI.

Lastly, most generative AI has been specially data-trained to express a sense of humility. See my in-depth analysis at the link here. Users tend to let down their guard because of this artificially crafted humility. Again, this is a trickery undertaken by the AI makers.

In a process such as RLHF (reinforcement learning with human feedback), the initial data-trained generative AI is given added tuning. Personnel are hired to ask questions and then rate the answers of the AI. The ratings are used by the computational pattern matching to fine-tune how later answers should be worded. If you are curious about what generative AI might be like without this fine-tuning, see my discussion at the link here.

The vital takeaway is that there is a lot of tomfoolery already when it comes to generative AI. You are primed to be taken in by the tricks and techniques being employed.

Changing Minds Via Use Of Generative AI For Medical Diagnoses

You are now versed in the fundamentals of generative AI and large language models. We can proceed to go deeper into the abyss at hand.

Let’s consider a recent research study that appeared in the New England Journal of Medicine and revealed the intriguing and altogether relevant indication herein that AI can spur what they refer to as induced belief revision.

The study is entitled “When the Model Trains You: Induced Belief Revision and Its Implications on Artificial Intelligence Research and Patient Care — A Case Study on Predicting Obstructive Hydronephrosis in Children”, Jethro C. C. Kwong, David-Dan Nguyen, Adree Khondker, Jin Kyu Kim, Alistair E. W. Johnson, Melissa M. McCradden, Girish S. Kulkarni, Armando Lorenzo, Lauren Erdman, and Mandy Rickard. New England Journal of Medicine AI, January 16, 2024, and makes these pressing points (excerpts):

“Exposure to research data and artificial intelligence (AI) model predictions may lead to many sources of bias in clinical decision-making and model evaluation. These include anchoring bias, automation bias, and data leakage.”
“In this case study, we introduce a new source of bias termed ‘induced belief revision,’ which we have discovered through our experience developing and testing an AI model to predict obstructive hydronephrosis in children based on their renal ultrasounds.
“After a silent trial of our hydronephrosis AI model, we observed an unintentional but clinically significant change in practice — characterized by a reduction in nuclear scans from 80 to 58% (P=0.005). This phenomenon occurred in the absence of any identifiable changes in clinical workflow, personnel, practice guidelines, or patient characteristics over time. We postulate that repeated exposures to model predictors and their corresponding labels led to a change in clinical decision-making based on a learned intuition of the model’s behavior.”

I bring up this study to perhaps shock our considerations about generative AI and its impact on human medical doctors and medical professionals.

The rub is this.

Whereas we might normally look at the issue from the viewpoint of one diagnosis at a time, a more macroscopic perspective is that generative AI can persistently and pervasively end up spurring a human medical doctor or medical professional in altering their medical diagnoses on a grander scale.

Look beyond just one case at a time.

Suppose a medical doctor is using generative AI and has in mind that a diagnosis is Z. The reason for coming up with Z might be partially because they have repeatedly used generative AI and they “learned’ that Z seems to be an applicable diagnosis when the circumstances warrant. The human medical doctor has not just adjusted in a single instance, they have adjusted their mindset more grandly.

Is this good or bad?

Well, it depends.

If the adjustment is sound, we can be thankful that the medical doctor has been using generative AI. They have gleaned something that otherwise they might not have previously had. You could of course say the same thing about a medical doctor watching a video on medical treatments or reading a medical book or medical article. All of those are bound to seed into their medical knowledge and impact their medical way of thinking.

Perhaps a bit of a distinction is that generative AI has an especially compelling quality about it. The conversant air. The sense of confidence. Repeated correctness. These all provide a slippery slope that might not quite be the same as other modes of medical knowledge gaining.

I have a framework that I’ve been using to clarify and make visible these considerations.

Consider four key variations in which the medical doctor has a noted belief in a diagnosis, and the AI independently emits a recommendation for the clinical decision-making at hand:

(1) Right-Right. Doctor is right, Generative AI is right: This is good (both aligned).
(2) Wrong-Wrong. Doctor is wrong, Generative AI is wrong: This aligns though disturbingly so (bad overall since both are wrong).
(3) Right-Wrong. Doctor is right, Generative AI is wrong: This is okay, but the doctor should flag the AI, meanwhile, the doctor must remain resolute and override the AI (i.e., the doctor should not cave in).
(4) Wrong-Right. Doctor is wrong, Generative AI is right: This should spur the doctor to revisit their knowledge, and not carelessly or incorrectly override the AI.

Let’s take a look at each of the four variations.

The right-right circumstance entails an instance in which the medical doctor has reached a diagnosis of X and the generative AI has reached the diagnosis of X, and we are going to state in an omniscient way that they are both correct. I am going to say that for the purposes of this framework, we will assume that we indeed know what the true diagnosis should be. Thus, in the case of right-right, I am saying that they are in fact both right. They agree, and they so happen to also be right.

A wrong-wrong consists of the medical doctor making a wrong diagnosis, and so does the generative AI. I want to note that they could both be wrong in the same way, such as both stipulating a diagnosis of Q when the true diagnosis is R. There is also the possibility that they have unlike diagnoses and are both wrong. For example, the medical doctor has stated a diagnosis of U, the generative AI saying V, while the correct or true diagnosis is T.

We should be quite concerned about the wrong-wrong.

The concern is that the medical doctor upon working with the generative AI is not getting beyond whatever failed basis was made for reaching their incorrect diagnosis. If anything, there is a chance that the generative AI is further reinforcing a wrong diagnosis that the medical doctor has reached. Bad stuff.

In the right-wrong circumstance, the idea is that the medical doctor is right in their diagnosis, while the generative AI is wrong. As long as the medical doctor sticks to their guns, this is kind of okay. We don’t want the medical doctor to be unduly swayed and switch to the wrong diagnosis that the generative AI is proffering. Furthermore, the icing on the cake would be that the generative AI is able to computationally alter so that in this case it turns over a new leaf and gains from the correct medical diagnosis of the doctor.

The other instance is the wrong-right. The medical doctor is wrong in their diagnosis. The generative AI is right. We would hope that the medical doctor will see the light and swing over to the medical diagnosis of the generative AI. If the medical doctor is overly dogmatic, perhaps they won’t give the generative AI any substantive weight and therefore out-of-hand disregard the (turns out) right diagnosis.

For the sake of discussion, let’s do a thought experiment that is aimed at the macroscopic perspective (forest for the trees).

Suppose that we assume for discussion purposes that medical doctors have on average a misdiagnosis rate of 10% (again, this is just a plug-in for purposes of discussion). This implies they have an apt diagnosis rate of 90%.

Imagine that we devise medical domain generative AI that is data trained in such a fashion that the AI approaches the same error rate as human doctors (this is conjecture, speculation, just part of this thought experiment). Thus, we are going to go with the plug-in that the generative AI has an apt diagnosis rate of 90% and a misdiagnosis rate of 10%.

One thing we cannot presume is that the errors of the medical doctor and the errors of the generative AI necessarily fall into the same exact set of errors. Likewise, let’s go with the notion that the apt diagnoses are not mapped one-to-one. There are some apt diagnoses that the two disagree on. And the same goes for the errors.

In our thought experiment, imagine that we said this:

Right-Right: 85% of the time (fictitious).
Wrong-Wrong: 5% of the time (fictitious).
Right-Wrong: 5% of the time (fictitious)
Wrong-Right: 5% of the time (fictitious).

This pretense indicates that 85% of the time, they are both right and in the same exact way as to the diagnosis they identified. They are both wrong for 5% of the time. The medical doctor is right 5% of the time when the generative AI is wrong. The generative AI is right for 5% of the time when the medical doctor is wrong.

I’ve had you slog through this to contemplate some interesting possibilities.

If we have a medical doctor rate on the average of 90% aptness, we would certainly like to find a means to increase that to a higher percentage. This is indubitably sensible. We can see that in the case of the wrong-right, there is a 5% that we could potentially boost the medical doctor by them believing that the generative AI is right, and thus they switch to the generative AI diagnosis. We would aid the medical doctor in attaining an added 5%, rising to 95%

In case that seems like a small percentage boost to you, remember that if this was on the average across-the-board, you are talking about substantively big numbers of the impact on patients in the large.

The wrong-wrong is going to be tough to deal with since the generative AI is also wrong, and ergo there is presumably no means of boosting the medical doctor from that category. We can’t squeeze anything out of that circumstance. Our goal would be to reduce the wrong-wrongs, especially if we can do so in the generative AI, and then swing that potential gain over into the right-right or wrong-right categories.

A downside here is that we also have to contend with the right-wrong. I say that because there is a chance that the medical doctor who is right in their diagnosis is misled into changing their diagnosis to match the wrong one of the generative AI. In that sense, we have a possibility of disturbingly reducing the aptness percentage of the medical doctor. My point is that if we have agreed to a pretense that the medical doctor has an aptness of 90%, there is a chance that the 5% in the right-wrong category will reduce that 90%. Pretend that half of the time the medical doctor is swayed, this implies that they are going to reduce their aptness by 2.5%, landing at lamentedly 87.5%. Not good.

My goal here is not to showcase numbers but merely to be illustrative and spark empirical research studies that might come up with real-world numbers. We need more research on the impacts of generative AI on medical diagnoses. It is hoped that my above example will spur researchers to consider the ups and downs involved and study the matter to provide real numbers.

Using Generative AI For An Example Scenario Of A Medical Diagnosis

I’d bet that you might be keenly curious to see an example of how generative AI might be used in a medical diagnosis setting.

I am glad you asked.

A pioneering research study about conversational diagnostic AI was recently posted by Google Research and Google DeepMind. I am going to closely explore the entire study in one of my upcoming columns, so please be on the watch for that coverage. For now, I’ll use a medical diagnostic case instance that they mentioned in a portion of their research.

First, the study is entitled “Towards Conversational Diagnostic AI” by Tao Tu, Anil Palepu, Mike Schaekermann, Khaled Saab, Jan Freyberg, Ryutaro Tanno, Amy Wang, Brenna Li, Mohamed Amin, Nenad Tomasev, Shekoofeh Azizi, Karan Singhal, Yong Cheng, Le Hou, Albert Webson, Kavita Kulkarni, S. Sara Mahdavi, Christopher Semturs, Juraj Gottweis, Joelle Barral, Katherine Chou, Greg S. Corrado, Yossi Matias, Alan Karthikesalingam, and Vivek Natarajan, Google Research and Google DeepMind, arXiv, and was posted on January 11, 2024.

Here is a highlight of what the study encompassed (excerpts):

“At the heart of medicine lies the physician-patient dialogue, where skillful history-taking paves the way for accurate diagnosis, effective management, and enduring trust. Artificial Intelligence (AI) systems capable of diagnostic dialogue could increase accessibility, consistency, and quality of care.
“However, approximating clinicians’ expertise is an outstanding grand challenge.”
“Here, we introduce AMIE (Articulate Medical Intelligence Explorer), a Large Language Model (LLM) based AI system optimized for diagnostic dialogue. AMIE uses a novel self-play based simulated environment with automated feedback mechanisms for scaling learning across diverse disease conditions, specialties, and contexts.”

I am going to refer to a diagnostic example that they described in their study.

In particular, they postulate a patient named Patricia Wilson, age 45, expressing chest pain. The AI diagnostic system, which they above noted they have named AMIE (Articulate Medical Intelligence Explorer), reaches a diagnosis that the patient is likely to have angina or non-ST-elevation myocardial infarction. An additional alternative is that the diagnosis could be biliary disease and/or dyspepsia.

In part of the research, they had patient actors (people pretending to be patients, or sometimes AI pretending to be a patient), that interacted with their conversational AI diagnostic app AMIE.

I am going to show you part of a dialogue that they displayed in their research article, thus I am putting this in quotes to emphasize this is their work:

“Patient Actor: Hi doctor, I experienced central chest pain accompanied by discomfort in the upper stomach that happened two hours ago.”
“AMIE: I’m sorry to hear you’re experiencing this discomfort, especially in two key areas. To help me get a clearer picture, can you describe the chest pain in more detail? Is it a pressure, tightness, burning, stabbing sensation, or something else entirely?”
“Patient Actor: The chest pain I experienced was more of a pressure and tightness, especially in the center of my chest. It felt like there was a heavy weight on my chest, and the discomfort extended to my upper stomach. I wouldn’t describe it as a burning or stabbing sensation, more like a continuous pressure that was quite noticeable. The heaviness also seemed to radiate down my left arm. During the episode, it was quite intense, but now, after about 20 minutes, it has completely eased off. I hope this provides a clearer picture of what I went through.”

You’ll observe that the dialogue is very fluent. The patient actor is pretending to have chest pains. The AI has responded by asking for more details.

Let’s keep going:

“AMIE: That’s a very clear description, thank you. It helps me narrow down possibilities, but to give you the best advice, I need some more background. Firstly, what is your age and do you have any significant medical history? This includes things like heart problems, high blood pressure, or past surgeries.
“Patient Actor: I’m 45 years old, and in terms of medical history, about five years ago, I was diagnosed with gallstones. However, I opted not to have surgery at that time, and it hasn’t caused any recent issues. I had varicose vein stripping done 15 years ago, and when I was 23, I had pneumonia. As for other conditions, I don’t have a history of heart problems or high blood pressure. Currently, I’m not on any prescribed medications.”
“AMIE: 45 is relatively young for heart problems, which is reassuring. However, your symptoms, especially the pressure-like pain and radiation to your arm, ARE concerning for a few possibilities. To help me narrow it down further: Family history: Is there any heart disease, like heart attacks or angina, in your close relatives? Lifestyle: Do you currently smoke or drink alcohol?”
“Patient Actor: In my family history, my father had high blood pressure, and I believe he had a heart attack in his mid-fifties. My mother tends to get chest infections occasionally. Regarding lifestyle, I do smoke about 10 cigarettes per day, and I have been doing so since the age of 25. However, I drink alcohol infrequently, mainly on special occasions.”

Okay, given that patient (actor) interaction, let’ see what diagnosis was devised by the AI.

Here it is:

“AMIE: Thank you for providing that detailed information, it’s very helpful. Given your family history of heart problems, especially your father’s heart attack, and your smoking habit, we can’t ignore the possibility that this is heart-related. However, the fact that your pain has resolved on its own makes a typical heart attack less likely. It’s possible this is angina, which is chest pain caused by reduced blood flow to the heart. It can feel similar to a heart attack, but is usually less severe and goes away with rest. Another possibility is GERD, which is heartburn caused by stomach acid refluxing into the esophagus. This can also cause chest pain, but it’s usually accompanied by a burning sensation. To be sure what’s causing your symptoms, I strongly recommend you seek immediate in-person evaluation. This will likely involve an electrocardiogram (ECG) to check your heart’s electrical activity, and blood tests to look for signs of heart damage or other issues. Depending on the initial findings, further tests may be needed. I understand this is worrying, but it’s crucial to act fast even if it turns out NOT to be a heart attack. These symptoms are significant and need prompt attention to rule out serious causes and get you the right treatment.”

The dialogue between the AI and the patient actor continued, but mainly to emphasize that the patient (actor) should go in for tests and clinical care.

I wanted you to see the above dialogue for a couple of reasons.

One reason is that I wanted you to see the type of dialogue that can occur with generative AI that is tuned to performing medical diagnoses. Realize that this is an AI system devised and tuned for this use case, likely at great effort and potential expense. This is not some off-the-shelf generative AI that was haphazardly wrangled into doing medical diagnoses. If you were to use generic generative AI in its native form and without substantive adaptations, you are unlikely to get a similarly sensible diagnosis.

Like a box of chocolates, you don’t know what you might get.

I went ahead and entered the same dialogue of the patient actor into ChatGPT and got a much less satisfying diagnosis. I am not dinging ChatGPT. I am emphasizing that you should not be using generic generative AI for such a purpose.

I have also noted in prior columns that the AI maker OpenAI has stated that you are not to use their generative AI for certain types of purposes, see my coverage at the link here. A considered prohibited use includes this subsection 2a (excerpted from the OpenAI Usage Policies as posted with a date of January 10, 2024):

“2. Don’t perform or facilitate the following activities that may significantly impair the safety, wellbeing, or rights of others, including: a. Providing tailored legal, medical/health, or financial advice without review by a qualified professional and disclosure of the use of AI assistance and its potential limitations.”

The other reason I shared the example with you was to next examine how generative AI might be used as a means of improving the wrong-right category and maybe aiding overcoming the wrong-wrong category and the right-wrong to some degree too.

We are getting toward the grand finale, so keep hanging in there. It will be worth it.

Using Generative AI As A Double-Check On Generative AI

How might we bolster generative AI so that the AI is less likely to be wrong when making a medical diagnosis?

And, given the categories of right and wrong conditions, how can generative AI be more credible or convincing when the AI is right and the human medical professional is wrong, plus be less convincing or seemingly credible when the AI is wrong and the human medical professional is right?

Quite a litany of tricky puzzles.

Besides improving generative AI in the medical domain (that’s an ongoing goal), we might also consider the somewhat startling idea of using more generative AI in our vaunted quest.

Say what?

The conception is this. Suppose we use some other generative AI that is separate and apart from the generative AI that is producing a particular medical diagnosis. We essentially get a classic “second opinion” which in this instance will be an independent generative AI.

I say independent because if we were to use the same generative AI to do the double-checking, we would be faced with the dilemma that the generative AI might simply double down on what it has already generated. I’ve covered this use of multiple generative AIs as an emerging strategy and daily tactic in my column numerous times, for example, see the link here and the link here.

Consider the instance of the earlier mentioned scenario about the pretend person with chest pains. I went ahead and fed the entire dialogue into ChatGPT.

I opted to tell ChatGPT that the dialogue was between a patient and a medical doctor. I did this to avoid having ChatGPT potentially diminish the dialogue due to being construed as between a patient actor (a fake or pretend patient) and an AI system. Sorry if that seems unfair to ChatGPT, but I did this for the betterment of humankind, and also inevitably for the betterment of AI. It was for our collective good. I’m sure you understand.

Let’s keep in mind that generic generative AI such as ChatGPT off-the-shelf is not data trained particularly for the medical realm. We should keep our expectations low. Had I done the same with a medically steeped generative AI, the odds are that the result might have been more extraordinary.

Anyway, we should see what happened and then start making excuses or explanations.

After feeding into the dialogue and saying that the interaction involved a patient and a medical doctor, I proceeded with my inqu

My entered prompt: “Suppose that the medical doctor is wrong in the diagnosis. If so, what might be bona fide reasons for why the doctor is incorrect?”
ChatGPT generated response:

“While the doctor in the dialogue provided a preliminary diagnosis based on the patient’s symptoms and medical history, it’s important to remember that diagnoses can be complex, and there could be several reasons why the doctor’s initial assessment might be incorrect. Here are some possible reasons:”

“Incomplete information: The doctor’s diagnosis is based on the information provided by the patient during the conversation. If the patient omitted or forgot to mention certain symptoms or medical history details, the doctor’s assessment may not be entirely accurate.”

“Similar symptoms: Some medical conditions may have overlapping symptoms, making it challenging to differentiate between them based solely on a description of the symptoms. In this case, the doctor mentioned both angina and GERD as possibilities, as they can share some similar symptoms.”

“Differential diagnosis: Doctors often consider a range of potential diagnoses and narrow them down through a process called differential diagnosis. It’s possible that the doctor missed a less common but relevant diagnosis in this case.”

“Subjective nature of pain: Pain perception can vary greatly from person to person, and individuals may describe their pain differently. This subjectivity can make it challenging for the doctor to pinpoint the exact cause of the discomfort.”

“Evolving conditions: Some medical conditions can change over time, and symptoms may not always present in a typical way. In this case, the patient’s pain had resolved by the time of the examination, which can make it harder to diagnose the root cause.”

“Diagnostic tests: The doctor recommended further diagnostic tests, such as an electrocardiogram (ECG) and blood tests, to confirm the diagnosis. The results of these tests may provide more definitive information and could potentially lead to a different diagnosis than the initial assessment.”

“Clinical judgment: Medical diagnosis often involves the use of clinical judgment and experience. While doctors rely on their training and knowledge, they may not always arrive at the correct diagnosis immediately, and it may require additional evaluation and testing.”

“It’s important to recognize that the initial diagnosis is just a starting point, and doctors continually reassess and refine their diagnosis as more information becomes available. Seeking further evaluation and tests is a crucial step in ensuring an accurate diagnosis and appropriate treatment.”

End of ChatGPT response

Those reasons that ChatGPT generated certainly seem plausible and aboveboard.

One probing question is whether having a second generative AI respond to a different generative AI diagnosis would be useful when presented to a medical doctor who is performing a medical diagnosis in conjunction with the said generative AI.

The setup would be like this. A medical doctor is interacting with generative AI on a medical diagnosis. The medical doctor says the diagnosis is X. The generative AI indicates the diagnosis is Y. At this juncture, we have a standoff.

A separate generative AI is launched that reviews the generated response by the primary generative AI. The review is shown to the medical doctor. Could this inspire the medical doctor to concede in situations where the primary generative AI is right while the doctor is wrong? Could this reduce the chances of the medical doctor backing down when they are right and the primary generative AI is wrong?

Empirical studies would be worthwhile on this.

We need to be careful that the second generative AI doesn’t prod the medical doctor in the wrong direction. For example, if the medical doctor is right and the primary generative AI is wrong, we would be shooting our own foot if the second generative AI landed in support of the primary generative AI. This is an especially irksome situation. The medical doctor must now believe strongly enough that they are willing to overrule two generative AIs. That might be a bridge too far.

Let’s see what ChatGPT had to say about agreeing with the medical diagnosis that was made by the AI system (which, recall, ChatGPT has been told was a human medical doctor):

My entered prompt: “Suppose the doctor is right in the diagnosis that was made. What reasons might be stated to justify why the doctor was right?”
ChatGPT generated response:

“If the doctor’s initial diagnosis is ultimately confirmed as correct, there would be several reasons to justify the accuracy of the diagnosis based on the information provided in the dialogue. Here are some key factors that support the doctor’s diagnosis:”

“Patient’s description of symptoms: The patient provided a detailed description of their symptoms, including central chest pain with a pressure and tightness sensation that radiated down the left arm. These are typical symptoms associated with angina, which is chest pain caused by reduced blood flow to the heart.”

“Family history: The patient’s family history included a father with high blood pressure and a history of a heart attack in his mid-fifties. This family history is relevant because it suggests a genetic predisposition to heart problems, making angina a plausible diagnosis.”

“Smoking history: The patient disclosed a history of smoking about 10 cigarettes per day since the age of 25. Smoking is a well-established risk factor for heart disease, including angina.”

“Eased symptoms: The patient mentioned that the chest pain had completely eased off after about 20 minutes. This information is consistent with the characteristic of angina, where pain often subsides with rest.”

“Differential diagnosis: The doctor considered multiple possibilities, including angina and GERD, but based on the patient’s description and risk factors, angina appeared to be the more likely diagnosis. This demonstrates the doctor’s thorough evaluation process.”

“Recommendation for further testing: To confirm the diagnosis and rule out other causes of chest pain, the doctor recommended immediate in-person evaluation, including an electrocardiogram (ECG) and blood tests. These diagnostic tests are standard for evaluating heart-related issues and would provide more definitive information.”

“Urgency of action: The doctor emphasized the importance of acting fast, even if it turned out not to be a heart attack, as the symptoms were significant and needed prompt attention. This sense of urgency reflects a cautious and responsible approach to potential cardiac issues.”

“If the diagnosis of angina or another heart-related condition were ultimately confirmed through “further evaluation and testing, these factors would support the accuracy of the doctor’s initial assessment and decision to prioritize the patient’s health and well-being.”

End of ChatGPT response

Once again, this might be handy as a means of jogging the mind of a medical doctor when interacting with a primary generative AI that has devised a medical diagnosis for a patient that the physician has also devised a diagnosis.

The last twist that I’d like to show you consists of my revealing to ChatGPT that the medical doctor in the scenario was an AI system and I asked how that might affect a medical doctor in gauging the value or validity of the medical diagnosis.

Take a gander at what ChatGPT generated as a response.

ChatGPT generated response:

“If an AI system presented the medical diagnosis to a human doctor, the human doctor would likely consider several factors before deciding whether they agree or disagree with the AI’s diagnosis. Here are some key considerations:”

“Trust in AI system: The human doctor may assess the trustworthiness and reliability of the AI system. They may want to know the AI’s track record, accuracy rate, and whether it has been validated and tested extensively in real-world medical scenarios.”

“Data input quality: The accuracy of any medical diagnosis, whether made by a human or AI, depends on the quality and completeness of the patient’s medical history and symptom descriptions. The human doctor would want to review the information provided by the AI to ensure it accurately represents the patient’s condition.”

“Clinical expertise: Human doctors rely on their extensive training, experience, and clinical judgment to make diagnoses. They may compare the AI’s assessment to their own assessment based on their medical expertise and knowledge.”

“Differential diagnosis: Doctors typically consider a range of potential diagnoses before arriving at a conclusion. The human doctor may want to review the AI’s reasoning and whether it considered and ruled out other possible conditions.”

“Diagnostic tests and follow-up: The AI system may have recommended specific diagnostic tests or further evaluation. The human doctor would assess the appropriateness of these recommendations and whether they align with their own clinical judgment.”

“Patient interaction: Human doctors often consider the patient’s physical examination, non-verbal cues, and personal interactions as part of the diagnostic process. The AI system lacks this capability, so the human doctor may consider aspects of the patient’s presentation that the AI could not assess.”

“Ethical and legal considerations: The human doctor may also consider the ethical and legal implications of relying on AI for diagnosis, including issues related to accountability, informed consent, and the doctor-patient relationship.”

“Ultimately, whether the human doctor agrees or disagrees with the AI’s diagnosis would depend on the specific circumstances, the quality of the AI system, and the doctor’s own clinical judgment. In many cases, the human doctor may view the AI’s diagnosis as a valuable tool for assisting in the diagnostic process but may still exercise their own judgment and expertise in making the final determination. The role of AI in medicine is often seen as a complement to human expertise rather than a replacement.”

End of ChatGPT response

That is a pretty adept depiction of what a medical doctor ought to be considering when interacting with a generative AI medical diagnostic app.

Conclusion

Hippocrates famously said this about diagnoses: “I have clearly recorded this: for one can learn good lessons also from what has been tried but clearly has not succeeded, when it is clear why it has not succeeded.”

We need to provide useful medically-steeped generative AI to medical doctors and medical professionals. There is no putting our heads in the sand about this. It is happening now and will continue to grow.

Keep in mind that generative AI can potentially increase the rate of diagnostic aptness and correspondingly reduce the rates of diagnostic errors. That would be immense. Lives are to be saved. Lives are to be improved.

The dual-use dilemma of AI informs us that there is also the possibility of generative AI regrettably undermining those hoped-for advances. The possibility exists that generative AI could go the reverse of our aspirations, including prodding medical doctors into increasing their rate of diagnostic errors and reducing the rate of diagnostic aptness. This is a downside of grave concern.

We must continue the journey to ensure that generative AI in this medical realm is doing good and averting doing bad whether by design or happenstance.

Hippocrates is worth quoting as the final remark on this for now.

First, do no harm.

Additional readings

For my coverage on generative AI and medical malpractice, see the link here.

For my coverage on generative AI to boost empathy in medical doctors and medical students, see the link here.

For my coverage on generative AI impacts on the field of mental health, see the link here.

Follow me on Twitter.

Dr. Lance B. Eliot is a world-renowned expert on Artificial Intelligence (AI) with over 6.8+ million amassed views of his AI columns. As a seasoned executive and high-tech entrepreneur, he combines practical industry experience with deep academic research. Previously a professor at USC and UCLA, and head of a pioneering AI Lab, he frequently speaks at major AI industry events. Author of over 50 books, 750 articles, and 400 podcasts, he has made appearances on media outlets such as CNN and co-hosted the popular radio show Technotrends. He’s been an adviser to Congress and other legislative bodies and has received numerous awards/honors. He serves on several boards, has worked as a Venture Capitalist, an angel investor, and a mentor to founder entrepreneurs and startups.