Auditing the inference processes of medical-image classifiers by leveraging generative AI and the expertise of physicians

doi: 10.1038/s41551-023-01160-9.

Online ahead of print.

Affiliations

¹ Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA, USA.
² Medical Scientist Training Program, University of Washington, Seattle, WA, USA.
³ Program for Clinical Research and Technology, Department of Dermatology, Stanford University School of Medicine, Stanford, CA, USA.
⁴ Department of Dermatology, Stanford University School of Medicine, Stanford, CA, USA. [email protected].
⁵ Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, USA. [email protected].
⁶ Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA, USA. [email protected].

Item in Clipboard

Alex J DeGrave et al.

Nat Biomed Eng.

2023.

doi: 10.1038/s41551-023-01160-9.

Online ahead of print.

Affiliations

¹ Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA, USA.
² Medical Scientist Training Program, University of Washington, Seattle, WA, USA.
³ Program for Clinical Research and Technology, Department of Dermatology, Stanford University School of Medicine, Stanford, CA, USA.
⁴ Department of Dermatology, Stanford University School of Medicine, Stanford, CA, USA. [email protected].
⁵ Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, USA. [email protected].
⁶ Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA, USA. [email protected].

Item in Clipboard

Abstract

The inferences of most machine-learning models powering medical artificial intelligence are difficult to interpret. Here we report a general framework for model auditing that combines insights from medical experts with a highly expressive form of explainable artificial intelligence. Specifically, we leveraged the expertise of dermatologists for the clinical task of differentiating melanomas from melanoma ‘lookalikes’ on the basis of dermoscopic and clinical images of the skin, and the power of generative models to render ‘counterfactual’ images to understand the ‘reasoning’ processes of five medical-image classifiers. By altering image attributes to produce analogous images that elicit a different prediction by the classifiers, and by asking physicians to identify medically meaningful features in the images, the counterfactual images revealed that the classifiers rely both on features used by human dermatologists, such as lesional pigmentation patterns, and on undesirable features, such as background skin texture and colour balance. The framework can be applied to any specialized medical domain to make the powerful inference processes of machine-learning models medically understandable.

PubMed Disclaimer

References

1. Wu, E. et al. How medical AI devices are evaluated: limitations and recommendations from an analysis of FDA approvals. Nat. Med. 27, 582–584 (2021).
  
  –
  
  DOI
  
  –
  
  PubMed
1. Reddy, S. Explainability and artificial intelligence in medicine. Lancet Digit. Health 4, E214–E215 (2022).
  
  –
  
  DOI
  
  –
  
  PubMed
1. Young, A. T. et al. Stress testing reveals gaps in clinic readiness of image-based diagnostic artificial intelligence models. npj Digit. Med. 4, 10 (2021).
  
  –
  
  DOI
  
  –
  
  PubMed
  
  –
  
  PMC
1. DeGrave, A. J., Janizek, J. D. & Lee, S.-I. AI for radiographic COVID-19 detection selects shortcuts over signal. Nat. Mach. Intell. 3, 610–619 (2021).
  
  –
  
  DOI
1. Singh, N. et al. Agreement between saliency maps and human-labeled regions of interest: applications to skin disease classification. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 3172–3181 (IEEE, 2020).