1 Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA, USA.
2 Medical Scientist Training Program, University of Washington, Seattle, WA, USA.
3 Program for Clinical Research and Technology, Department of Dermatology, Stanford University School of Medicine, Stanford, CA, USA.
4 Department of Dermatology, Stanford University School of Medicine, Stanford, CA, USA. [email protected].
5 Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, USA. [email protected].
6 Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA, USA. [email protected].
Item in Clipboard
Alex J DeGrave et al.
Nat Biomed Eng.
.
Display options
Format
doi: 10.1038/s41551-023-01160-9.
Online ahead of print.
Affiliations
1 Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA, USA.
2 Medical Scientist Training Program, University of Washington, Seattle, WA, USA.
3 Program for Clinical Research and Technology, Department of Dermatology, Stanford University School of Medicine, Stanford, CA, USA.
4 Department of Dermatology, Stanford University School of Medicine, Stanford, CA, USA. [email protected].
5 Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, USA. [email protected].
6 Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA, USA. [email protected].
Item in Clipboard
Display options
Format
Abstract
The inferences of most machine-learning models powering medical artificial intelligence are difficult to interpret. Here we report a general framework for model auditing that combines insights from medical experts with a highly expressive form of explainable artificial intelligence. Specifically, we leveraged the expertise of dermatologists for the clinical task of differentiating melanomas from melanoma ‘lookalikes’ on the basis of dermoscopic and clinical images of the skin, and the power of generative models to render ‘counterfactual’ images to understand the ‘reasoning’ processes of five medical-image classifiers. By altering image attributes to produce analogous images that elicit a different prediction by the classifiers, and by asking physicians to identify medically meaningful features in the images, the counterfactual images revealed that the classifiers rely both on features used by human dermatologists, such as lesional pigmentation patterns, and on undesirable features, such as background skin texture and colour balance. The framework can be applied to any specialized medical domain to make the powerful inference processes of machine-learning models medically understandable.
Young, A. T. et al. Stress testing reveals gaps in clinic readiness of image-based diagnostic artificial intelligence models. npj Digit. Med. 4, 10 (2021).
DeGrave, A. J., Janizek, J. D. & Lee, S.-I. AI for radiographic COVID-19 detection selects shortcuts over signal. Nat. Mach. Intell. 3, 610–619 (2021).
Singh, N. et al. Agreement between saliency maps and human-labeled regions of interest: applications to skin disease classification. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 3172–3181 (IEEE, 2020).