doi: 10.1111/bjh.19200.
Online ahead of print.
Affiliations
Item in Clipboard
Br J Haematol.
.
Abstract
In a first-of-its-kind study, we assessed the capabilities of large language models (LLMs) in making complex decisions in haematopoietic stem cell transplantation. The evaluation was conducted not only for Generative Pre-trained Transformer 4 (GPT-4) but also conducted on other artificial intelligence models: PaLm 2 and Llama-2. Using detailed haematological histories that include both clinical, molecular and donor data, we conducted a triple-blind survey to compare LLMs to haematology residents. We found that residents significantly outperformed LLMs (p = 0.02), particularly in transplant eligibility assessment (p = 0.01). Our triple-blind methodology aimed to mitigate potential biases in evaluating LLMs and revealed both their promise and limitations in deciphering complex haematological clinical scenarios.
Keywords:
GPT; HSC transplantation; artificial intelligence; interrater agreement; transplant.
© 2023 The Authors. British Journal of Haematology published by British Society for Haematology and John Wiley & Sons Ltd.
References
REFERENCES
-
-
Thirunavukarasu AJ, Hassan R, Mahmood S, Sanghera R, Barzangi K, El Mukashfi M, et al. Trialling a large language model (ChatGPT) in general practice with the applied knowledge test: observational study demonstrating opportunities and limitations in primary care. JMIR Med Educ. 2023;9:e46599.
-
-
-
Hoch CC, Wollenberg B, Lüers J-C, Knoedler S, Knoedler L, Frank K, et al. ChatGPT’s quiz skills in different otolaryngology subspecialties: an analysis of 2576 single-choice and multiple-choice board certification preparation questions. Eur Arch Otorhinolaryngol. 2023;280(9):4271-4278.
-
-
-
Jin D, Pan E, Oufattole N, Weng W-H, Fang H, Szolovits P. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. arxiv.org Cornell University 2020.
-
-
-
Jin Q, Dhingra B, Liu Z, Cohen WW, Lu X. PubMedQA: a dataset for biomedical research question answering. arXiv. 2019.
-
-
-
Hendrycks D, Burns C, Basart S, Zou A, Mazeika M, Song D, et al. Measuring massive multitask language understanding. arXiv. 2020.
-