The shaky foundations of large language models and foundation models for electronic health records

Bommasani, R. et al. On the opportunities and risks of foundation models. Preprint at arXiv: 2108.07258 (2021).

Brown, T. B. et al. Language models are few-shot learners. Preprint at arXiv:2005.14165 (2020).

Esser, P., Chiu, J., Atighehchian, P., Granskog, J. & Germanidis, A. Structure and content-guided video synthesis with diffusion models. Preprint at arXiv: 2302.03011 (2023).

Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

Article
CAS
PubMed
PubMed Central

Google Scholar

Jiang, Y. et al. VIMA: general robot manipulation with multimodal prompts. Preprint at arXiv: 2210.03094 (2022).

Eysenbach, G. The role of ChatGPT, generative language models, and artificial intelligence in medical education: a conversation with ChatGPT and a call for papers. JMIR Med Educ. 9, e46885 (2023).

Article
PubMed
PubMed Central

Google Scholar

Wei, J. et al. Emergent abilities of large language models. Preprint at arXiv: 2206.07682 (2022).

Kung, T. H. et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLoS Digit. Health 2, e0000198 (2023).

Article
PubMed
PubMed Central

Google Scholar

Gilson, A. et al. How does ChatGPT perform on the United States medical licensing examination? The implications of large language models for medical education and knowledge assessment. JMIR Med. Educ. (2023)

Liévin, V., Hother, C. E. & Winther, O. Can large language models reason about medical questions? Preprint at arXiv: :2207.08143 (2022).

Nori, H., King, N., Mc Kinney, S. M., Carignan, D. & Horvitz, E. Capabilities of GPT-4 on medical challenge problems. Preprint at arXiv: 2303.13375 (2023).

Jeblick, K. et al. ChatGPT makes medicine easy to swallow: an exploratory case study on simplified radiology reports. Preprint at arXiv: 2212.14882 (2022).

Macdonald, C., Adeloye, D., Sheikh, A. & Rudan, I. Can ChatGPT draft a research article? An example of population-level vaccine effectiveness analysis. J. Glob. Health 13, 01003 (2023).

Article
PubMed
PubMed Central

Google Scholar

Pang, C. et al. CEHR-BERT: Incorporating temporal information from structured EHR data to improve prediction tasks. Machine Learning for Health. PMLR (2021)

Choi, E., Bahadori, M. T., Schuetz, A., Stewart, W. F. & Sun, J. Doctor AI: predicting clinical events via recurrent neural networks. Preprint at arXiv: 1511.05942 (2015).

Prakash, P. K. S., Chilukuri, S., Ranade, N. & Viswanathan, S. RareBERT: transformer architecture for rare disease patient identification using administrative claims. AAAI 35, 453–460 (2021).

Article

Google Scholar

Cascella, M., Montomoli, J., Bellini, V. & Bignami, E. Evaluating the feasibility of ChatGPT in healthcare: an analysis of multiple clinical and research scenarios. J. Med. Syst. 47, 33 (2023).

Article
PubMed
PubMed Central

Google Scholar

Shen, Y. et al. ChatGPT and other large language models are double-edged swords. Radiology 307, 230163 (2023).

Wójcik, M. A. Foundation models in healthcare: opportunities, biases and regulatory prospects in Europe. In Electronic Government and the Information Systems Perspective: 11th International Conference, EGOVIS 2022 Proceedings 32–46 (Springer-Verlag, 2022).

Blagec, K., Kraiger, J., Frühwirt, W. & Samwald, M. Benchmark datasets driving artificial intelligence development fail to capture the needs of medical professionals. J. Biomed. Inform. 137, 104274 (2023).

Article
PubMed

Google Scholar

Donoho, D. 50 years of data science. J. Comput. Graph. Stat. 26, 745–766 (2017).

Article

Google Scholar

Topol, E. When M.D. is a machine doctor. https://erictopol.substack.com/p/when-md-is-a-machine-doctor (2023).

Robert, P. 5 Ways ChatGPT will change healthcare forever, for better. Forbes Magazine (13 February 2023).

Liang, P. et al. Holistic evaluation of language models. Preprint at arXiv [cs.CL] (2022).

Mohsen, F., Ali, H., El Hajj, N. & Shah, Z. Artificial intelligence-based methods for fusion of electronic health records and imaging data. Sci. Rep. 12, 17981 (2022).

Article
CAS
PubMed
PubMed Central

Google Scholar

BigScience Workshop, et al. BLOOM: a 176B-Parameter open-access multilingual language model. Preprint at arXiv [cs.CL] (2022).

Bubeck, S. et al. Sparks of artificial general intelligence: early experiments with GPT-4. Preprint at arXiv [cs.CL] (2023).

Agrawal, M., Hegselmann, S., Lang, H., Kim, Y. & Sontag, D. Large language models are few-shot clinical information extractors. Preprint at arXiv [cs.CL] (2022).

Singhal, K. et al. Large language models encode clinical knowledge. Preprint at arXiv [cs.CL] (2022).

Chintagunta, B., Katariya, N., Amatriain, X. & Kannan, A. Medically aware GPT-3 as a data generator for medical dialogue summarization. In Proc. Second Workshop on Natural Language Processing for Medical Conversations 66–76 (Association for Computational Linguistics, 2021).

Huang, K. et al. Clinical XLNet: Modeling Sequential Clinical Notes and Predicting Prolonged Mechanical Ventilation. Proceedings of the 3rd Clinical Natural Language Processing Workshop (2020).

Lehman, E. et al. Do we still need clinical language models? Preprint at arXiv [cs.CL] (2023).

Moradi, M., Blagec, K., Haberl, F. & Samwald, M. GPT-3 models are poor few-shot learners in the biomedical domain. Preprint at arXiv [cs.CL] (2021).

Steinberg, E. et al. Language models are an effective representation learning technique for electronic health record data. J. Biomed. Inform. 113, 103637 (2021).

Article
PubMed

Google Scholar

Guo, L. L. et al. EHR foundation models improve robustness in the presence of temporal distribution shift. Sci. Rep. 13, 3767 (2022).

Fei, N. et al. Towards artificial general intelligence via a multimodal foundation model. Nat. Commun. 13, 3094 (2022).

Article
CAS
PubMed
PubMed Central

Google Scholar

Si, Y. et al. Deep representation learning of patient data from Electronic Health Records (EHR): a systematic review. J. Biomed. Inform. 115, 103671 (2021).

Article
PubMed

Google Scholar

Rajpurkar, P., Chen, E., Banerjee, O. & Topol, E. J. AI in health and medicine. Nat. Med. 28, 31–38 (2022).

Article
CAS
PubMed

Google Scholar

Xiao, C., Choi, E. & Sun, J. Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review. J. Am. Med. Inform. Assoc. 25, 1419–1428 (2018).

Article
PubMed
PubMed Central

Google Scholar

Davenport, T. & Kalakota, R. The potential for artificial intelligence in healthcare. Future Health. J. 6, 94–98 (2019).

Article

Google Scholar

Bohr, A. & Memarzadeh, K. The rise of artificial intelligence in healthcare applications. Artif. Intell. Healthcare 25 (2020).

Howard, J. & Sebastian, R. Universal Language Model Fine-tuning for Text Classification. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (2018).

Chen, L. et al. HAPI: a large-scale longitudinal dataset of commercial ML API predictions. Preprint at arXiv [cs.SE] (2022).

Huge ‘foundation models’ are turbo-charging AI progress. The Economist (15 June 2022).

Canes, D. The time-saving magic of Chat GPT for doctors. https://tillthecavalryarrive.substack.com/p/the-time-saving-magic-of-chat-gpt?utm_campaign=auto_share (2022).

Steinberg, E., Xu, Y., Fries, J. & Shah, N. Self-supervised time-to-event modeling with structured medical records. Preprint at arXiv [cs.LG] (2023).

Kline, A. et al. Multimodal machine learning in precision health: a scoping review. NPJ Digit. Med. 5, 171 (2022).

Article
PubMed
PubMed Central

Google Scholar

Baevski, A. et al. Data2vec: A general framework for self-supervised learning in speech, vision and language. International Conference on Machine Learning. PMLR (2022).

Girdhar, R. et al. ImageBind: one embedding space to bind them all. Preprint at arXiv [cs.CV] (2023).

Boecking, B. et al. Making the most of text semantics to improve biomedical vision–language processing. Preprint at arXiv [cs.CV] (2022).

Radford, A. et al. Learning transferable visual models from natural language supervision. Preprint at arXiv [cs.CV] (2021).

Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I. & Lungren, M. P. Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digit. Med. 3, 136 (2020).

Article
PubMed
PubMed Central

Google Scholar

Acosta, J. N., Falcone, G. J., Rajpurkar, P. & Topol, E. J. Multimodal biomedical AI. Nat. Med. 28, 1773–1784 (2022).

Article
CAS
PubMed

Google Scholar

Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems (2022).

Lee, S., Da Young, L., Im, S., Kim, N. H. & Park, S.-M. Clinical decision transformer: intended treatment recommendation through goal prompting. Preprint at arXiv [cs.AI] (2023).

Johnson, A. E. W. et al. MIMIC-III, a freely accessible critical care database. Sci. Data 3, 160035 (2016).

Article
CAS
PubMed Central

Google Scholar

Wolf, T. et al. Transformers: State-of-the-Art Natural Language Processing. EMNLP 2020 (2020).

Sushil, M., Ludwig, D., Butte, A. J. & Rudrapatna, V. A. Developing a general-purpose clinical language inference model from a large corpus of clinical notes. Preprint at arXiv [cs.CL] (2022).

Li, F. et al. Fine-tuning bidirectional encoder representations from transformers (BERT)-based models on large-scale electronic health record notes: an empirical study. JMIR Med. Inf. 7, e14830 (2019).

Article

Google Scholar

Yang, X. et al. GatorTron: a large clinical language model to unlock patient information from unstructured electronic health records. Preprint at bioRxiv https://doi.org/10.1101/2022.02.27.22271257 (2022).

Pollard, T. J. et al. The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Sci. Data 5, 180178 (2018).

Article
PubMed
PubMed Central

Google Scholar

Li, Y. et al. Hi-BEHRT: hierarchical transformer-based model for accurate prediction of clinical events using multimodal longitudinal electronic health records. IEEE J. Biomed. Health Inform. 27 (2022).

Zeltzer, D. et al. Prediction accuracy with electronic medical records versus administrative claims. Med. Care 57, 551–559 (2019).

Article
PubMed

Google Scholar

Rasmy, L., Xiang, Y., Xie, Z., Tao, C. & Zhi, D. Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. npj Digit. Med. 4, 86 (2021).

Article
PubMed
PubMed Central

Google Scholar

Zeng, X., Linwood, S. L. & Liu, C. Pretrained transformer framework on pediatric claims data for population specific tasks. Sci. Rep. 12, 3651 (2022).

Article
CAS
PubMed
PubMed Central

Google Scholar

Hur, K. et al. Unifying heterogeneous electronic health records systems via text-based code embedding. Conference on Health, Inference, and Learning, PMLR (2022).

Tang, P. C., Ralston, M., Arrigotti, M. F., Qureshi, L. & Graham, J. Comparison of methodologies for calculating quality measures based on administrative data versus clinical data from an electronic health record system: implications for performance measures. J. Am. Med. Inform. Assoc. 14, 10–15 (2007).

Article
CAS
PubMed
PubMed Central

Google Scholar

Wei, W.-Q. et al. Combining billing codes, clinical notes, and medications from electronic health records provides superior phenotyping performance. J. Am. Med. Inform. Assoc. 23, e20–e27 (2016).

Article
PubMed

Google Scholar

Rajkomar, A. et al. Scalable and accurate deep learning with electronic health records. npj Digit. Med. 1, 1–10 (2018).

Article

Google Scholar

Lee, D., Jiang, X. & Yu, H. Harmonized representation learning on dynamic EHR graphs. J. Biomed. Inform. 106, 103426 (2020).

Article
PubMed

Google Scholar

Ateev, H. R. B. A. ChatGPT-assisted diagnosis: is the future suddenly here? https://www.statnews.com/2023/02/13/chatgpt-assisted-diagnosis/ (2023).

Raths, D. How UCSF physician execs are thinking about ChatGPT. Healthcare Innovation (17 February 2023).

Fries, J. et al. Bigbio: a framework for data-centric biomedical natural language processing. Advances in Neural Information Processing Systems 35 (2022).

Gao, Y. et al. A scoping review of publicly available language tasks in clinical natural language processing. J. Am. Med. Inform. Assoc. 29, 1797–1806 (2022).

Article
PubMed

Google Scholar

Leaman, R., Khare, R. & Lu, Z. Challenges in clinical natural language processing for automated disorder normalization. J. Biomed. Inform. 57, 28–37 (2015).

Article
PubMed
PubMed Central

Google Scholar

Spasic, I. & Nenadic, G. Clinical text data in machine learning: systematic review. JMIR Med. Inf. 8, e17984 (2020).

Article

Google Scholar

Yue, X., Jimenez Gutierrez, B. & Sun, H. Clinical reading comprehension: a thorough analysis of the emrQA dataset. In Proc. 58th Annual Meeting of the Association for Computational Linguistics 4474–4486 (Association for Computational Linguistics, 2020).

McDermott, M. et al. A comprehensive EHR timeseries pre-training benchmark. In Proc. Conference on Health, Inference, and Learning 257–278 (Association for Computing Machinery, 2021).

Shah, N. Making machine learning models clinically useful. JAMA 322, 1351 (2019).

Article

Google Scholar

Wornow, M., Gyang Ross, E., Callahan, A. & Shah, N. H. APLUS: a Python library for usefulness simulations of machine learning models in healthcare. J. Biomed. Inform. 139, 104319 (2023).

Article
PubMed
PubMed Central

Google Scholar

Tamm, Y.-M., Damdinov, R. & Vasilev, A. Quality metrics in recommender systems: Do we calculate metrics consistently? Proceedings of the 15th ACM Conference on Recommender Systems (2021).

Dash, D. et al. Evaluation of GPT-3.5 and GPT-4 for supporting real-world information needs in healthcare delivery. Preprint at arXiv [cs.AI] (2023).

Reiter, E. A structured review of the validity of BLEU. Comput. Linguist. 44, 393–401 (2018).

Article

Google Scholar

Hu, X. et al. Correlating automated and human evaluation of code documentation generation quality. ACM Trans. Softw. Eng. Methodol. 31, 1–28 (2022).

Google Scholar

Liu, Y. et al. G-Eval: NLG evaluation using GPT-4 with better human alignment. Preprint at arXiv [cs.CL] (2023).

Thomas, R. & Uminsky, D. The problem with metrics is a fundamental problem for AI. Preprint at arXiv [cs.CY] (2020).

Bai, Y. et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. Preprint at arXiv [cs.CL] (2022).

Gao, T., Fisch, A. & Chen, D. Making pre-trained language models better few-shot learners. Preprint at arXiv [cs.CL] (2020).

Kaufmann, J. Foundation models are the new public cloud. ScaleVP https://www.scalevp.com/blog/foundation-models-are-the-new-public-cloud (2022).

Kashyap, S., Morse, K. E., Patel, B. & Shah, N. H. A survey of extant organizational and computational setups for deploying predictive models in health systems. J. Am. Med. Inform. Assoc. 28, 2445–2450 (2021).

Article
PubMed
PubMed Central

Google Scholar

Abdullah, I. S., Loganathan, A., Lee, R. W. ChatGPT & doctors: the Medical Dream Team. URGENT Matters (2023).

Lee, P., Goldberg, C. & Kohane, I. The AI Revolution in Medicine: GPT-4 and Beyond. (Pearson, 2023).

Fleming, S. L. et al. Assessing the potential of USMLE-like exam questions generated by GPT-4. Preprint at medRxiv https://doi.org/10.1101/2023.04.25.23288588 (2023).

Husmann, S., Yèche, H., Rätsch, G. & Kuznetsova, R. On the importance of clinical notes in multi-modal learning for EHR data. Preprint at arXiv [cs.LG] (2022).

Soenksen, L. R. et al. Integrated multimodal artificial intelligence framework for healthcare applications. NPJ Digit. Med. 5, 149 (2022).

Article
PubMed
PubMed Central

Google Scholar

Peng, S., Kalliamvakou, E., Cihon, P. & Demirer, M. The impact of AI on developer productivity: evidence from GitHub copilot. Preprint at arXiv [cs.SE] (2023).

Noy, S. et al. Experimental evidence on the productivity effects of generative artificial intelligence. Science https://economics.mit.edu/sites/default/files/inline-files/Noy_Zhang_1.pdf (2023).

Perry, N., Srivastava, M., Kumar, D. & Boneh, D. Do users write more insecure code with AI assistants? Preprint at arXiv [cs.CR] (2022).

Zhang, X., Zhou, Z., Chen, D. & Wang, Y. E. AutoDistill: an end-to-end framework to explore and distill hardware-efficient language models. Preprint at arXiv [cs.LG] (2022).

El-Mhamdi, E.-M. et al. SoK: on the impossible security of very large foundation models. Preprint at arXiv [cs.LG] (2022).

Carlini, N. et al. Quantifying memorization across neural language models. Preprint at arXiv [cs.LG] (2022).

Mitchell, E., Lin, C., Bosselut, A., Manning, C. D. & Finn, C. Memory-based model editing at scale. Preprint at arXiv [cs.AI] (2022).

Sharir, O., Peleg, B. & Shoham, Y. The cost of training NLP models: a concise overview. Preprint at arXiv [cs.CL] (2020).

Yaeger, K. A., Martini, M., Yaniv, G., Oermann, E. K. & Costa, A. B. United States regulatory approval of medical devices and software applications enhanced by artificial intelligence. Health Policy Technol. 8, 192–197 (2019).

Article

Google Scholar

DeCamp, M. & Lindvall, C. Latent bias and the implementation of artificial intelligence in medicine. J. Am. Med. Inform. Assoc. 27, 2020–2023 (2020).

Article
PubMed
PubMed Central

Google Scholar

Wickens, C. D., Clegg, B. A., Vieane, A. Z. & Sebok, A. L. Complacency and automation bias in the use of imperfect automation. Hum. Factors 57, 728–739 (2015).

Article
PubMed

Google Scholar

Source link