Uncategorized

The shaky foundations of large language models and foundation models for electronic health records



  • Bommasani, R. et al. On the opportunities and risks of foundation models. Preprint at arXiv: 2108.07258 (2021).

  • Brown, T. B. et al. Language models are few-shot learners. Preprint at arXiv:2005.14165 (2020).

  • Esser, P., Chiu, J., Atighehchian, P., Granskog, J. & Germanidis, A. Structure and content-guided video synthesis with diffusion models. Preprint at arXiv: 2302.03011 (2023).

  • Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Jiang, Y. et al. VIMA: general robot manipulation with multimodal prompts. Preprint at arXiv: 2210.03094 (2022).

  • Eysenbach, G. The role of ChatGPT, generative language models, and artificial intelligence in medical education: a conversation with ChatGPT and a call for papers. JMIR Med Educ. 9, e46885 (2023).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Wei, J. et al. Emergent abilities of large language models. Preprint at arXiv: 2206.07682 (2022).

  • Kung, T. H. et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLoS Digit. Health 2, e0000198 (2023).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Gilson, A. et al. How does ChatGPT perform on the United States medical licensing examination? The implications of large language models for medical education and knowledge assessment. JMIR Med. Educ. (2023)

  • Liévin, V., Hother, C. E. & Winther, O. Can large language models reason about medical questions? Preprint at arXiv: :2207.08143 (2022).

  • Nori, H., King, N., Mc Kinney, S. M., Carignan, D. & Horvitz, E. Capabilities of GPT-4 on medical challenge problems. Preprint at arXiv: 2303.13375 (2023).

  • Jeblick, K. et al. ChatGPT makes medicine easy to swallow: an exploratory case study on simplified radiology reports. Preprint at arXiv: 2212.14882 (2022).

  • Macdonald, C., Adeloye, D., Sheikh, A. & Rudan, I. Can ChatGPT draft a research article? An example of population-level vaccine effectiveness analysis. J. Glob. Health 13, 01003 (2023).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Pang, C. et al. CEHR-BERT: Incorporating temporal information from structured EHR data to improve prediction tasks. Machine Learning for Health. PMLR (2021)

  • Choi, E., Bahadori, M. T., Schuetz, A., Stewart, W. F. & Sun, J. Doctor AI: predicting clinical events via recurrent neural networks. Preprint at arXiv: 1511.05942 (2015).

  • Prakash, P. K. S., Chilukuri, S., Ranade, N. & Viswanathan, S. RareBERT: transformer architecture for rare disease patient identification using administrative claims. AAAI 35, 453–460 (2021).

    Article 

    Google Scholar
     

  • Cascella, M., Montomoli, J., Bellini, V. & Bignami, E. Evaluating the feasibility of ChatGPT in healthcare: an analysis of multiple clinical and research scenarios. J. Med. Syst. 47, 33 (2023).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Shen, Y. et al. ChatGPT and other large language models are double-edged swords. Radiology 307, 230163 (2023).

  • Wójcik, M. A. Foundation models in healthcare: opportunities, biases and regulatory prospects in Europe. In Electronic Government and the Information Systems Perspective: 11th International Conference, EGOVIS 2022 Proceedings 32–46 (Springer-Verlag, 2022).

  • Blagec, K., Kraiger, J., Frühwirt, W. & Samwald, M. Benchmark datasets driving artificial intelligence development fail to capture the needs of medical professionals. J. Biomed. Inform. 137, 104274 (2023).

    Article 
    PubMed 

    Google Scholar
     

  • Donoho, D. 50 years of data science. J. Comput. Graph. Stat. 26, 745–766 (2017).

    Article 

    Google Scholar
     

  • Topol, E. When M.D. is a machine doctor. https://erictopol.substack.com/p/when-md-is-a-machine-doctor (2023).

  • Robert, P. 5 Ways ChatGPT will change healthcare forever, for better. Forbes Magazine (13 February 2023).

  • Liang, P. et al. Holistic evaluation of language models. Preprint at arXiv [cs.CL] (2022).

  • Mohsen, F., Ali, H., El Hajj, N. & Shah, Z. Artificial intelligence-based methods for fusion of electronic health records and imaging data. Sci. Rep. 12, 17981 (2022).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • BigScience Workshop, et al. BLOOM: a 176B-Parameter open-access multilingual language model. Preprint at arXiv [cs.CL] (2022).

  • Bubeck, S. et al. Sparks of artificial general intelligence: early experiments with GPT-4. Preprint at arXiv [cs.CL] (2023).

  • Agrawal, M., Hegselmann, S., Lang, H., Kim, Y. & Sontag, D. Large language models are few-shot clinical information extractors. Preprint at arXiv [cs.CL] (2022).

  • Singhal, K. et al. Large language models encode clinical knowledge. Preprint at arXiv [cs.CL] (2022).

  • Chintagunta, B., Katariya, N., Amatriain, X. & Kannan, A. Medically aware GPT-3 as a data generator for medical dialogue summarization. In Proc. Second Workshop on Natural Language Processing for Medical Conversations 66–76 (Association for Computational Linguistics, 2021).

  • Huang, K. et al. Clinical XLNet: Modeling Sequential Clinical Notes and Predicting Prolonged Mechanical Ventilation. Proceedings of the 3rd Clinical Natural Language Processing Workshop (2020).

  • Lehman, E. et al. Do we still need clinical language models? Preprint at arXiv [cs.CL] (2023).

  • Moradi, M., Blagec, K., Haberl, F. & Samwald, M. GPT-3 models are poor few-shot learners in the biomedical domain. Preprint at arXiv [cs.CL] (2021).

  • Steinberg, E. et al. Language models are an effective representation learning technique for electronic health record data. J. Biomed. Inform. 113, 103637 (2021).

    Article 
    PubMed 

    Google Scholar
     

  • Guo, L. L. et al. EHR foundation models improve robustness in the presence of temporal distribution shift. Sci. Rep. 13, 3767 (2022).

  • Fei, N. et al. Towards artificial general intelligence via a multimodal foundation model. Nat. Commun. 13, 3094 (2022).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Si, Y. et al. Deep representation learning of patient data from Electronic Health Records (EHR): a systematic review. J. Biomed. Inform. 115, 103671 (2021).

    Article 
    PubMed 

    Google Scholar
     

  • Rajpurkar, P., Chen, E., Banerjee, O. & Topol, E. J. AI in health and medicine. Nat. Med. 28, 31–38 (2022).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Xiao, C., Choi, E. & Sun, J. Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review. J. Am. Med. Inform. Assoc. 25, 1419–1428 (2018).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Davenport, T. & Kalakota, R. The potential for artificial intelligence in healthcare. Future Health. J. 6, 94–98 (2019).

    Article 

    Google Scholar
     

  • Bohr, A. & Memarzadeh, K. The rise of artificial intelligence in healthcare applications. Artif. Intell. Healthcare 25 (2020).

  • Howard, J. & Sebastian, R. Universal Language Model Fine-tuning for Text Classification. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (2018).

  • Chen, L. et al. HAPI: a large-scale longitudinal dataset of commercial ML API predictions. Preprint at arXiv [cs.SE] (2022).

  • Huge ‘foundation models’ are turbo-charging AI progress. The Economist (15 June 2022).

  • Canes, D. The time-saving magic of Chat GPT for doctors. https://tillthecavalryarrive.substack.com/p/the-time-saving-magic-of-chat-gpt?utm_campaign=auto_share (2022).

  • Steinberg, E., Xu, Y., Fries, J. & Shah, N. Self-supervised time-to-event modeling with structured medical records. Preprint at arXiv [cs.LG] (2023).

  • Kline, A. et al. Multimodal machine learning in precision health: a scoping review. NPJ Digit. Med. 5, 171 (2022).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Baevski, A. et al. Data2vec: A general framework for self-supervised learning in speech, vision and language. International Conference on Machine Learning. PMLR (2022).

  • Girdhar, R. et al. ImageBind: one embedding space to bind them all. Preprint at arXiv [cs.CV] (2023).

  • Boecking, B. et al. Making the most of text semantics to improve biomedical vision–language processing. Preprint at arXiv [cs.CV] (2022).

  • Radford, A. et al. Learning transferable visual models from natural language supervision. Preprint at arXiv [cs.CV] (2021).

  • Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I. & Lungren, M. P. Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digit. Med. 3, 136 (2020).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Acosta, J. N., Falcone, G. J., Rajpurkar, P. & Topol, E. J. Multimodal biomedical AI. Nat. Med. 28, 1773–1784 (2022).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems (2022).

  • Lee, S., Da Young, L., Im, S., Kim, N. H. & Park, S.-M. Clinical decision transformer: intended treatment recommendation through goal prompting. Preprint at arXiv [cs.AI] (2023).

  • Johnson, A. E. W. et al. MIMIC-III, a freely accessible critical care database. Sci. Data 3, 160035 (2016).

    Article 
    CAS 
    PubMed Central 

    Google Scholar
     

  • Wolf, T. et al. Transformers: State-of-the-Art Natural Language Processing. EMNLP 2020 (2020).

  • Sushil, M., Ludwig, D., Butte, A. J. & Rudrapatna, V. A. Developing a general-purpose clinical language inference model from a large corpus of clinical notes. Preprint at arXiv [cs.CL] (2022).

  • Li, F. et al. Fine-tuning bidirectional encoder representations from transformers (BERT)-based models on large-scale electronic health record notes: an empirical study. JMIR Med. Inf. 7, e14830 (2019).

    Article 

    Google Scholar
     

  • Yang, X. et al. GatorTron: a large clinical language model to unlock patient information from unstructured electronic health records. Preprint at bioRxiv https://doi.org/10.1101/2022.02.27.22271257 (2022).

  • Pollard, T. J. et al. The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Sci. Data 5, 180178 (2018).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Li, Y. et al. Hi-BEHRT: hierarchical transformer-based model for accurate prediction of clinical events using multimodal longitudinal electronic health records. IEEE J. Biomed. Health Inform. 27 (2022).

  • Zeltzer, D. et al. Prediction accuracy with electronic medical records versus administrative claims. Med. Care 57, 551–559 (2019).

    Article 
    PubMed 

    Google Scholar
     

  • Rasmy, L., Xiang, Y., Xie, Z., Tao, C. & Zhi, D. Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. npj Digit. Med. 4, 86 (2021).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Zeng, X., Linwood, S. L. & Liu, C. Pretrained transformer framework on pediatric claims data for population specific tasks. Sci. Rep. 12, 3651 (2022).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Hur, K. et al. Unifying heterogeneous electronic health records systems via text-based code embedding. Conference on Health, Inference, and Learning, PMLR (2022).

  • Tang, P. C., Ralston, M., Arrigotti, M. F., Qureshi, L. & Graham, J. Comparison of methodologies for calculating quality measures based on administrative data versus clinical data from an electronic health record system: implications for performance measures. J. Am. Med. Inform. Assoc. 14, 10–15 (2007).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Wei, W.-Q. et al. Combining billing codes, clinical notes, and medications from electronic health records provides superior phenotyping performance. J. Am. Med. Inform. Assoc. 23, e20–e27 (2016).

    Article 
    PubMed 

    Google Scholar
     

  • Rajkomar, A. et al. Scalable and accurate deep learning with electronic health records. npj Digit. Med. 1, 1–10 (2018).

    Article 

    Google Scholar
     

  • Lee, D., Jiang, X. & Yu, H. Harmonized representation learning on dynamic EHR graphs. J. Biomed. Inform. 106, 103426 (2020).

    Article 
    PubMed 

    Google Scholar
     

  • Ateev, H. R. B. A. ChatGPT-assisted diagnosis: is the future suddenly here? https://www.statnews.com/2023/02/13/chatgpt-assisted-diagnosis/ (2023).

  • Raths, D. How UCSF physician execs are thinking about ChatGPT. Healthcare Innovation (17 February 2023).

  • Fries, J. et al. Bigbio: a framework for data-centric biomedical natural language processing. Advances in Neural Information Processing Systems 35 (2022).

  • Gao, Y. et al. A scoping review of publicly available language tasks in clinical natural language processing. J. Am. Med. Inform. Assoc. 29, 1797–1806 (2022).

    Article 
    PubMed 

    Google Scholar
     

  • Leaman, R., Khare, R. & Lu, Z. Challenges in clinical natural language processing for automated disorder normalization. J. Biomed. Inform. 57, 28–37 (2015).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Spasic, I. & Nenadic, G. Clinical text data in machine learning: systematic review. JMIR Med. Inf. 8, e17984 (2020).

    Article 

    Google Scholar
     

  • Yue, X., Jimenez Gutierrez, B. & Sun, H. Clinical reading comprehension: a thorough analysis of the emrQA dataset. In Proc. 58th Annual Meeting of the Association for Computational Linguistics 4474–4486 (Association for Computational Linguistics, 2020).

  • McDermott, M. et al. A comprehensive EHR timeseries pre-training benchmark. In Proc. Conference on Health, Inference, and Learning 257–278 (Association for Computing Machinery, 2021).

  • Shah, N. Making machine learning models clinically useful. JAMA 322, 1351 (2019).

    Article 

    Google Scholar
     

  • Wornow, M., Gyang Ross, E., Callahan, A. & Shah, N. H. APLUS: a Python library for usefulness simulations of machine learning models in healthcare. J. Biomed. Inform. 139, 104319 (2023).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Tamm, Y.-M., Damdinov, R. & Vasilev, A. Quality metrics in recommender systems: Do we calculate metrics consistently? Proceedings of the 15th ACM Conference on Recommender Systems (2021).

  • Dash, D. et al. Evaluation of GPT-3.5 and GPT-4 for supporting real-world information needs in healthcare delivery. Preprint at arXiv [cs.AI] (2023).

  • Reiter, E. A structured review of the validity of BLEU. Comput. Linguist. 44, 393–401 (2018).

    Article 

    Google Scholar
     

  • Hu, X. et al. Correlating automated and human evaluation of code documentation generation quality. ACM Trans. Softw. Eng. Methodol. 31, 1–28 (2022).


    Google Scholar
     

  • Liu, Y. et al. G-Eval: NLG evaluation using GPT-4 with better human alignment. Preprint at arXiv [cs.CL] (2023).

  • Thomas, R. & Uminsky, D. The problem with metrics is a fundamental problem for AI. Preprint at arXiv [cs.CY] (2020).

  • Bai, Y. et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. Preprint at arXiv [cs.CL] (2022).

  • Gao, T., Fisch, A. & Chen, D. Making pre-trained language models better few-shot learners. Preprint at arXiv [cs.CL] (2020).

  • Kaufmann, J. Foundation models are the new public cloud. ScaleVP https://www.scalevp.com/blog/foundation-models-are-the-new-public-cloud (2022).

  • Kashyap, S., Morse, K. E., Patel, B. & Shah, N. H. A survey of extant organizational and computational setups for deploying predictive models in health systems. J. Am. Med. Inform. Assoc. 28, 2445–2450 (2021).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Abdullah, I. S., Loganathan, A., Lee, R. W. ChatGPT & doctors: the Medical Dream Team. URGENT Matters (2023).

  • Lee, P., Goldberg, C. & Kohane, I. The AI Revolution in Medicine: GPT-4 and Beyond. (Pearson, 2023).

  • Fleming, S. L. et al. Assessing the potential of USMLE-like exam questions generated by GPT-4. Preprint at medRxiv https://doi.org/10.1101/2023.04.25.23288588 (2023).

  • Husmann, S., Yèche, H., Rätsch, G. & Kuznetsova, R. On the importance of clinical notes in multi-modal learning for EHR data. Preprint at arXiv [cs.LG] (2022).

  • Soenksen, L. R. et al. Integrated multimodal artificial intelligence framework for healthcare applications. NPJ Digit. Med. 5, 149 (2022).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Peng, S., Kalliamvakou, E., Cihon, P. & Demirer, M. The impact of AI on developer productivity: evidence from GitHub copilot. Preprint at arXiv [cs.SE] (2023).

  • Noy, S. et al. Experimental evidence on the productivity effects of generative artificial intelligence. Science https://economics.mit.edu/sites/default/files/inline-files/Noy_Zhang_1.pdf (2023).

  • Perry, N., Srivastava, M., Kumar, D. & Boneh, D. Do users write more insecure code with AI assistants? Preprint at arXiv [cs.CR] (2022).

  • Zhang, X., Zhou, Z., Chen, D. & Wang, Y. E. AutoDistill: an end-to-end framework to explore and distill hardware-efficient language models. Preprint at arXiv [cs.LG] (2022).

  • El-Mhamdi, E.-M. et al. SoK: on the impossible security of very large foundation models. Preprint at arXiv [cs.LG] (2022).

  • Carlini, N. et al. Quantifying memorization across neural language models. Preprint at arXiv [cs.LG] (2022).

  • Mitchell, E., Lin, C., Bosselut, A., Manning, C. D. & Finn, C. Memory-based model editing at scale. Preprint at arXiv [cs.AI] (2022).

  • Sharir, O., Peleg, B. & Shoham, Y. The cost of training NLP models: a concise overview. Preprint at arXiv [cs.CL] (2020).

  • Yaeger, K. A., Martini, M., Yaniv, G., Oermann, E. K. & Costa, A. B. United States regulatory approval of medical devices and software applications enhanced by artificial intelligence. Health Policy Technol. 8, 192–197 (2019).

    Article 

    Google Scholar
     

  • DeCamp, M. & Lindvall, C. Latent bias and the implementation of artificial intelligence in medicine. J. Am. Med. Inform. Assoc. 27, 2020–2023 (2020).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Wickens, C. D., Clegg, B. A., Vieane, A. Z. & Sebok, A. L. Complacency and automation bias in the use of imperfect automation. Hum. Factors 57, 728–739 (2015).

    Article 
    PubMed 

    Google Scholar
     



  • Source link

    Leave a Reply

    Your email address will not be published. Required fields are marked *