Leveraging large language models for data analysis automation

Abstract

Data analysis is constrained by a shortage of skilled experts, particularly in biology, where detailed data interpretation is vital for understanding complex biological processes and developing new treatments and diagnostics. To address this, we developed mergen, an R package that leverages Large Language Models (LLMs) for data analysis code generation and execution. Our primary goal is to enable humans to conduct data analysis by simply describing their objectives and the desired analyses for specific datasets through clear text. Our approach improves code generation via specialized prompt engineering and error feedback mechanisms. In addition, our system can execute the data analysis workflows prescribed by the LLM providing the results of the data analysis workflow for human review. We evaluated the performance of this data analysis system using various data analysis tasks. Our evaluation revealed that while LLMs effectively generate code for some data analysis tasks, challenges remain in executable code generation, especially for complex data analysis tasks. Our study contributes to a better understanding of LLM capabilities and limitations, providing software infrastructure and practical insights for their effective integration into data analysis workflows.