Large language models overcome the challenges of unstructured text data in ecology

Abstract

The vast volume of unstructured textual data, such as that found in research papers, news outlets, and technical reports, holds largely untapped potential for ecological research. However, the labour-intensive nature of manually processing such data presents a considerable challenge. In this work, we explore the application of three state-of-the-art Large Language Models (LLMs) – ChatGPT 3.5, ChatGPT 4, and LLaMA-2-70B – to automate the identification, interpretation, extraction, and structuring of relevant ecological information from unstructured textual sources. Our focus is specifically on species distribution data, using two challenging sources of these data: news outlets and research papers. We assess the LLMs on four key parameters: identification of documents providing species distribution data, identification of regions where species observations are mentioned, generation of geographical coordinates for these regions, and provisioning of results in a structured format. Our results show that ChatGPT 4 consistently outperforms the other models, demonstrating a high capacity to interpret textual narratives and to extract relevant information, with a percentage of correct outputs often exceeding 90%. However, performance also seems dependent on the type of data source used and task tested – with better results being achieved for news texts and in identifying regions where species were observed and presenting structured output. Its predecessor, ChatGPT 3.5, delivers reasonably lower accuracy levels across tasks and data sources, while LLaMA-2-70B performed worse. The integration of LLMs into ecological data assimilation workflows appears not only imminent, but also essential to meet the growing challenge of efficiently processing an increasing volume of textual data.