Large language models overcome the challenges of unstructured text data in ecology

Task Force 1 will be deep-diving into the topic artificial intelligence (AI) through Workstream 3: learning events, commencing this autumn.

Ahead of this, we thought you might be interested in this paper by Andry Castro et al, published in July 2024, which explores the potential for large language models (LLMs) in overcoming the challenges of unstructured text data in ecology, such as research papers and technical reports.

The paper notes that manual processing of such data is labour-intensive and poses a significant challenge. In this study, Castro and team used three prompt-based LLMs - GPT 3.5, GPT 4 and LLaMA-2-70B - to automate the identification, interpretation, extraction and structuring of relevant ecological information from unstructured text sources. The study found that GPT 4 consistently outperformed the other models often exceeding 90% accuracy (averaging 87-100% accuracy).

The results demonstrate the potential benefit of integrating prompt-based LLMs into ecological data assimilation workflows as essential tools to efficiently process large volumes of textual data.

Download a copy of the paper here.

Castro, A., PInto, J., Reino, L., Pipek, P., Capinha, C. (2024) Large language models overcome the challenges of unstructured text data in ecology. Universidade de Lisboa, Portugal.

Shared under Creative Commons license CC-BY-ND 4.0.