CF-RAG: A Dataset and Method for Carbon Footprint QA Using Retrieval-Augmented Generation
2508.03489v1
cs.CL, cs.AI
2025-08-06
Авторы:
Kaiwen Zhao, Bharathan Balaji, Stephen Lee
Резюме на русском
Авторы исследуют задачу ответов на вопросы о углеродном следе товаров по неструктурированным отчётам PDF. Предложен открытый набор CarbonPDF-QA: 1735 документов, 1000+ вопросов с разметкой. Показано, что GPT-4o плохо справляется с нестандартными таблицами и текстом. Решение — CarbonPDF: дообученная Llama-3 8B с RAG-модулем, учитывающим структуру таблиц и контекст. Эксперименты: +15 % точности по сравнению с SOTA (TableLlama, GPT-4o). Метод открыт и применим для автоматизированной оценки устойчивости продукции.
Abstract
Product sustainability reports provide valuable insights into the
environmental impacts of a product and are often distributed in PDF format.
These reports often include a combination of tables and text, which complicates
their analysis. The lack of standardization and the variability in reporting
formats further exacerbate the difficulty of extracting and interpreting
relevant information from large volumes of documents. In this paper, we tackle
the challenge of answering questions related to carbon footprints within
sustainability reports available in PDF format. Unlike previous approaches, our
focus is on addressing the difficulties posed by the unstructured and
inconsistent nature of text extracted from PDF parsing. To facilitate this
analysis, we introduce CarbonPDF-QA, an open-source dataset containing
question-answer pairs for 1735 product report documents, along with
human-annotated answers. Our analysis shows that GPT-4o struggles to answer
questions with data inconsistencies. To address this limitation, we propose
CarbonPDF, an LLM-based technique specifically designed to answer carbon
footprint questions on such datasets. We develop CarbonPDF by fine-tuning Llama
3 with our training data. Our results show that our technique outperforms
current state-of-the-art techniques, including question-answering (QA) systems
finetuned on table and text data.
Ссылки и действия
Дополнительные ресурсы: