A Use-Case Specific Dataset for Measuring Dimensions of Responsible Performance in LLM-generated Text

2510.20782v1 cs.CL, cs.AI, I.2.7 2025-10-25

Авторы:

Alicia Sagae, Chia-Jung Lee, Sandeep Avula, Brandon Dang, Vanessa Murdock

Abstract

Current methods for evaluating large language models (LLMs) typically focus on high-level tasks such as text generation, without targeting a particular AI application. This approach is not sufficient for evaluating LLMs for Responsible AI dimensions like fairness, since protected attributes that are highly relevant in one application may be less relevant in another. In this work, we construct a dataset that is driven by a real-world application (generate a plain-text product description, given a list of product features), parameterized by fairness attributes intersected with gendered adjectives and product categories, yielding a rich set of labeled prompts. We show how to use the data to identify quality, veracity, safety, and fairness gaps in LLMs, contributing a proposal for LLM evaluation paired with a concrete resource for the research community.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

A Use-Case Specific Dataset for Measuring Dimensions of Responsible Performance in LLM-generated Text

Авторы:

Abstract

Ссылки и действия

Связанные статьи

Direct Semantic Communication Between Large Language Models via Vector Translati...

Detecting Data Contamination in LLMs via In-Context Learning

LASTIST: LArge-Scale Target-Independent STance dataset

PerCoR: Evaluating Commonsense Reasoning in Persian via Multiple-Choice Sentence...

MMAO-Bench: MultiModal All in One Benchmark Reveals Compositional Law between Un...

Навигация