Study on LLMs for Promptagator-Style Dense Retriever Training

2510.02241v1 cs.IR, cs.CL 2025-10-04

Авторы:

Daniel Gwon, Nour Jedidi, Jimmy Lin

Abstract

Promptagator demonstrated that Large Language Models (LLMs) with few-shot prompts can be used as task-specific query generators for fine-tuning domain-specialized dense retrieval models. However, the original Promptagator approach relied on proprietary and large-scale LLMs which users may not have access to or may be prohibited from using with sensitive data. In this work, we study the impact of open-source LLMs at accessible scales ($\leq$14B parameters) as an alternative. Our results demonstrate that open-source LLMs as small as 3B parameters can serve as effective Promptagator-style query generators. We hope our work will inform practitioners with reliable alternatives for synthetic data generation and give insights to maximize fine-tuning results for domain-specific applications.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Study on LLMs for Promptagator-Style Dense Retriever Training

Авторы:

Abstract

Ссылки и действия

Связанные статьи

Breaking It Down: Domain-Aware Semantic Segmentation for Retrieval Augmented Gen...

QueryGym: A Toolkit for Reproducible LLM-Based Query Reformulation

Music Recommendation with Large Language Models: Challenges, Opportunities, and ...

CroPS: Improving Dense Retrieval with Cross-Perspective Positive Samples in Shor...

BiCA: Effective Biomedical Dense Retrieval with Citation-Aware Hard Negatives

Навигация