Label Forensics: Interpreting Hard Labels in Black-Box Text Classifier

2512.01514v1 cs.LG 2025-12-04

Авторы:

Mengyao Du, Gang Yang, Han Fang, Quanjun Yin, Ee-chien Chang

Abstract

The widespread adoption of natural language processing techniques has led to an unprecedented growth of text classifiers across the modern web. Yet many of these models circulate with their internal semantics undocumented or even intentionally withheld. Such opaque classifiers, which may expose only hard-label outputs, can operate in unregulated web environments or be repurposed for unknown intents, raising legitimate forensic and auditing concerns. In this paper, we position ourselves as investigators and work to infer the semantic concept each label encodes in an undocumented black-box classifier. Specifically, we introduce label forensics, a black-box framework that reconstructs a label's semantic meaning. Concretely, we represent a label by a sentence embedding distribution from which any sample reliably reflects the concept the classifier has implicitly learned for that label. We believe this distribution should maintain two key properties: precise, with samples consistently classified into the target label, and general, covering the label's broad semantic space. To realize this, we design a semantic neighborhood sampler and an iterative optimization procedure to select representative seed sentences that jointly maximize label consistency and distributional coverage. The final output, an optimized seed sentence set combined with the sampler, constitutes the empirical distribution representing the label's semantics. Experiments on multiple black-box classifiers achieve an average label consistency of around 92.24 percent, demonstrating that the embedding regions accurately capture each classifier's label semantics. We further validate our framework on an undocumented HuggingFace classifier, enabling fine-grained label interpretation and supporting responsible AI auditing.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Label Forensics: Interpreting Hard Labels in Black-Box Text Classifier

Авторы:

Abstract

Ссылки и действия

Связанные статьи

QoSDiff: An Implicit Topological Embedding Learning Framework Leveraging Denoisi...

Coefficient of Variation Masking: A Volatility-Aware Strategy for EHR Foundation...

Variance Matters: Improving Domain Adaptation via Stratified Sampling

Mitigating the Antigenic Data Bottleneck: Semi-supervised Learning with Protein ...

Rethinking Tokenization for Clinical Time Series: When Less is More

Навигация