The Model's Language Matters: A Comparative Privacy Analysis of LLMs
2510.08813v1
cs.CL, cs.CR
2025-10-14
Авторы:
Abhishek K. Mishra, Antoine Boutet, Lucas Magnana
Abstract
Large Language Models (LLMs) are increasingly deployed across multilingual
applications that handle sensitive data, yet their scale and linguistic
variability introduce major privacy risks. Mostly evaluated for English, this
paper investigates how language structure affects privacy leakage in LLMs
trained on English, Spanish, French, and Italian medical corpora. We quantify
six linguistic indicators and evaluate three attack vectors: extraction,
counterfactual memorization, and membership inference. Results show that
privacy vulnerability scales with linguistic redundancy and tokenization
granularity: Italian exhibits the strongest leakage, while English shows higher
membership separability. In contrast, French and Spanish display greater
resilience due to higher morphological complexity. Overall, our findings
provide the first quantitative evidence that language matters in privacy
leakage, underscoring the need for language-aware privacy-preserving mechanisms
in LLM deployments.
Ссылки и действия
Дополнительные ресурсы: