📊 Статистика дайджестов

Всего дайджестов: 34022 Добавлено сегодня: 0

Последнее обновление: сегодня

📄 Bag of Tricks for Subverting Reasoning-based Safety Guardrails

2025-10-15

Авторы:

Shuo Chen, Zhen Han, Haokun Chen, Bailan He, Shengyun Si, Jingpei Wu, Philip Torr, Volker Tresp, Jindong Gu

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Recent reasoning-based safety guardrails for Large Reasoning Models (LRMs), such as deliberative alignment, have shown strong defense against jailbreak attacks. By leveraging LRMs' reasoning ability, these guardrails help the models to assess the safety of user inputs before generating final responses. The powerful reasoning ability can analyze the intention of the input query and will refuse to assist once it detects the harmful intent hidden by the jailbreak methods. Such guardrails have shown...

ID: 2510.11570v1 cs.CR, cs.CL

arXiv PDF

📄 Exploiting Web Search Tools of AI Agents for Data Exfiltration

2025-10-14

Авторы:

Dennis Rall, Bernhard Bauer, Mohit Mittal, Thomas Fraunholz

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Large language models (LLMs) are now routinely used to autonomously execute complex tasks, from natural language processing to dynamic workflows like web searches. The usage of tool-calling and Retrieval Augmented Generation (RAG) allows LLMs to process and retrieve sensitive corporate data, amplifying both their functionality and vulnerability to abuse. As LLMs increasingly interact with external data sources, indirect prompt injection emerges as a critical and evolving attack vector, enabling ...

ID: 2510.09093v1 cs.CR, cs.CL, 68T50, 68T0, F.2.2; I.2.7; K.6.5

arXiv PDF

📄 PATCH: Mitigating PII Leakage in Language Models with Privacy-Aware Targeted Circuit PatcHing

2025-10-11

Авторы:

Anthony Hughes, Vasisht Duddu, N. Asokan, Nikolaos Aletras, Ning Ma

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Language models (LMs) may memorize personally identifiable information (PII) from training data, enabling adversaries to extract it during inference. Existing defense mechanisms such as differential privacy (DP) reduce this leakage, but incur large drops in utility. Based on a comprehensive study using circuit discovery to identify the computational circuits responsible PII leakage in LMs, we hypothesize that specific PII leakage circuits in LMs should be responsible for this behavior. Therefore...

ID: 2510.07452v1 cs.CR, cs.CL

arXiv PDF

📄 Exposing Citation Vulnerabilities in Generative Engines

2025-10-10

Авторы:

Riku Mochizuki, Shusuke Komatsu, Souta Noguchi, Kazuto Ataka

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

We analyze answers generated by generative engines (GEs) from the perspectives of citation publishers and the content-injection barrier, defined as the difficulty for attackers to manipulate answers to user prompts by placing malicious content on the web. GEs integrate two functions: web search and answer generation that cites web pages using large language models. Because anyone can publish information on the web, GEs are vulnerable to poisoning attacks. Existing studies of citation evaluation ...

ID: 2510.06823v1 cs.CR, cs.CL, cs.IR

arXiv PDF

📄 RedTWIZ: Diverse LLM Red Teaming via Adaptive Attack Planning

2025-10-10

Авторы:

Artur Horal, Daniel Pina, Henrique Paz, Iago Paulo, João Soares, Rafael Ferreira, Diogo Tavares, Diogo Glória-Silva, João Magalhães, David Semedo

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

This paper presents the vision, scientific contributions, and technical details of RedTWIZ: an adaptive and diverse multi-turn red teaming framework, to audit the robustness of Large Language Models (LLMs) in AI-assisted software development. Our work is driven by three major research streams: (1) robust and systematic assessment of LLM conversational jailbreaks; (2) a diverse generative multi-turn attack suite, supporting compositional, realistic and goal-oriented jailbreak conversational strat...

ID: 2510.06994v1 cs.CR, cs.CL

arXiv PDF

📄 Differentially Private Synthetic Text Generation for Retrieval-Augmented Generation (RAG)

2025-10-10

Авторы:

Junki Mori, Kazuya Kakizaki, Taiki Miyagawa, Jun Sakuma

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by grounding them in external knowledge. However, its application in sensitive domains is limited by privacy risks. Existing private RAG methods typically rely on query-time differential privacy (DP), which requires repeated noise injection and leads to accumulated privacy loss. To address this issue, we propose DP-SynRAG, a framework that uses LLMs to generate differentially private synthetic RAG databases. Unlike prior ...

ID: 2510.06719v1 cs.CR, cs.CL, cs.LG

arXiv PDF

📄 Proactive defense against LLM Jailbreak

2025-10-08

Авторы:

Weiliang Zhao, Jinjun Peng, Daniel Ben-Levi, Zhou Yu, Junfeng Yang

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

The proliferation of powerful large language models (LLMs) has necessitated robust safety alignment, yet these models remain vulnerable to evolving adversarial attacks, including multi-turn jailbreaks that iteratively search for successful queries. Current defenses, primarily reactive and static, often fail to counter these search-based attacks. In this paper, we introduce ProAct, a novel proactive defense framework designed to disrupt and mislead autonomous jailbreaking processes. Our core idea...

ID: 2510.05052v1 cs.CR, cs.CL

arXiv PDF

📄 Fingerprinting LLMs via Prompt Injection

2025-10-02

Авторы:

Yuepeng Hu, Zhengyuan Jiang, Mengyuan Li, Osama Ahmed, Zhicong Huang, Cheng Hong, Neil Gong

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Large language models (LLMs) are often modified after release through post-processing such as post-training or quantization, which makes it challenging to determine whether one model is derived from another. Existing provenance detection methods have two main limitations: (1) they embed signals into the base model before release, which is infeasible for already published models, or (2) they compare outputs across models using hand-crafted or random prompts, which are not robust to post-processin...

ID: 2509.25448v2 cs.CR, cs.CL

arXiv PDF

📄 MaskSQL: Safeguarding Privacy for LLM-Based Text-to-SQL via Abstraction

2025-10-01

Авторы:

Sepideh Abedini, Shubhankar Mohapatra, D. B. Emerson, Masoumeh Shafieinejad, Jesse C. Cresswell, Xi He

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Large language models (LLMs) have shown promising performance on tasks that require reasoning, such as text-to-SQL, code generation, and debugging. However, regulatory frameworks with strict privacy requirements constrain their integration into sensitive systems. State-of-the-art LLMs are also proprietary, costly, and resource-intensive, making local deployment impractical. Consequently, utilizing such LLMs often requires sharing data with third-party providers, raising privacy concerns and risk...

ID: 2509.23459v2 cs.CR, cs.CL

arXiv PDF

📄 SBFA: Single Sneaky Bit Flip Attack to Break Large Language Models

2025-09-30

Авторы:

Jingkai Guo, Chaitali Chakrabarti, Deliang Fan

#### Контекст Large Language Models (LLMs) становятся все более популярными благодаря своим возможностям в области текстового понимания и генерации. Однако их со временем становится все чаще целью атак на безопасность. Одной из таких угроз является Bit-Flip Attack (BFA), способ атаки, в котором действующий бит в памяти модели меняется на ноль. Ранее проводились исследования, показавшие, что даже небольшое количество таких битовых ошибок может стать причиной серьезного ухудшения качества работы моделей, достигая уровня случайного генерирования. В этом работе мы исследуем применение BFA к самым современным LLMs и продемонстрируем, что даже один бит может испортить работу модели. #### Метод Мы предлагаем Single Sneaky Bit Flip Attack (SBFA), новый атакующий алгоритм, который разработан для LLMs. Этот метод основывается на итерационной оценке и рейтинге параметров модели с помощью ImpactScore, метрики, которая учитывает градиентную чувствительность и ограничение переменных в разумных границах нормальных значений весов модели. Для повышения эффективности, мы применяем новую легковесную SKIP-методику, которая существенно сокращает сложность поиска. Это позволяет выполнить поиск в течение нескольких минут для современных моделей LLM. Мы применяем SBFA к моделям Qwen, LLaMA и Gemma, чтобы продемонстрировать свою эффективность. #### Результаты Мы проводили эксперименты с LLMs в разных условиях, включая BF16 и INT8 данные. Наши результаты показывают, что SBFA способен серьезно испортить работу моделей, ниже уровня случайного угадывания, с помощью только одного бита из миллиардов параметров. Это отмечается как на Qwen, так и на LLaMA и Gemma. Эти результаты показывают, что даже один небольшой битовый сбой может стать причиной катастрофической заваливания модели. #### Значимость Эти результаты являются важной новостью для развития безопасности моделей LLMs. Мы показываем, что уязвимость LLMs к таким атакам может быть использована для нанесения вреда, даже при минимальных вмешательствах. Это открывает новые пути для развития методов защиты LLMs и повышения их надежности в реальном мире. #### Выводы Наши результаты демонстрируют, что SBFA является эффективным инструментом для проведения BFA на современных LLMs. Мы показываем, что даже один бит может стать причиной серьезного недостатка модели. Будущие исследования будут направлены на развитие методов защиты от таких атак и расширение понимания уязвимостей LLMs.

Annotation:

Model integrity of Large language models (LLMs) has become a pressing security concern with their massive online deployment. Prior Bit-Flip Attacks (BFAs) -- a class of popular AI weight memory fault-injection techniques -- can severely compromise Deep Neural Networks (DNNs): as few as tens of bit flips can degrade accuracy toward random guessing. Recent studies extend BFAs to LLMs and reveal that, despite the intuition of better robustness from modularity and redundancy, only a handful of adver...

ID: 2509.21843v1 cs.CR, cs.CL, cs.LG

arXiv PDF

Показано 31 - 40 из 58 записей