📊 Статистика дайджестов
Всего дайджестов: 34022 Добавлено сегодня: 82
Последнее обновление: сегодня
Авторы:
Shaked Zychlinski, Yuval Kainan
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Large Language Models (LLMs) are susceptible to jailbreak attacks where
malicious prompts are disguised using ciphers and character-level encodings to
bypass safety guardrails. While these guardrails often fail to interpret the
encoded content, the underlying models can still process the harmful
instructions. We introduce CPT-Filtering, a novel, model-agnostic with
negligible-costs and near-perfect accuracy guardrail technique that aims to
mitigate these attacks by leveraging the intrinsic behav...