📊 Статистика дайджестов
Всего дайджестов: 34022 Добавлено сегодня: 82
Последнее обновление: сегодня
Авторы:
Yinxi Li, Yuntian Deng, Pengyu Nie
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Large language models (LLMs) for code rely on subword tokenizers, such as
byte-pair encoding (BPE), learned from mixed natural language text and
programming language code but driven by statistics rather than grammar. As a
result, semantically identical code snippets can be tokenized differently
depending on superficial factors such as whitespace or identifier naming. To
measure the impact of this misalignment, we introduce TokDrift, a framework
that applies semantic-preserving rewrite rules to c...