How Different Tokenization Algorithms Impact LLMs and Transformer Models for Binary Code Analysis
2511.03825v1
cs.AI, cs.CL, cs.CR, cs.LG
2025-11-08
Авторы:
Ahmed Mostafa, Raisul Arefin Nahid, Samuel Mulder
Abstract
Tokenization is fundamental in assembly code analysis, impacting intrinsic
characteristics like vocabulary size, semantic coverage, and extrinsic
performance in downstream tasks. Despite its significance, tokenization in the
context of assembly code remains an underexplored area. This study aims to
address this gap by evaluating the intrinsic properties of Natural Language
Processing (NLP) tokenization models and parameter choices, such as vocabulary
size. We explore preprocessing customization options and pre-tokenization rules
tailored to the unique characteristics of assembly code. Additionally, we
assess their impact on downstream tasks like function signature prediction -- a
critical problem in binary code analysis.
To this end, we conduct a thorough study on various tokenization models,
systematically analyzing their efficiency in encoding assembly instructions and
capturing semantic nuances. Through intrinsic evaluations, we compare
tokenizers based on tokenization efficiency, vocabulary compression, and
representational fidelity for assembly code. Using state-of-the-art pre-trained
models such as the decoder-only Large Language Model (LLM) Llama 3.2, the
encoder-only transformer BERT, and the encoder-decoder model BART, we evaluate
the effectiveness of these tokenizers across multiple performance metrics.
Preliminary findings indicate that tokenizer choice significantly influences
downstream performance, with intrinsic metrics providing partial but incomplete
predictability of extrinsic evaluation outcomes. These results reveal complex
trade-offs between intrinsic tokenizer properties and their utility in
practical assembly code tasks. Ultimately, this study provides valuable
insights into optimizing tokenization models for low-level code analysis,
contributing to the robustness and scalability of Natural Language Model
(NLM)-based binary analysis workflows.