SecureInfer: Heterogeneous TEE-GPU Architecture for Privacy-Critical Tensors for Large Language Model Deployment
2510.19979v1
cs.CR, cs.LG, cs.SE
2025-10-25
Авторы:
Tushar Nayan, Ziqi Zhang, Ruimin Sun
Abstract
With the increasing deployment of Large Language Models (LLMs) on mobile and
edge platforms, securing them against model extraction attacks has become a
pressing concern. However, protecting model privacy without sacrificing the
performance benefits of untrusted AI accelerators, such as GPUs, presents a
challenging trade-off. In this paper, we initiate the study of high-performance
execution on LLMs and present SecureInfer, a hybrid framework that leverages a
heterogeneous Trusted Execution Environments (TEEs)-GPU architecture to isolate
privacy-critical components while offloading compute-intensive operations to
untrusted accelerators. Building upon an outsourcing scheme, SecureInfer adopts
an information-theoretic and threat-informed partitioning strategy:
security-sensitive components, including non-linear layers, projection of
attention head, FNN transformations, and LoRA adapters, are executed inside an
SGX enclave, while other linear operations (matrix multiplication) are
performed on the GPU after encryption and are securely restored within the
enclave. We implement a prototype of SecureInfer using the LLaMA-2 model and
evaluate it across performance and security metrics. Our results show that
SecureInfer offers strong security guarantees with reasonable performance,
offering a practical solution for secure on-device model inference.
Ссылки и действия
Дополнительные ресурсы: