Compression Beyond Pixels: Semantic Compression with Multimodal Foundation Models

2509.05925v1 cs.CV, cs.IT, math.IT 2025-09-10

Авторы:

Ruiqi Shen, Haotian Wu, Wenjing Zhang, Jiangjing Hu, Deniz Gunduz

Резюме на русском

## Контекст В modern deep learning-based image compression methods achieve сompetitive rate-distortion performance through extensive end-to-end training and advanced architectures. Однако, emerging applications increasingly prioritizе semantic preservation over pixel-level reconstruction and demand robust performance across diverse data distributions and downstream tasks. Тhese challenges call for advanced semantic compression paradigms. Мultimodal foundation models, leveraging their zero-shot and representational capabilities, оffеr a promising direction for addressing these challenges. ## Метод Мы предлагаем novel semantic compression method based on the contrastive language-image pretraining (CLIP) model. Метод основывается на том, что вместо сжатия изображений для реконструкции, мы сжимаем CLIP feature embeddings в минимальные биты, сохраняя семантическую информацию для различных задач. Это позволяет эффективно представлять информацию с минимальным потреблением ресурсов. Такой подход гарантирует высокую семантическую целостность и декодирование в различных условиях. ## Результаты Проведены эксперименты с benchmark datasets, показывающие, что наш метод сохраняет семантическую целостность даже при extreme compression. Общий bit rate составил примерно 2-3 * 10**(-3) bits per pixel, что меньше чем 5% от bitrate, необходимого для mainstream image compression сравнимой степени performance. Благодаря zero-shot robustness, метод оказался устойчивым к разным data distributions и downstream tasks, даже при extreme compression. ## Значимость Предложенный подход имеет широкую область применения в сферах, где семантическая информация имеет первостепенное значение, таких как computer vision, мобильные устройства, и internet of things. Он предлагает существенное преимущество в снижении bitrate без потери semantic integrity, что может повлиять на развитие новых приложений. ## Выводы Предложенный метод демонстрирует высокую семантическую целостность при extreme compression, обеспечивая robust performance в разных условиях. Наше future work будет сконцентрировано на дальнейшем улучшении метода, в том числе его применении в реальных-времени приложениях и его усовершенствовании для различных downstream tasks.

Abstract

Recent deep learning-based methods for lossy image compression achieve competitive rate-distortion performance through extensive end-to-end training and advanced architectures. However, emerging applications increasingly prioritize semantic preservation over pixel-level reconstruction and demand robust performance across diverse data distributions and downstream tasks. These challenges call for advanced semantic compression paradigms. Motivated by the zero-shot and representational capabilities of multimodal foundation models, we propose a novel semantic compression method based on the contrastive language-image pretraining (CLIP) model. Rather than compressing images for reconstruction, we propose compressing the CLIP feature embeddings into minimal bits while preserving semantic information across different tasks. Experiments show that our method maintains semantic integrity across benchmark datasets, achieving an average bit rate of approximately 2-3* 10(-3) bits per pixel. This is less than 5% of the bitrate required by mainstream image compression approaches for comparable performance. Remarkably, even under extreme compression, the proposed approach exhibits zero-shot robustness across diverse data distributions and downstream tasks.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Compression Beyond Pixels: Semantic Compression with Multimodal Foundation Models

Авторы:

Резюме на русском

Abstract

Ссылки и действия

Связанные статьи

Mixture of Balanced Information Bottlenecks for Long-Tailed Visual Recognition

A Novel Image Similarity Metric for Scene Composition Structure

Навигация