Orion: A Unified Visual Agent for Multimodal Perception, Advanced Visual Reasoning and Execution

2511.14210v2 cs.CV, cs.AI, cs.LG 2025-11-21

Авторы:

N Dinesh Reddy, Dylan Snyder, Lona Kiragu, Mirajul Mohin, Shahrear Bin Amin, Sudeep Pillai

Abstract

We introduce Orion, a visual agent that integrates vision-based reasoning with tool-augmented execution to achieve powerful, precise, multi-step visual intelligence across images, video, and documents. Unlike traditional vision-language models that generate descriptive outputs, Orion orchestrates a suite of specialized computer vision tools, including object detection, keypoint localization, panoptic segmentation, Optical Character Recognition (OCR), and geometric analysis, to execute complex multi-step visual workflows. The system achieves competitive performance across MMMU, MMBench, DocVQA, and MMLongBench while extending monolithic VLM capabilities to production-grade visual intelligence. Through its agentic, tool-augmented approach, Orion enables autonomous visual reasoning that bridges neural perception with symbolic execution, marking the transition from passive visual understanding to active, tool-driven visual intelligence. Try Orion for free at: https://chat.vlm.run Learn more at: https://www.vlm.run/orion

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Orion: A Unified Visual Agent for Multimodal Perception, Advanced Visual Reasoning and Execution

Авторы:

Abstract

Ссылки и действия

Связанные статьи

PyroFocus: A Deep Learning Approach to Real-Time Wildfire Detection in Multispec...

ProtoEFNet: Dynamic Prototype Learning for Inherently Interpretable Ejection Fra...

GalaxyDiT: Efficient Video Generation with Guidance Alignment and Adaptive Proxy...

Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video...

PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation

Навигация