Medical Vision Language Models as Policies for Robotic Surgery
2510.06064v1
cs.CV, cs.LG
2025-10-09
Авторы:
Akshay Muppidi, Martin Radfar
Abstract
Vision-based Proximal Policy Optimization (PPO) struggles with visual
observation-based robotic laparoscopic surgical tasks due to the
high-dimensional nature of visual input, the sparsity of rewards in surgical
environments, and the difficulty of extracting task-relevant features from raw
visual data. We introduce a simple approach integrating MedFlamingo, a medical
domain-specific Vision-Language Model, with PPO. Our method is evaluated on
five diverse laparoscopic surgery task environments in LapGym, using only
endoscopic visual observations. MedFlamingo PPO outperforms and converges
faster compared to both standard vision-based PPO and OpenFlamingo PPO
baselines, achieving task success rates exceeding 70% across all environments,
with improvements ranging from 66.67% to 1114.29% compared to baseline. By
processing task observations and instructions once per episode to generate
high-level planning tokens, our method efficiently combines medical expertise
with real-time visual feedback. Our results highlight the value of specialized
medical knowledge in robotic surgical planning and decision-making.
Ссылки и действия
Дополнительные ресурсы: