MacroBench: A Novel Testbed for Web Automation Scripts via Large Language Models
2510.04363v1
cs.SE, cs.AI, cs.CL
2025-10-08
Авторы:
Hyunjun Kim, Sejong Kim
Abstract
We introduce MacroBench, a code-first benchmark that evaluates whether LLMs
can synthesize reusable browser automation programs from natural language goals
by reading HTML/DOM and emitting Python with Selenium. MacroBench instantiates
seven self-hosted sites: Airbnb-like, TikTok-like, Reddit-like, Instagram-like,
Facebook-like, Discord-like, and Threads-like, covering 681 tasks across
interaction complexity and targeting difficulty. Our end-to-end protocol
validates generated code via static checks, sandboxed execution, and outcome
verification including DOM assertions and database snapshots, and includes a
safety suite for scraping, spam/abuse, and credential/privacy prompts. Across
2636 model-task runs, we observe stratified success: GPT-4o-Mini achieves 96.8
percent, GPT-4.1 achieves 95.3 percent, Gemini-2.5-Pro achieves 89.0 percent,
and DeepSeek-V3.1 achieves 83.4 percent. Models handle simple tasks reliably at
91.7 percent but fail on complex workflows at 0.0 percent, and none meet
production-quality coding practices despite functional completion. We release
our complete benchmark pipeline, evaluation framework, and experimental results
to enable reproducible assessment of macro synthesis for web automation.
Ссылки и действия
Дополнительные ресурсы: