MacroBench: A Novel Testbed for Web Automation Scripts via Large Language Models
2510.04363v2
cs.SE, cs.AI, cs.CL
2025-10-10
Авторы:
Hyunjun Kim, Sejong Kim
Abstract
We introduce MacroBench, a code-first benchmark that evaluates whether LLMs
can synthesize reusable browser-automation programs (macros) from
natural-language goals by reading HTML/DOM and emitting Selenium. MacroBench
instantiates seven self-hosted sites covering 681 tasks across interaction
complexity and targeting difficulty. Our end-to-end protocol validates
generated code via static checks, sandboxed execution, and outcome verification
(DOM assertions, database snapshots), and includes a safety suite for scraping,
spam/abuse, and credential/privacy prompts. Across 2,636 model-task runs, we
observe stratified success: GPT-4o-mini (96.8%), GPT-4o (95.3%), Gemini
(89.0%), DeepSeek (83.4%). Models handle simple tasks reliably (91.7%) but fail
on complex workflows (0.0%), and none meet production-quality coding practices
despite functional completion. We release our complete benchmark pipeline,
evaluation framework, and experimental results at
https://github.com/hyunjun1121/MacroBench to enable reproducible assessment of
macro synthesis for web automation.
Ссылки и действия
Дополнительные ресурсы: