MacroBench: A Novel Testbed for Web Automation Scripts via Large Language Models

2510.04363v2 cs.SE, cs.AI, cs.CL 2025-10-10

Авторы:

Hyunjun Kim, Sejong Kim

Abstract

We introduce MacroBench, a code-first benchmark that evaluates whether LLMs can synthesize reusable browser-automation programs (macros) from natural-language goals by reading HTML/DOM and emitting Selenium. MacroBench instantiates seven self-hosted sites covering 681 tasks across interaction complexity and targeting difficulty. Our end-to-end protocol validates generated code via static checks, sandboxed execution, and outcome verification (DOM assertions, database snapshots), and includes a safety suite for scraping, spam/abuse, and credential/privacy prompts. Across 2,636 model-task runs, we observe stratified success: GPT-4o-mini (96.8%), GPT-4o (95.3%), Gemini (89.0%), DeepSeek (83.4%). Models handle simple tasks reliably (91.7%) but fail on complex workflows (0.0%), and none meet production-quality coding practices despite functional completion. We release our complete benchmark pipeline, evaluation framework, and experimental results at https://github.com/hyunjun1121/MacroBench to enable reproducible assessment of macro synthesis for web automation.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

MacroBench: A Novel Testbed for Web Automation Scripts via Large Language Models

Авторы:

Abstract

Ссылки и действия

Связанные статьи

Process-Centric Analysis of Agentic Software Systems

Progressive Code Integration for Abstractive Bug Report Summarization

SecureReviewer: Enhancing Large Language Models for Secure Code Review through S...

Process-Level Trajectory Evaluation for Environment Configuration in Software En...

Does Model Size Matter? A Comparison of Small and Large Language Models for Requ...

Навигация