MacroBench: A Novel Testbed for Web Automation Scripts via Large Language Models

2510.04363v1 cs.SE, cs.AI, cs.CL 2025-10-08

Авторы:

Hyunjun Kim, Sejong Kim

Abstract

We introduce MacroBench, a code-first benchmark that evaluates whether LLMs can synthesize reusable browser automation programs from natural language goals by reading HTML/DOM and emitting Python with Selenium. MacroBench instantiates seven self-hosted sites: Airbnb-like, TikTok-like, Reddit-like, Instagram-like, Facebook-like, Discord-like, and Threads-like, covering 681 tasks across interaction complexity and targeting difficulty. Our end-to-end protocol validates generated code via static checks, sandboxed execution, and outcome verification including DOM assertions and database snapshots, and includes a safety suite for scraping, spam/abuse, and credential/privacy prompts. Across 2636 model-task runs, we observe stratified success: GPT-4o-Mini achieves 96.8 percent, GPT-4.1 achieves 95.3 percent, Gemini-2.5-Pro achieves 89.0 percent, and DeepSeek-V3.1 achieves 83.4 percent. Models handle simple tasks reliably at 91.7 percent but fail on complex workflows at 0.0 percent, and none meet production-quality coding practices despite functional completion. We release our complete benchmark pipeline, evaluation framework, and experimental results to enable reproducible assessment of macro synthesis for web automation.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

MacroBench: A Novel Testbed for Web Automation Scripts via Large Language Models

Авторы:

Abstract

Ссылки и действия

Связанные статьи

Process-Centric Analysis of Agentic Software Systems

Progressive Code Integration for Abstractive Bug Report Summarization

SecureReviewer: Enhancing Large Language Models for Secure Code Review through S...

Process-Level Trajectory Evaluation for Environment Configuration in Software En...

Does Model Size Matter? A Comparison of Small and Large Language Models for Requ...

Навигация