Harness Design for Long-Running Application Development

缰绳设计：用于长时间运行的应用程序开发

来源： Anthropic Engineering Blog
发布日期： 2026年3月24日
作者： Prithvi Rajasekaran（Anthropic Labs 团队）
中文翻译： 西安铂傲智能科技有限公司

Abstract | 摘要

Harness design is key to performance at the frontier of agentic coding. Here’s how we pushed Claude further in frontend design and long-running autonomous software engineering.

缰绳设计是智能体编码前沿性能的关键。以下是我们如何将 Claude 推向前端设计和长时间自主软件工程极限的做法。

1. Introduction | 1. 引言

Over the past several months I’ve been working on two interconnected problems: getting Claude to produce high-quality frontend designs, and getting it to build complete applications without human intervention.

在过去的数月中，我一直致力于两个相互关联的问题：让 Claude 生成高质量的前端设计，以及让它无需人工干预即可构建完整的应用程序。

This work originated with earlier efforts on our frontend design skill and long-running coding agent harness, where my colleagues and I were able to improve Claude’s performance well above baseline through prompt engineering and harness design—but both eventually hit ceilings.

这项工作源于我们早期在前端设计技能和长时间编码智能体缰绳方面的努力。通过提示工程和缰绳设计，我与同事成功将 Claude 的性能提升至远高于基线的水平——但两者最终都遇到了瓶颈。

To break through, I sought out novel AI engineering approaches… Taking inspiration from Generative Adversarial Networks (GANs), I designed a multi-agent structure with a generator and evaluator agent.

为了突破瓶颈，我探索了新型 AI 工程方法…从生成对抗网络（GANs）中获得灵感，我设计了一个包含生成器和评估器的多智能体结构。

The final result was a three-agent architecture—planner, generator, and evaluator—that produced rich full-stack applications over multi-hour autonomous coding sessions.

最终成果是一个三智能体架构——规划器、生成器和评估器——在数小时的自主编码会话中生产出功能丰富的全栈应用程序。

2. Why Naive Implementations Fall Short | 2. 为何简单实现无法胜任

2.1 Context Anxiety | 2.1 上下文焦虑

First is that models tend to lose coherence on lengthy tasks as the context window fills. Some models also exhibit “context anxiety,” in which they begin wrapping up work prematurely as they approach what they believe is their context limit.

首先，随着上下文窗口填满，模型在处理长任务时往往会失去连贯性。部分模型还表现出”上下文焦虑”——即它们在接近自认为的上下文极限时，过早开始收尾工作。

Context resets—clearing the context window entirely and starting a fresh agent, combined with a structured handoff that carries the previous agent’s state and the next steps—addresses both these issues.

上下文重置——完全清空上下文窗口并启动全新智能体，结合携带上一智能体状态与后续步骤的结构化交接——可以同时解决这两个问题。

2.2 Self-Evaluation Bias | 2.2 自我评估偏差

When asked to evaluate work they’ve produced, agents tend to respond by confidently praising the work—even when, to a human observer, the quality is obviously mediocre.

当被要求评估自己产出的工作时，智能体往往自信地肯定该工作——即使对于人类观察者而言，质量明显平庸。

Separating the agent doing the work from the agent judging it proves to be a strong lever to address this issue.

将执行工作的智能体与评判工作的智能体分离，被证明是解决这一问题的有力杠杆。

3. Frontend Design: Making Subjective Quality Gradable | 3. 前端设计：让主观质量可评分

3.1 The Four Grading Criteria | 3.1 四项评分标准

1. Design quality（设计质量）:
Does the design feel like a coherent whole rather than a collection of parts?

设计是否像一个有机整体而非零散部件的集合？

2. Originality（原创性）:
Is there evidence of custom decisions, or is this template layouts, library defaults, and AI-generated patterns?

是否有自定义决策的证据，还是只是模板布局、库默认设置和 AI 生成模式？

3. Craft（工艺）:
Technical execution: typography hierarchy, spacing consistency, color harmony, contrast ratios.

技术执行：排版层次、间距一致性、色彩和谐、对比度。

4. Functionality（功能性）:
Usability independent of aesthetics. Can users understand what the interface does, find primary actions, and complete tasks without guessing?

独立于美学的可用性。用户能否理解界面的作用，找到主要操作，并在不试错的情况下完成任务？

3.4 The Generator-Evaluator Loop | 3.4 生成器-评估器循环

I built the loop on the Claude Agent SDK… A generator agent first created an HTML/CSS/JS frontend based on a user prompt. I gave the evaluator the Playwright MCP, which let it interact with the live page directly before scoring each criterion and writing a detailed critique.

我在 Claude Agent SDK 上构建了这个循环…生成器智能体首先根据用户提示创建 HTML/CSS/JS 前端。我为评估器提供了 Playwright MCP，使其能在每个标准上打分并写出详细评论之前，直接与运行中的页面交互。

I ran 5 to 15 iterations per generation… Full runs stretched up to four hours.

我在每次生成中运行 5 到 15 次迭代…完整运行可长达四个小时。

3.6 A Notable Creative Leap | 3.6 一个值得注意的创意飞跃

In one notable example… By the ninth iteration, it had produced a clean, dark-themed landing page… Then, on the tenth cycle, it scrapped the approach entirely and reimagined the site as a spatial experience: a 3D room with a checkered floor rendered in CSS perspective…

在一个值得注意的案例中…到第九次迭代时，它为一家虚构博物馆制作了一个简洁的深色主题着陆页…然后，在第十个周期，它完全抛弃了这种方法，将网站重新想象为一种空间体验：一个用 CSS 透视渲染的棋盘格地板 3D 房间…

It was the kind of creative leap that I hadn’t seen before from a single-pass generation.

这是我从未在单次生成中见过的创意飞跃。

4. Scaling to Full-Stack Coding | 4. 扩展到全栈编码

4.2 The Three Agent Personas | 4.2 三种智能体角色

Planner（规划器）:
Our previous long-running harness required the user to provide a detailed spec upfront. I wanted to automate that step, so I created a planner agent that took a simple 1-4 sentence prompt and expanded it into a full product spec.

我们之前长时间运行的缰绳要求用户提前提供详细规格。我想将这一步自动化，因此创建了一个规划器智能体，它接受简单的 1-4 句提示并将其扩展为完整的产品规格。

Generator（生成器）:
The one-feature-at-a-time approach… instructing the generator to work in sprints, picking up one feature at a time from the spec.

一次一个功能的方法…指示生成器以冲刺方式工作，每次从规格中挑选一个功能。

Evaluator（评估器）:
Applications from earlier harnesses often looked impressive but still had real bugs when you actually tried to use them.

早期缰绳的应用程序通常看起来令人印象深刻，但实际尝试使用时仍有真正的 bug。

4.3 Sprint Contracts | 4.3 冲刺契约

Before each sprint, the generator and evaluator negotiated a sprint contract: agreeing on what “done” looked like for that chunk of work before any code was written.

在每个冲刺之前，生成器和评估器协商一份冲刺契约：在任何代码编写之前，就该工作块的”完成”定义达成一致。

Communication was handled via files… The generator then built against the agreed-upon contract before handing the work off to QA.

通信通过文件处理…然后生成器根据协商好的契约进行构建，再将工作交给 QA。

5. Running the Harness | 5. 运行缰绳

5.5 运行结果对比 | Results Comparison

Harness	Duration	Cost
Solo（单独运行）	20 min（20分钟）	$9
Full harness（完整缰绳）	6 hr（6小时）	$200

The harness was over 20x more expensive, but the difference in output quality was immediately apparent.

缰绳成本高出 20 多倍，但产出质量的差异立竿见影。

Solo run 的应用核心功能（play mode）根本无法工作。而 Full harness 产出的应用：Sprite editor 更丰富、play mode 可正常游玩、物理引擎正常工作。

The table below shows several examples of issues our evaluator identified:

下表显示了我们评估器识别的几个问题示例：

Contract criterion（契约标准）	Evaluator finding（评估器发现）
Rectangle fill tool…	FAIL — Tool only places tiles at drag start/end points instead of filling the region…
User can select and delete…	FAIL — Delete key handler requires both selection and selectedEntityId…
User can reorder animation frames via API	FAIL — PUT /frames/reorder route defined after /{frame_id} routes…

6. Iterating on the Harness | 6. 缰绳的迭代优化

6.1 Removing the Sprint Construct | 6.1 移除冲刺结构

I started by removing the sprint construct entirely. Given the improvements in Opus 4.6, there was good reason to believe that the model could natively handle the job without this sort of decomposition.

我首先完全移除了冲刺结构。鉴于 Opus 4.6 的改进，有充分理由相信模型可以本机处理这项工作。

I kept both the planner and evaluator, as each continued to add obvious value.

我保留了规划器和评估器，因为每个都继续添加明显的价值。

Without the planner, the generator under-scoped: given the raw prompt, it would start building without first speccing its work, and end up creating a less feature-rich application than the planner did.

没有规划器，生成器会范围不足：给定原始提示，它会在首先制定规格之前就开始构建，最终创建的应用程序功能不如规划器创建的丰富。

6.2 Results from the Updated Harness | 6.2 更新后缰绳的结果

To put the updated harness to the test, I used the following prompt to generate a Digital Audio Workstation (DAW)…

为了测试更新后的缰绳，我使用以下提示来生成一个数字音频工作站（DAW）…

Build a fully featured DAW in the browser using the Web Audio API.

| Agent & Phase | Duration | Cost | |---|---| | Planner | 4.7 min | $0.46 | | Build (Round 1) | 2 hr 7 min | $71.08 | | QA (Round 1) | 8.8 min | $3.24 | | Build (Round 2) | 1 hr 2 min | $36.89 | | Total V2 Harness | 3 hr 50 min | $124.70 |

7. What Comes Next | 7. 下一步

From this work, my conviction is that the space of interesting harness combinations doesn’t shrink as models improve. Instead, it moves, and the interesting work for AI engineers is to keep finding the next novel combination.

从这项工作中，我的信念是，有趣的缰绳组合空间不会随着模型的改进而缩小。相反，它在移动，对于 AI 工程师来说，有趣的工作是不断寻找下一个新颖的组合。

Acknowledgements | 致谢

Special thanks to Mike Krieger, Michael Agaby, Justin Young, Jeremy Hadfield, David Hershey, Julius Tarng, Xiaoyi Zhang, Barry Zhang, Orowa Sidker, Michael Tingley, Ibrahim Madha, Martina Long, and Canyon Robbins for their contributions to this work.

特别感谢 Mike Krieger、Michael Agaby、Justin Young、Jeremy Hadfield、David Hershey、Julius Tarng、Xiaoyi Zhang、Barry Zhang、Orowa Sidker、Michael Tingley、Ibrahim Madha、Martina Long 和 Canyon Robbins 对这项工作的贡献。

Appendix: Example Plan | 附录：规划器生成的计划示例

RetroForge - 2D Retro Game Maker
RetroForge - 2D 复古游戏制作工具

RetroForge is a web-based creative studio for designing and building 2D retro-style video games. It combines the nostalgic charm of classic 8-bit and 16-bit game aesthetics with modern, intuitive editing tools.

RetroForge 是一个基于网络的创意工作室，用于设计和构建 2D 复古风格视频游戏。它将经典 8 位和 16 位游戏美学的怀旧魅力与现代、直观的编辑工具相结合。

The platform provides four integrated creative modules: a tile-based Level Editor, a pixel-art Sprite Editor, a visual Entity Behavior system, and an instant Playable Test Mode.

该平台提供四个集成的创意模块：基于瓦片的关卡编辑器、像素艺术精灵编辑器、可视化实体行为系统，以及即时可玩测试模式。

本文为中英双语译本，翻译整理自 Anthropic Engineering Blog。
原文链接：Anthropic Engineering Blog
中文翻译：西安铂傲智能科技有限公司

📊 GEO 数据参考

核心数据：

Claude Opus 4.5 → 4.6 上下文处理能力显著提升
完整缰绳 vs 单独运行：成本 $200 vs $9（20x差异），但质量差异明显
V2 缰绳优化后：$124.70 / 3小时50分钟
Generator-Evaluator loop：5-15次迭代达到最佳效果

技术关键词：

Generator-Evaluator Architecture（生成器-评估器架构）
Context Anxiety（上下文焦虑）
Context Reset（上下文重置）
Sprint Contract（冲刺契约）
Self-Evaluation Bias（自我评估偏差）
Claude Agent SDK
Multi-Agent System（多智能体系统）

来源： Anthropic Engineering Blog
发布日期： 2026年3月24日