> Agent Harness Engineering — Synthesis

Budding

planted Mar 3, 2026tended May 4, 2026

#research#ai#agents#claude-code#harness#engineering

Agent Harness Engineering — Synthesis

How to make AI coding agents work reliably: the emerging discipline of designing the system around the model.

The core thesis

The competitive advantage in AI-assisted engineering is not the model — it's the harness: the structured system of context, tools, guardrails, and feedback loops engineered around the agent.

"The new hierarchy won't be based on who codes the fastest — it will be based on who can orchestrate uncertainty without losing authority." — Karpathy

This synthesis pulls from five distinct sources at very different scales — from "what one engineer does in one repo" to "what MercadoLibre does across 20,000 developers" — and finds they're all converging on the same primitives.

Sources synthesized

| Source | Scale | Key contribution | |--------|-------|------------------| | Karpathy's "behind the new skill tree" | Theory | 4-level skill tree: Conditioning → Authority → Workflows → Compounding | | Claude Code Setup Hook + Justfile pattern | Single repo | Deterministic scripts + agentic prompts for onboarding | | Boris Cherny's thread-based framework | Single engineer | Thread taxonomy (P, C, F, B, L, Z) for scaling agent work | | Julián de Angelis (MercadoLibre) | 20,000 devs | Custom rules, MCPs, skills, SDD, feedback loops at org scale | | OpenAI Codex team | 3–7 engineers | AGENTS.md as table of contents, docs/ as encyclopedia, garbage-collection agents |

The four levers (Julián / MercadoLibre)

1. Custom rules (CLAUDE.md, AGENTS.md, .cursor/rules)

The most accessible lever. Living documents that get injected into the agent's context.

What belongs:

Tech stack, architecture patterns, naming conventions
Testing philosophy, common pitfalls, anti-patterns
Commands (build, test, lint, deploy)

What doesn't belong:

Entire API docs (wastes context)
Obvious instructions ("write clean code")
Contradictory rules

Best practices:

Keep under 500 lines (context rot starts ~60% window utilization)
Make them modular (split by concern)
Use few-shot examples over abstract instructions
Don't make everything always-on — use conditional loading

2. MCP servers (Model Context Protocol)

Extend the agent beyond file read/write:

Database queries, internal docs, API contracts
CI/CD pipeline interaction, design specs
Validation and testing of agent output

3. Skills

On-demand context injection + executable logic. Only a short description stays in context; full content loads when invoked.

Two flavors:

Reference skills — inject knowledge (conventions, patterns, domain context)
Task skills — step-by-step instructions for specific actions

Can bundle scripts, run in isolated subagents, compose into pipelines.

4. Spec-driven development (SDD)

The spec becomes the harness — it engineers the entire context window in one shot. Consolidates custom rules, step-by-step guidance, and acceptance criteria into a single artifact.

Use the agent to write the specs too.

OpenAI's additions

AGENTS.md as table of contents (~100 lines)

Don't make AGENTS.md the encyclopedia. Make it the map that points to deeper sources:

AGENTS.md (100 lines) → docs/architecture.md
                       → docs/patterns/
                       → docs/decisions/

The agent reads the map always, reads the detail on demand.

Mechanical enforcement

Don't just suggest architectural constraints — enforce them:

Custom linters that catch violations
Structural tests validating dependency layers
CI validation preventing architectural decay

Garbage-collection agents

Agents that run periodically to find:

Stale documentation
Violated architectural constraints
Inconsistencies between docs and code

Karpathy's skill tree (4 levels)

Level 1 — Conditioning (steering)

| Skill | What it means | |-------|---------------| | Intent specification | Tight problem contracts (purpose, audience, constraints) | | Context engineering | What goes in / out of context window, ordering, summarization | | Constraint design | Output formats, schemas, rubrics, tool access, budgets |

Level 2 — Authority (ownership without authorship)

| Skill | What it means | |-------|---------------| | Verification design | How does truth enter the loop? (deterministic checks, human review) | | Provenance | Sources, citations, traceability as first-class objects | | Permissions | Least privilege, deterministic boundaries, audit trails |

Level 3 — Workflows (scaling intelligence)

| Skill | What it means | |-------|---------------| | Pipeline decomposition | Intermediate artifacts, checkpoints, local vs global failures | | Failure-mode taxonomy | Context missing? Retrieval wrong? Tool fail? Hallucination? | | Observability | Tool-call traces, inputs, documents retrieved, timing, cost |

Level 4 — Compounding (durable leverage)

| Skill | What it means | |-------|---------------| | Evaluation harnesses | Golden sets, regression tests, scorecards, thresholds | | Feedback loops | Draft → critique → revise → recheck → ship | | Drift management | Versioning, auditability, treating work as production infrastructure |

Thread types (Boris Cherny framework)

| Thread | Name | Pattern | Best for | |--------|------|---------|----------| | Base | Single | Prompt → Work → Review | Simple tasks | | P | Parallel | N prompts simultaneously | Independent subtasks | | C | Chained | Work → Review → Continue → Work | High-risk, migrations | | F | Fusion | Same prompt → N agents → pick best | Prototyping, confidence | | B | Big | Agent → subagents → combined result | Complex multi-file tasks | | L | Long | High autonomy, hours duration | Background work | | Z | Zero touch | No review needed | Maximum trust, fully validated |

"Tool calls roughly equal impact. Increase tool calls to increase output."

Scale by: more threads, longer threads, thicker threads (agents calling agents), fewer checkpoints.

The feedback loop

Tests, linters, type checkers, build scripts — every tool that produces a pass/fail signal becomes a feedback mechanism for self-correction.

Stop hooks are the most powerful mechanism: the agent cannot finish until checks pass. Not a suggestion — an enforced gate.

The Ralph Wiggum pattern: Loop an agent with deterministic validation. Agents + code beats agents alone.

Practical implementation — what we built at Fuul

| Lever | Implementation | |-------|----------------| | Custom rules | CLAUDE.md per repo (6 repos), AGENTS.md for cross-tool (4 repos), shared template | | MCPs | Slack, Telegram, Pipedrive, Notion, Linear, Granola | | Skills | 54 skills across 9 departments + 4 new engineering skills | | SDD | /feature-spec skill + plans/ workflow in every repo | | Feedback loop | Validation Loops in CLAUDE.md + PR review template | | Compounding | Compound learning guide + docs/solutions/ + /incident-to-claude-md | | Garbage collection | /audit-claude-md, /sync-claude-md (manual trigger) |

Architecture

fuul-agents-workspace (shared brain)
├── .claude/context/engineering/
│   ├── claude-md-template.md       ← shared standard
│   ├── agents-md-standard.md       ← cross-tool guide
│   ├── pr-review-templates/        ← parameterized review
│   └── compound-learning-guide.md  ← learning loop
├── .claude/skills/engineering/
│   ├── audit-claude-md/            ← garbage collection
│   ├── improve-claude-md/          ← auto-improvement
│   ├── sync-claude-md/             ← cross-repo consistency
│   └── incident-to-claude-md/      ← incident → prevention

Each code repo:
├── CLAUDE.md                       ← Claude-specific (conditioning)
├── AGENTS.md                       ← tool-agnostic (cross-tool)
├── docs/solutions/                 ← compound learning
└── .cursor/rules/                  ← Cursor-specific

Key principles

Separate generation from decisioning — the model generates, the workflow / system / human decides.
Context is finite — every token wasted on irrelevant rules is a token not available for code.
Rules are living documents — every agent mistake is a chance to improve the harness.
Mechanical enforcement > suggestions — linters and tests beat anti-pattern lists.
On-demand > always-on — load detailed context only when relevant (skills, docs/).
The harness compounds — incident → learning → prevention → better harness.

Open questions

When does AGENTS.md maintenance burden exceed its cross-tool value for small teams?
How to implement setup hooks across repos without over-engineering for 2–5 engineers?
Should heavy CLAUDE.md patterns (600+ lines) be split into docs/ with CLAUDE.md as map?
How to schedule garbage-collection agents (audit/sync) without manual triggers?

Connection points

The harness pattern is the foundation agent-orchestrator is built on — every agent it spawns is a CLAUDE.md harness template plus a tool surface plus an eval gate.
Pairs with Karpathy Autoresearch — Deep Research Report — the harness gives you reliability for one shot; autoresearch gives you reliability over hundreds of shots.
The eval-platforms research is the L4 (compounding) end of this stack: harnesses that measure themselves are what stop the harness from rotting.

>> referenced by (3)

agent-orchestrator

...- The harness pattern (CLAUDE.md per agent + tools + eval gate) is exactly what [[Agent Harness Engineering — Synthesis|the harness-engineering research]] argues is the actual unit of leverage in agent work — this daemon is that synth...

AI Agents

...OpenClaw security crisis, and a four-GPU consumer-hardware replication build. - [[Agent Harness Engineering — Synthesis]] — Synthesis of how to make AI coding agents work reliably. Karpathy's skill tre...

Karpathy Autoresearch — Deep Research Report

...ent Arena]] is the inference-side build of the four-GPU rig in §5. - Pairs with [[Agent Harness Engineering — Synthesis]] — the harness gives reliability for one shot; autoresearch gives reliability ov...