> Agent Harness Engineering — Synthesis

Budding
planted Mar 3, 2026tended May 4, 2026
#research#ai#agents#claude-code#harness#engineering

Agent Harness Engineering — Synthesis

How to make AI coding agents work reliably: the emerging discipline of designing the system around the model.

The core thesis

The competitive advantage in AI-assisted engineering is not the model — it's the harness: the structured system of context, tools, guardrails, and feedback loops engineered around the agent.

"The new hierarchy won't be based on who codes the fastest — it will be based on who can orchestrate uncertainty without losing authority." — Karpathy

This synthesis pulls from five distinct sources at very different scales — from "what one engineer does in one repo" to "what MercadoLibre does across 20,000 developers" — and finds they're all converging on the same primitives.

Sources synthesized

| Source | Scale | Key contribution | |--------|-------|------------------| | Karpathy's "behind the new skill tree" | Theory | 4-level skill tree: Conditioning → Authority → Workflows → Compounding | | Claude Code Setup Hook + Justfile pattern | Single repo | Deterministic scripts + agentic prompts for onboarding | | Boris Cherny's thread-based framework | Single engineer | Thread taxonomy (P, C, F, B, L, Z) for scaling agent work | | Julián de Angelis (MercadoLibre) | 20,000 devs | Custom rules, MCPs, skills, SDD, feedback loops at org scale | | OpenAI Codex team | 3–7 engineers | AGENTS.md as table of contents, docs/ as encyclopedia, garbage-collection agents |

The four levers (Julián / MercadoLibre)

1. Custom rules (CLAUDE.md, AGENTS.md, .cursor/rules)

The most accessible lever. Living documents that get injected into the agent's context.

What belongs:

  • Tech stack, architecture patterns, naming conventions
  • Testing philosophy, common pitfalls, anti-patterns
  • Commands (build, test, lint, deploy)

What doesn't belong:

  • Entire API docs (wastes context)
  • Obvious instructions ("write clean code")
  • Contradictory rules

Best practices:

  • Keep under 500 lines (context rot starts ~60% window utilization)
  • Make them modular (split by concern)
  • Use few-shot examples over abstract instructions
  • Don't make everything always-on — use conditional loading

2. MCP servers (Model Context Protocol)

Extend the agent beyond file read/write:

  • Database queries, internal docs, API contracts
  • CI/CD pipeline interaction, design specs
  • Validation and testing of agent output

3. Skills

On-demand context injection + executable logic. Only a short description stays in context; full content loads when invoked.

Two flavors:

  • Reference skills — inject knowledge (conventions, patterns, domain context)
  • Task skills — step-by-step instructions for specific actions

Can bundle scripts, run in isolated subagents, compose into pipelines.

4. Spec-driven development (SDD)

The spec becomes the harness — it engineers the entire context window in one shot. Consolidates custom rules, step-by-step guidance, and acceptance criteria into a single artifact.

Use the agent to write the specs too.

OpenAI's additions

AGENTS.md as table of contents (~100 lines)

Don't make AGENTS.md the encyclopedia. Make it the map that points to deeper sources:

AGENTS.md (100 lines) → docs/architecture.md
                       → docs/patterns/
                       → docs/decisions/

The agent reads the map always, reads the detail on demand.

Mechanical enforcement

Don't just suggest architectural constraints — enforce them:

  • Custom linters that catch violations
  • Structural tests validating dependency layers
  • CI validation preventing architectural decay

Garbage-collection agents

Agents that run periodically to find:

  • Stale documentation
  • Violated architectural constraints
  • Inconsistencies between docs and code

Karpathy's skill tree (4 levels)

Level 1 — Conditioning (steering)

| Skill | What it means | |-------|---------------| | Intent specification | Tight problem contracts (purpose, audience, constraints) | | Context engineering | What goes in / out of context window, ordering, summarization | | Constraint design | Output formats, schemas, rubrics, tool access, budgets |

Level 2 — Authority (ownership without authorship)

| Skill | What it means | |-------|---------------| | Verification design | How does truth enter the loop? (deterministic checks, human review) | | Provenance | Sources, citations, traceability as first-class objects | | Permissions | Least privilege, deterministic boundaries, audit trails |

Level 3 — Workflows (scaling intelligence)

| Skill | What it means | |-------|---------------| | Pipeline decomposition | Intermediate artifacts, checkpoints, local vs global failures | | Failure-mode taxonomy | Context missing? Retrieval wrong? Tool fail? Hallucination? | | Observability | Tool-call traces, inputs, documents retrieved, timing, cost |

Level 4 — Compounding (durable leverage)

| Skill | What it means | |-------|---------------| | Evaluation harnesses | Golden sets, regression tests, scorecards, thresholds | | Feedback loops | Draft → critique → revise → recheck → ship | | Drift management | Versioning, auditability, treating work as production infrastructure |

Thread types (Boris Cherny framework)

| Thread | Name | Pattern | Best for | |--------|------|---------|----------| | Base | Single | Prompt → Work → Review | Simple tasks | | P | Parallel | N prompts simultaneously | Independent subtasks | | C | Chained | Work → Review → Continue → Work | High-risk, migrations | | F | Fusion | Same prompt → N agents → pick best | Prototyping, confidence | | B | Big | Agent → subagents → combined result | Complex multi-file tasks | | L | Long | High autonomy, hours duration | Background work | | Z | Zero touch | No review needed | Maximum trust, fully validated |

"Tool calls roughly equal impact. Increase tool calls to increase output."

Scale by: more threads, longer threads, thicker threads (agents calling agents), fewer checkpoints.

The feedback loop

Tests, linters, type checkers, build scripts — every tool that produces a pass/fail signal becomes a feedback mechanism for self-correction.

Stop hooks are the most powerful mechanism: the agent cannot finish until checks pass. Not a suggestion — an enforced gate.

The Ralph Wiggum pattern: Loop an agent with deterministic validation. Agents + code beats agents alone.

Practical implementation — what we built at Fuul

| Lever | Implementation | |-------|----------------| | Custom rules | CLAUDE.md per repo (6 repos), AGENTS.md for cross-tool (4 repos), shared template | | MCPs | Slack, Telegram, Pipedrive, Notion, Linear, Granola | | Skills | 54 skills across 9 departments + 4 new engineering skills | | SDD | /feature-spec skill + plans/ workflow in every repo | | Feedback loop | Validation Loops in CLAUDE.md + PR review template | | Compounding | Compound learning guide + docs/solutions/ + /incident-to-claude-md | | Garbage collection | /audit-claude-md, /sync-claude-md (manual trigger) |

Architecture

fuul-agents-workspace (shared brain)
├── .claude/context/engineering/
│   ├── claude-md-template.md       ← shared standard
│   ├── agents-md-standard.md       ← cross-tool guide
│   ├── pr-review-templates/        ← parameterized review
│   └── compound-learning-guide.md  ← learning loop
├── .claude/skills/engineering/
│   ├── audit-claude-md/            ← garbage collection
│   ├── improve-claude-md/          ← auto-improvement
│   ├── sync-claude-md/             ← cross-repo consistency
│   └── incident-to-claude-md/      ← incident → prevention

Each code repo:
├── CLAUDE.md                       ← Claude-specific (conditioning)
├── AGENTS.md                       ← tool-agnostic (cross-tool)
├── docs/solutions/                 ← compound learning
└── .cursor/rules/                  ← Cursor-specific

Key principles

  1. Separate generation from decisioning — the model generates, the workflow / system / human decides.
  2. Context is finite — every token wasted on irrelevant rules is a token not available for code.
  3. Rules are living documents — every agent mistake is a chance to improve the harness.
  4. Mechanical enforcement > suggestions — linters and tests beat anti-pattern lists.
  5. On-demand > always-on — load detailed context only when relevant (skills, docs/).
  6. The harness compounds — incident → learning → prevention → better harness.

Open questions

  • When does AGENTS.md maintenance burden exceed its cross-tool value for small teams?
  • How to implement setup hooks across repos without over-engineering for 2–5 engineers?
  • Should heavy CLAUDE.md patterns (600+ lines) be split into docs/ with CLAUDE.md as map?
  • How to schedule garbage-collection agents (audit/sync) without manual triggers?

Connection points

  • The harness pattern is the foundation agent-orchestrator is built on — every agent it spawns is a CLAUDE.md harness template plus a tool surface plus an eval gate.
  • Pairs with Karpathy Autoresearch — Deep Research Report — the harness gives you reliability for one shot; autoresearch gives you reliability over hundreds of shots.
  • The eval-platforms research is the L4 (compounding) end of this stack: harnesses that measure themselves are what stop the harness from rotting.