> Production LLM Eval Platforms — Full Research Report

Budding

planted Apr 28, 2026tended Apr 28, 2026

#ai-agents#evals#llm-observability#agent-quality#research#braintrust#langfuse#langsmith#clickhouse#opentelemetry

Production LLM Eval Platforms — Full Research Report

🌿 Budding note — long-form research synthesis. Read it as a reference, not a blog post.

What this is. A 30k-word synthesis of state-of-the-art across eight interlocking areas of production LLM agent evaluation, generated via research-orchestrator (5 parallel agents × 2 rounds of shared-memory research, ~40 min, judge score 9.8/10).

Why it exists. Kicked off by Phil Hetzel's "Why building eval platforms is hard" (Braintrust, AI Engineer 2026). The talk argues the easy part is the UI; the hard part is the data layer. This report unpacks that claim across the broader ecosystem.

How to read it. Skim the Executive Summary first — every bullet is the load-bearing claim of a downstream section. Then jump to the section you care about. Each section is independently readable.

Caveats. Some quantified vendor benchmarks (e.g., Brainstore 23.9× claim) are explicitly flagged inline as vendor self-reported. A few academic citations use 2026 arXiv IDs that should be verified before propagation. See "Claims kept unverified or directional only" inside.

Related notes: Agent Evaluation and Testing · AI Agents Fundamentals · AI Agents MOC

Synthesis of state-of-the-art across eight interlocking areas of production LLM agent evaluation: the eval ↔ observability flywheel; trace data-layer architecture; failure-mode discovery; headless / agent-driven evals; SME playgrounds; AI-gateway-based tracing; topic modeling and unknown-unknowns; and RBAC / governance.

Executive Summary

The five-phase flywheel is settled industry doctrine. Every major vendor (Braintrust, Langfuse, LangSmith, Arize Phoenix, Helicone, Datadog, MLflow) and every named-team case study (Notion, Cursor, Sierra, Decagon, Klarna, Replit, Cognition/Devin, Anthropic, OpenAI) describes the same loop: production traces → online scorers + human review → failure-mode taxonomy → dataset curation → offline evals → ship → monitor. The contested territory is implementation, not shape. Notion explicitly devotes 80% of its 70-engineer AI-org's time to "evaluating from feedback and traces in Braintrust"; Sierra runs daily evals against production conversations; Cursor uses prod traces as the supervision signal for retraining its retrieval model.
The trace data layer has converged on a "ClickHouse + S3 + Postgres + Redis" pile (Langfuse v3, LangSmith, Helicone, SigNoz). Events land in S3 first for durability, then are async-written to ClickHouse via ReplicatedReplacingMergeTree(event_ts, is_deleted) partitioned by toYYYYMM(timestamp). Brainstore (Braintrust's bespoke Tantivy-on-S3 store) is the single notable divergence and exists because full-text search across multi-MB JSON spans is a primary product feature there. Hot/warm/cold tiering (CH-SSD 0–7d → S3 8–90d → Glacier/Iceberg 91d+) is the de facto pattern; ClickHouse's published internal compression is 100 PB → 5.6 PB (~18×) with >15–50× columnar advantage over row stores.
OpenTelemetry GenAI semantic conventions are the wire-format lingua franca, but multi-agent / handoff spans are still pre-standard (OTel SemConv issues #1530 and #2664 still open as of April 2026). Four agent frameworks have stabilized incompatible handoff models: OpenAI Agents SDK (handoff_span()), OpenInference (graph.node.{id,parent_id,name}), OWASP AOS (agent.run → turn → step hierarchy with RequestContext), AG2/AutoGen (group-chat speaker-selection spans). Production teams emit OTel GenAI core + OpenInference graph attrs to keep optionality across Datadog, Phoenix, Langfuse, Braintrust, Honeycomb.
Failure-mode discovery has bifurcated into two complementary tracks: qualitative coding (Hamel Husain's open-coding → axial-coding → theoretical-saturation method, ~50–100 traces per cycle, NurtureBoss canonical case where 3 categories → 60% of failures, date-handling alone → 66%) and BERTopic-family automatic clustering (embeddings → UMAP → HDBSCAN → c-TF-IDF / LLM labels — used by Braintrust Topics, PostHog LLM analytics, Phoenix, Langfuse, with HDBSCAN's cluster -1 noise treated as a first-class unknown-unknowns surface). The academic underpinnings are Shankar et al.'s EvalGen / criteria drift (UIST 2024) and SPADE (VLDB 2024).
Headless / agent-driven evals work today via "read in MCP, write in CLI" (Braintrust's explicit pattern). The published Braintrust customer-support walkthrough — agent runs bt sql for low-factuality traces, edits retrieval config, re-runs bt eval, score 0.3 → 0.9 in one session — is the canonical end-to-end example. MCP elicitation (elicitation/create, added 2025-06-18) is the protocol-level home for "agent proposes change → human approves," with Pinterest the first named-team production deployment. Anthropic's skill-creator runs a claude -p "...autonomously" loop with binary evals as the safety lock.
Two-surface convergence is real: every leading platform now ships (a) a coding-agent surface (MCP server / bt-style CLI) and (b) an SME playground with side-by-side prompt diffs, annotation queues, and active-learning calibration loops. Langfuse side-by-side playground (28 Jul 2025) and LangSmith Align Evals (29 Jul 2025) launched in the same week — a strong signal the category is hardening at once. Eugene Yan's recommendation, codified by Align Evals, is to calibrate to inter-human agreement, not a fixed 80% number, and to prefer pairwise / binary outputs over Likert scales (LLM-judges exhibit central-tendency bias clustering at 6–7 on 1–10).
AI gateways and SDK instrumentation are complements, not substitutes. A 2025 survey of 550 IT leaders found >50% of agent-deploying orgs already use a gateway. LiteLLM is the OSS default (100+ providers, OTel callback, Presidio integration with pre_call/post_call/logging_only modes); Portkey is the managed SOC2/HIPAA/GDPR option; Cloudflare AI Gateway is the edge story (OTLP JSON-only — incompatible with Datadog protobuf). Helicone is now de-recommended: Mintlify acquired Helicone on 3 March 2026 and explicitly tells customers to "plan a smooth migration to another platform"; Helicone is in maintenance mode.
Governance is the build-vs-buy forcing function. SOC 2 Type II is table stakes; HIPAA BAA, SCIM, audit logs, data masking, and configurable retention are uniformly Enterprise-only across Braintrust, Langfuse, LangSmith, Arize AX. There is no medium-budget compliance path on managed cloud. Cloud BYOK / CMEK does not exist at any mainstream LLM observability vendor as of April 2026 — self-hosted or hybrid (Braintrust's true control-plane / data-plane split) are the only paths. The EU AI Act (Articles 13/19/26, ≥6-month log retention, deployer-interpretable logs, penalties up to 7% of global turnover) elevates trace storage from optimization concern to compliance artifact — and breaks the cheaper Braintrust/Langfuse Pro tiers, whose default retention is 14–30 days.

Detailed Findings

1. The Eval + Observability Flywheel

The canonical five-phase loop (Hamel Husain, Shreya Shankar, Langfuse, LangChain, DoorDash, Datadog, Maxim AI):

Observability — capture every production trace (inputs, outputs, intermediate reasoning, tool calls, retrieval, latency, cost, feedback). Langfuse: "Set this up early; everything else depends on it."
Error analysis — Husain's grounded-theory method: open coding → axial coding → theoretical saturation. Sample ~50–100 diverse traces; free-form annotate the first failure (upstream errors cascade); cluster annotations into a taxonomy; stop when ~20 new traces reveal no new categories.
Targeted scorers — for each failure category, build a deterministic check (regex, schema validation) or an LLM-judge with a single-criterion rubric. Generic "helpfulness" / "hallucination" scorers are an explicit anti-pattern (Husain, Shankar, Langfuse).
Dataset curation — convert problematic traces to regression rows. Trace-to-dataset is "one-click" across Braintrust, LangSmith, Langfuse — without this primitive the loop has unacceptable friction. DoorDash layers a simulation-evaluation flywheel on top: synthetic adversarial traffic generated from production seeds.
Continuous validation — re-run offline evals after every prompt/model/code change; ship; monitor that the category does not recur.

Quantified Pareto pattern: Husain's NurtureBoss case — 3 failure categories → 60% of all failures; date-handling alone → 66%. Husain reports spending 60–80% of development time on this cycle.

Online vs offline interaction (the practical model):

Offline — deterministic baseline on golden datasets in CI/CD; ground-truth required; blocks regressions before deploy. Block thresholds typically 2–5% accuracy drop or 1σ on a curated metric.
Online — async LLM-judge sampling on production; reference-free; drift detection; alerts.
Bridge — annotation queues route low-scoring online traces to SMEs; SME labels become offline regression tests AND recalibrate the online judge.

Three-loop nested structure (synthesized across named-team case studies):

Inner loop, sub-second — deterministic guardrails on user-critical path (regex, classifier, schema, allow-list redaction). This is policy, not eval.
Middle loop, minutes — async LLM-judge over sampled production traces; daily-or- faster aggregation; failure-mode mining; SME annotation queue. Sierra's daily cadence is the lower bound; most teams are weekly.
Outer loop, weeks — regression and frontier evals in CI; A/B test gating; synthetic adversarial generation. Notion's "rebuild every 6 months" is the outermost loop — wholesale architecture refresh.

Named-team production flywheels:

Notion (Braintrust + Brainstore, 70-engineer AI-org): explicit two-track split — regression evals and frontier/headroom evals tuned to ~30% pass for headroom signal. 80% of AI-team time is "evaluating from feedback and traces in Braintrust." Rebuilds the agent system every ~6 months. Ships new frontier models within <24 hours. Canonical "needle-in-a-haystack" win was multilingual workspace adherence (APAC users) — not a generic benchmark, a customer-segment scorer.
Cursor (LangSmith + homegrown harness): two-loop architecture is now public. Production agent session traces are training data for the retrieval model — fed to an LLM-judge that ranks "what content should have been retrieved at step N"; that ranking is the supervision signal for a custom embedding/retrieval model. Public stance: "Adoption is the ultimate metric" — offline benchmarks for rapid iteration, online A/B for impact.
Sierra: ships τ-bench, τ²-bench, and τ³-bench (Apr 2026; expanded to knowledge retrieval and voice). Internal flywheel: "measuring performance daily against production conversations and feeding those signals back." τ-bench's design choice — scoring on goal database state after the conversation, not the conversation itself — removes LLM-judge non-determinism from the benchmark loop.
Decagon: two-phase framework — offline (model accuracy, F1, human-annotated preference labels) → online A/B (resolution rate, latency, CSAT). Production has Watchtower (always-on QA on live conversations) and Ask AI (NL analytics across conversation traces). Multi-vendor models (OpenAI, Anthropic, Gemini) plus internal fine-tunes — eval engine is "judge-of-judges."
Klarna: 2.5M conversations handled (first month), 700-FTE-equivalent workload, 80% reduction in resolution time (11 min → 2 min), $40M annual savings claim. LangSmith + LLM-as-judge + prompt iteration. Hidden lesson: optimized resolution-speed and cost-per-ticket but had no metric for trust erosion / brand- promise abandonment — eventually re-hired humans. The eval blind spot was not a missing scorer; it was a missing dimension entirely. Canonical case study against single-axis optimization.
Replit Agent 3: separate "tester subagent" with its own context window, runs Playwright-injected JS in a sandboxed REPL. Median $0.20 per testing session; up to 200+ minutes of autonomous run-time per task (vs ~20 min on Agent 2). LangSmith for trace storage. Explicitly targets Potemkin-interface failures (UI looks fine, no event handler wired, mocked data, broken backend) — output-only judges can't see these.
Cognition / Devin: proprietary cognition-golden benchmark with train/test split. Train split = autonomous self-improvement environment. Devin scores 74.2% held-out. PR merge rate 34% → 67% YoY (2024 → 2025), 4× problem-solving speed, 2× resource efficiency. Explicit rejection of SWE-bench: "too sanitized for end-to-end agent evaluation." Argues for qualitative customer-outcome metrics over uniform leaderboards: "Agents don't fit conventional engineering competency frameworks."
OpenAI Platform = trace-grading-as-a-service. First-class Trace Grading: end-to-end record of model + tool + guardrail + handoff calls scored by structured graders. Canonical research-agent grader stack: 3-grader composition — groundedness × coverage × source quality. AgentKit (Oct 2025) wires datasets + trace grading + automated prompt optimization + third-party model support.
Anthropic Bloom (alignment.anthropic.com): four-stage automated behavioral-eval pipeline — Understanding → Ideation → Rollout → Judgment. Generates scenarios from a seed (vs fixed-prompt eval suites). Open-sourced. Targets propensities and open-ended behaviors, not task-success.

Frontier-eval calibration as instrument design: Notion deliberately tunes frontier evals to ~30% pass — the information-rich zone, far from ceiling and floor — so each new model release moves the score visibly. Set the bar at 80% pass and frontier models all score 90% (no signal); at 5%, all score 5% (no signal). Frontier-eval calibration is a measurement-design problem, separate from regression-eval design where you want >95% pass.

Cost economics of online LLM-as-judge (April 2026 list prices):

| Judge | Per-eval (5k in / 200 out) | Per 1M evals | Per 100M traces sampled @ 5% | |---|---|---|---| | GPT-4o-mini ($0.15/$0.60 per M tokens) | $0.000870 | $870 | $4,350 | | Claude 3.5 Haiku ($1.00/$5.00 per M tokens) | $0.006 | $6,000 | $30,000 |

GPT-4o-mini is the workhorse online judge in 2025–26 (~7× cheaper than Haiku). Haiku appears for tasks where Claude's instruction-following beats it; Sonnet/Opus only in pre-deployment offline gold-standardizing. Batch APIs cut another 50%. Practitioner heuristic: eval cost should be 1–5% of the underlying LLM bill; 10× was an explicit outlier signaling mis-sized judges or excessive sampling.

Latency budget — sync vs async:

LLM-as-judge inline ≈ ~8 seconds typical — universally cited as "never put on the user-critical path."
Critical path (sync, <100 ms): regex, schema check, small classifier (Llama Guard, Galileo Luna-2 sub-200 ms), embedding-based safety classifier.
Off-path (async, 1–60 s): LLM-judge sampled at 1–10%, plus 100% on errors / low- score / high-latency tail.
Production budget: p95 ≤ 200 ms warning, p99 ≤ 500 ms critical-alert + autoscale. LLM judges fundamentally can't live inside that envelope.

Sampling is converging on a multi-tier strategy:

| Pattern | Rate | Source | |---|---|---| | Braintrust online (high-volume) | 1–10% | Braintrust docs | | Braintrust online (low-volume / critical) | 50–100% | Braintrust docs | | LLM-judge daily drift-detection default | ~5% | VentureBeat practitioner heuristic | | Tail-based / error-biased | 100% with errors, low scores, high latency, thumbs-down | Datadog, OpenObserve | | Stratified by intent/segment | uniform-within-strata | 2025 SIGIR-adjacent paper |

Defensible default: 100% capture into trace store + 5–10% LLM-judge online + 100% on tail + stratified human queues per intent cluster. Notion uses segment-conditional sampling (multilingual, enterprise tagged segments at higher rates); Klarna uses confidence-bucketed routing (>90% proceeds, <90% triggers verification); Sierra effectively regenerates a new offline gold-set daily from production, blurring the offline/online line.

Anti-patterns called out across sources:

Manual annotation forever: doesn't scale. Automate scorers once human↔LLM agreement reaches threshold.
Static metrics: Shankar — "metrics must evolve as APIs change and new failure modes emerge."
Holistic Likert-scale labeling: too noisy; prefer binary True/False per dimension.
Trajectory blindness: scoring only final outputs misses tool-selection bugs, hallucinated policies, mid-trace reasoning errors.
Generic ground-truth-first thinking in production: ground truth often doesn't exist; use reference-free LLM judges, reserve ground-truth for offline.
Klarna mode (single-axis optimization): hit named metric, miss the dimension that matters. Eval taxonomy needs anti-metrics — categories the team commits not to regress on. None of the surveyed vendors model "anti-metrics" as first-class.
Replit mode (output-blindness): judge sees final answer not trajectory; misses Potemkin failures.
Notion mode (global average masks segment failures): needle-in-a-haystack failures diluted by average. Fix is segment-conditional scorers.

Pricing tiers (April 2026, public list):

| Vendor | Entry | Mid | Top | Model | |---|---|---|---|---| | Braintrust | Free (1M spans) | Pro $249/mo | Enterprise (custom) | Per-org, no per-seat | | Langfuse Cloud | Hobby free (50k units, 30-day retention, 2 seats) | Core $29/mo; Pro $199/mo | Enterprise from $2,499/mo | Unit-based; combines traces + observations + scores | | LangSmith | Developer free (5k traces, 1 seat) | Plus $39/seat/mo + per-trace overage | Enterprise (custom; EU instance) | Per-seat — punitive at team scale |

A 5-engineer team on Langfuse Core pays $29/mo total; same team on LangSmith Plus pays $195/mo before any per-trace overage — a 6.7× per-seat-only multiplier.

Cost breakpoints (build-vs-buy):

<1M traces/month: Langfuse Cloud Core ($29) + GPT-4o-mini judge runs the full flywheel for <$200/mo all-in. Don't self-host below this.
1M–10M/month: Pro tiers become reality; judge cost (5% sampling × 1M evals × $0.001/eval = $50–500/mo) still under self-host eng cost.
10M–100M/month: judge cost crosses $500–5,000/mo, storage cost meaningful. Band where self-host-with-vendor-eval-surface dominates. Run Langfuse self-hosted on your own ClickHouse, but keep Braintrust/LangSmith for the eval workflow.
100M+/month: vendor egress fees and per-trace overages punitive (LangSmith especially). Self-host both trace store and eval surface.

2. Trace Data Layer Architecture

Workload that breaks "normal" APM (Brainstore engineering data):

p95 trace size 500 KB → ~3 MB in months.
Individual spans regularly >1 MB; p90 in tens of MB; full traces in tens of GB.
Datadog LLM Observability has hard 1 MB payload cap that silently drops oversize spans (issue #13260) — concrete example of how traditional APM ingest was built for a different size class.
Three structural properties: payload size dwarfs APM (KB → MB+); schema is semi- structured and dynamic ("filter on output.steps[1] = 'router' without registering it as an indexed column"); full-text search across millions of large rows is required, not nice-to-have.

The "ClickHouse + S3 + Postgres + Redis" pile (Langfuse v3, LangSmith, Helicone, SigNoz):

| Component | Role | |---|---| | ClickHouse | OLAP traces / observations / scores. ReplicatedReplacingMergeTree(event_ts, is_deleted), partition by toYYYYMM(timestamp), ordered by (project_id, toDate(timestamp), id), ZSTD on input/output, bloom-filter skip indexes. Read-side dedup: ORDER BY event_ts LIMIT 1 BY ... (avoids cost of FINAL). | | Postgres | OLTP — users, projects, prompt definitions, settings. | | S3 | Raw events first (durability), multi-modal attachments, batch exports. Events land in S3 before DB write so ingest is durable independent of CH. Worker pulls events from S3 → writes to CH. Reference (only) is queued in Redis to keep memory pressure down. | | Redis/Valkey | Event queue (references, not bodies), caching, prompt cache. |

Why this shape works: Langfuse migrated to it because every row update in ClickHouse is "immensely expensive"; updates are modeled as new inserts with a higher event_ts. p99 of prompts API improved from 7s → 100ms. Helicone migrated from Postgres after dashboard aggregations were taking 30+s; queries that took 100s now run in 0.5s. They open-sourced pgv2cht for dual-insert migration.

Phoenix is the outlier among open-source vendors: SQLite (dev) / Postgres (prod) only — not designed for the 10M-spans/day regime. Spans stored in OpenInference/OTLP format with input.value/output.value as JSON strings. Fine for local/dev, falls over at fleet-wide scale.

ClickHouse schema constraints (Langfuse v3):

"ClickHouse 24.3 (the version Langfuse v3 maintains backwards compatibility with) has no JSON column support, meaning an introduction of a new column with type JSON would require a major release of Langfuse."

So Langfuse stores semi-structured payloads in Map(LowCardinality(String), String) columns, not JSON. ZSTD on payloads gives ~70–80% compression on typical JSON/text payloads. ClickHouse 26.x has GA'd a real JSON type — for new builds in 2026 the JSON-vs-Map decision is a live one.

Hot/warm/cold tiering (becoming standard):

Days 0–7 hot in ClickHouse SSD (~17.5 GB compressed @ 10M spans/day).
Days 8–90 warm (S3 / Postgres aggregates).
Day 91+ Glacier (~$0.02/GB retrieval) or Iceberg cold (~500 GB/yr @ 10M/day).
Native CH storage policies + Iceberg cold tier (Altinity Antalya, advertised ~10× cheaper); ClickStack Aug 2025 update added inverted index for observability data.

Cost economics (newly published 2026 numbers):

| Metric | Number | |---|---| | ClickStack ingest+retain unit cost | <1 cent / GB for high-cardinality OTel | | ClickHouse internal observability compression | 100 PB → 5.6 PB (~18×) | | Columnar vs row compression | 15–50× | | Glacier retrieval | $0.02 / GB | | Untiered storage cost at growth | $50K/mo before compute/network | | Production scale example | Respan: 50M daily events on CH Cloud |

Cost dominator at scale is not compute or networking — it is retention duration × payload size after compression. EU AI Act forces 6-month minimum + spans routinely exceed 1 MB → 3-tier strategy is no longer optional for regulated teams.

Span chunking and large-payload offload — dominant pattern is "don't store the big body; store a pointer":

Reference-token rewrite (Langfuse): SDK auto-detects base64 → uploads to S3 via presigned URL → replaces inline with @@@langfuseMedia:type=image/png|id=...|source=base64_data_uri@@@. Files dedup on (project, MIME, SHA-256). Configurable per-request size cap via LANGFUSE_INGESTION_MAX_REQUEST_BODY_SIZE_MB. Separate buckets for events / multi- modal / batch exports.
Truncation on ingest (Helicone-style): "truncate prompt text at 4K characters and response text at 8K characters" to keep ClickHouse hot tier bounded. Lossy.
External-content references (OTel official): spec recommends "Upload content separately and reference it on spans" for production.
Events instead of attributes for message content: OTel models prompt/completion content as log events rather than span attributes. Events can be dropped at the collector without changing instrumentation, keeping span attribute size bounded. Gated by OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=true.

Brainstore (Braintrust's bespoke store) is the notable divergence: Tantivy (Lucene-in-Rust) embedded inside segment format on object storage. WAL "roughly one file per request" on S3; compaction folds WAL entries into "an inverted index, row store, columnstore, vectors, and bloom filters" — five index shapes serving five query patterns over the same data. All spans for the same trace are guaranteed to land in the same segment, so the planner can read one segment to reconstruct a trace. Reported: 401 ms full-text search vs ~9.6 s competitor on 3.9M traces (a 23.9× claim, vendor-published vs unnamed baseline); "<50 ms hot, <500 ms cold" target; write 6.98 s vs 17.78 s; span load 346 ms vs 1.29 s; ~100k spans/sec write.

Full-text search at trace scale:

ClickHouse-native (Langfuse, Helicone, SigNoz, ClickStack): Bloom-filter skip indexes on metadata keys/values. ClickHouse 26.2 (Aug 2025) GA'd inverted text indexes that work on JSON sub-paths via JSONAllPaths/JSONAllValues. Performance competitive for substring/term search; phrase / fuzzy lag dedicated engines.
Tantivy-on-object-storage (Brainstore): vendor numbers above.
Hybrid (Laminar): ClickHouse for analytics + Postgres for storage + Qdrant for semantic vector search over trace content.

OTel GenAI vs OpenInference schemas:

OTel GenAI (spec v1.37+ in 2026): span name is {gen_ai.operation.name} {gen_ai.request.model}. Operations: chat, embeddings, execute_tool, text_completion. Required attrs: gen_ai.operation.name, gen_ai.provider.name. Common: gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens. Content goes in log events so it can be dropped/sampled separately. Datadog supports natively as of v1.37.
OpenInference (Arize, used by Phoenix and Comet Opik): input.value (JSON string), input.mime_type, output.value, output.mime_type, llm.model_name, llm.invocation_parameters, llm.input_messages.0.message.role (zero-indexed flattened lists), llm.token_count.prompt.

OTel GenAI is winning the standards war; OpenInference still has a meaningful install base via Phoenix and the LangChain ecosystem. Deliberately compatible — OpenInference is "complementary to OpenTelemetry" — so dual-emitting is feasible.

Multi-agent / sub-agent / handoff trace patterns (still pre-standard; OTel SemConv issues #1530, #2664 open as of April 2026):

| Framework | Handoff representation | Conversation grouping | |---|---|---| | OpenAI Agents SDK | First-class handoff_span(), child of originating agent_span(). Auto-instrumented. | group_id on outer trace(); sensitive-data toggle via RunConfig.trace_include_sensitive_data or OPENAI_AGENTS_TRACE_INCLUDE_SENSITIVE_DATA. | | OpenInference (Arize) | Execution-graph attributes on AGENT span — graph.node.id, graph.node.parent_id, graph.node.name. Cleanest "handoff edge" model and framework-agnostic. | session.id, user.id, agent.name. | | OWASP AOS | Hierarchical agent.run (root) → turn spans (turnId) → step spans (stepId). Step types include steps/agentTrigger and explicit toolCallRequest/toolCallResult pairs. | RequestContext 4-tuple (agent, session, turnId, stepId). | | AG2 / AutoGen (Microsoft) | OTel-native: every conversation, agent turn, LLM call, tool execution, and group-chat speaker selection is its own span. | Shared trace_id. |

Pragmatic mapping for vendor-portable greenfield: emit OTel GenAI workflow.run at level 0; invoke_agent (INTERNAL) at level 1 with agent.name and gen_ai.conversation.id; gen_ai.execute_tool or chat/text_completion spans at level 2. For handoffs, emit a synthetic handoff span with gen_ai.handoff.from_ agent.name / gen_ai.handoff.to_agent.name plus graph.node.parent_id (OpenInference) so Phoenix and Arize render the edge.

OpenInference's graph.node.* is underrated as a cross-framework handoff schema: encodes the agent-call DAG explicitly, separate from span parent/child tree (which encodes call stack); survives across processes and async hops where span context propagation gets lost. For multi-agent systems where the same agent is invoked twice from different parents (common LangGraph/CrewAI pattern), graph.node.parent_id disambiguates lineage in a way span hierarchy alone cannot.

Workflow status is database-authoritative, traces are historical-authoritative (nNode AI synthesis Jan 2026; Temporal/Restate-style durable-execution pattern):

QUEUED → RUNNING → WAITING_FOR_APPROVAL → SUCCEEDED | FAILED | CANCELED | UNKNOWN | STUCK

Stable business identifiers every span/log/metric should carry: tenant.id, workflow.id, workflow.run_id, workflow.step_id, workflow.step_name, workflow.idempotency_key. Operational rules: "DB wins for current state. Trace wins for historical reality." And "budget per step, not per workflow" — agents accumulate cost on retried/failed sub-spans that disappear if you only roll up at the workflow root.

Voice / multimodal trace layer (Arize cookbook + OpenAI Realtime + Anthropic computer-use):

Storage pattern (cross-vendor consensus): buffer PCM16 chunks during turn → on response.done serialize to WAV → upload to GCS/S3 with content-addressed key → record only URL on span; delete local file.

Arize audio span attribute schema: input.audio.url (GCS/S3 ref), input.audio. mime_type, input.audio.transcript (auto-derived; enables text evals on voice), output.audio.url, output.audio.mime_type, output.audio.transcript.

OpenAI Agents SDK voice span types: transcription_span(), speech_span(), speech_group_span() (parent for related audio spans). Audio data is base64-PCM by default; disable via VoicePipelineConfig.trace_include_sensitive_audio_data — contradicts OTel GenAI's "privacy-by-default" stance.

Realtime API event-to-span mapping (LiveKit MultimodalAgent + LangSmith):

| Realtime API event | LangSmith run_type | Span semantics | |---|---|---| | input_audio_buffer.speech_started/speech_stopped | start/end markers | VAD latency | | user_speech_committed | prompt | User turn finalized | | response.audio_transcript.delta … response.done | llm | Generation lifecycle | | response.audio.delta | (delta) | Time-to-first-byte | | agent_speech_committed / agent_speech_interrupted | llm | Turn-taking & interruptions | | function_calls_finished | tool | Tool execution completion | | metrics_collected | chain | Aggregated turn metrics |

Computer-use telemetry is intentionally client-side only (Anthropic): all screenshots, mouse/keyboard inputs, session files captured/stored in customer environment; Anthropic processes in-flight, doesn't retain. ZDR-eligible. If you want screenshots for failure-mode analysis, you are responsible for offload, retention, and PII redaction over screenshots. Honeycomb's anthropic-usage-receiver pulls usage

cost — not screenshots — confirming the metadata-tier vs content-tier observability split.

Voice payload growth (storage tier stress-test): a typical 30-second user utterance is ~480 KB PCM16 16 kHz mono. Plus output. Plus interim transcription deltas. A single voice turn easily exceeds 1 MB. 1M-call/day voice agent → ~2.7 TB/day raw audio.

Storage choice — sweet spots:

| Pattern | Sweet spot | Limit | |---|---|---| | Pure ClickHouse (Langfuse v3, Helicone, SigNoz) | <30M spans/day, sub-second hot queries, full-text + bloom skip indexes | 90+ day retention starts to hurt; cold tier needs S3 offload | | ClickHouse hot + Iceberg cold (Altinity Antalya) | High retention (EU AI Act 6-month min), >50M/day, lakehouse-shared analytics | Cross-tier joins slower | | Pure columnar warehouse (BigQuery, Snowflake, Databricks) | Org already runs the warehouse; analytics > observability | Single-trace lookup too slow for live debugging UI | | Hybrid Postgres + S3 (SaaS scale-up baseline) | Early-stage, <1M spans/day | Falls over at 10M+; Phoenix is canonical example | | Tantivy/Rust on S3 (Brainstore) | Full-text dominant workload | Proprietary; lock-in unless Quickwit | | Postgres + ClickHouse + Qdrant (Laminar) | Vector + structured + scalar simultaneously | Three systems; consistency is a write-side problem |

Greenfield default: start ClickHouse-only, plan for Iceberg cold tier at 6- month-retention point, defer vector store until you actually need semantic search over traces (most teams reach for it for failure-mode discovery, not hot-path retrieval — at which point Qdrant or pgvector are both fine).

3. Failure-Mode Discovery from Production Traces

Two complementary tracks, both required:

Track 1: Qualitative coding (human-led, scaffolded by LLMs) — Husain / Shankar / Langfuse canonical workflow:

Gather 50–100 diverse traces from production (or synthetic), prioritizing variety of intents over random sampling. Annotation queues (Langfuse, LangSmith, Phoenix) are the canonical surface.
Open coding: reviewer reads each trace end-to-end, writes (a) binary pass/fail and (b) free-text describing the first point of failure. The "first failure" rule matters because "a single upstream error, like incorrect document retrieval, often causes multiple downstream issues."
Axial coding — sort open-coded notes into named buckets. LLM-assisted; Langfuse prompt: "organize open-ended annotations into coherent failure categories, providing a concise descriptive title and definition for each, only clustering based on issues in the annotations without inventing new failure types." LLM is constrained to surface what's in the data, not invent.
Quantification — re-label dataset with structured taxonomy. NurtureBoss canonical: 3 categories → 60% of failures.
Convert categories → scoring functions: each top category becomes either a Python assertion (deterministic, e.g. "valid date format") or an LLM-judge with a single-criterion rubric (subjective). Same scorer runs offline against regression dataset and online against sampled production traffic so pre/post-launch scores are directly comparable.

Track 2: Automatic clustering (BERTopic-family pipeline):

| Vendor | Embeddings | UMAP | HDBSCAN | Labeling | |---|---|---|---|---| | Braintrust Topics | (unstated) | UMAP | HDBSCAN | c-TF-IDF + facets (Task / Issues / Sentiment) | | PostHog LLM analytics | text-embedding-3-large (3072-D) | 3072→100 (min_dist=0.0) for clustering, 3072→2 (min_dist=0.1) for viz | cluster_selection_method="eom" | LangGraph ReAct agent w/ 8 tools | | Phoenix | OpenInference span embeddings | 3-D for visual | HDBSCAN | — | | Langfuse intent cookbook | all-mpnet-base-v2 | — | min cluster size 10 | GPT-4o-mini, sample 50 messages/cluster, snake_case labels (15k+ messages demoed) | | LangSmith Insights Agent | undisclosed | — | hierarchical (top → second-level → run) | "discovers patterns you didn't know to look for" |

HDBSCAN cluster -1 (noise) is a first-class signal — it's where unknown-unknowns live. PostHog: "edge cases, unusual workflows, or bugs that don't fit any pattern." This is one of the few principled handles teams have on unknown-unknowns from traces.

LangSmith Insights gap analysis (external): clusters are not lifecycle objects — no "active/resolved/regressed" state, no automatic eval generation from a cluster. Categorization is "surprisingly good" but the lifecycle modeling is missing.

Academic foundations:

EvalGen / criteria drift (Shankar et al., UIST 2024, arXiv:2404.12272). Key finding from the qualitative study: users who started with a list of evaluation criteria changed those criteria as they graded more outputs, and sometimes returned to revise earlier grades. "It is impossible to completely determine evaluation criteria prior to human judging of LLM outputs." EvalGen operationalizes with three entry points (auto-suggested, manual, grading-induced); for each criterion, synthesizes multiple Python or LLM-judge candidates and selects whichever best agrees with human verdicts. Operational implications: eval rubrics must be versioned alongside the prompts they grade; any "auto-rubric" feature that generates rubrics without human grading first is theoretically suspect.
SPADE (Shankar et al., VLDB 2024). Diffs prompt versions, classifies each delta against a refinement-pattern taxonomy learned from LangSmith history, generates candidate boolean assertions, filters via subsumption-aware selection. Reported on ~75-input pipelines: 14% assertion reduction, 21% false-failure reduction. The architectural lesson: failure-mode discovery is not only "look at traces" — the prompt history is a high-signal failure-mode corpus too.
Recursive Rubric Decomposition (RRD) (2026): coarse rubrics → fine-grained discriminative criteria → correlation-aware pruning.
RIFT (RubrIc Failure mode Taxonomy): argues current rubric quality measurement conflates rubric quality with judge behavior and task formulation.
Amazon Nova rubric judge (SageMaker, Apr 2026): auto-generates per-prompt rubrics; not yet evaluated against EvalGen baselines.

Synthetic adversarial generation has split into three layers:

| Layer | Tools | What's generated | Used for | |---|---|---|---| | Model-level red-team | Garak, Inspect probes | Static + adaptive probes for known attack classes | Pre-deploy gating: "is the base model safe enough?" | | Application-level red-team | Promptfoo redteam generate, Inspect custom tasks | Adversarial inputs conditioned on your system's purpose | Pre-deploy: "is my app, with my system prompt and tools, safe?" | | Behavioral auditing | Anthropic Petri | Auditor-agent-vs-target-agent multi-turn scenarios | Hypothesis-driven safety research, system-card audits |

Promptfoo redteam generate is the most concrete OSS implementation: ~70 plugin categories (Aegis, ToxicChat, UnsafeBench, BeaverTails, HarmBench, CyberSecEval, DoNotAnswer; regulatory frames: NIST AI RMF, OWASP LLM Top 10, MITRE ATLAS) and 17 strategy transformations (encoding: Base64/ROT13/Leetspeak/Homoglyph/Morse/emoji- smuggling/audio/image/video; jailbreak templates: DAN, Skeleton Key, Likert framing; multi-turn: Crescendo/GOAT/Hydra/Mischievous User; gradient-style: GCG, Tree-of- Attacks-with-Pruning). Synthetic test cases generated by an attacker LLM (default GPT-5) conditioned on a purpose field — the purpose statement is the load-bearing artifact.

Petri (Anthropic, 2025) — canonical "automated auditor agent" pattern. Built on UK AISI's Inspect framework. Auditor agent (Claude) interacts with a target agent across realistic multi-turn scenarios, exploring hypotheses (situational awareness, scheming, self-preservation). Used in Claude 4 and Claude Sonnet 4.5 System Cards.

Garak is the closest LLM analogue to nmap/Metasploit; full vulnerability scanner. Static + dynamic + adaptive probes across hallucination, data leakage, prompt injection, misinformation, toxicity, jailbreaks. NVIDIA + community maintained. Distinct from Promptfoo: Garak is model-level (does this LLM leak training data?); Promptfoo is application-level (does your app expose vulnerabilities given the system prompt and tools you wired in?). A serious eval program runs both.

DSPy GEPA (ICLR 2026 Oral) is explicitly a "trace reflection" optimizer, not a numeric optimizer. Samples full trajectories — reasoning, tool calls, tool outputs — for a candidate program; uses an LLM to reflect in natural language on what went wrong; proposes a textual prompt mutation; tests it. Genetic-Pareto piece keeps the Pareto frontier of attempts so complementary lessons can be combined. Reported: +6% avg / +20% peak vs GRPO using up to 35× fewer rollouts; >+10% vs MIPROv2 (e.g. +12% on AIME-2025). The input shape is exactly what eval platforms already store.

MIPROv2's three-stage flow is bootstrapping → grounded proposal → discrete search; stage 1 is literally trace-driven (filters traces to those that appear in highly- scored trajectories). No OSS eval vendor ships a built-in DSPy compiler operator — this is still glue code teams write; DSPy BootstrapFewShotWithRandomSearch + MLflow auto-tracing is the closest off-the-shelf trace→optimize loop.

NVIDIA Data Flywheel Blueprint is the most prescriptive end-to-end pipeline from production trace → distilled smaller model:

Storage: Elasticsearch 8.12.2 (logs), MongoDB 7.0 (job metadata), Redis 7.2 (queue). Schema requires timestamp, workload_id, client_id, plus OpenAI-format request/response. Stratified split is class-aware.
Three experiment types per candidate: Base (raw prompts replayed), ICL (few-shot via semantic similarity or uniform tool distribution), Customized (LoRA-fine-tuned).
Numbers: Llama-3.2-1B-Instruct fine-tune ≈ 98% of 70B baseline; one workload Qwen-2.5-32B-Coder matched 70B without fine-tuning. Inference cost cut up to 98.6%. Operating cost: ≥6× H100/A100 GPUs (self-hosted judge), or 2× with remote judge.
Hard rule: "Think of the Flywheel as a flashlight, not an autopilot. Promotion to production—as well as any deeper evaluation or dataset curation—remains a human decision." Explicit anti-pattern call against full closed-loop self- modification.

Replay debugging (Phoenix Span Replay = canonical OSS): loads the exact span (inputs, variables, function calls) from production directly into the Prompt Playground, where you can edit prompt/parameters/model and re-run on the same real input. Arize AX "Test prompts on spans" is the commercial equivalent. Braintrust's canonical play is per-row regression colour-coding — you see which slice regressed, not just whether the average dipped.

Canonical cluster → scoring-function workflow:

Cluster all production traces.
Triage clusters by volume × severity. SMEs review exemplars in annotation queue.
Promote each cluster into a labeled Score config / rubric (binary or categorical).
Backfill rubric over historical dataset (LLM-judge or human re-annotation); gold-standardize ~100–200 examples per category.
Calibrate LLM-judge against gold (LangSmith Align Evals: target >85% agreement; mitigate central-tendency bias by using binary or 1–5 scales).
Deploy dual-mode: same scorer offline (regression) + online (sampled production), tracked agreement over time.
Lifecycle: track each failure mode like a bug — open / resolved / regressed. Most vendors don't model this first-class today; teams build it themselves on dataset versions and Score configs.

4. Headless / Agent-Driven Evals

The pattern collapses three previously separate roles — eval engineer, prompt engineer, scorer designer — into one autonomous loop run by a coding agent.

Tight loop (bt eval --watch Braintrust pattern): re-runs eval whenever code changes; Claude Code edits a prompt → reads new score JSON → keeps or backs out. Anthropic's skill-creator follows the same shape with binary pass/fail assertions and a claude -p "...run autonomously" loop.

Read in MCP, write in CLI (Braintrust's explicit design):

Braintrust MCP: 7 tools — search_docs, resolve_object, list_recent_objects, infer_schema, sql_query, summarize_experiment, generate_permalink. Read-only.
Phoenix MCP: broadest writeable surface in OSS — projects, traces, spans, sessions, annotation configs, prompts, datasets, experiments. Prompts and dataset examples writable.
Langfuse MCP: 5 prompt-management tools — getPrompt, listPrompts, createTextPrompt, createChatPrompt, updatePromptLabels.
Datadog LLM Observability MCP + Pydantic Logfire MCP: observability-side, less prompt-management oriented.

Braintrust's published rationale: "Coding agents like Claude Code can call bt commands directly, which means you get the same reliability whether you or your agent is driving" — same shell command works for humans and agents; writes have the audit trail of shell history rather than an MCP tool call.

Real production walkthrough (Braintrust customer-support, the cleanest published example): "the agent independently queried failing cases using bt sql to filter for low factuality scores, pulled a specific trace to diagnose the root cause (deprecated documentation in retrieval), updated the retrieval configuration to exclude outdated sources, reran the eval and verified scores jumped from 0.3 to 0.9. You asked one question. The coding agent ran the SQL query, inspected the trace, made the code change, and verified the result, all in one session."

Other implementations:

Braintrust + Claude Code plugin (braintrustdata/braintrust-claude-plugin): trace-claude-code (auto-traces Claude Code sessions into Braintrust hierarchies)
- braintrust (lets agent query logs, fetch experiment results, log new examples in-terminal). Bidirectional flow is explicit: "Most observability integrations only send data out, but agent development requires moving in both directions."
Anthropic skill-creator (v2): internal eval pipeline with 4 parallel sub-agents (executor + critics) and 4 modes (Create, Eval, Improve, Benchmark). Eval results surfaced on each skill's registry page and pinned to versions.
Arize/Phoenix Prompt Learning: cookbook for "Optimizing Coding Agent Prompts" using their Prompt Learning optimizer to generate better rule-files for coding agents based on telemetry.
DSPy MIPROv2 / GEPA: optimizer-driven equivalent of the same loop.

CI/CD plumbing — concrete pieces:

Braintrust eval-action@v1: deliberately narrow surface — api_key, runtime (node/python), root, paths, package_manager, use_proxy (defaults true, sets OPENAI_BASE_URL to Braintrust proxy so LLM calls cache between PR runs), terminate_on_failure (defaults false). Requires pull-requests: write permission. Does NOT itself ship "fail PR if score < X" — threshold gating is the user's job in a custom step.
Promptfoo's three-tier threshold model (most explicit):
- Per-assertion: type(threshold):value (e.g. similar(0.8):reference text).
- Per-test (assertion-set): threshold: 0.5 means 50% of weighted assertion scores must pass.
- GitHub-Action-level: fail-if-percent-below (0–100).
- Cache pattern uses actions/cache@v4 keyed on ~/.cache/promptfoo — cuts cost and flakiness because identical prompt+input pairs hit cache.
LangSmith reference repos: langchain-samples/cicd-pipeline-example (full deployment pipeline with offline+online eval gates and Control-Plane API for staging→prod promotion); langchain-samples/evals-cicd (minimal multi-agent example: PR opens → Action runs supervisor/specialist agents over a dataset → LLM-as-judge scores → results posted as artifacts → second report job queries LangSmith for experiment metrics and posts markdown on PR).
LangSmith pipeline shape: Unit → Integration → E2E → Offline evals → Push to staging → Online evals on live data → quality threshold check → if-fail fan-out (annotation queue + webhook to Slack/PagerDuty + add trace to golden dataset) → if-pass promote to prod. The annotation-queue/golden-dataset hooks are the literal CI fan-outs that close the flywheel.

The CI gate is a 3-tier gate, not one decision:

| Tier | Mechanism | Threshold style | |---|---|---| | Per-assertion | Promptfoo type(threshold):value; Braintrust autoeval scorers | Hard pass/fail per row | | Per-test / per-row | Promptfoo threshold: 0.5 over weighted assertion set | Soft (weighted score ≥ X) | | Per-experiment | Promptfoo fail-if-percent-below; LangSmith ">=0.85" aggregate; Braintrust delta-vs-base-experiment | Aggregate pass-rate or score delta |

Tier-3 is where flakiness bites. Best-practice is delta-vs-base-experiment with N-run aggregation: run new + baseline 3–5 times each; fail only if delta exceeds 2σ. None of the OSS GitHub Actions ship this out-of-the-box.

Eval flakiness — what the gate has to fight:

Determinism is not solvable purely with seed=42 and temperature=0. Root cause is non-associativity of FP arithmetic — (a+b)+c ≠ a+(b+c) once you're rounding at finite precision. Same model + same input + different batch size, GPU count, or GPU SKU can move accuracy ~9% and (worst case) response length by 9,000 tokens.
Precision format dominates: FP32 ≈ deterministic, FP16 = moderate variance, BF16 (the production default for many hosts) = significant variance.
Practitioners report eval flakiness routinely moves single-judge scores ±3–5% run- over-run on the same dataset → naive ">5% drop fails PR" gate produces false fails ~weekly without aggregation. The gate needs score-delta + variance-band, not raw delta.
Mitigations: run LLM-judge N times (3–5) per row, aggregate (median or mean) before threshold; fix FP precision and batch size on the eval runner; use Braintrust / promptfoo proxy caching so retries hit cache.

Data leakage / contamination — operational rules:

Build a proprietary test set of 100–500 production-derived examples, hand- labeled by SMEs. By construction can't be in model's training corpus.
Rotate the test set every ~6 months — not because models leak, but because the team implicitly overfits when the same set is used as the perpetual gate.
Detection knobs (DCR, n-gram, member inference) are useful for vendor-benchmark assessment but operationally heavy; the 100–500 set is the practical answer.

Security model:

Indirect prompt injection through trace content is the central risk. Trace payloads contain unfiltered user input. Canonical incident — Supabase Cursor (mid-2025): attacker input embedded SQL exfiltration commands; privileged service-role MCP tool executed them.
Tool poisoning / line jumping: malicious tool descriptions in MCP server registrations can hijack the model before any legitimate tool fires (Palo Alto Unit 42, Microsoft DevBlogs).

Mitigations seen in practice:

Read-only MCPs (Braintrust default) keep blast radius small; writes go through bt CLI which the human shell vets.
MCP gateway pattern (MintMCP, Stytch, Prompt Security MCP gateway).
Scoped API keys per project (Phoenix --apiKey, Braintrust regional endpoints).
Allowlists in CLAUDE.md: "Which file to modify (only the skill file). Which files to never touch (evals.py, harness.py)" — agent cannot circumvent tests by editing the rubric.

MCP write-back governance — two converging primitives:

MCP elicitation/create added to the spec 2025-06-18. Server pauses execution and sends a structured request back through the client to the user with a JSON schema. User can accept, decline, or cancel. Critical security carve- out: "Servers MUST NOT use form mode elicitation to request sensitive information such as passwords, API keys, access tokens, or payment credentials." For those, URL-mode elicitation redirects to an out-of-band browser flow. Pinterest is the first named-team production deployment — "mandates human-in-the-loop approval for sensitive operations."
MCP gateway category — what's actually shipping:
- MintMCP: STDIO-to-managed conversion (wraps any local MCP server with OAuth/ SSO + audit + monitoring without code change). OAuth 2.0 + SAML + SSO; tool- level RBAC/ABAC; Virtual MCP Servers that expose only the minimum capabilities per role (a least-privilege primitive). SOC 2 Type II; PII auto-redaction + secrets-leakage scanning.
- Stytch: opinionated identity layer for MCP servers. OAuth 2.1, Dynamic Client Registration (RFC 7591), Authorization Server Metadata, custom scopes per tool/resource. MCP server hosts /.well-known/oauth-protected-resource returning {resource, authorization_servers, scopes_supported}. manage:* is the convention for write/admin scopes.
- Prompt Security MCP gateway: reverse proxy redirecting all MCP requests through inspection. Full audit; allow/block by user/server/action; continuous static + dynamic analysis of upstream MCP server codebases producing a per-server "risk score" used for allowlist decisions.
Autogenesis Protocol (AGP): research blueprint with two layers:
- RSPL (Resource Substrate Protocol Layer) — models five resource classes as versioned registered resources: Prompts, Agents, Tools, Environments, Memory.
- SEPL (Self-Evolution Protocol Layer) — closed-loop operator: propose → assess → commit, with auditable lineage and rollback.
- RSPL+SEPL is the write schema; MCP elicitation + gateway scopes are the transport.

The new agent-write pattern emerging: read in MCP → drafts patch → calls MCP write-tool with manage:prompts scope → MCP gateway checks scope + policy → server emits elicitation/create with URL-mode pointing to PR or annotation queue → human approves/rejects in real UI (GitHub PR review or Phoenix annotation queue) → on approve, write commits with auditable lineage.

Open governance gap: no mainstream eval platform yet ships built-in policy enforcement on agent write-back end-to-end. The category exists; the integration doesn't.

5. Playground / Sandbox Patterns for SMEs

Shared UX vocabulary across Braintrust, Langfuse, Helicone, Phoenix, LangSmith, Galileo, Maxim:

Side-by-side prompt variants: each variant carries own model, parameters, system message, tools, template variables. Run all simultaneously or focus on one.
Diff toggle: highlights textual output differences + side-channel deltas (score, latency, tokens, cost). Critical for "PM review workflows — instead of scanning two walls of text, you see exactly what changed."
Score column: auto-evals (built-in scorers, custom code, LLM-as-judge) render a numeric score per row. SMEs sort by score and focus on regressions.
Annotation form: inline rubric — numeric / categorical / free-text / thumbs reactions, with hotkey support (Phoenix). Annotations attach to trace, span, or experiment-row level.
Convert selection → dataset: filter to "spans where the LLM judge said 'good' but the human said 'bad'" → export as calibration dataset with one click.
Save → Experiment: playground variant becomes immutable experiment snapshot, runnable against dataset matrix.

Notable launches in same week (signals category hardening): Braintrust playground GA 27 May 2025; Langfuse side-by-side playground 28 Jul 2025; LangSmith Align Evals 29 Jul 2025.

Platform-specific notes:

Braintrust playground: live editor for tasks, scorers, datasets; side-by-side trace comparisons with diff toggle; annotation features (thumbs/comments) gated to Pro/Enterprise; converts to Experiment with single + Experiment button. Customer reference: Ambience Healthcare's clinical AI team "reduced evaluation time by 50% through instant custom scorer editing and tripled dataset capabilities."
Langfuse: side-by-side LLM Playground (28 Jul 2025), each variant with own LLM settings/variables/tools/placeholders, parallel execution, save variant directly to Prompt Management as new version. Native Dataset Runs (Prompt Experiments) plus Experiments-via-UI route to compare prompt versions side-by-side with optional LLM-as-Judge scoring.
Helicone Playground / Prompt Experiments: "the only solution that enables testing with real production data" (vendor claim — Braintrust and Langfuse also support this); side-by-side comparison with consistent metrics; uses real production user queries for variant evaluation.
Phoenix Prompt Playground: + Compare button to spawn duplicate variants; replays traced LLM calls; experiments auto-recorded; shareable links so non- technical reviewers can view results without UI access.
Arize AX / Phoenix Annotations: three annotator types (Human, LLM, Code); APP and API annotation interfaces; hotkey-driven UI; labeling queues for distributing review work; annotated samples export to datasets.
LangSmith Annotation Queues: predefined rubric per queue; integrates with Pairwise Review for relative judgments. Inline trace annotation lets SMEs add scores and notes without leaving the trace view.
Galileo: explicitly markets to vertical SMEs ("lawyers, doctors, accountants, PMs"); Luna-2 evaluators ship at sub-200ms.
Maxim AI: cross-functional from day one; PMs can define, run, and analyze evals via no-code UI.
Label Studio / SuperAnnotate: specialist annotation tools; richer rubric/ escalation features but disconnect from CI/CD and tracing.

Active-learning loop — LangSmith Align Evals (canonical, 29 Jul 2025), conceptually grounded in Eugene Yan's 2024 Evaluating the Effectiveness of LLM-Evaluators:

Select evaluation criteria.
Select representative data (balanced positive/negative).
SMEs grade golden set per criterion in annotation queue.
Write LLM-judge prompt → click "Start Alignment" → see alignment score (% judge- human agreement) → inspect misaligned cases sorted by disagreement size.
Iterate prompt — add instructions covering observed failures, embed few-shot, or simplify to binary output.
Save baseline; lock in alignment scores per evaluator version.

Industry alignment target: ~80% (LangChain: "strong LLM judges reach 80% agreement with human evaluators"). Eugene Yan's framing is more cautious: calibrate to inter- human agreement, not a fixed number. Yan's recommendations:

Pairwise > direct scoring for subjective tasks ("more stable results").
Binary outputs > Likert scales (LLM-judges exhibit central-tendency bias clustering around 6–7 on 1–10 scales).
CoT + n-shot for reliability, but n-shot calibration is brittle (order, count, sequence are unstable knobs).
Brittleness warning: every model "performed poorly on some datasets, suggesting that they're not reliable enough to systematically replace human judgments" — keep humans in the loop for alignment verification.

Statistical framing: The Alternative Annotator Test for LLM-as-a-Judge (Calderon et al., ACL 2025) — small "alt-test" annotation subset is enough to justify whether an LLM annotator can replace humans for a given task. Useful as the formal stopping criterion for an Align-Evals-style loop.

Three patterns for SME labels → automated scorer:

Few-shot calibration: misaligned examples become judge prompt few-shot exemplars. Cheapest, most common, brittle to scale.
Failure-mode-as-rule: patterns in misaligned examples become explicit instructions ("treat answers without unit attribution as incorrect").
Fine-tuned classifier or reward model: higher cost, lower latency. Galileo Luna-2 (sub-200ms) is the productized version. Yan recommends for production guardrails.

Inter-annotator agreement (IAA) — the missing playground primitive:

| Metric | Use case | |---|---| | Cohen's κ | Two raters, nominal labels, no missing data | | Fleiss' κ | N raters (still nominal). Useful when ≥3 reviewers per row to bootstrap a gold set | | Krippendorff's α | Any number of raters, missing labels, mixed nominal/ordinal/interval/ratio. The metric production annotation queues actually need because reviewers cover overlapping subsets, not the full set |

Critical extension to the LLM-judge case: validating an LLM-judge against humans is just another IAA problem. Same statistical rigor — compute κ or α between the judge and a panel of humans, not "judge vs gold answer." This reframing makes Yan's "calibrate to inter-human agreement" advice operational: measure inter-human α first, then judge-to-human α should match it within noise.

Adoption: Label Studio is the OSS exception — ships per-task agreement, project- level Krippendorff's α, and reviewer-disagreement views as first-class UI. Phoenix, Braintrust, LangSmith expose annotations and let you query agreement, but none compute α/κ on the playground page itself. Practical implication for greenfield: if you need IAA reporting on day one, route SME annotations through Label Studio (or similar) and import labels back to your eval platform.

Cross-cutting tensions:

MCP write surface vs CLI: read-only MCP + write-only CLI is Braintrust's conservative default. Phoenix and Langfuse expose more writes through MCP — more ergonomic for agents but pushes prompt-injection risk to the platform.
Playground in obs platform vs stand-alone: Label Studio has richer rubric/ escalation features but loses trace-context advantage. Bundled playgrounds have weaker review-ops but tighter loops with experiments and CI.
What an SME owns: (a) SMEs own rubric and golden set, engineers own judge prompt (Anthropic framing); (b) SMEs own end-to-end including judge prompt (Maxim, Galileo's pitch). The second is more operationally efficient; the first is safer when SMEs aren't prompt-engineering literate.
Agent-driven vs agent-assisted: agent-driven implies autonomous overnight loops editing skills until binary evals pass. Most platforms today are de facto agent- assisted; agent-driven only works when the eval suite is high-quality enough that gaming the rubric isn't cheaper than fixing the code.

6. AI Proxy / Gateway Based Automatic Tracing

The pattern: route all LLM traffic through a centralized proxy/gateway that (a) handles routing/failover/caching, (b) attaches authentication and budget controls, (c) emits OTel-compatible traces with GenAI semantic-convention attributes, (d) propagates trace context (W3C traceparent) so gateway spans nest under upstream agent spans.

Concrete products and observability story:

| Gateway | Tracing | Compliance | Notable | |---|---|---|---| | LiteLLM | callbacks: ["otel"] + OTEL_EXPORTER_OTLP_ENDPOINT; native callbacks to Helicone, Langfuse, Prometheus | OSS; no SOC2/HIPAA out-of-box | 100+ providers, gRPC/HTTP OTLP. First-class Presidio integration with pre_call/post_call/during_call/logging_only modes. | | Portkey | OTel + native dashboard, custom metadata, alerting | SOC2 / HIPAA / GDPR | SDK-first DX; 50+ guardrails. | | Helicone AI Gateway | Rust-based; OTel + native dashboards | SOC2 / HIPAA / GDPR + SSO + audit | De-recommended for new deployments — Mintlify acquired Helicone 3 March 2026, maintenance mode, customer migration explicitly suggested. | | Cloudflare AI Gateway | OTLP JSON only (no protobuf — Datadog incompatible). Trace context via cf-aig-otel-trace-id + cf-aig-otel-parent-span-id headers; arbitrary business metadata via cf-aig-metadata. Tested w/ Honeycomb, Braintrust, Langfuse | GDPR / HIPAA / PCI framing for DLP | Edge-deployed; no infra to run. | | Braintrust Gateway | Native nested span trees per-span cost/latency/tokens/errors; one-click prod-trace → eval | SOC2 / HIPAA / GDPR | Tightest eval-loop integration; OpenTelemetry-compatible. |

Where gateways win:

Zero-code-change instrumentation across heterogeneous services and languages.
Single chokepoint for cost attribution, RBAC, virtual keys, budget caps.
Centralized prompt management at runtime (Portkey, Braintrust) — prompt changes ship without app deploys; every request is correlated to a prompt version.
Governance: SOC2/HIPAA boundary, audit logs, SSO, prompt-injection guardrails applied uniformly.

Where gateways are blind (SDK instrumentation wins):

Local tool execution, vector retrieval, prompt templating that happens before the API call.
Internal control flow / loops in agentic frameworks. Braintrust: "proxies only see requests sent to APIs and responses received. They're blind to internal reasoning."
Reasoning-heavy agentic systems, RAG with non-trivial pre-processing, multi-step tool use — SDK spans needed to attribute hallucinations to retrieval failures.

Practical synthesis: production stacks run both. Gateway = governance + cost + guaranteed bottom-floor trace; SDK instrumentation (OpenLLMetry, OpenInference, vendor SDKs) = high-resolution agent-internal spans. W3C traceparent propagation ensures gateway span nests cleanly under agent root span.

OpenTelemetry GenAI semconv — see §2 for full attribute taxonomy. Key points:

The OTel GenAI SIG (chartered April 2024, Nir Gazit/Traceloop chairs) is the consolidation point.
Spec deliberately punts on large payloads: three options for messages — don't capture (default), capture inline (with truncation), or upload externally and store reference. Production guidance is "external storage" because of "high storage costs, regulatory requirements."
gen_ai.input.messages / gen_ai.output.messages / gen_ai.system_instructions gated behind OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=true.
Stability transitions managed via OTEL_SEMCONV_STABILITY_OPT_IN env var.

OpenLLMetry vs OpenInference vs OTel:

OpenLLMetry (Traceloop): pragmatic SDK; predates OTel GenAI conventions; now drives the SIG.
OpenInference (Arize): complementary conventions for LLM apps; explicit support across Python/JS/Java; instruments OpenAI, Claude Agent SDK, LangChain, LlamaIndex, DSPy, Bedrock, Anthropic. Includes a span processor that ingests OpenLLMetry spans — positioning OpenInference as a normalizer rather than competitor.
Both produce OTel-compatible spans that all major backends ingest; vendor lock-in is less load-bearing than 12 months ago.

LiteLLM + Presidio gateway pattern — production configuration nuances:

Modes: pre_call (mask input before model), post_call (mask model output before user), during_call (rare), logging_only (mask only what's logged, not what's sent — not HIPAA-compliant by itself but useful as defense-in-depth when the LLM provider has a BAA and observability vendor doesn't).
presidio_filter_scope: input / output / both.
Production-critical: litellm.redact_messages_in_exceptions = True. Without this, exceptions can carry PII into Sentry / Datadog error reports — common silent leak path.
Mask vs Block modes: _OPTIONAL_PresidioPIIMasking can replace tokens with entity type (<PHONE_NUMBER>) or hard-block the request. Block mode is the right default for HIPAA where degraded responses beat a leak.
Tag-based policy attachment: requests tagged healthcare pick up healthcare- specific Presidio policies — multi-tenant gateway serves regulated and unregulated traffic without doubling infra.
Known bugs (open as of search date): GitHub issues #6247 (Presidio guardrail output parsing) and #8359 ("Presidio guardrail doesn't scrub effective request and response") — gateway-side Presidio integration has documented edge cases where masking is not actually applied to the canonical request/response object that gets logged. A team relying on LiteLLM+Presidio for HIPAA needs adversarial tests, not config-as-sufficient.

Cloudflare AI Gateway DLP: framing is "GDPR/HIPAA/PCI-shaped"; OTLP JSON only (no protobuf), trace context via custom header cf-aig-otel-trace-id. Pairs cleanly with Honeycomb, Braintrust, Langfuse — not Datadog LLM Observability.

A 2025 survey of 550 IT leaders reported >50% of organizations rolling out autonomous agents already rely on an AI gateway (figure is directional — vendor blog snippet, underlying survey not fetched).

Governance benefits of the gateway chokepoint: RBAC + virtual keys, prompt management at the gateway (versioned, tagged, approval-gated, with rollback; every request correlated to prompt version), PII handling (redaction policies, opt-in vs opt-out content capture, tenant isolation), audit (who called what model with which prompt at what cost, in one log stream), compliance certifications, consistent telemetry. Strongest non-negotiable argument for the gateway pattern in regulated industries — enforced consistency, not just convenience.

7. Topic Modeling and Unknown-Unknowns

(See §3 for full failure-mode discovery treatment; this section focuses on unknown-unknowns specifically.)

Two complementary primitives:

HDBSCAN noise points + outlier embeddings — semantic distance from centroid; surfaced in Phoenix, PostHog, Braintrust. PostHog explicitly treats cluster -1 as a feature ("edge cases, unusual workflows, or bugs that don't fit any pattern").
Topic novelty / weak-signal detection — BERTrend (Boutaleb et al. 2024, ACL FuturED workshop). Extends BERTopic to streaming corpora; classifies topics as noise / weak signal / strong signal based on document count and update frequency over a configurable retrospective window. Weak signals that accelerate become candidates for emergent failure categories. Open-sourced as bertrend (RTE-France).

The closest published academic primitive for "surface emergent failure categories engineers had not anticipated."

Drift detection on user-intent distributions:

Embedding drift (Phoenix, Arize): Euclidean distance between baseline and current centroids + 2-sample Kolmogorov-Smirnov test on sampled distances. Arize blog explicitly recommends KS over PSI/KL/JS for unstructured embeddings (the latter assume structured-feature distributions).
Topic-popularity drift (BERTrend, above).

Safety benchmark blind spot: "How should AI Safety Benchmarks Benchmark Safety?" (2026, arXiv:2601.23112) found 81% of surveyed safety benchmarks evaluate only predefined known risks — explicit argument that benchmarks alone cannot find unknown-unknowns and that trace mining is the only practical surface.

Vendor implementations of automatic failure clustering converge on the BERTopic pattern (see §3 for the table). Notable observations:

Three pre-built facets ship in Braintrust Topics: Task, Issues, Sentiment.
Key product loop in Braintrust: run Topics over production and over your eval set; failures present in production but missing from eval reveal coverage gaps.
LangSmith Insights Agent's specific clustering algorithm is not publicly disclosed; described as hierarchical (top-level → second-level → individual run).

Cluster lifecycle gap: most vendors don't model "cluster as issue" first-class (active / resolved / regressed states). Teams build it themselves on top of dataset versions and Score configs.

8. RBAC, Data Masking, and Governance

Redaction has no single right place — mature setups layer it:

| Layer | Pros | Cons | Vendor patterns | |---|---|---|---| | Application / SDK pre-attribute | Sensitive data never enters pipeline; cheapest latency | Must be re-implemented per service | Arize AX OPENINFERENCE_HIDE_* env vars; Langfuse SDK mask | | AI gateway (proxy) | One chokepoint per model call; can mask request or logging-only | Doesn't see in-process agent spans (planning, retries, tool calls) | LiteLLM Presidio; Cloudflare AI Gateway DLP; Portkey 50+ guardrails | | OTel collector (OTTL transform / redaction processor) | Language-agnostic, central config | Limited to regex; can't use NLP NER | OneUptime example using replace_pattern for emails | | Eval-platform backend masking callback | Last line of defense; can block ingest | Vendor-specific; only protects that backend | Langfuse LANGFUSE_INGESTION_MASKING_CALLBACK_URL (Enterprise; OTel endpoint only — legacy /api/public/ingestion bypasses it) |

Key technical gotchas:

OpenTelemetry GenAI conventions deliberately do NOT capture prompts/completions by default. gen_ai.prompt / gen_ai.completion are opt-in attributes; "instrumentations SHOULD NOT capture by default." This pushes responsibility onto vendors and SDK authors — structural reason every eval platform ships its own masking layer.
OpenTelemetry's official guidance: prefer allow-list redaction processor (delete everything not on the allow list) over block-list. Hashing reversible identifiers (numeric user IDs, short strings) provides only weak privacy — not a real protection.
Presidio recall on non-US PII is poor out of the box: ~30 recognizers, US- centric. NER-based detection improves precision but reduces recall in formal evaluations. Don't treat Presidio as a HIPAA control by itself.
Vertical-PII update (April 2026): MedicalNERRecognizer ships in Presidio for clinical entity detection; Azure Health Data Services (AHDS) surrogate anonymization operator generates realistic PHI surrogates rather than redacting (important for downstream eval data quality — fully redacted clinical text breaks LLM-judge calibration); 50+ built-in recognizers as of 2025; GPU acceleration and ONNX Runtime support added; IHE De-Identification Implementation Guide based on FHIR R4 4.0.1 is in draft as of February 2026. HIPAA Safe Harbor lists 18 PHI identifiers; Presidio out-of-box handles ~10 reliably; the rest (MRN format varies by EHR, "any other unique identifier", device serials, full-face images, biometrics) need custom recognizers.

Tenant isolation patterns:

Logical multi-tenant DB (Langfuse Cloud, LangSmith, Braintrust Cloud): every row carries projectId; RBAC enforces filtering. Pen-tested annually (Langfuse). Acceptable for most; usually insufficient for defense/healthcare without contractual controls.
Region-pinned multi-tenancy (LangSmith EU, Langfuse EU/US/HIPAA-US, Braintrust EU SaaS): same logical model + infrastructure-region commitments. LangSmith explicitly notes you cannot have org-spanning workspaces across regions or migrate between them — real friction for multinationals.
Customer-managed data plane (Braintrust hybrid; Langfuse self-hosted; Phoenix self-hosted; LangSmith self-hosted Enterprise; Helicone self-hosted): true infrastructure separation. Braintrust is the outlier on data-plane separation: keeps SaaS control plane (project metadata + Clerk auth), SDKs/UI talk to customer-owned data plane via CORS.

RBAC feature matrix:

| Vendor | Built-in roles | Custom | Scopes | SSO/SAML | SCIM | Audit logs | |---|---|---|---|---|---|---| | Langfuse OSS | Owner / Admin / Member / Viewer / None | No | Org + Project (project overrides) | OIDC (cloud); SAML self-hosted Enterprise | Enterprise self-hosted | Enterprise only (~33 resource types, before/after JSON) | | LangSmith | Built-in + custom | Yes (workspaces, custom roles) | Workspace | Enterprise | Yes | Yes | | Braintrust | Owners / Engineers / Viewers + custom | Yes | Org / Project / Object (experiment, dataset, prompt) | Okta / Entra / Google | Yes | API key activity logs; trust center | | Arize AX | Built-in + custom (recent) | Yes (RBAC + SAML role mapping) | Space / Workspace | Okta / Entra w/ SSO enforcement | Yes | Yes | | Phoenix OSS | Minimal documented | — | — | Not in OSS README | — | — | | Helicone | Limited | Limited | Org | OSS lacks documented self-hosted auth | No | Frequently cited as weakness |

Compliance certifications and DPAs:

| | SOC 2 Type II | HIPAA BAA | EU residency | Other | |---|---|---|---|---| | Langfuse | Yes | Yes (HIPAA-US zone, BAA on Enterprise) | EU Cloud + EU self-hosted; ISO 27001 | TLS 1.2+, AES-256; no BYOK/CMEK on Cloud | | LangSmith | Yes | Yes (Enterprise BAA) | EU instance (eu.smith.langchain.com) | — | | Braintrust | Yes | Yes (BAA) | EU SaaS + BYOC; AES-256 with unique 256-bit keys per secret | Code execution in quarantined VPCs | | Arize AX | Yes | Yes | Yes; PCI DSS noted | — | | Helicone | Yes (advertised) | Yes (advertised) | Self-hosted available | RBAC + audit gaps; acquired by Mintlify Mar 2026, maintenance mode |

Subtle gotcha: the LLM provider is itself a sub-processor. A SOC 2 attestation from your eval vendor does not magically include OpenAI / Anthropic / Bedrock. Teams need DPA + sub-processor list + a mechanism to opt out of vendor training (now standard from frontier providers but must be configured per project). A team buying observability for HIPAA needs to verify the entire path is BAA- covered, not just the eval vendor — if the gateway is in the chain before the eval platform receives the trace, the gateway must also be BAA-covered.

BYOK / CMEK matrix (April 2026):

| Vendor | Cloud BYOK / CMEK | Self-hosted BYOK | |---|---|---| | Langfuse Cloud | No BYOK (TLS 1.2+, AES-256 at rest) | Yes — your infra, your keys | | Braintrust Cloud | No explicit BYOK; secrets encrypted with unique 256-bit keys per secret + nonce | Yes via AWS KMS / Azure Key Vault / GCP KMS — hybrid mode is the answer to "we need BYOK" | | LangSmith Cloud | Not publicly documented in any tier | Self-hosted Enterprise add-on in customer Kubernetes | | Datadog LLM Obs | CMEK for some products; not confirmed for LLM Obs | n/a | | Arize AX | Not surfaced in available docs | Self-hosted available | | Phoenix OSS | n/a | Yes — no HIPAA/SOC 2 cert on Phoenix server itself; self-attest |

Conclusion: BYOK is still self-hosted-only or hybrid-only across the major LLM observability vendors as of April 2026. No mainstream eval-platform SaaS offers true cloud BYOK. For teams where BYOK is required (defense, parts of healthcare, some EU banks), choices collapse to Braintrust hybrid, LangSmith self-hosted Enterprise, Langfuse self-hosted Enterprise, or Phoenix self-hosted with own compliance attestation.

LangSmith multi-geo data residency has a sharp edge:

"Customers with multi-geo data residency requirements need multiple separate LangSmith instances, each with its own organization." No cross-region single- tenant view.
The official langsmith-data-migration-tool (Python CLI) does not migrate trace data — only datasets, experiments, annotation queues, prompts, charts. If your governance team later forces an EU re-host, you lose your historical trace corpus (the asset the eval flywheel runs on). This is a strong argument to start with the gateway-emits-OTel-to-S3-first pattern — your raw archive is the only thing that survives a vendor migration.
Self-hosted is an "add-on to the Enterprise plan" — paying both Enterprise + the self-hosted add-on, which is why third-party sources note $100K+ minimums.

LangSmith → EU AI Act article-by-article mapping (cleanest "interpretable observability is a regulated artifact" framing in any vendor doc):

| EU AI Act article | LangSmith feature | |---|---| | Art. 9 (risk management) | Online monitoring, custom evaluators, alert thresholds | | Art. 10 (data governance) | Bias and fairness evaluators | | Art. 12 (event logging) | Automatic event logging across system lifetime | | Art. 13 (transparency / interpretability of logs) | Full execution tracing | | Art. 14 (human oversight) | LangGraph interrupts + annotation queues | | Art. 15 (accuracy metrics) | Correctness and adversarial evaluators | | Art. 72 (post-market monitoring) | Online evaluation + drift detection |

Default trace retention is 14 days base / 400 days extended, then bulk export to customer-controlled archive. The 400-day extended tier exceeds the EU AI Act 6-month minimum, but again retention is a paid feature.

Langfuse pricing — what's gated where:

| Tier | Price | Governance | |---|---|---| | Hobby | $0 | Nothing for production governance | | Core | $29/mo | Basic | | Pro | $199/mo | 3-year retention, SOC 2 + ISO 27001 | | Enterprise | from $2,499/mo | RBAC org+project, SCIM, audit logs, data masking for PII, configurable retention with auto-delete, HIPAA + BAA, GDPR + DPA, 99.9% SLA, annual pen test |

The gap between $199 and $2,499 is the entire governance stack. There is no middle tier for "we want SCIM and audit logs but not the SLA." Sharper cliff than Braintrust ($249 → custom).

Braintrust pricing:

| Tier | Cost | Processed data | Scores | Retention | Governance | |---|---|---|---|---|---| | Starter | $0 | 1 GB then $4/GB | 10k then $2.50/1k | 14 days | 1 human-review score / project | | Pro | $249/mo | 5 GB then $3/GB | 50k then $1.50/1k | 30 days | unlimited human-review scores, custom topics, charts | | Enterprise | custom | custom | custom | custom | BAA (HIPAA), SAML SSO, RBAC, S3 export, on-prem/hybrid, uptime SLA |

Two non-obvious gates: HIPAA BAA, SAML SSO, RBAC, S3 raw export are all Enterprise- only. Retention is contract-defined, not configurable on Starter/Pro. EU AI Act ≥6-month requirement → Pro alone is non-compliant.

EU AI Act — the regulatory forcing function:

High-risk AI systems must keep automatically generated logs for at least 6 months (Articles 19, 26).
Article 13 — deployer must be able to interpret the logs (observability is part of the regulated artifact).
Penalties up to 7% of global annual turnover (higher than GDPR).
Practical effect: aggressive cost-driven sampling now fights compliance — trace data layer architecture (cheap cold storage, columnar, span chunking) is a compliance question, not just a cost question.

How "buy" actually breaks down for a regulated team in April 2026:

HIPAA BAA is Enterprise-only at every vendor checked (Braintrust, Langfuse, LangSmith, Arize AX). No "buy on credit card and have HIPAA coverage" path. Team that says "we'll start on Pro and Enterprise-up later" will not have BAA coverage during Pro usage — any PHI in traces during that window is a contractual breach if challenged.
EU residency is a sharper constraint than HIPAA: LangSmith's "multi-region = multiple orgs, no migration" architecture means an EU-customer onboarding event is a re-procurement, not a setting toggle. Braintrust's "EU SaaS + BYOC" and Langfuse's EU Cloud + EU self-hosted are the only paths that don't fragment the org.
Audit-log and SCIM gates: Langfuse audit logs Enterprise-only; SCIM Enterprise across the board. For a team subject to SOC 2 itself, these are not optional → Enterprise on day one or OSS self-hosted.

The "buy the eval workflow, host the data" pattern:

Host trace storage yourself (ClickHouse + S3 + Glacier on your own cloud, with your own KMS — DIY BYOK).
Buy the eval-workflow surface (Braintrust hybrid, Langfuse self-hosted Enterprise, or LangSmith BYOC).
Federate: workflow tool reads aggregates from your ClickHouse via SQL/API; doesn't need to own the raw store.

This decouples storage cost economics (favors self-host) and workflow cost economics (favors buy because non-engineer eval UX is genuinely hard).

Key Insights & Implications

Cross-cutting themes — where the eight areas reinforce each other

Eval/observability flywheel ↔ trace data layer. You can't run cheap aggregate queries to surface failure modes if your spans live in a row-store optimized for single-trace lookup. Teams converge on dual-tier: hot single-trace lookup + columnar cold (ClickHouse, BigQuery, Iceberg/Parquet exports). Braintrust's hybrid deploy and Langfuse's ClickHouse migration both pull in this direction.
Failure-mode discovery ↔ topic modeling ↔ headless agents. Once traces are queryable in cold storage, you can run BERTopic / LLM-as-judge over them; once those scorers exist, a coding agent reading aggregate eval data via SQL can self-heal the prompts. The eval platform becomes the integration point.
Gateway-based tracing ↔ governance. A gateway is the cleanest place to enforce PII masking, audit logging, model allow-lists, and tenant routing all at once — but only sees the outer call. SDK instrumentation is required for the agent's internal spans. Treating these as competitors is the wrong frame; the right frame is "gateway is the policy/observability boundary, SDK is the introspection layer."
Playground/SME workflows ↔ RBAC. The moment you bring domain experts (clinicians, lawyers) into the eval loop, you need scoped access (they should see traces from their domain only, not raw production traffic), structured score annotation, and an audit trail. Exactly what enterprise RBAC + audit logs are for — explains why eval vendors (not gateway vendors) push these features hardest.
Governance ↔ build-vs-buy. PII redaction, BAAs, audit logs, and SCIM are the single largest reason teams move from "we'll build a quick logging table" to a vendor — compliance work is qualitatively harder to staff than infra work.

Where consensus exists

The five-phase flywheel (observe → analyze → score → curate → re-evaluate).
Trace-to-dataset is mandatory primitive.
Annotation queues with full span context, not stripped completions.
Async online LLM-judge scoring with sampling (1–10% high-volume; ~100% on errors).
OTel GenAI semantic conventions as the lingua franca.
Both gateway + SDK instrumentation in serious production stacks.
ClickHouse + S3 + Postgres + Redis trace pile (OSS).
BERTopic-family clustering pipeline.
Open-coding → axial-coding → quantification → scorer human workflow.
SOC 2 Type II as table stakes.
App-level masking + allow-list mindset > scrub-everything-later.
Read-in-MCP, write-in-CLI as the conservative agent-write pattern.
Pairwise/binary outputs + inter-human-agreement calibration > Likert + fixed thresholds.

Where opinion still diverges

Storage: ClickHouse vs columnar warehouse vs hybrid; Langfuse "standards-first" vs Braintrust proprietary-but-OTel-accepting layer (Brainstore).
Sampling rate: 1% vs 5% vs 10% on online evals; right answer depends on traffic volume and judge cost.
Trajectory vs output evaluation: LangChain pushes hard on trajectory; many teams still over-rely on output-only.
Generic vs domain-specific scorers: Husain/Shankar/Langfuse warn against generic; vendors keep shipping generic ones.
Auto-rubric induction (Amazon Nova) vs human-aligned EvalGen-style.
Cluster lifecycle modeling (issue-tracker semantics with active/resolved/ regressed) — most vendors don't model this; teams want it.
Embedding model: text-embedding-3-large (PostHog) vs all-mpnet-base-v2 (Langfuse).
MCP write surface: read-only (Braintrust) vs write-capable (Phoenix, Langfuse).
Whether gateway or eval platform owns guardrails / PII masking.
Self-hosted vs hybrid (Braintrust-style) for regulated industries.
Whether to capture full prompts/completions in production by default — OTel says no; vendors default to "yes, but masked" because the eval flywheel is useless without payloads.
BAA / DPA scope: if gateway is in the chain before the eval platform, gateway must also be BAA-covered (most LiteLLM is self-hosted; Portkey ships BAA; Cloudflare AI Gateway is HIPAA-eligible but workflow is non-trivial).

Greenfield team — concrete recipe

Build (the integration shape, not the infra):

Adopt OTel GenAI semconv from day one (vendor-lock-in insurance).
Thin AI gateway in front of every model call. LiteLLM is OSS default; Portkey if you need managed SOC2/HIPAA; Cloudflare AI Gateway if edge; do NOT pick Helicone for new work.
App-level allow-list redaction.

Adopt OSS:

Langfuse self-hosted if you need RBAC, audit logs (Enterprise), SAML/SCIM, and a mature trace+prompt+eval UI in one place.
Arize Phoenix if you want Apache/OTel-native, MIT-friendly licensing for the SDK side, and you're fine running enterprise governance off-platform (note: Phoenix server is ELv2-licensed, not OSI-open).

Buy when forced:

Need a managed eval workflow for non-engineers (Braintrust, LangSmith are the most polished).
Need HIPAA BAA / EU residency contractually (Braintrust hybrid is cleanest "SaaS UX, customer data plane"; LangSmith EU instance simplest if no BYOC needed).
Past ~10M traces/month and self-hosted CH+search infra cost dominates eng cost of vendor migration.

Don't build:

Trace UI (vendors are years ahead).
Eval orchestration / prompt versioning surface.
PII detection from scratch (use Presidio + custom recognizers).
HIPAA-compliant infra from scratch (audit work outweighs vendor cost).

Sequencing for 0-to-1:

Wk 1–2: gateway + OTel GenAI emission + raw logs to S3.
Wk 3–4: stand up Langfuse self-hosted (or Phoenix), wire OTel, get UI live.
Mo 2–3: add Presidio masking at gateway + allow-list redactor in SDK; pick eval taxonomy (rubric-based LLM judge + a few deterministic scorers).
Mo 3–4: only now evaluate buying — by now you'll know whether bottleneck is "non-engineer eval workflow" (buy) or "trace volume + governance certs" (still self-host w/ vendor for eval surface).
Mo 6+: failure-mode discovery (BERTopic over failed traces; LLM-judge over clusters); headless coding-agent loop reading aggregates via SQL/API.

Single biggest mistake: deferring governance — building the flywheel first, bolting PII redaction / RBAC / audit logs on after accumulating unredacted production traces. Order should be inverted: get the envelope (gateway + OTel + masking + allow-list) right before accumulating data, then iterate the eval surface on top.

Anti-patterns called out across sources

Manual annotation forever (doesn't scale).
Static metrics — Shankar: "must evolve as APIs change and new failure modes emerge."
Holistic Likert-scale labeling (too noisy; prefer binary).
Trajectory blindness (output-only judges miss tool-selection bugs, hallucinated policies, mid-trace reasoning errors, Potemkin failures).
Generic ground-truth-first thinking in production (use reference-free judges; reserve ground-truth for offline).
Sampling traces uniformly at random (misses tail failures; stratify by intent).
Letting an LLM invent failure categories (constrain it to summarize what humans annotated).
High-precision (1–10) numeric judges (central-tendency bias).
One-shot rubric design (criteria drift).
Single-axis optimization (Klarna mode — hit the named metric, miss the dimension that matters).
Global-average masking of segment failures (Notion mode — needle-in-a-haystack failures get diluted).
Treating Presidio as a HIPAA control by itself (US-centric recall, ~10/18 HIPAA identifiers reliably).
Hashing reversible identifiers as a "privacy" measure (OTel: weak protection).
Naive ">5% drop fails PR" gate without N-run aggregation (eval flakiness routinely ±3–5%).

Contradictions & Open Questions

Contradictions found

Brainstore framing vs reality: Braintrust frames ClickHouse and Postgres as inadequate for AI workloads ("complex queries could return incorrect results"), while Langfuse, LangSmith, Helicone, and SigNoz have all built production-scale systems on ClickHouse and explicitly say it works. The truth is product-shape- dependent: if multi-MB JSON full-text search is the primary product feature, ClickHouse text indexes are newer (GA Aug 2025) and Brainstore's specialized stack outruns them; if not, ClickHouse is more than enough.
OTel "external storage" recommendation vs Datadog 1 MB silent-drop: spec recommends external storage for messages, but Datadog's native GenAI ingest has a 1 MB payload cap that silently drops oversize spans (issue #13260). Even a vendor following the standard has practical limits the spec doesn't enforce. Teams instrumenting via Datadog must independently chunk/offload large payloads.
"Generic scorers are an anti-pattern" vs vendors shipping pre-built generic scorers: not really a contradiction — vendors ship them as a starting baseline; the anti-pattern is stopping there.
Anthropic's eval guide says "no self-improvement loops" vs Anthropic ships skill-creator with autonomous improvement loops. Reconciliation: Anthropic distinguishes agent-driven optimization of skills (acceptable; binary evals lock the loop) from agent self-modification of production behavior (not endorsed without human gates).
Read-only-MCP vs write-capable-MCP: Braintrust advocates for read-only; Phoenix and Langfuse expose more writes. Both ship today; right answer likely depends on the gateway / policy enforcement layer the org has independently.
OpenAI Agents SDK voice ships sensitive-audio-by-default (VoicePipelineConfig. trace_include_sensitive_audio_data defaults include) vs OTel GenAI's privacy-by-default stance. The Agents SDK contradicts the upstream OTel guidance for voice.
LangSmith Insights blog: clustering is automatic and accurate; outside analyses: clusters aren't lifecycle objects and don't auto-generate evals — both can be true (good clustering ≠ good lifecycle modeling).
Some vendor docs (Arize "LLM as a Judge"): advise high-precision numeric scoring; broader literature (Husain, evidently.ai, Yan): advise binary or 1–5 due to central-tendency bias. Treat numeric scales as a known-anti-pattern when calibration is weak.
Helicone "PII detection": createaiagent.net summary says yes; truefoundry.com / DEV.to comparisons say "lacks advanced PII redaction." Likely reconciled by: Helicone has some PII features but they're not Portkey-grade.

Open / unresolved questions

Vercel v0, Lindy.ai, Perplexity, Linear AI — no public flywheel descriptions; likely private.
OpenAI ChatGPT internal eval pipeline (vs the platform-side Trace Grading product) — unpublished.
Anthropic Claude.ai consumer-product eval pipeline — unpublished beyond Bloom (alignment-eval) and skill-creator (developer-tool).
Sub-100ms LLM-judge as a viable critical-path pattern — Galileo Luna-2 sub- 200ms exists, but evidence of fully-online sync-judge deployments at p95 ≤ 100ms is absent.
Concrete eval-cost-as-percent-of-LLM-bill data points — the 1–5% heuristic is plausible but supported only by guidance posts; Klarna, Notion, Sierra, Decagon don't publish the ratio.
OpenTelemetry SemConv issue #1530 resolution timeline for multi-agent / handoff spans — discussion active, no published target.
A2A (Agent-to-Agent) protocol trace propagation — Google's A2A spec is recent; trace-context propagation across A2A boundaries undocumented. Likely the next OTel GenAI battleground.
gen_ai.conversation.id usage in true multi-agent handoffs where conversation ID needs to flow through tool boundaries that don't carry trace context.
Span chunking thresholds — at what payload size do Langfuse / Braintrust / Helicone offload vs inline? Datadog's 1 MB ceiling is the only published number.
Concrete cost numbers for voice trace storage — PCM-WAV at 16 kHz mono = ~32 KB/s; a 1M-call/day voice agent generates ~2.7 TB/day raw audio. Object-store cost is published per byte, but no vendor-published "voice trace TCO."
Whether NVIDIA's flywheel works for agents with tool-use, or only for instruction-following classifiers. Blueprint examples lean toward classification/tool-routing; whether a fine-tuned 1B can sustain a 10-step reasoning chain is not addressed.
DSPy compiler operators — none documented in any vendor; either exists internally and not surfaced, or teams genuinely write the glue.
Whether vendors compute Cohen's κ / Krippendorff's α in their UIs — Label Studio is the OSS exception; Braintrust/LangSmith/Phoenix expose annotations but not built-in agreement metrics.
Real-world MCP elicitation prompt-write-back walkthroughs — Pinterest is named but no specific eval-prompt-edit story confirmed.
AGP (Autogenesis Protocol) production deployments — paper cites the protocol; no production case study running RSPL/SEPL end-to-end.
Datadog LLM Observability specific BAA inclusion — HIPAA-eligible-services list not on the public page.
Non-US PII custom recognizers (German tax ID, French INSEE, IBAN, UK NI) community-maintained EU-PII recognizer pack with the depth Presidio's US recognizers have — not surfaced.
Privacy-Enhancing Computation patterns (TEE / confidential VMs) for eval workloads handling cross-tenant data — none of the major vendors advertise SGX/SEV- SNP or Nitro Enclaves for the eval data plane as of April 2026.
Whether Mintlify will keep Helicone OSS development active beyond maintenance — load-bearing unknown for OSS adopters.

Claims kept "[unverified]" or directional only

The "5% daily LLM-judge sampling" figure as a universal default — VentureBeat- summarized search snippet, not a fetched primary source.
The "550 IT leaders, >50% rely on AI gateway" survey — vendor blog snippet; underlying survey not fetched. Treat as directional.
"Strong LLM judges reach 80% agreement with human evaluators" — LangChain phrasing is loose; Yan's framing (calibrate to inter-human) is principled.
Brainstore's "23.9× full-text and 2.55× write" benchmark vs an unnamed competitor — vendor-published; directionally true, not portable.
Specific in-production retention policies at LangSmith and OpenAI/Anthropic- internal eval stacks — not published.
Self-attributed adoption metrics for SME playgrounds (Galileo, Maxim) — vendor- attributed, not independently verifiable.
Specific sampling percentages used in production by named companies (DoorDash, OpenAI, Anthropic) — vendors document recommended ranges; actual ratios not publicly documented.

Sources

Deduplicated across all five agents and both rounds. Organized by topic; each source was fetched and content-confirmed in the original research session.