HoneyHive is part of Microsoft's open trust stack for AI agents, announced at Microsoft Build 2026: a way to build agents on any framework, with a broad ecosystem of governance, security, observability, and framework partners. HoneyHive joins as a partner for the Agent Control Specification (ACS).
The stack centers on two open-source projects: ASSERT (Adaptive Spec-driven Scoring for Evaluation and Regression Testing), which turns your policies into safety-focused evaluations, and ACS , a runtime control standard for enforcing those policies live. ASSERT evaluates, ACS enforces, and HoneyHive is the observability layer. The first integration is already live: every ASSERT run is captured as a HoneyHive trace you can open.
Microsoft's ACS launch partners and customers, with HoneyHive featured among them Scaling governance for agents Evaluating agents used to mean writing test cases by hand, one case at a time. That never keeps up. The agent changes, the world changes, and your coverage is always a release behind. Policy-as-code changes the unit of work. You write a behavioral spec in plain language, the behavior the agent must and must not exhibit, and ASSERT generates the adversarial suite from it. It returns scored dimensions like policy violation and overrefusal. Compliance stops being a one-time gate before launch and becomes continuous, so more use cases reach production faster.
HoneyHive + ASSERT ASSERT and HoneyHive cover different parts of the same run:
ASSERT turns your policies into adversarial test cases. You write a behavior spec in plain language; ASSERT runs a staged pipeline (taxonomy, test-case generation, inference, LLM judging) and returns scored dimensions tied to specific failure modes. It is open source and works across agent frameworks.HoneyHive captures the run. Initialize a tracer and an OpenInference instrumentor once, and every call in the ASSERT loop is traced: your agent, ASSERT's tester, and the judge. Agent turns group under named chain spans in the Traces view .Run them together and each failed dimension links to its HoneyHive session: the full span tree for that assert-ai run, including your agent's turns, the tools it called, ASSERT's tester, and the judge.
HoneyHive makes each run a shared, persistent session the whole team can open. You can hand a flagged conversation to the subject-matter expert who owns that policy, annotate where the agent went wrong, and pull failing cases into datasets you reuse as your regression suite grows. Evaluation and production runs use the same tracer and sit in one workspace, so a failure in testing lines up against the agent's live behavior.
How it works The integration is light. Wrap your agent as an ASSERT callable target, initialize HoneyHive tracing once at startup, and point ASSERT's behavior spec at the wrapper. From there, every assert-ai run produces one HoneyHive session: your agent's calls grouped under named chain spans, with ASSERT's tester and judge captured automatically, plus the usual ASSERT artifacts for regression comparison.
HoneyHive trace view of an ASSERT run: support_agent chain spans grouping the agent's ChatCompletion calls, with the judge's behavior-evaluator call scoring the policy_violation and overrefusal dimensions An ASSERT run as a single HoneyHive session: the agent's calls grouped under named support_agent chains, with ASSERT's tester and judge calls captured automatically.
You do not have to wire it by hand. HoneyHive publishes a honeyhive-instrument agent skill : point your coding agent at it and it adds the install, tracer initialization, and the right instrumentor for your stack.
Paste a prompt and let the agent do the setup:
Install the honeyhive-instrument skill from https: //github.com/honeyhiveai/skills,
then use it to instrument ASSERT in this project. Use the HoneyHive v2 docs for current SDK guidance. Refer https: //docs.honeyhive.ai/v2/integrations/assert
Getting started The integration is light. Wrap your agent as an ASSERT callable target, initialize HoneyHive tracing once at startup, and point ASSERT's behavior spec at the wrapper.
From there, every assert-ai run produces one HoneyHive session: your agent's calls grouped under named chain spans, with ASSERT's tester and judge captured automatically, plus the usual ASSERT artifacts for regression comparison. The full integration guide walks through setup end to end. If you're working on agent evaluation and reliability, book a demo .