1 Comment

User's avatar
Chirdeep Chhabra's avatar

The "agents deliver work, not workflows" framing is interesting indeed, and the verifier examples (Sierra's simulation process, rubric design, dataset management) address the output quality dimension well.

There is a second evaluation dimension that surfaces when the agent's "work" includes actions on production systems: execution quality. When an agent updates a CRM record, triggers a deployment, or modifies a billing configuration, "was the output correct?" and "was the execution safe?" are two separate questions. Did the agent act within its authorized scope? Did it execute in the right order relative to other agents acting on the same system? Is there a record of what it did, when, and under what policy?

Building verifiers for output quality is hard but tractable: compare against golden datasets and human judgment. Building verifiers for execution quality is structurally different because the quality signal lives in the infrastructure layer, not in the agent's output. You need the execution record (what was done, in what order, by which agent, under what authority) to even define what "correct" means. Most teams I see are building the first type of verifier and discovering the second as a separate, harder problem once their agents start acting on shared production systems.

No posts

Ready for more?