What Evaluations Do Not Do

  • guarantee production safety on their own - they surface comparison results, contract checks, and diagnostics, but still require review
  • replace monitoring - there is no built-in continuous evaluation loop; you schedule runs yourself
  • create a family-agnostic run-detail API - evaluation results can reference underlying run ids, but public inspection remains on the documented agent-run and skill-run route families

See also

Was this page helpful?