What Evaluations Do Not Do
- guarantee production safety on their own - they surface comparison results, contract checks, and diagnostics, but still require review
- replace monitoring - there is no built-in continuous evaluation loop; you schedule runs yourself
- create a family-agnostic run-detail API - evaluation results can reference underlying run ids, but public inspection remains on the documented agent-run and skill-run route families