What Evaluations Do Not Do

guarantee production safety on their own - they surface comparison results, contract checks, and diagnostics, but still require review
replace monitoring - there is no built-in continuous evaluation loop; you schedule runs yourself
create a family-agnostic run-detail API - evaluation results can reference underlying run ids, but public inspection remains on the documented agent-run and skill-run route families

See also