Whether the judge approves the tool for registration at its initial tier.
false means the forge request is rejected and no tool is registered.
Overall confidence the judge has in its verdict, in the range [0, 1]. Low confidence may trigger a second judge pass or human review.
Safety score in the range [0, 1]. Assesses whether the tool's implementation could cause unintended harm, data exfiltration, or resource exhaustion.
Correctness score in the range [0, 1]. Measures how well the tool's outputs match the expected outputs in the declared test cases.
Determinism score in the range [0, 1]. Gauges whether repeated invocations with identical inputs produce consistent outputs. Lower scores flag non-deterministic behaviour.
Bounded execution score in the range [0, 1]. Indicates whether the tool reliably completes within its declared resource limits (memory, time). Scores derived from sandbox telemetry.
Free-text explanation of the verdict, including any failure reasons, flagged patterns, or suggestions for improvement.
Evaluation verdict produced by the LLM-as-judge after a tool is forged.
The judge runs the tool against its declared test cases and scores it across five evaluation dimensions. A tool is only registered when
approvedistrue.