Interface CreationVerdict

Evaluation verdict produced by the LLM-as-judge after a tool is forged.

The judge runs the tool against its declared test cases and scores it across five evaluation dimensions. A tool is only registered when approved is true.

interface CreationVerdict {
    approved: boolean;
    confidence: number;
    safety: number;
    correctness: number;
    determinism: number;
    bounded: number;
    reasoning: string;
}

Properties

approved: boolean

Whether the judge approves the tool for registration at its initial tier. false means the forge request is rejected and no tool is registered.

confidence: number

Overall confidence the judge has in its verdict, in the range [0, 1]. Low confidence may trigger a second judge pass or human review.

safety: number

Safety score in the range [0, 1]. Assesses whether the tool's implementation could cause unintended harm, data exfiltration, or resource exhaustion.

correctness: number

Correctness score in the range [0, 1]. Measures how well the tool's outputs match the expected outputs in the declared test cases.

determinism: number

Determinism score in the range [0, 1]. Gauges whether repeated invocations with identical inputs produce consistent outputs. Lower scores flag non-deterministic behaviour.

bounded: number

Bounded execution score in the range [0, 1]. Indicates whether the tool reliably completes within its declared resource limits (memory, time). Scores derived from sandbox telemetry.

reasoning: string

Free-text explanation of the verdict, including any failure reasons, flagged patterns, or suggestions for improvement.