Evaluates forged tools for safety, correctness, and quality using LLM-as-judge.

Three evaluation modes, each scaled to the risk level of the operation:

Mode LLM calls When used
reviewCreation 1 Newly forged tool — full code audit + test validation
validateReuse 0 Every invocation — pure programmatic schema check
reviewPromotion 2 Tier promotion — dual-judge safety + correctness panel

Example

const judge = new EmergentJudge({
judgeModel: 'gpt-4o-mini',
promotionModel: 'gpt-4o',
generateText: async (model, prompt) => callLlm(model, prompt),
});

// Creation review
const verdict = await judge.reviewCreation(candidate);
if (verdict.approved) { registry.register(tool, 'session'); }

// Reuse validation (no LLM call)
const reuse = judge.validateReuse('tool-1', output, outputSchema);
if (!reuse.valid) { throw new Error(reuse.schemaErrors.join(', ')); }

// Promotion panel
const promotion = await judge.reviewPromotion(tool);
if (promotion.approved) { registry.promote(tool.id, 'agent'); }

Constructors

Methods

  • Full code + test review for a newly forged tool.

    Builds a structured prompt from the candidate's details (name, description, schemas, source code, sandbox allowlist, test results) and asks the LLM to evaluate four dimensions: SAFETY, CORRECTNESS, DETERMINISM, BOUNDED.

    The tool is approved only if both safety.passed AND correctness.passed are true in the LLM response.

    If the LLM returns malformed JSON that cannot be parsed, a rejected verdict is returned with confidence 0 and a reasoning string explaining the parse failure. This prevents bad LLM output from accidentally approving a tool.

    Parameters

    • candidate: ToolCandidate

      The tool candidate to evaluate. Must include source code and at least one test result.

    Returns Promise<CreationVerdict>

    A CreationVerdict indicating approval or rejection with per-dimension scores and reasoning.

  • Pure schema validation on each reuse — no LLM call.

    Validates that output conforms to the declared schema using basic type checking. This runs on every tool invocation so it must be fast — no LLM calls, no network I/O, no async operations.

    Checks performed:

    • If schema declares type: 'object', verify output is a non-null object.
    • If schema declares properties, verify each declared property key exists on the output object.
    • If schema declares required, verify each required property key exists.
    • If schema declares type: 'string', verify output is a string.
    • If schema declares type: 'number' or type: 'integer', verify output is a number.
    • If schema declares type: 'boolean', verify output is a boolean.
    • If schema declares type: 'array', verify output is an array.

    Parameters

    • _toolId: string

      The ID of the tool being reused (reserved for future anomaly detection; currently unused).

    • output: unknown

      The actual output value produced by the tool invocation.

    • schema: JSONSchemaObject

      The tool's declared output JSON Schema.

    Returns ReuseVerdict

    A ReuseVerdict with valid: true if the output conforms, or valid: false with a schemaErrors array describing each mismatch.

  • Two-judge panel for tier promotion. Both must approve.

    Sends two independent LLM calls in parallel using the promotion model:

    1. Safety auditor: Reviews the tool's source code and usage history for security concerns (data exfiltration, resource exhaustion, API abuse).
    2. Correctness reviewer: Reviews the tool's source code and all historical outputs for correctness issues (schema violations, edge case failures).

    Both reviewers must return approved: true for the promotion to pass. If either reviewer's response fails to parse as JSON, the promotion is rejected.

    Parameters

    • tool: EmergentTool

      The emergent tool to evaluate for promotion. Must have usage stats and judge verdicts from prior reviews.

    Returns Promise<PromotionVerdict>

    A PromotionVerdict containing both sub-verdicts and the combined approval decision.