Create a new EmergentJudge instance.
Judge configuration specifying models and the LLM callback.
The generateText function is called for creation reviews and promotion
panels but never for reuse validation (which is purely programmatic).
Full code + test review for a newly forged tool.
Builds a structured prompt from the candidate's details (name, description, schemas, source code, sandbox allowlist, test results) and asks the LLM to evaluate four dimensions: SAFETY, CORRECTNESS, DETERMINISM, BOUNDED.
The tool is approved only if both safety.passed AND correctness.passed
are true in the LLM response.
If the LLM returns malformed JSON that cannot be parsed, a rejected verdict is returned with confidence 0 and a reasoning string explaining the parse failure. This prevents bad LLM output from accidentally approving a tool.
The tool candidate to evaluate. Must include source code and at least one test result.
A CreationVerdict indicating approval or rejection with per-dimension scores and reasoning.
Pure schema validation on each reuse — no LLM call.
Validates that output conforms to the declared schema using basic type
checking. This runs on every tool invocation so it must be fast — no LLM
calls, no network I/O, no async operations.
Checks performed:
type: 'object', verify output is a non-null object.properties, verify each declared property key exists
on the output object.required, verify each required property key exists.type: 'string', verify output is a string.type: 'number' or type: 'integer', verify output
is a number.type: 'boolean', verify output is a boolean.type: 'array', verify output is an array.The ID of the tool being reused (reserved for future anomaly detection; currently unused).
The actual output value produced by the tool invocation.
The tool's declared output JSON Schema.
A ReuseVerdict with valid: true if the output conforms,
or valid: false with a schemaErrors array describing each mismatch.
Two-judge panel for tier promotion. Both must approve.
Sends two independent LLM calls in parallel using the promotion model:
Both reviewers must return approved: true for the promotion to pass. If
either reviewer's response fails to parse as JSON, the promotion is rejected.
The emergent tool to evaluate for promotion. Must have usage stats and judge verdicts from prior reviews.
A PromotionVerdict containing both sub-verdicts and the combined approval decision.
Evaluates forged tools for safety, correctness, and quality using LLM-as-judge.
Three evaluation modes, each scaled to the risk level of the operation:
reviewCreationvalidateReusereviewPromotionExample