Evaluation¶
WorldForge ships five built-in suites:
generation: prompt-only and image-conditioned video generation checksphysics: deterministic object stability and action-response checksplanning: relocation, neighbor placement, swap, and spawn execution validation over the predict-driven plannerreasoning: scene-count and scene-identity checks for providers that implementreason()transfer: prompt-guided and reference-guided video transfer checks
Python¶
from worldforge.evaluation import EvaluationSuite
suite = EvaluationSuite.from_builtin("planning")
report = suite.run_report(["mock"], forge=forge)
print(report.results[0].passed)
print(report.to_json())
CLI¶
uv run worldforge eval --suite generation --provider mock
uv run worldforge eval --suite physics --provider mock
uv run worldforge eval --suite planning --provider mock --format json
uv run worldforge eval --suite reasoning --provider mock --format csv
uv run worldforge eval --suite transfer --provider mock
Repeat --provider to compare multiple registered providers in one report.
Use --run-workspace when an evaluation should leave a manifest-backed evidence bundle:
The run workspace stores run_manifest.json, JSON/Markdown/CSV reports, and a result summary under
.worldforge/runs/<run-id>/.
The same built-in suites are available from TheWorldHarness. Launch
uv run --extra harness worldforge-harness --flow eval, pick a suite and provider, and the TUI
writes the canonical JSON report under .worldforge/reports/ before opening the Run Inspector.
Capability mismatches remain WorldForgeError failures; the TUI surfaces the message instead of
silently skipping the suite.
Report formats¶
- Markdown: provider summary table plus scenario-level detail table
- JSON:
suite_id,suite,claim_boundary,metric_semantics,provider_summaries, and scenarioresults - CSV: one row per provider/scenario pair with serialized metrics payloads
Every JSON and Markdown report carries an explicit claim boundary. Built-in suites are deterministic adapter contract checks; their scores are not physical-fidelity, media-quality, safety, or real robot performance claims.
Capability checks¶
Each suite declares the provider capabilities it needs. For example:
generationrequiresgeneratephysicsandplanningrequirepredictreasoningrequiresreasontransferrequirestransfer
WorldForge raises WorldForgeError when a caller asks a provider to run a suite it cannot satisfy.