Claim-To-Evidence Map¶
Issue: #140
Status: active public evidence map.
This page maps public WorldForge claims to the evidence class that supports them. Use it when citing README claims in issues, release evidence, or provider promotion reviews. It does not add new benchmark numbers, broaden provider capabilities, or turn deterministic checks into physical fidelity claims.
Evidence Classes¶
| Class | Meaning | Expected evidence |
|---|---|---|
checkout-tested |
Runs from a clean checkout without credentials, network, GPUs, or optional model runtimes. | Local pytest, CLI, docs, and package commands. |
fixture-tested |
Covered by synthetic JSON fixtures or recorded parser fixtures that stay in the repository. | tests/fixtures/, worldforge.testing fixtures, provider parser tests, or contract helpers. |
prepared-host smoke-tested |
Requires host-owned credentials, checkpoints, optional runtimes, or robot/model assets. | A documented command plus sanitized run_manifest.json or live-smoke registry row. |
release-gated |
Part of release evidence or CI quality gates. | Coverage, package contract, benchmark preset, docs build, or release evidence report. |
deferred |
A design or scaffold exists, but executable public behavior is intentionally withheld. | Explicit blocker, revisit trigger, or fail-closed scaffold docs. |
unsupported |
WorldForge does not claim or own this behavior. | Public non-claim and first routing step to host or upstream owner. |
Capability Claims¶
| Public claim | Evidence class | Evidence | Command or artifact | Boundary |
|---|---|---|---|---|
predict is a provider capability for state rollout. |
checkout-tested |
tests/test_world_lifecycle.py, tests/test_provider_contracts.py, tests/test_capability_fixtures.py |
uv run worldforge world predict <world-id> --object-id <object-id> --x 0.4 --y 0.5 --z 0 |
Built-in deterministic checks do not prove physical fidelity. |
score ranks action candidates for score-model workflows. |
fixture-tested; prepared-host smoke-tested for real runtimes |
tests/test_leworldmodel_provider.py, tests/test_jepa_provider.py, tests/test_jepa_wms_provider.py, tests/fixtures/providers/*score* |
uv run worldforge-demo-leworldmodel; live-smoke registry rows for leworldmodel, jepa, and jepa-wms |
Tensors, checkpoints, preprocessing, and devices stay host-owned. |
policy returns embodiment-specific action chunks. |
fixture-tested; prepared-host smoke-tested for real runtimes |
tests/test_lerobot_provider.py, tests/test_gr00t_provider.py, tests/test_provider_contracts.py |
scripts/robotics-showcase --json-only --no-tui --no-rerun; uploaded run_manifest.json in live robotics CI |
WorldForge preserves raw actions and requires host-owned translators before executable actions. |
generate produces media artifacts. |
checkout-tested; fixture-tested; prepared-host smoke-tested for remote APIs |
tests/test_cosmos_provider.py, tests/test_runway_provider.py, tests/test_remote_video_providers.py |
uv run worldforge benchmark --preset remote-media-dryrun on a configured host |
Returned media is an artifact contract, not a quality or safety claim. |
transfer transforms a media artifact. |
checkout-tested; fixture-tested |
tests/test_remote_video_providers.py, tests/test_provider_contracts.py, src/worldforge/testing/fixtures/transfer/ |
uv run worldforge benchmark --provider mock --operation transfer --input-file examples/benchmark-inputs.json |
Remote transfer requires provider credentials and artifact retention by the host. |
reason and embed are narrow mock-supported capability surfaces. |
checkout-tested |
tests/test_provider_contracts.py, tests/test_capability_fixtures.py, tests/test_benchmark.py |
uv run worldforge benchmark --preset parser-overhead |
They are contract and adapter-path checks, not general-purpose LLM or embedding quality claims. |
plan is a WorldForge facade over composed surfaces. |
checkout-tested |
tests/test_evaluation_and_planning.py, tests/test_capability_dual_routing.py |
uv run worldforge eval --suite planning --provider mock --format json |
plan is not advertised as a provider-owned capability by default. |
Provider And Runtime Claims¶
| Public claim | Evidence class | Evidence | Command or artifact | Boundary |
|---|---|---|---|---|
The mock provider is stable and deterministic. |
checkout-tested; release-gated |
tests/test_provider_contracts.py, tests/test_benchmark_presets.py |
uv run worldforge benchmark --preset mock-smoke |
Synthetic provider behavior is not runtime fidelity evidence. |
| Cosmos and Runway are remote media adapters. | fixture-tested; prepared-host smoke-tested when configured |
tests/fixtures/providers/cosmos_*.json, tests/fixtures/providers/runway_*.json, provider docs |
uv run worldforge benchmark --preset remote-media-dryrun |
Credentials, upstream availability, returned artifact retention, and paid usage stay host-owned. |
LeWorldModel exposes score. |
fixture-tested; prepared-host smoke-tested |
tests/test_leworldmodel_provider.py, tests/test_lerobot_leworldmodel_smoke_script.py, live-smoke registry |
scripts/robotics-showcase --json-only --no-tui --no-rerun |
stable-worldmodel, torch, checkpoints, tensors, and device behavior are optional runtime concerns. |
LeRobot and GR00T expose policy. |
fixture-tested; prepared-host smoke-tested |
tests/test_lerobot_provider.py, tests/test_gr00t_provider.py, live-smoke registry |
scripts/robotics-showcase for LeRobot; scripts/smoke_gr00t_policy.py --help for GR00T setup |
Robot controllers, safety checks, and action translators are host-owned. |
| JEPA is experimental and score-only. | fixture-tested; prepared-host smoke-tested only when host evidence exists |
tests/test_jepa_provider.py, tests/test_jepa_wms_provider.py, runtime manifest docs |
uv run worldforge-smoke-jepa-wms --help |
Torch-hub runtime, weights, preprocessing, and license review stay host-owned. |
| Genie is a scaffold reservation. | deferred |
tests/test_remote_scaffold_providers.py, docs/src/providers/genie.md |
Revisit trigger in the Genie provider docs | No public automation API contract is claimed; scaffold behavior remains fail-closed. |
| Nano World Model is a candidate, not a provider surface. | deferred |
docs/src/provider-cohort-selection.md |
Follow the assigned candidate issue before any catalog claim | No nanowm provider is exported or auto-registered. |
Workflow And Artifact Claims¶
| Public claim | Evidence class | Evidence | Command or artifact | Boundary |
|---|---|---|---|---|
| Evaluation reports carry provenance and claim boundaries. | checkout-tested; release-gated |
tests/test_provenance.py, tests/test_evaluation_and_planning.py, docs/src/evaluation.md |
uv run worldforge eval --suite planning --provider mock --format json |
Scores are deterministic contract signals, not physical or media-quality metrics. |
| Failed evaluation reports include issue-ready failure galleries. | checkout-tested; release-gated |
tests/test_evaluation_failure_gallery.py, docs/src/evaluation.md, docs/src/api/python.md |
uv run worldforge eval --suite planning --provider mock --format json plus report.artifacts()["failure_gallery.json"] |
Galleries are sanitized deterministic contract triage, not provider ranking or fidelity evidence. |
| Benchmark reports carry provenance, budgets, and preset gates. | checkout-tested; release-gated |
tests/test_benchmark.py, tests/test_benchmark_presets.py, docs/src/benchmarking.md |
uv run worldforge benchmark --preset release-evidence --format json --run-workspace .worldforge |
Timings are process-local adapter-path measurements, not machine-independent performance claims. |
| Benchmark budget changes have a preserved baseline review path. | checkout-tested; release-gated |
tests/test_benchmark_budget_calibration.py, scripts/calibrate_benchmark_budgets.py, docs/src/benchmarking.md |
uv run python scripts/calibrate_benchmark_budgets.py --report .worldforge/reports/benchmark-<timestamp>-<run-id>.json --current-budget src/worldforge/benchmark_presets/_data/budget-release-evidence.json |
Candidate budgets are review artifacts; they do not automatically weaken release gates. |
| Preserved run comparisons compare compatible provider runs without creating leaderboards. | checkout-tested; release-gated |
tests/test_harness_report_compare.py, tests/test_harness_workspace.py, docs/src/benchmarking.md |
uv run worldforge runs compare .worldforge/runs/<baseline-run-id> .worldforge/runs/<candidate-run-id> --format markdown |
Comparisons fail on mismatched capability, operation, fixture digest, budget, or suite version; rows are artifact deltas, not rankings across tasks. |
| Evaluation evidence bundles package preserved runs. | checkout-tested; release-gated |
tests/test_evidence_bundle.py, scripts/generate_evidence_bundle.py, docs/src/evaluation.md |
uv run python scripts/generate_evidence_bundle.py --workspace-dir .worldforge |
Unsafe, local-only, signed, or binary artifacts are excluded or marked; the bundle does not upload anything. |
| Live-smoke evidence is indexed in a publishable registry. | prepared-host smoke-tested; release-gated |
tests/test_live_smoke_evidence.py, docs/src/live-smoke-evidence.json |
uv run python scripts/generate_release_evidence.py --live-smoke-registry docs/src/live-smoke-evidence.json |
Missing optional runtimes or credentials are explicit skip states, not silent omissions. |
| Rerun records sanitized events and artifacts when the extra is installed. | checkout-tested; prepared-host smoke-tested for robotics showcase |
tests/test_rerun_integration.py, tests/test_robotics_showcase.py |
uv run --extra rerun worldforge-demo-rerun; /tmp/worldforge-robotics-showcase/real-run.rrd |
Rerun is optional observability, not a provider capability or base dependency. |
| TheWorldHarness is optional and Textual-isolated. | checkout-tested; release-gated |
tests/test_harness_flows.py, tests/test_harness_cli.py, tests/test_harness_tui.py |
uv run --extra harness worldforge-harness |
Non-TUI flow logic stays independent from Textual imports. |
| Local JSON persistence is the authoritative built-in store. | checkout-tested |
tests/test_world_lifecycle.py, tests/test_cli_world_commands.py, persistence ADR |
uv run worldforge world export <world-id> --output world.json |
It is not a concurrent multi-writer database or service-grade durable store. |
| Quality gates run on Python 3.13 with coverage, docs, package, and lint checks. | release-gated |
.github/workflows/ci.yml, scripts/test_package.sh, docs/src/quality.md |
uv run --extra harness pytest --cov=src/worldforge --cov-report=term-missing --cov-fail-under=90 |
Passing gates does not expand runtime capability claims. |
Unsupported Or Non-Claims¶
| Non-claim | Evidence class | First routing step |
|---|---|---|
| Physical fidelity, media quality, robot safety certification, or real-world control safety. | unsupported |
Treat the WorldForge report as adapter evidence only; use task-specific host evaluation and safety review. |
| Upstream provider SLA, paid API availability, rate limits, credential management, or artifact retention. | unsupported |
Check the provider's upstream status, credentials, and host retention policy. |
| Training LeWorldModel, JEPA, NanoWM, GR00T, LeRobot, or any other model inside WorldForge. | unsupported |
Use the upstream training repository and keep artifacts out of the WorldForge base package. |
| Service-grade persistence, database migrations, lock files, or multi-writer storage. | unsupported |
Follow the persistence adapter boundary ADR before introducing a host-owned store. |
| Hosted dashboard, telemetry service, queue, deployment, auth, or alerting. | unsupported |
Implement these in the host application and attach sanitized WorldForge run artifacts. |
Usage¶
For issue or release notes, cite the row by claim and include the matching command or artifact.
Expected success signals are the explicit test pass count, a benchmark gate with Status: passed,
or a validated run_manifest.json. If the command fails, the first triage step is to open the
linked docs page for that row and check whether the provider is checkout-safe, fixture-backed,
credentialed, prepared-host, deferred, or unsupported.