Skip to content

Claim-To-Evidence Map

Issue: #140

Status: active public evidence map.

This page maps public WorldForge claims to the evidence class that supports them. Use it when citing README claims in issues, release evidence, or provider promotion reviews. It does not add new benchmark numbers, broaden provider capabilities, or turn deterministic checks into physical fidelity claims.

Evidence Classes

Class Meaning Expected evidence
checkout-tested Runs from a clean checkout without credentials, network, GPUs, or optional model runtimes. Local pytest, CLI, docs, and package commands.
fixture-tested Covered by synthetic JSON fixtures or recorded parser fixtures that stay in the repository. tests/fixtures/, worldforge.testing fixtures, provider parser tests, or contract helpers.
prepared-host smoke-tested Requires host-owned credentials, checkpoints, optional runtimes, or robot/model assets. A documented command plus sanitized run_manifest.json or live-smoke registry row.
release-gated Part of release evidence or CI quality gates. Coverage, package contract, benchmark preset, docs build, or release evidence report.
deferred A design or scaffold exists, but executable public behavior is intentionally withheld. Explicit blocker, revisit trigger, or fail-closed scaffold docs.
unsupported WorldForge does not claim or own this behavior. Public non-claim and first routing step to host or upstream owner.

Capability Claims

Public claim Evidence class Evidence Command or artifact Boundary
predict is a provider capability for state rollout. checkout-tested tests/test_provider_contracts.py, tests/test_capability_fixtures.py, tests/test_capability_dual_routing.py uv run worldforge predict kitchen --provider mock --x 0.3 --y 0.8 --z 0.0 --steps 2 Built-in deterministic checks do not prove physical fidelity.
score ranks action candidates for score-model workflows. fixture-tested; prepared-host smoke-tested for real runtimes tests/test_leworldmodel_provider.py, tests/test_jepa_provider.py, tests/test_jepa_wms_provider.py, tests/fixtures/providers/*score* uv run worldforge-demo-leworldmodel; live-smoke registry rows for leworldmodel, jepa, and jepa-wms Tensors, checkpoints, preprocessing, and devices stay host-owned.
policy returns embodiment-specific action chunks. fixture-tested; prepared-host smoke-tested for real runtimes tests/test_lerobot_provider.py, tests/test_gr00t_provider.py, tests/test_provider_contracts.py scripts/robotics-showcase --json-only --no-tui --no-rerun; uploaded run_manifest.json in live robotics CI WorldForge preserves raw actions and requires host-owned translators before executable actions.
embed is a narrow mock-supported capability surface. checkout-tested tests/test_provider_contracts.py, tests/test_capability_fixtures.py, tests/test_benchmark.py uv run worldforge benchmark --preset parser-overhead It is a contract and adapter-path check, not a general-purpose embedding quality claim.
plan is a WorldForge facade over composed surfaces. checkout-tested tests/test_evaluation_and_planning.py, tests/test_capability_dual_routing.py uv run worldforge eval --suite planning --provider mock --format json plan is not advertised as a provider-owned capability by default.

Provider And Runtime Claims

Public claim Evidence class Evidence Command or artifact Boundary
The mock provider is stable and deterministic. checkout-tested; release-gated tests/test_provider_contracts.py, tests/test_benchmark_presets.py uv run worldforge benchmark --preset mock-smoke Synthetic provider behavior is not runtime fidelity evidence.
LeWorldModel exposes score. fixture-tested; prepared-host smoke-tested tests/test_leworldmodel_provider.py, tests/test_lerobot_leworldmodel_smoke_script.py, live-smoke registry scripts/robotics-showcase --json-only --no-tui --no-rerun stable-worldmodel, torch, checkpoints, tensors, and device behavior are optional runtime concerns.
LeRobot and GR00T expose policy. fixture-tested; prepared-host smoke-tested tests/test_lerobot_provider.py, tests/test_gr00t_provider.py, live-smoke registry scripts/robotics-showcase for LeRobot; scripts/smoke_gr00t_policy.py --help for GR00T setup Robot controllers, safety checks, and action translators are host-owned.
JEPA is experimental and score-only. fixture-tested; prepared-host smoke-tested only when host evidence exists tests/test_jepa_provider.py, tests/test_jepa_wms_provider.py, runtime manifest docs uv run worldforge-smoke-jepa-wms --help Torch-hub runtime, weights, preprocessing, and license review stay host-owned.
Genie is a scaffold reservation. deferred tests/test_remote_scaffold_providers.py, docs/src/providers/genie.md Revisit trigger in the Genie provider docs No public automation API contract is claimed; scaffold behavior remains fail-closed.
Nano World Model is a candidate, not a provider surface. deferred docs/src/provider-cohort-selection.md Follow the assigned candidate issue before any catalog claim No nanowm provider is exported or auto-registered.

Workflow And Artifact Claims

Public claim Evidence class Evidence Command or artifact Boundary
Evaluation reports carry provenance and claim boundaries. checkout-tested; release-gated tests/test_provenance.py, tests/test_evaluation_and_planning.py, docs/src/evaluation.md uv run worldforge eval --suite planning --provider mock --format json Scores are deterministic contract signals, not physical or media-quality metrics.
Evaluation reports can cite dataset manifests without embedding datasets. checkout-tested; release-gated tests/test_evaluation_suites.py, tests/test_evidence_bundle.py, docs/src/evaluation.md uv run worldforge eval --suite planning --provider mock --dataset-manifest examples/dataset-manifests/mock-evaluation-fixtures.json --format json Manifests record provenance, license, privacy, safety, checksums, and host-owned acquisition steps; datasets and prepared assets stay outside the repository.
Failed evaluation reports include issue-ready failure galleries. checkout-tested; release-gated tests/test_evaluation_failure_gallery.py, docs/src/evaluation.md, docs/src/api/python.md uv run worldforge eval --suite planning --provider mock --format json plus report.artifacts()["failure_gallery.json"] Galleries are sanitized deterministic contract triage, not provider ranking or fidelity evidence.
Benchmark reports carry provenance, budgets, and preset gates. checkout-tested; release-gated tests/test_benchmark.py, tests/test_benchmark_presets.py, docs/src/benchmarking.md uv run worldforge benchmark --preset release-evidence --format json --run-workspace .worldforge Timings are process-local adapter-path measurements, not machine-independent performance claims.
Benchmark budget changes have a preserved baseline review path. checkout-tested; release-gated tests/test_benchmark_budget_calibration.py, scripts/calibrate_benchmark_budgets.py, docs/src/benchmarking.md uv run python scripts/calibrate_benchmark_budgets.py --report .worldforge/reports/benchmark-<timestamp>-<run-id>.json --current-budget src/worldforge/benchmark_presets/_data/budget-release-evidence.json Candidate budgets are review artifacts; they do not automatically weaken release gates.
Preserved run comparisons compare compatible provider runs without creating leaderboards. checkout-tested; release-gated tests/test_harness_report_compare.py, tests/test_harness_workspace.py, docs/src/benchmarking.md uv run worldforge runs compare .worldforge/runs/<baseline-run-id> .worldforge/runs/<candidate-run-id> --format markdown Comparisons fail on mismatched capability, operation, fixture digest, budget, or suite version; rows are artifact deltas, not rankings across tasks.
Regression comparisons review a candidate run against a preserved baseline. checkout-tested; release-gated tests/test_harness_report_compare.py, docs/src/benchmarking.md, docs/src/html-reports.md uv run worldforge runs compare .worldforge/runs/<baseline-run-id> .worldforge/runs/<candidate-run-id> --mode regression --format markdown Reports metric deltas, budget status changes, new and removed failures, safe artifact drift, provenance differences, and unsafe artifact exclusions; they do not update baselines automatically.
Evaluation evidence bundles package preserved runs. checkout-tested; release-gated tests/test_evidence_bundle.py, scripts/generate_evidence_bundle.py, docs/src/evaluation.md uv run python scripts/generate_evidence_bundle.py --workspace-dir .worldforge Unsafe, local-only, signed, or binary artifacts are excluded or marked; the bundle does not upload anything.
Live-smoke evidence is indexed in a publishable registry. prepared-host smoke-tested; release-gated tests/test_live_smoke_evidence.py, docs/src/live-smoke-evidence.json uv run python scripts/generate_release_evidence.py --live-smoke-registry docs/src/live-smoke-evidence.json Missing optional runtimes or credentials are explicit skip states, not silent omissions.
Rerun records sanitized events and artifacts when the extra is installed. checkout-tested; prepared-host smoke-tested for robotics showcase tests/test_rerun_integration.py, tests/test_robotics_showcase.py uv run --extra rerun worldforge-demo-rerun; /tmp/worldforge-robotics-showcase/real-run.rrd Rerun is optional observability, not a provider capability or base dependency.
Robotics showcase TUI is optional and Textual-isolated. checkout-tested; release-gated tests/test_robotics_showcase.py, tests/test_robotics_showcase_ci.py, tests/test_import_boundaries.py scripts/robotics-showcase Non-TUI robotics evidence remains available without importing Textual.
Planning runs over plain world-state dicts via LatentMPCController. checkout-tested tests/test_latent_mpc_controller.py, tests/test_capability_dual_routing.py, persistence ADR uv run python examples/basic_prediction.py There is no built-in world store; durable persistence stays host-owned.
Quality gates run on Python 3.13 with coverage, docs, package, and lint checks. release-gated .github/workflows/ci.yml, scripts/test_package.sh, docs/src/quality.md uv run --extra harness pytest --cov=src/worldforge --cov-report=term-missing --cov-fail-under=90 Passing gates does not expand runtime capability claims.

Unsupported Or Non-Claims

Non-claim Evidence class First routing step
Physical fidelity, media quality, robot safety certification, or real-world control safety. unsupported Treat the WorldForge report as adapter evidence only; use task-specific host evaluation and safety review.
Upstream provider SLA, paid API availability, rate limits, credential management, or artifact retention. unsupported Check the provider's upstream status, credentials, and host retention policy.
Training LeWorldModel, JEPA, NanoWM, GR00T, LeRobot, or any other model inside WorldForge. unsupported Use the upstream training repository and keep artifacts out of the WorldForge base package.
Service-grade persistence, database migrations, lock files, or multi-writer storage. unsupported Follow the persistence adapter boundary ADR before introducing a host-owned store.
Hosted dashboard, telemetry service, queue, deployment, auth, or alerting. unsupported Implement these in the host application and attach sanitized WorldForge run artifacts.

Usage

For issue or release notes, cite the row by claim and include the matching command or artifact. Expected success signals are the explicit test pass count, a benchmark gate with Status: passed, or a validated run_manifest.json. If the command fails, the first triage step is to open the linked docs page for that row and check whether the provider is checkout-safe, fixture-backed, credentialed, prepared-host, deferred, or unsupported.