Provider And Platform Roadmap¶
This proposal converts the current WorldForge direction into issue-ready workstreams for real provider integrations, production-grade harnesses, host applications, and operational hardening. It is written as a planning source for GitHub issues, not as a promise that every item is already implemented.
Last reviewed: 2026-05-01.
Completion Snapshot¶
The first production track is complete. The GitHub tracker issues remain useful as an audit trail, but new provider and platform work should start from a fresh selection record or a narrowly scoped follow-up issue rather than reopening these implementation slices.
| Tracker | Scope | Completion signal |
|---|---|---|
| #47 | Provider platform foundation | Promotion gates, runtime manifests, conformance helpers, live-smoke artifacts, and optional runtime profiles are implemented. |
| #48 | Real provider implementations | LeWorldModel, LeRobot, GR00T, Cosmos, Runway, JEPA, JEPA-WMS, and Genie defer decisions are documented against executable capability claims. |
| #49 | Production harness | Run workspaces, connector readiness, live inspection, report comparison, and provider workbench flows preserve safe artifacts. |
| #50 | Reference host applications | Batch eval, service, and robotics operator host examples show host-owned integration boundaries. |
| #51 | Observability monitoring and logging | Provider event schema, logging, metrics, OpenTelemetry, readiness, and incident runbooks are available without changing the base dependency boundary. |
| #52 | Reliability security and release gates | Credential hardening, request budgets, persistence ADR, and release evidence generation are in place. |
Use the release gate below before tagging a release that claims this roadmap milestone.
Planning Rules¶
Use these rules when turning this roadmap into issues:
- Keep the base package lightweight. Do not add torch, robot runtimes, checkpoint managers, web frameworks, dashboards, or telemetry exporters to the default dependency set.
- Promote a provider only when it calls a real upstream runtime or API, validates the boundary, and has fixture-backed failure coverage.
- Keep optional runtimes host-owned. WorldForge can provide adapters, launchers, conformance tests, and reference host applications; the host owns credentials, devices, checkpoints, datasets, durable stores, dashboards, robot controllers, and safety policy.
- Treat provider events as security-sensitive records. Every new observable field must pass redaction tests before it reaches logs, metrics, traces, reports, or exports.
- Keep deterministic checkout tests separate from live runtime smokes. Live smokes should be explicit, skip cleanly without credentials/runtime packages, and preserve enough artifacts to debug failures.
- Every issue should name the exact capability surface it changes:
predict,generate,transfer,reason,embed,score, orpolicy.
Desired End State¶
The roadmap is complete when WorldForge has moved from "credible framework and demos" to a production-shaped provider platform with these properties:
| Dimension | End state |
|---|---|
| Provider trust | Every advertised provider capability has a real callable implementation, conformance coverage, generated docs, and a prepared-host smoke path. |
| Provider selection | New provider work starts from a written selection record that scores user value, upstream maturity, runtime cost, validation feasibility, and maintenance burden. |
| Runtime boundary | Heavy model packages, GPUs, checkpoints, robot stacks, hosted dashboards, and durable stores remain opt-in host concerns. |
| Harness | TheWorldHarness becomes the local operations workspace for provider readiness, run execution, artifact inspection, eval/benchmark comparison, and provider development. |
| Host apps | Reference hosts show how to embed WorldForge in batch jobs, services, and robotics operator workflows without redefining the base package boundary. |
| Operations | Provider calls produce sanitized events, optional traces/metrics/log files, run manifests, readiness states, and incident runbooks. |
| Release evidence | Releases include reproducible local gates and explicit optional-provider evidence or skip reasons. |
Non-negotiable quality bar:
- A provider is not "real" because its name exists in the catalog. It is real only when a host can install the upstream runtime or configure the upstream API, run a documented smoke command, and inspect preserved evidence.
- A harness feature is not production-grade because it renders a screen. It is production-grade when it preserves state, handles cancellation/failure, exports safe artifacts, and can be tested without live credentials.
- An observability feature is not production-grade if it leaks prompts, credentials, signed URL query strings, raw tensors, unbounded labels, or robot-specific secrets.
Current Baseline¶
| Area | Current state | Constraint |
|---|---|---|
| Base package | httpx runtime dependency, Python 3.13 only, hatchling/uv packaging |
Keep provider and host extras optional. |
| Provider catalog | mock, cosmos, runway, leworldmodel, gr00t, lerobot, experimental score-only jepa, plus scaffold genie |
Do not advertise scaffold providers as real integrations. |
| Candidate provider | jepa-wms direct-construction score candidate, not exported or auto-registered |
Promote only after real upstream limits and runtime behavior are validated. |
| Planning | World.plan(...) composes predict, score, policy, and policy+score flows |
Do not treat plan as a provider badge unless a provider implements planning directly. |
| Harness | Optional Textual app with worlds, providers, eval, benchmark, run inspector, and robotics report surfaces | Textual remains isolated to worldforge.harness.tui. |
| Observability | ProviderEvent, JsonLoggerSink, InMemoryRecorderSink, ProviderMetricsSink, handler fanout |
Host owns full telemetry export and alerting unless optional integrations are added. |
| Persistence | Validated single-writer local JSON state and report artifacts | Durable multi-writer persistence remains a host concern until a separate adapter design lands. |
Target Architecture¶
The roadmap should preserve this dependency direction:
host application / CLI / harness
-> WorldForge facade
-> provider registry + capability registry
-> provider adapter or protocol implementation
-> upstream runtime/API/checkpoint
The reverse direction is forbidden: provider adapters must not import host applications, the base package must not import optional robotics/model runtimes at module import time, and telemetry exporters must not own provider behavior.
Target module boundaries:
| Boundary | Owns | Must not own |
|---|---|---|
worldforge.models |
JSON-native public contracts, validation, public errors | upstream package types, host telemetry clients |
worldforge.providers.* |
adapter configuration, capability methods, upstream parsing, provider events | durable storage, dashboards, robot-controller safety policy |
worldforge.testing |
reusable capability conformance helpers and failure assertions | live credentials, network calls by default |
worldforge.benchmark / worldforge.evaluation |
deterministic contract signals, preserved reports, budget gates | claims of real physical fidelity without external evidence |
worldforge.harness.models / flows |
structured runs and reports without Textual dependency | TUI widgets, optional model packages |
worldforge.harness.tui |
Textual screens over public APIs and structured run data | provider-specific business logic |
examples/hosts/* |
optional host integration patterns | new required base dependencies |
Core data flow for real provider runs:
input fixture or host payload
-> capability-specific validation
-> provider runtime call
-> typed result validation
-> sanitized ProviderEvent stream
-> run_manifest.json
-> harness/report/export surfaces
Every issue should state which part of this flow it changes.
Milestone Sequence¶
| Milestone | Goal | Exit signal |
|---|---|---|
| M0: Issue-ready contracts | Convert this roadmap into labeled GitHub issues with dependencies and owners | Issues exist with acceptance criteria and validation commands. |
| M1: Provider foundation | Shared provider promotion gates, runtime manifests, conformance tests, and live-smoke conventions | New providers can be added without inventing process per adapter. |
| M2: Real provider promotions | Promote the highest-value current adapters and replace scaffold reservations only when real runtime contracts exist | Provider docs/catalog match executable behavior and live-smoke artifacts. |
| M3: Production harness | Turn TheWorldHarness into the canonical local operations workspace for providers, worlds, runs, artifacts, and diagnostics | Operators can configure, run, inspect, compare, export, and recover flows locally. |
| M4: Reference host applications | Provide optional host apps that show how to embed WorldForge in services, batch jobs, and robotics labs | Hosts have runnable templates without changing the base package. |
| M5: Observability and operations | Add optional telemetry exporters, service probes, run manifests, redaction gates, and incident runbooks | Production hosts can integrate with standard monitoring without leaking secrets. |
| M6: Release hardening | Make live-provider releases reproducible, auditable, and explicitly scoped | Release gates include docs, package contract, coverage, provider contract checks, and optional smoke evidence. |
Dependency Graph¶
Use this graph when assigning issues. Work in the same row can usually proceed in parallel after its dependencies are satisfied.
| Wave | Dependencies | Issues | Why this order |
|---|---|---|---|
| 0 | none | WF-PROV-001, WF-PROVIDER-SELECT-001 | Lock provider standards and avoid adding low-value adapter work. |
| 1 | WF-PROV-001 | WF-PROV-002, WF-PROV-003 | Runtime manifests and conformance checks are the reusable substrate for real providers. |
| 2 | WF-PROV-002, WF-PROV-003 | WF-PROV-004, WF-PROV-005 | Live smokes and markers need manifests plus contract helpers. |
| 3 | Wave 2 | WF-LWM-001, WF-LEROBOT-001, WF-COSMOS-001, WF-RUNWAY-001 | Promote existing high-value real paths before expanding the catalog. |
| 4 | WF-LEROBOT-001 | WF-LEROBOT-002, WF-GROOT-001 | Robotics policy work needs translator contracts before serious host workflows. |
| 5 | WF-PROV-004 | WF-HARNESS-001, WF-OBS-001 | Runs and events need a shared artifact/correlation model. |
| 6 | WF-HARNESS-001, WF-OBS-001 | WF-HARNESS-002 through WF-HARNESS-005, WF-OBS-002 through WF-OBS-005 | UI, exports, traces, logs, and readiness should share run IDs and event semantics. |
| 7 | Waves 3-6 | WF-HOST-001, WF-HOST-002, WF-HOST-003, WF-OPS-004 | Host apps and release evidence are useful after provider/run/observability contracts exist. |
| design-only | varies | WF-OPS-003, WF-JEPA-001, WF-GENIE-001 | Do not implement until the design or upstream runtime contract is credible. |
Issue Template¶
Use this shape when creating issues from the sections below:
Title:
Type:
Labels:
Depends on:
Problem:
Scope:
- In:
- Out:
Implementation notes:
Acceptance criteria:
- [ ]
- [ ]
- [ ]
Validation:
- command:
- expected signal:
Docs:
- pages to update:
Provider issues should also fill the existing provider adapter template fields: provider/runtime, implemented capabilities, runtime ownership, validation plan, and failure modes.
Issue Sizing Rules¶
Split an issue when it crosses any of these boundaries:
- Adds or changes more than one provider capability.
- Changes both provider runtime code and host application code.
- Adds optional dependencies and changes base package behavior.
- Changes public API contracts and docs in the same large patch.
- Requires live credentials or GPU validation and deterministic checkout validation.
- Touches both durable persistence design and local JSON behavior.
Preferred issue sizes:
| Size | Shape | Expected validation |
|---|---|---|
| Small | One parser, one validator, one docs fix, one fixture family | focused pytest plus docs if public |
| Medium | One capability contract, one provider hardening slice, one harness screen | focused pytest, docs, generated catalog if provider-facing |
| Large | Provider promotion, run workspace, event schema, host app | split implementation issues plus one tracking issue |
Every issue should include an "out of scope" section. The most common out-of-scope items should be base dependency expansion, real robot execution, durable persistence, and default CI jobs requiring credentials.
Provider Prioritization Rubric¶
Score each new provider proposal before implementation. Use the total to rank issue order, not to override safety gates.
| Criterion | 0 | 1 | 2 |
|---|---|---|---|
| User value | Niche or unclear workflow | Useful to one existing workflow | Unlocks a common provider, model family, or robotics workflow |
| Capability clarity | No stable callable surface | Callable surface exists but maps awkwardly | Maps cleanly to one WorldForge capability |
| Upstream maturity | Unreleased or unstable | Works but docs/API are moving | Stable API/package/checkpoint path |
| Runtime feasibility | Requires unavailable hardware or huge setup | Prepared-host only with clear dependencies | Can run locally or through a reachable hosted API |
| Fixture strategy | No fixtureable contract | Partial fixture coverage | Success and failure fixtures cover the boundary |
| Smoke feasibility | No smoke path | Manual smoke only | Documented smoke command with preserved manifest |
| Maintenance burden | High churn or unclear license | Moderate churn | Low churn and clear license/ownership |
| Safety/secret risk | High risk without mitigation | Manageable with redaction/tests | Low risk or already covered by event sanitization |
Interpretation:
12-16: candidate for the next implementation batch.8-11: write an RFC or keep as experimental/direct-construction only.<8: defer. Do not add catalog noise.
Current high-priority candidates:
| Candidate | Reason | First issue |
|---|---|---|
| LeWorldModel score | Existing real score path and robotics showcase value | WF-LWM-001 |
| LeRobot policy | Existing real policy path and policy+score planning value | WF-LEROBOT-001 |
| Runway/Cosmos media | Existing remote adapters with production parser/artifact concerns | WF-RUNWAY-001, WF-COSMOS-001 |
| GR00T PolicyClient | Valuable robotics policy path, but prepared-host complexity is higher | WF-GROOT-001 |
| JEPA-WMS | Research-value score candidate, but should remain direct-construction until upstream limits are known | WF-JEPAWMS-001 |
The next expansion batch is recorded in the Next Provider Selection RFC. Keep provider README/catalog wording unchanged until the selected implementation issue starts.
Provider Promotion Matrix¶
Use this matrix before changing implementation_status.
| Status | Allowed public claim | Required evidence | Forbidden |
|---|---|---|---|
scaffold |
Reserved name, planned contract, no executable public capability | docs state it is not real; methods fail closed or are test-only | claiming integration, auto-registering as usable, advertising capability flags |
experimental |
Real path exists, contract may change | injected-runtime tests, failure docs, optional smoke command or documented blocker | treating outputs as stable, hiding missing runtime limits |
beta |
Usable by prepared hosts with documented limits | conformance helper coverage, runtime manifest, live-smoke manifest, generated provider docs | base dependency expansion, undocumented env vars or failure modes |
stable |
Recommended provider path for its capability | repeated smoke evidence, release evidence, parser/validation coverage, incident runbook, compatibility policy | unpinned upstream assumptions, unbounded metadata/log fields |
Promotion blockers:
- The provider cannot run without undocumented credentials or local files.
- Returned metadata is not JSON-native.
- Score direction or candidate cardinality is ambiguous.
- Policy output cannot be translated into executable
Actionvalues without hidden assumptions. - Health checks cannot distinguish missing config from missing dependency or upstream failure.
- Provider events can leak secrets or signed URL query strings.
Artifact Contracts¶
These shapes are intentionally small. Exact schemas can be implemented later, but issues should preserve these fields unless there is a better reviewed design.
Runtime Manifest¶
{
"manifest_version": 1,
"provider": "leworldmodel",
"capabilities": ["score"],
"runtime": {
"kind": "local-python",
"optional_packages": ["stable-worldmodel", "torch"],
"python": ">=3.13,<3.14",
"devices": ["cpu", "cuda"]
},
"configuration": {
"required_env": ["LEWORLDMODEL_POLICY"],
"optional_env": ["LEWORLDMODEL_CACHE_DIR", "LEWORLDMODEL_DEVICE"]
},
"artifacts": {
"checkpoint": "host-owned",
"datasets": "host-owned"
},
"smoke": {
"command": "scripts/lewm-real --checkpoint ~/.stable-wm/pusht/lewm_object.ckpt --device cpu --json",
"success_signal": "JSON summary with provider=leworldmodel and capability=score"
}
}
Run Manifest¶
{
"manifest_version": 1,
"run_id": "20260430T120000Z-leworldmodel-score",
"worldforge_version": "0.5.0",
"command": ["scripts/lewm-real", "--device", "cpu", "--json"],
"provider": "leworldmodel",
"capability": "score",
"runtime_manifest": "leworldmodel",
"status": "passed",
"input_digest": "sha256:<hex>",
"result_digest": "sha256:<hex>",
"event_count": 3,
"artifacts": [
{"kind": "json", "path": "summary.json", "safe_to_attach": true}
],
"skip_reason": null
}
Provider Event Correlation¶
Provider events should be joinable to run artifacts without exposing sensitive payloads:
{
"provider": "runway",
"operation": "generate",
"phase": "success",
"attempt": 1,
"max_attempts": 1,
"duration_ms": 1200.0,
"run_id": "20260430T120000Z-runway-generate",
"request_id": "host-request-id",
"target": "https://api.example.test/v1/tasks",
"metadata": {
"artifact_id": "artifact-local-id",
"input_digest": "sha256:<hex>"
}
}
metadata must remain JSON-native and sanitized. Do not store raw prompts, raw tensors, bearer
tokens, signed URL query strings, robot serial numbers, or host-specific paths unless a host
explicitly marks them safe for local-only use.
Workstream A: Provider Platform Foundation¶
Track status: complete for #47.
Completion signals:
- Provider promotion rules cover
scaffold,experimental,beta, andstablein the Provider Authoring Guide, including profile metadata, generated catalog updates, validation commands, and the current provider classification table. - Runtime manifests are packaged under
src/worldforge/providers/runtime_manifests/, validated by tests, linked from provider docs, and used by health/configuration summaries without installing optional runtimes. - Capability conformance helpers exist for
predict,generate,transfer,reason,embed,score, andpolicy, plus provider-event redaction checks for safe fixture and injected-runtime coverage. - Optional live-smoke commands can write sanitized
run_manifest.jsonevidence with command, package, provider profile, capability, runtime manifest, input digest, event count, result digest, and artifact-path summaries. - Runtime pytest profiles keep default checkout validation deterministic while prepared hosts can
opt into
live,network,credentialed,gpu,robotics, and provider-specific test runs.
WF-PROV-001: Provider Promotion Gate¶
Type: provider platform
Labels: provider, quality, documentation
Depends on: none
Problem: Provider maturity is visible through profiles, but there is no explicit promotion
checklist for moving from scaffold to experimental, beta, or stable.
Scope:
- Define required evidence for each status: upstream contract, runtime availability, fixtures, live smoke, docs, failure modes, and benchmark inputs.
- Add a provider-promotion checklist to the provider authoring guide or a dedicated provider governance page.
- Require status-specific docs wording. Example:
stablemay document expected operator use;experimentalmust document known gaps;scaffoldmust advertise no executable capabilities.
Acceptance criteria:
- [x] Promotion rules cover all current statuses:
scaffold,experimental,beta,stable. - [x] Rules explain when to change provider profile metadata and generated catalog docs.
- [x] Rules include exact local validation commands.
- [x] Existing providers are classified against the new checklist without changing behavior.
Validation:
uv run python scripts/generate_provider_docs.py --check
uv run mkdocs build --strict
uv run pytest tests/test_provider_catalog_docs.py
WF-PROV-002: Provider Runtime Manifest Schema¶
Type: provider platform
Labels: provider, operations, documentation
Depends on: WF-PROV-001
Problem: Optional runtime docs are scattered across provider pages, scripts, and operations docs. Real provider work needs a machine-readable manifest for runtime requirements and smoke evidence.
Scope:
- Introduce a small JSON or TOML manifest schema for each real provider runtime.
- Capture optional packages, env vars, default model/checkpoint, device support, host-owned artifacts, minimum smoke command, and expected success signal.
- Keep manifests in-repo and dependency-free.
- Do not auto-install large runtimes from manifests in the base package.
Acceptance criteria:
- [x] Manifest schema is documented and validated by tests.
- [x] Manifests exist for
leworldmodel,lerobot,gr00t,cosmos, andrunway. - [x] Missing optional dependencies produce actionable health messages using manifest data.
- [x] Docs link from provider pages to the relevant manifest.
Validation:
WF-PROV-003: Provider Conformance Suite v2¶
Type: provider platform
Labels: provider, testing, quality
Depends on: WF-PROV-001
Problem: Contract helpers exist, but real providers need capability-specific conformance suites that can be reused for deterministic fakes, injected runtimes, HTTP fixtures, and live smokes.
Scope:
- Expand
src/worldforge/testing/into capability-specific checks forscore,policy,generate,transfer,predict,reason, andembed. - Verify finite numeric outputs, JSON-native metadata, redacted events, failure typing, health behavior, and docs/profile consistency.
- Keep helpers explicit; they should raise useful
AssertionErrormessages.
Acceptance criteria:
- [x] Each capability has a reusable conformance helper.
- [x] Current provider tests call the helpers where applicable.
- [x] The helpers can run against injected runtimes without credentials.
- [x] The helpers do not use bare Python
assertstatements.
Validation:
uv run pytest tests/test_*provider*.py tests/test_observability.py
uv run --extra harness pytest --cov=src/worldforge --cov-report=term-missing --cov-fail-under=90
WF-PROV-004: Live Smoke Artifact Contract¶
Type: provider platform
Labels: provider, operations, benchmark
Depends on: WF-PROV-002
Problem: Live provider runs are useful only if they preserve the command, environment summary, input payload, provider profile, event stream, output summary, and artifact paths.
Scope:
- Define a
run_manifest.jsonshape for optional live smokes. - Include command argv, package version, provider profile, capability, sanitized env summary, runtime manifest id, input fixture digest, event count, result digest, and artifact paths.
- Write manifests from optional smoke commands and robotics showcase paths.
- Keep credential values and signed URL query strings out of manifests.
Acceptance criteria:
- [x] Live smoke commands can emit
run_manifest.json. - [x] Manifest validation rejects secret-like metadata.
- [x] Robotics showcase manifest links policy, score, replay, and report artifacts.
- [x] Docs explain which artifacts are safe to attach to GitHub issues.
Validation:
uv run pytest tests/test_smoke_run_manifest.py tests/test_observability.py
uv run mkdocs build --strict
WF-PROV-005: Optional Runtime Test Profiles¶
Type: provider platform
Labels: provider, ci, testing
Depends on: WF-PROV-002, WF-PROV-004
Problem: CI should keep checkout tests deterministic while still making live provider validation repeatable on prepared hosts.
Scope:
- Add pytest markers and documented commands for
live,gpu,network,robotics, andcredentialedtests. - Keep default
uv run pytestfree of live runtime requirements. - Provide local commands for prepared hosts to run one provider profile at a time.
- Do not add live provider jobs to default CI unless credentials and billing policy are explicit.
Acceptance criteria:
- [x] Live tests skip with clear reasons when runtime/env is missing.
- [x] Prepared-host commands are documented for each real provider.
- [x] Default CI remains deterministic and does not require credentials, network calls, GPUs, or downloaded checkpoints.
- [x] Live-smoke evidence can be attached to release notes or provider issues.
Validation:
Workstream B: Real Provider Implementations¶
Track status: complete for #48.
Completion signals:
leworldmodelis a stable score provider overstable_worldmodel.policy.AutoCostModel, with CPU fallback, score direction metadata, runtime manifests, real-checkpoint smoke commands, and the PushT bridge registry preserving shape summaries in run manifests.lerobotis a stable policy provider with explicit loader validation, bounded raw-action previews, translator-contract evidence, and host-owned checkpoint, runtime, and controller boundaries.gr00tis a beta remote PolicyClient provider with sanitized target metadata, timeout/auth handling, unreachable-server health signals, and docs that keep CUDA, TensorRT, Isaac-GR00T, and policy-server operation host-owned.cosmosandrunwaykeep remote media generation/transfer production-shaped through parser fixtures, retry/timeout event metadata, optional live smoke manifests, benchmark inputs, and artifact-retention guidance that excludes signed URL query strings from observable records.jepanow advertises only the selected experimentalscoresurface through thefacebookresearch/jepa-wmstorch-hub contract, whilejepa-wmsremains a direct-construction candidate with prepared-host smoke evidence rather than auto-registration.genieremains a fail-closed scaffold with a documented defer decision until a credible upstream automation API or runtime contract exists.- Provider docs, provider-selection records, robotics showcase docs, runtime manifests, fixtures, and conformance tests now agree on which provider capabilities are executable and which runtime responsibilities remain with hosts.
WF-LWM-001: LeWorldModel Stable Score Provider¶
Type: provider promotion
Labels: provider, score, robotics
Depends on: WF-PROV-001, WF-PROV-002, WF-PROV-003
Problem: leworldmodel already wraps the real stable_worldmodel.policy.AutoCostModel path, but
stable promotion needs a stronger runtime matrix, artifact contract, and task bridge evidence.
Scope:
- Pin the exact upstream import boundary in docs and runtime manifest.
- Validate checkpoint loading, CPU fallback, score tensor shapes, candidate cardinality, finite costs, score direction, and malformed-output errors.
- Preserve a small deterministic injected-runtime test path for checkout CI.
- Add prepared-host smoke evidence for at least one real checkpoint path.
Acceptance criteria:
- [ ]
ProviderProfilestatus and provider docs reflect the promotion decision. - [ ] Real checkpoint smoke writes a manifest with score payload summary and event stream.
- [ ] Failure docs cover missing package, missing checkpoint, bad tensor shape, non-finite score, mismatched candidate count, and device fallback.
- [ ] No LeWorldModel runtime dependency is added to the base package.
Validation:
uv run pytest tests/test_leworldmodel_provider.py tests/test_provider_contracts.py
uv run python scripts/generate_provider_docs.py --check
uv run mkdocs build --strict
Prepared-host smoke:
WF-LWM-002: LeWorldModel Task Bridge Pack¶
Type: provider extension
Labels: provider, score, examples
Depends on: WF-LWM-001, WF-PROV-004
Problem: LeWorldModel scoring depends on task-native tensor preprocessing. WorldForge should provide clear bridge examples without pretending it can infer every task representation.
Scope:
- Keep PushT as the first complete bridge.
- Add a bridge registry pattern for examples and smoke commands.
- Document required tensors, shapes, and task ownership.
- Add at least one negative test showing that mismatched action spaces fail instead of being padded or projected silently.
Acceptance criteria:
- [ ] Bridge docs distinguish WorldForge validation from host preprocessing.
- [ ] Bridge code emits shape summaries into run manifests.
- [ ] Mismatched action dims fail before planning.
- [ ] Existing robotics showcase continues to use the same public provider surfaces.
Validation:
uv run pytest tests/test_robotics_showcase.py tests/test_leworldmodel_provider.py
uv run mkdocs build --strict
WF-LEROBOT-001: LeRobot Stable Policy Provider¶
Type: provider promotion
Labels: provider, policy, robotics
Depends on: WF-PROV-001, WF-PROV-002, WF-PROV-003
Problem: lerobot exposes a real policy adapter, but production use needs a stronger loader
contract, action-output normalization, translator docs, and prepared-host smoke evidence.
Scope:
- Validate supported policy loading modes and default device behavior.
- Normalize raw policy outputs into JSON-native previews without losing shape/type evidence.
- Keep executable WorldForge
Actioncreation behind explicit host-supplied translators. - Add health behavior for missing package, missing checkpoint, unsupported policy type, and translator absence.
Acceptance criteria:
- [ ] Provider status is updated only if all promotion gates pass.
- [ ] Provider docs name the supported loader path and unsupported cases.
- [ ] Raw policy action previews are safe for logs and reports.
- [ ] No LeRobot, torch, robot checkpoints, or simulation packages are added to the base package.
Validation:
uv run pytest tests/test_lerobot_provider.py tests/test_provider_contracts.py
uv run python scripts/generate_provider_docs.py --check
uv run mkdocs build --strict
Prepared-host smoke:
WF-LEROBOT-002: Embodiment Translator Contract¶
Type: provider foundation
Labels: provider, policy, robotics, testing
Depends on: WF-LEROBOT-001
Problem: Policy providers return embodiment-specific actions. Translators need a formal contract so host applications can prove a raw policy output became executable WorldForge actions safely.
Scope:
- Define translator input/output shape, metadata, failure behavior, and provenance fields.
- Add reusable tests for translator cardinality, finite values, JSON-native metadata, and reversible preview summaries.
- Document how translators map to real robot controllers without WorldForge owning hardware safety.
Acceptance criteria:
- [ ] Translators fail loudly on unknown embodiment tags and shape mismatches.
- [ ] Policy provider docs reference the translator contract.
- [ ] Robotics showcase uses the contract without changing its public behavior.
- [ ] Host-owned safety interlocks remain out of core WorldForge.
Validation:
uv run pytest tests/test_lerobot_provider.py tests/test_robotics_showcase.py
uv run mkdocs build --strict
WF-GROOT-001: GR00T PolicyClient Beta Provider¶
Type: provider promotion
Labels: provider, policy, robotics, operations
Depends on: WF-PROV-001, WF-PROV-002, WF-LEROBOT-002
Problem: gr00t is experimental. Beta promotion requires a prepared-host story for remote
PolicyClient operation, auth, timeout behavior, action previews, and failure triage.
Scope:
- Document supported remote PolicyClient configuration and unsupported local server assumptions.
- Validate strict and non-strict modes, timeout settings, auth token handling, observation validation, translator absence, and server-unreachable errors.
- Add prepared-host smoke instructions for connecting to an existing policy server.
- Keep CUDA, TensorRT, Isaac-GR00T, and robot dependencies host-owned.
Acceptance criteria:
- [ ] Health checks distinguish missing config, missing dependency, and unreachable policy server.
- [ ] Provider events include operation, attempt, sanitized target, and duration.
- [ ] Token-like values cannot appear in events or run manifests.
- [ ] Docs state that unsupported hosts should connect to a remote server instead of starting one.
Validation:
uv run pytest tests/test_gr00t_provider.py tests/test_observability.py
uv run python scripts/generate_provider_docs.py --check
uv run mkdocs build --strict
Prepared-host smoke:
WF-COSMOS-001: Cosmos Generate Provider Production Hardening¶
Type: provider promotion
Labels: provider, generate, operations
Depends on: WF-PROV-003, WF-PROV-004
Problem: cosmos is a real HTTP adapter. Production hardening should lock API-version handling,
parser fixtures, retry policy, artifact metadata, and docs around reachable deployments.
Scope:
- Audit request/response parser coverage against current supported Cosmos deployment shape.
- Preserve fixture coverage for success, malformed payloads, auth failures, timeout, polling, failed tasks, and unsupported artifacts.
- Emit run manifests for live generate smoke runs.
- Keep endpoint ownership with the host.
Acceptance criteria:
- [ ] Provider docs state required endpoint/auth configuration and first triage step.
- [ ] Parser tests cover every documented failure mode.
- [ ] Retry and timeout behavior is visible in provider events.
- [ ] Live smoke can be run manually without changing default CI.
Validation:
uv run pytest tests/test_cosmos_provider.py tests/test_remote_video_providers.py
uv run python scripts/generate_provider_docs.py --check
uv run mkdocs build --strict
WF-RUNWAY-001: Runway Generate/Transfer Production Hardening¶
Type: provider promotion
Labels: provider, generate, transfer, operations
Depends on: WF-PROV-003, WF-PROV-004
Problem: runway supports generate and transfer through a remote API. Production hardening needs
artifact retention policy, expired URL handling, and parser coverage that matches operator docs.
Scope:
- Verify create, polling, download, content-type, and expired-artifact error paths.
- Write live smoke manifests that preserve artifact metadata without storing signed URLs.
- Document host responsibility for persisting downloaded media immediately after task completion.
- Keep legacy
RUNWAY_API_SECRETalias tested while preferringRUNWAYML_API_SECRET.
Acceptance criteria:
- [ ] Signed URL query strings never appear in events, logs, manifests, or reports.
- [ ] Docs cover artifact expiration and first recovery step.
- [ ] Transfer and generate have separate benchmark inputs and capability tests.
- [ ] Provider profile notes model/version limits where known.
Validation:
uv run pytest tests/test_runway_provider.py tests/test_remote_video_providers.py tests/test_observability.py
uv run python scripts/generate_provider_docs.py --check
uv run mkdocs build --strict
WF-JEPAWMS-001: JEPA-WMS Candidate Promotion¶
Type: provider promotion
Labels: provider, score, research
Depends on: WF-PROV-001, WF-PROV-002, WF-PROV-003
Problem: jepa-wms is a direct-construction score candidate. It should stay unregistered until
real upstream runtime limits, model loading, and score semantics are validated.
Scope:
- Verify the selected upstream runtime path, required dependencies, model names, and device behavior on a prepared host.
- Define exact score input tensors, action candidate shape, score direction, and unsupported cases.
- Decide whether auto-registration is appropriate or whether direct construction remains the right boundary.
- Keep torch and JEPA-WMS dependencies host-owned.
Acceptance criteria:
- [x] Provider status is not upgraded without real runtime smoke evidence.
- [x] Docs state direct-construction versus auto-registration decision.
- [x] Fixture and injected-runtime tests cover parser and validation behavior.
- [x] Prepared-host smoke writes a manifest with runtime version and score summary.
Validation:
uv run pytest tests/test_jepa_wms_provider.py tests/test_provider_contracts.py
uv run mkdocs build --strict
WF-JEPA-001: Replace JEPA Scaffold With A Real Adapter¶
Type: provider implementation
Labels: provider, predict, score, research
Depends on: WF-JEPAWMS-001 or a separate provider-selection RFC
Problem: The catalog contains a fail-closed jepa reservation. Replacing it requires choosing a
real upstream JEPA runtime and one honest capability surface.
Scope:
- Write a short provider-selection RFC before implementation.
- Choose whether the first real JEPA surface is
score,predict, orembed; do not expose all three by default. - Remove or rename scaffold behavior only when the real adapter has tests, docs, and live smoke evidence.
- Preserve a migration note for users who configured the scaffold env var.
Acceptance criteria:
- [ ] The selected upstream package/API/checkpoint is named in docs.
- [ ] The adapter advertises exactly the implemented capability.
- [ ] Scaffold-only surrogate behavior is removed or kept behind a clearly separate test path.
- [ ] Provider catalog no longer implies a fake JEPA integration.
Validation:
uv run pytest tests/test_jepa_provider.py tests/test_provider_catalog_docs.py
uv run python scripts/generate_provider_docs.py --check
uv run mkdocs build --strict
WF-GENIE-001: Replace Genie Scaffold Only After Runtime Contract Exists¶
Type: provider implementation
Labels: provider, generate, research
Depends on: WF-PROV-001
Problem: genie is currently a fail-closed reservation. A real implementation should wait until
there is a concrete upstream runtime/API with stable callable behavior and acceptable host-owned
dependency boundaries.
Scope:
- Create a provider-selection RFC before implementation.
- Decide whether the first surface is
generate,predict, or a new typed scene/world surface. - Keep the scaffold reservation if no credible upstream contract exists.
- Do not present deterministic local surrogate behavior as a real Genie integration.
Acceptance criteria:
- [ ] Issue closes with either an implementation plan or an explicit decision to defer.
- [ ] If implemented, provider docs include upstream contract, limits, env vars, artifacts, and failure modes.
- [ ] If deferred, docs clearly state why it remains a scaffold.
Validation:
WF-PROVIDER-SELECT-001: Next Provider Selection RFC¶
Type: provider planning
Labels: provider, research, roadmap
Depends on: WF-PROV-001
Problem: Adding more provider names without real integrations creates noise. The next provider batch should be chosen by callable surface, user value, runtime feasibility, and validation cost.
Scope:
- Evaluate candidate classes: embodied policy adapters, latent score/predict models, remote media generation/transfer APIs, simulator bridges, and spatial/3D world model runtimes.
- For each candidate, record capability surface, package/API maturity, host dependency weight, smoke feasibility, licensing, fixture strategy, and expected users.
- Choose at most three providers for the next implementation batch.
Acceptance criteria:
- [ ] RFC recommends no more than three provider additions.
- [ ] Each recommended provider has an issue outline with capability, owner, and validation path.
- [ ] Deferred candidates include the blocking reason.
- [ ] README/provider docs are not changed until implementation starts.
Validation:
Workstream C: Production Harness¶
Harness work should be treated as product engineering, not demo polish. The harness is the reference implementation of a local operator workspace over WorldForge APIs.
Production harness requirements:
| Requirement | Meaning |
|---|---|
| Deterministic checkout path | Every new screen or flow has a mock or fixture-backed path that runs without credentials. |
| Prepared-host path | Real provider flows can run when optional runtime packages, credentials, and artifacts are present. |
| Persistent run model | Runs are stored under a stable run directory with manifest, events, summaries, and exported reports. |
| Failure visibility | Failed, cancelled, skipped, misconfigured, and unhealthy states are visually distinct and exported. |
| Safe exports | Report exports are sanitized and suitable for GitHub issue attachments unless marked local-only. |
| Keyboard-first operation | Common provider, run, report, and world actions are reachable from command palette or shortcuts. |
| Testable without Textual import leakage | Flow logic remains testable outside worldforge.harness.tui. |
Harness anti-goals:
- Do not turn TheWorldHarness into a hosted dashboard.
- Do not hide provider setup behind implicit installation of heavy optional runtimes.
- Do not execute robot-controller actions from a default harness flow.
- Do not render raw secret-bearing provider metadata.
Track status: complete for #49.
Completion signals:
- Shared run workspaces use
.worldforge/runs/<run-id>/with sanitized manifests, provider events, result summaries, report exports, logs, sortable file-safe run IDs, retention commands, and issue-safe artifact paths. - Provider connector readiness is exposed through both TheWorldHarness and
worldforge harness --connectors, separating configured, missing credentials, missing optional dependencies, unhealthy, and scaffold states without printing secret values. - Live run inspection persists
results/inspector.json, sanitizedlogs/provider-events.jsonl, failed-run manifests, validation errors, and final artifact links for successful and failed flows. - Preserved eval and benchmark runs can be compared with
worldforge runs compareand exported as Markdown, JSON, or CSV with provenance references. worldforge provider workbench <provider>gives adapter authors a checkout-safe loop for capability conformance helpers, fixture validation, health checks, docs/catalog drift, and redaction-safe event checks.- TheWorldHarness docs describe the matching non-TUI commands and artifact layout so production harness behavior is testable without importing Textual.
WF-HARNESS-001: Harness Run Workspace¶
Type: harness
Labels: harness, operations, artifacts
Depends on: WF-PROV-004
Problem: The harness can display runs and reports, but production-grade local operation needs a clear workspace model for inputs, outputs, manifests, events, logs, and exports.
Scope:
- Define
.worldforge/runs/<run-id>/layout for harness and CLI flows. - Store sanitized run manifest, provider events, input summaries, result summaries, generated reports, and exported media paths.
- Add retention and cleanup commands.
- Keep local workspace single-user unless a durable store adapter is explicitly designed.
Acceptance criteria:
- [x] Harness and CLI flows write the same run layout.
- [x] Run IDs are file-safe and sortable.
- [x] Exported artifacts can be attached to issues without leaking secrets.
- [x] Docs include cleanup and recovery steps.
Validation:
uv run pytest tests/test_harness_flows.py tests/test_cli_reports.py
uv run --extra harness pytest tests/test_harness_tui.py
uv run mkdocs build --strict
WF-HARNESS-002: Provider Connector Workspace¶
Type: harness
Labels: harness, provider, operations
Depends on: WF-PROV-002, WF-HARNESS-001
Problem: Operators need a local way to inspect provider readiness, missing env vars, optional runtime health, and first smoke commands before running workflows.
Scope:
- Add a connector screen or panel that reads provider profiles and runtime manifests.
- Show configured, missing, unhealthy, and scaffold providers separately.
- Surface exact next commands without printing secrets.
- Support copyable smoke commands and first triage steps.
Acceptance criteria:
- [x]
mock,cosmos,runway,leworldmodel,gr00t,lerobot, experimentaljepa, and scaffoldgenierender with distinct status. - [x] Missing credentials and missing optional dependencies are visibly different.
- [x] Textual remains isolated to
worldforge.harness.tui. - [x] Non-TUI metadata command exposes the same provider readiness data as JSON.
Validation:
uv run pytest tests/test_harness_flows.py tests/test_harness_cli.py
uv run --extra harness pytest tests/test_harness_tui.py
WF-HARNESS-003: Live Run Inspector¶
Type: harness
Labels: harness, observability, provider
Depends on: WF-HARNESS-001, WF-OBS-001
Problem: Operators should be able to watch provider events, timing, retries, failures, selected actions, scores, and artifact creation as a run executes.
Scope:
- Stream sanitized provider events into the run inspector.
- Show timeline, event table, result summary, artifact paths, and validation errors.
- Support cancellation for long-running provider calls where the underlying operation can be stopped safely.
- Persist the final view from the same run manifest.
Acceptance criteria:
- [x] Success, retry, failure, and cancellation states are distinct.
- [x] Provider event metadata is redacted in TUI and exported reports.
- [x] A failed run still writes enough manifest data to reproduce the command.
- [x] Tests cover failure rendering without real provider credentials.
Validation:
uv run pytest tests/test_harness_flows.py tests/test_observability.py
uv run --extra harness pytest tests/test_harness_tui.py
WF-HARNESS-004: Report Compare And Export¶
Type: harness
Labels: harness, benchmark, evaluation
Depends on: WF-HARNESS-001
Problem: Provider evaluation and benchmark work needs first-class comparison across preserved runs, not just one-off reports.
Scope:
- Add report comparison for benchmark latency/throughput, evaluation scores, provider health, and event counts.
- Export Markdown, JSON, and CSV summaries from preserved run directories.
- Include provenance links to input fixtures and budget files.
- Keep benchmark claims tied to preserved artifacts.
Acceptance criteria:
- [x] Comparison refuses incompatible report types with a clear error.
- [x] Exported Markdown includes command, provider, operation, date, and artifact references.
- [x] CSV and JSON exports are stable enough for issue attachments.
- [x] Docs explain how to cite benchmark artifacts.
Validation:
uv run pytest tests/test_benchmark.py tests/test_evaluation_suites.py tests/test_harness_flows.py
uv run mkdocs build --strict
WF-HARNESS-005: Provider Development Workbench¶
Type: harness
Labels: harness, provider, developer-experience
Depends on: WF-PROV-003, WF-HARNESS-002
Problem: Adapter authors need a tight loop for fixture playback, health checks, capability tests, event inspection, and docs drift before opening a provider PR.
Scope:
- Add a workbench flow that runs provider conformance checks against a selected provider.
- Support fixture playback for HTTP adapters and injected runtimes for local adapters.
- Show missing docs/catalog updates when profile metadata changes.
- Do not run live provider calls unless explicitly selected.
Acceptance criteria:
- [x] Workbench can run against
mockin a clean checkout. - [x] Workbench lists required tests for each advertised capability.
- [x] Workbench links to provider authoring docs and generated catalog check.
- [x] Failures are actionable enough to paste into GitHub issues.
Validation:
uv run pytest tests/test_provider_contracts.py tests/test_harness_flows.py
uv run --extra harness pytest tests/test_harness_tui.py
Workstream D: Reference Host Applications¶
Reference host apps are examples of ownership boundaries. They should be complete enough to copy from, but they should not become required runtime paths for the library.
Track status: complete for #50.
Completion signals:
examples/hosts/batch-eval/runs clean-checkout eval and benchmark jobs with preserved.worldforge/batch-eval/runs/<run-id>/artifacts, copied benchmark inputs and budgets, and non-zero exits for budget violations.examples/hosts/service/exposes stdlib HTTP liveness, readiness, provider diagnostics, typed public error payloads, request-id correlation, and JSON provider-event logging without moving deployment or alerting into WorldForge.examples/hosts/robotics-operator/keeps operator review non-mutating by default, requires an explicit action translator, records approval/replay/evidence artifacts, and leaves controller execution behind a host-supplied hook.- Examples docs present all three host apps as optional references and repeat the host-owned boundaries for scheduling, durable storage, telemetry, credentials, controller integration, interlocks, and safety certification.
Host app packaging rules:
- Put reference hosts under
examples/hosts/<name>/. - Keep host-specific dependencies outside the base package.
- Prefer
uv run --with ...instructions or a dedicated optional extra only when the dependency set is broadly useful. - Include
.env.examplefragments only when they do not introduce real secrets. - Each host app must state whether it is checkout-safe, prepared-host, credentialed, GPU-bound, or robotics-lab-only.
Host app acceptance standard:
| Host type | Must demonstrate | Must not imply |
|---|---|---|
| Batch evaluation | CLI/job entrypoint, input fixtures, budgets, run artifacts, non-zero failure exits | hosted scheduling, long-term storage, empirical claims beyond artifacts |
| Service | readiness checks, typed errors, request IDs, logs/metrics hooks, timeout policy | WorldForge owning uptime, auth, routing, or dashboards |
| Robotics operator | dry-run review, policy/score evidence, action translator, approval record | automatic robot safety, controller ownership, certified execution |
WF-HOST-001: Batch Evaluation Host¶
Type: host application
Labels: examples, evaluation, operations
Depends on: WF-HARNESS-001, WF-PROV-004
Problem: Users need a production-shaped example for running provider evals and benchmarks as batch jobs with artifacts, budgets, and exit codes.
Scope:
- Add an optional example host under
examples/hosts/batch-eval/. - Use WorldForge APIs, benchmark input files, budget files, run manifests, and report exports.
- Keep dependencies optional and installable through
uv run --with ...or a documented extra. - Do not add a scheduler or queue to the base package.
Acceptance criteria:
- [x] Host can run
mockeval and benchmark jobs in a clean checkout. - [x] Host writes run workspace artifacts and exits non-zero on budget violations.
- [x] Docs explain how to swap in a real provider on a prepared host.
- [x] Package contract remains base-dependency clean.
Validation:
uv run pytest tests/test_batch_eval_host.py
bash scripts/test_package.sh
uv run mkdocs build --strict
WF-HOST-002: Service Host Reference¶
Type: host application
Labels: examples, operations, observability
Depends on: WF-OBS-001, WF-OBS-002, WF-OPS-002
Problem: Services embedding WorldForge need a reference for request IDs, readiness checks, provider calls, structured logs, metrics export, timeout budgets, and clean failure responses.
Scope:
- Add an optional service host example under
examples/hosts/service/. - Expose health/readiness, provider list, one safe mock workflow, and one configurable provider workflow.
- Keep the web framework optional and outside the base package.
- Demonstrate redacted logs, request IDs, and provider event correlation.
Acceptance criteria:
- [x] Service host runs with only optional example dependencies.
- [x] Health/readiness distinguish framework alive, provider configured, and provider healthy.
- [x] Provider errors return typed public error payloads without internal secrets.
- [x] Docs state this is a reference host, not the WorldForge product boundary.
Validation:
WF-HOST-003: Robotics Lab Operator Host¶
Type: host application
Labels: examples, robotics, operations, safety
Depends on: WF-LEROBOT-002, WF-GROOT-001, WF-HARNESS-003
Problem: Robotics users need an example that composes policy, score, replay, and operator approval without implying WorldForge controls robot hardware.
Scope:
- Add an optional host example for offline operator review of policy+score runs.
- Require explicit action translator, safety checklist, and dry-run approval before any host-owned controller integration point.
- Export selected action chunks, score rationale, event stream, and replay artifact.
- Leave real robot execution as an integration hook with documented safety ownership.
Acceptance criteria:
- [x] The default mode is non-mutating and does not talk to robot controllers.
- [x] Controller execution hook is disabled unless the host supplies an explicit implementation.
- [x] Operator approval and dry-run artifacts are recorded.
- [x] Docs state what WorldForge does and does not certify.
Validation:
uv run pytest tests/test_robotics_operator_host.py tests/test_robotics_showcase.py
uv run mkdocs build --strict
Workstream E: Observability, Monitoring, And Logging¶
Observability work must keep the core callback model simple while giving production hosts clean integration points.
Track status: complete for #51.
Completion signals:
ProviderEventcarries sanitized correlation fields and normalized phases for run, request, trace, span, artifact, and input-digest joins.JsonLoggerSink,RunJsonLogSink,ProviderMetricsSink,ProviderMetricsExporterSink,OpenTelemetryProviderEventSink, andInMemoryRecorderSinkshare the same event model without adding base runtime dependencies.- Run logs and manifests preserve issue-safe provider evidence under
.worldforge/runs/<run-id>/. - The service host exposes process liveness separately from
ready,provider_unconfigured, andprovider_unhealthyreadiness states. - Operations docs and playbooks map
worldforge doctor, provider health, provider events, and incident runbooks to host-owned escalation paths.
Telemetry layering:
ProviderEvent
-> in-process sinks
-> optional run log / run manifest
-> optional exporter adapters
-> host collector / dashboard / alerting
Required semantics:
| Signal | Source | Notes |
|---|---|---|
| request attempt | provider event per upstream call or model-boundary call | Retries are attempts, not separate logical user workflows. |
| retry | provider event with phase=retry |
Include attempt and sanitized target when available. |
| failure | provider event with phase=failure plus typed exception |
Preserve error family without raw secret payloads. |
| latency | event duration_ms |
Use histograms for exported metrics; do not compute from wall-clock UI render times. |
| run status | run manifest | Values should include passed, failed, skipped, cancelled, and not configured. |
| readiness | host app or CLI preflight | Distinguish configured, dependency available, upstream reachable, and workflow-ready. |
Exporter rules:
- Exporters are optional adapters over existing events; they must not change provider execution.
- Metrics labels must be bounded. Use provider, operation, capability, phase, and status class.
- Do not use prompts, targets with query strings, world IDs, run IDs, object IDs, or metadata keys as metrics labels.
- Trace/span attributes must be sanitized before export.
- Log files and issue bundles must be safe by default; local-only artifacts need explicit marking.
WF-OBS-001: Provider Event Schema v2¶
Type: observability
Labels: observability, provider, security
Depends on: WF-PROV-004
Problem: ProviderEvent is already useful, but production hosts need stable correlation fields,
run IDs, request IDs, normalized phases, and stronger schema documentation.
Scope:
- Add optional
run_id,request_id,trace_id,span_id,artifact_id, andinput_digestfields if they can remain JSON-native and redacted. - Preserve backward compatibility where possible.
- Document event phases and field semantics.
- Expand redaction tests for new fields.
Acceptance criteria:
- [x] Existing sinks keep working or have documented migration behavior.
- [x] Event fields are JSON-native and sanitized before sink consumption.
- [x] Provider events can be correlated with run manifests.
- [x] Docs include sample JSON log records.
Validation:
uv run pytest tests/test_observability.py tests/test_remote_video_providers.py
uv run mkdocs build --strict
WF-OBS-002: Optional OpenTelemetry Exporter¶
Type: observability
Labels: observability, operations, optional-dependency
Depends on: WF-OBS-001
Problem: Production hosts often use OpenTelemetry, but WorldForge should not force it into the base package.
Scope:
- Add an optional exporter module or extra for mapping provider events to spans and attributes.
- Keep exporter disabled by default.
- Document semantic attribute names, redaction guarantees, and host responsibility for collector configuration.
- Tests should not require a real collector.
Acceptance criteria:
- [x] Importing
worldforgedoes not import OpenTelemetry. - [x] Exporter maps provider, operation, phase, status, duration, attempt, and sanitized target.
- [x] Secret-like metadata is redacted before span attributes are created.
- [x] Docs show minimal host wiring.
Validation:
uv run pytest tests/test_observability_opentelemetry.py tests/test_observability.py
bash scripts/test_package.sh
uv run mkdocs build --strict
WF-OBS-003: Optional Metrics Exporter¶
Type: observability
Labels: observability, operations, optional-dependency
Depends on: WF-OBS-001
Problem: ProviderMetricsSink aggregates in memory. Hosts need an optional bridge to standard
metrics systems without changing core event semantics.
Scope:
- Add optional exporter hooks for counters, histograms, retries, errors, and latency.
- Keep label cardinality bounded: provider, operation, phase/status class, and capability only.
- Document why raw target URLs, prompts, metadata keys, or world IDs must not be metrics labels.
- Provide tests with an in-memory registry or fake exporter.
Acceptance criteria:
- [x] Metrics bridge is optional and has bounded labels.
- [x] Retry events increment retry metrics distinctly from logical operation count.
- [x] Docs explain metric meanings and alert examples.
- [x] Base package dependency set remains unchanged.
Validation:
uv run pytest tests/test_observability_metrics_export.py tests/test_observability.py
uv run mkdocs build --strict
WF-OBS-004: Logging Configuration And Run Logs¶
Type: observability
Labels: observability, operations, logging
Depends on: WF-OBS-001, WF-HARNESS-001
Problem: JsonLoggerSink exists, but host apps need a consistent recipe for file logs, JSON logs,
run-scoped logs, and operator-safe log exports.
Scope:
- Add a logging playbook for CLI, batch host, service host, and harness runs.
- Provide a helper for run-scoped JSON log files if it stays dependency-free.
- Include redaction tests for exported log files.
- Do not override host logging configuration globally.
Acceptance criteria:
- [x] Logs can be correlated to run manifests by
run_id. - [x] Exported logs contain no bearer tokens, API keys, signatures, or signed URL query strings.
- [x] Docs include first triage queries for provider failures and retries.
- [x] Host apps demonstrate logger injection rather than global logging side effects.
Validation:
WF-OBS-005: Health, Readiness, And Incident Runbooks¶
Type: operations
Labels: operations, observability, documentation
Depends on: WF-PROV-002, WF-HOST-002
Problem: Production hosts need a clear distinction between process liveness, provider configured, provider healthy, upstream degraded, and workflow failing.
Scope:
- Document a standard health/readiness model for host applications.
- Add reference host endpoints or commands that expose those states.
- Add incident runbooks for remote provider failures, optional runtime missing dependencies, artifact expiration, malformed world state, and benchmark budget failures.
- Keep on-call routing and alert channels host-owned.
Acceptance criteria:
- [x] Runbooks include symptom, likely cause, first command, expected signal, and escalation point.
- [x] Host reference app uses the model.
- [x] Existing
worldforge doctorand provider health outputs are mapped to readiness states. - [x] Docs avoid claiming WorldForge owns upstream SLAs.
Validation:
Workstream F: Reliability, Security, And Release Gates¶
Reliability work should make failures explicit and diagnosable. It should not hide instability by retrying indefinitely, weakening validation, or moving host responsibilities into the framework.
Reliability policy:
| Area | Default | Exception path |
|---|---|---|
| Remote create/mutation | single attempt | Host may configure retries only when idempotency is clear. |
| Health checks | cheap and bounded | Deep checks belong in explicit smoke commands. |
| Downloads | retry with bounded timeout where configured | Persist artifacts immediately when upstream URLs expire. |
| Optional runtime import | lazy and failure-explicit | Never import optional model packages from worldforge.__init__. |
| Live tests | opt-in markers | Default CI stays deterministic. |
| Coverage gate | keep current gate | Add focused tests for new branches instead of lowering thresholds. |
Completion signal for #52: this workstream is
implemented as baseline reliability, security, and release infrastructure. WF-OPS-001 added
per-operation provider budgets and budget-exceeded events. WF-OPS-002 hardened credential
configuration summaries and redaction expectations across safe artifacts. WF-OPS-003 recorded the
host-owned persistence adapter boundary in ADR 0001. WF-OPS-004 added the release evidence bundle
and release-gate checklist. Future work should start from a narrow follow-up issue when it changes a
specific provider, host, artifact, or release workflow.
WF-OPS-001: Provider Budget And Circuit-Breaker Policy¶
Type: operations
Labels: operations, provider, reliability
Depends on: WF-OBS-001
Problem: Remote and live model calls need explicit budgets for timeout, attempts, retry delay, max run duration, and failure thresholds.
Scope:
- Extend or document
ProviderRequestPolicyfor workflow-level budgets where appropriate. - Add host-owned circuit-breaker examples without forcing a service dependency.
- Ensure budget violations surface as typed provider/workflow errors and provider events.
- Keep create/mutation single-attempt defaults unless explicitly configured.
Acceptance criteria:
- [ ] Budgets can be set per provider operation in host examples.
- [ ] Budget failures are observable and testable.
- [ ] Docs distinguish request retry policy from workflow run budget.
- [ ] No silent retry loop can hide repeated provider failure.
Validation:
uv run pytest tests/test_provider_request_policy.py tests/test_remote_video_providers.py
uv run mkdocs build --strict
WF-OPS-002: Credential And Configuration Hardening¶
Type: security
Labels: security, operations, provider
Depends on: WF-PROV-002
Problem: Real providers increase the chance of leaking credentials through config, logs, manifests, events, docs, or issue attachments.
Scope:
- Audit env var loading and docs for every provider.
- Add config summaries that expose presence, source, and validation status without exposing values.
- Add secret-pattern tests for events, manifests, logs, reports, and docs examples.
- Keep
.env.exampletracked and.envfiles ignored.
Acceptance criteria:
- [ ] Every provider has documented env vars and redacted config summaries.
- [ ] Tests catch bearer/API/signature/password-like leaks in new observable surfaces.
- [ ] Issue docs explain what artifacts are safe to attach.
- [ ] Provider errors do not include raw credentials or signed URL query strings.
Validation:
uv run pytest tests/test_observability.py tests/test_provider_config.py tests/test_docs_site.py
uv run mkdocs build --strict
WF-OPS-003: Durable Persistence Adapter Design¶
Type: architecture
Labels: operations, persistence, design
Depends on: WF-HOST-001
Problem: Local JSON state is intentionally single-writer. Some hosts will need durable multi-writer persistence, but adding it to core without design would blur WorldForge's boundary.
Scope:
- Write an ADR for a persistence adapter interface before implementation.
- Define locking, migrations, backup/restore, retention, schema versioning, and failure recovery.
- Decide whether the first implementation belongs in core, an optional extra, or only reference host apps.
- Keep current local JSON behavior unchanged unless the ADR explicitly changes it.
Acceptance criteria:
- [ ] ADR names the adapter boundary and rejected alternatives.
- [ ] No database dependency is added before the ADR is accepted.
- [ ] Existing local JSON tests remain authoritative for default behavior.
- [ ] Host-owned responsibility remains clear in docs.
Validation:
WF-OPS-004: Release Evidence Bundle¶
Type: release
Labels: release, quality, provider, documentation
Depends on: WF-PROV-004, WF-HARNESS-004
Problem: Provider-heavy releases need evidence bundles that say what was tested locally, what live smokes ran, what was skipped, and which claims are supported.
Scope:
- Define a release evidence directory or generated Markdown report.
- Include validation commands, coverage, docs build, package contract, provider catalog drift, benchmark artifacts, live-smoke manifests, and known limitations.
- Keep live-provider evidence optional but explicit.
Acceptance criteria:
- [ ] Release evidence can be generated without credentials.
- [ ] Live-provider sections say skipped, passed, failed, or not configured.
- [ ] Evidence report links to preserved artifacts.
- [ ] Release checklist references the evidence bundle.
Validation:
uv lock --check
uv run ruff check src tests examples scripts
uv run ruff format --check src tests examples scripts
uv run python scripts/generate_provider_docs.py --check
uv run mkdocs build --strict
uv run pytest
uv run --extra harness pytest --cov=src/worldforge --cov-report=term-missing --cov-fail-under=90
bash scripts/test_package.sh
uv build --out-dir dist --clear --no-build-logs
Risk Register¶
Track these risks in issues and release evidence.
| Risk | Impact | Early warning | Mitigation |
|---|---|---|---|
| Capability inflation | Users trust a provider for work it cannot do | Provider docs use broad model labels instead of exact capability names | Promotion gate, generated catalog checks, conformance tests |
| Base dependency creep | Package becomes hard to install and test | Optional model package appears in dependencies |
Keep model/runtime packages in wrappers, extras, or host docs |
| Secret leakage | Logs, manifests, reports, or issues expose credentials | New event metadata includes targets, headers, env, or raw provider messages | Redaction tests for every observable surface |
| Live smoke fragility | Provider validation cannot be reproduced | Smoke command depends on undocumented local files | Runtime manifests, prepared-host commands, skip reasons |
| Harness becoming demo-only | UI looks useful but cannot support real triage | Failed runs do not preserve artifacts | Run workspace, failure states, export tests |
| Host boundary drift | WorldForge starts owning service or robot responsibilities | Docs promise uptime, controller safety, or durable storage | Host app boundary statements and ADRs before implementation |
| Benchmark misuse | Deterministic contract numbers become inflated claims | Reports lack input provenance or claim boundaries | Benchmark input fixtures, budgets, evidence bundles |
| Provider API churn | External provider changes break adapters silently | Fixtures only cover happy paths | Parser fixtures, version/runtime manifests, live smoke manifests |
| Coverage floor erosion | New guard branches lower quality silently | Harness coverage hovers near threshold | Add focused tests with every provider/harness branch |
Implementation Waves¶
Wave 0: Turn Roadmap Into Trackers¶
Goal: create the GitHub issue backbone.
Open one tracking issue per workstream plus the first ten implementation issues listed below. Each tracking issue should link child issues, dependency order, labels, and release evidence required.
Validation for this wave:
Wave 1: Provider Standards Before Provider Growth¶
Goal: prevent future provider work from becoming ad hoc.
Implement:
- WF-PROV-001: Provider Promotion Gate
- WF-PROV-002: Provider Runtime Manifest Schema
- WF-PROV-003: Provider Conformance Suite v2
Exit criteria:
- New provider issue template can reference promotion status, runtime manifest, and conformance helper expectations.
- Existing providers can be audited against the matrix without behavior changes.
- Default tests remain credential-free.
Wave 2: Evidence-Carrying Live Smokes¶
Goal: make live provider validation reproducible and issue-attachable.
Implement:
- WF-PROV-004: Live Smoke Artifact Contract
- WF-PROV-005: Optional Runtime Test Profiles
- WF-OBS-001: Provider Event Schema v2, if run correlation fields are needed first
Exit criteria:
- Optional live smokes can write
run_manifest.json. - Missing runtime, missing credentials, and unsupported host states skip with clear reasons.
- Sanitized evidence can be attached to a GitHub issue.
Wave 3: Promote Existing Real Providers¶
Goal: make the strongest current provider surfaces dependable before adding new provider names.
Implement:
- WF-LWM-001: LeWorldModel Stable Score Provider
- WF-LEROBOT-001: LeRobot Stable Policy Provider
- WF-COSMOS-001: Cosmos Generate Provider Production Hardening
- WF-RUNWAY-001: Runway Generate/Transfer Production Hardening
Exit criteria:
- Provider docs, generated catalog, conformance helpers, and smoke manifests agree.
- Every advertised capability has fixture-backed failure tests.
- No provider promotion adds heavy runtime packages to the base dependency set.
Wave 4: Robotics Composition¶
Goal: make policy+score workflows serious enough for host-owned robotics labs.
Implement:
- WF-LEROBOT-002: Embodiment Translator Contract
- WF-GROOT-001: GR00T PolicyClient Beta Provider
- WF-LWM-002: LeWorldModel Task Bridge Pack
- WF-HOST-003: Robotics Lab Operator Host, only after translator and run evidence contracts exist
Exit criteria:
- Policy raw actions, score candidates, translated
Actionvalues, selected plans, and replay artifacts are traceable through one run manifest. - Real robot execution remains disabled unless supplied by a host application.
Wave 5: Harness As Operations Workspace¶
Goal: make TheWorldHarness useful for provider development and local operations.
Implement:
- WF-HARNESS-001: Harness Run Workspace
- WF-HARNESS-002: Provider Connector Workspace
- WF-HARNESS-003: Live Run Inspector
- WF-HARNESS-004: Report Compare And Export
- WF-HARNESS-005: Provider Development Workbench
Exit criteria:
- Harness features call the same APIs and read the same run artifacts as CLI/host examples.
- Failed runs preserve enough data for issue filing.
- Textual remains optional and isolated.
Wave 6: Reference Hosts And Production Operations¶
Goal: give users copyable production shapes without turning WorldForge into a hosted platform.
Implement:
- WF-HOST-001: Batch Evaluation Host
- WF-HOST-002: Service Host Reference
- WF-OBS-002: Optional OpenTelemetry Exporter
- WF-OBS-003: Optional Metrics Exporter
- WF-OBS-004: Logging Configuration And Run Logs
- WF-OBS-005: Health, Readiness, And Incident Runbooks
- WF-OPS-001 through WF-OPS-004 as needed
Exit criteria:
- Host examples demonstrate request IDs, readiness, typed errors, budgets, sanitized logs, and artifact exports.
- Release evidence can say which optional providers were tested, skipped, or not configured.
Release Gate For Roadmap Milestones¶
Use this gate before marking a milestone complete:
| Gate | Required signal |
|---|---|
| Docs | uv run mkdocs build --strict passes with nav/SUMMARY synchronized. |
| Generated provider docs | uv run python scripts/generate_provider_docs.py --check passes when provider metadata changes. |
| Unit tests | Focused tests for changed modules pass. |
| Full checkout tests | uv run pytest passes for deterministic paths. |
| Harness coverage | uv run --extra harness pytest --cov=src/worldforge --cov-report=term-missing --cov-fail-under=90 passes when harness or optional-provider branches change. |
| Package contract | bash scripts/test_package.sh passes when package surface, optional imports, examples, or tests change. |
| Live evidence | Optional provider work includes smoke manifest or explicit skip reason. |
| Security | Observable surfaces have redaction coverage for new fields. |
First Issue Batch¶
Open these first; they unblock most later work.
| Order | Issue | Why first |
|---|---|---|
| 1 | WF-PROV-001: Provider Promotion Gate | Prevents noisy provider additions and status drift. |
| 2 | WF-PROV-002: Provider Runtime Manifest Schema | Creates a shared runtime contract for real providers and host apps. |
| 3 | WF-PROV-003: Provider Conformance Suite v2 | Gives every provider implementation a reusable quality bar. |
| 4 | WF-PROV-004: Live Smoke Artifact Contract | Makes live provider validation useful and debuggable. |
| 5 | WF-LWM-001: LeWorldModel Stable Score Provider | Highest-value existing real score path. |
| 6 | WF-LEROBOT-001: LeRobot Stable Policy Provider | Highest-value existing real policy path. |
| 7 | WF-LEROBOT-002: Embodiment Translator Contract | Required before serious robotics host applications. |
| 8 | WF-HARNESS-001: Harness Run Workspace | Common artifact substrate for harness, CLI, and host apps. |
| 9 | WF-OBS-001: Provider Event Schema v2 | Common observability substrate for monitoring and run manifests. |
| 10 | WF-PROVIDER-SELECT-001: Next Provider Selection RFC | Keeps the next provider batch evidence-driven. |
Deferred Until Design Exists¶
- Built-in multi-writer persistence.
- Hosted dashboard as a default product surface.
- Automatic robot-controller execution.
- Bundled torch/CUDA/robotics runtime dependencies.
- Claims that deterministic eval suites measure physical fidelity.
- Provider
plancapability for providers that only exposescoreorpolicy.
Issue Creation Checklist¶
Before opening an issue from this roadmap:
- [ ] Confirm the referenced files still exist.
- [ ] Copy the issue's scope, out-of-scope notes, acceptance criteria, and validation commands.
- [ ] Add dependencies in the issue body.
- [ ] Use existing templates where possible: provider adapter, eval/benchmark, docs, or bug.
- [ ] Attach only sanitized artifacts.
- [ ] If the issue changes public behavior, include docs and generated provider catalog updates in the acceptance criteria.