Operations¶
WorldForge is a Python library plus CLI. Operational responsibility lives in the host application that imports it. This page documents the runtime assumptions and minimum runbook for developers using WorldForge in services, jobs, or provider-evaluation pipelines.
For task-specific runbooks, use User And Operator Playbooks. That page covers clean checkout validation, provider availability, adapter promotion, persistence recovery, remote artifact handling, optional runtime smokes, benchmarks, and release gates.
Operational Modes¶
| Mode | Suitable use | Boundary |
|---|---|---|
| local development | examples, unit tests, adapter prototyping, deterministic demos | mock provider and local JSON state |
| provider evaluation job | fixture-backed provider checks, benchmarks, optional runtime smokes | host owns credentials, checkpoints, outputs, and run artifacts |
| embedded service/library use | application calls WorldForge APIs inside a larger system | host owns request IDs, telemetry export, persistence, retries around jobs, and alerts |
| real robot or simulator loop | host supplies policy observations and action translators | host owns safety interlocks, controller semantics, and embodiment-specific execution |
Minimum startup preflight for a host process:
uv run worldforge doctor --registered-only
uv run worldforge provider health
uv run worldforge provider info <provider> --format json
The Reference Host Deployment Recipes cover the stdlib service host, batch evaluation host, and robotics operator host with env templates, process commands, readiness checks, smoke commands, logging commands, evidence export commands, expected success signals, first triage or rollback steps, and owned boundaries. The recipes distinguish checkout-safe, prepared-host, credentialed, GPU-bound, and robotics-lab paths without moving deployment, auth, queueing, durable storage, controller integration, alerting, uptime, or safety certification into WorldForge.
Operator failure drills are available from worldforge drills. They rehearse missing credentials,
missing optional dependencies, malformed provider output, budget violations, corrupted local world
state, expired artifacts, and unsafe event metadata with mock providers or fixtures. Drill runs
write manifests under .worldforge/drills/runs/<run-id>/, can export an issue bundle with
--bundle, and keep generated state inside the requested temporary or documented workspace.
Health And Readiness¶
Host applications should expose liveness separately from readiness. Liveness answers whether the service process can handle an HTTP request. Readiness answers whether the specific provider-backed workflow should receive traffic.
The stdlib reference host in examples/hosts/service/app.py uses this model:
| State | Source | Meaning | Typical HTTP endpoint |
|---|---|---|---|
| process live | service handler returns {"status": "live"} |
process and web stack are running | GET /healthz |
| framework alive | WorldForge(...) can be constructed and doctor() can run |
library import, local state path, and provider registry are usable | GET /readyz |
| provider configured | provider appears in forge.providers() |
required env vars or host injection registered the provider | GET /readyz |
| provider lifecycle ready | forge.provider_lifecycle_status(name).ready is true |
provider-owned preflight is no-op or ready; skipped and failed hooks stay visible in diagnostics |
GET /readyz |
| provider healthy | forge.provider_health(name).healthy is true |
provider's cheap health check passed | GET /readyz |
| workflow failing | provider is configured and health may pass, but a workflow returns a typed error | request input, upstream response, budget, or artifact handling failed | workflow response body |
The reference host returns one of these readiness statuses from GET /readyz:
/readyz status |
Traffic decision | How to interpret it |
|---|---|---|
ready |
accept | framework is alive, the selected provider is registered, lifecycle preflight is ready or no-op, and provider health passed. |
provider_unconfigured |
drain | framework is alive, but the selected provider is not registered in this process. |
provider_unhealthy |
drain | provider is registered, but its health check reports missing optional runtime, bad credentials, unreachable upstream, or another provider-owned failure detail. |
Map CLI diagnostics the same way during incidents:
| Command | Readiness signal |
|---|---|
uv run worldforge doctor --registered-only |
registered provider count, health count, and local configuration issues. |
uv run worldforge doctor --capability <capability> |
whether any known provider can satisfy the requested surface. |
uv run worldforge provider health <name> |
provider-specific configured/healthy details. |
uv run worldforge provider info <name> |
redacted config summary plus profile, capability, lifecycle, and health. |
Lifecycle diagnostics use typed hook statuses: no-op, ready, skipped, failed, and
teardown-failed. skipped is the expected result when a prepared-host provider is missing
required env vars or host-owned optional dependencies; it is a skip reason, not a hidden install or
credential-provisioning attempt.
WorldForge reports local provider state and adapter errors. It does not own upstream provider SLAs, deployment load balancers, alert channels, retry orchestration outside one provider call, or credential rotation.
Configuration¶
Configuration comes from constructor arguments and environment variables documented in
.env.example. The generated Provider Configuration Index
is the canonical cross-provider table for required inputs, optional inputs, host-owned packages,
prepared-host assets, default request timeouts, first diagnostic commands, and smoke commands.
COSMOS_POLICY_BASE_URLenables the optional Cosmos-Policy embodied-policy adapter.COSMOS_POLICY_API_TOKEN,COSMOS_POLICY_TIMEOUT_SECONDS,COSMOS_POLICY_EMBODIMENT_TAG,COSMOS_POLICY_MODEL,COSMOS_POLICY_RETURN_ALL_QUERY_RESULTS,COSMOS_POLICY_ALLOW_LOCAL_BASE_URL, andCOSMOS_POLICY_ALLOWED_HOSTSare optional Cosmos-Policy/actsettings.LEWORLDMODEL_POLICYorLEWM_POLICYenables the optional LeWorldModel adapter.LEWORLDMODEL_CACHE_DIRoverrides the LeWorldModel checkpoint root.LEWORLDMODEL_REVISIONpins the Hugging Face LeWM commit used when the showcase auto-builds a missing object checkpoint.LEWORLDMODEL_ASSET_CACHE_DIRoverrides the checkpoint builder's Hugging Face config/weights cache directory.LEWORLDMODEL_DEVICEselects the optional torch device for LeWorldModel scoring.GROOT_POLICY_HOSTenables the optional GR00T embodied-policy adapter.GROOT_POLICY_PORTdefaults to5555.GROOT_POLICY_TIMEOUT_MSdefaults to15000.GROOT_POLICY_API_TOKEN,GROOT_POLICY_STRICT, andGROOT_EMBODIMENT_TAGare optional GR00T PolicyClient settings.LEROBOT_POLICY_PATHorLEROBOT_POLICYenables the optional LeRobot embodied-policy adapter.LEROBOT_POLICY_TYPE,LEROBOT_DEVICE,LEROBOT_CACHE_DIR, andLEROBOT_EMBODIMENT_TAGare optional LeRobot settings.JEPA_MODEL_NAMEenables the experimental score-only JEPA adapter backed by the upstreamfacebookresearch/jepa-wmstorch-hub path.JEPA_MODEL_PATHis legacy scaffold metadata only; it does not make the provider runnable.GENIE_API_KEYonly registers a capability-closed scaffold reservation.WORLDFORGE_ENABLE_SCAFFOLD_SURROGATES=1is for local scaffold adapter tests only; it does not make Genie a real provider integration.
Validate configuration when the host process starts:
Runtime Asset Manifests And Cache Policy¶
Prepared-host optional runtimes often depend on checkpoint files, policy repos, object files, server-side model assets, and cache directories. WorldForge records those as runtime asset manifests and safe run-manifest references; it does not download, retain, or upload the assets.
Runtime asset manifests use RUNTIME_ASSET_MANIFEST_SCHEMA_VERSION. Full local manifests may record
path, cache_root, source, revision, checksum, size_bytes, local_only, exists, and
rebuild_command. Run manifests only include runtime_assets as safe-to-attach references. For
local_only: true assets, references omit path and cache_root so host-local checkpoint and
cache locations do not leave the machine.
| Runtime family | Typical assets | Cache policy | Rebuild or triage command | Evidence boundary |
|---|---|---|---|---|
| LeWorldModel | *_object.ckpt, Hugging Face config.json and weights.pt, STABLEWM_HOME cache |
Pin LEWORLDMODEL_REVISION; keep builder downloads under LEWORLDMODEL_ASSET_CACHE_DIR; keep object checkpoints under STABLEWM_HOME or LEWORLDMODEL_CACHE_DIR. |
worldforge-build-leworldmodel-checkpoint --policy pusht/lewm --revision <pinned-sha> |
Run manifests cite the checkpoint manifest reference and rebuild command, never checkpoint bytes or host-local cache roots. |
| LeRobot | policy repo id or local checkpoint directory, Hugging Face cache, embodiment translator | Keep LEROBOT_CACHE_DIR host-owned and aligned with the policy version used for the smoke. |
scripts/smoke_lerobot_policy.py --policy-path <repo-or-checkpoint> --device cpu |
Run manifests cite a policy-checkpoint reference; raw policy weights and local policy paths stay local-only. |
| GR00T | remote policy server, model checkpoint, CUDA/TensorRT runtime, embodiment assets | Keep server-side caches on the GPU host; WorldForge should connect to the server rather than sync checkpoint directories. | uv run python scripts/smoke_gr00t_policy.py --health-only --run-manifest <path> |
Evidence records server reachability and provider events, not checkpoint files, GPU logs, or robot-controller state. |
| Cosmos-Policy | ALOHA /act server, Docker/CUDA runtime, model checkpoints, observation builder, translator |
Keep Docker layers, checkpoints, and gated-model tokens on the prepared GPU host. | uv run worldforge-smoke-cosmos-policy --health-only --run-manifest <path> |
Evidence records /act configuration and sanitized run status; checkpoints, raw observations, and tokens remain host-owned. |
| Future provider candidates | candidate-specific checkpoints, fixtures, caches, and server assets | Add a runtime asset manifest before promoting live smoke evidence. | Document the rebuild or reacquisition command in the provider page. | Attach references and checksums when available; do not attach local-only paths or generated assets. |
Cleanup is also host-owned. Use normal cache tools for the runtime in question, then rerun the
provider health command and the smoke with --run-manifest to preserve fresh evidence. If a cache
path appears in a JSON artifact, treat it as a bug: run manifests should contain runtime_assets
references, not local cache paths.
Non-Secret Configuration Profiles¶
Configuration profiles are optional JSON or TOML files for repeatable non-secret CLI defaults. They are intended for provider choices, operation lists, output format, state directories, run-workspace directories, timeout/retry preset names, and safe relative optional-runtime cache roots.
{
"schema_version": 1,
"name": "local-mock",
"providers": ["mock"],
"operations": ["predict"],
"run_workspace": ".worldforge/profiled-runs",
"state_dir": ".worldforge/worlds",
"output_format": "json",
"timeout_preset": "checkout-safe",
"retry_preset": "none",
"runtime_cache_roots": {
"leworldmodel": ".worldforge/cache/leworldmodel"
}
}
Use a profile only as an explicit CLI opt-in:
uv run worldforge benchmark --profile profiles/local-mock.json --iterations 1
uv run worldforge eval --profile profiles/local-mock.toml --suite planning
Profiles must not contain credentials, bearer tokens, API keys, signed URLs, .env paths, absolute
host-local paths, or .. traversal. Keep secrets in environment variables or a host-owned secret
store. Preserved eval and benchmark run manifests include a config_profile provenance block with
the profile name, safe relative source label, SHA-256 digest, and validated non-secret defaults; the
profile file itself is not copied into run evidence.
World State¶
WorldForge no longer ships a symbolic World runtime or a built-in JSON world store. Planning runs
over plain, JSON-serializable world-state dicts: a predict provider rolls a world_state dict
forward one action at a time, and LatentMPCController plans by scoring action candidates with a
score provider over that state.
from worldforge import Action, WorldForge
forge = WorldForge()
world_state = {"step": 0, "scene": {"objects": {}}}
payload = forge.predict(world_state, Action.move_to(0.3, 0.8, 0.0), steps=2, provider="mock")
next_state = payload.state
Persistence is explicitly host-owned. The reason is boundary clarity: host applications own deployment topology, durability, locking semantics, backup policy, and retention requirements. WorldForge should not imply production durability guarantees. Services that need durable world state should serialize these plain dicts into their own database and apply their own locking, backup, and retention policy.
ADR 0001, Persistence Adapter Boundary, records the
future WorldPersistenceAdapter boundary and the acceptance bar for any durable store.
Supported invariants:
- Public inputs are validated before any outbound provider call: invalid actions, non-positive
step counts, and malformed candidate payloads raise
WorldForgeError. - Malformed persisted or provider world-state payloads raise
WorldStateErrorat the boundary rather than being silently coerced. - Any future built-in persistence backend must be introduced as an explicit adapter with its own locking, migration, and recovery documentation.
Run Workspaces¶
Evaluation, benchmark, and harness jobs can preserve checkout-safe run evidence under
.worldforge/runs/<run-id>/. This is separate from local JSON world persistence: run workspaces are
operator evidence bundles, not a database.
uv run worldforge eval --suite planning --provider mock --run-workspace .worldforge
uv run worldforge benchmark --provider mock --operation predict --run-workspace .worldforge
uv run worldforge runs list
uv run worldforge runs index --provider mock --status failed --artifact-type json
uv run worldforge runs bundle <run-id>
uv run worldforge runs cleanup --keep 20
Each run directory contains run_manifest.json, inputs/, results/, reports/, artifacts/,
and logs/. The manifest stores a sortable file-safe run ID, command, provider, operation, status,
input summary, result summary, event count, and relative artifact paths.
For public issues, run worldforge runs bundle <run-id> --workspace-dir .worldforge first. The
command writes .worldforge/issue-bundles/<run-id>/evidence_manifest.json, summary.md, and
issue.md, then prints the issue template. Success signal: safe_to_attach is true or the
manifest clearly lists excluded/local-only files with a reason. First triage step after export:
open evidence_manifest.json; if anything is excluded or local-only, remove or replace the unsafe
artifact before attaching the bundle.
For repeated local operations, use worldforge runs list before opening individual artifacts. It
reads preserved manifests without optional model runtimes, filters by provider, capability, status,
created date, and safe artifact type, and prints sanitized rerun, comparison, and issue-bundle
commands. Failed, skipped, and cancelled rows surface the worldforge runs bundle <run-id>
recovery command first.
Retention is host-owned. worldforge runs cleanup --keep <n> keeps the newest run IDs and removes
older directories; use --dry-run before deleting evidence attached to an incident or release gate.
Do not attach raw host-created artifacts that contain private paths, prompts, credentials, signed
URLs, or provider-native payloads.
Candidate benchmark budgets must be generated from preserved benchmark reports and reviewed before they replace release budget files:
uv run python scripts/calibrate_benchmark_budgets.py \
--report .worldforge/reports/benchmark-<timestamp>-<run-id>.json \
--current-budget src/worldforge/benchmark_presets/_data/budget-release-evidence.json
The success signal is a budget-calibration.md review report plus a loadable
candidate-budgets.json; the command does not weaken existing release gates automatically.
Threshold loosening is allowed only with preserved report digests, machine-class context, observed
baseline values, and reviewer rationale.
Observability¶
Attach a provider event handler at WorldForge(event_handler=...) or provider construction time.
Use compose_event_handlers(...) to fan out events to:
JsonLoggerSinkfor structured JSON logs.RunJsonLogSinkfor newline-delimited JSON files tied to one run id.ProviderMetricsSinkfor request, retry, error, and latency aggregates.ProviderMetricsExporterSinkfor optional host-owned counters and latency histograms.OpenTelemetryProviderEventSinkfor optional host-owned tracing spans.InMemoryRecorderSinkfor tests and local debugging.RerunEventSinkfor optional Rerun recordings of provider events.
ProviderEvent sanitizes observable fields before they reach these sinks: HTTP targets keep
scheme, host, port, and path but drop userinfo, query strings, and fragments; message and metadata
fields redact obvious bearer tokens, API keys, signatures, passwords, and signed URLs. Host
applications should still avoid placing raw credentials in provider exception messages or custom
metadata.
Composed workflows can also emit WorkflowTrace artifacts. A trace is JSON-native,
schema-versioned, and safe to attach by default; it records step IDs, operations,
provider/capability slots, input/output artifact references, status, optional duration, sanitized
error summaries, and parent-child relationships. The top-level trace status is derived from the
step statuses and cannot contradict failed, running, pending, or skipped steps. Planning stores a trace under
Plan.metadata["workflow_trace"]; evaluation reports export workflow_trace.json and
workflow_trace.md; RerunArtifactLogger.log_workflow_trace(...) can add the same trace to an
optional Rerun recording. Traces do not capture raw prompts, tensors, credentials, controller
telemetry, or distributed tracing backend state.
Host services can attach correlation IDs directly to a ProviderEvent when the provider adapter
knows them, or through JsonLoggerSink(extra_fields=...) when the host owns them outside the
adapter. Optional event fields are run_id, request_id, trace_id, span_id, artifact_id,
and input_digest; they are strings, omitted when unset, and sanitized before sink consumption.
The event phase is normalized to lowercase so hosts can filter stable success, failure,
retry, and budget_exceeded values.
OpenTelemetry export is optional. Importing worldforge does not import OpenTelemetry, and the
base package does not install a collector, SDK, or exporter. Production hosts either install
opentelemetry-api and let OpenTelemetryProviderEventSink() resolve the current tracer lazily, or
inject their already configured tracer:
from worldforge import WorldForge
from worldforge.observability import OpenTelemetryProviderEventSink
forge = WorldForge(
event_handler=OpenTelemetryProviderEventSink(
tracer=host_tracer,
extra_attributes={"service": "batch-eval"},
)
)
Each provider event becomes one span named
worldforge.provider.<provider>.<operation>.<phase>. Span attributes are bounded to provider,
operation, phase, attempt, max attempts, optional duration, optional correlation IDs, HTTP method,
HTTP status code, sanitized target, status class, capability, redacted message, and redacted
metadata JSON. Hosts should not add raw prompts, world IDs, target URLs with query strings, or
high-cardinality business metadata as trace attributes.
Metrics export is also optional and dependency-free. ProviderMetricsExporterSink accepts any
host exporter with increment_counter(...) and observe_histogram(...) methods, so production
services can bridge provider events to Prometheus, OpenTelemetry Metrics, StatsD, or an internal
collector without adding dependencies to the base package.
from worldforge import WorldForge
from worldforge.observability import ProviderMetricsExporterSink, compose_event_handlers
host_metrics_exporter = ... # supplied by your service
forge = WorldForge(
event_handler=compose_event_handlers(
ProviderMetricsExporterSink(host_metrics_exporter),
)
)
The sink emits:
| Metric | Meaning |
|---|---|
worldforge_provider_events_total |
Every provider event, including retries. |
worldforge_provider_operations_total |
Logical non-retry outcomes such as success, failure, and budget_exceeded. |
worldforge_provider_retries_total |
Retry events only, separate from logical operation totals. |
worldforge_provider_errors_total |
Failed or budget-exceeded operation outcomes. |
worldforge_provider_latency_ms |
Event duration_ms values when providers include them. |
Labels are bounded to provider, operation, phase, status_class, and capability.
capability is exported only when it matches a known WorldForge capability; otherwise it becomes
unknown. Do not add raw target URLs, prompts, metadata keys, world IDs, artifact IDs, request IDs,
or user/business identifiers as metric labels. Those values have high cardinality, and some can
carry secrets. Good first alerts are retry-rate or error-rate thresholds by provider/operation, and
latency percentile alerts on worldforge_provider_latency_ms grouped by provider/operation.
Example JSON log record:
{
"artifact_id": "artifact-local-id",
"attempt": 1,
"duration_ms": 812.4,
"event_type": "provider_event",
"input_digest": "sha256:9fd7...",
"max_attempts": 3,
"message": "",
"metadata": {"status": "submitted"},
"method": "POST",
"operation": "task create",
"phase": "success",
"provider": "cosmos-policy",
"request_id": "host-request-id",
"run_id": "20260430T120000Z-batch-eval",
"span_id": "span-456",
"status_code": 200,
"target": "https://policy.example.test/act",
"trace_id": "trace-123"
}
For batch jobs, harness runs, and release evidence, attach a file sink owned by the host process:
from pathlib import Path
from worldforge import WorldForge
from worldforge.observability import JsonLoggerSink, RunJsonLogSink, compose_event_handlers
run_id = "20260430T120000Z-batch-eval"
forge = WorldForge(
event_handler=compose_event_handlers(
JsonLoggerSink(extra_fields={"run_id": run_id}),
RunJsonLogSink(
Path(".worldforge") / "runs" / run_id / "provider-events.jsonl",
run_id=run_id,
extra_fields={"host": "batch-eval"},
),
)
)
The file sink creates the parent directory and appends one JSON object per provider event. Its
configured run_id wins over any run_id supplied by extra fields or adapter events so every line
in the file joins to the same host run manifest. Operator bundles can then correlate
manifest.json, provider-events.jsonl, benchmark reports, and preserved artifacts without
relying on timestamps. Extra fields are validated as JSON and redacted with the same observable
secret rules as provider event messages and metadata.
Optional live smoke commands can also write a sanitized run_manifest.json:
scripts/robotics-showcase \
--json-output /tmp/worldforge-robotics-showcase/real-run.json \
--run-manifest /tmp/worldforge-robotics-showcase/run_manifest.json
The manifest records command argv, package version, provider profile, capability, value-free
environment presence, runtime manifest id when available, input fixture digest, event count, result
digest, status (passed, failed, or skipped), and artifact paths. Validation rejects unknown
statuses, raw secret-like fields, and unsanitized signed URLs; artifact URLs are stored without
query strings or fragments.
For local run inspection, install the optional rerun extra and stream events plus artifacts into
a Rerun recording:
Expected success signal: .worldforge/rerun/worldforge-rerun-showcase.rrd exists, the command
prints a byte count, and the recording opens in the Rerun viewer. First triage step: run
uv run --extra rerun python -c "import rerun; print(rerun.__version__)".
Robotics Operator Review¶
The checkout host at examples/hosts/robotics-operator/app.py is an offline review loop for
policy-plus-score robotics runs:
By default it does not talk to robot controllers. It requires an explicit action translator in the
host process, records checklist and dry-run approval state, and preserves selected action chunks,
score rationale, provider events, and a replay artifact under
.worldforge/robotics-operator/runs/<run-id>/.
WorldForge only certifies that its typed provider, event, replay, and manifest artifacts satisfy the framework contracts. The lab host remains responsible for embodiment translators, controller hooks, workspace safety, operator approval policy, emergency stops, hardware behavior, deployment readiness, and safety certification.
Failure Modes¶
- Invalid caller input raises
WorldForgeError. - Malformed persisted or provider-supplied state raises
WorldStateError. - Provider runtime, transport, credential, and upstream failures raise
ProviderError. - Missing remote credentials leave the provider unregistered unless inspected through
doctor(). - Remote create-style requests are single-attempt by default; health checks, polling, and
downloads retry according to
ProviderRequestPolicy. - Provider request budgets are per operation.
timeout_secondslimits one HTTP attempt; optionalmax_elapsed_secondslimits the whole operation including retries, backoff, and task polling. Budget violations raiseProviderBudgetExceededErrorand emit abudget_exceededprovider event when an event handler is attached. - Circuit breakers stay host-owned. A service can count recent
failure,retry, andbudget_exceededevents fromProviderMetricsSink, stop routing new work to a degraded provider, and continue serving cached/read-only paths without WorldForge owning alert channels or upstream SLAs. - LeWorldModel scoring fails explicitly when optional dependencies are unavailable, the checkpoint
cannot load, required
pixels/goal/actionfields are missing, action candidates do not have shape(batch=1, samples, horizon, action_dim), returned score count does not match candidate samples, or returned scores are not finite. - GR00T policy selection fails explicitly when the PolicyClient dependency is unavailable, the policy server is unreachable, observations are malformed, raw actions are not JSON-compatible, or no host-owned action translator is provided.
- LeRobot policy selection fails explicitly when the LeRobot dependency is unavailable, policy loading fails, observations are malformed, raw actions are not JSON-compatible, or no host-owned action translator is provided.
Recovery¶
- For local state corruption, restore from the host application's backup of exported world JSON.
- For missing credentials, fix the environment and restart the host process so provider auto-registration runs again.
- For transient remote failures, inspect emitted
ProviderEventrecords foroperation,phase,status_code,attempt, and sanitizedtarget. - For LeWorldModel failures, run
worldforge provider health leworldmodel, verifystable-worldmodel,torch,opencv-python, andimageioare installed in the host environment, then confirm the configured policy exists under$STABLEWM_HOMEorLEWORLDMODEL_CACHE_DIR. - To smoke-test a real LeWorldModel checkpoint, run
scripts/lewm-real --checkpoint ~/.stable-wm/pusht/lewm_object.ckpt --device cpu. This requires host-owned upstream dependencies and an extracted object checkpoint. - If you have Hugging Face LeWM
config.jsonandweights.ptassets rather than an extracted*_object.ckptarchive, build the object checkpoint first with the command below:
uv run --python 3.13 \
--with "stable-worldmodel @ git+https://github.com/galilai-group/stable-worldmodel.git" \
--with "datasets>=2.21" \
--with huggingface_hub \
--with hydra-core \
--with omegaconf \
--with transformers \
--with matplotlib \
--with "opencv-python" \
--with "imageio" \
worldforge-build-leworldmodel-checkpoint \
--stablewm-home ~/.stable-wm \
--policy pusht/lewm \
--revision 22b330c28c27ead4bfd1888615af1340e3fe9052
hydra-core, omegaconf, and transformers are required to instantiate the official LeWM
PushT config. Before Hydra is allowed to instantiate anything, the builder validates the
downloaded config against the known official PushT LeWM target allowlist, rejects any
interpolated _target_ value, and rejects nested targets outside that allowlist. The default
revision is the pinned commit
22b330c28c27ead4bfd1888615af1340e3fe9052; pass --revision <40-char-commit-sha> or set
LEWORLDMODEL_REVISION to another audited immutable Hugging Face commit.
The builder loads downloaded weights.pt with torch.load(..., weights_only=True) by default;
--allow-unsafe-pickle exists only for trusted legacy weights and older torch environments. The
builder downloads assets to ~/.cache/worldforge/leworldmodel by default, or to
LEWORLDMODEL_ASSET_CACHE_DIR / --asset-cache-dir when set, and writes the object checkpoint
under $STABLEWM_HOME.
- To demonstrate the LeWorldModel planning flow without optional dependencies, run
uv run worldforge-demo-leworldmodel. It uses the real LeWorldModelProvider interface
with an injected deterministic cost runtime and exercises score planning, execution,
persistence, and reload. It is not a real upstream-checkpoint inference run; use
lewm-real or worldforge-smoke-leworldmodel for that path. The demo should report
uses_leworldmodel_provider: true, uses_worldforge_score_planning: true, and
uses_real_upstream_checkpoint: false.
- To demonstrate LeRobot policy-plus-score planning without optional dependencies, run
uv run worldforge-demo-lerobot. It uses the real LeRobotPolicyProvider interface with an
injected deterministic policy runtime and exercises policy selection, score ranking, execution,
persistence, and reload. It is not a real LeRobot checkpoint inference run.
- To run the real LeRobot plus real LeWorldModel showcase, use scripts/robotics-showcase. It
launches the packaged PushT policy-plus-score bridge, opens the Textual report by default, and
writes /tmp/worldforge-robotics-showcase/real-run.rrd unless --no-rerun is passed. For the
full walkthrough, see Robotics Replay Showcase.
- To run the same path in CI, use .github/workflows/robotics-showcase.yml. It runs
scripts/robotics-showcase --json-only --no-tui --no-rerun on every pull request update and on
pushes to main, validates real policy/score events, caches Hugging Face assets and the
LeWorldModel object checkpoint with actions/cache, and uploads the JSON summary plus
run_manifest.json as evidence. Checkpoint artifacts are not uploaded.
- To smoke-test a real GR00T policy server, install or check out NVIDIA Isaac-GR00T on a prepared
NVIDIA/Linux host, prepare a host-specific observation fixture and action translator, then run
GROOT_POLICY_HOST=127.0.0.1 GROOT_POLICY_PORT=5555 uv run --with msgpack --with pyzmq --with numpy python scripts/smoke_gr00t_policy.py --health-only --run-manifest .worldforge/runs/gr00t-health/run_manifest.json.
Expected success for --health-only: the process exits 0 and the run manifest records
capability=policy with status=skipped.
For a full policy request, run
GROOT_POLICY_HOST=127.0.0.1 GROOT_POLICY_PORT=5555 uv run --with msgpack --with pyzmq --with numpy python scripts/smoke_gr00t_policy.py --policy-info-json /path/to/policy_info.json --translator /path/to/translator.py:translate_actions --allow-translator-code --run-manifest .worldforge/runs/gr00t-live/run_manifest.json.
Expected success: the process exits 0 and the run manifest records capability=policy with
status=passed. First triage: run uv run worldforge provider health gr00t to confirm the
client can reach the remote PolicyClient server, then recheck the observation fixture and
translator path.
- Starting the upstream GR00T server requires a compatible NVIDIA/Linux runtime for its CUDA and
TensorRT dependencies. On unsupported hosts, point WorldForge at an already running remote GR00T
policy server. Prefer an SSH tunnel such as ssh -N -L 5555:127.0.0.1:5555 ubuntu@<gpu-host>
or restrict the server port to the operator IP or VPN. Hibernate or terminate remote GPU
instances when the smoke is done.
- To smoke-test a real Cosmos-Policy ALOHA server, run the upstream server on a compatible
Linux/NVIDIA host, prepare ALOHA policy info and an action translator, then run
COSMOS_POLICY_BASE_URL=http://127.0.0.1:8777 COSMOS_POLICY_ALLOW_LOCAL_BASE_URL=1 uv run worldforge-smoke-cosmos-policy --policy-info-json /path/to/policy_info.json --translator /path/to/translator.py:translate_actions --allow-translator-code --run-manifest .worldforge/runs/cosmos-policy-live/run_manifest.json.
For the configuration-only path, run
COSMOS_POLICY_BASE_URL=http://127.0.0.1:8777 COSMOS_POLICY_ALLOW_LOCAL_BASE_URL=1 uv run worldforge-smoke-cosmos-policy --health-only --run-manifest .worldforge/runs/cosmos-policy-health/run_manifest.json.
--health-only validates WorldForge configuration only because the targeted upstream server has
no non-mutating health endpoint and does not call /act.
Expected success for --health-only: the process exits 0 and the run manifest records
capability=policy with status=skipped.
Expected success: the process exits 0 and the run manifest records capability=policy with
status=passed. First triage: run uv run worldforge provider health cosmos-policy to confirm
configuration, then run the smoke command to verify the host can reach /act.
For rented or lab GPUs, follow the
Cosmos-Policy remote GPU runbook: use a
prepared 48 GB or larger Linux/NVIDIA host when required by the upstream model, prefer an SSH
tunnel to port 8777, restrict direct firewall exposure to the operator IP or VPN CIDR, preserve
only sanitized manifests/replay artifacts, and hibernate or terminate the GPU host when done.
- Pytest live runtime coverage is opt-in. Use uv run pytest or uv run pytest -m "not live" for
deterministic checkout validation. Prepared hosts can select one live provider profile at a time
with markers such as live, network, credentialed, gpu, robotics, and
provider_profile, plus the matching --run-* flags and --provider-profile <name>. See
Run Optional Runtime Smokes for provider-specific
commands.
Release Checklist¶
Before publishing a release:
uv sync --group dev
uv lock --check
uv run ruff check src tests examples scripts
uv run ruff format --check src tests examples scripts
uv run python scripts/generate_provider_docs.py --check
uv run python scripts/check_docs_commands.py
uv run python scripts/check_docs_snippets.py
uv run python scripts/check_wrapper_portability.py
uv run python scripts/check_optional_import_boundaries.py
uv run python scripts/check_core_performance.py
uv run mkdocs build --strict
uv run pytest
uv run --extra harness pytest --cov=src/worldforge --cov-report=term-missing --cov-fail-under=90
bash scripts/test_package.sh
uv build --out-dir dist --clear --no-build-logs
shasum -a 256 dist/worldforge_ai-*.whl dist/worldforge_ai-*.tar.gz
The artifact integrity contract is documented in Artifact Integrity. It covers package hashes, current package/evidence checks, unsafe artifact exclusions, and future SBOM, provenance, and attestation work that is not claimed today.
Then generate locked dependency-audit evidence:
The wrapper runs the documented uv export --frozen --all-groups --no-emit-project --no-hashes
plus uvx --from pip-audit pip-audit ... --format json flow using a temporary requirements file
that is removed after the audit. It writes .worldforge/dependency-audit/dependency-audit.json
and .worldforge/dependency-audit/dependency-audit.md, records tool versions, dependency-set
digest, vulnerability summary, explicit --ignore-advisory ADVISORY=RATIONALE rows, command
output tails, and a first triage step. Raw-detail keys and values are sanitized before JSON or
Markdown rendering, with deterministic suffixes for redacted-key collisions. Success signal:
status is passed; findings, tool-unavailable, and failed states still leave safe-to-attach
evidence. First triage step for findings: inspect the Markdown advisory row, upgrade or document
the dependency decision, then rerun the audit.
Generate the release-readiness evidence after local gates and optional smokes finish. The command
writes both Markdown and JSON summaries by default; use --run-gates when the evidence run itself
should execute the checkout-safe gates instead of recording them as skipped.
uv run python scripts/generate_release_evidence.py \
--run-gates \
--live-smoke-registry docs/src/live-smoke-evidence.json \
--run-manifest .worldforge/runs/<run-id>/run_manifest.json \
--artifact .worldforge/dependency-audit/dependency-audit.json \
--benchmark-artifact .worldforge/reports/benchmark-<timestamp>-<run-id>.json \
--artifact dist/worldforge_ai-<version>-py3-none-any.whl
The report defaults to .worldforge/release-evidence/release-evidence.md and the JSON summary
defaults to .worldforge/release-evidence/release-evidence.json. Gate rows are explicit
passed, failed, or skipped; each row includes the command, exit code when available, and first
triage step. Optional live provider evidence is host-owned unless a prepared-host
run_manifest.json is linked. Attach the Markdown report, JSON summary, and linked artifacts when a
release note or provider promotion claims live-provider coverage.
Before a release review, rehearse the evidence path with the checkout-safe drill:
uv run python scripts/release_readiness_drill.py \
--workspace-dir .worldforge/release-readiness-drill
The drill writes .worldforge/release-readiness-drill/release-readiness-drill.json and Markdown
plus clean-pass and controlled-failure release-evidence fixtures. It explains the first failed
gate and first triage command, records host-owned optional-runtime skips, and never publishes,
tags, signs, creates a GitHub release, or approves a real release. Use it to verify operator
workflow understanding; use scripts/generate_release_evidence.py --run-gates for current release
approval evidence.
Generate the local quality dashboard when you need one at-a-glance page for the branch:
The dashboard defaults to .worldforge/quality-dashboard/quality-dashboard.json and
.worldforge/quality-dashboard/quality-dashboard.md. It reads existing release evidence,
dependency-audit evidence, and core-performance JSON; it does not execute gates. Status rows use
passed, failed, warning, skipped, and not-run, preserve sanitized raw failure output
tails, list skipped host-owned provider checks, and name the first failed gate. Use it as a local
quality index. Raw-detail keys and values are sanitized before JSON or Markdown rendering, and
colliding redacted keys keep deterministic suffixes instead of dropping entries. Core-performance
status is derived from both the top-level artifact status and row-level budget results, so a stale
passed: true flag cannot hide failed budget rows. Release evidence remains the artifact for
release claims, artifact hashes, linked run_manifest.json files, and known limitations.
After evidence exists, draft release notes for maintainer editing:
mkdir -p .worldforge/release-notes
gh issue list --state closed --limit 200 \
--json number,title,url,labels,closedAt,state \
> .worldforge/release-notes/closed-issues.json
uv run python scripts/generate_release_notes.py \
--release-evidence .worldforge/release-evidence/release-evidence.json \
--issues-json .worldforge/release-notes/closed-issues.json \
--known-caveat "No prepared-host live smoke was run for <provider>."
The release-notes command writes .worldforge/release-notes/release-notes-draft.md. It is a draft
artifact only: maintainers must edit it before publishing, and the command never creates a tag,
GitHub release, signature, or trusted-publishing artifact. Success signal: the draft contains
added, changed, fixed, docs, validation, compatibility, caveat, and host-owned optional-runtime
sections. Draft status is derived from both validation_summary and row-level
validation_gates; any failed gate row keeps the draft at needs-validation-review even if a
stale summary reports zero failures. First triage step when validation is missing: run
uv run python scripts/generate_release_evidence.py --run-gates and regenerate the draft. Use
--require-validation-evidence in release scripts when a missing or invalid evidence JSON should
fail the command. Changelog entries, closed issue metadata, release-evidence text, and
--known-caveat values are sanitized before Markdown rendering so token assignments, bearer
headers, signed URLs, and host-local paths stay out of draft release notes.
uv run python scripts/check_core_performance.py writes a checkout-safe JSON report for world
persistence, benchmark fixture loading, provider diagnostics, evidence-bundle creation, and report
rendering. Success signal: passed is true and each result row has a preserved artifact path when
--workspace-dir <path> is supplied. First triage step: inspect the failing row's measured path and
fix the regression before changing a budget. These budgets are local regression guards, not
cross-machine or optional-runtime performance claims.
uv run python scripts/check_wrapper_portability.py checks shell wrappers and optional-runtime
smoke commands without installing host-owned runtimes. Success signal: the report passes for
scripts/robotics-showcase, scripts/lewm-real, scripts/lewm-lerobot-real, GR00T and LeRobot
smoke helpers, and scripts/test_package.sh. First triage step: fix the named script's shebang,
executable bit, documented command, or Python 3.13 uv invocation.
uv run python scripts/check_optional_import_boundaries.py checks optional-runtime import
boundaries without installing host-owned runtimes. Success signal: base package imports, CLI
startup, worldforge.rerun, and non-TUI harness modules do not load Textual, Rerun, torch,
stable-worldmodel, LeRobot, GR00T, or Cosmos-Policy packages, and static source checks only find
optional imports inside their allowed provider, smoke, Rerun, or harness.tui modules. First
triage step: move the named import behind the allowed lazy boundary in the report.
uv run python scripts/check_docs_snippets.py executes selected Python snippets and parses selected
JSON snippets from the public docs. Success signal: the report passes with no snippet failures, and
any host-owned, credentialed, or illustrative snippets are explicitly skipped. First triage step:
fix the file, heading, and line named in the failure before changing surrounding docs.
When release or issue triage needs the underlying evaluation and benchmark artifacts, generate a separate evidence bundle first:
uv run worldforge eval --suite planning --provider mock --run-workspace .worldforge
uv run worldforge benchmark --preset mock-smoke --run-workspace .worldforge
uv run python scripts/generate_evidence_bundle.py \
--workspace-dir .worldforge \
--output .worldforge/evidence-bundles/mock-planning
uv run python scripts/generate_release_evidence.py \
--artifact .worldforge/evidence-bundles/mock-planning/evidence_manifest.json
Success signal: the bundle writes evidence_manifest.json and summary.md, every included file
has a sha256:<hex> digest, and excluded files carry a reason such as unsupported suffix,
host-local path, or secret-like material. First triage step on failure: inspect the run's
run_manifest.json and remove or local-only mark unsafe artifacts before regenerating the bundle.
The tag-triggered release workflow repeats the full quality gate before building distributions or publishing release artifacts.
Also update CHANGELOG.md, the README, and provider documentation for any public behavior change.
Provider Hardening Criteria¶
- Remote provider non-happy-path tests cover transport retries, malformed JSON, bad action payloads, missing optional runtimes, redaction, and provider limits.
- Persistence remains documented as host-owned unless a dedicated persistence adapter is designed.
- API documentation lists the public exception families and provider workflow failure modes.
- Remaining work is tracked with measurable exit criteria before provider capabilities are advertised as complete.