User And Operator Playbooks¶
These playbooks are for people running WorldForge from a checkout, embedding it in a job or service, or maintaining provider adapters. Each playbook says when to use it, what to run, what success looks like, and where to look when it fails.
WorldForge is still a library. It does not own deployment, credential storage, robot safety, multi-writer persistence, dashboards, or artifact retention. Those remain host responsibilities.
1. Bootstrap A Clean Checkout¶
Use this before changing code, reviewing a provider branch, or reproducing a reported issue.
uv sync --group dev
uv lock --check
uv run worldforge doctor
uv run worldforge examples
uv run python scripts/generate_provider_docs.py --check
uv run python scripts/check_docs_commands.py
uv run python scripts/check_docs_snippets.py
uv run python scripts/check_wrapper_portability.py
uv run python scripts/check_optional_import_boundaries.py
uv run mkdocs build --strict
uv run pytest tests/test_cli_help_snapshots.py tests/test_provider_catalog_docs.py
Success signal:
doctorshowsmockregistered and reports optional providers as missing or unregistered only when their environment variables are absent.- provider docs are already up to date.
- the MkDocs Material site builds without warnings.
- focused tests pass without live credentials.
If it fails:
| Symptom | First check | Likely owner |
|---|---|---|
uv lock --check fails |
dependency files changed without lock refresh | contributor |
| provider docs drift | run uv run python scripts/generate_provider_docs.py and inspect diff |
contributor |
| optional provider appears registered unexpectedly | check local .env and shell environment |
operator |
| tests try to reach live services | replace with fixture, fake transport, or injected runtime | contributor |
1a. Preserve Checkout-Safe Demo Evidence¶
Use this when you need a reproducible demo run for a bug report, roadmap issue, release note, or maintainer review without requiring credentials or optional runtimes.
uv run python scripts/demo_showcases.py list
uv run python scripts/demo_showcases.py run all --workspace-dir .worldforge/demo-showcases --format json --overwrite
Success signal: the top-level status is passed; individual workflows are either passed or an
intentional skipped for a missing optional extra. Each workflow writes
.worldforge/demo-showcases/<workflow>/workflow-result.json and a preserved
.worldforge/demo-showcases/<workflow>/runs/<run-id>/run_manifest.json.
If it fails:
| Symptom | First check | Likely owner |
|---|---|---|
first-run fails |
run uv run worldforge doctor --registered-only and inspect the demo's exported world-state JSON under the run workspace |
contributor |
| diagnostics bundle is not safe to attach | open issue-bundle/evidence_manifest.json and inspect excluded files |
reporter |
| robotics replay fails | run uv run worldforge-demo-lerobot and inspect provider event phases |
contributor |
| provider-event redaction dry run leaks a query string | inspect provider-event-redaction-events.json and the provider-event redaction corpus |
contributor |
| batch benchmark status changes | inspect copied budget and benchmark report before editing thresholds | performance maintainer |
The runner does not install LeRobot, LeWorldModel, GR00T, torch, Rerun, checkpoints, simulators, or provider credentials. It does not make paid API calls, control hardware, or claim physical fidelity.
2. Choose The Right Provider Surface¶
Use this before writing application code or adding an adapter. Start from the operation, not the provider name.
| Need | Capability | First command |
|---|---|---|
| roll a world state forward from an action | predict |
uv run worldforge doctor --capability predict |
| rank action candidates | score |
uv run worldforge doctor --capability score |
| select embodied action chunks | policy |
uv run worldforge doctor --capability policy |
| embed text | embed |
uv run worldforge doctor --capability embed |
Then inspect the provider profile:
uv run worldforge provider list
uv run worldforge provider info mock
uv run worldforge provider docs
Success signal: the provider advertises exactly the method your workflow calls. If the workflow needs policy plus scoring, configure one policy provider and one score provider rather than stretching either adapter into a false capability.
Python integrations can use either a registered full provider or a narrow capability protocol:
from worldforge import ActionScoreResult, WorldForge
class LocalCost:
name = "local-cost"
profile = None
def score_actions(self, *, info, action_candidates):
return ActionScoreResult(provider=self.name, scores=[0.1], best_index=0)
forge = WorldForge(auto_register_remote=False)
forge.register_cost(LocalCost())
assert forge.doctor(capability="score", registered_only=True).provider_count == 1
3. Add Or Promote A Provider Adapter¶
Use this for new provider work and for promoting a scaffold to a real adapter.
uv run python scripts/scaffold_provider.py "Acme WM" \
--taxonomy "JEPA latent predictive world model" \
--implementation-status scaffold \
--planned-capability score \
--remote \
--env-var ACME_WM_API_KEY
Before setting any capability flag to True, prove the full contract:
- caller inputs are validated before network or model calls where possible.
- upstream outputs are parsed through explicit helpers and malformed fixtures fail.
- supported methods return
PredictionPayload,ActionScoreResult,ActionPolicyResult, orEmbeddingResultas appropriate. - unsupported methods inherit the
BaseProviderProviderErrorbehavior. health()is cheap and reports missing credentials or optional dependencies clearly.- docs state configuration, runtime ownership, input shape, output schema, limits, failure modes, and smoke path.
If the integration is one narrow local surface, prefer a capability protocol implementation instead
of a mostly-empty BaseProvider subclass. Register it with register_cost, register_policy,
register_predictor, or register_embedder; it will still appear in providers(),
provider_profile(...), doctor(...), planning, and benchmark routing.
Validation:
uv sync --group dev
uv run ruff check src tests examples scripts
uv run ruff format --check src tests examples scripts
uv run python scripts/generate_provider_docs.py --check
uv run pytest tests/test_provider_contracts.py tests/test_provider_catalog_docs.py
uv run --extra harness pytest --cov=src/worldforge --cov-report=term-missing --cov-fail-under=90
Success signal: the provider contract helper passes for supported surfaces, every documented failure mode has a fixture or fake-runtime test, and the generated provider catalog has no drift.
4. Diagnose Provider Availability¶
Use this when a workflow says a provider is not registered, unhealthy, or missing a capability.
uv run worldforge doctor
uv run worldforge doctor --registered-only
uv run worldforge doctor --capability score
uv run worldforge provider health
uv run worldforge provider info leworldmodel
Read the result this way:
doctorincludes known optional providers by default so missing config is visible.--registered-onlyshows only providers active in the current process.- capability filters are strict. A typo such as
generationraises a framework error rather than returning an empty list. - remote providers usually register from environment variables; injected providers register from host code.
If it fails:
| Symptom | Likely cause | Action |
|---|---|---|
| provider is unknown | adapter is not in the catalog or was not manually registered | check src/worldforge/providers/catalog.py or host registration |
| provider is known but unregistered | required env vars are missing | load .env, export vars, restart process |
| provider is registered but unhealthy | optional dependency, endpoint, or credential is invalid | run provider-specific docs and health command |
| provider lacks capability | capability flag is truthful and workflow picked the wrong provider | choose another provider or implement the capability end to end |
4a. Troubleshoot Error Families¶
Use this matrix when an issue, run manifest, or operator log reports a public WorldForge exception. Do not catch these errors and continue with coerced state; each family points to a specific owner and first artifact.
| Error family | Common symptom | Likely owner | First command | Expected artifact or signal | First triage step |
|---|---|---|---|---|---|
WorldForgeError |
invalid public input, unknown capability, unsupported output format, non-finite number, unsafe artifact reference | caller or contributor | uv run worldforge doctor --registered-only |
JSON diagnostics with framework and provider configuration state | fix the caller input or add a regression test for the rejected public boundary |
WorldStateError |
malformed persisted or provider world-state payload, incoherent scene object bounding box | host operator for host-owned state; adapter maintainer if a provider returned it | inspect the offending world-state dict against a known-good seeded state | a sanitized WorldStateError naming the invalid field plus its first triage step |
restore the world-state JSON from the host backup only after reviewing the error |
ProviderError |
missing credentials, missing optional dependency, malformed upstream response, unsupported provider capability, expired artifact URL | host runtime owner first; adapter maintainer if parser or docs are wrong | uv run worldforge provider info <provider> |
redacted config summary, provider profile, capability flags, lifecycle status, health, and typed provider details | attach a sanitized run manifest or issue bundle; never paste raw credentials or signed URLs |
AssertionError from worldforge.testing |
provider conformance helper reports a contract failure | adapter contributor | uv run pytest tests/test_provider_contracts.py -q |
explicit helper failure naming the missing capability behavior | fix the adapter or its fixtures; do not replace helper checks with bare assert |
| non-zero benchmark budget exit | benchmark gate failed against a JSON budget | release or performance maintainer | uv run worldforge benchmark --preset mock-smoke --run-workspace .worldforge --format json |
preserved run workspace with run_manifest.json, report JSON, and budget status |
inspect the budget row and preserved inputs before changing thresholds |
| MkDocs strict warning | bad link, nav drift, or generated docs mismatch | docs contributor | uv run mkdocs build --strict |
strict build output naming the page or link | sync mkdocs.yml, docs/src/SUMMARY.md, and source page links |
Public issues should attach worldforge runs bundle <run-id> or
scripts/generate_evidence_bundle.py output when available. Security-sensitive failures still go
through the private Security tab.
4b. Map Health And Readiness During Incidents¶
Use this when a service host, batch job, or operator dashboard needs to decide whether to send traffic to a provider-backed workflow. Keep process liveness separate from provider readiness.
| State | Symptom | Likely cause | First command | Expected signal | Escalation point |
|---|---|---|---|---|---|
| process live | GET /healthz succeeds, but provider workflows may still fail |
HTTP process is running; provider path has not been proven | curl -fsS http://127.0.0.1:8080/healthz |
JSON status is live; no provider fields are implied |
host service owner if the process is down |
| provider unconfigured | /readyz returns provider_unconfigured or doctor shows the provider absent from registered providers |
missing env vars, missing host injection, or wrong provider name | uv run worldforge doctor --registered-only |
selected provider is absent; registered_provider_count and issues explain the local process |
host deployment or credential owner |
| provider unhealthy | /readyz returns provider_unhealthy or provider health reports healthy=false |
optional runtime missing, bad credentials, unreachable upstream, or failed provider health parsing | uv run worldforge provider health <name> |
health details name the missing env var, dependency, endpoint, or sanitized upstream error | host runtime owner first; provider adapter maintainer if details are wrong or unsafe |
| upstream degraded | provider health is intermittently false, provider events show retries, 5xx, 429, or budget exhaustion | remote provider outage, throttling, expired credentials, or host budget too tight | jq 'select(.phase=="retry" or .phase=="budget_exceeded") | {provider, operation, status_code, target, message}' .worldforge/runs/<run-id>/provider-events.jsonl |
sanitized targets, retry counts, status class, and budget_exceeded events identify the failing operation |
upstream provider support or host SRE; WorldForge does not own upstream SLA |
| workflow failing | /readyz stays ready, but one request returns a typed error |
malformed world state, unsupported capability, invalid input, parser failure, or expired artifact | uv run worldforge provider info <name> |
profile, capability flags, redacted config summary, lifecycle status, and health show whether the request matched the provider contract | application owner for bad input; adapter maintainer for parser/contract bugs |
The stdlib service reference uses the same model: /healthz is process-only liveness, while
/readyz returns ready, provider_unconfigured, or provider_unhealthy plus a traffic
decision of accept or drain. Alert routing, paging policy, retry orchestration outside a single
provider call, and upstream SLA ownership remain host responsibilities.
4c. Deploy Reference Hosts¶
Use this before handing the stdlib service host, batch evaluation host, or robotics operator host to another host process owner.
uv run python examples/hosts/service/app.py --provider mock --port 8080
uv run python examples/hosts/batch-eval/app.py benchmark --provider mock --operation predict --iterations 1
uv run python examples/hosts/robotics-operator/app.py review --sample-translator --approve-dry-run \
--check workspace_clear --check emergency_stop_available --check operator_present --check controller_isolated
Success signal: the service host returns /readyz with status: ready and traffic: accept; the
batch host prints status: passed and a run_manifest; the robotics operator host prints
status: passed, a run_manifest, and controller_executed: false.
If it fails: use the Reference Host Deployment Recipes first. They include env templates, process commands, readiness commands, smoke commands, logging commands, evidence export commands, and first rollback or triage steps for checkout-safe, prepared-host, credentialed, GPU-bound, and robotics-lab paths. Deployment, auth, queueing, durable storage, controller integration, alerting, uptime, and safety certification stay host-owned.
4d. Rehearse Operator Failure Drills¶
Use this before relying on an incident runbook. The drills are deterministic and checkout-safe; each one writes a preserved run manifest under a temporary or documented workspace and records the expected failure plus recovery command.
uv run worldforge drills list
uv run worldforge drills run missing-credentials --workspace-dir .worldforge/drills
uv run worldforge drills run unsafe-event-metadata --workspace-dir .worldforge/drills --bundle
uv run worldforge drills run all --workspace-dir .worldforge/drills
| Drill | Expected failure | Recovery command |
|---|---|---|
missing-credentials |
required provider credentials are absent in a value-free config summary | load the required env var, then run uv run worldforge provider health leworldmodel |
missing-optional-dependency |
optional runtime import is missing | install the provider optional runtime on a prepared host, then rerun its smoke command |
malformed-provider-output |
a provider parser rejects a malformed fixture | attach the sanitized fixture and fix the parser or upstream contract |
budget-violation |
a mock benchmark violates an intentionally impossible latency budget | inspect the run bundle, then rerun uv run worldforge benchmark --provider mock --operation predict --iterations 1 |
corrupted-world-state |
a malformed local world JSON file raises WorldStateError |
export diagnostics, quarantine the bad file, then recreate or import a valid world |
expired-artifact |
an artifact descriptor has an expiry timestamp in the past | rerun the provider workflow to refresh the artifact, then export a new issue bundle |
unsafe-event-metadata |
non-JSON-native event metadata is rejected and secret-shaped fields are redacted | remove object or tuple metadata, keep JSON-native fields only, then rerun the workflow |
Success signal: the drill command exits 0, prints status: passed, and writes a run manifest
whose status is failed because the rehearsed incident was observed. With --bundle, the command
also writes .worldforge/drills/issue-bundles/<run-id>/issue.md.
If it fails unexpectedly: inspect <workspace>/runs/<run-id>/results/drill.json first, then run
uv run worldforge runs bundle <run-id> --workspace-dir <workspace> before changing fixtures.
Drills must not mutate user worlds; corrupted-state input files are written under the drill run's
workspace.
5. Drive World State Through The Capability Surface¶
Use this for local jobs, demos, tests, and single-writer workflows. There is no symbolic World
runtime or built-in JSON world store; planning runs over plain world-state dicts.
CLI:
Python:
from worldforge import Action, WorldForge
forge = WorldForge()
world_state = {"step": 0, "scene": {"objects": {}}}
payload = forge.predict(world_state, Action.move_to(0.3, 0.8, 0.0), steps=2, provider="mock")
next_state = payload.state
Success signal:
predictreturns aPredictionPayloadwhosestateis the provider-updated world-state dict.- invalid public inputs (bad actions, non-positive step counts, malformed candidates) raise
WorldForgeErrorbefore any outbound provider call. - malformed persisted or provider world-state payloads raise
WorldStateErrorat the boundary.
Recovery guidance:
- if a host-owned persisted world-state JSON is corrupted, restore from the host application's backup; WorldForge does not silently repair malformed state.
- if
worldforge runsreports stale run workspaces or unsafe artifact paths, export a run bundle when a manifest is valid; otherwise quarantine the run directory after preserving its manifest. - if retention pressure is the only issue, run
uv run worldforge runs cleanup --workspace-dir .worldforge --keep 20 --dry-runand remove evidence only after incident or release references no longer need it. - if multiple workers need writes, move persistence into host-owned storage with locking, migrations, backups, and recovery drills.
- do not add a lock file, SQLite store, or service adapter to WorldForge without the persistence adapter ADR.
5b. Capture Run-Scoped Provider Logs¶
Use this when a CLI job, batch host, service request, or robotics showcase run needs provider events that can be attached to an issue, release bundle, or incident note.
For one failed preserved run, export the issue-ready bundle before posting:
uv run worldforge runs list --status failed --artifact-type json
uv run worldforge runs bundle <run-id> \
--workspace-dir .worldforge \
--output .worldforge/issue-bundles/<run-id>
Success signal: worldforge runs index --status failed --artifact-type json lists the failed run
with a sanitized rerun command and the same worldforge runs bundle <run-id> recovery command. The
bundle command writes evidence_manifest.json, summary.md, and issue.md, then prints a short
issue template with the command, expected signal, observed failure, artifact list, safe_to_attach
status, and first triage step. If safe_to_attach is false, inspect the manifest's excluded and
local_only entries before attaching anything.
from pathlib import Path
from worldforge import WorldForge
from worldforge.observability import JsonLoggerSink, RunJsonLogSink, compose_event_handlers
run_id = "20260430T120000Z-provider-event"
log_path = Path(".worldforge") / "runs" / run_id / "provider-events.jsonl"
forge = WorldForge(
event_handler=compose_event_handlers(
JsonLoggerSink(extra_fields={"run_id": run_id, "host": "service"}),
RunJsonLogSink(log_path, run_id=run_id, extra_fields={"host": "service"}),
)
)
Success signal:
- each line in
provider-events.jsonlis a complete JSON object. - every record has
event_type=provider_eventand the samerun_idas the host run manifest. targetvalues keep only route-level context; URL query strings and fragments are removed.message,metadata, and sinkextra_fieldsredact bearer tokens, API keys, signatures, passwords, signed URLs, and token-like assignments.- host applications inject sinks into
WorldForge(event_handler=...)or provider constructors instead of changing global logging configuration inside WorldForge.
First triage queries:
jq 'select(.phase=="failure") | {provider, operation, status_code, target, message}' \
.worldforge/runs/<run-id>/provider-events.jsonl
jq 'select(.phase=="retry") | {provider, operation, attempt, max_attempts, status_code, target}' \
.worldforge/runs/<run-id>/provider-events.jsonl
jq -s 'group_by(.provider,.operation)[] | {provider: .[0].provider, operation: .[0].operation, events: length}' \
.worldforge/runs/<run-id>/provider-events.jsonl
For optional live smokes, preserve the manifest beside the event log:
scripts/robotics-showcase \
--json-output .worldforge/runs/<run-id>/real-run.json \
--run-manifest .worldforge/runs/<run-id>/run_manifest.json
jq '{run_id, provider_profile, capability, status, event_count, artifact_paths}' \
.worldforge/runs/<run-id>/run_manifest.json
If it fails:
| Symptom | First check | Likely owner |
|---|---|---|
| no file was written | confirm the host passed RunJsonLogSink into the active event handler |
host app |
| records have different run IDs | compare sink construction with the run manifest writer | host app |
| raw credential appears in an exported log | remove the raw value from custom metadata or exception text and add a regression test | contributor |
| failures have no status code | inspect provider-specific docs; local dependency failures may not have HTTP status | operator |
6. Run Evaluation And Benchmarks¶
Use evaluation for deterministic behavior checks and benchmarks for adapter latency and event shape. Do not treat either as a physical-fidelity claim.
uv run worldforge eval --suite planning --provider mock --format markdown
uv run worldforge eval --suite physics --provider mock --format json
uv run worldforge eval --suite planning --provider mock \
--dataset-manifest examples/dataset-manifests/mock-evaluation-fixtures.json \
--format json
uv run worldforge benchmark --provider mock --iterations 5 --format markdown
uv run worldforge benchmark --provider mock --iterations 5 --format json
uv run worldforge benchmark --provider mock --operation embed --input-file examples/benchmark-inputs.json
uv run worldforge benchmark --provider mock --operation predict --budget-file examples/benchmark-budget.json
Success signal:
- suites skip or fail explicitly when a provider does not support the required capability.
- evaluation dataset manifests are cited by compact provenance references; license, privacy, safety, checksums, and host-owned acquisition steps are recorded without copying datasets.
- benchmark reports identify provider, operation, pass/fail status, latency, retry counts, and
exported artifact format for direct provider surfaces such as
predict,score,policy, andembed. - benchmark budget files fail non-zero when success rate, error count, retry count, latency, or throughput thresholds regress.
--input-filefixtures reproduce benchmark inputs for prediction, embedding, score, and policy runs. The checked-in fixture is checkout-safe formockpredictandembed; its score and policy fields are provider-specific inputs for providers that advertise those capabilities.- benchmark input files and result JSON are saved by the host when they are used for release or paper claims.
If a score changes, first check provider capability, test fixture changes, input data, and retry events. Do not rewrite claims around a one-off run without preserving the run artifact.
When a preserved benchmark baseline justifies a budget update, generate review artifacts instead of editing release budgets directly:
uv run python scripts/calibrate_benchmark_budgets.py \
--report .worldforge/reports/benchmark-<timestamp>-<run-id>.json \
--current-budget src/worldforge/benchmark_presets/_data/budget-release-evidence.json \
--output .worldforge/benchmark-calibration/release-evidence-candidate
Success signal: candidate-budgets.json loads through the benchmark budget parser, and
budget-calibration.md shows source report digests, machine class, old threshold, candidate
threshold, observed baseline, and rationale. First triage step on an unexpected candidate is to
rerun the preserved benchmark command on the same machine class and compare report digests before
loosening any release gate.
6a. Preserve Evaluation And Benchmark Reports¶
The eval and benchmark CLIs preserve completed reports under the active run workspace:
.worldforge/reports/eval-<suite>-<timestamp>-<run-id>.json
.worldforge/reports/benchmark-<timestamp>-<run-id>.json
The JSON is written through the same renderer used by the worldforge eval and worldforge
benchmark commands. Use the preserved report path whenever a benchmark or evaluation result is
cited in a PR, release note, paper, or slide.
First triage step for a surprising number: open the saved JSON, confirm the provider and operation/suite, then rerun the matching CLI command with the same provider and operation.
6b. Record A Rerun Inspection Artifact¶
Use Rerun when you need a visual, time-indexed inspection artifact for provider events, world state, plans, and benchmark metrics:
uv run --extra rerun worldforge-demo-rerun
uv run --extra rerun rerun .worldforge/rerun/worldforge-rerun-showcase.rrd
Success signal: the .rrd file contains provider event text logs, world snapshots, plan payloads,
3D object/target markers, and mock benchmark metrics. First triage step: verify the optional SDK
is available with uv run --extra rerun python -c "import rerun; print(rerun.__version__)".
For the real PushT policy+score showcase, the wrapper writes a Rerun artifact by default:
scripts/robotics-showcase
uvx --from "rerun-sdk>=0.24,<0.32" rerun /tmp/worldforge-robotics-showcase/real-run.rrd
Success signal: the recording contains candidate target points, selected replay lines, score bars,
latency bars, provider events, world snapshots, and the plan payload. Use --no-rerun for runs
where only the TUI/JSON artifact is needed. In the robotics TUI, press o to open the persisted
Rerun recording directly.
7. Handle Provider Artifacts¶
Use this for provider outputs and event evidence that may be attached to an issue or release note.
Preflight:
uv run worldforge doctor --capability score
uv run worldforge provider info leworldmodel
uv run worldforge provider health leworldmodel
Operational rules:
- create-style requests are single-attempt unless the provider contract is idempotent.
- health, polling, and downloads can retry through
ProviderRequestPolicy. timeout_secondsis a per-attempt request timeout;max_elapsed_secondsis the host's workflow budget for the operation, including retries, backoff, and poll intervals.- budget failures raise
ProviderBudgetExceededErrorand emitphase=="budget_exceeded"so alerts can distinguish an exhausted host budget from an upstream HTTP failure. - provider-returned artifact references are treated as untrusted input: no embedded credentials, no local/private/link-local destinations by default, and no raw signed URLs in attachable evidence.
- provider errors should include operation and provider context without leaking credentials, bearer tokens, or signed URLs.
- provider event
targetvalues are sanitized for logs: use them to identify the endpoint or artifact path, not to recover a full signed URL.
If artifact export fails, inspect provider events for operation, phase, status_code,
attempt, and sanitized target, then rerun the local workflow after fixing the provider input or
host configuration.
8. Run Optional Runtime Smokes¶
Use checkout-safe demos first. Use real runtime smokes only in a host environment that has the model, checkpoint, CUDA or robot stack, and task-specific preprocessing.
Checkout-safe:
uv run pytest
uv run pytest -m "not live"
uv run worldforge-demo-leworldmodel
uv run worldforge-demo-lerobot
Runtime pytest profiles are opt-in. Mark live provider tests with the smallest truthful set of
markers, for example @pytest.mark.live, @pytest.mark.network,
@pytest.mark.credentialed, @pytest.mark.gpu, @pytest.mark.robotics, and
@pytest.mark.provider_profile("leworldmodel"). Default uv run pytest skips marked tests before they
can reach live endpoints, GPUs, robot stacks, credentials, or downloaded checkpoints.
Prepared-host provider profiles:
# Cosmos-Policy: requires COSMOS_POLICY_BASE_URL and a reachable ALOHA /act server.
COSMOS_POLICY_BASE_URL=http://127.0.0.1:8777 \
COSMOS_POLICY_ALLOW_LOCAL_BASE_URL=1 \
uv run pytest -m "live and network and robotics and provider_profile" \
--run-live --run-network --run-robotics --provider-profile cosmos-policy
# Expected success: pytest completes the selected live profile without failures.
# First triage: run `uv run worldforge provider health cosmos-policy` to confirm
# configuration only; use the smoke command below to verify `/act` reachability.
COSMOS_POLICY_BASE_URL=http://127.0.0.1:8777 \
COSMOS_POLICY_ALLOW_LOCAL_BASE_URL=1 \
uv run worldforge-smoke-cosmos-policy \
--health-only \
--run-manifest .worldforge/runs/cosmos-policy-health/run_manifest.json
# Expected success: run_manifest.json records capability=policy with status=skipped.
# First triage: run `uv run worldforge provider health cosmos-policy` to confirm
# configuration only, then use the full `/act` smoke below for endpoint behavior.
COSMOS_POLICY_BASE_URL=http://127.0.0.1:8777 \
COSMOS_POLICY_ALLOW_LOCAL_BASE_URL=1 \
uv run worldforge-smoke-cosmos-policy \
--policy-info-json /path/to/policy_info.json \
--translator /path/to/translator.py:translate_actions \
--allow-translator-code \
--run-manifest .worldforge/runs/cosmos-policy-live/run_manifest.json
# Expected success: run_manifest.json records capability=policy with status=passed.
# First triage: verify the `/act` server is reachable and recheck the translator path plus
# policy_info.json shape.
# Cosmos-Policy remote GPU checklist:
#
# 1. Use a prepared Linux/NVIDIA host that can run the upstream Cosmos-Policy server.
# Current ALOHA Predict2 smokes should start from a 48 GB or larger GPU memory class unless
# the upstream requirements are stricter.
# 2. Keep checkpoints, model approvals, CUDA, Docker, and Hugging Face/NVIDIA tokens on the host.
# WorldForge should only see the `/act` endpoint and sanitized run evidence.
# 3. Prefer an SSH tunnel to `127.0.0.1:8777`. If the server is exposed directly, restrict
# inbound TCP 8777 to the operator IP or VPN CIDR and set `COSMOS_POLICY_ALLOWED_HOSTS`.
# 4. Run health-only first. It validates WorldForge configuration only; it does not prove `/act`
# inference because the targeted upstream server has no non-mutating health endpoint.
# 5. Run the full smoke with prepared ALOHA policy info and a trusted translator. Expected success
# is `run_manifest.json` with capability=policy, status=passed, and a non-empty action shape
# such as 50 x 14.
# 6. Preserve only sanitized evidence. Do not commit raw images, tokens, checkpoints, Docker
# layers, or GPU logs with secrets. Hibernate or terminate the GPU host when finished.
# LeWorldModel: requires LEWORLDMODEL_POLICY or LEWM_POLICY and host-owned runtime deps.
LEWORLDMODEL_POLICY=pusht/lewm \
uv run pytest -m "live and gpu and provider_profile" \
--run-live --run-gpu --provider-profile leworldmodel
# GR00T: requires GROOT_POLICY_HOST and a reachable policy server.
GROOT_POLICY_HOST=127.0.0.1 \
uv run pytest -m "live and network and robotics and provider_profile" \
--run-live --run-network --run-robotics --provider-profile gr00t
# Expected success: pytest completes the selected live profile without failures.
# First triage: run `uv run worldforge provider health gr00t` to confirm client
# configuration and server reachability.
GROOT_POLICY_HOST=127.0.0.1 \
GROOT_POLICY_PORT=5555 \
uv run --with msgpack --with pyzmq --with numpy python scripts/smoke_gr00t_policy.py \
--health-only \
--run-manifest .worldforge/runs/gr00t-health/run_manifest.json
# Expected success: run_manifest.json records capability=policy with status=skipped.
# First triage: confirm the remote PolicyClient server is reachable before sending
# observation data.
GROOT_POLICY_HOST=127.0.0.1 \
GROOT_POLICY_PORT=5555 \
uv run --with msgpack --with pyzmq --with numpy python scripts/smoke_gr00t_policy.py \
--policy-info-json /path/to/policy_info.json \
--translator /path/to/translator.py:translate_actions \
--allow-translator-code \
--run-manifest .worldforge/runs/gr00t-live/run_manifest.json
# Expected success: run_manifest.json records capability=policy with status=passed.
# First triage: recheck the observation shape, translator import path, and remote
# server logs.
# LeRobot: requires LEROBOT_POLICY_PATH or LEROBOT_POLICY and host-owned policy deps.
LEROBOT_POLICY_PATH=lerobot/diffusion_pusht \
uv run pytest -m "live and robotics and provider_profile" \
--run-live --run-robotics --provider-profile lerobot
When a test is selected without the matching opt-in flag or provider environment, pytest reports a skip reason naming the missing flag or environment variable. Save stdout/stderr, JSON summaries, and provider-event logs from prepared-host runs when the result is used as release or issue evidence.
Real LeWorldModel checkpoint:
Equivalent explicit uv command:
uv run --python 3.13 \
--with "stable-worldmodel @ git+https://github.com/galilai-group/stable-worldmodel.git" \
--with "datasets>=2.21" \
--with "opencv-python" \
--with "imageio" \
lewm-real \
--checkpoint ~/.stable-wm/pusht/lewm_object.ckpt \
--device cpu
The wrapper runs uv run --python 3.13 with the upstream stable-worldmodel, datasets, OpenCV,
and imageio runtime requirements, then invokes the packaged lewm-real alias.
stable-worldmodel is the official LeWorldModel loading/evaluation runtime used by lucas-maes/le-wm;
LeWorldModelProvider loads the LeWM object checkpoint through
stable_worldmodel.policy.AutoCostModel. The live smoke prints what the run demonstrates, a visual
pipeline, tensor shapes, latency metrics, provider events, and a ranked candidate cost landscape.
It exits non-zero before inference if the checkpoint, optional runtime, or provider health check is
missing. Use --json-only for the machine-readable result payload, or --json-output
lewm-real-summary.json to write the same run data while keeping the visual output.
The live smoke uses deterministic synthetic PushT-shaped tensors. It proves the checkpoint loads and scores candidates through the WorldForge provider contract; it does not prove task-specific preprocessing or robot execution.
LeRobot policy plus LeWorldModel checkpoint scoring replay:
The showcase wrapper installs the host-owned optional runtime set for this process, runs the
packaged PushT bridge, opens a Textual visual report with the policy-to-score pipeline, runtime
bars, tensor metrics, staged reveal messages, an illustrative animated robot-arm replay, full-width
candidate ranking, provider events, and tabletop replay map, then writes the full JSON summary under
/tmp/worldforge-robotics-showcase/real-run.json. Pass --tui-stage-delay <seconds> to tune the
reveal pace, --no-tui-animation to disable sleeps and arm motion, --no-tui for the plain
terminal report, --json-only for automation, or --health-only for a dependency preflight. It
requests lerobot[transformers-dep]==0.5.1 so the Python 3.13 policy import path is stable while
the LeWorldModel runtime is installed, and filters common macOS native-library duplicate class
warnings from the user-facing output while leaving runtime device fallback warnings visible. The
--health-only path does not auto-build or download missing LeWorldModel checkpoints; it reports
whether the checkpoint is present and exits before inference. Set WORLDFORGE_SHOW_RUNTIME_WARNINGS=1
to see raw third-party stderr.
Prepared-host CI uses .github/workflows/robotics-showcase.yml for this path on every pull
request update and on pushes to main. That workflow keeps the run non-interactive with
--json-only --no-tui --no-rerun, writes a JSON summary plus run_manifest.json, and verifies the
real lerobot.policy and leworldmodel.score success events. Use actions/cache for Hugging
Face policy downloads, LeWorldModel config/weights assets, and the built LeWorldModel object
checkpoint. Checkpoint artifacts are not uploaded; normal CI artifacts should be run evidence, not
reusable checkpoint storage.
Use the lower-level runner when replacing the task observation, score tensors, translator, or candidate bridge:
scripts/lewm-lerobot-real \
--policy-path lerobot/diffusion_pusht \
--policy-type diffusion \
--checkpoint ~/.stable-wm/pusht/lewm_object.ckpt \
--device cpu \
--mode select_action \
--observation-module /path/to/pusht_obs.py:build_observation \
--score-info-npz /path/to/lewm_score_tensors.npz \
--translator worldforge.smoke.lerobot_leworldmodel:translate_pusht_xy_actions \
--candidate-builder /path/to/pusht_lewm_bridge.py:build_action_candidates \
--expected-action-dim 10 \
--expected-horizon 4
Equivalent explicit uv command:
uv run --python 3.13 \
--with "stable-worldmodel @ git+https://github.com/galilai-group/stable-worldmodel.git" \
--with "datasets>=2.21" \
--with "huggingface_hub" \
--with "hydra-core" \
--with "omegaconf" \
--with "matplotlib" \
--with "transformers" \
--with "lerobot[transformers-dep]==0.5.1" \
--with "textual>=8.2,<9" \
--with "pygame" \
--with "opencv-python" \
--with "imageio" \
--with "pymunk" \
--with "gymnasium" \
--with "shapely" \
worldforge-robotics-showcase --tui
Equivalent explicit uv command for the lower-level runner:
uv run --python 3.13 \
--with "stable-worldmodel @ git+https://github.com/galilai-group/stable-worldmodel.git" \
--with "datasets>=2.21" \
--with "opencv-python" \
--with "imageio" \
--with "lerobot[transformers-dep]==0.5.1" \
lewm-lerobot-real \
--policy-path lerobot/diffusion_pusht \
--policy-type diffusion \
--checkpoint ~/.stable-wm/pusht/lewm_object.ckpt \
--device cpu \
--mode select_action \
--observation-module /path/to/pusht_obs.py:build_observation \
--score-info-npz /path/to/lewm_score_tensors.npz \
--translator worldforge.smoke.lerobot_leworldmodel:translate_pusht_xy_actions \
--candidate-builder /path/to/pusht_lewm_bridge.py:build_action_candidates
This flow demonstrates robotics-builder composition: LeRobot proposes policy action candidates,
LeWorldModel ranks checkpoint-native candidate tensors, and WorldForge selects and mock-executes the
lowest-cost chunk through World.plan(..., planning_mode="policy+score"). The packaged
scripts/robotics-showcase command owns the PushT demonstration bridge; any other task still needs
a host-owned observation builder and candidate bridge. If the LeRobot raw action dimension or horizon
does not match the LeWorldModel checkpoint contract, provide a task-specific bridge instead of
padding or projecting actions.
GR00T and LeRobot live smokes:
uv run python scripts/smoke_gr00t_policy.py --help
uv run python scripts/smoke_lerobot_policy.py --help
Success signal: the demo or smoke states whether it used injected deterministic runtime, real checkpoint inference, remote policy server, provider events, persistence, and reload. Do not describe an injected demo as real neural inference.
9. Prepare A Release Or Public Branch¶
Use this before publishing a package, merging provider work, or pushing a milestone.
uv sync --group dev
uv lock --check
uv run ruff check src tests examples scripts
uv run ruff format --check src tests examples scripts
uv run python scripts/generate_provider_docs.py --check
uv run python scripts/check_docs_commands.py
uv run python scripts/check_docs_snippets.py
uv run python scripts/check_wrapper_portability.py
uv run python scripts/check_optional_import_boundaries.py
uv run python scripts/check_core_performance.py
uv run mkdocs build --strict
uv run pytest
uv run --extra harness pytest --cov=src/worldforge --cov-report=term-missing --cov-fail-under=90
bash scripts/test_package.sh
uv build --out-dir dist --clear --no-build-logs
shasum -a 256 dist/worldforge_ai-*.whl dist/worldforge_ai-*.tar.gz
The package contract checks both distribution artifacts: the wheel must contain only runtime package
files, the py.typed marker, capability protocols, observable capability wrapper, and console
scripts; the sdist must contain docs, tests, examples, scripts, and release metadata needed to
rebuild and audit the source package.
Run the core performance budget gate with a preserved workspace before claiming release readiness:
uv run python scripts/check_core_performance.py \
--workspace-dir .worldforge/core-performance \
--output .worldforge/core-performance/core-performance.json
Success signal: core-performance.json has passed: true and result rows for world persistence,
benchmark fixture loading, provider catalog diagnostics, evidence-bundle creation, and report
rendering. First triage step: inspect the failing row's artifact path and rerun the single changed
code path before loosening a budget. The report is a local regression guard, not a public
performance claim.
Run the wrapper portability checker whenever shell wrappers, optional-runtime smoke commands, or documented host commands change:
Success signal: every wrapper row passes, including scripts/robotics-showcase, LeWorldModel
wrappers, GR00T and LeRobot smoke helpers, and scripts/test_package.sh. First triage step: repair
the exact script named in the failure before editing docs around it.
Run the docs snippet gate whenever Python or JSON examples change:
Success signal: marked Python snippets execute in a temp workspace, marked JSON snippets parse and schema-check where supported, and host-owned, credentialed, or illustrative examples use explicit skip markers. First triage step: fix the file, heading, and line named in the report or apply the correct skip marker.
Run the optional import boundary audit whenever base imports, CLI startup, non-TUI harness modules, or optional provider modules change:
Success signal: the static audit reports no direct optional-runtime imports outside allowed
modules, and the import-time audit loads worldforge, worldforge.cli, provider modules, and
non-TUI harness modules without importing Textual, Rerun, torch, stable-worldmodel, LeRobot,
GR00T, or Cosmos-Policy packages. First triage step: move the named import behind the provider,
smoke, worldforge.rerun, or harness.tui boundary identified by the report.
Then generate locked dependency-audit evidence:
Success signal: .worldforge/dependency-audit/dependency-audit.json and
.worldforge/dependency-audit/dependency-audit.md exist, the status is passed, and the
requirements summary states that the temporary requirements file was not preserved. The wrapper
uses uv export --frozen --all-groups --no-emit-project --no-hashes plus
uvx --from pip-audit pip-audit ... --format json; use --ignore-advisory ADVISORY=RATIONALE
only for explicit release-reviewed exceptions. Raw-detail keys and values are sanitized before JSON
or Markdown rendering. First triage step for findings: inspect the Markdown advisory table,
upgrade or document the dependency decision, then rerun.
Finally generate the release-readiness evidence. This command writes
.worldforge/release-evidence/release-evidence.md and
.worldforge/release-evidence/release-evidence.json; add --run-gates when the evidence run
should execute the checkout-safe gates itself.
uv run python scripts/generate_evidence_bundle.py \
--workspace-dir .worldforge \
--output .worldforge/evidence-bundles/release-candidate
uv run python scripts/generate_release_evidence.py \
--run-gates \
--live-smoke-registry docs/src/live-smoke-evidence.json \
--run-manifest .worldforge/runs/<run-id>/run_manifest.json \
--artifact .worldforge/dependency-audit/dependency-audit.json \
--artifact .worldforge/evidence-bundles/release-candidate/evidence_manifest.json \
--benchmark-artifact .worldforge/reports/benchmark-<timestamp>-<run-id>.json \
--artifact dist/worldforge_ai-<version>-py3-none-any.whl
The generator does not require provider credentials; optional live smokes without linked manifests
are listed explicitly as host-owned, the registry records missing-runtime and missing-credential
skips, and passed/failed/skipped live runs link back to their preserved run_manifest.json files
and artifact summaries. Gate rows are explicit passed, failed, or skipped and include the
first triage step. Use --known-limitation for release-scoped caveats that should travel with the
bundle.
Generate a local quality dashboard after the evidence artifacts exist:
Success signal: .worldforge/quality-dashboard/quality-dashboard.json and
.worldforge/quality-dashboard/quality-dashboard.md exist, the table distinguishes failed,
warning, skipped, and not-run rows, and the first failed gate points back to sanitized raw
details for the underlying output. The dashboard reads existing gate outputs rather than running
them, and it sanitizes both raw-detail keys and values before JSON or Markdown rendering. It is an
at-a-glance local review index; release evidence remains the release artifact for hashes, linked
run manifests, optional runtime claim boundaries, and known limitations.
Then create a maintainer-editable release notes draft from the changelog, evidence JSON, and optional closed-issue metadata:
mkdir -p .worldforge/release-notes
gh issue list --state closed --limit 200 \
--json number,title,url,labels,closedAt,state \
> .worldforge/release-notes/closed-issues.json
uv run python scripts/generate_release_notes.py \
--release-evidence .worldforge/release-evidence/release-evidence.json \
--issues-json .worldforge/release-notes/closed-issues.json \
--known-caveat "No prepared-host live smoke was run for <provider>."
Success signal: .worldforge/release-notes/release-notes-draft.md includes Added, Changed,
Fixed, Docs, Validation, Compatibility Notes, Known Caveats, and Host-Owned Optional
Runtime Evidence sections. The draft is source material for release editing, not a publish step:
it never creates a GitHub release or tag, signs artifacts, or edits trusted publishing. Its status
checks both validation_summary and individual validation_gates, so a failed gate row keeps the
draft in needs-validation-review until release evidence is regenerated from passing gates. If the
draft says validation evidence is missing, run
uv run python scripts/generate_release_evidence.py --run-gates first. Use
--require-validation-evidence when release automation should fail on missing evidence. The draft
sanitizes changelog text, closed issue metadata, release-evidence text, and --known-caveat
values before Markdown rendering.
Success signal:
- validation passes from a clean checkout.
- generated provider docs have no drift and the Pages site builds in strict mode.
- release evidence links validation expectations, optional live-smoke manifests, benchmark artifacts, generated evidence bundles, dependency-audit evidence, distribution artifacts, JSON summaries, first triage steps, and known limitations.
- release notes draft links changelog entries, closed issues by label, validation evidence, compatibility notes, caveats, and host-owned optional-runtime evidence for maintainer review.
- README, docs, changelog, and
AGENTS.mdreflect public behavior. - no optional runtime dependency, checkpoint, credential, generated artifact, or
.envfile is committed accidentally.
10. Triage Incidents And Regressions¶
Use this as the first stop when a user reports a failure. For provider-specific parser, retry, credential, optional-runtime, scaffold, and unsafe-artifact examples, cross-check the Provider Failure Mode Gallery before attaching evidence.
| Reported failure | First command | Evidence to capture | Usual fix path |
|---|---|---|---|
| provider missing | uv run worldforge doctor |
registered providers, required env vars | environment or catalog registration |
| provider unhealthy | uv run worldforge provider health <name> |
health details, optional dependency versions | host runtime setup or provider health code |
| unsupported capability | uv run worldforge doctor --capability <capability> |
provider profile and workflow call | choose correct provider or implement capability |
| persistence load failed | reproduce load_world with saved JSON |
failing JSON, world ID, state dir | restore from backup or fix importer validation |
| provider artifact export failed | provider events and provider-specific docs | status code, attempt, sanitized target | parser, retry policy, artifact handling, or host credentials |
| optional runtime smoke failed | smoke command and --help output |
host OS, dependency path, checkpoint path | host runtime setup; do not add heavy deps to base package |
| coverage failed | uv run --extra harness pytest --cov=src/worldforge --cov-report=term-missing --cov-fail-under=90 |
missing lines and changed files | add behavior tests, especially error paths |
Do not paper over a failure by widening docs or loosening a capability. Fix the contract or make the limitation explicit.