Python API¶
For compatibility tiers, deprecation expectations, and artifact-schema migration rules, see Public API Stability.
Entry points¶
from worldforge import (
Action,
ActionPolicyResult,
ActionScoreResult,
BenchmarkBudget,
Cost,
Policy,
load_benchmark_inputs,
RunnableModel,
WorldForge,
)
WorldForge¶
Top-level framework object responsible for:
- provider registration
- world creation and persistence
- prediction, embedding, action-scoring, and action-policy helpers
- provider profiles and environment diagnostics
Common inspection helpers:
from worldforge import Action, WorldForge
forge = WorldForge()
profiles = forge.builtin_provider_profiles()
doctor = forge.doctor()
print(profiles[0].supported_tasks)
print(doctor.issues)
Provider capability filters are strict. Valid capability names are predict, embed, plan,
score, and policy; unknown names raise
WorldForgeError instead of producing an empty result by typo.
Capability protocols¶
Provider adapters can still subclass BaseProvider, but small integrations can also register a
single capability object. The object declares a non-empty name, optional ProviderProfileSpec,
and the method for exactly the capability it implements.
from worldforge import ActionScoreResult, WorldForge
from worldforge.providers import ProviderProfileSpec
class LocalCost:
name = "local-cost"
profile = ProviderProfileSpec(description="Local score model")
def score_actions(self, *, info, action_candidates):
return ActionScoreResult(provider=self.name, scores=[0.2, 0.7], best_index=0)
forge = WorldForge()
forge.register_cost(LocalCost())
result = forge.score_actions(cost="local-cost", info={}, action_candidates=[{}, {}])
print(result.best_index)
The same pattern is available through register_policy, register_predictor, and
register_embedder.
forge.register(...) dispatches a pure object by protocol membership, and
RunnableModel(...) can group several capability implementations. Registered protocol
implementations appear in providers(), provider_profile(...), doctor(...), planning, and the
benchmark harness.
For a runnable policy-plus-score example that registers plain in-process objects, see Capability Protocols Quickstart.
Capability protocol implementations and BaseProvider subclasses may also expose provider-owned
preflight, warmup, and teardown hooks that return ProviderLifecycleResult. Diagnostics
surface the aggregate ProviderLifecycleStatus through doctor() and
provider_lifecycle_status(...) without changing the capability method contract.
World State¶
There is no symbolic World runtime or built-in JSON world store. Planning runs over plain,
JSON-serializable world-state dicts: forge.predict rolls a world_state dict forward one action
at a time.
from worldforge import Action, WorldForge
forge = WorldForge()
world_state = {"step": 0, "scene": {"objects": {}}}
payload = forge.predict(world_state, Action.move_to(0.2, 0.5, 0.0), steps=1, provider="mock")
print(payload.metadata["provider"], payload.physics_score)
next_state = payload.state
Invalid public inputs raise WorldForgeError, and malformed persisted or provider world-state
payloads raise WorldStateError at the boundary. Durable persistence is host-owned: serialize the
plain world-state dict into your own store.
Observability¶
import logging
from pathlib import Path
from worldforge import WorldForge
from worldforge.workflow_trace import workflow_trace_from_provider_events
from worldforge.observability import (
JsonLoggerSink,
ProviderMetricsExporterSink,
OpenTelemetryProviderEventSink,
ProviderMetricsSink,
RunJsonLogSink,
compose_event_handlers,
)
from worldforge.rerun import RerunArtifactLogger, RerunEventSink, RerunRecordingConfig, RerunSession
run_id = "demo-run"
metrics = ProviderMetricsSink()
host_metrics_exporter = ... # supplied by your service
forge = WorldForge(
event_handler=compose_event_handlers(
JsonLoggerSink(logger=logging.getLogger("demo.worldforge"), extra_fields={"run_id": run_id}),
RunJsonLogSink(Path(".worldforge") / "runs" / run_id / "provider-events.jsonl", run_id),
ProviderMetricsExporterSink(host_metrics_exporter),
metrics,
)
)
world_state = {"step": 0, "scene": {"objects": {}}}
forge.predict(world_state, Action.move_to(0.2, 0.5, 0.0), steps=1, provider="mock")
print(metrics.get("mock", "predict").to_dict())
Provider events are log-safe by default. The target field keeps endpoint or artifact path context
but strips URL userinfo, query strings, and fragments; message, metadata, and sink extra fields
redact obvious bearer tokens, API keys, signatures, passwords, and signed URLs. RunJsonLogSink
appends one JSON object per line and stamps every record with run_id for manifest correlation.
OpenTelemetryProviderEventSink is optional and accepts an injected host tracer, so the base
package does not import OpenTelemetry or configure collectors.
ProviderMetricsExporterSink is also optional and accepts a host-owned metrics exporter with
bounded labels for provider, operation, phase, status class, and capability.
Composed operations can emit safe workflow trace artifacts. Evaluation reports export
workflow_trace.json and workflow_trace.md, and workflow_trace_from_provider_events(...) can
compact emitted ProviderEvent records into a schema-versioned trace without storing raw prompts,
tensors, credentials, or controller telemetry.
Rerun is available as an optional observability and artifact layer:
session = RerunSession(RerunRecordingConfig(save_path=".worldforge/rerun/run.rrd"))
rerun_events = RerunEventSink(session=session)
artifacts = RerunArtifactLogger(session=session)
forge = WorldForge(event_handler=rerun_events)
world_state = {"step": 0, "scene": {"objects": {}}}
payload = forge.predict(world_state, Action.move_to(0.2, 0.5, 0.0), steps=1, provider="mock")
artifacts.log_json("worlds/initial", world_state)
artifacts.log_json("worlds/predicted", payload.state)
session.close()
Install with worldforge-ai[rerun]. Rerun is not a provider and does not advertise WorldForge
capabilities.
Action Scoring¶
Providers that expose the score capability can rank candidate action sequences without claiming
prediction or policy support. LeWorldModel uses this path because its upstream runtime is a JEPA
cost model.
from worldforge import WorldForge
forge = WorldForge()
result = forge.score_actions(
"leworldmodel",
info={
"pixels": [[[0.0, 0.1, 0.2]]],
"goal": [[[0.8, 0.9, 1.0]]],
"action": [[[0.0, 0.0, 0.0]]],
},
action_candidates=[
[
[[0.0], [0.1], [0.2]],
[[0.3], [0.2], [0.1]],
]
],
)
print(result.best_index, result.best_score)
ActionScoreResult validates finite scores, exposes best_index and best_score, and requires
best_index to match lower_is_better so callers do not have to infer score direction from
provider-specific docs.
Metadata must be JSON-native: dict keys are strings, numbers are finite, and object instances or
tuples are rejected instead of being coerced silently.
Candidate helpers are provider-agnostic and return validated Action sequences. Use
cartesian_offset_candidates(...) for relative move candidates, object_near_candidates(...) for
reference-relative placements, swap_action_candidates(...) for two-object swaps, and
bounded_move_grid_candidates(...) for inclusive Cartesian grids. By default WorldForge serializes
candidate actions with action_candidates_to_score_payload(...) before calling
forge.score_actions(...); pass score_action_candidates explicitly when a score provider needs a
task-specific tensor instead of serialized Action payloads. These helpers
do not preprocess images, do not infer provider-native tensors, and
do not reinterpret robot action spaces.
For a checkout-safe policy+score candidate lab that preserves raw policy actions, ranks generated candidates, and shows invalid-bounds plus missing-translator failures, run:
uv run python scripts/demo_showcases.py run policy-score-candidate-lab --workspace-dir .worldforge/demo-showcases --overwrite
Latent Planning¶
LatentMPCController owns the receding-horizon optimizer: it proposes action candidates, ranks them
with a score provider as a cost oracle, and returns the lowest-cost chunk. There is no symbolic
World runtime — the controller operates over observation_info/goal_info payloads and the
score capability.
from worldforge import LatentMPCController, PlannerConfig, WorldForge
forge = WorldForge()
controller = LatentMPCController(
forge=forge,
score_provider="mock",
config=PlannerConfig(
horizon=1,
num_samples=16,
num_iterations=2,
num_elites=4,
action_kind="latent_action",
action_parameter_bounds={"x": (-1.0, 1.0), "y": (-1.0, 1.0), "z": (-1.0, 1.0)},
seed=0,
),
)
plan = controller.plan_step(
observation_info={"point": [0.0, 0.5, 0.0]},
goal_info={"target": [0.55, 0.5, 0.0]},
)
print(len(plan.actions), plan.best_score, plan.candidate_count)
Action Policy¶
Providers that expose the policy capability select executable action chunks from observations.
NVIDIA Isaac GR00T uses this surface because it is an embodied VLA policy, not a predictive world
model.
result = forge.select_actions(
"gr00t",
info={
"observation": {
"video": {"front": video_array},
"state": {"eef": state_array},
"language": {"task": [["pick up the cube"]]},
},
"embodiment_tag": "LIBERO_PANDA",
"action_horizon": 16,
},
)
print(result.actions, result.raw_actions)
ActionPolicyResult validates that the provider returned at least one WorldForge Action,
preserves provider-native raw actions for debugging, and can carry multiple candidate action
chunks for downstream scoring. Preserved raw actions and metadata must be JSON-native so run
artifacts can be serialized without hidden encoder behavior.
Policy plus score planning composes the two surfaces directly: forge.select_actions(...) proposes
candidate chunks, forge.score_actions(...) ranks them, and the caller selects the lowest-cost
chunk by best_index. The host owns embodiment-specific action translation and any model-native
mapping.
Evaluation¶
from worldforge.evaluation import EvaluationSuite
print(EvaluationSuite.builtin_names())
suite = EvaluationSuite.from_builtin("planning")
report = suite.run_report(["mock"], forge=forge)
print(report.results[0].passed)
print(report.to_markdown())
gallery = report.failure_gallery()
print(gallery.to_json())
Failed reports expose representative, sanitized gallery cases through failure_gallery() and
through report.artifacts()["failure_gallery.json"] / ["failure_gallery.md"]. The gallery is
for deterministic contract triage; it does not rank providers or claim physical fidelity.
Custom deterministic suites use the same public report path:
from worldforge.evaluation import EvaluationContext, EvaluationScenario, EvaluationSuite
def check_world(context: EvaluationContext):
return context.outcome(
score=1.0,
passed=context.world.object_count == 0,
metrics={"object_count": context.world.object_count},
)
custom = EvaluationSuite.custom(
suite_id="custom-empty-world",
name="Custom Empty World Evaluation",
suite_version="custom-empty-world:1",
claim_boundary="Checkout-safe custom example; not a model-quality claim.",
scenarios=[
EvaluationScenario.from_callable(
name="empty-world-readable",
description="Checks that a new world can be inspected.",
evaluator=check_world,
)
],
)
report = custom.run_report("mock", forge=forge)
print(report.provenance.suite_version)
EvaluationSuite.register(...) and EvaluationSuite.from_registered(...) provide a process-local
registry for host applications that want to name custom suites.
Benchmarking¶
from worldforge import BenchmarkBudget, ProviderBenchmarkHarness, load_benchmark_inputs
harness = ProviderBenchmarkHarness(forge=forge)
inputs = load_benchmark_inputs(
{
"embedding_text": "benchmark cube state",
}
)
report = harness.run(
["mock"],
operations=["predict", "embed"],
iterations=5,
inputs=inputs,
)
print(report.to_json())
budget = BenchmarkBudget.from_dict({"max_p95_latency_ms": 25.0})
print(report.evaluate_budgets([budget]).passed)
Provider contract testing¶
from worldforge.providers import MockProvider
from worldforge.testing import assert_provider_contract
report = assert_provider_contract(MockProvider())
print(report.to_dict())
For score providers, pass provider-specific score payloads so the helper can exercise
score_actions(...):
report = assert_provider_contract(
provider,
score_info={"observation": [[0.0]], "goal": [[1.0]]},
score_action_candidates=[[[[0.0]]]],
)
For policy providers, pass provider-specific policy observations:
Public failure modes¶
WorldForge uses three public exception families for runtime workflows:
WorldForgeError: invalid caller input, invalid model values, unsupported formats, and invalid local configuration values.WorldStateError: malformed persisted state or provider-supplied world state that cannot be safely restored or applied, including invalid scene-object maps.ProviderError: provider credentials, transport failures, unsupported provider operations, malformed upstream responses, provider-specific input limits, optional dependency failures, and malformed model score or policy outputs.
Provider-facing workflows fail before returning partial results:
from worldforge import Action, WorldForge
from worldforge.providers import ProviderError
forge = WorldForge()
world_state = {"step": 0, "scene": {"objects": {}}}
try:
prediction = forge.predict(
world_state,
Action.move_to(0.1, 0.5, 0.0),
steps=1,
provider="mock",
)
except ProviderError as exc:
# Inspect emitted ProviderEvent records for transport status and attempts.
raise
Important boundary checks:
Position,Rotation, request policies, provider events, embeddings, score results, policy results, and prediction payload metrics reject non-finite numbers.Action.parameters,SceneObject.metadata, provider-event metadata, score metadata, policy raw actions, policy metadata, and prediction payload state/metadata reject non-JSON-native values rather than accepting object instances that only fail at persistence time.- Evaluation and benchmark result objects validate finite metrics, score ranges, coherent counts, and JSON-native metrics before JSON, Markdown, or CSV artifacts are rendered.
- Provider-supplied world state rejects scene-object keys that disagree with embedded object IDs.
- LeWorldModel scoring requires
pixels,goal, andactioninfo fields, action candidates shaped as(batch=1, samples, horizon, action_dim), optionalstable_worldmodelandtorchruntime dependencies, one returned score per candidate sample, and finite model scores.