Skip to content

TheWorldHarness

TheWorldHarness is an optional Textual TUI for running WorldForge integration flows as visible, inspectable traces. It is the default integration reference for how provider surfaces, planning, execution, persistence, diagnostics, benchmarks, and event inspection fit together.

It is a local tool. It does not require optional ML runtimes unless a selected flow explicitly does, and the current flows use deterministic checkout-safe paths.

The real robotics showcase also uses the Textual surface, but it is launched through scripts/robotics-showcase rather than the checkout-safe harness flow. That command runs the real LeRobot policy plus real LeWorldModel checkpoint path first, then opens a standalone report with the pipeline, runtime metrics, staged reveal, illustrative robot-arm animation, candidate cost landscape, provider events, and tabletop replay. Pass --tui-stage-delay <seconds> to tune the reveal pace or --no-tui to keep the plain terminal report.

Install Boundary

Textual is optional. The base package keeps httpx as its only runtime dependency.

uv run --extra harness worldforge-harness
uv run --extra harness worldforge-harness --flow lerobot
uv run --extra harness worldforge-harness --flow diagnostics
uv run --extra harness worldforge-harness --flow eval
uv run --extra harness worldforge-harness --flow benchmark
uv run worldforge harness --list
uv run worldforge harness --list --format json
uv run worldforge harness --connectors
uv run worldforge harness --connectors --format json

Installed package:

pip install "worldforge-ai[harness]"
worldforge-harness

Without the harness extra, metadata commands still work:

uv run worldforge harness --list
uv run worldforge harness --list --format json
uv run worldforge harness --connectors --format json
uv run worldforge provider workbench mock
uv run worldforge provider workbench runway --format json

Launching the TUI without Textual exits with an install hint instead of importing optional dependencies at package import time.

Current Flows

Flow Provider surface What it visualizes
leworldmodel score Deterministic LeWorldModel-shaped cost runtime, candidate scoring, score planning, execution, persistence, reload, provider events.
lerobot policy plus score provider Deterministic LeRobot-shaped policy, action translation, policy candidate ranking, execution, persistence, reload, provider events.
diagnostics provider catalog plus benchmark harness doctor() provider scan, registered/unregistered provider status, mock benchmark matrix across predict/reason/generate/transfer/embed, latency/throughput comparison, provider events.

Provider Connector Workspace

The Providers screen and worldforge harness --connectors --format json use the same Textual-free readiness model. Each known provider is grouped as configured, missing_credentials, missing_dependency, unhealthy, or scaffold, with value-free required environment names, optional runtime dependency names, a first smoke command, and triage steps.

This surface intentionally reports presence and status only. It does not print environment values, tokens, endpoints, checkpoint paths, or constructor-provided secrets.

What The Interface Shows

The harness now exposes the main WorldForge surfaces directly:

  • Home: jump cards plus recent worlds and preserved reports.
  • Worlds: create, edit, save, fork, delete, and preview local JSON worlds through WorldForge.
  • Providers: registered-provider capability matrix, health details, and cancellable mock.predict.
  • Eval: built-in deterministic suites with capability errors surfaced as hard toasts.
  • Benchmark: provider-operation latency, retry, and throughput runs with live samples.
  • Run Inspector: timeline, metrics, sanitized provider-event table, validation errors, transcript, and export preview for flows and reports.

Flow and report views are rendered from the same structured HarnessRun object used in tests. The provider, eval, and benchmark screens call the same Python APIs as the CLI; report artifacts use the canonical JSON / Markdown / CSV renderers.

Every harness flow preserves the final inspector state under .worldforge/runs/<run-id>/results/inspector.json, writes sanitized provider events to logs/provider-events.jsonl, and links both artifacts from run_manifest.json. If a flow fails before provider work completes, the manifest status is failed rather than left as running; the inspector still records the command, redacted validation error, and failure event needed to reproduce the run.

The diagnostics, eval, and benchmark screens map directly to non-TUI commands:

uv run worldforge doctor --registered-only
uv run worldforge provider list
uv run worldforge provider workbench mock
uv run worldforge benchmark --provider mock --iterations 2 --format json
uv run worldforge eval --suite planning --provider mock --format json

Provider Workbench

worldforge provider workbench <provider> is the checkout-safe adapter author loop behind the harness provider development workflow. It does not import Textual and does not make live provider calls unless --live is passed explicitly. The default report is designed to paste into GitHub issues: provider profile, required capability conformance helpers, fixture JSON status, docs/catalog drift hints, redaction-safe provider event status, and exact follow-up commands.

uv run worldforge provider workbench mock
uv run worldforge provider workbench runway --format json
uv run worldforge provider workbench runway --live

For deterministic local providers such as mock, the workbench invokes the advertised capability helpers. For HTTP adapters it validates matching tests/fixtures/providers/<provider>_*.json playback files and lists the capability helpers that the provider test module must cover. For host-owned local runtimes such as LeRobot and LeWorldModel, the default path inspects profile, health, docs, and fixtures while leaving injected-runtime/live smoke execution to prepared hosts. Run uv run python scripts/generate_provider_docs.py --check before opening a provider PR so profile metadata and generated catalog tables stay in sync.

Completed checkout-safe flows also preserve a sanitized run workspace:

.worldforge/runs/<run-id>/
|-- run_manifest.json
|-- inputs/
|-- results/
|-- reports/
|-- artifacts/
`-- logs/

Run IDs are UTC-sortable and file-safe (YYYYMMDDTHHMMSSZ-xxxxxxxx). The manifest records the command, provider surface, status, input summary, result summary, event count, and relative artifact paths. It intentionally stores summaries and report renderings, not credentials, raw signed URLs, or provider-owned private data.

The CLI can write the same layout for evaluation and benchmark runs:

uv run worldforge eval --suite planning --provider mock --run-workspace .worldforge
uv run worldforge benchmark --provider mock --operation predict --run-workspace .worldforge
uv run worldforge runs list
uv run worldforge runs compare .worldforge/runs/<run-a> .worldforge/runs/<run-b>
uv run worldforge runs cleanup --keep 20

Completed eval and benchmark TUI screens still write JSON under .worldforge/reports/ relative to the active state directory for the Home screen and Ctrl+P recent-report index. Use the run workspace when a full issue attachment needs manifest, reports, logs, and result summaries together. Use runs compare --format json|markdown|csv to export attachment-safe comparisons across preserved eval runs or preserved benchmark runs.

Interface Contract

The TUI is intentionally separated from the rest of the project:

Module Dependency boundary
worldforge.harness.models Dataclasses only; no Textual import.
worldforge.harness.flows Runs packaged demos and builds timeline, metrics, and transcript data; no Textual import.
worldforge.harness.cli Lists flows without Textual; imports the TUI only when launching it.
worldforge.harness.tui The only Textual-dependent module.

The harness does not replace the Python APIs or command-line demos. It makes the same flows observable: selected candidates, costs, action paths, saved world ids, final object positions, provider health, benchmark latency, benchmark throughput, and provider event phases.

Interaction Model

  • r: run the selected flow.
  • 1: select LeWorldModel score planning.
  • 2: select LeRobot policy-plus-score planning.
  • 3: select provider diagnostics and benchmark comparison.
  • g w: jump to Worlds.
  • g p: jump to Providers.
  • g e: jump to Eval.
  • g b: jump to Benchmark.
  • Ctrl+P: search static commands plus worlds, providers, and recent report files.
  • Ctrl+T: cycle worldforge-dark, worldforge-light, and worldforge-high-contrast.
  • q: quit.

Each run reveals stages through a timeline, then fills the inspector and transcript panes from the same structured HarnessRun data used by tests.

Themes

The harness registers three themes:

  • worldforge-dark: default dark workspace.
  • worldforge-light: light-terminal variant.
  • worldforge-high-contrast: higher-contrast variant for dense screens and reduced-colour terminals.

Widget CSS uses semantic tokens only; raw colour values live in worldforge.harness.theme.

Screenshot Refresh

The README image is regenerated from a deterministic harness state. The tracked refresh command is:

scripts/regen-harness-screenshot.sh

It seeds a local screenshot state directory, drives the providers screen through Textual's test harness, exports SVG, and renders the README PNG with rsvg-convert.

Roadmap

TheWorldHarness is evolving from the read-only demo viewer described above into the project's front-door interactive workspace — keyboard-first, command-palette-driven, and the canonical example of how to compose WorldForge from Python. The work is broken into six milestones (M0–M5), each with a published spec triad (spec.md + plan.md + tasks.md) under specs/:

Milestone What it adds
M0 — Theme + chrome reset Registered light/dark themes, semantic CSS variables, header clock and breadcrumb.
M1 — Screen architecture App split into named Screens, push_screen navigation, ? help overlay, Ctrl+P system commands.
M2 — Worlds CRUD Create / edit / save / fork / delete worlds entirely from the TUI through the public WorldForge API.
M3 — Live providers ProvidersScreen with capability matrix and one real provider call streamed through a worker; Esc cancels.
M4 — Eval + Benchmark EvalScreen and BenchmarkScreen; capability mismatch as a hard toast; reports preserved to disk and exportable.
M5 — Polish + showcase High-contrast theme, dynamic command-palette provider, recent items, screenshot export matrix, README screenshot refresh.

The intent and design language behind these milestones are summarized in the public roadmap.