Skip to content

geno_lewm.provenance.hashing

hashing

Content-addressing primitives for GenoLeWM artifacts (RFC-0011 §3.1).

Two functions:

  • :func:canonical_json_sha256 — SHA-256 of the canonical JSON serialization of a value. Canonical JSON per RFC-0011 §3.7: keys sorted lexicographically, no whitespace, UTF-8, NaN / Infinity rejected. Byte-stable across platforms and Python releases.
  • :func:sha256_file / :func:sha256_bytes — stream-friendly file / in-memory hashing used to compute the per-artifact hash fields in :class:Manifest.

All outputs are returned as "sha256:<hex>" strings to match the on-disk manifest convention.

canonical_json_bytes

canonical_json_bytes(value: Any) -> bytes

Return the canonical-JSON byte string of value.

Canonical form (RFC-0011 §3.7, similar to RFC 8785):

  • Keys are sorted lexicographically at every level.
  • No whitespace (compact separators=(",", ":")).
  • UTF-8 encoded.
  • NaN / Infinity rejected.
  • bytes rejected (must be hashed separately and embedded by reference).
Source code in geno_lewm/provenance/hashing.py
def canonical_json_bytes(value: Any) -> bytes:
    """Return the canonical-JSON byte string of ``value``.

    Canonical form (RFC-0011 §3.7, similar to RFC 8785):

    - Keys are sorted lexicographically at every level.
    - No whitespace (compact ``separators=(",", ":")``).
    - UTF-8 encoded.
    - NaN / Infinity rejected.
    - ``bytes`` rejected (must be hashed separately and embedded by
      reference).
    """
    _check_floats(value)
    text = json.dumps(
        value,
        ensure_ascii=False,
        sort_keys=True,
        separators=(",", ":"),
        allow_nan=False,
        default=_canonical_default,
    )
    return text.encode("utf-8")

canonical_json_sha256

canonical_json_sha256(value: Any) -> str

Return "sha256:<hex>" for the canonical JSON of value.

Source code in geno_lewm/provenance/hashing.py
def canonical_json_sha256(value: Any) -> str:
    """Return ``"sha256:<hex>"`` for the canonical JSON of ``value``."""
    return _PREFIX + hashlib.sha256(canonical_json_bytes(value)).hexdigest()

sha256_bytes

sha256_bytes(data: bytes | bytearray | memoryview) -> str

Return "sha256:<hex>" for data.

Source code in geno_lewm/provenance/hashing.py
def sha256_bytes(data: bytes | bytearray | memoryview) -> str:
    """Return ``"sha256:<hex>"`` for ``data``."""
    return _PREFIX + hashlib.sha256(bytes(data)).hexdigest()

sha256_file

sha256_file(path: str | Path) -> str

Return "sha256:<hex>" for the file at path.

Streams the file in 1 MiB chunks; safe for arbitrarily large artifacts (weights files can be multi-GB).

Source code in geno_lewm/provenance/hashing.py
def sha256_file(path: str | Path) -> str:
    """Return ``"sha256:<hex>"`` for the file at ``path``.

    Streams the file in 1 MiB chunks; safe for arbitrarily large
    artifacts (weights files can be multi-GB).
    """
    p = Path(path)
    h = hashlib.sha256()
    with p.open("rb") as f:
        while True:
            chunk = f.read(_CHUNK)
            if not chunk:
                break
            h.update(chunk)
    return _PREFIX + h.hexdigest()

looks_like_sha256

looks_like_sha256(s: str) -> bool

Return True iff s matches the "sha256:<64hex>" shape.

Source code in geno_lewm/provenance/hashing.py
def looks_like_sha256(s: str) -> bool:
    """Return True iff ``s`` matches the ``"sha256:<64hex>"`` shape."""
    if not isinstance(s, str) or not s.startswith(_PREFIX):
        return False
    rest = s[len(_PREFIX) :]
    if len(rest) != _HASH_HEX_LEN:
        return False
    return all(c in "0123456789abcdef" for c in rest)