Checkpoints
A checkpoint is a single relocatable.ckpt file (a zip) that captures a run
at a moment in time: world state plus agent state. Save and load follow the
PyTorch idiom — construct first, then load state into it.
Archive layout
state_dict
without the model class).
Tiers
- world — the snapshot document. Exact restore is the world’s
responsibility (see
world-interface.md). - workspace — agent-authored files. Backend-agnostic, the unit of
generational inheritance, and the shareable artifact. Reconstructable
artifacts (
.venv, caches) are excluded. - session — provider continuation state (conversation transcripts),
selected by a per-backend allowlist (
claude-code-v1,codex-v1). Needed only to resume the exact conversation.workspace_only=Trueomits it — that is the safe default for sharing.
Safety model
State loads, policy doesn’t:- Credentials and capability config (
.credentials.json,.claude.json,settings.json, permission files) are never written to an archive — the session tier is an allowlist, so unknown files stay out — and never extracted from one, even if a malicious archive contains them. - A secret scan runs over every file at save time and refuses to produce an
archive containing credential-shaped content (
CheckpointSecretError). - Extraction rejects absolute paths and
..traversal. - Loading a foreign session tier still means adopting a conversation you did
not audit. Prefer
workspace_only=Truefor checkpoints you did not create.
Invisibility
Checkpoints must be invisible to agents — telling an agent it is being checkpointed is like telling a network it is beingtorch.saved:
- Nothing is sent to the agent at save time; agent state is archived from disk, preferring an idle moment (transcripts are append-only, so a mid-turn snapshot resumes from the last completed message).
- On resume, the world restores the same session ids, the agent’s original system prompt still applies, and the resume nudge uses the standard auto-continue prompt — indistinguishable from a routine idle nudge.
- Worlds expose sim time only (never host wall time) in agent-visible
payloads. Known residual channel: the agent can compare its own
datewith pre-crash memory; closing that fully requires VM-level time virtualization and is out of scope here.
Generation chains
run_agent_generations.py (research harness) integrates checkpoints:
CHECKPOINT_EVERY(seconds, default 900,0disables) checkpoints each running generation inruns/<run_id>/checkpoints/, with the latest pointer in the experiment’scheckpoint.json.--resume-from <ckpt>resumes a generation that crashed mid-flight: the remaining generation duration is computed from the checkpoint’selapsed_seconds, and the chain continues into subsequent generations normally. The chain CSV recordstemplate_inascheckpoint:<path>.- Failed run directories that contain checkpoints are preserved (renamed
*.failed-<stamp>), never deleted by the retry loop.