Slot lifecycle

Every slot in hal0 is in exactly one state at any moment. The set of states is fixed; the transitions between them are validated against LEGAL_TRANSITIONS; every transition is persisted atomically to state.json and streamed over SSE to clients. The dashboard reflects real state, not just systemctl is-active snapshots.

The canonical enum is src/hal0/slots/state.py — a StrEnum, wire- stable across versions.

States

State	Meaning
`offline`	No systemd unit active.
`pulling`	Model files downloading / verifying.
`starting`	systemd unit up; container booting.
`warming`	Container live; model loading into VRAM / GTT.
`ready`	Passed health probe (non-empty `/v1/models` + sentinel completion).
`serving`	Inference request in-flight.
`idle`	Ready but no traffic past the idle timeout — unload candidate.
`unloading`	Graceful stop in progress.
`error`	Failed; details in `state.json` + journald.

error is a sideband — it’s reachable from most other states when something goes wrong, and the slot returns to offline from there once the failure is acknowledged.

Transition diagram

The canonical happy-path flow:

            ┌──────────────────────────────────────────────────────────┐
            │                                                          │
            ▼                                                          │
┌────────┐  pull   ┌─────────┐  done   ┌──────────┐  spawn   ┌────────┐│
│offline │────────▶│ pulling │────────▶│ starting │─────────▶│warming ││
└────────┘         └─────────┘         └──────────┘          └────┬───┘│
     ▲                                                            │    │
     │                                  health probe pass         ▼    │
     │                                                       ┌────────┐│
     │                                                       │ ready  ││
     │                       idle timer fires                └───┬────┘│
     │                                       ┌───────────────────┤    │
     │                                       ▼                   │    │
     │                                  ┌────────┐               │    │
     │                                  │  idle  │               │    │
     │                                  └───┬────┘  inference    │    │
     │                                      │       request      │    │
     │                                      ▼                    ▼    │
     │                                 ┌───────────┐  done  ┌────────┐│
     │                                 │ unloading │◀───────│serving ││
     │                                 └─────┬─────┘        └────────┘│
     │                                       │                        │
     └───────────────────────────────────────┘                        │
                                                                      │
       error sideband — reachable from pulling / starting / warming   │
       / ready / serving / unloading; returns to offline when ack'd ──┘

Legal transitions (text table)

From	Allowed `to:`
`offline`	`pulling`, `starting`, `error`
`pulling`	`starting`, `error`, `offline`
`starting`	`warming`, `error`
`warming`	`ready`, `error`
`ready`	`serving`, `idle`, `unloading`, `error`
`serving`	`ready`, `error`, `unloading`
`idle`	`serving`, `unloading`, `ready`
`unloading`	`offline`, `error`
`error`	`offline`

Any transition not listed is rejected by the manager with a slot.invalid_transition error. This is a hard invariant — there is no escape hatch.

Persistence & streaming

Every transition writes /var/lib/hal0/slots/<name>/state.json through the atomic-write path (NamedTemporaryFile → fsync → os.replace()).
The slot manager emits one SSE event per transition on GET /api/slots/events.
The dashboard’s Slots view subscribes to that stream — what you see in the UI is the same wire format the daemon writes to disk.

What each state implies for API behaviour

Slot state	API behaviour
`offline`	Requests fail fast with `slot.not_loaded`.
`pulling`	Requests fail with `slot.pulling` and a progress hint.
`starting`	Requests block briefly; if it doesn’t reach `warming` quickly, returns 503.
`warming`	Requests block on the adaptive cold-boot probe; succeed if the slot reaches `ready` within the request deadline.
`ready`	Requests are served. Transitions to `serving` for the duration.
`serving`	Concurrent requests stack on the slot’s queue.
`idle`	Requests are served as in `ready`; first request resets the idle timer.
`unloading`	Requests fail with `slot.unloading`.
`error`	Requests fail with the structured error envelope from the failure.

Why this matters

A naive setup conflates “the systemd unit is up” with “the model is ready to serve”. They aren’t the same — the model still has to load into VRAM, the backend has to pass its first health probe, and the sentinel completion has to succeed. hal0’s state machine encodes that distinction explicitly, which is why the dashboard can show “warming — 12s elapsed” instead of “service is up” while the request still 503s.