Streaming

hal0 streams /v1/chat/completions and /v1/completions responses as Server-Sent Events, exactly matching the OpenAI streaming protocol. Any OpenAI SDK that handles streaming today works against hal0 unmodified.

Enable streaming

Add "stream": true to the request body:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "primary",
    "stream": true,
    "messages": [
      {"role": "user", "content": "Count to five."}
    ]
  }'

Wire format

Each chunk is a data: … line, JSON-encoded, terminated by a blank line. The stream ends with data: [DONE]. Same shape OpenAI ships:

data: {"id":"...","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"role":"assistant","content":""}}]}

data: {"id":"...","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"One"}}]}

data: {"id":"...","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":", two"}}]}

data: [DONE]

Python SDK

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")

stream = client.chat.completions.create(
    model="primary",
    stream=True,
    messages=[{"role": "user", "content": "Count to five."}],
)

for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

What hal0 adds on top

Streaming flows through the same dispatcher that handles non-streaming requests, so you get:

Single-flight prefetch — if two clients open identical streams on a cold slot, the slot fires one upstream call and fans the token stream to both.
Adaptive cold-boot — the first request after a slot reaches ready keeps the connection open while the model finishes warming; you don’t get a 503 on a request that’s about to work.
Structured errors mid-stream — if the slot transitions to error part-way through, the stream emits one final SSE event with the structured error envelope before closing.