Skip to content

Streaming

hal0 streams /v1/chat/completions and /v1/completions responses as Server-Sent Events, exactly matching the OpenAI streaming protocol. Any OpenAI SDK that handles streaming today works against hal0 unmodified.

Add "stream": true to the request body:

Terminal window
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "primary",
"stream": true,
"messages": [
{"role": "user", "content": "Count to five."}
]
}'

Each chunk is a data: … line, JSON-encoded, terminated by a blank line. The stream ends with data: [DONE]. Same shape OpenAI ships:

data: {"id":"...","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"role":"assistant","content":""}}]}
data: {"id":"...","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"One"}}]}
data: {"id":"...","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":", two"}}]}
data: [DONE]
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")
stream = client.chat.completions.create(
model="primary",
stream=True,
messages=[{"role": "user", "content": "Count to five."}],
)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="", flush=True)

Streaming flows through the same dispatcher that handles non-streaming requests, so you get:

  • Single-flight prefetch — if two clients open identical streams on a cold slot, the slot fires one upstream call and fans the token stream to both.
  • Adaptive cold-boot — the first request after a slot reaches ready keeps the connection open while the model finishes warming; you don’t get a 503 on a request that’s about to work.
  • Structured errors mid-stream — if the slot transitions to error part-way through, the stream emits one final SSE event with the structured error envelope before closing.