v1 pre-alpha · Apache-2.0

Local AI for your home. Strix Halo native.

hal0 turns a Ryzen AI Max+ 395 with 128 GB of unified memory into a polished, OpenAI-compatible inference box. Slots, dispatcher, prewired chat — one Linux command installs the lot.

install on any modern Linux box
sh
curl -fsSL https://hal0.dev/install | bash

Why hal0

A polished home AI box, not another llama-server wrapper.

Strix Halo native

UMA-aware probe, FLM provider for the XDNA NPU, unified-memory slot-fit warnings sized to the real pool — not the BAR carve-out. 128 GB Ryzen AI Max+ 395 is the reference deployment, not a hopeful port.

Concurrent chat + embed + voice

Five built-in slot classes — chat, embed, STT, TTS, image — each a real systemd-managed process on its own port. Run them at once: primary + embed concurrent on Strix Halo measures ~258 tok/s with <200 ms dispatch.

Image gen, day one

POST /v1/images/generations served by ComfyUI on ROCm, inside the same slot lifecycle as everything else. Curated SDXL Turbo, SD 1.5, and Flux Schnell ship pre-pinned by sha256. Not a side-quest — a first-class slot.

Dispatcher with single-flight

Registry-aware routing across local slots and external upstreams (OpenRouter, Anthropic, OpenAI). Cold-cache prefetch with request coalescing — a thundering herd of identical prefetches becomes one HTTP call. Every routing decision logged as structured breadcrumbs.

What ships in v1

A complete home inference platform.

Slot state machine

Every workload has typed states (offline → pulling → warming → ready → serving ↔ idle → unloading) with atomic transitions, persisted to state.json and streamed over SSE.

OpenAI /v1/* surface

Chat, completions, embeddings, rerank, audio transcriptions, audio speech, models. Same shapes any OpenAI SDK already speaks.

Dispatcher + single-flight

Registry-aware routing, cold-cache prefetch, request coalescing, decision logging. A thundering herd of identical prefetches becomes one HTTP call.

Hardware-aware probe

Detects GPU / NPU / unified memory, writes /etc/hal0/hardware.json, surfaces VRAM/RAM fit warnings inline in the slot form before you try to load.

Prewired OpenWebUI

Chat UI on :3001, zero config — the installer points it at the local hal0 API. Dashboard is for operating the box, not chatting.

Dashboard (Vue 3)

Nine views — Slots, Models, Hardware, Logs, Settings, Providers, First-run, plus error shell. Dark by default. SSE-backed live status and log tail.

Auth + HTTPS, one flag

Off by default for trusted-LAN installs. --auth=basic brings up Caddy with basic_auth at the edge, bearer tokens for the OpenAI API, and automatic HTTPS — internal CA for .local, Let's Encrypt for real domains. Zero certbot.

Atomic self-update

hal0 update --channel stable|nightly. Cosign-verified tarballs swap a current symlink; --rollback reverts. Slot units survive API restarts.

One-line install

Linux + systemd, idempotent, non-interactive. Pre-flight checks, systemd template units, working slot defaults dropped on disk — no manual yaml.

Image generation

POST /v1/images/generations served by ComfyUI on ROCm — same OpenAI shape your SDK already speaks. Curated SDXL Turbo / SD 1.5 / Flux Schnell with license badges. img is a built-in slot with the same lifecycle as chat, embed, and audio.

Hardware

Built for Strix Halo. Honest about everything else.

Linux + systemd is the only hard requirement (installer/install.sh:86). The probe picks the right provider; you don't pin a backend by hand.

Provider matrix — picked automatically by the hardware probe

Hardware Vendor Unified / VRAM Support Notes
AMD Ryzen AI Max+ 395 AMD · "Strix Halo" Unified 128 GB first-class Reference deployment. iGPU + XDNA NPU + UMA-aware probe. Vulkan default; FLM for NPU.
AMD Ryzen AI Max 385 / 390 AMD · "Strix Halo" Unified 64 GB first-class Same path as the 128 GB SKU; small + mid tiers fit comfortably, 70B Q4 with shorter context.
NVIDIA RTX 30/40/50 NVIDIA 10–32 GB supported CUDA-backed llama.cpp. Same slot lifecycle, dedicated VRAM instead of UMA.
AMD Radeon RX 7000 / discrete AMD 16–24 GB supported Vulkan path today; ROCm toolbox image on the build list for opt-in.
CPU-only x86_64 Intel / AMD System RAM experimental Vulkan-CPU fallback. Small models only — CI runs Qwen 0.5B here. Not the headline experience.

macOS and Windows are not in scope for v1. NPU benchmarks land once the FLM toolbox image is published to ghcr.io/hal0ai/.

Recommended loadouts

Three starting points. Mix, match, swap.

Curated picks for the 128 GB Strix Halo — refreshed to the latest open-weight releases as of May 2026. The slot system takes a different model per slot whenever you change your mind. See the full hardware page for discrete-GPU and CPU loadouts.

Coding · mid

~19 GB
  • primary · Qwen3-Coder-30B-A3B-Instruct-Q4_K_M
  • embed · nomic-embed-text-v2-moe-Q4_K_M

MoE with 3B active params — runs near 3B speeds, reasons like a 30B. Pairs with a 140 MB embed for repo-aware search.

Voice mode

~3 GB
  • primary · Qwen3-4B-Instruct-2507-Q4_K_M
  • stt · Moonshine base
  • tts · Kokoro-82M v1.0

Low-latency reply, edge-built STT, 54-voice TTS. 128 GB leaves the entire rest of the budget free for a second chat model warm in another slot.

Maxed-out · long context

~50 GB
  • primary · Llama-4-Scout-17B-16E-Instruct-Q4_K_M
  • embed · bge-m3 (8192-token context)

10M-token context, MoE with 17B active. The biggest realistic single-model loadout that still leaves room for STT/TTS slots warm on 128 GB unified.

Loadouts are starting points. Every real install ends up tweaked. Sizes are published GGUF file sizes (Hugging Face, May 2026); no tok/s numbers on this page.

Quickstart

From zero to a live /v1/chat in two commands.

The installer is idempotent and non-interactive. It probes the hardware, writes /etc/hal0/hardware.json, drops working slot defaults, and brings the API up on :8080.

1 · install

install.sh
sh
# install on any modern Linux box with systemd
curl -fsSL https://hal0.dev/install | bash

# optional overrides
# HAL0_PORT=9090 HAL0_PREFIX=/opt/hal0 curl … | bash

2 · chat

/v1/chat/completions
sh
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-0.5b-instruct-q4_k_m",
    "messages": [{"role":"user","content":"Hello!"}]
  }'

Prefer a chat UI? OpenWebUI ships prewired on :3001, pointed at the local hal0 API out of the box.

Comparison

Why not just ollama, LM Studio, or raw llama.cpp?

hal0 isn't an inference engine — it's the orchestration, lifecycle, and multi-modal surface around llama.cpp, FLM, Moonshine, and Kokoro. Honest take, in one table.

Feature hal0 ollama LM Studio raw llama.cpp Cloud API
OpenAI /v1/* surface chat, embed, rerank, STT, TTSchat-only subsetchat-onlyraw HTTPfull
systemd-managed lifecycle partialdesktop appDIYn/a
Hardware probe + fit warnings n/a
Headless one-line install GUI installermanualn/a
Multi-model concurrent slots partial DIY
Bundled chat UI OpenWebUI prewired built-in varies
Signed self-update + rollback cosignmanualdesktop updatermanualn/a
Data stays on your box

vs. ollama — systemd-managed slots survive hal0-api restarts; the OpenAI surface includes embeddings, rerank, and STT/TTS, not just chat. Hardware probe + slot fit warnings are first-class.

vs. LM Studio — Linux-first, headless-first, one-line install, no GUI required. Prewired OpenWebUI handles chat; the dashboard is for operating the box.

vs. raw llama.cpp — hal0 owns the lifecycle: health probes, atomic env writes, cold-boot grace, single-flight prefetch, structured errors, signed self-update with rollback.

vs. cloud APIs — your hardware, your data, your models. External upstreams (OpenRouter, Anthropic, OpenAI, custom) can be configured as fallbacks behind the same /v1/* surface — mix local + remote per-model in one config.

Roadmap

v1 shipped. v0.2 is where it gets interesting.

Phase 1 landed on 2026-05-15 — 353 unit tests passing, integration tier on Vulkan-CPU + Qwen 0.5B. Here's what's next.

Full roadmap →

Now

v1

Slot lifecycle + dispatcher

shipped

v1

State machine, single-flight, adaptive cold-boot, atomic TOML config, structured error envelopes.

Five providers wired

shipped

v1

llama.cpp (Vulkan), FLM (XDNA NPU), Moonshine (STT), Kokoro (TTS), ComfyUI (image gen on ROCm).

Image generation

shipped

v1

POST /v1/images/generations served by ComfyUI on ROCm. Curated SDXL Turbo / SD 1.5 / Flux Schnell, slot named img.

Caddy + auth + HTTPS

shipped

v1

install.sh --auth=basic brings up Caddy with basic_auth + Bearer tokens + automatic HTTPS, with a post-install round-trip self-test.

Next

v0.2

Memory subsystem

next
Conversation memory store wired into the dispatcher — opt-in per slot.

MCP support

next
Model Context Protocol surface so hal0 can host MCP servers alongside slots.

hal0.local mDNS

next
Zero-config LAN discovery so the box is reachable as hal0.local without DNS edits.

Packaging

v0.2

AUR + Ubuntu PPA

next
Distro-native packages alongside the curl installer.

Benchmarks + Presets UI

next
In-dashboard benchmark runner and one-click loadout presets.

Trust

Boring guarantees, in writing.

License

Apache-2.0

Patent grant included. Bundle, fork, ship.

Telemetry

Off by default

Opt-in surfaces hw class, version, slot count. No model names. No IPs. No config.

Releases

cosign-signed

hal0 update verifies GitHub-OIDC signer identity before unpacking.

Source

github.com/hal0ai/hal0 →

Issues, discussions, release manifests. Pre-alpha — see CONTRIBUTING for contribution status.

One Linux box. Your AI.

Install hal0 in under a minute.

Linux + systemd is the only hard requirement. The probe picks the right provider — Vulkan on Strix Halo, CUDA on NVIDIA, ROCm on AMD discrete, CPU as a fallback.