CPU-only
hal0’s CPU-only path is the fallback tier. It exists so the installer works on a fresh VM, so CI can smoke-test slot lifecycle on boxes without GPUs, and so anyone trying hal0 for the first time can do so before deciding whether to commit hardware to it.
It is not the headline experience. The streaming chat feel that makes local AI worth using needs at least an iGPU.
What works
Section titled “What works”The hardware probe detects “no GPU” and writes that to
/etc/hal0/hardware.json. The installer picks the Vulkan-CPU path:
llama.cpp compiled with the Vulkan backend running against
lavapipe (Mesa’s
software Vulkan implementation). This is the same path hal0’s CI uses
to smoke-test the slot lifecycle with the Qwen 0.5B model.
All four built-in slots can theoretically run:
primary— small Q4 chat (4B and under)embed— embeddings + rerankstt— Moonshine (CPU-capable but latency-sensitive)tts— Kokoro (CPU-capable but latency-sensitive)
The OpenAI-compatible /v1/* API, the slot lifecycle state machine,
the dispatcher, and OpenWebUI all behave identically to a GPU box.
What to expect
Section titled “What to expect”The honest answer:
- Chat: a few tokens per second on a 4B Q4 model with a modest context. Fine for occasional Q&A, painful for long conversations.
- Streaming voice: not realistic. Moonshine STT and Kokoro TTS run on CPU, but the round-trip latency for streaming audio isn’t what the slots were designed for. You can sanity-check the path; you can’t run voice mode comfortably.
- Embeddings: fine.
nomic-embed-text-v2-moe-Q4_K_Mat 140 MB is small enough to run at usable speeds on any modern CPU, and the embed slot doesn’t need streaming.
Recommended loadout (CPU-only, 32–64 GB RAM, no GPU)
Section titled “Recommended loadout (CPU-only, 32–64 GB RAM, no GPU)”primary:gemma-3-1b-it-Q4_K_M(~0.7 GB) orQwen3-4B-Instruct-2507-Q4_K_M(~2.5 GB) for a snappier feel. (fallback:Phi-3-mini-4k-instruct-q4.gguf~2.4 GB, the curated default.)embed:nomic-embed-text-v2-moe-Q4_K_M(~140 MB) — runs fine on CPU.- No
stt/ttsslots — leave them in theofflinestate.
This is also the smallest viable hal0 install. The whole runtime, with a model loaded, fits comfortably under 3 GB of RSS.
Use cases that make sense
Section titled “Use cases that make sense”- Smoke-testing the install on a VM before committing hardware.
- Development: running the API + dashboard against a tiny model
while you build something against
/v1/*. - CI: hal0’s own integration tier uses Vulkan-CPU + Qwen 0.5B — same path you’d hit here.
- A box that’s already running for some other reason where you’d like an occasional local Q&A endpoint.
Use cases that don’t
Section titled “Use cases that don’t”- A daily-driver chat box. Get an iGPU.
- Anything voice-mode. Get an iGPU.
- Any model larger than Q4 4B unless you are deeply patient.
- Anything where the model is wider than memory bandwidth supports. Above 8B Q4 on CPU you start hitting wall-clock limits that no amount of patience fixes.
Installation notes
Section titled “Installation notes”The standard installer from the install page detects no-GPU correctly and picks Vulkan-CPU:
curl -fsSL https://hal0.dev/install | bashA few things to check:
- The Vulkan loader must be installed
(
apt install libvulkan1/pacman -S vulkan-icd-loader). - Mesa’s lavapipe (
mesa-vulkan-drivers/vulkan-swrast) provides the software Vulkan implementation. vulkaninfo --summaryshould showllvmpipeas a device.
Troubleshooting
Section titled “Troubleshooting”vulkaninfo shows no devices on a no-GPU box. Install Mesa’s
software rasterizer Vulkan driver:
- Debian/Ubuntu:
apt install mesa-vulkan-drivers - Arch/CachyOS:
pacman -S vulkan-swrast - Fedora:
dnf install vulkan-loader mesa-vulkan-drivers
Slot starts but inference is extremely slow. Expected on CPU. Confirm the model is the size you think it is (a Q8 14B is dramatically slower than a Q4 4B), and shorten context windows where possible.
OOM on slot start. A Q4 model needs roughly its file size in RAM plus headroom for KV cache. A 7 GB model on a 8 GB box won’t fit; swap matters more here than it does on GPU paths.
When to graduate from CPU-only
Section titled “When to graduate from CPU-only”If you’re doing anything more than smoke-testing, the cheapest meaningful upgrade is any modern AMD APU with RDNA-class graphics. Even a 780M-class iGPU is dramatically faster than CPU-only Vulkan on chat workloads. The full Strix Halo experience is the top end; the floor is “any iGPU at all.”