NVIDIA discrete GPU
hal0 supports NVIDIA discrete GPUs through a CUDA-backed llama.cpp
toolbox. NVIDIA is on the v1 supported list — the path works today,
with one caveat: the dedicated CUDA toolbox image hasn’t been
published to ghcr.io/hal0ai/ yet, so most users in v1 fall back to
the Vulkan path on NVIDIA hardware.
Where this fits
Section titled “Where this fits”NVIDIA discrete is a supported target, not the reference platform. For dedicated VRAM with mature drivers, NVIDIA is the obvious choice; for big-model headroom in one box, Strix Halo remains the headline target.
Tier-1 cards
Section titled “Tier-1 cards”- RTX 5090 (32 GB GDDR7)
- RTX 4090 / 3090 (24 GB GDDR6X)
- RTX 4080 / 4080 Super (16 GB GDDR6X)
Tier-2 cards
Section titled “Tier-2 cards”- RTX 3080 / 3080 Ti (10–12 GB GDDR6X)
- Older 20-series and below: technically supported by llama.cpp; not a v1 focus.
Cards below 10 GB run very small models only (Q4 4B and under).
Vulkan vs CUDA
Section titled “Vulkan vs CUDA”Out of the box in v1, NVIDIA users typically run on the Vulkan toolbox — the same image Strix Halo and AMD discrete use. It works, and it’s the path the installer picks when it detects an NVIDIA GPU without a published CUDA toolbox.
The CUDA toolbox is on the build list. When it lands, expect a non-trivial throughput improvement on chat workloads and substantially better large-context handling. We’ll publish before-and-after numbers on this page once the CUDA toolbox ships.
Recommended loadouts
Section titled “Recommended loadouts”These mirror the discrete-card section of the Strix Halo loadouts.
RTX 5090 (32 GB VRAM)
Section titled “RTX 5090 (32 GB VRAM)”primary:Qwen3-Coder-30B-A3B-Instruct-Q4_K_M(~18.6 GB) or any Q4 ~30B chat — comfortable with a 16–32k context.embed:nomic-embed-text-v2-moe-Q4_K_M(~140 MB) co-resident.- Q4 70B (
Hermes-4-70B/Llama-3.3-70B) is feasible but tight with partial CPU offload; expect lower tok/s than VRAM-resident inference. - Trade vs Strix Halo: no headroom for a hot STT/TTS slot alongside a 30B primary.
RTX 4090 / 3090 (24 GB VRAM)
Section titled “RTX 4090 / 3090 (24 GB VRAM)”primary:Qwen3-30B-A3B-Instruct-2507-Q4_K_M(~18.6 GB) fits with shorter context, orgemma-3-12b-it-Q4_K_M(~6.6 GB) for a longer window.embed: small Q4 embed only (nomic-embed-text-v2-moe~140 MB).- Q4 70B requires partial CPU offload — works, but drops well below VRAM-resident speeds.
- Trade vs 5090: tighter context budgets at the same model size.
RTX 4080 / 4080 Super (16 GB VRAM)
Section titled “RTX 4080 / 4080 Super (16 GB VRAM)”primary:gemma-3-12b-it-Q4_K_M(~6.6 GB) orHermes-4-14B-Q4_K_M(~9 GB).embed:nomic-embed-text-v2-moe-Q4_K_M(~140 MB) leaves several GB for a ~16k context.- Q4 32B class (Qwen3-30B-A3B) is offload-only here — workable occasionally, not as a daily driver.
- Trade vs 24 GB cards: keep the primary at ~13B class for a smooth experience.
RTX 3080 / 3080 Ti (10–12 GB VRAM)
Section titled “RTX 3080 / 3080 Ti (10–12 GB VRAM)”primary: a 4–14B Q4 —Hermes-4-14B-Q4_K_M,gemma-3-12b-it-Q4_K_M, orQwen3-4B-Instruct-2507-Q4_K_M(~2.5 GB) for low-latency.embed: skip on 10–12 GB cards.- One slot at a time is the norm.
Installation notes
Section titled “Installation notes”The standard one-liner from the install page handles the Vulkan path on NVIDIA:
curl -fsSL https://hal0.dev/install | bashYou’ll want:
- Recent NVIDIA proprietary drivers installed via your distro (575+ series recommended for newest cards).
- Vulkan runtime present (
vulkaninfo --summaryreturns devices). - The service user with
/dev/nvidia*access — usually handled by the driver install.
The hardware probe detects the GPU and writes VRAM size to
/etc/hal0/hardware.json so slot fit warnings are accurate.
Troubleshooting
Section titled “Troubleshooting”Probe doesn’t list the GPU. Run nvidia-smi to confirm the driver
sees the card. If that’s empty, fix the host driver install first.
Slot fails to start with CUDA / library errors. v1 doesn’t bundle a CUDA toolbox — make sure the slot is on the Vulkan provider:
hal0 slot list --json | grep providerIf a slot is set to a CUDA provider, swap it back to llama-cpp until
the CUDA toolbox publishes:
hal0 slot swap primary --provider llama-cppLower than expected throughput. Confirm the card isn’t power-limited or PCIe-mode-limited. NVIDIA’s Vulkan implementation is solid but won’t hit the throughput of a native CUDA llama.cpp build until the dedicated toolbox ships.