Skip to content

Recommended loadouts

Curated starting points, not gospel. Mix, match, tweak — the slot system takes a different model per slot whenever you change your mind. Sizes are published GGUF Q4_K_M / Q8_0 file sizes (verified on Hugging Face, May 2026); no tok/s numbers here. Picks are refreshed to the latest open-weight releases as of 2026-05-15 — each entry keeps a previous fallback in parens for users on existing setups.

Headline target: the 128 GB Strix Halo SKU

Section titled “Headline target: the 128 GB Strix Halo SKU”

Ryzen AI Max+ 395 with 128 GB LPDDR5X-8000 unified memory. The iGPU carveout is BIOS-tunable up to ~96 GB, and some configs report ~110 GB usable for the GPU when paged through GTT. Q4 70B fits with massive headroom; Q4 MoE 100B+ on a 17B–22B active path becomes feasible; mid-class loadouts leave 100+ GB free for context, KV cache, embed + audio slots, and multi-tab usage.

64 GB Strix Halo SKUs (Ryzen AI Max 385 / 390) are still well-served by every small + mid tier below, plus a tight Q4 70B / Llama-4 Scout with shorter context windows.

TierSizeprimaryembedNotes
Small~5 GBQwen2.5-Coder-7B-Instruct-Q4_K_MNo Qwen3-Coder small variant has shipped yet; 7B Qwen2.5 stays the best small dedicated coder.
Mid~19 GBQwen3-Coder-30B-A3B-Instruct-Q4_K_M (~18.6 GB, MoE 3B active — runs near 3B speeds, reasons like 30B).nomic-embed-text-v2-moe-Q4_K_M (~140 MB) for repo-aware search.fallback: Qwen2.5-Coder-32B-Instruct-Q4_K_M (~20 GB).
Large~42 GBHermes-4-70B-Q4_K_M (~42.5 GB) for hybrid reasoning + tool-friendly coding.Alt: Llama-4-Scout-17B-16E-Instruct-Q4_K_M (~50 GB, MoE 17B active, 10M context).

No dedicated 70B+ coder exists in GGUF, so the convention is to fall back on a top-tier general / reasoning model for hard problems. 128 GB headroom keeps both the 30B-A3B coder and a 70B reasoning model hot in separate slots. Large fallback: Llama-3.3-70B-Instruct-Q4_K_M.

TierSizeprimaryNotes
Small~2.5 GBQwen3-4B-Instruct-2507-Q4_K_M (Aug 2025 release, 1M-token context).Snappy on any modern box. fallback: Llama-3.2-3B-Instruct-Q4_K_M.
Mid~19 GBQwen3-30B-A3B-Instruct-2507-Q4_K_M (~18.6 GB, MoE 3B active).Smaller-RAM alt: gemma-3-12b-it-Q4_K_M (~6.6 GB). fallback: Meta-Llama-3.1-8B-Instruct-Q4_K_M.
Large~50 GBLlama-4-Scout-17B-16E-Instruct-Q4_K_M (~50 GB, MoE 17B active, 10M context).On 128 GB you also get the embed slot hot at the same time and headroom for STT/TTS.

Large fallback: Llama-3.3-70B-Instruct-Q4_K_M (~42 GB) or Qwen2.5-72B-Instruct-Q4_K_M (~47 GB). 64 GB SKUs don’t comfortably run a large primary + embed + audio simultaneously.

SlotPickNotes
primaryQwen3-4B-Instruct-2507-Q4_K_M (~2.5 GB) — low-latency reply.fallback: Llama-3.2-3B-Instruct-Q4_K_M.
sttMoonshine base (~190 MB) via the moonshine toolbox — built for edge real-time.Higher-accuracy alt: whisper-large-v3-turbo (~1.6 GB). 2025 SOTA: Canary-Qwen-2.5B (Open ASR leaderboard, 5.63% WER).
ttsKokoro-82M v1.0 (~330 MB, 8 languages / 54 voices, Jan 2025) via the kokoro toolbox.Voice-cloning alt: F5-TTS.

128 GB leaves the entire rest of the budget free for a large embed or a second chat model warm in another slot.

  • primary: Hermes-4-70B-Q4_K_M (~42.5 GB, Aug 2025 — hybrid-mode reasoning + creative strength).
  • Lighter alt: Hermes-4-14B-Q4_K_M (~9 GB, Qwen-3-14B base).
  • fallback: Mistral-Small-24B-Instruct-2501-Q4_K_M (~14 GB).
  • primary: gemma-3-1b-it-Q4_K_M (~0.7 GB) — text-only, March 2025.
  • fallback: Phi-3-mini-4k-instruct-q4.gguf (~2.4 GB, the curated default); or Qwen2.5-0.5B-Instruct-Q4_K_M (~400 MB, the CI smoke model).
  • embed: nomic-embed-text-v2-moe-Q4_K_M (~140 MB, multilingual MoE — 137M params).

Runs comfortably on CPU-only fallback boxes; smallest viable hal0 install.

  • primary: Qwen3-30B-A3B-Instruct-2507-Q4_K_M (~18.6 GB) for synthesis. fallback: Qwen2.5-14B-Instruct-Q4_K_M (~9 GB).
  • embed: bge-m3 (~600 MB Q8 — multilingual, multi-vector, 8192-token context, top retrieval R@1 in 2026 benchmarks). Lower-footprint alt: nomic-embed-text-v2-moe (~140 MB). fallback: bge-large-en-v1.5-Q8_0 (~670 MB).

The embed slot also serves rerank via /v1/rerankings. 128 GB extra: huge room for KV cache → long-context retrieval (64k+) without paging.

  • primary: Hermes-4-70B-Q4_K_M (~42.5 GB) — Nous’s hybrid-reasoning model is explicitly tuned for tool-call faithfulness and format adherence.
  • Lighter alt: Hermes-4-14B-Q4_K_M (~9 GB).
  • fallback: Qwen2.5-32B-Instruct-Q4_K_M (~20 GB).
  • embed: bge-m3 (~600 MB) or nomic-embed-text-v2-moe (~140 MB) for retrieval-augmented routing.

Lines up with the v0.2 agents / MCP roadmap.

The biggest realistic single-model loadout that still fits a 128 GB Strix Halo with room to breathe. Pick one:

  • primary: Llama-4-Scout-17B-16E-Instruct-Q4_K_M (~50 GB) — 10M context, MoE 17B active. The current best balance of size and capability.
  • primary: Hermes-4-70B-Q8_0 (~75 GB) — 70B at Q8 instead of Q4, trading size for quant headroom.
  • primary: Mistral-Large-Instruct-2411-Q4_K_M (123B, ~73 GB) — older but still excellent for raw single-model quality.

For NVIDIA the path is CUDA-backed llama.cpp; for AMD discrete it’s the ROCm toolbox image. Both go through the same slot lifecycle as Strix Halo — what changes is dedicated VRAM vs the unified pool.

  • primary: Qwen3-Coder-30B-A3B-Instruct-Q4_K_M (~18.6 GB) or any Q4 ~30B chat — comfortable with a 16–32k context.
  • embed: nomic-embed-text-v2-moe-Q4_K_M (~140 MB) co-resident.
  • Q4 70B (Hermes-4-70B / Llama-3.3-70B) is feasible but tight with partial CPU offload; expect lower tok/s than VRAM-resident inference.
  • Trade vs Strix Halo: no headroom for a hot STT/TTS slot alongside a 30B primary.
  • primary: Qwen3-30B-A3B-Instruct-2507-Q4_K_M (~18.6 GB) fits with shorter context, or gemma-3-12b-it-Q4_K_M (~6.6 GB) for a longer window.
  • embed: small Q4 embed only (nomic-embed-text-v2-moe ~140 MB).
  • Q4 70B requires partial CPU offload — works, but drops well below VRAM-resident speeds.
  • Trade vs 5090: tighter context budgets at the same model size.
  • primary: gemma-3-12b-it-Q4_K_M (~6.6 GB) or Hermes-4-14B-Q4_K_M (~9 GB).
  • embed: nomic-embed-text-v2-moe-Q4_K_M (~140 MB) leaves several GB for a ~16k context.
  • Q4 32B class (Qwen3-30B-A3B) is offload-only here — workable occasionally, not as a daily driver.
  • Trade vs 24 GB cards: keep the primary at ~13B class for a smooth experience.

NVIDIA RTX 3080 / AMD RX 7900 XT / XTX (10–24 GB VRAM)

Section titled “NVIDIA RTX 3080 / AMD RX 7900 XT / XTX (10–24 GB VRAM)”
  • primary: a 4–14B Q4 — Hermes-4-14B-Q4_K_M, gemma-3-12b-it-Q4_K_M, or Qwen3-4B-Instruct-2507-Q4_K_M (~2.5 GB) for low-latency.
  • embed: small Q4 embed if the card has 16 GB+; skip on 10–12 GB cards.
  • AMD route is hal0-toolbox-rocm; NVIDIA stays on the CUDA llama.cpp build.
  • Trade: one slot at a time is the norm — no simultaneous primary + embed + audio.
  • primary: gemma-3-1b-it-Q4_K_M (~0.7 GB) or Qwen3-4B-Instruct-2507-Q4_K_M (~2.5 GB) for a snappier feel. fallback: Phi-3-mini-4k-instruct-q4.gguf (~2.4 GB, the curated default).
  • embed: nomic-embed-text-v2-moe-Q4_K_M (~140 MB) — runs fine on CPU.
  • No stt / tts slots — Moonshine and Kokoro are technically CPU-capable but streaming audio at usable latency wants at least an iGPU.
  • Expect: a few tok/s on chat, fine for occasional Q&A and dev smoke; not the streaming experience.

Strix Halo’s unified pool is what unlocks the 70B Q4 and large / agentic tiers; on discrete cards you trade ceiling for raw tok/s on smaller models. hal0 picks the right provider automatically based on probe (hal0/hardware/probe.py/etc/hal0/hardware.json → slot defaults).

Loadouts are starting points. Every real install ends up tweaked.