Crown Citadel

AMD Ryzen AI Max+ 395 128GB local LLM benchmarks

Host ciru • NixOS • Linux 6.19.9 • llama.cpp Vulkan • 16C / 32T • 64 / 64 split

Latest 128 GiB UMA llama.cpp Vulkan -ngl 999 -fa 1

Size / Speed Frontier

Model size on x, generation speed on y, bubble area by VRAM. This is the main “what actually wins here?” chart.

Comparison context

Context Retention

Prompt-processing retention versus each model’s own 4096 baseline. Brighter is more stable as context grows.

Memory Envelope

VRAM, system RAM, and GTT at the active context, sorted by VRAM pressure.

Prompt vs Generate

Raw throughput view at the active context. Symbol size tracks model size, not memory.

Scaling Curves

Prompt-processing throughput across all tested contexts, one line per model family entry.

Setup and caveats
Machine details, runtime settings, and the benchmark notes that meaningfully affect interpretation.
+

System

  • Host ciru, NixOS, Linux 6.19.9
  • AMD Ryzen AI Max+ 395, 16 cores / 32 threads
  • AMD Radeon 8060S-class Strix Halo iGPU
  • 128 GiB total RAM, currently split 64 / 64
  • Reported pools: 64 GiB VRAM, 31.2 GiB GTT, about 62 GiB system RAM

Runtime

  • llama.cpp with Vulkan via llama-cpp-vulkan
  • Pinned commit 08f21453aec846867b39878500d725a05bd32683
  • Common benchmark flags -ngl 999 -fa 1 -n 128 -r 2 -o md
  • Guard reserves: 8 GiB RAM, 2 GiB VRAM, 4 GiB GTT
  • Typical contexts: 4096, 16384, 32768, with long checks to 100000+

Important notes

  • Gemma 4 required a newer local Vulkan-enabled build because the older system build lacked gemma4 support.
  • Gemma 4 31B Dense was manually stopped at 32768, so it is excluded from the charted active comparison points beyond measured contexts.
  • Qwen3-Coder-Next is plotted as 80B-class on the size axis to reflect the benchmark notes in this repo.
Raw result matrix
Every benchmark point used in the curated charts, including the larger long-context runs.
+
Date Model Family Size Context PP t/s TG t/s VRAM RAM GTT