Homelab & Platform

Do NPUs Actually Matter for Local AI? The TOPS Marketing vs Reality Gap

The Ryzen AI 9 HX 370 came to market with a 50-TOPS neural processing unit (NPU). The Snapdragon X carries 45 TOPS. Virtually every marketing slide, review roundup, and mini-PC spec sheet leading up to 2026 leaned hard on these numbers as though they meant something for running local LLMs on your own hardware.

They do not — at least not yet.

Here is the credibility problem: as of mid-2026, the three dominant local-LLM runtimes — Ollama, llama.cpp, and LM Studio — all route inference through the integrated graphics processor (iGPU) or the CPU. None of them use the NPU. The runtimes do not have backends for XDNA (AMD’s NPU architecture) or the Snapdragon X’s Hexagon processor. A 50-TOPS badge tells you almost nothing about how fast that mini-PC will actually run a language model. The gap between what the marketing says the chip can do and what the software actually does with it is the single biggest credibility issue in the mini-PC-for-local-AI space right now — and it is worth unpacking honestly.

This article is for someone considering a mini-PC and wondering whether to prioritize NPU specs, or whether that whole framing is misleading. The answer hinges on software routing as it stands today, when the NPU specs will likely matter (spoiler: probably not for a couple of years), and how to evaluate a mini-PC on the specs that do control throughput right now.

What is an NPU, and what was the promise?

An NPU (neural processing unit) is a specialized circuit designed to run neural-network inference — the forward pass of a model — with lower power draw than a general-purpose GPU. Different vendors call them different things: AMD calls its version XDNA, Qualcomm calls theirs Hexagon, and Apple embeds its Neural Engine in every iPhone and Mac.

The promise for mini-PCs was clean: a chip with high TOPS (tera-operations per second) can offload inference from the CPU and iGPU, reducing power draw and thermal load. For a thin laptop or a compact desktop, that is genuinely useful. A 45-TOPS NPU running a quantized 7B model could theoretically decode at 10–20 tokens per second while pulling 5–10 watts — way below what an iGPU demands. If it worked.

The catch is that none of the dominant open-source runtimes implemented support for these NPUs. Ollama, llama.cpp, and LM Studio were written when the latest NPU in consumer hardware was far weaker, or simply did not prioritize it. Each of these projects is written in C/C++ and depends on backends for specific hardware: CUDA for NVIDIA, Metal for Apple, OpenCL or HIP for AMD GPUs. Adding a new backend means writing kernel code, testing across hardware variants, and maintaining it long-term. For a volunteer-driven project or a small team, that is a heavy commitment. So the NPU sat there, and the software stacks all defaulted to what they already supported: the iGPU.

The routing truth as of June 2026

Here is what actually happens when you run Ollama on a Ryzen AI 9 HX 370 or a Snapdragon X device today:

Ollama (version 0.30+, as of June 2026): Routes to the Radeon iGPU (on Ryzen machines) or tries to fall back to CPU-only inference. The NPU is not probed, not used, not accessed. Check the Ollama GitHub issue tracker — contributors have asked for XDNA support. There is no shipped implementation.

llama.cpp (latest main branch): Supports CUDA (NVIDIA), Metal (Apple), OpenCL, Vulkan, and SYCL (Intel). XDNA and Snapdragon Hexagon are not in the build matrix. Contributors have opened issues requesting an XDNA backend. Again: no shipped implementation.

LM Studio (versions 0.2.x, 2026): Built on llama.cpp and llm-rs backends. Same story: no NPU routing. You get iGPU (Radeon or Intel UHD/Iris) or CPU.

This is not a conspiracy or laziness. It is a simple mismatch between hardware release cadence and open-source implementation capacity. The NPU marketing narrative moved faster than the software could follow. And now mini-PC reviews and spec sheets emphasize something that makes no difference to the 99% of local-LLM users running Ollama or llama.cpp.

Why the NPU specs do not translate to LLM speed

Even if the software did route to the NPU, the TOPS number would still be misleading without memory-bandwidth context.

Token generation in local LLM inference is memory-bandwidth-bound, not compute-bound. Generating each token requires reading the entire model’s weights from memory — a 7B model at Q4 quantization is ~4 GB of data per token. The arithmetic itself (matrix multiply) is cheap relative to the I/O. This is why a high-bandwidth iGPU (e.g., Radeon 890M at ~80 GB/s) often outperforms a lower-bandwidth discrete GPU on a per-watt basis.

An NPU with 50 TOPS but no visibility into the model’s KV cache and weights in memory is like having a racing engine with a rusty fuel line. The bottleneck is not compute capacity — it is access to the model. The iGPU has a direct memory path to the system’s unified memory pool. The NPU, on most mini-PC designs, does not, and the software would have to explicitly coordinate reads and writes across separate memory domains. That coordination cost eats into any speed gain.

In short: even in a hypothetical future where Ollama shipped an XDNA backend, the NPU would only win if its memory-bandwidth situation was competitive with the iGPU it would replace. On paper, 50 TOPS sounds good. In practice, without 40+ GB/s of accessible bandwidth, those TOPS are ornamental.

Comparison: NPU specs vs what actually matters for mini-PC LLM inference

This table breaks down the gap between what is marketed and what controls throughput.

AspectWhat the marketing emphasizesWhat actually controls LLM speed
NPU TOPS”50 TOPS XDNA” or “45 TOPS Hexagon” — headline numberNot routed to by any shipping runtime; memory bandwidth to iGPU matters far more
iGPU specsOften buried or generalized (e.g., “Radeon Graphics”)Critical: model fit and decode speed depend on iGPU memory bandwidth (GB/s) and VRAM capacity
System RAMListed in GB, treated as secondaryCritical: unified memory bandwidth and capacity; a Ryzen with 6,400 MT/s LPDDR5X will outrun 5,600 MT/s
Power efficiencyOften claimed via NPU; rarely measured end-to-endWhat matters: measured W/tok on real workloads (Ollama or llama.cpp), not marketed specs
CPU coresSpec listed, but inference does not saturate themMinimal impact on decode; matters for system responsiveness, not token throughput

What to actually evaluate in a mini-PC for local LLMs

If NPU specs are a red herring, here is what to measure and weight instead:

1. iGPU memory bandwidth — the primary control

This is the number that predicts decode speed better than anything else. For an AMD Ryzen chip:

  • Radeon 890M (on HX 370 and some HS variants): ~80–90 GB/s unified-memory bandwidth
  • Radeon 780M (older Ryzen 8xxx): ~70 GB/s
  • Radeon 680M (Ryzen 7xxx): ~55 GB/s

For Intel (if you encounter it):

  • Intel Arc A-series iGPU (on some Meteors): 120+ GB/s, but software support is spotty
  • Intel UHD / Iris Xe (on some mobile Intels): 30–50 GB/s, quite slow for LLMs

Ask a seller or reviewer: What is the integrated GPU’s memory bandwidth? If they do not know, they did not measure it; assume they prioritized NPU marketing over actual evaluation.

2. Unified memory capacity and speed

A mini-PC with 32 GB of LPDDR5X at 6,400 MT/s will fit and run models faster than one with 16 GB at 5,600 MT/s. Check the specs for both the size and the speed grade. A 7B model at Q4_K_M needs ~4 GB; a 13B model needs ~8 GB. You want some headroom.

3. Real measured throughput on a known model

Ask: How fast does this machine run Llama 3.1 8B Q4_K_M with Ollama or llama.cpp? If the answer is a spec sheet, walk away. If they say “~50–60 tok/s” and can cite the runtime version and context size, that is credible data.

Community-cited benchmarks from r/LocalLLaMA on specific mini-PC models are worth far more than NPU TOPS. Look for posts from people who actually loaded Ollama.

4. Thermal design and real sustained power

An NPU’s advertised TOPS assume you never actually use it. For iGPU inference, sustained power matters: can the mini-PC keep up 50+ tok/s without thermal throttling? Check reviews for fan noise and junction temperatures under load.

When will the NPU actually matter?

The honest timeline as of June 2026:

  • 2026–2027 (next 12 months): Someone in the open-source community will likely ship an XDNA backend for llama.cpp or similar. It will be rough, probably slower than the iGPU fallback initially, and maintained by volunteers. Do not wait for it expecting a magic speed boost.
  • 2027+ (speculative): If AMD and Qualcomm push hard enough, and if the software maintainers prioritize it, the iGPU-to-NPU handoff could become automatic and competitive. By then, unified-memory mini-PC designs may mature, and the iGPU itself might get better. The landscape will have shifted.

The key point: Waiting for NPU support to ship delays your ability to run local LLMs today. If you buy based on actual iGPU bandwidth and RAM in 2026, and NPU support lands in 2027, you still have a fast machine. The iGPU did not get worse — you just gained an option. The risk of waiting is that you miss months or years of usable inference while chasing a spec.

The credibility cost

Here is why this gap matters beyond specs: r/LocalLLaMA and the broader local-LLM community have built a high-trust culture. People ask honest questions, admit what does not work, and call out wishful thinking. Articles and reviews that lean on 50-TOPS NPU marketing without mentioning that the software ignores it read as either uninformed or intentionally misleading. That erodes trust, especially with the exact audience buying mini-PCs for local AI.

The right framing is: “This chip has a capable NPU, but today’s software does not use it. Evaluate it on iGPU specs and memory. If and when NPU support ships, that will be a bonus, but plan for today’s tooling.”

What to look for in mini-PC reviews and buying guides

When evaluating a mini-PC recommendation, demand:

  1. Real measured throughput on Ollama or llama.cpp, not NPU TOPS
  2. iGPU specs with memory bandwidth explicitly stated
  3. Unified memory size and speed — not just capacity
  4. Honest caveats — what this machine does not do well (serving 70B models, running 4+ concurrent users, etc.)
  5. Affiliate / sponsorship disclosure — was the reviewer given hardware? Do they make money from the recommendation?

The last point matters because mini-PC manufacturers know that local-LLM interest drives sales. There is real incentive to emphasize NPU specs even though they do not help today. A credible reviewer knows that, names it, and evaluates on what actually works.

Bottom line

As of mid-2026, NPU TOPS on a mini-PC are marketing theater. Ollama, llama.cpp, and LM Studio route through the iGPU or CPU. A 50-TOPS badge tells you nothing about decode speed.

Buy a mini-PC based on iGPU memory bandwidth and unified-memory capacity. Measure real throughput on Ollama or llama.cpp with a known model like Llama 3.1 8B Q4_K_M. If the vendor or reviewer only cites TOPS, they are not measuring what matters.

The NPU may become useful in a year or two if the software catches up. Until then, it is a feature in waiting. The machine itself — its iGPU, its memory, its thermal design — is what runs local LLMs today. That is where the decision lives.

For the current state of mini-PC recommendations grounded in real iGPU specs and measured throughput, see the best mini-PC for local LLM guide and the Hardware Buying Framework. For why memory bandwidth beats raw compute on this workload, Why VRAM Matters More Than Compute goes deeper.

If you are set on a thin, fanless device, the Framework Desktop for Local AI covers the trade-offs of that constraint honestly — no NPU hope, just what the hardware and software can actually do together.

Sources

  • r/LocalLLaMA community discussion threads on NPU routing and Ryzen AI support (2025–2026)
  • Ollama GitHub issues #1234–#1450: NPU hardware acceleration feature requests and status (2025–2026)
  • llama.cpp GitHub issues: XDNA/NPU backend discussions and fallback behavior (2025–2026)
  • Digest contributor commentary on NPU credibility gap (2026-06-28)
  • Ryzen AI 9 HX 370 and Ryzen 9 8945HS product specifications: AMD.com (XDNA cores, memory specs, 2025–2026)
  • Mini-PC community benchmarks: GPU/iGPU throughput on local inference (r/LocalLLaMA, 2026)