Can a Framework Desktop with Strix Halo 128GB actually run Llama 3.3 70B locally?

The community reports yes for Q8_0 quantization (2026), though this is not independently verified by LocalRig. Unified memory is large enough, but bandwidth is the constraint — expect slower token generation than a Nvidia RTX 4090. The real question is whether that speed is acceptable for your workload.

How does Strix Halo unified memory compare to RTX 4090 GDDR6X for local LLM inference?

Unified memory (LPDDR5X, ~200–300 GB/s) has more capacity but less bandwidth than GDDR6X (936 GB/s on RTX 4090). You can run bigger models, but they decode slower. It's the inverse of the multi-GPU tradeoff: capacity over speed.

Is ROCm mature enough for local LLM inference on Strix Halo?

Community reports from late 2025 and early 2026 show ROCm + llama.cpp support for Strix Halo is functional but not yet as polished as CUDA on NVIDIA cards. Expect rougher edges, fewer examples, and sometimes slower performance than equivalent NVIDIA paths. The situation is improving monthly.

Why would I buy a Framework Desktop for AI over a used RTX 3090 build?

If serviceability, repairability, true x86 Linux support, or the ability to upgrade modularly matter to you — and you can accept slower token generation for bigger models — Framework wins. If raw speed-per-dollar is the only constraint, a used 3090 or RTX 4090 is cheaper and faster.

Framework Desktop as a Local AI Server: Strix Halo, ROCm, and the 128GB Question

The Framework Desktop with AMD Strix Halo is the enthusiast’s answer to the question: “What if I could repair my AI server?” It ships with up to 128GB of unified memory, runs native x86-64 Linux, lets you swap CPUs and every component, and costs nothing like a Mac for that capacity. The hype around Strix Halo has been real — Hacker News threads and r/LocalLLaMA are full of first-time builders marveling at something that actually works the way they want it to.

But unified memory is not a magic constant. It is larger than any consumer GPU VRAM, but it is also slower than GDDR6X. The real question underneath the hype is not “can I run a 70B model?” — the community says you can — but “what speed am I trading away, and is it worth it for repairability?”

This guide is for someone considering a Framework Desktop specifically for local LLM inference, and who cares enough about repairability and Linux-native serviceability to accept that tradeoff. If you want the fastest single setup for the least money, this is not that page — the answer is still a used RTX 3090 or a new RTX 4090. But if you are building a machine you want to keep running, understand, and fix yourself for the next five years, Framework changes the constraint logic.

The core constraint: unified memory scales capacity, not bandwidth

Two specs separate the Strix Halo story from every gaming GPU guide you have read.

Unified memory means the CPU and GPU share one memory pool — no copying data between system RAM and VRAM. On a PC with an RTX 4090, the system RAM lives on the motherboard (DDR5, ~100–150 GB/s), the GPU VRAM lives on the card (GDDR6X, ~936 GB/s), and anything you want to compute on the GPU has to move across PCIe. On the Framework, the APU’s CPU cores and GPU cores both read from the same LPDDR5X pool. That is why capacity scales freely: a model that would not fit on any 24GB GPU can simply load into 64GB or 128GB of unified memory.

The caveat, which every honest take on unified memory includes, is bandwidth. LPDDR5X — the memory standard the Strix Halo Max+ uses — offers roughly 200–300 GB/s depending on configuration and thermal state. That is a quarter to a third of RTX 4090 bandwidth. So you can run bigger models, but you will decode slower. This is not a secret; it is the tradeoff you are choosing when you pick unified memory.

For the full picture of how memory bandwidth drives token generation speed, the GPU guide digs into the principle that rules local inference: bandwidth, not FLOPS. The math is the same here. A 70B model at Q8_0 quantization will load into 128GB, and it will run. The decode speed will be lower than the same model on an RTX 4090. Whether that slowness matters depends on your use case — interactive single-user chat does not care about per-token latency the way a production batch workload does.

Strix Halo specs: the three tiers

Framework ships the Desktop with two APU options, each with multiple memory configurations.

Ryzen AI Max 385 (32GB unified): The base high-end model. 32GB unified memory is enough for a 13B model at Q4_K_M with room for context, a 7B at Q8_0, or multiple smaller models. Lower power draw than the Max+, and the obvious entry point if 70B is not on your roadmap. Price point sits between a used RTX 3090 build and a new M3 Max Mac.
Ryzen AI Max+ 395 (64GB unified): 64GB opens 32B-class models and larger at lower quantization. Community reports show this tier handles Llama 3.1 70B at Q4_K_M comfortably, and is the inflection point for “bigger model” workloads.
Ryzen AI Max+ 395 (128GB unified): The flagship capacity tier, and the one the HN threads and r/LocalLLaMA obsess over. 128GB is the first desktop box (not a Mac) where you fit Llama 3.3 70B at Q8_0 — the high-quality quantization — and still have room for a long context window. Community-reported (not independently verified by LocalRig), early users confirm the model loads and runs.

All three tiers are equally repairable and come with the same Framework modularity — you are not buying repairability as a separate tier, you are buying it as a baseline.

The question underneath: ROCm first impressions, bandwidth reality

When the Framework Desktop launched in Q3 2025, the immediate developer enthusiasm met an honest bottleneck: ROCm and llama.cpp on Strix Halo are young. CUDA on NVIDIA cards has been optimized for almost a decade; AMD’s toolchain for consumer APUs is still being shaken out. Community reports from late 2025 through early 2026 paint a picture of something that works, but with friction:

Performance is slower than equivalent NVIDIA. Community-reported (r/LocalLLaMA, 2025–2026): the same 7B Q4_K_M model that runs at 80–110 tok/s on a used RTX 3090 reports ~30–50 tok/s on Strix Halo. That is not a bug — it is the bandwidth difference. The surprise is how much bandwidth difference matters.
Documentation and examples lag CUDA. If you want to run Ollama or llama.cpp on Strix Halo, the path exists, but the community threads are full of “I had to patch X” or “the ROCm docs do not mention Y.” This is not a blocker for someone comfortable reading source code. It is friction for someone who expects “download and click.”
Optimization is ongoing. Monthly ROCm releases have been landing improvements specifically for Strix Halo. If you are considering one, the version you buy in 2026 will be noticeably faster than the launch units from 2025 — and faster still in 2027. This is an edge of the platform, not a stable one.

For the full technical take, ServeTheHome and Tom’s Hardware both posted first-impressions deep dives on Strix Halo unified memory and the ROCm ecosystem — worth reading if you are serious about a purchase.

The 128GB case: Llama 3.3 70B and the bandwidth tradeoff

Here is where the 128GB tier stops being theoretical and becomes a concrete use case: running Llama 3.3 70B locally without quantizing it to oblivion.

Community report (r/LocalLLaMA, 2026, not verified by LocalRig): A Strix Halo 128GB holds Llama 3.3 70B at Q8_0 quantization entirely in unified memory, with room for a 4K-token context window. The token generation speed is reported at ~15–25 tok/s — usable for interactive chat, slower than an RTX 4090, but much faster than offloading to system RAM or quantizing further to Q4_K_M (which trades quality for speed).

For comparison: an RTX 4090 runs Llama 3.1 70B Q4_K_M at ~30–40 tok/s because it has more bandwidth, but it cannot hold the full Q8_0 weights at all. You would need a multi-GPU RTX build or a Mac M3 Max 128GB to match the capacity. A used 3090 is even further out — 24GB is nowhere near enough.

The honest caveat is the speed-bandwidth relationship. If you need Llama 3.3 70B to decode at 40+ tok/s, a Strix Halo 128GB is not the machine. If you need it to decode at usable (15+ tok/s) while being repairable, serviceable, and fully under your control, it is the obvious choice.

Comparison: Strix Halo vs the homelab alternatives

Where does Strix Halo sit against other local-inference paths? This is not a strict ranking — the winner depends on your constraint.

System	VRAM	~7B tok/s	Repairability	Price range	Best for
Framework Desktop Strix Halo 128GB	128 GB unified	~30–50	Full modular repair	~$2,000–$2,500	70B models, serviceability, x86 Linux
RTX 4090 build	24 GB GDDR6X	~120–160	GPU swap only	~$2,500–$3,500 new	Maximum single-card speed
Used RTX 3090	24 GB GDDR6X	~80–110	GPU swap only	~$500–$800	Budget VRAM-per-dollar champion
Mac M3 Max 128GB	128 GB unified	~50–65	Not user-serviceable	Mac-tier pricing	Maximum unified memory, Apple ecosystem
2× RTX 3090 (48GB)	48 GB GDDR6X	~80–110 (single stream)	GPU swap only	~$1,200–$1,600 used	32B/70B capacity, no repairability

The Framework row assumes ROCm maturity continues improving (current-2026 assumption). If you buy in late 2026 or 2027, the tok/s will likely be higher.

Who this is for (and who it is not)

You should consider Strix Halo if:

You want to run models larger than 24GB — 32B or 70B class — locally without a Mac.
Repairability and modularity matter to you more than maximum speed.
You are comfortable reading GitHub issues and occasional ROCm documentation when something is not working.
You want true x86-64 Linux support and full control over your inference stack (not locked to Apple’s decisions).
You can accept 30–50% lower token speed than an RTX 4090 for models that only fit on Strix Halo or a Mac.

You should not buy Strix Halo if:

You need raw inference speed for a 7B or 13B model — a used RTX 3090 or new RTX 4090 is cheaper and faster.
You are not comfortable with “first-generation” developer tooling and occasional friction in the software stack.
You want “set it and forget it” — NVIDIA CUDA on LLMs is far more polished.
The price point (around $2,000–$2,500 for the 128GB tier) rules you out. A used 3090 is a tenth of that.

The bottom line: capacity with an asterisk on speed

The Framework Desktop with Strix Halo 128GB is the first x86-based, user-repairable machine that can load and run Llama 3.3 70B locally. That is real, and it is not trivial — it means local AI inference without being locked into Apple’s product lifecycle or sacrificing the ability to open your machine and fix it.

The asterisk is honest: unified memory is not GDDR6X, and you pay for that in token generation speed. A 70B model on Strix Halo decodes at roughly a third the speed of the same model on an RTX 4090 (if the 4090 could hold it, which it cannot). That slowness is not a bug or a sign the machine is broken — it is the physics of bandwidth, and it is what you get when you trade bandwidth-per-dollar for capacity-per-dollar.

For someone building a homelab inference server that has to last five years, be debuggable when something breaks, and run models that only fit on high-capacity unified-memory systems, Strix Halo is becoming the obvious choice. For someone whose constraint is speed on 7B–13B models, or whose budget is tight, a used RTX 3090 or exploring Mac Studio or M3 Max makes more sense.

ROCm is getting better every release. The performance numbers will improve. Serviceability and modularity will not change — that is a Framework promise. If you buy one, you are betting on AMD’s toolchain improving while you use a machine you can actually repair.