Apple Silicon Inference

Mac Studio M3 Ultra for Local LLMs: 800GB/s, 256GB, and the Prefill Problem

The Mac Studio M3 Ultra is Apple’s flagship local-inference machine, a chip born for models that do not fit anywhere else: 70B+ LLMs, long-context work, and anything that demands the unified memory pool that Apple’s architecture uniquely offers. It carries 800GB/s of memory bandwidth — more than any consumer discrete GPU — and up to 256GB of unified VRAM in the new generation. But there is a plot twist buried in the specifications, and a discontinuation that changes the buying math. This guide separates the hype from the practical reality: where the M3 Ultra genuinely wins, where it loses, and why the March 2026 config change matters more than Apple’s press release suggests.

The M3 Ultra’s split personality: decode excellence, prefill pain

To understand the M3 Ultra, you must hold two facts in tension:

Decode (token generation after the prefill) is class-leading. The 800GB/s bandwidth makes it faster than any single discrete consumer GPU at generating tokens from a model that has already been processed. On DeepSeek V3 4-bit quantization, community benchmarks cite ~20–21 tok/s (Hardware Corner, June 2026, not independently verified by LocalRig). On DeepSeek R1 full-weight, the same sources report ~17–18 tok/s. These are respectable numbers because the bandwidth is so high that it offsets the unified-memory-per-GPU-cycle disadvantage over discrete GDDR6X.

Prefill (prompt processing before token generation begins) is punitive. This is the hidden tax that separates M3 Ultra ownership from M3 Ultra enjoyment. When you paste a 32,000-token context and hit Enter, the chip must process every token in the prompt before generating the first output token. The 800GB/s bandwidth still applies, but the arithmetic complexity of the prefill operation makes this the slowest part of the pipeline — to the point where long contexts can take 10–20 minutes of dead time before decoding begins. The Hardware Corner headline “14-minute wait on M3 Ultra prefill” is not hyperbole; it is a documented reality for 8K+ context inputs. This is where the M3 Ultra loses to discrete GPUs: an RTX 4090 also has to wait, but its prefill is measured in seconds, not minutes. The difference matters if your workload is long-context per-document summarization, RAG, or anything with a variable-size context.

This is not a GPU problem. It is an architectural choice. Unified memory is a feature, not a bug — it lets you load 256GB of model weights that would be impossible on any single consumer discrete card. But unified memory’s arithmetic density per cycle is lower than the tensor cores and NVLink coordination of a dual-GPU NVIDIA setup. You are trading prefill speed for capacity. Knowing which trade-off you are making is the difference between a sensible purchase and years of staring at a spinner.

The discontinuation: 512GB is gone, capped at 256GB

In March 2026, Apple quietly discontinued the 512GB unified-memory configuration for Mac Studio. New machines max out at 256GB. This is a footnote on the spec sheet and a tectonic shift for anyone running the largest open models.

The impact: models that need 512GB (DeepSeek V3, full-weight Llama 3 405B, newer research models, or fine-tuned giants) can no longer be purchased new. If you need more than 256GB, the only path is the used market. Used 512GB Macs are available but shrinking in inventory, command a premium, and carry the risk of buying secondhand consumer hardware. The person who bought a 512GB Studio for $16,000 in 2024 can now sell it used for a much smaller loss. The person who wants to enter the 512GB arena in 2026 must scour eBay and pray the previous owner did not abuse it.

For most people, 256GB is enough: a 70B model at Q4_K_M quantization fits comfortably (~35–40 GB), leaving room for context and runtime overhead. But if your models or your growth trajectory demands more, you must plan around the used-market reality and the risk that inventory continues to shrink.

Mac Studio M3 Ultra vs. the alternatives

HardwareMax VRAM~70B Q4 decodePrefill (TTFT)PriceUse case
Mac Studio M3 Ultra 256GB256 GB15–22 tok/sslow (30–60+ sec on 4K context)~$10–14k70B+ models, long-context, unified memory priority
2× RTX 4090 (NVLink)48 GB12–18 tok/sfast (1–3 sec)~$5–7kMulti-GPU scaling, 48GB capacity ceiling
RTX 5090 (single)32 GB18–24 tok/sfast (1–2 sec)new retailMaximum single-card speed, 32GB ceiling
Used Mac Studio 512GB512 GB15–22 tok/sslow (same tax)~$8–12k usedFrontier models, largest open-weights
Mac mini M4 Pro 36GB36 GB6–10 tok/sslow (same prefill tax)~$2kBudget large-model option, poor decode speed

This table is for decoding a 70B model at Q4_K_M. Prefill times are illustrative and vary wildly with context size. The key insight: M3 Ultra excels at decode speed, loses at prefill speed, and wins decisively at capacity. If your constraint is “I need to run giant models and can tolerate slow prefill,” it is the best consumer option. If your constraint is “I need both speed and capacity,” it is a hard trade-off.

Who the M3 Ultra is for (and who it is not)

Buy the M3 Ultra if:

  • You run models that exceed 24GB regularly. A 70B model at Q4_K_M, or anything in the 32B–100B range at usable quantization. Unified memory means it simply loads.
  • Your workload is primarily decode-bound. Chat continuations, code completion, smaller batch inference. Once the prefill is over, the experience is excellent.
  • You can absorb the prefill latency or mitigate it. Batch prefill across your entire workload into windows where the wait is acceptable (e.g., “I will process context Tuesday night, then chat the rest of the week”). Or run smaller models on a separate device for low-latency tasks.
  • You need a single integrated machine. No PCIe coordination, no multi-card cabling, no two separate power budgets. Mac Studio is a compact, thermally sound, upgradeable machine that runs other work (UI design, video, development) at the same time.

Skip the M3 Ultra if:

  • Your models fit in 24GB or less. A used RTX 3090 ($500–800) or RTX 4090 ($1600+) will decode faster, with zero prefill penalty, at a fraction of the price. The unified-memory advantage only matters when the model exceeds 24GB.
  • Prefill latency is a user-facing constraint. If you are building a chat application where users paste a 16K context and expect a response in seconds (not minutes), discrete GPUs are faster. The architectural difference is not subtle.
  • You are price-sensitive. $10k+ for hardware is a major investment. If you have the budget flexibility, it is defensible; if not, reconsider the constraint. The same $10k buys two RTX 4090s, a custom workstation, and a lot of cloud inference quota.
  • You want to expand beyond 256GB on new hardware. If your models may grow beyond the 256GB cap, buying new is a trap — you will max out and immediately need the used market. Plan accordingly or wait for the next generation.

The path to M3 Ultra ownership: Mac Studio, not Mac mini

If you decide the M3 Ultra is your hardware, do not confuse it with the M3 Max. The M3 Max (up to 128GB) is integrated into MacBook Pro; the M3 Ultra is only available in Mac Studio. Mac Studio is a desktop machine: silent fans, upgradeable storage, Thunderbolt ports, and a design that lasts. It is not a laptop and it is not a portable workstation. You are buying a semi-permanent installation.

For more on the Mac mini vs. Mac Studio trade-off for local LLM work, and the broader Apple Silicon landscape, see the full Apple Silicon comparison guide.

Buying the used 512GB: if you need it, hunt early

If you have determined that 512GB is your hard requirement (frontier models, fine-tuning on giant weights, or research that demands the full model), the new studio cannot deliver it. The used market is where you will shop. A few protective habits:

  • Verify the configuration before bidding. Confirm it is 512GB unified memory, not 256GB. The difference is obvious in the price, and scams exist; ask for a Geekbench submission URL or a system report screenshot.
  • Check thermal history if possible. Ask whether the machine has been run continuously in a data-center role, a desktop role, or mothballed. Desktop use is least concerning; 24/7 server use is a red flag for accelerated aging.
  • Budget for AppleCare retroactively if available. Used Macs do not carry warranties. Some sellers still have active AppleCare+ contracts; if so, negotiate the cost into the purchase.
  • Test on arrival. Run Geekbench, check the unified-memory bandwidth (should report ~800 GB/s), load a large model, and confirm the prefill and decode figures match community reports. If they are significantly slower, thermal throttling or some other issue may be present.

Browse used Mac Studio 512GB on eBay →

The prefill problem is solvable, but not free

Long-context prefill on Apple Silicon is slow — this is not solvable by software alone without fundamentally different architecture. But there are mitigations:

  • Use speculative decoding if your inference engine supports it. Smaller draft models can predict tokens and speed up the Transformer, reducing the wall-clock prefill time. Not a silver bullet, but meaningful on the larger models.
  • Batch or pipeline your prefill. Process context when you have a block of time, cache the KV state, then do interactive decode at low latency later. This works for summarization, document Q&A, and research workflows.
  • Run smaller models for latency-sensitive tasks. An 8B model on M3 Ultra has near-zero prefill overhead and excellent decode. Keep the 70B work separate.

For the deeper mechanics of why prompt processing is slow on Apple Silicon, including the arithmetic-density story, see the architectural guide.

Comparison with frontier alternatives: Strix Halo and the GPU moat

As of June 2026, Intel’s Strix Halo (36 GB) is a new entrant in the large-unified-memory space. For a detailed Strix Halo vs. Mac Studio comparison, see the homelab analysis. The short version: Strix Halo is cheaper and faster at prefill, but has less unified memory and does not yet have production-grade AI inference stacks. Mac Studio is the proven path; Strix Halo is the option to watch in 2027.

For whether you can run specific giant models like GLM 5.2 locally, the vRAM calculator and model-specific sizing guides have the exact byte counts.

Bottom line

The M3 Ultra is a masterful chip for one specific constraint: running giant models that do not fit on any single consumer discrete GPU, with excellent decode speed, and willingness to wait during prefill. If that matches your workload, it is the best consumer option available. The 256GB ceiling on new hardware is a real constraint; the 512GB used market exists but is shrinking. The prefill tax is not marketing hyperbole — it is architectural reality and the honest trade-off for unified memory.

For models under 24GB, the price-to-performance is poor and you should start with a used RTX 3090 or new RTX 4090. For models 24–70GB, M3 Ultra is competitive if you can absorb the prefill wait or mitigate it with batching. For models beyond 70GB, it is the pragmatic choice — and you will hunt for a used 512GB, not new hardware.

It is a powerful machine in the right context, and an expensive mistake in the wrong one. The decision hinges on the prefix “if your models actually need 256GB+.” If they do not, close this tab and look at discrete GPUs.

Sources

  • Hardware Corner community benchmark threads — M3 Ultra results with DeepSeek V3 4-bit and R1, June 2026 (community-cited, not independently verified by LocalRig)
  • Apple Mac Studio M3 Ultra specifications: apple.com (800 GB/s bandwidth, 256GB unified memory max, March 2026 config discontinuation)
  • LocalRig first-party benchmark: base Apple M4 16GB — 18.4 tok/s (llama.cpp b9820), Llama 3.1 8B Q4_K_M, 2026-06-27
  • r/LocalLLaMA community feedback — M3 Ultra prefill latency experiences (TTFT vs decode tok/s distinction)