Apple Silicon Inference

How Much Unified Memory Do You Need for Local LLMs? 48GB to 256GB, Mapped to Models

The two costliest Apple Silicon mistakes happen at opposite ends of the spectrum: buying 48GB and never needing more than what fits in 24GB, or buying 64GB when 128GB would barely cost more and saves years of frustration. Between them sits a clearer decision than Apple’s marketing suggests — because capacity and speed are not the same thing, and the unified memory bandwidth matters as much as the size.

Unlike NVIDIA discrete GPUs — where 24GB is the practical ceiling and the choice is whether to buy one, two, or switch platforms — Apple’s unified memory scales from 24GB to 256GB on the same machine. That sounds like pure abundance. In practice, it is a choice set, and choosing wrong at the tier level (48GB vs. 64GB vs. 128GB) locks you in at purchase with no upgrade path. A Mac’s unified memory is soldered to the chip. Once bought, it stays.

This guide maps each memory tier to the models it can actually run — not just fit, but run at speeds that make local inference worthwhile. It calls out the bandwidth constraint at every tier, because the silence around that is where the frustration starts.

The sizing heuristic: ~2 GB per billion parameters, then quarter it for Q4

Hold this formula in your head and the capacity question becomes straightforward:

Model size in GB ≈ (Parameters in billions × 2 GB/1B) × quantization factor

At FP16 (full precision), a 7B model needs ~14 GB, a 13B model ~26 GB, a 70B model ~140 GB. That is the theoretical floor before adding OS overhead, context window, and runtime slack.

Quantization shrinks this sharply:

  • Q4_K_M (the quality/size sweet spot): roughly 0.25× the FP16 size, so a 7B model ≈ ~3.5–4 GB, a 13B ≈ ~6.5–7.5 GB, a 70B ≈ ~35–40 GB.
  • Q8_0 (minimal quality loss): roughly 0.5× the FP16 size, so a 7B model ≈ ~7–8 GB, a 13B ≈ ~13–15 GB, a 70B ≈ ~70–80 GB.

Real-world unified memory consumption adds 2–4 GB for the OS and runtime overhead, plus context window (a 4,096-token context adds ~100–200 MB depending on the model). So a 7B Q4_K_M model on a Mac consumes roughly 5–6 GB total, leaving the rest for context, multiple models, or safety margin.

For the deeper math on why quantization works and which level is worth the quality trade-off, see What Is Quantization. The short version: Q4_K_M is almost always the right default. It is the quantization the community cites in throughput reports, and it is what sits behind all the model-class recommendations below.

The memory tiers: capacity, bandwidth, and the real cost of the jump

Here is the painful truth that Apple does not advertise: upgrading unified memory is expensive, and the price jumps are steep. A 16GB Mac Studio M4 starts at around $1,999–$2,100. Jumping to 64GB costs ~$600–$800 extra. Jumping to 128GB costs ~$1,200–$1,600 extra. The per-GB cost is not linear — your first 48GB costs far more per gigabyte than your next 48GB.

This matters because the mistake is not choosing 128GB when you need 64GB (you will be fine, just slower if the bandwidth is lower). The mistake is buying 48GB when you suspect you will want 70B models, paying $400 extra, and then six months later buying a new Mac because the unified memory is soldered and non-upgradeable.

Here is the tier map, with the models each tier actually runs well:

Memory TierModel Capacity (Q4_K_M)Primary Use CaseM4 Pro BandwidthM4 Max BandwidthPrice Premium (vs. 24GB)
48 GB7B comfortably, 13B tightSingle 7B + context, or 13B Q4 no room to spare~100 GB/s~120 GB/s~$400–$600
64 GB13B comfortably, 32B Q413B + large context, small 32B models at Q4~100 GB/s~120 GB/s~$600–$800
128 GB32B full precision, 70B Q470B Q4_K_M, 32B without quantization, 2–3 smaller models at once~100–120 GB/s~120 GB/s~$1,200–$1,600
256 GB70B full precision, multi-tenantFuture-proofing, multi-user chat servers, multi-model hosting~100–120 GB/s~120 GB/s~$2,200–$2,800

The bandwidth figures are Apple’s spec (M4 Pro/Max unified memory bandwidth). Note that they are roughly 10–12× lower than a discrete RTX 4090’s 1,456 GB/s. That is not a flaw — it is the arithmetic of shared memory. It means that a 70B model will generate tokens on a 128GB Mac, but at lower absolute throughput than the same model on a 24GB RTX 4090.

The picks, by buyer constraint

For chat on a single 7B model: 48 GB

If you know the model you want (Llama 3.1 8B, Mistral 7B, whatever fits under ~6 GB after quantization) and you do not plan to experiment with larger models, 48GB is the minimum that stops constant memory-pressure compromises. You get a 7B Q4 model with room for an 8K context window without spilling to swap. If you want a 13B model, you can fit it at Q4_K_M in 48GB, but you are at the ceiling — another model or a larger context window forces a trade-off.

Bandwidth caveat: Community reports (r/LocalLLaMA, 2025) cite ~30–40 tok/s on 7B Q4 models from M4 Pro base unified memory (100 GB/s). That is usable for chat and coding assistance. It is well below a used RTX 3090 (80–110 tok/s), which costs 1/5 the price. The advantage is the complete system — no PCIe, no discrete card wattage, one machine. The cost is throughput.

Macs with 48GB: M4 Pro 48GB, M3 Max 36GB (lower tier). M3 Max 36GB has higher bandwidth than M4 Pro, so it will feel noticeably faster on the same workload.

Browse M4 Pro 48GB Macs on Amazon →

For 13B models and small 32B runs: 64 GB

64GB is the tier where a 13B Q4_K_M model sits comfortably with room to spare. It is also the smallest tier where a 32B Q4 model (roughly 15–18 GB) loads without tight margin. The jump from 48GB to 64GB is typically ~$200 extra in total system cost; the jump from 64GB to 128GB is typically 2–3× that.

If you are choosing between 48GB and 64GB, ask: “Will I ever load a 13B model, or is 7B truly my ceiling?” If the answer is “yes, maybe 13B,” spend the $200. If the answer is “7B, maybe 7B larger, definitely not 13B,” 48GB works.

Bandwidth caveat: Same as 48GB — you are still on ~100 GB/s unified memory bandwidth (M4 Pro base). A 13B Q4 model will run noticeably slower than on discrete NVIDIA hardware. See the best Mac for local LLM guide for the full throughput breakdown by Mac tier.

Macs with 64GB: M4 Pro 64GB, M4 Max 64GB (base tier).

For 32B models and 70B Q4: 128 GB

This is the tier where “local LLM” stops being a novelty and becomes a serious capability. A 70B model at Q4_K_M quantization (roughly 35–40 GB) loads with room for context and runtime slack. You can also run a 32B model at full FP16 precision (roughly 65 GB), or keep two or three smaller models in memory at once.

128GB is also where the cost-per-gigabyte ratio improves relative to 64GB. The jump is steep in absolute dollars (~$800–$1,200), but you are paying less per GB than the jump to 64GB was.

Bandwidth caveat: Here the Mac tier matters much more. An M4 Max at 128GB has 120 GB/s unified memory bandwidth. An M4 Pro at 128GB has ~100 GB/s. On a 70B Q4 model, community reports (r/LocalLLaMA, 2025, not independently verified) cite roughly 15–25 tok/s from an M4 Max, and 12–18 tok/s from an M4 Pro. That is slow compared to a 24GB RTX 4090 (~120–160 tok/s), but it is honest inference that finishes in the time your brain can process. For agents, retrieval, and batch work, it is more than adequate.

Macs with 128GB: M4 Max 128GB, M3 Max 128GB.

Browse M4 Max 128GB Macs on Amazon →

For future-proofing or multi-model serving: 256 GB

256GB is not required for any single consumer use case today. It is the tier you buy when:

  • You plan to keep the Mac for 5+ years and want no constraint on model size.
  • You are running a small multi-user chat server and want to keep multiple large models in memory.
  • You want the absolute minimal compromise on speed and capacity, and the budget allows it.

At 256GB, almost every model class — 7B, 13B, 32B, 70B, even 405B at Q4 — loads. You are constrained only by bandwidth, not capacity.

Real cost: A Mac Studio M4 Max with 256GB runs ~$7,500–$8,500. A used RTX 3090 24GB costs ~$600–$800. Two of them cost ~$1,200–$1,600. The Mac is a complete system with lower power consumption and no PCIe complexity. The two 3090s are raw local VRAM at 1/6 the price. If you are stretched on budget, skip 256GB. If budget is not the binding constraint, it removes the last question mark from the system.

Macs with 256GB: M4 Max 256GB only (currently).

Who This Is NOT For

This guide optimizes for choosing the right unified memory tier for local LLM inference on Apple Silicon. It is the wrong guide if:

  • You are price-optimizing VRAM per dollar. A used RTX 3090 24GB (~$600–$800) delivers more VRAM-per-dollar than any Mac. Buy discrete GPU hardware instead.
  • You are chasing absolute throughput. An RTX 4090 is faster on any model that fits on both platforms. Macs win on power, form factor, and having no discrete GPU at all — not on raw tokens per second.
  • You have not sized your model yet. Before choosing a Mac tier, read What Is Quantization and run the VRAM calculator with your target model name and quantization. Guess wrong and you waste thousands on memory you do not need (or buy too little and regret it).
  • You are comparing a specific Mac tier to discrete hardware. A Mac is a system choice, not a GPU choice. For the full system comparison, see Mac Mini vs. Mac Studio for Local LLM.

The math you can verify yourself

The quantization heuristic (2 GB per 1B parameters at FP16, 0.25× for Q4_K_M) is conservative. Real weights can be 5–10% smaller depending on the model architecture. Overhead (OS, runtime, context window) is typically 2–4 GB on a Mac running a single model. So:

  • 7B Q4_K_M: 7 × 2 × 0.25 = 3.5 GB model + 3 GB overhead = 6.5 GB total (commonly 5–7 GB observed).
  • 13B Q4_K_M: 13 × 2 × 0.25 = 6.5 GB model + 3 GB overhead = 9.5 GB total (commonly 8–11 GB observed).
  • 70B Q4_K_M: 70 × 2 × 0.25 = 35 GB model + 3 GB overhead = 38 GB total (commonly 35–42 GB observed).

These numbers match community reports and the LocalRig benchmark on base Apple M4 (measured at 18.4 tok/s, llama.cpp b9820, Llama 3.1 8B Q4_K_M, 2026-06-27). Your actual consumption will vary with the runtime (Ollama, llama.cpp, MLX), the model card (Llama, Mistral, Qwen), and your context window size. Always have 5–10 GB of headroom to avoid swap.

Bottom Line

The memory tier matters because unified memory on Apple is not upgradeable, and the price jumps are steep. Choose the smallest tier that handles your target model plus one size up, then add the Mac tier (Pro vs. Max) that gives you acceptable bandwidth for that model class. A 128GB M4 Max can run 70B models, but slower than a 24GB RTX 4090 at 1/6 the cost. A 48GB M4 Pro runs 7B models fast enough for chat, but slower than a used RTX 3090 at 1/8 the cost. Neither is wrong — they are systems choices, and the bandwidth trade-off is real. Know it going in.

Sources

All throughput figures from the r/LocalLLaMA community are not independently verified by LocalRig except the base Apple M4 benchmark (18.4 tok/s llama.cpp, 19.5 tok/s Ollama, Llama 3.1 8B Q4_K_M, 2026-06-27, methodology in the 7B/8B hardware guide). Memory bandwidth and unified memory specifications from apple.com (M4 Pro, M4 Max, M3 Max technical specs, 2025–2026).

Sources

  • Apple M3 Max, M4 Max, M4 Pro unified memory specifications and memory bandwidth (apple.com, 2025–2026)
  • LocalRig first-party benchmark: base Apple M4, 16 GB — llama.cpp b9820 (18.4 tok/s) and Ollama 0.30.11 (19.5 tok/s), Llama 3.1 8B Q4_K_M, 2026-06-27
  • r/LocalLLaMA community reports: M3 Max, M4 Max token throughput on 7B, 13B, 70B models (2025–2026), not independently verified by LocalRig
  • LocalRig quantization guide: ~2 GB per billion parameters at FP16, scaled to Q4_K_M and Q8_0