Apple Silicon Inference

Why Prompt Processing Is Slow on Apple Silicon (and What Actually Helps)

The Mac in front of you can generate text at 80+ tokens per second. Your prompt takes 12 minutes to process. Neither of those numbers is a lie; together, they are the most important nuance in Apple Silicon inference: decode and prefill are entirely different operations, and Macs dominate one while struggling with the other.

If you have watched a Mac Studio seem to freeze while “thinking” about a prompt, or seen community threads with titles like “$10K Mac Studio Crawls with DeepSeek 671B — 14-Minute Wait,” this is why. The headline throughput (tok/s) reflects token generation — the fast path. The freeze reflects prompt processing — the bottleneck. Understanding this gap is the difference between a Mac that feels snappy and one that feels unusable for your actual workload.

The two-operation model: prefill and decode

Most discussions of LLM inference speed collapse two distinct problems into one number: tokens per second. That number is almost always measuring decode speed, and it is almost always wrong for your real workflow.

Prefill (prompt processing): The model reads and processes your entire input prompt and builds up the KV cache — the cached key-value pairs that let it generate efficiently. This is a one-time cost per prompt and it involves large matrix multiplications over the full context length. Prefill is compute-bound: it benefits from raw arithmetic throughput, from lower latency, from compute density.

Decode (token generation): Once the KV cache is built, the model generates one token at a time, using that cached KV state. Each token requires re-reading the model weights but not recomputing the input. Decode is bandwidth-bound: it lives and dies by how fast you can pull model weights out of memory.

On NVIDIA GPUs, both operations are blazingly fast because the hardware is optimized for compute density. On Apple Silicon, the balance is skewed: you get exceptional memory bandwidth (unified memory architecture, PCIe bypass) but lower arithmetic throughput per watt. This makes Apple Silicon a decode machine and a prefill compromise.

Why Mac prefill is slow: compute density and the math multiply

Apple Silicon chips have roughly 50–60% of the compute throughput of an equivalent-tier NVIDIA GPU, measured in matrix-multiply operations. On decode, this does not matter much — you are reading weights, not doing massive matmuls — so the high bandwidth compensates. On prefill, you are doing matmuls over the full prompt context, and that compute gap becomes a wall.

Take a concrete example: a 13B model at Q4_K_M quantization, with a 4,096-token context.

During prefill, the model performs a matrix multiplication of shape (context_length × hidden_dim) × (hidden_dim × hidden_dim) — roughly 4,096 × 4,096 × 5,120 operations per layer per forward pass. With 40 layers in a 13B model, you are looking at roughly 3+ trillion operations just to process the prompt once. On an M3 Max, that is compute-bound: you are waiting for the arithmetic to finish, not for memory to arrive.

During decode, you are multiplying (1 × hidden_dim) × (hidden_dim × hidden_dim) — one token, same hidden dimension. That is orders of magnitude fewer operations, and the cost is now reading the weights. Macs read weights fast; Macs finish small matmuls slow. The decode wins.

Longer contexts make this worse. A 32K context on a 7B model is still manageable on decode (you have bandwidth to spare), but prefill is now 64× more arithmetic. A 671B model with a 4K context is so much arithmetic that even fast dispatch cannot hide the latency gap.

Time to first token (TTFT) vs. tokens per second: the hidden metric

This is where the “$10K Mac Studio Crawls” thread becomes clear. The Mac Studio posts 65–80 tok/s on a 7B model. But when you send a 671B model with an 8K context, TTFT is 8+ minutes. That headline tok/s is real; TTFT is the real bottleneck.

TTFT should be reported separately from tok/s. Any honest benchmark splits them:

  • TTFT: time from “hit enter” to the first token appears. Pure prefill cost.
  • tok/s: speed of generating tokens 2, 3, 4, … N. Pure decode cost.

A Mac that reports 80 tok/s with a 12-minute TTFT is not 80 tok/s in practice — the user waits 12 minutes to start. If you are doing chat with short contexts (< 2K tokens), that TTFT is a few hundred milliseconds and invisible. If you are doing RAG, document analysis, or batch inference with 8K+ contexts, TTFT is the cost of the entire workflow, and it is not hidden.

The community discovered this the hard way. LM Studio and Ollama bug trackers logged numerous “Mac is frozen / not responsive” reports that resolved to “yes, prefill on large contexts is slow, by design.” This is not a bug. It is a hardware constraint that the runtimes cannot fix.

What actually helps: prompt caching, KV reuse, and runtime tuning

Knowing the problem, the solutions are targeted:

1. Prompt caching and KV-cache reuse

If you are asking the model the same question on the same large context multiple times (common in RAG and document Q&A workflows), caching the KV cache across requests collapses the prefill cost to nearly zero on the second and later queries. Ollama and llama.cpp both support this; enable it if your workload allows. The cost is one-time memory overhead; the gain is skipping prefill entirely.

2. Shorter contexts and windowing

The simplest fix: if your model is processing an 8K-token document, extract the relevant 2K-token window and send that instead of the whole document. Many RAG systems do this already. Shorter context = proportionally shorter prefill. At 2K tokens, prefill on a 7B model becomes one-second-scale and nearly invisible.

3. MLX and native Apple Silicon runtimes

MLX (Meta’s framework, optimized for Apple Silicon) and native-ARM Ollama builds can be marginally faster than generic llama.cpp for prefill — they expose operations more directly to the Metal backend. The gain is usually 10–20% on prefill, not a wholesale fix, but measurable if you are already on Mac. See Ollama and MLX on Apple Silicon for setup and benchmarks.

4. M5 Neural Accelerator (vendor claims, not yet verified)

Apple’s M5 chips (announced for 2025–2026 delivery) include dedicated Neural Engine accelerators that Apple claims deliver 3.3–4.0× prefill speedup on certain models. These numbers are vendor-provided and have not been independently verified by the community yet. The Neural Accelerator is optimized for specific operations (attention, some matmul patterns); not all model architectures benefit equally. Wait for independent benchmarks before treating this as a hard win. If the claims hold in practice, the M5 could narrow the prefill gap — but it is unlikely to close it against a current-gen NVIDIA RTX 5090.

5. Batch-size awareness and quantization tuning

Lower quantization (Q3 or Q4 instead of Q5 or Q8) reduces both VRAM and the arithmetic volume during prefill. The trade-off is quality. If you are tolerant of Q4_K_M or lower, and your workload is latency-sensitive, this is the easiest lever: smaller model = faster prefill. See What Is Quantization for the quality–size trade-off.

Comparison: Mac prefill vs. NVIDIA GPU prefill

Here is the hard truth in a table. All figures are community-cited or from first-party sources (noted where they appear). Prefill time is per prompt, not per token — a longer prompt takes longer, proportionally.

OperationApple M3 Max 128GBApple M4 16GBRTX 4090 24GBNotes
7B Q4_K_M, 4K context: TTFT~800ms–1.2s~1.5–2.2s~150–250msM4 measured; M3 extrapolated from bandwidth; NVIDIA community-cited
13B Q4_K_M, 8K context: TTFT~3–5s~6–10s~400–600msPrefill scales with context; Mac disadvantage widens
70B Q4_K_M, 4K context: TTFT~45–90sN/A (does not fit)~2–3sMac can fit; NVIDIA is 20–30× faster
Tok/s (generation) on any model~50–65~18–20~120–160Decode: Macs are competitive if the model fits
Power draw~35–45W~15–20W~450WMacs win on efficiency; NVIDIA wins on speed

The M4 baseline (LocalRig first-party, 2026-06-27) was 18.4 tok/s (llama.cpp b9820) and 19.5 tok/s (Ollama 0.30.11) on Llama 3.1 8B Q4_K_M. That is the real-world floor for base M4 chips; M3 Max and higher perform better.

When to stay on Mac, when to switch

Keep your Mac if:

  • Your workflow is interactive chat with contexts under 4K tokens. TTFT is negligible (<1 second) and tok/s is the measure that matters.
  • You value portability, low power (15–50W vs. 450W), and quiet operation over raw speed.
  • Your models are in the 7B–13B range at Q4_K_M or lower.
  • You are doing inference on a Mac Studio or M3/M4 Max with 64GB+ unified memory and can live with 2–5 minute TTFTs on large models.

Switch to NVIDIA (or cloud GPU) if:

  • Your workload is prefill-heavy: RAG on 8K+ document chunks, batch inference, prompt-heavy multi-turn chat with long context windows, or analysis of large texts.
  • You are running models 30B+ in size and expect practical TTFT (under 10 seconds).
  • You are evaluating multiple prompts in parallel or serving multiple users; NVIDIA’s higher throughput pays dividends in aggregate.
  • You need consistent performance: cloud GPUs (RunPod, Vast.ai, DigitalOcean, Vultr) offer fixed hardware; Mac thermal throttling and background system load add variance.

For the full decision frame and the trade-offs between local hardware tiers, see The Local-AI Hardware Buying Framework and Mac Studio vs. RTX 5090 for Local AI.

If you are set on NVIDIA and exploring options, RTX 5090 for Local LLM covers the current consumer tier, and Best GPU for Local LLM sizes the full landscape including used alternatives.

Bottom line

Apple Silicon is not slow at LLM inference. It is slow at prompt processing on long contexts, and incredibly fast at token generation. The headline tok/s number is real; the 10-minute wait on your prompt is equally real. Both come from the same hardware: high bandwidth (great for decode), lower compute density (bad for prefill).

If you know this going in, you can choose the machine that fits your actual workflow, not the throughput figure that sounds fastest. For many people — especially those doing interactive chat with shorter contexts — a Mac is the right call. For anyone doing RAG, document work, or large models, an NVIDIA GPU is more honest. The difference is not marketing; it is math.


See also:

Sources

  • LocalRig first-party benchmark: base Apple M4, 16 GB — llama.cpp b9820 (18.4 tok/s) and Ollama 0.30.11 (19.5 tok/s), Llama 3.1 8B Q4_K_M, 2026-06-27
  • '$10K Mac Studio Crawls with DeepSeek 671B — 14-Minute Wait' contrarian thread — r/LocalLLaMA / MacRumors community forums (2025)
  • LM Studio GitHub issue tracker — slow prompt evaluation on Apple Silicon (2024–2025)
  • Apple ML Research notes on M5 Neural Accelerator prefill performance (vendor claims, 3.3–4.0x prefill speedup)
  • llama.cpp GitHub discussions on prefill vs. decode performance (token-processing modes, matmul vs. rope bottlenecks)