Runtimes

How to Run LLMs Locally: Which Inference Engine for Your Rig (2026)
A decision guide that picks the right local inference engine from your hardware, not hype. llama.cpp for CPU and portability, MLX on Apple Silicon, vLLM for CUDA serving — and why we don't recommend Ollama.
How to Run llama.cpp on an RTX 3090 (CUDA, Step by Step)
A step-by-step guide to building llama.cpp with CUDA and running a GGUF model on an RTX 3090. Covers driver and toolkit prerequisites, the CUDA build, full GPU offload with -ngl, a throughput check, and an OpenAI-compatible server.