Software & Runtimes

How to Run llama.cpp on an RTX 3090 (CUDA, Step by Step)

This guide is for someone who already owns (or just bought used) an RTX 3090 and wants llama.cpp running with CUDA — talking to the GPU, not crawling along on the CPU. By the end you will have a CUDA-accelerated llama.cpp build, a GGUF model downloaded, a working chat command, a throughput number from your own card, and an optional OpenAI-compatible server you can point app code at.

The RTX 3090 is the value pick for local inference for one reason: 24 GB of GDDR6X at 936 GB/s of memory bandwidth, available used for a few hundred dollars. That capacity comfortably holds an 8B model at Q8 with room for a large context window, and it fits a 13B model at Q4 with headroom. For the full price-and-throughput breakdown across cards, see the 7B/8B hardware guide and the best GPU for local LLM comparison. This page is the practical setup that turns that card into a running model.

If you are still deciding which engine to run at all, start with the pillar — how to run LLMs locally — which maps hardware to engine. For a single consumer NVIDIA card, llama.cpp is the portable default, and that is what this guide installs.

Prerequisites

This walkthrough assumes Linux or Windows with WSL2 (Ubuntu). The same llama.cpp build steps work on native Windows with the Visual Studio toolchain, but the commands below are written for a Unix-style shell.

Before you start, have these in place:

  • An NVIDIA driver new enough for your CUDA toolkit. Check it with nvidia-smi — if that command prints your RTX 3090 and a driver version, the driver is loaded. If it errors, install or repair the driver first; nothing else will work until nvidia-smi sees the card.
  • The CUDA Toolkit (the compiler nvcc, not just the runtime). Confirm with nvcc --version. On WSL2, install the WSL-specific CUDA toolkit from NVIDIA — do not install a Linux display driver inside WSL; the GPU is passed through from the Windows host driver.
  • Build tooling: git, cmake (3.18 or newer), and a C/C++ compiler (build-essential on Ubuntu).
  • A GGUF model file. GGUF is llama.cpp’s native weight format. We use Llama 3.1 8B Instruct at Q4_K_M below; the quantization glossary explains what Q4_K_M, Q8_0, and the rest actually cost you in memory and quality.
  • Disk space: roughly 5 GB for an 8B Q4_K_M file, plus a couple of GB for the build.

A quick sizing note before you download: the 3090’s 24 GB fits an 8B model at Q8_0 (about 8 GB of weights) very comfortably, and a 13B model at Q4_K_M (about 8 GB) with room left for context. You do not need to start at the smallest quant — but Q4_K_M is the standard, well-tested default, so this guide uses it.

Step 1 — Clone and build llama.cpp with CUDA

Clone the repository and build it with the CUDA backend enabled:

git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j

The flag that matters is -DGGML_CUDA=ON. Without it, CMake produces a CPU-only build and your 3090 sits idle no matter what other flags you pass at runtime. The -j on the final line builds in parallel across your CPU cores; the CUDA compile is the slow part and can take several minutes.

When the build finishes, the binaries land in build/bin/llama-cli, llama-bench, and llama-server are the three you will use here. If the build fails, the cause is almost always a CUDA toolkit that CMake could not find (check nvcc --version) or a toolkit/driver version mismatch (see Troubleshooting).

Step 2 — Download a GGUF model

Download a GGUF file from Hugging Face. Llama 3.1 8B Instruct at Q4_K_M is a good first model — small enough to load instantly on a 3090, good enough to be genuinely useful. You can download it through the Hugging Face web UI and drop the file in a models/ folder, or pull it from the command line with the Hugging Face CLI:

pip3 install -U "huggingface_hub[cli]"
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
  Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  --local-dir models

Exact repository names on Hugging Face change over time; search for “Llama 3.1 8B Instruct GGUF” and pick a well-downloaded quant repository. What you want is a single .gguf file ending in Q4_K_M.gguf. Note the path you saved it to — the commands below assume models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf.

Step 3 — Run it on the GPU

Run a prompt through llama-cli, offloading all layers to the GPU:

./build/bin/llama-cli \
  -m models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  -ngl 99 \
  -p "Explain what a KV cache is in two sentences."

The critical flag is -ngl 99 (number of GPU layers). It tells llama.cpp to offload up to 99 layers to the GPU — more than any 8B model has, which is the idiomatic way to say “put the whole model on the card.” An 8B model at Q4_K_M is about 5 GB, so all of it fits in the 3090’s 24 GB with enormous headroom.

Watch the startup log. You want to see lines reporting layers assigned to the CUDA device and a VRAM allocation for the model. If the log says layers are on the CPU, or if nvidia-smi shows no memory used by the process while it runs, the model is not actually on the GPU — jump to Troubleshooting.

Step 4 — Check throughput

To get a real tokens-per-second number from your own card, use llama-bench rather than eyeballing the chat:

./build/bin/llama-bench \
  -m models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  -ngl 99

llama-bench reports prompt-processing (prefill) and token-generation (decode) throughput separately, which is the honest way to measure — the two phases stress different parts of the card. For a sense of where you should land: community benchmark threads put an RTX 3090 running Llama 3.1 8B Q4_K_M via llama.cpp (CUDA) at roughly 80–110 tok/s of generation throughput. That figure is community-cited (r/LocalLLaMA / llama.cpp benchmark threads, 2024–2025), not independently verified by LocalRig — treat it as a planning range, not a guarantee. Your own number will move with your CUDA version, driver, PCIe configuration, and thermals, which is exactly why running llama-bench on your card beats trusting anyone’s published figure.

Step 5 — (Optional) Run an OpenAI-compatible server

If you want to call the model from application code instead of the terminal, start the server:

./build/bin/llama-server \
  -m models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  -ngl 99

llama-server exposes an OpenAI-compatible HTTP API (and a small built-in web UI) on port 8080 by default. Most code written against the OpenAI Python or JavaScript SDK works unchanged — point the client’s base URL at your server’s address and port. If you want to reach it from other machines on your network, bind it to 0.0.0.0 with --host 0.0.0.0; if it is only for this machine, the default bind is fine. Keep -ngl 99 here too — the server obeys the same offload flag as the CLI.

Troubleshooting

The model runs on the CPU (slow, GPU idle). Two usual causes. First, you forgot -ngl 99 (or passed -ngl 0) at runtime — add it. Second, the binary is a CPU-only build because -DGGML_CUDA=ON was missing or CUDA was not found at configure time. Confirm by re-running the CMake configure step and watching for a line that says CUDA was found; if it was not, fix your toolkit and rebuild from a clean build/ directory.

Out of memory (CUDA error during load or generation). On a 3090 with an 8B model this is rare, but it happens with larger models or very long contexts. Lower -ngl so some layers stay on the CPU (e.g. -ngl 30), shrink the context with -c (for example -c 4096), or step down to a smaller quant — Q4_K_M instead of Q8_0. The quantization glossary has the memory math for each level.

Driver / toolkit mismatch at build or runtime. If the build complains it cannot find CUDA, or you hit a runtime error about an unsupported driver, your installed driver is older than the CUDA toolkit you built against. Either update the NVIDIA driver to one that supports your toolkit version, or install a toolkit version your current driver supports. On WSL2 specifically, the driver lives on the Windows host — update it there, not inside the Linux environment.

Why llama.cpp here and not Ollama

Ollama is a wrapper around llama.cpp — it embeds the same kernels, so it is not a faster engine, just a more opaque one. Running llama.cpp directly gives you pinned builds you can reproduce and explicit control over the flags (-ngl, -c, quant choice) that Ollama hides behind its defaults. The full rationale, including first-party measurements showing the two land within about a token per second of each other, is on the pillar: how to run LLMs locally.

Who This Is NOT For

  • People who want zero setup. If building from source and managing CUDA toolkit versions sounds like more than you want to deal with, a convenience wrapper will feel easier on day one. This guide trades that for control and reproducibility.
  • People serving many concurrent users. llama.cpp is excellent for a single rig and a few requests, but it is not a production fleet server. If you need continuous batching, real scheduling, and multi-user throughput, look at vLLM — see the engine map on the pillar guide.
  • People who want to run models far bigger than 24 GB. A single 3090 holds 8B comfortably and 13B at Q4. A 70B model does not fit on one card at usable quants; you would need more VRAM (and a different architecture conversation) than this guide covers.
  • People on hardware that is not NVIDIA. This is the CUDA path. On Apple Silicon use the Metal build (or MLX); on AMD use the HIP/Vulkan backend. The build flags differ.

Sources

  • llama.cpp project documentation and CUDA build instructions: github.com/ggml-org/llama.cpp (accessed 2026-06-28).
  • Hugging Face GGUF model hosting — Llama 3.1 8B Instruct GGUF (accessed 2026-06-28).
  • NVIDIA CUDA Toolkit and driver documentation: developer.nvidia.com (accessed 2026-06-28).
  • RTX 3090 throughput, Llama 3.1 8B Q4_K_M via llama.cpp (CUDA): ~80–110 tok/s — community-cited (r/LocalLLaMA / llama.cpp benchmark threads, 2024–2025), not independently verified by LocalRig. See the 7B/8B hardware guide for the aggregated community data and methodology.

Getting the card: the RTX 3090 is sold used only — it was discontinued, so there is no manufacturer warranty on used units. Check seller feedback and ask for photos of the heatsink and ports. Browse used RTX 3090 24 GB listings on eBay (sorted by price) or check RTX 3090 24 GB listings on Amazon for refurbished and third-party stock.

Sources

  • llama.cpp project documentation and CUDA build instructions: github.com/ggml-org/llama.cpp (accessed 2026-06-28)
  • Hugging Face GGUF model hosting (Llama 3.1 8B Instruct GGUF): huggingface.co (accessed 2026-06-28)
  • NVIDIA CUDA Toolkit and driver documentation: developer.nvidia.com (accessed 2026-06-28)
  • r/LocalLLaMA and llama.cpp benchmark threads, RTX 3090 throughput (2024–2025) — community-cited
  • LocalRig 7B/8B hardware guide (community-cited RTX 3090 figures): /can-i-run/hardware-to-run-a-7b-model-locally/