TL;DR — The honest 2026 picks
Want to test what you already own first? Run the 9bench live LLM test — real tokens/s in your browser, no install.

Local LLM is the most practical AI hardware decision in 2026. ChatGPT-class quality fits on a consumer GPU now (Llama 3.3 70B, Qwen 2.5 72B, Mistral Large), and inference is fast enough that running locally is no longer a "tinkerer's project" — it's a normal way to use AI without paying $20/month per seat or sending your code to OpenAI.

But which GPU? Marketing pages all claim "AI-ready". Subreddits give you 200 contradictory opinions. This article gives you calibrated tokens/second numbers for the GPUs people actually buy — sourced from real llama.cpp / Ollama benchmarks plus the 9bench reference database.

📐 How these numbers were measured
All tokens/second figures below are for Llama 7B Q4_K_M (4-bit quantised), short prompt (~30 tokens in), 200 tokens out, no batching, default sampler. Sources: public llama.cpp benchmarks (GitHub issue #4167+), TechPowerUp GPU compute tests, Tom's Hardware AI workload tests, and the 9bench GPU-class lookup table (50+ calibrated entries). Numbers are medians, not peaks — your PSU, thermals, and CPU can swing this ±15%.

Tier 1: Beast (90+ tokens/sec, $1500+)

These are the GPUs you buy if you're running 70B models, fine-tuning, or hosting an LLM as a personal API for multiple users. Overkill for 7B chat. Necessary for serious work.

GPU VRAM Llama 7B Q4 (t/s) Max model Price (2026)
RTX 5090 32 GB 130-200 70B Q4 (tight) $1999-2499
RTX 4090 24 GB 100-160 30B Q4 / 70B Q3 $1700-2000 used
Apple M3 Ultra up to 192 GB unified 80-130 120B+ (via unified memory) $4000+ (Mac Studio)
Apple M2 Ultra up to 128 GB unified 70-110 70B Q4 comfortably $3500+ used
RTX 5080 16 GB 90-130 13B Q4 $999-1199
RX 7900 XTX 24 GB 80-130 30B Q4 $899-1099

Verdict — Tier 1

RTX 4090 is still the best LLM GPU in 2026 — RTX 5090 is faster, but you pay 25-50% more for ~30% gain. The 4090's 24 GB VRAM lets you run 30B Q4 comfortably and 70B Q3 in a pinch. CUDA + Tensor Cores + the entire ML ecosystem optimised for it. If you can find a used 4090 around $1700, that's the price/perf sweet spot at the top.

Apple M3 Ultra is the wildcard. Unified memory means 192 GB is "VRAM" — no consumer NVIDIA GPU comes close. You can run a quantised 120B model that would need two H100s on PC. Token generation is competitive (~80-130 t/s on 7B). The catch: prompt processing is 3-5× slower than NVIDIA, and CUDA-only tools (some fine-tuning frameworks, some research code) just won't work. For inference of large models on a small budget-of-attention, M-series is unmatched.

Tier 2: Workstation (50-90 t/s, $700-1500)

The pragmatic tier. You can run 7B models faster than you can read, 13B with comfort, and 30B if you're patient. This is where most serious solo developers and AI hobbyists live in 2026.

GPU VRAM Llama 7B Q4 (t/s) Max model Price (2026)
RTX 4080 Super 16 GB 75-110 13B Q4 / 30B Q3 $899-1099
RTX 5070 Ti 16 GB 70-100 13B Q4 $749-849
RTX 3090 (used) 24 GB 60-100 30B Q4 ⭐ $700-900 used
Apple M3 Max up to 64 GB unified 50-80 30B-70B (with 64-128GB) MacBook Pro $3500+
RTX 4070 Super 12 GB 50-75 13B Q4 $599-699
RX 7900 XT 20 GB 65-105 13B Q4 / 30B Q3 $649-799

The used RTX 3090 deal that nobody tells you about

If you only care about LLM inference, the used RTX 3090 ($700-900) is the best-value GPU in 2026. Here's why: it has 24 GB VRAM — same as the brand-new $2000 RTX 5090 and the legendary 4090. That 24 GB lets you run 30B Q4 models, which is impossible on every "modern" GPU under $1500.

Yes, raw FP32 compute on a 4070 Super is technically higher. But LLM inference is memory-bandwidth-bound, not compute-bound. The 3090's 936 GB/s memory bandwidth is in the same league as the 4090 (1008 GB/s) and crushes the 4070 Super (504 GB/s). Translation: a 3090 is faster than a 4070 Super on Llama 7B because tokens/s scales with bandwidth, not TFLOPS.

Buy from r/hardwareswap or eBay with photos showing the GPU running. Avoid mining cards with replaced thermal pads (cracked solder is a real risk). A clean 3090 will outlive most consumer GPUs.

⚠️ Apple unified memory caveat
On Apple Silicon, "VRAM" comes from system RAM via unified memory architecture. By default macOS reserves only ~75% of total RAM for GPU use. To run 70B models you need to manually raise the limit with sudo sysctl iogpu.wired_limit_mb=<value>. Most LLM tools (LM Studio, Ollama, MLX) document this. It's not "plug and play 192 GB of VRAM" — it's "plug and play once you know the trick".

Tier 3: Mainstream (25-50 t/s, $300-700)

The "I want to try local LLM but don't want to spend a kidney" tier. Every GPU here runs Llama 7B Q4 faster than human reading speed. None of them comfortably run 13B+, but for a personal coding/writing/chat assistant, 7B is enough.

GPU VRAM Llama 7B Q4 (t/s) 13B feasible? Price (2026)
RTX 4070 12 GB 45-65 Yes (Q4) $549-649
RTX 3080 (used) 10 GB 50-75 Tight (Q3 only) $400-550 used
RTX 4060 Ti 8 GB / 16 GB 35-55 Only 16 GB version $399-499
RX 7800 XT 16 GB 50-80 Yes (Q4) $499-599
RX 6800 (used) 16 GB 28-45 Yes (Q4) $300-400 used
Apple M3 Pro 18-36 GB unified 35-60 Yes (with 36GB) MacBook Pro $2000+
RTX 4060 8 GB 30-50 No (VRAM limit) $299-349

The 8 GB VRAM trap

Multiple GPUs in this tier ship with 8 GB VRAM. For Llama 7B Q4, that's just enough — the weights are ~4.5 GB, leaving ~3 GB for KV cache and context. You will not run 13B on 8 GB, no matter what marketing pages claim. Q3 quantisation can push 13B to ~6 GB, but quality drops noticeably.

If you're buying new in this price range, the RX 7800 XT (16 GB) is the LLM pick over RTX 4060/4060 Ti 8 GB. AMD's ROCm support on Linux is now decent (2026 is the year ROCm finally works on consumer cards). On Windows + Vulkan via llama.cpp it just works.

Tier 4: Budget / "what I already own" (8-25 t/s, under $300)

You don't need to upgrade to try local LLM. If you have one of these GPUs, you can run Llama 7B Q4 at usable speeds today.

GPU VRAM Llama 7B Q4 (t/s) Verdict
GTX 1080 Ti 11 GB 22-36 Still solid in 2026 — 11 GB is enough for 7B
RTX 2060 / 2070 6-8 GB 15-30 Works for 7B Q4, tight on context
RTX 3060 12 GB 12 GB 25-40 Underrated for LLM — 12 GB at $200 used
Apple M1 / M2 8-16 GB unified 14-35 Fine for 7B if you have 16 GB+
Radeon 780M (APU) shared 8-18 Surprisingly usable on Ryzen 8845HS-class laptops
Iris Xe / UHD shared 2-12 Slow but works — readable in 30-60s

Even an Intel Iris Xe iGPU manages ~8 t/s on Llama 7B Q4. That's faster than human reading speed. The point: local LLM is no longer "needs an RTX 4090". Every laptop sold since 2022 can run it at usable speeds.

Should you buy AMD for LLM in 2026?

Short answer: maybe, if you're on Linux. Long answer is more nuanced.

The case for AMD: RX 7900 XTX (24 GB, ~$900) gives you 4090-tier VRAM at half the price. On llama.cpp via Vulkan or ROCm 6.0+, it generates ~80-130 t/s on Llama 7B Q4 — slower than a 4090 (100-160 t/s) but not dramatically. Per-dollar, AMD is the better deal.

The case against AMD: ecosystem fragility. CUDA-only frameworks (some fine-tuning libraries, some research code, some quantisation tools) don't have ROCm equivalents. ROCm on Windows is still flaky. ROCm on Linux works but you're often a generation behind (RDNA 4 / RX 9000 cards may launch with broken ROCm support and take months to stabilise). NVIDIA is "boot Ubuntu, install Ollama, it works". AMD is "research compatibility for your model and tools first".

If you only run inference and you're on Linux: AMD is fine and saves money. If you're on Windows, fine-tune models, or follow research papers that ship CUDA code: pay the NVIDIA tax.

Browser vs native — the reality check

Quick PSA: when you see tokens/second numbers on benchmark sites, pay attention to whether they're native or browser. Browser-based LLM (transformers.js, WebLLM) runs through WebAssembly — typically 5-10× slower than llama.cpp / Ollama on the same hardware. We covered this in detail in Browser vs Native LLM Performance.

9bench's live LLM test in your browser will report ~5-15 tokens/s on a 4090. That doesn't mean a 4090 is slow — it means WebAssembly + WebGPU has real overhead. Native llama.cpp on the same 4090 generates 100-160 t/s. The browser number is useful for relative ranking between machines, not for absolute claims.

💡 Test your GPU before you buy a new one
Run 9bench.com, scroll to the AI Capabilities section, and click Run live AI test. It'll download a small Phi-3 model (~1.7 GB, cached after first use) and measure real tokens/second on your existing GPU. Even though browser numbers are lower than native, the relative ranking is honest. If your current GPU lands in Tier 3+, you probably don't need to upgrade for LLM work.

Memory bandwidth, not TFLOPS, is the LLM bottleneck

Most "best GPU for AI" articles list raw TFLOPS as if it's the deciding factor. For LLM inference, it isn't.

Token generation is a sequential process: predict next token, append, predict next, append. Each prediction requires reading the entire model weights from VRAM. The bottleneck is how fast you can stream those weights through the compute units — i.e. memory bandwidth.

Quick sanity check on the principle: an RTX 4070 Super has 35 TFLOPS FP32 and 504 GB/s memory bandwidth. An RTX 3090 has 36 TFLOPS FP32 and 936 GB/s bandwidth. Same compute, almost 2× bandwidth. On Llama 7B Q4, the 3090 generates ~70% more tokens/second than the 4070 Super. The TFLOPS number is misleading; bandwidth is the real story.

Practical takeaway: when comparing GPUs for LLM, sort by VRAM bandwidth, not TFLOPS. AMD GPUs and Apple Silicon both look better under this metric than raw compute would suggest.

What about training / fine-tuning?

This article focuses on inference (running models), not training (creating or fine-tuning them). For fine-tuning:

For fine-tuning, NVIDIA + CUDA is the safe path. AMD ROCm fine-tuning works but you'll debug frequently. Apple MLX is great but research-bleeding-edge code rarely targets it first.

The "don't buy" list

GPUs that look attractive on paper but disappoint for LLM:

Decision tree: which GPU should you actually buy?

  1. Already have a GPU? Test it first. 9bench live LLM test tells you actual tokens/s in 60 seconds. If you're getting 5+ t/s on browser, you're 25-50+ t/s on native. That's already usable.
  2. Want to run 70B+ models? Two paths: (a) Apple M3 Ultra Mac Studio with 96+ GB unified memory, or (b) two used RTX 3090s + a motherboard with two PCIe 4.0 x8 slots. The Mac is more elegant. The dual-3090 is faster on smaller models.
  3. Want fastest 7B/13B and willing to spend? RTX 4090 (used ~$1700) or RTX 5090 (~$2000-2500) if you want PCIe 5.0 and 32 GB. The 4090 is still a great purchase in 2026.
  4. Best price/perf for serious work? Used RTX 3090 ($700-900). 24 GB VRAM at this price is genuinely unique. Every solo developer doing LLM work should consider this.
  5. Mid-range new GPU? RTX 4070 Super 12 GB ($600) for NVIDIA-ecosystem comfort, RX 7800 XT 16 GB ($499) if you run Linux and want more VRAM per dollar.
  6. Mac user? M3 Pro 36 GB or M3 Max 48 GB+ are excellent for LLM. The 8 GB and 16 GB base configs are too tight for 13B+ models.
  7. Budget under $300? Used RTX 3060 12 GB. The unsung hero for LLM under $250. 12 GB is enough for 13B Q4.

Why this list will be obsolete in 2027

Three things will change the GPU-for-LLM landscape over the next 12 months:

Translation: don't buy aspirational hardware for "future LLM workloads". Buy what runs the models you need today, well. The ecosystem will get faster around your existing card.

Common questions

"What about CPU-only inference?" Modern Ryzen 7000+ / Intel 13th-gen+ CPUs can do 4-12 t/s on Llama 7B Q4 with no GPU at all. Slow but usable. DDR5 helps a lot — bandwidth-bound, again.

"Is 24 GB enough for 70B?" 70B Q4 weights are ~40 GB. So no, not in single-GPU mode. You'd need Q3 or Q2 quantisation (quality drop is real), or split across two 24 GB GPUs (works but adds latency), or use unified memory (Apple).

"Should I buy an H100/A100 used?" Almost certainly no. H100s are ~$25k+ even used. They're optimised for training and large-batch inference. For single-user inference, a $2000 RTX 4090 is faster on a per-token basis. Datacenter cards make sense only if you're serving many concurrent users.

"Will Stable Diffusion work on these?" Yes — every GPU on the Tier 1-3 tables runs SDXL fine. We have a separate article: Test PC for Stable Diffusion XL in 15 Seconds.

Test before you buy

Don't take any GPU recommendation on faith — including ours. Run 9bench.com on your current machine, click Run live AI test, and see real tokens/second on your hardware in your browser. No download, no signup, no upload. Then compare against the tier tables above to decide if you actually need to upgrade.

Test my GPU's LLM speed →