- Best overall: RTX 4090 (24GB, 100-160 t/s on Llama 7B Q4) — sweet spot of speed + VRAM
- Best price/perf: Used RTX 3090 ($700-900, 24GB, 60-100 t/s) — handles 30B models
- Best new mid-range: RTX 4070 Super ($600, 12GB, 50-75 t/s) — comfortable for 7B/13B
- Best for 70B+: Apple M3 Ultra (192GB unified) or M3 Max 64GB+ — only consumer option that fits
- Best AMD: RX 7900 XTX (24GB, 80-130 t/s) — if you tolerate ROCm setup pain
- Don't buy: RTX 4060 Ti 8GB at MSRP, RX 7600 8GB — VRAM-starved for anything beyond 7B
Local LLM is the most practical AI hardware decision in 2026. ChatGPT-class quality fits on a consumer GPU now (Llama 3.3 70B, Qwen 2.5 72B, Mistral Large), and inference is fast enough that running locally is no longer a "tinkerer's project" — it's a normal way to use AI without paying $20/month per seat or sending your code to OpenAI.
But which GPU? Marketing pages all claim "AI-ready". Subreddits give you 200 contradictory opinions. This article gives you calibrated tokens/second numbers for the GPUs people actually buy — sourced from real llama.cpp / Ollama benchmarks plus the 9bench reference database.
Tier 1: Beast (90+ tokens/sec, $1500+)
These are the GPUs you buy if you're running 70B models, fine-tuning, or hosting an LLM as a personal API for multiple users. Overkill for 7B chat. Necessary for serious work.
| GPU | VRAM | Llama 7B Q4 (t/s) | Max model | Price (2026) |
|---|---|---|---|---|
| RTX 5090 | 32 GB | 130-200 | 70B Q4 (tight) | $1999-2499 |
| RTX 4090 | 24 GB | 100-160 | 30B Q4 / 70B Q3 | $1700-2000 used |
| Apple M3 Ultra | up to 192 GB unified | 80-130 | 120B+ (via unified memory) | $4000+ (Mac Studio) |
| Apple M2 Ultra | up to 128 GB unified | 70-110 | 70B Q4 comfortably | $3500+ used |
| RTX 5080 | 16 GB | 90-130 | 13B Q4 | $999-1199 |
| RX 7900 XTX | 24 GB | 80-130 | 30B Q4 | $899-1099 |
Verdict — Tier 1
RTX 4090 is still the best LLM GPU in 2026 — RTX 5090 is faster, but you pay 25-50% more for ~30% gain. The 4090's 24 GB VRAM lets you run 30B Q4 comfortably and 70B Q3 in a pinch. CUDA + Tensor Cores + the entire ML ecosystem optimised for it. If you can find a used 4090 around $1700, that's the price/perf sweet spot at the top.
Apple M3 Ultra is the wildcard. Unified memory means 192 GB is "VRAM" — no consumer NVIDIA GPU comes close. You can run a quantised 120B model that would need two H100s on PC. Token generation is competitive (~80-130 t/s on 7B). The catch: prompt processing is 3-5× slower than NVIDIA, and CUDA-only tools (some fine-tuning frameworks, some research code) just won't work. For inference of large models on a small budget-of-attention, M-series is unmatched.
Tier 2: Workstation (50-90 t/s, $700-1500)
The pragmatic tier. You can run 7B models faster than you can read, 13B with comfort, and 30B if you're patient. This is where most serious solo developers and AI hobbyists live in 2026.
| GPU | VRAM | Llama 7B Q4 (t/s) | Max model | Price (2026) |
|---|---|---|---|---|
| RTX 4080 Super | 16 GB | 75-110 | 13B Q4 / 30B Q3 | $899-1099 |
| RTX 5070 Ti | 16 GB | 70-100 | 13B Q4 | $749-849 |
| RTX 3090 (used) | 24 GB | 60-100 | 30B Q4 ⭐ | $700-900 used |
| Apple M3 Max | up to 64 GB unified | 50-80 | 30B-70B (with 64-128GB) | MacBook Pro $3500+ |
| RTX 4070 Super | 12 GB | 50-75 | 13B Q4 | $599-699 |
| RX 7900 XT | 20 GB | 65-105 | 13B Q4 / 30B Q3 | $649-799 |
The used RTX 3090 deal that nobody tells you about
If you only care about LLM inference, the used RTX 3090 ($700-900) is the best-value GPU in 2026. Here's why: it has 24 GB VRAM — same as the brand-new $2000 RTX 5090 and the legendary 4090. That 24 GB lets you run 30B Q4 models, which is impossible on every "modern" GPU under $1500.
Yes, raw FP32 compute on a 4070 Super is technically higher. But LLM inference is memory-bandwidth-bound, not compute-bound. The 3090's 936 GB/s memory bandwidth is in the same league as the 4090 (1008 GB/s) and crushes the 4070 Super (504 GB/s). Translation: a 3090 is faster than a 4070 Super on Llama 7B because tokens/s scales with bandwidth, not TFLOPS.
Buy from r/hardwareswap or eBay with photos showing the GPU running. Avoid mining cards with replaced thermal pads (cracked solder is a real risk). A clean 3090 will outlive most consumer GPUs.
sudo sysctl iogpu.wired_limit_mb=<value>. Most LLM
tools (LM Studio, Ollama, MLX) document this. It's not "plug and play 192 GB of VRAM" —
it's "plug and play once you know the trick".
Tier 3: Mainstream (25-50 t/s, $300-700)
The "I want to try local LLM but don't want to spend a kidney" tier. Every GPU here runs Llama 7B Q4 faster than human reading speed. None of them comfortably run 13B+, but for a personal coding/writing/chat assistant, 7B is enough.
| GPU | VRAM | Llama 7B Q4 (t/s) | 13B feasible? | Price (2026) |
|---|---|---|---|---|
| RTX 4070 | 12 GB | 45-65 | Yes (Q4) | $549-649 |
| RTX 3080 (used) | 10 GB | 50-75 | Tight (Q3 only) | $400-550 used |
| RTX 4060 Ti | 8 GB / 16 GB | 35-55 | Only 16 GB version | $399-499 |
| RX 7800 XT | 16 GB | 50-80 | Yes (Q4) | $499-599 |
| RX 6800 (used) | 16 GB | 28-45 | Yes (Q4) | $300-400 used |
| Apple M3 Pro | 18-36 GB unified | 35-60 | Yes (with 36GB) | MacBook Pro $2000+ |
| RTX 4060 | 8 GB | 30-50 | No (VRAM limit) | $299-349 |
The 8 GB VRAM trap
Multiple GPUs in this tier ship with 8 GB VRAM. For Llama 7B Q4, that's just enough — the weights are ~4.5 GB, leaving ~3 GB for KV cache and context. You will not run 13B on 8 GB, no matter what marketing pages claim. Q3 quantisation can push 13B to ~6 GB, but quality drops noticeably.
If you're buying new in this price range, the RX 7800 XT (16 GB) is the LLM pick over RTX 4060/4060 Ti 8 GB. AMD's ROCm support on Linux is now decent (2026 is the year ROCm finally works on consumer cards). On Windows + Vulkan via llama.cpp it just works.
Tier 4: Budget / "what I already own" (8-25 t/s, under $300)
You don't need to upgrade to try local LLM. If you have one of these GPUs, you can run Llama 7B Q4 at usable speeds today.
| GPU | VRAM | Llama 7B Q4 (t/s) | Verdict |
|---|---|---|---|
| GTX 1080 Ti | 11 GB | 22-36 | Still solid in 2026 — 11 GB is enough for 7B |
| RTX 2060 / 2070 | 6-8 GB | 15-30 | Works for 7B Q4, tight on context |
| RTX 3060 12 GB | 12 GB | 25-40 | Underrated for LLM — 12 GB at $200 used |
| Apple M1 / M2 | 8-16 GB unified | 14-35 | Fine for 7B if you have 16 GB+ |
| Radeon 780M (APU) | shared | 8-18 | Surprisingly usable on Ryzen 8845HS-class laptops |
| Iris Xe / UHD | shared | 2-12 | Slow but works — readable in 30-60s |
Even an Intel Iris Xe iGPU manages ~8 t/s on Llama 7B Q4. That's faster than human reading speed. The point: local LLM is no longer "needs an RTX 4090". Every laptop sold since 2022 can run it at usable speeds.
Should you buy AMD for LLM in 2026?
Short answer: maybe, if you're on Linux. Long answer is more nuanced.
The case for AMD: RX 7900 XTX (24 GB, ~$900) gives you 4090-tier VRAM at half the price. On llama.cpp via Vulkan or ROCm 6.0+, it generates ~80-130 t/s on Llama 7B Q4 — slower than a 4090 (100-160 t/s) but not dramatically. Per-dollar, AMD is the better deal.
The case against AMD: ecosystem fragility. CUDA-only frameworks (some fine-tuning libraries, some research code, some quantisation tools) don't have ROCm equivalents. ROCm on Windows is still flaky. ROCm on Linux works but you're often a generation behind (RDNA 4 / RX 9000 cards may launch with broken ROCm support and take months to stabilise). NVIDIA is "boot Ubuntu, install Ollama, it works". AMD is "research compatibility for your model and tools first".
If you only run inference and you're on Linux: AMD is fine and saves money. If you're on Windows, fine-tune models, or follow research papers that ship CUDA code: pay the NVIDIA tax.
Browser vs native — the reality check
Quick PSA: when you see tokens/second numbers on benchmark sites, pay attention to whether they're native or browser. Browser-based LLM (transformers.js, WebLLM) runs through WebAssembly — typically 5-10× slower than llama.cpp / Ollama on the same hardware. We covered this in detail in Browser vs Native LLM Performance.
9bench's live LLM test in your browser will report ~5-15 tokens/s on a 4090. That doesn't mean a 4090 is slow — it means WebAssembly + WebGPU has real overhead. Native llama.cpp on the same 4090 generates 100-160 t/s. The browser number is useful for relative ranking between machines, not for absolute claims.
Memory bandwidth, not TFLOPS, is the LLM bottleneck
Most "best GPU for AI" articles list raw TFLOPS as if it's the deciding factor. For LLM inference, it isn't.
Token generation is a sequential process: predict next token, append, predict next, append. Each prediction requires reading the entire model weights from VRAM. The bottleneck is how fast you can stream those weights through the compute units — i.e. memory bandwidth.
Quick sanity check on the principle: an RTX 4070 Super has 35 TFLOPS FP32 and 504 GB/s memory bandwidth. An RTX 3090 has 36 TFLOPS FP32 and 936 GB/s bandwidth. Same compute, almost 2× bandwidth. On Llama 7B Q4, the 3090 generates ~70% more tokens/second than the 4070 Super. The TFLOPS number is misleading; bandwidth is the real story.
Practical takeaway: when comparing GPUs for LLM, sort by VRAM bandwidth, not TFLOPS. AMD GPUs and Apple Silicon both look better under this metric than raw compute would suggest.
What about training / fine-tuning?
This article focuses on inference (running models), not training (creating or fine-tuning them). For fine-tuning:
- LoRA fine-tuning of 7B models: any 12 GB+ GPU works. RTX 3060 12 GB to RTX 4090.
- QLoRA fine-tuning of 13B-30B: 24 GB minimum (RTX 3090, 4090, A6000).
- Full fine-tuning: not a consumer-GPU activity in 2026. Rent A100/H100s on RunPod or Together AI.
- Apple Silicon fine-tuning: MLX framework (Apple's PyTorch alternative) makes M-series viable for LoRA. Not all research code ports over.
For fine-tuning, NVIDIA + CUDA is the safe path. AMD ROCm fine-tuning works but you'll debug frequently. Apple MLX is great but research-bleeding-edge code rarely targets it first.
The "don't buy" list
GPUs that look attractive on paper but disappoint for LLM:
- RTX 4060 Ti 8 GB — VRAM-starved. The 16 GB version is fine, but the 8 GB version cannot run 13B and is bandwidth-limited.
- RX 7600 8 GB — Same VRAM issue. RX 7800 XT 16 GB is much better LLM value at $499.
- RTX 4080 (non-Super) — Was overpriced at launch. RTX 4080 Super at $999 makes the original 4080 obsolete.
- Used RTX 3050 / 3060 8 GB — 8 GB cap kills 13B aspirations. Save up for 3060 12 GB instead.
- Any GPU with shared memory only (laptops with 2-4 GB dedicated VRAM) — bandwidth is too low. Use the iGPU + system RAM via APU instead, or the CPU directly.
Decision tree: which GPU should you actually buy?
- Already have a GPU? Test it first. 9bench live LLM test tells you actual tokens/s in 60 seconds. If you're getting 5+ t/s on browser, you're 25-50+ t/s on native. That's already usable.
- Want to run 70B+ models? Two paths: (a) Apple M3 Ultra Mac Studio with 96+ GB unified memory, or (b) two used RTX 3090s + a motherboard with two PCIe 4.0 x8 slots. The Mac is more elegant. The dual-3090 is faster on smaller models.
- Want fastest 7B/13B and willing to spend? RTX 4090 (used ~$1700) or RTX 5090 (~$2000-2500) if you want PCIe 5.0 and 32 GB. The 4090 is still a great purchase in 2026.
- Best price/perf for serious work? Used RTX 3090 ($700-900). 24 GB VRAM at this price is genuinely unique. Every solo developer doing LLM work should consider this.
- Mid-range new GPU? RTX 4070 Super 12 GB ($600) for NVIDIA-ecosystem comfort, RX 7800 XT 16 GB ($499) if you run Linux and want more VRAM per dollar.
- Mac user? M3 Pro 36 GB or M3 Max 48 GB+ are excellent for LLM. The 8 GB and 16 GB base configs are too tight for 13B+ models.
- Budget under $300? Used RTX 3060 12 GB. The unsung hero for LLM under $250. 12 GB is enough for 13B Q4.
Why this list will be obsolete in 2027
Three things will change the GPU-for-LLM landscape over the next 12 months:
- Speculative decoding becomes standard. Already shipped in vLLM and llama.cpp, not yet in Ollama by default. When it lands universally, expect 1.5-2× tokens/s on the same hardware. This makes mid-range GPUs more competitive with Tier 1.
- Better quantisation (Q2_K_M, AWQ, GPTQ improvements). 70B models that currently need 40 GB will fit in 24 GB at acceptable quality. RTX 3090/4090 owners win.
- NPU acceleration. Intel Lunar Lake / Arrow Lake-H, AMD Strix Halo, Snapdragon X Elite all ship dedicated NPUs in 2026 laptops. Currently underused — software is catching up. Expect noticeable speedups for inference on integrated chips by mid-2026.
Translation: don't buy aspirational hardware for "future LLM workloads". Buy what runs the models you need today, well. The ecosystem will get faster around your existing card.
Common questions
"What about CPU-only inference?" Modern Ryzen 7000+ / Intel 13th-gen+ CPUs can do 4-12 t/s on Llama 7B Q4 with no GPU at all. Slow but usable. DDR5 helps a lot — bandwidth-bound, again.
"Is 24 GB enough for 70B?" 70B Q4 weights are ~40 GB. So no, not in single-GPU mode. You'd need Q3 or Q2 quantisation (quality drop is real), or split across two 24 GB GPUs (works but adds latency), or use unified memory (Apple).
"Should I buy an H100/A100 used?" Almost certainly no. H100s are ~$25k+ even used. They're optimised for training and large-batch inference. For single-user inference, a $2000 RTX 4090 is faster on a per-token basis. Datacenter cards make sense only if you're serving many concurrent users.
"Will Stable Diffusion work on these?" Yes — every GPU on the Tier 1-3 tables runs SDXL fine. We have a separate article: Test PC for Stable Diffusion XL in 15 Seconds.
Test before you buy
Don't take any GPU recommendation on faith — including ours. Run 9bench.com on your current machine, click Run live AI test, and see real tokens/second on your hardware in your browser. No download, no signup, no upload. Then compare against the tier tables above to decide if you actually need to upgrade.
Test my GPU's LLM speed →