What is the best GPU for running a local LLM in 2026?

For absolute peak speed: RTX 5090 (130-200 tokens/s on Llama 7B Q4, $1999+). For best price/performance: RTX 4070 Super (50-75 t/s, ~$600). For the largest models (70B+): Apple M3 Ultra with 192GB unified memory or two RTX 4090s. The "best" depends on which model size you want to run and your budget.

Can I run Llama 7B on an RTX 4060?

Yes — RTX 4060 (8GB) runs Llama 7B Q4 at 30-50 tokens/second. That's comfortably faster than human reading speed (~5 t/s). 8GB is enough for quantised 7B but tight for 13B. For 13B+ you want 12GB+.

Is Apple Silicon good for LLMs?

Surprisingly yes. The unified memory architecture means an M3 Max (64GB) runs 70B models that would require two RTX 4090s on PC. For 7B-13B models, M-series tokens/second is competitive with mid-range NVIDIA. The catch: ecosystem (CUDA-only tools won't work) and prompt-eval speed is slower than NVIDIA.

How much VRAM do I need for local LLM?

Rough rules: 7B Q4 = ~5GB, 13B Q4 = ~9GB, 30B Q4 = ~20GB, 70B Q4 = ~40GB. So an 8GB GPU runs 7B comfortably, 12GB handles 13B, 24GB handles 30B, and 70B requires 48GB+ (or Apple unified memory).

Should I buy a used RTX 3090 or new RTX 4070?

For LLM specifically: used RTX 3090 (24GB) wins because VRAM matters more than raw speed for larger models. RTX 3090 runs 30B Q4 (impossible on 4070's 12GB) and is faster on 7B (60-100 t/s vs 50-75 t/s). RTX 3090 used = ~$700-900. RTX 4070 new = ~$550. For pure LLM work, get the 3090.

Can I test my GPU's LLM speed without installing anything?

Yes — 9bench.com runs a real Llama-class model in your browser via WebAssembly and measures actual tokens/second. No download, no install, no sign-up. Browser tokens/s is 5-10× slower than native llama.cpp/Ollama, but the relative ranking is accurate, so you can compare GPUs fairly.

Best GPU for Local LLM in 2026: Tested + Real Tokens/Second

TL;DR — The honest 2026 picks

Best overall: RTX 4090 (24GB, 100-160 t/s on Llama 7B Q4) — sweet spot of speed + VRAM
Best price/perf: Used RTX 3090 ($700-900, 24GB, 60-100 t/s) — handles 30B models
Best new mid-range: RTX 4070 Super ($600, 12GB, 50-75 t/s) — comfortable for 7B/13B
Best for 70B+: Apple M3 Ultra (192GB unified) or M3 Max 64GB+ — only consumer option that fits
Best AMD: RX 7900 XTX (24GB, 80-130 t/s) — if you tolerate ROCm setup pain
Don't buy: RTX 4060 Ti 8GB at MSRP, RX 7600 8GB — VRAM-starved for anything beyond 7B

Want to test what you already own first? Run the 9bench live LLM test — real tokens/s in your browser, no install.

Local LLM is the most practical AI hardware decision in 2026. ChatGPT-class quality fits on a consumer GPU now (Llama 3.3 70B, Qwen 2.5 72B, Mistral Large), and inference is fast enough that running locally is no longer a "tinkerer's project" — it's a normal way to use AI without paying $20/month per seat or sending your code to OpenAI.

But which GPU? Marketing pages all claim "AI-ready". Subreddits give you 200 contradictory opinions. This article gives you calibrated tokens/second numbers for the GPUs people actually buy — sourced from real llama.cpp / Ollama benchmarks plus the 9bench reference database.

📐 How these numbers were measured

All tokens/second figures below are for Llama 7B Q4_K_M (4-bit quantised), short prompt (~30 tokens in), 200 tokens out, no batching, default sampler. Sources: public llama.cpp benchmarks (GitHub issue #4167+), TechPowerUp GPU compute tests, Tom's Hardware AI workload tests, and the 9bench GPU-class lookup table (50+ calibrated entries). Numbers are medians, not peaks — your PSU, thermals, and CPU can swing this ±15%.

Tier 1: Beast (90+ tokens/sec, $1500+)

These are the GPUs you buy if you're running 70B models, fine-tuning, or hosting an LLM as a personal API for multiple users. Overkill for 7B chat. Necessary for serious work.

GPU	VRAM	Llama 7B Q4 (t/s)	Max model	Price (2026)
RTX 5090	32 GB	130-200	70B Q4 (tight)	$1999-2499
RTX 4090	24 GB	100-160	30B Q4 / 70B Q3	$1700-2000 used
Apple M3 Ultra	up to 192 GB unified	80-130	120B+ (via unified memory)	$4000+ (Mac Studio)
Apple M2 Ultra	up to 128 GB unified	70-110	70B Q4 comfortably	$3500+ used
RTX 5080	16 GB	90-130	13B Q4	$999-1199
RX 7900 XTX	24 GB	80-130	30B Q4	$899-1099

Verdict — Tier 1

RTX 4090 is still the best LLM GPU in 2026 — RTX 5090 is faster, but you pay 25-50% more for ~30% gain. The 4090's 24 GB VRAM lets you run 30B Q4 comfortably and 70B Q3 in a pinch. CUDA + Tensor Cores + the entire ML ecosystem optimised for it. If you can find a used 4090 around $1700, that's the price/perf sweet spot at the top.

Apple M3 Ultra is the wildcard. Unified memory means 192 GB is "VRAM" — no consumer NVIDIA GPU comes close. You can run a quantised 120B model that would need two H100s on PC. Token generation is competitive (~80-130 t/s on 7B). The catch: prompt processing is 3-5× slower than NVIDIA, and CUDA-only tools (some fine-tuning frameworks, some research code) just won't work. For inference of large models on a small budget-of-attention, M-series is unmatched.

Tier 2: Workstation (50-90 t/s, $700-1500)

The pragmatic tier. You can run 7B models faster than you can read, 13B with comfort, and 30B if you're patient. This is where most serious solo developers and AI hobbyists live in 2026.

GPU	VRAM	Llama 7B Q4 (t/s)	Max model	Price (2026)
RTX 4080 Super	16 GB	75-110	13B Q4 / 30B Q3	$899-1099
RTX 5070 Ti	16 GB	70-100	13B Q4	$749-849
RTX 3090 (used)	24 GB	60-100	30B Q4 ⭐	$700-900 used
Apple M3 Max	up to 64 GB unified	50-80	30B-70B (with 64-128GB)	MacBook Pro $3500+
RTX 4070 Super	12 GB	50-75	13B Q4	$599-699
RX 7900 XT	20 GB	65-105	13B Q4 / 30B Q3	$649-799

The used RTX 3090 deal that nobody tells you about

If you only care about LLM inference, the used RTX 3090 ($700-900) is the best-value GPU in 2026. Here's why: it has 24 GB VRAM — same as the brand-new $2000 RTX 5090 and the legendary 4090. That 24 GB lets you run 30B Q4 models, which is impossible on every "modern" GPU under $1500.

Yes, raw FP32 compute on a 4070 Super is technically higher. But LLM inference is memory-bandwidth-bound, not compute-bound. The 3090's 936 GB/s memory bandwidth is in the same league as the 4090 (1008 GB/s) and crushes the 4070 Super (504 GB/s). Translation: a 3090 is faster than a 4070 Super on Llama 7B because tokens/s scales with bandwidth, not TFLOPS.

Buy from r/hardwareswap or eBay with photos showing the GPU running. Avoid mining cards with replaced thermal pads (cracked solder is a real risk). A clean 3090 will outlive most consumer GPUs.

⚠️ Apple unified memory caveat

On Apple Silicon, "VRAM" comes from system RAM via unified memory architecture. By default macOS reserves only ~75% of total RAM for GPU use. To run 70B models you need to manually raise the limit with sudo sysctl iogpu.wired_limit_mb=<value>. Most LLM tools (LM Studio, Ollama, MLX) document this. It's not "plug and play 192 GB of VRAM" — it's "plug and play once you know the trick".

Tier 3: Mainstream (25-50 t/s, $300-700)

The "I want to try local LLM but don't want to spend a kidney" tier. Every GPU here runs Llama 7B Q4 faster than human reading speed. None of them comfortably run 13B+, but for a personal coding/writing/chat assistant, 7B is enough.

GPU	VRAM	Llama 7B Q4 (t/s)	13B feasible?	Price (2026)
RTX 4070	12 GB	45-65	Yes (Q4)	$549-649
RTX 3080 (used)	10 GB	50-75	Tight (Q3 only)	$400-550 used
RTX 4060 Ti	8 GB / 16 GB	35-55	Only 16 GB version	$399-499
RX 7800 XT	16 GB	50-80	Yes (Q4)	$499-599
RX 6800 (used)	16 GB	28-45	Yes (Q4)	$300-400 used
Apple M3 Pro	18-36 GB unified	35-60	Yes (with 36GB)	MacBook Pro $2000+
RTX 4060	8 GB	30-50	No (VRAM limit)	$299-349

The 8 GB VRAM trap

Multiple GPUs in this tier ship with 8 GB VRAM. For Llama 7B Q4, that's just enough — the weights are ~4.5 GB, leaving ~3 GB for KV cache and context. You will not run 13B on 8 GB, no matter what marketing pages claim. Q3 quantisation can push 13B to ~6 GB, but quality drops noticeably.

If you're buying new in this price range, the RX 7800 XT (16 GB) is the LLM pick over RTX 4060/4060 Ti 8 GB. AMD's ROCm support on Linux is now decent (2026 is the year ROCm finally works on consumer cards). On Windows + Vulkan via llama.cpp it just works.

Tier 4: Budget / "what I already own" (8-25 t/s, under $300)

You don't need to upgrade to try local LLM. If you have one of these GPUs, you can run Llama 7B Q4 at usable speeds today.

GPU	VRAM	Llama 7B Q4 (t/s)	Verdict
GTX 1080 Ti	11 GB	22-36	Still solid in 2026 — 11 GB is enough for 7B
RTX 2060 / 2070	6-8 GB	15-30	Works for 7B Q4, tight on context
RTX 3060 12 GB	12 GB	25-40	Underrated for LLM — 12 GB at $200 used
Apple M1 / M2	8-16 GB unified	14-35	Fine for 7B if you have 16 GB+
Radeon 780M (APU)	shared	8-18	Surprisingly usable on Ryzen 8845HS-class laptops
Iris Xe / UHD	shared	2-12	Slow but works — readable in 30-60s

Even an Intel Iris Xe iGPU manages ~8 t/s on Llama 7B Q4. That's faster than human reading speed. The point: local LLM is no longer "needs an RTX 4090". Every laptop sold since 2022 can run it at usable speeds.

Should you buy AMD for LLM in 2026?

Short answer: maybe, if you're on Linux. Long answer is more nuanced.

The case for AMD: RX 7900 XTX (24 GB, ~$900) gives you 4090-tier VRAM at half the price. On llama.cpp via Vulkan or ROCm 6.0+, it generates ~80-130 t/s on Llama 7B Q4 — slower than a 4090 (100-160 t/s) but not dramatically. Per-dollar, AMD is the better deal.

The case against AMD: ecosystem fragility. CUDA-only frameworks (some fine-tuning libraries, some research code, some quantisation tools) don't have ROCm equivalents. ROCm on Windows is still flaky. ROCm on Linux works but you're often a generation behind (RDNA 4 / RX 9000 cards may launch with broken ROCm support and take months to stabilise). NVIDIA is "boot Ubuntu, install Ollama, it works". AMD is "research compatibility for your model and tools first".

If you only run inference and you're on Linux: AMD is fine and saves money. If you're on Windows, fine-tune models, or follow research papers that ship CUDA code: pay the NVIDIA tax.

Browser vs native — the reality check

Quick PSA: when you see tokens/second numbers on benchmark sites, pay attention to whether they're native or browser. Browser-based LLM (transformers.js, WebLLM) runs through WebAssembly — typically 5-10× slower than llama.cpp / Ollama on the same hardware. We covered this in detail in Browser vs Native LLM Performance.

9bench's live LLM test in your browser will report ~5-15 tokens/s on a 4090. That doesn't mean a 4090 is slow — it means WebAssembly + WebGPU has real overhead. Native llama.cpp on the same 4090 generates 100-160 t/s. The browser number is useful for relative ranking between machines, not for absolute claims.

💡 Test your GPU before you buy a new one

Run 9bench.com, scroll to the AI Capabilities section, and click Run live AI test. It'll download a small Phi-3 model (~1.7 GB, cached after first use) and measure real tokens/second on your existing GPU. Even though browser numbers are lower than native, the relative ranking is honest. If your current GPU lands in Tier 3+, you probably don't need to upgrade for LLM work.

Memory bandwidth, not TFLOPS, is the LLM bottleneck

Most "best GPU for AI" articles list raw TFLOPS as if it's the deciding factor. For LLM inference, it isn't.

Token generation is a sequential process: predict next token, append, predict next, append. Each prediction requires reading the entire model weights from VRAM. The bottleneck is how fast you can stream those weights through the compute units — i.e. memory bandwidth.

Quick sanity check on the principle: an RTX 4070 Super has 35 TFLOPS FP32 and 504 GB/s memory bandwidth. An RTX 3090 has 36 TFLOPS FP32 and 936 GB/s bandwidth. Same compute, almost 2× bandwidth. On Llama 7B Q4, the 3090 generates ~70% more tokens/second than the 4070 Super. The TFLOPS number is misleading; bandwidth is the real story.

Practical takeaway: when comparing GPUs for LLM, sort by VRAM bandwidth, not TFLOPS. AMD GPUs and Apple Silicon both look better under this metric than raw compute would suggest.

What about training / fine-tuning?

This article focuses on inference (running models), not training (creating or fine-tuning them). For fine-tuning:

LoRA fine-tuning of 7B models: any 12 GB+ GPU works. RTX 3060 12 GB to RTX 4090.
QLoRA fine-tuning of 13B-30B: 24 GB minimum (RTX 3090, 4090, A6000).
Full fine-tuning: not a consumer-GPU activity in 2026. Rent A100/H100s on RunPod or Together AI.
Apple Silicon fine-tuning: MLX framework (Apple's PyTorch alternative) makes M-series viable for LoRA. Not all research code ports over.

For fine-tuning, NVIDIA + CUDA is the safe path. AMD ROCm fine-tuning works but you'll debug frequently. Apple MLX is great but research-bleeding-edge code rarely targets it first.

The "don't buy" list

GPUs that look attractive on paper but disappoint for LLM:

RTX 4060 Ti 8 GB — VRAM-starved. The 16 GB version is fine, but the 8 GB version cannot run 13B and is bandwidth-limited.
RX 7600 8 GB — Same VRAM issue. RX 7800 XT 16 GB is much better LLM value at $499.
RTX 4080 (non-Super) — Was overpriced at launch. RTX 4080 Super at $999 makes the original 4080 obsolete.
Used RTX 3050 / 3060 8 GB — 8 GB cap kills 13B aspirations. Save up for 3060 12 GB instead.
Any GPU with shared memory only (laptops with 2-4 GB dedicated VRAM) — bandwidth is too low. Use the iGPU + system RAM via APU instead, or the CPU directly.

Decision tree: which GPU should you actually buy?

Already have a GPU? Test it first. 9bench live LLM test tells you actual tokens/s in 60 seconds. If you're getting 5+ t/s on browser, you're 25-50+ t/s on native. That's already usable.
Want to run 70B+ models? Two paths: (a) Apple M3 Ultra Mac Studio with 96+ GB unified memory, or (b) two used RTX 3090s + a motherboard with two PCIe 4.0 x8 slots. The Mac is more elegant. The dual-3090 is faster on smaller models.
Want fastest 7B/13B and willing to spend? RTX 4090 (used ~$1700) or RTX 5090 (~$2000-2500) if you want PCIe 5.0 and 32 GB. The 4090 is still a great purchase in 2026.
Best price/perf for serious work? Used RTX 3090 ($700-900). 24 GB VRAM at this price is genuinely unique. Every solo developer doing LLM work should consider this.
Mid-range new GPU? RTX 4070 Super 12 GB ($600) for NVIDIA-ecosystem comfort, RX 7800 XT 16 GB ($499) if you run Linux and want more VRAM per dollar.
Mac user? M3 Pro 36 GB or M3 Max 48 GB+ are excellent for LLM. The 8 GB and 16 GB base configs are too tight for 13B+ models.
Budget under $300? Used RTX 3060 12 GB. The unsung hero for LLM under $250. 12 GB is enough for 13B Q4.

Why this list will be obsolete in 2027

Three things will change the GPU-for-LLM landscape over the next 12 months:

Speculative decoding becomes standard. Already shipped in vLLM and llama.cpp, not yet in Ollama by default. When it lands universally, expect 1.5-2× tokens/s on the same hardware. This makes mid-range GPUs more competitive with Tier 1.
Better quantisation (Q2_K_M, AWQ, GPTQ improvements). 70B models that currently need 40 GB will fit in 24 GB at acceptable quality. RTX 3090/4090 owners win.
NPU acceleration. Intel Lunar Lake / Arrow Lake-H, AMD Strix Halo, Snapdragon X Elite all ship dedicated NPUs in 2026 laptops. Currently underused — software is catching up. Expect noticeable speedups for inference on integrated chips by mid-2026.

Translation: don't buy aspirational hardware for "future LLM workloads". Buy what runs the models you need today, well. The ecosystem will get faster around your existing card.

Common questions

"What about CPU-only inference?" Modern Ryzen 7000+ / Intel 13th-gen+ CPUs can do 4-12 t/s on Llama 7B Q4 with no GPU at all. Slow but usable. DDR5 helps a lot — bandwidth-bound, again.

"Is 24 GB enough for 70B?" 70B Q4 weights are ~40 GB. So no, not in single-GPU mode. You'd need Q3 or Q2 quantisation (quality drop is real), or split across two 24 GB GPUs (works but adds latency), or use unified memory (Apple).

"Should I buy an H100/A100 used?" Almost certainly no. H100s are ~$25k+ even used. They're optimised for training and large-batch inference. For single-user inference, a $2000 RTX 4090 is faster on a per-token basis. Datacenter cards make sense only if you're serving many concurrent users.

"Will Stable Diffusion work on these?" Yes — every GPU on the Tier 1-3 tables runs SDXL fine. We have a separate article: Test PC for Stable Diffusion XL in 15 Seconds.

Test before you buy

Don't take any GPU recommendation on faith — including ours. Run 9bench.com on your current machine, click Run live AI test, and see real tokens/second on your hardware in your browser. No download, no signup, no upload. Then compare against the tier tables above to decide if you actually need to upgrade.

Test my GPU's LLM speed →

Best GPU for Local LLM in 2026: Tested + Real Tokens/Second

Tier 1: Beast (90+ tokens/sec, $1500+)

Verdict — Tier 1

Tier 2: Workstation (50-90 t/s, $700-1500)

The used RTX 3090 deal that nobody tells you about

Tier 3: Mainstream (25-50 t/s, $300-700)

The 8 GB VRAM trap

Tier 4: Budget / "what I already own" (8-25 t/s, under $300)

Should you buy AMD for LLM in 2026?

Browser vs native — the reality check

Memory bandwidth, not TFLOPS, is the LLM bottleneck

What about training / fine-tuning?

The "don't buy" list

Decision tree: which GPU should you actually buy?

Why this list will be obsolete in 2027

Common questions

Test before you buy

Test your hardware in 15 seconds

Frequently asked

Tier 1: Beast (90+ tokens/sec, $1500+)

Verdict — Tier 1

Tier 2: Workstation (50-90 t/s, $700-1500)

The used RTX 3090 deal that nobody tells you about

Tier 3: Mainstream (25-50 t/s, $300-700)

The 8 GB VRAM trap

Tier 4: Budget / "what I already own" (8-25 t/s, under $300)

Should you buy AMD for LLM in 2026?

Browser vs native — the reality check

Memory bandwidth, not TFLOPS, is the LLM bottleneck

What about training / fine-tuning?

The "don't buy" list

Decision tree: which GPU should you actually buy?

Why this list will be obsolete in 2027

Common questions

Test before you buy

Test your hardware in 15 seconds

Frequently asked

Related articles