TL;DR — The browser-vs-native LLM gap, explained
On identical hardware, browser-based LLM inference (transformers.js, WebLLM) generates tokens 5-10× slower than native llama.cpp / Ollama. RTX 4090: ~150 t/s native vs ~15-25 t/s browser. Main causes: WebAssembly overhead (1.5-2×), WebGPU layout restrictions, browser memory caps. Browser is still the right choice when you need install-free deployment, guaranteed privacy, or cross-platform reach. Test it live: 9bench browser LLM test.

9bench includes a "live LLM test" that runs a small open-weight LLM in your browser (Phi-3-mini on machines with ~3.5 GB+ browser memory headroom; Qwen 0.5B or DistilGPT-2 on memory-constrained devices) and reports real tokens/second. The first thing users notice: the number is much lower than what they see in Ollama or LM Studio on the same machine. RTX 4090 owners report 15-25 tokens/s on 9bench browser test vs 100-160 on llama.cpp. That gap is real and not a 9bench bug.

This article explains why browser LLM is slower, by how much, when the browser is fast enough anyway, and what changes over the next 12-24 months.

The headline numbers

Measured on identical hardware, same model (Llama 7B Q4_K_M), same prompt, default settings:

Hardware Native (llama.cpp) Browser (transformers.js) Slowdown
RTX 4090 (24 GB) 100-160 t/s 15-25 t/s ~6-7×
RTX 4070 Super (12 GB) 50-75 t/s 10-18 t/s ~5×
RTX 4060 Laptop (8 GB) 25-45 t/s 5-12 t/s ~5×
Apple M3 Max 50-80 t/s (MLX) 8-15 t/s ~6-8×
Apple M2 (base) 20-35 t/s (MLX) 3-7 t/s ~5-6×
Ryzen 7840U iGPU (Radeon 780M) 8-18 t/s (Vulkan) 2-4 t/s ~4-5×
Intel Iris Xe iGPU 5-12 t/s (CPU+iGPU) 1.5-3 t/s ~3-4×

Two things stand out. First, the slowdown ratio is fairly consistent at 4-8× across all hardware classes — it's not a quirk of one chip. Second, weaker hardware has slightly less penalty (3-4× on iGPUs, 6-7× on top-end discrete) because the WebAssembly overhead becomes less dominant when GPU isn't the bottleneck.

📐 Methodology note
Numbers above are medians from llama.cpp GitHub issues (#4167, #5950, #7062), the LM Studio public leaderboard, MLX-Examples README benchmarks, and 9bench's own browser-LLM test aggregated across 2000+ submissions. Same prompt (~30 tokens in, 200 tokens out), Q4_K_M quantisation, default sampler. Your specific number can swing ±20% with PSU, thermals, background processes, and OS scheduler.

Why is the browser slower? Five concrete reasons

1. WebAssembly overhead (1.5-2× slower than native code)

WebAssembly (Wasm) is a portable bytecode that browsers JIT-compile to machine code. Wasm SIMD adds vectorised ops which help, but Wasm code still runs slower than equivalent native C++ for several reasons:

Wasm-SIMD-vs-native AVX2 benchmarks consistently show 1.5-2× slowdown. For the parts of LLM inference that run on CPU (tokenisation, sampling, KV-cache management), this is the dominant cost.

2. WebGPU has stricter constraints than CUDA / Metal / Vulkan

WebGPU is the new standard, and impressive — but it deliberately restricts certain operations for security and portability. Specifically for LLM:

These add up. A native llama.cpp build for an RTX 4090 uses CUDA tensor cores in FP16 with int4 weight de-quantisation kernels. The WebGPU equivalent runs general-purpose FP16 (if available) without tensor-core acceleration. That alone explains 2-3× of the gap.

3. Memory bandwidth bottleneck (browser caps + sandbox cost)

LLM inference is bandwidth-bound. Each generated token requires reading every weight from VRAM. The faster you can read, the faster tokens come out.

In the browser:

Native llama.cpp has direct, unlimited GPU memory access. It can map the entire model into pinned VRAM and read straight from compute kernels. The browser must marshal through layers. On a 4090, the effective bandwidth available to a browser tab is closer to 60-70% of what native gets.

4. No KV-cache optimisations browser-side (yet)

Modern LLM serving uses sophisticated KV-cache strategies: paged attention, sliding-window cache, Flash-Attention 2/3 for long contexts. These are critical when generating many tokens or handling long prompts.

transformers.js as of early 2026 implements basic KV cache but not paged attention, no Flash-Attention. WebLLM has slightly better KV handling. Native llama.cpp has all of them.

For short responses (200 tokens), this matters less. For long generations (2000+ tokens) or long-context conversations, the gap widens.

5. Initial load time + warm-up cost

Native LLM tools download the model once and load it into memory. Subsequent runs reuse the loaded weights.

Browser LLM:

For "open the page, ask one question, leave" use cases, this initial cost dominates the actual inference cost. tokens/s during steady-state generation is the right thing to compare, which is what 9bench measures.

When the browser is fast enough anyway

Slower than native isn't the same as too slow. For many use cases, browser LLM is plenty.

Use case: interactive writing assistance (10+ t/s is enough)

A blog editor that suggests sentence completions doesn't need 100 t/s. It needs 5-15 t/s consistently, with low latency on the first token. Browser LLM hits this on any GPU since 2020.

Notable examples shipping in 2025-2026: Notion AI's optional client-side mode, GitHub Copilot's experimental "private browser inference", several form-autofill services.

Use case: privacy-sensitive client-side AI (no upload allowed)

Healthcare apps, legal documents, financial planning tools — anywhere the user data must not leave the device. Browser LLM is the only option that keeps data on the client without requiring the user to install software.

The privacy win is total: model weights download to the browser, inference happens locally, nothing about the user's input ever touches a server. No "we promise we won't train on your data" trust required.

Use case: cross-platform deployment (mobile + desktop + Chromebook)

Native LLM tools require platform-specific builds. Mobile is especially hard (iOS forbids JIT, complicating Wasm; Android has fragmented GPU support). Browser LLM runs anywhere with a modern Chromium.

9bench itself is the demo: same browser code runs on Windows desktop, Mac, Linux, Chromebook, Android tablet, iPad. We didn't ship 6 builds. We shipped one URL.

Use case: educational / demo / "is local AI possible on my machine?"

For users who don't know what an LLM is, asking them to install Ollama is a non-starter. Loading a webpage and clicking a button isn't. Browser LLM at 5-15 t/s is dramatically more accessible than native LLM at 50-100 t/s — because most users will never get to the 50-100.

When the browser is NOT enough

Use native (llama.cpp / Ollama / LM Studio / MLX) when:

Rule of thumb: if you'd hit "stop generation" and edit the model output mid-stream, browser is fine. If you'd kick off a 30-minute run and walk away, use native.

How 9bench reports browser-vs-native — and why

We built two pieces specifically to address this confusion:

  1. Browser score (measured): What 9bench actually observes in your browser. This is the honest "what you'd get from a Phi-3-mini chat in this tab right now" number.
  2. Native estimate (calibrated): Looked up from a curated GPU-class table (50+ entries) based on your detected GPU. This is what llama.cpp / Ollama would generate on the same hardware.

Both numbers appear side-by-side on the result page, with the gap explained. Goal: nobody walks away thinking "9bench says my 4090 only does 20 t/s, the marketing was lying" when the real story is "browser does 20, native does 150, and both numbers are correct".

📊 Try it yourself
Run 9bench, scroll to AI Capabilities, click "Run live AI test". Wait ~60 seconds for the first run (model downloads). The result page shows your browser tokens/s + a calibrated native estimate. You'll see the gap on your specific hardware.

What changes in 2026-2027

The browser-vs-native gap is shrinking. Three forces are pushing it down:

WebGPU 2026 features (FP16 baseline, subgroup ops, FP8)

WebGPU spec is adding subgroup operations (warp-level primitives equivalent to CUDA's __shfl) and broader FP8 support. These were the missing primitives needed for tensor-core-class kernels in browser. Chrome 132+ has experimental support; full rollout expected late 2026. Impact: 1.5-2× speedup on top-end GPUs.

WebGPU Tensor extension (proposal)

Direct tensor-core access via a WebGPU extension is in early proposal stage. If it lands (2027 realistic), browser LLM on RTX/H100/M-series would close most of the remaining gap. "If" is doing real work in that sentence — extensions like this often take 2-3 years.

Better browser-native ML libraries

transformers.js v3 (early 2026) added Wasm SIMD threadpool optimisations. WebLLM is shipping paged attention. Microsoft's ONNX Runtime Web is integrating Flash-Attention. Each of these is a 1.2-1.5× compounding speedup.

Stacked: by mid-2027, expect browser LLM at ~3× slower than native (vs 5-10× today). That's the realistic ceiling. Native will always have some edge because of unrestricted memory and vendor extensions.

Practical recommendations

If you're a user: use native (Ollama, LM Studio) for serious work. Use browser LLM (9bench, demos on websites) for triage and "does this work on my machine" checks.

If you're building a product:

If you're benchmarking: always cite which mode you measured. "RTX 4090 does X tokens/s" is not a complete sentence. "RTX 4090 does X t/s on browser via 9bench" or "Y t/s on llama.cpp native" is.

Common questions

"Can I run Llama 70B in the browser?" Technically yes via WebLLM with model sharding. Practically: no. Memory caps prevent it on most browsers. Even on Chrome with flags enabled, 70B Q4 (~40 GB) exceeds the per-tab limit. Stick to 7B-13B in browser.

"Why does the first run take so long?" First run = downloading model weights (1-5 GB) + JIT-compiling Wasm/WGSL kernels. Subsequent runs in the same browser reuse cached weights and warm shaders. Typical speedup: 5-10× on second run. 9bench shows this in the load-time breakdown.

"Does WebGPU work on iPhone Safari?" WebGPU is enabled by default in Safari 17.4+ (iOS 17.4+). Most iPhone 14+ devices can run small LLMs via transformers.js. iPad Pro M-series chips run them comfortably. Older iPhones and iOS 16- fall back to Wasm-only, much slower.

"Is WebLLM faster than transformers.js?" Slightly, in some cases. WebLLM (from MLC) implements paged attention and uses TVM-compiled kernels. transformers.js is more general-purpose and supports more model architectures. For Llama-class models, WebLLM is ~20-40% faster. For general experimentation, transformers.js has wider model support.

Run a live LLM test in your browser — 60 seconds

9bench downloads a small open-weight LLM (Phi-3-mini on most machines, smaller fallbacks for memory-constrained browsers) and measures real tokens/second on your hardware. Compare against the native estimate on your result page. No install, no account, model cached after first run for instant subsequent tests.

Test browser LLM speed →