9bench includes a "live LLM test" that runs a small open-weight LLM in your browser (Phi-3-mini on machines with ~3.5 GB+ browser memory headroom; Qwen 0.5B or DistilGPT-2 on memory-constrained devices) and reports real tokens/second. The first thing users notice: the number is much lower than what they see in Ollama or LM Studio on the same machine. RTX 4090 owners report 15-25 tokens/s on 9bench browser test vs 100-160 on llama.cpp. That gap is real and not a 9bench bug.
This article explains why browser LLM is slower, by how much, when the browser is fast enough anyway, and what changes over the next 12-24 months.
The headline numbers
Measured on identical hardware, same model (Llama 7B Q4_K_M), same prompt, default settings:
| Hardware | Native (llama.cpp) | Browser (transformers.js) | Slowdown |
|---|---|---|---|
| RTX 4090 (24 GB) | 100-160 t/s | 15-25 t/s | ~6-7× |
| RTX 4070 Super (12 GB) | 50-75 t/s | 10-18 t/s | ~5× |
| RTX 4060 Laptop (8 GB) | 25-45 t/s | 5-12 t/s | ~5× |
| Apple M3 Max | 50-80 t/s (MLX) | 8-15 t/s | ~6-8× |
| Apple M2 (base) | 20-35 t/s (MLX) | 3-7 t/s | ~5-6× |
| Ryzen 7840U iGPU (Radeon 780M) | 8-18 t/s (Vulkan) | 2-4 t/s | ~4-5× |
| Intel Iris Xe iGPU | 5-12 t/s (CPU+iGPU) | 1.5-3 t/s | ~3-4× |
Two things stand out. First, the slowdown ratio is fairly consistent at 4-8× across all hardware classes — it's not a quirk of one chip. Second, weaker hardware has slightly less penalty (3-4× on iGPUs, 6-7× on top-end discrete) because the WebAssembly overhead becomes less dominant when GPU isn't the bottleneck.
Why is the browser slower? Five concrete reasons
1. WebAssembly overhead (1.5-2× slower than native code)
WebAssembly (Wasm) is a portable bytecode that browsers JIT-compile to machine code. Wasm SIMD adds vectorised ops which help, but Wasm code still runs slower than equivalent native C++ for several reasons:
- Bounds checking on memory access (security requirement)
- Limited register pressure visibility — JIT can't always allocate optimally
- No vendor-specific intrinsics (AVX-512, NEON-Dotprod, AMX) that llama.cpp uses heavily
- No tail-call optimisation in many engines
Wasm-SIMD-vs-native AVX2 benchmarks consistently show 1.5-2× slowdown. For the parts of LLM inference that run on CPU (tokenisation, sampling, KV-cache management), this is the dominant cost.
2. WebGPU has stricter constraints than CUDA / Metal / Vulkan
WebGPU is the new standard, and impressive — but it deliberately restricts certain operations for security and portability. Specifically for LLM:
- Workgroup memory size limited to 16 KB on most GPUs (CUDA allows 48-100 KB depending on chip)
- No persistent kernel programs — every dispatch reinitialises state
- FP16 (shader-f16) is optional and missing on many laptops/older GPUs. Without it, matmul runs in FP32 → 2× memory bandwidth, 2× compute
- No int8 or int4 dot products at the WebGPU level (yet — coming in WGSL extensions)
- No tensor cores directly accessible — you get general-purpose shaders only
These add up. A native llama.cpp build for an RTX 4090 uses CUDA tensor cores in FP16 with int4 weight de-quantisation kernels. The WebGPU equivalent runs general-purpose FP16 (if available) without tensor-core acceleration. That alone explains 2-3× of the gap.
3. Memory bandwidth bottleneck (browser caps + sandbox cost)
LLM inference is bandwidth-bound. Each generated token requires reading every weight from VRAM. The faster you can read, the faster tokens come out.
In the browser:
- WebGPU has to validate every buffer transfer (security)
- Browser tabs are typically capped at 4-8 GB of GPU memory by the OS
- WebGPU buffer reads may be coalesced through CPU-side staging in some implementations
- The browser process and the GPU process are separate — calls cross IPC boundary
Native llama.cpp has direct, unlimited GPU memory access. It can map the entire model into pinned VRAM and read straight from compute kernels. The browser must marshal through layers. On a 4090, the effective bandwidth available to a browser tab is closer to 60-70% of what native gets.
4. No KV-cache optimisations browser-side (yet)
Modern LLM serving uses sophisticated KV-cache strategies: paged attention, sliding-window cache, Flash-Attention 2/3 for long contexts. These are critical when generating many tokens or handling long prompts.
transformers.js as of early 2026 implements basic KV cache but not paged attention, no Flash-Attention. WebLLM has slightly better KV handling. Native llama.cpp has all of them.
For short responses (200 tokens), this matters less. For long generations (2000+ tokens) or long-context conversations, the gap widens.
5. Initial load time + warm-up cost
Native LLM tools download the model once and load it into memory. Subsequent runs reuse the loaded weights.
Browser LLM:
- First visit: download model from Hugging Face CDN (1.7 GB for Phi-3-mini, 4.5 GB for Llama 7B Q4)
- Subsequent visits: weights cached in browser via IndexedDB (good!) but JIT compile of WebAssembly + WGSL shaders happens on every page load (5-15 seconds)
- First token after page load is much slower than subsequent (warm-up cost)
For "open the page, ask one question, leave" use cases, this initial cost dominates the actual inference cost. tokens/s during steady-state generation is the right thing to compare, which is what 9bench measures.
When the browser is fast enough anyway
Slower than native isn't the same as too slow. For many use cases, browser LLM is plenty.
Use case: interactive writing assistance (10+ t/s is enough)
A blog editor that suggests sentence completions doesn't need 100 t/s. It needs 5-15 t/s consistently, with low latency on the first token. Browser LLM hits this on any GPU since 2020.
Notable examples shipping in 2025-2026: Notion AI's optional client-side mode, GitHub Copilot's experimental "private browser inference", several form-autofill services.
Use case: privacy-sensitive client-side AI (no upload allowed)
Healthcare apps, legal documents, financial planning tools — anywhere the user data must not leave the device. Browser LLM is the only option that keeps data on the client without requiring the user to install software.
The privacy win is total: model weights download to the browser, inference happens locally, nothing about the user's input ever touches a server. No "we promise we won't train on your data" trust required.
Use case: cross-platform deployment (mobile + desktop + Chromebook)
Native LLM tools require platform-specific builds. Mobile is especially hard (iOS forbids JIT, complicating Wasm; Android has fragmented GPU support). Browser LLM runs anywhere with a modern Chromium.
9bench itself is the demo: same browser code runs on Windows desktop, Mac, Linux, Chromebook, Android tablet, iPad. We didn't ship 6 builds. We shipped one URL.
Use case: educational / demo / "is local AI possible on my machine?"
For users who don't know what an LLM is, asking them to install Ollama is a non-starter. Loading a webpage and clicking a button isn't. Browser LLM at 5-15 t/s is dramatically more accessible than native LLM at 50-100 t/s — because most users will never get to the 50-100.
When the browser is NOT enough
Use native (llama.cpp / Ollama / LM Studio / MLX) when:
- Coding agents / autonomous loops — these often need 50+ t/s for usable iteration speed. Browser caps you at ~20 t/s on top hardware.
- Long-context tasks (32K+ tokens) — browser KV cache handling falls behind hard above ~4K context.
- Batch processing — running 100 queries through a model. Native batches efficiently; browser does not.
- 70B+ models — browser memory caps prevent loading. You're stuck under 8B even on a 4090.
- Fine-tuning / training — basically not viable in browser at this point. Use native.
Rule of thumb: if you'd hit "stop generation" and edit the model output mid-stream, browser is fine. If you'd kick off a 30-minute run and walk away, use native.
How 9bench reports browser-vs-native — and why
We built two pieces specifically to address this confusion:
- Browser score (measured): What 9bench actually observes in your browser. This is the honest "what you'd get from a Phi-3-mini chat in this tab right now" number.
- Native estimate (calibrated): Looked up from a curated GPU-class table (50+ entries) based on your detected GPU. This is what llama.cpp / Ollama would generate on the same hardware.
Both numbers appear side-by-side on the result page, with the gap explained. Goal: nobody walks away thinking "9bench says my 4090 only does 20 t/s, the marketing was lying" when the real story is "browser does 20, native does 150, and both numbers are correct".
What changes in 2026-2027
The browser-vs-native gap is shrinking. Three forces are pushing it down:
WebGPU 2026 features (FP16 baseline, subgroup ops, FP8)
WebGPU spec is adding subgroup operations (warp-level primitives equivalent to CUDA's
__shfl) and broader FP8 support. These were the missing primitives needed
for tensor-core-class kernels in browser. Chrome 132+ has experimental support; full
rollout expected late 2026. Impact: 1.5-2× speedup on top-end GPUs.
WebGPU Tensor extension (proposal)
Direct tensor-core access via a WebGPU extension is in early proposal stage. If it lands (2027 realistic), browser LLM on RTX/H100/M-series would close most of the remaining gap. "If" is doing real work in that sentence — extensions like this often take 2-3 years.
Better browser-native ML libraries
transformers.js v3 (early 2026) added Wasm SIMD threadpool optimisations. WebLLM is shipping paged attention. Microsoft's ONNX Runtime Web is integrating Flash-Attention. Each of these is a 1.2-1.5× compounding speedup.
Stacked: by mid-2027, expect browser LLM at ~3× slower than native (vs 5-10× today). That's the realistic ceiling. Native will always have some edge because of unrestricted memory and vendor extensions.
Practical recommendations
If you're a user: use native (Ollama, LM Studio) for serious work. Use browser LLM (9bench, demos on websites) for triage and "does this work on my machine" checks.
If you're building a product:
- Privacy-critical or wide-distribution → browser LLM. Accept the slowdown; it's worth it for the deployment story.
- Power-user tooling → native. Ship as Tauri/Electron app or CLI. Don't apologise for "users have to install".
- Hybrid → browser as fallback, native as accelerator. Best UX: works on any device, 10× faster on power-user setups.
If you're benchmarking: always cite which mode you measured. "RTX 4090 does X tokens/s" is not a complete sentence. "RTX 4090 does X t/s on browser via 9bench" or "Y t/s on llama.cpp native" is.
Common questions
"Can I run Llama 70B in the browser?" Technically yes via WebLLM with model sharding. Practically: no. Memory caps prevent it on most browsers. Even on Chrome with flags enabled, 70B Q4 (~40 GB) exceeds the per-tab limit. Stick to 7B-13B in browser.
"Why does the first run take so long?" First run = downloading model weights (1-5 GB) + JIT-compiling Wasm/WGSL kernels. Subsequent runs in the same browser reuse cached weights and warm shaders. Typical speedup: 5-10× on second run. 9bench shows this in the load-time breakdown.
"Does WebGPU work on iPhone Safari?" WebGPU is enabled by default in Safari 17.4+ (iOS 17.4+). Most iPhone 14+ devices can run small LLMs via transformers.js. iPad Pro M-series chips run them comfortably. Older iPhones and iOS 16- fall back to Wasm-only, much slower.
"Is WebLLM faster than transformers.js?" Slightly, in some cases. WebLLM (from MLC) implements paged attention and uses TVM-compiled kernels. transformers.js is more general-purpose and supports more model architectures. For Llama-class models, WebLLM is ~20-40% faster. For general experimentation, transformers.js has wider model support.
Run a live LLM test in your browser — 60 seconds
9bench downloads a small open-weight LLM (Phi-3-mini on most machines, smaller fallbacks for memory-constrained browsers) and measures real tokens/second on your hardware. Compare against the native estimate on your result page. No install, no account, model cached after first run for instant subsequent tests.
Test browser LLM speed →