Inference Without a Round Trip

This post covers how in-browser LLM inference works, what it costs, and what it changes about product design. The demo on this page runs a real model on your GPU, in your browser. No API call leaves the tab. While top of the line LLM models are still too big to run in a browser, the 0.5B to 4B parameter range is now small enough to fit, and good enough to be useful. The stack of technologies that makes this possible is new, but the implications for product design are bigger than the novelty. Local inference isn't just a fun demo. It's a new architecture for AI features, with different tradeoffs and failure modes than the hosted-LLM models.

01. THE SHIFT

For the past few years, running a model in a datacenter wasn't a decision anyone debated. It was the only reasonable choice, so obvious that nobody bothered writing it down. And the reasoning held up. Large language models needed hardware measured in terabytes of GPU memory, and even a modest serving cluster cost more per month than most product teams had discretion over. The browser was where you built the UI. Inference happened somewhere else, over HTTPS, billed by the token. Then three things changed, all at roughly the same time.

WebGPU landed across the browser matrix. Chrome shipped it in stable in 2023. Firefox followed in 2024. Safari 18 on macOS 15 enabled it by default. By early 2026, WebGPU had become a baseline capability, present in browsers covering 85% of desktop traffic without flags, polyfills, or any user opt-in. For the first time, JavaScript running in a tab had real access to GPU compute, at a level you'd actually want to use for something other than games.

Quantization made the models small enough to move. 4-bit schemes like GGUF from llama.cpp, AWQ, and MLC's q4f16_1 compress model weights by 4 to 8 times relative to the full-precision originals. There's a quality cost on instruction-following tasks, and it's real, but it's modest enough that a 4-bit quantized 3B model still answers questions well enough to be useful. That tradeoff used to be theoretical. It became practical the moment it meant the difference between a file that needed a server rack and one that fits in a browser tab.

The third force was the models themselves. Qwen 2.5, Phi-3.5, and Gemma 2 are instruction-tuned models in the 0.5B to 4B parameter range, built specifically to perform well at small sizes. They aren't cut-down versions of larger models. They're designed from the start for environments where you have 4GB, not 40. At 4-bit precision, a 1.5B-parameter Qwen 2.5 model weighs around 1GB. A browser can download that. A user can wait for it.

The demo on this page runs a real model in your browser. No API call ever leaves the tab. The inference happens on your GPU, through WebGPU, at your desk. What follows is how that works, and what it costs.

02. THE STACK

Three layers sit between a user's prompt and a token appearing on screen. Looking at each layer separately makes the performance numbers and the failure modes a lot easier to read.

WebGPU, the compute substrate. WebGPU is not WebGL with a different name. WebGL was a graphics API pressed into service for compute workloads it was never designed to handle. WebGPU exposes explicit compute pipelines, typed buffer bindings, and a shading language called WGSL that's designed for general-purpose workloads. You describe a compute pass, bind your weight buffers and activation tensors, dispatch a shader, and read the results back. The GPU never needs to know a triangle was involved. On Apple Silicon, WebGPU maps to Metal. On Windows, to DirectX 12 or Vulkan. The browser abstraction is thin enough that matrix multiplications now run at speeds that would have been implausible in a tab two years ago.

WebAssembly, the runtime for the inference engine. Tokenization, attention orchestration, KV-cache management, and sampling all happen inside a .wasm module. The engine is typically a C++ or Rust codebase compiled to wasm: llama.cpp under onnxruntime-web, a TVM-compiled runtime for web-llm. There are three reasons to compile to wasm rather than ship equivalent JavaScript. Tight loops run without the overhead of a JIT that might deoptimize on you. SIMD intrinsics are available through the wasm SIMD proposal. And memory layout is predictable and dense in ways that a GC'd heap can never quite manage. The engine calls into WebGPU for the heavy matrix math, and handles everything else itself.

Quantized weights, the model itself. A full-precision (fp32) 0.5B-parameter model takes up roughly 2 GB. q4f16_1, MLC's 4-bit quantization scheme with fp16 scales stored per block, brings that down to around 400 MB. The scheme encodes 4 bits per weight, then corrects the accumulated quantization error with a small number of fp16 scale factors per 32-weight block. Quality on instruction-following drops measurably, but it stays inside the range where the model still answers questions usefully. 400 MB fits in a browser's IndexedDB and survives across page loads without a re-download.

Three libraries have production-quality implementations of this stack. They overlap in what they can do, and they differ mostly in where they aim.

@mlc-ai/web-llm is what this demo uses. It ships a batteries-included chat API, a prebuilt model catalog with IndexedDB caching, and an OpenAI-compatible interface. The runtime underneath is Apache TVM compiled for the browser. It's the right choice when you're building a chat-style UI and want to skip writing model-loading infrastructure yourself.

@huggingface/transformers (transformers.js) is more general. It covers a wider model catalog, including encoders, classifiers, embedding models, and seq2seq, and gives you lower-level control over inputs and outputs. It's the right choice for non-chat workloads: semantic search, zero-shot classification, audio transcription.

onnxruntime-web uses the Microsoft ONNX Runtime compiled to wasm. If your pipeline already starts from an ONNX model export, or if you need the ONNX Runtime's quantization and optimization toolchain upstream, this is the runtime to stay consistent with.

Here's what loading and streaming look like in web-llm:

typescript

import { CreateMLCEngine } from "@mlc-ai/web-llm";
 
const engine = await CreateMLCEngine("Qwen2.5-0.5B-Instruct-q4f16_1-MLC", {
  initProgressCallback: (r) => console.log(r.text, r.progress),
});
 
const stream = await engine.chat.completions.create({
  messages: [{ role: "user", content: "Why is the sky blue?" }],
  stream: true,
});
 
for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
}

03. WHAT IT FEELS LIKE

The first time this demo is loaded, your browser is going to download around 400 MB. That's the Qwen2.5-0.5B-Instruct-q4f16_1-MLC model: 0.5 billion parameters at 4-bit quantization, packaged as a set of weight shards that web-llm fetches over HTTPS. On a home connection running 50 to 150 Mbps, which covers most residential broadband in North America and Western Europe, the download alone takes somewhere between 30 seconds and two minutes. Add shader compilation on top of that. WebGPU has to compile the WGSL shaders against your specific GPU driver before the first forward pass can run, and that compilation isn't cached by the browser across visits the way compiled JavaScript is. The combined wait is real. The progress bar isn't decorative.

That cost is paid exactly once. web-llm stores the weights in IndexedDB after the first load. On subsequent visits, the browser finds the cache, skips the download, and the shader pipeline compiles from a much warmer start. The gap between a cold first load and a warm return visit is large enough that they feel like different products. Most users who encounter this in a production app will hit the cold path once per device, then warm paths indefinitely after that.

In steady state, once the model is loaded and the GPU is running, throughput depends almost entirely on the GPU in the user's machine. On Apple Silicon (M1, M2, M3), the Qwen 0.5B model at 4-bit typically produces around 40 to 80 tokens per second in a browser context. Recent discrete GPUs on Windows, RTX 30- and 40-series and comparable AMD hardware, land in a similar range or higher. Integrated GPUs (Intel Iris Xe, older AMD iGPUs, pre-M1 Intel Mac) run the model at roughly 10 to 25 tok/s. That's still usable for a chat interface, but it's noticeably slower. You can see the generation pause between sentences. All of these numbers are reference points, not guarantees. Thermal throttling, browser overhead, and background load on the same GPU will all shift the actual rate.

The latency profile is different from a hosted API in a way that raw tok/s numbers don't capture. With a hosted API, the first token has to survive a network round trip, typically 100 to 400 ms to a datacenter, even on a low-latency connection. With local inference, the first token arrives as soon as the GPU completes the first forward pass, which for a 0.5B model takes only a few milliseconds. Time-to-first-token for local inference on a mid-range machine can actually be faster than a cloud endpoint, even though a server GPU running the same model at full precision would be producing tokens four or five times faster overall. The experience feels qualitatively different. Generation starts immediately, and the pacing is set by the GPU in front of you rather than by datacenter geography and request queue depth. Whether that tradeoff is worth 400 MB of cold load cost depends on what you're building and who your users are.

04. WHAT IT BREAKS

Local inference rewrites several product design patterns that relied on the round trip.

Privacy-by-default chat. The hosted-LLM threat model has a familiar shape. Prompts leave the device, land on a third-party server, get logged to some retention window, and may show up in an audit trail. That's the model that drove the "no AI tools" policies at legal firms, newsrooms, and healthcare orgs. When inference runs on-device and no transcript ever leaves the browser, the surface collapses down to the device itself, which is a threat model most of those organizations already know how to reason about. Local inference doesn't eliminate privacy risk. It moves it somewhere more tractable. For legal, medical, and journalism contexts, that's a meaningful distinction, and not just a cosmetic one.

Zero-marginal-cost AI features in SaaS. SaaS unit economics assume that inference calls cost the vendor money. That assumption holds for large, general-purpose models. It stops holding for the 0.5B to 4B range. Things like recommendation engines, smart defaults, "summarize my notes," and autocomplete in long-form editors are all candidates for features that could run entirely in the browser at zero cost per call. Right now SaaS vendors either absorb that cost, rate-limit aggressively, or push users toward paid tiers to access AI features. If the inference moves to the client, those features change from a margin problem into a retention feature.

Offline-first apps. AI features have always been a carve-out in offline-first design, the part of the app that stops working when there's no network. A note-taking app could sync offline. Its "smart tag suggestions" could not. A translation tool could cache vocabulary. It couldn't run a model. That boundary is dissolving. Features that used to require connectivity now have a path that doesn't. The apps most affected are the ones that already did the hard work of building offline-first. They already cache data locally and manage sync. Local inference is a clean fit for that architecture.

End of the thin-wrapper moat. Products whose differentiation was a well-designed UI over a hosted API assumed that access to inference was the scarce resource. It wasn't, but it felt like it, because the economics made rolling your own serving impractical. At the 0.5B to 4B scale, that barrier is gone. The interesting product design question shifts from "where do we host inference?" to "where does inference actually need to happen?" That's a harder question, and it selects for different answers than "we have a good system prompt."

Having the option to run locally isn't the same thing as "always run locally."

05. THE GAPS

Context length scales memory roughly linearly. A 32K-token context on a quantized 1B model needs a meaningful chunk of VRAM beyond the model weights themselves, enough that it pushes against browser memory budgets on all but high-end hardware. Long-context tasks (reading a long document, synthesizing a full codebase, holding a multi-turn research thread over hours) belong on a server. That isn't a temporary limitation that better quantization will eventually fix. The math doesn't change.

Frontier-class models aren't coming to the browser. A 70B model at 4-bit quantization weighs roughly 35GB. Consumer machines with 35GB of GPU-accessible memory exist, but they aren't the browsers you're shipping to. Quantization closed the gap between "requires a datacenter" and "runs on a laptop." It doesn't close a 50x gap. Tasks that need a frontier model (complex code reasoning, long-document synthesis, multi-step research) will stay on a server. Useful small-model tasks are plentiful. They aren't everything.

Mobile is unreliable for now. Safari 18 shipped WebGPU on macOS in 2024. iOS followed later with more caveats: memory ceilings are tighter, background tabs lose GPU state more aggressively, and the user agent space is fragmented. A model that runs fine on an M2 Mac in Safari may not load on an iPhone 13 in Safari 18.1. Treat local inference as a desktop feature. Mobile is graceful degradation. Fall back to a server endpoint rather than shipping a broken cold-load experience to half your users.

The first-load tax is structural. There's no engine-layer fix for "the weights have to get to the device before anything runs." A multi-hundred-megabyte download is an upfront cost that some product surfaces just can't absorb. Landing pages, conversion funnels, anything where bounce rates are measured in seconds: the cold load is a disqualification, not a tradeoff. IndexedDB caching makes the return visit cheap. It does nothing for the user who lands once and leaves.

The right framing is additive, not replacement. Local inference adds a deployment target. It doesn't retire cloud inference. The interesting design question is "which inference, where?", and for some workloads the answer is still "always server," just as it was before. The mistake is treating local as a categorical upgrade when really it's a new option in a set that already had working options.

SHAREX / TWITTER MASTODON

Inference Without a Round Trip

01. THE SHIFT

02. THE STACK

03. WHAT IT FEELS LIKE

04. WHAT IT BREAKS

05. THE GAPS

Static Is Enough: The Case for Hugo Over WordPress

Pretext: 300x Faster Text Layout Without DOM Reflow

01. THE SHIFT

02. THE STACK

03. WHAT IT FEELS LIKE

04. WHAT IT BREAKS

05. THE GAPS

RELATED POSTS

Static Is Enough: The Case for Hugo Over WordPress

Pretext: 300x Faster Text Layout Without DOM Reflow