WebGPU 全栈源码深读
A WebGPU stack source-level walkthrough
一行 pass.dispatchWorkgroups(16384) 在到达 GPU 核之前会被翻译八次:① JS API → Blink WebIDL;② Blink → Renderer 端 Wire;③ Mojo IPC 跨进程;④ GPU 进程 Dawn 验证;⑤ Tint 把 WGSL 编译成 SPIR-V/MSL/HLSL;⑥ ANGLE/原生 API(Metal/D3D12/Vulkan);⑦ 厂商驱动编译成 GPU ISA;⑧ GPU 命令处理器 dispatch 给 SM/CU/EU。这是一篇WebGPU 全栈解剖——从一行 JS 到一条 GPU 指令,25 章 · Dawn + wgpu 双栈对照 · 真源码 · 真延迟。
A single line pass.dispatchWorkgroups(16384) is translated eight times before it reaches a GPU core: ① JS API → Blink WebIDL; ② Blink → renderer-side Wire; ③ Mojo IPC across processes; ④ GPU-process Dawn validation; ⑤ Tint compiles WGSL into SPIR-V/MSL/HLSL; ⑥ native API (Metal/D3D12/Vulkan); ⑦ vendor driver compiles to GPU ISA; ⑧ GPU command processor dispatches to SMs/CUs/EUs. This is a full WebGPU stack dissection — from one JS call to one GPU instruction, 25 chapters · Dawn + wgpu side-by-side · real source · real latencies.
从 CPU 单核到 GPU 一万六千个 ALU · 从一行 JS 到一条 SIMT 指令
From one CPU core to 16,384 GPU ALUs · from one JS line to one SIMT instruction
WebGPU 的存在感来自五个数量级的差异。把它们放在一页上能解释为什么浏览器要再造一套图形 API、为什么不直接复用 WebGL、为什么 2023 年才发布——以及为什么后面 24 章每一章都在和这五个公式纠缠。
pass.dispatchWorkgroups(16384) 在到达 GPU 之前会经过 8 层翻译:这五个公式贯穿全书。Ch01-05 是为它们做准备的背景,Ch06-21 是它们的分步展开,Ch22-25 是它们的合成与回望。
WebGPU's existence comes from five orders-of-magnitude differences. Putting them on one page explains why browsers need a second graphics API, why WebGL wasn't enough, why it took until 2023 — and why every one of the 24 chapters that follow is wrestling with one of these five formulas.
pass.dispatchWorkgroups(16384) traverses 8 layers before it ever reaches silicon:These five formulas thread the whole book. Ch01-05 is the background for them, Ch06-21 is their step-by-step unpacking, Ch22-25 is their synthesis.
三股力量汇流:原生 API 现代化 · 浏览器三方提案 · W3C 标准化
Three forces converging: modern native APIs · three browser proposals · W3C standardisation
WebGPU 不是从零冒出来的。它是 三股力量 在 2017 年同时撞到一起的产物:
完整时间线:
三个名字、一个 API、两份代码:用户看到的是 navigator.gpu,但底下实际有三份独立实现——Dawn(Chrome,C++)、wgpu(Firefox + Deno + Bevy + Servo + ...,Rust)、WebKit's own(Safari,Obj-C++)。整本书会反复对照前两个,因为 Safari 的实现还非常年轻且未开放重要扩展。
WebGPU didn't appear out of nowhere. It's the product of three forces converging around 2017:
Full timeline:
Three names, one API, three codebases: users see navigator.gpu but underneath are three independent implementations — Dawn (Chrome, C++), wgpu (Firefox + Deno + Bevy + Servo, Rust), and WebKit's own (Safari, Obj-C++). This book contrasts the first two repeatedly; Safari's is too young and still missing key extensions.
显式不等于安全 · 跨平台不等于一致 · 性能不等于可移植
Explicit ≠ safe · cross-platform ≠ consistent · fast ≠ portable
这是 WebGPU 设计中最常被问的问题。Vulkan 是 2016 年的 Khronos 标准、跨 Windows/Linux/Android、显式控制、性能逼近金属——为什么不直接 navigator.vk = ... 让 JS 调?
因为 "显式"≠"安全"、"跨平台"≠"一致"、"性能"≠"可移植"。下面把三句话拆开。
Vulkan 的验证是 opt-in 的,且 ~80% 的实现都关。一个错误的 vkCmdDraw(比如绑定了 stride 不对的 vertex buffer)在 release driver 上会读越界内存。在桌面应用上这只是 crash,但在浏览器里这是跨域信息泄漏——同源策略下其他 origin 的纹理可能被读取。所以 WebGPU 的设计基线是:每个 API 调用在到 GPU 之前都必须 100% 验证。这是 Dawn/wgpu 都有 device/validation 一整层的根本原因。
Vulkan 把硬件差异留给应用。一个 VkBuffer 在 NVIDIA 上对齐 16 字节、在 Intel 上对齐 64、在 ARM Mali 上对齐 256。Vulkan 给你 VkPhysicalDeviceLimits.minStorageBufferOffsetAlignment 让你自己处理,没处理就 UB。WebGPU 反过来:规范规定所有实现都按 256 对齐(spec §3.6.2),代价是浪费一点空间但保证所有浏览器、所有设备结果一致。
Vulkan 不在 macOS/iOS 上(直接)。在 Apple 平台上跑 Vulkan 要走 MoltenVK 把 Vulkan 翻译到 Metal——多一层翻译就多一层 bug,且 Apple 主动放弃了 Vulkan WG。WebGPU 的做法是不假设任何原生 API 一定存在——浏览器自己选最合适的 backend(macOS→Metal · Windows→D3D12 · Linux/Android→Vulkan)。从应用看 API 是统一的。
还有一个 非技术 原因常被低估:Khronos 的 Vulkan 委员会动作慢。Vulkan 1.3 用了 6 年才允许 dynamic rendering(2022 年),而 Metal 一开始就支持。WebGPU 工作组速度更快,且不需要等三个原生 API 同时支持某特性——只要 Dawn/wgpu 能在三个 backend 上至少 emulate 即可。比如 subgroup ops 在 Vulkan 是 1.1 核心、D3D12 用 wave intrinsics、Metal 用 simdgroup,WebGPU 一个 subgroup 关键字三家都映射得到。
Vulkan 的设计目标是给桌面 AAA 游戏吃满硬件;WebGPU 的设计目标是在不可信代码(JS)里安全暴露 GPU。同一个底层硬件,两种使用人群,两种 API 形状。WebGPU 不是"Vulkan 的简化版",而是另一个设计点。
This is the most commonly asked question about WebGPU's design. Vulkan exists since 2016, runs on Windows/Linux/Android, exposes explicit control, performance is near-native. Why not just navigator.vk = ... and let JS call it?
Because "explicit" ≠ "safe", "cross-platform" ≠ "consistent", and "fast" ≠ "portable". Each in turn:
Vulkan validation is opt-in and ~80% of installs ship it disabled. A bad vkCmdDraw (e.g. a vertex buffer with the wrong stride) on a release driver will read out-of-bounds memory. On a desktop app that's a crash; in a browser it's a cross-origin info leak — other origins' textures readable under the same GPU context. So WebGPU's design baseline is: every API call must be 100% validated before it reaches the GPU. That's why Dawn and wgpu each carry a thick device/validation layer.
Vulkan leaves hardware variance to the app. A VkBuffer on NVIDIA aligns to 16 bytes, on Intel to 64, on ARM Mali to 256. Vulkan gives you VkPhysicalDeviceLimits.minStorageBufferOffsetAlignment and says handle it yourself; failing to is UB. WebGPU instead: the spec mandates 256-byte alignment everywhere (spec §3.6.2). You waste a bit of space, but every browser, every device, identical behaviour.
Vulkan doesn't run on macOS/iOS (directly). On Apple platforms you need MoltenVK, which translates Vulkan → Metal — one more layer means one more bug surface, and Apple actively abandoned the Vulkan WG. WebGPU's stance: assume no specific native API exists. Each browser picks the right backend (macOS → Metal, Windows → D3D12, Linux/Android → Vulkan). The user-facing API is the same.
There's also a non-technical reason often underweighted: Khronos moves slowly. Vulkan 1.3 needed 6 years to land dynamic rendering (2022); Metal had it on day one. The WebGPU WG moves faster because it doesn't need all three native APIs to ship a feature simultaneously — Dawn/wgpu only need to emulate it consistently. Example: subgroup ops are Vulkan 1.1 core, but on D3D12 they're "wave intrinsics" and on Metal they're "simdgroup". A single WGSL subgroup keyword maps to all three.
Vulkan was designed to feed GPUs from desktop AAA games; WebGPU was designed to safely expose the GPU to untrusted code (JS). Same hardware underneath, two audiences, two API shapes. WebGPU isn't "Vulkan-lite" — it's a different design point.
把 WebGPU 与 WebGL / Vulkan / Metal / D3D12 放在四个轴上比较
WebGPU vs WebGL vs Vulkan vs Metal vs D3D12 on four axes
"安全、可移植、够快"在 2015 年之前是不可能三角。OpenGL 安全但不够快、Vulkan 够快但不安全、Metal 安全 + 够快但不可移植。WebGPU 的赌注是这三个可以同时拿到,只要愿意接受三个让步:
VK_KHR_acceleration_structure(ray tracing)暂未在 WebGPU 暴露。subgroup 是 2025 年才进 WebGPU 主线 的。每个新硬件特性都要等 3 个 native API 至少 2 个支持才会标准化。把 WebGPU 和五个亲戚放在四个轴上看:
| WebGL 2 | WebGPU | Vulkan | Metal | D3D12 | |
|---|---|---|---|---|---|
| API 函数数 | ~300 | ~120 | ~500 | ~250 | ~400 |
| 显式同步 | 否 | 大部分自动 | 是(fence/semaphore/barrier) | 大部分自动 | 是(fence + resource barrier) |
| Compute | 无 | 有(一等公民) | 有 | 有 | 有 |
| Validation 是否强制 | 是 | 是(spec 要求) | 否(Layer opt-in) | 部分 | 否(Debug Layer opt-in) |
| 跨 OS | 是 | 是 | 是(macOS 需 MoltenVK) | 否(仅 Apple) | 否(仅 Windows/Xbox) |
| Shading lang | GLSL ES | WGSL | SPIR-V(可 GLSL/HLSL 编译来) | MSL | HLSL / DXIL |
| 典型应用 | 遗留 Web 游戏 · 简单 3D | Figma · transformers.js · Babylon.js · Bevy(Web) | AAA Linux · Switch · Android 游戏 | 原生 macOS/iOS 应用 | Windows 游戏 · Xbox |
| 设计年份 | 2012(OpenGL ES 3.0) | 2017–2023 | 2015 | 2014 | 2015 |
WebGPU 是唯一一个同时满足:①跨 OS、②强制 validation、③有 compute、④API surface < 200 函数。这是它存在的意义——不是为了比 Vulkan 快,而是为了在不可信代码里安全地暴露 GPU。
"Safe, portable, fast" was an impossible triangle pre-2015. OpenGL was safe but slow; Vulkan was fast but unsafe; Metal was safe-and-fast but Apple-only. WebGPU bet you could have all three, with three concessions:
VK_KHR_acceleration_structure (ray tracing) isn't yet in WebGPU. subgroup only landed in 2025. Every new feature waits for at least 2 of 3 native APIs to support it before standardisation.Set WebGPU against five relatives on four axes:
| WebGL 2 | WebGPU | Vulkan | Metal | D3D12 | |
|---|---|---|---|---|---|
| API surface | ~300 | ~120 | ~500 | ~250 | ~400 |
| Explicit sync | no | mostly auto | yes (fence/semaphore/barrier) | mostly auto | yes (fence + resource barrier) |
| Compute | none | first-class | yes | yes | yes |
| Validation mandatory | yes | yes (spec-required) | no (layer opt-in) | partial | no (Debug Layer opt-in) |
| Cross-OS | yes | yes | yes (macOS via MoltenVK) | no (Apple only) | no (Windows/Xbox only) |
| Shading lang | GLSL ES | WGSL | SPIR-V (compile from GLSL/HLSL) | MSL | HLSL / DXIL |
| Typical app | legacy Web 3D · games | Figma · transformers.js · Babylon · Bevy-on-Web | AAA Linux · Switch · Android games | native macOS/iOS apps | Windows games · Xbox |
| Design year | 2012 (OpenGL ES 3.0) | 2017–2023 | 2015 | 2014 | 2015 |
WebGPU is the only API that satisfies all four of: ① cross-OS, ② mandatory validation, ③ compute, ④ API surface under 200 functions. That's its raison d'être — not to beat Vulkan in benchmarks, but to safely expose the GPU to untrusted code.
同一行 device.queue.submit() 在三个浏览器里走过的 8 个层不一样
The same device.queue.submit() traverses 8 different layers in each browser
"WebGPU" 是一个 API。但底层有三个独立实现。每条调用从 JS 到 GPU 中间的 8 层堆栈在三个浏览器里完全不同。把它们摆在一起:
web_gpu_*.idl)third_party/blink/renderer/modules/webgpu/dom/webgpu/(C++ glue)wgpu-core(Rust 验证 + state tracking)wgpu-hal → Metal · D3D12 · Vulkan三个栈共享:① API 形状(spec);② shader 语言(WGSL);③ 安全模型(同源、内存清零、限制范围)。但实现完全独立——这是 W3C 标准化的好处也是负担。本文 Ch11-Ch17 主要走 Chrome(Dawn)路径,因为它是最多人用且源码最完整,wgpu/Naga 作为对照。
把 GPU 调用关在独立进程里是浏览器安全模型的核心。原因:①GPU 驱动是巨型 C/C++ 代码,bug 率高(NVIDIA 驱动 ~10MLOC、AMD ~5MLOC);②驱动 crash 通常拉整个进程;③GPU 内存里有其他 origin/其他 tab 的纹理,需要进程边界隔离。GPU 进程崩了,最多页面白屏;Renderer 不会丢,UI 进程更不会丢。
"WebGPU" is one API, but it has three independent implementations. The 8-layer journey from JS to GPU differs across browsers. Side by side:
web_gpu_*.idl)third_party/blink/renderer/modules/webgpu/dom/webgpu/ (C++ glue)wgpu-core (Rust validation + state tracking)wgpu-hal → Metal · D3D12 · VulkanAll three stacks share: ① API shape (the spec), ② shader language (WGSL), ③ security model (same-origin, zeroed memory, capped limits). But implementations are entirely independent — both a strength and a burden of W3C standardisation. Ch11–Ch17 mostly walks the Chrome (Dawn) path, since it has the most users and the cleanest source; wgpu/Naga appears for contrast.
Keeping GPU calls in a separate process is core to the browser security model. Reasons: ① GPU drivers are enormous C/C++ codebases (NVIDIA's is ~10 MLOC, AMD's ~5 MLOC) and ship bugs; ② driver crashes typically take down the host process; ③ GPU memory contains textures from other origins / other tabs, requiring process-level isolation. A GPU-process crash blanks one page; renderers survive; the UI process certainly survives.
本书的主线程序。每章拆它一行(或一组),看它怎么走完全栈
The article's main line. Each chapter dissects one line (or group), tracing it through the full stack
这是 Field Note 09 的主线程序。它做一件简单的事:在 GPU 上把一个 1,048,576 个 float 的数组每个元素平方,然后读回 CPU 验证。每一行都对应后面 25 章中的一章——按这个顺序把整本书读下来,就等于把这 40 行 JS 一层一层翻译到 GPU 指令。
本文用这 40 行把 8 重翻译走通:
| 编号 | 翻译 | 章节 | 典型耗时(M2 Pro · Chrome) |
|---|---|---|---|
| ① | JS → Blink WebIDL binding | Ch06, Ch10 | < 1 µs(V8 inline cache 命中) |
| ② | Blink → Dawn Wire client(序列化) | Ch10, Ch11 | ~5 µs / dispatch(≈80 字节) |
| ③ | Mojo IPC · Renderer → GPU process | Ch11 | ~30 µs(带 ringbuffer flush) |
| ④ | Dawn 验证 + state tracking | Ch12 | ~10 µs / dispatch |
| ⑤ | Tint:WGSL → MSL/HLSL/SPIR-V(首次) | Ch14, Ch16 | ~5 ms(首次) · 缓存后 0 |
| ⑥ | Metal / D3D12 / Vulkan API 调用 | Ch17 | ~20 µs(含 PSO 切换) |
| ⑦ | 厂商 driver 编译 MSL/HLSL → GPU ISA | Ch17 | ~10 ms(首次) · 缓存后 0 |
| ⑧ | GPU 命令处理器 dispatch · SIMT 执行 | Ch18, Ch19 | ~200 µs(16384 wg × 64 thread × N 周期) |
总耗时(暖缓存):~250 µs CPU + ~200 µs GPU = ~450 µs。冷缓存首次 ~15 ms。后续每次提交 ~450 µs。后面的章节会一层一层把这条时间线拆开。
This is Field Note 09's main line. It does one thing: square each element of a 1,048,576-float array on the GPU, then read back to verify. Every line maps to one of the 25 chapters that follow — read them in order and you're translating these 40 lines layer by layer down to GPU instructions.
These 40 lines unfold the 8 translations:
| # | Translation | Chapter | Typical cost (M2 Pro · Chrome) |
|---|---|---|---|
| ① | JS → Blink WebIDL binding | Ch06, Ch10 | < 1 µs (V8 inline cache hit) |
| ② | Blink → Dawn Wire client (serialise) | Ch10, Ch11 | ~5 µs / dispatch (~80 bytes) |
| ③ | Mojo IPC · Renderer → GPU process | Ch11 | ~30 µs (with ring-buffer flush) |
| ④ | Dawn validation + state tracking | Ch12 | ~10 µs / dispatch |
| ⑤ | Tint: WGSL → MSL/HLSL/SPIR-V (first) | Ch14, Ch16 | ~5 ms (first) · 0 cached |
| ⑥ | Metal / D3D12 / Vulkan API call | Ch17 | ~20 µs (incl. PSO switch) |
| ⑦ | Vendor driver compiles to GPU ISA | Ch17 | ~10 ms (first) · 0 cached |
| ⑧ | GPU CP dispatches · SIMT execute | Ch18, Ch19 | ~200 µs (16384 wg × 64 thr × N cycles) |
Total (warm caches): ~250 µs CPU + ~200 µs GPU = ~450 µs. First-time cold: ~15 ms. Steady state: ~450 µs per submit. The chapters that follow unpack this timeline layer by layer.
两次 await · 一次硬件挑选 · 一次能力协商
Two awaits · one hardware pick · one capability negotiation
主线 ② 的两行:
const adapter = await navigator.gpu.requestAdapter(); // 物理 GPU const device = await adapter.requestDevice(); // 虚拟句柄
这两行做了 4 件事——这一章拆其中 3 件(第 4 件留到 Ch13):
requestAdapter() 默认返回第一个能用的,但你可以传 powerPreference: 'high-performance' 或 'low-power' 提示浏览器(不强制,但 NVIDIA 笔记本上确实会切到独显)。还有 forceFallbackAdapter: true,强制走 SwiftShader / Dawn's null backend,用于测试。'float32-filterable'、'shader-f16'、'subgroups')和各种硬件上限(maxBufferSize、maxStorageBufferBindingSize、maxComputeWorkgroupSizeX)。这些是 WebGPU 安全模型的第一道闸——所有上限都是规范规定的下限之上的实际值。requestDevice() 可以传 requiredFeatures 和 requiredLimits——如果硬件不支持,Promise reject。这是契约式协商:device 是一个保证拥有指定能力的虚拟 GPU。一个 process 可以有多个 device(不同 origin/iframe),但一个 device 一旦 lost 就再也不能用。因为 adapter 的选择可能不确定——浏览器要看权限策略、用户隐私设置、当前 GPU 占用,甚至弹用户授权框(首次 WebGPU 调用在某些浏览器上需要 user gesture)。同步返回不现实。Device creation 也是异步——backend 可能要分配资源池、加载 shader cache、协商 IPC channel。Dawn 在 macOS 上首次 device 创建约 12 ms,缓存后 1–2 ms。
实测主线 ② 在 M2 Pro · Chrome 130 上的耗时(暖缓存第二次起):
requestAdapter():~0.8 ms(含 IPC 一来一回 + 浏览器策略检查)requestDevice():~1.5 ms(含 Metal device 创建 + Dawn validation 初始化)Main line ②, two lines:
const adapter = await navigator.gpu.requestAdapter(); // physical GPU const device = await adapter.requestDevice(); // virtual handle
Those two lines do four things — this chapter covers three (the fourth, device loss, is Ch13):
requestAdapter() returns the first usable one by default. You can hint powerPreference: 'high-performance' or 'low-power' (advisory; on NVIDIA laptops this actually flips to the dGPU). forceFallbackAdapter: true forces SwiftShader / Dawn's null backend, useful for tests.'float32-filterable', 'shader-f16', 'subgroups') and hardware limits (maxBufferSize, maxStorageBufferBindingSize, maxComputeWorkgroupSizeX). These are the first gate in WebGPU's safety model — every limit is the spec-mandated minimum or higher.requestDevice() accepts requiredFeatures and requiredLimits — if hardware doesn't support them the Promise rejects. It's a contract: the device is a virtual GPU guaranteed to have the requested capabilities. A process can have multiple devices (different origins/iframes), but once a device is lost, it never recovers.Adapter selection can be nondeterministic — the browser checks permissions, privacy settings, current GPU contention, even surfaces a permission prompt (the first WebGPU call sometimes needs a user gesture). A synchronous return is unrealistic. Device creation is async too — the backend may need to allocate a resource pool, load a shader cache, negotiate an IPC channel. Dawn on macOS takes ~12 ms for the first device, ~1–2 ms cached.
Measured cost of main line ② on M2 Pro · Chrome 130 (warm caches, second time onward):
requestAdapter(): ~0.8 ms (IPC round trip + browser policy check)requestDevice(): ~1.5 ms (Metal device + Dawn validation init)4 MB 数据怎么从 JS Float32Array 到达 GPU 显存
How 4 MB of Float32Array gets from JS to GPU memory
主线 ③ + ⑦ 处理两个 buffer:一个 storage buffer(GPU 读写)和一个 map-read buffer(CPU 回读)。这一章拆 GPUBuffer 的 8 个 usage flag、3 种数据路径、2 种映射模式。
| Flag | 含义 | 能去哪 | 主线用了吗 |
|---|---|---|---|
MAP_READ | CPU 可 map 读 | 读结果 buffer | 是(read) |
MAP_WRITE | CPU 可 map 写 | staging buffer(少见) | 否 |
COPY_SRC | 能作为 copyBuffer 的源 | storage → read 拷贝 | 是(buf) |
COPY_DST | 能作为 copyBuffer/writeBuffer 的目标 | writeBuffer 目标 · 读 buffer 目标 | 是(两者都) |
INDEX | 能绑定为 index buffer | render pipeline | 否 |
VERTEX | 能绑定为 vertex buffer | render pipeline | 否 |
UNIFORM | 能绑为 uniform binding(read-only,小) | shader 入参常量 | 否 |
STORAGE | 能绑为 storage binding(rw,大) | compute / fragment | 是(buf) |
INDIRECT | 能作为 dispatchIndirect/drawIndirect 的参数 | GPU-driven 工作流 | 否 |
QUERY_RESOLVE | timestamp/occlusion query 的目标 | 性能测量 | 否 |
组合约束:MAP_READ 只能和 COPY_DST 一起;MAP_WRITE 只能和 COPY_SRC 一起。这是设计上的隔离——map 过的 buffer 不能直接用于 shader binding,避免 CPU 边写 GPU 边读的竞争。主线里读结果先 copyBufferToBuffer 到一个 MAP_READ | COPY_DST buffer,再 mapAsync 读。
writeBuffer(最常用)由 queue 提供:device.queue.writeBuffer(dst, dstOffset, srcData)。Dawn 内部走两步:a) 在 GPU 端分配一个临时 upload buffer;b) memcpy srcData 进去;c) 排一个 GPU copy 命令把它拷到 dst。同步调用 + 异步执行。主线 ③ 就是这条路径。
mappedAtCreation: true创建 buffer 时直接拿到 ArrayBuffer 写:buf.getMappedRange() → 写 → buf.unmap()。优点:少一次 memcpy(直接写到 GPU upload heap)。缺点:buffer 必须 STORAGE / COPY_SRC / COPY_DST,不能 MAP_READ。适合静态数据如 vertex/uniform。
copyBufferToBuffer + mapAsync用于读回。需要分两个 buffer:① 计算用的 STORAGE | COPY_SRC,② 读取用的 MAP_READ | COPY_DST。submit() 后 await mapAsync()。Dawn 会 poll fence;mapAsync 的 promise 在 GPU 完成那次 submit 后 resolve。主线 ⑦ + ⑧ 用此。
GPUBuffer 是显式资源。它和 Float32Array 不一样:JS GC 看不到 GPU 上分配的 4 MB。如果只持有 JS 引用而不调 buf.destroy(),buffer 会在 GPUDevice 被 GC 时才释放——对长跑的页面这可能持有 GB 级显存几分钟。规范规定:destroy() 立即解绑所有引用并归还显存。Dawn 实现见 src/dawn/native/Buffer.cpp::APIDestroy()。
WebGL 的 buffer 生命周期是 GL 上下文范围——context lost 时 GL 帮你清。WebGPU 没有这个语义——device lost 时你的 buffer 列表从应用的视角看依然存在但全部 invalid,需要手动管。模式:把所有 GPU 资源都挂在一个 resourceGarden 类下,devicelost 时遍历 destroy。
Main line ③ and ⑦ both touch buffers: one storage buffer (GPU read/write) and one map-read buffer (CPU read-back). This chapter unpacks the 8 usage flags, 3 data paths, and 2 mapping modes that the spec offers.
| Flag | Meaning | Where it goes | Main line uses |
|---|---|---|---|
MAP_READ | CPU can map for reading | readback buffer | yes (read) |
MAP_WRITE | CPU can map for writing | staging (rare) | no |
COPY_SRC | source of copyBuffer | storage → read | yes (buf) |
COPY_DST | target of copyBuffer / writeBuffer | writeBuffer target · read buffer target | yes (both) |
INDEX | bindable as index buffer | render pipeline | no |
VERTEX | bindable as vertex buffer | render pipeline | no |
UNIFORM | uniform binding (read-only, small) | shader constants | no |
STORAGE | storage binding (rw, large) | compute / fragment | yes (buf) |
INDIRECT | argument for dispatchIndirect / drawIndirect | GPU-driven flows | no |
QUERY_RESOLVE | target of timestamp/occlusion queries | perf measurement | no |
Combinatorial constraint: MAP_READ can only combine with COPY_DST; MAP_WRITE only with COPY_SRC. This is deliberate isolation — a mapped buffer can't be bound to a shader, eliminating CPU-write-while-GPU-reads races. Hence the two-buffer pattern in the main line: compute writes into STORAGE | COPY_SRC; we then copyBufferToBuffer into a MAP_READ | COPY_DST buffer and mapAsync.
writeBuffer (most common)Provided by queue: device.queue.writeBuffer(dst, dstOffset, srcData). Dawn internally: a) allocate a transient upload buffer on the GPU, b) memcpy srcData into it, c) enqueue a copy command into dst. Synchronous call · asynchronous execution. Main line ③ takes this path.
mappedAtCreation: trueGet an ArrayBuffer view at creation: buf.getMappedRange() → write → buf.unmap(). Saves one memcpy (writing straight into the GPU's upload heap). Constraint: buffer must be STORAGE / COPY_SRC / COPY_DST, never MAP_READ. Ideal for static vertex/uniform data.
copyBufferToBuffer + mapAsyncFor read-back. Requires two buffers: ① compute target STORAGE | COPY_SRC, ② readback MAP_READ | COPY_DST. After submit() you await mapAsync(). Dawn polls a fence; the mapAsync promise resolves once the GPU has finished that submit. Main line ⑦ + ⑧.
GPUBuffer is an explicit resource. Unlike a Float32Array, JS GC can't see the 4 MB sitting on the GPU. If you only hold a JS reference and never call buf.destroy(), the buffer survives until GPUDevice itself is GC'd — on a long-running page that means multiple GB of VRAM held for minutes. The spec is explicit: destroy() immediately releases all references and returns the memory. Dawn's implementation: src/dawn/native/Buffer.cpp::APIDestroy().
WebGL's buffer lifetime is scoped to the GL context — on context loss, GL cleans up. WebGPU has no such semantic; on device loss, all buffers are still visible from the app side but invalid, and you must clean them up manually. Pattern: hang all GPU resources off a single resourceGarden and walk-and-destroy on devicelost.
不是 GLSL · 不是 HLSL · 不是 MSL · 设计成可双向翻译到三家
Not GLSL · not HLSL · not MSL · designed to round-trip to all three
主线 ④ 那段字符串就是 WGSL,整个 WebGPU 协议只接受 WGSL 一种 shader 源码——这是 W3C 工作组做过的最大也是最有争议的决定。
于是工作组决定从零设计一种小语言。语法选了类 Rust(fn / let / var / ->),类型系统强制显式且无隐式转换。spec 是 w3.org/TR/WGSL/,约 400 页。
| 地址空间 | 生命周期 | 用途 | 性能层 |
|---|---|---|---|
function | 单次函数调用 | 局部 var | 寄存器(最快) |
private | shader 调用 | per-thread 全局 | 寄存器或 L1 |
workgroup | 一个 workgroup 的生命周期 | workgroup 内共享内存 | shared memory(Metal: threadgroup) |
uniform | 整条 pipeline | shader 常量(<64 KB) | uniform cache |
storage | 整条 pipeline | 大 buffer 读写 | VRAM(最慢) |
主线的 data 在 storage 里,每次访问要走 L2/VRAM。Ch19 会演示怎么用 workgroup 共享内存把 matmul 性能提 4–8×。
i32 · u32 · f32 · f16 · bool。f16 是可选 feature。vec2<T> · vec3<T> · vec4<T>。swizzling 支持:v.xyz、v.rg、v.rrgb。mat2x3<f32>(2 列 3 行)。array<T, N>(固定)或 array<T>(运行时大小,只能在 storage 最末位)。WGSL 里 const 是编译期常量(必须在 compile time 求值),let 是运行期不可变(C++ const),var 是可变。这和 Rust 的const 一样但和 JavaScript / C++ 不一样。看代码时要注意三者层级。
That string literal in main line ④ is WGSL. The WebGPU protocol only accepts WGSL as shader source — the single most consequential (and contentious) decision the W3C working group made.
So the WG decided to design a new small language. Syntax leans Rust (fn / let / var / ->); the type system is strict, no implicit conversions. Spec at w3.org/TR/WGSL/, about 400 pages.
| Address space | Lifetime | Use | Performance tier |
|---|---|---|---|
function | one function call | local vars | registers (fastest) |
private | shader invocation | per-thread globals | registers / L1 |
workgroup | one workgroup | workgroup-shared memory | shared memory (Metal: threadgroup) |
uniform | whole pipeline | shader constants (<64 KB) | uniform cache |
storage | whole pipeline | large rw buffers | VRAM (slowest) |
Our main line's data lives in storage, so each access hits L2/VRAM. Ch19 will show how moving partial sums into workgroup shared memory speeds matmul up 4–8×.
i32 · u32 · f32 · f16 · bool. f16 is an optional feature.vec2<T> · vec3<T> · vec4<T>. Swizzling: v.xyz, v.rg, v.rrgb.mat2x3<f32> (2 columns, 3 rows).array<T, N> (fixed) or array<T> (runtime-sized, only as the last storage member).In WGSL, const is a compile-time constant (must be evaluable at compile time), let is runtime-immutable (C++ const), var is mutable. Same as Rust, different from JS/C++. Triple-layer mental model required when reading code.
为什么需要"layout 的 layout",以及 layout:'auto' 是怎么实现的
Why you need a "layout of layouts" — and how layout:'auto' really works
主线 ⑤ 两步:createComputePipeline + createBindGroup。这两个 API 一起回答了一个底层问题:shader 怎么找到它要用的资源?
所有现代图形 API 都用 两级描述符(descriptor)来回答这个问题:
为什么两级而不是一级?因为pipeline 编译只关心形状,运行时切换 BindGroup 只换实体。一个 pipeline 可以服务 1000 个不同的 BindGroup(如渲染 1000 个 model,每个一组贴图),不需要重新编译 shader。这是显式 API 的核心性能技巧——把编译期决策和运行期换绑分开。
layout: 'auto' 的真相主线用了 layout: 'auto'。这不是"没有 layout"——而是让 Dawn/wgpu 帮你从 shader 反推 layout。Dawn 的实现见 src/dawn/native/ShaderModule.cpp::ExtractAutoLayout():扫一遍 WGSL AST,找所有 @group(X) @binding(Y) 声明,按 X 分组,每组生成一个 BindGroupLayout。这个 layout 然后用来编译 pipeline。
layout: 'auto' 的代价:不同 pipeline 之间不能共享 BindGroup。两个 pipeline 即使 shader 完全相同的 binding 声明,layout: 'auto' 生成的两个 BindGroupLayout 也是不同实例,BindGroup 必须分别创建。生产代码里如果有 N 个 pipeline 共用 binding,应该显式 device.createBindGroupLayout() + 在每个 pipeline descriptor 里复用同一个 layout。
主线 ⑤ 的 createComputePipeline 是显著开销的调用:
总和 10–30 ms。所以不要在 frame 里创建 pipeline。生产应用通常在启动时把所有 pipeline 预编译好("pre-warming")。Chrome 还有 pipeline cache 把编译产物存到磁盘,下次启动直接复用。
Main line ⑤ does two things: createComputePipeline + createBindGroup. Together they answer one underlying question: how does the shader find the resources it uses?
Every modern graphics API uses a two-level descriptor model:
Why two levels rather than one? Pipeline compilation only cares about shape; switching BindGroups at runtime only swaps instances. One pipeline can serve 1,000 BindGroups (e.g. rendering 1,000 models, each with its own textures) without recompiling the shader. That's the core trick of explicit APIs — separating compile-time decisions from runtime rebinding.
layout: 'auto' really doesOur main line uses layout: 'auto'. This doesn't mean "no layout" — it means let Dawn/wgpu infer the layout from the shader. Dawn's logic is in src/dawn/native/ShaderModule.cpp::ExtractAutoLayout(): it walks the WGSL AST, collects every @group(X) @binding(Y) declaration, groups by X, and synthesises a BindGroupLayout per group. That layout is then used to compile the pipeline.
The cost of layout: 'auto': BindGroups aren't shareable across pipelines. Even if two pipelines declare identical bindings, their inferred layouts are distinct instances, so each needs its own BindGroup. Production code with N pipelines sharing bindings should call device.createBindGroupLayout() explicitly and reuse the same layout in every pipeline descriptor.
Main line ⑤'s createComputePipeline is a heavy call:
Sum: 10–30 ms total. Never create pipelines in a frame loop. Production apps pre-warm all pipelines at startup. Chrome also persists the pipeline cache to disk so a second visit skips re-compilation.
把 dispatch / draw / copy 攒成 CommandBuffer,再 submit 给 GPU
Batch dispatch / draw / copy into a CommandBuffer, then submit to the GPU
主线 ⑥ 在 4 行内做了 5 件事:
const enc = device.createCommandEncoder(); // 1. 开一个录制器 const pass = enc.beginComputePass(); // 2. 开 compute pass pass.setPipeline(pipeline); pass.setBindGroup(0, bg); pass.dispatchWorkgroups(16384); // 3. 录一条 dispatch pass.end(); // 4. 结束 pass enc.copyBufferToBuffer(buf, 0, read, 0, byteLen); // 5. 录一条 copy device.queue.submit([enc.finish()]); // 6. 提交
这里有两个关键概念:
调用 setPipeline / dispatchWorkgroups / copyBufferToBuffer 都不会立刻让 GPU 做事。它们只是把命令记到 encoder 的内部 buffer 里。这是 D3D12 和 Metal 的 command-buffer 模型——批量录制 + 一次提交,让 driver 有机会重排、合并、并行化。
Dawn 内部 encoder 是 src/dawn/native/CommandEncoder.cpp,每条 API 调用都附带 validation(pipeline 兼容性、binding 对齐、usage tracking)。validation 失败不抛 JS 异常——而是把 encoder 标 invalid,等到 finish() 才一次性 reject。这是性能优化(错误检查不打断流水线)。
device.queue.submit([cmdBuf]) 才是真正跨进程通信 + 调用原生 API。它做:
[encoder dispatchThreadgroups:...]、D3D12 ID3D12CommandList::Dispatch、Vulkan vkCmdDispatchMTLCommandBuffer / D3D12CommandList 提交到队列主线一次 submit 在 M2 Pro 上:~30 µs IPC + ~10 µs validation + ~20 µs Metal API = ~60 µs CPU 开销。然后 GPU 异步执行,主线的 mapAsync 在 GPU 完成时 resolve。
主线一次提交一条命令 + 一条 copy。生产代码会把多个 pass 攒到一个 encoder 里:例如先 dispatch compute pass A、再 dispatch compute pass B、再 render pass、最后 copy。这样一次 submit 推一整帧的工作,比每条单独 submit 快 5–10×(少了 IPC 来回)。Babylon.js 渲染一帧典型 1 submit + 50–100 个 pass。
WebGPU 用 workgroup 数而不是thread 数,让你显式知道在分组。一次 dispatchWorkgroups(16384) 配 @workgroup_size(64) = 16384 × 64 = 1,048,576 个 thread。每个 workgroup 内的 64 thread 在同一个 SM(Apple GPU 术语:cluster)上跑,可以共享 workgroup memory。Ch18 会展开讨论这个分组语义。
Main line ⑥ does five things in four lines:
const enc = device.createCommandEncoder(); // 1. open a recorder const pass = enc.beginComputePass(); // 2. begin compute pass pass.setPipeline(pipeline); pass.setBindGroup(0, bg); pass.dispatchWorkgroups(16384); // 3. record a dispatch pass.end(); // 4. end the pass enc.copyBufferToBuffer(buf, 0, read, 0, byteLen); // 5. record a copy device.queue.submit([enc.finish()]); // 6. submit
Two key concepts:
Calling setPipeline / dispatchWorkgroups / copyBufferToBuffer does not run anything on the GPU. Each call appends commands into the encoder's internal buffer. This is the D3D12/Metal command-buffer model — batch-record, submit-once — giving the driver room to reorder, merge, and parallelise.
Dawn's encoder lives at src/dawn/native/CommandEncoder.cpp. Every API call is validated inline (pipeline compatibility, binding alignment, usage tracking). Validation failures don't throw JS exceptions — they poison the encoder, deferring rejection until finish(). That's a perf optimisation (errors don't stall the recording pipeline).
device.queue.submit([cmdBuf]) is where cross-process IPC and native-API calls actually happen:
[encoder dispatchThreadgroups:...] / D3D12 ID3D12CommandList::Dispatch / Vulkan vkCmdDispatchMTLCommandBuffer / D3D12CommandList) submitted to its queueOn M2 Pro, one submit costs ~30 µs IPC + ~10 µs validation + ~20 µs Metal API = ~60 µs CPU overhead. The GPU then runs asynchronously; the main line's mapAsync resolves when the GPU finishes.
Our main line submits one command + one copy. Production code batches many passes into one encoder: dispatch compute pass A, then compute pass B, then a render pass, then a copy. One submit pushes a whole frame of work, 5–10× faster than per-pass submits (no extra IPC). Babylon.js renders a typical frame in 1 submit + 50–100 passes.
WebGPU exposes workgroup count instead of thread count so the grouping is explicit. One dispatchWorkgroups(16384) with @workgroup_size(64) = 16,384 × 64 = 1,048,576 threads. Each workgroup's 64 threads run on the same SM (Apple GPU: cluster) and can share workgroup memory. Ch18 unpacks the grouping semantics in depth.
为什么浏览器要把 dispatch 跨进程跑,怎么跨,付什么代价
Why dispatches cross a process boundary, how, and what it costs
这是翻译③ Mojo IPC 那一层。在桌面 native 应用里调 Metal 一次 dispatchThreadgroups 是一个函数调用,纯 user-space。在 Chrome 里同样的事是跨进程:JS 在 Renderer process(沙箱里),驱动调用在 GPU process(更深的沙箱,但有 driver 访问权)。这一章拆这个边界。
JS 里每次调 pass.dispatchWorkgroups(16384) 都会被 Dawn Wire client(C++,在 Renderer 里)翻译成一段二进制。结构大致:
这个 wire 序列化是同步但不立刻发送——命令 append 到一个 ring buffer,等 queue.submit() 才 flush。Flush 走 Mojo 把整个 ring 内容(典型 1–10 KB / frame)一口气 IPC 过去。
Mojo 是 Chromium 自己的 IPC 系统(不是 Chrome OS Mojo,是同名巧合)。底层是 Unix domain socket(Linux/macOS)或 named pipe(Windows)。Mojo 在上面建了:
Dawn 用 Mojo shared memory 来传 wire ring buffer——主线一次 submit 推 ~80 字节命令(一个 dispatch),但更大的 batch 一帧可能 50 KB。用 shm 比单次 send/recv 快 ~3×(少了内核数据拷贝)。延迟数据见 Mojo docs。
| 步骤 | M2 Pro · macOS | i7-13700H · Windows | Pixel 8 · Android |
|---|---|---|---|
| Renderer 序列化(80 bytes) | ~3 µs | ~4 µs | ~6 µs |
| Mojo shm flush | ~10 µs | ~12 µs | ~20 µs |
| 调度 + GPU 进程唤醒 | ~15 µs | ~18 µs | ~30 µs |
| GPU 进程反序列化 + validate | ~10 µs | ~12 µs | ~15 µs |
| 合计 / dispatch | ~38 µs | ~46 µs | ~71 µs |
这就是为什么主线一次 submit ~60 µs 是合理的——IPC 占了一半多。也是为什么不要在 frame 里多 submit:每多一次 submit 就多 40+ µs。
Firefox 用 IPDL(Mozilla 自己的 IPC 框架)传一段不同格式的字节流。wgpu 的 wire 是 Rust struct serialise via bincode,see wgpu/src/backend/wgpu_core/mod.rs。整体延迟和 Dawn 相近(~40 µs)。比较有趣的是 wgpu 在 same-process fast path 时(如 Bevy/Servo 直接嵌入)能完全跳过 IPC,直接 mem-copy struct。Firefox 浏览器里始终走 IPC(沙箱要求)。
This unpacks translation ③ Mojo IPC. In a desktop native app, one dispatchThreadgroups Metal call is one function call, pure user-space. In Chrome the same operation is cross-process: JS in the Renderer (sandboxed), driver calls in the GPU process (also sandboxed, but with driver access). This chapter dissects that boundary.
Every JS pass.dispatchWorkgroups(16384) is translated by the Dawn Wire client (C++, in the Renderer) into a binary tag. Roughly:
Wire serialisation is synchronous but doesn't send immediately — commands append to a ring buffer, flushed only on queue.submit(). The flush IPCs the entire ring (typically 1–10 KB / frame) in one shot.
Mojo is Chromium's IPC layer (the name collides with Chrome OS Mojo by coincidence). Under the hood: Unix domain sockets on Linux/macOS, named pipes on Windows. On top of that Mojo provides:
Dawn uses Mojo shared memory for the wire ring buffer — a single dispatch is ~80 bytes, but a busy frame's batch can be 50 KB. shm is ~3× faster than send/recv (avoids the kernel copy). See Mojo docs.
| Step | M2 Pro · macOS | i7-13700H · Windows | Pixel 8 · Android |
|---|---|---|---|
| Renderer serialise (80 bytes) | ~3 µs | ~4 µs | ~6 µs |
| Mojo shm flush | ~10 µs | ~12 µs | ~20 µs |
| Schedule + GPU process wake | ~15 µs | ~18 µs | ~30 µs |
| GPU-side deserialise + validate | ~10 µs | ~12 µs | ~15 µs |
| Total / dispatch | ~38 µs | ~46 µs | ~71 µs |
This explains why a main-line submit is ~60 µs — IPC eats more than half. It's also why you shouldn't multi-submit per frame: every extra submit adds ~40 µs.
Firefox uses IPDL (Mozilla's own IPC framework) for a different byte format. wgpu's wire is Rust structs serialised via bincode, see wgpu/src/backend/wgpu_core/mod.rs. Total latency is comparable to Dawn (~40 µs). Interestingly, wgpu has a same-process fast path (e.g. Bevy/Servo embedding it directly) that skips IPC entirely — just mem-copy structs. The Firefox browser always goes via IPC for sandbox reasons.
异步错误模型 · pushErrorScope/popErrorScope · uncapturederror 事件
Async error model · push/popErrorScope · uncapturederror events
WebGPU 的错误处理有一种奇怪的味道:写错代码很少抛 JS 异常,但东西就是不工作。这章解释为什么——它是故意 设计成这样的,叫做contagion model(污染模型)。
传统的同步 API(如 WebGL)模式:每次调用都同步检查 + 同步报错。WebGL 用 gl.getError()——你每次 draw call 后要主动查;不查就丢错误。每次 getError 都要 IPC 一次(Renderer → GPU process → 取错误状态 → 回),开销巨大。
WebGPU 反过来:错误是异步传染的。一个错误的 dispatchWorkgroups(0)(workgroup 数为 0):
pushErrorScope('validation') 包着,那么 popErrorScope() 返回一个 GPUValidationError。uncapturederror 事件。为什么这样设计:
| 类型 | 触发条件 | 典型例 |
|---|---|---|
GPUValidationError | API 用法错误(spec 违规) | buffer usage 不匹配 · binding 越界 · workgroup_size 太大 |
GPUOutOfMemoryError | 真没显存了 | 申请 > maxBufferSize 的 buffer · 创建过多 texture |
GPUInternalError | 实现 bug 或硬件异常 | shader 编译炸了 · driver 内部 timeout |
错误要从 GPU 进程反向流到 Renderer——和 dispatch 同一条 IPC 通道但反向。popErrorScope() 返回的 Promise 在对应的所有命令都被 GPU 进程处理完之后才 resolve。所以 popErrorScope 之后再 dispatchWorkgroups 是同步的(不阻塞),但读取错误结果是异步的。
Dawn 的 validation 不是一处——它分布在每条 API 的 GPU-side handler 里。例如 ComputePassEncoder::APIDispatchWorkgroups(in src/dawn/native/CommandEncoder.cpp)会调 ValidateDispatch(),检查 ① workgroup count 不为 0、② 当前 BindGroup 已绑、③ pipeline 已绑、④ 三个 binding usage 兼容。失败时 poison the command encoder,但不 throw——延迟到 finish() 才 reject。
Dawn 错误信息带调用栈式 context——每个嵌套的 API 调用都会 append 一段 "While calling [X]"。这是 src/dawn/native/ErrorScope.cpp 里 ErrorScope::AppendContext() 干的。比 WebGL 的 getError() === GL_INVALID_OPERATION 信息量多 100×。
WebGPU error handling has a strange flavour: getting things wrong rarely throws a JS exception, but stuff just doesn't work. This chapter explains why — it's deliberate, often called the contagion model.
Traditional sync APIs (like WebGL): each call validates and reports inline. WebGL uses gl.getError() — you must poll after every draw call, or the error vanishes. Each getError costs one IPC round trip (Renderer → GPU process → fetch error state → reply) — huge overhead.
WebGPU inverts this: errors propagate asynchronously by contagion. A bad dispatchWorkgroups(0) (zero workgroups):
pushErrorScope('validation'), the matching popErrorScope() returns a GPUValidationError.uncapturederror on the device.Why design it this way:
| Type | Trigger | Typical cause |
|---|---|---|
GPUValidationError | API misuse (spec violation) | buffer usage mismatch · binding OOB · workgroup_size too large |
GPUOutOfMemoryError | actually out of VRAM | buffer larger than maxBufferSize · texture flood |
GPUInternalError | implementation bug / hardware fault | shader compiler crashed · driver timeout |
Errors flow backwards from the GPU process to the Renderer — same IPC channel as the dispatch, opposite direction. The Promise from popErrorScope() resolves only after all matched commands have been processed in the GPU process. So issuing dispatchWorkgroups after popErrorScope is synchronous (non-blocking), but reading the error is async.
Dawn's validation isn't in one place — it's distributed across every API's GPU-side handler. E.g. ComputePassEncoder::APIDispatchWorkgroups (in src/dawn/native/CommandEncoder.cpp) calls ValidateDispatch(), which checks ① workgroup count != 0, ② a BindGroup is bound, ③ a pipeline is bound, ④ the three binding usages are compatible. Failure poisons the command encoder but doesn't throw — rejection is deferred until finish().
Dawn error messages carry call-stack-style context — each nested API call appends a "While calling [X]" segment. Implemented in src/dawn/native/ErrorScope.cpp by ErrorScope::AppendContext(). 100× more informative than WebGL's getError() === GL_INVALID_OPERATION.
GPU 不可信的时候怎么活下来
Staying alive when the GPU lies to you
WebGPU 的一个明显设计决策是:device 可能突然死掉。device.lost 是个 Promise,一定会 resolve(要么因为应用主动 destroy,要么因为系统级原因)。
info.reason 的可能值(spec §22 Errors & Debugging · #device-lost):
"destroyed" — 应用主动调 device.destroy()。"undefined" — 其他原因(OS、driver、TDR 等都归这里)。没错,就是字符串 "undefined"。规范把不区分具体丢失原因当作隐私防御——能告诉应用"TDR 触发了"就能用shader 运行时长度作为指纹(每个 GPU TDR 阈值不同)。所以应用只能粗粒度恢复。
Dawn 内部有 三层 handling(src/dawn/native/Device.cpp):① 一个 backend 报错(如 Metal command buffer execution failed)→ ② Dawn 把当前 device 标记为 lost 状态,所有 in-flight 命令的 promise reject → ③ 触发 Wire client 发 lost 事件到 Renderer,触发 JS 的 device.lost promise resolve。整个流程是异步的——可能错误发生 50ms 后 JS 才知道。
Babylon.js 5+ 默认开 device-loss 监听。Figma 上线 WebGPU 时报告过 TDR 在 Windows + 老 NVIDIA 驱动上的发生率 ~0.3% / 用户 / 月——很低,但有几万用户就有人遇到。Figma 的恢复策略:保存所有 vector 数据在 JS 端,device 丢失时 5 秒 cooldown 后重建 + 重提交所有 pipeline + 重传所有静态纹理(typical 30-100 ms)。
An obvious WebGPU design choice: devices can die at any time. device.lost is a Promise that always resolves eventually (either because the app destroyed it, or some external event killed it).
Possible info.reason values (spec §22 Errors & Debugging · #device-lost):
"destroyed" — app called device.destroy()."undefined" — anything else (OS / driver / TDR all go here). Yes, literally the string "undefined".The spec keeps device-loss reason coarse as a privacy defence — telling the app "TDR fired" would let an attacker use shader runtime length as a fingerprint (every GPU has a different TDR threshold). So apps only get coarse-grained recovery info.
Dawn has three layers of handling (src/dawn/native/Device.cpp): ① a backend reports an error (e.g. Metal command-buffer execution failed) → ② Dawn marks the current device as lost and rejects all in-flight command promises → ③ Wire client fires a lost event to the Renderer, which resolves JS's device.lost promise. The whole flow is asynchronous — JS may learn about the error 50ms after it happened.
Babylon.js 5+ ships device-loss listening on by default. Figma reported when launching WebGPU that TDR on Windows + older NVIDIA drivers occurs at ~0.3% / user / month — small, but with millions of users some hit it. Figma's recovery: keep all vector data in JS, 5-second cooldown after loss, then rebuild + resubmit all pipelines + retransfer all static textures (typically 30–100 ms).
parser → resolver → IR → writer 的四阶段编译器
A four-stage compiler: parser → resolver → IR → writer
Tint 是 Chromium 项目里把 WGSL 翻译到 3 种原生 shader 语言的编译器子项目。地址:chromium/src/third_party/dawn/src/tint/。约 11 万行 C++,由 Google 主要维护,但 Apple/Intel 也有 commit。
原本工作组想:写一个 WGSL→SPIR-V 前端,然后 SPIR-V→MSL 走 SPIRV-Cross,SPIR-V→HLSL 走 SPIRV-Cross,HLSL→DXIL 走 DXC。实际尝试后发现:
于是 Tint 走一跳直达:每个 backend 直接从 Tint IR 翻到目标语言。架构清晰、安全可控、性能更好。
| 阶段 | 路径 | 输入 | 输出 | 典型耗时(一个 shader) |
|---|---|---|---|---|
| 1. Parse | tint/lang/wgsl/reader/ | WGSL 源文本 | AST | ~0.5 ms |
| 2. Resolve | tint/lang/wgsl/resolver/ | AST | typed AST(含 type、constant value) | ~1 ms |
| 3. Lower to IR | tint/lang/wgsl/program_to_ir/ | typed AST | Tint IR(SSA) | ~0.5 ms |
| 4a. Writer · MSL | tint/lang/msl/writer/ | IR | Metal Shading Language | ~1 ms |
| 4b. Writer · HLSL | tint/lang/hlsl/writer/ | IR | HLSL(再被 DXC 编 DXIL) | ~1.5 ms |
| 4c. Writer · SPIR-V | tint/lang/spirv/writer/ | IR | SPIR-V binary | ~1 ms |
主线那段 16 行 WGSL 全程 ~3 ms(首次)。缓存后 0——Dawn 把编译产物 keyed on (WGSL source hash, backend type, target API version) 缓存到 device-scoped HashMap,二次 createShaderModule 直接复用。
Resolver(tint/lang/wgsl/resolver/)是 Tint 里最大的子目录——主文件 resolver.cc 约 3,500 行,整个 resolver/ 目录加起来约 25,000 行 C++——因为它要 enforce 整个 WGSL spec 的静态语义。简略列表:
a + b 要求 a/b 同类型)const x = 1 + 2 * 3 在编译期求出 7)function 不能跨 invocation 共享)@diagnostic(off, derivative_uniformity))Resolver 失败 → 整个 ShaderModule 创建失败 → 抛 GPUValidationError 到 JS。这种错误是同步可见的——通过 shader.getCompilationInfo() async API 拿到详细信息。
Tint is the Chromium subproject that translates WGSL into three native shader languages. Path: chromium/src/third_party/dawn/src/tint/. About 110k lines of C++, maintained mostly by Google with Apple/Intel contributions.
The working group's first plan: write a WGSL→SPIR-V frontend, then SPIR-V→MSL via SPIRV-Cross, SPIR-V→HLSL via SPIRV-Cross, HLSL→DXIL via DXC. In practice:
So Tint takes one direct hop: each backend walks Tint IR straight to the target language. Cleaner architecture, controllable security, better performance.
| Stage | Path | Input | Output | Typical (one shader) |
|---|---|---|---|---|
| 1. Parse | tint/lang/wgsl/reader/ | WGSL source | AST | ~0.5 ms |
| 2. Resolve | tint/lang/wgsl/resolver/ | AST | typed AST (types + const values) | ~1 ms |
| 3. Lower to IR | tint/lang/wgsl/program_to_ir/ | typed AST | Tint IR (SSA) | ~0.5 ms |
| 4a. Writer · MSL | tint/lang/msl/writer/ | IR | Metal Shading Language | ~1 ms |
| 4b. Writer · HLSL | tint/lang/hlsl/writer/ | IR | HLSL (DXC then → DXIL) | ~1.5 ms |
| 4c. Writer · SPIR-V | tint/lang/spirv/writer/ | IR | SPIR-V binary | ~1 ms |
Our main line's 16-line WGSL takes ~3 ms end-to-end first time. Cached afterwards — Dawn keys compilation results on (WGSL source hash, backend type, target API version) in a device-scoped HashMap; second createShaderModule reuses.
The resolver (tint/lang/wgsl/resolver/) is Tint's largest subdirectory — the main file resolver.cc is ~3,500 lines, but the entire resolver/ directory totals ~25,000 lines of C++, because it must enforce the entire WGSL spec's static semantics. A non-exhaustive list:
a + b requires same type)const x = 1 + 2 * 3 resolves to 7 at compile time)function can't be shared across invocations)@diagnostic(off, derivative_uniformity))Resolver failure → ShaderModule creation fails → GPUValidationError raised on JS. This error is observable — via the async shader.getCompilationInfo() you get a structured list of messages.
Rust · arena IR · 编译给 Firefox / Bevy / Deno 共用
Rust · arena IR · powers Firefox / Bevy / Deno alike
Naga 是 Tint 的 Rust 对应物。地址:github.com/gfx-rs/wgpu/tree/trunk/naga。约 6 万行 Rust,gfx-rs 团队维护。它和 Tint 干同样的事——WGSL 翻译到 MSL / HLSL / SPIR-V / GLSL(外加 WGSL→WGSL re-print)——但架构很不一样。
Tint 用经典的对象图——每个 IR 节点都是一个堆分配的 C++ 对象,靠 Block* / Value* 指针互引。Naga 反过来——用arena(slab allocator):
为什么用 arena:
Box<T> 满天飞和 borrow checker 难题。Handle<T> 是 Copy,arena 是不可变借用。bincode),让 Firefox 在 GPU 进程间传 IR 极快。Naga 不只有 WGSL 前端——它还能反向读 SPIR-V、GLSL 当输入:
| 前端 | 路径 | 用途 |
|---|---|---|
| WGSL → Module | naga/src/front/wgsl/ | 主路径,wgpu 用 |
| SPIR-V → Module | naga/src/front/spv/ | 把外部 shader 引入 wgpu |
| GLSL → Module | naga/src/front/glsl/ | 从 OpenGL 项目迁移用 |
这是 Naga 比 Tint 灵活的地方——Tint 只 接 WGSL。但 WebGPU 浏览器实现只用 WGSL 输入,所以这种灵活性主要服务 wgpu 在非浏览器场景(Bevy 引擎、Deno runtime、Servo)。
| 后端 | 路径 | 使用方 |
|---|---|---|
| Module → MSL | naga/src/back/msl/ | Firefox macOS, Bevy macOS |
| Module → HLSL | naga/src/back/hlsl/ | Firefox Windows(→ DXC → DXIL) |
| Module → SPIR-V | naga/src/back/spv/ | Firefox Linux, Bevy Vulkan |
| Module → GLSL | naga/src/back/glsl/ | WebGL 兼容(兜底) |
| Module → WGSL | naga/src/back/wgsl/ | 调试 / round trip 验证 |
Naga 的 WGSL→MSL 在 M2 Pro 上对主线那段 16 行约 1.2 ms,比 Tint 略快(~3 ms)。差距主要来自 Rust 的 monomorphization 和 arena 的局部性。在 wgpu 的内部 benchmark 上(wgpu/wgpu-types/benches/),Naga 编译 1000 行复杂 shader 约 15 ms,Tint 约 25–30 ms。两者都远快于原生 driver 编译(spirv-tools/glslc/DXC 都 100+ ms)。
Tint 的代码组织紧贴 WGSL spec 章节号——tint/lang/wgsl/resolver/uniformity.cc 直接对应 spec §13 uniformity analysis。Naga 的代码组织按数据流——前端折叠各种 source 语言进 Module,后端把 Module 翻成各种 target。两种风格都对,但 Tint 更适合 spec 实现者(一一对应)、Naga 更适合 graphics engine 嵌入者(IR 友好)。
Naga is Tint's Rust counterpart. Repo: github.com/gfx-rs/wgpu/tree/trunk/naga. About 60k lines of Rust, maintained by the gfx-rs team. It does the same job as Tint — translate WGSL into MSL / HLSL / SPIR-V / GLSL (plus a WGSL re-printer) — but its architecture is very different.
Tint uses a classic object graph — each IR node is a heap-allocated C++ object referenced via Block* / Value*. Naga inverts this and uses arenas (slab allocators):
Why arenas:
Box<T> everywhere; Handle<T> is Copy; arenas are immutably borrowed, the borrow checker is happy.bincode), making cross-process IR transfer in Firefox very fast.Naga isn't WGSL-only — it can read back SPIR-V and GLSL as input:
| Frontend | Path | Use |
|---|---|---|
| WGSL → Module | naga/src/front/wgsl/ | main path, wgpu |
| SPIR-V → Module | naga/src/front/spv/ | importing external shaders |
| GLSL → Module | naga/src/front/glsl/ | migration from OpenGL projects |
That's where Naga is more flexible than Tint — Tint accepts only WGSL. But browser WebGPU only feeds WGSL, so this flexibility serves wgpu in non-browser contexts (Bevy, Deno, Servo).
| Backend | Path | Consumer |
|---|---|---|
| Module → MSL | naga/src/back/msl/ | Firefox macOS, Bevy macOS |
| Module → HLSL | naga/src/back/hlsl/ | Firefox Windows (→ DXC → DXIL) |
| Module → SPIR-V | naga/src/back/spv/ | Firefox Linux, Bevy Vulkan |
| Module → GLSL | naga/src/back/glsl/ | WebGL fallback |
| Module → WGSL | naga/src/back/wgsl/ | debug / round-trip testing |
Naga's WGSL→MSL on M2 Pro for our 16-line shader is about 1.2 ms, slightly faster than Tint's ~3 ms. The gap comes from Rust monomorphisation and arena locality. On wgpu's internal benchmarks (wgpu/wgpu-types/benches/), Naga compiles a 1000-line shader in ~15 ms vs Tint's ~25–30 ms. Both are far faster than native driver compilation (spirv-tools/glslc/DXC all run 100+ ms).
Tint's source layout follows WGSL spec chapters — e.g. tint/lang/wgsl/resolver/uniformity.cc maps to spec §13 uniformity analysis. Naga's layout follows data flow — frontends fold various source languages into Module, backends fold Module into various targets. Both are valid; Tint is friendlier to spec implementers (one-to-one mapping), Naga to engine embedders (IR-friendly).
同一行 WGSL,翻译到三种 shader 语言后长什么样
The same WGSL line, in three shading languages
WGSL 进了 Tint/Naga,出来的不再是源代码而是三种格式之一。这一章把同一段主线 WGSL 翻译到三个目标,看每种目标长什么样、各自有什么 quirk。
注意点:
device = MSL 的 storage 地址空间,对应 WGSL storage。kernel = MSL 的 compute shader 入口(不是 fragment 或 vertex)。[[thread_position_in_grid]] = MSL builtin,对应 WGSL @builtin(global_invocation_id)。arrayLength(&data) 不存在于 MSL——Tint 把它编译成 uniform 常量,由 Dawn 在 dispatch 时填进去。这是 Tint 的语义适配层。D3D12 的 storage buffer 用 RWByteAddressBuffer,所有访问都按字节。Tint 把 data[i](f32 数组)翻译成 data.Load(i * 4) 加 asfloat()。这是巨大语义跳变——也是 HLSL 后端比 MSL 后端复杂的根本原因。
SPIR-V 是二进制 IR,下面给的是 spirv-dis 的可读版本。一个 16 行 WGSL 出来约 120 条 SPIR-V 指令、二进制 ~600 字节。SPIR-V 的优点是可被驱动直接吃、Khronos 已经有十几个 vendor 实现;缺点是不可读、debug 必须 spirv-dis。
| 语义 | WGSL | MSL | HLSL | SPIR-V |
|---|---|---|---|---|
| 整数溢出 | wrap (2's complement) | wrap | wrap | wrap (with OpDecorate) |
| 除以 0(整数) | 未定义(typically 0) | 未定义 | 未定义 | 未定义 |
| NaN ordering | std::isnan 可用 | 同 | 同 | 同 |
| 越界 array read | required: 0 或 clamp | 需手动 clamp | RWByteAddressBuffer 自动 0 | 需 capability |
| workgroup memory(典型,非 spec 最小) | 16 KB 最低保证 | ~32 KB 典型(Apple) | 32 KB 典型(D3D12) | 16 KB spec 下限(Vulkan maxComputeSharedMemorySize),实际 32-48 KB 常见 |
注意"越界 read"那行——WGSL spec 强制规定越界读必须 0 或 clamp(不能 UB)。这是 WebGPU 安全模型的核心。Tint/Naga 的每个 backend都要 patch 进相应保护代码:MSL 加显式 if 判断、HLSL 用 RWByteAddressBuffer 自带的保护、SPIR-V 加 RobustBufferAccess capability。
WGSL goes into Tint/Naga and comes out as one of three formats. This chapter translates the same main-line WGSL to all three targets, side by side, with the quirks each one brings.
Things to notice:
device = MSL's storage address space, equivalent to WGSL storage.kernel = MSL's compute-shader entry (not fragment, not vertex).[[thread_position_in_grid]] = MSL builtin, maps WGSL @builtin(global_invocation_id).arrayLength(&data) doesn't exist in MSL — Tint compiles it into a uniform constant filled in by Dawn at dispatch time. This is Tint's semantic adaptation layer.D3D12 storage buffers are RWByteAddressBuffer; all access is byte-addressed. Tint translates data[i] (an f32 array) into data.Load(i * 4) + asfloat(). This is a huge semantic shift and the reason the HLSL backend is harder than the MSL backend.
SPIR-V is binary IR; the text above is spirv-dis output. Our 16-line WGSL becomes ~120 SPIR-V instructions, ~600 bytes binary. SPIR-V's advantage: directly consumable by drivers, dozens of vendor implementations already exist. Disadvantage: unreadable; debugging requires spirv-dis.
| Semantic | WGSL | MSL | HLSL | SPIR-V |
|---|---|---|---|---|
| Integer overflow | wraps (two's complement) | wraps | wraps | wraps (with OpDecorate) |
| Integer divide by 0 | undefined (typically 0) | undefined | undefined | undefined |
| NaN ordering | std::isnan available | same | same | same |
| Out-of-bounds array read | required: 0 or clamp | manual clamp needed | RWByteAddressBuffer auto-zeros | needs capability |
| Workgroup memory (typical, not spec floor) | 16 KB guaranteed | ~32 KB typical (Apple) | 32 KB typical (D3D12) | 16 KB Vulkan maxComputeSharedMemorySize floor; 32-48 KB common in practice |
Note the "out-of-bounds read" row — WGSL spec mandates 0 or clamp on OOB reads (no UB allowed). That's core to WebGPU's safety model. Tint/Naga's every backend patches in the corresponding protection: MSL adds an explicit if-bound check, HLSL uses RWByteAddressBuffer's built-in zeroing, SPIR-V adds the RobustBufferAccess capability.
同一个 WebGPU dispatch,在三个原生 API 下的调用链
One WebGPU dispatch, mapped to three native API call chains
这是翻译⑥的最后一步。Dawn / wgpu 拿到编译好的 shader 后要把它变成原生 API 调用。这一章把主线 ⑥ 在三个原生 API 下的完整调用链列出来。
Metal 的优点:类型一致——dispatch 直接传 threadgroup 数 + 每 threadgroup 线程数(对应 WGSL workgroupCount + workgroup_size)。Apple GPU 直接用这套模型,没有翻译损失。Apple Silicon 上 Dawn 测得 dispatch 命令本身~10 µs(命令录制),等待 GPU 执行另算。
D3D12 多两个东西:① root signature(每个 pipeline 必须显式给出整个 binding shape)和 ② resource barrier(每个资源状态变化必须显式宣告)。Dawn 的 D3D12 backend 比 Metal 多 ~2× 代码量就因为要处理这两个。
| Metal | D3D12 | Vulkan | |
|---|---|---|---|
| 命令 buffer 类型 | MTLCommandBuffer | ID3D12GraphicsCommandList | VkCommandBuffer |
| Pipeline 状态 | argument buffer 自动 | root signature 显式 | descriptor set + pipeline layout 显式 |
| Resource barrier | 大部分自动 | 显式 ResourceBarrier | 显式 vkCmdPipelineBarrier(最严) |
| Threadgroup vs threads | threadgroup 和 thread 都显 | numthreads 在 shader; Dispatch 是 group 数 | 同 D3D12 |
| 同步原语 | MTLFence (event-based) | ID3D12Fence (value-based) | VkFence / VkSemaphore (binary) |
| Dawn 后端代码量 | ~3.5 万行 Obj-C++ | ~5 万行 C++ | ~5.5 万行 C++ |
Vulkan 的显式同步是三家最严的——每个 buffer/texture 在 dispatch 之前都要 declare 当前 state 和目标 state,每个 image layout transition 都要写 barrier。Dawn 的 Vulkan backend 有一个完整的 "subresource state tracker" 来在录制 encoder 时自动算出所有 barrier。这部分代码占 Vulkan backend 的 ~30%。
把主线的 7 个 JS 调用串起来,逐 API 看:
| 主线 JS | Metal(macOS) | D3D12(Windows) | Vulkan(Linux) |
|---|---|---|---|
requestDevice() | MTLCreateSystemDefaultDevice() + [device newCommandQueue] | D3D12CreateDevice() + CreateCommandQueue() | vkCreateDevice() + vkGetDeviceQueue() |
createBuffer(4 MB, STORAGE) | [device newBufferWithLength:4194304 options:MTLResourceStorageModePrivate] | CreateCommittedResource(D3D12_HEAP_TYPE_DEFAULT, 4 MB) | vkCreateBuffer({4 MB, STORAGE_BUFFER_BIT}) + vkAllocateMemory() |
queue.writeBuffer(buf, 0, input) | 临时 shared buffer + blitCommandEncoder copyFromBuffer:... | upload heap + cl->CopyBufferRegion(...) | staging buffer + vkCmdCopyBuffer() |
createShaderModule(wgsl) | Tint→MSL · [device newLibraryWithSource:msl options:...] | Tint→HLSL→DXC→DXIL · D3DCompile() | Tint→SPIR-V · vkCreateShaderModule({spv_bytes, len}) |
createComputePipeline({...}) | [device newComputePipelineStateWithFunction:...] | device->CreateComputePipelineState(&psoDesc) | vkCreateComputePipelines(layout, shader_stage) |
pass.dispatchWorkgroups(16384) | [enc dispatchThreadgroups:MTLSizeMake(16384,1,1) threadsPerThreadgroup:MTLSizeMake(64,1,1)] | cl->Dispatch(16384, 1, 1)(每 group 64 thread 已在 HLSL [numthreads(64,1,1)] 里写死) | vkCmdDispatch(cb, 16384, 1, 1)(每 group 64 thread 在 SPIR-V LocalSize 64 1 1) |
queue.submit([...]) | [cb commit] · 等待 completionHandler | queue->ExecuteCommandLists(1, &cl) · 等 fence | vkQueueSubmit(queue, 1, &si, fence) · vkWaitForFences() |
read.mapAsync(READ) | shared buffer + [buf contents] 直读 | readback heap + resource->Map(0, &range, &ptr) | vkMapMemory(device, mem, 0, size, 0, &ptr) |
总结:主线的 7 个 JS 调用 → 三家 ~12-15 个原生 API 调用 + 一次驱动 shader 编译。八重翻译 到此结束第七层。最后一层(GPU 命令处理器实际 dispatch),Ch18 展开。
This unpacks translation step ⑥. Once Dawn / wgpu has the compiled shader, it must translate into native API calls. This chapter shows the complete call chain for main line ⑥ on all three native APIs.
Metal's strength: type-aligned — dispatch takes both threadgroup count and per-threadgroup thread count (matching WGSL workgroupCount + workgroup_size). Apple GPUs natively use this model — no translation lossage. On Apple Silicon, Dawn measures the dispatch command itself at ~10 µs (command recording); GPU execution is separate.
D3D12 has two extra concepts: ① root signature (every pipeline must explicitly declare its full binding shape) and ② resource barriers (every resource state change must be declared explicitly). Dawn's D3D12 backend is ~2× the code of its Metal backend mostly because of these.
| Metal | D3D12 | Vulkan | |
|---|---|---|---|
| Command buffer type | MTLCommandBuffer | ID3D12GraphicsCommandList | VkCommandBuffer |
| Pipeline state binding | argument buffer (auto) | root signature (explicit) | descriptor set + pipeline layout (explicit) |
| Resource barriers | mostly automatic | explicit ResourceBarrier | explicit vkCmdPipelineBarrier (strictest) |
| Threadgroup vs threads | both explicit | numthreads in shader; Dispatch counts groups | same as D3D12 |
| Sync primitives | MTLFence (event-based) | ID3D12Fence (value-based) | VkFence / VkSemaphore (binary) |
| Dawn backend size | ~35k lines Obj-C++ | ~50k lines C++ | ~55k lines C++ |
Vulkan's explicit synchronisation is the strictest of the three — every buffer/texture must declare current and target state before a dispatch, every image layout transition needs a barrier. Dawn's Vulkan backend has an entire "subresource state tracker" to compute all barriers at encoder-record time. That subsystem alone is ~30% of the backend.
Threading the seven main-line JS calls through each native API:
| Main-line JS | Metal (macOS) | D3D12 (Windows) | Vulkan (Linux) |
|---|---|---|---|
requestDevice() | MTLCreateSystemDefaultDevice() + [device newCommandQueue] | D3D12CreateDevice() + CreateCommandQueue() | vkCreateDevice() + vkGetDeviceQueue() |
createBuffer(4 MB, STORAGE) | [device newBufferWithLength:4194304 options:MTLResourceStorageModePrivate] | CreateCommittedResource(D3D12_HEAP_TYPE_DEFAULT, 4 MB) | vkCreateBuffer({4 MB, STORAGE_BUFFER_BIT}) + vkAllocateMemory() |
queue.writeBuffer(buf, 0, input) | transient shared buffer + blitCommandEncoder copyFromBuffer:... | upload heap + cl->CopyBufferRegion(...) | staging buffer + vkCmdCopyBuffer() |
createShaderModule(wgsl) | Tint→MSL · [device newLibraryWithSource:msl options:...] | Tint→HLSL→DXC→DXIL · D3DCompile() | Tint→SPIR-V · vkCreateShaderModule({spv_bytes, len}) |
createComputePipeline({...}) | [device newComputePipelineStateWithFunction:...] | device->CreateComputePipelineState(&psoDesc) | vkCreateComputePipelines(layout, shader_stage) |
pass.dispatchWorkgroups(16384) | [enc dispatchThreadgroups:MTLSizeMake(16384,1,1) threadsPerThreadgroup:MTLSizeMake(64,1,1)] | cl->Dispatch(16384, 1, 1) (per-group 64 threads baked into HLSL [numthreads(64,1,1)]) | vkCmdDispatch(cb, 16384, 1, 1) (per-group 64 threads in SPIR-V LocalSize 64 1 1) |
queue.submit([...]) | [cb commit] · awaits completionHandler | queue->ExecuteCommandLists(1, &cl) · fence wait | vkQueueSubmit(queue, 1, &si, fence) · vkWaitForFences() |
read.mapAsync(READ) | shared buffer + [buf contents] direct read | readback heap + resource->Map(0, &range, &ptr) | vkMapMemory(device, mem, 0, size, 0, &ptr) |
In summary: the main line's seven JS calls fan out to ~12-15 native API calls per platform, plus one driver shader compilation. Seven of eight translations done. The eighth (GPU command processor actually dispatching SIMT waves) is unpacked in Ch18.
从一个 dispatch 到 16384 个 workgroup 再到一万亿个 ALU 操作
From one dispatch to 16,384 workgroups to a trillion ALU ops
主线 ⑧ 终于落到 GPU 硬件上了。一行 vkCmdDispatch(16384, 1, 1) 让 GPU 命令处理器(CP)启动调度。它会发生这些事:
@workgroup_size(64) 决定每个 workgroup 有 64 threads。这是程序员可控的分组单位。一个 workgroup 内的 threads:
var<workgroup> 内存(典型 16-32 KB)workgroupBarrier() 同步主线 @workgroup_size(64) 太小,没用到 workgroup memory,所以纯算"每 thread 一个 element"。Ch19 会演示用 workgroup memory 把 matmul 速度提 5×。
Subgroup(也叫 wave / warp / simdgroup)是硬件层的并行单位——一组 threads 在完全 lockstep 跑同一条指令。这是 GPU 算力的物理基础。
| 厂商 | 术语 | 典型大小 | API 中的 builtin |
|---|---|---|---|
| NVIDIA | warp | 32 | WARP_SIZE intrinsic |
| AMD GCN/RDNA1 | wave | 64 | WaveGetLaneCount |
| AMD RDNA2+ | wave | compute 默认 32 · graphics 可动态 32/64 | 同 |
| Apple GPU | simdgroup | 32 | simdgroup_size |
| Intel Gen11+ | EU thread | 8/16/32 SIMD-wide | SIMD width |
| ARM Mali | quad / warp | 4 / 8 / 16 | 视架构而定 |
WebGPU 的设计选择:subgroup 大小不暴露给应用,应用要么不用 subgroup ops,要么用 enable subgroups 启用扩展并通过 builtin subgroup_size 拿到运行时大小。这是和原生 API 的差异——native shader 一般 hard-code 一个常量。
没有 subgroup 时,做一个 32 元素 sum 至少要 5 次 workgroupBarrier() + 32 → 16 → 8 → 4 → 2 → 1 折半。用 subgroupAdd 是1 条硬件指令——NVIDIA 的 shfl.bfly、AMD 的 ds_swizzle、Apple 的 simd_sum。性能差 5-10×。
把主线在 Apple M2 Pro 上的 GPU 执行展开:
每个 thread 做一次 load 4 字节 + 一次 store 4 字节 = 8 字节内存流量 × 1M thread = 8 MB。M2 Pro 显存带宽 ~200 GB/s = 0.04 ms 满速。但 cache 命中率不到 100%、scheduling 开销、launch latency 等,实测 200 µs ≈ 40 GB/s 有效带宽——已经是显存带宽的 20%。Ch19 会演示访存模式如何决定性能。
Main line ⑧ finally hits the GPU. One vkCmdDispatch(16384, 1, 1) wakes the GPU's command processor (CP). Here's what happens:
@workgroup_size(64) declares 64 threads per workgroup. This is the programmer-controlled grouping. Threads within a workgroup:
var<workgroup> memory (typically 16–32 KB)workgroupBarrier()Our main line uses @workgroup_size(64) but no workgroup memory — pure "one thread per element" arithmetic. Ch19 will show how using workgroup memory delivers a 5× speedup on matmul.
Subgroup (also wave / warp / simdgroup) is the hardware-level parallel unit — a set of threads running in complete lockstep. It's the physical basis of GPU throughput.
| Vendor | Term | Typical size | API builtin |
|---|---|---|---|
| NVIDIA | warp | 32 | WARP_SIZE intrinsic |
| AMD GCN/RDNA1 | wave | 64 | WaveGetLaneCount |
| AMD RDNA2+ | wave | compute defaults to 32 · graphics can be dynamic 32/64 | same |
| Apple GPU | simdgroup | 32 | simdgroup_size |
| Intel Gen11+ | EU thread | 8/16/32 SIMD-wide | SIMD width |
| ARM Mali | quad / warp | 4 / 8 / 16 | varies by arch |
WebGPU's design choice: subgroup size is not exposed to apps. You either don't use subgroup ops, or you enable subgroups and read the size at runtime via the subgroup_size builtin. Native APIs typically hard-code constants; WebGPU has to be portable.
Without subgroups, a 32-element sum takes 5 × workgroupBarrier() + a 32 → 16 → 8 → 4 → 2 → 1 tree. With subgroupAdd it's one hardware instruction — NVIDIA's shfl.bfly, AMD's ds_swizzle, Apple's simd_sum. 5–10× faster.
Main line execution on Apple M2 Pro:
Each thread does 1 load (4 bytes) + 1 store (4 bytes) = 8 bytes × 1M threads = 8 MB of memory traffic. M2 Pro's memory bandwidth is ~200 GB/s = 0.04 ms at peak. But cache misses, scheduling overhead, and launch latency push real measurement to ~200 µs ≈ 40 GB/s effective — 20% of peak bandwidth. Ch19 shows how access pattern dominates performance.
同一个 1024×1024 矩阵乘,三种实现,差 50×
Same 1024×1024 matmul, three implementations, 50× spread
主线是访存受限的纯算,但真实 GPU 工作大多是矩阵乘——LLM、CV、PCA 全都是 matmul。这一章用 1024×1024 的 matmul 演示从 naive 到 tiled 到 wmma 怎么把性能拉 50×。
分析:每个 thread 算一个 C[row,col],要读 N=1024 个 A 和 1024 个 B,做 1024 次 FMA。读 8 KB 数据做 2 KFLOP——算密度(FLOP/byte)= 0.25。M2 Pro 算力 / 带宽比 ~7,意味着访存远远跟不上,性能受带宽限制。
实测 1024×1024 matmul:~95 ms。理论算力 1.4 TFLOPS × 利用率 = 2 GFLOPS 实际 = ~5% 利用率。GPU 大部分时间在等内存。
关键:256 threads(16×16)一起从 storage 读 256 个 A 和 256 个 B 元素到 workgroup memory(共 2 KB),然后每个 thread 用 16 次。读一次用 16 次 = 16× 带宽节省。算密度提到 ~4 FLOP/byte。
实测:~22 ms,比 naive 快 4.3×。
进一步,让每个 thread 算4×4 块的 C 而非单元素:算密度从 4 升到 ~32 FLOP/byte,寄存器复用,访存量再降 4×。代码省略(~60 行 WGSL)。
实测:~6 ms,比 naive 快 16×。
NVIDIA Volta+/Apple M3+/Intel Arc 的 GPU 有专用矩阵指令——NVIDIA 叫 wmma(warp matrix multiply-accumulate)、Apple 叫 simdgroup matrix、Intel 叫 XMX。一条 wmma 指令做 16×16 × 16×16 matmul,~512 FLOP。
WebGPU 这部分目前是 chromium-experimental-subgroup-matrix 扩展(W3C 在标准化中,2026 年应进 1.1)。一旦标准化,主线 WGSL 加几行:
实测(Tint experimental,M3 Pro 上):~2 ms,比 naive 快 47×,比 tiled+register-blocking 快 3×。这是未来 1-2 年浏览器 ML 性能的重要催化剂。
| 实现 | 1024² matmul(M2 Pro · Chrome) | 1024² matmul(M3 Pro · Chrome) | 4096² matmul(RTX 4090 · Chrome) | 算力利用率 |
|---|---|---|---|---|
| naive | ~95 ms | ~75 ms | ~280 ms | ~5% |
| tiled (workgroup mem) | ~22 ms | ~17 ms | ~70 ms | ~22% |
| + register blocking 4×4 | ~6 ms | ~5 ms | ~18 ms | ~82% |
| + subgroup matrix (实验) | n/a(Apple matrix 仅 M3+) | ~2 ms | ~3.5 ms | ~95% |
| 参考:Metal Performance Shaders | ~1.5 ms | ~1.2 ms | — | ~100% |
| 参考:CUDA cuBLAS | — | — | ~3 ms | ~95% |
WebGPU 的极限是原生 BLAS 库的 80-95%。差距来自 ① validation 开销、② 缺少最深层硬件特性(如 NVIDIA 的 cp.async 流水线 load)、③ 编译器优化(DXC/Apple Metal 编译器仍比 Tint/Naga 经过更多优化迭代)。
The main line is memory-bound pure arithmetic, but most real GPU work is matrix multiplication — LLMs, computer vision, PCA, all matmul. This chapter walks 1024×1024 matmul from naive to tiled to wmma, showing a 50× spread.
Analysis: each thread computes one C[row, col], reading N=1024 A's and 1024 B's, doing 1024 FMAs. Reads 8 KB of data for 2 KFLOP — arithmetic intensity (FLOP/byte) = 0.25. M2 Pro's compute/bandwidth ratio is ~7, meaning bandwidth can't keep up; the workload is bandwidth-bound.
Measured 1024×1024 matmul: ~95 ms. Theoretical peak 1.4 TFLOPS × utilisation = 2 GFLOPS actual ≈ 5% utilisation. The GPU spends most time waiting on memory.
The key: 256 threads (16×16) cooperatively load 256 A's and 256 B's from storage into workgroup memory (2 KB total), then each thread reuses each element 16 times. Load-once-use-16 = 16× bandwidth savings. Arithmetic intensity climbs to ~4 FLOP/byte.
Measured: ~22 ms, 4.3× faster than naive.
Go further: each thread computes a 4×4 block of C instead of one element. Arithmetic intensity rises from 4 to ~32 FLOP/byte; register reuse cuts memory traffic another 4×. Code omitted (~60 lines of WGSL).
Measured: ~6 ms, 16× faster than naive.
NVIDIA Volta+ / Apple M3+ / Intel Arc GPUs ship dedicated matrix instructions — NVIDIA calls them wmma (warp matrix multiply-accumulate), Apple calls them simdgroup matrix, Intel calls them XMX. One wmma does 16×16 × 16×16 matmul = ~512 FLOP.
In WebGPU this is currently the chromium-experimental-subgroup-matrix extension (W3C is standardising it; expected in 1.1 in 2026). Once standardised, you'd write:
Measured (Tint experimental, on M3 Pro): ~2 ms, 47× faster than naive, 3× faster than tiled-with-register-blocking. This is the main catalyst for browser-side ML perf over the next 1–2 years.
| Implementation | 1024² matmul (M2 Pro · Chrome) | 1024² matmul (M3 Pro · Chrome) | 4096² matmul (RTX 4090 · Chrome) | Utilisation |
|---|---|---|---|---|
| naive | ~95 ms | ~75 ms | ~280 ms | ~5% |
| tiled (workgroup mem) | ~22 ms | ~17 ms | ~70 ms | ~22% |
| + register blocking 4×4 | ~6 ms | ~5 ms | ~18 ms | ~82% |
| + subgroup matrix (experimental) | n/a (Apple matrix is M3+ only) | ~2 ms | ~3.5 ms | ~95% |
| reference: Metal Performance Shaders | ~1.5 ms | ~1.2 ms | — | ~100% |
| reference: CUDA cuBLAS | — | — | ~3 ms | ~95% |
WebGPU's ceiling is 80–95% of native BLAS. The gap comes from ① validation overhead, ② missing the deepest HW features (e.g. NVIDIA's cp.async pipeline loads), ③ compiler maturity (DXC and Apple's Metal compiler have years more optimisation work than Tint/Naga).
从 HuggingFace 模型到 30 tokens/s 输出,靠 WebGPU 全栈
HuggingFace model → 30 tokens/s, powered by the WebGPU stack
2024 年开始浏览器里出现了真正实用的 LLM 推理——Llama-3.2-1B 能在 M2 Mac 上跑 30 tokens/s,Qwen2.5-0.5B 能跑 60 tokens/s。这一章解释这背后的工程栈,主角是 transformers.js 和 ONNX Runtime Web。
writeBuffer 上传到 GPU。在 M2 Pro 上 ~150 ms。每个 forward pass 是 ~200 个 dispatch。每 dispatch ~150 µs(含 IPC + 实际 GPU),总共 ~30 ms。关键优化:所有 dispatch 在一次 encoder.finish() + queue.submit()里——不能每 dispatch 单独 submit(那样 IPC 开销会把性能拖垮)。
Naive 实现每个 token 都要重新算所有先前 token 的 key/value,复杂度 O(n²)。KV cache 把每个 token 的 K/V 算一次就存住,后续 token 只算自己的并 concat。复杂度变 O(n)。
在 WebGPU 里 KV cache 就是两个长 storage buffer,每生成一个 token 就 append。每层 ~16 MB(4096 hidden × 4 KV head × 2 K+V × 2 字节 × seq len),16 层 = ~256 MB。生成长 context(~2048 tokens)容易吃 GB 级 GPU 显存。这是当前浏览器 LLM 的主要瓶颈——RAM 比算力更宝贵。
| 模型 | 大小 | M2 Pro · Chrome | i9-13900K + RTX 4090 · Chrome | 原生 CPU(同模型) |
|---|---|---|---|---|
| Qwen2.5-0.5B (f16) | ~500 MB | ~60 tok/s | ~180 tok/s | ~5 tok/s (Apple Neural Engine) |
| Llama-3.2-1B (q4) | ~700 MB | ~30 tok/s | ~90 tok/s | ~3 tok/s |
| Llama-3.2-3B (q4) | ~2 GB | ~12 tok/s | ~40 tok/s | ~1 tok/s |
| SDXL-Turbo | ~1.5 GB | 4-8 s / 图 | ~0.5 s / 图 | ~30 s / 图 |
来源:HuggingFace transformers.js README perf 表 · WebLLM benchmarks · Xenova/transformers.js benchmark issues · 单机实测仅供参考。
2017 年用 WebGL 跑 BERT 是玩具(没 compute、性能差 100×)。2023 年 WebGPU GA + 2024 年 subgroup ops + Llama 系开源 + 量化(int4/int8)成熟,浏览器里跑 LLM 从不可能变成不舒服但能。再过 1-2 年 subgroup_matrix 标准化 + KV cache compression 普及,家用 GPU 上跑 7B 模型会是默认。
From 2024, browser-side LLM inference became practical — Llama-3.2-1B runs at 30 tokens/s on M2 Macs, Qwen2.5-0.5B at 60 tokens/s. This chapter dissects the engineering stack — starring transformers.js and ONNX Runtime Web.
writeBuffer. ~150 ms on M2 Pro.Each forward pass is ~200 dispatches. At ~150 µs each (IPC + actual GPU) you get ~30 ms total. Crucial optimisation: all dispatches in one encoder.finish() + queue.submit(); you cannot submit per-dispatch (IPC overhead would crush perf).
A naive implementation recomputes every prior token's keys/values on every new token: O(n²). The KV cache computes each token's K/V once and reuses, only computing for the new token and concatenating. O(n).
In WebGPU the KV cache is two long storage buffers, appended per token. Each layer is ~16 MB (4096 hidden × 4 KV heads × 2 K+V × 2 bytes × seq len), 16 layers = ~256 MB. Long context (~2048 tokens) easily costs GBs of GPU memory. This is the current bottleneck in browser LLMs — RAM is scarcer than FLOPS.
| Model | Size | M2 Pro · Chrome | i9-13900K + RTX 4090 · Chrome | Native CPU (same) |
|---|---|---|---|---|
| Qwen2.5-0.5B (f16) | ~500 MB | ~60 tok/s | ~180 tok/s | ~5 tok/s (Apple Neural Engine) |
| Llama-3.2-1B (q4) | ~700 MB | ~30 tok/s | ~90 tok/s | ~3 tok/s |
| Llama-3.2-3B (q4) | ~2 GB | ~12 tok/s | ~40 tok/s | ~1 tok/s |
| SDXL-Turbo | ~1.5 GB | 4–8 s / image | ~0.5 s / image | ~30 s / image |
Sources: HuggingFace transformers.js README perf table · WebLLM benchmarks · Xenova/transformers.js benchmark issues · single-machine measurements only.
BERT on WebGL in 2017 was a toy (no compute, 100× slower than native). 2023 WebGPU GA + 2024 subgroup ops + open Llama models + mature quantisation (int4/int8) flipped browser LLMs from impossible to uncomfortable-but-real. With subgroup_matrix standardisation + KV cache compression another 1–2 years out, 7B models on consumer GPUs in the browser becomes default.
需要 enable feature 才能用的能力
Features you must opt-in to before using
WebGPU 的核心 API 是所有支持设备都能跑的最小子集。三个常用扩展需要 explicit enable,本章逐个看。
shader-f1616-bit 浮点。LLM/CV 用 f16 推理已成标准——精度够用、内存减半、算力多 2×(如果 GPU 有 native f16 路径)。启用:
WGSL 用法:
enable f16; @group(0) @binding(0) var<storage, read_write> data: array<f16>;
注意:f16 在 storage 里 packed 2 个一组,对齐要求 4 字节。vec4<f16> 是 8 字节而非 16。Tint/Naga 自动处理 packing。
WGSL 核心就支持 atomic<u32> 和 atomic<i32>——这是不需要 feature enable 的。但有几个加强需要 feature:
atomic<f32>:feature chromium-experimental-storage-f32-atomic。NVIDIA + AMD 硬件支持,Apple GPU 软件 emulate。LLM 训练用得到,inference 一般不需要。atomic<i64>/u64:尚未标准化。Vulkan 1.2 有 shaderBufferInt64Atomics,但 Metal 2024 才加上。var<workgroup> atomic<u32> 上做 atomic。这是核心支持的——典型用法 reduction 时算 workgroup 内部累积。timestamp-queryGPU 上能精确测时间——而不是 performance.now() 在 JS 端 measure(那个被 IPC 延迟 polluted)。启用:
隐私限制:timestamp 的分辨率被 clamp 到 100 µs(spec §20.4 Timestamp Query · #timestamp),防止 timing side-channel 攻击。所以 100 µs 以下的 pass 测出来都是 0 或 100 µs。Native 应用能拿 GPU clock 的纳秒级分辨率,浏览器拿不到。
WebGPU 的 feature gating 用三个原生 API 至少 2 个支持 + 安全模型 OK 作 cliff。Ray Tracing extension(VK_KHR_acceleration_structure + Metal MTL ray tracing + D3D12 DXR)三家都有,但 WebGPU仍未提案——因为 BVH 构建可能泄漏几何信息 给恶意 origin。Mesh shaders、subgroup matrix 也都因为类似原因等待中。"安全是慢的"是 WebGPU 的核心代价。
WebGPU's core API is the minimum subset all supported devices can run. Three common extensions need explicit enabling; this chapter walks them.
shader-f1616-bit floats. f16 inference is standard for LLM/CV — precision is enough, memory halved, compute can be 2× faster (if the GPU has a native f16 path). Enabling:
In WGSL:
enable f16; @group(0) @binding(0) var<storage, read_write> data: array<f16>;
Note: f16 packs two-per-32-bits in storage; alignment is 4 bytes. vec4<f16> is 8 bytes, not 16. Tint/Naga handles packing.
Core WGSL supports atomic<u32> and atomic<i32> — no feature enabling needed. Some extensions:
atomic<f32>: feature chromium-experimental-storage-f32-atomic. NVIDIA + AMD hardware-supported, Apple emulates in software. Used by LLM training; inference rarely needs it.atomic<i64>/u64: not standardised yet. Vulkan 1.2 has shaderBufferInt64Atomics, but Metal only added it in 2024.var<workgroup> atomic<u32> is core. Used commonly for in-workgroup reductions.timestamp-queryMeasure time precisely on the GPU — not performance.now() on the JS side (which gets polluted by IPC latency). Enabling:
Privacy clamp: timestamp resolution is clamped to 100 µs (spec §20.4 Timestamp Query · #timestamp) to prevent timing side-channel attacks. Anything under 100 µs reads as 0 or 100 µs. Native apps get nanosecond GPU clocks; browsers don't.
WebGPU's feature gate uses at least 2 of 3 native APIs supporting it + safe security model as the cliff. Ray tracing (VK_KHR_acceleration_structure + Metal MTL ray tracing + D3D12 DXR) exists on all three, but WebGPU still has no proposal — because BVH construction can leak geometry information to a malicious origin. Mesh shaders and subgroup matrix wait for similar reasons. "Safety is slow" is WebGPU's core cost.
为什么 WebGL 2 还在 · 为什么 WebCL 死了
Why WebGL 2 sticks around · why WebCL died
每隔几年浏览器都有一次"GPU API 提案",结局都不一样。WebGL 1.0/2.0 双双 GA;WebCL 2011 起草、2014 stagnate、2018 弃;WebGPU 2017-2023 一路打到 GA。把三个放一起:
| 维度 | WebGL 2 | WebCL | WebGPU |
|---|---|---|---|
| 状态 | 2017 GA · 仍主流 | 2011 草案 · 2018 撤回 | 2023 GA · 标准化中 |
| 底层映射 | OpenGL ES 3.0 | OpenCL 1.2 | Metal · D3D12 · Vulkan |
| Compute | 无 | 是核心 | 是核心 |
| Shading 语言 | GLSL ES | OpenCL C | WGSL |
| 典型 use case | 2D/3D 游戏 · 数据可视化 | (理论上) 科学计算 | matmul · LLM · 复杂渲染 |
| 支持厂商 | 4 大全支持 | Apple/Google 一直拒绝 | 4 大都已支持 |
WebGL 2 还会在 Web 平台上活 5-10 年,原因:
所以 Three.js、Babylon、Mapbox 都做双后端:检测 navigator.gpu 可用就走 WebGPU,否则 fallback WebGL。性能差 1.5-3×(取决于工作负载),但用户都拿到合理体验。
WebGL 2 ★★★★ · WebGPU ★★★★★。
WebGPU 多线程命令录制(虽然 Web 还是单线程,但驱动层是的)+ 显式 pipeline cache = 减少 CPU bottleneck。Babylon 测得复杂场景 1.5–2× FPS 提升。
WebGL 2 ☆ · WebGPU ★★★★★。
WebGL 没 compute,只能用 framebuffer 当"compute"——慢、丑。WebGPU 是 first-class,transformers.js 全靠它。
WebGL 2 ★★★★★ · WebGPU ★★★。
据 caniuse · webgpu,WebGPU 全球 baseline 覆盖约 85%(Chrome/Edge/Firefox 桌面 + macOS Safari 26),剩下 ~15%(老 Android、iOS 18 以前、内网 IE/旧版浏览器)仍需 fallback。WebGL 2 全球 ~96%。
WebGL 2 ★★★ · WebGPU ★★。
WebGPU 概念多:device, adapter, queue, encoder, pass, pipeline, bindgroup ... 入门要 1 周。
WebGL 2 ★★ · WebGPU ★★★★。
WebGPU 有结构化错误(validation error + scope)。WebGL 只有 getError 字符串。Chrome DevTools WebGPU panel 后续会有 capture/replay。
WebGL 2 ★ · WebGPU ★★★★★。
WebGL 已冻结。WebGPU 还在加 subgroup matrix · ray tracing · multi-GPU · HDR · WebTransport for textures 等。
Every few years the Web platform sees a "GPU API proposal"; outcomes vary. WebGL 1.0/2.0 both shipped; WebCL was drafted in 2011, stagnated by 2014, withdrawn in 2018; WebGPU went 2017→2023→GA. Side by side:
| Axis | WebGL 2 | WebCL | WebGPU |
|---|---|---|---|
| Status | 2017 GA · still dominant | 2011 draft · withdrawn 2018 | 2023 GA · still standardising |
| Maps to | OpenGL ES 3.0 | OpenCL 1.2 | Metal · D3D12 · Vulkan |
| Compute | none | core | core |
| Shading lang | GLSL ES | OpenCL C | WGSL |
| Typical use | 2D/3D games · viz | (theoretical) sci-compute | matmul · LLM · advanced rendering |
| Vendor support | all four | Apple/Google always refused | all four shipped |
WebGL 2 sticks around for 5–10 more years because:
So Three.js, Babylon, Mapbox all ship dual backends: detect navigator.gpu and use WebGPU, fall back to WebGL. Performance gap is 1.5–3× depending on workload — everyone gets a reasonable experience.
WebGL 2 ★★★★ · WebGPU ★★★★★.
WebGPU has multithreaded command recording (single-threaded on Web, but the driver layer parallelises) + explicit pipeline cache = less CPU bottleneck. Babylon measures 1.5–2× FPS on complex scenes.
WebGL 2 ☆ · WebGPU ★★★★★.
WebGL has no compute; people abuse framebuffers as "compute" — slow and ugly. WebGPU is first-class; transformers.js depends entirely on it.
WebGL 2 ★★★★★ · WebGPU ★★★.
Per caniuse · webgpu, WebGPU's global baseline reach is ~85% (Chrome/Edge/Firefox desktop + Safari 26 on macOS); the remaining ~15% (old Android, pre-iOS 18, intranet legacy browsers) still need fallbacks. WebGL 2 global ~96%.
WebGL 2 ★★★ · WebGPU ★★.
WebGPU has many concepts: device, adapter, queue, encoder, pass, pipeline, bindgroup … ~1 week to first useful program.
WebGL 2 ★★ · WebGPU ★★★★.
WebGPU has structured errors (validation error + scope). WebGL is just getError string codes. Chrome DevTools' WebGPU panel will soon have capture/replay.
WebGL 2 ★ · WebGPU ★★★★★.
WebGL is frozen. WebGPU is still adding subgroup matrix · ray tracing · multi-GPU · HDR · WebTransport for textures.
四个真实工程的 WebGPU 上线史
Four real engineering tales of WebGPU adoption
背景:Figma 把所有 vector 渲染用 WASM + WebGL 跑。一个百万图层 design file 在 WebGL 上 FPS ~25。2023 年 GA 后 Figma 开始迁。
挑战:① WebGL fragment shader 风格的 stencil 写法在 WebGPU 里要换成 compute;② path rasterisation 用了 GPU-side tessellation,WebGL 用 vertex shader emulate,WebGPU 直接走 compute shader 算 Bezier subdivision;③ device loss 处理(早期 NVIDIA 驱动经常 hang)。
结果:复杂文档 FPS 提到 ~60-90,CPU 占用降 40%。但设备覆盖损失——~10% 用户在 WebGL 上才稳定(主要是老 Windows + 集显),他们继续走 WebGL。
背景:Babylon.js 是 Microsoft 主导的 WebGL 渲染引擎,2022 年开始加 WebGPU 支持。现在 Babylon 7.0 默认 WebGPU + 自动 fallback。
独到设计:① Snapshot Rendering——把 WebGPU 的 command 录到一个 snapshot,复用每帧,CPU 开销降 80%;② "Engine harmonisation"——同一份 JS API 在 WebGL 后端和 WebGPU 后端都跑得动;③ 主动接入 subgroup ops 给 compute particles。
数据:Babylon 自己 benchmark 显示一个 1 万物体场景 WebGL 25 FPS / WebGPU 60 FPS / WebGPU+Snapshot 90 FPS。官方文档。
背景:Bevy 是 Rust 写的游戏引擎,用 wgpu 作 GPU 抽象层。Bevy 编译到 WebAssembly + wgpu 的 WebGPU 后端,能在浏览器里跑 Rust 游戏。
有意思的地方:① wgpu 的同一份 Rust 代码既能编译给 native(直接调 Vulkan/Metal/D3D12)也能编译给 Web(调 navigator.gpu);② Bevy 在 native 上典型 60+ FPS,编 Web 上掉到 30-45 FPS——损失主要来自 IPC(native 没那一跳)和 JS↔WASM 边界;③ 是 wgpu 项目最大的用户,反向催生了 wgpu 很多 perf 优化。
背景:Unity 历史上靠 WebGL 1/2 backend 在浏览器里跑("Unity WebGL")。2024 年 12 月发布 Unity 6 时加了实验性 WebGPU backend——和原 WebGL backend 并存,开发者在 Player Settings 里勾选启用。注意"Unity Web Player"是 2017 年就退役的老 NPAPI 插件,不要和这个新 backend 混淆。
不寻常的事:Unity 用 IL2CPP 把 C# 编到 C++ 再 emscripten → WASM,跑在 WebGPU 上。新 backend 比 WebGL 启动慢 ~2×(一次性 pipeline 编译耗时),但稳态渲染性能在复杂场景里 1.5-2× 提升。早期 demo 加载时间 ~5s(WebGPU)vs ~3s(WebGL)。Unity 6.1 / 2025 路线图把 WebGPU backend 推向 Production-Ready。
① Pipeline 预热是关键:所有四家都报告启动时把所有 pipeline 编出来很费时(~3-30 秒),用户感觉"卡"。生产策略:先编最常用的 10-20 个 pipeline,剩下的lazy。
② Device loss 必须处理:Figma 的统计数字 ~0.3% / 月。不处理 = 用户看到永久黑屏。
③ Fallback 必须 ship:~10-15% 用户暂时不能用 WebGPU。不做 fallback 等于丢这部分用户。
Background: Figma renders all vectors via WASM + WebGL. A million-layer design file on WebGL hits ~25 FPS. After WebGPU GA in 2023, Figma started migrating.
Challenges: ① WebGL stencil-style fragment writes had to become compute kernels; ② path rasterisation used GPU-side tessellation on WebGL via vertex-shader emulation — on WebGPU it goes straight to a compute shader doing Bezier subdivision; ③ device-loss handling (early NVIDIA drivers hung frequently).
Result: complex docs went from 25 FPS to 60–90 FPS, CPU usage down 40%. But device-coverage loss — ~10% of users (mostly older Windows + integrated GPUs) had to stay on WebGL.
Background: Babylon.js is the Microsoft-led WebGL renderer; WebGPU support added in 2022. Babylon 7.0 defaults to WebGPU with automatic fallback.
Distinctive design: ① Snapshot Rendering — captures WebGPU commands into a snapshot reused per frame, dropping CPU overhead by 80%; ② "Engine harmonisation" — one JS API runs on both backends; ③ early adoption of subgroup ops for GPU particles.
Data: Babylon's own benchmark — 10k objects: WebGL 25 FPS / WebGPU 60 FPS / WebGPU+Snapshot 90 FPS. Docs.
Background: Bevy is a Rust game engine using wgpu as its GPU abstraction. Compiled to WASM + wgpu's WebGPU backend, you can run Rust games in the browser.
Interesting bits: ① wgpu's same Rust code compiles to native (calls Vulkan/Metal/D3D12 directly) or to Web (calls navigator.gpu); ② Bevy hits 60+ FPS native, 30–45 FPS on Web — the gap comes from IPC (native has none) and the JS↔WASM boundary; ③ Bevy is wgpu's largest consumer and reverse-pushed many perf optimisations into wgpu.
Background: Unity has historically run in browsers via a WebGL 1/2 backend ("Unity WebGL"). When Unity 6 shipped in December 2024 it added an experimental WebGPU backend, coexisting with the WebGL backend — developers opt in via Player Settings. (Don't confuse this with "Unity Web Player", the old NPAPI plugin retired in 2017.)
The unusual thing: Unity compiles C# via IL2CPP to C++, then emscripten → WASM, running atop WebGPU. Startup is ~2× slower than the WebGL backend (one-time pipeline compilation), but steady-state rendering performance is 1.5–2× faster on complex scenes. Early demos load in ~5 s on WebGPU vs ~3 s on WebGL. Unity 6.1 / 2025 roadmap targets graduating the WebGPU backend to production-ready.
① Pipeline pre-warm is mandatory: all four report startup pipeline compilation taking ~3–30 s, which users feel as "stalls". Production strategy: pre-compile the top 10–20 pipelines, lazy-load the rest.
② Device loss must be handled: Figma's number is ~0.3% / month. Ignoring it = users see permanent black screens.
③ Fallback must ship: ~10–15% of users still can't use WebGPU. Without a WebGL fallback, you lose them.
所有的"为什么不行" 都能追到隐私 / 安全 / 一致性这三个原因
Every "why doesn't this work?" traces back to privacy / security / consistency
WebGPU 的边界都是故意设的。这一章列三类边界。
所有 maxXXX 限制都是 spec 的最低保证,实际硬件可以更高,但你不应该假设更高:
| 限制 | spec 最小 | 典型 hw(M2 Pro) | 含义 |
|---|---|---|---|
maxBufferSize | 256 MB | ~4 GB | 单个 buffer 最大字节数 |
maxStorageBufferBindingSize | 128 MB | ~4 GB | 单次绑定的 storage 最大 |
maxComputeWorkgroupStorageSize | 16 KB | 32 KB | workgroup memory 最大 |
maxComputeInvocationsPerWorkgroup | 256 | 1024 | workgroup 内 thread 总数 |
maxComputeWorkgroupSizeX/Y | 256 / 256 | 1024 / 1024 | 单维度 |
maxComputeWorkgroupsPerDimension | 65535 | ~2^31 | dispatch 总 workgroup 数 |
想用 256 MB 以上的 buffer?需要 requestDevice({ requiredLimits: { maxBufferSize: 1024*1024*1024 } }) 显式 opt-in。如果硬件不支持,requestDevice reject——不是悄悄给你一个降级 device。
许多原生 GPU API 都允许应用读取 GPU 型号字符串(如 "NVIDIA RTX 4090 / driver 555.85")。WebGPU 不允许——adapter.info 被严格 limited:
vendor:粗粒度厂商名("apple"/"amd"/"intel"/"nvidia"/"qualcomm"/"arm"),不给版本号architecture:粗粒度架构名(如 "apple-7", "intel-gen11", "ada-lovelace"),可选device:通常空字符串description:通常空字符串为什么这么严:GPU 型号 + 驱动版本是极强的指纹——某些 GPU 全世界几千张。配合 IP、UA、字体列表,单次访问就能唯一识别用户。WebGPU 选择只暴露厂商类——能让程序 dispatch 适合的优化,但不能 fingerprint。
GPU 是很好的时序攻击平台——多个 origin 共用同一颗 GPU,cache 共享,timing 可推测出其他 origin 的内存访问模式。学界已 demo'd(USENIX Security 2024)从 GPU timing 推出另一个 tab 的 LLM 推理 token。WebGPU 的防御:
performance.now() 也 clamp 到 100 µs(早已)WGSL spec 要求所有 workgroup memory 在 invocation 前必须为 0。这是性能损失——硬件本来不需要 zero——但避免了 "读到上一个 dispatch 留下的数据"导致的 cross-origin 泄漏。Dawn/wgpu 在生成 shader 时自动加 zero-init prologue。
新创建的 texture 在第一次读之前必须保证为 0。如果应用没显式 clear,Dawn 在第一次绑定到 shader 时插一个 clear pass。这是开销但避免了 GPU 显存里别的 origin 残留数据被读取。
"GPU 上不应该有 origin A 的数据 origin B 能看到的可能性"——这一句话决定了 WebGPU 80% 的非性能决策。所有看起来"为什么这事不让做",最后都能追到这条。WebGPU 因此不会暴露 raw VRAM 指针、DMA buffer 共享、persistent memory 映射、cross-process GPU resource 共享——即使原生 API 有。
WebGPU's boundaries are all deliberate. This chapter enumerates three categories.
Every maxXXX is the spec's minimum guarantee; real hardware can do more, but you must not assume it does:
| Limit | Spec min | Typical HW (M2 Pro) | Meaning |
|---|---|---|---|
maxBufferSize | 256 MB | ~4 GB | largest single buffer in bytes |
maxStorageBufferBindingSize | 128 MB | ~4 GB | largest storage binding |
maxComputeWorkgroupStorageSize | 16 KB | 32 KB | workgroup memory cap |
maxComputeInvocationsPerWorkgroup | 256 | 1024 | threads per workgroup |
maxComputeWorkgroupSizeX/Y | 256 / 256 | 1024 / 1024 | per-axis cap |
maxComputeWorkgroupsPerDimension | 65535 | ~2^31 | dispatch workgroup count |
Want a buffer over 256 MB? You must requestDevice({ requiredLimits: { maxBufferSize: 1024*1024*1024 } }) to opt in. If the hardware lacks it, requestDevice rejects — not a silent fallback to a lesser device.
Many native GPU APIs let apps read the GPU model string ("NVIDIA RTX 4090 / driver 555.85"). WebGPU does not — adapter.info is strictly limited:
vendor: coarse name ("apple"/"amd"/"intel"/"nvidia"/"qualcomm"/"arm"), no versionarchitecture: coarse arch (e.g. "apple-7", "intel-gen11", "ada-lovelace"), optionaldevice: usually empty stringdescription: usually empty stringWhy so strict: GPU model + driver version is a very strong fingerprint — some GPUs have only a few thousand units worldwide. Combined with IP, UA, and font list, a single visit could uniquely identify a user. WebGPU exposes only vendor class — enough to dispatch the right optimisations, not enough to fingerprint.
GPUs are great timing side-channel platforms — multiple origins share one GPU, share caches, and timings can hint at other origins' memory access patterns. Academia demonstrated extracting another tab's LLM inference tokens from GPU timing (USENIX Security 2024). WebGPU's defences:
performance.now() already clamped to 100 µsWGSL spec requires all workgroup memory to be zeroed before invocation. It's a performance cost — hardware doesn't intrinsically need it — but prevents "reading leftover data from a prior dispatch" cross-origin leaks. Dawn/wgpu auto-inject a zero-init prologue.
A freshly created texture must read as zero before any write. If the app doesn't explicitly clear, Dawn injects a clear pass on first bind to a shader. Costs CPU+GPU work but prevents leftover VRAM from other origins leaking.
"There should be zero possibility that origin A's data is visible to origin B via the GPU." This single sentence drives 80% of WebGPU's non-perf decisions. Every "why can't I do X?" eventually traces back to it. As a result WebGPU does not expose raw VRAM pointers, DMA buffer sharing, persistent memory mapping, or cross-process GPU resource sharing — even when native APIs do.
subgroup matrix · ray tracing · multi-GPU · HDR · WGSL+
subgroup matrix · ray tracing · multi-GPU · HDR · WGSL+
WebGPU 1.0 GA 在 2023 年。现在(2026 年初)是 WebGPU 1.1 的标准化期——8-10 个扩展在 W3C 工作组里走流程。给三年后的图:
vec2<T> 类型参数是编译器内置。社区 proposal 在讨论 user-defined generics。import 等。#define 一类。各种 build tool(如 webpack wgsl-loader)在 JS 里做 string templating。可能成为 standardised feature。storage ptr 加一些受控算术,方便实现 zero-copy 矩阵 view。从 transformers.js 到 Apple Intelligence on Web Inference,浏览器逐渐变成本地 AI 设备。WebGPU 是这条路唯一可走的桥——WebNN(Web Neural Network API)也在标准化但定位是更 high-level、走原生 ML 后端(CoreML、DirectML、NNAPI)。两个都会上线,但 WebGPU 给灵活、WebNN 给极致 ML 性能。
预测:① 7B LLM 模型在消费级 GPU上跑到 30 tok/s 是默认;② 视频会议 client 全靠 WebGPU 做实时人脸增强、降噪、背景;③ in-browser game 性能 ~原生 90%;④ Web 上首次出现"GPU 算力时常成为瓶颈"的应用(之前都是 CPU/网络)。WebGPU 是这一切的底座。
"一行 pass.dispatchWorkgroups()
要被翻译八次。
但只翻译一次
就可以让 GPU 算
一万亿次乘法。"
WebGPU 1.0 went GA in 2023. Right now (early 2026) is WebGPU 1.1's standardisation window — 8–10 extensions are moving through the W3C working group. A picture of three years out:
vec2<T> type parameterisation is compiler-built-in. Community proposals discussing user-defined generics.import / linking.#define-like construct. Build tools (e.g. webpack wgsl-loader) do JS-side templating. Could become standardised.storage pointers to enable zero-copy matrix views.From transformers.js to Apple Intelligence's Web Inference, the browser is becoming a local AI device. WebGPU is the only bridge available — WebNN (Web Neural Network API) is also standardising but at a higher level, dispatching to native ML backends (CoreML, DirectML, NNAPI). Both will ship; WebGPU gives flexibility, WebNN gives ultimate ML performance.
Predictions: ① 7B LLMs running at 30 tok/s on consumer GPUs by default; ② video-call clients use WebGPU for real-time face enhancement / denoising / background; ③ in-browser games reach ~90% of native performance; ④ for the first time on the Web, GPU throughput becomes the routine bottleneck (previously always CPU/network). WebGPU is the foundation for all of it.
"One pass.dispatchWorkgroups()
gets translated eight times.
But translated only once,
it lets the GPU compute
a trillion multiplies."