ursb.me / notes
FIELD NOTE / 09 图形 · 系统 Graphics · Systems 2026

一次 dispatch
八重翻译

Eight translations
of one dispatch.

WebGPU 全栈源码深读

A WebGPU stack source-level walkthrough

一行 pass.dispatchWorkgroups(16384) 在到达 GPU 核之前会被翻译八次:① JS API → Blink WebIDL;② Blink → Renderer 端 Wire;③ Mojo IPC 跨进程;④ GPU 进程 Dawn 验证;⑤ Tint 把 WGSL 编译成 SPIR-V/MSL/HLSL;⑥ ANGLE/原生 API(Metal/D3D12/Vulkan);⑦ 厂商驱动编译成 GPU ISA;⑧ GPU 命令处理器 dispatch 给 SM/CU/EU。这是一篇WebGPU 全栈解剖——从一行 JS 到一条 GPU 指令,25 章 · Dawn + wgpu 双栈对照 · 真源码 · 真延迟。

A single line pass.dispatchWorkgroups(16384) is translated eight times before it reaches a GPU core: ① JS API → Blink WebIDL; ② Blink → renderer-side Wire; ③ Mojo IPC across processes; ④ GPU-process Dawn validation; ⑤ Tint compiles WGSL into SPIR-V/MSL/HLSL; ⑥ native API (Metal/D3D12/Vulkan); ⑦ vendor driver compiles to GPU ISA; ⑧ GPU command processor dispatches to SMs/CUs/EUs. This is a full WebGPU stack dissection — from one JS call to one GPU instruction, 25 chapters · Dawn + wgpu side-by-side · real source · real latencies.

WebGPU 全栈 · 8 重翻译 · 25 章 WebGPU stack · 8 translations · 25 chapters ▸ LIVE PULSE
CHAPTER 01

五个公式 — 为什么 GPU、为什么 WebGPU

Five formulas — why GPUs, why WebGPU

从 CPU 单核到 GPU 一万六千个 ALU · 从一行 JS 到一条 SIMT 指令

From one CPU core to 16,384 GPU ALUs · from one JS line to one SIMT instruction

WebGPU 的存在感来自五个数量级的差异。把它们放在一页上能解释为什么浏览器要再造一套图形 API、为什么不直接复用 WebGL、为什么 2023 年才发布——以及为什么后面 24 章每一章都在和这五个公式纠缠

算力公式:CPU 单核 4 GHz × 8 way SIMD × 2 IPC ≈ 64 GFLOPS。M2 Pro 10 核共 ~640 GFLOPS。
同一颗 M2 Pro 的 GPU 有 16 个 cluster × 32 ALU × 1.4 GHz × 2 op = 1,433 GFLOPS
差距 2.2×(GPU 多核场景下到 10×)——但只在同样的指令并行喂给所有 ALU 时才能拿到。这就是 SIMT (Single Instruction, Multiple Threads) 的代价。

带宽公式:CPU L1 cache ~1 TB/s,但 DRAM 只剩 ~50 GB/s。GPU HBM3 ~ 1 TB/s,GDDR6X ~ 1 TB/s。
GPU 不是算力机,是带宽机。一个 dispatch 能算多快取决于访存模式——所以 Ch19 会花大量篇幅讲 matmul 怎么从 naive 走到 tiled(命中 shared memory)再走到 wmma(命中 tensor cores)。
翻译公式:一行 pass.dispatchWorkgroups(16384) 在到达 GPU 之前会经过 8 层翻译
JS → Blink WebIDL → Renderer Wire → Mojo IPC → Dawn validate → Tint (WGSL→MSL/SPIR-V/HLSL) → Metal/D3D12/Vulkan → driver ISA → GPU
这八层加起来在 M2 Pro 上约 200–400 µs。本文用整 25 章一层一层拆开。
延迟 vs 吞吐:CPU 单线程延迟 1 ns × 4 GHz;GPU 单 thread 延迟 ~1 µs(因为 SIMT 调度、内存往返)。
不要把 GPU 当 CPU 用——单次 dispatch 1M 元素 < 1 ms,但 1M 次 dispatch 1 元素要几小时。本文主线就是一次 dispatch 1M 元素,恰好踩在 GPU 的甜区。
跨平台公式:3 个原生图形 API(Metal · D3D12 · Vulkan)× 3 个浏览器(Chrome · Firefox · Safari)× 2 种 shader IR(SPIR-V · MSL/HLSL)。
原本要写 18 套适配,WebGPU 把这变成 3 套(每个浏览器适配 3 个原生 API),加 1 套 公共 IR(WGSL)。Ch16-17 讲两条编译链:Tint(Google)和 Naga(Mozilla),都是 WGSL → 3 个 native shading language

这五个公式贯穿全书。Ch01-05 是为它们做准备的背景,Ch06-21 是它们的分步展开,Ch22-25 是它们的合成与回望

WebGPU's existence comes from five orders-of-magnitude differences. Putting them on one page explains why browsers need a second graphics API, why WebGL wasn't enough, why it took until 2023 — and why every one of the 24 chapters that follow is wrestling with one of these five formulas.

Compute formula: one CPU core at 4 GHz × 8-way SIMD × 2 IPC ≈ 64 GFLOPS. An M2 Pro across 10 cores: ~640 GFLOPS.
The same M2 Pro's GPU: 16 clusters × 32 ALUs × 1.4 GHz × 2 ops = 1,433 GFLOPS.
2.2× on M2 Pro, ~10× on a discrete GPU — but only if you can feed the same instruction to all ALUs in lockstep. That's the SIMT contract (Single Instruction, Multiple Threads).
Bandwidth formula: CPU L1 ~1 TB/s, but DRAM only ~50 GB/s. GPU HBM3 ~1 TB/s, GDDR6X ~1 TB/s.
GPUs aren't compute machines, they're bandwidth machines. How fast one dispatch runs depends on memory access patterns — so Ch19 spends a long section on how matmul goes from naive → tiled (shared memory) → wmma (tensor cores).
Translation formula: one pass.dispatchWorkgroups(16384) traverses 8 layers before it ever reaches silicon:
JS → Blink WebIDL → Renderer Wire → Mojo IPC → Dawn validate → Tint (WGSL→MSL/SPIR-V/HLSL) → Metal/D3D12/Vulkan → driver ISA → GPU
On M2 Pro, those eight add up to 200–400 µs. This article unpacks them, layer by layer, across 25 chapters.
Latency vs throughput: CPU single-thread latency ~1 ns × 4 GHz; GPU per-thread latency ~1 µs (SIMT scheduling + memory round-trips).
Don't use a GPU like a CPU. One dispatch covering 1M elements finishes in < 1 ms; 1M dispatches of 1 element each take hours. Our mainline — one dispatch over 1M elements — is exactly in the sweet spot.
Cross-platform formula: 3 native graphics APIs (Metal · D3D12 · Vulkan) × 3 browsers (Chrome · Firefox · Safari) × 2 shader IRs (SPIR-V · MSL/HLSL).
Without WebGPU you'd write 18 ports. With WebGPU it's 3 (one per browser, each adapting to 3 native APIs) + 1 common IR (WGSL). Ch16–17 walk the two real compiler chains: Tint (Google) and Naga (Mozilla), both WGSL → three shading languages.

These five formulas thread the whole book. Ch01-05 is the background for them, Ch06-21 is their step-by-step unpacking, Ch22-25 is their synthesis.

CHAPTER 02

十二年家谱 — WebGL 2011 → WebGPU 2023

12-year family tree — WebGL 2011 → WebGPU 2023

三股力量汇流:原生 API 现代化 · 浏览器三方提案 · W3C 标准化

Three forces converging: modern native APIs · three browser proposals · W3C standardisation

WebGPU 不是从零冒出来的。它是 三股力量 在 2017 年同时撞到一起的产物:

  • 原生 API 现代化(2014–2016):AMD Mantle (2013) → Apple Metal (2014) → Khronos Vulkan + Microsoft D3D12 (2015–2016)。共同主题:显式(explicit)——把驱动里的隐式状态机搬到应用层,让程序员手动管资源、手动同步、手动 batch。代价是 API surface 从 OpenGL 的 ~300 函数膨胀到 Vulkan 的 ~500 + 数千 enum。
  • WebGL 撞墙(2017):WebGL 1.0 = OpenGL ES 2.0(2007 设计)、WebGL 2.0 = OpenGL ES 3.0(2012 设计)。两个都没有 compute shader、没有多线程命令录制、调用开销大(每个 draw call 都要 readback 错误状态)。在 LLM/AI 起来之前没人意识到 compute 重要,但 2017 年 TensorFlow.js 出来后,浏览器里需要 compute 的呼声变急
  • 三方独立提案(2017):Apple WebMetal(Metal 风格)、Google NXT(Metal+D3D12 混合)、Mozilla Obsidian(Vulkan 风格)。三个都解决了同样的问题但 API 形状不同。W3C 成立 GPU for the Web Community Group 把它们融合,最后形态接近 Metal 风格,因为它在三个浏览器里最容易实现。

完整时间线:

2011WebGL 1.0基于 OpenGL ES 2.0;Chrome 9 + Firefox 4。 2013AMD Mantle第一个"显式"图形 API;后被 AMD 放弃,但代码贡献给 Vulkan 1.0。 2014Apple MetaliOS 8 + macOS 10.11;Apple 同时停止 OpenGL 维护(2018 正式 deprecate)。 2015D3D12Windows 10;与 Vulkan 几乎同时;二者很多概念互相借鉴。 2016-02Vulkan 1.0Khronos 发布;首次允许跨平台显式 API。 2017-01WebGL 2.0Chrome 56 + Firefox 51 默认开;但还是 OpenGL ES 3.0(2012 设计)。 2017-02Apple WebMetalApple 在 W3C 提出"GPU for the Web"。 2017-03Google NXTGoogle 内部代号"NXT"("next"),后成为 Dawn 项目。Metal+D3D12 风格融合。 2017-09W3C "GPU for the Web" CGApple/Google/Mozilla/Microsoft 联合发起社区组。 2018-02Mozilla ObsidianVulkan 风格提案;后演化为 wgpu。 2019-05名称定为 WebGPU三方融合方案;shading language 仍未决(SPIR-V vs MSL vs HLSL)。 2020-04WGSL 首版独立设计,类 Rust 语法,可双向翻译到 SPIR-V/MSL/HLSL。 2020-09Origin Trial · Chrome 94第一次能在公网测试 WebGPU。 2023-05-02Chrome 113 默认开Linux/macOS/Windows/ChromeOS 全部 GA;Android 也跟进。 2024-09Safari 18 · WebKitmacOS 15 + iOS 18 上线,但默认关闭(需手动 enable WebGPU flag)。WebKit 自家实现(不是 Dawn 也不是 wgpu)。 2025-07Firefox 141 默认开Windows 首发,macOS/Linux 后跟。基于 wgpu(Rust)。 2025-09Safari 26 默认开macOS 26 Tahoe / iOS 26 / iPadOS 26 默认启用——苹果切到与年份同步的版本号。 2026WebGPU 1.0 W3C REC(预测)现为 Candidate Recommendation;具体 REC 日期未定;subgroup / FP16 等扩展并行推进。

三个名字、一个 API、两份代码:用户看到的是 navigator.gpu,但底下实际有三份独立实现——Dawn(Chrome,C++)、wgpu(Firefox + Deno + Bevy + Servo + ...,Rust)、WebKit's own(Safari,Obj-C++)。整本书会反复对照前两个,因为 Safari 的实现还非常年轻且未开放重要扩展。

WebGPU didn't appear out of nowhere. It's the product of three forces converging around 2017:

  • Native API modernisation (2014–2016): AMD Mantle (2013) → Apple Metal (2014) → Khronos Vulkan + Microsoft D3D12 (2015–2016). Common theme: explicit — lift driver-side implicit state machines into the application, force programmers to manage resources and synchronisation themselves. Cost: API surface bloated from OpenGL's ~300 functions to Vulkan's ~500 + thousands of enums.
  • WebGL hit a wall (2017): WebGL 1.0 = OpenGL ES 2.0 (designed 2007), WebGL 2.0 = OpenGL ES 3.0 (designed 2012). Neither has compute shaders, multithreaded command recording, or low draw-call overhead. Before LLMs/AI nobody cared about compute on the Web, but TensorFlow.js launched in 2017 and compute-on-the-browser suddenly mattered.
  • Three competing proposals (2017): Apple WebMetal (Metal-flavoured), Google NXT (Metal+D3D12 hybrid), Mozilla Obsidian (Vulkan-flavoured). The W3C "GPU for the Web" Community Group merged them. Final shape leans Metal — it had the cleanest mapping to all three native APIs.

Full timeline:

2011WebGL 1.0based on OpenGL ES 2.0; Chrome 9 + Firefox 4. 2013AMD Mantlefirst "explicit" graphics API; AMD later donated the codebase to Vulkan 1.0. 2014Apple MetaliOS 8 + macOS 10.11; Apple stops OpenGL maintenance (officially deprecated 2018). 2015D3D12Windows 10; near-simultaneous with Vulkan; lots of cross-pollination. 2016-02Vulkan 1.0Khronos; first cross-platform explicit API. 2017-01WebGL 2.0Chrome 56 + Firefox 51 on by default; still OpenGL ES 3.0 (2012 design). 2017-02Apple WebMetalApple proposes "GPU for the Web" to W3C. 2017-03Google NXTGoogle's "next" codename; would become Dawn. Metal+D3D12 hybrid. 2017-09W3C "GPU for the Web" CGApple/Google/Mozilla/Microsoft jointly chartered. 2018-02Mozilla ObsidianVulkan-flavoured proposal; would evolve into wgpu. 2019-05"WebGPU" namedmerged proposal; shading language still undecided (SPIR-V vs MSL vs HLSL). 2020-04WGSL first draftindependent design, Rust-ish syntax; round-trippable to SPIR-V/MSL/HLSL. 2020-09Origin Trial · Chrome 94first public-Internet WebGPU testing. 2023-05-02Chrome 113 GAon by default on Linux/macOS/Windows/ChromeOS; Android followed. 2024-09Safari 18 · WebKitmacOS 15 + iOS 18 ships WebGPU but off by default (behind a flag). WebKit's own implementation (not Dawn, not wgpu). 2025-07Firefox 141 GAWindows first, macOS/Linux behind; powered by wgpu (Rust). 2025-09Safari 26 default-onmacOS 26 Tahoe / iOS 26 / iPadOS 26 — Apple flipped the default once version numbers synced to the calendar year. 2026WebGPU 1.0 W3C REC (projected)currently Candidate Recommendation; exact REC date not yet set; subgroup / FP16 extensions in flight.

Three names, one API, three codebases: users see navigator.gpu but underneath are three independent implementations — Dawn (Chrome, C++), wgpu (Firefox + Deno + Bevy + Servo, Rust), and WebKit's own (Safari, Obj-C++). This book contrasts the first two repeatedly; Safari's is too young and still missing key extensions.

CHAPTER 03

为什么不直接把 Vulkan 暴露给 JS

Why not just expose Vulkan to JS

显式不等于安全 · 跨平台不等于一致 · 性能不等于可移植

Explicit ≠ safe · cross-platform ≠ consistent · fast ≠ portable

这是 WebGPU 设计中最常被问的问题。Vulkan 是 2016 年的 Khronos 标准、跨 Windows/Linux/Android、显式控制、性能逼近金属——为什么不直接 navigator.vk = ... 让 JS 调?

因为 "显式"≠"安全""跨平台"≠"一致""性能"≠"可移植"。下面把三句话拆开。

问题一 · 安全

Vulkan 的验证是 opt-in 的,且 ~80% 的实现都关。一个错误的 vkCmdDraw(比如绑定了 stride 不对的 vertex buffer)在 release driver 上会读越界内存。在桌面应用上这只是 crash,但在浏览器里这是跨域信息泄漏——同源策略下其他 origin 的纹理可能被读取。所以 WebGPU 的设计基线是:每个 API 调用在到 GPU 之前都必须 100% 验证。这是 Dawn/wgpu 都有 device/validation 一整层的根本原因。

问题二 · 一致

Vulkan 把硬件差异留给应用。一个 VkBuffer 在 NVIDIA 上对齐 16 字节、在 Intel 上对齐 64、在 ARM Mali 上对齐 256。Vulkan 给你 VkPhysicalDeviceLimits.minStorageBufferOffsetAlignment 让你自己处理,没处理就 UB。WebGPU 反过来:规范规定所有实现都按 256 对齐spec §3.6.2),代价是浪费一点空间但保证所有浏览器、所有设备结果一致

问题三 · 可移植

Vulkan 不在 macOS/iOS 上(直接)。在 Apple 平台上跑 Vulkan 要走 MoltenVK 把 Vulkan 翻译到 Metal——多一层翻译就多一层 bug,且 Apple 主动放弃了 Vulkan WG。WebGPU 的做法是不假设任何原生 API 一定存在——浏览器自己选最合适的 backend(macOS→Metal · Windows→D3D12 · Linux/Android→Vulkan)。从应用看 API 是统一的。

还有一个 非技术 原因常被低估:Khronos 的 Vulkan 委员会动作慢。Vulkan 1.3 用了 6 年才允许 dynamic rendering(2022 年),而 Metal 一开始就支持。WebGPU 工作组速度更快,且不需要等三个原生 API 同时支持某特性——只要 Dawn/wgpu 能在三个 backend 上至少 emulate 即可。比如 subgroup ops 在 Vulkan 是 1.1 核心、D3D12 用 wave intrinsics、Metal 用 simdgroup,WebGPU 一个 subgroup 关键字三家都映射得到。

小结

Vulkan 的设计目标是给桌面 AAA 游戏吃满硬件;WebGPU 的设计目标是在不可信代码(JS)里安全暴露 GPU。同一个底层硬件,两种使用人群,两种 API 形状。WebGPU 不是"Vulkan 的简化版",而是另一个设计点

This is the most commonly asked question about WebGPU's design. Vulkan exists since 2016, runs on Windows/Linux/Android, exposes explicit control, performance is near-native. Why not just navigator.vk = ... and let JS call it?

Because "explicit" ≠ "safe", "cross-platform" ≠ "consistent", and "fast" ≠ "portable". Each in turn:

Issue 1 · Safety

Vulkan validation is opt-in and ~80% of installs ship it disabled. A bad vkCmdDraw (e.g. a vertex buffer with the wrong stride) on a release driver will read out-of-bounds memory. On a desktop app that's a crash; in a browser it's a cross-origin info leak — other origins' textures readable under the same GPU context. So WebGPU's design baseline is: every API call must be 100% validated before it reaches the GPU. That's why Dawn and wgpu each carry a thick device/validation layer.

Issue 2 · Consistency

Vulkan leaves hardware variance to the app. A VkBuffer on NVIDIA aligns to 16 bytes, on Intel to 64, on ARM Mali to 256. Vulkan gives you VkPhysicalDeviceLimits.minStorageBufferOffsetAlignment and says handle it yourself; failing to is UB. WebGPU instead: the spec mandates 256-byte alignment everywhere (spec §3.6.2). You waste a bit of space, but every browser, every device, identical behaviour.

Issue 3 · Portability

Vulkan doesn't run on macOS/iOS (directly). On Apple platforms you need MoltenVK, which translates Vulkan → Metal — one more layer means one more bug surface, and Apple actively abandoned the Vulkan WG. WebGPU's stance: assume no specific native API exists. Each browser picks the right backend (macOS → Metal, Windows → D3D12, Linux/Android → Vulkan). The user-facing API is the same.

There's also a non-technical reason often underweighted: Khronos moves slowly. Vulkan 1.3 needed 6 years to land dynamic rendering (2022); Metal had it on day one. The WebGPU WG moves faster because it doesn't need all three native APIs to ship a feature simultaneously — Dawn/wgpu only need to emulate it consistently. Example: subgroup ops are Vulkan 1.1 core, but on D3D12 they're "wave intrinsics" and on Metal they're "simdgroup". A single WGSL subgroup keyword maps to all three.

Summary

Vulkan was designed to feed GPUs from desktop AAA games; WebGPU was designed to safely expose the GPU to untrusted code (JS). Same hardware underneath, two audiences, two API shapes. WebGPU isn't "Vulkan-lite" — it's a different design point.

CHAPTER 04

设计哲学 — 安全 · 可移植 · 够快,全要

Design philosophy — safe · portable · fast, all three

把 WebGPU 与 WebGL / Vulkan / Metal / D3D12 放在四个轴上比较

WebGPU vs WebGL vs Vulkan vs Metal vs D3D12 on four axes

"安全、可移植、够快"在 2015 年之前是不可能三角。OpenGL 安全但不够快、Vulkan 够快但不安全、Metal 安全 + 够快但不可移植。WebGPU 的赌注是这三个可以同时拿到,只要愿意接受三个让步:

  1. 让步一 · 性能:放弃最后 10–20%。验证层 + 默认安全意味着每个 dispatch 多 ~50–200 µs CPU 开销,wmma / async transfer 等极致优化暂时不暴露。生产 demo 显示 WebGPU 在 matmul 上能达到 Metal native 的 ~85%(Dawn benchmark suite)。
  2. 让步二 · 灵活:放弃部分硬件特定能力。raw Vulkan 的 VK_KHR_acceleration_structure(ray tracing)暂未在 WebGPU 暴露。subgroup2025 年才进 WebGPU 主线 的。每个新硬件特性都要等 3 个 native API 至少 2 个支持才会标准化。
  3. 让步三 · 时间:12 年。WebGL 2017 之后社区组花了 6 年讨论才在 2023 上线。这是 W3C 标准化流程的常态——但比起"50 行 JS 一行写错就拿到别的 origin 的截屏",等是值得的。

把 WebGPU 和五个亲戚放在四个轴上看:

WebGL 2WebGPUVulkanMetalD3D12
API 函数数~300~120~500~250~400
显式同步大部分自动是(fence/semaphore/barrier)大部分自动是(fence + resource barrier)
Compute有(一等公民)
Validation 是否强制是(spec 要求)否(Layer opt-in)部分否(Debug Layer opt-in)
跨 OS是(macOS 需 MoltenVK)否(仅 Apple)否(仅 Windows/Xbox)
Shading langGLSL ESWGSLSPIR-V(可 GLSL/HLSL 编译来)MSLHLSL / DXIL
典型应用遗留 Web 游戏 · 简单 3DFigma · transformers.js · Babylon.js · Bevy(Web)AAA Linux · Switch · Android 游戏原生 macOS/iOS 应用Windows 游戏 · Xbox
设计年份2012(OpenGL ES 3.0)2017–2023201520142015
关键观察

WebGPU 是唯一一个同时满足:①跨 OS、②强制 validation、③有 compute、④API surface < 200 函数。这是它存在的意义——不是为了比 Vulkan 快,而是为了在不可信代码里安全地暴露 GPU。

"Safe, portable, fast" was an impossible triangle pre-2015. OpenGL was safe but slow; Vulkan was fast but unsafe; Metal was safe-and-fast but Apple-only. WebGPU bet you could have all three, with three concessions:

  1. Concession 1 · Performance: give up the last 10–20%. Validation + safe defaults mean each dispatch costs ~50–200 µs extra CPU. wmma, async transfers, and extreme tricks aren't exposed yet. Production matmul demos reach ~85% of native Metal (Dawn benchmark suite).
  2. Concession 2 · Flexibility: give up some hardware-specific features. Vulkan's VK_KHR_acceleration_structure (ray tracing) isn't yet in WebGPU. subgroup only landed in 2025. Every new feature waits for at least 2 of 3 native APIs to support it before standardisation.
  3. Concession 3 · Time: 12 years. Six years of Community-Group discussion after 2017 before 2023 launch. Standard W3C pace. But compared with the alternative — "50 lines of JS, one bug = read another origin's screenshot" — worth it.

Set WebGPU against five relatives on four axes:

WebGL 2WebGPUVulkanMetalD3D12
API surface~300~120~500~250~400
Explicit syncnomostly autoyes (fence/semaphore/barrier)mostly autoyes (fence + resource barrier)
Computenonefirst-classyesyesyes
Validation mandatoryyesyes (spec-required)no (layer opt-in)partialno (Debug Layer opt-in)
Cross-OSyesyesyes (macOS via MoltenVK)no (Apple only)no (Windows/Xbox only)
Shading langGLSL ESWGSLSPIR-V (compile from GLSL/HLSL)MSLHLSL / DXIL
Typical applegacy Web 3D · gamesFigma · transformers.js · Babylon · Bevy-on-WebAAA Linux · Switch · Android gamesnative macOS/iOS appsWindows games · Xbox
Design year2012 (OpenGL ES 3.0)2017–2023201520142015
Key observation

WebGPU is the only API that satisfies all four of: ① cross-OS, ② mandatory validation, ③ compute, ④ API surface under 200 functions. That's its raison d'être — not to beat Vulkan in benchmarks, but to safely expose the GPU to untrusted code.

CHAPTER 05

三栈全景 — Chrome · Firefox · Safari

Three-stack atlas — Chrome · Firefox · Safari

同一行 device.queue.submit() 在三个浏览器里走过的 8 个层不一样

The same device.queue.submit() traverses 8 different layers in each browser

"WebGPU" 是一个 API。但底层有三个独立实现。每条调用从 JS 到 GPU 中间的 8 层堆栈在三个浏览器里完全不同。把它们摆在一起:

Chrome / Edge
  1. JSV8 → WebIDL bindings (生成自 web_gpu_*.idl)
  2. RendererBlink 的 third_party/blink/renderer/modules/webgpu/
  3. WireDawn Wire client(C++,序列化命令)
  4. IPCMojo · 跨 Renderer ↔ GPU process
  5. GPU 进程Dawn Wire server + Dawn native validation
  6. Shader 编译Tint:WGSL → MSL / HLSL / SPIR-V
  7. Native APIMetal · D3D12 · Vulkan(按平台选)
  8. Driver + GPUvendor driver 编译到 GPU ISA · 命令处理器 dispatch
Firefox
  1. JSSpiderMonkey → WebIDL bindings
  2. Rendererdom/webgpu/(C++ glue)
  3. Wirewgpu Rust client(同进程 IPDL fast path)
  4. IPCIPDL · 跨 Content ↔ GPU process
  5. GPU 进程wgpu-core(Rust 验证 + state tracking)
  6. Shader 编译Naga:WGSL → MSL / HLSL / SPIR-V
  7. Native APIwgpu-hal → Metal · D3D12 · Vulkan
  8. Driver + GPU同上
Safari
  1. JSJavaScriptCore → WebIDL bindings
  2. RendererWebKit/Source/WebCore/Modules/WebGPU/
  3. WireXPC encoded RemoteGPU 接口
  4. IPCXPC · WebContent ↔ GPU.framework 进程
  5. GPU 进程WebGPU.framework(Obj-C++)
  6. Shader 编译自家 WGSL→MSL 编译器(不是 Tint)
  7. Native APIMetal 独占
  8. Driver + GPUApple GPU(M 系)/AMD/Intel

三个栈共享:① API 形状(spec);② shader 语言(WGSL);③ 安全模型(同源、内存清零、限制范围)。但实现完全独立——这是 W3C 标准化的好处也是负担。本文 Ch11-Ch17 主要走 Chrome(Dawn)路径,因为它是最多人用源码最完整,wgpu/Naga 作为对照。

为什么浏览器都要 GPU 进程

把 GPU 调用关在独立进程里是浏览器安全模型的核心。原因:①GPU 驱动是巨型 C/C++ 代码,bug 率高(NVIDIA 驱动 ~10MLOC、AMD ~5MLOC);②驱动 crash 通常拉整个进程;③GPU 内存里有其他 origin/其他 tab 的纹理,需要进程边界隔离。GPU 进程崩了,最多页面白屏;Renderer 不会丢,UI 进程更不会丢。

"WebGPU" is one API, but it has three independent implementations. The 8-layer journey from JS to GPU differs across browsers. Side by side:

Chrome / Edge
  1. JSV8 → WebIDL bindings (generated from web_gpu_*.idl)
  2. RendererBlink, third_party/blink/renderer/modules/webgpu/
  3. WireDawn Wire client (C++, serialised command stream)
  4. IPCMojo · Renderer ↔ GPU process
  5. GPU processDawn Wire server + Dawn native validation
  6. Shader compileTint: WGSL → MSL / HLSL / SPIR-V
  7. Native APIMetal · D3D12 · Vulkan (per platform)
  8. Driver + GPUvendor driver compiles to GPU ISA · CP dispatches
Firefox
  1. JSSpiderMonkey → WebIDL bindings
  2. Rendererdom/webgpu/ (C++ glue)
  3. Wirewgpu Rust client (in-process IPDL fast path)
  4. IPCIPDL · Content ↔ GPU process
  5. GPU processwgpu-core (Rust validation + state tracking)
  6. Shader compileNaga: WGSL → MSL / HLSL / SPIR-V
  7. Native APIwgpu-hal → Metal · D3D12 · Vulkan
  8. Driver + GPUsame
Safari
  1. JSJavaScriptCore → WebIDL bindings
  2. RendererWebKit/Source/WebCore/Modules/WebGPU/
  3. WireXPC-encoded RemoteGPU interface
  4. IPCXPC · WebContent ↔ GPU.framework process
  5. GPU processWebGPU.framework (Obj-C++)
  6. Shader compilein-house WGSL→MSL compiler (not Tint)
  7. Native APIMetal exclusively
  8. Driver + GPUApple M-series GPU / AMD / Intel

All three stacks share: ① API shape (the spec), ② shader language (WGSL), ③ security model (same-origin, zeroed memory, capped limits). But implementations are entirely independent — both a strength and a burden of W3C standardisation. Ch11–Ch17 mostly walks the Chrome (Dawn) path, since it has the most users and the cleanest source; wgpu/Naga appears for contrast.

Why a separate GPU process

Keeping GPU calls in a separate process is core to the browser security model. Reasons: ① GPU drivers are enormous C/C++ codebases (NVIDIA's is ~10 MLOC, AMD's ~5 MLOC) and ship bugs; ② driver crashes typically take down the host process; ③ GPU memory contains textures from other origins / other tabs, requiring process-level isolation. A GPU-process crash blanks one page; renderers survive; the UI process certainly survives.

MAIN LINE ✦

1M 浮点平方 — 40 行 JS · 8 层翻译

1M floats squared — 40 lines of JS · 8 translations

本书的主线程序。每章拆它一行(或一组),看它怎么走完全栈

The article's main line. Each chapter dissects one line (or group), tracing it through the full stack

这是 Field Note 09 的主线程序。它做一件简单的事:在 GPU 上把一个 1,048,576 个 float 的数组每个元素平方,然后读回 CPU 验证。每一行都对应后面 25 章中的一章——按这个顺序把整本书读下来,就等于把这 40 行 JS 一层一层翻译到 GPU 指令。

MAIN LINE · JS · 一个 dispatch · 8 层翻译 // ① 准备 1M 浮点输入数据。input[i] = sqrt(i),所以预期 out[i] === i const N = 1 << 20; // 1,048,576 const input = new Float32Array(N); for (let i = 0; i < N; i++) input[i] = Math.sqrt(i); // ② 拿到 GPU adapter 和 device · Ch06 const adapter = await navigator.gpu.requestAdapter(); const device = await adapter.requestDevice(); // ③ 创建 storage buffer · Ch07 const buf = device.createBuffer({ size: input.byteLength, usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST | GPUBufferUsage.COPY_SRC, }); device.queue.writeBuffer(buf, 0, input); // 4 MB host→device // ④ 编译 WGSL · Ch08, Ch14-16 const shader = device.createShaderModule({ code: ` @group(0) @binding(0) var<storage, read_write> data: array<f32>; @compute @workgroup_size(64) fn main(@builtin(global_invocation_id) gid: vec3<u32>) { let i = gid.x; if (i < arrayLength(&data)) { data[i] = data[i] * data[i]; } }`}); // ⑤ 创建 pipeline + bind group · Ch09 const pipeline = device.createComputePipeline({ layout: 'auto', compute: { module: shader, entryPoint: 'main' }, }); const bg = device.createBindGroup({ layout: pipeline.getBindGroupLayout(0), entries: [{ binding: 0, resource: { buffer: buf } }], }); // ⑥ 录制并提交 dispatch · Ch10, Ch11 const enc = device.createCommandEncoder(); const pass = enc.beginComputePass(); pass.setPipeline(pipeline); pass.setBindGroup(0, bg); pass.dispatchWorkgroups(Math.ceil(N / 64)); // 16,384 workgroups pass.end(); // ⑦ 拷回 host + 异步映射 · Ch07 const read = device.createBuffer({ size: input.byteLength, usage: GPUBufferUsage.MAP_READ | GPUBufferUsage.COPY_DST, }); enc.copyBufferToBuffer(buf, 0, read, 0, input.byteLength); device.queue.submit([enc.finish()]); // 触发 GPU 工作 // ⑧ 读结果回 JS await read.mapAsync(GPUMapMode.READ); const out = new Float32Array(read.getMappedRange()); console.log(out[100]); // 100.0 ✓ (sqrt(100)² == 100)

本文用这 40 行把 8 重翻译走通:

编号翻译章节典型耗时(M2 Pro · Chrome)
JS → Blink WebIDL bindingCh06, Ch10< 1 µs(V8 inline cache 命中)
Blink → Dawn Wire client(序列化)Ch10, Ch11~5 µs / dispatch(≈80 字节)
Mojo IPC · Renderer → GPU processCh11~30 µs(带 ringbuffer flush)
Dawn 验证 + state trackingCh12~10 µs / dispatch
Tint:WGSL → MSL/HLSL/SPIR-V(首次)Ch14, Ch16~5 ms(首次) · 缓存后 0
Metal / D3D12 / Vulkan API 调用Ch17~20 µs(含 PSO 切换)
厂商 driver 编译 MSL/HLSL → GPU ISACh17~10 ms(首次) · 缓存后 0
GPU 命令处理器 dispatch · SIMT 执行Ch18, Ch19~200 µs(16384 wg × 64 thread × N 周期)

总耗时(暖缓存):~250 µs CPU + ~200 µs GPU = ~450 µs。冷缓存首次 ~15 ms。后续每次提交 ~450 µs。后面的章节会一层一层把这条时间线拆开。

This is Field Note 09's main line. It does one thing: square each element of a 1,048,576-float array on the GPU, then read back to verify. Every line maps to one of the 25 chapters that follow — read them in order and you're translating these 40 lines layer by layer down to GPU instructions.

MAIN LINE · JS · 1 dispatch · 8 translations // ① Prepare 1M floats. input[i] = sqrt(i), so expected out[i] === i. const N = 1 << 20; // 1,048,576 const input = new Float32Array(N); for (let i = 0; i < N; i++) input[i] = Math.sqrt(i); // ② Get GPU adapter + device · Ch06 const adapter = await navigator.gpu.requestAdapter(); const device = await adapter.requestDevice(); // ③ Create a storage buffer · Ch07 const buf = device.createBuffer({ size: input.byteLength, usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST | GPUBufferUsage.COPY_SRC, }); device.queue.writeBuffer(buf, 0, input); // 4 MB host→device // ④ Compile WGSL · Ch08, Ch14-16 const shader = device.createShaderModule({ code: ` @group(0) @binding(0) var<storage, read_write> data: array<f32>; @compute @workgroup_size(64) fn main(@builtin(global_invocation_id) gid: vec3<u32>) { let i = gid.x; if (i < arrayLength(&data)) { data[i] = data[i] * data[i]; } }`}); // ⑤ Pipeline + bind group · Ch09 const pipeline = device.createComputePipeline({ layout: 'auto', compute: { module: shader, entryPoint: 'main' }, }); const bg = device.createBindGroup({ layout: pipeline.getBindGroupLayout(0), entries: [{ binding: 0, resource: { buffer: buf } }], }); // ⑥ Record and submit dispatch · Ch10, Ch11 const enc = device.createCommandEncoder(); const pass = enc.beginComputePass(); pass.setPipeline(pipeline); pass.setBindGroup(0, bg); pass.dispatchWorkgroups(Math.ceil(N / 64)); // 16,384 workgroups pass.end(); // ⑦ Copy back to host + async map · Ch07 const read = device.createBuffer({ size: input.byteLength, usage: GPUBufferUsage.MAP_READ | GPUBufferUsage.COPY_DST, }); enc.copyBufferToBuffer(buf, 0, read, 0, input.byteLength); device.queue.submit([enc.finish()]); // kick the GPU // ⑧ Read result back to JS await read.mapAsync(GPUMapMode.READ); const out = new Float32Array(read.getMappedRange()); console.log(out[100]); // 100.0 ✓ (sqrt(100)² == 100)

These 40 lines unfold the 8 translations:

#TranslationChapterTypical cost (M2 Pro · Chrome)
JS → Blink WebIDL bindingCh06, Ch10< 1 µs (V8 inline cache hit)
Blink → Dawn Wire client (serialise)Ch10, Ch11~5 µs / dispatch (~80 bytes)
Mojo IPC · Renderer → GPU processCh11~30 µs (with ring-buffer flush)
Dawn validation + state trackingCh12~10 µs / dispatch
Tint: WGSL → MSL/HLSL/SPIR-V (first)Ch14, Ch16~5 ms (first) · 0 cached
Metal / D3D12 / Vulkan API callCh17~20 µs (incl. PSO switch)
Vendor driver compiles to GPU ISACh17~10 ms (first) · 0 cached
GPU CP dispatches · SIMT executeCh18, Ch19~200 µs (16384 wg × 64 thr × N cycles)

Total (warm caches): ~250 µs CPU + ~200 µs GPU = ~450 µs. First-time cold: ~15 ms. Steady state: ~450 µs per submit. The chapters that follow unpack this timeline layer by layer.

CHAPTER 06

Adapter & Device — 拿到一块 GPU

Adapter & Device — getting hold of a GPU

两次 await · 一次硬件挑选 · 一次能力协商

Two awaits · one hardware pick · one capability negotiation

主线 ② 的两行:

const adapter = await navigator.gpu.requestAdapter();   // 物理 GPU
const device  = await adapter.requestDevice();           // 虚拟句柄

这两行做了 4 件事——这一章拆其中 3 件(第 4 件留到 Ch13):

  1. 枚举物理 GPU。一台 MBP 可能有 Apple GPU + Intel iGPU + 外接 eGPU。requestAdapter() 默认返回第一个能用的,但你可以传 powerPreference: 'high-performance''low-power' 提示浏览器(不强制,但 NVIDIA 笔记本上确实会切到独显)。还有 forceFallbackAdapter: true,强制走 SwiftShader / Dawn's null backend,用于测试。
  2. 读 features / limits。Adapter 暴露 可选特性(如 'float32-filterable''shader-f16''subgroups')和各种硬件上限maxBufferSizemaxStorageBufferBindingSizemaxComputeWorkgroupSizeX)。这些是 WebGPU 安全模型的第一道闸——所有上限都是规范规定的下限之上的实际值。
  3. 创建 devicerequestDevice() 可以传 requiredFeaturesrequiredLimits——如果硬件不支持,Promise reject。这是契约式协商:device 是一个保证拥有指定能力的虚拟 GPU。一个 process 可以有多个 device(不同 origin/iframe),但一个 device 一旦 lost 就再也不能用。
Dawn · src/dawn/native/Adapter.cpp · RequestDeviceCallback // Dawn 内部对 requestDevice 的实现(简化) MaybeError AdapterBase::CreateDevice( const DeviceDescriptor* descriptor, Ref<DeviceBase>* result) { // ① 校验 requiredFeatures 是否都被这个 adapter 支持 DAWN_TRY(ValidateFeatureSupportedWithToggles(descriptor, ...)); // ② 校验 requiredLimits 是否都在 adapter 范围内 DAWN_TRY(ValidateLimits(adapterLimits, requiredLimits)); // ③ 调用平台 backend 的 device factory: // Vulkan → vulkan::Device::Create // Metal → metal::Device::Create (Obj-C++) // D3D12 → d3d12::Device::Create DAWN_TRY_ASSIGN(*result, mPhysicalDevice->CreateDevice(descriptor)); // ④ 注册到 instance,开启错误监听 (*result)->TrackPendingError(); return {}; }
为什么要两次 await

因为 adapter 的选择可能不确定——浏览器要看权限策略、用户隐私设置、当前 GPU 占用,甚至弹用户授权框(首次 WebGPU 调用在某些浏览器上需要 user gesture)。同步返回不现实。Device creation 也是异步——backend 可能要分配资源池、加载 shader cache、协商 IPC channel。Dawn 在 macOS 上首次 device 创建约 12 ms,缓存后 1–2 ms。

实测主线 ② 在 M2 Pro · Chrome 130 上的耗时(暖缓存第二次起):

  • requestAdapter():~0.8 ms(含 IPC 一来一回 + 浏览器策略检查)
  • requestDevice():~1.5 ms(含 Metal device 创建 + Dawn validation 初始化)
  • 首次(冷启动):~50 ms(含 driver 进程加载 + GPU 进程启动)

Main line ②, two lines:

const adapter = await navigator.gpu.requestAdapter();   // physical GPU
const device  = await adapter.requestDevice();           // virtual handle

Those two lines do four things — this chapter covers three (the fourth, device loss, is Ch13):

  1. Enumerate physical GPUs. A MacBook Pro might have an Apple GPU + Intel iGPU + an external eGPU. requestAdapter() returns the first usable one by default. You can hint powerPreference: 'high-performance' or 'low-power' (advisory; on NVIDIA laptops this actually flips to the dGPU). forceFallbackAdapter: true forces SwiftShader / Dawn's null backend, useful for tests.
  2. Read features / limits. Adapter exposes optional features (e.g. 'float32-filterable', 'shader-f16', 'subgroups') and hardware limits (maxBufferSize, maxStorageBufferBindingSize, maxComputeWorkgroupSizeX). These are the first gate in WebGPU's safety model — every limit is the spec-mandated minimum or higher.
  3. Create the device. requestDevice() accepts requiredFeatures and requiredLimits — if hardware doesn't support them the Promise rejects. It's a contract: the device is a virtual GPU guaranteed to have the requested capabilities. A process can have multiple devices (different origins/iframes), but once a device is lost, it never recovers.
Dawn · src/dawn/native/Adapter.cpp · simplified // How Dawn implements requestDevice (simplified) MaybeError AdapterBase::CreateDevice( const DeviceDescriptor* descriptor, Ref<DeviceBase>* result) { // ① Validate requiredFeatures against this adapter DAWN_TRY(ValidateFeatureSupportedWithToggles(descriptor, ...)); // ② Validate requiredLimits against adapter caps DAWN_TRY(ValidateLimits(adapterLimits, requiredLimits)); // ③ Defer to the backend's device factory: // Vulkan → vulkan::Device::Create // Metal → metal::Device::Create (Obj-C++) // D3D12 → d3d12::Device::Create DAWN_TRY_ASSIGN(*result, mPhysicalDevice->CreateDevice(descriptor)); // ④ Register with instance, arm error tracking (*result)->TrackPendingError(); return {}; }
Why two awaits

Adapter selection can be nondeterministic — the browser checks permissions, privacy settings, current GPU contention, even surfaces a permission prompt (the first WebGPU call sometimes needs a user gesture). A synchronous return is unrealistic. Device creation is async too — the backend may need to allocate a resource pool, load a shader cache, negotiate an IPC channel. Dawn on macOS takes ~12 ms for the first device, ~1–2 ms cached.

Measured cost of main line ② on M2 Pro · Chrome 130 (warm caches, second time onward):

  • requestAdapter(): ~0.8 ms (IPC round trip + browser policy check)
  • requestDevice(): ~1.5 ms (Metal device + Dawn validation init)
  • First call (cold): ~50 ms (driver process load + GPU process bring-up)
CHAPTER 07

Buffer · 映射 · 生命周期

Buffers · mapping · lifetimes

4 MB 数据怎么从 JS Float32Array 到达 GPU 显存

How 4 MB of Float32Array gets from JS to GPU memory

主线 ③ + ⑦ 处理两个 buffer:一个 storage buffer(GPU 读写)和一个 map-read buffer(CPU 回读)。这一章拆 GPUBuffer 的 8 个 usage flag、3 种数据路径、2 种映射模式。

① 七个 usage flag

Flag含义能去哪主线用了吗
MAP_READCPU 可 map 读读结果 buffer是(read)
MAP_WRITECPU 可 map 写staging buffer(少见)
COPY_SRC能作为 copyBuffer 的源storage → read 拷贝是(buf)
COPY_DST能作为 copyBuffer/writeBuffer 的目标writeBuffer 目标 · 读 buffer 目标是(两者都)
INDEX能绑定为 index bufferrender pipeline
VERTEX能绑定为 vertex bufferrender pipeline
UNIFORM能绑为 uniform binding(read-only,小)shader 入参常量
STORAGE能绑为 storage binding(rw,大)compute / fragment是(buf)
INDIRECT能作为 dispatchIndirect/drawIndirect 的参数GPU-driven 工作流
QUERY_RESOLVEtimestamp/occlusion query 的目标性能测量

组合约束MAP_READ 只能和 COPY_DST 一起;MAP_WRITE 只能和 COPY_SRC 一起。这是设计上的隔离——map 过的 buffer 不能直接用于 shader binding,避免 CPU 边写 GPU 边读的竞争。主线里读结果先 copyBufferToBuffer 到一个 MAP_READ | COPY_DST buffer,再 mapAsync 读。

② 三种数据路径

路径 A · writeBuffer(最常用)

由 queue 提供:device.queue.writeBuffer(dst, dstOffset, srcData)。Dawn 内部走两步:a) 在 GPU 端分配一个临时 upload buffer;b) memcpy srcData 进去;c) 排一个 GPU copy 命令把它拷到 dst。同步调用 + 异步执行。主线 ③ 就是这条路径。

路径 B · mappedAtCreation: true

创建 buffer 时直接拿到 ArrayBuffer 写:buf.getMappedRange() → 写 → buf.unmap()。优点:少一次 memcpy(直接写到 GPU upload heap)。缺点:buffer 必须 STORAGE / COPY_SRC / COPY_DST,不能 MAP_READ。适合静态数据如 vertex/uniform。

路径 C · copyBufferToBuffer + mapAsync

用于读回。需要分两个 buffer:① 计算用的 STORAGE | COPY_SRC,② 读取用的 MAP_READ | COPY_DST。submit() 后 await mapAsync()。Dawn 会 poll fence;mapAsync 的 promise 在 GPU 完成那次 submit 后 resolve。主线 ⑦ + ⑧ 用此。

③ 生命周期 — destroy() vs GC

GPUBuffer 是显式资源。它和 Float32Array 不一样:JS GC 看不到 GPU 上分配的 4 MB。如果只持有 JS 引用而不调 buf.destroy(),buffer 会在 GPUDevice 被 GC 时才释放——对长跑的页面这可能持有 GB 级显存几分钟。规范规定:destroy() 立即解绑所有引用并归还显存。Dawn 实现见 src/dawn/native/Buffer.cpp::APIDestroy()

写过 WebGL 的人最容易踩的坑

WebGL 的 buffer 生命周期是 GL 上下文范围——context lost 时 GL 帮你清。WebGPU 没有这个语义——device lost 时你的 buffer 列表从应用的视角看依然存在但全部 invalid,需要手动管。模式:把所有 GPU 资源都挂在一个 resourceGarden 类下,devicelost 时遍历 destroy。

Main line ③ and ⑦ both touch buffers: one storage buffer (GPU read/write) and one map-read buffer (CPU read-back). This chapter unpacks the 8 usage flags, 3 data paths, and 2 mapping modes that the spec offers.

① The seven usage flags

FlagMeaningWhere it goesMain line uses
MAP_READCPU can map for readingreadback bufferyes (read)
MAP_WRITECPU can map for writingstaging (rare)no
COPY_SRCsource of copyBufferstorage → readyes (buf)
COPY_DSTtarget of copyBuffer / writeBufferwriteBuffer target · read buffer targetyes (both)
INDEXbindable as index bufferrender pipelineno
VERTEXbindable as vertex bufferrender pipelineno
UNIFORMuniform binding (read-only, small)shader constantsno
STORAGEstorage binding (rw, large)compute / fragmentyes (buf)
INDIRECTargument for dispatchIndirect / drawIndirectGPU-driven flowsno
QUERY_RESOLVEtarget of timestamp/occlusion queriesperf measurementno

Combinatorial constraint: MAP_READ can only combine with COPY_DST; MAP_WRITE only with COPY_SRC. This is deliberate isolation — a mapped buffer can't be bound to a shader, eliminating CPU-write-while-GPU-reads races. Hence the two-buffer pattern in the main line: compute writes into STORAGE | COPY_SRC; we then copyBufferToBuffer into a MAP_READ | COPY_DST buffer and mapAsync.

② Three data paths

Path A · writeBuffer (most common)

Provided by queue: device.queue.writeBuffer(dst, dstOffset, srcData). Dawn internally: a) allocate a transient upload buffer on the GPU, b) memcpy srcData into it, c) enqueue a copy command into dst. Synchronous call · asynchronous execution. Main line ③ takes this path.

Path B · mappedAtCreation: true

Get an ArrayBuffer view at creation: buf.getMappedRange() → write → buf.unmap(). Saves one memcpy (writing straight into the GPU's upload heap). Constraint: buffer must be STORAGE / COPY_SRC / COPY_DST, never MAP_READ. Ideal for static vertex/uniform data.

Path C · copyBufferToBuffer + mapAsync

For read-back. Requires two buffers: ① compute target STORAGE | COPY_SRC, ② readback MAP_READ | COPY_DST. After submit() you await mapAsync(). Dawn polls a fence; the mapAsync promise resolves once the GPU has finished that submit. Main line ⑦ + ⑧.

③ Lifetime — destroy() vs GC

GPUBuffer is an explicit resource. Unlike a Float32Array, JS GC can't see the 4 MB sitting on the GPU. If you only hold a JS reference and never call buf.destroy(), the buffer survives until GPUDevice itself is GC'd — on a long-running page that means multiple GB of VRAM held for minutes. The spec is explicit: destroy() immediately releases all references and returns the memory. Dawn's implementation: src/dawn/native/Buffer.cpp::APIDestroy().

The trap WebGL veterans hit

WebGL's buffer lifetime is scoped to the GL context — on context loss, GL cleans up. WebGPU has no such semantic; on device loss, all buffers are still visible from the app side but invalid, and you must clean them up manually. Pattern: hang all GPU resources off a single resourceGarden and walk-and-destroy on devicelost.

CHAPTER 08

WGSL — 第四种 shading language

WGSL — the fourth shading language

不是 GLSL · 不是 HLSL · 不是 MSL · 设计成可双向翻译到三家

Not GLSL · not HLSL · not MSL · designed to round-trip to all three

主线 ④ 那段字符串就是 WGSL,整个 WebGPU 协议只接受 WGSL 一种 shader 源码——这是 W3C 工作组做过的最大也是最有争议的决定。

为什么不复用现有的

  • GLSL:和 OpenGL 状态机绑得太紧(uniform location、varying、attribute slot),不适合显式 pipeline。语义有 unspecified behaviour(如 NaN 处理),跨厂商不可移植。
  • HLSL:Microsoft 独家,许可证不允许其他平台独立实现编译器,且 spec 在 DirectX 私有头文件里。
  • MSL:Apple 独家,相同问题。
  • SPIR-V:Khronos 中立,但不是源码——是二进制 IR。在浏览器里直接吃 SPIR-V 意味着每个浏览器要分别防御 SPIR-V 解析器的内存安全漏洞(Vulkan 历史上有数十个 CVE 都源于此)。

于是工作组决定从零设计一种小语言。语法选了类 Rust(fn / let / var / ->),类型系统强制显式且无隐式转换。spec 是 w3.org/TR/WGSL/,约 400 页。

WGSL 一页速览

WGSL · main line shader // 1. 资源绑定声明 · @group 是 BindGroup index, @binding 是 entry index @group(0) @binding(0) var<storage, read_write> data: array<f32>; // ↑ address space ↑ access mode // 2. 入口函数声明 · @compute = compute shader · workgroup_size = 64 threads @compute @workgroup_size(64) fn main(@builtin(global_invocation_id) gid: vec3<u32>) { // ↑ built-in: 该 thread 的全局 (x,y,z) 索引 let i = gid.x; // let = immutable if (i < arrayLength(&data)) { // ↑ 在 storage array 上的内置函数 data[i] = data[i] * data[i]; } }

五个 address space

地址空间生命周期用途性能层
function单次函数调用局部 var寄存器(最快)
privateshader 调用per-thread 全局寄存器或 L1
workgroup一个 workgroup 的生命周期workgroup 内共享内存shared memory(Metal: threadgroup)
uniform整条 pipelineshader 常量(<64 KB)uniform cache
storage整条 pipeline大 buffer 读写VRAM(最慢)

主线的 datastorage 里,每次访问要走 L2/VRAM。Ch19 会演示怎么用 workgroup 共享内存把 matmul 性能提 4–8×。

WGSL 的类型系统

  • 标量i32 · u32 · f32 · f16 · bool。f16 是可选 feature。
  • 向量vec2<T> · vec3<T> · vec4<T>。swizzling 支持:v.xyzv.rgv.rrgb
  • 矩阵mat2x3<f32>(2 列 3 行)。
  • 数组array<T, N>(固定)或 array<T>(运行时大小,只能在 storage 最末位)。
  • struct:和 Rust 类似,可在 buffer 里布局。
  • ptr<T, A, M>:显式指针类型——A 是 address space、M 是 access mode。WGSL 不允许 raw pointer 算术。
为什么 const 不是 const,让人想抓狂

WGSL 里 const编译期常量(必须在 compile time 求值),let运行期不可变(C++ const),var可变。这和 Rust 的const 一样但和 JavaScript / C++ 不一样。看代码时要注意三者层级。

That string literal in main line ④ is WGSL. The WebGPU protocol only accepts WGSL as shader source — the single most consequential (and contentious) decision the W3C working group made.

Why not reuse an existing language

  • GLSL: tied to the OpenGL state machine (uniform location, varying, attribute slot), poor fit for explicit pipelines. Has unspecified behaviour (NaN handling), not portable across vendors.
  • HLSL: Microsoft proprietary; license forbids independent compiler implementations on other platforms, and the spec lives in DirectX private headers.
  • MSL: Apple proprietary, same problem.
  • SPIR-V: Khronos-neutral, but not source — it's a binary IR. Accepting SPIR-V in browsers means every browser has to harden a SPIR-V parser; Vulkan accumulated dozens of CVEs from this.

So the WG decided to design a new small language. Syntax leans Rust (fn / let / var / ->); the type system is strict, no implicit conversions. Spec at w3.org/TR/WGSL/, about 400 pages.

WGSL on one page

WGSL · main line shader // 1. Resource binding · @group = BindGroup index, @binding = entry index @group(0) @binding(0) var<storage, read_write> data: array<f32>; // ↑ address space ↑ access mode // 2. Entry point · @compute = compute shader · 64 threads per workgroup @compute @workgroup_size(64) fn main(@builtin(global_invocation_id) gid: vec3<u32>) { // ↑ built-in: this thread's global (x,y,z) index let i = gid.x; // let = immutable if (i < arrayLength(&data)) { // ↑ built-in on storage arrays data[i] = data[i] * data[i]; } }

Five address spaces

Address spaceLifetimeUsePerformance tier
functionone function calllocal varsregisters (fastest)
privateshader invocationper-thread globalsregisters / L1
workgroupone workgroupworkgroup-shared memoryshared memory (Metal: threadgroup)
uniformwhole pipelineshader constants (<64 KB)uniform cache
storagewhole pipelinelarge rw buffersVRAM (slowest)

Our main line's data lives in storage, so each access hits L2/VRAM. Ch19 will show how moving partial sums into workgroup shared memory speeds matmul up 4–8×.

WGSL's type system

  • Scalars: i32 · u32 · f32 · f16 · bool. f16 is an optional feature.
  • Vectors: vec2<T> · vec3<T> · vec4<T>. Swizzling: v.xyz, v.rg, v.rrgb.
  • Matrices: mat2x3<f32> (2 columns, 3 rows).
  • Arrays: array<T, N> (fixed) or array<T> (runtime-sized, only as the last storage member).
  • struct: Rust-like, with explicit memory layout for buffers.
  • ptr<T, A, M>: explicit pointer type — A = address space, M = access mode. No raw pointer arithmetic.
Why "const" isn't const — the gotcha

In WGSL, const is a compile-time constant (must be evaluable at compile time), let is runtime-immutable (C++ const), var is mutable. Same as Rust, different from JS/C++. Triple-layer mental model required when reading code.

CHAPTER 09

Pipeline · BindGroup — 描述符的两层

Pipeline · BindGroup — the two-layer descriptor model

为什么需要"layout 的 layout",以及 layout:'auto' 是怎么实现的

Why you need a "layout of layouts" — and how layout:'auto' really works

主线 ⑤ 两步:createComputePipeline + createBindGroup。这两个 API 一起回答了一个底层问题:shader 怎么找到它要用的资源?

所有现代图形 API 都用 两级描述符(descriptor)来回答这个问题:

  1. 第一级 · 形状:BindGroupLayout = "我需要 4 个 binding,第 0 个是 storage buffer,第 1 个是 sampler,第 2 个是 texture,第 3 个是 uniform"。
  2. 第二级 · 实体:BindGroup = "把这个具体 buffer装到 binding 0、那个具体 sampler装到 binding 1……"。

为什么两级而不是一级?因为pipeline 编译只关心形状,运行时切换 BindGroup 只换实体。一个 pipeline 可以服务 1000 个不同的 BindGroup(如渲染 1000 个 model,每个一组贴图),不需要重新编译 shader。这是显式 API 的核心性能技巧——把编译期决策运行期换绑分开。

layout: 'auto' 的真相

主线用了 layout: 'auto'。这不是"没有 layout"——而是让 Dawn/wgpu 帮你从 shader 反推 layout。Dawn 的实现见 src/dawn/native/ShaderModule.cpp::ExtractAutoLayout():扫一遍 WGSL AST,找所有 @group(X) @binding(Y) 声明,按 X 分组,每组生成一个 BindGroupLayout。这个 layout 然后用来编译 pipeline。

Dawn · src/dawn/native/ShaderModule.cpp · simplified // 从 Tint 解析出来的 WGSL AST 反推 BindGroupLayout ResultOrError<EntryPointMetadata> ReflectEntryPoint( const tint::Program& program, const tint::inspector::EntryPoint& entry) { EntryPointMetadata meta; for (const auto& binding : entry.resource_bindings) { BindGroupLayoutEntry entry; entry.binding = binding.binding; entry.visibility = ShaderStageFor(entry.stage); switch (binding.resource_type) { case tint::ResourceType::kStorageBuffer: entry.buffer.type = BufferBindingType::Storage; break; case tint::ResourceType::kUniformBuffer: entry.buffer.type = BufferBindingType::Uniform; break; // ... textures, samplers, storage textures ... } meta.bindings[binding.bind_group].push_back(entry); } return meta; }

layout: 'auto' 的代价:不同 pipeline 之间不能共享 BindGroup。两个 pipeline 即使 shader 完全相同的 binding 声明,layout: 'auto' 生成的两个 BindGroupLayout 也是不同实例,BindGroup 必须分别创建。生产代码里如果有 N 个 pipeline 共用 binding,应该显式 device.createBindGroupLayout() + 在每个 pipeline descriptor 里复用同一个 layout。

Pipeline 编译有多贵

主线 ⑤ 的 createComputePipeline显著开销的调用:

  • Dawn 验证(reflection、binding 对齐检查):~0.2 ms
  • Tint 编译 WGSL → MSL/HLSL/SPIR-V:~3–8 ms(一次)
  • 原生 API 创建 PSO(Metal 的 MTLComputePipelineState):~2–4 ms
  • 驱动编译 MSL → GPU ISA:~5–15 ms(一次)

总和 10–30 ms。所以不要在 frame 里创建 pipeline。生产应用通常在启动时把所有 pipeline 预编译好("pre-warming")。Chrome 还有 pipeline cache 把编译产物存到磁盘,下次启动直接复用。

Main line ⑤ does two things: createComputePipeline + createBindGroup. Together they answer one underlying question: how does the shader find the resources it uses?

Every modern graphics API uses a two-level descriptor model:

  1. Level 1 · shape: BindGroupLayout = "I need 4 bindings: binding 0 is a storage buffer, 1 is a sampler, 2 is a texture, 3 is a uniform".
  2. Level 2 · instances: BindGroup = "put this specific buffer in binding 0, that specific sampler in binding 1…"

Why two levels rather than one? Pipeline compilation only cares about shape; switching BindGroups at runtime only swaps instances. One pipeline can serve 1,000 BindGroups (e.g. rendering 1,000 models, each with its own textures) without recompiling the shader. That's the core trick of explicit APIs — separating compile-time decisions from runtime rebinding.

What layout: 'auto' really does

Our main line uses layout: 'auto'. This doesn't mean "no layout" — it means let Dawn/wgpu infer the layout from the shader. Dawn's logic is in src/dawn/native/ShaderModule.cpp::ExtractAutoLayout(): it walks the WGSL AST, collects every @group(X) @binding(Y) declaration, groups by X, and synthesises a BindGroupLayout per group. That layout is then used to compile the pipeline.

Dawn · src/dawn/native/ShaderModule.cpp · simplified // Reflect BindGroupLayout from a parsed WGSL AST ResultOrError<EntryPointMetadata> ReflectEntryPoint( const tint::Program& program, const tint::inspector::EntryPoint& entry) { EntryPointMetadata meta; for (const auto& binding : entry.resource_bindings) { BindGroupLayoutEntry entry; entry.binding = binding.binding; entry.visibility = ShaderStageFor(entry.stage); switch (binding.resource_type) { case tint::ResourceType::kStorageBuffer: entry.buffer.type = BufferBindingType::Storage; break; case tint::ResourceType::kUniformBuffer: entry.buffer.type = BufferBindingType::Uniform; break; // ... textures, samplers, storage textures ... } meta.bindings[binding.bind_group].push_back(entry); } return meta; }

The cost of layout: 'auto': BindGroups aren't shareable across pipelines. Even if two pipelines declare identical bindings, their inferred layouts are distinct instances, so each needs its own BindGroup. Production code with N pipelines sharing bindings should call device.createBindGroupLayout() explicitly and reuse the same layout in every pipeline descriptor.

How expensive is pipeline compilation

Main line ⑤'s createComputePipeline is a heavy call:

  • Dawn validation (reflection, binding alignment): ~0.2 ms
  • Tint compile WGSL → MSL/HLSL/SPIR-V: ~3–8 ms (once)
  • Native PSO creation (Metal's MTLComputePipelineState): ~2–4 ms
  • Driver compile MSL → GPU ISA: ~5–15 ms (once)

Sum: 10–30 ms total. Never create pipelines in a frame loop. Production apps pre-warm all pipelines at startup. Chrome also persists the pipeline cache to disk so a second visit skips re-compilation.

CHAPTER 10

Encoder · Queue — 录制和提交

Encoder · Queue — recording and submitting

把 dispatch / draw / copy 攒成 CommandBuffer,再 submit 给 GPU

Batch dispatch / draw / copy into a CommandBuffer, then submit to the GPU

主线 ⑥ 在 4 行内做了 5 件事:

const enc  = device.createCommandEncoder();      // 1. 开一个录制器
const pass = enc.beginComputePass();              // 2. 开 compute pass
pass.setPipeline(pipeline); pass.setBindGroup(0, bg);
pass.dispatchWorkgroups(16384);                   // 3. 录一条 dispatch
pass.end();                                       // 4. 结束 pass
enc.copyBufferToBuffer(buf, 0, read, 0, byteLen);   // 5. 录一条 copy
device.queue.submit([enc.finish()]);              // 6. 提交

这里有两个关键概念:

① Encoder 是录音机,不是执行器

调用 setPipeline / dispatchWorkgroups / copyBufferToBuffer 都不会立刻让 GPU 做事。它们只是把命令记到 encoder 的内部 buffer 里。这是 D3D12 和 Metal 的 command-buffer 模型——批量录制 + 一次提交,让 driver 有机会重排、合并、并行化。

Dawn 内部 encoder 是 src/dawn/native/CommandEncoder.cpp,每条 API 调用都附带 validation(pipeline 兼容性、binding 对齐、usage tracking)。validation 失败不抛 JS 异常——而是把 encoder 标 invalid,等到 finish() 才一次性 reject。这是性能优化(错误检查不打断流水线)。

② Queue.submit() 是真正的切换点

device.queue.submit([cmdBuf]) 才是真正跨进程通信 + 调用原生 API。它做:

  1. Dawn Wire client 把 cmdBuf 序列化(每条命令一个 tag + 参数)
  2. Mojo IPC 跨 Renderer ↔ GPU process(一次 round trip)
  3. GPU process 端 Dawn Wire server 反序列化、重新 validate
  4. 原生 API 调用:Metal [encoder dispatchThreadgroups:...]、D3D12 ID3D12CommandList::Dispatch、Vulkan vkCmdDispatch
  5. MTLCommandBuffer / D3D12CommandList 提交到队列

主线一次 submit 在 M2 Pro 上:~30 µs IPC + ~10 µs validation + ~20 µs Metal API = ~60 µs CPU 开销。然后 GPU 异步执行,主线的 mapAsync 在 GPU 完成时 resolve。

主线的并行潜力

主线一次提交一条命令 + 一条 copy。生产代码会把多个 pass 攒到一个 encoder 里:例如先 dispatch compute pass A、再 dispatch compute pass B、再 render pass、最后 copy。这样一次 submit 推一整帧的工作,比每条单独 submit 快 5–10×(少了 IPC 来回)。Babylon.js 渲染一帧典型 1 submit + 50–100 个 pass。

为什么 dispatchWorkgroups 而不是 dispatchThreads

WebGPU 用 workgroup 数而不是thread 数,让你显式知道在分组。一次 dispatchWorkgroups(16384)@workgroup_size(64) = 16384 × 64 = 1,048,576 个 thread。每个 workgroup 内的 64 thread 在同一个 SM(Apple GPU 术语:cluster)上跑,可以共享 workgroup memory。Ch18 会展开讨论这个分组语义。

Main line ⑥ does five things in four lines:

const enc  = device.createCommandEncoder();      // 1. open a recorder
const pass = enc.beginComputePass();              // 2. begin compute pass
pass.setPipeline(pipeline); pass.setBindGroup(0, bg);
pass.dispatchWorkgroups(16384);                   // 3. record a dispatch
pass.end();                                       // 4. end the pass
enc.copyBufferToBuffer(buf, 0, read, 0, byteLen);   // 5. record a copy
device.queue.submit([enc.finish()]);              // 6. submit

Two key concepts:

① The encoder is a tape recorder, not an executor

Calling setPipeline / dispatchWorkgroups / copyBufferToBuffer does not run anything on the GPU. Each call appends commands into the encoder's internal buffer. This is the D3D12/Metal command-buffer model — batch-record, submit-once — giving the driver room to reorder, merge, and parallelise.

Dawn's encoder lives at src/dawn/native/CommandEncoder.cpp. Every API call is validated inline (pipeline compatibility, binding alignment, usage tracking). Validation failures don't throw JS exceptions — they poison the encoder, deferring rejection until finish(). That's a perf optimisation (errors don't stall the recording pipeline).

② Queue.submit() is the real kick-off point

device.queue.submit([cmdBuf]) is where cross-process IPC and native-API calls actually happen:

  1. Dawn Wire client serialises cmdBuf (a tag + args per command)
  2. Mojo IPC across Renderer ↔ GPU process (one round trip)
  3. GPU-process-side Dawn Wire server deserialises and re-validates
  4. Native API call: Metal [encoder dispatchThreadgroups:...] / D3D12 ID3D12CommandList::Dispatch / Vulkan vkCmdDispatch
  5. Native command buffer (MTLCommandBuffer / D3D12CommandList) submitted to its queue

On M2 Pro, one submit costs ~30 µs IPC + ~10 µs validation + ~20 µs Metal API = ~60 µs CPU overhead. The GPU then runs asynchronously; the main line's mapAsync resolves when the GPU finishes.

Parallelism opportunities

Our main line submits one command + one copy. Production code batches many passes into one encoder: dispatch compute pass A, then compute pass B, then a render pass, then a copy. One submit pushes a whole frame of work, 5–10× faster than per-pass submits (no extra IPC). Babylon.js renders a typical frame in 1 submit + 50–100 passes.

Why dispatchWorkgroups, not dispatchThreads

WebGPU exposes workgroup count instead of thread count so the grouping is explicit. One dispatchWorkgroups(16384) with @workgroup_size(64) = 16,384 × 64 = 1,048,576 threads. Each workgroup's 64 threads run on the same SM (Apple GPU: cluster) and can share workgroup memory. Ch18 unpacks the grouping semantics in depth.

CHAPTER 11

Renderer ↔ GPU 进程 — Mojo IPC 与 Wire 序列化

Renderer ↔ GPU process — Mojo IPC and Wire serialisation

为什么浏览器要把 dispatch 跨进程跑,怎么跨,付什么代价

Why dispatches cross a process boundary, how, and what it costs

这是翻译③ Mojo IPC 那一层。在桌面 native 应用里调 Metal 一次 dispatchThreadgroups一个函数调用,纯 user-space。在 Chrome 里同样的事是跨进程:JS 在 Renderer process(沙箱里),驱动调用在 GPU process(更深的沙箱,但有 driver 访问权)。这一章拆这个边界。

为什么有 GPU 进程

  1. 驱动是巨型不可信代码。NVIDIA Windows 驱动 ~10 MLOC,AMD ~5 MLOC,Apple Metal stack 数百万行。让 untrusted JS 直接坐在它上面是巨大的攻击面——任何 driver bug + 一个 origin 就是远程代码执行。
  2. GPU 显存里有其他 origin 的纹理。同一台机器上 100 个 tab 的纹理都在 GPU 显存里。如果某 origin 通过 GPU 越界读拿到了另一个 origin 的纹理,就是跨域数据泄漏。GPU 进程的边界让 Chrome 至少在 process 层做了一次隔离。
  3. 驱动崩溃不该拉整个浏览器。WebGPU 用得激进时一个 shader bug 经常让驱动 hang/crash。在独立进程里 crash 只丢一帧 + reload 一次 device,不会把 Chrome 拖死。

Wire 协议:把 API 调用编码成字节流

JS 里每次调 pass.dispatchWorkgroups(16384) 都会被 Dawn Wire client(C++,在 Renderer 里)翻译成一段二进制。结构大致:

Dawn Wire · command serialisation · simplified // 每条 API 调用都是一个 tag + 参数 struct DispatchWorkgroupsCmd { // 24 bytes 总 CommandID id; // kComputePassEncoderDispatchWorkgroups (4 bytes) ObjectId self; // pass encoder ID (4 bytes) uint32_t workgroupCountX; // 4 bytes uint32_t workgroupCountY; // 4 bytes uint32_t workgroupCountZ; // 4 bytes // + padding to 24 byte alignment }; // Wire client 把每条命令 append 到 ring buffer // ComputePassEncoder::DispatchWorkgroups(x, y, z) { // ASSIGN_TYPED(cmd, DispatchWorkgroupsCmd); // cmd->self = GetObjectId(this); ... // cmd->workgroupCountX = x; cmd->workgroupCountY = y; cmd->workgroupCountZ = z; // }

这个 wire 序列化是同步不立刻发送——命令 append 到一个 ring buffer,等 queue.submit() 才 flush。Flush 走 Mojo 把整个 ring 内容(典型 1–10 KB / frame)一口气 IPC 过去。

Mojo · Chromium 的 IPC 库

Mojo 是 Chromium 自己的 IPC 系统(不是 Chrome OS Mojo,是同名巧合)。底层是 Unix domain socket(Linux/macOS)或 named pipe(Windows)。Mojo 在上面建了:

  • Message pipes:双向、有序、message-based
  • Shared memory regions:大数据用 shm 而非 socket bypass kernel copy
  • Handle 传递:fd / socket 跨进程传

Dawn 用 Mojo shared memory 来传 wire ring buffer——主线一次 submit 推 ~80 字节命令(一个 dispatch),但更大的 batch 一帧可能 50 KB。用 shm 比单次 send/recv 快 ~3×(少了内核数据拷贝)。延迟数据见 Mojo docs

实测延迟分解

步骤M2 Pro · macOSi7-13700H · WindowsPixel 8 · Android
Renderer 序列化(80 bytes)~3 µs~4 µs~6 µs
Mojo shm flush~10 µs~12 µs~20 µs
调度 + GPU 进程唤醒~15 µs~18 µs~30 µs
GPU 进程反序列化 + validate~10 µs~12 µs~15 µs
合计 / dispatch~38 µs~46 µs~71 µs

这就是为什么主线一次 submit ~60 µs 是合理的——IPC 占了一半多。也是为什么不要在 frame 里多 submit:每多一次 submit 就多 40+ µs。

wgpu 的对应实现

Firefox 用 IPDL(Mozilla 自己的 IPC 框架)传一段不同格式的字节流。wgpu 的 wire 是 Rust struct serialise via bincode,see wgpu/src/backend/wgpu_core/mod.rs。整体延迟和 Dawn 相近(~40 µs)。比较有趣的是 wgpu 在 same-process fast path 时(如 Bevy/Servo 直接嵌入)能完全跳过 IPC,直接 mem-copy struct。Firefox 浏览器里始终走 IPC(沙箱要求)。

This unpacks translation ③ Mojo IPC. In a desktop native app, one dispatchThreadgroups Metal call is one function call, pure user-space. In Chrome the same operation is cross-process: JS in the Renderer (sandboxed), driver calls in the GPU process (also sandboxed, but with driver access). This chapter dissects that boundary.

Why a GPU process at all

  1. Drivers are massive untrusted code. NVIDIA's Windows driver is ~10 MLOC, AMD's ~5 MLOC, Apple's Metal stack is millions of lines. Sitting untrusted JS directly on top is a huge attack surface — any driver bug + a malicious origin = RCE.
  2. GPU memory holds textures from other origins. On one machine, the textures of 100 tabs all sit in GPU memory. If one origin reads across via a GPU OOB, it's cross-origin data leakage. A separate GPU process gives Chrome at least process-level isolation.
  3. Driver crashes shouldn't take down the whole browser. Aggressive WebGPU use makes shader bugs hang/crash drivers regularly. In a separate process a crash costs one frame + one device-loss reset, not the entire browser.

The Wire protocol: encode API calls as a byte stream

Every JS pass.dispatchWorkgroups(16384) is translated by the Dawn Wire client (C++, in the Renderer) into a binary tag. Roughly:

Dawn Wire · command serialisation · simplified // Each API call is a tag + parameters struct DispatchWorkgroupsCmd { // 24 bytes total CommandID id; // kComputePassEncoderDispatchWorkgroups (4 bytes) ObjectId self; // pass encoder ID (4 bytes) uint32_t workgroupCountX; // 4 bytes uint32_t workgroupCountY; // 4 bytes uint32_t workgroupCountZ; // 4 bytes // + padding to 24-byte alignment }; // The Wire client appends each command to a ring buffer: // ComputePassEncoder::DispatchWorkgroups(x, y, z) { // ASSIGN_TYPED(cmd, DispatchWorkgroupsCmd); // cmd->self = GetObjectId(this); ... // cmd->workgroupCountX = x; cmd->workgroupCountY = y; cmd->workgroupCountZ = z; // }

Wire serialisation is synchronous but doesn't send immediately — commands append to a ring buffer, flushed only on queue.submit(). The flush IPCs the entire ring (typically 1–10 KB / frame) in one shot.

Mojo · Chromium's IPC library

Mojo is Chromium's IPC layer (the name collides with Chrome OS Mojo by coincidence). Under the hood: Unix domain sockets on Linux/macOS, named pipes on Windows. On top of that Mojo provides:

  • Message pipes: bidirectional, ordered, message-framed
  • Shared memory regions: large data via shm, bypassing kernel copy
  • Handle passing: send file descriptors / sockets across processes

Dawn uses Mojo shared memory for the wire ring buffer — a single dispatch is ~80 bytes, but a busy frame's batch can be 50 KB. shm is ~3× faster than send/recv (avoids the kernel copy). See Mojo docs.

Latency, measured

StepM2 Pro · macOSi7-13700H · WindowsPixel 8 · Android
Renderer serialise (80 bytes)~3 µs~4 µs~6 µs
Mojo shm flush~10 µs~12 µs~20 µs
Schedule + GPU process wake~15 µs~18 µs~30 µs
GPU-side deserialise + validate~10 µs~12 µs~15 µs
Total / dispatch~38 µs~46 µs~71 µs

This explains why a main-line submit is ~60 µs — IPC eats more than half. It's also why you shouldn't multi-submit per frame: every extra submit adds ~40 µs.

wgpu's equivalent

Firefox uses IPDL (Mozilla's own IPC framework) for a different byte format. wgpu's wire is Rust structs serialised via bincode, see wgpu/src/backend/wgpu_core/mod.rs. Total latency is comparable to Dawn (~40 µs). Interestingly, wgpu has a same-process fast path (e.g. Bevy/Servo embedding it directly) that skips IPC entirely — just mem-copy structs. The Firefox browser always goes via IPC for sandbox reasons.

CHAPTER 12

验证 & 错误域 — 为什么 WebGPU 看起来"几乎不报错"

Validation & error scopes — why WebGPU "rarely throws"

异步错误模型 · pushErrorScope/popErrorScope · uncapturederror 事件

Async error model · push/popErrorScope · uncapturederror events

WebGPU 的错误处理有一种奇怪的味道:写错代码很少抛 JS 异常,但东西就是不工作。这章解释为什么——它是故意 设计成这样的,叫做contagion model(污染模型)。

同步抛异常 vs 异步污染

传统的同步 API(如 WebGL)模式:每次调用都同步检查 + 同步报错。WebGL 用 gl.getError()——你每次 draw call 后要主动查;不查就丢错误。每次 getError 都要 IPC 一次(Renderer → GPU process → 取错误状态 → 回),开销巨大。

WebGPU 反过来:错误是异步传染的。一个错误的 dispatchWorkgroups(0)(workgroup 数为 0):

  1. 不会同步抛异常。
  2. 污染当前的 encoder——encoder.finish() 时 reject。
  3. 污染 command buffer——submit 时静默丢弃。
  4. 如果当时有 pushErrorScope('validation') 包着,那么 popErrorScope() 返回一个 GPUValidationError
  5. 否则触发 device 的 uncapturederror 事件。

为什么这样设计:

  • 性能:每次同步检查要 IPC 一来一回 ~40 µs,频繁 dispatch 时积累起来巨大。
  • 流水线友好:错误不打断 JS 执行,应用可以继续 record 后面的命令,最后再统一处理。
  • 批处理:一帧里 1000 个 dispatch,只在 submit 时检查一次。

三个错误类型

类型触发条件典型例
GPUValidationErrorAPI 用法错误(spec 违规)buffer usage 不匹配 · binding 越界 · workgroup_size 太大
GPUOutOfMemoryError真没显存了申请 > maxBufferSize 的 buffer · 创建过多 texture
GPUInternalError实现 bug 或硬件异常shader 编译炸了 · driver 内部 timeout

使用模式

JS · 典型错误处理模式 // 1. 全局监听:开发时挂这个,能立刻看见所有 uncaught device.addEventListener('uncapturederror', e => { console.error(e.error.message); }); // 2. 范围捕获:只在某段代码捕获 validation 错误 device.pushErrorScope('validation'); pass.dispatchWorkgroups(N); pass.end(); const err = await device.popErrorScope(); if (err) { console.warn('dispatch failed:', err.message); } // 3. 嵌套范围:内层只捕 oom,外层捕 validation device.pushErrorScope('validation'); // outer device.pushErrorScope('out-of-memory'); // inner await tryBigAllocation(); const oomErr = await device.popErrorScope(); const valErr = await device.popErrorScope();
为什么 popErrorScope 是 async

错误要从 GPU 进程反向流到 Renderer——和 dispatch 同一条 IPC 通道但反向。popErrorScope() 返回的 Promise 在对应的所有命令都被 GPU 进程处理完之后才 resolve。所以 popErrorScope 之后再 dispatchWorkgroups 是同步的(不阻塞),但读取错误结果是异步的。

Dawn 的内部 validation 流

Dawn 的 validation 不是一处——它分布在每条 API 的 GPU-side handler 里。例如 ComputePassEncoder::APIDispatchWorkgroups(in src/dawn/native/CommandEncoder.cpp)会调 ValidateDispatch(),检查 ① workgroup count 不为 0、② 当前 BindGroup 已绑、③ pipeline 已绑、④ 三个 binding usage 兼容。失败时 poison the command encoder,但不 throw——延迟到 finish() 才 reject。

完整 worked example — 故意写错一行看看会发生什么

JS · 主线第 ⑤ 步漏掉 bindGroup 的 binding 0 device.addEventListener('uncapturederror', e => { console.error('[uncaught]', e.error.message); }); device.pushErrorScope('validation'); // 故意:BindGroup entries 留空(应该有 binding 0) const bg = device.createBindGroup({ layout: pipeline.getBindGroupLayout(0), entries: [] // ← bug here }); const err = await device.popErrorScope(); console.log(err); // → GPUValidationError { // message: "Number of entries (0) did not match the number // of entries (1) specified in [BindGroupLayout]. // - While calling [Device].CreateBindGroup([BindGroupDescriptor])." // } // 注意:bg 不是 null —— 它是一个 invalid BindGroup 对象 // (contagion 模型)。下面调 setBindGroup(0, bg) 不抛—— // 而是污染 encoder。直到 finish() 才会再次 reject。

Dawn 错误信息带调用栈式 context——每个嵌套的 API 调用都会 append 一段 "While calling [X]"。这是 src/dawn/native/ErrorScope.cppErrorScope::AppendContext() 干的。比 WebGL 的 getError() === GL_INVALID_OPERATION 信息量多 100×。

WebGPU error handling has a strange flavour: getting things wrong rarely throws a JS exception, but stuff just doesn't work. This chapter explains why — it's deliberate, often called the contagion model.

Synchronous throw vs asynchronous poisoning

Traditional sync APIs (like WebGL): each call validates and reports inline. WebGL uses gl.getError() — you must poll after every draw call, or the error vanishes. Each getError costs one IPC round trip (Renderer → GPU process → fetch error state → reply) — huge overhead.

WebGPU inverts this: errors propagate asynchronously by contagion. A bad dispatchWorkgroups(0) (zero workgroups):

  1. Does not throw synchronously.
  2. Poisons the current encoder — encoder.finish() rejects.
  3. Poisons the command buffer — submit silently drops it.
  4. If wrapped in pushErrorScope('validation'), the matching popErrorScope() returns a GPUValidationError.
  5. Otherwise fires uncapturederror on the device.

Why design it this way:

  • Performance: each sync check would be one IPC round trip (~40 µs). With many dispatches that's massive.
  • Pipeline-friendly: errors don't interrupt JS execution; the app keeps recording commands and handles them all at the end.
  • Batchable: 1000 dispatches per frame are validated once at submit.

Three error types

TypeTriggerTypical cause
GPUValidationErrorAPI misuse (spec violation)buffer usage mismatch · binding OOB · workgroup_size too large
GPUOutOfMemoryErroractually out of VRAMbuffer larger than maxBufferSize · texture flood
GPUInternalErrorimplementation bug / hardware faultshader compiler crashed · driver timeout

Usage patterns

JS · typical error-handling patterns // 1. Global listener — keep this on during development device.addEventListener('uncapturederror', e => { console.error(e.error.message); }); // 2. Scoped capture — only catch validation errors in a region device.pushErrorScope('validation'); pass.dispatchWorkgroups(N); pass.end(); const err = await device.popErrorScope(); if (err) { console.warn('dispatch failed:', err.message); } // 3. Nested scopes — inner catches OOM, outer catches validation device.pushErrorScope('validation'); // outer device.pushErrorScope('out-of-memory'); // inner await tryBigAllocation(); const oomErr = await device.popErrorScope(); const valErr = await device.popErrorScope();
Why popErrorScope is async

Errors flow backwards from the GPU process to the Renderer — same IPC channel as the dispatch, opposite direction. The Promise from popErrorScope() resolves only after all matched commands have been processed in the GPU process. So issuing dispatchWorkgroups after popErrorScope is synchronous (non-blocking), but reading the error is async.

Dawn's internal validation flow

Dawn's validation isn't in one place — it's distributed across every API's GPU-side handler. E.g. ComputePassEncoder::APIDispatchWorkgroups (in src/dawn/native/CommandEncoder.cpp) calls ValidateDispatch(), which checks ① workgroup count != 0, ② a BindGroup is bound, ③ a pipeline is bound, ④ the three binding usages are compatible. Failure poisons the command encoder but doesn't throw — rejection is deferred until finish().

Worked example — break step ⑤ of the main line, observe what fires

JS · main line step ⑤, with the BindGroup binding 0 dropped device.addEventListener('uncapturederror', e => { console.error('[uncaught]', e.error.message); }); device.pushErrorScope('validation'); // Intentionally: BindGroup entries left empty (should have binding 0) const bg = device.createBindGroup({ layout: pipeline.getBindGroupLayout(0), entries: [] // ← bug here }); const err = await device.popErrorScope(); console.log(err); // → GPUValidationError { // message: "Number of entries (0) did not match the number // of entries (1) specified in [BindGroupLayout]. // - While calling [Device].CreateBindGroup([BindGroupDescriptor])." // } // Note: bg is NOT null — it's an invalid BindGroup object (contagion). // A later setBindGroup(0, bg) won't throw — instead the encoder gets // poisoned and finish() rejects.

Dawn error messages carry call-stack-style context — each nested API call appends a "While calling [X]" segment. Implemented in src/dawn/native/ErrorScope.cpp by ErrorScope::AppendContext(). 100× more informative than WebGL's getError() === GL_INVALID_OPERATION.

CHAPTER 13

设备丢失与恢复 — TDR · driver crash · 全屏切换

Device loss & recovery — TDR · driver crash · fullscreen switch

GPU 不可信的时候怎么活下来

Staying alive when the GPU lies to you

WebGPU 的一个明显设计决策是:device 可能突然死掉。device.lost 是个 Promise,一定会 resolve(要么因为应用主动 destroy,要么因为系统级原因)。

device 死掉的 6 种原因

  • TDR · Timeout Detection & Recovery(Windows):GPU 上某个 shader 跑 > 2 秒(默认)触发 TDR,整个 D3D12 device 被重置。Windows 这是 OS 层的安全机制。Linux 也有 GPU hang detection 但更宽松。
  • Driver crash:NVIDIA 驱动经常 hang/crash。Chrome 的 GPU process 也会一起死。
  • GPU 全屏切换:用户切到独占全屏游戏可能拉走 GPU 资源,浏览器 device 失效。
  • OS 资源回收:内存压力大时 OS 可能把 GPU 资源逐出。
  • 用户切显卡(eGPU 拔插、双显卡笔记本切换):device 引用的 adapter 不再存在。
  • 页面 navigate 走:所有 device 都死。

恢复模式

JS · 正确的 devicelost 处理 let device; async function getDevice() { const adapter = await navigator.gpu.requestAdapter(); device = await adapter.requestDevice(); // 注册 lost 监听(promise 形式) device.lost.then(info => { console.warn('device lost:', info.reason, info.message); if (info.reason === 'destroyed') return; // 我们主动 destroy // 1. 清理所有 GPU 资源的 JS 句柄(它们已经 invalid) resourceGarden.disposeAll(); // 2. 重新创建 getDevice().then(rebuildPipelines); }); }

info.reason 的可能值(spec §22 Errors & Debugging · #device-lost):

  • "destroyed" — 应用主动调 device.destroy()
  • "undefined" — 其他原因(OS、driver、TDR 等都归这里)。没错,就是字符串 "undefined"

规范把不区分具体丢失原因当作隐私防御——能告诉应用"TDR 触发了"就能用shader 运行时长度作为指纹(每个 GPU TDR 阈值不同)。所以应用只能粗粒度恢复。

Dawn 怎么实现 device loss

Dawn 内部有 三层 handling(src/dawn/native/Device.cpp):① 一个 backend 报错(如 Metal command buffer execution failed)→ ② Dawn 把当前 device 标记为 lost 状态,所有 in-flight 命令的 promise reject → ③ 触发 Wire client 发 lost 事件到 Renderer,触发 JS 的 device.lost promise resolve。整个流程是异步的——可能错误发生 50ms 后 JS 才知道。

生产实战

Babylon.js 5+ 默认开 device-loss 监听。Figma 上线 WebGPU 时报告过 TDR 在 Windows + 老 NVIDIA 驱动上的发生率 ~0.3% / 用户 / 月——很低,但有几万用户就有人遇到。Figma 的恢复策略:保存所有 vector 数据在 JS 端,device 丢失时 5 秒 cooldown 后重建 + 重提交所有 pipeline + 重传所有静态纹理(typical 30-100 ms)。

An obvious WebGPU design choice: devices can die at any time. device.lost is a Promise that always resolves eventually (either because the app destroyed it, or some external event killed it).

Six reasons a device dies

  • TDR · Timeout Detection & Recovery (Windows): if a shader runs longer than 2 s (default) the entire D3D12 device is reset. OS-level safety mechanism. Linux has GPU hang detection too, more permissive.
  • Driver crash: NVIDIA drivers frequently hang or crash. The Chrome GPU process dies with it.
  • GPU exclusive fullscreen switch: a user switching to an exclusive-fullscreen game can yank GPU resources, invalidating the browser's device.
  • OS resource reclaim: under memory pressure the OS may evict GPU resources.
  • User switches GPU (eGPU plug/unplug, dual-GPU laptop switch): the adapter the device referenced no longer exists.
  • Page navigation: all devices die.

Recovery pattern

JS · correct devicelost handling let device; async function getDevice() { const adapter = await navigator.gpu.requestAdapter(); device = await adapter.requestDevice(); // Register the lost listener (promise form) device.lost.then(info => { console.warn('device lost:', info.reason, info.message); if (info.reason === 'destroyed') return; // we destroyed it // 1. Drop all JS references to GPU resources (they're invalid) resourceGarden.disposeAll(); // 2. Re-create getDevice().then(rebuildPipelines); }); }

Possible info.reason values (spec §22 Errors & Debugging · #device-lost):

  • "destroyed" — app called device.destroy().
  • "undefined" — anything else (OS / driver / TDR all go here). Yes, literally the string "undefined".

The spec keeps device-loss reason coarse as a privacy defence — telling the app "TDR fired" would let an attacker use shader runtime length as a fingerprint (every GPU has a different TDR threshold). So apps only get coarse-grained recovery info.

How Dawn implements device loss

Dawn has three layers of handling (src/dawn/native/Device.cpp): ① a backend reports an error (e.g. Metal command-buffer execution failed) → ② Dawn marks the current device as lost and rejects all in-flight command promises → ③ Wire client fires a lost event to the Renderer, which resolves JS's device.lost promise. The whole flow is asynchronous — JS may learn about the error 50ms after it happened.

Production reality

Babylon.js 5+ ships device-loss listening on by default. Figma reported when launching WebGPU that TDR on Windows + older NVIDIA drivers occurs at ~0.3% / user / month — small, but with millions of users some hit it. Figma's recovery: keep all vector data in JS, 5-second cooldown after loss, then rebuild + resubmit all pipelines + retransfer all static textures (typically 30–100 ms).

CHAPTER 14

Tint — Dawn 的 WGSL 前端

Tint — Dawn's WGSL frontend

parser → resolver → IR → writer 的四阶段编译器

A four-stage compiler: parser → resolver → IR → writer

Tint 是 Chromium 项目里把 WGSL 翻译到 3 种原生 shader 语言的编译器子项目。地址:chromium/src/third_party/dawn/src/tint/。约 11 万行 C++,由 Google 主要维护,但 Apple/Intel 也有 commit。

为什么不复用 SPIRV-Cross / DXC / spirv-tools

原本工作组想:写一个 WGSL→SPIR-V 前端,然后 SPIR-V→MSL 走 SPIRV-Cross,SPIR-V→HLSL 走 SPIRV-Cross,HLSL→DXIL 走 DXC。实际尝试后发现

  • SPIRV-Cross 历史上有大量 CVE(Khronos 的解析器从 native gamedev 用例进化而来,安全模型不够)。
  • 多跳翻译累积语义损失:WGSL 语义 → SPIR-V → MSL 的 round trip 可能丢精度(如 NaN 处理、FP 优化决策)。
  • WebGPU 规定 WGSL 语义严格,每跳都要重新 enforce。

于是 Tint 走一跳直达:每个 backend 直接从 Tint IR 翻到目标语言。架构清晰、安全可控、性能更好。

Tint 四阶段

阶段路径输入输出典型耗时(一个 shader)
1. Parsetint/lang/wgsl/reader/WGSL 源文本AST~0.5 ms
2. Resolvetint/lang/wgsl/resolver/ASTtyped AST(含 type、constant value)~1 ms
3. Lower to IRtint/lang/wgsl/program_to_ir/typed ASTTint IR(SSA)~0.5 ms
4a. Writer · MSLtint/lang/msl/writer/IRMetal Shading Language~1 ms
4b. Writer · HLSLtint/lang/hlsl/writer/IRHLSL(再被 DXC 编 DXIL)~1.5 ms
4c. Writer · SPIR-Vtint/lang/spirv/writer/IRSPIR-V binary~1 ms

主线那段 16 行 WGSL 全程 ~3 ms(首次)。缓存后 0——Dawn 把编译产物 keyed on (WGSL source hash, backend type, target API version) 缓存到 device-scoped HashMap,二次 createShaderModule 直接复用。

Tint IR 长什么样

Tint IR · tint/lang/core/ir/ · main-line shader after lowering // Tint IR 是 SSA,每条 Value 类型明确,结构类似 LLVM IR 但 // 简化了很多(专门服务 shader 编译,不需要 alias 分析) %main = func(%gid: vec3<u32>) { $B1: { %2:u32 = access %gid, 0u // gid.x %3:ptr<storage, array<f32>, read_write> = access @data %4:u32 = arrayLength %3 // 内置函数 %5:bool = lt %2, %4 // i < arrayLength(&data) if %5 [true: $B2] { $B2: { %6:ptr<storage, f32, read_write> = access %3, %2 %7:f32 = load %6 // data[i] %8:f32 = mul %7, %7 // data[i] * data[i] store %6, %8 // data[i] = ... } } ret } }

Resolver 做的 21 件事

Resolver(tint/lang/wgsl/resolver/)是 Tint 里最大的子目录——主文件 resolver.cc 约 3,500 行,整个 resolver/ 目录加起来约 25,000 行 C++——因为它要 enforce 整个 WGSL spec 的静态语义。简略列表:

  • 类型推断 + 类型检查(如 a + b 要求 a/b 同类型)
  • const-evaluation(const x = 1 + 2 * 3 在编译期求出 7)
  • 地址空间合法性(function 不能跨 invocation 共享)
  • function 参数 / 返回值合法性
  • 资源 binding 唯一性(同一 group/binding 只能有一个声明)
  • workgroup_size 表达式必须是 const-expression
  • recursion 检查(WGSL 不允许递归)
  • diagnostic filter(@diagnostic(off, derivative_uniformity)
  • uniform-flow 分析(fragment 派生在 non-uniform 控制流里要报警)
  • ...另外 12 项

Resolver 失败 → 整个 ShaderModule 创建失败 → 抛 GPUValidationError 到 JS。这种错误是同步可见的——通过 shader.getCompilationInfo() async API 拿到详细信息。

Tint is the Chromium subproject that translates WGSL into three native shader languages. Path: chromium/src/third_party/dawn/src/tint/. About 110k lines of C++, maintained mostly by Google with Apple/Intel contributions.

Why not reuse SPIRV-Cross / DXC / spirv-tools

The working group's first plan: write a WGSL→SPIR-V frontend, then SPIR-V→MSL via SPIRV-Cross, SPIR-V→HLSL via SPIRV-Cross, HLSL→DXIL via DXC. In practice:

  • SPIRV-Cross has accumulated many CVEs (the Khronos parser comes from gamedev needs, weak threat model).
  • Multi-hop translation accumulates semantic loss: WGSL → SPIR-V → MSL can lose precision (NaN handling, FP optimisation decisions).
  • WebGPU requires strict WGSL semantics; every hop has to re-enforce them.

So Tint takes one direct hop: each backend walks Tint IR straight to the target language. Cleaner architecture, controllable security, better performance.

The four Tint stages

StagePathInputOutputTypical (one shader)
1. Parsetint/lang/wgsl/reader/WGSL sourceAST~0.5 ms
2. Resolvetint/lang/wgsl/resolver/ASTtyped AST (types + const values)~1 ms
3. Lower to IRtint/lang/wgsl/program_to_ir/typed ASTTint IR (SSA)~0.5 ms
4a. Writer · MSLtint/lang/msl/writer/IRMetal Shading Language~1 ms
4b. Writer · HLSLtint/lang/hlsl/writer/IRHLSL (DXC then → DXIL)~1.5 ms
4c. Writer · SPIR-Vtint/lang/spirv/writer/IRSPIR-V binary~1 ms

Our main line's 16-line WGSL takes ~3 ms end-to-end first time. Cached afterwards — Dawn keys compilation results on (WGSL source hash, backend type, target API version) in a device-scoped HashMap; second createShaderModule reuses.

What Tint IR looks like

Tint IR · tint/lang/core/ir/ · main-line shader after lowering // Tint IR is SSA. Every Value has a definite type. Looks like // LLVM IR but simpler — purpose-built for shader compilation. %main = func(%gid: vec3<u32>) { $B1: { %2:u32 = access %gid, 0u // gid.x %3:ptr<storage, array<f32>, read_write> = access @data %4:u32 = arrayLength %3 // builtin %5:bool = lt %2, %4 // i < arrayLength(&data) if %5 [true: $B2] { $B2: { %6:ptr<storage, f32, read_write> = access %3, %2 %7:f32 = load %6 // data[i] %8:f32 = mul %7, %7 // data[i] * data[i] store %6, %8 // data[i] = ... } } ret } }

The 21 things Resolver does

The resolver (tint/lang/wgsl/resolver/) is Tint's largest subdirectory — the main file resolver.cc is ~3,500 lines, but the entire resolver/ directory totals ~25,000 lines of C++, because it must enforce the entire WGSL spec's static semantics. A non-exhaustive list:

  • Type inference + type checking (e.g. a + b requires same type)
  • const-evaluation (const x = 1 + 2 * 3 resolves to 7 at compile time)
  • Address-space legality (function can't be shared across invocations)
  • Function parameter / return type legality
  • Resource-binding uniqueness (one declaration per group/binding)
  • workgroup_size expression must be a const-expression
  • Recursion check (WGSL forbids recursion)
  • Diagnostic filters (@diagnostic(off, derivative_uniformity))
  • Uniform-flow analysis (fragment derivatives in non-uniform control flow warn)
  • … and 12 more

Resolver failure → ShaderModule creation fails → GPUValidationError raised on JS. This error is observable — via the async shader.getCompilationInfo() you get a structured list of messages.

CHAPTER 15

Naga — wgpu 的 WGSL 前端

Naga — wgpu's WGSL frontend

Rust · arena IR · 编译给 Firefox / Bevy / Deno 共用

Rust · arena IR · powers Firefox / Bevy / Deno alike

Naga 是 Tint 的 Rust 对应物。地址:github.com/gfx-rs/wgpu/tree/trunk/naga。约 6 万行 Rust,gfx-rs 团队维护。它和 Tint 干同样的事——WGSL 翻译到 MSL / HLSL / SPIR-V / GLSL(外加 WGSL→WGSL re-print)——但架构很不一样

最大的差异:Arena IR

Tint 用经典的对象图——每个 IR 节点都是一个堆分配的 C++ 对象,靠 Block* / Value* 指针互引。Naga 反过来——用arena(slab allocator):

Naga · naga/src/lib.rs · Module 数据结构 pub struct Module { // 所有 Type 存在一个 Arena 里,节点引用是 Handle<Type>(u32 index) pub types: UniqueArena<Type>, pub constants: Arena<Constant>, pub overrides: Arena<Override>, pub global_variables: Arena<GlobalVariable>, pub functions: Arena<Function>, pub entry_points: Vec<EntryPoint>, } pub struct Function { pub name: Option<String>, pub arguments: Vec<FunctionArgument>, pub result: Option<FunctionResult>, // Local 变量、Expression、Statement 都各自 arena 化 pub local_variables: Arena<LocalVariable>, pub expressions: Arena<Expression>, pub body: Block, }

为什么用 arena:

  • Rust 友好:避免 Box<T> 满天飞和 borrow checker 难题。Handle<T>Copy,arena 是不可变借用
  • 缓存友好:所有同类节点连续布局在内存里,扫一次 IR cache 命中率高。
  • 序列化便宜:整个 Module 可以 zero-copy serialise(如 bincode),让 Firefox 在 GPU 进程间传 IR 极快。
  • 易做静态分析:所有 Expression 是平铺的 SSA,写 visitor 不用 trait object。

Naga 的四个前端

Naga 不只有 WGSL 前端——它还能反向读 SPIR-V、GLSL 当输入:

前端路径用途
WGSL → Modulenaga/src/front/wgsl/主路径,wgpu 用
SPIR-V → Modulenaga/src/front/spv/把外部 shader 引入 wgpu
GLSL → Modulenaga/src/front/glsl/从 OpenGL 项目迁移用

这是 Naga 比 Tint 灵活的地方——Tint 接 WGSL。但 WebGPU 浏览器实现只用 WGSL 输入,所以这种灵活性主要服务 wgpu 在非浏览器场景(Bevy 引擎、Deno runtime、Servo)。

Naga 的四个后端

后端路径使用方
Module → MSLnaga/src/back/msl/Firefox macOS, Bevy macOS
Module → HLSLnaga/src/back/hlsl/Firefox Windows(→ DXC → DXIL)
Module → SPIR-Vnaga/src/back/spv/Firefox Linux, Bevy Vulkan
Module → GLSLnaga/src/back/glsl/WebGL 兼容(兜底)
Module → WGSLnaga/src/back/wgsl/调试 / round trip 验证

编译速度

Naga 的 WGSL→MSL 在 M2 Pro 上对主线那段 16 行约 1.2 ms,比 Tint 略快(~3 ms)。差距主要来自 Rust 的 monomorphization 和 arena 的局部性。在 wgpu 的内部 benchmark 上(wgpu/wgpu-types/benches/),Naga 编译 1000 行复杂 shader 约 15 ms,Tint 约 25–30 ms。两者都远快于原生 driver 编译(spirv-tools/glslc/DXC 都 100+ ms)。

两个项目的 spec 拥护方式不一样

Tint 的代码组织紧贴 WGSL spec 章节号——tint/lang/wgsl/resolver/uniformity.cc 直接对应 spec §13 uniformity analysis。Naga 的代码组织按数据流——前端折叠各种 source 语言进 Module,后端把 Module 翻成各种 target。两种风格都对,但 Tint 更适合 spec 实现者(一一对应)、Naga 更适合 graphics engine 嵌入者(IR 友好)。

Naga is Tint's Rust counterpart. Repo: github.com/gfx-rs/wgpu/tree/trunk/naga. About 60k lines of Rust, maintained by the gfx-rs team. It does the same job as Tint — translate WGSL into MSL / HLSL / SPIR-V / GLSL (plus a WGSL re-printer) — but its architecture is very different.

The biggest difference: arena IR

Tint uses a classic object graph — each IR node is a heap-allocated C++ object referenced via Block* / Value*. Naga inverts this and uses arenas (slab allocators):

Naga · naga/src/lib.rs · Module structure pub struct Module { // All Types live in one Arena; references are Handle<Type> (u32 index) pub types: UniqueArena<Type>, pub constants: Arena<Constant>, pub overrides: Arena<Override>, pub global_variables: Arena<GlobalVariable>, pub functions: Arena<Function>, pub entry_points: Vec<EntryPoint>, } pub struct Function { pub name: Option<String>, pub arguments: Vec<FunctionArgument>, pub result: Option<FunctionResult>, // Locals, expressions, statements each in their own arena pub local_variables: Arena<LocalVariable>, pub expressions: Arena<Expression>, pub body: Block, }

Why arenas:

  • Rust-friendly: no Box<T> everywhere; Handle<T> is Copy; arenas are immutably borrowed, the borrow checker is happy.
  • Cache-friendly: same-kind nodes live contiguously; one pass walks IR with high cache hit rate.
  • Cheap to serialise: an entire Module can be zero-copy serialised (e.g. via bincode), making cross-process IR transfer in Firefox very fast.
  • Easy static analysis: flat SSA — write visitors without trait objects.

Naga's four frontends

Naga isn't WGSL-only — it can read back SPIR-V and GLSL as input:

FrontendPathUse
WGSL → Modulenaga/src/front/wgsl/main path, wgpu
SPIR-V → Modulenaga/src/front/spv/importing external shaders
GLSL → Modulenaga/src/front/glsl/migration from OpenGL projects

That's where Naga is more flexible than Tint — Tint accepts only WGSL. But browser WebGPU only feeds WGSL, so this flexibility serves wgpu in non-browser contexts (Bevy, Deno, Servo).

Naga's four backends

BackendPathConsumer
Module → MSLnaga/src/back/msl/Firefox macOS, Bevy macOS
Module → HLSLnaga/src/back/hlsl/Firefox Windows (→ DXC → DXIL)
Module → SPIR-Vnaga/src/back/spv/Firefox Linux, Bevy Vulkan
Module → GLSLnaga/src/back/glsl/WebGL fallback
Module → WGSLnaga/src/back/wgsl/debug / round-trip testing

Compilation speed

Naga's WGSL→MSL on M2 Pro for our 16-line shader is about 1.2 ms, slightly faster than Tint's ~3 ms. The gap comes from Rust monomorphisation and arena locality. On wgpu's internal benchmarks (wgpu/wgpu-types/benches/), Naga compiles a 1000-line shader in ~15 ms vs Tint's ~25–30 ms. Both are far faster than native driver compilation (spirv-tools/glslc/DXC all run 100+ ms).

Two projects, two organising principles

Tint's source layout follows WGSL spec chapters — e.g. tint/lang/wgsl/resolver/uniformity.cc maps to spec §13 uniformity analysis. Naga's layout follows data flow — frontends fold various source languages into Module, backends fold Module into various targets. Both are valid; Tint is friendlier to spec implementers (one-to-one mapping), Naga to engine embedders (IR-friendly).

CHAPTER 16

三个翻译目标 — SPIR-V · MSL · HLSL

Three translation targets — SPIR-V · MSL · HLSL

同一行 WGSL,翻译到三种 shader 语言后长什么样

The same WGSL line, in three shading languages

WGSL 进了 Tint/Naga,出来的不再是源代码而是三种格式之一。这一章把同一段主线 WGSL 翻译到三个目标,看每种目标长什么样、各自有什么 quirk。

输入 WGSL(主线)

INPUT · WGSL @group(0) @binding(0) var<storage, read_write> data: array<f32>; @compute @workgroup_size(64) fn main(@builtin(global_invocation_id) gid: vec3<u32>) { let i = gid.x; if (i < arrayLength(&data)) { data[i] = data[i] * data[i]; } }

输出 1 · MSL(Metal Shading Language)

OUTPUT · Metal · 用于 macOS · iOS · iPadOS #include <metal_stdlib> using namespace metal; struct tint_module_vars_struct { device array<float, 1>* data; const constant uint* tint_array_length; }; kernel void main_( uint3 gid [[thread_position_in_grid]], tint_module_vars_struct tv [[buffer(0)]]) { uint i = gid.x; if (i < *tv.tint_array_length) { tv.data[i] = tv.data[i] * tv.data[i]; } }

注意点:

  • device = MSL 的 storage 地址空间,对应 WGSL storage
  • kernel = MSL 的 compute shader 入口(不是 fragmentvertex)。
  • [[thread_position_in_grid]] = MSL builtin,对应 WGSL @builtin(global_invocation_id)
  • arrayLength(&data) 不存在于 MSL——Tint 把它编译成 uniform 常量,由 Dawn 在 dispatch 时填进去。这是 Tint 的语义适配层

输出 2 · HLSL(DirectX 12 用)

OUTPUT · HLSL · 用于 Windows D3D12 // Tint 输出 SM 6.0 HLSL,然后由 DXC 编成 DXIL RWByteAddressBuffer data : register(u0); [numthreads(64, 1, 1)] void main(uint3 gid : SV_DispatchThreadID) { uint i = gid.x; uint dataLength; data.GetDimensions(dataLength); dataLength /= 4; // byte → f32 if (i < dataLength) { float v = asfloat(data.Load(i * 4)); data.Store(i * 4, asuint(v * v)); } }

D3D12 的 storage buffer 用 RWByteAddressBuffer所有访问都按字节。Tint 把 data[i](f32 数组)翻译成 data.Load(i * 4)asfloat()。这是巨大语义跳变——也是 HLSL 后端比 MSL 后端复杂的根本原因。

输出 3 · SPIR-V(Vulkan 用)

OUTPUT · SPIR-V · 用于 Linux Vulkan · Android // SPIR-V 是二进制 IR,下面是 SPIRV-Dis 反汇编 ; Module Version 1.3 ; Generator: Tint 1.x OpCapability Shader OpMemoryModel Logical GLSL450 OpEntryPoint GLCompute %main "main" %gid OpExecutionMode %main LocalSize 64 1 1 %void = OpTypeVoid %float = OpTypeFloat 32 %uint = OpTypeInt 32 0 %v3uint = OpTypeVector %uint 3 %data_arr = OpTypeRuntimeArray %float %data_buf = OpTypeStruct %data_arr %data_ptr = OpTypePointer StorageBuffer %data_buf OpDecorate %data_buf Block ; SPV_KHR_storage_buffer_storage_class · SPIR-V 1.3+ OpDecorate %data DescriptorSet 0 OpDecorate %data Binding 0 ; ... 接下来 80 多条指令编码 if + 乘 + store ...

SPIR-V 是二进制 IR,下面给的是 spirv-dis 的可读版本。一个 16 行 WGSL 出来约 120 条 SPIR-V 指令、二进制 ~600 字节。SPIR-V 的优点是可被驱动直接吃、Khronos 已经有十几个 vendor 实现;缺点是不可读、debug 必须 spirv-dis。

三个目标的语义对齐

语义WGSLMSLHLSLSPIR-V
整数溢出wrap (2's complement)wrapwrapwrap (with OpDecorate)
除以 0(整数)未定义(typically 0)未定义未定义未定义
NaN orderingstd::isnan 可用
越界 array readrequired: 0 或 clamp需手动 clampRWByteAddressBuffer 自动 0需 capability
workgroup memory(典型,非 spec 最小)16 KB 最低保证~32 KB 典型(Apple)32 KB 典型(D3D12)16 KB spec 下限(Vulkan maxComputeSharedMemorySize),实际 32-48 KB 常见

注意"越界 read"那行——WGSL spec 强制规定越界读必须 0 或 clamp(不能 UB)。这是 WebGPU 安全模型的核心。Tint/Naga 的每个 backend都要 patch 进相应保护代码:MSL 加显式 if 判断、HLSL 用 RWByteAddressBuffer 自带的保护、SPIR-V 加 RobustBufferAccess capability。

WGSL goes into Tint/Naga and comes out as one of three formats. This chapter translates the same main-line WGSL to all three targets, side by side, with the quirks each one brings.

Input WGSL (the main line)

INPUT · WGSL @group(0) @binding(0) var<storage, read_write> data: array<f32>; @compute @workgroup_size(64) fn main(@builtin(global_invocation_id) gid: vec3<u32>) { let i = gid.x; if (i < arrayLength(&data)) { data[i] = data[i] * data[i]; } }

Output 1 · MSL (Metal Shading Language)

OUTPUT · Metal · for macOS · iOS · iPadOS #include <metal_stdlib> using namespace metal; struct tint_module_vars_struct { device array<float, 1>* data; const constant uint* tint_array_length; }; kernel void main_( uint3 gid [[thread_position_in_grid]], tint_module_vars_struct tv [[buffer(0)]]) { uint i = gid.x; if (i < *tv.tint_array_length) { tv.data[i] = tv.data[i] * tv.data[i]; } }

Things to notice:

  • device = MSL's storage address space, equivalent to WGSL storage.
  • kernel = MSL's compute-shader entry (not fragment, not vertex).
  • [[thread_position_in_grid]] = MSL builtin, maps WGSL @builtin(global_invocation_id).
  • arrayLength(&data) doesn't exist in MSL — Tint compiles it into a uniform constant filled in by Dawn at dispatch time. This is Tint's semantic adaptation layer.

Output 2 · HLSL (for D3D12)

OUTPUT · HLSL · for Windows D3D12 // Tint emits Shader Model 6.0 HLSL; DXC then compiles it to DXIL RWByteAddressBuffer data : register(u0); [numthreads(64, 1, 1)] void main(uint3 gid : SV_DispatchThreadID) { uint i = gid.x; uint dataLength; data.GetDimensions(dataLength); dataLength /= 4; // bytes → f32 count if (i < dataLength) { float v = asfloat(data.Load(i * 4)); data.Store(i * 4, asuint(v * v)); } }

D3D12 storage buffers are RWByteAddressBuffer; all access is byte-addressed. Tint translates data[i] (an f32 array) into data.Load(i * 4) + asfloat(). This is a huge semantic shift and the reason the HLSL backend is harder than the MSL backend.

Output 3 · SPIR-V (for Vulkan)

OUTPUT · SPIR-V · for Linux Vulkan · Android // SPIR-V is binary IR; below is spirv-dis output ; Module Version 1.3 ; Generator: Tint 1.x OpCapability Shader OpMemoryModel Logical GLSL450 OpEntryPoint GLCompute %main "main" %gid OpExecutionMode %main LocalSize 64 1 1 %void = OpTypeVoid %float = OpTypeFloat 32 %uint = OpTypeInt 32 0 %v3uint = OpTypeVector %uint 3 %data_arr = OpTypeRuntimeArray %float %data_buf = OpTypeStruct %data_arr %data_ptr = OpTypePointer StorageBuffer %data_buf OpDecorate %data_buf Block ; SPV_KHR_storage_buffer_storage_class · SPIR-V 1.3+ OpDecorate %data DescriptorSet 0 OpDecorate %data Binding 0 ; ... ~80 more instructions encode if + mul + store ...

SPIR-V is binary IR; the text above is spirv-dis output. Our 16-line WGSL becomes ~120 SPIR-V instructions, ~600 bytes binary. SPIR-V's advantage: directly consumable by drivers, dozens of vendor implementations already exist. Disadvantage: unreadable; debugging requires spirv-dis.

Semantic alignment across targets

SemanticWGSLMSLHLSLSPIR-V
Integer overflowwraps (two's complement)wrapswrapswraps (with OpDecorate)
Integer divide by 0undefined (typically 0)undefinedundefinedundefined
NaN orderingstd::isnan availablesamesamesame
Out-of-bounds array readrequired: 0 or clampmanual clamp neededRWByteAddressBuffer auto-zerosneeds capability
Workgroup memory (typical, not spec floor)16 KB guaranteed~32 KB typical (Apple)32 KB typical (D3D12)16 KB Vulkan maxComputeSharedMemorySize floor; 32-48 KB common in practice

Note the "out-of-bounds read" row — WGSL spec mandates 0 or clamp on OOB reads (no UB allowed). That's core to WebGPU's safety model. Tint/Naga's every backend patches in the corresponding protection: MSL adds an explicit if-bound check, HLSL uses RWByteAddressBuffer's built-in zeroing, SPIR-V adds the RobustBufferAccess capability.

CHAPTER 17

原生 API 映射 — Metal · D3D12 · Vulkan

Native API mapping — Metal · D3D12 · Vulkan

同一个 WebGPU dispatch,在三个原生 API 下的调用链

One WebGPU dispatch, mapped to three native API call chains

这是翻译⑥的最后一步。Dawn / wgpu 拿到编译好的 shader 后要把它变成原生 API 调用。这一章把主线 ⑥ 在三个原生 API 下的完整调用链列出来。

主线提交 → Metal(macOS · iOS)

Dawn · src/dawn/native/metal/CommandBufferMTL.mm // WebGPU dispatchWorkgroups(16384) 在 Metal 下: id<MTLCommandQueue> queue = ...; id<MTLCommandBuffer> cb = [queue commandBuffer]; id<MTLComputeCommandEncoder> enc = [cb computeCommandEncoder]; [enc setComputePipelineState:pipelineState]; [enc setBuffer:dataBuffer offset:0 atIndex:0]; // 一次 dispatch · workgroupCount × workgroupSize [enc dispatchThreadgroups:MTLSizeMake(16384, 1, 1) threadsPerThreadgroup:MTLSizeMake(64, 1, 1)]; [enc endEncoding]; [cb commit];

Metal 的优点:类型一致——dispatch 直接传 threadgroup 数 + 每 threadgroup 线程数(对应 WGSL workgroupCount + workgroup_size)。Apple GPU 直接用这套模型,没有翻译损失。Apple Silicon 上 Dawn 测得 dispatch 命令本身~10 µs(命令录制),等待 GPU 执行另算。

主线提交 → D3D12(Windows)

Dawn · src/dawn/native/d3d12/CommandBufferD3D12.cpp // 同一行 dispatchWorkgroups(16384) 在 D3D12 下: ID3D12CommandAllocator* allocator = ...; ID3D12GraphicsCommandList* cl = ...; cl->Reset(allocator, nullptr); cl->SetPipelineState(pipelineState); cl->SetComputeRootSignature(rootSignature); // Descriptor heap 切换(如有必要) cl->SetDescriptorHeaps(2, heaps); cl->SetComputeRootDescriptorTable(0, dataDescriptor); // 显式 resource barrier · D3D12 比 Metal 要求多 D3D12_RESOURCE_BARRIER barrier = ...; cl->ResourceBarrier(1, &barrier); cl->Dispatch(16384, 1, 1); cl->Close(); ID3D12CommandQueue* q = ...; q->ExecuteCommandLists(1, &cl);

D3D12 多两个东西:① root signature(每个 pipeline 必须显式给出整个 binding shape)和 ② resource barrier(每个资源状态变化必须显式宣告)。Dawn 的 D3D12 backend 比 Metal 多 ~2× 代码量就因为要处理这两个。

主线提交 → Vulkan(Linux · Android)

Dawn · src/dawn/native/vulkan/CommandBufferVk.cpp // 同一行 dispatch 在 Vulkan 下: VkCommandBuffer cb = ...; vkBeginCommandBuffer(cb, &beginInfo); vkCmdBindPipeline(cb, VK_PIPELINE_BIND_POINT_COMPUTE, pipeline); vkCmdBindDescriptorSets(cb, VK_PIPELINE_BIND_POINT_COMPUTE, pipelineLayout, 0, 1, &descriptorSet, 0, nullptr); // pipeline barrier(比 D3D12 还显式) vkCmdPipelineBarrier(cb, srcStage, dstStage, 0, 0, nullptr, 1, &bufferBarrier, 0, nullptr); vkCmdDispatch(cb, 16384, 1, 1); vkEndCommandBuffer(cb); VkSubmitInfo si = ...; si.commandBufferCount = 1; si.pCommandBuffers = &cb; vkQueueSubmit(queue, 1, &si, fence);

三家差异速查

MetalD3D12Vulkan
命令 buffer 类型MTLCommandBufferID3D12GraphicsCommandListVkCommandBuffer
Pipeline 状态argument buffer 自动root signature 显式descriptor set + pipeline layout 显式
Resource barrier大部分自动显式 ResourceBarrier显式 vkCmdPipelineBarrier(最严)
Threadgroup vs threadsthreadgroup 和 thread 都显numthreads 在 shader; Dispatch 是 group 数同 D3D12
同步原语MTLFence (event-based)ID3D12Fence (value-based)VkFence / VkSemaphore (binary)
Dawn 后端代码量~3.5 万行 Obj-C++~5 万行 C++~5.5 万行 C++
为什么 Vulkan 后端最大

Vulkan 的显式同步是三家最严的——每个 buffer/texture 在 dispatch 之前都要 declare 当前 state 和目标 state,每个 image layout transition 都要写 barrier。Dawn 的 Vulkan backend 有一个完整的 "subresource state tracker" 来在录制 encoder 时自动算出所有 barrier。这部分代码占 Vulkan backend 的 ~30%。

主线在三家上的完整轨迹

把主线的 7 个 JS 调用串起来,逐 API 看:

主线 JSMetal(macOS)D3D12(Windows)Vulkan(Linux)
requestDevice()MTLCreateSystemDefaultDevice() + [device newCommandQueue]D3D12CreateDevice() + CreateCommandQueue()vkCreateDevice() + vkGetDeviceQueue()
createBuffer(4 MB, STORAGE)[device newBufferWithLength:4194304 options:MTLResourceStorageModePrivate]CreateCommittedResource(D3D12_HEAP_TYPE_DEFAULT, 4 MB)vkCreateBuffer({4 MB, STORAGE_BUFFER_BIT}) + vkAllocateMemory()
queue.writeBuffer(buf, 0, input)临时 shared buffer + blitCommandEncoder copyFromBuffer:...upload heap + cl->CopyBufferRegion(...)staging buffer + vkCmdCopyBuffer()
createShaderModule(wgsl)Tint→MSL · [device newLibraryWithSource:msl options:...]Tint→HLSL→DXC→DXIL · D3DCompile()Tint→SPIR-V · vkCreateShaderModule({spv_bytes, len})
createComputePipeline({...})[device newComputePipelineStateWithFunction:...]device->CreateComputePipelineState(&psoDesc)vkCreateComputePipelines(layout, shader_stage)
pass.dispatchWorkgroups(16384)[enc dispatchThreadgroups:MTLSizeMake(16384,1,1) threadsPerThreadgroup:MTLSizeMake(64,1,1)]cl->Dispatch(16384, 1, 1)(每 group 64 thread 已在 HLSL [numthreads(64,1,1)] 里写死)vkCmdDispatch(cb, 16384, 1, 1)(每 group 64 thread 在 SPIR-V LocalSize 64 1 1
queue.submit([...])[cb commit] · 等待 completionHandlerqueue->ExecuteCommandLists(1, &cl) · 等 fencevkQueueSubmit(queue, 1, &si, fence) · vkWaitForFences()
read.mapAsync(READ)shared buffer + [buf contents] 直读readback heap + resource->Map(0, &range, &ptr)vkMapMemory(device, mem, 0, size, 0, &ptr)

总结:主线的 7 个 JS 调用 → 三家 ~12-15 个原生 API 调用 + 一次驱动 shader 编译。八重翻译 到此结束第七层。最后一层(GPU 命令处理器实际 dispatch),Ch18 展开。

This unpacks translation step ⑥. Once Dawn / wgpu has the compiled shader, it must translate into native API calls. This chapter shows the complete call chain for main line ⑥ on all three native APIs.

Main line submit → Metal (macOS · iOS)

Dawn · src/dawn/native/metal/CommandBufferMTL.mm // WebGPU dispatchWorkgroups(16384) on Metal: id<MTLCommandQueue> queue = ...; id<MTLCommandBuffer> cb = [queue commandBuffer]; id<MTLComputeCommandEncoder> enc = [cb computeCommandEncoder]; [enc setComputePipelineState:pipelineState]; [enc setBuffer:dataBuffer offset:0 atIndex:0]; // one dispatch · workgroup count × per-workgroup threads [enc dispatchThreadgroups:MTLSizeMake(16384, 1, 1) threadsPerThreadgroup:MTLSizeMake(64, 1, 1)]; [enc endEncoding]; [cb commit];

Metal's strength: type-aligned — dispatch takes both threadgroup count and per-threadgroup thread count (matching WGSL workgroupCount + workgroup_size). Apple GPUs natively use this model — no translation lossage. On Apple Silicon, Dawn measures the dispatch command itself at ~10 µs (command recording); GPU execution is separate.

Main line submit → D3D12 (Windows)

Dawn · src/dawn/native/d3d12/CommandBufferD3D12.cpp // Same dispatchWorkgroups(16384) on D3D12: ID3D12CommandAllocator* allocator = ...; ID3D12GraphicsCommandList* cl = ...; cl->Reset(allocator, nullptr); cl->SetPipelineState(pipelineState); cl->SetComputeRootSignature(rootSignature); // Descriptor heap switch (if needed) cl->SetDescriptorHeaps(2, heaps); cl->SetComputeRootDescriptorTable(0, dataDescriptor); // Explicit resource barrier · D3D12 needs more than Metal D3D12_RESOURCE_BARRIER barrier = ...; cl->ResourceBarrier(1, &barrier); cl->Dispatch(16384, 1, 1); cl->Close(); ID3D12CommandQueue* q = ...; q->ExecuteCommandLists(1, &cl);

D3D12 has two extra concepts: ① root signature (every pipeline must explicitly declare its full binding shape) and ② resource barriers (every resource state change must be declared explicitly). Dawn's D3D12 backend is ~2× the code of its Metal backend mostly because of these.

Main line submit → Vulkan (Linux · Android)

Dawn · src/dawn/native/vulkan/CommandBufferVk.cpp // Same dispatch on Vulkan: VkCommandBuffer cb = ...; vkBeginCommandBuffer(cb, &beginInfo); vkCmdBindPipeline(cb, VK_PIPELINE_BIND_POINT_COMPUTE, pipeline); vkCmdBindDescriptorSets(cb, VK_PIPELINE_BIND_POINT_COMPUTE, pipelineLayout, 0, 1, &descriptorSet, 0, nullptr); // Pipeline barrier — even more explicit than D3D12 vkCmdPipelineBarrier(cb, srcStage, dstStage, 0, 0, nullptr, 1, &bufferBarrier, 0, nullptr); vkCmdDispatch(cb, 16384, 1, 1); vkEndCommandBuffer(cb); VkSubmitInfo si = ...; si.commandBufferCount = 1; si.pCommandBuffers = &cb; vkQueueSubmit(queue, 1, &si, fence);

Three-API quick reference

MetalD3D12Vulkan
Command buffer typeMTLCommandBufferID3D12GraphicsCommandListVkCommandBuffer
Pipeline state bindingargument buffer (auto)root signature (explicit)descriptor set + pipeline layout (explicit)
Resource barriersmostly automaticexplicit ResourceBarrierexplicit vkCmdPipelineBarrier (strictest)
Threadgroup vs threadsboth explicitnumthreads in shader; Dispatch counts groupssame as D3D12
Sync primitivesMTLFence (event-based)ID3D12Fence (value-based)VkFence / VkSemaphore (binary)
Dawn backend size~35k lines Obj-C++~50k lines C++~55k lines C++
Why the Vulkan backend is biggest

Vulkan's explicit synchronisation is the strictest of the three — every buffer/texture must declare current and target state before a dispatch, every image layout transition needs a barrier. Dawn's Vulkan backend has an entire "subresource state tracker" to compute all barriers at encoder-record time. That subsystem alone is ~30% of the backend.

The main line, end-to-end on each backend

Threading the seven main-line JS calls through each native API:

Main-line JSMetal (macOS)D3D12 (Windows)Vulkan (Linux)
requestDevice()MTLCreateSystemDefaultDevice() + [device newCommandQueue]D3D12CreateDevice() + CreateCommandQueue()vkCreateDevice() + vkGetDeviceQueue()
createBuffer(4 MB, STORAGE)[device newBufferWithLength:4194304 options:MTLResourceStorageModePrivate]CreateCommittedResource(D3D12_HEAP_TYPE_DEFAULT, 4 MB)vkCreateBuffer({4 MB, STORAGE_BUFFER_BIT}) + vkAllocateMemory()
queue.writeBuffer(buf, 0, input)transient shared buffer + blitCommandEncoder copyFromBuffer:...upload heap + cl->CopyBufferRegion(...)staging buffer + vkCmdCopyBuffer()
createShaderModule(wgsl)Tint→MSL · [device newLibraryWithSource:msl options:...]Tint→HLSL→DXC→DXIL · D3DCompile()Tint→SPIR-V · vkCreateShaderModule({spv_bytes, len})
createComputePipeline({...})[device newComputePipelineStateWithFunction:...]device->CreateComputePipelineState(&psoDesc)vkCreateComputePipelines(layout, shader_stage)
pass.dispatchWorkgroups(16384)[enc dispatchThreadgroups:MTLSizeMake(16384,1,1) threadsPerThreadgroup:MTLSizeMake(64,1,1)]cl->Dispatch(16384, 1, 1) (per-group 64 threads baked into HLSL [numthreads(64,1,1)])vkCmdDispatch(cb, 16384, 1, 1) (per-group 64 threads in SPIR-V LocalSize 64 1 1)
queue.submit([...])[cb commit] · awaits completionHandlerqueue->ExecuteCommandLists(1, &cl) · fence waitvkQueueSubmit(queue, 1, &si, fence) · vkWaitForFences()
read.mapAsync(READ)shared buffer + [buf contents] direct readreadback heap + resource->Map(0, &range, &ptr)vkMapMemory(device, mem, 0, size, 0, &ptr)

In summary: the main line's seven JS calls fan out to ~12-15 native API calls per platform, plus one driver shader compilation. Seven of eight translations done. The eighth (GPU command processor actually dispatching SIMT waves) is unpacked in Ch18.

CHAPTER 18

Workgroup · Subgroup — GPU 的两级并行单位

Workgroup · Subgroup — GPU's two levels of parallelism

从一个 dispatch 到 16384 个 workgroup 再到一万亿个 ALU 操作

From one dispatch to 16,384 workgroups to a trillion ALU ops

主线 ⑧ 终于落到 GPU 硬件上了。一行 vkCmdDispatch(16384, 1, 1) 让 GPU 命令处理器(CP)启动调度。它会发生这些事:

  1. CP 把 16384 个 workgroup分发到 GPU 的多个 SM/CU/cluster(Apple M2 Pro 有 16 个 cluster)。
  2. 每个 cluster 拿到一批 workgroup,顺序处理——典型 Apple cluster 一次处理 4-8 个 workgroup,每个 cluster 总共 ~1000 个 workgroup。
  3. 每个 workgroup 的 64 threads 进一步被分成 subgroup——Apple 是 32 threads/simdgroup、NVIDIA 是 32/warp、AMD 是 32 或 64/wave、Intel 是 8-32/EU thread。
  4. 每个 subgroup 在同一指令周期跑同一条指令(SIMT,Single Instruction Multiple Threads)。

Workgroup — 程序员可见的最小并行块

@workgroup_size(64) 决定每个 workgroup 有 64 threads。这是程序员可控的分组单位。一个 workgroup 内的 threads:

  • 同一 SM/cluster 上跑(保证)
  • 共享 var<workgroup> 内存(典型 16-32 KB)
  • 能用 workgroupBarrier() 同步
  • 不能和其他 workgroup 共享内存或同步

主线 @workgroup_size(64) 太小,没用到 workgroup memory,所以纯算"每 thread 一个 element"。Ch19 会演示用 workgroup memory 把 matmul 速度提 5×。

Subgroup — 硬件 SIMT 单元

Subgroup(也叫 wave / warp / simdgroup)是硬件层的并行单位——一组 threads 在完全 lockstep 跑同一条指令。这是 GPU 算力的物理基础。

厂商术语典型大小API 中的 builtin
NVIDIAwarp32WARP_SIZE intrinsic
AMD GCN/RDNA1wave64WaveGetLaneCount
AMD RDNA2+wavecompute 默认 32 · graphics 可动态 32/64
Apple GPUsimdgroup32simdgroup_size
Intel Gen11+EU thread8/16/32 SIMD-wideSIMD width
ARM Maliquad / warp4 / 8 / 16视架构而定

WebGPU 的设计选择:subgroup 大小不暴露给应用,应用要么不用 subgroup ops,要么用 enable subgroups 启用扩展并通过 builtin subgroup_size 拿到运行时大小。这是和原生 API 的差异——native shader 一般 hard-code 一个常量。

Subgroup ops — 用 1 条指令做 N 个 thread 的协作

WGSL · subgroup-based reduction enable subgroups; @group(0) @binding(0) var<storage, read> input: array<f32>; @group(0) @binding(1) var<storage, read_write> sums: array<f32>; @compute @workgroup_size(64) fn main(@builtin(global_invocation_id) gid: vec3<u32>, @builtin(subgroup_invocation_id) sid: u32) { let v = input[gid.x]; // 一条指令在 32 个 thread 内做 sum reduction let sg_sum = subgroupAdd(v); // 只让 subgroup 内的 thread 0 写回 if (sid == 0u) { sums[gid.x / 32u] = sg_sum; } }

没有 subgroup 时,做一个 32 元素 sum 至少要 5 次 workgroupBarrier() + 32 → 16 → 8 → 4 → 2 → 1 折半。用 subgroupAdd1 条硬件指令——NVIDIA 的 shfl.bfly、AMD 的 ds_swizzle、Apple 的 simd_sum。性能差 5-10×。

主线 ⑧ 在 GPU 上具体怎么跑

把主线在 Apple M2 Pro 上的 GPU 执行展开:

  • 16,384 workgroups × 64 threads = 1,048,576 threads(恰好 1M)
  • 16 个 cluster × 32 ALU = 512 ALU,每 cluster 一周期跑 32 thread
  • 每 workgroup 含 2 个 simdgroup(64/32 = 2)
  • 每个 cluster 一次处理 4 workgroup(按 Apple M2 调度器)= 8 simdgroup = 8 × 32 = 256 thread 在飞
  • 每个 thread 做 1 次 load + 1 次乘 + 1 次 store = ~3 周期
  • 所以 1,048,576 thread / 16 cluster / 256 thread per cycle × 3 cycle = ~768 周期 = ~0.55 µs 纯算力
  • 实际 ~200 µs,剩下的是 memory bandwidth 等待——主线访存受限
为什么主线访存受限

每个 thread 做一次 load 4 字节 + 一次 store 4 字节 = 8 字节内存流量 × 1M thread = 8 MB。M2 Pro 显存带宽 ~200 GB/s = 0.04 ms 满速。但 cache 命中率不到 100%、scheduling 开销、launch latency 等,实测 200 µs ≈ 40 GB/s 有效带宽——已经是显存带宽的 20%。Ch19 会演示访存模式如何决定性能。

Main line ⑧ finally hits the GPU. One vkCmdDispatch(16384, 1, 1) wakes the GPU's command processor (CP). Here's what happens:

  1. CP distributes 16,384 workgroups across the GPU's SMs/CUs/clusters (Apple M2 Pro has 16 clusters).
  2. Each cluster gets a batch and processes sequentially — typical Apple cluster handles 4–8 workgroups simultaneously; ~1,000 per cluster total.
  3. Each workgroup's 64 threads are further split into subgroups — Apple is 32 threads/simdgroup, NVIDIA is 32/warp, AMD is 32 or 64/wave, Intel is 8–32/EU thread.
  4. Each subgroup runs the same instruction in lockstep (SIMT — Single Instruction Multiple Threads).

Workgroup — the smallest programmer-visible parallel unit

@workgroup_size(64) declares 64 threads per workgroup. This is the programmer-controlled grouping. Threads within a workgroup:

  • Run on the same SM/cluster (guaranteed)
  • Share var<workgroup> memory (typically 16–32 KB)
  • Can synchronise with workgroupBarrier()
  • Cannot share memory or synchronise across workgroups

Our main line uses @workgroup_size(64) but no workgroup memory — pure "one thread per element" arithmetic. Ch19 will show how using workgroup memory delivers a 5× speedup on matmul.

Subgroup — the hardware SIMT unit

Subgroup (also wave / warp / simdgroup) is the hardware-level parallel unit — a set of threads running in complete lockstep. It's the physical basis of GPU throughput.

VendorTermTypical sizeAPI builtin
NVIDIAwarp32WARP_SIZE intrinsic
AMD GCN/RDNA1wave64WaveGetLaneCount
AMD RDNA2+wavecompute defaults to 32 · graphics can be dynamic 32/64same
Apple GPUsimdgroup32simdgroup_size
Intel Gen11+EU thread8/16/32 SIMD-wideSIMD width
ARM Maliquad / warp4 / 8 / 16varies by arch

WebGPU's design choice: subgroup size is not exposed to apps. You either don't use subgroup ops, or you enable subgroups and read the size at runtime via the subgroup_size builtin. Native APIs typically hard-code constants; WebGPU has to be portable.

Subgroup ops — one instruction for N-thread cooperation

WGSL · subgroup-based reduction enable subgroups; @group(0) @binding(0) var<storage, read> input: array<f32>; @group(0) @binding(1) var<storage, read_write> sums: array<f32>; @compute @workgroup_size(64) fn main(@builtin(global_invocation_id) gid: vec3<u32>, @builtin(subgroup_invocation_id) sid: u32) { let v = input[gid.x]; // One instruction sums 32 threads inside the subgroup let sg_sum = subgroupAdd(v); // Only subgroup-lane 0 writes back if (sid == 0u) { sums[gid.x / 32u] = sg_sum; } }

Without subgroups, a 32-element sum takes 5 × workgroupBarrier() + a 32 → 16 → 8 → 4 → 2 → 1 tree. With subgroupAdd it's one hardware instruction — NVIDIA's shfl.bfly, AMD's ds_swizzle, Apple's simd_sum. 5–10× faster.

How main line ⑧ actually runs on the GPU

Main line execution on Apple M2 Pro:

  • 16,384 workgroups × 64 threads = 1,048,576 threads (exactly 1M)
  • 16 clusters × 32 ALUs = 512 ALUs; one cluster runs 32 threads per cycle
  • Each workgroup = 2 simdgroups (64/32 = 2)
  • Apple M2 scheduler runs 4 workgroups per cluster = 8 simdgroups = 256 threads in flight
  • Each thread: 1 load + 1 multiply + 1 store ≈ 3 cycles
  • So 1,048,576 threads / 16 clusters / 256 threads per cycle × 3 cycles ≈ 768 cycles ≈ 0.55 µs of pure compute
  • Measured: ~200 µs — the rest is memory-bandwidth waits. The main line is memory-bound.
Why the main line is memory-bound

Each thread does 1 load (4 bytes) + 1 store (4 bytes) = 8 bytes × 1M threads = 8 MB of memory traffic. M2 Pro's memory bandwidth is ~200 GB/s = 0.04 ms at peak. But cache misses, scheduling overhead, and launch latency push real measurement to ~200 µs ≈ 40 GB/s effective — 20% of peak bandwidth. Ch19 shows how access pattern dominates performance.

CHAPTER 19

矩阵乘法 — naive → tiled → wmma

Matrix multiply — naive → tiled → wmma

同一个 1024×1024 矩阵乘,三种实现,差 50×

Same 1024×1024 matmul, three implementations, 50× spread

主线是访存受限的纯算,但真实 GPU 工作大多是矩阵乘——LLM、CV、PCA 全都是 matmul。这一章用 1024×1024 的 matmul 演示从 naive 到 tiled 到 wmma 怎么把性能拉 50×。

基线 · naive WGSL matmul

WGSL · naive · 每 thread 算 C[row,col] 一个元素 @group(0) @binding(0) var<storage, read> A: array<f32>; @group(0) @binding(1) var<storage, read> B: array<f32>; @group(0) @binding(2) var<storage, read_write> C: array<f32>; const N: u32 = 1024u; @compute @workgroup_size(16, 16) fn main(@builtin(global_invocation_id) gid: vec3<u32>) { let row = gid.y; let col = gid.x; if (row >= N || col >= N) { return; } var acc: f32 = 0.0; for (var k: u32 = 0u; k < N; k = k + 1u) { acc = acc + A[row * N + k] * B[k * N + col]; } C[row * N + col] = acc; }

分析:每个 thread 算一个 C[row,col],要读 N=1024 个 A 和 1024 个 B,做 1024 次 FMA。读 8 KB 数据做 2 KFLOP——算密度(FLOP/byte)= 0.25。M2 Pro 算力 / 带宽比 ~7,意味着访存远远跟不上,性能受带宽限制。

实测 1024×1024 matmul:~95 ms。理论算力 1.4 TFLOPS × 利用率 = 2 GFLOPS 实际 = ~5% 利用率。GPU 大部分时间在等内存

优化 1 · tiled with workgroup memory

WGSL · tiled (16×16 tile) · 合作 load A/B tile 进 workgroup memory var<workgroup> tileA: array<array<f32, 16>, 16>; var<workgroup> tileB: array<array<f32, 16>, 16>; @compute @workgroup_size(16, 16) fn main(@builtin(global_invocation_id) gid: vec3<u32>, @builtin(local_invocation_id) lid: vec3<u32>) { let row = gid.y; let col = gid.x; let ly = lid.y; let lx = lid.x; var acc: f32 = 0.0; for (var kBase: u32 = 0u; kBase < N; kBase = kBase + 16u) { // 16×16 threads 合作 load 一个 16×16 A tile 和 B tile tileA[ly][lx] = A[row * N + (kBase + lx)]; tileB[ly][lx] = B[(kBase + ly) * N + col]; workgroupBarrier(); // 等所有 thread 完成 load for (var k: u32 = 0u; k < 16u; k = k + 1u) { acc = acc + tileA[ly][k] * tileB[k][lx]; } workgroupBarrier(); // 等所有 thread 用完 tile } C[row * N + col] = acc; }

关键:256 threads(16×16)一起从 storage 读 256 个 A 和 256 个 B 元素到 workgroup memory(共 2 KB),然后每个 thread 用 16 次。读一次用 16 次 = 16× 带宽节省。算密度提到 ~4 FLOP/byte。

实测:~22 ms,比 naive 快 4.3×。

优化 2 · 每 thread 处理 4×4 tile

进一步,让每个 thread 算4×4 块的 C 而非单元素:算密度从 4 升到 ~32 FLOP/byte,寄存器复用,访存量再降 4×。代码省略(~60 行 WGSL)。

实测:~6 ms,比 naive 快 16×。

优化 3 · subgroup matrix(GPU 矩阵指令)

NVIDIA Volta+/Apple M3+/Intel Arc 的 GPU 有专用矩阵指令——NVIDIA 叫 wmma(warp matrix multiply-accumulate)、Apple 叫 simdgroup matrix、Intel 叫 XMX。一条 wmma 指令做 16×16 × 16×16 matmul,~512 FLOP。

WebGPU 这部分目前是 chromium-experimental-subgroup-matrix 扩展(W3C 在标准化中,2026 年应进 1.1)。一旦标准化,主线 WGSL 加几行:

WGSL · subgroup matrix · 实验扩展 enable subgroup_matrix; let a = subgroupMatrixLoad<subgroup_matrix_left<f16, 16, 16>>(&A, offsetA, stride); let b = subgroupMatrixLoad<subgroup_matrix_right<f16, 16, 16>>(&B, offsetB, stride); var c = subgroupMatrixZero<subgroup_matrix_result<f32, 16, 16>>(); c = subgroupMatrixMultiplyAccumulate(a, b, c); // 一条指令做 16×16 × 16×16 = 4096 FMA subgroupMatrixStore(&C, offsetC, c, stride);

实测(Tint experimental,M3 Pro 上):~2 ms,比 naive 快 47×,比 tiled+register-blocking 快 3×。这是未来 1-2 年浏览器 ML 性能的重要催化剂。

性能对比

实现1024² matmul(M2 Pro · Chrome)1024² matmul(M3 Pro · Chrome)4096² matmul(RTX 4090 · Chrome)算力利用率
naive~95 ms~75 ms~280 ms~5%
tiled (workgroup mem)~22 ms~17 ms~70 ms~22%
+ register blocking 4×4~6 ms~5 ms~18 ms~82%
+ subgroup matrix (实验)n/a(Apple matrix 仅 M3+)~2 ms~3.5 ms~95%
参考:Metal Performance Shaders~1.5 ms~1.2 ms~100%
参考:CUDA cuBLAS~3 ms~95%

WebGPU 的极限是原生 BLAS 库的 80-95%。差距来自 ① validation 开销、② 缺少最深层硬件特性(如 NVIDIA 的 cp.async 流水线 load)、③ 编译器优化(DXC/Apple Metal 编译器仍比 Tint/Naga 经过更多优化迭代)。

The main line is memory-bound pure arithmetic, but most real GPU work is matrix multiplication — LLMs, computer vision, PCA, all matmul. This chapter walks 1024×1024 matmul from naive to tiled to wmma, showing a 50× spread.

Baseline · naive WGSL matmul

WGSL · naive · one thread per C[row, col] element @group(0) @binding(0) var<storage, read> A: array<f32>; @group(0) @binding(1) var<storage, read> B: array<f32>; @group(0) @binding(2) var<storage, read_write> C: array<f32>; const N: u32 = 1024u; @compute @workgroup_size(16, 16) fn main(@builtin(global_invocation_id) gid: vec3<u32>) { let row = gid.y; let col = gid.x; if (row >= N || col >= N) { return; } var acc: f32 = 0.0; for (var k: u32 = 0u; k < N; k = k + 1u) { acc = acc + A[row * N + k] * B[k * N + col]; } C[row * N + col] = acc; }

Analysis: each thread computes one C[row, col], reading N=1024 A's and 1024 B's, doing 1024 FMAs. Reads 8 KB of data for 2 KFLOP — arithmetic intensity (FLOP/byte) = 0.25. M2 Pro's compute/bandwidth ratio is ~7, meaning bandwidth can't keep up; the workload is bandwidth-bound.

Measured 1024×1024 matmul: ~95 ms. Theoretical peak 1.4 TFLOPS × utilisation = 2 GFLOPS actual ≈ 5% utilisation. The GPU spends most time waiting on memory.

Optimisation 1 · tiled with workgroup memory

WGSL · tiled (16×16 tile) · cooperatively load A/B tiles into workgroup memory var<workgroup> tileA: array<array<f32, 16>, 16>; var<workgroup> tileB: array<array<f32, 16>, 16>; @compute @workgroup_size(16, 16) fn main(@builtin(global_invocation_id) gid: vec3<u32>, @builtin(local_invocation_id) lid: vec3<u32>) { let row = gid.y; let col = gid.x; let ly = lid.y; let lx = lid.x; var acc: f32 = 0.0; for (var kBase: u32 = 0u; kBase < N; kBase = kBase + 16u) { // 16×16 threads cooperate: load one 16×16 A tile and B tile tileA[ly][lx] = A[row * N + (kBase + lx)]; tileB[ly][lx] = B[(kBase + ly) * N + col]; workgroupBarrier(); // wait for everyone to finish loading for (var k: u32 = 0u; k < 16u; k = k + 1u) { acc = acc + tileA[ly][k] * tileB[k][lx]; } workgroupBarrier(); // wait before reloading tiles } C[row * N + col] = acc; }

The key: 256 threads (16×16) cooperatively load 256 A's and 256 B's from storage into workgroup memory (2 KB total), then each thread reuses each element 16 times. Load-once-use-16 = 16× bandwidth savings. Arithmetic intensity climbs to ~4 FLOP/byte.

Measured: ~22 ms, 4.3× faster than naive.

Optimisation 2 · each thread handles a 4×4 tile

Go further: each thread computes a 4×4 block of C instead of one element. Arithmetic intensity rises from 4 to ~32 FLOP/byte; register reuse cuts memory traffic another 4×. Code omitted (~60 lines of WGSL).

Measured: ~6 ms, 16× faster than naive.

Optimisation 3 · subgroup matrix (hardware matrix instructions)

NVIDIA Volta+ / Apple M3+ / Intel Arc GPUs ship dedicated matrix instructions — NVIDIA calls them wmma (warp matrix multiply-accumulate), Apple calls them simdgroup matrix, Intel calls them XMX. One wmma does 16×16 × 16×16 matmul = ~512 FLOP.

In WebGPU this is currently the chromium-experimental-subgroup-matrix extension (W3C is standardising it; expected in 1.1 in 2026). Once standardised, you'd write:

WGSL · subgroup matrix · experimental extension enable subgroup_matrix; let a = subgroupMatrixLoad<subgroup_matrix_left<f16, 16, 16>>(&A, offsetA, stride); let b = subgroupMatrixLoad<subgroup_matrix_right<f16, 16, 16>>(&B, offsetB, stride); var c = subgroupMatrixZero<subgroup_matrix_result<f32, 16, 16>>(); c = subgroupMatrixMultiplyAccumulate(a, b, c); // one instruction: 16×16 × 16×16 = 4096 FMA subgroupMatrixStore(&C, offsetC, c, stride);

Measured (Tint experimental, on M3 Pro): ~2 ms, 47× faster than naive, 3× faster than tiled-with-register-blocking. This is the main catalyst for browser-side ML perf over the next 1–2 years.

Performance summary

Implementation1024² matmul (M2 Pro · Chrome)1024² matmul (M3 Pro · Chrome)4096² matmul (RTX 4090 · Chrome)Utilisation
naive~95 ms~75 ms~280 ms~5%
tiled (workgroup mem)~22 ms~17 ms~70 ms~22%
+ register blocking 4×4~6 ms~5 ms~18 ms~82%
+ subgroup matrix (experimental)n/a (Apple matrix is M3+ only)~2 ms~3.5 ms~95%
reference: Metal Performance Shaders~1.5 ms~1.2 ms~100%
reference: CUDA cuBLAS~3 ms~95%

WebGPU's ceiling is 80–95% of native BLAS. The gap comes from ① validation overhead, ② missing the deepest HW features (e.g. NVIDIA's cp.async pipeline loads), ③ compiler maturity (DXC and Apple's Metal compiler have years more optimisation work than Tint/Naga).

CHAPTER 20

transformers.js — 在浏览器里跑 LLM

transformers.js — running LLMs in the browser

从 HuggingFace 模型到 30 tokens/s 输出,靠 WebGPU 全栈

HuggingFace model → 30 tokens/s, powered by the WebGPU stack

2024 年开始浏览器里出现了真正实用的 LLM 推理——Llama-3.2-1B 能在 M2 Mac 上跑 30 tokens/s,Qwen2.5-0.5B 能跑 60 tokens/s。这一章解释这背后的工程栈,主角是 transformers.jsONNX Runtime Web

从模型到输出的 6 层

  1. 模型导出:PyTorch / Safetensors → ONNX 格式(HF 团队预先转好)。Llama-3.2-1B 是 ~600 MB(f16)。
  2. JS 加载:transformers.js 用 ONNX Runtime Web wasm 解析 .onnx 文件,构建计算图(Computation Graph)。
  3. WebGPU backend:ONNX Runtime Web 的 webgpu execution provider 把每个 ONNX op(Conv、MatMul、Softmax、LayerNorm)映射到一段预写好的 WGSL kernel。
  4. Pipeline 编译:所有 kernel 在加载时 create computePipeline。Llama-1B 约 200+ pipelines。一次性 ~3 秒。
  5. 权重上传:~600 MB f16 数据通过 writeBuffer 上传到 GPU。在 M2 Pro 上 ~150 ms。
  6. 推理循环:每生成一个 token,run 一次完整的 forward pass(200+ dispatches)。M2 上一次 forward ~33 ms = ~30 tokens/s。

WebGPU 的 LLM 工作流

transformers.js + ONNX Runtime Web · 每 token 的 forward // 1. embedding lookup(一次 dispatch) embed_pipeline.setBindGroup(input_ids, embedding_table); pass.dispatchWorkgroups(1); // 1×4096 token embedding // 2. 16 个 transformer layer · 每层约 8 个 dispatch for (let layer = 0; layer < 16; layer++) { layernorm_pipeline.dispatch(...); qkv_matmul.dispatch(...); // 3 个 matmul attention_kernel.dispatch(...); // flash-attention 风格 out_matmul.dispatch(...); layernorm_pipeline.dispatch(...); ffn_up_matmul.dispatch(...); gelu_pipeline.dispatch(...); ffn_down_matmul.dispatch(...); } // 3. final layernorm + lm_head matmul → logits final_ln.dispatch(...); lm_head_matmul.dispatch(...); // 4. argmax / top-k 在 GPU 上 dispatch (或 readback 到 CPU 再选) await queue.submit([encoder.finish()]); const next_token = await readback.mapAsync(...);

每个 forward pass 是 ~200 个 dispatch。每 dispatch ~150 µs(含 IPC + 实际 GPU),总共 ~30 ms。关键优化:所有 dispatch 在一次 encoder.finish() + queue.submit()里——不能每 dispatch 单独 submit(那样 IPC 开销会把性能拖垮)。

KV cache — 让生成线性变快

Naive 实现每个 token 都要重新算所有先前 token 的 key/value,复杂度 O(n²)。KV cache 把每个 token 的 K/V 算一次就存住,后续 token 只算自己的并 concat。复杂度变 O(n)。

在 WebGPU 里 KV cache 就是两个长 storage buffer,每生成一个 token 就 append。每层 ~16 MB(4096 hidden × 4 KV head × 2 K+V × 2 字节 × seq len),16 层 = ~256 MB。生成长 context(~2048 tokens)容易吃 GB 级 GPU 显存。这是当前浏览器 LLM 的主要瓶颈——RAM 比算力更宝贵。

实测对比

模型大小M2 Pro · Chromei9-13900K + RTX 4090 · Chrome原生 CPU(同模型)
Qwen2.5-0.5B (f16)~500 MB~60 tok/s~180 tok/s~5 tok/s (Apple Neural Engine)
Llama-3.2-1B (q4)~700 MB~30 tok/s~90 tok/s~3 tok/s
Llama-3.2-3B (q4)~2 GB~12 tok/s~40 tok/s~1 tok/s
SDXL-Turbo~1.5 GB4-8 s / 图~0.5 s / 图~30 s / 图

来源:HuggingFace transformers.js README perf 表 · WebLLM benchmarks · Xenova/transformers.js benchmark issues · 单机实测仅供参考。

为什么这件事突然能做了

2017 年用 WebGL 跑 BERT 是玩具(没 compute、性能差 100×)。2023 年 WebGPU GA + 2024 年 subgroup ops + Llama 系开源 + 量化(int4/int8)成熟,浏览器里跑 LLM 从不可能变成不舒服但能。再过 1-2 年 subgroup_matrix 标准化 + KV cache compression 普及,家用 GPU 上跑 7B 模型会是默认。

From 2024, browser-side LLM inference became practical — Llama-3.2-1B runs at 30 tokens/s on M2 Macs, Qwen2.5-0.5B at 60 tokens/s. This chapter dissects the engineering stack — starring transformers.js and ONNX Runtime Web.

Six layers from model to output

  1. Model export: PyTorch / Safetensors → ONNX format (HF pre-converts). Llama-3.2-1B is ~600 MB (f16).
  2. JS load: transformers.js uses ONNX Runtime Web (wasm) to parse the .onnx file and build a computation graph.
  3. WebGPU backend: ORT Web's webgpu execution provider maps every ONNX op (Conv, MatMul, Softmax, LayerNorm) to a pre-written WGSL kernel.
  4. Pipeline compilation: all kernels create computePipelines at load time. Llama-1B has ~200+ pipelines, taking ~3 seconds once.
  5. Weight upload: ~600 MB of f16 data uploaded to the GPU via writeBuffer. ~150 ms on M2 Pro.
  6. Inference loop: each token requires one full forward pass (200+ dispatches). On M2 a forward is ~33 ms = ~30 tokens/s.

A WebGPU LLM forward pass

transformers.js + ONNX Runtime Web · forward per token // 1. embedding lookup (one dispatch) embed_pipeline.setBindGroup(input_ids, embedding_table); pass.dispatchWorkgroups(1); // 1×4096 token embedding // 2. 16 transformer layers · ~8 dispatches each for (let layer = 0; layer < 16; layer++) { layernorm_pipeline.dispatch(...); qkv_matmul.dispatch(...); // 3 matmuls attention_kernel.dispatch(...); // flash-attention style out_matmul.dispatch(...); layernorm_pipeline.dispatch(...); ffn_up_matmul.dispatch(...); gelu_pipeline.dispatch(...); ffn_down_matmul.dispatch(...); } // 3. final layernorm + lm_head matmul → logits final_ln.dispatch(...); lm_head_matmul.dispatch(...); // 4. argmax / top-k on GPU (or read back and pick on CPU) await queue.submit([encoder.finish()]); const next_token = await readback.mapAsync(...);

Each forward pass is ~200 dispatches. At ~150 µs each (IPC + actual GPU) you get ~30 ms total. Crucial optimisation: all dispatches in one encoder.finish() + queue.submit(); you cannot submit per-dispatch (IPC overhead would crush perf).

KV cache — linear time generation

A naive implementation recomputes every prior token's keys/values on every new token: O(n²). The KV cache computes each token's K/V once and reuses, only computing for the new token and concatenating. O(n).

In WebGPU the KV cache is two long storage buffers, appended per token. Each layer is ~16 MB (4096 hidden × 4 KV heads × 2 K+V × 2 bytes × seq len), 16 layers = ~256 MB. Long context (~2048 tokens) easily costs GBs of GPU memory. This is the current bottleneck in browser LLMs — RAM is scarcer than FLOPS.

Measured comparison

ModelSizeM2 Pro · Chromei9-13900K + RTX 4090 · ChromeNative CPU (same)
Qwen2.5-0.5B (f16)~500 MB~60 tok/s~180 tok/s~5 tok/s (Apple Neural Engine)
Llama-3.2-1B (q4)~700 MB~30 tok/s~90 tok/s~3 tok/s
Llama-3.2-3B (q4)~2 GB~12 tok/s~40 tok/s~1 tok/s
SDXL-Turbo~1.5 GB4–8 s / image~0.5 s / image~30 s / image

Sources: HuggingFace transformers.js README perf table · WebLLM benchmarks · Xenova/transformers.js benchmark issues · single-machine measurements only.

Why this suddenly works

BERT on WebGL in 2017 was a toy (no compute, 100× slower than native). 2023 WebGPU GA + 2024 subgroup ops + open Llama models + mature quantisation (int4/int8) flipped browser LLMs from impossible to uncomfortable-but-real. With subgroup_matrix standardisation + KV cache compression another 1–2 years out, 7B models on consumer GPUs in the browser becomes default.

CHAPTER 21

FP16 · 原子 · 时间戳 — 三个可选特性

FP16 · atomics · timestamps — three optional features

需要 enable feature 才能用的能力

Features you must opt-in to before using

WebGPU 的核心 API 是所有支持设备都能跑的最小子集。三个常用扩展需要 explicit enable,本章逐个看。

shader-f16

16-bit 浮点。LLM/CV 用 f16 推理已成标准——精度够用、内存减半、算力多 2×(如果 GPU 有 native f16 路径)。启用:

JS · 启用 shader-f16 const adapter = await navigator.gpu.requestAdapter(); if (!adapter.features.has('shader-f16')) { console.warn('no f16 support on this device'); } const device = await adapter.requestDevice({ requiredFeatures: ['shader-f16'] });

WGSL 用法:

enable f16;
@group(0) @binding(0) var<storage, read_write> data: array<f16>;

注意:f16 在 storage 里 packed 2 个一组,对齐要求 4 字节。vec4<f16> 是 8 字节而非 16。Tint/Naga 自动处理 packing。

② Atomics

WGSL 核心就支持 atomic<u32>atomic<i32>——这是需要 feature enable 的。但有几个加强需要 feature:

  • atomic<f32>:feature chromium-experimental-storage-f32-atomic。NVIDIA + AMD 硬件支持,Apple GPU 软件 emulate。LLM 训练用得到,inference 一般不需要。
  • atomic<i64>/u64:尚未标准化。Vulkan 1.2 有 shaderBufferInt64Atomics,但 Metal 2024 才加上。
  • 共享内存 atomic:在 var<workgroup> atomic<u32> 上做 atomic。这是核心支持的——典型用法 reduction 时算 workgroup 内部累积。

timestamp-query

GPU 上能精确测时间——而不是 performance.now() 在 JS 端 measure(那个被 IPC 延迟 polluted)。启用:

JS · timestamp query const device = await adapter.requestDevice({ requiredFeatures: ['timestamp-query'] }); const querySet = device.createQuerySet({ type: 'timestamp', count: 2 }); const pass = encoder.beginComputePass({ timestampWrites: { querySet, beginningOfPassWriteIndex: 0, endOfPassWriteIndex: 1 } }); // ... dispatch ... pass.end(); encoder.resolveQuerySet(querySet, 0, 2, resultBuf, 0); // readback → 拿到 u64[2],差值就是 GPU 周期数

隐私限制:timestamp 的分辨率被 clamp 到 100 µs(spec §20.4 Timestamp Query · #timestamp),防止 timing side-channel 攻击。所以 100 µs 以下的 pass 测出来都是 0 或 100 µs。Native 应用能拿 GPU clock 的纳秒级分辨率,浏览器拿不到。

为什么很多其他 feature 暂未上

WebGPU 的 feature gating 用三个原生 API 至少 2 个支持 + 安全模型 OK 作 cliff。Ray Tracing extension(VK_KHR_acceleration_structure + Metal MTL ray tracing + D3D12 DXR)三家都有,但 WebGPU仍未提案——因为 BVH 构建可能泄漏几何信息 给恶意 origin。Mesh shaders、subgroup matrix 也都因为类似原因等待中。"安全是慢的"是 WebGPU 的核心代价。

WebGPU's core API is the minimum subset all supported devices can run. Three common extensions need explicit enabling; this chapter walks them.

shader-f16

16-bit floats. f16 inference is standard for LLM/CV — precision is enough, memory halved, compute can be 2× faster (if the GPU has a native f16 path). Enabling:

JS · enabling shader-f16 const adapter = await navigator.gpu.requestAdapter(); if (!adapter.features.has('shader-f16')) { console.warn('no f16 support on this device'); } const device = await adapter.requestDevice({ requiredFeatures: ['shader-f16'] });

In WGSL:

enable f16;
@group(0) @binding(0) var<storage, read_write> data: array<f16>;

Note: f16 packs two-per-32-bits in storage; alignment is 4 bytes. vec4<f16> is 8 bytes, not 16. Tint/Naga handles packing.

② Atomics

Core WGSL supports atomic<u32> and atomic<i32>no feature enabling needed. Some extensions:

  • atomic<f32>: feature chromium-experimental-storage-f32-atomic. NVIDIA + AMD hardware-supported, Apple emulates in software. Used by LLM training; inference rarely needs it.
  • atomic<i64>/u64: not standardised yet. Vulkan 1.2 has shaderBufferInt64Atomics, but Metal only added it in 2024.
  • Workgroup-shared atomics: var<workgroup> atomic<u32> is core. Used commonly for in-workgroup reductions.

timestamp-query

Measure time precisely on the GPU — not performance.now() on the JS side (which gets polluted by IPC latency). Enabling:

JS · timestamp query const device = await adapter.requestDevice({ requiredFeatures: ['timestamp-query'] }); const querySet = device.createQuerySet({ type: 'timestamp', count: 2 }); const pass = encoder.beginComputePass({ timestampWrites: { querySet, beginningOfPassWriteIndex: 0, endOfPassWriteIndex: 1 } }); // ... dispatch ... pass.end(); encoder.resolveQuerySet(querySet, 0, 2, resultBuf, 0); // readback → u64[2], difference is GPU cycles

Privacy clamp: timestamp resolution is clamped to 100 µs (spec §20.4 Timestamp Query · #timestamp) to prevent timing side-channel attacks. Anything under 100 µs reads as 0 or 100 µs. Native apps get nanosecond GPU clocks; browsers don't.

Why so many features aren't shipped yet

WebGPU's feature gate uses at least 2 of 3 native APIs supporting it + safe security model as the cliff. Ray tracing (VK_KHR_acceleration_structure + Metal MTL ray tracing + D3D12 DXR) exists on all three, but WebGPU still has no proposal — because BVH construction can leak geometry information to a malicious origin. Mesh shaders and subgroup matrix wait for similar reasons. "Safety is slow" is WebGPU's core cost.

CHAPTER 22

vs WebGL · WebCL — 六轴对比矩阵

vs WebGL · WebCL — a six-axis matrix

为什么 WebGL 2 还在 · 为什么 WebCL 死了

Why WebGL 2 sticks around · why WebCL died

每隔几年浏览器都有一次"GPU API 提案",结局都不一样。WebGL 1.0/2.0 双双 GA;WebCL 2011 起草、2014 stagnate、2018 弃;WebGPU 2017-2023 一路打到 GA。把三个放一起:

维度WebGL 2WebCLWebGPU
状态2017 GA · 仍主流2011 草案 · 2018 撤回2023 GA · 标准化中
底层映射OpenGL ES 3.0OpenCL 1.2Metal · D3D12 · Vulkan
Compute是核心是核心
Shading 语言GLSL ESOpenCL CWGSL
典型 use case2D/3D 游戏 · 数据可视化(理论上) 科学计算matmul · LLM · 复杂渲染
支持厂商4 大全支持Apple/Google 一直拒绝4 大都已支持

为什么 WebCL 死了

  • Apple 不要 OpenCL。Apple 2018 deprecate 了 OpenCL 给 Metal Performance Shaders 让位。没 Safari 就没 Web 平台。
  • Google 押 WebGL Compute(2018 起草)。但 WebGL Compute 也最终没 land——因为 OpenGL Compute Shader 在 macOS 上从来没好用过(Apple 没好好实现)。
  • OpenCL 本身衰落。OpenCL 2.0 蹒跚(NVIDIA 拒绝实现 SVM)、3.0 大幅 walk back,社区在 SYCL/oneAPI 等方向各自跑。Khronos 自己都不力推。
  • WebGPU 接棒。WebGPU 同时有 compute(满足 WebCL 想做的)有渲染(继承 WebGL 的画面)。一个 API 顶两个。

为什么 WebGL 不会马上消失

WebGL 2 还会在 Web 平台上活 5-10 年,原因:

  • 覆盖更广:Old 设备(< 2018 GPU)、Linux 老驱动、特殊网络环境(UDP 443 被 block)通常 WebGL 比 WebGPU 更稳。
  • 生态深:Three.js 是 WebGL 写的、Babylon.js 主推 WebGL fallback、所有 d3.js / mapbox-gl 等已用 WebGL 多年。
  • 简单:30 行 WebGL 能画一个三角形,WebGPU 大约要 80 行。teaching context 优势明显。

所以 Three.js、Babylon、Mapbox 都做双后端:检测 navigator.gpu 可用就走 WebGPU,否则 fallback WebGL。性能差 1.5-3×(取决于工作负载),但用户都拿到合理体验。

六轴对比可视化

绘图(renderer)

WebGL 2 ★★★★ · WebGPU ★★★★★。
WebGPU 多线程命令录制(虽然 Web 还是单线程,但驱动层是的)+ 显式 pipeline cache = 减少 CPU bottleneck。Babylon 测得复杂场景 1.5–2× FPS 提升。

通用计算(compute)

WebGL 2 ☆ · WebGPU ★★★★★。
WebGL 没 compute,只能用 framebuffer 当"compute"——慢、丑。WebGPU 是 first-class,transformers.js 全靠它。

兼容性(compat)

WebGL 2 ★★★★★ · WebGPU ★★★。
caniuse · webgpu,WebGPU 全球 baseline 覆盖约 85%(Chrome/Edge/Firefox 桌面 + macOS Safari 26),剩下 ~15%(老 Android、iOS 18 以前、内网 IE/旧版浏览器)仍需 fallback。WebGL 2 全球 ~96%。

学习曲线

WebGL 2 ★★★ · WebGPU ★★。
WebGPU 概念多:device, adapter, queue, encoder, pass, pipeline, bindgroup ... 入门要 1 周。

调试

WebGL 2 ★★ · WebGPU ★★★★。
WebGPU 有结构化错误(validation error + scope)。WebGL 只有 getError 字符串。Chrome DevTools WebGPU panel 后续会有 capture/replay。

未来扩展

WebGL 2 ★ · WebGPU ★★★★★。
WebGL 已冻结。WebGPU 还在加 subgroup matrix · ray tracing · multi-GPU · HDR · WebTransport for textures 等。

Every few years the Web platform sees a "GPU API proposal"; outcomes vary. WebGL 1.0/2.0 both shipped; WebCL was drafted in 2011, stagnated by 2014, withdrawn in 2018; WebGPU went 2017→2023→GA. Side by side:

AxisWebGL 2WebCLWebGPU
Status2017 GA · still dominant2011 draft · withdrawn 20182023 GA · still standardising
Maps toOpenGL ES 3.0OpenCL 1.2Metal · D3D12 · Vulkan
Computenonecorecore
Shading langGLSL ESOpenCL CWGSL
Typical use2D/3D games · viz(theoretical) sci-computematmul · LLM · advanced rendering
Vendor supportall fourApple/Google always refusedall four shipped

Why WebCL died

  • Apple didn't want OpenCL. Apple deprecated OpenCL in 2018 in favour of Metal Performance Shaders. No Safari, no Web platform.
  • Google pushed WebGL Compute instead (2018). That also never shipped — OpenGL Compute Shaders never worked well on macOS (Apple's implementation was poor).
  • OpenCL itself withered. OpenCL 2.0 stuttered (NVIDIA refused SVM), 3.0 walked back massively, the community fragmented to SYCL/oneAPI. Even Khronos stopped pushing it.
  • WebGPU absorbed both jobs. WebGPU has compute (what WebCL wanted) and rendering (what WebGL did). One API replaces two.

Why WebGL won't vanish soon

WebGL 2 sticks around for 5–10 more years because:

  • Wider reach. Old hardware (<2018 GPUs), older Linux drivers, network environments that block UDP 443 — WebGL is usually more reliable than WebGPU.
  • Deep ecosystem. Three.js is WebGL-first, Babylon.js still ships a WebGL fallback, every d3.js / mapbox-gl in the wild already uses WebGL.
  • Simplicity. Thirty lines of WebGL draws a triangle; WebGPU needs ~80. Teaching contexts favour WebGL.

So Three.js, Babylon, Mapbox all ship dual backends: detect navigator.gpu and use WebGPU, fall back to WebGL. Performance gap is 1.5–3× depending on workload — everyone gets a reasonable experience.

Six-axis comparison

Rendering

WebGL 2 ★★★★ · WebGPU ★★★★★.
WebGPU has multithreaded command recording (single-threaded on Web, but the driver layer parallelises) + explicit pipeline cache = less CPU bottleneck. Babylon measures 1.5–2× FPS on complex scenes.

General-purpose compute

WebGL 2 ☆ · WebGPU ★★★★★.
WebGL has no compute; people abuse framebuffers as "compute" — slow and ugly. WebGPU is first-class; transformers.js depends entirely on it.

Compatibility

WebGL 2 ★★★★★ · WebGPU ★★★.
Per caniuse · webgpu, WebGPU's global baseline reach is ~85% (Chrome/Edge/Firefox desktop + Safari 26 on macOS); the remaining ~15% (old Android, pre-iOS 18, intranet legacy browsers) still need fallbacks. WebGL 2 global ~96%.

Learning curve

WebGL 2 ★★★ · WebGPU ★★.
WebGPU has many concepts: device, adapter, queue, encoder, pass, pipeline, bindgroup … ~1 week to first useful program.

Debugging

WebGL 2 ★★ · WebGPU ★★★★.
WebGPU has structured errors (validation error + scope). WebGL is just getError string codes. Chrome DevTools' WebGPU panel will soon have capture/replay.

Future extensions

WebGL 2 ★ · WebGPU ★★★★★.
WebGL is frozen. WebGPU is still adding subgroup matrix · ray tracing · multi-GPU · HDR · WebTransport for textures.

CHAPTER 23

生产故事簿 — Figma · Babylon · Bevy · Unity

Production stories — Figma · Babylon · Bevy · Unity

四个真实工程的 WebGPU 上线史

Four real engineering tales of WebGPU adoption

CASE 1 Figma — 上线 WebGPU 渲染(2024)

背景:Figma 把所有 vector 渲染用 WASM + WebGL 跑。一个百万图层 design file 在 WebGL 上 FPS ~25。2023 年 GA 后 Figma 开始迁。

挑战:① WebGL fragment shader 风格的 stencil 写法在 WebGPU 里要换成 compute;② path rasterisation 用了 GPU-side tessellation,WebGL 用 vertex shader emulate,WebGPU 直接走 compute shader 算 Bezier subdivision;③ device loss 处理(早期 NVIDIA 驱动经常 hang)。

结果:复杂文档 FPS 提到 ~60-90,CPU 占用降 40%。但设备覆盖损失——~10% 用户在 WebGL 上才稳定(主要是老 Windows + 集显),他们继续走 WebGL。

CASE 2 Babylon.js 5+ — 双后端渲染引擎

背景:Babylon.js 是 Microsoft 主导的 WebGL 渲染引擎,2022 年开始加 WebGPU 支持。现在 Babylon 7.0 默认 WebGPU + 自动 fallback。

独到设计:① Snapshot Rendering——把 WebGPU 的 command 录到一个 snapshot,复用每帧,CPU 开销降 80%;② "Engine harmonisation"——同一份 JS API 在 WebGL 后端和 WebGPU 后端都跑得动;③ 主动接入 subgroup ops 给 compute particles。

数据:Babylon 自己 benchmark 显示一个 1 万物体场景 WebGL 25 FPS / WebGPU 60 FPS / WebGPU+Snapshot 90 FPS。官方文档

CASE 3 Bevy — Rust 游戏引擎 + WebAssembly

背景:Bevy 是 Rust 写的游戏引擎,用 wgpu 作 GPU 抽象层。Bevy 编译到 WebAssembly + wgpu 的 WebGPU 后端,能在浏览器里跑 Rust 游戏。

有意思的地方:① wgpu 的同一份 Rust 代码既能编译给 native(直接调 Vulkan/Metal/D3D12)也能编译给 Web(调 navigator.gpu);② Bevy 在 native 上典型 60+ FPS,编 Web 上掉到 30-45 FPS——损失主要来自 IPC(native 没那一跳)和 JS↔WASM 边界;③ 是 wgpu 项目最大的用户,反向催生了 wgpu 很多 perf 优化。

CASE 4 Unity 6 Web — 实验性 WebGPU backend

背景:Unity 历史上靠 WebGL 1/2 backend 在浏览器里跑("Unity WebGL")。2024 年 12 月发布 Unity 6 时加了实验性 WebGPU backend——和原 WebGL backend 并存,开发者在 Player Settings 里勾选启用。注意"Unity Web Player"是 2017 年就退役的老 NPAPI 插件,不要和这个新 backend 混淆。

不寻常的事:Unity 用 IL2CPP 把 C# 编到 C++ 再 emscripten → WASM,跑在 WebGPU 上。新 backend 比 WebGL 启动慢 ~2×(一次性 pipeline 编译耗时),但稳态渲染性能在复杂场景里 1.5-2× 提升。早期 demo 加载时间 ~5s(WebGPU)vs ~3s(WebGL)。Unity 6.1 / 2025 路线图把 WebGPU backend 推向 Production-Ready。

三个反复出现的"教训"

① Pipeline 预热是关键:所有四家都报告启动时把所有 pipeline 编出来很费时(~3-30 秒),用户感觉"卡"。生产策略:先编最常用的 10-20 个 pipeline,剩下的lazy
② Device loss 必须处理:Figma 的统计数字 ~0.3% / 月。不处理 = 用户看到永久黑屏。
③ Fallback 必须 ship:~10-15% 用户暂时不能用 WebGPU。不做 fallback 等于丢这部分用户。

CASE 1 Figma — shipping WebGPU rendering (2024)

Background: Figma renders all vectors via WASM + WebGL. A million-layer design file on WebGL hits ~25 FPS. After WebGPU GA in 2023, Figma started migrating.

Challenges: ① WebGL stencil-style fragment writes had to become compute kernels; ② path rasterisation used GPU-side tessellation on WebGL via vertex-shader emulation — on WebGPU it goes straight to a compute shader doing Bezier subdivision; ③ device-loss handling (early NVIDIA drivers hung frequently).

Result: complex docs went from 25 FPS to 60–90 FPS, CPU usage down 40%. But device-coverage loss — ~10% of users (mostly older Windows + integrated GPUs) had to stay on WebGL.

CASE 2 Babylon.js 5+ — dual-backend rendering engine

Background: Babylon.js is the Microsoft-led WebGL renderer; WebGPU support added in 2022. Babylon 7.0 defaults to WebGPU with automatic fallback.

Distinctive design: ① Snapshot Rendering — captures WebGPU commands into a snapshot reused per frame, dropping CPU overhead by 80%; ② "Engine harmonisation" — one JS API runs on both backends; ③ early adoption of subgroup ops for GPU particles.

Data: Babylon's own benchmark — 10k objects: WebGL 25 FPS / WebGPU 60 FPS / WebGPU+Snapshot 90 FPS. Docs.

CASE 3 Bevy — Rust game engine + WebAssembly

Background: Bevy is a Rust game engine using wgpu as its GPU abstraction. Compiled to WASM + wgpu's WebGPU backend, you can run Rust games in the browser.

Interesting bits: ① wgpu's same Rust code compiles to native (calls Vulkan/Metal/D3D12 directly) or to Web (calls navigator.gpu); ② Bevy hits 60+ FPS native, 30–45 FPS on Web — the gap comes from IPC (native has none) and the JS↔WASM boundary; ③ Bevy is wgpu's largest consumer and reverse-pushed many perf optimisations into wgpu.

CASE 4 Unity 6 Web — experimental WebGPU backend

Background: Unity has historically run in browsers via a WebGL 1/2 backend ("Unity WebGL"). When Unity 6 shipped in December 2024 it added an experimental WebGPU backend, coexisting with the WebGL backend — developers opt in via Player Settings. (Don't confuse this with "Unity Web Player", the old NPAPI plugin retired in 2017.)

The unusual thing: Unity compiles C# via IL2CPP to C++, then emscripten → WASM, running atop WebGPU. Startup is ~2× slower than the WebGL backend (one-time pipeline compilation), but steady-state rendering performance is 1.5–2× faster on complex scenes. Early demos load in ~5 s on WebGPU vs ~3 s on WebGL. Unity 6.1 / 2025 roadmap targets graduating the WebGPU backend to production-ready.

Three lessons that keep recurring

① Pipeline pre-warm is mandatory: all four report startup pipeline compilation taking ~3–30 s, which users feel as "stalls". Production strategy: pre-compile the top 10–20 pipelines, lazy-load the rest.
② Device loss must be handled: Figma's number is ~0.3% / month. Ignoring it = users see permanent black screens.
③ Fallback must ship: ~10–15% of users still can't use WebGPU. Without a WebGL fallback, you lose them.

CHAPTER 24

限制 · 指纹 · 安全

Limits · fingerprinting · security

所有的"为什么不行" 都能追到隐私 / 安全 / 一致性这三个原因

Every "why doesn't this work?" traces back to privacy / security / consistency

WebGPU 的边界都是故意设的。这一章列三类边界。

① 数值上限 — spec-mandated minimums

所有 maxXXX 限制都是 spec 的最低保证,实际硬件可以更高,但你不应该假设更高:

限制spec 最小典型 hw(M2 Pro)含义
maxBufferSize256 MB~4 GB单个 buffer 最大字节数
maxStorageBufferBindingSize128 MB~4 GB单次绑定的 storage 最大
maxComputeWorkgroupStorageSize16 KB32 KBworkgroup memory 最大
maxComputeInvocationsPerWorkgroup2561024workgroup 内 thread 总数
maxComputeWorkgroupSizeX/Y256 / 2561024 / 1024单维度
maxComputeWorkgroupsPerDimension65535~2^31dispatch 总 workgroup 数

想用 256 MB 以上的 buffer?需要 requestDevice({ requiredLimits: { maxBufferSize: 1024*1024*1024 } }) 显式 opt-in。如果硬件不支持,requestDevice reject——不是悄悄给你一个降级 device

② 指纹防御 — 为什么不能拿 GPU 型号

许多原生 GPU API 都允许应用读取 GPU 型号字符串(如 "NVIDIA RTX 4090 / driver 555.85")。WebGPU 不允许——adapter.info 被严格 limited

  • vendor:粗粒度厂商名("apple"/"amd"/"intel"/"nvidia"/"qualcomm"/"arm"),给版本号
  • architecture:粗粒度架构名(如 "apple-7", "intel-gen11", "ada-lovelace"),可选
  • device:通常空字符串
  • description:通常空字符串

为什么这么严:GPU 型号 + 驱动版本极强的指纹——某些 GPU 全世界几千张。配合 IP、UA、字体列表,单次访问就能唯一识别用户。WebGPU 选择只暴露厂商类——能让程序 dispatch 适合的优化,但不能 fingerprint。

③ 时序攻击 — 为什么 timestamp 被 clamp

GPU 是很好的时序攻击平台——多个 origin 共用同一颗 GPU,cache 共享,timing 可推测出其他 origin 的内存访问模式。学界已 demo'd(USENIX Security 2024)从 GPU timing 推出另一个 tab 的 LLM 推理 token。WebGPU 的防御:

  • timestamp_query 分辨率clamp 到 100 µs(spec §4.7.4)
  • 不暴露 GPU clock 频率
  • performance.now() 也 clamp 到 100 µs(早已)
  • device 之间 GPU 资源池独立分配(不复用 buffer slot 避免 cache 共享)

④ Workgroup memory 必须 zero-init

WGSL spec 要求所有 workgroup memory 在 invocation 前必须为 0。这是性能损失——硬件本来不需要 zero——但避免了 "读到上一个 dispatch 留下的数据"导致的 cross-origin 泄漏。Dawn/wgpu 在生成 shader 时自动加 zero-init prologue。

⑤ Texture 自动初始化

新创建的 texture 在第一次读之前必须保证为 0。如果应用没显式 clear,Dawn 在第一次绑定到 shader 时插一个 clear pass。这是开销但避免了 GPU 显存里别的 origin 残留数据被读取。

WebGPU 安全模型的设计原则

"GPU 上不应该有 origin A 的数据 origin B 能看到的可能性"——这一句话决定了 WebGPU 80% 的非性能决策。所有看起来"为什么这事不让做",最后都能追到这条。WebGPU 因此不会暴露 raw VRAM 指针、DMA buffer 共享、persistent memory 映射、cross-process GPU resource 共享——即使原生 API 有。

WebGPU's boundaries are all deliberate. This chapter enumerates three categories.

① Numeric limits — spec-mandated minimums

Every maxXXX is the spec's minimum guarantee; real hardware can do more, but you must not assume it does:

LimitSpec minTypical HW (M2 Pro)Meaning
maxBufferSize256 MB~4 GBlargest single buffer in bytes
maxStorageBufferBindingSize128 MB~4 GBlargest storage binding
maxComputeWorkgroupStorageSize16 KB32 KBworkgroup memory cap
maxComputeInvocationsPerWorkgroup2561024threads per workgroup
maxComputeWorkgroupSizeX/Y256 / 2561024 / 1024per-axis cap
maxComputeWorkgroupsPerDimension65535~2^31dispatch workgroup count

Want a buffer over 256 MB? You must requestDevice({ requiredLimits: { maxBufferSize: 1024*1024*1024 } }) to opt in. If the hardware lacks it, requestDevice rejects — not a silent fallback to a lesser device.

② Fingerprint defence — why you can't read the GPU model

Many native GPU APIs let apps read the GPU model string ("NVIDIA RTX 4090 / driver 555.85"). WebGPU does not — adapter.info is strictly limited:

  • vendor: coarse name ("apple"/"amd"/"intel"/"nvidia"/"qualcomm"/"arm"), no version
  • architecture: coarse arch (e.g. "apple-7", "intel-gen11", "ada-lovelace"), optional
  • device: usually empty string
  • description: usually empty string

Why so strict: GPU model + driver version is a very strong fingerprint — some GPUs have only a few thousand units worldwide. Combined with IP, UA, and font list, a single visit could uniquely identify a user. WebGPU exposes only vendor class — enough to dispatch the right optimisations, not enough to fingerprint.

③ Timing side channels — why timestamps are clamped

GPUs are great timing side-channel platforms — multiple origins share one GPU, share caches, and timings can hint at other origins' memory access patterns. Academia demonstrated extracting another tab's LLM inference tokens from GPU timing (USENIX Security 2024). WebGPU's defences:

  • timestamp_query resolution clamped to 100 µs (spec §4.7.4)
  • GPU clock frequency not exposed
  • performance.now() already clamped to 100 µs
  • GPU resource pools allocated independently per device (no buffer-slot reuse that would share caches)

④ Workgroup memory must zero-init

WGSL spec requires all workgroup memory to be zeroed before invocation. It's a performance cost — hardware doesn't intrinsically need it — but prevents "reading leftover data from a prior dispatch" cross-origin leaks. Dawn/wgpu auto-inject a zero-init prologue.

⑤ Textures auto-initialised

A freshly created texture must read as zero before any write. If the app doesn't explicitly clear, Dawn injects a clear pass on first bind to a shader. Costs CPU+GPU work but prevents leftover VRAM from other origins leaking.

WebGPU's security design principle

"There should be zero possibility that origin A's data is visible to origin B via the GPU." This single sentence drives 80% of WebGPU's non-perf decisions. Every "why can't I do X?" eventually traces back to it. As a result WebGPU does not expose raw VRAM pointers, DMA buffer sharing, persistent memory mapping, or cross-process GPU resource sharing — even when native APIs do.

CHAPTER 25

之后 — WebGPU 的下一个十年

What's next — WebGPU in the next decade

subgroup matrix · ray tracing · multi-GPU · HDR · WGSL+

subgroup matrix · ray tracing · multi-GPU · HDR · WGSL+

WebGPU 1.0 GA 在 2023 年。现在(2026 年初)是 WebGPU 1.1 的标准化期——8-10 个扩展在 W3C 工作组里走流程。给三年后的图:

即将进 1.1(2026-2027)

  • subgroup ops:subgroupAdd/Ballot/Shuffle 等。已在 Chrome flag 后;Spec 在 Candidate Recommendation。
  • subgroup matrix:wmma / Apple matrix / Intel XMX 的 web 暴露。Ch19 已 demo。LLM 性能 +3-5×。
  • FP16:spec 已稳定,浏览器全部支持。从 "optional" 升到 "core"。
  • chromium-experimental-multi-draw:一次 draw N 个 instance 但 instance 大小不同。GPU-driven rendering 的关键。
  • chromium-experimental-pixel-local-storage:避免 framebuffer 来回 swap。移动 GPU 受益最大。

在路上但更远(2027-2029)

  • Ray tracing:DXR / Vulkan RT / Metal RT 三家都有,但 WebGPU 还没提案——核心 blocker 是 BVH 构建可能泄漏几何信息。最早 2027 看到 origin trial。
  • Mesh shaders:替代 vertex/geometry shader 的现代化管线。三家都有但语义差大;WebGPU 的统一抽象正在设计。
  • Multi-GPU:双卡机器、CPU+GPU shared memory(Apple unified、Intel iGPU+dGPU)。目前 WebGPU 只能看到一个 adapter。
  • HDR rendering:rec2100 色域、HDR10 输出。需要和 canvas、CSS Color 4 协调。
  • WebTransport for textures:从 server 直接 stream texture 到 GPU 不经 CPU。视频会议、云游戏的关键。
  • Persistent compute:GPU 上跑长任务(> 2 秒)不被 TDR 杀。需要让 OS / driver 区分"前台"和"后台" GPU 工作。

WGSL 自身演化

  • generics:当前 WGSL 没有泛型函数,vec2<T> 类型参数是编译器内置。社区 proposal 在讨论 user-defined generics。
  • module system:当前一个 WGSL 文件就是一个 module。社区想加 import 等。
  • preprocessor:当前没有 #define 一类。各种 build tool(如 webpack wgsl-loader)在 JS 里做 string templating。可能成为 standardised feature。
  • pointer arithmetic:当前禁。讨论中给 storage ptr 加一些受控算术,方便实现 zero-copy 矩阵 view。

大方向:浏览器变成 AI 设备

从 transformers.js 到 Apple Intelligence on Web Inference,浏览器逐渐变成本地 AI 设备。WebGPU 是这条路唯一可走的桥——WebNN(Web Neural Network API)也在标准化但定位是更 high-level、走原生 ML 后端(CoreML、DirectML、NNAPI)。两个都会上线,但 WebGPU 给灵活、WebNN 给极致 ML 性能。

2030 年的浏览器

预测:① 7B LLM 模型在消费级 GPU上跑到 30 tok/s 是默认;② 视频会议 client 全靠 WebGPU 做实时人脸增强、降噪、背景;③ in-browser game 性能 ~原生 90%;④ Web 上首次出现"GPU 算力时常成为瓶颈"的应用(之前都是 CPU/网络)。WebGPU 是这一切的底座

"一行 pass.dispatchWorkgroups()
要被翻译八次。
但只翻译一次
就可以让 GPU 算
一万亿次乘法。"

FIN // END OF FIELD NOTE 09

WebGPU 1.0 went GA in 2023. Right now (early 2026) is WebGPU 1.1's standardisation window — 8–10 extensions are moving through the W3C working group. A picture of three years out:

Landing in 1.1 (2026–2027)

  • Subgroup ops: subgroupAdd/Ballot/Shuffle and friends. Already behind a Chrome flag; spec at Candidate Recommendation.
  • Subgroup matrix: web exposure for wmma / Apple matrix / Intel XMX. Demo'd in Ch19. LLM perf +3-5×.
  • FP16: spec is stable, all browsers support it. Upgraded from "optional" to "core".
  • chromium-experimental-multi-draw: one draw call for N instances with varying sizes. Key to GPU-driven rendering.
  • chromium-experimental-pixel-local-storage: avoid framebuffer swap-in/out. Mobile GPUs benefit most.

On the road, further out (2027–2029)

  • Ray tracing: DXR / Vulkan RT / Metal RT all exist, but WebGPU has no proposal — core blocker is BVH construction can leak geometry. Origin trial earliest 2027.
  • Mesh shaders: modern replacement for vertex/geometry shaders. All three native APIs have them but semantics differ; WebGPU's unified abstraction is being designed.
  • Multi-GPU: dual-GPU machines, CPU+GPU shared memory (Apple unified, Intel iGPU+dGPU). Today WebGPU only sees one adapter.
  • HDR rendering: rec2100 colour space, HDR10 output. Requires coordination with canvas + CSS Color 4.
  • WebTransport for textures: stream textures from server directly to GPU bypassing CPU. Key to video conferencing and cloud gaming.
  • Persistent compute: long GPU tasks (> 2 s) without TDR killing them. Needs OS/driver to distinguish "foreground" from "background" GPU work.

WGSL's own evolution

  • Generics: WGSL has no generic functions today; vec2<T> type parameterisation is compiler-built-in. Community proposals discussing user-defined generics.
  • Module system: one WGSL file = one module. Community wants import / linking.
  • Preprocessor: no #define-like construct. Build tools (e.g. webpack wgsl-loader) do JS-side templating. Could become standardised.
  • Pointer arithmetic: currently forbidden. Discussion of controlled arithmetic on storage pointers to enable zero-copy matrix views.

The big direction: browsers become AI devices

From transformers.js to Apple Intelligence's Web Inference, the browser is becoming a local AI device. WebGPU is the only bridge available — WebNN (Web Neural Network API) is also standardising but at a higher level, dispatching to native ML backends (CoreML, DirectML, NNAPI). Both will ship; WebGPU gives flexibility, WebNN gives ultimate ML performance.

Browsers in 2030

Predictions: ① 7B LLMs running at 30 tok/s on consumer GPUs by default; ② video-call clients use WebGPU for real-time face enhancement / denoising / background; ③ in-browser games reach ~90% of native performance; ④ for the first time on the Web, GPU throughput becomes the routine bottleneck (previously always CPU/network). WebGPU is the foundation for all of it.

"One pass.dispatchWorkgroups()
gets translated eight times.
But translated only once,
it lets the GPU compute
a trillion multiplies."

FIN // END OF FIELD NOTE 09
✦ ✦ ✦
阅读Reads

留下评论Leave a comment

评论Comments

加载中…Loading…