一段 Rust 卷积循环要穿过 17 道工序、两层 JIT、4 GiB 的线性内存,才能在你的屏幕上跑成一条 SSE 指令。
这是 WebAssembly 从字节到机器码的全景手册。
A Rust convolution loop has to cross seventeen stages, two tiers of JIT, and 4 GiB of linear memory before it lights up as a single SSE instruction on your screen.
This is a field map of WebAssembly from bytes to machine code.
从 .rs 源文件出发,穿过 4 个阶段、13 道工序、3 个进程、~ 16.7 ms 的帧预算,最后变成屏幕上的一个像素。streaming compile 让阶段 Ⅱ 与 Ⅲ 完全重叠——这是 wasm 在浏览器的"边下边编" 体验来源。
From .rs source through four phases, thirteen stages, three processes and a 16.7 ms frame budget — to one pixel on screen. Streaming compile overlaps phases Ⅱ and Ⅲ entirely, the basis of wasm's "compile-while-downloading" feel.
先把 WebAssembly 这件事放回它的历史位置:它是 asm.js 的延伸,是 JavaScript 走到天花板之后的另一条腿,是浏览器从"文档查看器"变成"通用计算机"的最后一块拼图。先有这四章作为骨骼,后面 22 章的细节才会落到合适的位置。
Before we sink to bits, put WebAssembly back into its historical slot: an extension of asm.js, a second leg the browser grew once JavaScript hit its ceiling, the last piece that turned the browser from a "document viewer" into a "general-purpose computer". With these four chapters as skeleton, the 22 that follow fall into place.
把这个庞然大物压成三行
crushing the elephant into three lines
"WebAssembly"在大部分讲座 PPT 里被画成一个紫色方块,旁边是"fast, safe, portable"三个词,听起来像一份产品宣传单。但当你真正打开 spec 仓库会发现:它不是一个东西,而是三层契约叠在一起——一层是字节格式,一层是执行模型,一层是宿主接口。把这三层各写成一个公式,后面所有故事都能从里面长出来。
"WebAssembly" gets painted as a purple block in most slide decks, captioned fast, safe, portable like a product brochure. Open the spec repo and you discover it is not one thing — it's three contracts stacked on top of each other: one for the byte format, one for the execution model, one for the host interface. Write each as a formula and every later story grows out of them.
| 引擎Engine | Tier 0(基线)Tier 0 (baseline) | Tier 1(优化)Tier 1 (optimising) | 用在哪Used in |
|---|---|---|---|
| V8 | Liftoff (2018) | TurboFan / Turboshaft (2023→) | Chrome, Edge, Node, Deno |
| SpiderMonkey | Baseline (2018) | Ion / WarpMonkey | Firefox |
| JavaScriptCore | BBQ (Build Bytecode Quickly) | OMG (Optimized Machine Generator) | Safari, WebKit |
| Wasmtime | — | Cranelift | Bytecode Alliance, edge runtimes |
| Wasmer | Singlepass | Cranelift / LLVM | standalone, plugin sandbox |
| WAMR | interpreter / fast-interp | AOT (LLVM) | IoT, embedded |
从这张表里冒出一个事实:浏览器三家都选择了"双层 JIT",非浏览器引擎多半只留一层优化器或反而留解释器。原因是浏览器要兼顾"开页面要立刻能跑"和"久了要够快",而服务器端 wasm 通常是冷启动一次跑很久,直接 AOT 即可。同一个 spec,生出两套截然不同的实现哲学。
A fact climbs out: all three browser engines went with two-tier JIT, while non-browser engines tend to keep just one optimiser — or revert to an interpreter. Browsers must reconcile "must run instantly" with "must run fast eventually"; server-side wasm cold-starts once and runs forever, so AOT alone is enough. One spec, two diverging philosophies of implementation.
WebAssembly 不是一种语言,
是一份让 LLVM 和浏览器握手的协议。 Field Note · 03
WebAssembly is not a language.
It is the handshake between LLVM and the browser. Field Note · 03
每一个提案都是一次妥协的化石
every proposal is a fossilised compromise
2010 年 Google 在 Chrome 里塞了一个叫 NaCl 的东西——它能跑原生码,但每一种 CPU 各编译一份。后来 PNaCl 用 LLVM bitcode 当中间格式,通用化是有了,但只有 Chrome 支持。"在浏览器里跑 C++"这件事整整失败了五年。
2011 年另一个分支冒头:Mozilla 的 Alon Zakai 写了 Emscripten,把 LLVM bitcode 翻成 JavaScript;2013 年他和 Luke Wagner 进一步把"JS 的一个类型化子集"标准化成 asm.js——你可以用 "use asm" 告诉引擎这段代码全是 int32,引擎就能跳过类型检查,直接 AOT 编译。Firefox 上的 asm.js 跑出过原生 1.5 倍的成绩。
但 asm.js 仍然要走 JS parser,文件还是文本,还是要走 V8 的 SMI/HeapNumber 边界。所有人都看到了一条更短的路:把那个类型化子集直接二进制化。2015 年 6 月 17 日,W3C 上的四家——Mozilla、Google、Apple、Microsoft——宣布合作。两年后 MVP 在四大浏览器同时落地,这是 web 平台史上罕见的一次性达成。
In 2010 Google shipped NaCl in Chrome — it ran native code, but you had to compile once per CPU. PNaCl tried LLVM bitcode as a portable IR, but only Chrome supported it. "Running C++ in the browser" failed cleanly for five years.
The other branch sprouted in 2011: Mozilla's Alon Zakai wrote Emscripten, which translated LLVM bitcode into JavaScript. By 2013 he and Luke Wagner had standardised "a typed subset of JS" as asm.js — drop a "use asm" at the top and the engine could skip type checks and AOT-compile. Firefox's asm.js engine hit ~1.5× of native.
But asm.js still went through the JS parser, was still text, still bumped into V8's SMI/HeapNumber boundary. Everyone saw the shortcut: binarise that typed subset. On 17 June 2015 the four browser vendors — Mozilla, Google, Apple, Microsoft — announced collaboration on the W3C. Two years later the MVP shipped in all four browsers simultaneously — a rare instance of platform consensus actually happening.
四条血脉(NaCl · Emscripten · asm.js · JSC 经验) 在 2015 年 6 月 17 日的 W3C 会议室里收敛成 wasm 主干。MVP 之后,提案像枝条一样从主干长出来——绿色是编译器/运行时提案,紫色是计算能力提案。Wasm 2.0 在 2025 年成为 W3C Recommendation,把过去 8 年的 8 个独立提案合并成一份新基线。
Four ancestor strands (NaCl · Emscripten · asm.js · JSC heritage) converge into the wasm trunk on 17 Jun 2015 at the W3C. Post-MVP, proposals sprout — green are compiler/runtime proposals, purple are compute proposals. Wasm 2.0 became a W3C Recommendation in 2025, folding eight separate proposals into a new baseline.
HEAP32[(p+4)>>2] = x | 0 这种风格。证明了"用 JS 当虚拟 CPU"在工程上可行。今天 Emscripten 还在,但它的 backend 已经直接输出 wasm。HEAP32[(p+4)>>2] = x | 0 style. It proved "JS as virtual CPU" was engineerable. Emscripten still ships, but its backend now emits wasm directly."use asm" 一行声明,引擎认出后用 AOT 而非 JIT 编译该函数。Firefox 的 OdinMonkey 在 asm.js 上跑出过 1.5× of native。但 asm.js 仍是 文本,要走 JS parser,parse 一个 100 MB 的游戏 bundle 要十几秒。这成了催生 wasm 二进制格式的最后一根稻草。"use asm" directive let the engine AOT-compile a function. Firefox's OdinMonkey hit 1.5× of native on it. But asm.js was still text, still went through the JS parser, and a 100 MB game bundle took tens of seconds to parse. That was the final straw that forced wasm to be binary.| Phase | 含义Meaning | 谁同意Who agreed | 能不能用Usable? |
|---|---|---|---|
| 0 · Pre-proposal | 某人提个 idea,有仓库someone has an idea + repo | — | 不no |
| 1 · Feature proposal | CG 同意值得做CG agrees it's worth doing | CG | 不no |
| 2 · Proposed spec text | 有正式规范文字formal spec text exists | CG | flag 后可用(Chromium --enable-experimental-webassembly-features)behind flag |
| 3 · Implementation | 至少 2 个引擎实现≥ 2 engines shipped impl | CG | flag 后可用,Origin Trialbehind flag, Origin Trial |
| 4 · Standardize | WG 投票纳入正式规范WG votes to standardise | WG | 默认开启on by default |
CG = Community Group(社区组,任何人可加入);WG = Working Group(工作组,需要会员资格)。一个提案常常在 phase 3 待两到三年——SIMD 在 phase 3 待了 26 个月才升 phase 4。这套机制让 wasm 的每一步演进都需要至少两家厂商先实现,从根上把"一家独大"挡住了。
CG = Community Group (anyone can join); WG = Working Group (membership required). A proposal often sits in phase 3 for two to three years — SIMD spent 26 months at phase 3 before stepping up. The mechanism forces every evolutionary step to be implemented by at least two vendors first — structurally blocking unilateral moves.
JVM 走过的路,wasm 又走了一遍
JVM walked this path, wasm walked it again
"为什么 wasm 不是寄存器机?Dalvik 不是更快吗?"——这是每个第一次看 wasm 字节码的人会问的问题。答案藏在一个看似无关的数字里:wasm 字节码的体积要小到能流式下载。MVP 设计期(2015)给自己定的目标是 4 MB 文本的 asm.js 程序压成不超过 1 MB 的二进制——压缩比 1:4。所有的设计决策都要让步于这个数字。
"Why isn't wasm a register machine? Aren't Dalvik registers faster?" — every first-time reader of wasm bytecode asks this. The answer hides in a seemingly unrelated number: wasm bytes must be small enough to stream-download. The MVP target (2015) was to fit 4 MB of asm.js text into < 1 MB of binary — a 1:4 ratio. Every design choice bows to that number.
考虑一行表达式 c = a + b。在两种 ISA 里它的字节序列分别是:
Take a single expression c = a + b. The byte sequences in the two ISAs:
看起来寄存器机指令更少。但寄存器号需要 bits 来编:LLVM 的 SSA 寄存器数量无界,实际编码时需要 32 位甚至更多;Dalvik 把寄存器限到 256 个,8 bit;ARM/x86 真寄存器 16 个,4 bit。栈机一字节就是一条 opcode(i32.add = 0x6A),局部变量索引用 LEB128(通常 1 byte),整体下来栈机一般赢 30~40% 字节。
Register ops look fewer. But register IDs need bits: LLVM SSA values are unbounded, encoded at 32+ bits each; Dalvik caps at 256 registers (8 bits); ARM/x86 have 16 real registers (4 bits). A stack-machine opcode is one byte (i32.add = 0x6A), local indices LEB128 (usually 1 byte). The stack form typically wins 30–40% on bytes.
opcode 1 byte,大多数立即数 LEB128 1~2 byte。同样语义比 ARM64 大约小 35%。
1-byte opcode, most immediates 1–2-byte LEB128. About 35% smaller than equivalent ARM64.
类型栈抽象解释,一遍扫完即可证明类型安全。Ch11 详谈。
Type-stack abstract interpretation: one linear pass proves type safety. See Ch11.
栈位置编译期可知,Liftoff 边解码边发射机器码,无中间 IR。
Stack positions are statically known; Liftoff emits machine code while decoding, no IR.
不绑定寄存器数量或调用约定,同一份字节在 x86/ARM/RISC-V 上都能跑。
Not tied to a register count or calling convention; the same bytes run on x86/ARM/RISC-V.
栈机解释执行慢——每条指令要操作栈顶,栈本身常驻内存,L1 cache 命中率不如寄存器机。这是 JVM 早期被嘲讽"慢得像树懒"的根本原因。wasm 怎么解?用 JIT 而不是解释器。设计者赌的是:既然反正都要 JIT,那就让字节码偏向解码密度,机器码偏向执行速度,各取所长。
Stack interpreters are slow — every op touches the stack top, the stack lives in memory, L1 hit-rate trails register machines. That's why early JVMs felt "sloth-slow". Wasm's answer: skip the interpreter. The bet was: we'll JIT anyway, so let the bytecode optimise for density and the machine code optimise for speed. Best of both.
栈机还配了一个"半寄存器"层:locals。每个函数有固定数量的 locals(像寄存器),local.get / local.set 在栈和 locals 之间搬运值。这套设计让 wasm 既像栈机一样紧凑,又像寄存器机一样能"存中间结果"。JVM 的 locals 与之几乎完全相同——wasm 的设计者把 JVM 学了一遍。
Stack machines also carry a "half-register" file: locals. Each function has a fixed number of locals (register-like), with local.get / local.set moving values between stack and locals. That gives wasm a stack's compactness with a register machine's ability to "hold intermediate values". The JVM has the exact same construct — wasm's designers studied JVM thoroughly.
fib(40) 在 wasm 和 Dalvik 上的字节数fib(40) in wasm vs Dalvik bytes实测把 fn fib(n: i32) -> i32 { if n < 2 { n } else { fib(n-1) + fib(n-2) } } 编到 wasm 和 dex:wasm body 是 31 byte(含两次递归调用),dex 经压缩 27 byte。差距不大,因为函数体太短,寄存器号编码 vs locals 索引几乎抵平。真正拉开差距的是大函数——一个 1000 行的 SIMD inner loop,wasm 大约赢 33%,这才是 wasm 选栈机的真正回报。
Compile fn fib(n: i32) -> i32 { if n < 2 { n } else { fib(n-1) + fib(n-2) } } to wasm and dex: wasm body is 31 bytes (including two recursive calls), dex 27 bytes. Tiny gap, because the body is too short — register encoding vs local index cancels out. The gap widens on large functions — a 1000-line SIMD inner loop sees wasm ~33% smaller. That's the real return on the stack-machine bet.
栈是密度,寄存器是速度。
wasm 选了让编译器付出速度。 Field Note · 03
The stack buys density, the register buys speed.
Wasm chose to make the compiler pay for speed. Field Note · 03
为什么 wasm 必须存在
why wasm has to exist
V8 是一台让人叹服的 JIT 引擎——它在运行时学习对象形状、追踪类型、构造内联缓存、把热函数从 Ignition 经 Sparkplug、Maglev 一路提升到 TurboFan。但所有这些工程都建立在一个前提之上:JS 是动态类型。这个前提注定了 JIT 有一个跨不过去的天花板。
把天花板写成三件事:(1) 类型不确定,所以要 inline cache,猜错就要 deopt;(2) 数字不止一种表示,SMI / HeapNumber / Float64 之间的装箱拆箱无法消除;(3) GC 不可关,即使是数值密集的图像处理,引用计数和写屏障也要付。这三件事单独看每一件都是几个百分点,叠起来就是 5× ~ 10× 的差距。
V8 is an awe-inspiring JIT — it learns object shapes at runtime, traces types, builds inline caches, lifts hot functions from Ignition through Sparkplug and Maglev to TurboFan. But all that engineering rests on one premise: JS is dynamically typed. That premise dictates a ceiling.
Three sides to that ceiling: (1) types are uncertain, so you need inline caches, deopt on misses; (2) numbers have multiple representations (SMI / HeapNumber / Float64), boxing/unboxing cannot be eliminated; (3) GC cannot be turned off — even on pixel-pushing loops you pay write barriers and reference counts. Each of these is a few percent; stacked, they multiply to 5–10×.
obj.x 优化成"直接偏移 +8 取值"——前提是 obj 总是这个形状。一旦你给某个 obj 加了字段,这条优化作废,引擎不得不 deopt 回 Ignition,函数从 TurboFan 掉到字节码。wasm 没有 obj.x,有的是 i32.load offset=8——偏移在编译期就钉死,没有 deopt。obj.x into "offset +8 of this shape" — until you add a property and the shape changes, at which point the optimisation invalidates, the engine deopts back to Ignition, the function falls from TurboFan to bytecode. Wasm has no obj.x; it has i32.load offset=8 — the offset is fixed at compile time, deopt-free.x = 1 是 SMI(31-bit tagged 整数,栈上);x = 1.5 是 HeapNumber(堆指针,要 GC)。x = a + b 时引擎要先判断两边是 SMI 还是 HeapNumber,再决定加法 opcode。一个简单的 inner loop 里这种判断每次都跑。wasm 的 i32.add 输入永远是 i32——没有判断,直接出 add eax, ebx。x = 1 is an SMI (31-bit tagged int, stack); x = 1.5 is a HeapNumber (heap pointer, GC-tracked). For x = a + b, the engine first checks both sides' representations, then picks the add op. In a tight inner loop, that check runs every iteration. Wasm's i32.add always takes two i32s — no check, straight to add eax, ebx.arr[i] = x 这种写入触发 write barrier、保护代际收集器。一个 1M 次写入的循环里,write barrier 占 10~15% 时钟。wasm 的 i32.store offset=0 写到 linear memory——它是 ArrayBuffer 的一片,GC 完全不参与。这也是为什么 wasm 适合做图像/音视频/物理引擎而不适合做 React 组件树。arr[i] = x to keep the generational GC sound. In a 1 M-write loop, write barriers consume 10–15% of cycles. Wasm's i32.store offset=0 hits linear memory — a slice of ArrayBuffer the GC never touches. That's why wasm shines on images/video/physics and slogs on React component trees.这是上一篇文章《V8 是怎么把 JS 跑快的》结尾的句子——它正是这一章要展开的主张。JS 引擎走完了它能走的所有路:Sparkplug 把启动延迟干到 1× of native parse;Maglev 把热路径速度做到 0.8× of TurboFan;TurboFan 把寄存器分配做到接近 LLVM。当你需要更快,你需要的不是更聪明的 JIT,你需要的是更少的不确定性——这就是 wasm 的角色。
That's the closing line of the previous piece «How V8 Makes JS Fast» — and it's exactly the claim this chapter unpacks. The JS engine has walked every road it can: Sparkplug brings startup to ~1× native parse, Maglev hits 0.8× of TurboFan on hot paths, TurboFan's register allocator approaches LLVM's. When you need more speed, you don't need a smarter JIT — you need less uncertainty. That is wasm's role.
| 基准Benchmark | JS (V8 TurboFan)JS (V8 TurboFan) | Wasm (V8 TurboFan)Wasm (V8 TurboFan) | 原生 C(LLVM -O3)Native C (LLVM -O3) | wasm / nativewasm / native |
|---|---|---|---|---|
| SciMark 2.0 (geom mean) | 2.4× | 1.15× | 1.00× | 87% |
| fasta (computational) | 3.1× | 1.08× | 1.00× | 93% |
| n-body (3D physics) | 2.8× | 1.18× | 1.00× | 85% |
| JPEG decode (libjpeg) | 4.5× | 1.25× | 1.00× | 80% |
| SHA-256(纯算术)SHA-256 (pure arithmetic) | 3.6× | 1.10× | 1.00× | 91% |
| DOM diff(JS-bound)DOM diff (JS-bound) | 1.00× | 1.7×↓ | — | — |
表里最后一行是反例:DOM diff 在 wasm 里反而更慢,因为每次 DOM 调用都要跨 wasm/JS 边界,trampoline 成本压过了算术加速。wasm 比 JS 快的是"算数",不是"调用浏览器 API"——这条边界 Ch17 会量化。
The last row is a counter-example: wasm is slower at DOM diff, because each DOM call crosses the wasm/JS boundary, and the trampoline cost outweighs the arithmetic speedup. Wasm beats JS at arithmetic, not at calling browser APIs — Ch17 quantifies that boundary.
姐妹篇 chromium-renderer 用一张名片(The Card)当贯穿全文的实例。这里我们用一段 3×3 卷积循环——它来自图像滤镜,小到可以打印在一页纸上,大到能压出栈机、SIMD、JIT、Tier-up 几乎所有的特性。后面 22 章每一章都会切回这段代码,看它在那一道工序里是什么样子。
In the sibling piece chromium-renderer, a business card (The Card) served as the through-line. Here we use a 3×3 convolution loop — straight out of image filtering, small enough to print on one page yet rich enough to exercise the stack machine, SIMD, JIT, and tier-up. Every one of the next 22 chapters cuts back to this code and shows what it looks like at that stage.
11 行 Rust,17 道工序,1 条 SSE 指令
11 lines of Rust, 17 stages, 1 SSE op
"WebAssembly 的字节是什么样子" 这种问题用文字描述会很抽象。我们换一种问法:这一段你能看得懂的 Rust 函数,在每一道工序里长什么样。下面是它的源头——一个 3×3 盒型模糊滤镜,把一张灰度图的每个像素替换成它周围 9 个像素的平均值。这是 Photoshop 里"模糊"按钮在内核做的事的精简版,也是 wasm 最擅长跑的那种代码:循环密、整数为主、对内存 layout 敏感。
"What do WebAssembly bytes look like?" gets abstract in prose. So we switch the question: what does this Rust function look like at every stage? Below is the source — a 3×3 box blur that replaces each grayscale pixel with the average of its 9 neighbours. A miniature of Photoshop's blur button kernel, and the kind of code wasm shines on: loop-heavy, integer-dominated, memory-layout sensitive.
// hot.rs — 3×3 box blur on an 8-bit grayscale image // w · h are pre-checked, no panics on bounds #[no_mangle] pub fn blur3(src: &[u8], dst: &mut [u8], w: usize, h: usize) { for y in 1..h - 1 { for x in 1..w - 1 { let mut sum: u32 = 0; for dy in 0..3 { for dx in 0..3 { sum += src[(y + dy - 1) * w + (x + dx - 1)] as u32; } } dst[y * w + x] = (sum / 9) as u8; } } }
五个观察:① #[no_mangle] 让 rustc 把符号名原样导出,后面 wasm 才能用 blur3 找到它;② 输入是切片,Rust 编译到 wasm 时会拆成"指针 + 长度"两个 i32 参数;③ 内层 9 次 src[...] 索引,每次都会被 LLVM 展平成 i32.load offset=?;④ sum / 9 编译成 i32.div_u——不是浮点;⑤ Rust 的 as u8 编译成 i32.store8,只写低 8 位。这五件事每一件都对应 wasm 的一个设计点,后面会一个个回到。
Five observations: ① #[no_mangle] tells rustc to export the symbol literally so wasm callers can find blur3; ② slice arguments are split into "pointer + length" — two i32 args each; ③ the nine inner src[...] indices each flatten into i32.load offset=?; ④ sum / 9 becomes i32.div_u — integer, not float; ⑤ as u8 becomes i32.store8, writing only the low byte. Each of these maps to a wasm design choice; we'll come back to them one by one.
$ rustc --target wasm32-unknown-unknown -O --crate-type cdylib -o hot.wasm hot.rs
$ wasm-opt -O3 hot.wasm -o hot.opt.wasm # Binaryen post-pass
$ ls -l hot*.wasm
-rw-r--r-- 1 airing staff 192 May 16 14:32 hot.opt.wasm
-rw-r--r-- 1 airing staff 248 May 16 14:32 hot.wasm
192 字节的 .wasm 包含完整的模块——header / type / function / memory / export / code 六个 section,加起来不到一条 tweet。这是栈机+LEB128 编码密度的胜利。把这 192 字节十六进制打印出来,你能眼睛看完:
192 bytes contains the entire module — header / type / function / memory / export / code, six sections, less than a tweet. That's the win from stack machine + LEB128. Print those 192 bytes as hex and you can read them with your eyes:
00000000 00 61 73 6d 01 00 00 00 ; \0asm magic + version=1 00000008 01 0b 02 60 04 7f 7f ; type section, 2 types 00000010 7f 7f 00 60 00 00 ; (func (param i32 i32 i32 i32)), (func) 00000018 03 02 01 00 ; function section: func0 has type0 0000001c 05 03 01 00 01 ; memory section: 1 page (64 KiB) 00000021 07 09 01 05 62 6c 75 ; export "blur3" 00000029 72 33 00 00 ; → func 0 0000002d 0a ... ; code section, body of blur3 (155 byte) ... 000000be 0b ; end · final byte = 0xC0 (192)
注意三件事:① 00 61 73 6d 是 ASCII 的 \0asm——所有 wasm 模块都以它开头,像 ELF 的 0x7F ELF;② 01 00 00 00 是版本号 1,小端;③ 每个 section 以一个 ID byte(0x01 = type, 0x03 = function, ...)开头,然后是 LEB128 编码的长度。Ch06 会把这层皮一字一字撕开。
Three things to note: ① 00 61 73 6d is ASCII \0asm — every wasm module starts with it, like ELF's 0x7F ELF; ② 01 00 00 00 is version 1, little-endian; ③ each section opens with an ID byte (0x01 = type, 0x03 = function, …) followed by LEB128-encoded length. Ch06 peels this skin off byte by byte.
;; hot.wat — 经过 wasm-opt -O3 优化后的等价文本 (module (type $t0 (func (param i32 i32 i32 i32))) (memory (export "memory") 1) (func $blur3 (export "blur3") (type $t0) (param $src i32) (param $dst i32) (param $w i32) (param $h i32) (local $y i32) (local $x i32) (local $sum i32) ;; for y = 1..h-1 (local.set $y (i32.const 1)) (block $break_y (loop $loop_y (br_if $break_y (i32.ge_s (local.get $y) (i32.sub (local.get $h) (i32.const 1)))) ;; for x = 1..w-1 (local.set $x (i32.const 1)) (block $break_x (loop $loop_x (br_if $break_x (i32.ge_s (local.get $x) (i32.sub (local.get $w) (i32.const 1)))) ;; sum = 9 个 load 加起来 (LLVM 已经把内 2 层循环展平) local.get $src i32.const 0 i32.load8_u offset=0 ;; src[(y-1)*w + (x-1)] local.get $src i32.load8_u offset=1 i32.add ;; ... 共 9 次 i32.load8_u + 8 次 i32.add ... (展开版省略) local.set $sum ;; dst[y*w + x] = sum / 9 local.get $dst local.get $sum i32.const 9 i32.div_u i32.store8 ;; 写回 local.set $x (i32.add (local.get $x) (i32.const 1)) br $loop_x)) ;; end x local.set $y (i32.add (local.get $y) (i32.const 1)) br $loop_y)) ;; end y )
这是 wat 的一种展开形式(为了讲解可读)。实际的 LLVM 输出会把内 2 层循环完全展开成 9 条 i32.load8_u + 8 条 i32.add。注意三个关键点:(a) 控制流只有 block / loop / br / br_if 这几个原语——没有 goto;(b) 所有内存访问都带 offset=N,这个 offset 是编译期常量,Liftoff 可以直接折进地址计算;(c) 每条算术指令的输入输出类型由 opcode 自身决定(i32.add 必然两 i32 输入一 i32 输出)——这是 wasm "静态类型"的核心,Ch08 和 Ch11 会展开。
This is wat in an unrolled-but-readable form. The real LLVM output unrolls the two inner loops into 9 × i32.load8_u + 8 × i32.add. Three key things: (a) control flow uses only block / loop / br / br_if primitives — no goto; (b) every memory access carries an offset=N immediate that is a compile-time constant, which Liftoff folds straight into address arithmetic; (c) each arithmetic opcode self-describes its operand types (i32.add is always two-i32-in, one-i32-out) — that's the core of wasm's static typing, expanded in Ch08 and Ch11.
把 hot.wasm 喂给 V8,默认会先用 Liftoff 编译。在 Chrome 里用 --print-wasm-code dump 出 Liftoff 生成的 x86-64:
Feed hot.wasm to V8 and Liftoff compiles first by default. Use --print-wasm-code in Chrome to dump the generated x86-64:
; Liftoff output for blur3 (excerpt of inner body) push rbp mov rbp, rsp sub rsp, 0x30 ; reserve 6 slots for locals mov [rbp-0x08], rdi ; spill $src (arg 0) mov [rbp-0x10], rsi ; spill $dst (arg 1) ... ; inner: load src[idx] mov rax, [rbp-0x08] ; rax = $src mov rcx, [rbp-0x18] ; rcx = computed index movzx edx, byte ptr [r15+rax+rcx] ; bounds-check via r15 base add [rbp-0x28], edx ; sum += byte ... ; tier-up trigger cmp dword ptr [r13+0x40], 0x100 jne +0x4 call WasmCompileLazy
Liftoff 的输出有几个标志:(1) 几乎所有局部变量都 spill 到栈上,不做寄存器分配——这让 codegen 走单遍;(2) r15 是 V8 约定的 "wasm memory base" 寄存器,所有 load/store 都通过它做基址相对寻址,自带越界检查(用大段保留页 + signal handler);(3) 函数尾巴塞了一个 tier-up 计数器,每次进入函数就 cmp 一下,达到阈值就触发后台 TurboFan 重编译——这是 Ch14 / Ch15 的故事。
Marks of Liftoff: (1) nearly every local spills to the stack — no register allocation, single-pass; (2) r15 is V8's "wasm memory base" register; every load/store uses it as the base, with bounds checking via guard pages + signal handlers; (3) the function tail packs a tier-up counter, cmp'd on each entry — when the threshold trips, a background TurboFan recompile fires. That's the Ch14 / Ch15 story.
; TurboFan output for blur3 (excerpt of inner body) mov edi, [r15+rcx] ; row 0 starting load — held in reg, not spilled movzx eax, dil movzx ebx, byte ptr [r15+rcx+1] add eax, ebx ; 9 loads, 8 adds — all in regs movzx ebx, byte ptr [r15+rcx+2] add eax, ebx ... mov ebx, 0x1c71c71d ; (2^32 + 8) / 9 magic for div-by-9 mul ebx shr rdx, 1 mov byte ptr [r15+rdi], dl ; store the average add ecx, 1 ; x++ cmp ecx, esi jl -0x53
TurboFan 输出几乎就是手写汇编的样子——寄存器分配把 9 次 load 的中间结果留在 eax/ebx 里,sum / 9 被识别成"除以常数",用魔数乘法(0x1c71c71d)替换了昂贵的 div 指令。这一招 LLVM 也会做(Hacker's Delight 第 10 章),V8 的 TurboFan 把它原样搬过来。这是 wasm "原生 80%" 的具体形式。
TurboFan output reads like hand-written assembly — the register allocator keeps the 9 loads in eax/ebx, and sum / 9 is recognised as "divide by constant" and replaced with magic-number multiplication (0x1c71c71d). LLVM does the same trick (Hacker's Delight Ch10); V8 ports it over. This is the concrete form of wasm's "80% of native".
如果你打开 RUSTFLAGS="-C target-feature=+simd128",LLVM 会把这段代码完全向量化——同样的 inner loop 变成一条 v128.load + 一条 v128.add 即可处理 16 个像素。Ch19 会把向量化全过程展开。这里先给一行 punch line:
Add RUSTFLAGS="-C target-feature=+simd128" and LLVM vectorises completely — the inner loop becomes one v128.load + one v128.add processing 16 pixels. Ch19 unfolds the full vectorisation. One-line punchline:
Uint8Array 的一片,无 GC write barrier;③ sum/9 单次 div,JS 要先 ToInt32 再 ToUInt8,慢 3 倍。翻成机器码的"愚蠢版"已经胜过 JS 的"聪明版"。Liftoff 名字来自飞机起飞——快到不需要长跑道。
Liftoff does not do register allocation, but its cost-per-byte-of-output is already much lower than JS: ① types are settled, no IC, no deopt; ② memory is a slice of Uint8Array, no GC write barrier; ③ sum/9 is one div, JS has to ToInt32 then ToUInt8 — 3× slower. The "dumb" wasm machine code already beats the "clever" JS machine code. Liftoff is named after takeoff — fast enough not to need a runway.
| Ch | 在那一章里它是什么What it looks like there |
|---|---|
| 06 | 前 8 字节:\0asm magic + versionFirst 8 bytes: \0asm magic + version |
| 07 | 展开它的 11 个 sectionIts 11 sections fully unpacked |
| 08 | i32 占 99%,出现 i32→u8 的窄化99% i32, with i32→u8 narrowing |
| 09 | 用到哪 6 类 opcodeWhich 6 opcode families it touches |
| 10 | src + dst 在 linear memory 的 layoutsrc + dst layout in linear memory |
| 11 | 验证它的类型栈一步一步走Type stack walks during validation |
| 12 / 13 | 流式 decode,边下边编Streaming decode, compile-while-fetching |
| 14 | Liftoff 出的机器码Liftoff's machine code |
| 15 | TurboFan 的 sea-of-nodes 图TurboFan's sea-of-nodes graph |
| 16 | 实例化时 memory 怎么分配How instantiation allocates memory |
| 17 | 从 JS 调它要花多少 nsJS calling it: how many ns per call |
| 18 / 19 / 20 / 21 / 22 | 线程版 / SIMD 版 / wasm-GC 改写 / 组件模型导出Threaded · SIMD · GC-rewrite · Component-exported |
| 23 ~ 25 | 性能分析、DevTools 调试、移植到 Figma/Photoshop 的回响Perf profile, DevTools debug, echoes in Figma / Photoshop |
下面是 hot.rs 从源码到屏幕的12 个快照。每一格是这段代码在那一秒的实际面貌——左边五格是"静态形态",中间四格是"编译动作",右边三格是"运行时事件"。读完这张图,你应该能在脑里把后面 Act III/IV/V 的每一章对应回这条主线上。后面 8 个章节(Ch11-Ch19)的顶部都挂了一个 "MAIN-LINE STOP X/12" 胶囊,告诉你"你现在站在哪一格"。
Below: 12 snapshots of hot.rs, from source to pixel. Each cell is what the code actually looks like at that moment — the first five cells are static forms, the middle four are compiler actions, the last three are runtime events. After this image, every chapter in Acts III/IV/V can be slotted back onto this main-line. Eight later chapters (Ch11–Ch19) carry a "MAIN-LINE STOP X/12" capsule at the top telling you "which cell you're standing in".
从 cargo build 到屏幕上一个像素,11 行 Rust 走过 12 个快照——前 4 格是开发者机器上发生的事(rustc → LLVM IR → wat → .wasm 字节),第 5 格是CDN 到浏览器的传输,中间 6 格是渲染进程内的编译与第一次执行,最后 1 格是 SIMD 向量化版本在 GPU 显示前的最后一秒。每格底部标了对应章节,后面 8 章的顶部都挂"MAIN-LINE STOP X/12" 胶囊回引这张图。读完这张图你就拿到了整篇文章的骨架。
From cargo build to a pixel on screen, 11 lines of Rust pass through 12 snapshots — the first four cells happen on the developer's machine (rustc → LLVM IR → wat → .wasm bytes), the fifth is CDN to browser transport, the middle six are compilation and first execution inside the renderer process, and the last is the SIMD-vectorised version one heartbeat before the GPU lights the pixel. Every cell is anchored to a chapter; the next eight chapters carry a "MAIN-LINE STOP X/12" capsule at the top linking back here. Read this picture and you hold the article's skeleton.
→ 向右拖动可查看完整 12 格
→ scroll horizontally to see all 12 cells
11 行 Rust,要走 17 道工序
才能在屏幕上动一个像素。 The Hot Loop · main-line
11 lines of Rust, 17 stages,
before one pixel moves on screen. The Hot Loop · main-line
这一段是 wasm 的解剖学:外壳怎么定形,11 段 section 各自存什么,类型系统从 4 个数字类型怎么膨胀到 v128 和 GC,400+ 条 opcode 怎么塞进单字节,线性内存为什么 64 KiB 一页,以及——验证算法为什么能在线性时间里证明类型安全。这 6 章读完,你拿到一个 .wasm 文件可以一字一字读出来。
This act is wasm's anatomy: how the shell is shaped, what the 11 sections each carry, how the type system grew from four numeric types into v128 and GC, how 400+ opcodes pack into single bytes, why linear memory is 64 KiB per page, and — how validation proves type safety in linear time. After these six chapters you can pick up a .wasm file and read it byte by byte.
\0asm + version + 11 段子
\0asm + version + 11 sections
00 61 73 6d 01 00 00 00——magic 4 字节(\0asm),版本 4 字节(目前是 1)。后面跟着一个 section 序列,每段一个 ID + LEB128 长度 + 内容。仅此而已。
The first 8 bytes of every .wasm file are fixed: 00 61 73 6d 01 00 00 00 — magic 4 bytes (\0asm), version 4 bytes (currently 1). Then comes a sequence of sections: each is an ID byte + LEB128 length + payload. That's all.
0x00 让任何把它当 JS 解析的工具立刻报错;② section 用 ID + length 而非偏移表,允许流式解析——边下载边解码。Ch12 会用到这个性质。
Two goals: ① byte 0 = 0x00 ensures any tool that tries to parse the file as JS fails immediately; ② sections use ID + length (not an offset table) to enable streaming parse — decode while downloading. Ch12 hinges on this.
v8/src/wasm/module-decoder.cc :: DecodeModule()
192 字节里,code section 占 79%——其余 7 个 section 加起来只占 21%。这是 wasm "structure 紧凑,代码占比高" 的可视证据。magic + version 8 字节让任何 JS 解析器立刻报错;接下来每个 section 都是 id + LEB128 length + payload 的三段式。
Of 192 bytes, the code section is 79% — the other seven sections together account for 21%. Visible proof of wasm's "compact structure, code-heavy ratio". Magic + version 8 bytes make any JS parser fail instantly; every following section follows id + LEB128 length + payload.
\0asm。第一字节 NULL 让 cat file.wasm | node 立刻抛 SyntaxError。Byte 0 = NULL makes cat file.wasm | node throw SyntaxError instantly.0x01 = Type section,后面紧跟 LEB128 编码的长度 ll(可变长 1~5 字节)。0x01 = Type section, immediately followed by LEB128-encoded length ll (variable 1–5 bytes).0x00)外,其余 section 必须按 ID 升序出现。Payload format is determined by section type. Except for Custom (0x00), sections must appear in ascending ID order.0~127 用 1 字节,128~16383 用 2 字节,以此类推。DWARF 调试格式发明,wasm 拿来用——因为大多数 wasm 整数都很小(类型索引、locals 数、跳转目标),平均不到 2 字节。所有 wasm 整数(除指令的立即数 i32.const)都是 LEB128 编码的。
Little Endian Base-128. A variable-length integer encoding: 7 data bits per byte, top bit = 1 means "more coming", 0 means "done". 0–127 use 1 byte, 128–16383 use 2 bytes, and so on. Invented for DWARF, adopted by wasm — because most wasm integers are small (type indices, local counts, branch targets), averaging < 2 bytes. Every wasm integer (except some immediate operands) is LEB128-encoded.
11 个 section 按引用方向分四类:蓝声明(谁存在)、绿函数体(真正的字节码)、橙初始化器(给 table/memory 灌数据)、紫宿主接口(import/export/start)。这四类必须按 ID 升序出现——只有 Custom 段(灰)可以出现在任何地方,出现多少次都行。
The 11 sections split four ways by reference direction: blue declarations (who exists), green bodies (the real bytecode), copper initialisers (filling tables/memory), purple host-facing (import/export/start). The four must appear in ascending ID order — only Custom (grey) may appear anywhere, any number of times.
Custom section 是规范留给所有人的逃生舱口——它没有规定的格式,只有一个名字(LEB128 长度 + UTF-8 字节)和任意 payload。DWARF 调试信息、source map、wasm-bindgen 的 JS 胶水都藏在这里。Ch24 会展开 name custom section,它给函数和局部变量起名,让 DevTools 能显示符号。
The Custom section is the spec's escape hatch for everyone — no prescribed format, just a name (LEB128 length + UTF-8 bytes) and arbitrary payload. DWARF debug info, source maps, and wasm-bindgen's JS glue all hide here. Ch24 unpacks the name custom section, which names functions and locals so DevTools can show symbols.
回看 Act II 给的 192 字节十六进制,排查 section ID:01(type)、03(function)、05(memory)、07(export)、0a(code)。没有 import,没有 table,没有 global,没有 data——因为我们的卷积函数不依赖宿主、不做间接调用、没有模块级常量、不预填内存。最小可运行 wasm 模块就是这 7 段。
Re-read the 192-byte hex from Act II and you'll find section IDs: 01 (type), 03 (function), 05 (memory), 07 (export), 0a (code). No import, table, global, or data — because our blur function imports nothing, uses no indirect calls, has no module-level constants, and pre-fills no memory. The minimum runnable wasm is exactly these 7 sections.
每一段都是一个 K-V 仓库
each section is a K-V vault
看一个 .wasm 文件最容易的方式,就是把它当成一组按 ID 升序排列的 K-V 仓库。每个 section 解一个具体问题。这一章把 11 个 section 各拆一遍,每段给一个最小例子 + 在主线 Hot Loop 里的角色。
The easiest way to read a .wasm file is to treat it as a sequence of K-V vaults, ordered by ID. Each section answers one specific question. This chapter walks all 11, with a minimal example and the role each plays in the main-line Hot Loop.
问题:"函数 $blur3 长什么样?"
答:"它是 type[0]:(i32 i32 i32 i32) -> ()"——所有 module 内出现的函数签名先注册一遍,后面引用用索引。
Question: "What does $blur3 look like?"
Answer: "It's type[0]: (i32 i32 i32 i32) -> ()" — every signature used in the module is registered up front, later referenced by index.
(type $t0 (func (param i32 i32 i32 i32))) ;; void return implied
为什么把签名独立成表?因为同一个签名会被多个函数共用——主线里只有一个函数,签名表有 1 条。但 Photoshop 的 wasm 里有几十万个函数,只用到几百种签名,共用让 type section 体积小 2~3 个数量级。
Why pull signatures into their own table? Because the same signature is shared among many functions — the main-line has 1 function, so 1 entry. Photoshop's wasm has hundreds of thousands of functions but only hundreds of distinct signatures; sharing collapses the type section by 2–3 orders of magnitude.
主线 Hot Loop 没有 import——纯数学函数,不依赖任何 JS API。但下面是 Photoshop 的实际 import section 缩影:
The main-line Hot Loop has no imports — pure math, no JS-side dependency. Below is a snapshot of Photoshop's real import section:
(import "env" "memory" (memory 256 32768 shared)) (import "env" "__indirect_func_table" (table 4096 funcref)) (import "env" "emscripten_resize_heap" (func (param i32) (result i32))) (import "wasi_snapshot_preview1" "fd_write" (func (param i32 i32 i32 i32) (result i32)))
四个观察:① 每一条 import 是两段名字 + 一个签名("env" "memory" 是惯例,Emscripten 用 "env" 当 module 名);② memory 可以被 import——这是多线程 wasm 共享内存的关键;③ table 也能 import,允许 JS 给函数指针填值;④ WASI 函数通过 import 引入,在浏览器外的 wasm 里这是主要的"系统调用"通道。
Four notes: ① each import is two-segment name + one signature ("env" "memory" is the Emscripten convention); ② memory itself can be imported — that's the foundation of shared-memory multi-threaded wasm; ③ tables too, letting JS populate function pointers; ④ WASI functions enter via imports, which is the primary "syscall" channel for non-browser wasm.
这一段长得最简洁——就是一个 type index 数组:"函数 0 用 type[0],函数 1 用 type[2],函数 2 用 type[2],..."。函数体本身不在这里,它们在 Code section(0x0a)。把"签名声明"和"函数体"分开是为了流式解码——下载到 function section 就能开始检查 import/export 的类型匹配,不必等 code 段下完。
The plainest section — just an array of type indices: "function 0 is type[0], function 1 is type[2], function 2 is type[2], …". The body lives elsewhere, in the Code section (0x0a). The split between "signature declaration" and "body" exists for streaming decode — once function section is in, you can check import/export type matching without waiting for code.
Table 是 wasm 的 "函数指针表",最初是为了 C 函数指针 / C++ vtable / Java 接口分发服务。每个 table 元素是 funcref(MVP)或 externref(2021)。call_indirect 指令用 table 索引 + 类型 ID 调用——类型 ID 必须匹配,否则 trap。Ch09 / Ch11 会展开。
Table is wasm's "function pointer table", born to serve C function pointers / C++ vtables / Java interfaces. Each element is funcref (MVP) or externref (2021). call_indirect uses (table idx + type id) to dispatch — the type id must match or it traps. Expanded in Ch09 / Ch11.
2021 年 reference-types 提案前,一个 module 只能有一张 table。之后可以有多张。主线 Hot Loop 不用 table——它没有间接调用。
Before reference-types (2021), a module could carry only one table. After: multiple. The main-line Hot Loop uses no table — no indirect call.
主线声明 (memory 1)——min=1 page=64 KiB,max 不指定。Ch10 完整展开线性内存。
Main-line declares (memory 1) — min=1 page=64 KiB, max unspecified. Ch10 covers linear memory fully.
(global $stack_top (mut i32) (i32.const 0x10000)) (global $PI f64 (f64.const 3.14159265358979))
每个 global 是 (type, mut?, init expr) 三件套。mut 标记可写,initialiser 是一段受限的常量表达式(只能用 iN.const / fN.const / global.get)。Rust / C 的 static 数据如果是常量就来这里,如果是读写就放到 linear memory 的 data section。
Each global is (type, mut?, init expr). mut means writable; initialiser is a constant expression (only iN.const / fN.const / global.get). Rust/C static data lives here when constant; mutable static data goes into linear memory via the data section.
(export "memory" (memory 0)) (export "blur3" (func $blur3)) (export "alloc" (func $alloc))
name → (kind, index) 的字典。kind 可以是 func / table / memory / global / tag(tag 是 exception handling 提案加的)。所有从 JS 调 wasm 的入口都在这里。 JS 那边的 instance.exports.blur3 就是查这张表。
A name → (kind, index) dictionary. Kind ∈ {func, table, memory, global, tag} (tag added by exception handling). Every JS-to-wasm entry point lives here. JS-side instance.exports.blur3 looks up this very table.
仅一个数字——某个函数的索引。该函数不能有参数,不能有返回值,在 module instantiate 完成的最后阶段被引擎自动调用。用来做模块级初始化(注册回调、填充常量表)。主线 Hot Loop 没有 start。
Just one number — the index of a function. The start function takes no params, returns nothing, and is invoked automatically by the engine at the end of instantiation. Used for module-level setup (registering callbacks, filling constant tables). Main-line Hot Loop omits start.
语义类似 data section,但写入对象是 table 而非 memory。一个 module 实例化时,element 段把 funcref 们填进对应 table 槽位。C 程序的"函数指针表"就在这里活;C++ 的 vtable 也是。
Semantically similar to data section but writing into tables rather than memory. On instantiation, element segments populate funcref slots. C function pointer tables live here; so do C++ vtables.
这是最大的一段——主线 Hot Loop 的 code section 占整个文件的 80% 字节。每个函数体的格式是:locals 声明(类型聚合表)+ 表达式序列 + 终止符 0x0b (end)。Ch09 把指令格式撕开。
The biggest section — the Hot Loop's code section is 80% of the entire file. Each function body's format is: locals declaration (run-length encoded by type) + expression sequence + terminator 0x0b (end). Ch09 unpacks the instruction format.
把"这段字节请在实例化时写到 linear memory 的某地址"批量声明。C 程序的字符串字面量、Rust 的 static 数组、Emscripten 的 stdlib 数据表都在这里。MVP 时每条数据段必须 active(立即写入);bulk memory 提案(2020)加了 passive 模式,允许 wasm 代码显式调 memory.init 来用——支持代码热更新。
Bulk-declares "at instantiation, write these bytes to memory at address X". C string literals, Rust static arrays, Emscripten's stdlib tables all live here. MVP required every segment to be active (written immediately). The bulk-memory proposal (2020) added passive mode, letting code call memory.init explicitly — supports hot reload.
2020 年加 bulk memory 后,memory.init segIdx 指令需要在验证时立刻知道 data 段总数。但 code section 在 data section 前面解析——为了不让 validator 反复回扫,设计者插入了一个新 section 0x0c,专门告诉解码器"我有 N 个 data 段"。这是 wasm spec 仅有的"事后补丁"section,反映了流式解析的硬约束。
Bulk memory (2020) made memory.init segIdx need to know the total number of data segments during validation. But code section parses before data — to spare the validator from a back-scan, the designers slipped in a new section 0x0c that just says "I have N data segments". It's the only "retroactive patch" section in the spec, reflecting the hard constraint of streaming parse.
Section 不是设计,是约束的化石。 Field Note · 03
Sections are not design.
They are fossilised constraints. Field Note · 03
小到 1 字节,大到任意结构
one byte to arbitrary structure
2017 年 MVP 上线时,wasm 一共只有 4 种值类型:i32 / i64 / f32 / f64。理由极其务实:这是所有 CPU 都能直接处理的 4 种,JIT 不需要费力适配。九年后的今天,加上 SIMD 的 v128、reference 的 funcref/externref、以及 wasm-GC 的 struct/array/i31,wasm 已经有了"近似于一门完整语言"的类型系统——但每一次扩张都要回答同一个问题:新类型怎么不破坏栈机的"一字节 opcode"承诺?
The 2017 MVP shipped with just four value types: i32 / i64 / f32 / f64. The reasoning was ruthlessly practical: these are the four that every CPU handles natively, so the JIT has no fitting to do. Nine years on, with SIMD's v128, reference-types' funcref/externref, and wasm-GC's struct/array/i31, wasm now has a type system that "looks like a real language". But every expansion answers the same question: how does the new type not break the stack machine's "one-byte opcode" promise?
2024 之后,wasm 值类型分两个世界:左半是 5 个独立的原始类型(i32/i64/f32/f64/v128),没有子类型关系;右半是引用类型 lattice——顶层 anyref,下分 eqref / funcref / externref,再下到具体 struct / array / 函数签名引用。所有 nullable 引用最终指向 nullref。
After 2024, wasm value types split into two worlds: left — 5 independent primitive types (i32/i64/f32/f64/v128), no subtyping; right — a reference type lattice topped by anyref, descending into eqref / funcref / externref, then concrete struct / array / function-sig references. All nullable refs ultimately point to nullref.
| Category | Type | Size | Tag (encoding) | Since | What |
|---|---|---|---|---|---|
| numeric | i32 | 4 byte | 0x7F | MVP | 32 位整数(符号自指令)32-bit integer (sign-per-op) |
i64 | 8 byte | 0x7E | MVP | 64 位整数64-bit integer | |
f32 | 4 byte | 0x7D | MVP | IEEE 754 single | |
f64 | 8 byte | 0x7C | MVP | IEEE 754 double | |
| vector | v128 | 16 byte | 0x7B | 2021 | 128 位 SIMD,可解释成 16×i8 / 8×i16 / 4×i32 / 4×f32 / 2×f64128-bit SIMD, viewable as 16×i8 / 8×i16 / 4×i32 / 4×f32 / 2×f64 |
| reference | funcref | ptr | 0x70 | 2021 | 指向 wasm 函数的不透明引用opaque reference to a wasm function |
externref | ptr | 0x6F | 2021 | 指向宿主对象(JS Object / DOM 节点)reference to a host object (JS Object / DOM node) | |
| GC (2024) | (ref $T) | ptr | 0x6B | 2024 | 指向 struct / array 的强类型引用typed reference to a struct or array |
(ref null $T) | ptr | 0x6C | 2024 | 允许为 null 的版本nullable version | |
i31ref | 31-bit | 0x6C+ | 2024 | SMI 风格的内联小整数(避免堆分配)SMI-style inline small int (skip heap alloc) |
0x7F = -1,i64 是 0x7E = -2——这些是 signed LEB128 编码的小负数。规范选负数空间是有意的:正数空间留给"type index"(给 GC 用),这样验证器一字节就能判断"这是基本类型还是 struct 引用"。tag 设计是为 GC 的未来留的接口——MVP 时代设计者已经预想到这一步。
Note i32's tag is 0x7F = -1, i64 is 0x7E = -2 — these are signed LEB128 small negatives. The negative space was deliberate: positive space is reserved for type indices (for GC), so the validator can decide "basic type vs struct reference" in one byte. The tag design is the interface MVP designers left for GC's future — anticipated from day one.
在 Type section 里出现的 (func (param i32 i32) (result i32)) 用编码 0x60 引导。MVP 时只有 func 这一种"组合类型",GC 提案后加了 0x5F = struct 和 0x5E = array——把 Type section 从"函数签名表"扩成了"组合类型表"。同一段 binary 在 2017 年和 2026 年解析出来的"section 0x01"含义已经悄悄扩张了一倍。
In Type section, (func (param i32 i32) (result i32)) begins with tag 0x60. The MVP had only this one "compound type". The GC proposal added 0x5F = struct and 0x5E = array — quietly stretching Type section from "signature table" to "compound-type table". The same byte (section 0x01) means twice as much in 2026 as in 2017.
"为什么 wasm 没有 i8 类型?字符串处理要怎么办?"——这是另一个常见疑问。答案:wasm 的值类型不区分 i8/i16/i32,但内存读写有 i32.load8_u / i32.load8_s / i32.load16_u / i32.load16_s——读 8/16 位 byte,符号或零扩展到 i32。窄类型只存在于 memory 边界,寄存器里永远是 i32 或 i64。
"Why no i8? How do you process strings?" — another perennial question. Answer: wasm's value types don't distinguish i8/i16/i32, but memory access does: i32.load8_u / i32.load8_s / i32.load16_u / i32.load16_s read 8/16-bit bytes and sign- or zero-extend to i32. Narrow types exist only at the memory boundary; in registers, everything is i32 or i64.
同理无符号 vs 有符号区分也只活在指令层面:i32.div_s(signed) vs i32.div_u(unsigned)、i32.lt_s vs i32.lt_u。"类型只标 32/64 位,符号由 op 携带"是 wasm 的核心设计简化——让值类型集合保持小,降低验证器和 JIT 的复杂度。
Same with signed-vs-unsigned: it lives at the opcode layer, not the type layer — i32.div_s vs i32.div_u, i32.lt_s vs i32.lt_u. "Types carry width only; signedness rides on the op" is a core simplification. It keeps the value-type set small and shrinks both validator and JIT.
主线 Hot Loop 几乎是 100% i32——这是 wasm 的常态。绝大多数 LLVM 后端在 wasm32 目标上把 usize / size_t 编译成 i32(因为 wasm32 上指针就是 32 位),数组下标也是 i32。f64 主要出现在浮点计算场景,i64 出现在 BigInt 场景。如果你 grep 一个真实 wasm 模块,i32. 开头的 opcode 占 70% ~ 90%。
The main-line is ~100% i32 — wasm's norm. Most LLVM backends compile usize / size_t to i32 on wasm32 (pointers are 32-bit), and array indices are i32. f64 shows up in floating-point math; i64 in BigInt scenarios. Grep any real wasm module and 70–90% of opcodes start with i32.
six families, one prefix scheme
six families, one prefix scheme
MVP 时 wasm 用了 256 个 opcode 空间里的 190 个左右。后来 SIMD / Bulk Memory / Reference Types / GC / Atomics 每个提案都要加新指令,字节空间不够了。解法是多字节 opcode:第一字节用一个保留值(0xFC = Bulk, 0xFD = SIMD, 0xFE = Atomics, 0xFB = GC),后跟一个 LEB128 子 opcode。单字节空间维持紧凑,扩展走 prefix。
The MVP used ~190 of the 256 opcode slots. SIMD / Bulk Memory / Reference Types / GC / Atomics each demanded new ops, and the byte space ran short. The fix: multi-byte opcodes. A reserved first byte (0xFC = Bulk, 0xFD = SIMD, 0xFE = Atomics, 0xFB = GC) followed by a LEB128 sub-opcode. The single-byte space stays compact; extensions ride the prefix.
256 个单字节 opcode 槽位里,numeric + memory + 控制流 占去六成多。底部 0xFB-0xFE 四个紫色格是 prefix 字节——每个 prefix 后跟一个 LEB128 子 opcode,把扩展空间延展到无穷。2017 MVP 时 只有上半部分被占用,所有 2019 后加的指令(SIMD/GC/Atomics)都缩在这四个 prefix 后面。
Of 256 single-byte opcode slots, numeric + memory + control occupy over 60%. The four purple cells at the bottom (0xFB–0xFE) are prefix bytes — each followed by a LEB128 sub-opcode, extending the space without bound. In the 2017 MVP only the upper half was filled; every post-2019 op (SIMD/GC/Atomics) lives behind these four prefixes.
block / loop / if / else / br / br_if / br_table / return / call / call_indirect / unreachable / nop。没有 goto。结构化控制是 wasm 的硬约束,Ch11 的验证算法依赖这个。
block / loop / if / else / br / br_if / br_table / return / call / call_indirect / unreachable / nop. No goto. Structured control is a hard invariant — Ch11's validator depends on it.
drop / select / local.get / local.set / local.tee / global.get / global.set。tee 是 set 的"留个备份在栈顶"版。
drop / select / local.get / local.set / local.tee / global.get / global.set. tee is set that also keeps a copy on the stack top.
i32.load / i32.load8_s / i32.load8_u / ... / i32.store / i32.store8 / memory.size / memory.grow。每条 load/store 带 align + offset 立即数。
i32.load / i32.load8_s / i32.load8_u / … / i32.store / i32.store8 / memory.size / memory.grow. Each load/store carries align + offset immediates.
i32.const / i64.const / f32.const / f64.const。i32/i64 立即数用 signed LEB128;f32/f64 用原始字节序。
i32.const / i64.const / f32.const / f64.const. i32/i64 immediates use signed LEB128; f32/f64 use raw IEEE bytes.
i32.add / i32.sub / i32.mul / i32.div_s / i32.div_u / i32.eq / i32.lt_s / ... / f64.sqrt / f64.nearest。约 130 条,覆盖 IEEE 754 算术。
i32.add / i32.sub / i32.mul / i32.div_s / i32.div_u / i32.eq / i32.lt_s / … / f64.sqrt / f64.nearest. ~130 ops, full IEEE 754 coverage.
0xFC nn = 饱和转换 + bulk memory;0xFD nn = SIMD(~250 op);0xFE nn = atomics(threads);0xFB nn = GC。子 opcode 用 LEB128,所以是无界的。
0xFC nn = saturating convert + bulk memory; 0xFD nn = SIMD (~250 ops); 0xFE nn = atomics (threads); 0xFB nn = GC. Sub-opcode is LEB128, so unbounded.
以 i32.load offset=4 align=2 为例。它的字节序列:
Take i32.load offset=4 align=2. Its byte sequence:
i32.load。栈上 pop 一个 i32 地址,push 一个 i32 加载值。pops one i32 address, pushes one i32 loaded value.三字节里隐藏的设计点:① opcode 是 1 字节,空间精确;② align 是 hint 不是约束——这让 wasm 能跑在 ARM(对齐)和 x86(自由对齐)上无差别;③ offset 是常量,Liftoff 可以 fold 进 [base + reg + 4] 这种寻址模式,免一条 add。三字节里塞了三层信息。
Three bytes hide three design points: ① opcode is one byte, slot-precise; ② align is a hint, not a constraint — letting wasm run on both ARM (aligned) and x86 (free) without changes; ③ offset is constant, so Liftoff folds it into [base + reg + 4] addressing — saving one add. Three bytes, three layers.
| Family | Used | Count | Example |
|---|---|---|---|
| control | block / loop / br_if | 4 | block 0x40 |
| param | local.get / local.set | 14 | local.get 0x20 00 |
| memory | i32.load8_u / i32.store8 | 10 | i32.load8_u 0x2D 00 00 |
| const | i32.const | 5 | i32.const 0x41 09 (= 9) |
| numeric | i32.add / i32.sub / i32.div_u / i32.mul / i32.ge_s | 16 | i32.add 0x6A |
| prefix | — | 0 | 本主线无 SIMD,Ch19 才会出现 0xFDno SIMD; 0xFD appears in Ch19 |
49 条指令,49 字节(opcode 部分)+ 立即数(平均 1.2 字节)≈ 110 字节,加上 locals 声明 5 字节、function header 6 字节,凑成约 121 字节的 code section。再加上前面的 6 个 section header 和 export 段,合 192 字节。密度的来源在每一字节都看得见。
49 ops × (1B opcode + ~1.2B imm avg) ≈ 110 bytes, plus 5B locals declaration + 6B function header ≈ 121 bytes of code section. Add six section headers and the export segment: 192 bytes. Density is visible in every byte.
0xFD nn 的两字节 opcode 在 inner loop 里每次都要多 fetch 一字节,影响热路径性能。最终方案是 在 Liftoff 阶段把 SIMD 字节序展开成两字节但 TurboFan IR 里仍按一字节代理——这是 wasm spec 罕见的"实现影响 spec"的例子,SIMD 的子 opcode 数量被精心控制在 256 内,避免出现 3 字节 opcode 的可能。
When SIMD reached phase 4 in 2021, V8 measured that the two-byte 0xFD nn opcode forced an extra fetch on every inner-loop iteration. The fix: Liftoff decodes it as two bytes, but TurboFan IR proxies it as one — a rare instance of "implementation pressuring spec". SIMD sub-opcodes are deliberately capped at 256 to avoid the 3-byte opcode scenario.
64 KiB 一页,最大 4 GiB
64 KiB per page, 4 GiB max
线性内存是 wasm 最简洁的设计之一——它就是一片连续字节,从地址 0 开始,长度是 N 个 64 KiB 的 page。所有 i32.load / i32.store 都读写这片字节。没有指针类型,没有 GC,没有别的内存空间——堆、栈、静态数据全部混在这一片。这片字节在 JS 那边是一个 WebAssembly.Memory 对象,可以 new Uint8Array(mem.buffer) 直接看到原始字节。
Linear memory is one of wasm's most distilled designs — a flat slab of bytes starting at address 0, length = N × 64 KiB pages. Every i32.load / i32.store reads or writes this slab. No pointer type, no GC, no other memory space — heap, stack, static data all share the slab. From JS, this is a WebAssembly.Memory object; new Uint8Array(mem.buffer) lets you see the bytes directly.
wasm32 的实际内存只有顶部那条窄绿条(用户的 N 个 64 KiB page),但浏览器引擎提前预留了整 4 GiB 虚拟地址空间——下面 99% 都是 PROT_NONE 陷阱区。越界访问由硬件 + signal handler 接住,JIT 出码里完全没有 cmp/jcc 边界检查指令——这就是 wasm 边界检查"免费"的真相。
wasm32's actual memory is just the thin green strip at top (user's N × 64 KiB pages), but the browser engine pre-reserves the entire 4 GiB virtual address space — 99% of it is PROT_NONE trap zone. Out-of-bounds is caught by hardware + signal handler; the JIT emits no cmp/jcc bounds-check instructions — the true source of wasm's "free" bounds checking.
为什么是 64 KiB 一页?这数字不是 OS 的 4 KiB / 16 KiB 内存页对齐;它来自一个 i32 寻址 / 2^16 = 2^16 个页 这一计算——4 GiB 总空间 ÷ 2^16 页 = 2^16 = 64 KiB 一页。选 64 KiB 是要在"页粒度太细(grow 太频繁)"和"页太大(浪费)"之间找平衡点,设计者参考了 x86 的 large page 与 ARM 的 64K granule。
Why 64 KiB per page? Not from OS 4 KiB / 16 KiB alignment. It comes from i32 addressing space (4 GiB) ÷ 2^16 pages = 64 KiB per page. The choice balances "too fine-grained (grow too often)" against "too coarse (waste)", referring to x86's large page and ARM's 64K granule.
每一次 load/store 都必须保证地址在 [0, memory_size) 内,否则 trap。朴素实现:cmp addr, mem_size; ja .trap;——每个内存访问加两条指令,在 inner loop 里 5~10% 开销。
Every load/store must keep its address in [0, memory_size), else trap. Naïve: cmp addr, mem_size; ja .trap; — two extra instructions per access, 5–10% overhead in an inner loop.
现代引擎用一个聪明技巧:把 wasm 的整个 4 GiB 地址空间作为虚拟保留页映射,只把 [0, memory_size) 设为可读写,后面全部设为 PROT_NONE。任何越界访问会触发 SIGSEGV,引擎挂一个 signal handler 把它翻译成 wasm trap。结果是 inner loop 里完全没有显式边界检查,跑得跟 native 几乎一样快——只在 trap 时才慢。
Modern engines use a clever trick: reserve the full 4 GiB virtual address space, mark [0, memory_size) as RW, mark the rest as PROT_NONE. Any overrun raises SIGSEGV, caught by a signal handler that translates it into a wasm trap. The result: zero explicit bounds checks in the hot loop, near-native speed — only slow on actual trap.
; Linear memory after JS-side setup
0x000000 ┌─────────────────────────────────┐
│ src image data (8 bpp grayscale)│ 1920×1080 = 2 073 600 byte
0x1FA400 ├─────────────────────────────────┤
│ padding ( 4 KiB align ) │
0x1FB000 ├─────────────────────────────────┤
│ dst image data (output) │ another 2 073 600 byte
0x3F5400 ├─────────────────────────────────┤
│ unused │ ~ 60 KiB
0x400000 └─────────────────────────────────┘ 64 pages (4 MiB)
JS 那边先 mem.grow(63) 把内存扩到 64 page = 4 MiB,然后用 Uint8ClampedArray 视图把 src 图像数据 copy 进去,调 instance.exports.blur3(0, 0x1FB000, 1920, 1080),wasm 函数对 src 做卷积写到 dst,JS 再 new Uint8ClampedArray(mem.buffer, 0x1FB000, len) 取出来显示。整个过程 没有把数据 copy 出 wasm 内存——只是不同 JS 视图共享同一片字节。这是 wasm/JS 协作的"zero copy"模式。
JS first mem.grow(63) to reach 64 pages = 4 MiB, then copies src image bytes in via Uint8ClampedArray, calls instance.exports.blur3(0, 0x1FB000, 1920, 1080), wasm convolves and writes dst, JS reads back via new Uint8ClampedArray(mem.buffer, 0x1FB000, len). The data never leaves wasm memory — different JS views share the same bytes. This is the wasm/JS "zero copy" pattern.
memory.grow 的代价memory.growgrow 是个昂贵指令——它可能触发 ArrayBuffer 重新分配(老 4 MiB 不够时申请新的 16 MiB,copy 整片字节)。grow 后所有 JS 视图 (TypedArray) 立刻被 detached,所有 wasm 那边持有的 base 指针会被引擎自动更新。这条约束让 grow 在 inner loop 里几乎是禁忌,通常只在 module 启动时或者明显边界(图像变大、文件加载)才调用。
grow is expensive — it can trigger a full ArrayBuffer realloc (allocating a fresh 16 MiB when 4 MiB runs short, then copying). After grow, all JS-side TypedArray views are detached immediately; wasm-side base pointers are auto-updated by the engine. The invariant makes grow nearly forbidden inside hot loops — typically called only at startup or coarse boundaries (image resize, file load).
类型栈的抽象解释
abstract interpretation on a type stack
"怎么证明这段二进制没有缓冲区溢出、没有未初始化变量、没有类型混乱?" Java 的解法是 bytecode verifier——一段几千行的 dataflow 分析。wasm 用了一招更猛的:把"类型栈"作为唯一的抽象状态,沿指令序列做一遍 forward sweep。算法只用一个数据结构(类型栈)、只走一遍(单遍 forward),时间复杂度 O(n)。
"How do you prove this binary has no buffer overflow, no uninitialised variable, no type confusion?" Java's answer is the bytecode verifier — a few thousand lines of dataflow analysis. Wasm went bolder: use a "type stack" as the only abstract state, then forward-sweep through the instruction sequence. One data structure (the type stack), one pass (forward only), O(n) time.
维护两个东西:
Maintain two things:
br k 跳到 cstack 顶往下数第 k 个 frame。br k jumps to the k-th frame from the top.遍历指令序列,每条指令做三件事:① pop 走它需要的输入类型(类型不对 → fail);② push 它产生的输出类型;③ 如果是控制指令,适当 push/pop cstack。函数末尾 vstack 必须正好等于函数返回类型——否则 fail。就这么简单。但这套机制证明了:任何通过验证的 wasm 不会类型混乱、不会栈溢出、不会未初始化访问。
For each instruction: ① pop the input types it expects (mismatch → fail); ② push the output types it produces; ③ if it's a control op, push/pop cstack accordingly. At function end, vstack must exactly equal the return type — else fail. That's it. Yet this proves any validated wasm cannot suffer type confusion, stack overflow, or uninitialised access.
i32.add左侧指令逐条扫过,右侧类型栈跟着推演:load → load → add 让 vstack 在 [i32]→[i32,i32]→[i32] 之间走;const → div 让它再次先升后降。这套 abstract interpretation 用一个数据结构(类型栈) + 一遍 forward scan 证明类型安全。动画每 7 秒自动循环。
Left: instructions stream in one by one; right: the type stack evolves in lockstep — load → load → add moves vstack through [i32] → [i32,i32] → [i32]; const → div pushes and pops again. This abstract interpretation proves type safety using one data structure (the type stack) + one forward pass. Animation auto-loops every 7 s.
| Case | WAT | Why rejected |
|---|---|---|
| 栈不够underflow | i32.const 1; i32.add | i32.add 要 pop 两个,vstack 只有一个 → faili32.add expects two pops, vstack has one → fail |
| 类型不对type mismatch | f32.const 1.5; i32.const 2; i32.add | i32.add 要 [i32, i32],拿到 [f32, i32] → faili32.add wants [i32, i32], got [f32, i32] → fail |
| 函数尾未消栈leftover at end | i32.const 1; end (无 return) | 函数返回 (),vstack 末态须空 → failfunction returns (), vstack must be empty at end → fail |
一个微妙的细节:验证遇到 br(无条件跳转)或 return 或 unreachable 后,后续直到下一个 end 的指令都无法执行到。但代码还在那儿——验证器要怎么处理?答案:把 vstack 标记为 polymorphic stack(假栈),后续 pop 操作都不真的检查,push 也接受任何类型。等遇到 end 或者 else 时再恢复真实栈状态。这一招让验证器即使对"死代码"也能 O(n) 走完。
A subtle detail: after br (unconditional), return, or unreachable, instructions up to the next end are unreachable. But the bytes are still there — what should the validator do? Answer: mark vstack as polymorphic stack — subsequent pops are not really checked, pushes accept any type. Real state restored at the next end or else. This keeps the validator O(n) even through dead code.
block / loop / if 配 br k)从根本上禁止了不规则跳转——每个跳转目标都是当前 cstack 上的某个 frame,目标类型在 frame 创建时就钉死。这让不需要反向传播分析。"结构化控制"是 wasm 验证可以一遍完成的根本前提,也是为什么没有 goto。
The Java verifier needs dataflow because JVM bytecode allows non-structured goto, producing irregular CFGs. Wasm's structured control (block / loop / if + br k) bans irregular jumps at the root — every branch target is a frame on the current cstack with its target type fixed at frame creation. No backward propagation needed. "Structured control" is why wasm validation finishes in one pass — and why there's no goto.
单遍验证 + 函数互相独立(只引用 Type / Function / Memory / Table 等"全局"section,这些先解析完)= 函数级并行。V8 的实现给 N 个函数开 min(N, CPU 核数) 个验证 worker,每个 worker 拿一个函数独立验。Photoshop 那种 30 万函数的 wasm 模块,在 8 核机器上 ~500 ms 就能验完——这是为什么 wasm 启动比想象中快。
Single-pass validation + function independence (functions reference only the "global" sections — Type / Function / Memory / Table — already parsed) = function-level parallelism. V8 spawns min(N, num_cpus) workers and each takes one function. Photoshop's 300 K-function wasm validates in ~500 ms on an 8-core box — which is why wasm startup is faster than people expect.
验证不是检查代码,
是把代码读成一个可证明的形状。 Field Note · 03
Validation isn't checking code.
It is reading code into a provable shape. Field Note · 03
从这一段起,字节离开磁盘,进入引擎。我们追着主线 Hot Loop 走过 6 道工序:流式 decode 用 LEB128 把字节翻成 Module 数据结构;Validate 在函数级并行里把类型证明做完;Tier-0 Liftoff 单遍出机器码,启动 0 等待;Tier-1 TurboFan 在后台把热函数重编译到接近 native;然后实例化分配 memory、填 table、跑 start;最后 JS 跟 wasm 之间的 trampoline 把调用边界缝起来。这 6 章是整个文章最"引擎"的部分。
From here, the bytes leave disk and enter the engine. We follow the main-line Hot Loop through six stages: streaming decode turns LEB128 bytes into a Module; validate proves type safety with function-level parallelism; Tier-0 Liftoff emits machine code in one pass, zero startup wait; Tier-1 TurboFan re-compiles hot functions to near-native in the background; then instantiation allocates memory, fills tables, runs start; finally trampolines stitch the JS ↔ wasm call boundary. The most "engine" part of the article.
streaming compilation
streaming compilation
WebAssembly.compileStreaming(fetch('hot.wasm')) · WebAssembly.instantiateStreaming(...)
"下载完再编译"是 2017 年 MVP 时的默认行为。2018 年起 V8 / SpiderMonkey 都实现了 streaming compile——浏览器 fetch 第一个 chunk 进来就交给 wasm decoder,decoder 拿到一个 section 完整字节就解析,拿到一个函数体完整字节就交给 Liftoff。下载和编译完全并行,这一招的本质是把 wasm 当成"边下边播"的视频流。
"Download then compile" was the default in the 2017 MVP. From 2018, V8 and SpiderMonkey both shipped streaming compile — the browser hands each fetched chunk to the wasm decoder, which parses each complete section and forwards each complete function body to Liftoff. Download and compile run fully in parallel; in essence, treat wasm like a "stream-while-watching" video.
// 慢路径(non-streaming):先 ArrayBuffer 再 compile const buf = await fetch('hot.wasm').then(r => r.arrayBuffer()); const mod = await WebAssembly.compile(buf); // 等下完才开始 // 快路径(streaming):fetch 进来一段就开始编 const mod = await WebAssembly.compileStreaming(fetch('hot.wasm'));
compileStreaming 需要 server 回 Content-Type: application/wasm——否则会 fallback 到 buffer 路径并 throw 一个警告。这是常见踩坑(把 .wasm 当 .bin 上 CDN 时 MIME 不对)。
compileStreaming requires the server to return Content-Type: application/wasm — else it falls back to the buffer path and emits a warning. A common pitfall when serving .wasm as .bin from a CDN.
// V8 ModuleDecoder 简化状态机 [kPreamble] ; expecting magic + version (8 bytes) ├─▶ kSectionHeader ; expecting id byte + LEB128 length ├─▶ kTypeSection ; 0x01 · vec[Type] ├─▶ kImportSection ; 0x02 · vec[Import] ├─▶ kFunctionSection ; 0x03 · vec[u32 type-idx] ├─▶ kTableSection ; 0x04 ├─▶ kMemorySection ; 0x05 ├─▶ kGlobalSection ; 0x06 ├─▶ kExportSection ; 0x07 ├─▶ kStartSection ; 0x08 ├─▶ kElementSection ; 0x09 ├─▶ kCodeSection ; 0x0a · 进入 per-function loop │ └─▶ for each function: │ 1. parse body bytes │ 2. enqueue to validator worker │ 3. enqueue to Liftoff worker ├─▶ kDataSection ; 0x0b └─▶ kCustomSection ; 0x00 · name / dwarf / vendor
每个状态对应一个 section 的解析函数,内部都是同一种结构:先读 LEB128 数量,再 for 循环依次解析每个 entry。这种规则化让 decoder 简单到可以单文件 (module-decoder.cc) 几千行写完。
Each state maps to a parsing function for one section, all sharing the same shape: read LEB128 count, then for-loop entries. The regularity keeps the decoder small — one file (module-decoder.cc), a few thousand lines.
code section 的格式有一个细节让流式编译变得可行:每个函数体前面都有一个 LEB128 长度。这让 decoder 不必先扫一遍找边界,可以直接 fread 长度 → fread 函数体 → 入队 → fread 下一段长度。"self-describing 长度前缀" 是 wasm 设计里反复出现的母题——module 长度、section 长度、函数体长度、import name 长度,全是 LEB128 前缀。
A detail in the code section makes streaming feasible: each function body is prefixed by a LEB128 length. The decoder doesn't need a pre-scan — just read length → read body → enqueue → next length. "Self-describing length prefix" is a recurring motif — module length, section length, body length, import-name length, all LEB128-prefixed.
N 个函数,N 个 worker
N functions, N workers
Ch11 已经讲了验证算法本身。这一章谈工程实现:V8 怎么把 N 个函数的验证拆到 N 个 worker 上,什么时候 fail-fast,什么时候 graceful。
Ch11 covered the algorithm itself. This chapter is about engineering: how V8 spreads N functions across N workers, when to fail-fast, when to be graceful.
关键前提是函数体只引用模块级声明(type / function / table / memory / global / element),这些都在 code section 之前的 section 里解析完了。函数 A 验证时不需要看函数 B——它最多通过 call 引用 B 的签名(已知)。所以 N 个 worker 可以独立验证 N 个函数,彼此不通信。
The key invariant: function bodies reference only module-level declarations (type / function / table / memory / global / element), all parsed before the code section. Function A's validator doesn't need to look at function B — at most it sees B's signature via call (already known). So N workers validate N functions independently, no inter-thread comms.
| 引擎Engine | workers | strategy |
|---|---|---|
| V8 | min(N, num_logical_cpus) | each thread pulls from one queue |
| SpiderMonkey | helper threads (configurable) | tile-based, 64 KB per tile |
| JavaScriptCore | WTF::WorkerPool | per-function, with size-aware scheduling |
| Wasmtime | rayon parallel iterator | per-function |
如果第 3 个函数验证失败,后面 1000 个函数还要不要继续验证?V8 选择继续——所有 worker 把活做完,最后聚合错误。这听起来浪费,但因为 worker 是并行的,继续做不会延后失败时间;反而提前 abort 需要协调(kill 其他 worker),代码复杂度反而高。"并行算法里 fail-fast 不一定快"是 V8 设计里反复出现的取舍。
If function 3 fails, do the remaining 1000 keep validating? V8 says yes — let all workers finish, aggregate errors at the end. Sounds wasteful, but because workers run in parallel, continuing doesn't delay failure; aborting would need coordination (kill other workers), with higher code complexity. "Fail-fast isn't necessarily fast in a parallel pipeline" — a trade-off V8 makes repeatedly.
主线只有 1 个函数,所以"并行"在这里退化成单 worker。49 条指令,vstack 最深 4 槽,cstack 最深 2 frame,在 M1 Pro 上验证耗时 ~ 6 µs。这一数字给你一个数量级感受:验证比解码还快,因为验证不分配内存(用栈上小固定容量数组就够了)。
The main-line has 1 function, so "parallel" degenerates to single worker. 49 ops, vstack max depth 4, cstack max depth 2 — on M1 Pro, validation takes ~ 6 µs. The order of magnitude: validation is faster than decoding, because it allocates nothing — a small fixed-cap stack array suffices.
10 MB/s · 0 IR · 0 register alloc
10 MB/s · 0 IR · 0 register alloc
v8/src/wasm/baseline/liftoff-compiler.cc :: VisitOpcode · liftoff-assembler-{x64,arm64}.cc
local.get → mov reg, [rbp-N],N 是该 local 的固定偏移;遇到 local.set → mov [rbp-N], reg。栈顶值用瞬时寄存器 rax/rbx 之类即可。不优化中间结果留寄存器——出来的码"很啰嗦但确定"。local.get: mov reg, [rbp-N], where N is that local's fixed offset. For local.set: mov [rbp-N], reg. Stack-top values land in ad-hoc registers like rax/rbx. No effort to keep intermediates in regs — the code is "verbose but deterministic".; blur3 -- Liftoff codegen (x86-64, simplified) 0x000000 push rbp 0x000001 mov rbp, rsp 0x000004 sub rsp, 0x40 ; 8 stack slots 0x000008 mov [rbp-0x08], rdi ; spill $src 0x00000c mov [rbp-0x10], rsi ; spill $dst 0x000010 mov [rbp-0x18], edx ; spill $w 0x000014 mov [rbp-0x1c], ecx ; spill $h ; outer loop: y = 1 0x000018 mov dword ptr [rbp-0x20], 1 ; $y = 1 0x000020 mov eax, [rbp-0x1c] 0x000024 dec eax ; eax = h - 1 0x000026 cmp [rbp-0x20], eax 0x000029 jge .end_y .loop_y: ; inner loop: x = 1 0x00002b mov dword ptr [rbp-0x24], 1 ; $x = 1 0x000033 mov eax, [rbp-0x18] 0x000037 dec eax ; eax = w - 1 0x000039 cmp [rbp-0x24], eax 0x00003c jge .end_x .loop_x: ; sum = 0 0x00003e mov dword ptr [rbp-0x28], 0 ; 9 byte loads, 8 adds — Liftoff emits each one 0x000046 mov rax, [rbp-0x08] ; $src 0x00004a movzx edx, byte ptr [r15+rax] ; i32.load8_u offset=0 (r15 = mem base) 0x00004e add [rbp-0x28], edx ; sum += byte 0x000051 mov rax, [rbp-0x08] ; reload $src ← spill cost 0x000055 movzx edx, byte ptr [r15+rax+1] 0x00005a add [rbp-0x28], edx ... ; (similar 7 more times — Liftoff makes no attempt to hoist $src) ... ; sum / 9 — Liftoff does NOT do magic-number multiplication 0x0000c0 mov eax, [rbp-0x28] 0x0000c3 xor edx, edx 0x0000c5 mov ecx, 9 0x0000ca div ecx ; expensive! ~25 cycles ; store dst[y*w + x] 0x0000cc ... 0x0000e0 mov byte ptr [r15+rbx], al ; x++; loop_x 0x0000e4 inc dword ptr [rbp-0x24] 0x0000e8 jmp .loop_x ... .end_x: .end_y: 0x000130 leave 0x000131 ret
这段~ 240 字节的 x86-64 代码就是 Liftoff 对 hot.wasm 的输出。三个观察:① $src 在每次 load 前都重新从 [rbp-0x08] 加载——Liftoff 不知道也不分析 "这个值我刚加载过";② sum / 9 用了真 div 指令,~25 cycle;③ 函数体没有 SIMD 化。但出码时间在 200 µs 量级——这正是它要的。
~240 bytes of x86-64 is Liftoff's output for hot.wasm. Three notes: ① $src is reloaded from [rbp-0x08] before every load — Liftoff doesn't know it just loaded this; ② sum / 9 uses a real div, ~25 cycles; ③ no SIMD. But codegen time is ~200 µs — exactly the target.
每个 Liftoff 函数入口都塞一个计数器:
Every Liftoff function prologue carries a counter:
cmp dword ptr [r13+0x40], 0x100 ; tier-up threshold = 256 calls jne +0x4 call WasmCompileLazy ; → schedule TurboFan recompile
2 条指令的开销,每次进入函数加 1 次 cmp + 1 次 jne(不跳转)。达到阈值时调用 WasmCompileLazy,把这个函数入队到 TurboFan 后台 worker——不阻塞当前调用,Liftoff 版继续跑。后台 worker 编完后,引擎用一个 atomic store 把函数地址表里的入口换成 TurboFan 版,下次调用就走 TurboFan。
Two-instruction overhead: one cmp + one jne (not taken) per entry. At threshold, call WasmCompileLazy to enqueue the function for a background TurboFan worker — does not block the current call, Liftoff version keeps running. After the worker finishes, an atomic store swaps the function-table entry to point to the TurboFan version; the next call goes to TurboFan.
2 ms 出码,80% of native
2 ms emit, 80% of native
local.get $src 被合并成 1 次寄存器读;sum / 9 被识别为常量除,替换成魔数乘法 0x1c71c71d。180 字节 x86,3.8 ms/帧。下一站:atomic 安装。
hot.rs right now: the 256th call triggered tier-up; a background worker pushes it through TurboFan's sea-of-nodes + LoadElimination + SimplifiedLowering + Schedule + RegAlloc pipeline. Liftoff's nine local.get $srcs collapse to one register read; sum / 9 is recognised as div-by-constant and rewritten as magic-number mul 0x1c71c71d. 180 bytes of x86, 3.8 ms/frame. Next stop: atomic install.
TurboFan 原本是 V8 的 JavaScript 优化编译器。2017 年起它兼任 wasm 的优化编译器——但 wasm 那一面用的 pipeline 跟 JS 那边完全不一样。JS 那边要处理 SMI 标记、IC 反馈、deopt 边界;wasm 这边类型钉死,没有反馈,没有 deopt。所以 wasm TurboFan 是个"静态优化器",更接近 LLVM 的工作流。
TurboFan was originally V8's JS optimising compiler. From 2017 it doubles as wasm's optimiser — but the wasm pipeline differs entirely from JS. JS-side juggles SMI tags, IC feedback, deopt edges; wasm-side has fixed types, no feedback, no deopt. So wasm-TurboFan is a "static optimiser", much closer to LLVM's workflow.
$src 的 9 次 local.get 被压成 1 次寄存器持有。local.get $srcs collapse into one held register.sum / 9 在这一步被识别为常量除,替换成 magic-number multiplication。sum / 9 here is recognised as div-by-constant and rewritten as magic-number mul.| Metric | Liftoff | TurboFan | Ratio |
|---|---|---|---|
| 编译耗时Compile time | 200 µs | 2.1 ms | 10× |
| 出码字节Code bytes | 240 B | 180 B | 0.75× |
| 运行耗时(1080p)Runtime (1080p frame) | 12 ms | 3.8 ms | 0.32× |
$src reload 次数$src reloads | 9 | 1 | — |
sum / 9 | div(25 cy) | mul+shr(4 cy) | ~6× faster |
| SIMD ?SIMD ? | — | (no, 默认未开)(no, default off) | — |
TurboFan 编译耗时 10× 于 Liftoff,但出码运行快 ~ 3× 于 Liftoff。关键洞察是这两个数字不矛盾——TurboFan 在后台编,运行时把"编译延迟"摊到了后台 worker。用户看到的延迟是"Liftoff 编译完+第一次跑",TurboFan 是后面"悄悄变快"的。这是 wasm tiering 的全部哲学。
TurboFan compiles 10× slower than Liftoff, but the resulting code runs ~3× faster. The crucial insight: these don't conflict — TurboFan compiles in the background, amortising its latency onto a worker. Perceived latency is "Liftoff done + first run"; TurboFan is the "silent" speedup later. That's wasm tiering in one sentence.
左:Liftoff 把每条 wasm 字节翻成机器指令,9 次 local.get $src 各自 emit 一条 mov。右:TurboFan 看穿这 9 次都指向同一个 SSA 节点,LoadElimination 合并成 1 次寄存器读;sum / 9 在 SimplifiedLowering 阶段识别为常量除,替换成魔数乘法(0x1c71c71d)。这就是 wasm "原生 80%" 的具体形式。
Left: Liftoff emits one machine op per wasm byte; nine local.get $src turn into nine movs. Right: TurboFan sees the nine reference the same SSA node, LoadElimination merges them into a single register read; sum / 9 is recognised as divide-by-constant during SimplifiedLowering and rewritten as magic-number multiplication (0x1c71c71d). The concrete form of wasm's "80% of native".
2022 年 V8 团队启动 Turboshaft 项目,目标是替换 TurboFan。原 TurboFan 的 sea-of-nodes 在内存里是图结构,每次访问要解引用,在大模块上 cache miss 严重。Turboshaft 改成线性 IR 序列(类似 LLVM 的 BB instructions),内存连续,优化 pass 速度上升 30~50%。2023 年起 V8 wasm 默认走 Turboshaft,但 IR 和 pass 集合跟 TurboFan 高度兼容,从外部看几乎无感。
In 2022 the V8 team began Turboshaft, with the goal of replacing TurboFan. The original TurboFan keeps its sea-of-nodes as an in-memory graph; every access dereferences, and on large modules cache misses dominate. Turboshaft uses a linear IR sequence (like LLVM BB instructions), so memory is contiguous and pass speed improves 30–50%. Since 2023, V8 wasm has run Turboshaft by default, but IR and pass set are highly TurboFan-compatible — externally near-invisible.
local.get $src 怎么变成 1 次寄存器读local.get $src collapses into one register readLiftoff 把每个 local.get $src 都翻成 mov rax, [rbp-0x08]——9 次。TurboFan 看到这 9 次 local.get 都引用同一个 SSA 节点(因为 wasm 验证已经证明 $src 在此区间未被赋值),LoadElimination 把它们合并成一个 SSA 引用。寄存器分配阶段把这个 SSA 值留在 rcx 里——9 次内存读变成1 次内存读 + 9 次寄存器引用。这一招省下 ~ 20 ns 每像素,在 1920×1080 图像上是 40 ms 的差距。
Liftoff turns each local.get $src into mov rax, [rbp-0x08] — nine times. TurboFan sees all nine reference the same SSA node (validation already proved $src isn't reassigned in this region), and LoadElimination merges them into one SSA reference. RegAlloc keeps the SSA value in rcx — nine memory reads collapse into one memory read + nine reg references. ~20 ns saved per pixel; on a 1920×1080 image, that's a 40 ms swing.
memory · table · globals · imports · start
memory · table · globals · imports · start
WebAssembly.Module 是不可变的编译产物——它只装了代码、类型、import 声明、export 声明。要真正"跑"它,得创建一个 WebAssembly.Instance,把 import 满足、memory / table / globals 分配出来。同一个 Module 可以创建多个 Instance,每个 Instance 有自己的 memory——这是 wasm 实现"多线程"和"沙箱隔离"的基础。
WebAssembly.Module is an immutable compilation artifact — it carries code, types, import/export declarations. To actually run it, create a WebAssembly.Instance: satisfy imports, allocate memory / table / globals. One Module can spawn many Instances, each with its own memory — the foundation of wasm's "multithreading" and "sandbox isolation".
Memory section 的 min page 分配新 ArrayBuffer。min pages from the Memory section.global.get 已 init 的。global.get of already-inited globals allowed.const importObject = {
env: {
memory: new WebAssembly.Memory({ initial: 64, maximum: 256 }),
log: (x) => console.log('wasm says', x),
},
wasi_snapshot_preview1: { /* WASI shims */ },
};
const { module, instance } = await WebAssembly.instantiateStreaming(
fetch('hot.wasm'),
importObject
);
instance.exports.blur3(srcPtr, dstPtr, 1920, 1080);
Web Worker + SharedArrayBuffer 场景里,常见做法是主线程 compile 一次 Module,所有 worker 都用同一个 Module 创建独立 Instance。每个 worker 有自己的 memory(可能是 import 来的共享 memory),共享代码、不共享栈。Photoshop / Figma 都这样做。编译只跑一次,内存按需复制。
With Web Workers + SharedArrayBuffer, the standard pattern is: main thread compiles the Module once; each worker creates its own Instance from the same Module. Each worker has its own memory (possibly imported shared memory), sharing code but not stacks. Photoshop and Figma both do this. Compile once, memory on demand.
两条 ABI 中间的桥
the bridge between two ABIs
instance.exports.blur3(srcPtr, dstPtr, 1920, 1080) 触发了第一次执行。V8 在中间塞了一层 JS-to-Wasm wrapper 栈帧——SMI 解包 + r15/r14 装填 + tail-jmp 进 Liftoff 出码。2025 年 V8 把这层压到 5 ns。Storyboard 第 8 格的跨边界细节。
hot.rs right now: JS-side instance.exports.blur3(srcPtr, dstPtr, 1920, 1080) triggers the first invocation. V8 inserts a JS-to-Wasm wrapper frame between them — SMI unbox + r15/r14 setup + tail-jmp into Liftoff code. 2025 V8 has the whole thing down to 5 ns. The boundary detail behind Storyboard cell 8.
JS 用 SMI 标记的 31 位整数,wasm 用裸 i32。JS 调用约定走的是 V8 的 JS calling convention,wasm 内部用的是 wasm calling convention。JS 调 wasm 函数,引擎要在中间塞一个 trampoline——把 SMI 解包成 i32,把 NaN 之类的非法值 throw 出来,然后跳进 wasm 函数体。反过来也一样。这一切 V8 都在编译期生成,但你看不到。
JS represents 31-bit integers as SMIs; wasm uses raw i32. JS calls follow V8's JS calling convention; wasm internally uses the wasm convention. When JS calls wasm, the engine slips a trampoline in between — unbox the SMI to i32, throw on illegal values like NaN, then jump into the wasm function body. Same in reverse. V8 generates all of this at compile time, but you never see it.
| Name | Direction | Used when |
|---|---|---|
| JS-to-Wasm wrapper | JS → Wasm | JS calls instance.exports.f |
| Wasm-to-JS wrapper | Wasm → JS | Wasm calls an imported JS function |
| Wasm-to-Wasm | Wasm → Wasm | Direct or indirect call to another wasm func |
| Capi wrapper | C/C++ ↔ Wasm | Embedder uses the wasm_c_api headers |
JS 调 wasm 时,V8 在中间塞了一层JS-to-Wasm wrapper 栈帧——专门做 SMI 解包 + r15/r14 寄存器装填,然后尾跳进 wasm 函数体。整个过程 2025 年 V8 已压到 ~ 5 ns(2017 是 80 ns)。剩下三件事不能省:栈指针切换、wasm 关键寄存器装填、异常处理元数据 push——这是 trampoline 的物理下限。
When JS calls wasm, V8 inserts a JS-to-Wasm wrapper frame — it unboxes SMIs, loads r15/r14, then tail-jumps into the wasm body. 2025 V8 has the whole thing down to ~5 ns (from 80 ns in 2017). Three things refuse to compress: stack-pointer swap, wasm context-register load, EH metadata push — the trampoline's physical floor.
; JS-to-Wasm wrapper for blur3(src, dst, w, h) push rbp mov rbp, rsp ; arg 0 (src): expect SMI, unbox to i32 mov rax, [rdi+0x10] ; rdi = first arg, JS heap pointer test rax, 0x1 ; SMI test (low bit = 0 means SMI in V8) jnz .slow_path ; HeapNumber path sar rax, 1 ; SMI shift to get raw int mov edi, eax ; load into wasm arg reg ... ; same for args 1..3 ; setup wasm frame mov r15, [r13+0x20] ; load wasm memory base mov r14, [r13+0x28] ; load wasm instance pointer ; tail-call into wasm function jmp [r14+0x40] ; → Liftoff/TurboFan-compiled blur3 .slow_path: call ConvertNumberToInt32 ; handles HeapNumber, BigInt, throws on NaN jmp back
观察:① 主路径是纯寄存器操作 + 一条 jmp,~5 ns;② SMI 解包是一条 test + sar,几乎免费;③ 慢路径处理 HeapNumber/BigInt/NaN,大约 50~100 ns;④ r15(memory base)和 r14(instance)被显式 load——wasm 函数运行时假设这两个寄存器有效。
Notes: ① fast path is pure register ops + one jmp ≈ 5 ns; ② SMI unboxing is one test + sar, practically free; ③ slow path (HeapNumber / BigInt / NaN) is ~50–100 ns; ④ r15 (memory base) and r14 (instance) are explicitly loaded — the wasm body assumes these registers hold valid values.
反方向更贵:wasm 调用 JS 函数(比如 console.log)需要构造 JS call frame、把 i32 boxing 成 SMI、检查 receiver、可能 GC——单次大概 100~300 ns。"wasm 频繁调 DOM" 是性能反模式——每次过桥的成本就吃掉了算术速度的优势。Photoshop 的策略是把整张图片 copy 到 linear memory,处理完一整张再过桥回 JS,把过桥次数压到极少。
The reverse is pricier: wasm calling JS (e.g. console.log) constructs a JS call frame, boxes i32 to SMI, checks the receiver, possibly triggers GC — ~100–300 ns per call. "Frequent DOM calls from wasm" is the canonical perf anti-pattern — boundary cost devours arithmetic speedup. Photoshop's strategy: copy the entire image into linear memory, process the whole thing, cross the boundary once on return. Minimise crossings.
2017 MVP 时,JS 调 wasm 单次成本 ~ 80 ns——这意味着每秒最多 1200 万次过桥,对 60 fps 的游戏来说是真实瓶颈。V8 后续 5 年逐步优化:把 wrapper 做成 builtin、把 SMI 解包内联、用 call-ref 替代 call-indirect 间接调用、最后是 2025 年的直接调用——把 wrapper 完全 elide,JS 编译器看穿"这次 call 一定调 wasm" 时直接 emit 一条 call。如今边界几乎免费。
In the 2017 MVP, JS-calling-wasm cost ~80 ns per call — a hard cap of 12 M crossings/s, a real bottleneck for 60 fps games. V8 optimised over five years: turn wrappers into builtins, inline SMI unboxing, replace call-indirect with call-ref, and finally the 2025 direct call — elide the wrapper entirely, JS compiler sees "this call definitely lands in wasm" and emits a plain call. Today the boundary is nearly free.
r15 / r14 寄存器 load(wasm 函数假设它们有效);③ 异常处理元数据 push(为了 wasm trap 能被 JS try/catch 抓到)。理论上还能再砍 1~2 ns,但工程复杂度极高。"5 ns 是 trampoline 自身的物理极限"。
5 ns is V8's current floor — three things remain irreducible: ① stack-pointer swap (JS uses V8 stack, wasm has its own); ② r15 / r14 register loads (the wasm body assumes these are valid); ③ exception-handling metadata push (so wasm traps can be caught by JS try/catch). 1–2 ns more could be carved, but engineering cost is high. "5 ns is the trampoline's physical floor".
最贵的指令不是除法,是过边界。 Field Note · 03
The most expensive instruction is not division.
It is crossing the boundary. Field Note · 03
MVP 之后 8 年,wasm 加了 ~15 个生效中的提案。这 5 章只挑最重要的展开:Threads 把共享内存接进 wasm 沙箱;SIMD 用 16 字节寄存器给 inner loop 提速 6 倍;GC 让 Java/Kotlin/Dart 不再背着自己的运行时;Component Model 给 wasm 一个跨语言 ABI;以及还有六个候选提案在排队。每个提案都要回答"怎么不破坏 portable + safe + fast + compact 四目标"——这是 wasm 委员会评审的根本问题。
Eight years post-MVP, wasm has shipped ~15 live proposals. These five chapters cover the most consequential: Threads plug shared memory into the sandbox; SIMD turns 16-byte registers into 6× inner-loop speedups; GC frees Java/Kotlin/Dart from shipping their own runtimes; Component Model gives wasm a cross-language ABI; and six more are queued. Every proposal must answer "does this still honour portable + safe + fast + compact?" — the working group's gatekeeping question.
SharedArrayBuffer · atomics · futex
SharedArrayBuffer · atomics · futex
"能不能在浏览器里跑 pthread?"——这是从 2014 年起 game engine 开发者就在问的问题。Threads 提案 2019 年 ship,答案是:能,但用新的方式。WebWorker 已经存在(线程没有共享内存,只有消息传递);wasm threads 在这上面叠加了 SharedArrayBuffer(共享内存)和 atomic ops(无锁原语)。
"Can we run pthread in the browser?" — a question game-engine devs have asked since 2014. The Threads proposal shipped in 2019 with the answer: yes, but in a new way. WebWorker already existed (no shared memory, only message passing); wasm threads layer SharedArrayBuffer (shared memory) and atomic ops (lock-free primitives) on top.
i32.atomic.load / store / rmw.add / rmw.cmpxchg。出码 x86 的 LOCK XADD / LOCK CMPXCHG,ARM 的 LDAR / STLR / LDADD。顺序一致(sequential consistency)是 wasm 的默认。
i32.atomic.load / store / rmw.add / rmw.cmpxchg. Emit x86 LOCK XADD / LOCK CMPXCHG or ARM LDAR / STLR / LDADD. Sequential consistency is wasm's default.
类似 Linux futex:线程 a 等地址 X 的值变,引擎挂起这个线程到内核。不能在主线程上用(浏览器禁止主线程阻塞 > 0 ms)。
Linux-futex-like: thread a sleeps until the value at address X changes; the engine parks the thread in the kernel. Not callable from the main thread (browsers forbid > 0 ms main-thread blocks).
唤醒在地址 X 上等待的 K 个线程(K 可以是 ∞)。配合 wait 实现 mutex / condvar / barrier。
Wake K (possibly ∞) waiters on address X. Combined with wait → mutex / condvar / barrier.
主线程 compile 一次 Module + 分配 一个 SharedArrayBuffer,通过 postMessage 把它发给 N 个 worker。每个 worker 创建独立 Instance(独立栈、独立 locals、独立 trap state),但都 import 同一个 Memory——这才是真正的"共享内存多线程"。Spectre 漏洞之后 COOP+COEP 头是必需的进程隔离保险。
The main thread compiles the Module once + allocates one SharedArrayBuffer, then ships them to N workers via postMessage. Each worker spawns its own Instance (own stack, locals, trap state) but imports the same Memory — true "shared-memory multithreading". Post-Spectre, COOP+COEP headers gate the process isolation that makes this safe.
// Main thread const mem = new WebAssembly.Memory({ initial: 256, maximum: 2048, shared: true // ← key flag }); const buf = mem.buffer; // instanceof SharedArrayBuffer const workers = [...Array(8)].map(() => new Worker('worker.js')); workers.forEach(w => w.postMessage({ mem })); // share Memory across workers // worker.js self.onmessage = async ({ data }) => { const { instance } = await WebAssembly.instantiateStreaming( fetch('hot.wasm'), { env: { memory: data.mem } } // import same Memory ); instance.exports.blur3_threaded(srcPtr, dstPtr, w, h, workerId); };
五件事:① shared: true 让 ArrayBuffer 变成 SharedArrayBuffer——浏览器对此要 Cross-Origin-Isolated 才允许;② maximum 必填——因为 grow shared memory 在 JS 那边复杂(所有 worker 都要被通知),所以提前占好上限;③ 主线程 compile 一次 Module,所有 worker 复用;④ 每个 worker 创建自己的 Instance,但 import 同一个 Memory——这是共享内存的关键;⑤ wasm 那边 thread id 通过函数参数显式传入,不是隐式。
Five things: ① shared: true upgrades the ArrayBuffer to SharedArrayBuffer — browsers require Cross-Origin-Isolated for it; ② maximum is mandatory — growing shared memory cross-worker is complex, so the ceiling is fixed up front; ③ main thread compiles the Module once, all workers reuse it; ④ each worker spawns its own Instance but imports the same Memory — that's the shared-memory bridge; ⑤ thread id is passed explicitly as a function arg, not implicit.
2018 年 1 月 Spectre 漏洞披露后,所有浏览器立刻关闭了 SharedArrayBuffer——因为高分辨率定时器 + 共享内存 = 可以利用 cache 旁路通道。wasm threads 当时 phase 3 即将 ship,被推迟了一年半。最终方案:要求页面声明 Cross-Origin-Embedder-Policy: require-corp + Cross-Origin-Opener-Policy: same-origin——把进程隔离到只跟自己同源的脚本一起跑,这样旁路通道泄漏只会泄漏自己的数据,无意义。2021 年起浏览器在 COOP+COEP 头下重新启用 SharedArrayBuffer。如今你能在 Figma 上跑 wasm threads,就是因为它服务端正确设置了这两个头。
When Spectre dropped in January 2018, browsers immediately disabled SharedArrayBuffer — high-res timers + shared memory = a usable cache side-channel. wasm threads, then at phase 3 and almost shipping, slipped a year and a half. The eventual mitigation: require pages to declare Cross-Origin-Embedder-Policy: require-corp + Cross-Origin-Opener-Policy: same-origin — isolate the process so it only co-resides with same-origin scripts; any side-channel leak only leaks your own data, which is harmless. From 2021, browsers re-enabled SharedArrayBuffer behind COOP+COEP. You can run wasm threads on Figma because they set those headers correctly.
8 个 worker 把图像分成 8 个水平条带,每个 worker 独立处理。理想线性加速 8×,实测 5.8×——剩下 28% 损失在 worker 启动开销、worker 间内存通信、最后聚合点同步。这已经是 wasm 在 web 平台能做到的"多核计算"上限。
Eight workers slice the image into eight horizontal stripes, each processed independently. Ideal linear speedup is 8×; measured 5.8×. The remaining 28% goes to worker startup, cross-worker memory contention, and the final sync barrier. This is the practical ceiling for "multi-core computation" on the web platform today.
v128 · lane ops · 6× speedup
v128 · lane ops · 6× speedup
RUSTFLAGS="-C target-feature=+simd128",LLVM 把内层循环向量化——一次循环处理 16 个像素而不是 1 个。inner loop 变成 18 条 SSE2 指令(PADDW / PMULHRSW / PSRLW),平均 1.1 条 SSE / 像素。这是 Storyboard 最后一格,也是 wasm 在 1080p 图像上达到 6.8× of JS 的根源。终点站。
hot.rs right now (SIMD build): with RUSTFLAGS="-C target-feature=+simd128", LLVM vectorises the inner loop — 16 pixels per iteration, not 1. The body becomes 18 SSE2 instructions (PADDW / PMULHRSW / PSRLW), averaging ~1.1 SSE per pixel. The final Storyboard cell — and the reason wasm hits 6.8× of JS on a 1080p image. End of the line.
SIMD 是 wasm 提案里争议最大的一个——主要分歧是固定宽度 vs 可变宽度(scalable)。x86 的 AVX-512 是 512 bit,ARM 的 SVE2 是 128~2048 bit 可变,RISC-V 的 V 扩展是真的可变。真"可移植"的方案应该是 scalable,但 scalable SIMD 的 codegen 复杂度极高,JIT 在浏览器里跑不起。最终 wasm 选了固定 128 bit——所有现代 CPU 都至少有 128 bit 寄存器,JIT 输出最直接。SIMD 是 wasm 唯一一个"明确放弃移植性最大化" 的提案。
SIMD was the most contentious wasm proposal — the central debate was fixed width vs scalable. x86 AVX-512 is 512 bit; ARM SVE2 is 128–2048 bit scalable; RISC-V's V extension is genuinely scalable. The portable answer is scalable, but scalable codegen is so complex that no JIT can handle it in the browser. Wasm settled on fixed 128 bit — every modern CPU has 128-bit registers, so JIT output is direct. SIMD is the one proposal where wasm explicitly traded portability for tractability.
同一个 128-bit 寄存器,根据操作解释成 16/8/4/2 个 lane。i8x16.add 是 16 个并行 8-bit 整数加;f32x4.mul 是 4 个并行 32-bit 浮点乘;i64x2.shl 是 2 个并行 64-bit 位移。类型在指令里,不在值里——这是 wasm 整个类型系统的统一原则,SIMD 也不例外。
The same 128-bit register is interpreted into 16/8/4/2 lanes by the op: i8x16.add = 16 parallel 8-bit adds, f32x4.mul = 4 parallel 32-bit float mults, i64x2.shl = 2 parallel 64-bit shifts. The type lives in the op, not the value — wasm's unifying principle, SIMD included.
128 bit 可以看成:
128 bits can be viewed as:
| Lane shape | Lanes | Per-lane type | Examples |
|---|---|---|---|
| i8x16 | 16 | i8 | i8x16.add |
| i16x8 | 8 | i16 | i16x8.mul |
| i32x4 | 4 | i32 | i32x4.add |
| i64x2 | 2 | i64 | i64x2.add |
| f32x4 | 4 | f32 | f32x4.sqrt |
| f64x2 | 2 | f64 | f64x2.sqrt |
同一个 v128 寄存器,根据操作解释成不同 lane shape。i8x16.add 是 16 个 8-bit 整数对应相加,f32x4.mul 是 4 个 32-bit 浮点对应相乘。类型在指令里,不在值里——这是 wasm 整个类型系统的复制粘贴(回顾 Ch08)。
A single v128 register is reinterpreted by the op's lane shape. i8x16.add = 16 paired 8-bit adds; f32x4.mul = 4 paired 32-bit float mults. The type lives in the op, not the value — a copy-paste from wasm's overall design (recall Ch08).
;; SIMD-vectorised inner loop: process 16 columns at once v128.load offset=0 ;; 16 bytes from src row 0 v128.load offset=1 ;; 16 bytes shifted right by 1 i16x8.extadd_pairwise_i8x16_u ;; widen + add pairs → 8 × i16 v128.load offset=2 ;; 3rd column i16x8.extadd_pairwise_i8x16_u i16x8.add ;; sum 3 columns of row 0 ;; ... repeat for rows 1 and 2, then sum 3 rows → 8 × i16 sums ... i16x8.div_u ;; ÷ 9 (one v128 op = 8 div in 4 cy) i8x16.narrow_i16x8_u ;; saturate back to 8-bit v128.store offset=0 ;; 16 bytes written to dst
一次循环处理 16 个像素,而不是一个。3×3 卷积变成"3 行 SIMD 加法 + 1 次 SIMD 除法 + 1 次 SIMD 写入"。TurboFan 把这段 wasm 翻成 x86 的 PADDW / PMULHRSW / PSRLW 等 SSE2 指令——每条 inner loop 大约 18 条 SSE 指令,处理 16 像素,平均每像素 ~1.1 条 SSE。这比标量版本(每像素 ~20 条 x86 指令)快 6 倍以上,正是 hero pulse bar 里看到的 6.8× 来源。
One iteration handles 16 pixels, not one. The 3×3 convolution becomes "3 rows of SIMD add + 1 SIMD divide + 1 SIMD store". TurboFan lowers this wasm into x86 PADDW / PMULHRSW / PSRLW SSE2 ops — ~18 SSE instructions per inner iteration handling 16 pixels, averaging ~1.1 SSE per pixel. Vs ~20 x86 instructions per pixel in the scalar version → 6×+ speedup, exactly the 6.8× from the hero pulse bar.
i16x8.relaxed_q15mulr_s 在不同 CPU 上结果可能差 1 个 ulp。原因:严格的 SIMD 在 x86 上有时要 emulate(因为 SSE2 不完全等价于 NEON 的某些精确语义),emulate 的开销大。Relaxed SIMD 允许 JIT 选最快的硬件 op,牺牲严格 deterministic。这是 wasm 第一次主动放弃"portable" 的子集——专门给图像滤镜、AI 推理这些"差 1 个 ulp 没人 care" 的场景。
The Relaxed SIMD proposal (2024) added a set of SIMD ops whose results may differ slightly across CPUs — e.g. i16x8.relaxed_q15mulr_s may diverge by 1 ulp across x86 vs ARM. Reason: strict SIMD sometimes needs emulation on x86 (SSE2 isn't bit-equivalent to certain NEON semantics), and emulation is expensive. Relaxed SIMD lets the JIT pick the fastest hardware op, trading strict determinism. The first time wasm willingly dropped a portion of "portable" — aimed at image filters, AI inference, the "1 ulp doesn't matter" scenarios.
struct · array · i31 · ref
struct · array · i31 · ref
2017 MVP 时 wasm 没有 GC。Java / Kotlin / Dart / C# 这些带 GC 的语言要跑 wasm,只能把整个 GC 运行时也编译进 wasm——TeaVM 编 Java,Kotlin/Wasm 编 Kotlin,DartVM-wasm 编 Dart,每个加 1-2 MB 的运行时 wasm。这意味着同一个标签页里 10 个 wasm 模块就有 10 份 GC 在跑,堆不共享,STW 不协调——巨大的浪费。
wasm-GC 提案的解法:让 wasm 模块共享宿主 GC(浏览器里就是 V8 或 SM 自带的 GC),wasm 那边定义 struct 和 array 类型,引擎负责分配 / 回收。2024 年 V8 130、Firefox 120 同时 ship,Kotlin/Wasm 立刻把运行时从 1.4 MB 砍到 400 KB,Dart 团队也在改造。
The 2017 MVP had no GC. For GC-bearing languages like Java / Kotlin / Dart / C#, the only way was to compile your GC runtime into wasm — TeaVM for Java, Kotlin/Wasm for Kotlin, DartVM-wasm for Dart, each adding 1–2 MB of runtime wasm. Ten wasm modules in one tab meant ten GCs running, heaps unshared, STW pauses uncoordinated — staggering waste.
The wasm-GC proposal's answer: wasm modules share the host's GC (V8's or SM's in browsers); wasm declares struct and array types, the engine allocates and collects. V8 130 and Firefox 120 shipped it simultaneously in 2024. Kotlin/Wasm immediately cut its runtime from 1.4 MB to 400 KB; the Dart team is refactoring.
(type $Point (struct (field $x f64) (field $y f64))) (type $Vec (array (mut i32))) ;; allocate struct.new $Point ;; (consume 2 f64 on stack → produce (ref $Point)) array.new $Vec ;; (i32 len + i32 init → produce (ref $Vec)) ;; access struct.get $Point $x ;; pop ref, push f64 array.get_u $Vec ;; pop ref + i32 idx, push elem ;; cast (downcast) ref.cast (ref $Point) ;; runtime check, trap on mismatch ;; small ints inline (avoid heap alloc) i31.new ;; pop i32 (31-bit limited), push i31ref i31.get_s ;; unbox
三个观察:① 类型用名义定义($Point 不等于另一个 same-shape struct),允许 sub-typing;② 引用类型可以为 null 也可以 nullable;③ i31ref 直接借用 V8 的 SMI 技术——31 位整数不分配堆,直接 inline 到 ref 槽位低位,跟 JS 的 Number 互操作几乎免费。
Three notes: ① types are defined nominally (one $Point ≠ another struct of the same shape) and support sub-typing; ② references can be nullable or not; ③ i31ref borrows V8's SMI trick — 31-bit ints are not heap-allocated, inlined into the low bits of a ref slot, near-free interop with JS Number.
"wasm 跑在 V8 上,直接用 JS Object 不就行了?"——这是另一个常见疑问。答案:JS Object 是动态形状的(隐藏类、IC、加 property 时形状变),wasm 需要静态形状的对象才能做高效 codegen。wasm-GC 的 struct 类型在编译期固定 layout——访问 $Point.$x 就是 mov reg, [ref+8],一条指令,没有 IC,没有 deopt。共享 GC 不等于共享对象模型。
"Wasm runs on V8 — why not use JS Object directly?" — another common question. Answer: JS Object is dynamically shaped (hidden classes, ICs, shape changes on property add), but wasm needs statically shaped objects for efficient codegen. A wasm-GC struct has a layout fixed at compile time — accessing $Point.$x compiles to mov reg, [ref+8], one instruction, no IC, no deopt. Sharing the GC ≠ sharing the object model.
WIT · interface types · WASI 0.2
WIT · interface types · WASI 0.2
"一个 Rust 写的 wasm 怎么调一个 Go 写的 wasm,传一个字符串?"——MVP wasm 无法回答这个问题。原因:wasm 只有 i32/i64/f32/f64/v128,没有 "字符串" 类型。Rust 那边 String 是 (ptr, len, cap) 三件套,Go 那边 string 是 (ptr, len) 两件套,Java 那边 String 是 UTF-16 数组——三方都要用约定俗成的方式把字符串"展开"成 i32 + 长度。每两种语言都要写一套胶水,N² 复杂度。
Component Model 用一个新的组件层解决这个问题。在 .wasm 的"core module" 之上,加一个 .component 文件,声明用语言无关的类型系统(string / list<T> / record / variant)描述接口。组件间互调由规范的 ABI 处理 lift / lower——发送端把语言原生类型 lower 成 ABI 形式,接收端 lift 回它的原生类型。N 个语言只需要 N 套 binding generator,复杂度从 N² 降到 N。
"How does Rust wasm call Go wasm, passing a string?" — the MVP can't answer. Reason: wasm has only i32/i64/f32/f64/v128, no "string" type. Rust's String is (ptr, len, cap); Go's string is (ptr, len); Java's String is UTF-16 array — each pair of languages needs custom glue. N² complexity.
Component Model solves this with a new component layer. On top of a wasm "core module", a .component file declares interfaces in a language-agnostic type system (string / list<T> / record / variant). Inter-component calls are mediated by a canonical ABI that lifts and lowers — the caller lowers its native type to the ABI shape, the callee lifts back into its native type. N languages need only N binding generators; complexity drops from N² to N.
Component Model 的核心抽象:lift 把语言原生类型(Rust 的 String / Go 的 string / Java 的 String)降 成 wasm 原始 scalar(i32 ptr + i32 len)+ linear memory 字节;lower 是反过程。两端语言不知道彼此存在,只跟同一份 WIT 协议握手。N 种语言只需 N 套 binding generator,复杂度从 N² 降到 N。
Component Model's core abstraction: lower takes a language-native type (Rust's String, Go's string, Java's String) and lowers it into wasm scalars (i32 ptr + i32 len) + linear-memory bytes; lift is the inverse. Neither side knows the other exists; both shake hands with the same WIT contract. N languages need only N binding generators — complexity drops from N² to N.
// blur.wit — the interface for our image-processor component package ursb:image@0.1.0; interface filter { record bitmap { width: u32, height: u32, pixels: list<u8>, } enum error { invalid-size, oom } blur3: func(input: bitmap) -> result<bitmap, error>; } world image-tools { export filter; }
用 wit-bindgen 工具:
Use wit-bindgen:
$ wit-bindgen rust blur.wit --out-dir ./bindings/ # Generates Rust types matching the WIT — `Bitmap { width: u32, ... }` plus a trait you `impl` $ wit-bindgen go blur.wit --out-dir ./bindings/ # Same component, now in Go $ wasm-tools component new core.wasm -o blur.component.wasm # Package the core module + component metadata
每种语言的 binding generator 知道怎么把它的原生类型 marshal 进 ABI 形式:Rust 的 String → (ptr, len),Go 的 string → (ptr, len) 但用 GC 跟踪,Java 的 String → UTF-8 编码后传递。所有这些细节对组件作者完全透明。
Each binding generator knows how to marshal native types into the ABI: Rust's String → (ptr, len), Go's string → (ptr, len) tracked by GC, Java's String → UTF-8-encoded. All these details are invisible to the component author.
WASI 0.2(2024 ship)的所有"系统接口" —— wasi:io / wasi:filesystem / wasi:http / wasi:clocks —— 都用 Component Model 声明。这意味着同一个 wasm 组件 可以在 Wasmtime / Wasmer / Spin / Jco / 浏览器 polyfill 上跑,只要 host 提供对应的 WASI 接口实现。这是真正的"一次编译,处处运行"——比 Java 当年的承诺更彻底,因为它跨语言。Cloudflare Workers, Fastly Compute@Edge, Shopify Functions, Spin 等"边缘计算 wasm" 平台,本质都是 Component Model 的客户。
WASI 0.2 (shipped 2024) declares all its "system interfaces" — wasi:io / wasi:filesystem / wasi:http / wasi:clocks — through the Component Model. That means one wasm component runs on Wasmtime / Wasmer / Spin / Jco / a browser polyfill, as long as the host implements the matching WASI interface. The real "compile once, run anywhere" — more thorough than Java's old promise because it's cross-language. Cloudflare Workers, Fastly Compute@Edge, Shopify Functions, Spin — every edge-wasm platform is essentially a Component Model customer.
tail-call · EH · memory64 · JSPI · stack-switching · multi-memory
tail-call · EH · memory64 · JSPI · stack-switching · multi-memory
除了 Threads / SIMD / GC / Component Model 这 4 个"明星" 提案外,还有六个对生态影响很大的提案正在不同阶段。下面是 2026 年 5 月的现状快照——这些数字会变,但格局相对稳定。
Beyond the four "headliners" (Threads / SIMD / GC / Component Model), six more proposals materially shape the ecosystem at various phases. A snapshot as of May 2026 — the numbers move, but the landscape is stable.
return_call + return_call_indirect。让函数式语言(Scheme/OCaml)能尾递归到 wasm 不爆栈。V8 2023 ship。return_call + return_call_indirect. Lets functional langs (Scheme/OCaml) tail-recurse without stack blow-up. V8 shipped 2023.try / catch / throw / tag 一套。C++ 异常、Rust panic 现在能"真正" 抛而不是断电。2023 ship。try / catch / throw / tag. C++ exceptions, Rust panics now genuinely throw rather than abort. Shipped 2023.;; before — call + return: stack grows N deep for N tail calls (func $fact (param $n i32) (param $acc i32) (result i32) local.get $n i32.eqz (if (result i32) (then local.get $acc) (else local.get $n i32.const 1 i32.sub local.get $n local.get $acc i32.mul call $fact ;; ← regular call, stack grows N frames ))) ;; after — return_call: reuse current frame, stack stays at 1 frame return_call $fact ;; ← O(1) stack
没有 tail-call 时,函数式语言只能用 trampoline 模拟尾递归——把 "下一步要调用什么" 当返回值,外层循环里轮询。代码丑、慢 3 倍。tail-call ship 之后 Scheme / Erlang / OCaml 编译到 wasm 才真正可用。
Without tail-call, functional languages had to simulate tail recursion via trampolines — returning "what to call next" and looping in an outer loop. Ugly, 3× slower. Post-tail-call, Scheme / Erlang / OCaml-to-wasm is genuinely usable.
假设你的 wasm 要 fetch 一个网络资源。在 MVP 里你只能:① wasm 调用一个 JS 函数,JS 起 fetch,等 fetch 完后回调 wasm 的另一个函数。代码丑,因为 wasm 函数被切成两半。JSPI 让 wasm 函数能停在中间,等 JS Promise resolved 再继续。引擎在 stack 上记一个 continuation,fetch 完后从这个 continuation 恢复。对开发者像是同步代码,引擎在底下做了异步。
Suppose your wasm wants to fetch a network resource. In the MVP, you could only: ① wasm calls a JS function, JS issues fetch, on completion JS calls back another wasm function. Ugly — your wasm function is cut in half. JSPI lets a wasm function pause mid-execution, await a JS Promise, resume. The engine stores a continuation on the stack; on resolve, it resumes from that continuation. To the developer it reads as sync code; under the hood the engine does async.
前 22 章拆开看每一道工序;这 4 章把它们拼回来。先写一份性能模型,把"为什么 wasm 比 JS 快" 拆成具体百分比;然后讲怎么用 Chrome DevTools 在 wasm 里设断点、看变量、追 SourceMap;接着是真实战场——Figma / Photoshop / AutoCAD / Ruffle / ffmpeg 这些把 wasm 用到极限的工业级产品;最后一份术语表,把全文出现过的 50 个名词钉死定义。读完这 4 章,你应该能在任何技术讨论里 hold 住 wasm 这个话题。
The previous 22 chapters dissected each stage; these 4 stitch them back. First, a performance model that decomposes "why wasm is faster than JS" into concrete percentages; then Chrome DevTools — setting breakpoints, inspecting locals, following source maps in wasm; then the battlefield — Figma / Photoshop / AutoCAD / Ruffle / ffmpeg, the industrial products that push wasm to the edge; finally a glossary of 50 terms used throughout. By the end, you should be able to hold any wasm conversation.
把工程经验写成公式
turning engineering folklore into a formula
"wasm 比 JS 快多少" 是个没法一句话回答的问题——它依赖于代码模式、引擎版本、SIMD / 多线程是否开。但我们可以写一个分解公式:
"How much faster is wasm than JS" can't be answered in one sentence — it depends on code pattern, engine version, SIMD/threads. But we can write a decomposition formula:
人们以为 wasm 启动慢——其实Liftoff 让 wasm 启动比 JS 还快。一个 1 MB 的 wasm 模块,Liftoff ~ 100 ms 出码就能跑;一个 1 MB 的 minified JS,V8 要 parse + Ignition + 进 inline cache,~ 200 ms 才稳定。wasm 启动从 2018 年起就不再是性能问题。剩下的延迟主要是下载——文件大小决定的,不是 wasm 的错。
People assume wasm startup is slow — actually Liftoff makes wasm boot faster than JS. A 1 MB wasm module: Liftoff ~100 ms to runnable. A 1 MB minified JS: V8 parses + Ignition + IC warmup ~200 ms to steady state. Since 2018, wasm startup hasn't been a perf problem. Remaining latency is download — a function of file size, not wasm's fault.
错。短函数 + 频繁过桥时 wasm 慢。"wasm 快"是大块计算的快。
False. Short funcs with frequent crossings: wasm loses. "Wasm-fast" describes chunky compute.
错。Cloudflare Workers, Spin, Fastly, Shopify Functions 都在服务器跑 wasm,数量已超过浏览器 wasm 模块的总数。
False. Cloudflare Workers, Spin, Fastly, Shopify Functions all run server-side wasm — collectively more module-instances than the browser.
2024 年前是,现在不是。wasm-GC ship 之后,Kotlin/Wasm 已经是生产就绪。
True pre-2024, false now. Post-wasm-GC, Kotlin/Wasm is production-ready.
错。Rust 是 wasm 最大的语言,但 C++(Emscripten)、Go(TinyGo)、Kotlin、AssemblyScript、Swift 都能编 wasm。
False. Rust is wasm's biggest source language, but C++ (Emscripten), Go (TinyGo), Kotlin, AssemblyScript, Swift all compile to wasm.
name section · source maps · DWARF
name section · source maps · DWARF
编完 wasm 后函数 / 局部变量都变成索引——func 17 而不是 blur3,local 3 而不是 sum。直接在 DevTools 里看一团数字几乎不可能调试。三层调试信息把符号补回来:① name custom section(函数 + locals 名字);② source map(行号 → 文件名+行号);③ DWARF custom section(完整的类型信息 + 局部变量映射 + inline 信息)。三层都通过 custom section 加塞进 .wasm,运行时不影响,DevTools 启用 "WebAssembly Debugging" 选项后才解析。
After compilation, functions and locals are indices — func 17 not blur3, local 3 not sum. Debugging a wall of numbers in DevTools is near-impossible. Three layers of debug info put names back: ① name custom section (function + local names); ② source map (line → file+line); ③ DWARF custom section (full type info + local-variable mapping + inline info). All three ride custom sections inside the .wasm — invisible at runtime, parsed when DevTools "WebAssembly Debugging" is enabled.
| Layer | Section | What it gives you | Cost (size) | Tooling |
|---|---|---|---|---|
| name | "name" | 函数 + locals 名字function + local names | + 1–3 % | built into rustc / wasm-bindgen |
| source map | "sourceMappingURL" | 行号 ↔ 源文件位置line ↔ source file position | + 5–10 % | wasm-opt --emit-source-map |
| DWARF | ".debug_*" | 完整类型 + 变量 + inlinetypes + locals + inlining | + 30–100 % | clang -g · rustc -g · DWARF dumping |
2020 年 Chrome 88 起,DevTools 集成了 wasm 调试器(由 Google 内部 chrome-devtools-frontend 团队和 Bloomberg 合作开发)。开启步骤:
From Chrome 88 (2020), DevTools includes a wasm debugger (built by Chromium's chrome-devtools-frontend team and Bloomberg). Enable in three steps:
-g at compile timeRUSTFLAGS="-g" 或者 cargo build --release + [profile.release] debug = true。C++: emcc -g hot.c -o hot.wasm。会让 wasm 体积涨 30–100%,但调试体验质变。RUSTFLAGS="-g" or cargo build --release with [profile.release] debug = true. C++: emcc -g hot.c -o hot.wasm. Bumps wasm size 30–100% but transforms the debug experience.sum: u32 = 1842(从 DWARF 解出类型 + 当前寄存器/栈位置 + 字节解码)。sum: u32 = 1842 (decoded via DWARF type + register/stack location + byte interpretation).--print-wasm-code 或者 D8 + --print-code,看具体 x86-64 / ARM64。--print-wasm-code, or D8 with --print-code for raw x86-64 / ARM64.hot.wasm.map)。记得 .map 也 deploy 上 CDN,否则 DevTools 会报 "404, falling back to disassembly"。这是新手最常踩的坑之一。
If your server gzips / brotlis the .wasm, Chrome resolves the source-map URL on the decompressed bytes — but the .map file is a separate URL (hot.wasm.map). Deploy the .map alongside the .wasm, or DevTools shows "404, falling back to disassembly". A common newbie trap.
Figma · Photoshop · AutoCAD · Ruffle · ffmpeg
Figma · Photoshop · AutoCAD · Ruffle · ffmpeg
这一章不讲技术,讲产品。每一个下面的案例都是 wasm 跨过工程门槛、跑在百万级用户上的真实证据。它们一起构成的"这是 wasm 能做的事"的最有力证明。
This chapter isn't technical — it's about products. Each case below is real evidence of wasm clearing the engineering bar to ship to millions of users. Together they form the strongest argument for what wasm can actually do.
Figma 2016 年上线时,渲染引擎已经是 C++ 编到 asm.js 跑在浏览器里。2017 年 wasm MVP ship 后立刻迁移到 wasm——启动速度提升 3 倍,文件加载提升 2 倍。Evan Wallace(Figma 联合创始人)在博客里写过:"without WebAssembly, Figma would not exist"。Figma 的整个矢量编辑、 canvas 渲染、协作 OT 算法都在 wasm 里——只有 UI 是 React。它定义了"wasm-first 应用" 的工程模板。
Figma launched in 2016 with its rendering engine already compiled from C++ to asm.js. Post-wasm MVP in 2017 it migrated immediately — 3× startup speedup, 2× file load. Co-founder Evan Wallace wrote on the blog: "without WebAssembly, Figma would not exist". Figma's vector editing, canvas rendering, and collaborative OT all run inside wasm — only the UI is React. It defined the engineering template for "wasm-first apps".
2023 年 Adobe 把 Photoshop 的 pixel pipeline 编译到 wasm,在 Chromium 上开始公测。模块大小:70 MB(gzip 后 18 MB)。用了 wasm threads + SIMD + 多 memory + JSPI。其中最大的工程难点是 Photoshop 自带的内存分配器(jemalloc)要从假定有 mmap 的 native 环境改为 wasm 的 linear memory——他们花了 9 个月把 jemalloc 移植成"wasm 友好" 的版本。Photoshop Web 是目前为止编到 wasm 的最大商业代码库。
In 2023 Adobe compiled Photoshop's pixel pipeline to wasm and opened public beta on Chromium. Module size: 70 MB (18 MB gzipped). Uses wasm threads + SIMD + multi-memory + JSPI. The hardest engineering hurdle was porting Photoshop's bundled allocator (jemalloc) from a mmap-assuming native world to wasm's linear memory — 9 months to produce a "wasm-friendly" jemalloc. Photoshop Web is the largest commercial codebase ever compiled to wasm.
AutoCAD 1982 年首次发布,代码累计 30M+ LOC。Autodesk 2018 年开始把它编到 wasm,2020 年正式上线 AutoCAD Web App。移植中最大的挑战不是计算速度,是文件 IO 路径——AutoCAD 假定有本地文件系统,wasm 在浏览器里没有,要用 OPFS(Origin Private File System)和 fetch API 模拟。这是WASI 0.2 的 wasi:filesystem 在浏览器里也有用的原因。
AutoCAD shipped in 1982 with 30 M+ cumulative LOC. Autodesk began compiling to wasm in 2018; AutoCAD Web launched in 2020. The biggest port hurdle wasn't compute speed — it was the filesystem path. AutoCAD assumes a local FS; wasm in the browser has none, so they shim via OPFS (Origin Private File System) and fetch. This is why WASI 0.2's wasi:filesystem matters in the browser too.
Adobe Flash 2020 年正式 EOL。但无数 90s/00s 的网页游戏 + 互动课件 + 文化档案因此面临"不能再打开"的危机。Ruffle 是一个用 Rust 写的 Flash player,编到 wasm,在浏览器里跑——纯客户端,不需要 Adobe 任何东西。It runs on Internet Archive 上 50 万个 .swf 游戏 / 视频已经"复活"。Ruffle 是 wasm 在文化遗产保存方向最暖的一个故事。
Adobe Flash reached EOL in 2020. But countless 90s/00s web games + interactive coursework + cultural archives faced "cannot be opened again". Ruffle is a Rust-written Flash player, compiled to wasm, running in the browser — pure client-side, Adobe-free. On the Internet Archive, 500 K .swf games and videos have "resurrected" via Ruffle. The warmest wasm story — preservation of cultural heritage.
把 ffmpeg(2000 万行 C)编到 wasm。生成的 .wasm 大约 25 MB(gzip 后 6 MB)。性能大概是 native ffmpeg 的 40~60%——主要差距在 SIMD 不完全(ffmpeg 用了大量 AVX-512,wasm SIMD 只到 128 bit)。但 client-side 视频转码、抠图、字幕合成全部可以做。1Password、Loom、Riverside、CapCut Web 都集成了 ffmpeg.wasm。
ffmpeg (20 M lines of C) compiled to wasm. Result: ~25 MB .wasm (6 MB gzipped). Perf is 40–60% of native ffmpeg — the gap mostly from SIMD shortfall (ffmpeg uses heavy AVX-512; wasm SIMD caps at 128 bit). Even so, client-side video transcoding, chroma key, subtitle compositing are all on the table. 1Password, Loom, Riverside, CapCut Web all embed ffmpeg.wasm.
Cloudflare Workers, Fastly Compute@Edge, Shopify Functions, Spin, Fermyon Cloud, NGINX Unit ngx_wasm。这些边缘计算平台不用容器,用 wasm 实例——冷启动 ~ 1 ms(容器 ~ 100 ms),内存隔离更便宜,可以一台机器跑十万个客户。2024 年起服务器端 wasm 实例的总数超过了浏览器。如果你只看浏览器,你只看到了 wasm 故事的一半。
Cloudflare Workers, Fastly Compute@Edge, Shopify Functions, Spin, Fermyon Cloud, NGINX Unit ngx_wasm. These edge platforms don't use containers — they use wasm instances. Cold start ~1 ms (containers ~100 ms), cheaper memory isolation, 100 K tenants per box. From 2024, server-side wasm instances outnumber browser instances. Watching only the browser sees only half the story.
| Battlefield | Source language | Module size | Year | Key feature used |
|---|---|---|---|---|
| Figma | C++ | ~ 3 MB | 2017 | MVP arithmetic |
| Google Earth | C++ | ~ 15 MB | 2019 | threads (early) |
| AutoCAD Web | C++ | ~ 80 MB | 2020 | threads + OPFS |
| Photoshop Web | C++ | ~ 70 MB | 2023 | threads + SIMD + multi-mem + JSPI |
| Ruffle | Rust | ~ 2 MB | 2021 | MVP + SIMD |
| ffmpeg.wasm | C | ~ 25 MB | 2019 | SIMD + threads |
| Blazor | C# | ~ 3 MB AOT | 2020 | GC (custom runtime) → wasm-GC migrating |
| 1Password CLI | Rust | ~ 5 MB | 2022 | WASI |
| Cloudflare Workers | any | variable | 2018 | server-side, 1 ms cold start |
读完这一章你能 hold 住任何 wasm 讨论
after this, you can hold any wasm conversation
"use asm" 指令的 JS 子集,引擎可以 AOT 编译。Firefox 实测过 1.5× of native。Wasm's direct predecessor. A JS subset marked with "use asm" that engines can AOT-compile. Firefox measured 1.5× of native.call_indirect 索引调用。是 C 函数指针 / C++ vtable 在 wasm 里的形式。An array of funcref / externref values, indexed via call_indirect. The wasm representation of C function pointers / C++ vtables.blur3 而不是 func 17。The simplest custom section: gives UTF-8 names to functions / locals / globals. DevTools shows blur3 instead of func 17.WebAssembly.compileStreaming(fetch(...))。每收一段就编一段,不等下载完。WebAssembly.compileStreaming(fetch(...)). Compile each chunk as it arrives, don't wait for the full file.goto,只有 block / loop / if 配 br k。让验证可以单遍完成。No goto; only block / loop / if + br k. Enables single-pass validation.WebAssembly.RuntimeError。An "unrecoverable" abort: bounds, div-by-zero, failed type cast. Surfaces as WebAssembly.RuntimeError on the JS side.i32.atomic.* / memory.atomic.wait/notify,出码 x86 LOCK 前缀指令 / ARM acquire-release op。Wasm's i32.atomic.* / memory.atomic.wait/notify; emit x86 LOCK-prefixed ops or ARM acquire/release.return_call + return_call_indirect 给函数式语言(Scheme/OCaml)做 O(1) 栈的尾递归。Shipped 2023. return_call + return_call_indirect give functional langs (Scheme/OCaml) O(1)-stack tail recursion.wasm-opt -O3 来自这里。AssemblyScript 把它当主编译器。A wasm IR + optimiser + post-pass toolchain. wasm-opt -O3 comes from here. AssemblyScript uses it as the main compiler.读到这里,你已经看完 wasm 从字节到 SIMD 的一生。
下次再有人问 "wasm 是什么"——别用一句话回答。 Field Note · 03 · Final
You have now read the life of wasm, from byte to SIMD.
Next time someone asks "what is wasm" — refuse the one-liner. Field Note · 03 · Final
type safety · memory safety · CFI
type safety · memory safety · CFI
wasm 的"safe"是被严肃证明过的——Conrad Watt 2018 年在 Isabelle/HOL 里把整套规范 mechanise 了一遍,过程中还顺手挑出 spec 里几处 bug。这一章把安全保证拆成三层,顺便讲 Spectre 漏洞如何让 wasm threads 推迟了一年半,以及 wasm 设计里那些反过来的限制——这些限制不是缺点,是故意。
Wasm's "safe" has been formally proven — in 2018 Conrad Watt mechanised the whole spec in Isabelle/HOL, discovering spec-level bugs along the way. This chapter splits the safety story into three layers, recounts how the Spectre disclosure shoved wasm threads back by 18 months, and explains the inverted design constraints — limits that are not flaws but deliberate choices.
call_indirect 在 table 里查 funcref 时必须验证目标函数签名匹配,否则 trap。所有跳转目标(br k)都是结构化控制框架内的 frame——不可能跳到任意地址。这给 wasm 提供了 ROP / JOP 攻击免疫——攻击者无法把任意机器码地址塞进 funcref。call_indirect looking up a funcref in a table must verify the target's signature matches, else it traps. Every br k jumps inside the structured control frame — cannot land at arbitrary addresses. This grants immunity to ROP / JOP — attackers cannot stuff arbitrary machine code into a funcref.2018 年 1 月 3 日,Spectre / Meltdown 漏洞披露。这两个漏洞利用CPU 推测执行 + cache 时序 旁路通道,可以从一个进程读到另一个进程的内存。wasm threads 当时正好处在 phase 3、即将 ship 阶段——共享内存 + 高精度计时器(performance.now() 当时还是 5 µs 精度)就是 Spectre 的完美材料。
On 3 Jan 2018, Spectre / Meltdown were disclosed. Both exploit CPU speculative execution + cache timing side channels to read another process's memory. Wasm threads were at phase 3, on the verge of shipping — shared memory + high-precision timers (performance.now() at the time was 5 µs precise) were perfect Spectre ingredients.
所有浏览器在 24 小时内做了两件事:① 把 performance.now() 精度降到 ms 级;② 关闭 SharedArrayBuffer。wasm threads 推迟一年半。最终方案是用进程隔离(COOP/COEP 头)让每个站点跑在独立进程里——旁路通道泄漏只能泄漏自己的数据,无意义。这是 web 平台史上第一次因为硬件漏洞推迟了一个软件特性。
All browsers shipped two fixes within 24 hours: ① coarsen performance.now() to ms precision; ② disable SharedArrayBuffer. Wasm threads slid 18 months. The eventual fix used process isolation (COOP/COEP headers) — each site runs in its own process, so side-channel leaks only reveal its own data. The first time the web platform delayed a software feature because of a hardware vulnerability.
| Year | CVE | What | Layer |
|---|---|---|---|
| 2018 | CVE-2018-6065 | V8 wasm interpreter 整数溢出V8 wasm interpreter int overflow | impl bug |
| 2020 | CVE-2020-9802 | JSC wasm 类型混淆JSC wasm type confusion | impl bug |
| 2021 | CVE-2021-21195 | V8 wasm UAFV8 wasm use-after-free | impl bug |
| 2022 | CVE-2022-4135 | V8 wasm heap buffer overflowV8 wasm heap buffer overflow | impl bug |
| 2023 | CVE-2023-2935 | V8 wasm 类型混淆,sandbox 逃逸V8 wasm type confusion, sandbox escape | impl bug |
| 2024 | CVE-2024-11116 | V8 Turboshaft wasm OOBV8 Turboshaft wasm OOB | impl bug |
注意一个模式:所有 CVE 都是引擎实现 bug,没有规范级漏洞——这正是 mechanised proof 的胜利。spec 是数学上正确的,但 V8/SM/JSC 必须把它落地到 C++ 代码,这一步会出错。各浏览器现在都跑 fuzzing 工具(wasm-mutate、OSS-Fuzz)持续测试,每个月在主线分支上跑数万 CPU 小时。
A pattern: every CVE is an implementation bug, never a spec-level hole — the win of mechanised proof. The spec is mathematically sound; V8/SM/JSC must land it in C++ and that's where errors creep in. Browsers now run continuous fuzzing (wasm-mutate, OSS-Fuzz) — tens of thousands of CPU-hours per month on the main branches.
CF Workers · Spin · Fermyon · Wasmtime
CF Workers · Spin · Fermyon · Wasmtime
到 2024 年,全球服务端 wasm 实例数量超过了浏览器 wasm 模块的总数——但大多数前端工程师不知道这件事。这一章把视野从浏览器移开。服务端 wasm 解决了一个不同的问题:容器太慢、太重——一个 Docker 容器冷启动 100 ms-数秒,而一个 wasm 实例 1 ms。当你想跑 10 万个客户的 isolated 代码在同一台机器,这个差距决定了一个商业模式能不能成立。
By 2024, global server-side wasm instance counts had overtaken browser wasm module counts — but most front-end engineers don't know this. This chapter shifts the focus off the browser. Server-side wasm solves a different problem: containers are too slow, too heavy — a Docker container cold-starts in 100 ms to seconds, a wasm instance in ~1 ms. When you want to run 100 K customers' isolated code on one machine, that gap decides whether a business model is viable.
| Platform | Runtime | Cold start | Mem limit | Isolation | WASI 0.2? |
|---|---|---|---|---|---|
| Cloudflare Workers | V8 isolates | ~ 5 ms | 128 MiB | V8 isolate | partial |
| Fastly Compute@Edge | Wasmtime + Lucet | ~ 1 ms | 128 MiB | per-instance | yes |
| Fermyon Spin | Wasmtime | ~ 1 ms | config | per-instance | yes |
| Shopify Functions | Wasmtime | ~ 5 ms | 10 MiB | strict | partial |
| NGINX Unit ngx_wasm | WAMR / Wasmtime | ~ 2 ms | config | per-request | partial |
| Wasmtime (standalone) | Cranelift | ~ 0.5 ms | 4 GiB (wasm32) | process | yes |
数字差 2 个数量级。这让 wasm 在函数即服务(FaaS)场景里成为唯一可行的隔离方案——AWS Lambda 用容器,冷启动 100 ms~3 s 是真实痛点;Cloudflare Workers 用 V8 isolate(算 wasm 半亲戚),冷启动 5 ms;Fastly 用 Wasmtime,1 ms。同样的代码,延迟差 100 倍。
Two orders of magnitude difference. That makes wasm the only viable isolation model for function-as-a-service — AWS Lambda runs containers, cold-starts of 100 ms–3 s are a real pain point; Cloudflare Workers run V8 isolates (a wasm half-sibling) at ~5 ms; Fastly runs Wasmtime at ~1 ms. Same code, 100× latency gap.
浏览器 wasm 通过 import 拿到 JS 函数;服务端 wasm 通过 import 拿到系统接口——文件读写、网络、时钟、随机数。MVP 时代每家平台都自己定义,Cloudflare 的 API ≠ Fastly 的 API ≠ Wasmtime 的 API。WASI 0.2 (2024 ship) 用 Component Model 把这套接口标准化成一组 .wit 文件:wasi:io / wasi:filesystem / wasi:http / wasi:clocks / wasi:random / wasi:sockets。同一个 wasm 组件可以跑在所有支持 WASI 0.2 的平台——这才是真正的 "compile once, run anywhere"。
Browser wasm gets JS functions via import; server wasm gets system interfaces via import — file I/O, networking, clocks, randomness. In the MVP era every platform defined its own; Cloudflare's API ≠ Fastly's ≠ Wasmtime's. WASI 0.2 (shipped 2024) standardised them as Component Model .wit files: wasi:io / wasi:filesystem / wasi:http / wasi:clocks / wasi:random / wasi:sockets. One wasm component runs on every WASI 0.2-compliant platform — the real "compile once, run anywhere".
fork(),WASI 0.2 也没标准化 threads。Go 程序的 goroutine、Node.js 的 worker thread 在 wasm 里都失效——除非用 stack-switching 提案(还在 phase 3)。fork(); WASI 0.2 still doesn't standardise threads. Go goroutines, Node.js worker threads — all break in wasm. Until stack-switching ships (still phase 3).七个硬限制及它们的绕过办法
seven hard limits and how to route around them
这一章倒过来定义 wasm。前 28 章描述了 wasm 能 做什么,这一章列七件它结构性做不到的事——以及工程上怎么绕。这些"不能"不是 bug 是 feature,体现了 wasm 的设计哲学:小而硬,而不是大而软。
This chapter defines wasm by negation. The previous 28 described what wasm can do; this one lists seven things wasm structurally cannot — and how engineers route around them. These "cannot"s are features, not bugs — they reflect wasm's design philosophy: small and hard, not big and soft.
String 和 Go 的 string 不兼容——每对语言要自己写 marshal 代码。绕法:用 Component Model + WIT,把 N² glue 复杂度降到 N。String and Go's string are incompatible — every pair of languages needs custom marshalling. Workaround: Component Model + WIT drops the N² glue complexity to N.wasm 的"不能",定义了它的"能"。 Field Note · 03 · Appendix
Wasm's "cannot"
defines its "can". Field Note · 03 · Appendix
W3C · IETF · IEEE · 学术 · 源码
W3C · IETF · IEEE · academia · source
这一节把全文用到的所有外部标准、规范、论文、源码归档。每条引用带状态(REC = W3C Recommendation,CR = Candidate Recommendation,WD = Working Draft)+ 链接 + 你在哪一章会用到它。所有 URL 在 2026 年 5 月有效;wasm 提案演化快,phase 4 后会迁移到 W3C TR/ 命名空间。
This section archives every external standard, spec, paper, or source-code reference the article touches. Each carries a status pill (REC = W3C Recommendation, CR = Candidate Recommendation, WD = Working Draft) + link + the chapter that needs it. All URLs valid as of May 2026; wasm proposals move quickly, post-phase-4 entries migrate to W3C TR/ namespaces.
WebAssembly.Module/Instance/Memory/Table/Global 接口。Ch16/17 用。JS-side WebAssembly.* surface. Used by Ch16/17.
compileStreaming / instantiateStreaming / Response 集成。Ch12 用。compileStreaming / instantiateStreaming / Response integration. Used by Ch12.
memory.copy/fill/init 等。Ch07/22。memory.copy/fill/init etc. Ch07/22.return_call + return_call_indirect。Ch22。return_call + return_call_indirect. Ch22.chrome://flags/#enable-experimental-webassembly-features · 在 Chrome 里打开所有 phase 2-3 提案。Enables all phase 2-3 proposals in Chrome..debug_* custom sections)。Ch24。Wasm debug info (.debug_* custom sections). Ch24.sourceMappingURL custom section 引用。Ch24。Standardising at TC39 since 2024; wasm references via sourceMappingURL custom section. Ch24.v8/src/wasm/module-decoder.cc · 流式 decode 主流程。Ch12。Streaming decode main loop. Ch12.v8/src/wasm/function-body-decoder-impl.h · 类型栈验证模板。Ch11/13。Type-stack validator template. Ch11/13.v8/src/wasm/baseline/liftoff-compiler.cc · 基线 JIT 主入口。Ch14。Baseline JIT main entry. Ch14.v8/src/compiler/wasm-compiler.cc · wasm → TF graph build。Ch15。wasm → TF graph build. Ch15.v8/src/compiler/turboshaft/wasm-*.cc · 2023 后默认的 wasm 优化器。Ch15。Default wasm optimiser since 2023. Ch15.v8/src/wasm/wasm-import-wrapper-cache.cc · JS↔Wasm trampoline 缓存。Ch17。JS↔Wasm trampoline cache. Ch17.v8/src/wasm/wasm-objects.cc · Module/Instance/Memory/Table 的 JS 对象。Ch16。JS objects for Module/Instance/Memory/Table. Ch16.mozilla-central/js/src/wasm/WasmBaselineCompile.cppmozilla-central/js/src/wasm/WasmIonCompile.cppWebKit/JavaScriptCore/wasm/WasmBBQ*.cpp · WasmOMG*.cppwasm-opt, wasm-as, IR 优化器。wasm2wat, wasm-objdump, wat2wasm。Ch06/07/24 用。技术写作的鉴别度
不在花活,在出处。 Field Note · 03 · Appendix
Technical writing is judged not by flourishes,
but by the rigour of its sources. Field Note · 03 · Appendix