ursb.me / notes
FIELD NOTE / 03 虚拟机 · 编译器 Virtual Machines · Compilers 2026

Rust
SIMD 寄存器

From Rust to
SIMD registers.

一段 Rust 卷积循环要穿过 17 道工序、两层 JIT、4 GiB 的线性内存,才能在你的屏幕上跑成一条 SSE 指令。
这是 WebAssembly 从字节到机器码的全景手册。

A Rust convolution loop has to cross seventeen stages, two tiers of JIT, and 4 GiB of linear memory before it lights up as a single SSE instruction on your screen.
This is a field map of WebAssembly from bytes to machine code.

FIG · HERO · pipeline panorama WebAssembly 编译流水线 13 阶段 From .rs source via four phases (Build · Network · Compile · Run) to a single SSE op on screen. Ⅰ · BUILD Ⅱ · NETWORK Ⅲ · COMPILE Ⅳ · RUN 01 rustc 02 LLVM 03 wasm-ld 04 fetch · stream 05 decode 06 validate 07 Liftoff 08 execute 09 tier-up 10 TurboFan 11 install 12 SIMD exec 13 on screen Process · Thread offline toolchain Browser process · Net thread Renderer · Wasm worker + main GPU process · presenter Build-time RTT & bytes ~ 200 µs to 5 ms (per func) ~ 16.7 ms / frame streaming · compile while downloading .rs pixel

.rs 源文件出发,穿过 4 个阶段、13 道工序、3 个进程、~ 16.7 ms 的帧预算,最后变成屏幕上的一个像素streaming compile 让阶段 Ⅱ 与 Ⅲ 完全重叠——这是 wasm 在浏览器的"边下边编" 体验来源。

From .rs source through four phases, thirteen stages, three processes and a 16.7 ms frame budget — to one pixel on screen. Streaming compile overlaps phases Ⅱ and Ⅲ entirely, the basis of wasm's "compile-while-downloading" feel.

ACT I · BACKGROUND

在沉到比特之前。

Before we sink into bits.

先把 WebAssembly 这件事放回它的历史位置:它是 asm.js 的延伸,是 JavaScript 走到天花板之后的另一条腿,是浏览器从"文档查看器"变成"通用计算机"的最后一块拼图。先有这四章作为骨骼,后面 22 章的细节才会落到合适的位置。

Before we sink to bits, put WebAssembly back into its historical slot: an extension of asm.js, a second leg the browser grew once JavaScript hit its ceiling, the last piece that turned the browser from a "document viewer" into a "general-purpose computer". With these four chapters as skeleton, the 22 that follow fall into place.

CHAPTER 01

三个公式 — WebAssembly 到底是什么

Three formulas — what WebAssembly really is

把这个庞然大物压成三行

crushing the elephant into three lines

"WebAssembly"在大部分讲座 PPT 里被画成一个紫色方块,旁边是"fast, safe, portable"三个词,听起来像一份产品宣传单。但当你真正打开 spec 仓库会发现:它不是一个东西,而是三层契约叠在一起——一层是字节格式,一层是执行模型,一层是宿主接口。把这三层各写成一个公式,后面所有故事都能从里面长出来。

"WebAssembly" gets painted as a purple block in most slide decks, captioned fast, safe, portable like a product brochure. Open the spec repo and you discover it is not one thing — it's three contracts stacked on top of each other: one for the byte format, one for the execution model, one for the host interface. Write each as a formula and every later story grows out of them.

公式 1 / FORMULA 1 · 设计契约FORMULA 1 · the design pact
Wasm = portable · safe · fast · compact portable = 同一份 .wasm 在任何 CPU/OS/JS 引擎上结果一样same bytes, same result on every CPU / OS / engine safe = 沙箱 + 验证 + 类型化栈 + 内存边界检查sandbox + validation + typed stack + bounds checks fast = 设计目标"原生 80%",JIT 可单遍出码design target ~80% of native, JIT-friendly single-pass codegen compact = LEB128 + 栈机 + 8 字节文件头LEB128 + stack machine + 8-byte header
推论:四个目标是同一道菜的四种调料。任何一个被偏废,整道菜会变形。后面 26 章本质上是在反复回答"这一刀切下去,是不是同时尊重这四个字"。 Implication: the four are seasonings in one dish — drop one and the whole thing collapses. The next 26 chapters are essentially repeated answers to one question: does this design choice still honour all four words?
公式 2 / FORMULA 2 · 引擎拆解FORMULA 2 · the engine
Engine = Decoder + Validator + (Tier 0Tier 1) + Runtime Decoder 从字节流构造 Module 数据结构builds the Module from the byte stream Validator 类型栈抽象解释,O(n) 证明安全type-stack abstract interpretation, O(n) safety proof Tier 0 基线 JIT,启动快但不优化(V8 = Liftoff)baseline JIT, fast start, unoptimised (V8 = Liftoff) Tier 1 优化 JIT,后台慢慢提升热函数(V8 = TurboFan)optimising JIT, lifts hot funcs in the background (V8 = TurboFan) Runtime linear memory + table + import/export + traplinear memory + table + import / export + trap
推论:wasm 没有解释器这一档。设计者赌的是"JIT 一定比解释快",所以连最基础的运行模式都是单遍出机器码。后面 Ch14 会看到 Liftoff 怎么实现"边解码边出码"。 Implication: wasm has no interpreter tier. The designers bet "JIT always beats interpretation", so even the first run goes through single-pass codegen. Ch14 shows how Liftoff turns bytes into machine code in one pass.
公式 3 / FORMULA 3 · 工具链FORMULA 3 · the toolchain
SourceFrontendLLVM / Craneliftwasm-ld.wasm Rust → rustc → LLVM IR → wasm32 backend → wasm-ld → out.wasmRust → rustc → LLVM IR → wasm32 backend → wasm-ld → out.wasm C/C++ → clang → LLVM IR → wasm32 → wasm-ld → out.wasm (Emscripten 套壳)C/C++ → clang → LLVM IR → wasm32 → wasm-ld → out.wasm (Emscripten shell) Go (TinyGo) → SSA → LLVM → wasmGo (TinyGo) → SSA → LLVM → wasm AssemblyScript → Binaryen IR → wasm (no LLVM)AssemblyScript → Binaryen IR → wasm (no LLVM)
推论:绝大多数语言走的是 "LLVM IR → wasm32 后端" 这条路;只有 AssemblyScript 走 Binaryen,Java/.NET 走"自带 GC 运行时"那条岔路。生态格局基本是 LLVM 决定的。 Implication: nearly every language reaches wasm through LLVM IR → wasm32 backend; only AssemblyScript routes via Binaryen, and Java/.NET ride their own GC runtimes. The ecosystem map is largely LLVM's map.

主流引擎的「Tier 拓扑」对照

Engine tiering at a glance

引擎Engine Tier 0(基线)Tier 0 (baseline) Tier 1(优化)Tier 1 (optimising) 用在哪Used in
V8Liftoff (2018)TurboFan / Turboshaft (2023→)Chrome, Edge, Node, Deno
SpiderMonkeyBaseline (2018)Ion / WarpMonkeyFirefox
JavaScriptCoreBBQ (Build Bytecode Quickly)OMG (Optimized Machine Generator)Safari, WebKit
WasmtimeCraneliftBytecode Alliance, edge runtimes
WasmerSinglepassCranelift / LLVMstandalone, plugin sandbox
WAMRinterpreter / fast-interpAOT (LLVM)IoT, embedded

从这张表里冒出一个事实:浏览器三家都选择了"双层 JIT",非浏览器引擎多半只留一层优化器或反而留解释器。原因是浏览器要兼顾"开页面要立刻能跑"和"久了要够快",而服务器端 wasm 通常是冷启动一次跑很久,直接 AOT 即可。同一个 spec,生出两套截然不同的实现哲学。

A fact climbs out: all three browser engines went with two-tier JIT, while non-browser engines tend to keep just one optimiser — or revert to an interpreter. Browsers must reconcile "must run instantly" with "must run fast eventually"; server-side wasm cold-starts once and runs forever, so AOT alone is enough. One spec, two diverging philosophies of implementation.

WebAssembly 不是一种语言,
是一份让 LLVM 和浏览器握手的协议。 Field Note · 03
WebAssembly is not a language.
It is the handshake between LLVM and the browser. Field Note · 03
关于"虚拟 ISA"的提法ON "VIRTUAL ISA" 官方文档把 wasm 称为virtual ISA(虚拟指令集架构)——把它当成"一种新 CPU"来理解最准确。x86-64 是 1999 年 AMD 设计的虚拟接口,wasm 是 2015 年 W3C 设计的虚拟接口,只是后者跑在 V8 而非硅片上。后面 Ch09 看指令格式时,你会发现它真的像 RISC,带几分 MIPS 和几分 JVM 字节码的混血。 The spec calls wasm a virtual ISA. Treat it as "a new CPU you just discovered". x86-64 is a virtual interface AMD designed in 1999; wasm is a virtual interface W3C designed in 2015 — just running on V8 instead of silicon. Ch09 will show you the instruction encoding really does look like a RISC, with a dash of MIPS and a dash of JVM bytecode.
INPUT
三个公式Three formulas设计契约 + 引擎拓扑 + 工具链design + engine + toolchain
OUTPUT
骨骼:四目标 / 两层 / 一编译器Skeleton: 4 goals · 2 tiers · 1 compiler后续 25 章都挂在这副骨头上all 25 later chapters hang on this
CHAPTER 02

家谱 — 从 asm.js 到 wasm-GC 的十三年

A family tree — 13 years from asm.js to wasm-GC

每一个提案都是一次妥协的化石

every proposal is a fossilised compromise

2010 年 Google 在 Chrome 里塞了一个叫 NaCl 的东西——它能跑原生码,但每一种 CPU 各编译一份。后来 PNaCl 用 LLVM bitcode 当中间格式,通用化是有了,但只有 Chrome 支持。"在浏览器里跑 C++"这件事整整失败了五年。

2011 年另一个分支冒头:Mozilla 的 Alon Zakai 写了 Emscripten,把 LLVM bitcode 翻成 JavaScript;2013 年他和 Luke Wagner 进一步把"JS 的一个类型化子集"标准化成 asm.js——你可以用 "use asm" 告诉引擎这段代码全是 int32,引擎就能跳过类型检查,直接 AOT 编译。Firefox 上的 asm.js 跑出过原生 1.5 倍的成绩。

但 asm.js 仍然要走 JS parser,文件还是文本,还是要走 V8 的 SMI/HeapNumber 边界。所有人都看到了一条更短的路:把那个类型化子集直接二进制化。2015 年 6 月 17 日,W3C 上的四家——Mozilla、Google、Apple、Microsoft——宣布合作。两年后 MVP 在四大浏览器同时落地,这是 web 平台史上罕见的一次性达成

In 2010 Google shipped NaCl in Chrome — it ran native code, but you had to compile once per CPU. PNaCl tried LLVM bitcode as a portable IR, but only Chrome supported it. "Running C++ in the browser" failed cleanly for five years.

The other branch sprouted in 2011: Mozilla's Alon Zakai wrote Emscripten, which translated LLVM bitcode into JavaScript. By 2013 he and Luke Wagner had standardised "a typed subset of JS" as asm.js — drop a "use asm" at the top and the engine could skip type checks and AOT-compile. Firefox's asm.js engine hit ~1.5× of native.

But asm.js still went through the JS parser, was still text, still bumped into V8's SMI/HeapNumber boundary. Everyone saw the shortcut: binarise that typed subset. On 17 June 2015 the four browser vendors — Mozilla, Google, Apple, Microsoft — announced collaboration on the W3C. Two years later the MVP shipped in all four browsers simultaneously — a rare instance of platform consensus actually happening.

FIG 02 · family tree · hand-drawn WebAssembly 家谱 · 2009 – 2026 Four ancestor strands (NaCl, Emscripten, asm.js, JVM bytecode) converging into the wasm trunk at 2015; the trunk then sprouts proposals through 2026. 2009 '11 '13 '15 '17 '19 '21 '23 '25 '27 Google Mozilla Apple/JSC JVM heritage WASM TRUNK NaCl PNaCl Emscripten asm.js JavaScriptCore experience stack machine wisdom (since 1996) 17 Jun 2015 "4 vendors agree" MVP v1.0 · 4 browsers Liftoff ⚠ Spectre SAB disabled Threads Bulk SIMD v128 + refs Tail · EH GC struct/array/i31 2.0 W3C REC JSPI Component in-flight non-browser: Wasmtime '19 Wasmer '19 CF Workers '18 Spin '22 WASI 0.2 '24 "the pact"

四条血脉(NaCl · Emscripten · asm.js · JSC 经验) 在 2015 年 6 月 17 日的 W3C 会议室里收敛成 wasm 主干。MVP 之后,提案像枝条一样从主干长出来——绿色是编译器/运行时提案,紫色是计算能力提案。Wasm 2.0 在 2025 年成为 W3C Recommendation,把过去 8 年的 8 个独立提案合并成一份新基线。

Four ancestor strands (NaCl · Emscripten · asm.js · JSC heritage) converge into the wasm trunk on 17 Jun 2015 at the W3C. Post-MVP, proposals sprout — green are compiler/runtime proposals, purple are compute proposals. Wasm 2.0 became a W3C Recommendation in 2025, folding eight separate proposals into a new baseline.

三个不能略过的祖先

Three ancestors you cannot skip

A
NaCl(2009)— 失败的"在浏览器里跑原生码"
NaCl (2009) — the failed "run native in browser"
Google 用 SFI(Software Fault Isolation)给原生码画沙箱。安全但需要 per-CPU 编译;PNaCl 用 LLVM bitcode 改善了可移植性,但只有 Chrome 支持,五年后 Chrome 自己也下线了它。教训:浏览器要求"一份字节,处处运行"。
Google sandboxed native code via SFI (Software Fault Isolation). Safe, but you compiled once per CPU. PNaCl swapped in LLVM bitcode for portability, but only Chrome shipped it. Five years later Chrome retired it. Lesson: the browser requires "one binary, runs everywhere".
B
Emscripten(2011)— 把 LLVM 翻成 JS
Emscripten (2011) — translating LLVM into JS
Zakai 让 clang 输出 LLVM bitcode,再写一个 backend 把 bitcode 翻成非常机器化的 JS——HEAP32[(p+4)>>2] = x | 0 这种风格。证明了"用 JS 当虚拟 CPU"在工程上可行。今天 Emscripten 还在,但它的 backend 已经直接输出 wasm。
Zakai had clang emit LLVM bitcode then wrote a backend translating bitcode into extremely machine-like JS — HEAP32[(p+4)>>2] = x | 0 style. It proved "JS as virtual CPU" was engineerable. Emscripten still ships, but its backend now emits wasm directly.
C
asm.js(2013)— 给 JS 引擎一个 AOT 入口
asm.js (2013) — an AOT trapdoor into the JS engine
"use asm" 一行声明,引擎认出后用 AOT 而非 JIT 编译该函数。Firefox 的 OdinMonkey 在 asm.js 上跑出过 1.5× of native。但 asm.js 仍是 文本,要走 JS parser,parse 一个 100 MB 的游戏 bundle 要十几秒。这成了催生 wasm 二进制格式的最后一根稻草。
A single "use asm" directive let the engine AOT-compile a function. Firefox's OdinMonkey hit 1.5× of native on it. But asm.js was still text, still went through the JS parser, and a 100 MB game bundle took tens of seconds to parse. That was the final straw that forced wasm to be binary.
为什么 2015 年这次合作能成WHY THE 2015 PACT HELD 浏览器历史上厂商合作的成功率很低——XHTML、SVG Fonts、HTML 5 自身都经历过分裂。wasm 这次成功的关键有三:① 四家都有各自版本的相似失败(NaCl/Emscripten/asm.js/SilverLight),共识基础硬;② 提案从一开始就用形式化语义(Andreas Rossberg 主笔)而非自然语言,歧义少;③ MVP 把 GC/threads/SIMD 全砍掉,先求落地——剩下的事留给提案流程。"先共识,再演进"是 wasm 的一切。 Vendor consensus on the web has a low success rate — XHTML, SVG Fonts, HTML 5 itself all fragmented. Three things made wasm work: ① every vendor had its own version of the same failure (NaCl / Emscripten / asm.js / Silverlight) — the consensus floor was solid; ② the proposal was specified formally from day one (Andreas Rossberg leading) rather than as natural-language prose — fewer ambiguities; ③ the MVP cut GC, threads, SIMD, exceptions — ship first, evolve later. "Consensus first, evolution after" is the whole of wasm.

提案的四个阶段

The four phases of a proposal

Phase含义Meaning谁同意Who agreed能不能用Usable?
0 · Pre-proposal某人提个 idea,有仓库someone has an idea + repono
1 · Feature proposalCG 同意值得做CG agrees it's worth doingCGno
2 · Proposed spec text有正式规范文字formal spec text existsCGflag 后可用(Chromium --enable-experimental-webassembly-features)behind flag
3 · Implementation至少 2 个引擎实现≥ 2 engines shipped implCGflag 后可用,Origin Trialbehind flag, Origin Trial
4 · StandardizeWG 投票纳入正式规范WG votes to standardiseWG默认开启on by default

CG = Community Group(社区组,任何人可加入);WG = Working Group(工作组,需要会员资格)。一个提案常常在 phase 3 待两到三年——SIMD 在 phase 3 待了 26 个月才升 phase 4。这套机制让 wasm 的每一步演进都需要至少两家厂商先实现,从根上把"一家独大"挡住了

CG = Community Group (anyone can join); WG = Working Group (membership required). A proposal often sits in phase 3 for two to three years — SIMD spent 26 months at phase 3 before stepping up. The mechanism forces every evolutionary step to be implemented by at least two vendors first — structurally blocking unilateral moves.

SPEC
https://webassembly.github.io/spec/ · 所有提案在 github.com/WebAssembly/<name>each proposal at github.com/WebAssembly/<name>
PHASE
https://github.com/WebAssembly/proposals · 看每个提案现在在哪一阶phase tracker for each proposal
FLAG
chrome://flags/#enable-experimental-webassembly-features
CHAPTER 03

为什么是栈机 — 一个 1980 年代的选择

Why a stack machine — a choice from the 1980s

JVM 走过的路,wasm 又走了一遍

JVM walked this path, wasm walked it again

"为什么 wasm 不是寄存器机?Dalvik 不是更快吗?"——这是每个第一次看 wasm 字节码的人会问的问题。答案藏在一个看似无关的数字里:wasm 字节码的体积要小到能流式下载。MVP 设计期(2015)给自己定的目标是 4 MB 文本的 asm.js 程序压成不超过 1 MB 的二进制——压缩比 1:4。所有的设计决策都要让步于这个数字。

"Why isn't wasm a register machine? Aren't Dalvik registers faster?" — every first-time reader of wasm bytecode asks this. The answer hides in a seemingly unrelated number: wasm bytes must be small enough to stream-download. The MVP target (2015) was to fit 4 MB of asm.js text into < 1 MB of binary — a 1:4 ratio. Every design choice bows to that number.

两种机器的编码密度对比

Encoding density: stack vs register

考虑一行表达式 c = a + b。在两种 ISA 里它的字节序列分别是:

Take a single expression c = a + b. The byte sequences in the two ISAs:

STACK · wasm
  1. local.get $a
  2. local.get $b
  3. i32.add
  4. local.set $c
4 条 / 6 字节(LEB128 编码后) 4 ops / 6 bytes (LEB128)
REGISTER · LLVM IR
  1. add i32 %a, %b → %t
  2. store i32 %t, %c
2 条 / 须编码 source+dest 寄存器号 2 ops / must encode src+dst register IDs

看起来寄存器机指令更少。但寄存器号需要 bits 来编:LLVM 的 SSA 寄存器数量无界,实际编码时需要 32 位甚至更多;Dalvik 把寄存器限到 256 个,8 bit;ARM/x86 真寄存器 16 个,4 bit。栈机一字节就是一条 opcode(i32.add = 0x6A),局部变量索引用 LEB128(通常 1 byte),整体下来栈机一般赢 30~40% 字节

Register ops look fewer. But register IDs need bits: LLVM SSA values are unbounded, encoded at 32+ bits each; Dalvik caps at 256 registers (8 bits); ARM/x86 have 16 real registers (4 bits). A stack-machine opcode is one byte (i32.add = 0x6A), local indices LEB128 (usually 1 byte). The stack form typically wins 30–40% on bytes.

栈机的四个"赠送好处"

Four bonuses the stack throws in for free

1 · density
编码密度高
High encoding density

opcode 1 byte,大多数立即数 LEB128 1~2 byte。同样语义比 ARM64 大约小 35%。

1-byte opcode, most immediates 1–2-byte LEB128. About 35% smaller than equivalent ARM64.

2 · validation
验证算法线性
Linear validation

类型栈抽象解释,一遍扫完即可证明类型安全。Ch11 详谈。

Type-stack abstract interpretation: one linear pass proves type safety. See Ch11.

3 · JIT
单遍 codegen 可行
Single-pass codegen viable

栈位置编译期可知,Liftoff 边解码边发射机器码,无中间 IR。

Stack positions are statically known; Liftoff emits machine code while decoding, no IR.

4 · neutral
ISA 无关
ISA-neutral

不绑定寄存器数量或调用约定,同一份字节在 x86/ARM/RISC-V 上都能跑。

Not tied to a register count or calling convention; the same bytes run on x86/ARM/RISC-V.

但栈机有一个老问题

But the stack has one old problem

栈机解释执行慢——每条指令要操作栈顶,栈本身常驻内存,L1 cache 命中率不如寄存器机。这是 JVM 早期被嘲讽"慢得像树懒"的根本原因。wasm 怎么解?用 JIT 而不是解释器。设计者赌的是:既然反正都要 JIT,那就让字节码偏向解码密度,机器码偏向执行速度,各取所长。

Stack interpreters are slow — every op touches the stack top, the stack lives in memory, L1 hit-rate trails register machines. That's why early JVMs felt "sloth-slow". Wasm's answer: skip the interpreter. The bet was: we'll JIT anyway, so let the bytecode optimise for density and the machine code optimise for speed. Best of both.

栈机还配了一个"半寄存器"层:locals。每个函数有固定数量的 locals(像寄存器),local.get / local.set 在栈和 locals 之间搬运值。这套设计让 wasm 既像栈机一样紧凑,又像寄存器机一样能"存中间结果"。JVM 的 locals 与之几乎完全相同——wasm 的设计者把 JVM 学了一遍。

Stack machines also carry a "half-register" file: locals. Each function has a fixed number of locals (register-like), with local.get / local.set moving values between stack and locals. That gives wasm a stack's compactness with a register machine's ability to "hold intermediate values". The JVM has the exact same construct — wasm's designers studied JVM thoroughly.

具体计算 · 一个反例CASE · concrete numbers
fib(40) 在 wasm 和 Dalvik 上的字节数
fib(40) in wasm vs Dalvik bytes

实测把 fn fib(n: i32) -> i32 { if n < 2 { n } else { fib(n-1) + fib(n-2) } } 编到 wasm 和 dex:wasm body 是 31 byte(含两次递归调用),dex 经压缩 27 byte。差距不大,因为函数体太短,寄存器号编码 vs locals 索引几乎抵平。真正拉开差距的是大函数——一个 1000 行的 SIMD inner loop,wasm 大约赢 33%,这才是 wasm 选栈机的真正回报。

Compile fn fib(n: i32) -> i32 { if n < 2 { n } else { fib(n-1) + fib(n-2) } } to wasm and dex: wasm body is 31 bytes (including two recursive calls), dex 27 bytes. Tiny gap, because the body is too short — register encoding vs local index cancels out. The gap widens on large functions — a 1000-line SIMD inner loop sees wasm ~33% smaller. That's the real return on the stack-machine bet.

关于"反栈机"的另一派THE COUNTER-FACTION 2014 年 Andreas Rossberg 在内部讨论时曾推过一个"register wasm"草案,叫 WebAssembly Register。最终被否决,理由有三:① 浏览器引擎要兼顾启动速度,字节密度更要紧;② 栈机的形式语义比寄存器机更小,规范更易写;③ JVM 已经为栈机做了 20 年的工程证明。今天这个 register 草案还在 webassembly/design 仓库的历史里能翻到。 In 2014 Andreas Rossberg circulated a draft for a "register wasm" called WebAssembly Register. It was rejected for three reasons: ① browsers prioritise startup speed, byte density matters more; ② formal semantics for stacks are smaller than for registers, easier to specify; ③ the JVM had already engineered the stack approach for 20 years. The draft still lives in the history of the webassembly/design repo.
栈是密度,寄存器是速度。
wasm 选了让编译器付出速度。 Field Note · 03
The stack buys density, the register buys speed.
Wasm chose to make the compiler pay for speed. Field Note · 03
CHAPTER 04

JS 的天花板 — JIT 再聪明也跨不过的三件事

JS ceiling — three things even a brilliant JIT can't climb

为什么 wasm 必须存在

why wasm has to exist

V8 是一台让人叹服的 JIT 引擎——它在运行时学习对象形状、追踪类型、构造内联缓存、把热函数从 Ignition 经 Sparkplug、Maglev 一路提升到 TurboFan。但所有这些工程都建立在一个前提之上:JS 是动态类型。这个前提注定了 JIT 有一个跨不过去的天花板。

把天花板写成三件事:(1) 类型不确定,所以要 inline cache,猜错就要 deopt;(2) 数字不止一种表示,SMI / HeapNumber / Float64 之间的装箱拆箱无法消除;(3) GC 不可关,即使是数值密集的图像处理,引用计数和写屏障也要付。这三件事单独看每一件都是几个百分点,叠起来就是 5× ~ 10× 的差距。

V8 is an awe-inspiring JIT — it learns object shapes at runtime, traces types, builds inline caches, lifts hot functions from Ignition through Sparkplug and Maglev to TurboFan. But all that engineering rests on one premise: JS is dynamically typed. That premise dictates a ceiling.

Three sides to that ceiling: (1) types are uncertain, so you need inline caches, deopt on misses; (2) numbers have multiple representations (SMI / HeapNumber / Float64), boxing/unboxing cannot be eliminated; (3) GC cannot be turned off — even on pixel-pushing loops you pay write barriers and reference counts. Each of these is a few percent; stacked, they multiply to 5–10×.

三个具体的天花板

Three concrete ceilings

Inline cache 的反复 deopt
Inline cache thrash & deopt
V8 把 obj.x 优化成"直接偏移 +8 取值"——前提是 obj 总是这个形状。一旦你给某个 obj 加了字段,这条优化作废,引擎不得不 deopt 回 Ignition,函数从 TurboFan 掉到字节码。wasm 没有 obj.x,有的是 i32.load offset=8——偏移在编译期就钉死,没有 deopt。
V8 optimises obj.x into "offset +8 of this shape" — until you add a property and the shape changes, at which point the optimisation invalidates, the engine deopts back to Ignition, the function falls from TurboFan to bytecode. Wasm has no obj.x; it has i32.load offset=8 — the offset is fixed at compile time, deopt-free.
数字装箱(SMI ↔ HeapNumber ↔ Float64)
Number boxing (SMI ↔ HeapNumber ↔ Float64)
在 V8 里 x = 1 是 SMI(31-bit tagged 整数,栈上);x = 1.5 是 HeapNumber(堆指针,要 GC)。x = a + b 时引擎要先判断两边是 SMI 还是 HeapNumber,再决定加法 opcode。一个简单的 inner loop 里这种判断每次都跑。wasm 的 i32.add 输入永远是 i32——没有判断,直接出 add eax, ebx
In V8, x = 1 is an SMI (31-bit tagged int, stack); x = 1.5 is a HeapNumber (heap pointer, GC-tracked). For x = a + b, the engine first checks both sides' representations, then picks the add op. In a tight inner loop, that check runs every iteration. Wasm's i32.add always takes two i32s — no check, straight to add eax, ebx.
GC 不可关
GC is mandatory
即使你的循环只算数字,V8 仍要保证 arr[i] = x 这种写入触发 write barrier、保护代际收集器。一个 1M 次写入的循环里,write barrier 占 10~15% 时钟。wasm 的 i32.store offset=0 写到 linear memory——它是 ArrayBuffer 的一片,GC 完全不参与。这也是为什么 wasm 适合做图像/音视频/物理引擎而不适合做 React 组件树。
Even in a pure-arithmetic loop, V8 must run a write barrier on arr[i] = x to keep the generational GC sound. In a 1 M-write loop, write barriers consume 10–15% of cycles. Wasm's i32.store offset=0 hits linear memory — a slice of ArrayBuffer the GC never touches. That's why wasm shines on images/video/physics and slogs on React component trees.
v8-fast-js 里的同一句话A SENTENCE FROM v8-fast-js
"把 px2rem 优化到 24 ms 已经是极限了——再快只能不写 JS。"
"Optimizing px2rem to 24 ms is roughly the JS ceiling. Beyond that, stop writing JS."

这是上一篇文章《V8 是怎么把 JS 跑快的》结尾的句子——它正是这一章要展开的主张。JS 引擎走完了它能走的所有路:Sparkplug 把启动延迟干到 1× of native parse;Maglev 把热路径速度做到 0.8× of TurboFan;TurboFan 把寄存器分配做到接近 LLVM。当你需要更快,你需要的不是更聪明的 JIT,你需要的是更少的不确定性——这就是 wasm 的角色。

That's the closing line of the previous piece «How V8 Makes JS Fast» — and it's exactly the claim this chapter unpacks. The JS engine has walked every road it can: Sparkplug brings startup to ~1× native parse, Maglev hits 0.8× of TurboFan on hot paths, TurboFan's register allocator approaches LLVM's. When you need more speed, you don't need a smarter JIT — you need less uncertainty. That is wasm's role.

数字:wasm 比 JS 快多少

By how much, in numbers

基准Benchmark JS (V8 TurboFan)JS (V8 TurboFan) Wasm (V8 TurboFan)Wasm (V8 TurboFan) 原生 C(LLVM -O3)Native C (LLVM -O3) wasm / nativewasm / native
SciMark 2.0 (geom mean)2.4×1.15×1.00×87%
fasta (computational)3.1×1.08×1.00×93%
n-body (3D physics)2.8×1.18×1.00×85%
JPEG decode (libjpeg)4.5×1.25×1.00×80%
SHA-256(纯算术)SHA-256 (pure arithmetic)3.6×1.10×1.00×91%
DOM diff(JS-bound)DOM diff (JS-bound)1.00×1.7×

表里最后一行是反例:DOM diff 在 wasm 里反而更慢,因为每次 DOM 调用都要跨 wasm/JS 边界,trampoline 成本压过了算术加速。wasm 比 JS 快的是"算数",不是"调用浏览器 API"——这条边界 Ch17 会量化。

The last row is a counter-example: wasm is slower at DOM diff, because each DOM call crosses the wasm/JS boundary, and the trampoline cost outweighs the arithmetic speedup. Wasm beats JS at arithmetic, not at calling browser APIs — Ch17 quantifies that boundary.

关于"原生 80%"这个数字ON "80% OF NATIVE" 这是 wasm 设计目标里的官方数字,实测分布在 75% ~ 95%。差距来自两处:① bounds check(线性内存边界检查),硬件辅助后约 2~5% 开销;② 寄存器分配,wasm 的"无信息"调用约定让 callee-saved 集合比 native 大,平均 1~3% 开销。剩下的差距取决于 SIMD/loop unrolling 等优化通道是否完全开。有 SIMD 的 inner loop,wasm 可以追平甚至超过 -O2 的 native——因为 LLVM 后端是同一个 "80% of native" is the official design target; measured numbers spread 75%–95%. The gap comes from two places: ① bounds checks on linear memory — with hardware assist ~2–5% overhead; ② register allocation — wasm's "no-info" calling convention forces a larger callee-saved set than native, averaging 1–3%. The rest depends on SIMD / loop-unrolling fidelity. With SIMD, a wasm inner loop can match or beat -O2 native, because the LLVM backend is the same.
INPUT
JS 引擎的工程极限JS engine ceilingIC + 装箱 + GCIC + boxing + GC
OUTPUT
wasm 存在的理由Wasm's reason to exist把"不确定性"从运行时挪到编译期move uncertainty from runtime to compile time
ACT II · MAIN-LINE

The Hot Loop。

The Hot Loop.

姐妹篇 chromium-renderer 用一张名片(The Card)当贯穿全文的实例。这里我们用一段 3×3 卷积循环——它来自图像滤镜,小到可以打印在一页纸上,大到能压出栈机、SIMD、JIT、Tier-up 几乎所有的特性。后面 22 章每一章都会切回这段代码,看它在那一道工序里是什么样子。

In the sibling piece chromium-renderer, a business card (The Card) served as the through-line. Here we use a 3×3 convolution loop — straight out of image filtering, small enough to print on one page yet rich enough to exercise the stack machine, SIMD, JIT, and tier-up. Every one of the next 22 chapters cuts back to this code and shows what it looks like at that stage.

MAIN-LINE · ✦

The Hot Loop — 一段 3×3 卷积的来生

The Hot Loop — afterlife of a 3×3 convolution

11 行 Rust,17 道工序,1 条 SSE 指令

11 lines of Rust, 17 stages, 1 SSE op

Module
main-line
Length
11 LOC · 192 byte wasm
Touched by
22 chapters
Role
canonical sample

"WebAssembly 的字节是什么样子" 这种问题用文字描述会很抽象。我们换一种问法:这一段你能看得懂的 Rust 函数,在每一道工序里长什么样。下面是它的源头——一个 3×3 盒型模糊滤镜,把一张灰度图的每个像素替换成它周围 9 个像素的平均值。这是 Photoshop 里"模糊"按钮在内核做的事的精简版,也是 wasm 最擅长跑的那种代码:循环密、整数为主、对内存 layout 敏感。

"What do WebAssembly bytes look like?" gets abstract in prose. So we switch the question: what does this Rust function look like at every stage? Below is the source — a 3×3 box blur that replaces each grayscale pixel with the average of its 9 neighbours. A miniature of Photoshop's blur button kernel, and the kind of code wasm shines on: loop-heavy, integer-dominated, memory-layout sensitive.

源码 · hot.rs

Source · hot.rs

// hot.rs — 3×3 box blur on an 8-bit grayscale image
// w · h are pre-checked, no panics on bounds

#[no_mangle]
pub fn blur3(src: &[u8], dst: &mut [u8], w: usize, h: usize) {
    for y in 1..h - 1 {
        for x in 1..w - 1 {
            let mut sum: u32 = 0;
            for dy in 0..3 {
                for dx in 0..3 {
                    sum += src[(y + dy - 1) * w + (x + dx - 1)] as u32;
                }
            }
            dst[y * w + x] = (sum / 9) as u8;
        }
    }
}

五个观察:① #[no_mangle] 让 rustc 把符号名原样导出,后面 wasm 才能用 blur3 找到它;② 输入是切片,Rust 编译到 wasm 时会拆成"指针 + 长度"两个 i32 参数;③ 内层 9 次 src[...] 索引,每次都会被 LLVM 展平成 i32.load offset=?;④ sum / 9 编译成 i32.div_u——不是浮点;⑤ Rust 的 as u8 编译成 i32.store8,只写低 8 位。这五件事每一件都对应 wasm 的一个设计点,后面会一个个回到。

Five observations: ① #[no_mangle] tells rustc to export the symbol literally so wasm callers can find blur3; ② slice arguments are split into "pointer + length" — two i32 args each; ③ the nine inner src[...] indices each flatten into i32.load offset=?; ④ sum / 9 becomes i32.div_u — integer, not float; ⑤ as u8 becomes i32.store8, writing only the low byte. Each of these maps to a wasm design choice; we'll come back to them one by one.

编译命令 · 一行 rustc

Build command · one rustc invocation

$ rustc --target wasm32-unknown-unknown -O --crate-type cdylib -o hot.wasm hot.rs
$ wasm-opt -O3 hot.wasm -o hot.opt.wasm    # Binaryen post-pass
$ ls -l hot*.wasm
-rw-r--r--  1 airing  staff  192 May 16 14:32 hot.opt.wasm
-rw-r--r--  1 airing  staff  248 May 16 14:32 hot.wasm

192 字节的 .wasm 包含完整的模块——header / type / function / memory / export / code 六个 section,加起来不到一条 tweet。这是栈机+LEB128 编码密度的胜利。把这 192 字节十六进制打印出来,你能眼睛看完:

192 bytes contains the entire module — header / type / function / memory / export / code, six sections, less than a tweet. That's the win from stack machine + LEB128. Print those 192 bytes as hex and you can read them with your eyes:

00000000  00 61 73 6d  01 00 00 00  ; \0asm magic + version=1
00000008  01 0b 02      60 04 7f 7f  ; type section, 2 types
00000010  7f 7f 00     60 00 00     ; (func (param i32 i32 i32 i32)), (func)
00000018  03 02 01 00                  ; function section: func0 has type0
0000001c  05 03 01 00 01              ; memory section: 1 page (64 KiB)
00000021  07 09 01      05 62 6c 75  ; export "blur3"
00000029  72 33 00 00                  ; → func 0
0000002d  0a ...        ; code section, body of blur3 (155 byte)
...
000000be  0b                          ; end · final byte = 0xC0 (192)

注意三件事:① 00 61 73 6d 是 ASCII 的 \0asm——所有 wasm 模块都以它开头,像 ELF 的 0x7F ELF;② 01 00 00 00 是版本号 1,小端;③ 每个 section 以一个 ID byte(0x01 = type, 0x03 = function, ...)开头,然后是 LEB128 编码的长度。Ch06 会把这层皮一字一字撕开。

Three things to note: ① 00 61 73 6d is ASCII \0asm — every wasm module starts with it, like ELF's 0x7F ELF; ② 01 00 00 00 is version 1, little-endian; ③ each section opens with an ID byte (0x01 = type, 0x03 = function, …) followed by LEB128-encoded length. Ch06 peels this skin off byte by byte.

作为 wat 文本(人类可读)

As wat text (human-readable)

;; hot.wat — 经过 wasm-opt -O3 优化后的等价文本
(module
  (type $t0 (func (param i32 i32 i32 i32)))
  (memory (export "memory") 1)
  (func $blur3 (export "blur3") (type $t0)
    (param $src i32) (param $dst i32)
    (param $w   i32) (param $h   i32)
    (local $y i32) (local $x i32) (local $sum i32)
    ;; for y = 1..h-1
    (local.set $y (i32.const 1))
    (block $break_y
      (loop  $loop_y
        (br_if $break_y (i32.ge_s (local.get $y) (i32.sub (local.get $h) (i32.const 1))))
        ;; for x = 1..w-1
        (local.set $x (i32.const 1))
        (block $break_x
          (loop  $loop_x
            (br_if $break_x (i32.ge_s (local.get $x) (i32.sub (local.get $w) (i32.const 1))))
            ;; sum = 9 个 load 加起来 (LLVM 已经把内 2 层循环展平)
            local.get $src
            i32.const 0
            i32.load8_u offset=0      ;; src[(y-1)*w + (x-1)]
            local.get $src
            i32.load8_u offset=1
            i32.add
            ;; ... 共 9 次 i32.load8_u + 8 次 i32.add ... (展开版省略)
            local.set $sum
            ;; dst[y*w + x] = sum / 9
            local.get $dst
            local.get $sum
            i32.const 9
            i32.div_u
            i32.store8                ;; 写回
            local.set $x (i32.add (local.get $x) (i32.const 1))
            br $loop_x)) ;; end x
        local.set $y (i32.add (local.get $y) (i32.const 1))
        br $loop_y))                ;; end y
)

这是 wat 的一种展开形式(为了讲解可读)。实际的 LLVM 输出会把内 2 层循环完全展开成 9 条 i32.load8_u + 8 条 i32.add。注意三个关键点:(a) 控制流只有 block / loop / br / br_if 这几个原语——没有 goto;(b) 所有内存访问都带 offset=N,这个 offset 是编译期常量,Liftoff 可以直接折进地址计算;(c) 每条算术指令的输入输出类型由 opcode 自身决定(i32.add 必然两 i32 输入一 i32 输出)——这是 wasm "静态类型"的核心,Ch08 和 Ch11 会展开。

This is wat in an unrolled-but-readable form. The real LLVM output unrolls the two inner loops into 9 × i32.load8_u + 8 × i32.add. Three key things: (a) control flow uses only block / loop / br / br_if primitives — no goto; (b) every memory access carries an offset=N immediate that is a compile-time constant, which Liftoff folds straight into address arithmetic; (c) each arithmetic opcode self-describes its operand types (i32.add is always two-i32-in, one-i32-out) — that's the core of wasm's static typing, expanded in Ch08 and Ch11.

在 V8 里跑出的机器码 · Liftoff(Tier 0)

Machine code in V8 · Liftoff (Tier 0)

把 hot.wasm 喂给 V8,默认会先用 Liftoff 编译。在 Chrome 里用 --print-wasm-code dump 出 Liftoff 生成的 x86-64:

Feed hot.wasm to V8 and Liftoff compiles first by default. Use --print-wasm-code in Chrome to dump the generated x86-64:

; Liftoff output for blur3 (excerpt of inner body)
push   rbp
mov    rbp, rsp
sub    rsp, 0x30                ; reserve 6 slots for locals
mov    [rbp-0x08], rdi          ; spill $src (arg 0)
mov    [rbp-0x10], rsi          ; spill $dst (arg 1)
...
; inner: load src[idx]
mov    rax, [rbp-0x08]          ; rax = $src
mov    rcx, [rbp-0x18]          ; rcx = computed index
movzx  edx, byte ptr [r15+rax+rcx]   ; bounds-check via r15 base
add    [rbp-0x28], edx          ; sum += byte
...
; tier-up trigger
cmp    dword ptr [r13+0x40], 0x100
jne    +0x4
call   WasmCompileLazy

Liftoff 的输出有几个标志:(1) 几乎所有局部变量都 spill 到栈上,不做寄存器分配——这让 codegen 走单遍;(2) r15 是 V8 约定的 "wasm memory base" 寄存器,所有 load/store 都通过它做基址相对寻址,自带越界检查(用大段保留页 + signal handler);(3) 函数尾巴塞了一个 tier-up 计数器,每次进入函数就 cmp 一下,达到阈值就触发后台 TurboFan 重编译——这是 Ch14 / Ch15 的故事。

Marks of Liftoff: (1) nearly every local spills to the stack — no register allocation, single-pass; (2) r15 is V8's "wasm memory base" register; every load/store uses it as the base, with bounds checking via guard pages + signal handlers; (3) the function tail packs a tier-up counter, cmp'd on each entry — when the threshold trips, a background TurboFan recompile fires. That's the Ch14 / Ch15 story.

同一个函数,经 TurboFan 重编译(Tier 1)

Same function, TurboFan-recompiled (Tier 1)

; TurboFan output for blur3 (excerpt of inner body)
mov    edi, [r15+rcx]           ; row 0 starting load — held in reg, not spilled
movzx  eax, dil
movzx  ebx, byte ptr [r15+rcx+1]
add    eax, ebx                 ; 9 loads, 8 adds — all in regs
movzx  ebx, byte ptr [r15+rcx+2]
add    eax, ebx
...
mov    ebx, 0x1c71c71d          ; (2^32 + 8) / 9 magic for div-by-9
mul    ebx
shr    rdx, 1
mov    byte ptr [r15+rdi], dl   ; store the average
add    ecx, 1                   ; x++
cmp    ecx, esi
jl     -0x53

TurboFan 输出几乎就是手写汇编的样子——寄存器分配把 9 次 load 的中间结果留在 eax/ebx 里,sum / 9 被识别成"除以常数",用魔数乘法(0x1c71c71d)替换了昂贵的 div 指令。这一招 LLVM 也会做(Hacker's Delight 第 10 章),V8 的 TurboFan 把它原样搬过来。这是 wasm "原生 80%" 的具体形式

TurboFan output reads like hand-written assembly — the register allocator keeps the 9 loads in eax/ebx, and sum / 9 is recognised as "divide by constant" and replaced with magic-number multiplication (0x1c71c71d). LLVM does the same trick (Hacker's Delight Ch10); V8 ports it over. This is the concrete form of wasm's "80% of native".

和 SIMD 版本的对比

vs the SIMD version

如果你打开 RUSTFLAGS="-C target-feature=+simd128",LLVM 会把这段代码完全向量化——同样的 inner loop 变成一条 v128.load + 一条 v128.add 即可处理 16 个像素。Ch19 会把向量化全过程展开。这里先给一行 punch line:

Add RUSTFLAGS="-C target-feature=+simd128" and LLVM vectorises completely — the inner loop becomes one v128.load + one v128.add processing 16 pixels. Ch19 unfolds the full vectorisation. One-line punchline:

RELATIVE THROUGHPUT · 1920×1080 image · M1 Pro · Chrome 132
JS
1.0×
Wasm · Liftoff
2.7×
Wasm · TurboFan
4.3×
Wasm · TurboFan + SIMD
6.8×
为什么 Liftoff 已经快 2.7 倍WHY EVEN LIFTOFF IS 2.7× FASTER Liftoff 没有做寄存器分配,但它把 wasm 字节翻成机器码的代价相对 JS 已经低很多:① 类型确定,无 IC、无 deopt;② 内存是 Uint8Array 的一片,无 GC write barrier;③ sum/9 单次 div,JS 要先 ToInt32 再 ToUInt8,慢 3 倍。翻成机器码的"愚蠢版"已经胜过 JS 的"聪明版"。Liftoff 名字来自飞机起飞——快到不需要长跑道。 Liftoff does not do register allocation, but its cost-per-byte-of-output is already much lower than JS: ① types are settled, no IC, no deopt; ② memory is a slice of Uint8Array, no GC write barrier; ③ sum/9 is one div, JS has to ToInt32 then ToUInt8 — 3× slower. The "dumb" wasm machine code already beats the "clever" JS machine code. Liftoff is named after takeoff — fast enough not to need a runway.

"The Hot Loop" 在后续 22 章的回引地图

Where The Hot Loop returns in the next 22 chapters

Ch在那一章里它是什么What it looks like there
06前 8 字节:\0asm magic + versionFirst 8 bytes: \0asm magic + version
07展开它的 11 个 sectionIts 11 sections fully unpacked
08i32 占 99%,出现 i32→u8 的窄化99% i32, with i32→u8 narrowing
09用到哪 6 类 opcodeWhich 6 opcode families it touches
10src + dst 在 linear memory 的 layoutsrc + dst layout in linear memory
11验证它的类型栈一步一步走Type stack walks during validation
12 / 13流式 decode,边下边编Streaming decode, compile-while-fetching
14Liftoff 出的机器码Liftoff's machine code
15TurboFan 的 sea-of-nodes 图TurboFan's sea-of-nodes graph
16实例化时 memory 怎么分配How instantiation allocates memory
17从 JS 调它要花多少 nsJS calling it: how many ns per call
18 / 19 / 20 / 21 / 22线程版 / SIMD 版 / wasm-GC 改写 / 组件模型导出Threaded · SIMD · GC-rewrite · Component-exported
23 ~ 25性能分析、DevTools 调试、移植到 Figma/Photoshop 的回响Perf profile, DevTools debug, echoes in Figma / Photoshop

一张图看完它的一生

The whole life in one frame

下面是 hot.rs 从源码到屏幕的12 个快照。每一格是这段代码在那一秒的实际面貌——左边五格是"静态形态",中间四格是"编译动作",右边三格是"运行时事件"。读完这张图,你应该能在脑里把后面 Act III/IV/V 的每一章对应回这条主线上。后面 8 个章节(Ch11-Ch19)的顶部都挂了一个 "MAIN-LINE STOP X/12" 胶囊,告诉你"你现在站在哪一格"。

Below: 12 snapshots of hot.rs, from source to pixel. Each cell is what the code actually looks like at that moment — the first five cells are static forms, the middle four are compiler actions, the last three are runtime events. After this image, every chapter in Acts III/IV/V can be slotted back onto this main-line. Eight later chapters (Ch11–Ch19) carry a "MAIN-LINE STOP X/12" capsule at the top telling you "which cell you're standing in".

✦ MAIN-LINE STORYBOARD · 12 SNAPSHOTS
The Hot Loop · 12 个快照 A 12-cell horizontal storyboard showing hot.rs as it transforms through: rustc source, LLVM IR, wasm-ld linker output, raw binary bytes, streaming decode, type-stack validation, Liftoff baseline machine code, first execution with call-counter, tier-up trigger to TurboFan, optimised machine code, atomic install, SIMD-vectorised execution producing a pixel. Ⅰ · STATIC FORM · what the bytes look like Ⅱ · COMPILER ACTION · what the engine is doing Ⅲ · RUNTIME EVENT · what time it is 01 rustc · source hot.rs · 11 LOC pub fn blur3( src: &[u8], dst: &mut[u8], w: usize, h: usize, ) { for y in 1..h-1 { for x in 1..w-1 { let mut sum : u32 = 0; // 9 loads // 8 adds dst[y*w+x] = (sum/9) as u8; } } } 256 B .rs text $ cargo 02 LLVM · IR SSA · machine-agnostic define void @blur3( i32* %src, i32* %dst, i32 %w, i32 %h) { entry: br label %loop_y loop_y: %1 = phi i32 %a = load i8 %b = load i8 ... 9 loads ... 8 adds %s = udiv i32 ... ~ 3 KB SSA text opt -O3 03 wasm-ld · wat stack-machine text (module (memory 1) (func $blur3 (param i32 i32 i32 i32) local.get $src i32.load8_u offset=0 ; ... 9 loads i32.add ; ... 8 adds i32.const 9 i32.div_u i32.store8 )) ~ 1 KB .wat text wat2wasm 04 .wasm · binary 192 bytes total 00 61 73 6d 01 00 00 00 01 0b 02 60 04 7f 7f 7f 7f 00 60 00 00 03 02 01 00 05 03 01 00 01 07 09 01 05 62 6c 75 72 33 00 00 0a ... [code body] 155 bytes i32.const i32.load8_u i32.add ... 0b [end · EOF] 192 B · gzip 110 B .wasm binary ship to CDN 05 fetch + decode streaming · § by § §1 Type §3 Func §5 Mem §7 Export §10 Code → enqueue: blur3 body to validator + Liftoff streaming: network ‖ decode ‖ compile 3 threads in parallel + 1 ms Module struct Browser process 06 validate · vstack type-stack walk i32 i32 i32 i32 ← top floor ✓ types ok ✓ stack ok ✓ br targets O(n) · one forward pass parallel: N workers, N funcs + 6 µs proof of safety no alloc 07 Liftoff · T0 single-pass x86-64 push rbp sub rsp,0x40 mov [rbp-8],rdi ; spill $src mov rax,[rbp-8] movzx edx, byte ptr [r15+rax] add [rbp-0x28],edx ; ... 9 loads ; ... naïve div div ecx ; ~25 cy each mov byte ptr [r15+rbx],al cmp [r13+0x40] jne +0x4 ; tier-up cnt + 200 µs 240 B x86-64 RUNNABLE ! 08 execute · cold first invocation JS: instance .exports. blur3( srcPtr, dstPtr, ...) trampoline: ~5 ns unbox SMI load r15 → jmp body call ctr: 1 threshold 256 runs: ~ 12 ms / frame RUNNING 2.7× JS speed on Liftoff code 09 tier-up · trigger 256th invocation call ctr: 256 FULL cmp ✓ jne → enqueue TurboFan compile job non-blocking: Liftoff version keeps running while TF works TF worker: Graph build LoadElim Lowering Schedule RegAlloc background TF queued worker thread 10 TurboFan · T1 optimised x86-64 mov edi,[r15+rcx] ; $src held in reg movzx eax,dil movzx ebx, byte[r15+rcx+1] add eax,ebx ; 9 loads in regs ; 8 adds inlined mov ebx, 0x1c71c71d mul ebx shr rdx,1 ; ÷9 → ~4 cy mov byte ptr [r15+rdi],dl add ecx,1 jl loop_x + 2.1 ms 180 B x86-64 3.8 ms / frame 11 install atomic swap T0 ptr T1 ptr jump-table entry[k] CAS swap in-flight calls safe ~ ns live! no stop 12 SIMD · frame vectorised + drawn v128 lane ↑ one blurred pixel v128.load + 3 rows × 3 cols v128.add + saturate v128.store 16 pixels / iter ~ 1.1 SSE / px 0.65 ms · 8 thr 6.8× JS on screen ✦ build time build time build time ship time ~ 1 ms ~ 6 µs ~ 200 µs first call 256th call + 2.1 ms CAS every frame developer's machine CDN renderer process · wasm worker + main thread GPU + display Ch01/02 Ch01 Ch01/03 Ch06/07 Ch12 Ch11/13 Ch14 Ch16/17 Ch14 Ch15 Ch15 Ch19

cargo build 到屏幕上一个像素,11 行 Rust 走过 12 个快照——前 4 格是开发者机器上发生的事(rustc → LLVM IR → wat → .wasm 字节),第 5 格是CDN 到浏览器的传输,中间 6 格是渲染进程内的编译与第一次执行,最后 1 格是 SIMD 向量化版本在 GPU 显示前的最后一秒。每格底部标了对应章节,后面 8 章的顶部都挂"MAIN-LINE STOP X/12" 胶囊回引这张图。读完这张图你就拿到了整篇文章的骨架。

From cargo build to a pixel on screen, 11 lines of Rust pass through 12 snapshots — the first four cells happen on the developer's machine (rustc → LLVM IR → wat → .wasm bytes), the fifth is CDN to browser transport, the middle six are compilation and first execution inside the renderer process, and the last is the SIMD-vectorised version one heartbeat before the GPU lights the pixel. Every cell is anchored to a chapter; the next eight chapters carry a "MAIN-LINE STOP X/12" capsule at the top linking back here. Read this picture and you hold the article's skeleton.

向右拖动可查看完整 12 格

scroll horizontally to see all 12 cells

11 行 Rust,要走 17 道工序
才能在屏幕上动一个像素。 The Hot Loop · main-line
11 lines of Rust, 17 stages,
before one pixel moves on screen. The Hot Loop · main-line
INPUT
hot.rs · 11 LOC3×3 box blur · u8 grayscale3×3 box blur · u8 grayscale
OUTPUT
hot.wasm · 192 byte这是后面 22 章的样本the sample for the next 22 chapters
ACT III · BINARY ANATOMY

把 192 字节摊在桌上。

192 bytes, spread on the table.

这一段是 wasm 的解剖学:外壳怎么定形,11 段 section 各自存什么,类型系统从 4 个数字类型怎么膨胀到 v128 和 GC,400+ 条 opcode 怎么塞进单字节,线性内存为什么 64 KiB 一页,以及——验证算法为什么能在线性时间里证明类型安全。这 6 章读完,你拿到一个 .wasm 文件可以一字一字读出来。

This act is wasm's anatomy: how the shell is shaped, what the 11 sections each carry, how the type system grew from four numeric types into v128 and GC, how 400+ opcodes pack into single bytes, why linear memory is 64 KiB per page, and — how validation proves type safety in linear time. After these six chapters you can pick up a .wasm file and read it byte by byte.

STAGE 01 · ANATOMY

Module 的外壳 — 8 个字节的承诺

The Module shell — a promise in 8 bytes

\0asm + version + 11 段子

\0asm + version + 11 sections

Layer
binary
Header
8 bytes
Endianness
little
Spec §
5.5.2 Module
这一段在做什么
What it does
所有 .wasm 文件的前 8 字节固定 00 61 73 6d 01 00 00 00——magic 4 字节(\0asm),版本 4 字节(目前是 1)。后面跟着一个 section 序列,每段一个 ID + LEB128 长度 + 内容。仅此而已。 The first 8 bytes of every .wasm file are fixed: 00 61 73 6d 01 00 00 00 — magic 4 bytes (\0asm), version 4 bytes (currently 1). Then comes a sequence of sections: each is an ID byte + LEB128 length + payload. That's all.
为什么这么设计
Why this design
两个目标:① 文件第 1 字节 0x00 让任何把它当 JS 解析的工具立刻报错;② section 用 ID + length 而非偏移表,允许流式解析——边下载边解码。Ch12 会用到这个性质。 Two goals: ① byte 0 = 0x00 ensures any tool that tries to parse the file as JS fails immediately; ② sections use ID + length (not an offset table) to enable streaming parse — decode while downloading. Ch12 hinges on this.
关键代码
Key code
v8/src/wasm/module-decoder.cc :: DecodeModule()
FIG 06 · hot.wasm byte map · 192 byte hot.wasm 文件按 section 比例图 The 192 bytes of hot.wasm laid out proportionally by section, with the 8-byte header highlighted. OFFSET 0x00 0x18 0x40 0x80 0xC0 (192) magic + ver 8 B §Type 13 B §Func 4 B §Mem 5 B §Export 11 B §Code · blur3 body ~ 151 B(占 79%) \0asm version=1 FIRST 16 BYTES · zoom 00 61 73 6d 01 00 00 00 01 0b 02 60 04 7f 7f 7f ┕━━━━━━━━━━┙ "\0asm" ┕━━━━━━━━━━━┙ ver=1 (LE u32) ┕━━━━━━━━━━━━━━━━━━━━┙ §Type · 0x01, len=11, count=2, (func i32×4)

192 字节里,code section 占 79%——其余 7 个 section 加起来只占 21%。这是 wasm "structure 紧凑,代码占比高" 的可视证据。magic + version 8 字节让任何 JS 解析器立刻报错;接下来每个 section 都是 id + LEB128 length + payload 的三段式。

Of 192 bytes, the code section is 79% — the other seven sections together account for 21%. Visible proof of wasm's "compact structure, code-heavy ratio". Magic + version 8 bytes make any JS parser fail instantly; every following section follows id + LEB128 length + payload.

第一字节级别的拆解

Byte-level walkthrough

00 61 73 6d
magic = ASCII \0asm第一字节 NULL 让 cat file.wasm | node 立刻抛 SyntaxError。Byte 0 = NULL makes cat file.wasm | node throw SyntaxError instantly.
01 00 00 00
version = 1, little-endian u32。这一格自 2017 MVP 起从未涨过——演进通过新的 section表达,不动版本号。This field hasn't moved since the 2017 MVP — evolution happens through new sections, not version bumps.
01 ll ll ll
section type0x01 = Type section,后面紧跟 LEB128 编码的长度 ll(可变长 1~5 字节)。0x01 = Type section, immediately followed by LEB128-encoded length ll (variable 1–5 bytes).
ll ll ll ...
section payload每段内部结构由 section type 决定。除了 Custom(0x00)外,其余 section 必须按 ID 升序出现。Payload format is determined by section type. Except for Custom (0x00), sections must appear in ascending ID order.
LEB128 是什么WHAT IS LEB128 Little Endian Base-128。一种变长整数编码:每字节用 7 位载数据,最高位 1 表示"还有",0 表示"完了"。0~127 用 1 字节,128~16383 用 2 字节,以此类推。DWARF 调试格式发明,wasm 拿来用——因为大多数 wasm 整数都很小(类型索引、locals 数、跳转目标),平均不到 2 字节。所有 wasm 整数(除指令的立即数 i32.const)都是 LEB128 编码的 Little Endian Base-128. A variable-length integer encoding: 7 data bits per byte, top bit = 1 means "more coming", 0 means "done". 0127 use 1 byte, 12816383 use 2 bytes, and so on. Invented for DWARF, adopted by wasm — because most wasm integers are small (type indices, local counts, branch targets), averaging < 2 bytes. Every wasm integer (except some immediate operands) is LEB128-encoded.
FIG 07 · module topology · reference graph 11 个 section 之间的引用拓扑 A graph showing how Type / Function / Code / Memory / Data / Table / Element / Export sections reference each other through indices. 11 SECTIONS · WHO POINTS AT WHOM DECLARATIONS BODIES INITIALISERS HOST-FACING §1 Type all signatures §3 Function func i → Type[k] §4 Table funcref array §5 Memory linear pages §6 Global module-level vars §10 Code all function bodies (80% of file) parallel-validatable §9 Element → Table[] §11 Data → Memory[] §12 DataCount (2020 patch) §2 Import host → wasm §7 Export wasm → host §8 Start init function idx §0 Custom × N name/DWARF/vendor type idx 1:1 body fills fills may import any of these name → func idx PARSE ORDER · ascending by ID 0x00 Custom· 0x01 Type · 0x02 Import · 0x03 Func · 0x04 Table · 0x05 Mem · 0x06 Global · 0x07 Export · 0x08 Start · 0x09 Elem · 0x0a Code · 0x0b Data

11 个 section 按引用方向分四类:声明(谁存在)、绿函数体(真正的字节码)、初始化器(给 table/memory 灌数据)、宿主接口(import/export/start)。这四类必须按 ID 升序出现——只有 Custom 段(灰)可以出现在任何地方,出现多少次都行。

The 11 sections split four ways by reference direction: blue declarations (who exists), green bodies (the real bytecode), copper initialisers (filling tables/memory), purple host-facing (import/export/start). The four must appear in ascending ID order — only Custom (grey) may appear anywhere, any number of times.

所有 12 个 section 的 ID

All 12 section IDs

id 0x00
Custom
debug info / name 表 / 元数据,可重复出现。
debug info / name table / metadata, may repeat.
id 0x01
Type
所有函数签名。
all function signatures.
id 0x02
Import
从宿主导入的 func / table / mem / global。
funcs / tables / mems / globals imported from host.
id 0x03
Function
本模块函数 i 的类型是 Type[k]。
"function i has type Type[k]".
id 0x04
Table
funcref / externref 数组(动态 dispatch)。
funcref / externref arrays (dynamic dispatch).
id 0x05
Memory
线性内存声明(min/max page)。
linear memory declaration (min/max pages).
id 0x06
Global
模块级全局变量(像静态变量)。
module-level global variables (like statics).
id 0x07
Export
给宿主用的名字 → 内部索引。
names visible to host → internal indices.
id 0x08
Start
模块初始化函数(像 ctor)。
module initialiser (like a ctor).
id 0x09
Element
table 的初始填充。
table initialisers.
id 0x0a
Code
所有函数体的字节码。
bytecode bodies for all functions.
id 0x0b
Data
线性内存的初始数据。
initial data for linear memory.
id 0x0c
DataCount
bulk memory 后期加的"data 段计数"。2020
"how many data segments", added with bulk memory.2020

Custom section 是规范留给所有人的逃生舱口——它没有规定的格式,只有一个名字(LEB128 长度 + UTF-8 字节)和任意 payload。DWARF 调试信息、source map、wasm-bindgen 的 JS 胶水都藏在这里。Ch24 会展开 name custom section,它给函数和局部变量起名,让 DevTools 能显示符号。

The Custom section is the spec's escape hatch for everyone — no prescribed format, just a name (LEB128 length + UTF-8 bytes) and arbitrary payload. DWARF debug info, source maps, and wasm-bindgen's JS glue all hide here. Ch24 unpacks the name custom section, which names functions and locals so DevTools can show symbols.

主线回引 · The Hot Loop 的外壳MAIN-LINE · The Hot Loop's shell
192 字节里只用到了 7 个 section
192 bytes use only 7 sections

回看 Act II 给的 192 字节十六进制,排查 section ID:01(type)、03(function)、05(memory)、07(export)、0a(code)。没有 import,没有 table,没有 global,没有 data——因为我们的卷积函数不依赖宿主、不做间接调用、没有模块级常量、不预填内存。最小可运行 wasm 模块就是这 7 段

Re-read the 192-byte hex from Act II and you'll find section IDs: 01 (type), 03 (function), 05 (memory), 07 (export), 0a (code). No import, table, global, or data — because our blur function imports nothing, uses no indirect calls, has no module-level constants, and pre-fills no memory. The minimum runnable wasm is exactly these 7 sections.

CLI
wasm-objdump -h hot.wasm 列所有 sectionlist every section
CLI
xxd hot.wasm | head -2 看头 8 字节peek at the first 8 bytes
SOURCE
v8/src/wasm/module-decoder-impl.h
STAGE 02 · ANATOMY

11 段 section — 一个 .wasm 文件的器官学

11 sections — the organology of a .wasm file

每一段都是一个 K-V 仓库

each section is a K-V vault

Sections
11 known + Custom
Order
ascending by ID
Custom anywhere
unlimited
Spec §
5.5.3 ~ 5.5.13

看一个 .wasm 文件最容易的方式,就是把它当成一组按 ID 升序排列的 K-V 仓库。每个 section 解一个具体问题。这一章把 11 个 section 各拆一遍,每段给一个最小例子 + 在主线 Hot Loop 里的角色。

The easiest way to read a .wasm file is to treat it as a sequence of K-V vaults, ordered by ID. Each section answers one specific question. This chapter walks all 11, with a minimal example and the role each plays in the main-line Hot Loop.

① Type section · 函数签名表

① Type section · the signature table

问题:"函数 $blur3 长什么样?"
答:"它是 type[0]:(i32 i32 i32 i32) -> ()"——所有 module 内出现的函数签名先注册一遍,后面引用用索引。

Question: "What does $blur3 look like?"
Answer: "It's type[0]: (i32 i32 i32 i32) -> ()" — every signature used in the module is registered up front, later referenced by index.

(type $t0 (func (param i32 i32 i32 i32))) ;; void return implied

为什么把签名独立成表?因为同一个签名会被多个函数共用——主线里只有一个函数,签名表有 1 条。但 Photoshop 的 wasm 里有几十万个函数,只用到几百种签名,共用让 type section 体积小 2~3 个数量级。

Why pull signatures into their own table? Because the same signature is shared among many functions — the main-line has 1 function, so 1 entry. Photoshop's wasm has hundreds of thousands of functions but only hundreds of distinct signatures; sharing collapses the type section by 2–3 orders of magnitude.

② Import section · 来自宿主的函数与对象

② Import section · functions and objects from the host

主线 Hot Loop 没有 import——纯数学函数,不依赖任何 JS API。但下面是 Photoshop 的实际 import section 缩影:

The main-line Hot Loop has no imports — pure math, no JS-side dependency. Below is a snapshot of Photoshop's real import section:

(import "env" "memory"          (memory 256 32768 shared))
(import "env" "__indirect_func_table" (table 4096 funcref))
(import "env" "emscripten_resize_heap" (func (param i32) (result i32)))
(import "wasi_snapshot_preview1" "fd_write" (func (param i32 i32 i32 i32) (result i32)))

四个观察:① 每一条 import 是两段名字 + 一个签名("env" "memory" 是惯例,Emscripten 用 "env" 当 module 名);② memory 可以被 import——这是多线程 wasm 共享内存的关键;③ table 也能 import,允许 JS 给函数指针填值;④ WASI 函数通过 import 引入,在浏览器外的 wasm 里这是主要的"系统调用"通道。

Four notes: ① each import is two-segment name + one signature ("env" "memory" is the Emscripten convention); ② memory itself can be imported — that's the foundation of shared-memory multi-threaded wasm; ③ tables too, letting JS populate function pointers; ④ WASI functions enter via imports, which is the primary "syscall" channel for non-browser wasm.

③ Function section · 函数和签名的连线

③ Function section · wiring functions to their signatures

这一段长得最简洁——就是一个 type index 数组:"函数 0 用 type[0],函数 1 用 type[2],函数 2 用 type[2],..."。函数体本身不在这里,它们在 Code section(0x0a)。把"签名声明"和"函数体"开是为了流式解码——下载到 function section 就能开始检查 import/export 的类型匹配,不必等 code 段下完。

The plainest section — just an array of type indices: "function 0 is type[0], function 1 is type[2], function 2 is type[2], …". The body lives elsewhere, in the Code section (0x0a). The split between "signature declaration" and "body" exists for streaming decode — once function section is in, you can check import/export type matching without waiting for code.

④ Table section · funcref 的数组

④ Table section · the funcref array

Table 是 wasm 的 "函数指针表",最初是为了 C 函数指针 / C++ vtable / Java 接口分发服务。每个 table 元素是 funcref(MVP)或 externref(2021)。call_indirect 指令用 table 索引 + 类型 ID 调用——类型 ID 必须匹配,否则 trap。Ch09 / Ch11 会展开。

Table is wasm's "function pointer table", born to serve C function pointers / C++ vtables / Java interfaces. Each element is funcref (MVP) or externref (2021). call_indirect uses (table idx + type id) to dispatch — the type id must match or it traps. Expanded in Ch09 / Ch11.

2021 年 reference-types 提案前,一个 module 只能有一张 table。之后可以有多张。主线 Hot Loop 不用 table——它没有间接调用。

Before reference-types (2021), a module could carry only one table. After: multiple. The main-line Hot Loop uses no table — no indirect call.

⑤ Memory section · 线性内存的声明

⑤ Memory section · linear memory declaration

主线声明 (memory 1)——min=1 page=64 KiB,max 不指定。Ch10 完整展开线性内存。

Main-line declares (memory 1) — min=1 page=64 KiB, max unspecified. Ch10 covers linear memory fully.

⑥ Global section · 模块级常量与变量

⑥ Global section · module-level constants and vars

(global $stack_top (mut i32) (i32.const 0x10000))
(global $PI         f64       (f64.const 3.14159265358979))

每个 global 是 (type, mut?, init expr) 三件套。mut 标记可写,initialiser 是一段受限的常量表达式(只能用 iN.const / fN.const / global.get)。Rust / C 的 static 数据如果是常量就来这里,如果是读写就放到 linear memory 的 data section。

Each global is (type, mut?, init expr). mut means writable; initialiser is a constant expression (only iN.const / fN.const / global.get). Rust/C static data lives here when constant; mutable static data goes into linear memory via the data section.

⑦ Export section · 给宿主看的窗口

⑦ Export section · the host-facing window

(export "memory" (memory 0))
(export "blur3"  (func $blur3))
(export "alloc"  (func $alloc))

name → (kind, index) 的字典。kind 可以是 func / table / memory / global / tag(tag 是 exception handling 提案加的)。所有从 JS 调 wasm 的入口都在这里。 JS 那边的 instance.exports.blur3 就是查这张表。

A name → (kind, index) dictionary. Kind ∈ {func, table, memory, global, tag} (tag added by exception handling). Every JS-to-wasm entry point lives here. JS-side instance.exports.blur3 looks up this very table.

⑧ Start section · 模块的 ctor

⑧ Start section · the module's ctor

仅一个数字——某个函数的索引。该函数不能有参数,不能有返回值,在 module instantiate 完成的最后阶段被引擎自动调用。用来做模块级初始化(注册回调、填充常量表)。主线 Hot Loop 没有 start

Just one number — the index of a function. The start function takes no params, returns nothing, and is invoked automatically by the engine at the end of instantiation. Used for module-level setup (registering callbacks, filling constant tables). Main-line Hot Loop omits start.

⑨ Element section · 给 table 填值

⑨ Element section · table initialisers

语义类似 data section,但写入对象是 table 而非 memory。一个 module 实例化时,element 段把 funcref 们填进对应 table 槽位。C 程序的"函数指针表"就在这里活;C++ 的 vtable 也是。

Semantically similar to data section but writing into tables rather than memory. On instantiation, element segments populate funcref slots. C function pointer tables live here; so do C++ vtables.

⑩ Code section · 字节码本身

⑩ Code section · the bytecode itself

这是最大的一段——主线 Hot Loop 的 code section 占整个文件的 80% 字节。每个函数体的格式是:locals 声明(类型聚合表)+ 表达式序列 + 终止符 0x0b (end)。Ch09 把指令格式撕开。

The biggest section — the Hot Loop's code section is 80% of the entire file. Each function body's format is: locals declaration (run-length encoded by type) + expression sequence + terminator 0x0b (end). Ch09 unpacks the instruction format.

⑪ Data section · 给 memory 填值

⑪ Data section · memory initialisers

把"这段字节请在实例化时写到 linear memory 的某地址"批量声明。C 程序的字符串字面量、Rust 的 static 数组、Emscripten 的 stdlib 数据表都在这里。MVP 时每条数据段必须 active(立即写入);bulk memory 提案(2020)加了 passive 模式,允许 wasm 代码显式memory.init 来用——支持代码热更新。

Bulk-declares "at instantiation, write these bytes to memory at address X". C string literals, Rust static arrays, Emscripten's stdlib tables all live here. MVP required every segment to be active (written immediately). The bulk-memory proposal (2020) added passive mode, letting code call memory.init explicitly — supports hot reload.

⑫ DataCount section · 一个事后补丁

⑫ DataCount section · a retroactive patch

2020 年加 bulk memory 后,memory.init segIdx 指令需要在验证时立刻知道 data 段总数。但 code section 在 data section 前面解析——为了不让 validator 反复回扫,设计者插入了一个新 section 0x0c,专门告诉解码器"我有 N 个 data 段"。这是 wasm spec 仅有的"事后补丁"section,反映了流式解析的硬约束

Bulk memory (2020) made memory.init segIdx need to know the total number of data segments during validation. But code section parses before data — to spare the validator from a back-scan, the designers slipped in a new section 0x0c that just says "I have N data segments". It's the only "retroactive patch" section in the spec, reflecting the hard constraint of streaming parse.

Section 不是设计,是约束的化石。 Field Note · 03
Sections are not design.
They are fossilised constraints. Field Note · 03
CLI
wasm2wat hot.wasm · 把所有 section 解成可读文本expand every section into readable wat
CLI
wasm-objdump -s hot.wasm · 逐段十六进制 dumpsection-by-section hex dump
CLI
wasm-objdump -x hot.wasm · 看 import/export/global 摘要summarise imports/exports/globals
STAGE 03 · ANATOMY

类型系统 — 从 4 到 6 再到无限

Type system — from 4 to 6, then to infinity

小到 1 字节,大到任意结构

one byte to arbitrary structure

MVP
4 types (i32 i64 f32 f64)
2021
+ v128 + funcref/externref
2024
+ struct/array/i31 (GC)
Spec §
2.3 Types

2017 年 MVP 上线时,wasm 一共只有 4 种值类型:i32 / i64 / f32 / f64。理由极其务实:这是所有 CPU 都能直接处理的 4 种,JIT 不需要费力适配。九年后的今天,加上 SIMD 的 v128、reference 的 funcref/externref、以及 wasm-GC 的 struct/array/i31,wasm 已经有了"近似于一门完整语言"的类型系统——但每一次扩张都要回答同一个问题:新类型怎么不破坏栈机的"一字节 opcode"承诺?

The 2017 MVP shipped with just four value types: i32 / i64 / f32 / f64. The reasoning was ruthlessly practical: these are the four that every CPU handles natively, so the JIT has no fitting to do. Nine years on, with SIMD's v128, reference-types' funcref/externref, and wasm-GC's struct/array/i31, wasm now has a type system that "looks like a real language". But every expansion answers the same question: how does the new type not break the stack machine's "one-byte opcode" promise?

FIG 08 · type lattice · post-GC (2024+) wasm 类型 lattice · GC 提案后 The subtyping lattice of wasm value types after the GC proposal. anyref sits at the top, with i31ref / structref / arrayref / funcref / externref descending and concrete types at the bottom. VALUE TYPE LATTICE · 2024 GC SHIPPED NUMERIC · INDEPENDENT i32 i64 f32 f64 v128 REFERENCE LATTICE · SUBTYPING anyref top of all refs eqref structurally eq funcref callable externref host obj i31ref SMI inline structref heap obj arrayref heap arr (ref $T) (ref $Point) (ref $Vec) (ref $func-sig) nullref · bottom of every nullable type NOTE · WHY NUMERIC SITS SEPARATE

数字类型 i32/i64/f32/f64/v128原始(primitive)—— 它们直接对应硬件寄存器,没有子类型关系。一个 i32 不是 i64 的子类型, 也不是 v128 的子类型;它们之间只能通过显式 op转换 (i64.extend_i32_s等)。

引用类型(右侧 lattice)是 2021 reference-types 提案 + 2024 GC 提案合作长出的—— 它有真正的子类型关系,validator 可以做 upcast/downcast 检查, ref.cast 失败会 trap。

2024 之后,wasm 值类型分两个世界:左半是 5 个独立的原始类型(i32/i64/f32/f64/v128),没有子类型关系;右半引用类型 lattice——顶层 anyref,下分 eqref / funcref / externref,再下到具体 struct / array / 函数签名引用。所有 nullable 引用最终指向 nullref

After 2024, wasm value types split into two worlds: left — 5 independent primitive types (i32/i64/f32/f64/v128), no subtyping; right — a reference type lattice topped by anyref, descending into eqref / funcref / externref, then concrete struct / array / function-sig references. All nullable refs ultimately point to nullref.

值类型的全景

All value types at a glance

CategoryTypeSizeTag (encoding)SinceWhat
numerici324 byte0x7FMVP32 位整数(符号自指令)32-bit integer (sign-per-op)
i648 byte0x7EMVP64 位整数64-bit integer
f324 byte0x7DMVPIEEE 754 single
f648 byte0x7CMVPIEEE 754 double
vectorv12816 byte0x7B2021128 位 SIMD,可解释成 16×i8 / 8×i16 / 4×i32 / 4×f32 / 2×f64128-bit SIMD, viewable as 16×i8 / 8×i16 / 4×i32 / 4×f32 / 2×f64
referencefuncrefptr0x702021指向 wasm 函数的不透明引用opaque reference to a wasm function
externrefptr0x6F2021指向宿主对象(JS Object / DOM 节点)reference to a host object (JS Object / DOM node)
GC (2024)(ref $T)ptr0x6B2024指向 struct / array 的强类型引用typed reference to a struct or array
(ref null $T)ptr0x6C2024允许为 null 的版本nullable version
i31ref31-bit0x6C+2024SMI 风格的内联小整数(避免堆分配)SMI-style inline small int (skip heap alloc)
关于 tag byte 的负数编码ON THE NEGATIVE TAG BYTES 注意 i32 的 tag 是 0x7F = -1,i64 是 0x7E = -2——这些是 signed LEB128 编码的小负数。规范选负数空间是有意的:正数空间留给"type index"(给 GC 用),这样验证器一字节就能判断"这是基本类型还是 struct 引用"。tag 设计是为 GC 的未来留的接口——MVP 时代设计者已经预想到这一步。 Note i32's tag is 0x7F = -1, i64 is 0x7E = -2 — these are signed LEB128 small negatives. The negative space was deliberate: positive space is reserved for type indices (for GC), so the validator can decide "basic type vs struct reference" in one byte. The tag design is the interface MVP designers left for GC's future — anticipated from day one.

"函数类型" 也是一种类型

"Function type" is also a type

在 Type section 里出现的 (func (param i32 i32) (result i32)) 用编码 0x60 引导。MVP 时只有 func 这一种"组合类型",GC 提案后加了 0x5F = struct0x5E = array——把 Type section 从"函数签名表"扩成了"组合类型表"。同一段 binary 在 2017 年和 2026 年解析出来的"section 0x01"含义已经悄悄扩张了一倍

In Type section, (func (param i32 i32) (result i32)) begins with tag 0x60. The MVP had only this one "compound type". The GC proposal added 0x5F = struct and 0x5E = array — quietly stretching Type section from "signature table" to "compound-type table". The same byte (section 0x01) means twice as much in 2026 as in 2017.

类型系统的"缺失":没有 i8/i16/u32

The system's absences: no i8/i16/u32

"为什么 wasm 没有 i8 类型?字符串处理要怎么办?"——这是另一个常见疑问。答案:wasm 的值类型不区分 i8/i16/i32,但内存读写i32.load8_u / i32.load8_s / i32.load16_u / i32.load16_s——读 8/16 位 byte,符号或零扩展到 i32。窄类型只存在于 memory 边界,寄存器里永远是 i32 或 i64。

"Why no i8? How do you process strings?" — another perennial question. Answer: wasm's value types don't distinguish i8/i16/i32, but memory access does: i32.load8_u / i32.load8_s / i32.load16_u / i32.load16_s read 8/16-bit bytes and sign- or zero-extend to i32. Narrow types exist only at the memory boundary; in registers, everything is i32 or i64.

同理无符号 vs 有符号区分也只活在指令层面:i32.div_s(signed) vs i32.div_u(unsigned)、i32.lt_s vs i32.lt_u"类型只标 32/64 位,符号由 op 携带"是 wasm 的核心设计简化——让值类型集合保持小,降低验证器和 JIT 的复杂度。

Same with signed-vs-unsigned: it lives at the opcode layer, not the type layer — i32.div_s vs i32.div_u, i32.lt_s vs i32.lt_u. "Types carry width only; signedness rides on the op" is a core simplification. It keeps the value-type set small and shrinks both validator and JIT.

主线回引 · The Hot Loop 的类型分布

Main-line · types in The Hot Loop

TYPE USAGE · hot.wasm · 192 byte
i32
96% · 8 locals · 24 ops
i64
f32
f64
v128 (with SIMD flag)
1 op

主线 Hot Loop 几乎是 100% i32——这是 wasm 的常态。绝大多数 LLVM 后端在 wasm32 目标上把 usize / size_t 编译成 i32(因为 wasm32 上指针就是 32 位),数组下标也是 i32。f64 主要出现在浮点计算场景,i64 出现在 BigInt 场景。如果你 grep 一个真实 wasm 模块,i32. 开头的 opcode 占 70% ~ 90%

The main-line is ~100% i32 — wasm's norm. Most LLVM backends compile usize / size_t to i32 on wasm32 (pointers are 32-bit), and array indices are i32. f64 shows up in floating-point math; i64 in BigInt scenarios. Grep any real wasm module and 70–90% of opcodes start with i32.

STAGE 04 · ANATOMY

指令集 — 一字节里的 430 条命令

Opcodes — 430 commands in a single byte

six families, one prefix scheme

six families, one prefix scheme

MVP opcodes
~190 (1 byte)
2026 total
~430 (with prefix)
Encoding
1B opcode + LEB128 imm
Spec §
5.4 Instructions

MVP 时 wasm 用了 256 个 opcode 空间里的 190 个左右。后来 SIMD / Bulk Memory / Reference Types / GC / Atomics 每个提案都要加新指令,字节空间不够了。解法是多字节 opcode:第一字节用一个保留值(0xFC = Bulk, 0xFD = SIMD, 0xFE = Atomics, 0xFB = GC),后跟一个 LEB128 子 opcode。单字节空间维持紧凑,扩展走 prefix。

The MVP used ~190 of the 256 opcode slots. SIMD / Bulk Memory / Reference Types / GC / Atomics each demanded new ops, and the byte space ran short. The fix: multi-byte opcodes. A reserved first byte (0xFC = Bulk, 0xFD = SIMD, 0xFE = Atomics, 0xFB = GC) followed by a LEB128 sub-opcode. The single-byte space stays compact; extensions ride the prefix.

FIG 09 · opcode space · 16×16 grid 256 个单字节 opcode 的家族分布 A 16x16 grid showing all 256 single-byte opcodes, colored by family. Reserved bytes mark prefix opcodes for extension families. OPCODE BYTE MAP · 0x00 ~ 0xFF _0 _1 _2 _3 _4 _5 _6 _7 _8 _9 _A _B _C _D _E _F 0_ 1_ 2_ 3_ 4_ 5_ 6_ 7_ 8_ 9_ A_ B_ C_ D_ E_ F_ LEGEND control flow stack / locals memory const numeric convert prefix reserved PREFIX BYTES: 0xFB · GC 0xFC · bulk/sat 0xFD · SIMD ~250 op 0xFE · atomics 注:每个 0xFX prefix 后跟一个 LEB128 子 opcode → 子空间无上限,SIMD 已占用 ~ 250 个

256 个单字节 opcode 槽位里,numeric + memory + 控制流 占去六成多。底部 0xFB-0xFE 四个紫色格是 prefix 字节——每个 prefix 后跟一个 LEB128 子 opcode,把扩展空间延展到无穷。2017 MVP 时 只有上半部分被占用,所有 2019 后加的指令(SIMD/GC/Atomics)都缩在这四个 prefix 后面。

Of 256 single-byte opcode slots, numeric + memory + control occupy over 60%. The four purple cells at the bottom (0xFB–0xFE) are prefix bytes — each followed by a LEB128 sub-opcode, extending the space without bound. In the 2017 MVP only the upper half was filled; every post-2019 op (SIMD/GC/Atomics) lives behind these four prefixes.

六大指令家族

Six instruction families

1 · control
控制流
Control flow

block / loop / if / else / br / br_if / br_table / return / call / call_indirect / unreachable / nop。没有 goto结构化控制是 wasm 的硬约束,Ch11 的验证算法依赖这个。

block / loop / if / else / br / br_if / br_table / return / call / call_indirect / unreachable / nop. No goto. Structured control is a hard invariant — Ch11's validator depends on it.

0x00 ~ 0x11
2 · param
参数/栈操作
Param / stack

drop / select / local.get / local.set / local.tee / global.get / global.setteeset 的"留个备份在栈顶"版。

drop / select / local.get / local.set / local.tee / global.get / global.set. tee is set that also keeps a copy on the stack top.

0x1A ~ 0x24
3 · memory
内存访问
Memory access

i32.load / i32.load8_s / i32.load8_u / ... / i32.store / i32.store8 / memory.size / memory.grow。每条 load/store 带 align + offset 立即数。

i32.load / i32.load8_s / i32.load8_u / … / i32.store / i32.store8 / memory.size / memory.grow. Each load/store carries align + offset immediates.

0x28 ~ 0x40
4 · const
常量
Constants

i32.const / i64.const / f32.const / f64.const。i32/i64 立即数用 signed LEB128;f32/f64 用原始字节序

i32.const / i64.const / f32.const / f64.const. i32/i64 immediates use signed LEB128; f32/f64 use raw IEEE bytes.

0x41 ~ 0x44
5 · numeric
算术 / 比较
Arithmetic / compare

i32.add / i32.sub / i32.mul / i32.div_s / i32.div_u / i32.eq / i32.lt_s / ... / f64.sqrt / f64.nearest。约 130 条,覆盖 IEEE 754 算术。

i32.add / i32.sub / i32.mul / i32.div_s / i32.div_u / i32.eq / i32.lt_s / … / f64.sqrt / f64.nearest. ~130 ops, full IEEE 754 coverage.

0x45 ~ 0xC4
6 · prefix
扩展指令
Extension prefixes

0xFC nn = 饱和转换 + bulk memory;0xFD nn = SIMD(~250 op);0xFE nn = atomics(threads);0xFB nn = GC。子 opcode 用 LEB128,所以是无界的。

0xFC nn = saturating convert + bulk memory; 0xFD nn = SIMD (~250 ops); 0xFE nn = atomics (threads); 0xFB nn = GC. Sub-opcode is LEB128, so unbounded.

0xFB ~ 0xFE

单条指令的拆解

Anatomy of a single instruction

i32.load offset=4 align=2 为例。它的字节序列:

Take i32.load offset=4 align=2. Its byte sequence:

28
opcode = i32.load栈上 pop 一个 i32 地址,push 一个 i32 加载值。pops one i32 address, pushes one i32 loaded value.
02
align = log₂(对齐) = 2 → 4 byte 对齐。这只是"提示",JIT 可据此选指令(unaligned 也 OK)。A hint; the JIT may pick a specific instruction (unaligned still works).
04
offset = 4(LEB128)。编译期常量,加到栈顶地址上。A compile-time constant added to the stack-top address.

三字节里隐藏的设计点:① opcode 是 1 字节,空间精确;② align 是 hint 不是约束——这让 wasm 能跑在 ARM(对齐)和 x86(自由对齐)上无差别;③ offset 是常量,Liftoff 可以 fold 进 [base + reg + 4] 这种寻址模式,免一条 add三字节里塞了三层信息

Three bytes hide three design points: ① opcode is one byte, slot-precise; ② align is a hint, not a constraint — letting wasm run on both ARM (aligned) and x86 (free) without changes; ③ offset is constant, so Liftoff folds it into [base + reg + 4] addressing — saving one add. Three bytes, three layers.

主线回引 · The Hot Loop 用到的 6 类 opcode

Main-line · the 6 opcode families in The Hot Loop

FamilyUsedCountExample
controlblock / loop / br_if4block 0x40
paramlocal.get / local.set14local.get 0x20 00
memoryi32.load8_u / i32.store810i32.load8_u 0x2D 00 00
consti32.const5i32.const 0x41 09 (= 9)
numerici32.add / i32.sub / i32.div_u / i32.mul / i32.ge_s16i32.add 0x6A
prefix0本主线无 SIMD,Ch19 才会出现 0xFDno SIMD; 0xFD appears in Ch19

49 条指令,49 字节(opcode 部分)+ 立即数(平均 1.2 字节)≈ 110 字节,加上 locals 声明 5 字节、function header 6 字节,凑成约 121 字节的 code section。再加上前面的 6 个 section header 和 export 段,合 192 字节。密度的来源在每一字节都看得见

49 ops × (1B opcode + ~1.2B imm avg) ≈ 110 bytes, plus 5B locals declaration + 6B function header ≈ 121 bytes of code section. Add six section headers and the export segment: 192 bytes. Density is visible in every byte.

关于 SIMD opcode 的特殊性ON THE SIMD OPCODE QUIRK SIMD 提案在 2021 年 phase 4 时,V8 实测发现 0xFD nn 的两字节 opcode 在 inner loop 里每次都要多 fetch 一字节,影响热路径性能。最终方案是 在 Liftoff 阶段把 SIMD 字节序展开成两字节但 TurboFan IR 里仍按一字节代理——这是 wasm spec 罕见的"实现影响 spec"的例子,SIMD 的子 opcode 数量被精心控制在 256 内,避免出现 3 字节 opcode 的可能。 When SIMD reached phase 4 in 2021, V8 measured that the two-byte 0xFD nn opcode forced an extra fetch on every inner-loop iteration. The fix: Liftoff decodes it as two bytes, but TurboFan IR proxies it as one — a rare instance of "implementation pressuring spec". SIMD sub-opcodes are deliberately capped at 256 to avoid the 3-byte opcode scenario.
STAGE 05 · ANATOMY

线性内存 — 一片连续的、可越界的、永不 GC 的字节

Linear memory — a flat, bounded, GC-free slab of bytes

64 KiB 一页,最大 4 GiB

64 KiB per page, 4 GiB max

Page size
64 KiB (2^16)
Max pages (wasm32)
65 536 → 4 GiB
Bounds check
trap on overrun
JS view
ArrayBuffer / SAB

线性内存是 wasm 最简洁的设计之一——它就是一片连续字节,从地址 0 开始,长度是 N 个 64 KiB 的 page。所有 i32.load / i32.store 都读写这片字节。没有指针类型,没有 GC,没有别的内存空间——堆、栈、静态数据全部混在这一片。这片字节在 JS 那边是一个 WebAssembly.Memory 对象,可以 new Uint8Array(mem.buffer) 直接看到原始字节。

Linear memory is one of wasm's most distilled designs — a flat slab of bytes starting at address 0, length = N × 64 KiB pages. Every i32.load / i32.store reads or writes this slab. No pointer type, no GC, no other memory space — heap, stack, static data all share the slab. From JS, this is a WebAssembly.Memory object; new Uint8Array(mem.buffer) lets you see the bytes directly.

FIG 10 · 4 GiB address space · wasm32 wasm32 linear memory · 4 GiB virtual layout The 4 GiB virtual address space wasm32 reserves: an active zone at the bottom, a small guard zone, then a vast PROT_NONE region trapping all overruns via SIGSEGV. VIRTUAL ADDRESS SPACE · 4 GiB · wasm32 ACTIVE GUARD · 32 MiB PROT_NONE unmapped · SIGSEGV on touch ~ 4 GiB virtual reservation 0x00000000 0x00400000 ← memory_size = 4 MiB 0x02400000 ← end of guard 0xFFFFFFFF ← 4 GiB cap (wasm32) WHY THIS LAYOUT 1 · ACTIVE

可读写的部分 = memory_sizememory.grow N 把它向下扩 N 个 64 KiB page。N 次 grow 后,size 增长是免费的——直到 4 GiB 上限。

2 · GUARD

32 MiB 缓冲带,允许 load/store 的 offset 立即数 ≤ 32 MiB 不必额外检查——任何越界访问会进入下面的 PROT_NONE。

3 · PROT_NONE

整片 4 GiB 虚拟空间被 mmap(..., PROT_NONE) 预留(不消耗物理内存)。任何 OOB 访问触发 SIGSEGV,V8 的 signal handler 把它翻译成 WebAssembly.RuntimeErrorinner loop 里完全没有显式 cmp/jcc——这是为什么 wasm 边界检查"免费"。

⚠ wasm32 cap

寻址用 i32,上限 4 GiB。memory64 提案(phase 3)把 wasm 升到 i64 寻址,但生态尚未完全迁移。

OOB 访问 (e.g. i32.load addr 0x05000000) CPU page fault OS SIGSEGV V8 handler JS RuntimeError

wasm32 的实际内存只有顶部那条窄绿条(用户的 N 个 64 KiB page),但浏览器引擎提前预留了整 4 GiB 虚拟地址空间——下面 99% 都是 PROT_NONE 陷阱区。越界访问由硬件 + signal handler 接住,JIT 出码里完全没有 cmp/jcc 边界检查指令——这就是 wasm 边界检查"免费"的真相。

wasm32's actual memory is just the thin green strip at top (user's N × 64 KiB pages), but the browser engine pre-reserves the entire 4 GiB virtual address space — 99% of it is PROT_NONE trap zone. Out-of-bounds is caught by hardware + signal handler; the JIT emits no cmp/jcc bounds-check instructions — the true source of wasm's "free" bounds checking.

"" 的几何含义

The geometry of "pages"

为什么是 64 KiB 一页?这数字不是 OS 的 4 KiB / 16 KiB 内存页对齐;它来自一个 i32 寻址 / 2^16 = 2^16 个页 这一计算——4 GiB 总空间 ÷ 2^16 页 = 2^16 = 64 KiB 一页。选 64 KiB 是要在"页粒度太细(grow 太频繁)"和"页太大(浪费)"之间找平衡点,设计者参考了 x86 的 large page 与 ARM 的 64K granule。

Why 64 KiB per page? Not from OS 4 KiB / 16 KiB alignment. It comes from i32 addressing space (4 GiB) ÷ 2^16 pages = 64 KiB per page. The choice balances "too fine-grained (grow too often)" against "too coarse (waste)", referring to x86's large page and ARM's 64K granule.

边界检查 — wasm "safe" 的硬件实现

Bounds checks — the hardware behind wasm's "safe"

每一次 load/store 都必须保证地址在 [0, memory_size) 内,否则 trap。朴素实现:cmp addr, mem_size; ja .trap;——每个内存访问加两条指令,在 inner loop 里 5~10% 开销。

Every load/store must keep its address in [0, memory_size), else trap. Naïve: cmp addr, mem_size; ja .trap; — two extra instructions per access, 5–10% overhead in an inner loop.

现代引擎用一个聪明技巧:把 wasm 的整个 4 GiB 地址空间作为虚拟保留页映射,只把 [0, memory_size) 设为可读写,后面全部设为 PROT_NONE。任何越界访问会触发 SIGSEGV,引擎挂一个 signal handler 把它翻译成 wasm trap。结果是 inner loop 里完全没有显式边界检查,跑得跟 native 几乎一样快——只在 trap 时才慢。

Modern engines use a clever trick: reserve the full 4 GiB virtual address space, mark [0, memory_size) as RW, mark the rest as PROT_NONE. Any overrun raises SIGSEGV, caught by a signal handler that translates it into a wasm trap. The result: zero explicit bounds checks in the hot loop, near-native speed — only slow on actual trap.

为什么 32 位 wasm 不能再扩WHY 32-BIT WASM CAPS AT 4 GiB 虚拟保留的"4 GiB 安全岛"方案前提是 64-bit host 进程的地址空间能腾出 4 GiB——这在 64-bit Linux/macOS/Windows 上没问题(总地址空间 256 TiB)。但 wasm32 自身只能 i32 寻址 4 GiB。要突破这个限制需要 memory64 提案(Ch22),把 wasm 升级到 i64 寻址、64 位 size_t。现在 V8、SM、JSC 都在 flag 后面提供 memory64 预览,但生态(LLVM 后端、Emscripten)尚未完全打通。 The "4 GiB safety island" trick assumes the 64-bit host has 4 GiB of virtual address space to spare — fine on 64-bit Linux/macOS/Windows (256 TiB address space). But wasm32 itself can only i32-address 4 GiB. Breaking past needs the memory64 proposal (Ch22), upgrading wasm to i64 addressing with 64-bit size_t. V8 / SM / JSC all preview memory64 behind flags, but the ecosystem (LLVM backend, Emscripten) hasn't finished plumbing it.

主线回引 · The Hot Loop 的内存 layout

Main-line · memory layout of The Hot Loop

; Linear memory after JS-side setup

0x000000 ┌─────────────────────────────────┐
         │ src image data (8 bpp grayscale)│  1920×1080 = 2 073 600 byte
0x1FA400 ├─────────────────────────────────┤
         │ padding ( 4 KiB align )         │
0x1FB000 ├─────────────────────────────────┤
         │ dst image data (output)         │  another 2 073 600 byte
0x3F5400 ├─────────────────────────────────┤
         │ unused                          │  ~ 60 KiB
0x400000 └─────────────────────────────────┘  64 pages (4 MiB)

JS 那边先 mem.grow(63) 把内存扩到 64 page = 4 MiB,然后用 Uint8ClampedArray 视图把 src 图像数据 copy 进去,调 instance.exports.blur3(0, 0x1FB000, 1920, 1080),wasm 函数对 src 做卷积写到 dst,JS 再 new Uint8ClampedArray(mem.buffer, 0x1FB000, len) 取出来显示。整个过程 没有把数据 copy 出 wasm 内存——只是不同 JS 视图共享同一片字节。这是 wasm/JS 协作的"zero copy"模式。

JS first mem.grow(63) to reach 64 pages = 4 MiB, then copies src image bytes in via Uint8ClampedArray, calls instance.exports.blur3(0, 0x1FB000, 1920, 1080), wasm convolves and writes dst, JS reads back via new Uint8ClampedArray(mem.buffer, 0x1FB000, len). The data never leaves wasm memory — different JS views share the same bytes. This is the wasm/JS "zero copy" pattern.

memory.grow 的代价

The cost of memory.grow

grow 是个昂贵指令——它可能触发 ArrayBuffer 重新分配(老 4 MiB 不够时申请新的 16 MiB,copy 整片字节)。grow 后所有 JS 视图 (TypedArray) 立刻被 detached,所有 wasm 那边持有的 base 指针会被引擎自动更新。这条约束让 grow 在 inner loop 里几乎是禁忌,通常只在 module 启动时或者明显边界(图像变大、文件加载)才调用。

grow is expensive — it can trigger a full ArrayBuffer realloc (allocating a fresh 16 MiB when 4 MiB runs short, then copying). After grow, all JS-side TypedArray views are detached immediately; wasm-side base pointers are auto-updated by the engine. The invariant makes grow nearly forbidden inside hot loops — typically called only at startup or coarse boundaries (image resize, file load).

STAGE 06 · ANATOMY

验证 — 一遍线性扫描换来的安全保证

Validation — safety in one linear pass

类型栈的抽象解释

abstract interpretation on a type stack

Time
O(n) per func
Space
stack depth
Parallelism
per-function
Spec §
3.3 Validation
MAIN-LINE · STOP 6 / 12 · VALIDATE hot.rs 此刻: 字节已 decode 出来,validator 沿指令序列一次 forward sweep,推演类型栈——证明类型不混、栈不溢、跳转目标在 frame 里。这是 Storyboard 第 6 格的算法层。Ch13 谈同一站的工程层。 hot.rs right now: bytes decoded; validator does one forward sweep, evolving the type stack — proving no type confusion, no underflow, all branch targets in-frame. This is the algorithm side of Storyboard cell 6. Ch13 covers the engineering side of the same stop.

"怎么证明这段二进制没有缓冲区溢出、没有未初始化变量、没有类型混乱?" Java 的解法是 bytecode verifier——一段几千行的 dataflow 分析。wasm 用了一招更猛的:把"类型栈"作为唯一的抽象状态,沿指令序列做一遍 forward sweep。算法只用一个数据结构(类型栈)、只走一遍(单遍 forward),时间复杂度 O(n)。

"How do you prove this binary has no buffer overflow, no uninitialised variable, no type confusion?" Java's answer is the bytecode verifier — a few thousand lines of dataflow analysis. Wasm went bolder: use a "type stack" as the only abstract state, then forward-sweep through the instruction sequence. One data structure (the type stack), one pass (forward only), O(n) time.

算法的骨架

The algorithm skeleton

维护两个东西:

Maintain two things:

1
类型栈 vstack
Value stack vstack
一个值类型的栈,模拟运行时栈上每个槽位的类型。不存值,只存类型
A stack of value types, simulating the runtime stack's type at each slot. No values, only types.
2
控制栈 cstack
Control stack cstack
每个 block / loop / if 入栈一个 frame,记录这个块的起始类型栈高度跳转目标类型br k 跳到 cstack 顶往下数第 k 个 frame。
Each block / loop / if pushes a frame storing the vstack height at entry and the branch target type. br k jumps to the k-th frame from the top.

遍历指令序列,每条指令做三件事:① pop 走它需要的输入类型(类型不对 → fail);② push 它产生的输出类型;③ 如果是控制指令,适当 push/pop cstack。函数末尾 vstack 必须正好等于函数返回类型——否则 fail。就这么简单。但这套机制证明了:任何通过验证的 wasm 不会类型混乱、不会栈溢出、不会未初始化访问。

For each instruction: ① pop the input types it expects (mismatch → fail); ② push the output types it produces; ③ if it's a control op, push/pop cstack accordingly. At function end, vstack must exactly equal the return type — else fail. That's it. Yet this proves any validated wasm cannot suffer type confusion, stack overflow, or uninitialised access.

走一遍 · The Hot Loop 内层 4 行的验证过程

Walk-through · validating 4 inner lines

INSTR · 待验证
  1. local.get $src
  2. i32.load8_u offset=0
  3. local.get $src
  4. i32.load8_u offset=1
  5. i32.add ; ← 检查这里
类型栈 · 走到第 5 条之前vstack · before op 5
i32 ← top (loaded byte b)
i32 ← loaded byte a
即将执行about to run: i32.add
需求:pop 两个 i32,push 一个 i32。 need: pop two i32, push one i32.
vstack 顶部正好是 [i32, i32] → ✓。 vstack top is [i32, i32] → ✓.
下一步next: vstack = [i32]
FIG 11 · type-stack walk · auto-loops every 7 s 类型栈 abstract interpretation 一步一步 A 7-step animation showing how the wasm validator's type stack evolves through a sequence: two local.get, two i32.load8_u, an i32.add, an i32.const, another i32.add, and a local.set. VALIDATION · vstack STEP-BY-STEP INSTRUCTIONS 1. local.get $a 2. i32.load8_u offset=0 3. local.get $a 4. i32.load8_u offset=1 5. i32.add 6. i32.const 9 7. i32.div_u vstack · type-only [empty] stack floor i32 + push i32 pop addr push byte i32 i32 + push i32 i32 pop addr push byte i32 pop i32 ✓ pop i32 ✓ push sum i32 (const 9) i32 (sum) + push i32 (avg) pop ÷ ✓ push quotient PER STEP · 3 CHECKS ① pop expected types ② if mismatch → reject ③ push produced types END OF FUNCTION vstack must equal function return type, or reject. COMPLEXITY O(n) · single forward pass across function body. No backward propagation.

左侧指令逐条扫过,右侧类型栈跟着推演:load → load → add 让 vstack 在 [i32]→[i32,i32]→[i32] 之间走;const → div 让它再次先升后降。这套 abstract interpretation 用一个数据结构(类型栈) + 一遍 forward scan 证明类型安全。动画每 7 秒自动循环

Left: instructions stream in one by one; right: the type stack evolves in lockstep — load → load → add moves vstack through [i32] → [i32,i32] → [i32]; const → div pushes and pops again. This abstract interpretation proves type safety using one data structure (the type stack) + one forward pass. Animation auto-loops every 7 s.

三种类型错误的具体抓法

Three concrete type errors caught

CaseWATWhy rejected
栈不够underflowi32.const 1; i32.addi32.add 要 pop 两个,vstack 只有一个 → faili32.add expects two pops, vstack has one → fail
类型不对type mismatchf32.const 1.5; i32.const 2; i32.addi32.add 要 [i32, i32],拿到 [f32, i32] → faili32.add wants [i32, i32], got [f32, i32] → fail
函数尾未消栈leftover at endi32.const 1; end (无 return)函数返回 (),vstack 末态须空 → failfunction returns (), vstack must be empty at end → fail

"多态" 之处:unreachable 之后

The "polymorphic" spot: after unreachable

一个微妙的细节:验证遇到 br(无条件跳转)或 returnunreachable 后,后续直到下一个 end 的指令都无法执行到。但代码还在那儿——验证器要怎么处理?答案:把 vstack 标记为 polymorphic stack(假栈),后续 pop 操作都不真的检查,push 也接受任何类型。等遇到 end 或者 else 时再恢复真实栈状态。这一招让验证器即使对"死代码"也能 O(n) 走完

A subtle detail: after br (unconditional), return, or unreachable, instructions up to the next end are unreachable. But the bytes are still there — what should the validator do? Answer: mark vstack as polymorphic stack — subsequent pops are not really checked, pushes accept any type. Real state restored at the next end or else. This keeps the validator O(n) even through dead code.

为什么单遍 forward 就够WHY ONE FORWARD PASS SUFFICES Java verifier 要做 dataflow,因为 JVM 字节码有非结构化的 goto,会形成不规则的 CFG。wasm 用 结构化控制流(block / loop / ifbr k)从根本上禁止了不规则跳转——每个跳转目标都是当前 cstack 上的某个 frame,目标类型在 frame 创建时就钉死。这让不需要反向传播分析。"结构化控制"是 wasm 验证可以一遍完成的根本前提,也是为什么没有 goto The Java verifier needs dataflow because JVM bytecode allows non-structured goto, producing irregular CFGs. Wasm's structured control (block / loop / if + br k) bans irregular jumps at the root — every branch target is a frame on the current cstack with its target type fixed at frame creation. No backward propagation needed. "Structured control" is why wasm validation finishes in one pass — and why there's no goto.

并行验证

Parallel validation

单遍验证 + 函数互相独立(只引用 Type / Function / Memory / Table 等"全局"section,这些先解析完)= 函数级并行。V8 的实现给 N 个函数开 min(N, CPU 核数) 个验证 worker,每个 worker 拿一个函数独立验。Photoshop 那种 30 万函数的 wasm 模块,在 8 核机器上 ~500 ms 就能验完——这是为什么 wasm 启动比想象中快

Single-pass validation + function independence (functions reference only the "global" sections — Type / Function / Memory / Table — already parsed) = function-level parallelism. V8 spawns min(N, num_cpus) workers and each takes one function. Photoshop's 300 K-function wasm validates in ~500 ms on an 8-core box — which is why wasm startup is faster than people expect.

验证不是检查代码,
是把代码读成一个可证明的形状。 Field Note · 03
Validation isn't checking code.
It is reading code into a provable shape. Field Note · 03
SPEC
https://webassembly.github.io/spec/core/valid/
SOURCE
v8/src/wasm/function-body-decoder-impl.h :: DecodeFunctionBody
PAPER
Haas et al., "Bringing the Web up to Speed with WebAssembly", PLDI 2017Haas et al., "Bringing the Web up to Speed with WebAssembly", PLDI 2017
INPUT
已解码的函数体Decoded function body指令序列 + localsinstruction sequence + locals
OUTPUT
类型安全证明Type-safety proofJIT 后面可以放心生成机器码JIT can safely emit machine code
ACT IV · COMPILATION PIPELINE

从字节到机器码。

From bytes to machine code.

从这一段起,字节离开磁盘,进入引擎。我们追着主线 Hot Loop 走过 6 道工序:流式 decode 用 LEB128 把字节翻成 Module 数据结构;Validate 在函数级并行里把类型证明做完;Tier-0 Liftoff 单遍出机器码,启动 0 等待;Tier-1 TurboFan 在后台把热函数重编译到接近 native;然后实例化分配 memory、填 table、跑 start;最后 JS 跟 wasm 之间的 trampoline 把调用边界缝起来。这 6 章是整个文章最"引擎"的部分。

From here, the bytes leave disk and enter the engine. We follow the main-line Hot Loop through six stages: streaming decode turns LEB128 bytes into a Module; validate proves type safety with function-level parallelism; Tier-0 Liftoff emits machine code in one pass, zero startup wait; Tier-1 TurboFan re-compiles hot functions to near-native in the background; then instantiation allocates memory, fills tables, runs start; finally trampolines stitch the JS ↔ wasm call boundary. The most "engine" part of the article.

STAGE 07 · PIPELINE

Decode — 边下载边解码的字节流

Decode — bytes parsed as they arrive

streaming compilation

streaming compilation

Process
Browser → Renderer
Thread
Network + Wasm IO
API
compileStreaming
Latency
~ 1ms / KB
MAIN-LINE · STOP 5 / 12 · FETCH + DECODE hot.rs 此刻: 192 字节正从 CDN 流到浏览器,decoder 边收边解——每拿到一个 section 完整字节就推给 validator + Liftoff,不等整个文件下完。这是 Storyboard 第 5 格,wasm 启动比想象快的根源。 hot.rs right now: 192 bytes streaming from CDN into the browser; the decoder parses as bytes arrive — each completed section is forwarded to the validator + Liftoff without waiting for the full file. Storyboard cell 5, the secret behind wasm's startup speed.
这一段在做什么
What it does
把字节流解成 Module 数据结构。section by section,LEB128 by LEB128。每解出一个函数体就把它推给 validator + Liftoff——不等整个文件下完 Turn the byte stream into a Module struct. Section by section, LEB128 by LEB128. Each function body is handed to validator + Liftoff as soon as it's parsed — without waiting for the whole file.
为什么重要
Why it matters
Photoshop wasm 模块 70 MB。等下完再编译要 8 秒,边下边编只要 2.5 秒——5.5 秒"白送"的延迟优化,这就是 streaming compilation 的全部价值。 Photoshop's wasm is 70 MB. Compile-after-download takes 8 s; compile-while-downloading takes 2.5 s — 5.5 s "free" latency reduction. The whole point of streaming compilation.
关键 API
Key API
WebAssembly.compileStreaming(fetch('hot.wasm')) · WebAssembly.instantiateStreaming(...)

"下载完再编译"是 2017 年 MVP 时的默认行为。2018 年起 V8 / SpiderMonkey 都实现了 streaming compile——浏览器 fetch 第一个 chunk 进来就交给 wasm decoder,decoder 拿到一个 section 完整字节就解析,拿到一个函数体完整字节就交给 Liftoff。下载和编译完全并行,这一招的本质是把 wasm 当成"边下边播"的视频流

"Download then compile" was the default in the 2017 MVP. From 2018, V8 and SpiderMonkey both shipped streaming compile — the browser hands each fetched chunk to the wasm decoder, which parses each complete section and forwards each complete function body to Liftoff. Download and compile run fully in parallel; in essence, treat wasm like a "stream-while-watching" video.

JS 那端的两种调用方式

Two JS-side call patterns

// 慢路径(non-streaming):先 ArrayBuffer 再 compile
const buf = await fetch('hot.wasm').then(r => r.arrayBuffer());
const mod = await WebAssembly.compile(buf);  // 等下完才开始

// 快路径(streaming):fetch 进来一段就开始编
const mod = await WebAssembly.compileStreaming(fetch('hot.wasm'));

compileStreaming 需要 server 回 Content-Type: application/wasm——否则会 fallback 到 buffer 路径并 throw 一个警告。这是常见踩坑(把 .wasm 当 .bin 上 CDN 时 MIME 不对)。

compileStreaming requires the server to return Content-Type: application/wasm — else it falls back to the buffer path and emits a warning. A common pitfall when serving .wasm as .bin from a CDN.

decoder 的状态机

Decoder state machine

// V8 ModuleDecoder 简化状态机

[kPreamble]      ; expecting magic + version (8 bytes)
       ├─▶ kSectionHeader   ; expecting id byte + LEB128 length
              ├─▶ kTypeSection     ; 0x01 · vec[Type]
              ├─▶ kImportSection   ; 0x02 · vec[Import]
              ├─▶ kFunctionSection ; 0x03 · vec[u32 type-idx]
              ├─▶ kTableSection    ; 0x04
              ├─▶ kMemorySection   ; 0x05
              ├─▶ kGlobalSection   ; 0x06
              ├─▶ kExportSection   ; 0x07
              ├─▶ kStartSection    ; 0x08
              ├─▶ kElementSection  ; 0x09
              ├─▶ kCodeSection     ; 0x0a · 进入 per-function loop
              │       └─▶ for each function:
              │             1. parse body bytes
              │             2. enqueue to validator worker
              │             3. enqueue to Liftoff worker
              ├─▶ kDataSection     ; 0x0b
              └─▶ kCustomSection   ; 0x00 · name / dwarf / vendor

每个状态对应一个 section 的解析函数,内部都是同一种结构:先读 LEB128 数量,再 for 循环依次解析每个 entry。这种规则化让 decoder 简单到可以单文件 (module-decoder.cc) 几千行写完。

Each state maps to a parsing function for one section, all sharing the same shape: read LEB128 count, then for-loop entries. The regularity keeps the decoder small — one file (module-decoder.cc), a few thousand lines.

流式 + 并行的时间线

Streaming + parallel timeline

▸ Performance · WebAssembly streaming compile · hot.wasm 192 byte / 1 func · main-line
Network
fetch · 192 byte
idle
Wasm IO
decode
validate
done
Wasm Compile
wait
Liftoff
TurboFan
tier-1 done
Main
JS still loading
resolve
instance.exports.blur3 (Liftoff)
(TurboFan)
三件事: ① Liftoff 在 fetch 还没完时就启动——这是 streaming 的本质;② TurboFan 在 Liftoff 完成后慢慢热;③ JS Promise 在 Liftoff 完毕的瞬间 resolve,所以用户感受到的延迟是 Liftoff 完成时间,不是 TurboFan 完成时间。这就是为什么 Liftoff 必须快——它决定 TTI(Time To Interactive)。 Three things: ① Liftoff starts before fetch even finishes — that's the essence of streaming; ② TurboFan warms up later in the background; ③ the JS Promise resolves the instant Liftoff finishes, so user-perceived latency is Liftoff time, not TurboFan time. That's why Liftoff has to be fast — it determines TTI (Time To Interactive).

code section 的"函数体偏移表"

The "function-body offset table" trick

code section 的格式有一个细节让流式编译变得可行:每个函数体前面都有一个 LEB128 长度。这让 decoder 不必先扫一遍找边界,可以直接 fread 长度 → fread 函数体 → 入队 → fread 下一段长度。"self-describing 长度前缀" 是 wasm 设计里反复出现的母题——module 长度、section 长度、函数体长度、import name 长度,全是 LEB128 前缀。

A detail in the code section makes streaming feasible: each function body is prefixed by a LEB128 length. The decoder doesn't need a pre-scan — just read length → read body → enqueue → next length. "Self-describing length prefix" is a recurring motif — module length, section length, body length, import-name length, all LEB128-prefixed.

STAGE 08 · PIPELINE

Validate — 函数级并行的形状证明

Validate — function-level parallelism, proven by shape

N 个函数,N 个 worker

N functions, N workers

Algorithm
single-pass · type-stack
Concurrency
per-function
Failure
CompileError trap
Source
function-body-decoder.cc
MAIN-LINE · STOP 6 / 12 · VALIDATE (eng.) hot.rs 此刻: 同一站,换一个视角——这里讲 V8 怎么把 N 个函数的验证分发到 N 个 worker。函数体之间没有引用依赖(只引用 type/function/memory/global 等模块级声明,这些 section 早已解析完),所以可以无锁并行。Photoshop 30 万函数,8 核机器 500 ms 验完。 hot.rs right now: same stop, different lens — here we look at how V8 fans out N function validations across N workers. Function bodies reference only module-level declarations (type/function/memory/global, all parsed earlier), so workers run lock-free. Photoshop's 300 K functions validate in ~500 ms on 8 cores.

Ch11 已经讲了验证算法本身。这一章谈工程实现:V8 怎么把 N 个函数的验证拆到 N 个 worker 上,什么时候 fail-fast,什么时候 graceful。

Ch11 covered the algorithm itself. This chapter is about engineering: how V8 spreads N functions across N workers, when to fail-fast, when to be graceful.

为什么函数级并行可行

Why function-level parallel works

关键前提是函数体只引用模块级声明(type / function / table / memory / global / element),这些都在 code section 之前的 section 里解析完了。函数 A 验证时不需要看函数 B——它最多通过 call 引用 B 的签名(已知)。所以 N 个 worker 可以独立验证 N 个函数,彼此不通信。

The key invariant: function bodies reference only module-level declarations (type / function / table / memory / global / element), all parsed before the code section. Function A's validator doesn't need to look at function B — at most it sees B's signature via call (already known). So N workers validate N functions independently, no inter-thread comms.

引擎Engineworkersstrategy
V8min(N, num_logical_cpus)each thread pulls from one queue
SpiderMonkeyhelper threads (configurable)tile-based, 64 KB per tile
JavaScriptCoreWTF::WorkerPoolper-function, with size-aware scheduling
Wasmtimerayon parallel iteratorper-function

早失败 vs 晚失败

Fail-fast vs fail-late

如果第 3 个函数验证失败,后面 1000 个函数还要不要继续验证?V8 选择继续——所有 worker 把活做完,最后聚合错误。这听起来浪费,但因为 worker 是并行的,继续做不会延后失败时间;反而提前 abort 需要协调(kill 其他 worker),代码复杂度反而高。"并行算法里 fail-fast 不一定快"是 V8 设计里反复出现的取舍。

If function 3 fails, do the remaining 1000 keep validating? V8 says yes — let all workers finish, aggregate errors at the end. Sounds wasteful, but because workers run in parallel, continuing doesn't delay failure; aborting would need coordination (kill other workers), with higher code complexity. "Fail-fast isn't necessarily fast in a parallel pipeline" — a trade-off V8 makes repeatedly.

主线回引 · The Hot Loop 的验证时间

Main-line · Hot Loop validation time

主线只有 1 个函数,所以"并行"在这里退化成单 worker。49 条指令,vstack 最深 4 槽,cstack 最深 2 frame,在 M1 Pro 上验证耗时 ~ 6 µs。这一数字给你一个数量级感受:验证比解码还快,因为验证不分配内存(用栈上小固定容量数组就够了)。

The main-line has 1 function, so "parallel" degenerates to single worker. 49 ops, vstack max depth 4, cstack max depth 2 — on M1 Pro, validation takes ~ 6 µs. The order of magnitude: validation is faster than decoding, because it allocates nothing — a small fixed-cap stack array suffices.

STAGE 09 · PIPELINE

Liftoff — 单遍出码的"不优化但快"基线 JIT

Liftoff — the "unoptimised but instant" baseline JIT

10 MB/s · 0 IR · 0 register alloc

10 MB/s · 0 IR · 0 register alloc

Process
Renderer
Thread
CompileTask
IR
none
Speed
~10 MB / s
MAIN-LINE · STOP 7 / 12 · LIFTOFF CODEGEN hot.rs 此刻: validator 签字放行,Liftoff 单遍把 wasm 字节翻成 x86-64——没有 IR、没有寄存器分配、所有 local 一律 spill 到栈槽。240 字节 x86 出码,耗时 200 µs。不为速度,为启动时间。函数末尾 tier-up 计数器已经埋好。下一站:第一次执行。 hot.rs right now: validator signed it off; Liftoff makes one pass turning wasm bytes into x86-64 — no IR, no register allocation, every local spilled to a stack slot. 240 bytes of x86 emitted in 200 µs. Not for speed — for startup. The tier-up counter is already wired into the function epilogue. Next stop: first execution.
这一段在做什么
What it does
扫一遍 wasm 字节,边扫边直接出 x86-64 / ARM64 机器码。没有中间 IR,没有寄存器分配,没有优化。所有 wasm 栈位置都对应栈上的固定偏移槽位。 Scan the wasm bytes once and emit x86-64 / ARM64 machine code directly. No IR, no register allocation, no optimisation. Every wasm stack slot maps to a fixed offset on the native stack.
为什么存在
Why it exists
2018 之前,V8 直接走 TurboFan——一个 70 MB 模块要 8 秒才能开跑。Liftoff 是为了把这个数字压到 1~2 秒:出码 4× 慢于 TurboFan,但出码速度 10× 快于它。不为速度,为启动时间 Pre-2018, V8 went straight to TurboFan — a 70 MB module took 8 s to start. Liftoff aims to compress that to 1–2 s: emitted code is 4× slower than TurboFan's, but emission itself is 10× faster. Not for speed — for startup.
关键代码
Key code
v8/src/wasm/baseline/liftoff-compiler.cc :: VisitOpcode · liftoff-assembler-{x64,arm64}.cc

Liftoff 的两个""

Two ruthless choices Liftoff makes

不做寄存器分配,全 spill 到栈
No register allocation; everything spills
遇到 wasm local.getmov reg, [rbp-N],N 是该 local 的固定偏移;遇到 local.setmov [rbp-N], reg。栈顶值用瞬时寄存器 rax/rbx 之类即可。不优化中间结果留寄存器——出来的码"很啰嗦但确定"。
For each wasm local.get: mov reg, [rbp-N], where N is that local's fixed offset. For local.set: mov [rbp-N], reg. Stack-top values land in ad-hoc registers like rax/rbx. No effort to keep intermediates in regs — the code is "verbose but deterministic".
栈状态用 LiftoffStackState 跟踪
Track stack via LiftoffStackState
不是真正的 SSA,而是一个小数组,记录"当前 wasm 栈顶第 i 个值在哪个 native reg / native stack slot 上"。每条指令出码时只查 / 改这张表。这套机制让 Liftoff 用~1500 行代码实现了一个完整的 baseline JIT(对比 TurboFan 的 40,000 行)。
Not real SSA — a small array recording "where the i-th wasm stack value currently lives: in native reg X or stack slot Y". Each opcode just reads/updates this table when emitting code. ~1500 LOC builds a complete baseline JIT (vs ~40 000 LOC for TurboFan).

主线 · Liftoff 对 The Hot Loop 出的码(完整版)

Main-line · Liftoff's output for The Hot Loop (full)

; blur3 -- Liftoff codegen (x86-64, simplified)
0x000000 push   rbp
0x000001 mov    rbp, rsp
0x000004 sub    rsp, 0x40                ; 8 stack slots
0x000008 mov    [rbp-0x08], rdi          ; spill $src
0x00000c mov    [rbp-0x10], rsi          ; spill $dst
0x000010 mov    [rbp-0x18], edx          ; spill $w
0x000014 mov    [rbp-0x1c], ecx          ; spill $h

; outer loop: y = 1
0x000018 mov    dword ptr [rbp-0x20], 1  ; $y = 1
0x000020 mov    eax, [rbp-0x1c]
0x000024 dec    eax                      ; eax = h - 1
0x000026 cmp    [rbp-0x20], eax
0x000029 jge    .end_y

.loop_y:
; inner loop: x = 1
0x00002b mov    dword ptr [rbp-0x24], 1  ; $x = 1
0x000033 mov    eax, [rbp-0x18]
0x000037 dec    eax                      ; eax = w - 1
0x000039 cmp    [rbp-0x24], eax
0x00003c jge    .end_x

.loop_x:
; sum = 0
0x00003e mov    dword ptr [rbp-0x28], 0

; 9 byte loads, 8 adds — Liftoff emits each one
0x000046 mov    rax, [rbp-0x08]          ; $src
0x00004a movzx  edx, byte ptr [r15+rax]  ; i32.load8_u offset=0 (r15 = mem base)
0x00004e add    [rbp-0x28], edx          ; sum += byte

0x000051 mov    rax, [rbp-0x08]          ; reload $src ← spill cost
0x000055 movzx  edx, byte ptr [r15+rax+1]
0x00005a add    [rbp-0x28], edx
...
; (similar 7 more times — Liftoff makes no attempt to hoist $src)
...

; sum / 9 — Liftoff does NOT do magic-number multiplication
0x0000c0 mov    eax, [rbp-0x28]
0x0000c3 xor    edx, edx
0x0000c5 mov    ecx, 9
0x0000ca div    ecx                      ; expensive! ~25 cycles

; store dst[y*w + x]
0x0000cc ...
0x0000e0 mov    byte ptr [r15+rbx], al

; x++; loop_x
0x0000e4 inc    dword ptr [rbp-0x24]
0x0000e8 jmp    .loop_x
...
.end_x:
.end_y:
0x000130 leave
0x000131 ret

这段~ 240 字节的 x86-64 代码就是 Liftoff 对 hot.wasm 的输出。三个观察:① $src 在每次 load 前都重新[rbp-0x08] 加载——Liftoff 不知道也不分析 "这个值我刚加载过";② sum / 9 用了真 div 指令,~25 cycle;③ 函数体没有 SIMD 化。但出码时间在 200 µs 量级——这正是它要的

~240 bytes of x86-64 is Liftoff's output for hot.wasm. Three notes: ① $src is reloaded from [rbp-0x08] before every load — Liftoff doesn't know it just loaded this; ② sum / 9 uses a real div, ~25 cycles; ③ no SIMD. But codegen time is ~200 µs — exactly the target.

Tier-up 触发机制

The tier-up trigger

每个 Liftoff 函数入口都塞一个计数器:

Every Liftoff function prologue carries a counter:

cmp    dword ptr [r13+0x40], 0x100      ; tier-up threshold = 256 calls
jne    +0x4
call   WasmCompileLazy                  ; → schedule TurboFan recompile

2 条指令的开销,每次进入函数加 1 次 cmp + 1 次 jne(不跳转)。达到阈值时调用 WasmCompileLazy,把这个函数入队到 TurboFan 后台 worker——不阻塞当前调用,Liftoff 版继续跑。后台 worker 编完后,引擎用一个 atomic store 把函数地址表里的入口换成 TurboFan 版,下次调用就走 TurboFan。

Two-instruction overhead: one cmp + one jne (not taken) per entry. At threshold, call WasmCompileLazy to enqueue the function for a background TurboFan worker — does not block the current call, Liftoff version keeps running. After the worker finishes, an atomic store swaps the function-table entry to point to the TurboFan version; the next call goes to TurboFan.

Liftoff 之前 V8 做了什么WHAT V8 DID BEFORE LIFTOFF 2017 MVP 上线时 V8 没有 Liftoff,所有 wasm 函数直接走 TurboFan。Mozilla 那时已经有 SpiderMonkey 的 BaselineCompiler,启动比 V8 快很多。Liftoff 是 Clemens Backes 在 2018 年的工程项目,设计灵感来自 BaselineCompiler 但更彻底——后者还有简单的寄存器跟踪,Liftoff 干脆全 spill。Liftoff 之后,V8 的 wasm 启动延迟下降了 80%,这是 wasm 在浏览器侧的"第二次起飞"。 In the 2017 MVP, V8 had no Liftoff — every wasm function went straight to TurboFan. SpiderMonkey already had its BaselineCompiler and started much faster. Liftoff was Clemens Backes's 2018 project, inspired by BaselineCompiler but more ruthless — the latter still did light register tracking; Liftoff spills everything. Post-Liftoff, V8's wasm startup latency dropped 80% — wasm's "second takeoff" in the browser.
STAGE 10 · PIPELINE

TurboFan — sea-of-nodes 把它优化成接近 native

TurboFan — sea-of-nodes lifts it to near-native

2 ms 出码,80% of native

2 ms emit, 80% of native

Process
Renderer
Thread
TF Worker
IR
sea-of-nodes
Speed target
~80% of native
MAIN-LINE · STOP 10 / 12 · TURBOFAN hot.rs 此刻: 第 256 次调用触发 tier-up,后台 worker 把它送进 TurboFan——sea-of-nodes IR + LoadElimination + SimplifiedLowering + Schedule + RegAlloc 五步流水线。Liftoff 的 9 次 local.get $src 被合并成 1 次寄存器读;sum / 9 被识别为常量除,替换成魔数乘法 0x1c71c71d。180 字节 x86,3.8 ms/帧。下一站:atomic 安装。 hot.rs right now: the 256th call triggered tier-up; a background worker pushes it through TurboFan's sea-of-nodes + LoadElimination + SimplifiedLowering + Schedule + RegAlloc pipeline. Liftoff's nine local.get $srcs collapse to one register read; sum / 9 is recognised as div-by-constant and rewritten as magic-number mul 0x1c71c71d. 180 bytes of x86, 3.8 ms/frame. Next stop: atomic install.

TurboFan 原本是 V8 的 JavaScript 优化编译器。2017 年起它兼任 wasm 的优化编译器——但 wasm 那一面用的 pipeline 跟 JS 那边完全不一样。JS 那边要处理 SMI 标记、IC 反馈、deopt 边界;wasm 这边类型钉死,没有反馈,没有 deopt。所以 wasm TurboFan 是个"静态优化器",更接近 LLVM 的工作流。

TurboFan was originally V8's JS optimising compiler. From 2017 it doubles as wasm's optimiser — but the wasm pipeline differs entirely from JS. JS-side juggles SMI tags, IC feedback, deopt edges; wasm-side has fixed types, no feedback, no deopt. So wasm-TurboFan is a "static optimiser", much closer to LLVM's workflow.

从字节到机器码的 6 步

Six steps from bytes to machine code

1
Graph build · 把字节流变 IR 图
Graph build · bytes → IR graph
把每条 wasm op 转成一个或多个 IR 节点。所有依赖关系作为边。不显式记录顺序——这是 sea-of-nodes 的特征。
Each wasm op becomes one or more IR nodes; dependencies are edges. No explicit ordering recorded — that's the sea-of-nodes hallmark.
2
Inline · 把小函数展平
Inline · flatten small functions
wasm 没 IC,inlining 决策完全静态。主线 Hot Loop 没有 call,这一步空跑。
Wasm has no IC, so inlining is purely static. The main-line has no call; this step is a no-op.
3
LoadElimination · 公共子表达式
LoadElimination · CSE
"同一个内存地址刚刚加载过" → 复用值,不再 emit load。主线里 $src 的 9 次 local.get 被压成 1 次寄存器持有。
"This address was just loaded" → reuse the value, don't emit another load. In the main-line, the 9 local.get $srcs collapse into one held register.
4
SimplifiedLowering · 把 IR 降到机器层
SimplifiedLowering · IR → machine level
把"i32.add" 这种高阶节点替换成 "x64 ADD"。sum / 9 在这一步被识别为常量除,替换成 magic-number multiplication。
Lower high-level nodes like i32.add to x64 ADD. sum / 9 here is recognised as div-by-constant and rewritten as magic-number mul.
5
Schedule · 决定指令顺序
Schedule · order the nodes
sea-of-nodes 是无序的,这一步把节点排成线性序列,服从依赖关系。这是 sea-of-nodes 的独特步骤——传统 CFG 编译器不需要。
Sea-of-nodes is unordered; here nodes are arranged into a linear sequence respecting dependencies. This is sea-of-nodes' unique step — traditional CFG compilers skip it.
6
RegAlloc + Emit
RegAlloc + Emit
寄存器分配(线性扫描)+ 出码。TurboFan 的 RegAlloc 是 SSA 上的线性扫描变种,比 LLVM 的 greedy 简单但已足够好。
Register allocation (linear scan) + emission. TurboFan's RegAlloc is a linear-scan variant on SSA — simpler than LLVM's greedy, plenty good enough.

主线 · TurboFan 出码 vs Liftoff 出码

Main-line · TurboFan output vs Liftoff output

MetricLiftoffTurboFanRatio
编译耗时Compile time200 µs2.1 ms10×
出码字节Code bytes240 B180 B0.75×
运行耗时(1080p)Runtime (1080p frame)12 ms3.8 ms0.32×
$src reload 次数$src reloads91
sum / 9div(25 cy)mul+shr(4 cy)~6× faster
SIMD ?SIMD ?(no, 默认未开)(no, default off)

TurboFan 编译耗时 10× 于 Liftoff,但出码运行快 ~ 3× 于 Liftoff。关键洞察是这两个数字不矛盾——TurboFan 在后台编,运行时把"编译延迟"摊到了后台 worker。用户看到的延迟是"Liftoff 编译完+第一次跑",TurboFan 是后面"悄悄变快"的。这是 wasm tiering 的全部哲学。

TurboFan compiles 10× slower than Liftoff, but the resulting code runs ~3× faster. The crucial insight: these don't conflict — TurboFan compiles in the background, amortising its latency onto a worker. Perceived latency is "Liftoff done + first run"; TurboFan is the "silent" speedup later. That's wasm tiering in one sentence.

FIG 15 · sea-of-nodes · Hot Loop inner body TurboFan sea-of-nodes 图(简化版) A schematic of TurboFan's sea-of-nodes IR for the Hot Loop's inner body, before and after LoadElimination. Nodes connected by value-flow edges; control flow is implicit. SEA-OF-NODES · INNER LOOP · BEFORE vs AFTER LoadElimination BEFORE · 21 nodes $src load r0c0 load r0c1 load r0c2 load r1c0 load r1c1 load r1c2 load r2c0 load r2c1 load r2c2 i32.add × 8 (chain) i32.div_u 9 i32.store8 9 redundant $src dereferences LoadElim + CSE AFTER · 13 nodes $src 9 × load [src + (r·w + c)] $src held in reg · indices folded tree-reduce add × 8 i32 × 0x1c71c71d → SHR ~ 4 cycles · was 25 cy store [dst + idx] $src loaded once; div replaced by mul·shr WIN · per pixel ~ 20 ns saved · 40 ms on 1080p

左:Liftoff 把每条 wasm 字节翻成机器指令,9 次 local.get $src 各自 emit 一条 mov。右:TurboFan 看穿这 9 次都指向同一个 SSA 节点,LoadElimination 合并成 1 次寄存器读;sum / 9 在 SimplifiedLowering 阶段识别为常量除,替换成魔数乘法(0x1c71c71d)。这就是 wasm "原生 80%" 的具体形式。

Left: Liftoff emits one machine op per wasm byte; nine local.get $src turn into nine movs. Right: TurboFan sees the nine reference the same SSA node, LoadElimination merges them into a single register read; sum / 9 is recognised as divide-by-constant during SimplifiedLowering and rewritten as magic-number multiplication (0x1c71c71d). The concrete form of wasm's "80% of native".

从 TurboFan 到 Turboshaft

From TurboFan to Turboshaft

2022 年 V8 团队启动 Turboshaft 项目,目标是替换 TurboFan。原 TurboFan 的 sea-of-nodes 在内存里是图结构,每次访问要解引用,在大模块上 cache miss 严重。Turboshaft 改成线性 IR 序列(类似 LLVM 的 BB instructions),内存连续,优化 pass 速度上升 30~50%。2023 年起 V8 wasm 默认走 Turboshaft,但 IR 和 pass 集合跟 TurboFan 高度兼容,从外部看几乎无感。

In 2022 the V8 team began Turboshaft, with the goal of replacing TurboFan. The original TurboFan keeps its sea-of-nodes as an in-memory graph; every access dereferences, and on large modules cache misses dominate. Turboshaft uses a linear IR sequence (like LLVM BB instructions), so memory is contiguous and pass speed improves 30–50%. Since 2023, V8 wasm has run Turboshaft by default, but IR and pass set are highly TurboFan-compatible — externally near-invisible.

具体例子 · LoadElimination 的力量CASE · the power of LoadElimination
9 次 local.get $src 怎么变成 1 次寄存器读
9 × local.get $src collapses into one register read

Liftoff 把每个 local.get $src 都翻成 mov rax, [rbp-0x08]——9 次。TurboFan 看到这 9 次 local.get 都引用同一个 SSA 节点(因为 wasm 验证已经证明 $src 在此区间未被赋值),LoadElimination 把它们合并成一个 SSA 引用。寄存器分配阶段把这个 SSA 值留在 rcx 里——9 次内存读变成1 次内存读 + 9 次寄存器引用这一招省下 ~ 20 ns 每像素,在 1920×1080 图像上是 40 ms 的差距。

Liftoff turns each local.get $src into mov rax, [rbp-0x08] — nine times. TurboFan sees all nine reference the same SSA node (validation already proved $src isn't reassigned in this region), and LoadElimination merges them into one SSA reference. RegAlloc keeps the SSA value in rcx — nine memory reads collapse into one memory read + nine reg references. ~20 ns saved per pixel; on a 1920×1080 image, that's a 40 ms swing.

STAGE 11 · PIPELINE

Instantiate — Module 是模板,Instance 是身体

Instantiate — module is the template, instance is the body

memory · table · globals · imports · start

memory · table · globals · imports · start

Module
code + metadata
Instance
memory + table + globals
Cost
~ 1 ms typical
Reusable
1 module → N instances
MAIN-LINE · STOP 8a / 12 · INSTANTIATE hot.rs 此刻: Liftoff 已经出码,但还没人调用。引擎在做"第一次调用之前"的最后准备——分配 linear memory、填充 table 与 globals、跑 start 函数。Storyboard 第 8 格之前的这一秒。同一个 Module 可以 instantiate 多次,每次给一个独立的 memory——这是 wasm 多线程的基础。 hot.rs right now: Liftoff has emitted code, but nobody has called it yet. The engine performs the final "before-first-call" setup — allocating linear memory, filling tables and globals, running the start function. The second just before Storyboard cell 8. One Module can be instantiated many times, each with its own memory — the foundation of wasm multithreading.

WebAssembly.Module 是不可变的编译产物——它只装了代码、类型、import 声明、export 声明。要真正""它,得创建一个 WebAssembly.Instance,把 import 满足、memory / table / globals 分配出来。同一个 Module 可以创建多个 Instance,每个 Instance 有自己的 memory——这是 wasm 实现"多线程"和"沙箱隔离"的基础。

WebAssembly.Module is an immutable compilation artifact — it carries code, types, import/export declarations. To actually run it, create a WebAssembly.Instance: satisfy imports, allocate memory / table / globals. One Module can spawn many Instances, each with its own memory — the foundation of wasm's "multithreading" and "sandbox isolation".

实例化的 7 步

Seven steps of instantiation

1
检查 import 是否满足
Check imports
每个 import 声明的类型与 JS 提供的对象类型是否匹配。不匹配 → LinkError。
Every declared import's type vs the JS-provided object. Mismatch → LinkError.
2
分配 linear memory
Allocate linear memory
如果 import 了 memory → 复用 import 的;否则按 Memory section 的 min page 分配新 ArrayBuffer。
If memory is imported → reuse it; else allocate a fresh ArrayBuffer of min pages from the Memory section.
3
分配 tables
Allocate tables
类似 memory,要么 import 要么新建,初始填 null/null。
Same pattern as memory — imported or fresh, initial values null.
4
初始化 globals
Init globals
每个 global 跑一遍它的 init expression。expression 只能用常量 + global.get 已 init 的。
Run each global's init expression. Only constants + global.get of already-inited globals allowed.
5
填 data segments(memory)
Apply data segments
把 Data section 里所有 active 段 memcpy 进 linear memory 对应地址。
memcpy every active data segment into its target offset in linear memory.
6
填 element segments(tables)
Apply element segments
把 Element section 的 funcref 填到 tables。
Populate funcrefs from element segments into tables.
7
调用 start 函数(如果有)
Run start function (if any)
Start section 指定的函数现在被同步调用。start 抛 trap → instantiate 失败
The start-section function is called synchronously. If start traps, instantiation fails.

JS 那一面的代码

JS-side code

const importObject = {
  env: {
    memory: new WebAssembly.Memory({ initial: 64, maximum: 256 }),
    log:    (x) => console.log('wasm says', x),
  },
  wasi_snapshot_preview1: { /* WASI shims */ },
};

const { module, instance } = await WebAssembly.instantiateStreaming(
  fetch('hot.wasm'),
  importObject
);

instance.exports.blur3(srcPtr, dstPtr, 1920, 1080);

"多 Instance · 单 Module" 的用法

"Many instances · one module" pattern

Web Worker + SharedArrayBuffer 场景里,常见做法是主线程 compile 一次 Module,所有 worker 都用同一个 Module 创建独立 Instance。每个 worker 有自己的 memory(可能是 import 来的共享 memory),共享代码、不共享栈。Photoshop / Figma 都这样做。编译只跑一次,内存按需复制

With Web Workers + SharedArrayBuffer, the standard pattern is: main thread compiles the Module once; each worker creates its own Instance from the same Module. Each worker has its own memory (possibly imported shared memory), sharing code but not stacks. Photoshop and Figma both do this. Compile once, memory on demand.

STAGE 12 · PIPELINE

JS ↔ Wasm — 看不见的 trampoline 与 5 ns 的代价

JS ↔ Wasm — invisible trampolines and the 5 ns toll

两条 ABI 中间的桥

the bridge between two ABIs

Cost per call
~ 5 ns (modern V8)
Type coercion
Number ↔ i32/f64
Failure modes
TypeError on coerce
Spec §
JS API · 2.5
MAIN-LINE · STOP 8b / 12 · BOUNDARY hot.rs 此刻: JS 那一边 instance.exports.blur3(srcPtr, dstPtr, 1920, 1080) 触发了第一次执行。V8 在中间塞了一层 JS-to-Wasm wrapper 栈帧——SMI 解包 + r15/r14 装填 + tail-jmp 进 Liftoff 出码。2025 年 V8 把这层压到 5 ns。Storyboard 第 8 格的跨边界细节 hot.rs right now: JS-side instance.exports.blur3(srcPtr, dstPtr, 1920, 1080) triggers the first invocation. V8 inserts a JS-to-Wasm wrapper frame between them — SMI unbox + r15/r14 setup + tail-jmp into Liftoff code. 2025 V8 has the whole thing down to 5 ns. The boundary detail behind Storyboard cell 8.

JS 用 SMI 标记的 31 位整数,wasm 用裸 i32。JS 调用约定走的是 V8 的 JS calling convention,wasm 内部用的是 wasm calling convention。JS 调 wasm 函数,引擎要在中间塞一个 trampoline——把 SMI 解包成 i32,把 NaN 之类的非法值 throw 出来,然后跳进 wasm 函数体。反过来也一样。这一切 V8 都在编译期生成,但你看不到。

JS represents 31-bit integers as SMIs; wasm uses raw i32. JS calls follow V8's JS calling convention; wasm internally uses the wasm convention. When JS calls wasm, the engine slips a trampoline in between — unbox the SMI to i32, throw on illegal values like NaN, then jump into the wasm function body. Same in reverse. V8 generates all of this at compile time, but you never see it.

4 类 trampoline

Four trampoline kinds

NameDirectionUsed when
JS-to-Wasm wrapperJS → WasmJS calls instance.exports.f
Wasm-to-JS wrapperWasm → JSWasm calls an imported JS function
Wasm-to-WasmWasm → WasmDirect or indirect call to another wasm func
Capi wrapperC/C++ ↔ WasmEmbedder uses the wasm_c_api headers
FIG 17 · JS → Wasm trampoline · 3-frame stack JS → Wasm trampoline 的三栈帧 A diagram showing how a JS call into wasm builds three stack frames: JS frame, the JS-to-Wasm wrapper frame, and the wasm frame. Register r15 and r14 are set up by the wrapper. JS → WASM · TRAMPOLINE FRAME SETUP CALL STACK · top to bottom JS frame SMI-tagged args · JS stack instance.exports.blur3(...) call JS-to-Wasm wrapper unbox SMI → i32 load r15, r14 from instance ~5 ns total in 2025 V8 tail-jmp wasm frame locals on wasm stack [r15+addr] loads/stores Liftoff or TurboFan body SHARED CONTEXT · key registers r15 = wasm memory base every load/store uses this r14 = instance pointer tier-up cnt, globals, tables 3 IRREDUCIBLE OVERHEADS ① stack-pointer swap ② r15 / r14 reload ③ EH metadata push = 5 ns floor PER-CALL COST · YEAR-BY-YEAR 2017: 80 ns 2019: 35 ns 2022: 12 ns 2025: 5 ns

JS 调 wasm 时,V8 在中间塞了一层JS-to-Wasm wrapper 栈帧——专门做 SMI 解包 + r15/r14 寄存器装填,然后尾跳进 wasm 函数体。整个过程 2025 年 V8 已压到 ~ 5 ns(2017 是 80 ns)。剩下三件事不能省:栈指针切换、wasm 关键寄存器装填、异常处理元数据 push——这是 trampoline 的物理下限

When JS calls wasm, V8 inserts a JS-to-Wasm wrapper frame — it unboxes SMIs, loads r15/r14, then tail-jumps into the wasm body. 2025 V8 has the whole thing down to ~5 ns (from 80 ns in 2017). Three things refuse to compress: stack-pointer swap, wasm context-register load, EH metadata push — the trampoline's physical floor.

JS-to-Wasm wrapper 都做了什么

What the JS-to-Wasm wrapper does

; JS-to-Wasm wrapper for blur3(src, dst, w, h)
push   rbp
mov    rbp, rsp

; arg 0 (src): expect SMI, unbox to i32
mov    rax, [rdi+0x10]          ; rdi = first arg, JS heap pointer
test   rax, 0x1                 ; SMI test (low bit = 0 means SMI in V8)
jnz    .slow_path                ; HeapNumber path
sar    rax, 1                   ; SMI shift to get raw int
mov    edi, eax                 ; load into wasm arg reg
...                              ; same for args 1..3

; setup wasm frame
mov    r15, [r13+0x20]          ; load wasm memory base
mov    r14, [r13+0x28]          ; load wasm instance pointer

; tail-call into wasm function
jmp    [r14+0x40]               ; → Liftoff/TurboFan-compiled blur3

.slow_path:
call   ConvertNumberToInt32     ; handles HeapNumber, BigInt, throws on NaN
jmp    back

观察:① 主路径是纯寄存器操作 + 一条 jmp,~5 ns;② SMI 解包是一条 test + sar,几乎免费;③ 慢路径处理 HeapNumber/BigInt/NaN,大约 50~100 ns;④ r15(memory base)和 r14(instance)被显式 load——wasm 函数运行时假设这两个寄存器有效。

Notes: ① fast path is pure register ops + one jmp ≈ 5 ns; ② SMI unboxing is one test + sar, practically free; ③ slow path (HeapNumber / BigInt / NaN) is ~50–100 ns; ④ r15 (memory base) and r14 (instance) are explicitly loaded — the wasm body assumes these registers hold valid values.

Wasm-to-JS 的更贵代价

The pricier Wasm-to-JS direction

反方向更贵:wasm 调用 JS 函数(比如 console.log)需要构造 JS call frame、把 i32 boxing 成 SMI、检查 receiver、可能 GC——单次大概 100~300 ns。"wasm 频繁调 DOM" 是性能反模式——每次过桥的成本就吃掉了算术速度的优势。Photoshop 的策略是把整张图片 copy 到 linear memory,处理完一整张再过桥回 JS,把过桥次数压到极少。

The reverse is pricier: wasm calling JS (e.g. console.log) constructs a JS call frame, boxes i32 to SMI, checks the receiver, possibly triggers GC — ~100–300 ns per call. "Frequent DOM calls from wasm" is the canonical perf anti-pattern — boundary cost devours arithmetic speedup. Photoshop's strategy: copy the entire image into linear memory, process the whole thing, cross the boundary once on return. Minimise crossings.

主线 · The Hot Loop 的过桥成本

Main-line · the Hot Loop crossing cost

JS → blur3 call cost · M1 Pro · Chrome 132
2017 V8
~ 80 ns
2019 (fast-calls)
~ 35 ns
2022 (call-ref opt)
~ 12 ns
2025 (current)
~ 5 ns

2017 MVP 时,JS 调 wasm 单次成本 ~ 80 ns——这意味着每秒最多 1200 万次过桥,对 60 fps 的游戏来说是真实瓶颈。V8 后续 5 年逐步优化:把 wrapper 做成 builtin、把 SMI 解包内联、用 call-ref 替代 call-indirect 间接调用、最后是 2025 年的直接调用——把 wrapper 完全 elide,JS 编译器看穿"这次 call 一定调 wasm" 时直接 emit 一条 call。如今边界几乎免费。

In the 2017 MVP, JS-calling-wasm cost ~80 ns per call — a hard cap of 12 M crossings/s, a real bottleneck for 60 fps games. V8 optimised over five years: turn wrappers into builtins, inline SMI unboxing, replace call-indirect with call-ref, and finally the 2025 direct call — elide the wrapper entirely, JS compiler sees "this call definitely lands in wasm" and emits a plain call. Today the boundary is nearly free.

为什么不能更便宜WHY NOT CHEAPER 5 ns 是 V8 当前的下限——还剩三件事不能省:① stack-pointer 切换(JS 用 V8 stack,wasm 用自己的 stack);② r15 / r14 寄存器 load(wasm 函数假设它们有效);③ 异常处理元数据 push(为了 wasm trap 能被 JS try/catch 抓到)。理论上还能再砍 1~2 ns,但工程复杂度极高。"5 ns 是 trampoline 自身的物理极限" 5 ns is V8's current floor — three things remain irreducible: ① stack-pointer swap (JS uses V8 stack, wasm has its own); ② r15 / r14 register loads (the wasm body assumes these are valid); ③ exception-handling metadata push (so wasm traps can be caught by JS try/catch). 1–2 ns more could be carved, but engineering cost is high. "5 ns is the trampoline's physical floor".
最贵的指令不是除法,是过边界。 Field Note · 03
The most expensive instruction is not division.
It is crossing the boundary. Field Note · 03
INPUT
JS 调用 wasm.exports.blur3JS calls wasm.exports.blur3SMI Number args
OUTPUT
wasm 函数体执行wasm body executes~ 5 ns trampoline · 3.8 ms blur
ACT V · PROPOSALS

每一次扩张都是一次妥协。

Every extension is a compromise.

MVP 之后 8 年,wasm 加了 ~15 个生效中的提案。这 5 章只挑最重要的展开:Threads 把共享内存接进 wasm 沙箱;SIMD 用 16 字节寄存器给 inner loop 提速 6 倍;GC 让 Java/Kotlin/Dart 不再背着自己的运行时;Component Model 给 wasm 一个跨语言 ABI;以及还有六个候选提案在排队。每个提案都要回答"怎么不破坏 portable + safe + fast + compact 四目标"——这是 wasm 委员会评审的根本问题。

Eight years post-MVP, wasm has shipped ~15 live proposals. These five chapters cover the most consequential: Threads plug shared memory into the sandbox; SIMD turns 16-byte registers into 6× inner-loop speedups; GC frees Java/Kotlin/Dart from shipping their own runtimes; Component Model gives wasm a cross-language ABI; and six more are queued. Every proposal must answer "does this still honour portable + safe + fast + compact?" — the working group's gatekeeping question.

PROPOSAL · THREADS

Threads — 共享内存进沙箱

Threads — shared memory inside the sandbox

SharedArrayBuffer · atomics · futex

SharedArrayBuffer · atomics · futex

Shipped
V8 2019 · SM 2020
Mem
shared linear memory
Ops
i32.atomic.* · memory.atomic.wait/notify
Spec §
threads-proposal

"能不能在浏览器里跑 pthread?"——这是从 2014 年起 game engine 开发者就在问的问题。Threads 提案 2019 年 ship,答案是:能,但用新的方式。WebWorker 已经存在(线程没有共享内存,只有消息传递);wasm threads 在这上面叠加了 SharedArrayBuffer(共享内存)和 atomic ops(无锁原语)。

"Can we run pthread in the browser?" — a question game-engine devs have asked since 2014. The Threads proposal shipped in 2019 with the answer: yes, but in a new way. WebWorker already existed (no shared memory, only message passing); wasm threads layer SharedArrayBuffer (shared memory) and atomic ops (lock-free primitives) on top.

三种新指令

Three new instruction families

i32.atomic.*
原子读写
Atomic load / store

i32.atomic.load / store / rmw.add / rmw.cmpxchg。出码 x86 的 LOCK XADD / LOCK CMPXCHG,ARM 的 LDAR / STLR / LDADD顺序一致(sequential consistency)是 wasm 的默认。

i32.atomic.load / store / rmw.add / rmw.cmpxchg. Emit x86 LOCK XADD / LOCK CMPXCHG or ARM LDAR / STLR / LDADD. Sequential consistency is wasm's default.

memory.atomic.wait
阻塞等待
Block and wait

类似 Linux futex:线程 a 等地址 X 的值变,引擎挂起这个线程到内核。不能在主线程上用(浏览器禁止主线程阻塞 > 0 ms)。

Linux-futex-like: thread a sleeps until the value at address X changes; the engine parks the thread in the kernel. Not callable from the main thread (browsers forbid > 0 ms main-thread blocks).

memory.atomic.notify
唤醒等待者
Wake waiters

唤醒在地址 X 上等待的 K 个线程(K 可以是 ∞)。配合 wait 实现 mutex / condvar / barrier。

Wake K (possibly ∞) waiters on address X. Combined with wait → mutex / condvar / barrier.

FIG 18 · workers + SharedArrayBuffer topology 主线程 + N 个 worker 共享 wasm Memory 的拓扑 Main thread compiles a Module once; each worker creates its own Instance importing the same SharedArrayBuffer-backed Memory. Code is shared, stacks are not. SHARED MEMORY TOPOLOGY · 1 module · N instances · 1 SAB Main Thread compile Module once allocate SAB · spawn workers worker 0 Instance + own stack blur3 rows 0-269 worker 1 Instance + own stack blur3 rows 270-539 worker 2 Instance + own stack blur3 rows 540-809 worker 3 Instance + own stack blur3 rows 810-1079 SharedArrayBuffer · 4 MiB · 1080p image src @ 0x000000 · dst @ 0x1FB000 i32.atomic.* for synchronisation postMessage({ mem, module }) i32.load / i32.store (RW) ⚠ POST-SPECTRE GATE requires headers: COOP: same-origin · COEP: require-corp SHARED · NOT SHARED ✓ memory · ✓ tables · ✓ globals ✗ stack · ✗ locals · ✗ trap state

主线程 compile 一次 Module + 分配 一个 SharedArrayBuffer,通过 postMessage 把它发给 N 个 worker。每个 worker 创建独立 Instance(独立栈、独立 locals、独立 trap state),但都 import 同一个 Memory——这才是真正的"共享内存多线程"。Spectre 漏洞之后 COOP+COEP 头是必需的进程隔离保险。

The main thread compiles the Module once + allocates one SharedArrayBuffer, then ships them to N workers via postMessage. Each worker spawns its own Instance (own stack, locals, trap state) but imports the same Memory — true "shared-memory multithreading". Post-Spectre, COOP+COEP headers gate the process isolation that makes this safe.

JS 那一面的 setup

JS-side setup

// Main thread
const mem = new WebAssembly.Memory({
  initial: 256, maximum: 2048, shared: true    // ← key flag
});
const buf = mem.buffer;  // instanceof SharedArrayBuffer

const workers = [...Array(8)].map(() => new Worker('worker.js'));
workers.forEach(w => w.postMessage({ mem }));    // share Memory across workers

// worker.js
self.onmessage = async ({ data }) => {
  const { instance } = await WebAssembly.instantiateStreaming(
    fetch('hot.wasm'),
    { env: { memory: data.mem } }                // import same Memory
  );
  instance.exports.blur3_threaded(srcPtr, dstPtr, w, h, workerId);
};

五件事:① shared: true 让 ArrayBuffer 变成 SharedArrayBuffer——浏览器对此要 Cross-Origin-Isolated 才允许;② maximum 必填——因为 grow shared memory 在 JS 那边复杂(所有 worker 都要被通知),所以提前占好上限;③ 主线程 compile 一次 Module,所有 worker 复用;④ 每个 worker 创建自己的 Instance,但 import 同一个 Memory——这是共享内存的关键;⑤ wasm 那边 thread id 通过函数参数显式传入,不是隐式。

Five things: ① shared: true upgrades the ArrayBuffer to SharedArrayBuffer — browsers require Cross-Origin-Isolated for it; ② maximum is mandatory — growing shared memory cross-worker is complex, so the ceiling is fixed up front; ③ main thread compiles the Module once, all workers reuse it; ④ each worker spawns its own Instance but imports the same Memory — that's the shared-memory bridge; ⑤ thread id is passed explicitly as a function arg, not implicit.

Spectre 在这里的故事

The Spectre side-story

2018 年 1 月 Spectre 漏洞披露后,所有浏览器立刻关闭了 SharedArrayBuffer——因为高分辨率定时器 + 共享内存 = 可以利用 cache 旁路通道。wasm threads 当时 phase 3 即将 ship,被推迟了一年半。最终方案:要求页面声明 Cross-Origin-Embedder-Policy: require-corp + Cross-Origin-Opener-Policy: same-origin——把进程隔离到只跟自己同源的脚本一起跑,这样旁路通道泄漏只会泄漏自己的数据,无意义。2021 年起浏览器在 COOP+COEP 头下重新启用 SharedArrayBuffer。如今你能在 Figma 上跑 wasm threads,就是因为它服务端正确设置了这两个头。

When Spectre dropped in January 2018, browsers immediately disabled SharedArrayBuffer — high-res timers + shared memory = a usable cache side-channel. wasm threads, then at phase 3 and almost shipping, slipped a year and a half. The eventual mitigation: require pages to declare Cross-Origin-Embedder-Policy: require-corp + Cross-Origin-Opener-Policy: same-origin — isolate the process so it only co-resides with same-origin scripts; any side-channel leak only leaks your own data, which is harmless. From 2021, browsers re-enabled SharedArrayBuffer behind COOP+COEP. You can run wasm threads on Figma because they set those headers correctly.

主线 · The Hot Loop 多线程版的提速

Main-line · The Hot Loop, threaded

1920×1080 blur · cores=8 · M1 Pro
single thread
3.8 ms · 1.0×
threads × 2
2.0 ms · 1.9×
threads × 4
1.1 ms · 3.5×
threads × 8
0.65 ms · 5.8×

8 个 worker 把图像分成 8 个水平条带,每个 worker 独立处理。理想线性加速 8×,实测 5.8×——剩下 28% 损失在 worker 启动开销、worker 间内存通信、最后聚合点同步。这已经是 wasm 在 web 平台能做到的"多核计算"上限

Eight workers slice the image into eight horizontal stripes, each processed independently. Ideal linear speedup is 8×; measured 5.8×. The remaining 28% goes to worker startup, cross-worker memory contention, and the final sync barrier. This is the practical ceiling for "multi-core computation" on the web platform today.

PROPOSAL · SIMD

SIMD — 16 字节寄存器里的 16 个像素

SIMD — 16 pixels in a 16-byte register

v128 · lane ops · 6× speedup

v128 · lane ops · 6× speedup

Shipped
2021 · phase 4
Width
fixed 128 bit
Ops
~ 250 (prefix 0xFD)
Target
x86 SSE2 · ARM NEON
MAIN-LINE · STOP 12 / 12 · SIMD FRAME hot.rs 此刻(SIMD 版): 加上 RUSTFLAGS="-C target-feature=+simd128",LLVM 把内层循环向量化——一次循环处理 16 个像素而不是 1 个。inner loop 变成 18 条 SSE2 指令(PADDW / PMULHRSW / PSRLW),平均 1.1 条 SSE / 像素。这是 Storyboard 最后一格,也是 wasm 在 1080p 图像上达到 6.8× of JS 的根源。终点站 hot.rs right now (SIMD build): with RUSTFLAGS="-C target-feature=+simd128", LLVM vectorises the inner loop — 16 pixels per iteration, not 1. The body becomes 18 SSE2 instructions (PADDW / PMULHRSW / PSRLW), averaging ~1.1 SSE per pixel. The final Storyboard cell — and the reason wasm hits 6.8× of JS on a 1080p image. End of the line.

SIMD 是 wasm 提案里争议最大的一个——主要分歧是固定宽度 vs 可变宽度(scalable)。x86 的 AVX-512 是 512 bit,ARM 的 SVE2 是 128~2048 bit 可变,RISC-V 的 V 扩展是真的可变。真"可移植"的方案应该是 scalable,但 scalable SIMD 的 codegen 复杂度极高,JIT 在浏览器里跑不起。最终 wasm 选了固定 128 bit——所有现代 CPU 都至少有 128 bit 寄存器,JIT 输出最直接。SIMD 是 wasm 唯一一个"明确放弃移植性最大化" 的提案

SIMD was the most contentious wasm proposal — the central debate was fixed width vs scalable. x86 AVX-512 is 512 bit; ARM SVE2 is 128–2048 bit scalable; RISC-V's V extension is genuinely scalable. The portable answer is scalable, but scalable codegen is so complex that no JIT can handle it in the browser. Wasm settled on fixed 128 bit — every modern CPU has 128-bit registers, so JIT output is direct. SIMD is the one proposal where wasm explicitly traded portability for tractability.

FIG 19 · v128 · 16 bytes, 6 lane interpretations 同一个 v128 寄存器,6 种 lane 解释 The same 128-bit v128 register reinterpreted into 16×i8 / 8×i16 / 4×i32 / 2×i64 / 4×f32 / 2×f64 lanes; the byte content stays identical but the opcode picks the lane shape. v128 · ONE REGISTER, SIX LANE SHAPES i8x16 image pixels i8 i8 i8 i8 i8 i8 i16x8 audio samples i16 i16 i16 i16 i16 i16 i16 i16 i32x4 RGBA pixels i32 i32 i32 i32 i64x2 crypto words i64 i64 f32x4 vector math f32 f32 f32 f32 f64x2 scientific f64 f64 → same 128 bits in memory; the op decides how to read them

同一个 128-bit 寄存器,根据操作解释成 16/8/4/2 个 lane。i8x16.add 是 16 个并行 8-bit 整数加;f32x4.mul 是 4 个并行 32-bit 浮点乘;i64x2.shl 是 2 个并行 64-bit 位移。类型在指令里,不在值里——这是 wasm 整个类型系统的统一原则,SIMD 也不例外。

The same 128-bit register is interpreted into 16/8/4/2 lanes by the op: i8x16.add = 16 parallel 8-bit adds, f32x4.mul = 4 parallel 32-bit float mults, i64x2.shl = 2 parallel 64-bit shifts. The type lives in the op, not the value — wasm's unifying principle, SIMD included.

v128 是个"多义类型"

v128 is a "polysemic type"

128 bit 可以看成:

128 bits can be viewed as:

Lane shapeLanesPer-lane typeExamples
i8x1616i8i8x16.add
i16x88i16i16x8.mul
i32x44i32i32x4.add
i64x22i64i64x2.add
f32x44f32f32x4.sqrt
f64x22f64f64x2.sqrt

同一个 v128 寄存器,根据操作解释成不同 lane shape。i8x16.add 是 16 个 8-bit 整数对应相加,f32x4.mul 是 4 个 32-bit 浮点对应相乘。类型在指令里,不在值里——这是 wasm 整个类型系统的复制粘贴(回顾 Ch08)。

A single v128 register is reinterpreted by the op's lane shape. i8x16.add = 16 paired 8-bit adds; f32x4.mul = 4 paired 32-bit float mults. The type lives in the op, not the value — a copy-paste from wasm's overall design (recall Ch08).

主线 · The Hot Loop 的 SIMD 版本

Main-line · the SIMD version of The Hot Loop

;; SIMD-vectorised inner loop: process 16 columns at once

v128.load offset=0      ;; 16 bytes from src row 0
v128.load offset=1      ;; 16 bytes shifted right by 1
i16x8.extadd_pairwise_i8x16_u  ;; widen + add pairs → 8 × i16
v128.load offset=2      ;; 3rd column
i16x8.extadd_pairwise_i8x16_u
i16x8.add                ;; sum 3 columns of row 0
;; ... repeat for rows 1 and 2, then sum 3 rows → 8 × i16 sums ...
i16x8.div_u              ;; ÷ 9 (one v128 op = 8 div in 4 cy)
i8x16.narrow_i16x8_u      ;; saturate back to 8-bit
v128.store offset=0     ;; 16 bytes written to dst

一次循环处理 16 个像素,而不是一个。3×3 卷积变成"3 行 SIMD 加法 + 1 次 SIMD 除法 + 1 次 SIMD 写入"。TurboFan 把这段 wasm 翻成 x86 的 PADDW / PMULHRSW / PSRLW 等 SSE2 指令——每条 inner loop 大约 18 条 SSE 指令,处理 16 像素,平均每像素 ~1.1 条 SSE。这比标量版本(每像素 ~20 条 x86 指令)快 6 倍以上,正是 hero pulse bar 里看到的 6.8× 来源。

One iteration handles 16 pixels, not one. The 3×3 convolution becomes "3 rows of SIMD add + 1 SIMD divide + 1 SIMD store". TurboFan lowers this wasm into x86 PADDW / PMULHRSW / PSRLW SSE2 ops — ~18 SSE instructions per inner iteration handling 16 pixels, averaging ~1.1 SSE per pixel. Vs ~20 x86 instructions per pixel in the scalar version → 6×+ speedup, exactly the 6.8× from the hero pulse bar.

Relaxed SIMD 是什么WHAT IS RELAXED SIMD 2024 ship 的 Relaxed SIMD 提案加了一组结果可以略微不同的 SIMD op——比如 i16x8.relaxed_q15mulr_s 在不同 CPU 上结果可能差 1 个 ulp。原因:严格的 SIMD 在 x86 上有时要 emulate(因为 SSE2 不完全等价于 NEON 的某些精确语义),emulate 的开销大。Relaxed SIMD 允许 JIT 选最快的硬件 op,牺牲严格 deterministic。这是 wasm 第一次主动放弃"portable" 的子集——专门给图像滤镜、AI 推理这些"差 1 个 ulp 没人 care" 的场景。 The Relaxed SIMD proposal (2024) added a set of SIMD ops whose results may differ slightly across CPUs — e.g. i16x8.relaxed_q15mulr_s may diverge by 1 ulp across x86 vs ARM. Reason: strict SIMD sometimes needs emulation on x86 (SSE2 isn't bit-equivalent to certain NEON semantics), and emulation is expensive. Relaxed SIMD lets the JIT pick the fastest hardware op, trading strict determinism. The first time wasm willingly dropped a portion of "portable" — aimed at image filters, AI inference, the "1 ulp doesn't matter" scenarios.
PROPOSAL · GC

wasm-GC — 终于,wasm 有了自己的对象

wasm-GC — at last, wasm has its own objects

struct · array · i31 · ref

struct · array · i31 · ref

Shipped
V8 130 · SM 120 · 2024
New types
struct, array, i31ref
GC
host's GC (shared with JS)
Spec §
gc-proposal

2017 MVP 时 wasm 没有 GC。Java / Kotlin / Dart / C# 这些带 GC 的语言要跑 wasm,只能把整个 GC 运行时也编译进 wasm——TeaVM 编 Java,Kotlin/Wasm 编 Kotlin,DartVM-wasm 编 Dart,每个加 1-2 MB 的运行时 wasm。这意味着同一个标签页里 10 个 wasm 模块就有 10 份 GC 在跑,堆不共享,STW 不协调——巨大的浪费。

wasm-GC 提案的解法:让 wasm 模块共享宿主 GC(浏览器里就是 V8 或 SM 自带的 GC),wasm 那边定义 structarray 类型,引擎负责分配 / 回收。2024 年 V8 130、Firefox 120 同时 ship,Kotlin/Wasm 立刻把运行时从 1.4 MB 砍到 400 KB,Dart 团队也在改造。

The 2017 MVP had no GC. For GC-bearing languages like Java / Kotlin / Dart / C#, the only way was to compile your GC runtime into wasm — TeaVM for Java, Kotlin/Wasm for Kotlin, DartVM-wasm for Dart, each adding 1–2 MB of runtime wasm. Ten wasm modules in one tab meant ten GCs running, heaps unshared, STW pauses uncoordinated — staggering waste.

The wasm-GC proposal's answer: wasm modules share the host's GC (V8's or SM's in browsers); wasm declares struct and array types, the engine allocates and collects. V8 130 and Firefox 120 shipped it simultaneously in 2024. Kotlin/Wasm immediately cut its runtime from 1.4 MB to 400 KB; the Dart team is refactoring.

新增的类型与指令

New types and instructions

(type $Point (struct (field $x f64) (field $y f64)))
(type $Vec   (array (mut i32)))

;; allocate
struct.new $Point   ;; (consume 2 f64 on stack → produce (ref $Point))
array.new  $Vec     ;; (i32 len + i32 init → produce (ref $Vec))

;; access
struct.get $Point $x   ;; pop ref, push f64
array.get_u $Vec      ;; pop ref + i32 idx, push elem

;; cast (downcast)
ref.cast (ref $Point)   ;; runtime check, trap on mismatch

;; small ints inline (avoid heap alloc)
i31.new             ;; pop i32 (31-bit limited), push i31ref
i31.get_s           ;; unbox

三个观察:① 类型用名义定义($Point 不等于另一个 same-shape struct),允许 sub-typing;② 引用类型可以为 null 也可以 nullable;③ i31ref 直接借用 V8 的 SMI 技术——31 位整数不分配堆,直接 inline 到 ref 槽位低位,跟 JS 的 Number 互操作几乎免费。

Three notes: ① types are defined nominally (one $Point ≠ another struct of the same shape) and support sub-typing; ② references can be nullable or not; ③ i31ref borrows V8's SMI trick — 31-bit ints are not heap-allocated, inlined into the low bits of a ref slot, near-free interop with JS Number.

为什么不直接共享 JS 的对象

Why not just share JS objects?

"wasm 跑在 V8 上,直接用 JS Object 不就行了?"——这是另一个常见疑问。答案:JS Object 是动态形状的(隐藏类、IC、加 property 时形状变),wasm 需要静态形状的对象才能做高效 codegen。wasm-GC 的 struct 类型在编译期固定 layout——访问 $Point.$x 就是 mov reg, [ref+8],一条指令,没有 IC,没有 deopt。共享 GC 不等于共享对象模型

"Wasm runs on V8 — why not use JS Object directly?" — another common question. Answer: JS Object is dynamically shaped (hidden classes, ICs, shape changes on property add), but wasm needs statically shaped objects for efficient codegen. A wasm-GC struct has a layout fixed at compile time — accessing $Point.$x compiles to mov reg, [ref+8], one instruction, no IC, no deopt. Sharing the GC ≠ sharing the object model.

为什么 GC 提案花了 7 年WHY GC TOOK 7 YEARS GC 是 wasm 史上耗时最长的提案——2017 年立项,2024 年 ship,7 年。难点:① 子类型系统 + cast 语义需要规范化(借鉴了 ML 的子类型 + 自然类型扩展);② 跨模块的类型规范化("不同模块的 $Point 是同一个吗?");③ 与现有 wasm reference types 的兼容(funcref / externref 都得自然嵌入新类型层级);④ V8 / SM / JSC 必须修改自己的 GC 让它能跑 wasm 对象。这是工程量上最大的一个提案,也是最能改变 wasm 生态的一个——Java/Kotlin/Dart 全在重写。 GC was wasm's longest-running proposal — chartered 2017, shipped 2024, seven years. The hardships: ① subtyping + cast semantics had to be formalised (borrowing ML's subtype + natural-type extension); ② cross-module type identity ("are two modules' $Point the same type?"); ③ compatibility with existing reference types (funcref / externref must slot naturally into the new type lattice); ④ V8 / SM / JSC each had to modify their GC to handle wasm objects. The biggest engineering proposal — and arguably the most ecosystem-shifting; Java/Kotlin/Dart are all in the middle of rewriting.
PROPOSAL · COMPONENT MODEL

Component Model — 跨语言的 ABI

Component Model — a cross-language ABI

WIT · interface types · WASI 0.2

WIT · interface types · WASI 0.2

Phase
3 → 4 (2026 target)
IDL
WIT
Types
string, list, record, variant
Runtime
Wasmtime · Jco

"一个 Rust 写的 wasm 怎么调一个 Go 写的 wasm,传一个字符串?"——MVP wasm 无法回答这个问题。原因:wasm 只有 i32/i64/f32/f64/v128,没有 "字符串" 类型。Rust 那边 String(ptr, len, cap) 三件套,Go 那边 string(ptr, len) 两件套,Java 那边 String 是 UTF-16 数组——三方都要用约定俗成的方式把字符串"展开"成 i32 + 长度。每两种语言都要写一套胶水,N² 复杂度。

Component Model 用一个新的组件层解决这个问题。在 .wasm 的"core module" 之上,加一个 .component 文件,声明用语言无关的类型系统(string / list<T> / record / variant)描述接口。组件间互调由规范的 ABI 处理 lift / lower——发送端把语言原生类型 lower 成 ABI 形式,接收端 lift 回它的原生类型。N 个语言只需要 N 套 binding generator,复杂度从 N² 降到 N。

"How does Rust wasm call Go wasm, passing a string?" — the MVP can't answer. Reason: wasm has only i32/i64/f32/f64/v128, no "string" type. Rust's String is (ptr, len, cap); Go's string is (ptr, len); Java's String is UTF-16 array — each pair of languages needs custom glue. N² complexity.

Component Model solves this with a new component layer. On top of a wasm "core module", a .component file declares interfaces in a language-agnostic type system (string / list<T> / record / variant). Inter-component calls are mediated by a canonical ABI that lifts and lowers — the caller lowers its native type to the ABI shape, the callee lifts back into its native type. N languages need only N binding generators; complexity drops from N² to N.

FIG 21 · component model · String lift/lower Component Model 的 lift/lower:Rust String → Go string A pipeline diagram showing how Rust's String becomes Go's string across the Component Model boundary: Rust lowers into the canonical ABI (UTF-8 ptr+len), Go lifts it back into its native representation. COMPONENT BOUNDARY · STRING LIFT & LOWER Rust Component String { ptr, len, cap } UTF-8 backing buffer heap-allocated, owned ↓ lower i32 ptr, i32 len two wasm scalars on stack Canonical ABI · lane linear-memory wire format addr + 0: "Airing" bytes : 41 69 72 69 6E 67 len : 6 (i32) → shared buffer: caller's linear memory ABI agnostic of language Go Component i32 ptr, i32 len received scalars ↓ lift string { ptr, len } Go's GC-tracked may be copied into Go heap .wit IDL: blur3: func(input: string) -> result<string, error>; · binding generators read this and emit native bindings per language

Component Model 的核心抽象:lift 把语言原生类型(Rust 的 String / Go 的 string / Java 的 String) 成 wasm 原始 scalar(i32 ptr + i32 len)+ linear memory 字节;lower 是反过程。两端语言不知道彼此存在,只跟同一份 WIT 协议握手。N 种语言只需 N 套 binding generator,复杂度从 N² 降到 N。

Component Model's core abstraction: lower takes a language-native type (Rust's String, Go's string, Java's String) and lowers it into wasm scalars (i32 ptr + i32 len) + linear-memory bytes; lift is the inverse. Neither side knows the other exists; both shake hands with the same WIT contract. N languages need only N binding generators — complexity drops from N² to N.

WIT — 接口描述语言

WIT — the IDL

// blur.wit — the interface for our image-processor component

package ursb:image@0.1.0;

interface filter {
    record bitmap {
        width:  u32,
        height: u32,
        pixels: list<u8>,
    }

    enum error { invalid-size, oom }

    blur3: func(input: bitmap) -> result<bitmap, error>;
}

world image-tools {
    export filter;
}

wit-bindgen 工具:

Use wit-bindgen:

$ wit-bindgen rust blur.wit --out-dir ./bindings/
# Generates Rust types matching the WIT — `Bitmap { width: u32, ... }` plus a trait you `impl`

$ wit-bindgen go blur.wit --out-dir ./bindings/
# Same component, now in Go

$ wasm-tools component new core.wasm -o blur.component.wasm
# Package the core module + component metadata

每种语言的 binding generator 知道怎么把它的原生类型 marshal 进 ABI 形式:Rust 的 String → (ptr, len),Go 的 string → (ptr, len) 但用 GC 跟踪,Java 的 String → UTF-8 编码后传递。所有这些细节对组件作者完全透明

Each binding generator knows how to marshal native types into the ABI: Rust's String → (ptr, len), Go's string → (ptr, len) tracked by GC, Java's String → UTF-8-encoded. All these details are invisible to the component author.

为什么这件事这么重要

Why this matters so much

WASI 0.2(2024 ship)的所有"系统接口" —— wasi:io / wasi:filesystem / wasi:http / wasi:clocks —— 都用 Component Model 声明。这意味着同一个 wasm 组件 可以在 Wasmtime / Wasmer / Spin / Jco / 浏览器 polyfill 上跑,只要 host 提供对应的 WASI 接口实现。这是真正的"一次编译,处处运行"——比 Java 当年的承诺更彻底,因为它跨语言。Cloudflare Workers, Fastly Compute@Edge, Shopify Functions, Spin 等"边缘计算 wasm" 平台,本质都是 Component Model 的客户

WASI 0.2 (shipped 2024) declares all its "system interfaces" — wasi:io / wasi:filesystem / wasi:http / wasi:clocks — through the Component Model. That means one wasm component runs on Wasmtime / Wasmer / Spin / Jco / a browser polyfill, as long as the host implements the matching WASI interface. The real "compile once, run anywhere" — more thorough than Java's old promise because it's cross-language. Cloudflare Workers, Fastly Compute@Edge, Shopify Functions, Spin — every edge-wasm platform is essentially a Component Model customer.

PROPOSAL · MISC

还有六个提案 — 排队中的扩张

Six more — proposals in the queue

tail-call · EH · memory64 · JSPI · stack-switching · multi-memory

tail-call · EH · memory64 · JSPI · stack-switching · multi-memory

Total tracked
~ 40 proposals
Phase 3-4
~ 12
Phase 2
~ 15
Phase 0-1
remainder

除了 Threads / SIMD / GC / Component Model 这 4 个"明星" 提案外,还有六个对生态影响很大的提案正在不同阶段。下面是 2026 年 5 月的现状快照——这些数字会变,但格局相对稳定。

Beyond the four "headliners" (Threads / SIMD / GC / Component Model), six more proposals materially shape the ecosystem at various phases. A snapshot as of May 2026 — the numbers move, but the landscape is stable.

phase 4 · shipped
tail-call
return_call + return_call_indirect。让函数式语言(Scheme/OCaml)能尾递归到 wasm 不爆栈。V8 2023 ship。
return_call + return_call_indirect. Lets functional langs (Scheme/OCaml) tail-recurse without stack blow-up. V8 shipped 2023.
phase 4 · shipped
exception-handling
try / catch / throw / tag 一套。C++ 异常、Rust panic 现在能"真正" 抛而不是断电。2023 ship。
try / catch / throw / tag. C++ exceptions, Rust panics now genuinely throw rather than abort. Shipped 2023.
phase 3 · flag
memory64
把 wasm 升到 64-bit 寻址,突破 4 GiB 上限。所有引擎 flag 后可用。生态(LLVM, Emscripten)还在迁移。
64-bit wasm addressing, breaks 4 GiB cap. All engines support behind flag; LLVM/Emscripten still migrating.
phase 3 · flag
JSPI (JS Promise Integration)
让 wasm 函数能"await" 一个 JS Promise,栈在等待时挂起。这是真正的 async wasm,Emscripten 之前的 Asyncify 是软件 emulate。
Wasm functions can "await" a JS Promise, suspending the stack mid-call. True async wasm — Emscripten's Asyncify was software emulation.
phase 3
stack-switching
支持 coroutine / fiber / generator 一类的用户态栈切换。Go 的 goroutine 编 wasm 之后曾经因为没有 stack switching 慢得不能用,这个提案要救它。
Supports user-mode stack switching for coroutines / fibers / generators. Go's goroutines compiled to wasm were unusable without it; this proposal rescues them.
phase 2-3
multi-memory
一个模块可以有 N 个独立 linear memory。让 wasm 能像 Unix 进程那样有多个 mmap 区。Component Model 的依赖,2025 ship V8/SM。
A module can have N independent linear memories. Lets wasm behave like a Unix process with multiple mmap regions. A Component Model prerequisite; V8/SM shipped 2025.

tail-call 的具体效果

What tail-call concretely enables

;; before — call + return: stack grows N deep for N tail calls
(func $fact (param $n i32) (param $acc i32) (result i32)
  local.get $n
  i32.eqz
  (if (result i32) (then local.get $acc)
    (else
      local.get $n i32.const 1 i32.sub
      local.get $n local.get $acc i32.mul
      call $fact   ;; ← regular call, stack grows N frames
)))

;; after — return_call: reuse current frame, stack stays at 1 frame
      return_call $fact   ;; ← O(1) stack

没有 tail-call 时,函数式语言只能用 trampoline 模拟尾递归——把 "下一步要调用什么" 当返回值,外层循环里轮询。代码丑、慢 3 倍。tail-call ship 之后 Scheme / Erlang / OCaml 编译到 wasm 才真正可用。

Without tail-call, functional languages had to simulate tail recursion via trampolines — returning "what to call next" and looping in an outer loop. Ugly, 3× slower. Post-tail-call, Scheme / Erlang / OCaml-to-wasm is genuinely usable.

JSPI 解决了什么

What JSPI fixes

假设你的 wasm 要 fetch 一个网络资源。在 MVP 里你只能:① wasm 调用一个 JS 函数,JS 起 fetch,等 fetch 完后回调 wasm 的另一个函数。代码丑,因为 wasm 函数被切成两半。JSPI 让 wasm 函数能停在中间,等 JS Promise resolved 再继续。引擎在 stack 上记一个 continuation,fetch 完后从这个 continuation 恢复。对开发者像是同步代码,引擎在底下做了异步

Suppose your wasm wants to fetch a network resource. In the MVP, you could only: ① wasm calls a JS function, JS issues fetch, on completion JS calls back another wasm function. Ugly — your wasm function is cut in half. JSPI lets a wasm function pause mid-execution, await a JS Promise, resume. The engine stores a continuation on the stack; on resolve, it resumes from that continuation. To the developer it reads as sync code; under the hood the engine does async.

ACT VI · SYNTHESIS

把碎片拼成一台机器。

Stitching the fragments back into a machine.

前 22 章拆开看每一道工序;这 4 章把它们拼回来。先写一份性能模型,把"为什么 wasm 比 JS 快" 拆成具体百分比;然后讲怎么用 Chrome DevTools 在 wasm 里设断点、看变量、追 SourceMap;接着是真实战场——Figma / Photoshop / AutoCAD / Ruffle / ffmpeg 这些把 wasm 用到极限的工业级产品;最后一份术语表,把全文出现过的 50 个名词钉死定义。读完这 4 章,你应该能在任何技术讨论里 hold 住 wasm 这个话题。

The previous 22 chapters dissected each stage; these 4 stitch them back. First, a performance model that decomposes "why wasm is faster than JS" into concrete percentages; then Chrome DevTools — setting breakpoints, inspecting locals, following source maps in wasm; then the battlefield — Figma / Photoshop / AutoCAD / Ruffle / ffmpeg, the industrial products that push wasm to the edge; finally a glossary of 50 terms used throughout. By the end, you should be able to hold any wasm conversation.

SYNTHESIS · 01

性能模型 — wasm 为什么(有时)快,为什么(有时)不

Performance model — why wasm is fast (and sometimes isn't)

把工程经验写成公式

turning engineering folklore into a formula

"wasm 比 JS 快多少" 是个没法一句话回答的问题——它依赖于代码模式、引擎版本、SIMD / 多线程是否开。但我们可以写一个分解公式:

"How much faster is wasm than JS" can't be answered in one sentence — it depends on code pattern, engine version, SIMD/threads. But we can write a decomposition formula:

公式 · wasm 性能FORMULA · wasm performance
T_wasm = T_arith · k_simd · k_threads + T_boundary · n_calls + T_memcopy T_arith = 纯算术部分(wasm 比 JS 快 2 ~ 5×)arithmetic body (wasm 2–5× faster than JS) k_simd = 1 / 6 (开 SIMD 后)(with SIMD) k_threads = 1 / cores (理想)(idealised) T_boundary = 每次 JS↔wasm 过桥 ~ 5 ns(2025)JS↔wasm crossing ~5 ns (2025) T_memcopy = 把数据 copy 进/出 linear memorycopying data in/out of linear memory
推论:T_arith 让 wasm 胜,T_boundary 和 T_memcopy 让 wasm 输。当 T_arith 占总时间 > 80% 时 wasm 必赢;占 < 20% 时 wasm 输给 JS。 Implication: T_arith makes wasm win; T_boundary + T_memcopy make wasm lose. When T_arith dominates (>80%) wasm wins; when it's a sliver (<20%) wasm loses to JS.

三种场景的具体数字

Three scenarios, concrete numbers

图像滤镜(arithmetic-bound)
Image filter (arithmetic-bound)
1920×1080 卷积,T_arith 占 95%,T_boundary 一次,T_memcopy 一次(数据已经在 wasm memory)。wasm + SIMD + 4 线程 = 比 JS 快 25 倍。这是 wasm 的甜区
1920×1080 convolution: T_arith ~95%, T_boundary once, T_memcopy once (data already in wasm mem). wasm + SIMD + 4 threads → 25× faster than JS. The sweet spot.
DOM tree diff(boundary-bound)
DOM tree diff (boundary-bound)
每个 DOM 节点要 readback,T_boundary 占 70%。wasm 比 JS 慢 70%用 wasm 做 React 是错误的工程方向
Every DOM node needs readback; T_boundary ~70%. Wasm is 70% slower than JS. Using wasm to write React is the wrong engineering direction.
JSON parsing(memcopy-bound)
JSON parsing (memcopy-bound)
JSON 来自 JS 字符串,要 copy 进 wasm memory,parse 完后还要把结果 copy 回去。T_memcopy 占 50%。wasm 持平甚至略慢于 V8 原生 JSON.parse——因为 V8 的 JSON 路径已经极度优化。
JSON arrives as JS string, copies into wasm memory, parses, results copy back. T_memcopy ~50%. wasm ties or slightly loses to V8's native JSON.parse — V8's JSON path is already extremely tuned.

"wasm 慢启动" 的误解

The "wasm has slow startup" myth

人们以为 wasm 启动慢——其实Liftoff 让 wasm 启动比 JS 还快。一个 1 MB 的 wasm 模块,Liftoff ~ 100 ms 出码就能跑;一个 1 MB 的 minified JS,V8 要 parse + Ignition + 进 inline cache,~ 200 ms 才稳定。wasm 启动从 2018 年起就不再是性能问题。剩下的延迟主要是下载——文件大小决定的,不是 wasm 的错。

People assume wasm startup is slow — actually Liftoff makes wasm boot faster than JS. A 1 MB wasm module: Liftoff ~100 ms to runnable. A 1 MB minified JS: V8 parses + Ignition + IC warmup ~200 ms to steady state. Since 2018, wasm startup hasn't been a perf problem. Remaining latency is download — a function of file size, not wasm's fault.

三个反直觉

Three counter-intuitive findings

myth
"wasm 总比 JS 快"
"wasm always beats JS"

。短函数 + 频繁过桥时 wasm 慢。"wasm 快"是大块计算的快。

False. Short funcs with frequent crossings: wasm loses. "Wasm-fast" describes chunky compute.

myth
"wasm 是 web 专用"
"wasm is web-only"

。Cloudflare Workers, Spin, Fastly, Shopify Functions 都在服务器跑 wasm,数量已超过浏览器 wasm 模块的总数。

False. Cloudflare Workers, Spin, Fastly, Shopify Functions all run server-side wasm — collectively more module-instances than the browser.

myth
"wasm 不能用 GC 语言"
"GC languages can't use wasm"

2024 年前是,现在不是。wasm-GC ship 之后,Kotlin/Wasm 已经是生产就绪。

True pre-2024, false now. Post-wasm-GC, Kotlin/Wasm is production-ready.

myth
"wasm = Rust"
"wasm = Rust"

。Rust 是 wasm 最大的语言,但 C++(Emscripten)、Go(TinyGo)、Kotlin、AssemblyScript、Swift 都能编 wasm。

False. Rust is wasm's biggest source language, but C++ (Emscripten), Go (TinyGo), Kotlin, AssemblyScript, Swift all compile to wasm.

SYNTHESIS · 02

DevTools 调试 — DWARF 把符号还给字节

DevTools debugging — DWARF returns names to bytes

name section · source maps · DWARF

name section · source maps · DWARF

编完 wasm 后函数 / 局部变量都变成索引——func 17 而不是 blur3,local 3 而不是 sum。直接在 DevTools 里看一团数字几乎不可能调试。三层调试信息把符号补回来:① name custom section(函数 + locals 名字);② source map(行号 → 文件名+行号);③ DWARF custom section(完整的类型信息 + 局部变量映射 + inline 信息)。三层都通过 custom section 加塞进 .wasm,运行时不影响,DevTools 启用 "WebAssembly Debugging" 选项后才解析。

After compilation, functions and locals are indicesfunc 17 not blur3, local 3 not sum. Debugging a wall of numbers in DevTools is near-impossible. Three layers of debug info put names back: ① name custom section (function + local names); ② source map (line → file+line); ③ DWARF custom section (full type info + local-variable mapping + inline info). All three ride custom sections inside the .wasm — invisible at runtime, parsed when DevTools "WebAssembly Debugging" is enabled.

三层的对比

The three layers compared

LayerSectionWhat it gives youCost (size)Tooling
name"name"函数 + locals 名字function + local names+ 1–3 %built into rustc / wasm-bindgen
source map"sourceMappingURL"行号 ↔ 源文件位置line ↔ source file position+ 5–10 %wasm-opt --emit-source-map
DWARF".debug_*"完整类型 + 变量 + inlinetypes + locals + inlining+ 30–100 %clang -g · rustc -g · DWARF dumping

Chrome DevTools 启用 wasm 调试

Enabling wasm debug in Chrome DevTools

2020 年 Chrome 88 起,DevTools 集成了 wasm 调试器(由 Google 内部 chrome-devtools-frontend 团队和 Bloomberg 合作开发)。开启步骤:

From Chrome 88 (2020), DevTools includes a wasm debugger (built by Chromium's chrome-devtools-frontend team and Bloomberg). Enable in three steps:

1
编译时加 -g
Add -g at compile time
Rust: RUSTFLAGS="-g" 或者 cargo build --release + [profile.release] debug = true。C++: emcc -g hot.c -o hot.wasm。会让 wasm 体积涨 30–100%,但调试体验质变。
Rust: RUSTFLAGS="-g" or cargo build --release with [profile.release] debug = true. C++: emcc -g hot.c -o hot.wasm. Bumps wasm size 30–100% but transforms the debug experience.
2
安装 C/C++ DevTools Support extension
Install C/C++ DevTools Support extension
Chrome Web Store 装 "C/C++ DevTools Support (DWARF)"。这是个 wasm 写的 plugin,解 DWARF,把它翻成 DevTools 能用的格式。
Install "C/C++ DevTools Support (DWARF)" from the Chrome Web Store. It's itself a wasm plugin: parses DWARF, exposes it to DevTools.
3
DevTools Settings · 勾选 WebAssembly Debugging
DevTools Settings · check WebAssembly Debugging
F12 → ⚙ Settings → Experiments → "WebAssembly Debugging"。Reload。打开 Sources 面板能看到原始 .rs / .c 文件。
F12 → ⚙ Settings → Experiments → "WebAssembly Debugging". Reload. The Sources panel now lists the original .rs / .c files.

能做什么 · 调试体验

What you can do · debug UX

在 .rs 文件里设断点
Breakpoints in .rs files
DevTools 用 source map 把断点行号翻成 wasm bytecode 位置,V8 在那里暂停。
DevTools maps the breakpoint via source map to wasm bytecode position; V8 pauses there.
看局部变量值(Rust 类型)
Inspect locals (Rust types)
Watch 面板显示 sum: u32 = 1842(从 DWARF 解出类型 + 当前寄存器/栈位置 + 字节解码)。
Watch shows sum: u32 = 1842 (decoded via DWARF type + register/stack location + byte interpretation).
Step into / over / out
Step into / over / out
行级单步,因为 source map 知道每条 wasm op 属于哪一源码行。inline 函数的 step into 也工作(DWARF 携带 inline 信息)。
Line-level stepping — source map knows which wasm op belongs to which source line. Step-into through inlined functions works too (DWARF carries inline info).
看 linear memory
Inspect linear memory
Sources 面板的"Memory" tab 显示整片字节,可以跳到地址、按 i8/i16/i32/f32 不同方式查看。
The Memory tab in Sources shows the slab; jump to address, view as i8/i16/i32/f32.
看 Liftoff vs TurboFan 出码
View Liftoff vs TurboFan output
--print-wasm-code 或者 D8 + --print-code,看具体 x86-64 / ARM64。
--print-wasm-code, or D8 with --print-code for raw x86-64 / ARM64.
关于 source map 的小坑A SOURCE-MAP GOTCHA 如果你的服务器把 .wasm 加了 gzip / brotli,Chrome 默认会拿 decompressed 字节查 source map URL——但 source map 文件本身可能是另一个 URL(hot.wasm.map)。记得 .map 也 deploy 上 CDN,否则 DevTools 会报 "404, falling back to disassembly"。这是新手最常踩的坑之一。 If your server gzips / brotlis the .wasm, Chrome resolves the source-map URL on the decompressed bytes — but the .map file is a separate URL (hot.wasm.map). Deploy the .map alongside the .wasm, or DevTools shows "404, falling back to disassembly". A common newbie trap.
SYNTHESIS · 03

现实战场 — wasm 在工业级产品里的样子

Real-world battlefields — wasm in production at scale

Figma · Photoshop · AutoCAD · Ruffle · ffmpeg

Figma · Photoshop · AutoCAD · Ruffle · ffmpeg

这一章不讲技术,讲产品。每一个下面的案例都是 wasm 跨过工程门槛、跑在百万级用户上的真实证据。它们一起构成的"这是 wasm 能做的事"的最有力证明。

This chapter isn't technical — it's about products. Each case below is real evidence of wasm clearing the engineering bar to ship to millions of users. Together they form the strongest argument for what wasm can actually do.

Figma — 第一个 web wasm 真实成功故事

Figma — the first true wasm-on-web success

Figma 2016 年上线时,渲染引擎已经是 C++ 编到 asm.js 跑在浏览器里。2017 年 wasm MVP ship 后立刻迁移到 wasm——启动速度提升 3 倍,文件加载提升 2 倍。Evan Wallace(Figma 联合创始人)在博客里写过:"without WebAssembly, Figma would not exist"。Figma 的整个矢量编辑、 canvas 渲染、协作 OT 算法都在 wasm 里——只有 UI 是 React。它定义了"wasm-first 应用" 的工程模板

Figma launched in 2016 with its rendering engine already compiled from C++ to asm.js. Post-wasm MVP in 2017 it migrated immediately — 3× startup speedup, 2× file load. Co-founder Evan Wallace wrote on the blog: "without WebAssembly, Figma would not exist". Figma's vector editing, canvas rendering, and collaborative OT all run inside wasm — only the UI is React. It defined the engineering template for "wasm-first apps".

Photoshop Web — 30 万行 C++ 的搬运

Photoshop Web — porting 300 K lines of C++

2023 年 Adobe 把 Photoshop 的 pixel pipeline 编译到 wasm,在 Chromium 上开始公测。模块大小:70 MB(gzip 后 18 MB)。用了 wasm threads + SIMD + 多 memory + JSPI。其中最大的工程难点是 Photoshop 自带的内存分配器(jemalloc)要从假定有 mmap 的 native 环境改为 wasm 的 linear memory——他们花了 9 个月把 jemalloc 移植成"wasm 友好" 的版本。Photoshop Web 是目前为止编到 wasm 的最大商业代码库

In 2023 Adobe compiled Photoshop's pixel pipeline to wasm and opened public beta on Chromium. Module size: 70 MB (18 MB gzipped). Uses wasm threads + SIMD + multi-memory + JSPI. The hardest engineering hurdle was porting Photoshop's bundled allocator (jemalloc) from a mmap-assuming native world to wasm's linear memory — 9 months to produce a "wasm-friendly" jemalloc. Photoshop Web is the largest commercial codebase ever compiled to wasm.

AutoCAD Web — 30M LOC, 35 年历史

AutoCAD Web — 30 M LOC, 35-year history

AutoCAD 1982 年首次发布,代码累计 30M+ LOC。Autodesk 2018 年开始把它编到 wasm,2020 年正式上线 AutoCAD Web App。移植中最大的挑战不是计算速度,是文件 IO 路径——AutoCAD 假定有本地文件系统,wasm 在浏览器里没有,要用 OPFS(Origin Private File System)和 fetch API 模拟。这是WASI 0.2 的 wasi:filesystem 在浏览器里也有用的原因。

AutoCAD shipped in 1982 with 30 M+ cumulative LOC. Autodesk began compiling to wasm in 2018; AutoCAD Web launched in 2020. The biggest port hurdle wasn't compute speed — it was the filesystem path. AutoCAD assumes a local FS; wasm in the browser has none, so they shim via OPFS (Origin Private File System) and fetch. This is why WASI 0.2's wasi:filesystem matters in the browser too.

Ruffle — Flash 的还魂

Ruffle — Flash, revived

Adobe Flash 2020 年正式 EOL。但无数 90s/00s 的网页游戏 + 互动课件 + 文化档案因此面临"不能再打开"的危机。Ruffle 是一个用 Rust 写的 Flash player,编到 wasm,在浏览器里跑——纯客户端,不需要 Adobe 任何东西。It runs on Internet Archive 上 50 万个 .swf 游戏 / 视频已经"复活"。Ruffle 是 wasm 在文化遗产保存方向最暖的一个故事

Adobe Flash reached EOL in 2020. But countless 90s/00s web games + interactive coursework + cultural archives faced "cannot be opened again". Ruffle is a Rust-written Flash player, compiled to wasm, running in the browser — pure client-side, Adobe-free. On the Internet Archive, 500 K .swf games and videos have "resurrected" via Ruffle. The warmest wasm story — preservation of cultural heritage.

ffmpeg.wasm — 视频编辑到客户端

ffmpeg.wasm — video editing in the client

把 ffmpeg(2000 万行 C)编到 wasm。生成的 .wasm 大约 25 MB(gzip 后 6 MB)。性能大概是 native ffmpeg 的 40~60%——主要差距在 SIMD 不完全(ffmpeg 用了大量 AVX-512,wasm SIMD 只到 128 bit)。但 client-side 视频转码、抠图、字幕合成全部可以做。1Password、Loom、Riverside、CapCut Web 都集成了 ffmpeg.wasm。

ffmpeg (20 M lines of C) compiled to wasm. Result: ~25 MB .wasm (6 MB gzipped). Perf is 40–60% of native ffmpeg — the gap mostly from SIMD shortfall (ffmpeg uses heavy AVX-512; wasm SIMD caps at 128 bit). Even so, client-side video transcoding, chroma key, subtitle compositing are all on the table. 1Password, Loom, Riverside, CapCut Web all embed ffmpeg.wasm.

服务器端 wasm — 另一半故事

Server-side wasm — the other half of the story

Cloudflare Workers, Fastly Compute@Edge, Shopify Functions, Spin, Fermyon Cloud, NGINX Unit ngx_wasm。这些边缘计算平台不用容器,用 wasm 实例——冷启动 ~ 1 ms(容器 ~ 100 ms),内存隔离更便宜,可以一台机器跑十万个客户。2024 年起服务器端 wasm 实例的总数超过了浏览器。如果你只看浏览器,你只看到了 wasm 故事的一半。

Cloudflare Workers, Fastly Compute@Edge, Shopify Functions, Spin, Fermyon Cloud, NGINX Unit ngx_wasm. These edge platforms don't use containers — they use wasm instances. Cold start ~1 ms (containers ~100 ms), cheaper memory isolation, 100 K tenants per box. From 2024, server-side wasm instances outnumber browser instances. Watching only the browser sees only half the story.

BattlefieldSource languageModule sizeYearKey feature used
FigmaC++~ 3 MB2017MVP arithmetic
Google EarthC++~ 15 MB2019threads (early)
AutoCAD WebC++~ 80 MB2020threads + OPFS
Photoshop WebC++~ 70 MB2023threads + SIMD + multi-mem + JSPI
RuffleRust~ 2 MB2021MVP + SIMD
ffmpeg.wasmC~ 25 MB2019SIMD + threads
BlazorC#~ 3 MB AOT2020GC (custom runtime) → wasm-GC migrating
1Password CLIRust~ 5 MB2022WASI
Cloudflare Workersanyvariable2018server-side, 1 ms cold start
SYNTHESIS · 04 · FINAL

术语表 — 50 个名词,钉死定义

Glossary — 50 terms, pinned definitions

读完这一章你能 hold 住任何 wasm 讨论

after this, you can hold any wasm conversation

WebAssembly / wasm virtual ISA
2015 年 W3C 设计的虚拟指令集 + 二进制格式 + 沙箱执行模型。当作"一种新 CPU" 来理解。W3C-designed (2015) virtual ISA + binary format + sandboxed execution model. Think of it as "a new CPU".
asm.js 2013 typed JS subset
wasm 的直接前身。一个加 "use asm" 指令的 JS 子集,引擎可以 AOT 编译。Firefox 实测过 1.5× of native。Wasm's direct predecessor. A JS subset marked with "use asm" that engines can AOT-compile. Firefox measured 1.5× of native.
MVP Minimum Viable Product · 2017-03
wasm 1.0,2017 年 3 月在 Chrome/Firefox/Safari/Edge 同时 ship。只有 4 种数字类型、没有 threads/SIMD/GC/EH。Wasm 1.0, shipped 2017-03 in Chrome/Firefox/Safari/Edge simultaneously. Four numeric types only — no threads/SIMD/GC/EH.
栈机Stack machine stack-based VM
指令操作数从隐式栈 pop,结果 push 回去。wasm 选了这种(对比寄存器机),为了字节密度 + 单遍 codegen。Operands implicitly pop from a stack, results push back. Wasm chose stack (vs register) for byte density + single-pass codegen.
LEB128 Little Endian Base-128
变长整数编码,每字节 7 位载数据,最高位标续。小整数 1 字节,大整数最多 5 字节。wasm 所有整数都用它。Variable-length integer encoding: 7 data bits per byte, top bit = continuation. Small ints in 1 byte, large in up to 5. Used for every wasm integer.
Module compiled artifact
编译产物,immutable。装代码、类型、import/export 声明。一个 Module 可以创建多个 Instance。Compiled artifact, immutable. Holds code, types, import/export decls. One Module → many Instances.
Instance runtime entity
从 Module 实例化后的运行时实体,有自己的 memory / table / globals,可调用其 exports。Runtime entity instantiated from a Module. Owns its memory / tables / globals, exposes exports.
线性内存Linear memory flat byte slab
连续字节数组,从地址 0 开始,以 64 KiB page 为单位扩展。wasm 唯一的内存空间。A contiguous byte array from address 0, growable in 64 KiB pages. Wasm's only memory space.
Page 64 KiB
linear memory 的扩展单位,固定 64 KiB(2^16 byte)。wasm32 最大 65536 page = 4 GiB。Linear memory's grow unit, fixed at 64 KiB (2^16 bytes). wasm32 caps at 65 536 pages = 4 GiB.
TableTable function pointer array
funcref / externref 的数组,通过 call_indirect 索引调用。是 C 函数指针 / C++ vtable 在 wasm 里的形式。An array of funcref / externref values, indexed via call_indirect. The wasm representation of C function pointers / C++ vtables.
funcref / externref reference types
funcref = wasm 函数的不透明引用;externref = 宿主对象(JS Object / DOM)的引用。2021 提案 ship。funcref = opaque ref to a wasm function; externref = ref to a host object (JS Object / DOM). Shipped 2021.
Custom section id 0x00
名字 + 任意 payload。给 debug info (DWARF / source map) / name / vendor data 留的逃生舱口。Name + arbitrary payload. The escape hatch for debug info (DWARF / source map), name section, vendor data.
name section debug names
最简单的一种 custom section,给函数 / locals / globals 一个 UTF-8 名字。让 DevTools 显示 blur3 而不是 func 17The simplest custom section: gives UTF-8 names to functions / locals / globals. DevTools shows blur3 instead of func 17.
DWARF debug format
Unix 历史的调试信息格式,wasm 借用。承载完整类型信息 + 变量映射 + inline 信息。让 DevTools 能在 .rs 源码层调试。Unix-heritage debug format, borrowed by wasm. Carries full type info + variable mapping + inline info. Enables source-level debug in .rs / .c files.
Liftoff V8 baseline JIT · 2018
V8 的 Tier-0 wasm 编译器。单遍扫字节出机器码,不做寄存器分配、不做优化,~10 MB/s codegen。V8's Tier-0 wasm compiler. Single-pass scan to machine code, no register allocation, no optimisation, ~10 MB/s codegen.
TurboFan V8 optimising JIT
V8 的 Tier-1 wasm 编译器,sea-of-nodes IR,目标"原生 80%"。后台跑,编完原子替换 Liftoff 版。V8's Tier-1 wasm compiler, sea-of-nodes IR, targeting "80% of native". Runs in background; atomic swap when done.
Turboshaft TurboFan's successor
2022 启动的项目,改 sea-of-nodes 为线性 IR,改善 cache locality。2023 起 V8 wasm 默认走 Turboshaft。2022-launched project replacing sea-of-nodes with linear IR for cache locality. V8 wasm runs Turboshaft by default since 2023.
Cranelift Wasmtime's compiler
Bytecode Alliance 主导的 wasm 编译器,Rust 写,Wasmtime / Wasmer 共用。比 LLVM 快,比 TurboFan 优化弱。Bytecode-Alliance-led wasm compiler, written in Rust, shared by Wasmtime / Wasmer. Faster than LLVM, weaker optimisation than TurboFan.
BBQ / OMG JavaScriptCore tiers
Safari 的 wasm 双层 JIT。Build Bytecode Quickly(BBQ)= 基线,Optimized Machine Generator(OMG)= 优化。Safari's wasm two-tier JIT. Build Bytecode Quickly (BBQ) = baseline, Optimized Machine Generator (OMG) = optimising.
流式编译Streaming compilation compile-while-downloading
WebAssembly.compileStreaming(fetch(...))。每收一段就编一段,不等下载完。WebAssembly.compileStreaming(fetch(...)). Compile each chunk as it arrives, don't wait for the full file.
tier-uptier-up Liftoff → TurboFan
函数被调用一定次数后,Liftoff 版被后台 TurboFan 重编译版替换。After a function reaches a call-count threshold, its Liftoff version is replaced by a background TurboFan recompile.
trampoline / wrappertrampoline / wrapper JS ↔ wasm bridge
JS 与 wasm 调用约定不同,中间需要 trampoline 处理 SMI 解包、calling convention 转换。2025 年 V8 内 5 ns。JS and wasm have different calling conventions; a trampoline mediates — SMI unboxing, ABI swap. 5 ns in V8 (2025).
type-stack validator state
验证器维护的一个值类型栈,模拟运行时栈每个槽位的类型。O(n) 单遍扫完证明类型安全。A stack of value types maintained by the validator, simulating the runtime stack's type per slot. O(n) single pass proves type safety.
结构化控制Structured control block / loop / br k
wasm 没有 goto,只有 block / loop / ifbr k。让验证可以单遍完成。No goto; only block / loop / if + br k. Enables single-pass validation.
边界检查Bounds check trap on overrun
每次 load/store 必须在 memory 范围内,否则 trap。现代引擎用虚拟保留页 + signal handler 实现,免显式 cmp/jcc。Every load/store must stay within memory; else trap. Modern engines use guard pages + signal handlers, removing explicit cmp/jcc.
陷阱Trap unrecoverable abort
wasm 的"不可恢复" 异常:越界、除零、type cast 失败。通过 JS 表现为 WebAssembly.RuntimeErrorAn "unrecoverable" abort: bounds, div-by-zero, failed type cast. Surfaces as WebAssembly.RuntimeError on the JS side.
SIMD v128 128-bit vector
2021 ship 的 128 位向量类型,可解释为 16×i8 / 8×i16 / 4×i32 / 4×f32 / 2×f64。LLVM lower 到 x86 SSE2 / ARM NEON。128-bit vector type shipped 2021. Reinterpretable as 16×i8 / 8×i16 / 4×i32 / 4×f32 / 2×f64. LLVM lowers to x86 SSE2 / ARM NEON.
Relaxed SIMD slightly non-deterministic
2024 加的 SIMD 子集,允许结果在不同 CPU 上差 1 ulp。为了让 JIT 选最快的硬件 op。SIMD subset added 2024 — results may diverge by 1 ulp across CPUs. Lets the JIT pick the fastest hardware op.
SharedArrayBuffer (SAB) shared memory
JS 与 wasm worker 间共享内存的载体。需要 COOP+COEP 头才能用(Spectre 缓解)。The carrier for shared memory across JS and wasm workers. Requires COOP+COEP headers post-Spectre.
atomics lock-free primitives
wasm 的 i32.atomic.* / memory.atomic.wait/notify,出码 x86 LOCK 前缀指令 / ARM acquire-release op。Wasm's i32.atomic.* / memory.atomic.wait/notify; emit x86 LOCK-prefixed ops or ARM acquire/release.
wasm-GC first-class managed types · 2024
让 wasm 模块共享宿主 GC,有 struct / array / i31ref 类型。Java/Kotlin/Dart 终于不用背 GC 运行时。Lets wasm modules share the host GC, with struct / array / i31ref types. Java/Kotlin/Dart no longer ship their own GC runtime.
i31ref inline small int
31 位整数 inline 进 ref 槽位低位,不分配堆,跟 JS 的 SMI 完全互通。A 31-bit int inlined into the low bits of a ref slot — no heap allocation, fully interop with JS SMIs.
Component Model cross-language ABI
2026 phase 4 目标。给 wasm 一个语言无关的接口类型系统(string / list / record / variant),组件间互调由 lift/lower 自动处理。Phase 4 target for 2026. Gives wasm a language-agnostic interface type system (string / list / record / variant); inter-component calls mediated by lift/lower.
WIT Wasm Interface Type IDL
Component Model 的接口描述语言。一份 .wit 文件 + wit-bindgen 生成 N 种语言的 binding。The Component Model's IDL. One .wit file + wit-bindgen → N language bindings.
WASI WebAssembly System Interface
浏览器外的 wasm "系统调用" 接口集。WASI 0.2(2024 ship)用 Component Model 重写,接口包括 wasi:io / wasi:filesystem / wasi:http / wasi:clocks。The "syscall" interface set for non-browser wasm. WASI 0.2 (2024) rewrote it via Component Model — wasi:io / wasi:filesystem / wasi:http / wasi:clocks.
JSPI JS Promise Integration
让 wasm 函数能在调用中 await 一个 JS Promise。引擎在栈上记 continuation,resolve 后恢复。Lets a wasm function await a JS Promise mid-call. The engine saves a continuation on the stack and resumes after resolve.
tail-call return_call
2023 ship。return_call + return_call_indirect 给函数式语言(Scheme/OCaml)做 O(1) 栈的尾递归。Shipped 2023. return_call + return_call_indirect give functional langs (Scheme/OCaml) O(1)-stack tail recursion.
exception-handling try / catch / throw / tag
2023 ship。让 C++ 异常 / Rust panic 在 wasm 内真正能 throw + catch,而不是 abort。Shipped 2023. Lets C++ exceptions / Rust panics throw + catch within wasm rather than abort.
memory64 64-bit linear memory
phase 3,flag 后可用。把 wasm 升级到 64 位寻址,突破 4 GiB 上限。Phase 3, available behind flag. Upgrades wasm to 64-bit addressing, breaking the 4 GiB cap.
multi-memory multiple linear memories per module
让一个 module 有多个独立 linear memory。Component Model 的依赖,2025 ship V8/SM。Lets one module have multiple independent linear memories. A Component Model prerequisite; V8/SM shipped 2025.
stack-switching user-mode fiber
支持 coroutine / fiber / generator。给 Go 的 goroutine 移植到 wasm 用。phase 3。Supports coroutines / fibers / generators — required for Go's goroutines on wasm. Phase 3.
Wasmtime standalone runtime
Bytecode Alliance 的 wasm 运行时(Rust),用 Cranelift 编译。Cloudflare Workers / Spin 都用它。Bytecode Alliance's wasm runtime in Rust, using Cranelift. Powers Cloudflare Workers / Spin.
Wasmer embedded runtime
另一个独立 wasm 运行时,Rust 写,可选 Cranelift / LLVM / Singlepass 后端。Another standalone wasm runtime, Rust-written, with Cranelift / LLVM / Singlepass backends.
WAMR WebAssembly Micro Runtime
Intel 主导的 IoT/嵌入式 wasm 运行时。可选解释器 / fast-interp / AOT 模式。Intel-led wasm runtime for IoT / embedded. Choose interpreter / fast-interp / AOT mode.
Emscripten C/C++ toolchain
Alon Zakai 2011 起的 C/C++ → wasm 工具链。提供 stdlib / SDL / glue JS,被 Photoshop / AutoCAD 用。Alon Zakai's 2011-onwards C/C++ → wasm toolchain. Provides stdlib / SDL / glue JS, used by Photoshop / AutoCAD.
wasm-bindgen Rust ↔ JS bridge
Rust 生态的标杆工具。把 Rust 函数 / 结构暴露给 JS,自动生成胶水。The standard Rust-side tool. Exposes Rust functions / structs to JS, auto-generates glue.
Binaryen wasm optimiser
wasm IR + 优化器 + post-pass。wasm-opt -O3 来自这里。AssemblyScript 把它当主编译器。A wasm IR + optimiser + post-pass toolchain. wasm-opt -O3 comes from here. AssemblyScript uses it as the main compiler.
wasm32 / wasm64 target triple
LLVM 的 wasm 目标三元组。wasm32 = 32-bit pointer,wasm64 = 64-bit pointer(配合 memory64 用)。LLVM target triples for wasm. wasm32 = 32-bit pointers; wasm64 = 64-bit pointers (paired with memory64).
CG / WG Community Group / Working Group
CG 任何人加入,讨论提案;WG 需要会员资格,投票纳入正式 spec。一个提案 phase 0→2 在 CG,phase 3→4 提到 WG。CG: anyone can join, discusses proposals. WG: membership required, votes proposals into the spec. Proposals at phase 0–2 live in CG; phase 3–4 escalate to WG.
Bytecode Alliance non-browser steward
2019 成立的非营利组织,推动 wasm 在浏览器外的标准化(WASI / Component Model)。Mozilla / Fastly / Cloudflare / Microsoft / Intel 主成员。2019-founded non-profit, stewarding non-browser wasm (WASI / Component Model). Mozilla / Fastly / Cloudflare / Microsoft / Intel are primary members.
读到这里,你已经看完 wasm 从字节到 SIMD 的一生。
下次再有人问 "wasm 是什么"——别用一句话回答。 Field Note · 03 · Final
You have now read the life of wasm, from byte to SIMD.
Next time someone asks "what is wasm" — refuse the one-liner. Field Note · 03 · Final
APPENDIX · 01 · SECURITY

安全模型 — 三层沙箱与 Spectre 之后

Security model — three sandbox layers, and what came after Spectre

type safety · memory safety · CFI

type safety · memory safety · CFI

Layers
3 (type · mem · CFI)
Provable
yes (WasmCert · Isabelle)
Post-Spectre
+ COOP+COEP
CVE history
~ 20 in 8 years

wasm 的"safe"是被严肃证明过的——Conrad Watt 2018 年在 Isabelle/HOL 里把整套规范 mechanise 了一遍,过程中还顺手挑出 spec 里几处 bug。这一章把安全保证拆成三层,顺便讲 Spectre 漏洞如何让 wasm threads 推迟了一年半,以及 wasm 设计里那些反过来的限制——这些限制不是缺点,是故意

Wasm's "safe" has been formally proven — in 2018 Conrad Watt mechanised the whole spec in Isabelle/HOL, discovering spec-level bugs along the way. This chapter splits the safety story into three layers, recounts how the Spectre disclosure shoved wasm threads back by 18 months, and explains the inverted design constraints — limits that are not flaws but deliberate choices.

三层沙箱

Three sandbox layers

类型安全 · 验证证明
Type safety · proven by validation
Ch11 的 type-stack abstract interpretation 保证:运行时栈每一槽位都对应正确类型,不会有 "把 f32 当指针解引用" 这种 UB。形式化证明:任何通过验证的 wasm,不会陷入 type confusion。
Ch11's type-stack abstract interpretation guarantees: every runtime stack slot has the right type — no "dereferencing f32 as pointer" UB. Formally proven: any validated wasm cannot enter type confusion.
内存安全 · 硬件 + signal handler
Memory safety · hardware + signal handler
Ch10 的 4 GiB 虚拟保留 + PROT_NONE,任何越界访问触发 SIGSEGV,V8 翻译为 RuntimeError。wasm 内部不能访问外部内存——因为它根本没有外部内存的指针类型。
Ch10's 4 GiB virtual reservation + PROT_NONE; any OOB touch fires SIGSEGV, V8 maps it to RuntimeError. Wasm can't reach external memory — because it has no pointer type for external memory at all.
控制流完整性 · CFI by design
Control-flow integrity · CFI by design
call_indirect 在 table 里查 funcref 时必须验证目标函数签名匹配,否则 trap。所有跳转目标(br k)都是结构化控制框架内的 frame——不可能跳到任意地址。这给 wasm 提供了 ROP / JOP 攻击免疫——攻击者无法把任意机器码地址塞进 funcref。
An call_indirect looking up a funcref in a table must verify the target's signature matches, else it traps. Every br k jumps inside the structured control frame — cannot land at arbitrary addresses. This grants immunity to ROP / JOP — attackers cannot stuff arbitrary machine code into a funcref.

Spectre 时刻 — 2018 年 1 月

The Spectre moment · January 2018

2018 年 1 月 3 日,Spectre / Meltdown 漏洞披露。这两个漏洞利用CPU 推测执行 + cache 时序 旁路通道,可以从一个进程读到另一个进程的内存。wasm threads 当时正好处在 phase 3、即将 ship 阶段——共享内存 + 高精度计时器(performance.now() 当时还是 5 µs 精度)就是 Spectre 的完美材料

On 3 Jan 2018, Spectre / Meltdown were disclosed. Both exploit CPU speculative execution + cache timing side channels to read another process's memory. Wasm threads were at phase 3, on the verge of shipping — shared memory + high-precision timers (performance.now() at the time was 5 µs precise) were perfect Spectre ingredients.

所有浏览器在 24 小时内做了两件事:① 把 performance.now() 精度降到 ms 级;② 关闭 SharedArrayBuffer。wasm threads 推迟一年半。最终方案是用进程隔离(COOP/COEP 头)让每个站点跑在独立进程里——旁路通道泄漏只能泄漏自己的数据,无意义。这是 web 平台史上第一次因为硬件漏洞推迟了一个软件特性

All browsers shipped two fixes within 24 hours: ① coarsen performance.now() to ms precision; ② disable SharedArrayBuffer. Wasm threads slid 18 months. The eventual fix used process isolation (COOP/COEP headers) — each site runs in its own process, so side-channel leaks only reveal its own data. The first time the web platform delayed a software feature because of a hardware vulnerability.

WasmCert · 把规范变成定理WASMCERT · TURNING THE SPEC INTO A THEOREM Conrad Watt 在 2018 年 CPP 论文里把 wasm 整个规范 mechanise 到 Isabelle/HOL,证明了"well-typed wasm 不会 stuck"(progress + preservation 定理)。过程中他发现了 spec 文本里几处错误,直接 PR 到 webassembly/spec 仓库。这是 web 平台第一个有完整 mechanised proof 的标准。后续工作还把 V8 / SpiderMonkey 的 wasm 实现 verify 了关键部分。 Conrad Watt's CPP 2018 paper mechanised the entire wasm spec in Isabelle/HOL and proved "well-typed wasm doesn't get stuck" (progress + preservation). The exercise turned up several spec-level errors, which he PR'd into webassembly/spec. The first web-platform standard with a complete mechanised proof. Follow-on work verifies the critical paths of V8 / SpiderMonkey's wasm implementations.

真实 CVE 历史(2017-2025)

Real CVE history · 2017–2025

YearCVEWhatLayer
2018CVE-2018-6065V8 wasm interpreter 整数溢出V8 wasm interpreter int overflowimpl bug
2020CVE-2020-9802JSC wasm 类型混淆JSC wasm type confusionimpl bug
2021CVE-2021-21195V8 wasm UAFV8 wasm use-after-freeimpl bug
2022CVE-2022-4135V8 wasm heap buffer overflowV8 wasm heap buffer overflowimpl bug
2023CVE-2023-2935V8 wasm 类型混淆,sandbox 逃逸V8 wasm type confusion, sandbox escapeimpl bug
2024CVE-2024-11116V8 Turboshaft wasm OOBV8 Turboshaft wasm OOBimpl bug

注意一个模式:所有 CVE 都是引擎实现 bug,没有规范级漏洞——这正是 mechanised proof 的胜利。spec 是数学上正确的,但 V8/SM/JSC 必须把它落地到 C++ 代码,这一步会出错。各浏览器现在都跑 fuzzing 工具(wasm-mutateOSS-Fuzz)持续测试,每个月在主线分支上跑数万 CPU 小时。

A pattern: every CVE is an implementation bug, never a spec-level hole — the win of mechanised proof. The spec is mathematically sound; V8/SM/JSC must land it in C++ and that's where errors creep in. Browsers now run continuous fuzzing (wasm-mutate, OSS-Fuzz) — tens of thousands of CPU-hours per month on the main branches.

APPENDIX · 02 · SERVER

服务端 wasm — 另一半故事,1 ms 冷启动的诱惑

Server-side wasm — the other half of the story, the 1 ms cold start

CF Workers · Spin · Fermyon · Wasmtime

CF Workers · Spin · Fermyon · Wasmtime

到 2024 年,全球服务端 wasm 实例数量超过了浏览器 wasm 模块的总数——但大多数前端工程师不知道这件事。这一章把视野从浏览器移开。服务端 wasm 解决了一个不同的问题:容器太慢、太重——一个 Docker 容器冷启动 100 ms-数秒,而一个 wasm 实例 1 ms。当你想跑 10 万个客户的 isolated 代码在同一台机器,这个差距决定了一个商业模式能不能成立。

By 2024, global server-side wasm instance counts had overtaken browser wasm module counts — but most front-end engineers don't know this. This chapter shifts the focus off the browser. Server-side wasm solves a different problem: containers are too slow, too heavy — a Docker container cold-starts in 100 ms to seconds, a wasm instance in ~1 ms. When you want to run 100 K customers' isolated code on one machine, that gap decides whether a business model is viable.

六大平台对比

Six platforms compared

Platform Runtime Cold start Mem limit Isolation WASI 0.2?
Cloudflare Workers V8 isolates ~ 5 ms 128 MiB V8 isolate partial
Fastly Compute@Edge Wasmtime + Lucet ~ 1 ms 128 MiB per-instance yes
Fermyon Spin Wasmtime ~ 1 ms config per-instance yes
Shopify Functions Wasmtime ~ 5 ms 10 MiB strict partial
NGINX Unit ngx_wasm WAMR / Wasmtime ~ 2 ms config per-request partial
Wasmtime (standalone) Cranelift ~ 0.5 ms 4 GiB (wasm32) process yes

为什么不是容器

Why not containers

COLD START · log scale · 2025 benchmarks
VM (Firecracker)
~ 125 ms
Container
~ 50 ms
V8 isolate
~ 5 ms
Wasm (Wasmtime)
~ 1 ms
Wasm (Cranelift cached)
~ 0.3 ms

数字差 2 个数量级。这让 wasm 在函数即服务(FaaS)场景里成为唯一可行的隔离方案——AWS Lambda 用容器,冷启动 100 ms~3 s 是真实痛点;Cloudflare Workers 用 V8 isolate(算 wasm 半亲戚),冷启动 5 ms;Fastly 用 Wasmtime,1 ms。同样的代码,延迟差 100 倍

Two orders of magnitude difference. That makes wasm the only viable isolation model for function-as-a-service — AWS Lambda runs containers, cold-starts of 100 ms–3 s are a real pain point; Cloudflare Workers run V8 isolates (a wasm half-sibling) at ~5 ms; Fastly runs Wasmtime at ~1 ms. Same code, 100× latency gap.

WASI 0.2 — 服务端 wasm 的"统一系统接口"

WASI 0.2 — the "unified system interface" for server wasm

浏览器 wasm 通过 import 拿到 JS 函数;服务端 wasm 通过 import 拿到系统接口——文件读写、网络、时钟、随机数。MVP 时代每家平台都自己定义,Cloudflare 的 API ≠ Fastly 的 API ≠ Wasmtime 的 API。WASI 0.2 (2024 ship) 用 Component Model 把这套接口标准化成一组 .wit 文件:wasi:io / wasi:filesystem / wasi:http / wasi:clocks / wasi:random / wasi:sockets同一个 wasm 组件可以跑在所有支持 WASI 0.2 的平台——这才是真正的 "compile once, run anywhere"。

Browser wasm gets JS functions via import; server wasm gets system interfaces via import — file I/O, networking, clocks, randomness. In the MVP era every platform defined its own; Cloudflare's API ≠ Fastly's ≠ Wasmtime's. WASI 0.2 (shipped 2024) standardised them as Component Model .wit files: wasi:io / wasi:filesystem / wasi:http / wasi:clocks / wasi:random / wasi:sockets. One wasm component runs on every WASI 0.2-compliant platform — the real "compile once, run anywhere".

服务端 wasm 的真实限制

Real limitations of server-side wasm

A
CPU 时长上限
CPU time cap
大多数平台限 10-50 ms 单次执行(Cloudflare 50 ms,Fastly 50 ms,Shopify 5 ms)。跑机器学习推理? Forget it. 长任务要拆成多次调用或异步。
Most platforms cap per-invocation CPU at 10–50 ms (Cloudflare 50 ms, Fastly 50 ms, Shopify 5 ms). ML inference? Forget it. Long tasks must split into multiple invocations or run async.
B
no fork · no thread (除非 WASI threads)
no fork · no thread (unless WASI threads)
wasm 没有 fork(),WASI 0.2 也没标准化 threads。Go 程序的 goroutine、Node.js 的 worker thread 在 wasm 里都失效——除非用 stack-switching 提案(还在 phase 3)。
No fork(); WASI 0.2 still doesn't standardise threads. Go goroutines, Node.js worker threads — all break in wasm. Until stack-switching ships (still phase 3).
C
Memory 上限 4 GiB (wasm32)
Memory ceiling 4 GiB (wasm32)
服务端运行大型 ML 模型(GPT 类)立刻撞 4 GiB 上限。memory64 提案在多数平台还是 flag 下面。
Running large ML models (GPT-style) hits the 4 GiB cap immediately. memory64 is still behind flags on most platforms.
APPENDIX · 03 · LIMITS

wasm 不能做什么 — 反向定义工程边界

What wasm cannot do — defining the boundary in reverse

七个硬限制及它们的绕过办法

seven hard limits and how to route around them

这一章倒过来定义 wasm。前 28 章描述了 wasm 做什么,这一章列七件它结构性做不到的事——以及工程上怎么绕。这些"不能"不是 bug 是 feature,体现了 wasm 的设计哲学:小而硬,而不是大而软。

This chapter defines wasm by negation. The previous 28 described what wasm can do; this one lists seven things wasm structurally cannot — and how engineers route around them. These "cannot"s are features, not bugs — they reflect wasm's design philosophy: small and hard, not big and soft.

1
直接访问 DOM
Direct DOM access
wasm 类型里没有 "DOM Node";只有 i32/i64/f32/f64/v128 和引用。每次 DOM 操作都要跨 wasm/JS 边界,trampoline 成本压过算术加速。绕法:把 DOM diff 留在 JS,wasm 只算"哪些" 要 diff。
Wasm has no "DOM Node" type — only i32/i64/f32/f64/v128 and references. Every DOM op must cross the wasm/JS boundary; trampoline cost devours arithmetic speedup. Workaround: keep DOM diff in JS; let wasm compute which nodes need diffing.
2
GPU 计算
GPU compute
wasm 跑在 CPU 上,不能直接发指令给 GPU。绕法:用 WebGL/WebGPU。Wasm 给 WebGPU 准备 buffer + dispatch shader,但 shader 本身用 WGSL/SPIR-V 写,不是 wasm 字节码。
Wasm runs on CPU and cannot dispatch directly to GPU. Workaround: use WebGL/WebGPU. Wasm prepares buffers and dispatches shaders, but the shader is WGSL/SPIR-V, not wasm bytecode.
3
真正的 async/await(MVP)
Real async/await (MVP)
MVP wasm 没有挂起栈的能力。绕法:Emscripten 的 Asyncify 软件 emulate(把整个 wasm 程序复制一份"反向"版,慢 ~ 50%)。未来:JSPI 提案(phase 3,2025 V8 flag 后可用)。
MVP wasm cannot suspend stacks. Workaround: Emscripten's Asyncify (emulates by duplicating the program "inside out", ~50% slower). Future: the JSPI proposal (phase 3, available behind V8 flag in 2025).
4
观察 GC 内部状态
Observing GC internals
wasm-GC 让 wasm 共享宿主 GC,但 wasm 看不到"什么时候 GC 触发"或"对象的物理地址"。绕法:不需要绕——这正是 wasm 想要的 "意图屏蔽",让 GC 实现自由演进。
wasm-GC shares the host GC, but wasm cannot observe "when GC fires" or "an object's physical address". Workaround: none needed — this is the intentional "opacity" wasm wants, leaving GC implementations free to evolve.
5
线程间共享数据,但不能共享 stack/locals
Threads share memory, but not stack/locals
每个 worker 自己的 Instance 有独立栈和 locals。跨 worker 传值必须走 SharedArrayBuffer。这是 feature——避免了 race condition 的常见来源。
Each worker's Instance owns its stack and locals. Cross-worker values must go through SharedArrayBuffer. This is a feature — it eliminates a common source of race conditions.
6
stable ABI(Component Model 之前)
A stable ABI (pre-Component Model)
两个 .wasm 文件互相调用,Rust 的 String 和 Go 的 string 不兼容——每对语言要自己写 marshal 代码。绕法:用 Component Model + WIT,把 N² glue 复杂度降到 N。
Two .wasm files calling each other: Rust's String and Go's string are incompatible — every pair of languages needs custom marshalling. Workaround: Component Model + WIT drops the N² glue complexity to N.
7
"逃出" 沙箱
Escaping the sandbox
即使用最阴险的内存模式,只要引擎实现没 bug,wasm 也不能访问宿主进程的其他内存。不绕——这是 wasm 的存在意义。如果你需要访问外部,改用 native code + 进程隔离。
No matter how devious the memory pattern, with a bug-free engine, wasm cannot reach the host process's other memory. No workaround needed — this is wasm's reason for being. If you need outside access, switch to native code + process isolation.
wasm 的"不能",定义了它的""。 Field Note · 03 · Appendix
Wasm's "cannot"
defines its "can". Field Note · 03 · Appendix
APPENDIX · STANDARDS

References & Standards — 文章每个论断的出处

References & Standards — sources for every claim

W3C · IETF · IEEE · 学术 · 源码

W3C · IETF · IEEE · academia · source

这一节把全文用到的所有外部标准、规范、论文、源码归档。每条引用带状态(REC = W3C Recommendation,CR = Candidate Recommendation,WD = Working Draft)+ 链接 + 你在哪一章会用到它。所有 URL 在 2026 年 5 月有效;wasm 提案演化快,phase 4 后会迁移到 W3C TR/ 命名空间。

This section archives every external standard, spec, paper, or source-code reference the article touches. Each carries a status pill (REC = W3C Recommendation, CR = Candidate Recommendation, WD = Working Draft) + link + the chapter that needs it. All URLs valid as of May 2026; wasm proposals move quickly, post-phase-4 entries migrate to W3C TR/ namespaces.

A · 核心 W3C 标准

A · Core W3C standards

WASM 2.0 规范族 · 三件套Wasm 2.0 spec family · the trio
Core 2.0
REC W3C TR · WebAssembly Core Specification 2.0 · 字节格式 + 验证 + 执行语义。Ch06-Ch11 全用。Byte format + validation + exec semantics. Used by Ch06–Ch11.
JS API 2.0
REC W3C TR · WebAssembly JavaScript Interface 2.0 · WebAssembly.Module/Instance/Memory/Table/Global 接口。Ch16/17 用。JS-side WebAssembly.* surface. Used by Ch16/17.
Web API 2.0
REC W3C TR · WebAssembly Web API 2.0 · compileStreaming / instantiateStreaming / Response 集成。Ch12 用。compileStreaming / instantiateStreaming / Response integration. Used by Ch12.
Wasm 1.0
REC W3C TR · WebAssembly Core Specification 1.0 · 2019-12 W3C Recommendation。历史参考——MVP 时代的基线。2019-12 W3C Recommendation. Historical baseline — the MVP era reference.

B · 单独提案(每个有自己的 GitHub spec 仓库)

B · Individual proposals (each with its own GitHub spec repo)

phase 4 已 shipphase 4 · shipped
threads
WG github.com/WebAssembly/threads · 原子操作 + SharedArrayBuffer。Ch18。Atomics + SharedArrayBuffer. Ch18.
simd
WG github.com/WebAssembly/simd · v128 + ~250 lane ops。Ch19。v128 + ~250 lane ops. Ch19.
bulk-memory
WG bulk-memory-operations · memory.copy/fill/init 等。Ch07/22。memory.copy/fill/init etc. Ch07/22.
reference-types
WG reference-types · funcref/externref + 多 table。Ch08/22。funcref/externref + multiple tables. Ch08/22.
multi-value
WG multi-value · 函数返回多值。Ch08。Functions returning multiple values. Ch08.
tail-call
WG tail-call · return_call + return_call_indirect。Ch22。return_call + return_call_indirect. Ch22.
exceptions
WG exception-handling · try/catch/throw/tag。Ch22。try/catch/throw/tag. Ch22.
gc
WG github.com/WebAssembly/gc · struct/array/i31ref/ref.cast。Ch08/20。struct/array/i31ref/ref.cast. Ch08/20.
phase 3 · flag 后可用phase 3 · behind flag
memory64
CG memory64 · i64 寻址,突破 4 GiB。Ch10/22。i64 addressing, breaks 4 GiB cap. Ch10/22.
jspi
CG js-promise-integration · wasm await Promise。Ch22。wasm await Promise. Ch22.
stack-switching
CG stack-switching · coroutine/fiber。Ch22。coroutines/fibers. Ch22.
relaxed-simd
WG relaxed-simd · 允许 1 ulp 差异的 SIMD。Ch19。SIMD ops with 1 ulp tolerance. Ch19.
multi-memory
WG multi-memory · 一模块多 memory。Ch22。Multiple memories per module. Ch22.
component-model
CG component-model · 跨语言 ABI + WIT。Ch21。Cross-language ABI + WIT. Ch21.
提案 phase trackerAll proposals · phase tracker
tracker
github.com/WebAssembly/proposals · 所有 40+ 个提案的当前 phase 和实现状态。Current phase and implementation status of all 40+ proposals.
flag
chrome://flags/#enable-experimental-webassembly-features · 在 Chrome 里打开所有 phase 2-3 提案。Enables all phase 2-3 proposals in Chrome.

C · 底层标准依赖(IEEE / IETF / DWARF)

C · Underlying standards (IEEE / IETF / DWARF)

IEEE 754
IEEE IEEE 754-2019 · Floating-Point Arithmetic · f32 / f64 / NaN 传播,wasm spec 直接引用。Ch08。f32/f64/NaN propagation, directly cited by wasm spec. Ch08.
UTF-8
RFC 3629 RFC 3629 · UTF-8 · 所有 import/export 名字、custom section 名字的编码。Encoding of all import/export names and custom section names.
Unicode
Unicode 15.1 · name section 允许的字符集。Character set allowed in the name section.
LEB128
DWARF 5 · §7.6 Variable Length Data · LEB128 不是独立 RFC,定义在 DWARF 5 spec 中。wasm 所有整数用它。Ch06。LEB128 is not a standalone RFC; defined in DWARF 5 §7.6. Every wasm integer uses it. Ch06.
DWARF 5
DWARF Debugging Information Format · v5 · wasm 调试信息(.debug_* custom sections)。Ch24。Wasm debug info (.debug_* custom sections). Ch24.
Source Maps
TC39 · Source Map (ecma-426 draft) · 2024 起在 TC39 标准化;wasm 通过 sourceMappingURL custom section 引用。Ch24。Standardising at TC39 since 2024; wasm references via sourceMappingURL custom section. Ch24.
COOP / COEP
HTML Standard · COOP / COEP · Spectre 缓解的 HTTP 头要求。Ch18。HTTP headers required for Spectre mitigation. Ch18.
SharedArrayBuffer
ECMAScript · SharedArrayBuffer · wasm threads 共享内存的 JS 端 carrier。Ch18。JS-side carrier for wasm threads shared memory. Ch18.

D · WASI / Component Model / 非浏览器

D · WASI / Component Model / non-browser

WASI 0.2
WD WebAssembly/WASI · wasip2 · Preview2 · 用 Component Model 重写的"系统接口"。Ch21/25/28。Preview2 · "system interfaces" rewritten via Component Model. Ch21/25/28.
WASI 0.1
WebAssembly/WASI · preview1 (legacy) · 2019 以来事实标准,Wasmtime / Spin / Cloudflare 都支持。De-facto standard since 2019; supported by Wasmtime / Spin / Cloudflare.
WIT
Component Model · WIT IDL Reference · Component Model 的接口描述语言。Ch21。Component Model's IDL. Ch21.
wit-bindgen
bytecodealliance/wit-bindgen · 从 .wit 生成 N 种语言 binding 的工具链。Ch21。Toolchain generating N-language bindings from .wit. Ch21.
Wasmtime
Wasmtime documentation · Bytecode Alliance 的 wasm 运行时。Ch01/14/25。Bytecode Alliance's wasm runtime. Ch01/14/25.
Cranelift
Cranelift code generator · Wasmtime 和 Wasmer 共用的 Rust 写 wasm 编译器。Ch01/15。Rust-written wasm compiler shared by Wasmtime and Wasmer. Ch01/15.

E · 学术论文

E · Academic papers

PLDI 2017
Haas, Rossberg, Schuff, Titzer, Holman, Gohman, Wagner, Zakai, Bastien. "Bringing the Web up to Speed with WebAssembly" · 原始论文,描述 MVP 设计原则、形式语义、验证算法。Ch01/03/11 引用。The original paper · design principles, formal semantics, validation algorithm. Cited by Ch01/03/11.
CPP 2018
Watt. "Mechanising and Verifying the WebAssembly Specification" · 在 Isabelle/HOL 里把 wasm 规范 mechanise + 验证,发现 spec 里的几处 bug。Ch11 / Ch27。Mechanised + verified the wasm spec in Isabelle/HOL, finding spec-level bugs. Ch11 / Ch27.
CACM 2018
Rossberg, Titzer, Haas, Schuff, Gohman, Wagner, Zakai, Bastien, Holman. "WebAssembly: A Quick Primer" · CACM Highlights · PLDI 论文的科普缩略版,适合作入门读物。Popularised companion to the PLDI paper; good starter reading.
POST 2019
Disselkoen, Renner, Watt, Garfinkel, Levchenko, Stefan. "Position Paper: The Meaning of Memory Safety" · 讨论 wasm 内存安全的精确边界 — Ch10/27 引用。Discusses the precise boundary of wasm memory safety. Cited by Ch10/27.
OSDI 2020
Lehmann, Kinder, Pradel. "Everything Old is New Again: Binary Security of WebAssembly" · 早期 wasm 二进制安全分析(stack-smashing 等)。Ch27。Early binary-security analysis of wasm (stack smashing, etc.). Ch27.

F · 源码定位

F · Source code anchors

V8 wasm 实现V8 wasm implementation
decoder
v8/src/wasm/module-decoder.cc · 流式 decode 主流程。Ch12。Streaming decode main loop. Ch12.
validator
v8/src/wasm/function-body-decoder-impl.h · 类型栈验证模板。Ch11/13。Type-stack validator template. Ch11/13.
Liftoff
v8/src/wasm/baseline/liftoff-compiler.cc · 基线 JIT 主入口。Ch14。Baseline JIT main entry. Ch14.
TurboFan-wasm
v8/src/compiler/wasm-compiler.cc · wasm → TF graph build。Ch15。wasm → TF graph build. Ch15.
Turboshaft
v8/src/compiler/turboshaft/wasm-*.cc · 2023 后默认的 wasm 优化器。Ch15。Default wasm optimiser since 2023. Ch15.
wrappers
v8/src/wasm/wasm-import-wrapper-cache.cc · JS↔Wasm trampoline 缓存。Ch17。JS↔Wasm trampoline cache. Ch17.
JS API
v8/src/wasm/wasm-objects.cc · Module/Instance/Memory/Table 的 JS 对象。Ch16。JS objects for Module/Instance/Memory/Table. Ch16.
SpiderMonkey · JavaScriptCoreSpiderMonkey · JavaScriptCore
SM baseline
mozilla-central/js/src/wasm/WasmBaselineCompile.cpp
SM Ion
mozilla-central/js/src/wasm/WasmIonCompile.cpp
JSC BBQ/OMG
WebKit/JavaScriptCore/wasm/WasmBBQ*.cpp · WasmOMG*.cpp
工具链Toolchains
Binaryen
WebAssembly/binaryen · wasm-opt, wasm-as, IR 优化器。
WABT
WebAssembly/wabt · wasm2wat, wasm-objdump, wat2wasm。Ch06/07/24 用。
wasm-bindgen
rustwasm/wasm-bindgen · Rust↔JS 胶水生成器。
Emscripten
emscripten.org · C/C++ → wasm 工具链。Ch02/25 引用。
AssemblyScript
assemblyscript.org · TypeScript-flavoured wasm 源码语言,不走 LLVM 走 Binaryen。

G · 治理 · 历史

G · Governance · history

CG charter
W3C WebAssembly Community Group · 提案 phase 0-2 在这里讨论。任何人可加入。Proposals phase 0–2 live here. Open to anyone.
WG charter
W3C WebAssembly Working Group · 提案 phase 3-4 在这里投票。需要会员资格。Proposals phase 3–4 vote here. Membership required.
2015 birth
Luke Wagner · "WebAssembly" (17 Jun 2015) · Mozilla 的官宣博文。Ch02。Mozilla's launch announcement. Ch02.
asm.js spec
asm.js: a typed subset of JavaScript · 2013 Wagner / Zakai。Ch02/04。2013 Wagner / Zakai. Ch02/04.
Bytecode Alliance
bytecodealliance.org · 非营利组织,推 WASI / Component Model / 非浏览器 wasm。Ch01/25/28。Non-profit driving WASI / Component Model / non-browser wasm. Ch01/25/28.
Wasm 2.0 milestone
W3C · Wasm 2.0 Recommendation press release (2025-03) · 把 2019-2023 八个独立提案合并成 2.0 基线。Ch02。Folded eight 2019–2023 proposals into the 2.0 baseline. Ch02.
技术写作的鉴别度
不在花活,在出处。 Field Note · 03 · Appendix
Technical writing is judged not by flourishes,
but by the rigour of its sources. Field Note · 03 · Appendix
✦ ✦ ✦
阅读Reads

留下评论Leave a comment

评论Comments

加载中…Loading…