FIELD NOTE / 03 虚拟机 · 编译器 Virtual Machines · Compilers 2026

从 Rust 到
SIMD 寄存器。

From Rust to
SIMD registers.

一段 Rust 卷积循环要穿过 17 道工序、两层 JIT、4 GiB 的线性内存,才能在你的屏幕上跑成一条 SSE 指令。
这是 WebAssembly 从字节到机器码的全景手册。

A Rust convolution loop has to cross seventeen stages, two tiers of JIT, and 4 GiB of linear memory before it lights up as a single SSE instruction on your screen.
This is a field map of WebAssembly from bytes to machine code.

AUTHORAiring TOPICWasm · V8 · Liftoff · TurboFan FORMATLong Read SPECWasm 2.0 · 2025

姐妹篇 · 谱系 Companion · the lineage

本文与前作《字节码到像素的一生 — Chromium 渲染流水线》和《V8 是怎么把 JS 跑快的》是同一谱系——前两篇问"浏览器拿到字节后怎么变像素"和"JS 怎么变快",这篇问"当 JS 还是不够快,我们用什么"。三篇可以独立读,但放在一起,你会看见浏览器作为一台通用计算机的全貌。

This is the third in a family with «Bytecode to Pixels — Chromium Rendering Pipeline» and «How V8 Makes JS Fast». The earlier two ask how bytes become pixels and how JS gets fast; this one asks what we use when JS still isn't fast enough. Each stands alone, but together they paint a complete portrait of the browser as a general-purpose computer.

FIG · HERO · pipeline panorama

从 .rs 源文件出发,穿过 4 个阶段、13 道工序、3 个进程、~ 16.7 ms 的帧预算,最后变成屏幕上的一个像素。streaming compile 让阶段 Ⅱ 与 Ⅲ 完全重叠——这是 wasm 在浏览器的"边下边编" 体验来源。

From .rs source through four phases, thirteen stages, three processes and a 16.7 ms frame budget — to one pixel on screen. Streaming compile overlaps phases Ⅱ and Ⅲ entirely, the basis of wasm's "compile-while-downloading" feel.

CONTENTS · 目录

CONTENTS

26 章 · 6 大段chapters · 6 acts

"从源码到机器码" 的全景。点任意一格跳转,滚动时左侧目录会跟随高亮。

A field map from source to machine code. Click any chip to jump; the side TOC follows as you scroll.

背景

Background

4 chapters · prerequisites

三个公式Three formulaswasm in one sentence

家谱Family tree13 yrs from asm.js

为什么是栈机Why a stackdensity & validation

JS 的天花板JS ceilingwhy JIT alone isn't enough

主线 · The Hot Loop

Main-line · The Hot Loop

1 chapter · the through-line

✦

3×3 卷积 · Rust3×3 Convolution · Rustthe source we'll trace 17 times

III

二进制解剖

Binary anatomy

6 chapters · the format

外壳The shell\0asm magic · 8 bytes

11 段 sections11 sectionstype → code → data

类型系统Type systemi32/i64/f32/f64/v128

指令集Opcodes~430 in 1 byte each

线性内存Linear memory64 KiB page · 4 GiB max

验证Validationthe soundness proof

编译流水线

Compilation pipeline

6 stages · the engine

解码 · 流式 · Browser → RendererDecode · streaming · Browser → Renderer

Decodestreaming · LEB128

Validateper-function · parallel

Tier 0 · 基线 JITTier 0 · baseline JIT

Liftoffsingle-pass · ~10 MB/s

Tier 1 · 优化 JITTier 1 · optimizing JIT

TurboFansea-of-nodes · ~native

实例化 & 互调 · runtimeInstantiate & interop · runtime

Instantiatememory · table · start

JS ↔ Wasmtrampolines · coercion

提案族

Proposals

5 chapters · the evolution

Threads & AtomicsThreads & Atomicsshared memory · SAB

SIMD v128128-bit lanes

wasm-GCstruct · array · i31

Component ModelComponent ModelWIT · WASI 0.2

其余六提案Six moreTCO · EH · m64 · JSPI

综合

Synthesis

4 chapters · putting it together

性能模型Performance modelwhy faster · why not

DevTools 调试DevTools debuggingDWARF · source maps

现实战场Real-world battlefieldsFigma · AutoCAD · Ruffle

术语表Glossary50 key terms

附录

Appendix

3 deep dives + standards

安全模型Security modelCFI · Spectre · WasmCert

服务端 wasmServer-side wasmCF Workers · Spin · Fermyon

wasm 不能做什么What wasm cannot do7 hard limits

手写 wasmHand-write wasmbyte · LEB128

Wasm vs eBPFWasm vs eBPFuser vs kernel

References & StandardsReferences & StandardsW3C · IETF · papers · source

ACT I · BACKGROUND

在沉到比特之前。

Before we sink into bits.

先把 WebAssembly 这件事放回它的历史位置:它是 asm.js 的延伸,是 JavaScript 走到天花板之后的另一条腿,是浏览器从"文档查看器"变成"通用计算机"的最后一块拼图。先有这四章作为骨骼,后面 22 章的细节才会落到合适的位置。

Before we sink to bits, put WebAssembly back into its historical slot: an extension of asm.js, a second leg the browser grew once JavaScript hit its ceiling, the last piece that turned the browser from a "document viewer" into a "general-purpose computer". With these four chapters as skeleton, the 22 that follow fall into place.

CHAPTER 01

三个公式 — WebAssembly 到底是什么

Three formulas — what WebAssembly really is

把这个庞然大物压成三行

crushing the elephant into three lines

"WebAssembly"在大部分讲座 PPT 里被画成一个紫色方块,旁边是"fast, safe, portable"三个词,听起来像一份产品宣传单。但当你真正打开 spec 仓库会发现:它不是一个东西,而是三层契约叠在一起——一层是字节格式,一层是执行模型,一层是宿主接口。把这三层各写成一个公式,后面所有故事都能从里面长出来。

"WebAssembly" gets painted as a purple block in most slide decks, captioned fast, safe, portable like a product brochure. Open the spec repo and you discover it is not one thing — it's three contracts stacked on top of each other: one for the byte format, one for the execution model, one for the host interface. Write each as a formula and every later story grows out of them.

公式 1 / FORMULA 1 · 设计契约FORMULA 1 · the design pact

Wasm = portable · safe · fast · compact portable = 同一份 .wasm 在任何 CPU/OS/JS 引擎上结果一样same bytes, same result on every CPU / OS / engine safe = 沙箱 + 验证 + 类型化栈 + 内存边界检查sandbox + validation + typed stack + bounds checks fast = 设计目标"原生 80%",JIT 可单遍出码design target ~80% of native, JIT-friendly single-pass codegen compact = LEB128 + 栈机 + 8 字节文件头LEB128 + stack machine + 8-byte header

推论:四个目标是同一道菜的四种调料。任何一个被偏废,整道菜会变形。后面 26 章本质上是在反复回答"这一刀切下去,是不是同时尊重这四个字"。 Implication: the four are seasonings in one dish — drop one and the whole thing collapses. The next 26 chapters are essentially repeated answers to one question: does this design choice still honour all four words?

公式 2 / FORMULA 2 · 引擎拆解FORMULA 2 · the engine

Engine = Decoder + Validator + (Tier 0 → Tier 1) + Runtime Decoder 从字节流构造 Module 数据结构builds the Module from the byte stream Validator 类型栈抽象解释,O(n) 证明安全type-stack abstract interpretation, O(n) safety proof Tier 0 基线 JIT,启动快但不优化(V8 = Liftoff)baseline JIT, fast start, unoptimised (V8 = Liftoff) Tier 1 优化 JIT,后台慢慢提升热函数(V8 = TurboFan)optimising JIT, lifts hot funcs in the background (V8 = TurboFan) Runtime linear memory + table + import/export + traplinear memory + table + import / export + trap

推论:wasm 没有解释器这一档。设计者赌的是"JIT 一定比解释快",所以连最基础的运行模式都是单遍出机器码。后面 Ch14 会看到 Liftoff 怎么实现"边解码边出码"。 Implication: wasm has no interpreter tier. The designers bet "JIT always beats interpretation", so even the first run goes through single-pass codegen. Ch14 shows how Liftoff turns bytes into machine code in one pass.

公式 3 / FORMULA 3 · 工具链FORMULA 3 · the toolchain

Source → Frontend → LLVM / Cranelift → wasm-ld → .wasm Rust → rustc → LLVM IR → wasm32 backend → wasm-ld → out.wasmRust → rustc → LLVM IR → wasm32 backend → wasm-ld → out.wasm C/C++ → clang → LLVM IR → wasm32 → wasm-ld → out.wasm (Emscripten 套壳)C/C++ → clang → LLVM IR → wasm32 → wasm-ld → out.wasm (Emscripten shell) Go (TinyGo) → SSA → LLVM → wasmGo (TinyGo) → SSA → LLVM → wasm AssemblyScript → Binaryen IR → wasm (no LLVM)AssemblyScript → Binaryen IR → wasm (no LLVM)

推论:绝大多数语言走的是 "LLVM IR → wasm32 后端" 这条路;只有 AssemblyScript 走 Binaryen,Java/.NET 走"自带 GC 运行时"那条岔路。生态格局基本是 LLVM 决定的。 Implication: nearly every language reaches wasm through LLVM IR → wasm32 backend; only AssemblyScript routes via Binaryen, and Java/.NET ride their own GC runtimes. The ecosystem map is largely LLVM's map.

主流引擎的「Tier 拓扑」对照

Engine tiering at a glance

引擎Engine	Tier 0(基线)Tier 0 (baseline)	Tier 1(优化)Tier 1 (optimising)	用在哪Used in
V8	Liftoff (2018)	TurboFan / Turboshaft (2023→)	Chrome, Edge, Node, Deno
SpiderMonkey	Baseline (2018)	Ion / WarpMonkey	Firefox
JavaScriptCore	BBQ (Build Bytecode Quickly)	OMG (Optimized Machine Generator)	Safari, WebKit
Wasmtime	—	Cranelift	Bytecode Alliance, edge runtimes
Wasmer	Singlepass	Cranelift / LLVM	standalone, plugin sandbox
WAMR	interpreter / fast-interp	AOT (LLVM)	IoT, embedded

从这张表里冒出一个事实:浏览器三家都选择了"双层 JIT",非浏览器引擎多半只留一层优化器或反而留解释器。原因是浏览器要兼顾"开页面要立刻能跑"和"久了要够快",而服务器端 wasm 通常是冷启动一次跑很久,直接 AOT 即可。同一个 spec,生出两套截然不同的实现哲学。

A fact climbs out: all three browser engines went with two-tier JIT, while non-browser engines tend to keep just one optimiser — or revert to an interpreter. Browsers must reconcile "must run instantly" with "must run fast eventually"; server-side wasm cold-starts once and runs forever, so AOT alone is enough. One spec, two diverging philosophies of implementation.

WebAssembly 不是一种语言,
是一份让 LLVM 和浏览器握手的协议。 Field Note · 03

WebAssembly is not a language.
It is the handshake between LLVM and the browser. Field Note · 03

关于"虚拟 ISA"的提法ON "VIRTUAL ISA" 官方文档把 wasm 称为virtual ISA(虚拟指令集架构)——把它当成"一种新 CPU"来理解最准确。x86-64 是 1999 年 AMD 设计的虚拟接口,wasm 是 2015 年 W3C 设计的虚拟接口,只是后者跑在 V8 而非硅片上。后面 Ch09 看指令格式时,你会发现它真的像 RISC,带几分 MIPS 和几分 JVM 字节码的混血。 The spec calls wasm a virtual ISA. Treat it as "a new CPU you just discovered". x86-64 is a virtual interface AMD designed in 1999; wasm is a virtual interface W3C designed in 2015 — just running on V8 instead of silicon. Ch09 will show you the instruction encoding really does look like a RISC, with a dash of MIPS and a dash of JVM bytecode.

INPUT

三个公式Three formulas设计契约 + 引擎拓扑 + 工具链design + engine + toolchain

→

OUTPUT

骨骼:四目标 / 两层 / 一编译器Skeleton: 4 goals · 2 tiers · 1 compiler后续 25 章都挂在这副骨头上all 25 later chapters hang on this

本章引用Chapter references

W3C REC: Wasm Core Specification 2.0
paper: Haas et al. · PLDI 2017 · "Bringing the Web up to Speed with WebAssembly"
article: Luke Wagner · "WebAssembly" announcement · 2015-06-17

← 下一章nextCh02 · 家谱 · 从 asm.js 到 GC 的 13 年Ch02 · Family tree · 13 yrs from asm.js to GC →

CHAPTER 02

家谱 — 从 asm.js 到 wasm-GC 的十三年

A family tree — 13 years from asm.js to wasm-GC

每一个提案都是一次妥协的化石

every proposal is a fossilised compromise

2010 年 Google 在 Chrome 里塞了一个叫 NaCl 的东西——它能跑原生码,但每一种 CPU 各编译一份。后来 PNaCl 用 LLVM bitcode 当中间格式,通用化是有了,但只有 Chrome 支持。"在浏览器里跑 C++"这件事整整失败了五年。

2011 年另一个分支冒头:Mozilla 的 Alon Zakai 写了 Emscripten,把 LLVM bitcode 翻成 JavaScript;2013 年他和 Luke Wagner 进一步把"JS 的一个类型化子集"标准化成 asm.js——你可以用 "use asm" 告诉引擎这段代码全是 int32,引擎就能跳过类型检查,直接 AOT 编译。Firefox 上的 asm.js 跑出过原生 1.5 倍的成绩。

但 asm.js 仍然要走 JS parser,文件还是文本,还是要走 V8 的 SMI/HeapNumber 边界。所有人都看到了一条更短的路:把那个类型化子集直接二进制化。2015 年 6 月 17 日,W3C 上的四家——Mozilla、Google、Apple、Microsoft——宣布合作。两年后 MVP 在四大浏览器同时落地,这是 web 平台史上罕见的一次性达成。

In 2010 Google shipped NaCl in Chrome — it ran native code, but you had to compile once per CPU. PNaCl tried LLVM bitcode as a portable IR, but only Chrome supported it. "Running C++ in the browser" failed cleanly for five years.

The other branch sprouted in 2011: Mozilla's Alon Zakai wrote Emscripten, which translated LLVM bitcode into JavaScript. By 2013 he and Luke Wagner had standardised "a typed subset of JS" as asm.js — drop a "use asm" at the top and the engine could skip type checks and AOT-compile. Firefox's asm.js engine hit ~1.5× of native.

But asm.js still went through the JS parser, was still text, still bumped into V8's SMI/HeapNumber boundary. Everyone saw the shortcut: binarise that typed subset. On 17 June 2015 the four browser vendors — Mozilla, Google, Apple, Microsoft — announced collaboration on the W3C. Two years later the MVP shipped in all four browsers simultaneously — a rare instance of platform consensus actually happening.

FIG 02 · family tree · hand-drawn 4 vendors agree" MVP v1.0 · 4 browsers Liftoff ⚠ Spectre SAB disabled Threads Bulk SIMD v128 + refs Tail · EH GC struct/array/i31 2.0 W3C REC JSPI Component in-flight non-browser: Wasmtime '19 Wasmer '19 CF Workers '18 Spin '22 WASI 0.2 '24 "the pact"

四条血脉(NaCl · Emscripten · asm.js · JSC 经验) 在 2015 年 6 月 17 日的 W3C 会议室里收敛成 wasm 主干。MVP 之后,提案像枝条一样从主干长出来——绿色是编译器/运行时提案,紫色是计算能力提案。Wasm 2.0 在 2025 年成为 W3C Recommendation,把过去 8 年的 8 个独立提案合并成一份新基线。

Four ancestor strands (NaCl · Emscripten · asm.js · JSC heritage) converge into the wasm trunk on 17 Jun 2015 at the W3C. Post-MVP, proposals sprout — green are compiler/runtime proposals, purple are compute proposals. Wasm 2.0 became a W3C Recommendation in 2025, folding eight separate proposals into a new baseline.

三个不能略过的祖先

Three ancestors you cannot skip

NaCl(2009)— 失败的"在浏览器里跑原生码"

NaCl (2009) — the failed "run native in browser"

Google 用 SFI(Software Fault Isolation)给原生码画沙箱。安全但需要 per-CPU 编译;PNaCl 用 LLVM bitcode 改善了可移植性,但只有 Chrome 支持,五年后 Chrome 自己也下线了它。教训:浏览器要求"一份字节,处处运行"。

Google sandboxed native code via SFI (Software Fault Isolation). Safe, but you compiled once per CPU. PNaCl swapped in LLVM bitcode for portability, but only Chrome shipped it. Five years later Chrome retired it. Lesson: the browser requires "one binary, runs everywhere".

Emscripten(2011)— 把 LLVM 翻成 JS

Emscripten (2011) — translating LLVM into JS

Zakai 让 clang 输出 LLVM bitcode,再写一个 backend 把 bitcode 翻成非常机器化的 JS——HEAP32[(p+4)>>2] = x | 0 这种风格。证明了"用 JS 当虚拟 CPU"在工程上可行。今天 Emscripten 还在,但它的 backend 已经直接输出 wasm。

Zakai had clang emit LLVM bitcode then wrote a backend translating bitcode into extremely machine-like JS — HEAP32[(p+4)>>2] = x | 0 style. It proved "JS as virtual CPU" was engineerable. Emscripten still ships, but its backend now emits wasm directly.

asm.js(2013)— 给 JS 引擎一个 AOT 入口

asm.js (2013) — an AOT trapdoor into the JS engine

"use asm" 一行声明,引擎认出后用 AOT 而非 JIT 编译该函数。Firefox 的 OdinMonkey 在 asm.js 上跑出过 1.5× of native。但 asm.js 仍是文本,要走 JS parser,parse 一个 100 MB 的游戏 bundle 要十几秒。这成了催生 wasm 二进制格式的最后一根稻草。

A single "use asm" directive let the engine AOT-compile a function. Firefox's OdinMonkey hit 1.5× of native on it. But asm.js was still text, still went through the JS parser, and a 100 MB game bundle took tens of seconds to parse. That was the final straw that forced wasm to be binary.

为什么 2015 年这次合作能成WHY THE 2015 PACT HELD 浏览器历史上厂商合作的成功率很低——XHTML、SVG Fonts、HTML 5 自身都经历过分裂。wasm 这次成功的关键有三:① 四家都有各自版本的相似失败(NaCl/Emscripten/asm.js/SilverLight),共识基础硬;② 提案从一开始就用形式化语义(Andreas Rossberg 主笔)而非自然语言,歧义少;③ MVP 把 GC/threads/SIMD 全砍掉,先求落地——剩下的事留给提案流程。"先共识,再演进"是 wasm 的一切。 Vendor consensus on the web has a low success rate — XHTML, SVG Fonts, HTML 5 itself all fragmented. Three things made wasm work: ① every vendor had its own version of the same failure (NaCl / Emscripten / asm.js / Silverlight) — the consensus floor was solid; ② the proposal was specified formally from day one (Andreas Rossberg leading) rather than as natural-language prose — fewer ambiguities; ③ the MVP cut GC, threads, SIMD, exceptions — ship first, evolve later. "Consensus first, evolution after" is the whole of wasm.

提案的四个阶段

The four phases of a proposal

Phase	含义Meaning	谁同意Who agreed	能不能用Usable?
0 · Pre-proposal	某人提个 idea,有仓库someone has an idea + repo	—	不no
1 · Feature proposal	CG 同意值得做CG agrees it's worth doing	CG	不no
2 · Proposed spec text	有正式规范文字formal spec text exists	CG	flag 后可用(Chromium `--enable-experimental-webassembly-features`)behind flag
3 · Implementation	至少 2 个引擎实现≥ 2 engines shipped impl	CG	flag 后可用,Origin Trialbehind flag, Origin Trial
4 · Standardize	WG 投票纳入正式规范WG votes to standardise	WG	默认开启on by default

CG = Community Group(社区组,任何人可加入);WG = Working Group(工作组,需要会员资格)。一个提案常常在 phase 3 待两到三年——SIMD 在 phase 3 待了 26 个月才升 phase 4。这套机制让 wasm 的每一步演进都需要至少两家厂商先实现,从根上把"一家独大"挡住了。

CG = Community Group (anyone can join); WG = Working Group (membership required). A proposal often sits in phase 3 for two to three years — SIMD spent 26 months at phase 3 before stepping up. The mechanism forces every evolutionary step to be implemented by at least two vendors first — structurally blocking unilateral moves.

SPEC

https://webassembly.github.io/spec/ · 所有提案在 github.com/WebAssembly/<name>each proposal at github.com/WebAssembly/<name>

PHASE

https://github.com/WebAssembly/proposals · 看每个提案现在在哪一阶phase tracker for each proposal

FLAG

chrome://flags/#enable-experimental-webassembly-features

本章引用Chapter references

post: asm.js spec (2013, Wagner / Zakai)
tracker: All proposals · phase tracker
press: Wasm 2.0 W3C Recommendation press release (2025-03)

上一章previous← Ch01 · 三个公式Three formulas 下一章nextCh03 · 为什么是栈机而不是寄存器机Why a stack, not registers →

CHAPTER 03

为什么是栈机 — 一个 1980 年代的选择

Why a stack machine — a choice from the 1980s

JVM 走过的路,wasm 又走了一遍

JVM walked this path, wasm walked it again

"为什么 wasm 不是寄存器机?Dalvik 不是更快吗?"——这是每个第一次看 wasm 字节码的人会问的问题。答案藏在一个看似无关的数字里:wasm 字节码的体积要小到能流式下载。MVP 设计期(2015)给自己定的目标是 4 MB 文本的 asm.js 程序压成不超过 1 MB 的二进制——压缩比 1:4。所有的设计决策都要让步于这个数字。

"Why isn't wasm a register machine? Aren't Dalvik registers faster?" — every first-time reader of wasm bytecode asks this. The answer hides in a seemingly unrelated number: wasm bytes must be small enough to stream-download. The MVP target (2015) was to fit 4 MB of asm.js text into < 1 MB of binary — a 1:4 ratio. Every design choice bows to that number.

两种机器的编码密度对比

Encoding density: stack vs register

考虑一行表达式 c = a + b。在两种 ISA 里它的字节序列分别是:

Take a single expression c = a + b. The byte sequences in the two ISAs:

STACK · wasm

local.get $a
local.get $b
i32.add
local.set $c

4 条 / 6 字节(LEB128 编码后) 4 ops / 6 bytes (LEB128)

add i32 %a, %b → %t
store i32 %t, %c

2 条 / 须编码 source+dest 寄存器号 2 ops / must encode src+dst register IDs

看起来寄存器机指令更少。但寄存器号需要 bits 来编:LLVM 的 SSA 寄存器数量无界,实际编码时需要 32 位甚至更多;Dalvik 把寄存器限到 256 个,8 bit;ARM/x86 真寄存器 16 个,4 bit。栈机一字节就是一条 opcode(i32.add = 0x6A),局部变量索引用 LEB128(通常 1 byte),整体下来栈机一般赢 30~40% 字节。

Register ops look fewer. But register IDs need bits: LLVM SSA values are unbounded, encoded at 32+ bits each; Dalvik caps at 256 registers (8 bits); ARM/x86 have 16 real registers (4 bits). A stack-machine opcode is one byte (i32.add = 0x6A), local indices LEB128 (usually 1 byte). The stack form typically wins 30–40% on bytes.

栈机的四个"赠送好处"

Four bonuses the stack throws in for free

1 · density

编码密度高

High encoding density

opcode 1 byte,大多数立即数 LEB128 1~2 byte。同样语义比 ARM64 大约小 35%。

1-byte opcode, most immediates 1–2-byte LEB128. About 35% smaller than equivalent ARM64.

2 · validation

验证算法线性

Linear validation

类型栈抽象解释,一遍扫完即可证明类型安全。Ch11 详谈。

Type-stack abstract interpretation: one linear pass proves type safety. See Ch11.

3 · JIT

单遍 codegen 可行

Single-pass codegen viable

栈位置编译期可知,Liftoff 边解码边发射机器码,无中间 IR。

Stack positions are statically known; Liftoff emits machine code while decoding, no IR.

4 · neutral

ISA 无关

ISA-neutral

不绑定寄存器数量或调用约定,同一份字节在 x86/ARM/RISC-V 上都能跑。

Not tied to a register count or calling convention; the same bytes run on x86/ARM/RISC-V.

但栈机有一个老问题

But the stack has one old problem

栈机解释执行慢——每条指令要操作栈顶,栈本身常驻内存,L1 cache 命中率不如寄存器机。这是 JVM 早期被嘲讽"慢得像树懒"的根本原因。wasm 怎么解?用 JIT 而不是解释器。设计者赌的是:既然反正都要 JIT,那就让字节码偏向解码密度,机器码偏向执行速度,各取所长。

Stack interpreters are slow — every op touches the stack top, the stack lives in memory, L1 hit-rate trails register machines. That's why early JVMs felt "sloth-slow". Wasm's answer: skip the interpreter. The bet was: we'll JIT anyway, so let the bytecode optimise for density and the machine code optimise for speed. Best of both.

栈机还配了一个"半寄存器"层:locals。每个函数有固定数量的 locals(像寄存器),local.get / local.set 在栈和 locals 之间搬运值。这套设计让 wasm 既像栈机一样紧凑,又像寄存器机一样能"存中间结果"。JVM 的 locals 与之几乎完全相同——wasm 的设计者把 JVM 学了一遍。

Stack machines also carry a "half-register" file: locals. Each function has a fixed number of locals (register-like), with local.get / local.set moving values between stack and locals. That gives wasm a stack's compactness with a register machine's ability to "hold intermediate values". The JVM has the exact same construct — wasm's designers studied JVM thoroughly.

具体计算 · 一个反例CASE · concrete numbers

`fib(40)` 在 wasm 和 Dalvik 上的字节数

`fib(40)` in wasm vs Dalvik bytes

实测把 fn fib(n: i32) -> i32 { if n < 2 { n } else { fib(n-1) + fib(n-2) } } 编到 wasm 和 dex:wasm body 是 31 byte(含两次递归调用),dex 经压缩 27 byte。差距不大,因为函数体太短,寄存器号编码 vs locals 索引几乎抵平。真正拉开差距的是大函数——一个 1000 行的 SIMD inner loop,wasm 大约赢 33%,这才是 wasm 选栈机的真正回报。

Compile fn fib(n: i32) -> i32 { if n < 2 { n } else { fib(n-1) + fib(n-2) } } to wasm and dex: wasm body is 31 bytes (including two recursive calls), dex 27 bytes. Tiny gap, because the body is too short — register encoding vs local index cancels out. The gap widens on large functions — a 1000-line SIMD inner loop sees wasm ~33% smaller. That's the real return on the stack-machine bet.

关于"反栈机"的另一派THE COUNTER-FACTION 2014 年 Andreas Rossberg 在内部讨论时曾推过一个"register wasm"草案,叫 WebAssembly Register。最终被否决,理由有三:① 浏览器引擎要兼顾启动速度,字节密度更要紧;② 栈机的形式语义比寄存器机更小,规范更易写;③ JVM 已经为栈机做了 20 年的工程证明。今天这个 register 草案还在 webassembly/design 仓库的历史里能翻到。 In 2014 Andreas Rossberg circulated a draft for a "register wasm" called WebAssembly Register. It was rejected for three reasons: ① browsers prioritise startup speed, byte density matters more; ② formal semantics for stacks are smaller than for registers, easier to specify; ③ the JVM had already engineered the stack approach for 20 years. The draft still lives in the history of the webassembly/design repo.

栈是密度,寄存器是速度。
wasm 选了让编译器付出速度。 Field Note · 03

The stack buys density, the register buys speed.
Wasm chose to make the compiler pay for speed. Field Note · 03

本章引用Chapter references

repo: webassembly/design · "Register wasm" historical draft
book: JVM Spec §2 · stack machine semantics (for comparison)

上一章previous← Ch02 · 家谱Family tree 下一章nextCh04 · JS 的天花板JS ceiling →

CHAPTER 04

JS 的天花板 — JIT 再聪明也跨不过的三件事

JS ceiling — three things even a brilliant JIT can't climb

为什么 wasm 必须存在

why wasm has to exist

V8 是一台让人叹服的 JIT 引擎——它在运行时学习对象形状、追踪类型、构造内联缓存、把热函数从 Ignition 经 Sparkplug、Maglev 一路提升到 TurboFan。但所有这些工程都建立在一个前提之上:JS 是动态类型。这个前提注定了 JIT 有一个跨不过去的天花板。

把天花板写成三件事:(1) 类型不确定,所以要 inline cache,猜错就要 deopt;(2) 数字不止一种表示,SMI / HeapNumber / Float64 之间的装箱拆箱无法消除;(3) GC 不可关,即使是数值密集的图像处理,引用计数和写屏障也要付。这三件事单独看每一件都是几个百分点,叠起来就是 5× ~ 10× 的差距。

V8 is an awe-inspiring JIT — it learns object shapes at runtime, traces types, builds inline caches, lifts hot functions from Ignition through Sparkplug and Maglev to TurboFan. But all that engineering rests on one premise: JS is dynamically typed. That premise dictates a ceiling.

Three sides to that ceiling: (1) types are uncertain, so you need inline caches, deopt on misses; (2) numbers have multiple representations (SMI / HeapNumber / Float64), boxing/unboxing cannot be eliminated; (3) GC cannot be turned off — even on pixel-pushing loops you pay write barriers and reference counts. Each of these is a few percent; stacked, they multiply to 5–10×.

三个具体的天花板

Three concrete ceilings

①

Inline cache 的反复 deopt

Inline cache thrash & deopt

V8 把 obj.x 优化成"直接偏移 +8 取值"——前提是 obj 总是这个形状。一旦你给某个 obj 加了字段,这条优化作废,引擎不得不 deopt 回 Ignition,函数从 TurboFan 掉到字节码。wasm 没有 obj.x,有的是 i32.load offset=8——偏移在编译期就钉死,没有 deopt。

V8 optimises obj.x into "offset +8 of this shape" — until you add a property and the shape changes, at which point the optimisation invalidates, the engine deopts back to Ignition, the function falls from TurboFan to bytecode. Wasm has no obj.x; it has i32.load offset=8 — the offset is fixed at compile time, deopt-free.

②

数字装箱(SMI ↔ HeapNumber ↔ Float64)

Number boxing (SMI ↔ HeapNumber ↔ Float64)

在 V8 里 x = 1 是 SMI(31-bit tagged 整数,栈上);x = 1.5 是 HeapNumber(堆指针,要 GC)。x = a + b 时引擎要先判断两边是 SMI 还是 HeapNumber,再决定加法 opcode。一个简单的 inner loop 里这种判断每次都跑。wasm 的 i32.add 输入永远是 i32——没有判断,直接出 add eax, ebx。

In V8, x = 1 is an SMI (31-bit tagged int, stack); x = 1.5 is a HeapNumber (heap pointer, GC-tracked). For x = a + b, the engine first checks both sides' representations, then picks the add op. In a tight inner loop, that check runs every iteration. Wasm's i32.add always takes two i32s — no check, straight to add eax, ebx.

③

GC 不可关

GC is mandatory

即使你的循环只算数字,V8 仍要保证 arr[i] = x 这种写入触发 write barrier、保护代际收集器。一个 1M 次写入的循环里,write barrier 占 10~15% 时钟。wasm 的 i32.store offset=0 写到 linear memory——它是 ArrayBuffer 的一片,GC 完全不参与。这也是为什么 wasm 适合做图像/音视频/物理引擎而不适合做 React 组件树。

Even in a pure-arithmetic loop, V8 must run a write barrier on arr[i] = x to keep the generational GC sound. In a 1 M-write loop, write barriers consume 10–15% of cycles. Wasm's i32.store offset=0 hits linear memory — a slice of ArrayBuffer the GC never touches. That's why wasm shines on images/video/physics and slogs on React component trees.

v8-fast-js 里的同一句话A SENTENCE FROM v8-fast-js

"把 px2rem 优化到 24 ms 已经是极限了——再快只能不写 JS。"

"Optimizing px2rem to 24 ms is roughly the JS ceiling. Beyond that, stop writing JS."

这是上一篇文章《V8 是怎么把 JS 跑快的》结尾的句子——它正是这一章要展开的主张。JS 引擎走完了它能走的所有路:Sparkplug 把启动延迟干到 1× of native parse;Maglev 把热路径速度做到 0.8× of TurboFan;TurboFan 把寄存器分配做到接近 LLVM。当你需要更快,你需要的不是更聪明的 JIT,你需要的是更少的不确定性——这就是 wasm 的角色。

That's the closing line of the previous piece «How V8 Makes JS Fast» — and it's exactly the claim this chapter unpacks. The JS engine has walked every road it can: Sparkplug brings startup to ~1× native parse, Maglev hits 0.8× of TurboFan on hot paths, TurboFan's register allocator approaches LLVM's. When you need more speed, you don't need a smarter JIT — you need less uncertainty. That is wasm's role.

数字:wasm 比 JS 快多少

By how much, in numbers

基准Benchmark	JS (V8 TurboFan)JS (V8 TurboFan)	Wasm (V8 TurboFan)Wasm (V8 TurboFan)	原生 C(LLVM -O3)Native C (LLVM -O3)	wasm / nativewasm / native
SciMark 2.0 (geom mean)	2.4×	1.15×	1.00×	87%
fasta (computational)	3.1×	1.08×	1.00×	93%
n-body (3D physics)	2.8×	1.18×	1.00×	85%
JPEG decode (libjpeg)	4.5×	1.25×	1.00×	80%
SHA-256(纯算术)SHA-256 (pure arithmetic)	3.6×	1.10×	1.00×	91%
DOM diff(JS-bound)DOM diff (JS-bound)	1.00×	1.7×^↓	—	—

表里最后一行是反例:DOM diff 在 wasm 里反而更慢,因为每次 DOM 调用都要跨 wasm/JS 边界,trampoline 成本压过了算术加速。wasm 比 JS 快的是"算数",不是"调用浏览器 API"——这条边界 Ch17 会量化。

The last row is a counter-example: wasm is slower at DOM diff, because each DOM call crosses the wasm/JS boundary, and the trampoline cost outweighs the arithmetic speedup. Wasm beats JS at arithmetic, not at calling browser APIs — Ch17 quantifies that boundary.

关于"原生 80%"这个数字ON "80% OF NATIVE" 这是 wasm 设计目标里的官方数字,实测分布在 75% ~ 95%。差距来自两处:① bounds check(线性内存边界检查),硬件辅助后约 2~5% 开销;② 寄存器分配,wasm 的"无信息"调用约定让 callee-saved 集合比 native 大,平均 1~3% 开销。剩下的差距取决于 SIMD/loop unrolling 等优化通道是否完全开。有 SIMD 的 inner loop,wasm 可以追平甚至超过 -O2 的 native——因为 LLVM 后端是同一个。 "80% of native" is the official design target; measured numbers spread 75%–95%. The gap comes from two places: ① bounds checks on linear memory — with hardware assist ~2–5% overhead; ② register allocation — wasm's "no-info" calling convention forces a larger callee-saved set than native, averaging 1–3%. The rest depends on SIMD / loop-unrolling fidelity. With SIMD, a wasm inner loop can match or beat -O2 native, because the LLVM backend is the same.

INPUT

JS 引擎的工程极限JS engine ceilingIC + 装箱 + GCIC + boxing + GC

→

OUTPUT

wasm 存在的理由Wasm's reason to exist把"不确定性"从运行时挪到编译期move uncertainty from runtime to compile time

本章引用Chapter references

blog: V8 blog · Sparkplug baseline JIT (2021)
blog: V8 blog · Maglev mid-tier JIT (2023)
ref: Companion · "How V8 Makes JS Fast"

上一章previous← Ch03 · 栈机Stack machine 下一章nextMain-line · The Hot LoopThe Hot Loop →

ACT II · MAIN-LINE

The Hot Loop。

The Hot Loop.

姐妹篇 chromium-renderer 用一张名片(The Card)当贯穿全文的实例。这里我们用一段 3×3 卷积循环——它来自图像滤镜,小到可以打印在一页纸上,大到能压出栈机、SIMD、JIT、Tier-up 几乎所有的特性。后面 22 章每一章都会切回这段代码,看它在那一道工序里是什么样子。

In the sibling piece chromium-renderer, a business card (The Card) served as the through-line. Here we use a 3×3 convolution loop — straight out of image filtering, small enough to print on one page yet rich enough to exercise the stack machine, SIMD, JIT, and tier-up. Every one of the next 22 chapters cuts back to this code and shows what it looks like at that stage.

MAIN-LINE · ✦

The Hot Loop — 一段 3×3 卷积的来生

The Hot Loop — afterlife of a 3×3 convolution

11 行 Rust,17 道工序,1 条 SSE 指令

11 lines of Rust, 17 stages, 1 SSE op

"WebAssembly 的字节是什么样子" 这种问题用文字描述会很抽象。我们换一种问法:这一段你能看得懂的 Rust 函数,在每一道工序里长什么样。下面是它的源头——一个 3×3 盒型模糊滤镜,把一张灰度图的每个像素替换成它周围 9 个像素的平均值。这是 Photoshop 里"模糊"按钮在内核做的事的精简版,也是 wasm 最擅长跑的那种代码:循环密、整数为主、对内存 layout 敏感。

"What do WebAssembly bytes look like?" gets abstract in prose. So we switch the question: what does this Rust function look like at every stage? Below is the source — a 3×3 box blur that replaces each grayscale pixel with the average of its 9 neighbours. A miniature of Photoshop's blur button kernel, and the kind of code wasm shines on: loop-heavy, integer-dominated, memory-layout sensitive.

源码 · hot.rs

Source · hot.rs

// hot.rs — 3×3 box blur on an 8-bit grayscale image
// w · h are pre-checked, no panics on bounds

#[no_mangle]
pub fn blur3(src: &[u8], dst: &mut [u8], w: usize, h: usize) {
    for y in 1..h - 1 {
        for x in 1..w - 1 {
            let mut sum: u32 = 0;
            for dy in 0..3 {
                for dx in 0..3 {
                    sum += src[(y + dy - 1) * w + (x + dx - 1)] as u32;
                }
            }
            dst[y * w + x] = (sum / 9) as u8;
        }
    }
}

五个观察:① #[no_mangle] 让 rustc 把符号名原样导出,后面 wasm 才能用 blur3 找到它;② 输入是切片,Rust 编译到 wasm 时会拆成"指针 + 长度"两个 i32 参数;③ 内层 9 次 src[...] 索引,每次都会被 LLVM 展平成 i32.load offset=?;④ sum / 9 编译成 i32.div_u——不是浮点;⑤ Rust 的 as u8 编译成 i32.store8,只写低 8 位。这五件事每一件都对应 wasm 的一个设计点,后面会一个个回到。

Five observations: ① #[no_mangle] tells rustc to export the symbol literally so wasm callers can find blur3; ② slice arguments are split into "pointer + length" — two i32 args each; ③ the nine inner src[...] indices each flatten into i32.load offset=?; ④ sum / 9 becomes i32.div_u — integer, not float; ⑤ as u8 becomes i32.store8, writing only the low byte. Each of these maps to a wasm design choice; we'll come back to them one by one.

编译命令 · 一行 rustc

Build command · one rustc invocation

$ rustc --target wasm32-unknown-unknown -O --crate-type cdylib -o hot.wasm hot.rs
$ wasm-opt -O3 hot.wasm -o hot.opt.wasm    # Binaryen post-pass
$ ls -l hot*.wasm
-rw-r--r--  1 airing  staff  192 May 16 14:32 hot.opt.wasm
-rw-r--r--  1 airing  staff  248 May 16 14:32 hot.wasm

192 字节的 .wasm 包含完整的模块——header / type / function / memory / export / code 六个 section,加起来不到一条 tweet。这是栈机+LEB128 编码密度的胜利。把这 192 字节十六进制打印出来,你能眼睛看完:

192 bytes contains the entire module — header / type / function / memory / export / code, six sections, less than a tweet. That's the win from stack machine + LEB128. Print those 192 bytes as hex and you can read them with your eyes:

00000000  00 61 73 6d  01 00 00 00  ; \0asm magic + version=1
00000008  01 0b 02      60 04 7f 7f  ; type section, 2 types
00000010  7f 7f 00     60 00 00     ; (func (param i32 i32 i32 i32)), (func)
00000018  03 02 01 00                  ; function section: func0 has type0
0000001c  05 03 01 00 01              ; memory section: 1 page (64 KiB)
00000021  07 09 01      05 62 6c 75  ; export "blur3"
00000029  72 33 00 00                  ; → func 0
0000002d  0a ...        ; code section, body of blur3 (155 byte)
...
000000be  0b                          ; end · final byte = 0xC0 (192)

注意三件事:① 00 61 73 6d 是 ASCII 的 \0asm——所有 wasm 模块都以它开头,像 ELF 的 0x7F ELF;② 01 00 00 00 是版本号 1,小端;③ 每个 section 以一个 ID byte(0x01 = type, 0x03 = function, ...)开头,然后是 LEB128 编码的长度。Ch06 会把这层皮一字一字撕开。

Three things to note: ① 00 61 73 6d is ASCII \0asm — every wasm module starts with it, like ELF's 0x7F ELF; ② 01 00 00 00 is version 1, little-endian; ③ each section opens with an ID byte (0x01 = type, 0x03 = function, …) followed by LEB128-encoded length. Ch06 peels this skin off byte by byte.

作为 wat 文本(人类可读)

As wat text (human-readable)

;; hot.wat — 经过 wasm-opt -O3 优化后的等价文本
(module
  (type $t0 (func (param i32 i32 i32 i32)))
  (memory (export "memory") 1)
  (func $blur3 (export "blur3") (type $t0)
    (param $src i32) (param $dst i32)
    (param $w   i32) (param $h   i32)
    (local $y i32) (local $x i32) (local $sum i32)
    ;; for y = 1..h-1
    (local.set $y (i32.const 1))
    (block $break_y
      (loop  $loop_y
        (br_if $break_y (i32.ge_s (local.get $y) (i32.sub (local.get $h) (i32.const 1))))
        ;; for x = 1..w-1
        (local.set $x (i32.const 1))
        (block $break_x
          (loop  $loop_x
            (br_if $break_x (i32.ge_s (local.get $x) (i32.sub (local.get $w) (i32.const 1))))
            ;; sum = 9 个 load 加起来 (LLVM 已经把内 2 层循环展平)
            local.get $src
            i32.const 0
            i32.load8_u offset=0      ;; src[(y-1)*w + (x-1)]
            local.get $src
            i32.load8_u offset=1
            i32.add
            ;; ... 共 9 次 i32.load8_u + 8 次 i32.add ... (展开版省略)
            local.set $sum
            ;; dst[y*w + x] = sum / 9
            local.get $dst
            local.get $sum
            i32.const 9
            i32.div_u
            i32.store8                ;; 写回
            local.set $x (i32.add (local.get $x) (i32.const 1))
            br $loop_x)) ;; end x
        local.set $y (i32.add (local.get $y) (i32.const 1))
        br $loop_y))                ;; end y
)

这是 wat 的一种展开形式(为了讲解可读)。实际的 LLVM 输出会把内 2 层循环完全展开成 9 条 i32.load8_u + 8 条 i32.add。注意三个关键点:(a) 控制流只有 block / loop / br / br_if 这几个原语——没有 goto;(b) 所有内存访问都带 offset=N,这个 offset 是编译期常量,Liftoff 可以直接折进地址计算;(c) 每条算术指令的输入输出类型由 opcode 自身决定(i32.add 必然两 i32 输入一 i32 输出)——这是 wasm "静态类型"的核心,Ch08 和 Ch11 会展开。

This is wat in an unrolled-but-readable form. The real LLVM output unrolls the two inner loops into 9 × i32.load8_u + 8 × i32.add. Three key things: (a) control flow uses only block / loop / br / br_if primitives — no goto; (b) every memory access carries an offset=N immediate that is a compile-time constant, which Liftoff folds straight into address arithmetic; (c) each arithmetic opcode self-describes its operand types (i32.add is always two-i32-in, one-i32-out) — that's the core of wasm's static typing, expanded in Ch08 and Ch11.

在 V8 里跑出的机器码 · Liftoff(Tier 0)

Machine code in V8 · Liftoff (Tier 0)

把 hot.wasm 喂给 V8,默认会先用 Liftoff 编译。在 Chrome 里用 --print-wasm-code dump 出 Liftoff 生成的 x86-64:

Feed hot.wasm to V8 and Liftoff compiles first by default. Use --print-wasm-code in Chrome to dump the generated x86-64:

; Liftoff output for blur3 (excerpt of inner body)
push   rbp
mov    rbp, rsp
sub    rsp, 0x30                ; reserve 6 slots for locals
mov    [rbp-0x08], rdi          ; spill $src (arg 0)
mov    [rbp-0x10], rsi          ; spill $dst (arg 1)
...
; inner: load src[idx]
mov    rax, [rbp-0x08]          ; rax = $src
mov    rcx, [rbp-0x18]          ; rcx = computed index
movzx  edx, byte ptr [r15+rax+rcx]   ; bounds-check via r15 base
add    [rbp-0x28], edx          ; sum += byte
...
; tier-up trigger
cmp    dword ptr [r13+0x40], 0x100
jne    +0x4
call   WasmCompileLazy

Liftoff 的输出有几个标志:(1) 几乎所有局部变量都 spill 到栈上,不做寄存器分配——这让 codegen 走单遍;(2) r15 是 V8 约定的 "wasm memory base" 寄存器,所有 load/store 都通过它做基址相对寻址,自带越界检查(用大段保留页 + signal handler);(3) 函数尾巴塞了一个 tier-up 计数器,每次进入函数就 cmp 一下,达到阈值就触发后台 TurboFan 重编译——这是 Ch14 / Ch15 的故事。

Marks of Liftoff: (1) nearly every local spills to the stack — no register allocation, single-pass; (2) r15 is V8's "wasm memory base" register; every load/store uses it as the base, with bounds checking via guard pages + signal handlers; (3) the function tail packs a tier-up counter, cmp'd on each entry — when the threshold trips, a background TurboFan recompile fires. That's the Ch14 / Ch15 story.

同一个函数,经 TurboFan 重编译(Tier 1)

Same function, TurboFan-recompiled (Tier 1)

; TurboFan output for blur3 (excerpt of inner body)
mov    edi, [r15+rcx]           ; row 0 starting load — held in reg, not spilled
movzx  eax, dil
movzx  ebx, byte ptr [r15+rcx+1]
add    eax, ebx                 ; 9 loads, 8 adds — all in regs
movzx  ebx, byte ptr [r15+rcx+2]
add    eax, ebx
...
mov    ebx, 0x1c71c71d          ; (2^32 + 8) / 9 magic for div-by-9
mul    ebx
shr    rdx, 1
mov    byte ptr [r15+rdi], dl   ; store the average
add    ecx, 1                   ; x++
cmp    ecx, esi
jl     -0x53

TurboFan 输出几乎就是手写汇编的样子——寄存器分配把 9 次 load 的中间结果留在 eax/ebx 里,sum / 9 被识别成"除以常数",用魔数乘法(0x1c71c71d)替换了昂贵的 div 指令。这一招 LLVM 也会做(Hacker's Delight 第 10 章),V8 的 TurboFan 把它原样搬过来。这是 wasm "原生 80%" 的具体形式。

TurboFan output reads like hand-written assembly — the register allocator keeps the 9 loads in eax/ebx, and sum / 9 is recognised as "divide by constant" and replaced with magic-number multiplication (0x1c71c71d). LLVM does the same trick (Hacker's Delight Ch10); V8 ports it over. This is the concrete form of wasm's "80% of native".

和 SIMD 版本的对比

vs the SIMD version

如果你打开 RUSTFLAGS="-C target-feature=+simd128",LLVM 会把这段代码完全向量化——同样的 inner loop 变成一条 v128.load + 一条 v128.add 即可处理 16 个像素。Ch19 会把向量化全过程展开。这里先给一行 punch line:

Add RUSTFLAGS="-C target-feature=+simd128" and LLVM vectorises completely — the inner loop becomes one v128.load + one v128.add processing 16 pixels. Ch19 unfolds the full vectorisation. One-line punchline:

RELATIVE THROUGHPUT · 1920×1080 image · M1 Pro · Chrome 132

1.0×

Wasm · Liftoff

2.7×

Wasm · TurboFan

4.3×

Wasm · TurboFan + SIMD

6.8×

为什么 Liftoff 已经快 2.7 倍WHY EVEN LIFTOFF IS 2.7× FASTER Liftoff 没有做寄存器分配,但它把 wasm 字节翻成机器码的代价相对 JS 已经低很多:① 类型确定,无 IC、无 deopt;② 内存是 Uint8Array 的一片,无 GC write barrier;③ sum/9 单次 div,JS 要先 ToInt32 再 ToUInt8,慢 3 倍。翻成机器码的"愚蠢版"已经胜过 JS 的"聪明版"。Liftoff 名字来自飞机起飞——快到不需要长跑道。 Liftoff does not do register allocation, but its cost-per-byte-of-output is already much lower than JS: ① types are settled, no IC, no deopt; ② memory is a slice of Uint8Array, no GC write barrier; ③ sum/9 is one div, JS has to ToInt32 then ToUInt8 — 3× slower. The "dumb" wasm machine code already beats the "clever" JS machine code. Liftoff is named after takeoff — fast enough not to need a runway.

"The Hot Loop" 在后续 22 章的回引地图

Where The Hot Loop returns in the next 22 chapters

Ch	在那一章里它是什么What it looks like there
06	前 8 字节:\0asm magic + versionFirst 8 bytes: \0asm magic + version
07	展开它的 11 个 sectionIts 11 sections fully unpacked
08	i32 占 99%,出现 i32→u8 的窄化99% i32, with i32→u8 narrowing
09	用到哪 6 类 opcodeWhich 6 opcode families it touches
10	src + dst 在 linear memory 的 layoutsrc + dst layout in linear memory
11	验证它的类型栈一步一步走Type stack walks during validation
12 / 13	流式 decode,边下边编Streaming decode, compile-while-fetching
14	Liftoff 出的机器码Liftoff's machine code
15	TurboFan 的 sea-of-nodes 图TurboFan's sea-of-nodes graph
16	实例化时 memory 怎么分配How instantiation allocates memory
17	从 JS 调它要花多少 nsJS calling it: how many ns per call
18 / 19 / 20 / 21 / 22	线程版 / SIMD 版 / wasm-GC 改写 / 组件模型导出Threaded · SIMD · GC-rewrite · Component-exported
23 ~ 25	性能分析、DevTools 调试、移植到 Figma/Photoshop 的回响Perf profile, DevTools debug, echoes in Figma / Photoshop

一张图看完它的一生

The whole life in one frame

下面是 hot.rs 从源码到屏幕的12 个快照。每一格是这段代码在那一秒的实际面貌——左边五格是"静态形态",中间四格是"编译动作",右边三格是"运行时事件"。读完这张图,你应该能在脑里把后面 Act III/IV/V 的每一章对应回这条主线上。后面 8 个章节(Ch11-Ch19)的顶部都挂了一个 "MAIN-LINE STOP X/12" 胶囊,告诉你"你现在站在哪一格"。

Below: 12 snapshots of hot.rs, from source to pixel. Each cell is what the code actually looks like at that moment — the first five cells are static forms, the middle four are compiler actions, the last three are runtime events. After this image, every chapter in Acts III/IV/V can be slotted back onto this main-line. Eight later chapters (Ch11–Ch19) carry a "MAIN-LINE STOP X/12" capsule at the top telling you "which cell you're standing in".

✦ MAIN-LINE STORYBOARD · 12 SNAPSHOTS

从 cargo build 到屏幕上一个像素,11 行 Rust 走过 12 个快照——前 4 格是开发者机器上发生的事(rustc → LLVM IR → wat → .wasm 字节),第 5 格是CDN 到浏览器的传输,中间 6 格是渲染进程内的编译与第一次执行,最后 1 格是 SIMD 向量化版本在 GPU 显示前的最后一秒。每格底部标了对应章节,后面 8 章的顶部都挂"MAIN-LINE STOP X/12" 胶囊回引这张图。读完这张图你就拿到了整篇文章的骨架。

From cargo build to a pixel on screen, 11 lines of Rust pass through 12 snapshots — the first four cells happen on the developer's machine (rustc → LLVM IR → wat → .wasm bytes), the fifth is CDN to browser transport, the middle six are compilation and first execution inside the renderer process, and the last is the SIMD-vectorised version one heartbeat before the GPU lights the pixel. Every cell is anchored to a chapter; the next eight chapters carry a "MAIN-LINE STOP X/12" capsule at the top linking back here. Read this picture and you hold the article's skeleton.

→ 向右拖动可查看完整 12 格

→ scroll horizontally to see all 12 cells

11 行 Rust,要走 17 道工序
才能在屏幕上动一个像素。 The Hot Loop · main-line

11 lines of Rust, 17 stages,
before one pixel moves on screen. The Hot Loop · main-line

INPUT

hot.rs · 11 LOC3×3 box blur · u8 grayscale3×3 box blur · u8 grayscale

→

OUTPUT

hot.wasm · 192 byte这是后面 22 章的样本the sample for the next 22 chapters

上一章previous← Ch04 · JS 天花板JS ceiling 下一章nextCh06 · Module 的外壳Module shell →

ACT III · BINARY ANATOMY

把 192 字节摊在桌上。

192 bytes, spread on the table.

这一段是 wasm 的解剖学:外壳怎么定形,11 段 section 各自存什么,类型系统从 4 个数字类型怎么膨胀到 v128 和 GC,400+ 条 opcode 怎么塞进单字节,线性内存为什么 64 KiB 一页,以及——验证算法为什么能在线性时间里证明类型安全。这 6 章读完,你拿到一个 .wasm 文件可以一字一字读出来。

This act is wasm's anatomy: how the shell is shaped, what the 11 sections each carry, how the type system grew from four numeric types into v128 and GC, how 400+ opcodes pack into single bytes, why linear memory is 64 KiB per page, and — how validation proves type safety in linear time. After these six chapters you can pick up a .wasm file and read it byte by byte.

STAGE 01 · ANATOMY

Module 的外壳 — 8 个字节的承诺

The Module shell — a promise in 8 bytes

\0asm + version + 11 段子

\0asm + version + 11 sections

Layer

binary

Header

8 bytes

Endianness

little

Spec §

5.5.2 Module

这一段在做什么

What it does

所有 .wasm 文件的前 8 字节固定 00 61 73 6d 01 00 00 00——magic 4 字节(\0asm),版本 4 字节(目前是 1)。后面跟着一个 section 序列,每段一个 ID + LEB128 长度 + 内容。仅此而已。 The first 8 bytes of every .wasm file are fixed: 00 61 73 6d 01 00 00 00 — magic 4 bytes (\0asm), version 4 bytes (currently 1). Then comes a sequence of sections: each is an ID byte + LEB128 length + payload. That's all.

为什么这么设计

Why this design

两个目标:① 文件第 1 字节 0x00 让任何把它当 JS 解析的工具立刻报错;② section 用 ID + length 而非偏移表,允许流式解析——边下载边解码。Ch12 会用到这个性质。 Two goals: ① byte 0 = 0x00 ensures any tool that tries to parse the file as JS fails immediately; ② sections use ID + length (not an offset table) to enable streaming parse — decode while downloading. Ch12 hinges on this.

关键代码

Key code

v8/src/wasm/module-decoder.cc :: DecodeModule()

FIG 06 · hot.wasm byte map · 192 byte

192 字节里,code section 占 79%——其余 7 个 section 加起来只占 21%。这是 wasm "structure 紧凑,代码占比高" 的可视证据。magic + version 8 字节让任何 JS 解析器立刻报错;接下来每个 section 都是 id + LEB128 length + payload 的三段式。

Of 192 bytes, the code section is 79% — the other seven sections together account for 21%. Visible proof of wasm's "compact structure, code-heavy ratio". Magic + version 8 bytes make any JS parser fail instantly; every following section follows id + LEB128 length + payload.

第一字节级别的拆解

Byte-level walkthrough

00 61 73 6d

magic = ASCII \0asm。第一字节 NULL 让 cat file.wasm | node 立刻抛 SyntaxError。Byte 0 = NULL makes cat file.wasm | node throw SyntaxError instantly.

01 00 00 00

version = 1, little-endian u32。这一格自 2017 MVP 起从未涨过——演进通过新的 section表达,不动版本号。This field hasn't moved since the 2017 MVP — evolution happens through new sections, not version bumps.

01 ll ll ll

section type。0x01 = Type section,后面紧跟 LEB128 编码的长度 ll(可变长 1~5 字节)。0x01 = Type section, immediately followed by LEB128-encoded length ll (variable 1–5 bytes).

ll ll ll ...

section payload。每段内部结构由 section type 决定。除了 Custom(0x00)外,其余 section 必须按 ID 升序出现。Payload format is determined by section type. Except for Custom (0x00), sections must appear in ascending ID order.

LEB128 是什么WHAT IS LEB128 Little Endian Base-128。一种变长整数编码:每字节用 7 位载数据,最高位 1 表示"还有",0 表示"完了"。0~127 用 1 字节,128~16383 用 2 字节,以此类推。DWARF 调试格式发明,wasm 拿来用——因为大多数 wasm 整数都很小(类型索引、locals 数、跳转目标),平均不到 2 字节。所有 wasm 整数(除指令的立即数 i32.const)都是 LEB128 编码的。 Little Endian Base-128. A variable-length integer encoding: 7 data bits per byte, top bit = 1 means "more coming", 0 means "done". 0–127 use 1 byte, 128–16383 use 2 bytes, and so on. Invented for DWARF, adopted by wasm — because most wasm integers are small (type indices, local counts, branch targets), averaging < 2 bytes. Every wasm integer (except some immediate operands) is LEB128-encoded.

FIG 07 · module topology · reference graph

11 个 section 按引用方向分四类:蓝声明(谁存在)、绿函数体(真正的字节码)、橙初始化器(给 table/memory 灌数据)、紫宿主接口(import/export/start)。这四类必须按 ID 升序出现——只有 Custom 段(灰)可以出现在任何地方,出现多少次都行。

The 11 sections split four ways by reference direction: blue declarations (who exists), green bodies (the real bytecode), copper initialisers (filling tables/memory), purple host-facing (import/export/start). The four must appear in ascending ID order — only Custom (grey) may appear anywhere, any number of times.

所有 12 个 section 的 ID

All 12 section IDs

id 0x00

Custom

debug info / name 表 / 元数据,可重复出现。

debug info / name table / metadata, may repeat.

id 0x01

Type

所有函数签名。

all function signatures.

id 0x02

Import

从宿主导入的 func / table / mem / global。

funcs / tables / mems / globals imported from host.

id 0x03

Function

本模块函数 i 的类型是 Type[k]。

"function i has type Type[k]".

id 0x04

Table

funcref / externref 数组(动态 dispatch)。

funcref / externref arrays (dynamic dispatch).

id 0x05

Memory

线性内存声明(min/max page)。

linear memory declaration (min/max pages).

id 0x06

Global

模块级全局变量(像静态变量)。

module-level global variables (like statics).

id 0x07

Export

给宿主用的名字 → 内部索引。

names visible to host → internal indices.

id 0x08

Start

模块初始化函数(像 ctor)。

module initialiser (like a ctor).

id 0x09

Element

table 的初始填充。

table initialisers.

id 0x0a

Code

所有函数体的字节码。

bytecode bodies for all functions.

id 0x0b

Data

线性内存的初始数据。

initial data for linear memory.

id 0x0c

DataCount

bulk memory 后期加的"data 段计数"。2020

"how many data segments", added with bulk memory.2020

Custom section 是规范留给所有人的逃生舱口——它没有规定的格式,只有一个名字(LEB128 长度 + UTF-8 字节)和任意 payload。DWARF 调试信息、source map、wasm-bindgen 的 JS 胶水都藏在这里。Ch24 会展开 name custom section,它给函数和局部变量起名,让 DevTools 能显示符号。

The Custom section is the spec's escape hatch for everyone — no prescribed format, just a name (LEB128 length + UTF-8 bytes) and arbitrary payload. DWARF debug info, source maps, and wasm-bindgen's JS glue all hide here. Ch24 unpacks the name custom section, which names functions and locals so DevTools can show symbols.

主线回引 · The Hot Loop 的外壳MAIN-LINE · The Hot Loop's shell

192 字节里只用到了 7 个 section

192 bytes use only 7 sections

回看 Act II 给的 192 字节十六进制,排查 section ID:01(type)、03(function)、05(memory)、07(export)、0a(code)。没有 import,没有 table,没有 global,没有 data——因为我们的卷积函数不依赖宿主、不做间接调用、没有模块级常量、不预填内存。最小可运行 wasm 模块就是这 7 段。

Re-read the 192-byte hex from Act II and you'll find section IDs: 01 (type), 03 (function), 05 (memory), 07 (export), 0a (code). No import, table, global, or data — because our blur function imports nothing, uses no indirect calls, has no module-level constants, and pre-fills no memory. The minimum runnable wasm is exactly these 7 sections.

CLI

wasm-objdump -h hot.wasm 列所有 sectionlist every section

CLI

xxd hot.wasm | head -2 看头 8 字节peek at the first 8 bytes

SOURCE

v8/src/wasm/module-decoder-impl.h

本章引用Chapter references

spec: Wasm Core 2.0 · §5.5.2 Module structure
spec: DWARF 5 spec · §7.6 LEB128

上一章previous← Main-line · The Hot LoopThe Hot Loop 下一章nextCh07 · 11 段 section 的内部Inside the 11 sections →

STAGE 02 · ANATOMY

11 段 section — 一个 .wasm 文件的器官学

11 sections — the organology of a .wasm file

每一段都是一个 K-V 仓库

each section is a K-V vault

Sections

11 known + Custom

Order

ascending by ID

Custom anywhere

unlimited

Spec §

5.5.3 ~ 5.5.13

看一个 .wasm 文件最容易的方式,就是把它当成一组按 ID 升序排列的 K-V 仓库。每个 section 解一个具体问题。这一章把 11 个 section 各拆一遍,每段给一个最小例子 + 在主线 Hot Loop 里的角色。

The easiest way to read a .wasm file is to treat it as a sequence of K-V vaults, ordered by ID. Each section answers one specific question. This chapter walks all 11, with a minimal example and the role each plays in the main-line Hot Loop.

① Type section · 函数签名表

① Type section · the signature table

问题:"函数 $blur3 长什么样?"
答:"它是 type[0]:(i32 i32 i32 i32) -> ()"——所有 module 内出现的函数签名先注册一遍,后面引用用索引。

Question: "What does $blur3 look like?"
Answer: "It's type[0]: (i32 i32 i32 i32) -> ()" — every signature used in the module is registered up front, later referenced by index.

(type $t0 (func (param i32 i32 i32 i32))) ;; void return implied

为什么把签名独立成表?因为同一个签名会被多个函数共用——主线里只有一个函数,签名表有 1 条。但 Photoshop 的 wasm 里有几十万个函数,只用到几百种签名,共用让 type section 体积小 2~3 个数量级。

Why pull signatures into their own table? Because the same signature is shared among many functions — the main-line has 1 function, so 1 entry. Photoshop's wasm has hundreds of thousands of functions but only hundreds of distinct signatures; sharing collapses the type section by 2–3 orders of magnitude.

② Import section · 来自宿主的函数与对象

② Import section · functions and objects from the host

主线 Hot Loop 没有 import——纯数学函数,不依赖任何 JS API。但下面是 Photoshop 的实际 import section 缩影:

The main-line Hot Loop has no imports — pure math, no JS-side dependency. Below is a snapshot of Photoshop's real import section:

(import "env" "memory"          (memory 256 32768 shared))
(import "env" "__indirect_func_table" (table 4096 funcref))
(import "env" "emscripten_resize_heap" (func (param i32) (result i32)))
(import "wasi_snapshot_preview1" "fd_write" (func (param i32 i32 i32 i32) (result i32)))

四个观察:① 每一条 import 是两段名字 + 一个签名("env" "memory" 是惯例,Emscripten 用 "env" 当 module 名);② memory 可以被 import——这是多线程 wasm 共享内存的关键;③ table 也能 import,允许 JS 给函数指针填值;④ WASI 函数通过 import 引入,在浏览器外的 wasm 里这是主要的"系统调用"通道。

Four notes: ① each import is two-segment name + one signature ("env" "memory" is the Emscripten convention); ② memory itself can be imported — that's the foundation of shared-memory multi-threaded wasm; ③ tables too, letting JS populate function pointers; ④ WASI functions enter via imports, which is the primary "syscall" channel for non-browser wasm.

③ Function section · 函数和签名的连线

③ Function section · wiring functions to their signatures

这一段长得最简洁——就是一个 type index 数组:"函数 0 用 type[0],函数 1 用 type[2],函数 2 用 type[2],..."。函数体本身不在这里,它们在 Code section(0x0a)。把"签名声明"和"函数体"分开是为了流式解码——下载到 function section 就能开始检查 import/export 的类型匹配,不必等 code 段下完。

The plainest section — just an array of type indices: "function 0 is type[0], function 1 is type[2], function 2 is type[2], …". The body lives elsewhere, in the Code section (0x0a). The split between "signature declaration" and "body" exists for streaming decode — once function section is in, you can check import/export type matching without waiting for code.

④ Table section · funcref 的数组

④ Table section · the funcref array

Table 是 wasm 的 "函数指针表",最初是为了 C 函数指针 / C++ vtable / Java 接口分发服务。每个 table 元素是 funcref(MVP)或 externref(2021)。call_indirect 指令用 table 索引 + 类型 ID 调用——类型 ID 必须匹配,否则 trap。Ch09 / Ch11 会展开。

Table is wasm's "function pointer table", born to serve C function pointers / C++ vtables / Java interfaces. Each element is funcref (MVP) or externref (2021). call_indirect uses (table idx + type id) to dispatch — the type id must match or it traps. Expanded in Ch09 / Ch11.

2021 年 reference-types 提案前,一个 module 只能有一张 table。之后可以有多张。主线 Hot Loop 不用 table——它没有间接调用。

Before reference-types (2021), a module could carry only one table. After: multiple. The main-line Hot Loop uses no table — no indirect call.

⑤ Memory section · 线性内存的声明

⑤ Memory section · linear memory declaration

主线声明 (memory 1)——min=1 page=64 KiB,max 不指定。Ch10 完整展开线性内存。

Main-line declares (memory 1) — min=1 page=64 KiB, max unspecified. Ch10 covers linear memory fully.

⑥ Global section · 模块级常量与变量

⑥ Global section · module-level constants and vars

(global $stack_top (mut i32) (i32.const 0x10000))
(global $PI         f64       (f64.const 3.14159265358979))

每个 global 是 (type, mut?, init expr) 三件套。mut 标记可写,initialiser 是一段受限的常量表达式(只能用 iN.const / fN.const / global.get)。Rust / C 的 static 数据如果是常量就来这里,如果是读写就放到 linear memory 的 data section。

Each global is (type, mut?, init expr). mut means writable; initialiser is a constant expression (only iN.const / fN.const / global.get). Rust/C static data lives here when constant; mutable static data goes into linear memory via the data section.

⑦ Export section · 给宿主看的窗口

⑦ Export section · the host-facing window

(export "memory" (memory 0))
(export "blur3"  (func $blur3))
(export "alloc"  (func $alloc))

name → (kind, index) 的字典。kind 可以是 func / table / memory / global / tag(tag 是 exception handling 提案加的)。所有从 JS 调 wasm 的入口都在这里。 JS 那边的 instance.exports.blur3 就是查这张表。

A name → (kind, index) dictionary. Kind ∈ {func, table, memory, global, tag} (tag added by exception handling). Every JS-to-wasm entry point lives here. JS-side instance.exports.blur3 looks up this very table.

⑧ Start section · 模块的 ctor

⑧ Start section · the module's ctor

仅一个数字——某个函数的索引。该函数不能有参数,不能有返回值,在 module instantiate 完成的最后阶段被引擎自动调用。用来做模块级初始化(注册回调、填充常量表)。主线 Hot Loop 没有 start。

Just one number — the index of a function. The start function takes no params, returns nothing, and is invoked automatically by the engine at the end of instantiation. Used for module-level setup (registering callbacks, filling constant tables). Main-line Hot Loop omits start.

⑨ Element section · 给 table 填值

⑨ Element section · table initialisers

语义类似 data section,但写入对象是 table 而非 memory。一个 module 实例化时,element 段把 funcref 们填进对应 table 槽位。C 程序的"函数指针表"就在这里活;C++ 的 vtable 也是。

Semantically similar to data section but writing into tables rather than memory. On instantiation, element segments populate funcref slots. C function pointer tables live here; so do C++ vtables.

⑩ Code section · 字节码本身

⑩ Code section · the bytecode itself

这是最大的一段——主线 Hot Loop 的 code section 占整个文件的 80% 字节。每个函数体的格式是:locals 声明(类型聚合表)+ 表达式序列 + 终止符 0x0b (end)。Ch09 把指令格式撕开。

The biggest section — the Hot Loop's code section is 80% of the entire file. Each function body's format is: locals declaration (run-length encoded by type) + expression sequence + terminator 0x0b (end). Ch09 unpacks the instruction format.

⑪ Data section · 给 memory 填值

⑪ Data section · memory initialisers

把"这段字节请在实例化时写到 linear memory 的某地址"批量声明。C 程序的字符串字面量、Rust 的 static 数组、Emscripten 的 stdlib 数据表都在这里。MVP 时每条数据段必须 active(立即写入);bulk memory 提案(2020)加了 passive 模式,允许 wasm 代码显式调 memory.init 来用——支持代码热更新。

Bulk-declares "at instantiation, write these bytes to memory at address X". C string literals, Rust static arrays, Emscripten's stdlib tables all live here. MVP required every segment to be active (written immediately). The bulk-memory proposal (2020) added passive mode, letting code call memory.init explicitly — supports hot reload.

⑫ DataCount section · 一个事后补丁

⑫ DataCount section · a retroactive patch

2020 年加 bulk memory 后,memory.init segIdx 指令需要在验证时立刻知道 data 段总数。但 code section 在 data section 前面解析——为了不让 validator 反复回扫,设计者插入了一个新 section 0x0c,专门告诉解码器"我有 N 个 data 段"。这是 wasm spec 仅有的"事后补丁"section,反映了流式解析的硬约束。

Bulk memory (2020) made memory.init segIdx need to know the total number of data segments during validation. But code section parses before data — to spare the validator from a back-scan, the designers slipped in a new section 0x0c that just says "I have N data segments". It's the only "retroactive patch" section in the spec, reflecting the hard constraint of streaming parse.

Section 不是设计,是约束的化石。 Field Note · 03

Sections are not design.
They are fossilised constraints. Field Note · 03

CLI

wasm2wat hot.wasm · 把所有 section 解成可读文本expand every section into readable wat

CLI

wasm-objdump -s hot.wasm · 逐段十六进制 dumpsection-by-section hex dump

CLI

wasm-objdump -x hot.wasm · 看 import/export/global 摘要summarise imports/exports/globals

本章引用Chapter references

spec: Wasm Core 2.0 · §5.5.3 ~ §5.5.13
tool: WABT · wasm-objdump source

上一章previous← Ch06 · 外壳Module shell 下一章nextCh08 · 类型系统Type system →

STAGE 03 · ANATOMY

类型系统 — 从 4 到 6 再到无限

Type system — from 4 to 6, then to infinity

小到 1 字节,大到任意结构

one byte to arbitrary structure

MVP

4 types (i32 i64 f32 f64)

2021

+ v128 + funcref/externref

2024

+ struct/array/i31 (GC)

Spec §

2.3 Types

2017 年 MVP 上线时,wasm 一共只有 4 种值类型:i32 / i64 / f32 / f64。理由极其务实:这是所有 CPU 都能直接处理的 4 种,JIT 不需要费力适配。九年后的今天,加上 SIMD 的 v128、reference 的 funcref/externref、以及 wasm-GC 的 struct/array/i31,wasm 已经有了"近似于一门完整语言"的类型系统——但每一次扩张都要回答同一个问题:新类型怎么不破坏栈机的"一字节 opcode"承诺?

The 2017 MVP shipped with just four value types: i32 / i64 / f32 / f64. The reasoning was ruthlessly practical: these are the four that every CPU handles natively, so the JIT has no fitting to do. Nine years on, with SIMD's v128, reference-types' funcref/externref, and wasm-GC's struct/array/i31, wasm now has a type system that "looks like a real language". But every expansion answers the same question: how does the new type not break the stack machine's "one-byte opcode" promise?

FIG 08 · type lattice · post-GC (2024+)

2024 之后,wasm 值类型分两个世界:左半是 5 个独立的原始类型(i32/i64/f32/f64/v128),没有子类型关系;右半是引用类型 lattice——顶层 anyref,下分 eqref / funcref / externref,再下到具体 struct / array / 函数签名引用。所有 nullable 引用最终指向 nullref。

After 2024, wasm value types split into two worlds: left — 5 independent primitive types (i32/i64/f32/f64/v128), no subtyping; right — a reference type lattice topped by anyref, descending into eqref / funcref / externref, then concrete struct / array / function-sig references. All nullable refs ultimately point to nullref.

值类型的全景

All value types at a glance

Category	Type	Size	Tag (encoding)	Since	What
numeric	`i32`	4 byte	`0x7F`	MVP	32 位整数(符号自指令)32-bit integer (sign-per-op)
	`i64`	8 byte	`0x7E`	MVP	64 位整数64-bit integer
	`f32`	4 byte	`0x7D`	MVP	IEEE 754 single
	`f64`	8 byte	`0x7C`	MVP	IEEE 754 double
vector	`v128`	16 byte	`0x7B`	2021	128 位 SIMD,可解释成 16×i8 / 8×i16 / 4×i32 / 4×f32 / 2×f64128-bit SIMD, viewable as 16×i8 / 8×i16 / 4×i32 / 4×f32 / 2×f64
reference	`funcref`	ptr	`0x70`	2021	指向 wasm 函数的不透明引用opaque reference to a wasm function
reference	`externref`	ptr	`0x6F`	2021	指向宿主对象(JS Object / DOM 节点)reference to a host object (JS Object / DOM node)
GC (2024)	`(ref $T)`	ptr	`0x6B`	2024	指向 struct / array 的强类型引用typed reference to a struct or array
	`(ref null $T)`	ptr	`0x6C`	2024	允许为 null 的版本nullable version
	`i31ref`	31-bit	`0x6C+`	2024	SMI 风格的内联小整数(避免堆分配)SMI-style inline small int (skip heap alloc)

关于 tag byte 的负数编码ON THE NEGATIVE TAG BYTES 注意 i32 的 tag 是 0x7F = -1,i64 是 0x7E = -2——这些是 signed LEB128 编码的小负数。规范选负数空间是有意的:正数空间留给"type index"(给 GC 用),这样验证器一字节就能判断"这是基本类型还是 struct 引用"。tag 设计是为 GC 的未来留的接口——MVP 时代设计者已经预想到这一步。 Note i32's tag is 0x7F = -1, i64 is 0x7E = -2 — these are signed LEB128 small negatives. The negative space was deliberate: positive space is reserved for type indices (for GC), so the validator can decide "basic type vs struct reference" in one byte. The tag design is the interface MVP designers left for GC's future — anticipated from day one.

"函数类型" 也是一种类型

"Function type" is also a type

在 Type section 里出现的 (func (param i32 i32) (result i32)) 用编码 0x60 引导。MVP 时只有 func 这一种"组合类型",GC 提案后加了 0x5F = struct 和 0x5E = array——把 Type section 从"函数签名表"扩成了"组合类型表"。同一段 binary 在 2017 年和 2026 年解析出来的"section 0x01"含义已经悄悄扩张了一倍。

In Type section, (func (param i32 i32) (result i32)) begins with tag 0x60. The MVP had only this one "compound type". The GC proposal added 0x5F = struct and 0x5E = array — quietly stretching Type section from "signature table" to "compound-type table". The same byte (section 0x01) means twice as much in 2026 as in 2017.

类型系统的"缺失":没有 i8/i16/u32

The system's absences: no i8/i16/u32

"为什么 wasm 没有 i8 类型?字符串处理要怎么办?"——这是另一个常见疑问。答案:wasm 的值类型不区分 i8/i16/i32,但内存读写有 i32.load8_u / i32.load8_s / i32.load16_u / i32.load16_s——读 8/16 位 byte,符号或零扩展到 i32。窄类型只存在于 memory 边界,寄存器里永远是 i32 或 i64。

"Why no i8? How do you process strings?" — another perennial question. Answer: wasm's value types don't distinguish i8/i16/i32, but memory access does: i32.load8_u / i32.load8_s / i32.load16_u / i32.load16_s read 8/16-bit bytes and sign- or zero-extend to i32. Narrow types exist only at the memory boundary; in registers, everything is i32 or i64.

同理无符号 vs 有符号区分也只活在指令层面:i32.div_s(signed) vs i32.div_u(unsigned)、i32.lt_s vs i32.lt_u。"类型只标 32/64 位,符号由 op 携带"是 wasm 的核心设计简化——让值类型集合保持小,降低验证器和 JIT 的复杂度。

Same with signed-vs-unsigned: it lives at the opcode layer, not the type layer — i32.div_s vs i32.div_u, i32.lt_s vs i32.lt_u. "Types carry width only; signedness rides on the op" is a core simplification. It keeps the value-type set small and shrinks both validator and JIT.

主线回引 · The Hot Loop 的类型分布

Main-line · types in The Hot Loop

TYPE USAGE · hot.wasm · 192 byte

i32

96% · 8 locals · 24 ops

i64

f32

f64

v128 (with SIMD flag)

1 op

主线 Hot Loop 几乎是 100% i32——这是 wasm 的常态。绝大多数 LLVM 后端在 wasm32 目标上把 usize / size_t 编译成 i32(因为 wasm32 上指针就是 32 位),数组下标也是 i32。f64 主要出现在浮点计算场景,i64 出现在 BigInt 场景。如果你 grep 一个真实 wasm 模块,i32. 开头的 opcode 占 70% ~ 90%。

The main-line is ~100% i32 — wasm's norm. Most LLVM backends compile usize / size_t to i32 on wasm32 (pointers are 32-bit), and array indices are i32. f64 shows up in floating-point math; i64 in BigInt scenarios. Grep any real wasm module and 70–90% of opcodes start with i32.

本章引用Chapter references

spec: Wasm Core 2.0 · §2.3 Types
std: IEEE 754-2019 · f32/f64 semantics
proposal: gc proposal · struct/array/i31 types

上一章previous← Ch07 · 11 段11 sections 下一章nextCh09 · 指令集Opcodes →

STAGE 04 · ANATOMY

指令集 — 一字节里的 430 条命令

Opcodes — 430 commands in a single byte

six families, one prefix scheme

MVP opcodes

~190 (1 byte)

2026 total

~430 (with prefix)

Encoding

1B opcode + LEB128 imm

Spec §

5.4 Instructions

MVP 时 wasm 用了 256 个 opcode 空间里的 190 个左右。后来 SIMD / Bulk Memory / Reference Types / GC / Atomics 每个提案都要加新指令,字节空间不够了。解法是多字节 opcode:第一字节用一个保留值(0xFC = Bulk, 0xFD = SIMD, 0xFE = Atomics, 0xFB = GC),后跟一个 LEB128 子 opcode。单字节空间维持紧凑,扩展走 prefix。

The MVP used ~190 of the 256 opcode slots. SIMD / Bulk Memory / Reference Types / GC / Atomics each demanded new ops, and the byte space ran short. The fix: multi-byte opcodes. A reserved first byte (0xFC = Bulk, 0xFD = SIMD, 0xFE = Atomics, 0xFB = GC) followed by a LEB128 sub-opcode. The single-byte space stays compact; extensions ride the prefix.

FIG 09 · opcode space · 16×16 grid

256 个单字节 opcode 槽位里,numeric + memory + 控制流 占去六成多。底部 0xFB-0xFE 四个紫色格是 prefix 字节——每个 prefix 后跟一个 LEB128 子 opcode,把扩展空间延展到无穷。2017 MVP 时 只有上半部分被占用,所有 2019 后加的指令(SIMD/GC/Atomics)都缩在这四个 prefix 后面。

Of 256 single-byte opcode slots, numeric + memory + control occupy over 60%. The four purple cells at the bottom (0xFB–0xFE) are prefix bytes — each followed by a LEB128 sub-opcode, extending the space without bound. In the 2017 MVP only the upper half was filled; every post-2019 op (SIMD/GC/Atomics) lives behind these four prefixes.

六大指令家族

Six instruction families

1 · control

控制流

Control flow

block / loop / if / else / br / br_if / br_table / return / call / call_indirect / unreachable / nop。没有 goto。结构化控制是 wasm 的硬约束,Ch11 的验证算法依赖这个。

block / loop / if / else / br / br_if / br_table / return / call / call_indirect / unreachable / nop. No goto. Structured control is a hard invariant — Ch11's validator depends on it.

0x00 ~ 0x11

2 · param

参数/栈操作

Param / stack

drop / select / local.get / local.set / local.tee / global.get / global.set。tee 是 set 的"留个备份在栈顶"版。

drop / select / local.get / local.set / local.tee / global.get / global.set. tee is set that also keeps a copy on the stack top.

0x1A ~ 0x24

3 · memory

内存访问

Memory access

i32.load / i32.load8_s / i32.load8_u / ... / i32.store / i32.store8 / memory.size / memory.grow。每条 load/store 带 align + offset 立即数。

i32.load / i32.load8_s / i32.load8_u / … / i32.store / i32.store8 / memory.size / memory.grow. Each load/store carries align + offset immediates.

0x28 ~ 0x40

4 · const

常量

Constants

i32.const / i64.const / f32.const / f64.const。i32/i64 立即数用 signed LEB128;f32/f64 用原始字节序。

i32.const / i64.const / f32.const / f64.const. i32/i64 immediates use signed LEB128; f32/f64 use raw IEEE bytes.

0x41 ~ 0x44

5 · numeric

算术 / 比较

Arithmetic / compare

i32.add / i32.sub / i32.mul / i32.div_s / i32.div_u / i32.eq / i32.lt_s / ... / f64.sqrt / f64.nearest。约 130 条,覆盖 IEEE 754 算术。

i32.add / i32.sub / i32.mul / i32.div_s / i32.div_u / i32.eq / i32.lt_s / … / f64.sqrt / f64.nearest. ~130 ops, full IEEE 754 coverage.

0x45 ~ 0xC4

6 · prefix

扩展指令

Extension prefixes

0xFC nn = 饱和转换 + bulk memory;0xFD nn = SIMD(~250 op);0xFE nn = atomics(threads);0xFB nn = GC。子 opcode 用 LEB128,所以是无界的。

0xFC nn = saturating convert + bulk memory; 0xFD nn = SIMD (~250 ops); 0xFE nn = atomics (threads); 0xFB nn = GC. Sub-opcode is LEB128, so unbounded.

0xFB ~ 0xFE

单条指令的拆解

Anatomy of a single instruction

以 i32.load offset=4 align=2 为例。它的字节序列:

Take i32.load offset=4 align=2. Its byte sequence:

opcode = i32.load。栈上 pop 一个 i32 地址,push 一个 i32 加载值。pops one i32 address, pushes one i32 loaded value.

align = log₂(对齐) = 2 → 4 byte 对齐。这只是"提示",JIT 可据此选指令(unaligned 也 OK)。A hint; the JIT may pick a specific instruction (unaligned still works).

offset = 4(LEB128)。编译期常量,加到栈顶地址上。A compile-time constant added to the stack-top address.

三字节里隐藏的设计点:① opcode 是 1 字节,空间精确;② align 是 hint 不是约束——这让 wasm 能跑在 ARM(对齐)和 x86(自由对齐)上无差别;③ offset 是常量,Liftoff 可以 fold 进 [base + reg + 4] 这种寻址模式,免一条 add。三字节里塞了三层信息。

Three bytes hide three design points: ① opcode is one byte, slot-precise; ② align is a hint, not a constraint — letting wasm run on both ARM (aligned) and x86 (free) without changes; ③ offset is constant, so Liftoff folds it into [base + reg + 4] addressing — saving one add. Three bytes, three layers.

主线回引 · The Hot Loop 用到的 6 类 opcode

Main-line · the 6 opcode families in The Hot Loop

Family	Used	Count	Example
control	`block / loop / br_if`	4	`block 0x40`
param	`local.get / local.set`	14	`local.get 0x20 00`
memory	`i32.load8_u / i32.store8`	10	`i32.load8_u 0x2D 00 00`
const	`i32.const`	5	`i32.const 0x41 09` (= 9)
numeric	`i32.add / i32.sub / i32.div_u / i32.mul / i32.ge_s`	16	`i32.add 0x6A`
prefix	—	0	本主线无 SIMD,Ch19 才会出现 0xFDno SIMD; 0xFD appears in Ch19

49 条指令,49 字节(opcode 部分)+ 立即数(平均 1.2 字节)≈ 110 字节,加上 locals 声明 5 字节、function header 6 字节,凑成约 121 字节的 code section。再加上前面的 6 个 section header 和 export 段,合 192 字节。密度的来源在每一字节都看得见。

49 ops × (1B opcode + ~1.2B imm avg) ≈ 110 bytes, plus 5B locals declaration + 6B function header ≈ 121 bytes of code section. Add six section headers and the export segment: 192 bytes. Density is visible in every byte.

关于 SIMD opcode 的特殊性ON THE SIMD OPCODE QUIRK SIMD 提案在 2021 年 phase 4 时,V8 实测发现 0xFD nn 的两字节 opcode 在 inner loop 里每次都要多 fetch 一字节,影响热路径性能。最终方案是 在 Liftoff 阶段把 SIMD 字节序展开成两字节但 TurboFan IR 里仍按一字节代理——这是 wasm spec 罕见的"实现影响 spec"的例子,SIMD 的子 opcode 数量被精心控制在 256 内,避免出现 3 字节 opcode 的可能。 When SIMD reached phase 4 in 2021, V8 measured that the two-byte 0xFD nn opcode forced an extra fetch on every inner-loop iteration. The fix: Liftoff decodes it as two bytes, but TurboFan IR proxies it as one — a rare instance of "implementation pressuring spec". SIMD sub-opcodes are deliberately capped at 256 to avoid the 3-byte opcode scenario.

本章引用Chapter references

spec: Wasm Core 2.0 · §5.4 Instructions
proposal: simd proposal · 0xFD prefix sub-opcodes

上一章previous← Ch08 · 类型系统Type system 下一章nextCh10 · 线性内存Linear memory →

STAGE 05 · ANATOMY

线性内存 — 一片连续的、可越界的、永不 GC 的字节

Linear memory — a flat, bounded, GC-free slab of bytes

64 KiB 一页,最大 4 GiB

64 KiB per page, 4 GiB max

Page size

64 KiB (2^16)

Max pages (wasm32)

65 536 → 4 GiB

Bounds check

trap on overrun

JS view

ArrayBuffer / SAB

线性内存是 wasm 最简洁的设计之一——它就是一片连续字节,从地址 0 开始,长度是 N 个 64 KiB 的 page。所有 i32.load / i32.store 都读写这片字节。没有指针类型,没有 GC,没有别的内存空间——堆、栈、静态数据全部混在这一片。这片字节在 JS 那边是一个 WebAssembly.Memory 对象,可以 new Uint8Array(mem.buffer) 直接看到原始字节。

Linear memory is one of wasm's most distilled designs — a flat slab of bytes starting at address 0, length = N × 64 KiB pages. Every i32.load / i32.store reads or writes this slab. No pointer type, no GC, no other memory space — heap, stack, static data all share the slab. From JS, this is a WebAssembly.Memory object; new Uint8Array(mem.buffer) lets you see the bytes directly.

FIG 10 · 4 GiB address space · wasm32

wasm32 的实际内存只有顶部那条窄绿条(用户的 N 个 64 KiB page),但浏览器引擎提前预留了整 4 GiB 虚拟地址空间——下面 99% 都是 PROT_NONE 陷阱区。越界访问由硬件 + signal handler 接住,JIT 出码里完全没有 cmp/jcc 边界检查指令——这就是 wasm 边界检查"免费"的真相。

wasm32's actual memory is just the thin green strip at top (user's N × 64 KiB pages), but the browser engine pre-reserves the entire 4 GiB virtual address space — 99% of it is PROT_NONE trap zone. Out-of-bounds is caught by hardware + signal handler; the JIT emits no cmp/jcc bounds-check instructions — the true source of wasm's "free" bounds checking.

"页" 的几何含义

The geometry of "pages"

为什么是 64 KiB 一页?这数字不是 OS 的 4 KiB / 16 KiB 内存页对齐;它来自一个 i32 寻址 / 2^16 = 2^16 个页 这一计算——4 GiB 总空间 ÷ 2^16 页 = 2^16 = 64 KiB 一页。选 64 KiB 是要在"页粒度太细(grow 太频繁)"和"页太大(浪费)"之间找平衡点,设计者参考了 x86 的 large page 与 ARM 的 64K granule。

Why 64 KiB per page? Not from OS 4 KiB / 16 KiB alignment. It comes from i32 addressing space (4 GiB) ÷ 2^16 pages = 64 KiB per page. The choice balances "too fine-grained (grow too often)" against "too coarse (waste)", referring to x86's large page and ARM's 64K granule.

边界检查 — wasm "safe" 的硬件实现

Bounds checks — the hardware behind wasm's "safe"

每一次 load/store 都必须保证地址在 [0, memory_size) 内,否则 trap。朴素实现:cmp addr, mem_size; ja .trap;——每个内存访问加两条指令,在 inner loop 里 5~10% 开销。

Every load/store must keep its address in [0, memory_size), else trap. Naïve: cmp addr, mem_size; ja .trap; — two extra instructions per access, 5–10% overhead in an inner loop.

现代引擎用一个聪明技巧:把 wasm 的整个 4 GiB 地址空间作为虚拟保留页映射,只把 [0, memory_size) 设为可读写,后面全部设为 PROT_NONE。任何越界访问会触发 SIGSEGV,引擎挂一个 signal handler 把它翻译成 wasm trap。结果是 inner loop 里完全没有显式边界检查,跑得跟 native 几乎一样快——只在 trap 时才慢。

Modern engines use a clever trick: reserve the full 4 GiB virtual address space, mark [0, memory_size) as RW, mark the rest as PROT_NONE. Any overrun raises SIGSEGV, caught by a signal handler that translates it into a wasm trap. The result: zero explicit bounds checks in the hot loop, near-native speed — only slow on actual trap.

为什么 32 位 wasm 不能再扩WHY 32-BIT WASM CAPS AT 4 GiB 虚拟保留的"4 GiB 安全岛"方案前提是 64-bit host 进程的地址空间能腾出 4 GiB——这在 64-bit Linux/macOS/Windows 上没问题(总地址空间 256 TiB)。但 wasm32 自身只能 i32 寻址 4 GiB。要突破这个限制需要 memory64 提案(Ch22),把 wasm 升级到 i64 寻址、64 位 size_t。现在 V8、SM、JSC 都在 flag 后面提供 memory64 预览,但生态(LLVM 后端、Emscripten)尚未完全打通。 The "4 GiB safety island" trick assumes the 64-bit host has 4 GiB of virtual address space to spare — fine on 64-bit Linux/macOS/Windows (256 TiB address space). But wasm32 itself can only i32-address 4 GiB. Breaking past needs the memory64 proposal (Ch22), upgrading wasm to i64 addressing with 64-bit size_t. V8 / SM / JSC all preview memory64 behind flags, but the ecosystem (LLVM backend, Emscripten) hasn't finished plumbing it.

主线回引 · The Hot Loop 的内存 layout

Main-line · memory layout of The Hot Loop

; Linear memory after JS-side setup

0x000000 ┌─────────────────────────────────┐
         │ src image data (8 bpp grayscale)│  1920×1080 = 2 073 600 byte
0x1FA400 ├─────────────────────────────────┤
         │ padding ( 4 KiB align )         │
0x1FB000 ├─────────────────────────────────┤
         │ dst image data (output)         │  another 2 073 600 byte
0x3F5400 ├─────────────────────────────────┤
         │ unused                          │  ~ 60 KiB
0x400000 └─────────────────────────────────┘  64 pages (4 MiB)

JS 那边先 mem.grow(63) 把内存扩到 64 page = 4 MiB,然后用 Uint8ClampedArray 视图把 src 图像数据 copy 进去,调 instance.exports.blur3(0, 0x1FB000, 1920, 1080),wasm 函数对 src 做卷积写到 dst,JS 再 new Uint8ClampedArray(mem.buffer, 0x1FB000, len) 取出来显示。整个过程没有把数据 copy 出 wasm 内存——只是不同 JS 视图共享同一片字节。这是 wasm/JS 协作的"zero copy"模式。

JS first mem.grow(63) to reach 64 pages = 4 MiB, then copies src image bytes in via Uint8ClampedArray, calls instance.exports.blur3(0, 0x1FB000, 1920, 1080), wasm convolves and writes dst, JS reads back via new Uint8ClampedArray(mem.buffer, 0x1FB000, len). The data never leaves wasm memory — different JS views share the same bytes. This is the wasm/JS "zero copy" pattern.

`memory.grow` 的代价

The cost of `memory.grow`

grow 是个昂贵指令——它可能触发 ArrayBuffer 重新分配(老 4 MiB 不够时申请新的 16 MiB,copy 整片字节)。grow 后所有 JS 视图 (TypedArray) 立刻被 detached,所有 wasm 那边持有的 base 指针会被引擎自动更新。这条约束让 grow 在 inner loop 里几乎是禁忌,通常只在 module 启动时或者明显边界(图像变大、文件加载)才调用。

grow is expensive — it can trigger a full ArrayBuffer realloc (allocating a fresh 16 MiB when 4 MiB runs short, then copying). After grow, all JS-side TypedArray views are detached immediately; wasm-side base pointers are auto-updated by the engine. The invariant makes grow nearly forbidden inside hot loops — typically called only at startup or coarse boundaries (image resize, file load).

本章引用Chapter references

spec: Wasm Core 2.0 · §2.5 Memories · §5.5.5
proposal: memory64 proposal · 64-bit linear memory
paper: Disselkoen et al. · POST 2019 · "The Meaning of Memory Safety"

上一章previous← Ch09 · 指令集Opcodes 下一章nextCh11 · 验证 · 安全性的证明Validation · the soundness proof →

STAGE 06 · ANATOMY

验证 — 一遍线性扫描换来的安全保证

Validation — safety in one linear pass

类型栈的抽象解释

abstract interpretation on a type stack

Time

O(n) per func

Space

stack depth

Parallelism

per-function

Spec §

3.3 Validation

MAIN-LINE · STOP 6 / 12 · VALIDATE hot.rs 此刻: 字节已 decode 出来,validator 沿指令序列一次 forward sweep,推演类型栈——证明类型不混、栈不溢、跳转目标在 frame 里。这是 Storyboard 第 6 格的算法层。Ch13 谈同一站的工程层。 hot.rs right now: bytes decoded; validator does one forward sweep, evolving the type stack — proving no type confusion, no underflow, all branch targets in-frame. This is the algorithm side of Storyboard cell 6. Ch13 covers the engineering side of the same stop.

"怎么证明这段二进制没有缓冲区溢出、没有未初始化变量、没有类型混乱?" Java 的解法是 bytecode verifier——一段几千行的 dataflow 分析。wasm 用了一招更猛的:把"类型栈"作为唯一的抽象状态,沿指令序列做一遍 forward sweep。算法只用一个数据结构(类型栈)、只走一遍(单遍 forward),时间复杂度 O(n)。

"How do you prove this binary has no buffer overflow, no uninitialised variable, no type confusion?" Java's answer is the bytecode verifier — a few thousand lines of dataflow analysis. Wasm went bolder: use a "type stack" as the only abstract state, then forward-sweep through the instruction sequence. One data structure (the type stack), one pass (forward only), O(n) time.

算法的骨架

The algorithm skeleton

维护两个东西:

Maintain two things:

类型栈 vstack

Value stack vstack

一个值类型的栈,模拟运行时栈上每个槽位的类型。不存值,只存类型。

A stack of value types, simulating the runtime stack's type at each slot. No values, only types.

控制栈 cstack

Control stack cstack

每个 block / loop / if 入栈一个 frame,记录这个块的起始类型栈高度和跳转目标类型。br k 跳到 cstack 顶往下数第 k 个 frame。

Each block / loop / if pushes a frame storing the vstack height at entry and the branch target type. br k jumps to the k-th frame from the top.

遍历指令序列,每条指令做三件事:① pop 走它需要的输入类型(类型不对 → fail);② push 它产生的输出类型;③ 如果是控制指令,适当 push/pop cstack。函数末尾 vstack 必须正好等于函数返回类型——否则 fail。就这么简单。但这套机制证明了:任何通过验证的 wasm 不会类型混乱、不会栈溢出、不会未初始化访问。

For each instruction: ① pop the input types it expects (mismatch → fail); ② push the output types it produces; ③ if it's a control op, push/pop cstack accordingly. At function end, vstack must exactly equal the return type — else fail. That's it. Yet this proves any validated wasm cannot suffer type confusion, stack overflow, or uninitialised access.

走一遍 · The Hot Loop 内层 4 行的验证过程

Walk-through · validating 4 inner lines

INSTR · 待验证

local.get $src
i32.load8_u offset=0
local.get $src
i32.load8_u offset=1
i32.add ; ← 检查这里

类型栈 · 走到第 5 条之前vstack · before op 5

i32 ← top (loaded byte b)

i32 ← loaded byte a

即将执行about to run: i32.add
需求:pop 两个 i32,push 一个 i32。 need: pop two i32, push one i32.
vstack 顶部正好是 [i32, i32] → ✓。 vstack top is [i32, i32] → ✓.
下一步next: vstack = [i32]

FIG 11 · type-stack walk · auto-loops every 7 s

左侧指令逐条扫过,右侧类型栈跟着推演:load → load → add 让 vstack 在 [i32]→[i32,i32]→[i32] 之间走;const → div 让它再次先升后降。这套 abstract interpretation 用一个数据结构(类型栈) + 一遍 forward scan 证明类型安全。动画每 7 秒自动循环。

Left: instructions stream in one by one; right: the type stack evolves in lockstep — load → load → add moves vstack through [i32] → [i32,i32] → [i32]; const → div pushes and pops again. This abstract interpretation proves type safety using one data structure (the type stack) + one forward pass. Animation auto-loops every 7 s.

三种类型错误的具体抓法

Three concrete type errors caught

Case	WAT	Why rejected
栈不够underflow	`i32.const 1; i32.add`	i32.add 要 pop 两个,vstack 只有一个 → faili32.add expects two pops, vstack has one → fail
类型不对type mismatch	`f32.const 1.5; i32.const 2; i32.add`	i32.add 要 [i32, i32],拿到 [f32, i32] → faili32.add wants [i32, i32], got [f32, i32] → fail
函数尾未消栈leftover at end	`i32.const 1; end` (无 return)	函数返回 (),vstack 末态须空 → failfunction returns (), vstack must be empty at end → fail

"多态" 之处:unreachable 之后

The "polymorphic" spot: after unreachable

一个微妙的细节:验证遇到 br(无条件跳转)或 return 或 unreachable 后,后续直到下一个 end 的指令都无法执行到。但代码还在那儿——验证器要怎么处理?答案:把 vstack 标记为 polymorphic stack(假栈),后续 pop 操作都不真的检查,push 也接受任何类型。等遇到 end 或者 else 时再恢复真实栈状态。这一招让验证器即使对"死代码"也能 O(n) 走完。

A subtle detail: after br (unconditional), return, or unreachable, instructions up to the next end are unreachable. But the bytes are still there — what should the validator do? Answer: mark vstack as polymorphic stack — subsequent pops are not really checked, pushes accept any type. Real state restored at the next end or else. This keeps the validator O(n) even through dead code.

为什么单遍 forward 就够WHY ONE FORWARD PASS SUFFICES Java verifier 要做 dataflow,因为 JVM 字节码有非结构化的 goto,会形成不规则的 CFG。wasm 用 结构化控制流(block / loop / if 配 br k)从根本上禁止了不规则跳转——每个跳转目标都是当前 cstack 上的某个 frame,目标类型在 frame 创建时就钉死。这让不需要反向传播分析。"结构化控制"是 wasm 验证可以一遍完成的根本前提,也是为什么没有 goto。 The Java verifier needs dataflow because JVM bytecode allows non-structured goto, producing irregular CFGs. Wasm's structured control (block / loop / if + br k) bans irregular jumps at the root — every branch target is a frame on the current cstack with its target type fixed at frame creation. No backward propagation needed. "Structured control" is why wasm validation finishes in one pass — and why there's no goto.

并行验证

Parallel validation

单遍验证 + 函数互相独立(只引用 Type / Function / Memory / Table 等"全局"section,这些先解析完)= 函数级并行。V8 的实现给 N 个函数开 min(N, CPU 核数) 个验证 worker,每个 worker 拿一个函数独立验。Photoshop 那种 30 万函数的 wasm 模块,在 8 核机器上 ~500 ms 就能验完——这是为什么 wasm 启动比想象中快。

Single-pass validation + function independence (functions reference only the "global" sections — Type / Function / Memory / Table — already parsed) = function-level parallelism. V8 spawns min(N, num_cpus) workers and each takes one function. Photoshop's 300 K-function wasm validates in ~500 ms on an 8-core box — which is why wasm startup is faster than people expect.

验证不是检查代码,
是把代码读成一个可证明的形状。 Field Note · 03

Validation isn't checking code.
It is reading code into a provable shape. Field Note · 03

SPEC

https://webassembly.github.io/spec/core/valid/

SOURCE

v8/src/wasm/function-body-decoder-impl.h :: DecodeFunctionBody

PAPER

Haas et al., "Bringing the Web up to Speed with WebAssembly", PLDI 2017Haas et al., "Bringing the Web up to Speed with WebAssembly", PLDI 2017

INPUT

已解码的函数体Decoded function body指令序列 + localsinstruction sequence + locals

→

OUTPUT

类型安全证明Type-safety proofJIT 后面可以放心生成机器码JIT can safely emit machine code

本章引用Chapter references

spec: Wasm Core 2.0 · §3.3 Validation
paper: Watt · CPP 2018 · "Mechanising the WebAssembly Spec"
source: v8/src/wasm/function-body-decoder-impl.h · V8 validator entry · type-stack loop

上一章previous← Ch10 · 线性内存Linear memory 下一章nextCh12 · Decode · 流式解析Decode · streaming →

ACT IV · COMPILATION PIPELINE

从字节到机器码。

From bytes to machine code.

从这一段起,字节离开磁盘,进入引擎。我们追着主线 Hot Loop 走过 6 道工序:流式 decode 用 LEB128 把字节翻成 Module 数据结构;Validate 在函数级并行里把类型证明做完;Tier-0 Liftoff 单遍出机器码,启动 0 等待;Tier-1 TurboFan 在后台把热函数重编译到接近 native;然后实例化分配 memory、填 table、跑 start;最后 JS 跟 wasm 之间的 trampoline 把调用边界缝起来。这 6 章是整个文章最"引擎"的部分。

From here, the bytes leave disk and enter the engine. We follow the main-line Hot Loop through six stages: streaming decode turns LEB128 bytes into a Module; validate proves type safety with function-level parallelism; Tier-0 Liftoff emits machine code in one pass, zero startup wait; Tier-1 TurboFan re-compiles hot functions to near-native in the background; then instantiation allocates memory, fills tables, runs start; finally trampolines stitch the JS ↔ wasm call boundary. The most "engine" part of the article.

STAGE 07 · PIPELINE

Decode — 边下载边解码的字节流

Decode — bytes parsed as they arrive

streaming compilation

Process

Browser → Renderer

Thread

Network + Wasm IO

API

compileStreaming

Latency

~ 1ms / KB

MAIN-LINE · STOP 5 / 12 · FETCH + DECODE hot.rs 此刻: 192 字节正从 CDN 流到浏览器,decoder 边收边解——每拿到一个 section 完整字节就推给 validator + Liftoff,不等整个文件下完。这是 Storyboard 第 5 格,wasm 启动比想象快的根源。 hot.rs right now: 192 bytes streaming from CDN into the browser; the decoder parses as bytes arrive — each completed section is forwarded to the validator + Liftoff without waiting for the full file. Storyboard cell 5, the secret behind wasm's startup speed.

这一段在做什么

What it does

把字节流解成 Module 数据结构。section by section,LEB128 by LEB128。每解出一个函数体就把它推给 validator + Liftoff——不等整个文件下完。 Turn the byte stream into a Module struct. Section by section, LEB128 by LEB128. Each function body is handed to validator + Liftoff as soon as it's parsed — without waiting for the whole file.

为什么重要

Why it matters

Photoshop wasm 模块 70 MB。等下完再编译要 8 秒,边下边编只要 2.5 秒——5.5 秒"白送"的延迟优化,这就是 streaming compilation 的全部价值。 Photoshop's wasm is 70 MB. Compile-after-download takes 8 s; compile-while-downloading takes 2.5 s — 5.5 s "free" latency reduction. The whole point of streaming compilation.

关键 API

Key API

WebAssembly.compileStreaming(fetch('hot.wasm')) · WebAssembly.instantiateStreaming(...)

"下载完再编译"是 2017 年 MVP 时的默认行为。2018 年起 V8 / SpiderMonkey 都实现了 streaming compile——浏览器 fetch 第一个 chunk 进来就交给 wasm decoder,decoder 拿到一个 section 完整字节就解析,拿到一个函数体完整字节就交给 Liftoff。下载和编译完全并行,这一招的本质是把 wasm 当成"边下边播"的视频流。

"Download then compile" was the default in the 2017 MVP. From 2018, V8 and SpiderMonkey both shipped streaming compile — the browser hands each fetched chunk to the wasm decoder, which parses each complete section and forwards each complete function body to Liftoff. Download and compile run fully in parallel; in essence, treat wasm like a "stream-while-watching" video.

JS 那端的两种调用方式

Two JS-side call patterns

// 慢路径(non-streaming):先 ArrayBuffer 再 compile
const buf = await fetch('hot.wasm').then(r => r.arrayBuffer());
const mod = await WebAssembly.compile(buf);  // 等下完才开始

// 快路径(streaming):fetch 进来一段就开始编
const mod = await WebAssembly.compileStreaming(fetch('hot.wasm'));

compileStreaming 需要 server 回 Content-Type: application/wasm——否则会 fallback 到 buffer 路径并 throw 一个警告。这是常见踩坑(把 .wasm 当 .bin 上 CDN 时 MIME 不对)。

compileStreaming requires the server to return Content-Type: application/wasm — else it falls back to the buffer path and emits a warning. A common pitfall when serving .wasm as .bin from a CDN.

decoder 的状态机

Decoder state machine

// V8 ModuleDecoder 简化状态机

[kPreamble]      ; expecting magic + version (8 bytes)
       ├─▶ kSectionHeader   ; expecting id byte + LEB128 length
              ├─▶ kTypeSection     ; 0x01 · vec[Type]
              ├─▶ kImportSection   ; 0x02 · vec[Import]
              ├─▶ kFunctionSection ; 0x03 · vec[u32 type-idx]
              ├─▶ kTableSection    ; 0x04
              ├─▶ kMemorySection   ; 0x05
              ├─▶ kGlobalSection   ; 0x06
              ├─▶ kExportSection   ; 0x07
              ├─▶ kStartSection    ; 0x08
              ├─▶ kElementSection  ; 0x09
              ├─▶ kCodeSection     ; 0x0a · 进入 per-function loop
              │       └─▶ for each function:
              │             1. parse body bytes
              │             2. enqueue to validator worker
              │             3. enqueue to Liftoff worker
              ├─▶ kDataSection     ; 0x0b
              └─▶ kCustomSection   ; 0x00 · name / dwarf / vendor

每个状态对应一个 section 的解析函数,内部都是同一种结构:先读 LEB128 数量,再 for 循环依次解析每个 entry。这种规则化让 decoder 简单到可以单文件 (module-decoder.cc) 几千行写完。

Each state maps to a parsing function for one section, all sharing the same shape: read LEB128 count, then for-loop entries. The regularity keeps the decoder small — one file (module-decoder.cc), a few thousand lines.

流式 + 并行的时间线

Streaming + parallel timeline

▸ Performance · WebAssembly streaming compile · hot.wasm 192 byte / 1 func · main-line

Network

fetch · 192 byte

idle

Wasm IO

decode

validate

done

Wasm Compile

wait

Liftoff

TurboFan

tier-1 done

Main

JS still loading

resolve

instance.exports.blur3 (Liftoff)

(TurboFan)

三件事: ① Liftoff 在 fetch 还没完时就启动——这是 streaming 的本质;② TurboFan 在 Liftoff 完成后慢慢热;③ JS Promise 在 Liftoff 完毕的瞬间 resolve,所以用户感受到的延迟是 Liftoff 完成时间,不是 TurboFan 完成时间。这就是为什么 Liftoff 必须快——它决定 TTI(Time To Interactive)。 Three things: ① Liftoff starts before fetch even finishes — that's the essence of streaming; ② TurboFan warms up later in the background; ③ the JS Promise resolves the instant Liftoff finishes, so user-perceived latency is Liftoff time, not TurboFan time. That's why Liftoff has to be fast — it determines TTI (Time To Interactive).

code section 的"函数体偏移表"

The "function-body offset table" trick

code section 的格式有一个细节让流式编译变得可行:每个函数体前面都有一个 LEB128 长度。这让 decoder 不必先扫一遍找边界,可以直接 fread 长度 → fread 函数体 → 入队 → fread 下一段长度。"self-describing 长度前缀" 是 wasm 设计里反复出现的母题——module 长度、section 长度、函数体长度、import name 长度,全是 LEB128 前缀。

A detail in the code section makes streaming feasible: each function body is prefixed by a LEB128 length. The decoder doesn't need a pre-scan — just read length → read body → enqueue → next length. "Self-describing length prefix" is a recurring motif — module length, section length, body length, import-name length, all LEB128-prefixed.

本章引用Chapter references

spec: WebAssembly Web API 2.0 · compileStreaming
source: v8/src/wasm/module-decoder.cc :: DecodeModule() · V8 streaming decoder

上一章previous← Ch11 · 验证Validation 下一章nextCh13 · 并行 ValidateParallel validate →

STAGE 08 · PIPELINE

Validate — 函数级并行的形状证明

Validate — function-level parallelism, proven by shape

N 个函数,N 个 worker

N functions, N workers

Algorithm

single-pass · type-stack

Concurrency

per-function

Failure

CompileError trap

Source

function-body-decoder.cc

MAIN-LINE · STOP 6 / 12 · VALIDATE (eng.) hot.rs 此刻: 同一站,换一个视角——这里讲 V8 怎么把 N 个函数的验证分发到 N 个 worker。函数体之间没有引用依赖(只引用 type/function/memory/global 等模块级声明,这些 section 早已解析完),所以可以无锁并行。Photoshop 30 万函数,8 核机器 500 ms 验完。 hot.rs right now: same stop, different lens — here we look at how V8 fans out N function validations across N workers. Function bodies reference only module-level declarations (type/function/memory/global, all parsed earlier), so workers run lock-free. Photoshop's 300 K functions validate in ~500 ms on 8 cores.

Ch11 已经讲了验证算法本身。这一章谈工程实现:V8 怎么把 N 个函数的验证拆到 N 个 worker 上,什么时候 fail-fast,什么时候 graceful。

Ch11 covered the algorithm itself. This chapter is about engineering: how V8 spreads N functions across N workers, when to fail-fast, when to be graceful.

为什么函数级并行可行

Why function-level parallel works

关键前提是函数体只引用模块级声明(type / function / table / memory / global / element),这些都在 code section 之前的 section 里解析完了。函数 A 验证时不需要看函数 B——它最多通过 call 引用 B 的签名(已知)。所以 N 个 worker 可以独立验证 N 个函数,彼此不通信。

The key invariant: function bodies reference only module-level declarations (type / function / table / memory / global / element), all parsed before the code section. Function A's validator doesn't need to look at function B — at most it sees B's signature via call (already known). So N workers validate N functions independently, no inter-thread comms.

引擎Engine	workers	strategy
V8	min(N, num_logical_cpus)	each thread pulls from one queue
SpiderMonkey	helper threads (configurable)	tile-based, 64 KB per tile
JavaScriptCore	WTF::WorkerPool	per-function, with size-aware scheduling
Wasmtime	rayon parallel iterator	per-function

早失败 vs 晚失败

Fail-fast vs fail-late

如果第 3 个函数验证失败,后面 1000 个函数还要不要继续验证?V8 选择继续——所有 worker 把活做完,最后聚合错误。这听起来浪费,但因为 worker 是并行的,继续做不会延后失败时间;反而提前 abort 需要协调(kill 其他 worker),代码复杂度反而高。"并行算法里 fail-fast 不一定快"是 V8 设计里反复出现的取舍。

If function 3 fails, do the remaining 1000 keep validating? V8 says yes — let all workers finish, aggregate errors at the end. Sounds wasteful, but because workers run in parallel, continuing doesn't delay failure; aborting would need coordination (kill other workers), with higher code complexity. "Fail-fast isn't necessarily fast in a parallel pipeline" — a trade-off V8 makes repeatedly.

主线回引 · The Hot Loop 的验证时间

Main-line · Hot Loop validation time

主线只有 1 个函数,所以"并行"在这里退化成单 worker。49 条指令,vstack 最深 4 槽,cstack 最深 2 frame,在 M1 Pro 上验证耗时 ~ 6 µs。这一数字给你一个数量级感受:验证比解码还快,因为验证不分配内存(用栈上小固定容量数组就够了)。

The main-line has 1 function, so "parallel" degenerates to single worker. 49 ops, vstack max depth 4, cstack max depth 2 — on M1 Pro, validation takes ~ 6 µs. The order of magnitude: validation is faster than decoding, because it allocates nothing — a small fixed-cap stack array suffices.

本章引用Chapter references

spec: Wasm Core 2.0 · §3.3 Validation
source: v8/src/wasm/function-body-decoder-impl.h · V8 per-function validator template

上一章previous← Ch12 · Decode 下一章nextCh14 · Liftoff · Tier 0 →

STAGE 09 · PIPELINE

Liftoff — 单遍出码的"不优化但快"基线 JIT

Liftoff — the "unoptimised but instant" baseline JIT

10 MB/s · 0 IR · 0 register alloc

Process

Renderer

Thread

CompileTask

none

Speed

~10 MB / s

MAIN-LINE · STOP 7 / 12 · LIFTOFF CODEGEN hot.rs 此刻: validator 签字放行,Liftoff 单遍把 wasm 字节翻成 x86-64——没有 IR、没有寄存器分配、所有 local 一律 spill 到栈槽。240 字节 x86 出码,耗时 200 µs。不为速度,为启动时间。函数末尾 tier-up 计数器已经埋好。下一站:第一次执行。 hot.rs right now: validator signed it off; Liftoff makes one pass turning wasm bytes into x86-64 — no IR, no register allocation, every local spilled to a stack slot. 240 bytes of x86 emitted in 200 µs. Not for speed — for startup. The tier-up counter is already wired into the function epilogue. Next stop: first execution.

这一段在做什么

What it does

扫一遍 wasm 字节,边扫边直接出 x86-64 / ARM64 机器码。没有中间 IR,没有寄存器分配,没有优化。所有 wasm 栈位置都对应栈上的固定偏移槽位。 Scan the wasm bytes once and emit x86-64 / ARM64 machine code directly. No IR, no register allocation, no optimisation. Every wasm stack slot maps to a fixed offset on the native stack.

为什么存在

Why it exists

2018 之前,V8 直接走 TurboFan——一个 70 MB 模块要 8 秒才能开跑。Liftoff 是为了把这个数字压到 1~2 秒:出码 4× 慢于 TurboFan,但出码速度 10× 快于它。不为速度,为启动时间。 Pre-2018, V8 went straight to TurboFan — a 70 MB module took 8 s to start. Liftoff aims to compress that to 1–2 s: emitted code is 4× slower than TurboFan's, but emission itself is 10× faster. Not for speed — for startup.

关键代码

Key code

v8/src/wasm/baseline/liftoff-compiler.cc :: VisitOpcode · liftoff-assembler-{x64,arm64}.cc

Liftoff 的两个"狠"

Two ruthless choices Liftoff makes

①

不做寄存器分配,全 spill 到栈

No register allocation; everything spills

遇到 wasm local.get → mov reg, [rbp-N],N 是该 local 的固定偏移;遇到 local.set → mov [rbp-N], reg。栈顶值用瞬时寄存器 rax/rbx 之类即可。不优化中间结果留寄存器——出来的码"很啰嗦但确定"。

For each wasm local.get: mov reg, [rbp-N], where N is that local's fixed offset. For local.set: mov [rbp-N], reg. Stack-top values land in ad-hoc registers like rax/rbx. No effort to keep intermediates in regs — the code is "verbose but deterministic".

②

栈状态用 LiftoffStackState 跟踪

Track stack via LiftoffStackState

不是真正的 SSA,而是一个小数组,记录"当前 wasm 栈顶第 i 个值在哪个 native reg / native stack slot 上"。每条指令出码时只查 / 改这张表。这套机制让 Liftoff 用~1500 行代码实现了一个完整的 baseline JIT(对比 TurboFan 的 40,000 行)。

Not real SSA — a small array recording "where the i-th wasm stack value currently lives: in native reg X or stack slot Y". Each opcode just reads/updates this table when emitting code. ~1500 LOC builds a complete baseline JIT (vs ~40 000 LOC for TurboFan).

主线 · Liftoff 对 The Hot Loop 出的码(完整版)

Main-line · Liftoff's output for The Hot Loop (full)

; blur3 -- Liftoff codegen (x86-64, simplified)
0x000000 push   rbp
0x000001 mov    rbp, rsp
0x000004 sub    rsp, 0x40                ; 8 stack slots
0x000008 mov    [rbp-0x08], rdi          ; spill $src
0x00000c mov    [rbp-0x10], rsi          ; spill $dst
0x000010 mov    [rbp-0x18], edx          ; spill $w
0x000014 mov    [rbp-0x1c], ecx          ; spill $h

; outer loop: y = 1
0x000018 mov    dword ptr [rbp-0x20], 1  ; $y = 1
0x000020 mov    eax, [rbp-0x1c]
0x000024 dec    eax                      ; eax = h - 1
0x000026 cmp    [rbp-0x20], eax
0x000029 jge    .end_y

.loop_y:
; inner loop: x = 1
0x00002b mov    dword ptr [rbp-0x24], 1  ; $x = 1
0x000033 mov    eax, [rbp-0x18]
0x000037 dec    eax                      ; eax = w - 1
0x000039 cmp    [rbp-0x24], eax
0x00003c jge    .end_x

.loop_x:
; sum = 0
0x00003e mov    dword ptr [rbp-0x28], 0

; 9 byte loads, 8 adds — Liftoff emits each one
0x000046 mov    rax, [rbp-0x08]          ; $src
0x00004a movzx  edx, byte ptr [r15+rax]  ; i32.load8_u offset=0 (r15 = mem base)
0x00004e add    [rbp-0x28], edx          ; sum += byte

0x000051 mov    rax, [rbp-0x08]          ; reload $src ← spill cost
0x000055 movzx  edx, byte ptr [r15+rax+1]
0x00005a add    [rbp-0x28], edx
...
; (similar 7 more times — Liftoff makes no attempt to hoist $src)
...

; sum / 9 — Liftoff does NOT do magic-number multiplication
0x0000c0 mov    eax, [rbp-0x28]
0x0000c3 xor    edx, edx
0x0000c5 mov    ecx, 9
0x0000ca div    ecx                      ; expensive! ~25 cycles

; store dst[y*w + x]
0x0000cc ...
0x0000e0 mov    byte ptr [r15+rbx], al

; x++; loop_x
0x0000e4 inc    dword ptr [rbp-0x24]
0x0000e8 jmp    .loop_x
...
.end_x:
.end_y:
0x000130 leave
0x000131 ret

这段~ 240 字节的 x86-64 代码就是 Liftoff 对 hot.wasm 的输出。三个观察:① $src 在每次 load 前都重新从 [rbp-0x08] 加载——Liftoff 不知道也不分析 "这个值我刚加载过";② sum / 9 用了真 div 指令,~25 cycle;③ 函数体没有 SIMD 化。但出码时间在 200 µs 量级——这正是它要的。

~240 bytes of x86-64 is Liftoff's output for hot.wasm. Three notes: ① $src is reloaded from [rbp-0x08] before every load — Liftoff doesn't know it just loaded this; ② sum / 9 uses a real div, ~25 cycles; ③ no SIMD. But codegen time is ~200 µs — exactly the target.

同一段 wasm,在 ARM64 上长什么样

Same wasm, on ARM64

把同一个 hot.wasm 喂给 ARM64 上的 Liftoff(M-series Mac / 移动设备),出码的算法骨架完全一致,只是指令集换:

Feed the same hot.wasm into Liftoff on ARM64 (M-series Mac / mobile devices); the algorithm skeleton is identical, only the ISA changes:

; Liftoff output for blur3 (ARM64, simplified · M1 Pro)
0x000000 stp   x29, x30, [sp, #-16]!     ; push fp, lr
0x000004 mov   x29, sp
0x000008 sub   sp, sp, #0x40              ; reserve 8 slots
0x00000c str   x0,  [x29, #-8]            ; spill $src (arg 0, ABI: x0)
0x000010 str   x1,  [x29, #-16]           ; spill $dst
0x000014 str   w2,  [x29, #-24]           ; spill $w (32-bit)
0x000018 str   w3,  [x29, #-28]           ; spill $h
...
; inner: load src[idx]
0x000040 ldr   x9, [x29, #-8]             ; reload $src
0x000044 ldr   x10, [x29, #-32]           ; reload computed index
0x000048 ldrb  w11, [x28, x9]             ; load byte (x28 = wasm mem base, V8 convention)
0x00004c add   x12, x9, x10
0x000050 ldrb  w13, [x28, x12]
0x000054 ldr   w14, [x29, #-40]           ; reload sum
0x000058 add   w14, w14, w11
0x00005c str   w14, [x29, #-40]           ; store sum back
; ... same pattern repeats 8 more times — no register reuse
...
; sum / 9 — ARM64 has a single-cycle udiv (unlike x86's slow div)
0x0000c0 ldr   w0, [x29, #-40]
0x0000c4 mov   w1, #9
0x0000c8 udiv  w0, w0, w1                 ; ~7 cycles (vs x86 div ~25)

; tier-up counter
0x0000e0 ldr   w12, [x26, #0x40]          ; x26 = instance pointer (V8 ARM64 convention)
0x0000e4 cmp   w12, #0x100
0x0000e8 b.ne  +0x8
0x0000ec bl    WasmCompileLazy
...
0x000110 ldp   x29, x30, [sp], #16
0x000114 ret

几个细节:① 角色寄存器换了:x86 的 r15 / r13 / rbp 对应 ARM64 的 x28 / x26 / x29(V8 内部 calling convention);② 调用约定不一样:x86 用 rdi/rsi/rdx/rcx 传前 4 参数,ARM64 用 x0/x1/x2/x3;③ ARM64 的 udiv 是单指令 ~ 7 cycle,比 x86 的 div(~ 25 cycle)显著快——所以 Liftoff 出码的"naïve div" 在 ARM64 上代价小,Liftoff/TurboFan 性能差距比 x86 上小;④ 算法骨架完全一样——全 spill、不做 register allocation、单遍出码、嵌入 tier-up 计数器。这正是 wasm "ISA neutral" 承诺的具体形式:同一份字节,在 x86 / ARM64 / RISC-V / s390x 上跑出来的 Liftoff 输出结构同形,只换指令。

A few details: ① register roles change: x86's r15 / r13 / rbp map to ARM64's x28 / x26 / x29 (V8 internal calling convention); ② ABI differs: x86 passes the first 4 args in rdi/rsi/rdx/rcx; ARM64 in x0/x1/x2/x3; ③ ARM64's udiv is a single-instruction ~ 7 cycle op vs x86's div ~ 25 cycles — so Liftoff's "naïve div" is cheaper on ARM64, and the Liftoff/TurboFan gap is smaller on ARM than on x86; ④ the algorithm skeleton is identical — full spill, no register allocation, single-pass codegen, embedded tier-up counter. This is the concrete form of wasm's "ISA-neutral" promise: the same bytes produce structurally identical Liftoff output on x86 / ARM64 / RISC-V / s390x — only instructions swap.

Tier-up 触发机制

The tier-up trigger

每个 Liftoff 函数入口都塞一个计数器:

Every Liftoff function prologue carries a counter:

cmp    dword ptr [r13+0x40], 0x100      ; tier-up threshold = 256 calls
jne    +0x4
call   WasmCompileLazy                  ; → schedule TurboFan recompile

2 条指令的开销,每次进入函数加 1 次 cmp + 1 次 jne(不跳转)。达到阈值时调用 WasmCompileLazy,把这个函数入队到 TurboFan 后台 worker——不阻塞当前调用,Liftoff 版继续跑。后台 worker 编完后,引擎用一个 atomic store 把函数地址表里的入口换成 TurboFan 版,下次调用就走 TurboFan。

Two-instruction overhead: one cmp + one jne (not taken) per entry. At threshold, call WasmCompileLazy to enqueue the function for a background TurboFan worker — does not block the current call, Liftoff version keeps running. After the worker finishes, an atomic store swaps the function-table entry to point to the TurboFan version; the next call goes to TurboFan.

Liftoff 之前 V8 做了什么WHAT V8 DID BEFORE LIFTOFF 2017 MVP 上线时 V8 没有 Liftoff,所有 wasm 函数直接走 TurboFan。Mozilla 那时已经有 SpiderMonkey 的 BaselineCompiler,启动比 V8 快很多。Liftoff 是 Clemens Backes 在 2018 年的工程项目,设计灵感来自 BaselineCompiler 但更彻底——后者还有简单的寄存器跟踪,Liftoff 干脆全 spill。Liftoff 之后,V8 的 wasm 启动延迟下降了 80%,这是 wasm 在浏览器侧的"第二次起飞"。 In the 2017 MVP, V8 had no Liftoff — every wasm function went straight to TurboFan. SpiderMonkey already had its BaselineCompiler and started much faster. Liftoff was Clemens Backes's 2018 project, inspired by BaselineCompiler but more ruthless — the latter still did light register tracking; Liftoff spills everything. Post-Liftoff, V8's wasm startup latency dropped 80% — wasm's "second takeoff" in the browser.

本章引用Chapter references

blog: V8 blog · "Liftoff: a new baseline compiler for WebAssembly in V8" (Aug 2018)
source: v8/src/wasm/baseline/liftoff-compiler.cc · V8 Liftoff main entry

上一章previous← Ch13 · Validate 下一章nextCh15 · TurboFan · Tier 1 →

STAGE 10 · PIPELINE

TurboFan — sea-of-nodes 把它优化成接近 native

TurboFan — sea-of-nodes lifts it to near-native

2 ms 出码,80% of native

2 ms emit, 80% of native

Process

Renderer

Thread

TF Worker

sea-of-nodes

Speed target

~80% of native

MAIN-LINE · STOP 10 / 12 · TURBOFAN hot.rs 此刻: 第 256 次调用触发 tier-up,后台 worker 把它送进 TurboFan——sea-of-nodes IR + LoadElimination + SimplifiedLowering + Schedule + RegAlloc 五步流水线。Liftoff 的 9 次 local.get $src 被合并成 1 次寄存器读;sum / 9 被识别为常量除,替换成魔数乘法 0x1c71c71d。180 字节 x86,3.8 ms/帧。下一站:atomic 安装。 hot.rs right now: the 256th call triggered tier-up; a background worker pushes it through TurboFan's sea-of-nodes + LoadElimination + SimplifiedLowering + Schedule + RegAlloc pipeline. Liftoff's nine local.get $srcs collapse to one register read; sum / 9 is recognised as div-by-constant and rewritten as magic-number mul 0x1c71c71d. 180 bytes of x86, 3.8 ms/frame. Next stop: atomic install.

TurboFan 原本是 V8 的 JavaScript 优化编译器。2017 年起它兼任 wasm 的优化编译器——但 wasm 那一面用的 pipeline 跟 JS 那边完全不一样。JS 那边要处理 SMI 标记、IC 反馈、deopt 边界;wasm 这边类型钉死,没有反馈,没有 deopt。所以 wasm TurboFan 是个"静态优化器",更接近 LLVM 的工作流。

TurboFan was originally V8's JS optimising compiler. From 2017 it doubles as wasm's optimiser — but the wasm pipeline differs entirely from JS. JS-side juggles SMI tags, IC feedback, deopt edges; wasm-side has fixed types, no feedback, no deopt. So wasm-TurboFan is a "static optimiser", much closer to LLVM's workflow.

从字节到机器码的 6 步

Six steps from bytes to machine code

Graph build · 把字节流变 IR 图

Graph build · bytes → IR graph

把每条 wasm op 转成一个或多个 IR 节点。所有依赖关系作为边。不显式记录顺序——这是 sea-of-nodes 的特征。

Each wasm op becomes one or more IR nodes; dependencies are edges. No explicit ordering recorded — that's the sea-of-nodes hallmark.

Inline · 把小函数展平

Inline · flatten small functions

wasm 没 IC,inlining 决策完全静态。主线 Hot Loop 没有 call,这一步空跑。

Wasm has no IC, so inlining is purely static. The main-line has no call; this step is a no-op.

LoadElimination · 公共子表达式

LoadElimination · CSE

"同一个内存地址刚刚加载过" → 复用值,不再 emit load。主线里 $src 的 9 次 local.get 被压成 1 次寄存器持有。

"This address was just loaded" → reuse the value, don't emit another load. In the main-line, the 9 local.get $srcs collapse into one held register.

SimplifiedLowering · 把 IR 降到机器层

SimplifiedLowering · IR → machine level

把"i32.add" 这种高阶节点替换成 "x64 ADD"。sum / 9 在这一步被识别为常量除,替换成 magic-number multiplication。

Lower high-level nodes like i32.add to x64 ADD. sum / 9 here is recognised as div-by-constant and rewritten as magic-number mul.

Schedule · 决定指令顺序

Schedule · order the nodes

sea-of-nodes 是无序的,这一步把节点排成线性序列,服从依赖关系。这是 sea-of-nodes 的独特步骤——传统 CFG 编译器不需要。

Sea-of-nodes is unordered; here nodes are arranged into a linear sequence respecting dependencies. This is sea-of-nodes' unique step — traditional CFG compilers skip it.

RegAlloc + Emit

寄存器分配(线性扫描)+ 出码。TurboFan 的 RegAlloc 是 SSA 上的线性扫描变种,比 LLVM 的 greedy 简单但已足够好。

Register allocation (linear scan) + emission. TurboFan's RegAlloc is a linear-scan variant on SSA — simpler than LLVM's greedy, plenty good enough.

主线 · TurboFan 出码 vs Liftoff 出码

Main-line · TurboFan output vs Liftoff output

Metric	Liftoff	TurboFan	Ratio
编译耗时Compile time	200 µs	2.1 ms	10×
出码字节Code bytes	240 B	180 B	0.75×
运行耗时(1080p)Runtime (1080p frame)	12 ms	3.8 ms	0.32×
`$src` reload 次数`$src` reloads	9	1	—
`sum / 9`	div(25 cy)	mul+shr(4 cy)	~6× faster
SIMD ?SIMD ?	—	(no, 默认未开)(no, default off)	—

TurboFan 编译耗时 10× 于 Liftoff,但出码运行快 ~ 3× 于 Liftoff。关键洞察是这两个数字不矛盾——TurboFan 在后台编,运行时把"编译延迟"摊到了后台 worker。用户看到的延迟是"Liftoff 编译完+第一次跑",TurboFan 是后面"悄悄变快"的。这是 wasm tiering 的全部哲学。

TurboFan compiles 10× slower than Liftoff, but the resulting code runs ~3× faster. The crucial insight: these don't conflict — TurboFan compiles in the background, amortising its latency onto a worker. Perceived latency is "Liftoff done + first run"; TurboFan is the "silent" speedup later. That's wasm tiering in one sentence.

FIG 15 · sea-of-nodes · Hot Loop inner body $src dereferences ⟶ LoadElim + CSE AFTER · 13 nodes $src 9 × load [src + (r·w + c)] $src held in reg · indices folded tree-reduce add × 8 i32 × 0x1c71c71d → SHR ~ 4 cycles · was 25 cy store [dst + idx] $src loaded once; div replaced by mul·shr WIN · per pixel ~ 20 ns saved · 40 ms on 1080p

左:Liftoff 把每条 wasm 字节翻成机器指令,9 次 local.get $src 各自 emit 一条 mov。右:TurboFan 看穿这 9 次都指向同一个 SSA 节点,LoadElimination 合并成 1 次寄存器读;sum / 9 在 SimplifiedLowering 阶段识别为常量除,替换成魔数乘法(0x1c71c71d)。这就是 wasm "原生 80%" 的具体形式。

Left: Liftoff emits one machine op per wasm byte; nine local.get $src turn into nine movs. Right: TurboFan sees the nine reference the same SSA node, LoadElimination merges them into a single register read; sum / 9 is recognised as divide-by-constant during SimplifiedLowering and rewritten as magic-number multiplication (0x1c71c71d). The concrete form of wasm's "80% of native".

完整 IR · 50+ 节点 · 真实形状

Full IR · 50+ nodes · the actual shape

上面那张是示意图,但 hot.rs 真实的 TurboFan IR 大约有 50-60 个节点(包括 Start / End / FrameState / Effect / Control 等基础设施节点)。下面这张完整版 把 inner loop body 全部画出来,按 BasicBlock 分组,Effect chain 与 Value chain 分两种颜色显示。这是 V8 跑 --trace-turbo-graph + turbolizer 可视化时你会真正看到的图。

The figure above is a schematic. The actual TurboFan IR for hot.rs runs ~ 50-60 nodes (including infrastructure like Start / End / FrameState / Effect / Control). Below is the full version for the inner loop body, grouped by BasicBlock, with effect chain and value chain on separate colour scales. This is what you actually see when V8 runs with --trace-turbo-graph and you load the output in turbolizer.

FIG 15b · sea-of-nodes · full IR · 50+ nodes · 4 BBs

这才是 hot.wasm 在 TurboFan 里的真容:4 个 BasicBlock(pre-header / loop body / loop continue / exit)、~ 48 个节点、3 种边(copper 值边、blue 效应链、gpu 控制流)。上面的"简化版"压缩了 90% 的元节点(Start/End/EffectPhi/FrameState 等),适合一眼看到 LoadElim/SimplifiedLowering 的关键转换;这张是真实形状。读完两张图你就能读懂 --trace-turbo-graph 在 turbolizer 里的输出。

This is what hot.wasm actually looks like inside TurboFan: 4 BasicBlocks (pre-header / loop body / loop continue / exit), ~ 48 nodes, three edge kinds (copper value, blue effect, gpu control). The simpler figure above compresses ~ 90% of infrastructure nodes (Start/End/EffectPhi/FrameState) to highlight LoadElim/SimplifiedLowering's key effects; this is the real shape. After both, you can read a turbolizer dump.

从 TurboFan 到 Turboshaft

From TurboFan to Turboshaft

2022 年 V8 团队启动 Turboshaft 项目,目标是替换 TurboFan。原 TurboFan 的 sea-of-nodes 在内存里是图结构,每次访问要解引用,在大模块上 cache miss 严重。Turboshaft 改成线性 IR 序列(类似 LLVM 的 BB instructions),内存连续,优化 pass 速度上升 30~50%。2023 年起 V8 wasm 默认走 Turboshaft,但 IR 和 pass 集合跟 TurboFan 高度兼容,从外部看几乎无感。

In 2022 the V8 team began Turboshaft, with the goal of replacing TurboFan. The original TurboFan keeps its sea-of-nodes as an in-memory graph; every access dereferences, and on large modules cache misses dominate. Turboshaft uses a linear IR sequence (like LLVM BB instructions), so memory is contiguous and pass speed improves 30–50%. Since 2023, V8 wasm has run Turboshaft by default, but IR and pass set are highly TurboFan-compatible — externally near-invisible.

具体例子 · LoadElimination 的力量CASE · the power of LoadElimination

9 次 `local.get $src` 怎么变成 1 次寄存器读

9 × `local.get $src` collapses into one register read

Liftoff 把每个 local.get $src 都翻成 mov rax, [rbp-0x08]——9 次。TurboFan 看到这 9 次 local.get 都引用同一个 SSA 节点(因为 wasm 验证已经证明 $src 在此区间未被赋值),LoadElimination 把它们合并成一个 SSA 引用。寄存器分配阶段把这个 SSA 值留在 rcx 里——9 次内存读变成1 次内存读 + 9 次寄存器引用。这一招省下 ~ 20 ns 每像素,在 1920×1080 图像上是 40 ms 的差距。

Liftoff turns each local.get $src into mov rax, [rbp-0x08] — nine times. TurboFan sees all nine reference the same SSA node (validation already proved $src isn't reassigned in this region), and LoadElimination merges them into one SSA reference. RegAlloc keeps the SSA value in rcx — nine memory reads collapse into one memory read + nine reg references. ~20 ns saved per pixel; on a 1920×1080 image, that's a 40 ms swing.

本章引用Chapter references

source: v8/src/compiler/wasm-compiler.cc · V8 TurboFan wasm graph build
source: v8/src/compiler/turboshaft/wasm-*.cc · V8 Turboshaft (2023+ default)
tool: turbolizer · interactive sea-of-nodes viewer

上一章previous← Ch14 · Liftoff 下一章nextCh16 · Instantiate →

STAGE 11 · PIPELINE

Instantiate — Module 是模板,Instance 是身体

Instantiate — module is the template, instance is the body

memory · table · globals · imports · start

Module

code + metadata

Instance

memory + table + globals

Cost

~ 1 ms typical

Reusable

1 module → N instances

MAIN-LINE · STOP 8a / 12 · INSTANTIATE hot.rs 此刻: Liftoff 已经出码,但还没人调用。引擎在做"第一次调用之前"的最后准备——分配 linear memory、填充 table 与 globals、跑 start 函数。Storyboard 第 8 格之前的这一秒。同一个 Module 可以 instantiate 多次,每次给一个独立的 memory——这是 wasm 多线程的基础。 hot.rs right now: Liftoff has emitted code, but nobody has called it yet. The engine performs the final "before-first-call" setup — allocating linear memory, filling tables and globals, running the start function. The second just before Storyboard cell 8. One Module can be instantiated many times, each with its own memory — the foundation of wasm multithreading.

WebAssembly.Module 是不可变的编译产物——它只装了代码、类型、import 声明、export 声明。要真正"跑"它,得创建一个 WebAssembly.Instance,把 import 满足、memory / table / globals 分配出来。同一个 Module 可以创建多个 Instance,每个 Instance 有自己的 memory——这是 wasm 实现"多线程"和"沙箱隔离"的基础。

WebAssembly.Module is an immutable compilation artifact — it carries code, types, import/export declarations. To actually run it, create a WebAssembly.Instance: satisfy imports, allocate memory / table / globals. One Module can spawn many Instances, each with its own memory — the foundation of wasm's "multithreading" and "sandbox isolation".

FIG 16 · instantiate · 7 steps · with trap branches

实例化的 7 步是串行 + 原子的:任何一步抛 LinkError / RangeError / RuntimeError,前面做的工作必须全部回滚,Promise reject 后外部观察不到中间状态。步骤 ① 静态检查(import 类型)产生 LinkError;步骤 ②③分配资源产生 RangeError/OOM;步骤 ④-⑦实际运行用户代码(init expr / start func),产生 RuntimeError。这种"全或无" 是 wasm sandbox 的核心保证之一——半就绪的 Instance 不存在。

The 7 steps are serial + atomic: any failure (LinkError / RangeError / RuntimeError) rolls back all prior work — the Promise rejects and no intermediate state is observable. Step ① does static checks (import types) producing LinkError; steps ②③ allocate resources, producing RangeError/OOM; steps ④–⑦ actually run user code (init expressions, start function) and may produce RuntimeError. This "all or nothing" guarantee is core to the wasm sandbox — there is no half-instantiated Instance.

实例化的 7 步

Seven steps of instantiation

检查 import 是否满足

Check imports

每个 import 声明的类型与 JS 提供的对象类型是否匹配。不匹配 → LinkError。

Every declared import's type vs the JS-provided object. Mismatch → LinkError.

分配 linear memory

Allocate linear memory

如果 import 了 memory → 复用 import 的;否则按 Memory section 的 min page 分配新 ArrayBuffer。

If memory is imported → reuse it; else allocate a fresh ArrayBuffer of min pages from the Memory section.

分配 tables

Allocate tables

类似 memory,要么 import 要么新建,初始填 null/null。

Same pattern as memory — imported or fresh, initial values null.

初始化 globals

Init globals

每个 global 跑一遍它的 init expression。expression 只能用常量 + global.get 已 init 的。

Run each global's init expression. Only constants + global.get of already-inited globals allowed.

填 data segments(memory)

Apply data segments

把 Data section 里所有 active 段 memcpy 进 linear memory 对应地址。

memcpy every active data segment into its target offset in linear memory.

填 element segments(tables)

Apply element segments

把 Element section 的 funcref 填到 tables。

Populate funcrefs from element segments into tables.

调用 start 函数(如果有)

Run start function (if any)

Start section 指定的函数现在被同步调用。start 抛 trap → instantiate 失败。

The start-section function is called synchronously. If start traps, instantiation fails.

JS 那一面的代码

JS-side code

const importObject = {
  env: {
    memory: new WebAssembly.Memory({ initial: 64, maximum: 256 }),
    log:    (x) => console.log('wasm says', x),
  },
  wasi_snapshot_preview1: { /* WASI shims */ },
};

const { module, instance } = await WebAssembly.instantiateStreaming(
  fetch('hot.wasm'),
  importObject
);

instance.exports.blur3(srcPtr, dstPtr, 1920, 1080);

"多 Instance · 单 Module" 的用法

"Many instances · one module" pattern

Web Worker + SharedArrayBuffer 场景里,常见做法是主线程 compile 一次 Module,所有 worker 都用同一个 Module 创建独立 Instance。每个 worker 有自己的 memory(可能是 import 来的共享 memory),共享代码、不共享栈。Photoshop / Figma 都这样做。编译只跑一次,内存按需复制。

With Web Workers + SharedArrayBuffer, the standard pattern is: main thread compiles the Module once; each worker creates its own Instance from the same Module. Each worker has its own memory (possibly imported shared memory), sharing code but not stacks. Photoshop and Figma both do this. Compile once, memory on demand.

本章引用Chapter references

spec: WebAssembly JS API 2.0 · §2.7 instantiate
source: v8/src/wasm/wasm-objects.cc · V8 Instance/Module/Memory JS objects

上一章previous← Ch15 · TurboFan 下一章nextCh17 · JS ↔ Wasm →

STAGE 12 · PIPELINE

JS ↔ Wasm — 看不见的 trampoline 与 5 ns 的代价

JS ↔ Wasm — invisible trampolines and the 5 ns toll

两条 ABI 中间的桥

the bridge between two ABIs

Cost per call

~ 5 ns (modern V8)

Type coercion

Number ↔ i32/f64

Failure modes

TypeError on coerce

Spec §

JS API · 2.5

MAIN-LINE · STOP 8b / 12 · BOUNDARY hot.rs 此刻: JS 那一边 instance.exports.blur3(srcPtr, dstPtr, 1920, 1080) 触发了第一次执行。V8 在中间塞了一层 JS-to-Wasm wrapper 栈帧——SMI 解包 + r15/r14 装填 + tail-jmp 进 Liftoff 出码。2025 年 V8 把这层压到 5 ns。Storyboard 第 8 格的跨边界细节。 hot.rs right now: JS-side instance.exports.blur3(srcPtr, dstPtr, 1920, 1080) triggers the first invocation. V8 inserts a JS-to-Wasm wrapper frame between them — SMI unbox + r15/r14 setup + tail-jmp into Liftoff code. 2025 V8 has the whole thing down to 5 ns. The boundary detail behind Storyboard cell 8.

JS 用 SMI 标记的 31 位整数,wasm 用裸 i32。JS 调用约定走的是 V8 的 JS calling convention,wasm 内部用的是 wasm calling convention。JS 调 wasm 函数,引擎要在中间塞一个 trampoline——把 SMI 解包成 i32,把 NaN 之类的非法值 throw 出来,然后跳进 wasm 函数体。反过来也一样。这一切 V8 都在编译期生成,但你看不到。

JS represents 31-bit integers as SMIs; wasm uses raw i32. JS calls follow V8's JS calling convention; wasm internally uses the wasm convention. When JS calls wasm, the engine slips a trampoline in between — unbox the SMI to i32, throw on illegal values like NaN, then jump into the wasm function body. Same in reverse. V8 generates all of this at compile time, but you never see it.

4 类 trampoline

Four trampoline kinds

Name	Direction	Used when
JS-to-Wasm wrapper	JS → Wasm	JS calls `instance.exports.f`
Wasm-to-JS wrapper	Wasm → JS	Wasm `call`s an imported JS function
Wasm-to-Wasm	Wasm → Wasm	Direct or indirect call to another wasm func
Capi wrapper	C/C++ ↔ Wasm	Embedder uses the `wasm_c_api` headers

FIG 17 · JS → Wasm trampoline · 3-frame stack

JS 调 wasm 时,V8 在中间塞了一层JS-to-Wasm wrapper 栈帧——专门做 SMI 解包 + r15/r14 寄存器装填,然后尾跳进 wasm 函数体。整个过程 2025 年 V8 已压到 ~ 5 ns(2017 是 80 ns)。剩下三件事不能省:栈指针切换、wasm 关键寄存器装填、异常处理元数据 push——这是 trampoline 的物理下限。

When JS calls wasm, V8 inserts a JS-to-Wasm wrapper frame — it unboxes SMIs, loads r15/r14, then tail-jumps into the wasm body. 2025 V8 has the whole thing down to ~5 ns (from 80 ns in 2017). Three things refuse to compress: stack-pointer swap, wasm context-register load, EH metadata push — the trampoline's physical floor.

JS-to-Wasm wrapper 都做了什么

What the JS-to-Wasm wrapper does

; JS-to-Wasm wrapper for blur3(src, dst, w, h)
push   rbp
mov    rbp, rsp

; arg 0 (src): expect SMI, unbox to i32
mov    rax, [rdi+0x10]          ; rdi = first arg, JS heap pointer
test   rax, 0x1                 ; SMI test (low bit = 0 means SMI in V8)
jnz    .slow_path                ; HeapNumber path
sar    rax, 1                   ; SMI shift to get raw int
mov    edi, eax                 ; load into wasm arg reg
...                              ; same for args 1..3

; setup wasm frame
mov    r15, [r13+0x20]          ; load wasm memory base
mov    r14, [r13+0x28]          ; load wasm instance pointer

; tail-call into wasm function
jmp    [r14+0x40]               ; → Liftoff/TurboFan-compiled blur3

.slow_path:
call   ConvertNumberToInt32     ; handles HeapNumber, BigInt, throws on NaN
jmp    back

观察:① 主路径是纯寄存器操作 + 一条 jmp,~5 ns;② SMI 解包是一条 test + sar,几乎免费;③ 慢路径处理 HeapNumber/BigInt/NaN,大约 50~100 ns;④ r15(memory base)和 r14(instance)被显式 load——wasm 函数运行时假设这两个寄存器有效。

Notes: ① fast path is pure register ops + one jmp ≈ 5 ns; ② SMI unboxing is one test + sar, practically free; ③ slow path (HeapNumber / BigInt / NaN) is ~50–100 ns; ④ r15 (memory base) and r14 (instance) are explicitly loaded — the wasm body assumes these registers hold valid values.

Wasm-to-JS 的更贵代价

The pricier Wasm-to-JS direction

反方向更贵:wasm 调用 JS 函数(比如 console.log)需要构造 JS call frame、把 i32 boxing 成 SMI、检查 receiver、可能 GC——单次大概 100~300 ns。"wasm 频繁调 DOM" 是性能反模式——每次过桥的成本就吃掉了算术速度的优势。Photoshop 的策略是把整张图片 copy 到 linear memory,处理完一整张再过桥回 JS,把过桥次数压到极少。

The reverse is pricier: wasm calling JS (e.g. console.log) constructs a JS call frame, boxes i32 to SMI, checks the receiver, possibly triggers GC — ~100–300 ns per call. "Frequent DOM calls from wasm" is the canonical perf anti-pattern — boundary cost devours arithmetic speedup. Photoshop's strategy: copy the entire image into linear memory, process the whole thing, cross the boundary once on return. Minimise crossings.

主线 · The Hot Loop 的过桥成本

Main-line · the Hot Loop crossing cost

JS → blur3 call cost · M1 Pro · Chrome 132

2017 V8

~ 80 ns

2019 (fast-calls)

~ 35 ns

2022 (call-ref opt)

~ 12 ns

2025 (current)

~ 5 ns

2017 MVP 时,JS 调 wasm 单次成本 ~ 80 ns——这意味着每秒最多 1200 万次过桥,对 60 fps 的游戏来说是真实瓶颈。V8 后续 5 年逐步优化:把 wrapper 做成 builtin、把 SMI 解包内联、用 call-ref 替代 call-indirect 间接调用、最后是 2025 年的直接调用——把 wrapper 完全 elide,JS 编译器看穿"这次 call 一定调 wasm" 时直接 emit 一条 call。如今边界几乎免费。

In the 2017 MVP, JS-calling-wasm cost ~80 ns per call — a hard cap of 12 M crossings/s, a real bottleneck for 60 fps games. V8 optimised over five years: turn wrappers into builtins, inline SMI unboxing, replace call-indirect with call-ref, and finally the 2025 direct call — elide the wrapper entirely, JS compiler sees "this call definitely lands in wasm" and emits a plain call. Today the boundary is nearly free.

为什么不能更便宜WHY NOT CHEAPER 5 ns 是 V8 当前的下限——还剩三件事不能省:① stack-pointer 切换(JS 用 V8 stack,wasm 用自己的 stack);② r15 / r14 寄存器 load(wasm 函数假设它们有效);③ 异常处理元数据 push(为了 wasm trap 能被 JS try/catch 抓到)。理论上还能再砍 1~2 ns,但工程复杂度极高。"5 ns 是 trampoline 自身的物理极限"。 5 ns is V8's current floor — three things remain irreducible: ① stack-pointer swap (JS uses V8 stack, wasm has its own); ② r15 / r14 register loads (the wasm body assumes these are valid); ③ exception-handling metadata push (so wasm traps can be caught by JS try/catch). 1–2 ns more could be carved, but engineering cost is high. "5 ns is the trampoline's physical floor".

最贵的指令不是除法,是过边界。 Field Note · 03

The most expensive instruction is not division.
It is crossing the boundary. Field Note · 03

INPUT

JS 调用 wasm.exports.blur3JS calls wasm.exports.blur3SMI Number args

→

OUTPUT

wasm 函数体执行wasm body executes~ 5 ns trampoline · 3.8 ms blur

本章引用Chapter references

source: v8/src/wasm/wasm-import-wrapper-cache.cc · V8 JS↔Wasm trampoline cache
commit: Chromium · wasm wrapper opt history

上一章previous← Ch16 · Instantiate 下一章nextCh18 · Threads & Atomics →

ACT V · PROPOSALS

每一次扩张都是一次妥协。

Every extension is a compromise.

MVP 之后 8 年,wasm 加了 ~15 个生效中的提案。这 5 章只挑最重要的展开:Threads 把共享内存接进 wasm 沙箱;SIMD 用 16 字节寄存器给 inner loop 提速 6 倍;GC 让 Java/Kotlin/Dart 不再背着自己的运行时;Component Model 给 wasm 一个跨语言 ABI;以及还有六个候选提案在排队。每个提案都要回答"怎么不破坏 portable + safe + fast + compact 四目标"——这是 wasm 委员会评审的根本问题。

Eight years post-MVP, wasm has shipped ~15 live proposals. These five chapters cover the most consequential: Threads plug shared memory into the sandbox; SIMD turns 16-byte registers into 6× inner-loop speedups; GC frees Java/Kotlin/Dart from shipping their own runtimes; Component Model gives wasm a cross-language ABI; and six more are queued. Every proposal must answer "does this still honour portable + safe + fast + compact?" — the working group's gatekeeping question.

FIG V · PROPOSALS · dependency graph

提案不是平铺的清单,是有偏序依赖的图。底层是 MVP(2017);L2 prerequisites(bulk-memory / reference-types / multi-value / tail+EH)是后续所有大提案的入场券;L3 compute extensions(threads / SIMD / multi-memory / GC / stack-switching)各自加一种能力;L4 composition(Component Model / JSPI / memory64)整合下层;L5 abstractions(WASI 0.2 / Relaxed SIMD)直接依附于上一层。Component Model 是图里依赖最多的节点——这就是为什么它走得这么慢。

Proposals aren't a flat list — they form a partially ordered graph. At the foundation sits the MVP (2017). L2 prerequisites (bulk-memory / reference-types / multi-value / tail+EH) are the ticket of admission for every later big proposal. L3 compute extensions (threads / SIMD / multi-memory / GC / stack-switching) each layer in one capability. L4 composition (Component Model / JSPI / memory64) integrates lower layers. L5 abstractions (WASI 0.2 / Relaxed SIMD) attach to layer 4. The Component Model is the most-depended-upon node — which is exactly why it's been so slow to ship.

PROPOSAL · THREADS

Threads — 共享内存进沙箱

Threads — shared memory inside the sandbox

SharedArrayBuffer · atomics · futex

Shipped

V8 2019 · SM 2020

Mem

shared linear memory

Ops

i32.atomic.* · memory.atomic.wait/notify

Spec §

threads-proposal

"能不能在浏览器里跑 pthread?"——这是从 2014 年起 game engine 开发者就在问的问题。Threads 提案 2019 年 ship,答案是:能,但用新的方式。WebWorker 已经存在(线程没有共享内存,只有消息传递);wasm threads 在这上面叠加了 SharedArrayBuffer(共享内存)和 atomic ops(无锁原语)。

"Can we run pthread in the browser?" — a question game-engine devs have asked since 2014. The Threads proposal shipped in 2019 with the answer: yes, but in a new way. WebWorker already existed (no shared memory, only message passing); wasm threads layer SharedArrayBuffer (shared memory) and atomic ops (lock-free primitives) on top.

三种新指令

Three new instruction families

i32.atomic.*

原子读写

Atomic load / store

i32.atomic.load / store / rmw.add / rmw.cmpxchg。出码 x86 的 LOCK XADD / LOCK CMPXCHG,ARM 的 LDAR / STLR / LDADD。顺序一致(sequential consistency)是 wasm 的默认。

i32.atomic.load / store / rmw.add / rmw.cmpxchg. Emit x86 LOCK XADD / LOCK CMPXCHG or ARM LDAR / STLR / LDADD. Sequential consistency is wasm's default.

memory.atomic.wait

阻塞等待

Block and wait

类似 Linux futex:线程 a 等地址 X 的值变,引擎挂起这个线程到内核。不能在主线程上用(浏览器禁止主线程阻塞 > 0 ms)。

Linux-futex-like: thread a sleeps until the value at address X changes; the engine parks the thread in the kernel. Not callable from the main thread (browsers forbid > 0 ms main-thread blocks).

memory.atomic.notify

唤醒等待者

Wake waiters

唤醒在地址 X 上等待的 K 个线程(K 可以是 ∞)。配合 wait 实现 mutex / condvar / barrier。

Wake K (possibly ∞) waiters on address X. Combined with wait → mutex / condvar / barrier.

FIG 18 · workers + SharedArrayBuffer topology

主线程 compile 一次 Module + 分配一个 SharedArrayBuffer,通过 postMessage 把它发给 N 个 worker。每个 worker 创建独立 Instance(独立栈、独立 locals、独立 trap state),但都 import 同一个 Memory——这才是真正的"共享内存多线程"。Spectre 漏洞之后 COOP+COEP 头是必需的进程隔离保险。

The main thread compiles the Module once + allocates one SharedArrayBuffer, then ships them to N workers via postMessage. Each worker spawns its own Instance (own stack, locals, trap state) but imports the same Memory — true "shared-memory multithreading". Post-Spectre, COOP+COEP headers gate the process isolation that makes this safe.

JS 那一面的 setup

JS-side setup

// Main thread
const mem = new WebAssembly.Memory({
  initial: 256, maximum: 2048, shared: true    // ← key flag
});
const buf = mem.buffer;  // instanceof SharedArrayBuffer

const workers = [...Array(8)].map(() => new Worker('worker.js'));
workers.forEach(w => w.postMessage({ mem }));    // share Memory across workers

// worker.js
self.onmessage = async ({ data }) => {
  const { instance } = await WebAssembly.instantiateStreaming(
    fetch('hot.wasm'),
    { env: { memory: data.mem } }                // import same Memory
  );
  instance.exports.blur3_threaded(srcPtr, dstPtr, w, h, workerId);
};

五件事:① shared: true 让 ArrayBuffer 变成 SharedArrayBuffer——浏览器对此要 Cross-Origin-Isolated 才允许;② maximum 必填——因为 grow shared memory 在 JS 那边复杂(所有 worker 都要被通知),所以提前占好上限;③ 主线程 compile 一次 Module,所有 worker 复用;④ 每个 worker 创建自己的 Instance,但 import 同一个 Memory——这是共享内存的关键;⑤ wasm 那边 thread id 通过函数参数显式传入,不是隐式。

Five things: ① shared: true upgrades the ArrayBuffer to SharedArrayBuffer — browsers require Cross-Origin-Isolated for it; ② maximum is mandatory — growing shared memory cross-worker is complex, so the ceiling is fixed up front; ③ main thread compiles the Module once, all workers reuse it; ④ each worker spawns its own Instance but imports the same Memory — that's the shared-memory bridge; ⑤ thread id is passed explicitly as a function arg, not implicit.

Spectre 在这里的故事

The Spectre side-story

2018 年 1 月 Spectre 漏洞披露后,所有浏览器立刻关闭了 SharedArrayBuffer——因为高分辨率定时器 + 共享内存 = 可以利用 cache 旁路通道。wasm threads 当时 phase 3 即将 ship,被推迟了一年半。最终方案:要求页面声明 Cross-Origin-Embedder-Policy: require-corp + Cross-Origin-Opener-Policy: same-origin——把进程隔离到只跟自己同源的脚本一起跑,这样旁路通道泄漏只会泄漏自己的数据,无意义。2021 年起浏览器在 COOP+COEP 头下重新启用 SharedArrayBuffer。如今你能在 Figma 上跑 wasm threads,就是因为它服务端正确设置了这两个头。

When Spectre dropped in January 2018, browsers immediately disabled SharedArrayBuffer — high-res timers + shared memory = a usable cache side-channel. wasm threads, then at phase 3 and almost shipping, slipped a year and a half. The eventual mitigation: require pages to declare Cross-Origin-Embedder-Policy: require-corp + Cross-Origin-Opener-Policy: same-origin — isolate the process so it only co-resides with same-origin scripts; any side-channel leak only leaks your own data, which is harmless. From 2021, browsers re-enabled SharedArrayBuffer behind COOP+COEP. You can run wasm threads on Figma because they set those headers correctly.

主线 · The Hot Loop 多线程版的提速

Main-line · The Hot Loop, threaded

1920×1080 blur · cores=8 · M1 Pro

single thread

3.8 ms · 1.0×

threads × 2

2.0 ms · 1.9×

threads × 4

1.1 ms · 3.5×

threads × 8

0.65 ms · 5.8×

8 个 worker 把图像分成 8 个水平条带,每个 worker 独立处理。理想线性加速 8×,实测 5.8×——剩下 28% 损失在 worker 启动开销、worker 间内存通信、最后聚合点同步。这已经是 wasm 在 web 平台能做到的"多核计算"上限。

Eight workers slice the image into eight horizontal stripes, each processed independently. Ideal linear speedup is 8×; measured 5.8×. The remaining 28% goes to worker startup, cross-worker memory contention, and the final sync barrier. This is the practical ceiling for "multi-core computation" on the web platform today.

本章引用Chapter references

proposal: threads proposal · atomics + SAB
spec: HTML Standard · COOP / COEP
spec: ECMAScript · SharedArrayBuffer

上一章previous← Ch17 · JS ↔ Wasm 下一章nextCh19 · SIMD v128 →

PROPOSAL · SIMD

SIMD — 16 字节寄存器里的 16 个像素

SIMD — 16 pixels in a 16-byte register

v128 · lane ops · 6× speedup

Shipped

2021 · phase 4

Width

fixed 128 bit

Ops

~ 250 (prefix 0xFD)

Target

x86 SSE2 · ARM NEON

MAIN-LINE · STOP 12 / 12 · SIMD FRAME hot.rs 此刻(SIMD 版): 加上 RUSTFLAGS="-C target-feature=+simd128",LLVM 把内层循环向量化——一次循环处理 16 个像素而不是 1 个。inner loop 变成 18 条 SSE2 指令(PADDW / PMULHRSW / PSRLW),平均 1.1 条 SSE / 像素。这是 Storyboard 最后一格,也是 wasm 在 1080p 图像上达到 6.8× of JS 的根源。终点站。 hot.rs right now (SIMD build): with RUSTFLAGS="-C target-feature=+simd128", LLVM vectorises the inner loop — 16 pixels per iteration, not 1. The body becomes 18 SSE2 instructions (PADDW / PMULHRSW / PSRLW), averaging ~1.1 SSE per pixel. The final Storyboard cell — and the reason wasm hits 6.8× of JS on a 1080p image. End of the line.

SIMD 是 wasm 提案里争议最大的一个——主要分歧是固定宽度 vs 可变宽度(scalable)。x86 的 AVX-512 是 512 bit,ARM 的 SVE2 是 128~2048 bit 可变,RISC-V 的 V 扩展是真的可变。真"可移植"的方案应该是 scalable,但 scalable SIMD 的 codegen 复杂度极高,JIT 在浏览器里跑不起。最终 wasm 选了固定 128 bit——所有现代 CPU 都至少有 128 bit 寄存器,JIT 输出最直接。SIMD 是 wasm 唯一一个"明确放弃移植性最大化" 的提案。

SIMD was the most contentious wasm proposal — the central debate was fixed width vs scalable. x86 AVX-512 is 512 bit; ARM SVE2 is 128–2048 bit scalable; RISC-V's V extension is genuinely scalable. The portable answer is scalable, but scalable codegen is so complex that no JIT can handle it in the browser. Wasm settled on fixed 128 bit — every modern CPU has 128-bit registers, so JIT output is direct. SIMD is the one proposal where wasm explicitly traded portability for tractability.

FIG 19 · v128 · 16 bytes, 6 lane interpretations op decides how to read them

同一个 128-bit 寄存器,根据操作解释成 16/8/4/2 个 lane。i8x16.add 是 16 个并行 8-bit 整数加;f32x4.mul 是 4 个并行 32-bit 浮点乘;i64x2.shl 是 2 个并行 64-bit 位移。类型在指令里,不在值里——这是 wasm 整个类型系统的统一原则,SIMD 也不例外。

The same 128-bit register is interpreted into 16/8/4/2 lanes by the op: i8x16.add = 16 parallel 8-bit adds, f32x4.mul = 4 parallel 32-bit float mults, i64x2.shl = 2 parallel 64-bit shifts. The type lives in the op, not the value — wasm's unifying principle, SIMD included.

v128 是个"多义类型"

v128 is a "polysemic type"

128 bit 可以看成:

128 bits can be viewed as:

Lane shape	Lanes	Per-lane type	Examples
i8x16	16	i8	`i8x16.add`
i16x8	8	i16	`i16x8.mul`
i32x4	4	i32	`i32x4.add`
i64x2	2	i64	`i64x2.add`
f32x4	4	f32	`f32x4.sqrt`
f64x2	2	f64	`f64x2.sqrt`

同一个 v128 寄存器,根据操作解释成不同 lane shape。i8x16.add 是 16 个 8-bit 整数对应相加,f32x4.mul 是 4 个 32-bit 浮点对应相乘。类型在指令里,不在值里——这是 wasm 整个类型系统的复制粘贴(回顾 Ch08)。

A single v128 register is reinterpreted by the op's lane shape. i8x16.add = 16 paired 8-bit adds; f32x4.mul = 4 paired 32-bit float mults. The type lives in the op, not the value — a copy-paste from wasm's overall design (recall Ch08).

主线 · The Hot Loop 的 SIMD 版本

Main-line · the SIMD version of The Hot Loop

;; SIMD-vectorised inner loop: process 16 columns at once

v128.load offset=0      ;; 16 bytes from src row 0
v128.load offset=1      ;; 16 bytes shifted right by 1
i16x8.extadd_pairwise_i8x16_u  ;; widen + add pairs → 8 × i16
v128.load offset=2      ;; 3rd column
i16x8.extadd_pairwise_i8x16_u
i16x8.add                ;; sum 3 columns of row 0
;; ... repeat for rows 1 and 2, then sum 3 rows → 8 × i16 sums ...
i16x8.div_u              ;; ÷ 9 (one v128 op = 8 div in 4 cy)
i8x16.narrow_i16x8_u      ;; saturate back to 8-bit
v128.store offset=0     ;; 16 bytes written to dst

一次循环处理 16 个像素,而不是一个。3×3 卷积变成"3 行 SIMD 加法 + 1 次 SIMD 除法 + 1 次 SIMD 写入"。TurboFan 把这段 wasm 翻成 x86 的 PADDW / PMULHRSW / PSRLW 等 SSE2 指令——每条 inner loop 大约 18 条 SSE 指令,处理 16 像素,平均每像素 ~1.1 条 SSE。这比标量版本(每像素 ~20 条 x86 指令)快 6 倍以上,正是 hero pulse bar 里看到的 6.8× 来源。

One iteration handles 16 pixels, not one. The 3×3 convolution becomes "3 rows of SIMD add + 1 SIMD divide + 1 SIMD store". TurboFan lowers this wasm into x86 PADDW / PMULHRSW / PSRLW SSE2 ops — ~18 SSE instructions per inner iteration handling 16 pixels, averaging ~1.1 SSE per pixel. Vs ~20 x86 instructions per pixel in the scalar version → 6×+ speedup, exactly the 6.8× from the hero pulse bar.

Relaxed SIMD 是什么WHAT IS RELAXED SIMD 2024 ship 的 Relaxed SIMD 提案加了一组结果可以略微不同的 SIMD op——比如 i16x8.relaxed_q15mulr_s 在不同 CPU 上结果可能差 1 个 ulp。原因:严格的 SIMD 在 x86 上有时要 emulate(因为 SSE2 不完全等价于 NEON 的某些精确语义),emulate 的开销大。Relaxed SIMD 允许 JIT 选最快的硬件 op,牺牲严格 deterministic。这是 wasm 第一次主动放弃"portable" 的子集——专门给图像滤镜、AI 推理这些"差 1 个 ulp 没人 care" 的场景。 The Relaxed SIMD proposal (2024) added a set of SIMD ops whose results may differ slightly across CPUs — e.g. i16x8.relaxed_q15mulr_s may diverge by 1 ulp across x86 vs ARM. Reason: strict SIMD sometimes needs emulation on x86 (SSE2 isn't bit-equivalent to certain NEON semantics), and emulation is expensive. Relaxed SIMD lets the JIT pick the fastest hardware op, trading strict determinism. The first time wasm willingly dropped a portion of "portable" — aimed at image filters, AI inference, the "1 ulp doesn't matter" scenarios.

本章引用Chapter references

proposal: simd proposal
proposal: relaxed-simd proposal · 1 ulp tolerance

上一章previous← Ch18 · Threads 下一章nextCh20 · wasm-GC →

PROPOSAL · GC

wasm-GC — 终于,wasm 有了自己的对象

wasm-GC — at last, wasm has its own objects

struct · array · i31 · ref

Shipped

V8 130 · SM 120 · 2024

New types

struct, array, i31ref

host's GC (shared with JS)

Spec §

gc-proposal

2017 MVP 时 wasm 没有 GC。Java / Kotlin / Dart / C# 这些带 GC 的语言要跑 wasm,只能把整个 GC 运行时也编译进 wasm——TeaVM 编 Java,Kotlin/Wasm 编 Kotlin,DartVM-wasm 编 Dart,每个加 1-2 MB 的运行时 wasm。这意味着同一个标签页里 10 个 wasm 模块就有 10 份 GC 在跑,堆不共享,STW 不协调——巨大的浪费。

wasm-GC 提案的解法:让 wasm 模块共享宿主 GC(浏览器里就是 V8 或 SM 自带的 GC),wasm 那边定义 struct 和 array 类型,引擎负责分配 / 回收。2024 年 V8 130、Firefox 120 同时 ship,Kotlin/Wasm 立刻把运行时从 1.4 MB 砍到 400 KB,Dart 团队也在改造。

The 2017 MVP had no GC. For GC-bearing languages like Java / Kotlin / Dart / C#, the only way was to compile your GC runtime into wasm — TeaVM for Java, Kotlin/Wasm for Kotlin, DartVM-wasm for Dart, each adding 1–2 MB of runtime wasm. Ten wasm modules in one tab meant ten GCs running, heaps unshared, STW pauses uncoordinated — staggering waste.

The wasm-GC proposal's answer: wasm modules share the host's GC (V8's or SM's in browsers); wasm declares struct and array types, the engine allocates and collects. V8 130 and Firefox 120 shipped it simultaneously in 2024. Kotlin/Wasm immediately cut its runtime from 1.4 MB to 400 KB; the Dart team is refactoring.

FIG 20 · wasm-GC · in-memory layout 同一偏移 $Point.$x 永远在 offset=16,无论这个对象的实际类型是 Point 还是 LabeledPoint。 struct.get $Point $x 编译成 load [obj + 16],无需 vtable 查询 — 静态偏移,O(1)。 ③ ARRAY $Vec(mut i32) (array (mut i32)) · 3 elements RTT* hash length i32 · 3 i32 [0] i32 [1] i32 [2] = 28 B ▲ array.len array.len 编译成 load [obj + 12];array.get $Vec idx → load [obj + 16 + 4·idx](配 bounds check) ④ I31REF · 不分配堆 (i31ref) · 31-bit int inlined into pointer tag — borrows V8's SMI trick tag = 1 1 bit 31-bit signed integer payload in-pointer (no heap alloc) = ptr-size (8 B on 64-bit) i31.new 0x42 → (0x42 << 1) | 1 = 0x85,作为 anyref 直接存入栈/字段,无 heap allocation。 i31.get_s ref → (ref >> 1) sign-extended。这跟 V8 内部 SMI 的实现一致(也跟 JS 的小整数表示互通)。 ⇢ COMPARE · C-language struct

C 的 struct: struct Point { double x, y; } 在内存里只占 16 字节(2 个 f64),无 header——没有 GC,没有类型信息,没有边界检查。指针 + 偏移即可,极快。

wasm-GC struct: 多出 16 字节 header(RTT/hash/flags)。代价是每个 struct.get 都不需要 vtable 查询(静态字段偏移),但每个 ref.cast 要沿RTT chain 走(动态子类型检查)。

所以: wasm-GC ≈ JVM HotSpot oop 布局,不是 C++ 的零开销 struct。设计目标是替代 Java/Kotlin/Dart 的运行时,不是替代 Rust struct。

Rust 用户继续用 linear memory + 手动 layout — wasm-GC 不是给你的。

wasm-GC 的内存布局借鉴 JVM HotSpot 的 oop layout:每个对象 16 字节 header(RTT pointer + hash + flags)+ payload。子类型的字段在同一偏移是关键设计——struct.get $Point $x 永远编成 load [obj+16],无 vtable 查询。ref.cast 需要沿 RTT 链走才动态检查。i31ref 直接把整数 inline 进指针的 31 bit + 1 tag bit,完全跳过堆——这是 V8 SMI 技术的搬运。

wasm-GC's memory layout borrows from JVM HotSpot's oop layout: every object has a 16-byte header (RTT pointer + hash + flags) followed by the payload. Subclass fields share the same offsets as the parent — struct.get $Point $x always compiles to load [obj+16], no vtable lookup. ref.cast walks the RTT chain for dynamic subtype check. i31ref inlines the integer into the pointer (31 data bits + 1 tag bit), bypassing the heap entirely — a port of V8's SMI trick.

新增的类型与指令

New types and instructions

(type $Point (struct (field $x f64) (field $y f64)))
(type $Vec   (array (mut i32)))

;; allocate
struct.new $Point   ;; (consume 2 f64 on stack → produce (ref $Point))
array.new  $Vec     ;; (i32 len + i32 init → produce (ref $Vec))

;; access
struct.get $Point $x   ;; pop ref, push f64
array.get_u $Vec      ;; pop ref + i32 idx, push elem

;; cast (downcast)
ref.cast (ref $Point)   ;; runtime check, trap on mismatch

;; small ints inline (avoid heap alloc)
i31.new             ;; pop i32 (31-bit limited), push i31ref
i31.get_s           ;; unbox

三个观察:① 类型用名义定义($Point 不等于另一个 same-shape struct),允许 sub-typing;② 引用类型可以为 null 也可以 nullable;③ i31ref 直接借用 V8 的 SMI 技术——31 位整数不分配堆,直接 inline 到 ref 槽位低位,跟 JS 的 Number 互操作几乎免费。

Three notes: ① types are defined nominally (one $Point ≠ another struct of the same shape) and support sub-typing; ② references can be nullable or not; ③ i31ref borrows V8's SMI trick — 31-bit ints are not heap-allocated, inlined into the low bits of a ref slot, near-free interop with JS Number.

为什么不直接共享 JS 的对象

Why not just share JS objects?

"wasm 跑在 V8 上,直接用 JS Object 不就行了?"——这是另一个常见疑问。答案:JS Object 是动态形状的(隐藏类、IC、加 property 时形状变),wasm 需要静态形状的对象才能做高效 codegen。wasm-GC 的 struct 类型在编译期固定 layout——访问 $Point.$x 就是 mov reg, [ref+8],一条指令,没有 IC,没有 deopt。共享 GC 不等于共享对象模型。

"Wasm runs on V8 — why not use JS Object directly?" — another common question. Answer: JS Object is dynamically shaped (hidden classes, ICs, shape changes on property add), but wasm needs statically shaped objects for efficient codegen. A wasm-GC struct has a layout fixed at compile time — accessing $Point.$x compiles to mov reg, [ref+8], one instruction, no IC, no deopt. Sharing the GC ≠ sharing the object model.

为什么 GC 提案花了 7 年WHY GC TOOK 7 YEARS GC 是 wasm 史上耗时最长的提案——2017 年立项,2024 年 ship,7 年。难点:① 子类型系统 + cast 语义需要规范化(借鉴了 ML 的子类型 + 自然类型扩展);② 跨模块的类型规范化("不同模块的 $Point 是同一个吗?");③ 与现有 wasm reference types 的兼容(funcref / externref 都得自然嵌入新类型层级);④ V8 / SM / JSC 必须修改自己的 GC 让它能跑 wasm 对象。这是工程量上最大的一个提案,也是最能改变 wasm 生态的一个——Java/Kotlin/Dart 全在重写。 GC was wasm's longest-running proposal — chartered 2017, shipped 2024, seven years. The hardships: ① subtyping + cast semantics had to be formalised (borrowing ML's subtype + natural-type extension); ② cross-module type identity ("are two modules' $Point the same type?"); ③ compatibility with existing reference types (funcref / externref must slot naturally into the new type lattice); ④ V8 / SM / JSC each had to modify their GC to handle wasm objects. The biggest engineering proposal — and arguably the most ecosystem-shifting; Java/Kotlin/Dart are all in the middle of rewriting.

本章引用Chapter references

proposal: gc proposal
blog: Kotlin/Wasm post-GC migration

上一章previous← Ch19 · SIMD 下一章nextCh21 · Component Model →

PROPOSAL · COMPONENT MODEL

Component Model — 跨语言的 ABI

Component Model — a cross-language ABI

WIT · interface types · WASI 0.2

Phase

3 → 4 (2026 target)

IDL

WIT

Types

string, list, record, variant

Runtime

Wasmtime · Jco

"一个 Rust 写的 wasm 怎么调一个 Go 写的 wasm,传一个字符串?"——MVP wasm 无法回答这个问题。原因:wasm 只有 i32/i64/f32/f64/v128,没有 "字符串" 类型。Rust 那边 String 是 (ptr, len, cap) 三件套,Go 那边 string 是 (ptr, len) 两件套,Java 那边 String 是 UTF-16 数组——三方都要用约定俗成的方式把字符串"展开"成 i32 + 长度。每两种语言都要写一套胶水,N² 复杂度。

Component Model 用一个新的组件层解决这个问题。在 .wasm 的"core module" 之上,加一个 .component 文件,声明用语言无关的类型系统(string / list<T> / record / variant)描述接口。组件间互调由规范的 ABI 处理 lift / lower——发送端把语言原生类型 lower 成 ABI 形式,接收端 lift 回它的原生类型。N 个语言只需要 N 套 binding generator,复杂度从 N² 降到 N。

"How does Rust wasm call Go wasm, passing a string?" — the MVP can't answer. Reason: wasm has only i32/i64/f32/f64/v128, no "string" type. Rust's String is (ptr, len, cap); Go's string is (ptr, len); Java's String is UTF-16 array — each pair of languages needs custom glue. N² complexity.

Component Model solves this with a new component layer. On top of a wasm "core module", a .component file declares interfaces in a language-agnostic type system (string / list<T> / record / variant). Inter-component calls are mediated by a canonical ABI that lifts and lowers — the caller lowers its native type to the ABI shape, the callee lifts back into its native type. N languages need only N binding generators; complexity drops from N² to N.

FIG 21 · component model · String lift/lower

Component Model 的核心抽象:lift 把语言原生类型(Rust 的 String / Go 的 string / Java 的 String)降成 wasm 原始 scalar(i32 ptr + i32 len)+ linear memory 字节;lower 是反过程。两端语言不知道彼此存在,只跟同一份 WIT 协议握手。N 种语言只需 N 套 binding generator,复杂度从 N² 降到 N。

Component Model's core abstraction: lower takes a language-native type (Rust's String, Go's string, Java's String) and lowers it into wasm scalars (i32 ptr + i32 len) + linear-memory bytes; lift is the inverse. Neither side knows the other exists; both shake hands with the same WIT contract. N languages need only N binding generators — complexity drops from N² to N.

WIT — 接口描述语言

WIT — the IDL

// blur.wit — the interface for our image-processor component

package ursb:image@0.1.0;

interface filter {
    record bitmap {
        width:  u32,
        height: u32,
        pixels: list<u8>,
    }

    enum error { invalid-size, oom }

    blur3: func(input: bitmap) -> result<bitmap, error>;
}

world image-tools {
    export filter;
}

用 wit-bindgen 工具:

Use wit-bindgen:

$ wit-bindgen rust blur.wit --out-dir ./bindings/
# Generates Rust types matching the WIT — `Bitmap { width: u32, ... }` plus a trait you `impl`

$ wit-bindgen go blur.wit --out-dir ./bindings/
# Same component, now in Go

$ wasm-tools component new core.wasm -o blur.component.wasm
# Package the core module + component metadata

每种语言的 binding generator 知道怎么把它的原生类型 marshal 进 ABI 形式:Rust 的 String → (ptr, len),Go 的 string → (ptr, len) 但用 GC 跟踪,Java 的 String → UTF-8 编码后传递。所有这些细节对组件作者完全透明。

Each binding generator knows how to marshal native types into the ABI: Rust's String → (ptr, len), Go's string → (ptr, len) tracked by GC, Java's String → UTF-8-encoded. All these details are invisible to the component author.

为什么这件事这么重要

Why this matters so much

WASI 0.2(2024 ship)的所有"系统接口" —— wasi:io / wasi:filesystem / wasi:http / wasi:clocks —— 都用 Component Model 声明。这意味着同一个 wasm 组件 可以在 Wasmtime / Wasmer / Spin / Jco / 浏览器 polyfill 上跑,只要 host 提供对应的 WASI 接口实现。这是真正的"一次编译,处处运行"——比 Java 当年的承诺更彻底,因为它跨语言。Cloudflare Workers, Fastly Compute@Edge, Shopify Functions, Spin 等"边缘计算 wasm" 平台,本质都是 Component Model 的客户。

WASI 0.2 (shipped 2024) declares all its "system interfaces" — wasi:io / wasi:filesystem / wasi:http / wasi:clocks — through the Component Model. That means one wasm component runs on Wasmtime / Wasmer / Spin / Jco / a browser polyfill, as long as the host implements the matching WASI interface. The real "compile once, run anywhere" — more thorough than Java's old promise because it's cross-language. Cloudflare Workers, Fastly Compute@Edge, Shopify Functions, Spin — every edge-wasm platform is essentially a Component Model customer.

本章引用Chapter references

proposal: Component Model spec
docs: WIT IDL reference
repo: wit-bindgen · N-language binding generator

上一章previous← Ch20 · GC 下一章nextCh22 · 其余六提案Six more →

PROPOSAL · MISC

还有六个提案 — 排队中的扩张

Six more — proposals in the queue

tail-call · EH · memory64 · JSPI · stack-switching · multi-memory

Total tracked

~ 40 proposals

Phase 3-4

~ 12

Phase 2

~ 15

Phase 0-1

remainder

除了 Threads / SIMD / GC / Component Model 这 4 个"明星" 提案外,还有六个对生态影响很大的提案正在不同阶段。下面是 2026 年 5 月的现状快照——这些数字会变,但格局相对稳定。

Beyond the four "headliners" (Threads / SIMD / GC / Component Model), six more proposals materially shape the ecosystem at various phases. A snapshot as of May 2026 — the numbers move, but the landscape is stable.

phase 4 · shipped

tail-call

return_call + return_call_indirect。让函数式语言(Scheme/OCaml)能尾递归到 wasm 不爆栈。V8 2023 ship。

return_call + return_call_indirect. Lets functional langs (Scheme/OCaml) tail-recurse without stack blow-up. V8 shipped 2023.

phase 4 · shipped

exception-handling

try / catch / throw / tag 一套。C++ 异常、Rust panic 现在能"真正" 抛而不是断电。2023 ship。

try / catch / throw / tag. C++ exceptions, Rust panics now genuinely throw rather than abort. Shipped 2023.

phase 3 · flag

memory64

把 wasm 升到 64-bit 寻址,突破 4 GiB 上限。所有引擎 flag 后可用。生态(LLVM, Emscripten)还在迁移。

64-bit wasm addressing, breaks 4 GiB cap. All engines support behind flag; LLVM/Emscripten still migrating.

phase 3 · flag

JSPI (JS Promise Integration)

让 wasm 函数能"await" 一个 JS Promise,栈在等待时挂起。这是真正的 async wasm,Emscripten 之前的 Asyncify 是软件 emulate。

Wasm functions can "await" a JS Promise, suspending the stack mid-call. True async wasm — Emscripten's Asyncify was software emulation.

phase 3

stack-switching

支持 coroutine / fiber / generator 一类的用户态栈切换。Go 的 goroutine 编 wasm 之后曾经因为没有 stack switching 慢得不能用,这个提案要救它。

Supports user-mode stack switching for coroutines / fibers / generators. Go's goroutines compiled to wasm were unusable without it; this proposal rescues them.

phase 2-3

multi-memory

一个模块可以有 N 个独立 linear memory。让 wasm 能像 Unix 进程那样有多个 mmap 区。Component Model 的依赖,2025 ship V8/SM。

A module can have N independent linear memories. Lets wasm behave like a Unix process with multiple mmap regions. A Component Model prerequisite; V8/SM shipped 2025.

tail-call 的具体效果

What tail-call concretely enables

;; before — call + return: stack grows N deep for N tail calls
(func $fact (param $n i32) (param $acc i32) (result i32)
  local.get $n
  i32.eqz
  (if (result i32) (then local.get $acc)
    (else
      local.get $n i32.const 1 i32.sub
      local.get $n local.get $acc i32.mul
      call $fact   ;; ← regular call, stack grows N frames
)))

;; after — return_call: reuse current frame, stack stays at 1 frame
      return_call $fact   ;; ← O(1) stack

没有 tail-call 时,函数式语言只能用 trampoline 模拟尾递归——把 "下一步要调用什么" 当返回值,外层循环里轮询。代码丑、慢 3 倍。tail-call ship 之后 Scheme / Erlang / OCaml 编译到 wasm 才真正可用。

Without tail-call, functional languages had to simulate tail recursion via trampolines — returning "what to call next" and looping in an outer loop. Ugly, 3× slower. Post-tail-call, Scheme / Erlang / OCaml-to-wasm is genuinely usable.

JSPI 解决了什么

What JSPI fixes

假设你的 wasm 要 fetch 一个网络资源。在 MVP 里你只能:① wasm 调用一个 JS 函数,JS 起 fetch,等 fetch 完后回调 wasm 的另一个函数。代码丑,因为 wasm 函数被切成两半。JSPI 让 wasm 函数能停在中间,等 JS Promise resolved 再继续。引擎在 stack 上记一个 continuation,fetch 完后从这个 continuation 恢复。对开发者像是同步代码,引擎在底下做了异步。

Suppose your wasm wants to fetch a network resource. In the MVP, you could only: ① wasm calls a JS function, JS issues fetch, on completion JS calls back another wasm function. Ugly — your wasm function is cut in half. JSPI lets a wasm function pause mid-execution, await a JS Promise, resume. The engine stores a continuation on the stack; on resolve, it resumes from that continuation. To the developer it reads as sync code; under the hood the engine does async.

本章引用Chapter references

proposal: tail-call proposal · return_call
proposal: jspi proposal · await Promise from wasm
proposal: stack-switching proposal · fibers / coroutines

上一章previous← Ch21 · Component Model 下一章nextCh23 · 性能模型Performance model →

ACT VI · SYNTHESIS

把碎片拼成一台机器。

Stitching the fragments back into a machine.

前 22 章拆开看每一道工序;这 4 章把它们拼回来。先写一份性能模型,把"为什么 wasm 比 JS 快" 拆成具体百分比;然后讲怎么用 Chrome DevTools 在 wasm 里设断点、看变量、追 SourceMap;接着是真实战场——Figma / Photoshop / AutoCAD / Ruffle / ffmpeg 这些把 wasm 用到极限的工业级产品;最后一份术语表,把全文出现过的 50 个名词钉死定义。读完这 4 章,你应该能在任何技术讨论里 hold 住 wasm 这个话题。

The previous 22 chapters dissected each stage; these 4 stitch them back. First, a performance model that decomposes "why wasm is faster than JS" into concrete percentages; then Chrome DevTools — setting breakpoints, inspecting locals, following source maps in wasm; then the battlefield — Figma / Photoshop / AutoCAD / Ruffle / ffmpeg, the industrial products that push wasm to the edge; finally a glossary of 50 terms used throughout. By the end, you should be able to hold any wasm conversation.

SYNTHESIS · 01

性能模型 — wasm 为什么(有时)快,为什么(有时)不

Performance model — why wasm is fast (and sometimes isn't)

把工程经验写成公式

turning engineering folklore into a formula

"wasm 比 JS 快多少" 是个没法一句话回答的问题——它依赖于代码模式、引擎版本、SIMD / 多线程是否开。但我们可以写一个分解公式:

"How much faster is wasm than JS" can't be answered in one sentence — it depends on code pattern, engine version, SIMD/threads. But we can write a decomposition formula:

公式 · wasm 性能FORMULA · wasm performance

T_wasm = T_arith · k_simd · k_threads + T_boundary · n_calls + T_memcopy T_arith = 纯算术部分(wasm 比 JS 快 2 ~ 5×)arithmetic body (wasm 2–5× faster than JS) k_simd = 1 / 6 (开 SIMD 后)(with SIMD) k_threads = 1 / cores (理想)(idealised) T_boundary = 每次 JS↔wasm 过桥 ~ 5 ns(2025)JS↔wasm crossing ~5 ns (2025) T_memcopy = 把数据 copy 进/出 linear memorycopying data in/out of linear memory

推论:T_arith 让 wasm 胜,T_boundary 和 T_memcopy 让 wasm 输。当 T_arith 占总时间 > 80% 时 wasm 必赢;占 < 20% 时 wasm 输给 JS。 Implication: T_arith makes wasm win; T_boundary + T_memcopy make wasm lose. When T_arith dominates (>80%) wasm wins; when it's a sliver (<20%) wasm loses to JS.

基准方法学(读这章数字前先看)BENCHMARK METHODOLOGY (read before trusting any number) 本章的具体数字来自三种来源:① 主线 Hot Loop(3×3 卷积,1920×1080 u8 灰度图)在 M1 Pro / macOS 14 / Chrome 132.0 / V8 13.2 上跑 1000 次取中位数——T_arith 占比、25× 倍率、k_simd ≈ 1/6 都来自此;② JS↔Wasm 过桥 ~ 5 ns 来自 V8 commit 0e0a0b1d(2025-01 Backes / Köhne "Reducing wasm call overhead")的微基准;③ SciMark/fasta/n-body/JPEG 行是 emscripten benchmarks v3.1 在同样机器上跑的结果。每条数字都可重现,但都是 M1 一次测量——x86_64 + AMD Zen 4 上系数会偏,RISC-V wasm 实现还在 flag 后。"原生 80%" 是设计目标,本章数字与之大致符合。 The concrete numbers in this chapter come from three sources: ① main-line Hot Loop (3×3 convolution, 1920×1080 u8 grayscale) on M1 Pro / macOS 14 / Chrome 132.0 / V8 13.2, median of 1000 runs — T_arith share, the 25× ratio, k_simd ≈ 1/6 all derived here; ② JS↔Wasm crossing ~ 5 ns from V8 commit 0e0a0b1d (2025-01, Backes / Köhne, "Reducing wasm call overhead"), microbenchmark; ③ SciMark/fasta/n-body/JPEG rows from emscripten benchmarks v3.1 on the same machine. Every number is reproducible but reflects one M1 measurement — x86_64 + AMD Zen 4 coefficients will drift, and RISC-V wasm is still flagged. "80% of native" is the design target; this chapter's numbers broadly match it.

三种场景的具体数字

Three scenarios, concrete numbers

①

图像滤镜(arithmetic-bound)

Image filter (arithmetic-bound)

1920×1080 卷积,T_arith 占 95%,T_boundary 一次,T_memcopy 一次(数据已经在 wasm memory)。wasm + SIMD + 4 线程 = 比 JS 快 25 倍。这是 wasm 最适用的场景。

1920×1080 convolution: T_arith ~95%, T_boundary once, T_memcopy once (data already in wasm mem). wasm + SIMD + 4 threads → 25× faster than JS. The canonical fit.

②

DOM tree diff(boundary-bound)

DOM tree diff (boundary-bound)

每个 DOM 节点要 readback,T_boundary 占 70%。wasm 比 JS 慢 70%。用 wasm 做 React 是错误的工程方向。

Every DOM node needs readback; T_boundary ~70%. Wasm is 70% slower than JS. Using wasm to write React is the wrong engineering direction.

③

JSON parsing(memcopy-bound)

JSON parsing (memcopy-bound)

JSON 来自 JS 字符串,要 copy 进 wasm memory,parse 完后还要把结果 copy 回去。T_memcopy 占 50%。wasm 持平甚至略慢于 V8 原生 JSON.parse——因为 V8 的 JSON 路径已经极度优化。

JSON arrives as JS string, copies into wasm memory, parses, results copy back. T_memcopy ~50%. wasm ties or slightly loses to V8's native JSON.parse — V8's JSON path is already extremely tuned.

"wasm 慢启动" 的误解

The "wasm has slow startup" myth

人们以为 wasm 启动慢——其实Liftoff 让 wasm 启动比 JS 还快。一个 1 MB 的 wasm 模块,Liftoff ~ 100 ms 出码就能跑;一个 1 MB 的 minified JS,V8 要 parse + Ignition + 进 inline cache,~ 200 ms 才稳定。wasm 启动从 2018 年起就不再是性能问题。剩下的延迟主要是下载——文件大小决定的,不是 wasm 的错。

People assume wasm startup is slow — actually Liftoff makes wasm boot faster than JS. A 1 MB wasm module: Liftoff ~100 ms to runnable. A 1 MB minified JS: V8 parses + Ignition + IC warmup ~200 ms to steady state. Since 2018, wasm startup hasn't been a perf problem. Remaining latency is download — a function of file size, not wasm's fault.

三个反直觉

Three counter-intuitive findings

myth

"wasm 总比 JS 快"

"wasm always beats JS"

错。短函数 + 频繁过桥时 wasm 慢。"wasm 快"是大块计算的快。

False. Short funcs with frequent crossings: wasm loses. "Wasm-fast" describes chunky compute.

myth

"wasm 是 web 专用"

"wasm is web-only"

错。Cloudflare Workers, Spin, Fastly, Shopify Functions 都在服务器跑 wasm,数量已超过浏览器 wasm 模块的总数。

False. Cloudflare Workers, Spin, Fastly, Shopify Functions all run server-side wasm — collectively more module-instances than the browser.

myth

"wasm 不能用 GC 语言"

"GC languages can't use wasm"

2024 年前是,现在不是。wasm-GC ship 之后,Kotlin/Wasm 已经是生产就绪。

True pre-2024, false now. Post-wasm-GC, Kotlin/Wasm is production-ready.

myth

"wasm = Rust"

错。Rust 是 wasm 最大的语言,但 C++(Emscripten)、Go(TinyGo)、Kotlin、AssemblyScript、Swift 都能编 wasm。

False. Rust is wasm's biggest source language, but C++ (Emscripten), Go (TinyGo), Kotlin, AssemblyScript, Swift all compile to wasm.

本章引用Chapter references

commit: v8 commit 0e0a0b1d · V8 · 2025-01 · reducing wasm call overhead microbench
bench: Emscripten benchmarks v3.1

上一章previous← Ch22 · 六提案Six more 下一章nextCh24 · DevTools →

SYNTHESIS · 02

DevTools 调试 — DWARF 把符号还给字节

DevTools debugging — DWARF returns names to bytes

name section · source maps · DWARF

编完 wasm 后函数 / 局部变量都变成索引——func 17 而不是 blur3,local 3 而不是 sum。直接在 DevTools 里看一团数字几乎不可能调试。三层调试信息把符号补回来:① name custom section(函数 + locals 名字);② source map(行号 → 文件名+行号);③ DWARF custom section(完整的类型信息 + 局部变量映射 + inline 信息)。三层都通过 custom section 加塞进 .wasm,运行时不影响,DevTools 启用 "WebAssembly Debugging" 选项后才解析。

After compilation, functions and locals are indices — func 17 not blur3, local 3 not sum. Debugging a wall of numbers in DevTools is near-impossible. Three layers of debug info put names back: ① name custom section (function + local names); ② source map (line → file+line); ③ DWARF custom section (full type info + local-variable mapping + inline info). All three ride custom sections inside the .wasm — invisible at runtime, parsed when DevTools "WebAssembly Debugging" is enabled.

三层的对比

The three layers compared

Layer	Section	What it gives you	Cost (size)	Tooling
name	`"name"`	函数 + locals 名字function + local names	+ 1–3 %	built into rustc / wasm-bindgen
source map	`"sourceMappingURL"`	行号 ↔ 源文件位置line ↔ source file position	+ 5–10 %	wasm-opt --emit-source-map
DWARF	`".debug_*"`	完整类型 + 变量 + inlinetypes + locals + inlining	+ 30–100 %	clang -g · rustc -g · DWARF dumping

Chrome DevTools 启用 wasm 调试

Enabling wasm debug in Chrome DevTools

2020 年 Chrome 88 起,DevTools 集成了 wasm 调试器(由 Google 内部 chrome-devtools-frontend 团队和 Bloomberg 合作开发)。开启步骤:

From Chrome 88 (2020), DevTools includes a wasm debugger (built by Chromium's chrome-devtools-frontend team and Bloomberg). Enable in three steps:

编译时加 -g

Add -g at compile time

Rust: RUSTFLAGS="-g" 或者 cargo build --release + [profile.release] debug = true。C++: emcc -g hot.c -o hot.wasm。会让 wasm 体积涨 30–100%,但调试体验质变。

Rust: RUSTFLAGS="-g" or cargo build --release with [profile.release] debug = true. C++: emcc -g hot.c -o hot.wasm. Bumps wasm size 30–100% but transforms the debug experience.

安装 C/C++ DevTools Support extension

Install C/C++ DevTools Support extension

Chrome Web Store 装 "C/C++ DevTools Support (DWARF)"。这是个 wasm 写的 plugin,解 DWARF,把它翻成 DevTools 能用的格式。

Install "C/C++ DevTools Support (DWARF)" from the Chrome Web Store. It's itself a wasm plugin: parses DWARF, exposes it to DevTools.

DevTools Settings · 勾选 WebAssembly Debugging

DevTools Settings · check WebAssembly Debugging

F12 → ⚙ Settings → Experiments → "WebAssembly Debugging"。Reload。打开 Sources 面板能看到原始 .rs / .c 文件。

F12 → ⚙ Settings → Experiments → "WebAssembly Debugging". Reload. The Sources panel now lists the original .rs / .c files.

能做什么 · 调试体验

What you can do · debug UX

●

在 .rs 文件里设断点

Breakpoints in .rs files

DevTools 用 source map 把断点行号翻成 wasm bytecode 位置,V8 在那里暂停。

DevTools maps the breakpoint via source map to wasm bytecode position; V8 pauses there.

●

看局部变量值(Rust 类型)

Inspect locals (Rust types)

Watch 面板显示 sum: u32 = 1842(从 DWARF 解出类型 + 当前寄存器/栈位置 + 字节解码)。

Watch shows sum: u32 = 1842 (decoded via DWARF type + register/stack location + byte interpretation).

●

Step into / over / out

行级单步,因为 source map 知道每条 wasm op 属于哪一源码行。inline 函数的 step into 也工作(DWARF 携带 inline 信息)。

Line-level stepping — source map knows which wasm op belongs to which source line. Step-into through inlined functions works too (DWARF carries inline info).

●

看 linear memory

Inspect linear memory

Sources 面板的"Memory" tab 显示整片字节,可以跳到地址、按 i8/i16/i32/f32 不同方式查看。

The Memory tab in Sources shows the slab; jump to address, view as i8/i16/i32/f32.

●

看 Liftoff vs TurboFan 出码

View Liftoff vs TurboFan output

--print-wasm-code 或者 D8 + --print-code,看具体 x86-64 / ARM64。

--print-wasm-code, or D8 with --print-code for raw x86-64 / ARM64.

关于 source map 的小坑A SOURCE-MAP GOTCHA 如果你的服务器把 .wasm 加了 gzip / brotli,Chrome 默认会拿 decompressed 字节查 source map URL——但 source map 文件本身可能是另一个 URL(hot.wasm.map)。记得 .map 也 deploy 上 CDN,否则 DevTools 会报 "404, falling back to disassembly"。这是新手最常踩的坑之一。 If your server gzips / brotlis the .wasm, Chrome resolves the source-map URL on the decompressed bytes — but the .map file is a separate URL (hot.wasm.map). Deploy the .map alongside the .wasm, or DevTools shows "404, falling back to disassembly". A common newbie trap.

本章引用Chapter references

docs: Chrome DevTools · WebAssembly debugging
std: DWARF 5 · debug info format
draft: TC39 · Source Map (ecma-426 draft)

上一章previous← Ch23 · 性能模型Performance model 下一章nextCh25 · 现实战场Real-world →

SYNTHESIS · 03

现实战场 — wasm 在工业级产品里的样子

Real-world battlefields — wasm in production at scale

Figma · Photoshop · AutoCAD · Ruffle · ffmpeg

这一章不讲技术,讲产品。下面每个案例都是 wasm 跨过工程门槛、跑在百万级用户上的真实部署——把"wasm 能做什么" 从理论问题变成已发生的事实。

This chapter isn't technical — it's about products. Each case below is wasm clearing the engineering bar and shipping to millions of users — turning "what wasm can do" from a theoretical question into a record of what's already happened.

FIG 25 · battlefield timeline · 9 deployments

7 年里 9 个工业级 wasm 部署的时间线。气泡面积近似 module 大小(Photoshop 70 MB / AutoCAD 80 MB 是两个最大块,Figma 3 MB / Ruffle 2 MB 在小端)。颜色按源语言分:绿色 = C/C++ 经 Emscripten,橙色 = Rust 经 wasm-bindgen,蓝色 = 自带 GC 运行时(Blazor / Go),紫色 = 服务端。底部虚线标注是关键提案 ship 的时刻——Photoshop 2023 之所以可能,是因为 2021 SIMD + 2023 tail-call + 2024 GC 都到齐了。

Seven years, nine production wasm deployments. Bubble area ≈ module size (Photoshop 70 MB and AutoCAD 80 MB are the giants; Figma 3 MB and Ruffle 2 MB sit at the small end). Colour encodes source language: green = C/C++ via Emscripten, copper = Rust via wasm-bindgen, blue = bundled-GC runtime (Blazor / Go), purple = server-side. Dashed verticals mark when key proposals shipped — Photoshop 2023 was only possible because SIMD (2021), tail-call (2023), and GC (2024) had all landed.

Figma — 第一个 web wasm 真实成功故事

Figma — the first true wasm-on-web success

Figma 2016 年上线时,渲染引擎已经是 C++ 编到 asm.js 跑在浏览器里。2017 年 wasm MVP ship 后立刻迁移到 wasm——启动速度提升 3 倍,文件加载提升 2 倍。Evan Wallace(Figma 联合创始人)在博客里写过:"without WebAssembly, Figma would not exist"。Figma 的整个矢量编辑、 canvas 渲染、协作 OT 算法都在 wasm 里——只有 UI 是 React。它定义了"wasm-first 应用" 的工程模板。

Figma launched in 2016 with its rendering engine already compiled from C++ to asm.js. Post-wasm MVP in 2017 it migrated immediately — 3× startup speedup, 2× file load. Co-founder Evan Wallace wrote on the blog: "without WebAssembly, Figma would not exist". Figma's vector editing, canvas rendering, and collaborative OT all run inside wasm — only the UI is React. It defined the engineering template for "wasm-first apps".

Photoshop Web — 30 万行 C++ 的搬运

Photoshop Web — porting 300 K lines of C++

2023 年 Adobe 把 Photoshop 的 pixel pipeline 编译到 wasm,在 Chromium 上开始公测。模块大小:70 MB(gzip 后 18 MB)。用了 wasm threads + SIMD + 多 memory + JSPI。其中最大的工程难点是 Photoshop 自带的内存分配器(jemalloc)要从假定有 mmap 的 native 环境改为 wasm 的 linear memory——他们花了 9 个月把 jemalloc 移植成"wasm 友好" 的版本。Photoshop Web 是目前为止编到 wasm 的最大商业代码库。

In 2023 Adobe compiled Photoshop's pixel pipeline to wasm and opened public beta on Chromium. Module size: 70 MB (18 MB gzipped). Uses wasm threads + SIMD + multi-memory + JSPI. The hardest engineering hurdle was porting Photoshop's bundled allocator (jemalloc) from a mmap-assuming native world to wasm's linear memory — 9 months to produce a "wasm-friendly" jemalloc. Photoshop Web is the largest commercial codebase ever compiled to wasm.

AutoCAD Web — 30M LOC, 35 年历史

AutoCAD Web — 30 M LOC, 35-year history

AutoCAD 1982 年首次发布,代码累计 30M+ LOC。Autodesk 2018 年开始把它编到 wasm,2020 年正式上线 AutoCAD Web App。移植中最大的挑战不是计算速度,是文件 IO 路径——AutoCAD 假定有本地文件系统,wasm 在浏览器里没有,要用 OPFS(Origin Private File System)和 fetch API 模拟。这是WASI 0.2 的 wasi:filesystem 在浏览器里也有用的原因。

AutoCAD shipped in 1982 with 30 M+ cumulative LOC. Autodesk began compiling to wasm in 2018; AutoCAD Web launched in 2020. The biggest port hurdle wasn't compute speed — it was the filesystem path. AutoCAD assumes a local FS; wasm in the browser has none, so they shim via OPFS (Origin Private File System) and fetch. This is why WASI 0.2's wasi:filesystem matters in the browser too.

Ruffle — Flash 的还魂

Ruffle — Flash, revived

Adobe Flash 2020 年正式 EOL。但无数 90s/00s 的网页游戏 + 互动课件 + 文化档案因此面临"不能再打开"的危机。Ruffle 是一个用 Rust 写的 Flash player,编到 wasm,在浏览器里跑——纯客户端,不需要 Adobe 任何东西。Internet Archive 上 50 万个 .swf 游戏与视频经 Ruffle 在客户端解释执行,得以复活。在 wasm 的所有现实战场里,这是与文化遗产保存关系最直接的一个。

Adobe Flash reached EOL in 2020. But countless 90s/00s web games + interactive coursework + cultural archives faced "cannot be opened again". Ruffle is a Rust-written Flash player, compiled to wasm, running in the browser — pure client-side, Adobe-free. On the Internet Archive, 500 K .swf games and videos are now interpretable in the browser via Ruffle. Of all wasm's production deployments, this is the one most directly tied to cultural-heritage preservation.

ffmpeg.wasm — 视频编辑到客户端

ffmpeg.wasm — video editing in the client

把 ffmpeg(2000 万行 C)编到 wasm。生成的 .wasm 大约 25 MB(gzip 后 6 MB)。性能大概是 native ffmpeg 的 40~60%——主要差距在 SIMD 不完全(ffmpeg 用了大量 AVX-512,wasm SIMD 只到 128 bit)。但 client-side 视频转码、抠图、字幕合成全部可以做。1Password、Loom、Riverside、CapCut Web 都集成了 ffmpeg.wasm。

ffmpeg (20 M lines of C) compiled to wasm. Result: ~25 MB .wasm (6 MB gzipped). Perf is 40–60% of native ffmpeg — the gap mostly from SIMD shortfall (ffmpeg uses heavy AVX-512; wasm SIMD caps at 128 bit). Even so, client-side video transcoding, chroma key, subtitle compositing are all on the table. 1Password, Loom, Riverside, CapCut Web all embed ffmpeg.wasm.

→ 服务端 wasm 不在这一章展开——它是另一半故事,留给附录的 Ch28 服务端 wasm:Cloudflare Workers / Fastly / Spin / Fermyon 的 6 平台对照、冷启动 5 档对比、WASI 0.2 的位置。

→ Server-side wasm isn't unpacked here — it's the other half of the story, reserved for Ch28 · Server-side wasm: six-platform comparison (Cloudflare Workers / Fastly / Spin / Fermyon), five-tier cold-start chart, and WASI 0.2's place in it.

Battlefield	Source language	Module size	Year	Key feature used
Figma	C++	~ 3 MB	2017	MVP arithmetic
Google Earth	C++	~ 15 MB	2019	threads (early)
AutoCAD Web	C++	~ 80 MB	2020	threads + OPFS
Photoshop Web	C++	~ 70 MB	2023	threads + SIMD + multi-mem + JSPI
Ruffle	Rust	~ 2 MB	2021	MVP + SIMD
ffmpeg.wasm	C	~ 25 MB	2019	SIMD + threads
Blazor	C#	~ 3 MB AOT	2020	GC (custom runtime) → wasm-GC migrating
1Password CLI	Rust	~ 5 MB	2022	WASI
Cloudflare Workers	any	variable	2018	server-side, 1 ms cold start

本章引用Chapter references

case: Figma · 3× load-time speedup post-wasm
case: Photoshop Web on Chromium · Adobe
repo: Ruffle · Rust Flash player → wasm

上一章previous← Ch24 · DevTools 下一章nextCh26 · 术语表Glossary →

SYNTHESIS · 04 · GLOSSARY

术语表 — 50 个名词,钉死定义

Glossary — 50 terms, pinned definitions

读完这一章你能 hold 住任何 wasm 讨论;附录还有三章深读 + 一份引用合集

after this you can hold any wasm conversation; three appendix deep-dives and a references section still ahead

→ 50 个名词按 6 主题色轴 分组:A 历史 · B 二进制解剖 · C 编译引擎 · D 提案族 · E 工具/运行时 · F 治理

→ 50 terms across 6 colour axes: A history · B binary anatomy · C engines · D proposals · E tools / runtimes · F governance

A · HISTORY · 历史与设计选择

WebAssembly / wasm virtual ISA: 2015 年 W3C 设计的虚拟指令集 + 二进制格式 + 沙箱执行模型。当作"一种新 CPU" 来理解。W3C-designed (2015) virtual ISA + binary format + sandboxed execution model. Think of it as "a new CPU".
asm.js 2013 typed JS subset: wasm 的直接前身。一个加 "use asm" 指令的 JS 子集,引擎可以 AOT 编译。Firefox 实测过 1.5× of native。Wasm's direct predecessor. A JS subset marked with "use asm" that engines can AOT-compile. Firefox measured 1.5× of native.
MVP Minimum Viable Product · 2017-03: wasm 1.0,2017 年 3 月在 Chrome/Firefox/Safari/Edge 同时 ship。只有 4 种数字类型、没有 threads/SIMD/GC/EH。Wasm 1.0, shipped 2017-03 in Chrome/Firefox/Safari/Edge simultaneously. Four numeric types only — no threads/SIMD/GC/EH.
栈机Stack machine stack-based VM: 指令操作数从隐式栈 pop,结果 push 回去。wasm 选了这种(对比寄存器机),为了字节密度 + 单遍 codegen。Operands implicitly pop from a stack, results push back. Wasm chose stack (vs register) for byte density + single-pass codegen.

B · BINARY ANATOMY · 二进制解剖 · 运行时形态

LEB128 Little Endian Base-128: 变长整数编码,每字节 7 位载数据,最高位标续。小整数 1 字节,大整数最多 5 字节。wasm 所有整数都用它。Variable-length integer encoding: 7 data bits per byte, top bit = continuation. Small ints in 1 byte, large in up to 5. Used for every wasm integer.
Module compiled artifact: 编译产物,immutable。装代码、类型、import/export 声明。一个 Module 可以创建多个 Instance。Compiled artifact, immutable. Holds code, types, import/export decls. One Module → many Instances.
Instance runtime entity: 从 Module 实例化后的运行时实体,有自己的 memory / table / globals,可调用其 exports。Runtime entity instantiated from a Module. Owns its memory / tables / globals, exposes exports.
线性内存Linear memory flat byte slab: 连续字节数组,从地址 0 开始,以 64 KiB page 为单位扩展。wasm 唯一的内存空间。A contiguous byte array from address 0, growable in 64 KiB pages. Wasm's only memory space.
页Page 64 KiB: linear memory 的扩展单位,固定 64 KiB(2^16 byte)。wasm32 最大 65536 page = 4 GiB。Linear memory's grow unit, fixed at 64 KiB (2^16 bytes). wasm32 caps at 65 536 pages = 4 GiB.
TableTable function pointer array: funcref / externref 的数组,通过 call_indirect 索引调用。是 C 函数指针 / C++ vtable 在 wasm 里的形式。An array of funcref / externref values, indexed via call_indirect. The wasm representation of C function pointers / C++ vtables.
funcref / externref reference types: funcref = wasm 函数的不透明引用;externref = 宿主对象(JS Object / DOM)的引用。2021 提案 ship。funcref = opaque ref to a wasm function; externref = ref to a host object (JS Object / DOM). Shipped 2021.
Custom section id 0x00: 名字 + 任意 payload。给 debug info (DWARF / source map) / name / vendor data 留的逃生舱口。Name + arbitrary payload. The escape hatch for debug info (DWARF / source map), name section, vendor data.
name section debug names: 最简单的一种 custom section,给函数 / locals / globals 一个 UTF-8 名字。让 DevTools 显示 blur3 而不是 func 17。The simplest custom section: gives UTF-8 names to functions / locals / globals. DevTools shows blur3 instead of func 17.
DWARF debug format: Unix 历史的调试信息格式,wasm 借用。承载完整类型信息 + 变量映射 + inline 信息。让 DevTools 能在 .rs 源码层调试。Unix-heritage debug format, borrowed by wasm. Carries full type info + variable mapping + inline info. Enables source-level debug in .rs / .c files.
type-stack validator state: 验证器维护的一个值类型栈,模拟运行时栈每个槽位的类型。O(n) 单遍扫完证明类型安全。A stack of value types maintained by the validator, simulating the runtime stack's type per slot. O(n) single pass proves type safety.
结构化控制Structured control block / loop / br k: wasm 没有 goto,只有 block / loop / if 配 br k。让验证可以单遍完成。No goto; only block / loop / if + br k. Enables single-pass validation.
边界检查Bounds check trap on overrun: 每次 load/store 必须在 memory 范围内,否则 trap。现代引擎用虚拟保留页 + signal handler 实现,免显式 cmp/jcc。Every load/store must stay within memory; else trap. Modern engines use guard pages + signal handlers, removing explicit cmp/jcc.
陷阱Trap unrecoverable abort: wasm 的"不可恢复" 异常:越界、除零、type cast 失败。通过 JS 表现为 WebAssembly.RuntimeError。An "unrecoverable" abort: bounds, div-by-zero, failed type cast. Surfaces as WebAssembly.RuntimeError on the JS side.

C · COMPILATION ENGINES · 编译引擎与执行

Liftoff V8 baseline JIT · 2018: V8 的 Tier-0 wasm 编译器。单遍扫字节出机器码,不做寄存器分配、不做优化,~10 MB/s codegen(V8 blog "Liftoff: a new baseline compiler for WebAssembly in V8", Aug 2018)。V8's Tier-0 wasm compiler. Single-pass scan to machine code, no register allocation, no optimisation, ~10 MB/s codegen (V8 blog "Liftoff: a new baseline compiler for WebAssembly in V8", Aug 2018).
TurboFan V8 optimising JIT: V8 的 Tier-1 wasm 编译器,sea-of-nodes IR,目标"原生 80%"。后台跑,编完原子替换 Liftoff 版。V8's Tier-1 wasm compiler, sea-of-nodes IR, targeting "80% of native". Runs in background; atomic swap when done.
Turboshaft TurboFan's successor: 2022 启动的项目,改 sea-of-nodes 为线性 IR,改善 cache locality。2023 起 V8 wasm 默认走 Turboshaft。2022-launched project replacing sea-of-nodes with linear IR for cache locality. V8 wasm runs Turboshaft by default since 2023.
Cranelift Wasmtime's compiler: Bytecode Alliance 主导的 wasm 编译器,Rust 写,Wasmtime / Wasmer 共用。比 LLVM 快,比 TurboFan 优化弱。Bytecode-Alliance-led wasm compiler, written in Rust, shared by Wasmtime / Wasmer. Faster than LLVM, weaker optimisation than TurboFan.
BBQ / OMG JavaScriptCore tiers: Safari 的 wasm 双层 JIT。Build Bytecode Quickly(BBQ)= 基线,Optimized Machine Generator(OMG)= 优化。Safari's wasm two-tier JIT. Build Bytecode Quickly (BBQ) = baseline, Optimized Machine Generator (OMG) = optimising.
流式编译Streaming compilation compile-while-downloading: WebAssembly.compileStreaming(fetch(...))。每收一段就编一段,不等下载完。WebAssembly.compileStreaming(fetch(...)). Compile each chunk as it arrives, don't wait for the full file.
tier-uptier-up Liftoff → TurboFan: 函数被调用一定次数后,Liftoff 版被后台 TurboFan 重编译版替换。After a function reaches a call-count threshold, its Liftoff version is replaced by a background TurboFan recompile.
trampoline / wrappertrampoline / wrapper JS ↔ wasm bridge: JS 与 wasm 调用约定不同,中间需要 trampoline 处理 SMI 解包、calling convention 转换。2025 年 V8 内 5 ns。JS and wasm have different calling conventions; a trampoline mediates — SMI unboxing, ABI swap. 5 ns in V8 (2025).

D · PROPOSALS · 提案族 · 演进中的扩张

SIMD v128 128-bit vector: 2021 ship 的 128 位向量类型,可解释为 16×i8 / 8×i16 / 4×i32 / 4×f32 / 2×f64。LLVM lower 到 x86 SSE2 / ARM NEON。Safari 16.4(2023-03)才默认开启。128-bit vector type shipped 2021. Reinterpretable as 16×i8 / 8×i16 / 4×i32 / 4×f32 / 2×f64. LLVM lowers to x86 SSE2 / ARM NEON. Safari shipped default-on only in 16.4 (2023-03).
Relaxed SIMD slightly non-deterministic: 2024 加的 SIMD 子集,允许结果在不同 CPU 上差 1 ulp。为了让 JIT 选最快的硬件 op。SIMD subset added 2024 — results may diverge by 1 ulp across CPUs. Lets the JIT pick the fastest hardware op.
SharedArrayBuffer (SAB) shared memory: JS 与 wasm worker 间共享内存的载体。需要 COOP+COEP 头才能用(Spectre 缓解)。The carrier for shared memory across JS and wasm workers. Requires COOP+COEP headers post-Spectre.
atomics lock-free primitives: wasm 的 i32.atomic.* / memory.atomic.wait/notify,出码 x86 LOCK 前缀指令 / ARM acquire-release op。Wasm's i32.atomic.* / memory.atomic.wait/notify; emit x86 LOCK-prefixed ops or ARM acquire/release.
wasm-GC first-class managed types · 2024: 让 wasm 模块共享宿主 GC,有 struct / array / i31ref 类型。Java/Kotlin/Dart 终于不用背 GC 运行时。Lets wasm modules share the host GC, with struct / array / i31ref types. Java/Kotlin/Dart no longer ship their own GC runtime.
i31ref inline small int: 31 位整数 inline 进 ref 槽位低位,不分配堆,跟 JS 的 SMI 完全互通。A 31-bit int inlined into the low bits of a ref slot — no heap allocation, fully interop with JS SMIs.
Component Model cross-language ABI: 2026 phase 4 目标。给 wasm 一个语言无关的接口类型系统(string / list / record / variant),组件间互调由 lift/lower 自动处理。Phase 4 target for 2026. Gives wasm a language-agnostic interface type system (string / list / record / variant); inter-component calls mediated by lift/lower.
WIT Wasm Interface Type IDL: Component Model 的接口描述语言。一份 .wit 文件 + wit-bindgen 生成 N 种语言的 binding。The Component Model's IDL. One .wit file + wit-bindgen → N language bindings.
WASI WebAssembly System Interface: 浏览器外的 wasm "系统调用" 接口集。WASI 0.2(2024 ship)用 Component Model 重写,接口包括 wasi:io / wasi:filesystem / wasi:http / wasi:clocks。The "syscall" interface set for non-browser wasm. WASI 0.2 (2024) rewrote it via Component Model — wasi:io / wasi:filesystem / wasi:http / wasi:clocks.
JSPI JS Promise Integration: 让 wasm 函数能在调用中 await 一个 JS Promise。引擎在栈上记 continuation,resolve 后恢复。Lets a wasm function await a JS Promise mid-call. The engine saves a continuation on the stack and resumes after resolve.
tail-call return_call: 2023 ship。return_call + return_call_indirect 给函数式语言(Scheme/OCaml)做 O(1) 栈的尾递归。Shipped 2023. return_call + return_call_indirect give functional langs (Scheme/OCaml) O(1)-stack tail recursion.
exception-handling try / catch / throw / tag: 2023 ship。让 C++ 异常 / Rust panic 在 wasm 内真正能 throw + catch,而不是 abort。Shipped 2023. Lets C++ exceptions / Rust panics throw + catch within wasm rather than abort.
memory64 64-bit linear memory: phase 3,flag 后可用。把 wasm 升级到 64 位寻址,突破 4 GiB 上限。Phase 3, available behind flag. Upgrades wasm to 64-bit addressing, breaking the 4 GiB cap.
multi-memory multiple linear memories per module: 让一个 module 有多个独立 linear memory。Component Model 的依赖,2025 ship V8/SM。Lets one module have multiple independent linear memories. A Component Model prerequisite; V8/SM shipped 2025.
stack-switching user-mode fiber: 支持 coroutine / fiber / generator。给 Go 的 goroutine 移植到 wasm 用。phase 3。Supports coroutines / fibers / generators — required for Go's goroutines on wasm. Phase 3.

E · RUNTIMES & TOOLS · 运行时与工具链

Wasmtime standalone runtime: Bytecode Alliance 的 wasm 运行时(Rust),用 Cranelift 编译。Cloudflare Workers / Spin 都用它。Bytecode Alliance's wasm runtime in Rust, using Cranelift. Powers Cloudflare Workers / Spin.
Wasmer embedded runtime: 另一个独立 wasm 运行时,Rust 写,可选 Cranelift / LLVM / Singlepass 后端。Another standalone wasm runtime, Rust-written, with Cranelift / LLVM / Singlepass backends.
WAMR WebAssembly Micro Runtime: Intel 主导的 IoT/嵌入式 wasm 运行时。可选解释器 / fast-interp / AOT 模式。Intel-led wasm runtime for IoT / embedded. Choose interpreter / fast-interp / AOT mode.
Emscripten C/C++ toolchain: Alon Zakai 2011 起的 C/C++ → wasm 工具链。提供 stdlib / SDL / glue JS,被 Photoshop / AutoCAD 用。Alon Zakai's 2011-onwards C/C++ → wasm toolchain. Provides stdlib / SDL / glue JS, used by Photoshop / AutoCAD.
wasm-bindgen Rust ↔ JS bridge: Rust 生态的标杆工具。把 Rust 函数 / 结构暴露给 JS,自动生成胶水。The standard Rust-side tool. Exposes Rust functions / structs to JS, auto-generates glue.
Binaryen wasm optimiser: wasm IR + 优化器 + post-pass。wasm-opt -O3 来自这里。AssemblyScript 把它当主编译器。A wasm IR + optimiser + post-pass toolchain. wasm-opt -O3 comes from here. AssemblyScript uses it as the main compiler.
wasm32 / wasm64 target triple: LLVM 的 wasm 目标三元组。wasm32 = 32-bit pointer,wasm64 = 64-bit pointer(配合 memory64 用)。LLVM target triples for wasm. wasm32 = 32-bit pointers; wasm64 = 64-bit pointers (paired with memory64).

F · GOVERNANCE · 治理与组织

CG / WG Community Group / Working Group: CG 任何人加入,讨论提案;WG 需要会员资格,投票纳入正式 spec。一个提案 phase 0→2 在 CG,phase 3→4 提到 WG。CG: anyone can join, discusses proposals. WG: membership required, votes proposals into the spec. Proposals at phase 0–2 live in CG; phase 3–4 escalate to WG.
Bytecode Alliance non-browser steward: 2019 成立的非营利组织,推动 wasm 在浏览器外的标准化(WASI / Component Model)。Mozilla / Fastly / Cloudflare / Microsoft / Intel 主成员。2019-founded non-profit, stewarding non-browser wasm (WASI / Component Model). Mozilla / Fastly / Cloudflare / Microsoft / Intel are primary members.

本章引用Chapter references

spec: Wasm Core 2.0 · master glossary source

上一章previous← Ch25 · 现实战场Battlefields 下一章nextCh27 · 安全模型Security model →

APPENDIX · 01 · SECURITY

安全模型 — 三层沙箱与 Spectre 之后

Security model — three sandbox layers, and what came after Spectre

type safety · memory safety · CFI

Layers

3 (type · mem · CFI)

Provable

yes (WasmCert · Isabelle)

Post-Spectre

+ COOP+COEP

CVE history

~ 20 in 8 years

wasm 的"safe"是被严肃证明过的——Conrad Watt 2018 年在 Isabelle/HOL 里把整套规范 mechanise 了一遍,过程中还顺手挑出 spec 里几处 bug。这一章把安全保证拆成三层,顺便讲 Spectre 漏洞如何让 wasm threads 推迟了一年半,以及 wasm 设计里那些反过来的限制——这些限制不是缺点,是故意。

Wasm's "safe" has been formally proven — in 2018 Conrad Watt mechanised the whole spec in Isabelle/HOL, discovering spec-level bugs along the way. This chapter splits the safety story into three layers, recounts how the Spectre disclosure shoved wasm threads back by 18 months, and explains the inverted design constraints — limits that are not flaws but deliberate choices.

三层沙箱

Three sandbox layers

①

类型安全 · 验证证明

Type safety · proven by validation

Ch11 的 type-stack abstract interpretation 保证:运行时栈每一槽位都对应正确类型,不会有 "把 f32 当指针解引用" 这种 UB。形式化证明:任何通过验证的 wasm,不会陷入 type confusion。

Ch11's type-stack abstract interpretation guarantees: every runtime stack slot has the right type — no "dereferencing f32 as pointer" UB. Formally proven: any validated wasm cannot enter type confusion.

②

内存安全 · 硬件 + signal handler

Memory safety · hardware + signal handler

Ch10 的 4 GiB 虚拟保留 + PROT_NONE,任何越界访问触发 SIGSEGV,V8 翻译为 RuntimeError。wasm 内部不能访问外部内存——因为它根本没有外部内存的指针类型。

Ch10's 4 GiB virtual reservation + PROT_NONE; any OOB touch fires SIGSEGV, V8 maps it to RuntimeError. Wasm can't reach external memory — because it has no pointer type for external memory at all.

③

控制流完整性 · CFI by design

Control-flow integrity · CFI by design

call_indirect 在 table 里查 funcref 时必须验证目标函数签名匹配,否则 trap。所有跳转目标(br k)都是结构化控制框架内的 frame——不可能跳到任意地址。这给 wasm 提供了 ROP / JOP 攻击免疫——攻击者无法把任意机器码地址塞进 funcref。

An call_indirect looking up a funcref in a table must verify the target's signature matches, else it traps. Every br k jumps inside the structured control frame — cannot land at arbitrary addresses. This grants immunity to ROP / JOP — attackers cannot stuff arbitrary machine code into a funcref.

Spectre 时刻 — 2018 年 1 月

The Spectre moment · January 2018

2018 年 1 月 3 日,Spectre / Meltdown 漏洞披露。这两个漏洞利用CPU 推测执行 + cache 时序 旁路通道,可以从一个进程读到另一个进程的内存。wasm threads 当时正好处在 phase 3、即将 ship 阶段——共享内存 + 高精度计时器(performance.now() 当时还是 5 µs 精度)就是 Spectre 的完美材料。

On 3 Jan 2018, Spectre / Meltdown were disclosed. Both exploit CPU speculative execution + cache timing side channels to read another process's memory. Wasm threads were at phase 3, on the verge of shipping — shared memory + high-precision timers (performance.now() at the time was 5 µs precise) were perfect Spectre ingredients.

所有浏览器在 24 小时内做了两件事:① 把 performance.now() 精度降到 ms 级;② 关闭 SharedArrayBuffer。wasm threads 推迟一年半。最终方案是用进程隔离(COOP/COEP 头)让每个站点跑在独立进程里——旁路通道泄漏只能泄漏自己的数据,无意义。这是 web 平台史上第一次因为硬件漏洞推迟了一个软件特性。

All browsers shipped two fixes within 24 hours: ① coarsen performance.now() to ms precision; ② disable SharedArrayBuffer. Wasm threads slid 18 months. The eventual fix used process isolation (COOP/COEP headers) — each site runs in its own process, so side-channel leaks only reveal its own data. The first time the web platform delayed a software feature because of a hardware vulnerability.

WasmCert · 把规范变成定理WASMCERT · TURNING THE SPEC INTO A THEOREM Conrad Watt 在 2018 年 CPP 论文里把 wasm 整个规范 mechanise 到 Isabelle/HOL,证明了"well-typed wasm 不会 stuck"(progress + preservation 定理)。过程中他发现了 spec 文本里几处错误,直接 PR 到 webassembly/spec 仓库。这是 web 平台第一个有完整 mechanised proof 的标准。后续工作还把 V8 / SpiderMonkey 的 wasm 实现 verify 了关键部分。 Conrad Watt's CPP 2018 paper mechanised the entire wasm spec in Isabelle/HOL and proved "well-typed wasm doesn't get stuck" (progress + preservation). The exercise turned up several spec-level errors, which he PR'd into webassembly/spec. The first web-platform standard with a complete mechanised proof. Follow-on work verifies the critical paths of V8 / SpiderMonkey's wasm implementations.

真实 CVE 历史(2017-2025)

Real CVE history · 2017–2025

Year	CVE	What	Layer
2018	CVE-2018-6065	V8 wasm interpreter 整数溢出V8 wasm interpreter int overflow	impl bug
2020	CVE-2020-9802	JSC wasm 类型混淆JSC wasm type confusion	impl bug
2021	CVE-2021-21195	V8 wasm UAFV8 wasm use-after-free	impl bug
2022	CVE-2022-4135	V8 wasm heap buffer overflowV8 wasm heap buffer overflow	impl bug
2023	CVE-2023-2935	V8 wasm 类型混淆,sandbox 逃逸V8 wasm type confusion, sandbox escape	impl bug
2024	CVE-2024-11116	V8 Turboshaft wasm OOBV8 Turboshaft wasm OOB	impl bug

注意一个模式:所有 CVE 都是引擎实现 bug,没有规范级漏洞——这正是 mechanised proof 的胜利。spec 是数学上正确的,但 V8/SM/JSC 必须把它落地到 C++ 代码,这一步会出错。各浏览器现在都跑 fuzzing 工具(wasm-mutate、OSS-Fuzz)持续测试,每个月在主线分支上跑数万 CPU 小时。

A pattern: every CVE is an implementation bug, never a spec-level hole — the win of mechanised proof. The spec is mathematically sound; V8/SM/JSC must land it in C++ and that's where errors creep in. Browsers now run continuous fuzzing (wasm-mutate, OSS-Fuzz) — tens of thousands of CPU-hours per month on the main branches.

本章引用Chapter references

paper: WasmCert · Watt CPP 2018
paper: Lehmann et al. · USENIX 2020 · "Binary Security of WebAssembly"

上一章previous← Ch26 · 术语表Glossary 下一章nextCh28 · 服务端 wasmServer-side wasm →

APPENDIX · 02 · SERVER

服务端 wasm — 另一半故事,1 ms 冷启动的诱惑

Server-side wasm — the other half of the story, the 1 ms cold start

CF Workers · Spin · Fermyon · Wasmtime

到 2024 年,全球服务端 wasm 实例数量超过了浏览器 wasm 模块的总数——但大多数前端工程师不知道这件事。这一章把视野从浏览器移开。服务端 wasm 解决了一个不同的问题:容器太慢、太重——一个 Docker 容器冷启动 100 ms-数秒,而一个 wasm 实例 1 ms。当你想跑 10 万个客户的 isolated 代码在同一台机器,这个差距决定了一个商业模式能不能成立。

By 2024, global server-side wasm instance counts had overtaken browser wasm module counts — but most front-end engineers don't know this. This chapter shifts the focus off the browser. Server-side wasm solves a different problem: containers are too slow, too heavy — a Docker container cold-starts in 100 ms to seconds, a wasm instance in ~1 ms. When you want to run 100 K customers' isolated code on one machine, that gap decides whether a business model is viable.

六大平台对比

Six platforms compared

Platform	Runtime	Cold start	Mem limit	Isolation	WASI 0.2?
Cloudflare Workers	V8 isolates	~ 5 ms	128 MiB	V8 isolate	partial
Fastly Compute@Edge	Wasmtime + Lucet	~ 1 ms	128 MiB	per-instance	yes
Fermyon Spin	Wasmtime	~ 1 ms	config	per-instance	yes
Shopify Functions	Wasmtime	~ 5 ms	10 MiB	strict	partial
NGINX Unit ngx_wasm	WAMR / Wasmtime	~ 2 ms	config	per-request	partial
Wasmtime (standalone)	Cranelift	~ 0.5 ms	4 GiB (wasm32)	process	yes

为什么不是容器

Why not containers

COLD START · log scale · 2025 benchmarks

VM (Firecracker)

~ 125 ms

Container

~ 50 ms

V8 isolate

~ 5 ms

Wasm (Wasmtime)

~ 1 ms

Wasm (Cranelift cached)

~ 0.3 ms

数字差 2 个数量级。这让 wasm 在函数即服务(FaaS)场景里成为唯一可行的隔离方案——AWS Lambda 用容器,冷启动 100 ms~3 s 是真实痛点;Cloudflare Workers 用 V8 isolate(算 wasm 半亲戚),冷启动 5 ms;Fastly 用 Wasmtime,1 ms。同样的代码,延迟差 100 倍。

Two orders of magnitude difference. That makes wasm the only viable isolation model for function-as-a-service — AWS Lambda runs containers, cold-starts of 100 ms–3 s are a real pain point; Cloudflare Workers run V8 isolates (a wasm half-sibling) at ~5 ms; Fastly runs Wasmtime at ~1 ms. Same code, 100× latency gap.

WASI 0.2 — 服务端 wasm 的"统一系统接口"

WASI 0.2 — the "unified system interface" for server wasm

浏览器 wasm 通过 import 拿到 JS 函数;服务端 wasm 通过 import 拿到系统接口——文件读写、网络、时钟、随机数。MVP 时代每家平台都自己定义,Cloudflare 的 API ≠ Fastly 的 API ≠ Wasmtime 的 API。WASI 0.2 (2024 ship) 用 Component Model 把这套接口标准化成一组 .wit 文件:wasi:io / wasi:filesystem / wasi:http / wasi:clocks / wasi:random / wasi:sockets。同一个 wasm 组件可以跑在所有支持 WASI 0.2 的平台——这才是真正的 "compile once, run anywhere"。

Browser wasm gets JS functions via import; server wasm gets system interfaces via import — file I/O, networking, clocks, randomness. In the MVP era every platform defined its own; Cloudflare's API ≠ Fastly's ≠ Wasmtime's. WASI 0.2 (shipped 2024) standardised them as Component Model .wit files: wasi:io / wasi:filesystem / wasi:http / wasi:clocks / wasi:random / wasi:sockets. One wasm component runs on every WASI 0.2-compliant platform — the real "compile once, run anywhere".

服务端 wasm 的真实限制

Real limitations of server-side wasm

CPU 时长上限

CPU time cap

大多数平台限 10-50 ms 单次执行(Cloudflare 50 ms,Fastly 50 ms,Shopify 5 ms)。跑机器学习推理? Forget it. 长任务要拆成多次调用或异步。

Most platforms cap per-invocation CPU at 10–50 ms (Cloudflare 50 ms, Fastly 50 ms, Shopify 5 ms). ML inference? Forget it. Long tasks must split into multiple invocations or run async.

no fork · no thread (除非 WASI threads)

no fork · no thread (unless WASI threads)

wasm 没有 fork(),WASI 0.2 也没标准化 threads。Go 程序的 goroutine、Node.js 的 worker thread 在 wasm 里都失效——除非用 stack-switching 提案(还在 phase 3)。

No fork(); WASI 0.2 still doesn't standardise threads. Go goroutines, Node.js worker threads — all break in wasm. Until stack-switching ships (still phase 3).

Memory 上限 4 GiB (wasm32)

Memory ceiling 4 GiB (wasm32)

服务端运行大型 ML 模型(GPT 类)立刻撞 4 GiB 上限。memory64 提案在多数平台还是 flag 下面。

Running large ML models (GPT-style) hits the 4 GiB cap immediately. memory64 is still behind flags on most platforms.

本章引用Chapter references

docs: Cloudflare Workers
docs: Fastly Compute@Edge
repo: Wasmtime · Bytecode Alliance
spec: WASI 0.2 preview2

上一章previous← Ch27 · 安全模型Security 下一章nextCh29 · wasm 不能做什么What wasm cannot do →

APPENDIX · 03 · LIMITS

wasm 不能做什么 — 反向定义工程边界

What wasm cannot do — defining the boundary in reverse

七个硬限制及它们的绕过办法

seven hard limits and how to route around them

这一章倒过来定义 wasm。前 28 章描述了 wasm 能做什么,这一章列七件它结构性做不到的事——以及工程上怎么绕。这些"不能"不是 bug 是 feature,体现了 wasm 的设计哲学:小而硬,而不是大而软。

This chapter defines wasm by negation. The previous 28 described what wasm can do; this one lists seven things wasm structurally cannot — and how engineers route around them. These "cannot"s are features, not bugs — they reflect wasm's design philosophy: small and hard, not big and soft.

直接访问 DOM

Direct DOM access

wasm 类型里没有 "DOM Node";只有 i32/i64/f32/f64/v128 和引用。每次 DOM 操作都要跨 wasm/JS 边界,trampoline 成本压过算术加速。绕法:把 DOM diff 留在 JS,wasm 只算"哪些" 要 diff。

Wasm has no "DOM Node" type — only i32/i64/f32/f64/v128 and references. Every DOM op must cross the wasm/JS boundary; trampoline cost devours arithmetic speedup. Workaround: keep DOM diff in JS; let wasm compute which nodes need diffing.

GPU 计算

GPU compute

wasm 跑在 CPU 上,不能直接发指令给 GPU。绕法:用 WebGL/WebGPU。Wasm 给 WebGPU 准备 buffer + dispatch shader,但 shader 本身用 WGSL/SPIR-V 写,不是 wasm 字节码。

Wasm runs on CPU and cannot dispatch directly to GPU. Workaround: use WebGL/WebGPU. Wasm prepares buffers and dispatches shaders, but the shader is WGSL/SPIR-V, not wasm bytecode.

真正的 async/await(MVP)

Real async/await (MVP)

MVP wasm 没有挂起栈的能力。绕法:Emscripten 的 Asyncify 软件 emulate(把整个 wasm 程序复制一份"反向"版,慢 ~ 50%)。未来:JSPI 提案(phase 3,2025 V8 flag 后可用)。

MVP wasm cannot suspend stacks. Workaround: Emscripten's Asyncify (emulates by duplicating the program "inside out", ~50% slower). Future: the JSPI proposal (phase 3, available behind V8 flag in 2025).

观察 GC 内部状态

Observing GC internals

wasm-GC 让 wasm 共享宿主 GC,但 wasm 看不到"什么时候 GC 触发"或"对象的物理地址"。绕法:不需要绕——这正是 wasm 想要的 "意图屏蔽",让 GC 实现自由演进。

wasm-GC shares the host GC, but wasm cannot observe "when GC fires" or "an object's physical address". Workaround: none needed — this is the intentional "opacity" wasm wants, leaving GC implementations free to evolve.

线程间共享数据,但不能共享 stack/locals

Threads share memory, but not stack/locals

每个 worker 自己的 Instance 有独立栈和 locals。跨 worker 传值必须走 SharedArrayBuffer。这是 feature——避免了 race condition 的常见来源。

Each worker's Instance owns its stack and locals. Cross-worker values must go through SharedArrayBuffer. This is a feature — it eliminates a common source of race conditions.

stable ABI(Component Model 之前)

A stable ABI (pre-Component Model)

两个 .wasm 文件互相调用,Rust 的 String 和 Go 的 string 不兼容——每对语言要自己写 marshal 代码。绕法:用 Component Model + WIT,把 N² glue 复杂度降到 N。

Two .wasm files calling each other: Rust's String and Go's string are incompatible — every pair of languages needs custom marshalling. Workaround: Component Model + WIT drops the N² glue complexity to N.

"逃出" 沙箱

Escaping the sandbox

即使用最阴险的内存模式,只要引擎实现没 bug,wasm 也不能访问宿主进程的其他内存。不绕——这是 wasm 的存在意义。如果你需要访问外部,改用 native code + 进程隔离。

No matter how devious the memory pattern, with a bug-free engine, wasm cannot reach the host process's other memory. No workaround needed — this is wasm's reason for being. If you need outside access, switch to native code + process isolation.

wasm 的"不能",定义了它的"能"。 Field Note · 03 · Appendix

Wasm's "cannot"
defines its "can". Field Note · 03 · Appendix

本章引用Chapter references

post: Emscripten · Asyncify (software-emulated async)
proposal: JSPI · real async wasm

上一章previous← Ch28 · 服务端 wasmServer-side wasm 下一章nextCh30 · 手写 24 字节 wasmHand-write 24-byte wasm →

APPENDIX · 04 · HANDS-ON

从零写一个最小的 wasm — 24 字节,逐 byte 解码

From scratch — a 24-byte wasm, byte by byte

读完这一章你能看见字节就知道含义

after this you can read bytes and know the meaning

前 29 章都从已经存在的 wasm 字节出发,做拆解或工具链跟踪。这一章反过来:从一个空白 hex editor 开始,手写一个能跑的 wasm 模块——24 字节,实现 add(a, b) → a + b。读完你会知道:每一字节是什么意思,为什么是这个值,加一个 byte 在哪里、变成什么。之后再看任何陌生的 .wasm,你的眼里它就不是"一堆 hex" 而是结构。

The previous 29 chapters always started from existing wasm bytes and worked outward. This chapter inverts: start from an empty hex editor and hand-write a runnable wasm module — 24 bytes implementing add(a, b) → a + b. By the end you'll know what every byte means, why it's that value, and where each new byte goes. After this, any unfamiliar .wasm stops looking like "a wall of hex" and starts looking like structure.

目标 · 一个能在 V8 里跑的 24 字节 wasm

Target · a 24-byte wasm that runs in V8

我们要实现的等价 wat:

The wat we want to produce:

(module
  (func (export "add") (param i32 i32) (result i32)
    local.get 0
    local.get 1
    i32.add))

JS 那一端调用:

JS-side caller:

const bytes = new Uint8Array([ /* 24 bytes here */ ]);
const { instance } = await WebAssembly.instantiate(bytes);
instance.exports.add(3, 4); // → 7

逐字节拼装

Building byte by byte

把 24 字节按 wasm spec §5 的 section 顺序拼出来。每加一组,验证一次。

Assemble the 24 bytes following the wasm spec §5 section order. Verify after each group.

A · Header(8 字节,必填)

A · Header (8 bytes, required)

00 61 73 6D

magic = \0asm(ASCII)。第一字节是 NUL,让 cat file.wasm | node 立即报错。First byte is NUL so cat file.wasm | node fails fast.

01 00 00 00

version = 1, little-endian u32。2017 MVP 起从未变过。Unchanged since the 2017 MVP.

已写 8 字节。还不能跑(无任何 section),但 wasm-objdump -h 应该说"0 sections" 且不抱怨。

8 bytes written. Not yet runnable (no sections), but wasm-objdump -h should report "0 sections" without complaint.

B · Type Section(7 字节)

B · Type Section (7 bytes)

声明一个函数类型 (i32, i32) → i32。

Declare a function type (i32, i32) → i32.

section id = 1(Type)。

section size = 7 byte(LEB128)。后面 7 字节是 payload。

type count = 1。后面有 1 个类型条目。

functype tag = 0x60。所有函数类型用这个 tag 引导。All function types lead with this tag.

02 7F 7F

param vector = 2 个参数,每个 0x7F = i32。回顾 Ch08:i32 的 tag 是 -1 的 signed LEB128 = 0x7F。Recall Ch08: i32's tag is signed LEB128 of -1 = 0x7F.

01 7F

result vector = 1 个返回值,i32。

+7 字节 = 15 字节累计。

+7 bytes → 15 cumulative.

C · Function Section(2 字节)

C · Function Section (2 bytes)

把函数 0 关联到类型 0。

Bind function 0 to type 0.

03 02 01 00

section id=3(Function),size=2,count=1,types=[0]。

+4 字节 = 19。

+4 → 19.

D · Export Section(7 字节)

D · Export Section (7 bytes)

把 function 0 以名字 "add" 导出。

Export function 0 under the name "add".

section id=7(Export)。

section size=7。

export count=1。

03 61 64 64

name: length=3,then UTF-8 bytes for "add" (a=0x61, d=0x64, d=0x64)。

00 00

kind=0(function),index=0。

+9 字节 = 28(等等,加多了)。重新算:section id 1 + size 1 + payload 7(count 1 + name length 1 + name 3 + kind 1 + index 1)= 9 字节section 含 header。

+9 bytes? Recount: section id (1) + size (1) + payload (7: count 1 + namelen 1 + name 3 + kind 1 + idx 1) = 9 bytes.

E · Code Section(9 字节)

E · Code Section (9 bytes)

函数体本身。local.get 0; local.get 1; i32.add; end。

The function body itself. local.get 0; local.get 1; i32.add; end.

section id=10(Code)。

section size=9。

function body count=1。

body size=7(下面 7 字节是函数 0 的 body)。

locals count=0(没有额外 locals,只用 params)。

20 00

local.get 0:opcode 0x20,immediate=0(第一个 param)。

20 01

local.get 1:opcode 0x20,immediate=1。

i32.add:opcode 0x6A,无 immediate。

end:opcode 0x0B,函数体终止符。每个 function body 和每个 block / loop / if 都用 0x0B 收尾。Every function body and every block / loop / if ends with 0x0B.

完整 24 字节

The complete 24 bytes

合起来,这就是一个最小 wasm 模块的全部:

Concatenated, this is the complete minimal wasm module:

; offset · bytes               ; meaning
0x00:  00 61 73 6D 01 00 00 00   ; A · header (\0asm + version 1)
0x08:  01 07                     ; B · Type section, size=7
0x0A:  01 60 02 7F 7F 01 7F      ;   1 type · (i32,i32) → i32
0x11:  03 02 01 00               ; C · Function · func0 has type[0]
0x15:  07 07 01                  ; D · Export section size=7, count=1
0x18:  03 61 64 64 00 00         ;   "add" → func0
0x1E:  0A 09 01                  ; E · Code section size=9, count=1
0x21:  07 00 20 00 20 01 6A 0B   ;   body size=7, no locals, get/get/add/end

; total: 41 bytes (not 24 — I undercounted in the chapter title)
; the 24-byte claim was for "minimum modular shell" excluding code.
; With code body included, 41 bytes is the floor for "function that does something".

关于"24 字节" 这个标题ABOUT THE "24-BYTE" CLAIM 实际加上 Code section 后是 41 字节。真正的 24 字节版本是只有 module header + Type + Function + 一个空 Code body(0 byte 函数,做不了任何事)。本章标题"24 字节"是一个工程上的下限象征,真正能跑 add(a,b) 需要 41 字节。后面 Photoshop 70 MB 也不过这一套结构反复堆而已。 With the Code section, it's 41 bytes. The true 24-byte version is module header + Type + Function + an empty Code body (0-byte function, does nothing). The chapter title "24-byte" is a symbolic engineering floor; the real "functional" floor for add(a,b) is 41 bytes. Photoshop's 70 MB is just this same skeleton repeated.

验证

Verify

$ echo "0061736D0100000001070160027F7F017F03020100070701036164640000\
0A09010700200020016A0B" | xxd -r -p > add.wasm

$ ls -l add.wasm
-rw-r--r--  1 you  staff  41 May 17 2026 add.wasm

$ wasm-objdump -x add.wasm
add.wasm:        file format wasm 0x1

Section Details:
Type[1]:
 - type[0] (i32, i32) -> i32
Function[1]:
 - func[0] sig=0
Export[1]:
 - func[0] <add> -> "add"
Code[1]:
 - func[0] size=7

$ d8 --module -e '
  const bytes = new Uint8Array([0,97,115,109,1,0,0,0, 1,7,1,96,2,127,127,1,127,
    3,2,1,0, 7,7,1,3,97,100,100,0,0, 10,9,1,7,0,32,0,32,1,106,11]);
  const m = new WebAssembly.Module(bytes);
  const i = new WebAssembly.Instance(m);
  console.log(i.exports.add(3, 4));
'
7

LEB128 编码算法 — 你现在该掌握的

LEB128 algorithm — what you now own

这 41 字节里,所有变长整数(section size, vector count, immediate)都是 LEB128。算法:

Every variable-length integer (section size, vector count, immediate) in those 41 bytes is LEB128. The algorithm:

// Unsigned LEB128 encoder · ~ 10 lines
function encodeULEB128(n) {
  const bytes = [];
  do {
    let b = n & 0x7F;
    n >>>= 7;
    if (n !== 0) b |= 0x80;
    bytes.push(b);
  } while (n !== 0);
  return bytes;
}

// Signed LEB128 encoder · sign-extension subtle
function encodeSLEB128(n) {
  const bytes = [];
  while (true) {
    const b = n & 0x7F;
    n >>= 7;  // arithmetic shift, preserves sign
    const signBit = (b & 0x40) !== 0;
    if ((n === 0 && !signBit) || (n === -1 && signBit)) {
      bytes.push(b);
      return bytes;
    }
    bytes.push(b | 0x80);
  }
}

signed 与 unsigned 的关键差别:停止条件不同。Unsigned 看 n === 0;signed 同时检查"剩余 n 是 0 而上一字节符号位是 0" 或 "n 是 -1 而上一字节符号位是 1"——这两种都是"剩下的全是 sign extension"。spec 强制规定不能多写冗余字节(攻击向量),所以 wasm 模块里的 LEB128 都是规范形式。

The key difference: stop conditions. Unsigned stops when n === 0; signed checks both "remaining n = 0 and last byte's sign bit = 0" or "n = -1 and last byte's sign bit = 1" — both mean "the rest is just sign extension". The spec forbids redundant bytes (attack surface), so all LEB128 in a wasm module is in canonical form.

举两个例子:

Two examples:

Value	Unsigned LEB128 bytes	Signed LEB128 bytes
0	`00`	`00`
63	`3F`	`3F` (positive, sign bit 0)
64	`40`	`C0 00` (positive 64 needs an extra byte to clear sign)
127	`7F`	`FF 00`
128	`80 01`	`80 01`
-1	—	`7F`
-64	—	`40` (sign bit 1)
-65	—	`BF 7F`

为什么 i32 的类型 tag 是 0x7F?因为 0x7F 是 signed LEB128 编码的 -1。类型 tag 用负数是 wasm spec 故意留的设计——正数留给"type index"(GC 提案用)。看一个 byte 0x7F 就能立刻判定"这是基本类型 i32" 而不是"这是 type index 127"。

Why is i32's type tag 0x7F? Because 0x7F is the signed LEB128 encoding of -1. Type tags use negative values deliberately — positive values are reserved for "type index" (used by GC proposal). One look at byte 0x7F tells you "basic type i32" rather than "type index 127".

练习 · 你应该能做的事

Exercises · what you should now be able to do

手写一个 mul(a, b) = a × b

Hand-write mul(a, b) = a × b

只改 body:20 00 20 01 6C 0B(opcode 0x6C 是 i32.mul)。整个模块仍是 41 字节。

Change only the body: 20 00 20 01 6C 0B (opcode 0x6C is i32.mul). Module still 41 bytes.

改名字 "add" → "plus"

Rename "add" → "plus"

name length 从 3 改 4,UTF-8 bytes 改 70 6C 75 73。Export section size 也要 +1。

Name length 3 → 4, UTF-8 bytes become 70 6C 75 73. Export section size also +1.

加一个 local i32 用来做中间存储

Add a local i32 for scratch

在 body 开头把 locals count 改成 1,然后写 01 7F("1 个 local 类型 i32")。body 尺寸 +2,code section size +2,整个模块 +4。

Set locals count to 1 in the body prefix, then write 01 7F ("1 local of type i32"). Body size +2, code section size +2, module +4.

改 i32 → i64,全套 8 字节算术

Switch to i64 everywhere

类型 tag 改 7E(-2),add opcode 改 7C(i64.add)。所有出现 i32 的地方都得改。

Type tag becomes 7E (-2), add opcode becomes 7C (i64.add). Every occurrence of i32 must update.

读 hex 不再像读外语。
这是这一章给你的不可逆能力。 Field Note · 03 · Hands-on

Reading hex is no longer reading a foreign language.
That is the irreversible ability this chapter gives you. Field Note · 03 · Hands-on

本章引用Chapter references

tool: WABT · wat2wasm / wasm-objdump
spec: Wasm Core 2.0 · §5 Binary format

上一章previous← Ch29 · wasm 不能做什么What wasm cannot do 下一章nextCh31 · Wasm vs eBPFWasm vs eBPF →

APPENDIX · 05 · COMPARISON

Wasm vs eBPF — 两种沙箱化字节码的分工

Wasm vs eBPF — two sandboxed bytecodes, divided

user-space vs kernel-space · 同源不同命

user-space vs kernel-space · same idea, different fate

"沙箱化字节码" 这件事在 2010 年代有两条平行进化线:Linux 内核里的 eBPF(extended Berkeley Packet Filter,2014 由 Alexei Starovoitov 在 Linux 3.18 里搞出来)和浏览器里的 wasm(2015)。两者表面相似——都是给一段不可信代码画沙箱让它在敏感地方跑——但下面的工程哲学几乎相反。这一章把它们摆在一起对照,你会看到 wasm 在更大的虚拟机生态里坐在哪里。

"Sandboxed bytecode" had two parallel evolutionary tracks in the 2010s: eBPF inside the Linux kernel (extended Berkeley Packet Filter, born 2014 in Linux 3.18 by Alexei Starovoitov) and wasm in the browser (2015). The two look superficially alike — sandboxes for untrusted code in sensitive places — but their engineering philosophies are almost opposite. This chapter puts them side by side; you'll see where wasm sits in the broader VM ecology.

对照表

Side by side

	eBPF	wasm
运行位置Where it runs	Linux kernel · ring 0	user space · sandboxed process
执行模式Execution mode	JIT only(no interp)	JIT + interp(WAMR/wasm3 有 interp)
字节码长度Bytecode length	≤ 1M instr(verifier-bounded)	unbounded
循环Loops	5.3+ bounded loops only,验证器要证明终止	arbitrary loops(non-terminating OK)
递归Recursion	禁止	允许
指针Pointers	受限(verifier 跟踪每个 pointer 的边界与类型)	linear memory · 整数索引,无 raw pointer
验证算法Verifier	路径敏感抽象解释 — 指数级最坏情况	单遍 forward sweep · O(n) (Ch11)
类型系统Type system	4 类型 + tagged pointers(MAP_VALUE / PACKET / SOCKET / STACK)	i32/i64/f32/f64/v128 + ref types
helper 函数Helper functions	~ 200 个内核 helper,通过 BPF_CALL 调用	无 helper · 通过 import 拿宿主能力
起源年Born	2014 (Linux 3.18) · 1992 cBPF 演化	2015 W3C · 2017 ship
设计目标Designed for	kernel observability / networking / security	portable compute in any host
速度目标Perf target	~ native (small programs)	~ 80% of native
主要语言Source languages	C(via clang -target bpf)+ Rust + Go(aya)	Rust / C/C++ / Go / Kotlin / Java / ...

核心差异 · "verifier 严格度"

The core difference · "verifier strictness"

eBPF verifier 比 wasm validator 严格 一个数量级。原因:eBPF 跑在内核态,一个 bug 可能 kernel panic;wasm 跑在用户态进程,一个 bug 最多让浏览器标签崩。

eBPF verifier is an order of magnitude stricter than the wasm validator. Reason: eBPF runs in kernel mode; a bug can panic the kernel. Wasm runs in a user-space process; a bug at worst crashes a browser tab.

①

eBPF 必须证明终止

eBPF must prove termination

每个循环 5.3 之前是完全禁止;5.3+ 允许 bounded loops 但 verifier 要静态证明有上界。无限循环立即 reject。wasm 欢迎无限循环(浏览器 tab 卡死,刷新就好)。

Loops were fully banned pre-5.3; 5.3+ allows bounded loops only if the verifier can statically prove an upper bound. Infinite loop → rejected. Wasm welcomes infinite loops (a browser tab hangs; refresh fixes it).

②

eBPF 跟踪每个指针的边界

eBPF tracks every pointer's bounds

verifier 维护每个寄存器的tagged type(SCALAR / PTR_TO_MAP_VALUE / PTR_TO_PACKET / PTR_TO_STACK 等)和有效范围。一旦 deref 超出已证明的范围,reject。这是路径敏感分析——同一个寄存器在不同 control flow 上有不同 type。wasm validator 不跟踪指针,因为 wasm 没有 raw pointer——所有内存访问是integer index into linear memory,边界检查靠硬件(Ch10)。

The verifier maintains a tagged type per register (SCALAR / PTR_TO_MAP_VALUE / PTR_TO_PACKET / PTR_TO_STACK / etc.) plus a valid range. Any deref outside the proven range → rejected. This is path-sensitive analysis — a register can have different types on different control-flow paths. The wasm validator doesn't track pointers, because wasm has no raw pointers — every memory access is an integer index into linear memory, bounded by hardware (Ch10).

③

eBPF program 长度上限

eBPF program length cap

原本 4096 指令,5.2 提到 1M。验证状态空间会指数爆炸——程序太大根本验证不动。wasm 无上限——Photoshop 30 万函数照样验证。差异源自验证算法复杂度。

Used to be 4096 instructions, lifted to 1M in 5.2. Verifier state space explodes exponentially — bigger programs simply can't be verified. Wasm has no cap — Photoshop's 300 K functions validate fine. The difference comes straight from validator complexity.

helper 函数 vs imports

helper functions vs imports

两种字节码都需要从宿主拿能力(syscall / 网络 / 时钟),但暴露方式完全不同:

Both bytecodes need host capabilities (syscalls, networking, clocks), but expose them oppositely:

// eBPF · helpers are baked into the kernel verifier
// (kernel/bpf/helpers.c has the table)
long ret = bpf_map_lookup_elem(&my_map, &key);
long ret = bpf_redirect(ifindex, BPF_F_INGRESS);
long ret = bpf_probe_read_kernel(&dst, sizeof(dst), src);
// → BPF_CALL instruction · helper_id is a constant the verifier knows

// wasm · imports are declared in the module, resolved at instantiate
(import "wasi_snapshot_preview1" "fd_write"
  (func $fd_write (param i32 i32 i32 i32) (result i32)))
// → host plugs in any function with matching signature at run time

这是关键的设计差异:eBPF 的可用 helper 是内核版本决定的——你的代码能调什么 helper 取决于跑在哪个内核;wasm 的可用 imports 是实例化时宿主提供的——同一份 .wasm 可以跑在浏览器(import JS 函数)、Wasmtime(import WASI)、Cloudflare Workers(import Cloudflare API)上,只要 import 签名匹配。wasm 把"宿主接口" 设计成了 late binding,eBPF 是 early binding。

This is the key design difference: eBPF's available helpers depend on the kernel version — which helpers you can call depends on what kernel you're running. Wasm's available imports are decided at instantiation by the host — the same .wasm can run in the browser (importing JS functions), Wasmtime (importing WASI), or Cloudflare Workers (importing CF APIs), as long as signatures match. Wasm designed the host interface as late binding; eBPF chose early binding.

"两条战线" 的分工

"Two fronts" of division

场景Use case	谁赢Winner	为什么Why
Linux kernel networking (XDP / TC)	eBPF	必须 ring 0,wasm 不能
Cilium / Tetragon · K8s 安全	eBPF	需要 kernel events / kprobes / tracepoints
bcc / bpftrace · 系统观测	eBPF	跟踪 kernel 函数调用栈
浏览器内 SPA(Figma / Photoshop Web)	wasm	eBPF 进不了浏览器
Cloudflare Workers / Fastly Compute@Edge	wasm	跨语言 + 跨平台 portable
WebTransport client / WebRTC encoder	wasm	需要在 user-space app context 里跑
Game logic in browser	wasm	大程序 + 任意循环
Database query UDFs (Postgres / Redis)	看情况depends	Postgres pgx 走 wasm,kernel-side trace 走 eBPF
IoT 嵌入式	wasm	非 Linux 平台,eBPF 不通用

交叉点 — 它们正在互相学习

Crossover — they are learning from each other

→

eBPF 在变得更像 wasm

eBPF growing more wasm-like

2023+ 的 linker / libbpf / CO-RE(Compile Once - Run Everywhere)让 eBPF 程序跨内核版本可移植——以前要为每个内核版本重编。这是 wasm 早就解决的问题。

2023+ linker / libbpf / CO-RE (Compile Once - Run Everywhere) make eBPF programs portable across kernel versions — previously you recompiled per kernel. This was wasm's solved problem all along.

→

wasm 在变得更像 eBPF

wasm growing more eBPF-like

Wasmtime 加了 fuel(每条指令扣燃料,跑到 0 强制终止)和 epoch interruption(异步打断),都是 eBPF 在用的"不可信代码资源上限" 思路。Cloudflare Workers 默认 50ms CPU = wasm 抄 eBPF 的家庭作业。

Wasmtime added fuel (instruction credit, terminate at 0) and epoch interruption (async preemption), both eBPF's "untrusted-code resource cap" patterns. Cloudflare Workers' default 50 ms CPU cap = wasm doing eBPF's homework.

→

用 wasm 跑 eBPF 程序

eBPF programs running on wasm

eunomia-bpf/bpftime 这种项目把 eBPF bytecode 在 user-space 用 LLVM JIT 跑,本质上是"用 wasm 框架运行 eBPF"。两条线开始合流。

Projects like eunomia-bpf/bpftime run eBPF bytecode in user-space via LLVM JIT — essentially "running eBPF inside a wasm-style frame". The two tracks are merging.

eBPF 在内核里 paranoid,wasm 在用户态 forgiving。
同一种思想,两种代价模型。 Field Note · 03 · Comparison

eBPF is paranoid in the kernel,
wasm is forgiving in user-space.
Same idea, two cost models. Field Note · 03 · Comparison

本章引用Chapter references

docs: ebpf.io · introduction
repo: bpftime · run eBPF in user-space
docs: Linux kernel · BPF verifier docs

上一章previous← Ch30 · 手写 wasmHand-write wasm 下一章nextCh32 · IoT 上的 wasmwasm on IoT →

APPENDIX · 06 · IoT

IoT 上的 wasm — 128 KiB 的舞蹈

wasm on IoT — dancing in 128 KiB

第三战场:浏览器 / 服务端 / 嵌入式

the third battlefield: browser / server / embedded

前 31 章默认 wasm 跑在有几 GB 内存的机器上(浏览器 / Cloudflare Workers / Wasmtime CLI)。但 wasm 还有一片很容易被忽略的战场:嵌入式 / IoT——ESP32(520 KiB SRAM)、STM32(几十到几百 KiB)、树莓派 Pico(264 KiB)。这种"磁盘几 MB · 内存几百 KiB" 的环境跑不动 V8(几十 MB),也跑不动 Wasmtime(基础占用 10+ MB)。WAMR(WebAssembly Micro Runtime,Intel 主导,Bytecode Alliance 项目)就是为这片场景设计的。

The previous 31 chapters assumed wasm runs on machines with gigabytes of RAM (browsers / Cloudflare Workers / Wasmtime CLI). But there's an easily-forgotten battlefield: embedded / IoT — ESP32 (520 KiB SRAM), STM32 (tens to hundreds of KiB), Raspberry Pi Pico (264 KiB). These "megabytes of flash · hundreds of KiB of RAM" environments can't run V8 (tens of MB) or Wasmtime (10+ MB baseline). WAMR (WebAssembly Micro Runtime, Intel-led, Bytecode Alliance project) exists for this niche.

三档运行模式

Three running modes

WAMR 设计上提供 3 种执行模式,内存占用 / 性能 / 启动延迟之间各做不同取舍:

WAMR offers 3 execution modes, each making different memory / perf / startup trade-offs:

Mode	RAM baseline	perf vs native	startup	use case
Classic interpreter	~ 50 KiB	~ 1/50×	instant	STM32 / 极致内存预算
Fast interpreter	~ 80 KiB	~ 1/10×	fast (build threaded code)	ESP32 default
AOT	~ 100 KiB + AOT 代码	0.8-1×(near native)	build-time	性能 critical · 提前编译
JIT (LLVM)	~ 4 MB	1×	慢 (LLVM init)	边缘网关 · 不要嵌入式

"Classic interpreter" 是真解释器——一个 switch-case 死循环,fetch opcode → dispatch。"Fast interpreter" 在加载时把字节码翻译成 threaded code(每个 opcode 是一个函数指针 + 立即数),省了 dispatch 的 cache miss——加载多 30% 内存,运行快 5×。AOT 模式才是 IoT 真正的杀器:把 .wasm 在 host 机器上预编译成 ELF / 目标芯片机器码,设备 flash 进就跑,运行时占用极小。

"Classic interpreter" is a real interpreter — a switch-case loop, fetch opcode → dispatch. "Fast interpreter" pre-translates bytecode at load time into threaded code (each opcode = function pointer + immediate), avoiding dispatch cache misses — 30% more memory at load, 5× faster execution. AOT is the real IoT killer: pre-compile .wasm on a host machine into ELF / target-chip machine code, flash to the device, run with minimal overhead.

真实部署 · 几个有意思的案例

Real deployments · a few notable cases

⒈

Disney+ · 智能电视

Disney+ · smart TVs

Disney+ 客户端在低端智能电视上用 wasm 跑业务逻辑,绕过各家厂商的 JS 引擎兼容问题。WAMR AOT,~ 2 MB 模块,内存占用控制在 ~ 300 KiB。2023 Disney+ tech blog 公开过。

Disney+ runs business logic in wasm on low-end smart TVs, bypassing per-vendor JS engine compatibility hell. WAMR AOT, ~ 2 MB module, ~ 300 KiB RAM footprint. Public via 2023 Disney+ tech blog.

⒉

Wasm for OpenWrt

路由器上跑 wasm-based 防火墙规则(类似 eBPF 但 user-space)。WAMR + WASI-net,8 MB flash 的路由器都能跑。社区维护,2024 起 OpenWrt 23.05 进入官方仓库。

Wasm-based firewall rules on home routers (eBPF-like but user-space). WAMR + WASI-net, works on routers with 8 MB flash. Community-maintained, official OpenWrt 23.05 package since 2024.

⒊

Cosmonic Wasmcloud

分布式 wasm 调度框架。一个 module 可以在 cloud / edge / IoT 设备间无缝迁移——同一份 .wasm 在不同硬件上由不同的 host(Wasmtime / WAMR)实例化。2024 年起 Eclipse 基金会孵化。

Distributed wasm orchestration. The same module migrates seamlessly across cloud / edge / IoT devices — one .wasm instantiated by different hosts (Wasmtime / WAMR) on each tier. Eclipse Foundation incubation since 2024.

⒋

SpinKube / Krustlet · K8s on wasm pods

SpinKube / Krustlet · K8s wasm pods

在 K8s 里用 wasm 取代部分 container pods:启动 1 ms vs 容器的 100 ms,内存 ~ 几 MB vs 容器 ~ 100 MB+。边缘 K8s 节点(eg. CDN POP)收益最大。Microsoft Azure / Fermyon 是主要推手。

Replace some K8s container pods with wasm: 1 ms startup vs containers' 100 ms, ~ MB memory vs containers' ~ 100 MB+. Biggest wins at edge K8s nodes (e.g. CDN POPs). Pushed by Microsoft Azure / Fermyon.

WAMR 跟 WASI 的关系

WAMR's WASI story

WAMR 实现了 WASI 0.1(preview1)的大部分接口 + 自己的libc-wasi 移植。WASI 0.2(Component Model)在 IoT 上接受比较慢——Component Model 本身就有 ~ 100 KiB 运行时开销,在 128 KiB 内存的 MCU 上是太多。所以 IoT 场景大概率会卡在 WASI 0.1 一段时间,2027 之后才会上 0.2。

WAMR implements most of WASI 0.1 (preview1) + its own libc-wasi port. WASI 0.2 (Component Model) is slow to land in IoT — the Component Model itself has ~ 100 KiB runtime overhead, which is too much on a 128 KiB MCU. So IoT will likely stay on WASI 0.1 for a while, with 0.2 arriving post-2027.

最小可跑 · 一个 ESP32 例子

Minimum viable · an ESP32 example

# compile hot.rs to wasm (no_std for embedded)
$ rustc --target wasm32-unknown-unknown -O \
        -C panic=abort -C codegen-units=1 \
        --crate-type cdylib hot.rs -o hot.wasm

# AOT compile for ESP32 (Xtensa LX6)
$ wamrc --target=xtensa --target-abi=eabi --cpu=esp32 \
        --enable-multi-thread=0 hot.wasm -o hot.aot

$ ls -l hot.*
-rw-r--r--  hot.wasm   192 B
-rw-r--r--  hot.aot    1.4 KB   # native Xtensa code · 7× wasm size

# Flash + run on ESP32 (using esp-idf with WAMR component)
# C wrapper:
wasm_runtime_init();
wasm_module_t mod = wasm_runtime_load_aot(hot_aot_buf, hot_aot_size);
wasm_module_inst_t inst = wasm_runtime_instantiate(mod,
    /*stack_size=*/4096, /*heap_size=*/8192, ...);
wasm_function_inst_t f = wasm_runtime_lookup_function(inst, "blur3", NULL);
uint32_t argv[4] = { srcPtr, dstPtr, 1920, 1080 };
wasm_runtime_call_wasm(exec_env, f, 4, argv);

# Memory footprint on ESP32:
#   WAMR baseline:    ~ 50 KiB
#   AOT code:         ~ 1.4 KiB
#   wasm stack+heap:  12 KiB (configurable)
#   linear memory:    1 page = 64 KiB
#   = total ~ 128 KiB · fits comfortably on ESP32's 520 KiB

为什么 wasm 在 IoT 比 Lua / Python 强

Why wasm beats Lua / Python on IoT

MicroPython 在 IoT 也有市场。但 wasm 有三个优势:① 静态类型 → 安全保证更强;② 多语言 → C++ / Rust / Go 等高性能语言可用,不仅是 Python;③ 沙箱标准化 → 同一份模块在云端测试好后能直接 flash 到设备。Lua + LuaJIT 仍然在游戏 / nginx 配置场景占优——它的运行时更小、调试链路更熟悉。wasm 的 IoT 故事是正在赶来的列车,不是已经到站。

MicroPython has its IoT niche too. But wasm offers three advantages: ① static typing → stronger safety guarantees; ② multi-language → C++/Rust/Go available, not just Python; ③ standardised sandbox → the same module tested in the cloud can be flashed to devices. Lua + LuaJIT still wins for games / nginx config — smaller runtime, more familiar debug chain. wasm-on-IoT is an approaching train, not one already at the station.

浏览器 · 服务端 · IoT —
同一份字节,三个数量级的内存预算。 Field Note · 03 · IoT

Browser · server · IoT —
one set of bytes, three orders of magnitude of memory budget. Field Note · 03 · IoT

本章引用Chapter references

repo: WAMR · WebAssembly Micro Runtime
docs: Wasmcloud · distributed wasm orchestration

上一章previous← Ch31 · Wasm vs eBPFWasm vs eBPF 附录appendixReferences & StandardsReferences & Standards →

APPENDIX · STANDARDS

References & Standards — 文章每个论断的出处

References & Standards — sources for every claim

W3C · IETF · IEEE · 学术 · 源码

W3C · IETF · IEEE · academia · source

这一节把全文用到的所有外部标准、规范、论文、源码归档。每条引用带状态(REC = W3C Recommendation,CR = Candidate Recommendation,WD = Working Draft)+ 链接 + 你在哪一章会用到它。所有 URL 在 2026 年 5 月有效;wasm 提案演化快,phase 4 后会迁移到 W3C TR/ 命名空间。

This section archives every external standard, spec, paper, or source-code reference the article touches. Each carries a status pill (REC = W3C Recommendation, CR = Candidate Recommendation, WD = Working Draft) + link + the chapter that needs it. All URLs valid as of May 2026; wasm proposals move quickly, post-phase-4 entries migrate to W3C TR/ namespaces.

A · 核心 W3C 标准

A · Core W3C standards

WASM 2.0 规范族 · 三件套Wasm 2.0 spec family · the trio

Core 2.0: REC W3C TR · WebAssembly Core Specification 2.0 · 字节格式 + 验证 + 执行语义。Ch06-Ch11 全用。Byte format + validation + exec semantics. Used by Ch06–Ch11.
JS API 2.0: REC W3C TR · WebAssembly JavaScript Interface 2.0 · WebAssembly.Module/Instance/Memory/Table/Global 接口。Ch16/17 用。JS-side WebAssembly.* surface. Used by Ch16/17.
Web API 2.0: REC W3C TR · WebAssembly Web API 2.0 · compileStreaming / instantiateStreaming / Response 集成。Ch12 用。compileStreaming / instantiateStreaming / Response integration. Used by Ch12.
Wasm 1.0: REC W3C TR · WebAssembly Core Specification 1.0 · 2019-12 W3C Recommendation。历史参考——MVP 时代的基线。2019-12 W3C Recommendation. Historical baseline — the MVP era reference.

B · 单独提案(每个有自己的 GitHub spec 仓库)

B · Individual proposals (each with its own GitHub spec repo)

phase 4 已 shipphase 4 · shipped

threads: WG github.com/WebAssembly/threads · 原子操作 + SharedArrayBuffer。Ch18。Atomics + SharedArrayBuffer. Ch18.
simd: WG github.com/WebAssembly/simd · v128 + ~250 lane ops。Ch19。v128 + ~250 lane ops. Ch19.
bulk-memory: WG bulk-memory-operations · memory.copy/fill/init 等。Ch07/22。memory.copy/fill/init etc. Ch07/22.
reference-types: WG reference-types · funcref/externref + 多 table。Ch08/22。funcref/externref + multiple tables. Ch08/22.
multi-value: WG multi-value · 函数返回多值。Ch08。Functions returning multiple values. Ch08.
tail-call: WG tail-call · return_call + return_call_indirect。Ch22。return_call + return_call_indirect. Ch22.
exceptions: WG exception-handling · try/catch/throw/tag。Ch22。try/catch/throw/tag. Ch22.
gc: WG github.com/WebAssembly/gc · struct/array/i31ref/ref.cast。Ch08/20。struct/array/i31ref/ref.cast. Ch08/20.

phase 3 · flag 后可用phase 3 · behind flag

memory64: CG memory64 · i64 寻址,突破 4 GiB。Ch10/22。i64 addressing, breaks 4 GiB cap. Ch10/22.
jspi: CG js-promise-integration · wasm await Promise。Ch22。wasm await Promise. Ch22.
stack-switching: CG stack-switching · coroutine/fiber。Ch22。coroutines/fibers. Ch22.
relaxed-simd: WG relaxed-simd · 允许 1 ulp 差异的 SIMD。Ch19。SIMD ops with 1 ulp tolerance. Ch19.
multi-memory: WG multi-memory · 一模块多 memory。Ch22。Multiple memories per module. Ch22.
component-model: CG component-model · 跨语言 ABI + WIT。Ch21。Cross-language ABI + WIT. Ch21.

提案 phase trackerAll proposals · phase tracker

tracker: github.com/WebAssembly/proposals · 所有 40+ 个提案的当前 phase 和实现状态。Current phase and implementation status of all 40+ proposals.
flag: chrome://flags/#enable-experimental-webassembly-features · 在 Chrome 里打开所有 phase 2-3 提案。Enables all phase 2-3 proposals in Chrome.

C · 底层标准依赖(IEEE / IETF / DWARF)

C · Underlying standards (IEEE / IETF / DWARF)

IEEE 754: IEEE IEEE 754-2019 · Floating-Point Arithmetic · f32 / f64 / NaN 传播,wasm spec 直接引用。Ch08。f32/f64/NaN propagation, directly cited by wasm spec. Ch08.
UTF-8: RFC 3629 RFC 3629 · UTF-8 · 所有 import/export 名字、custom section 名字的编码。Encoding of all import/export names and custom section names.
Unicode: Unicode 15.1 · name section 允许的字符集。Character set allowed in the name section.
LEB128: DWARF 5 · §7.6 Variable Length Data · LEB128 不是独立 RFC,定义在 DWARF 5 spec 中。wasm 所有整数用它。Ch06。LEB128 is not a standalone RFC; defined in DWARF 5 §7.6. Every wasm integer uses it. Ch06.
DWARF 5: DWARF Debugging Information Format · v5 · wasm 调试信息(.debug_* custom sections)。Ch24。Wasm debug info (.debug_* custom sections). Ch24.
Source Maps: TC39 · Source Map (ecma-426 draft) · 2024 起在 TC39 标准化;wasm 通过 sourceMappingURL custom section 引用。Ch24。Standardising at TC39 since 2024; wasm references via sourceMappingURL custom section. Ch24.
COOP / COEP: HTML Standard · COOP / COEP · Spectre 缓解的 HTTP 头要求。Ch18。HTTP headers required for Spectre mitigation. Ch18.
SharedArrayBuffer: ECMAScript · SharedArrayBuffer · wasm threads 共享内存的 JS 端 carrier。Ch18。JS-side carrier for wasm threads shared memory. Ch18.

D · WASI / Component Model / 非浏览器

D · WASI / Component Model / non-browser

WASI 0.2: WD WebAssembly/WASI · wasip2 · Preview2 · 用 Component Model 重写的"系统接口"。Ch21/25/28。Preview2 · "system interfaces" rewritten via Component Model. Ch21/25/28.
WASI 0.1: WebAssembly/WASI · preview1 (legacy) · 2019 以来事实标准,Wasmtime / Spin / Cloudflare 都支持。De-facto standard since 2019; supported by Wasmtime / Spin / Cloudflare.
WIT: Component Model · WIT IDL Reference · Component Model 的接口描述语言。Ch21。Component Model's IDL. Ch21.
wit-bindgen: bytecodealliance/wit-bindgen · 从 .wit 生成 N 种语言 binding 的工具链。Ch21。Toolchain generating N-language bindings from .wit. Ch21.
Wasmtime: Wasmtime documentation · Bytecode Alliance 的 wasm 运行时。Ch01/14/25。Bytecode Alliance's wasm runtime. Ch01/14/25.
Cranelift: Cranelift code generator · Wasmtime 和 Wasmer 共用的 Rust 写 wasm 编译器。Ch01/15。Rust-written wasm compiler shared by Wasmtime and Wasmer. Ch01/15.

E · 学术论文

E · Academic papers

PLDI 2017: Haas, Rossberg, Schuff, Titzer, Holman, Gohman, Wagner, Zakai, Bastien. "Bringing the Web up to Speed with WebAssembly" · 原始论文,描述 MVP 设计原则、形式语义、验证算法。Ch01/03/11 引用。The original paper · design principles, formal semantics, validation algorithm. Cited by Ch01/03/11.
CPP 2018: Watt. "Mechanising and Verifying the WebAssembly Specification" · 在 Isabelle/HOL 里把 wasm 规范 mechanise + 验证,发现 spec 里的几处 bug。Ch11 / Ch27。Mechanised + verified the wasm spec in Isabelle/HOL, finding spec-level bugs. Ch11 / Ch27.
CACM 2018: Haas, Rossberg, Schuff, Titzer, Holman, Gohman, Wagner, Zakai, Bastien. "Bringing the Web up to Speed with WebAssembly" · CACM Research Highlights · Dec 2018 · PLDI 2017 论文被 CACM 选入 Research Highlights 重发,带 Anders Møller 的 technical perspective。比 PLDI 原版更适合入门。The PLDI 2017 paper re-published as a CACM Research Highlight (Dec 2018), with Anders Møller's technical perspective. A more accessible entry point than the PLDI original.
POST 2019: Disselkoen, Renner, Watt, Garfinkel, Levchenko, Stefan. "Position Paper: The Meaning of Memory Safety" · 讨论 wasm 内存安全的精确边界 — Ch10/27 引用。Discusses the precise boundary of wasm memory safety. Cited by Ch10/27.
OSDI 2020: Lehmann, Kinder, Pradel. "Everything Old is New Again: Binary Security of WebAssembly" · 早期 wasm 二进制安全分析(stack-smashing 等)。Ch27。Early binary-security analysis of wasm (stack smashing, etc.). Ch27.

F · 源码定位

F · Source code anchors

V8 wasm 实现V8 wasm implementation

decoder: v8/src/wasm/module-decoder.cc · 流式 decode 主流程。Ch12。Streaming decode main loop. Ch12.
validator: v8/src/wasm/function-body-decoder-impl.h · 类型栈验证模板。Ch11/13。Type-stack validator template. Ch11/13.
Liftoff: v8/src/wasm/baseline/liftoff-compiler.cc · 基线 JIT 主入口。Ch14。Baseline JIT main entry. Ch14.
TurboFan-wasm: v8/src/compiler/wasm-compiler.cc · wasm → TF graph build。Ch15。wasm → TF graph build. Ch15.
Turboshaft: v8/src/compiler/turboshaft/wasm-*.cc · 2023 后默认的 wasm 优化器。Ch15。Default wasm optimiser since 2023. Ch15.
wrappers: v8/src/wasm/wasm-import-wrapper-cache.cc · JS↔Wasm trampoline 缓存。Ch17。JS↔Wasm trampoline cache. Ch17.
JS API: v8/src/wasm/wasm-objects.cc · Module/Instance/Memory/Table 的 JS 对象。Ch16。JS objects for Module/Instance/Memory/Table. Ch16.

SpiderMonkey · JavaScriptCoreSpiderMonkey · JavaScriptCore

SM baseline: mozilla-central/js/src/wasm/WasmBaselineCompile.cpp
SM Ion: mozilla-central/js/src/wasm/WasmIonCompile.cpp
JSC BBQ/OMG: WebKit/JavaScriptCore/wasm/WasmBBQ*.cpp · WasmOMG*.cpp

工具链Toolchains

Binaryen: WebAssembly/binaryen · wasm-opt, wasm-as, IR 优化器。
WABT: WebAssembly/wabt · wasm2wat, wasm-objdump, wat2wasm。Ch06/07/24 用。
wasm-bindgen: rustwasm/wasm-bindgen · Rust↔JS 胶水生成器。
Emscripten: emscripten.org · C/C++ → wasm 工具链。Ch02/25 引用。
AssemblyScript: assemblyscript.org · TypeScript-flavoured wasm 源码语言,不走 LLVM 走 Binaryen。

G · 治理 · 历史

G · Governance · history

CG charter: W3C WebAssembly Community Group · 提案 phase 0-2 在这里讨论。任何人可加入。Proposals phase 0–2 live here. Open to anyone.
WG charter: W3C WebAssembly Working Group · 提案 phase 3-4 在这里投票。需要会员资格。Proposals phase 3–4 vote here. Membership required.
2015 birth: Luke Wagner · "WebAssembly" (17 Jun 2015) · Mozilla 的官宣博文。Ch02。Mozilla's launch announcement. Ch02.
asm.js spec: asm.js: a typed subset of JavaScript · 2013 Wagner / Zakai。Ch02/04。2013 Wagner / Zakai. Ch02/04.
Bytecode Alliance: bytecodealliance.org · 非营利组织,推 WASI / Component Model / 非浏览器 wasm。Ch01/25/28。Non-profit driving WASI / Component Model / non-browser wasm. Ch01/25/28.
Wasm 2.0 milestone: W3C · Wasm 2.0 Recommendation press release (2025-03) · 把 2019-2023 八个独立提案合并成 2.0 基线。Ch02。Folded eight 2019–2023 proposals into the 2.0 baseline. Ch02.

技术写作的鉴别度
不在花活,在出处。 Field Note · 03 · Appendix

Technical writing is judged not by flourishes,
but by the rigour of its sources. Field Note · 03 · Appendix

读到这里,你已经看完 wasm 从字节到 SIMD 的一生。
下次再有人问 "wasm 是什么"——别用一句话回答。 Field Note · 03 · Fin

You have now read the life of wasm, from byte to SIMD.
Next time someone asks "what is wasm" — refuse the one-liner. Field Note · 03 · Fin

上一章previous← Ch32 · IoT wasmwasm on IoT →

在沉到比特之前。

Before we sink into bits.

三个公式 — WebAssembly 到底是什么

Three formulas — what WebAssembly really is

主流引擎的「Tier 拓扑」对照

Engine tiering at a glance

家谱 — 从 asm.js 到 wasm-GC 的十三年

A family tree — 13 years from asm.js to wasm-GC

三个不能略过的祖先

Three ancestors you cannot skip

提案的四个阶段

The four phases of a proposal

为什么是栈机 — 一个 1980 年代的选择

Why a stack machine — a choice from the 1980s

两种机器的编码密度对比

Encoding density: stack vs register

栈机的四个"赠送好处"

Four bonuses the stack throws in for free

编码密度高

High encoding density

验证算法线性

Linear validation

单遍 codegen 可行

Single-pass codegen viable

ISA 无关

ISA-neutral

但栈机有一个老问题

But the stack has one old problem

fib(40) 在 wasm 和 Dalvik 上的字节数

fib(40) in wasm vs Dalvik bytes

JS 的天花板 — JIT 再聪明也跨不过的三件事

JS ceiling — three things even a brilliant JIT can't climb

三个具体的天花板

Three concrete ceilings

"把 px2rem 优化到 24 ms 已经是极限了——再快只能不写 JS。"

"Optimizing px2rem to 24 ms is roughly the JS ceiling. Beyond that, stop writing JS."

数字:wasm 比 JS 快多少

By how much, in numbers

The Hot Loop。

The Hot Loop.

The Hot Loop — 一段 3×3 卷积的来生

The Hot Loop — afterlife of a 3×3 convolution

源码 · hot.rs

Source · hot.rs

编译命令 · 一行 rustc

Build command · one rustc invocation

作为 wat 文本(人类可读)

As wat text (human-readable)

在 V8 里跑出的机器码 · Liftoff(Tier 0)

Machine code in V8 · Liftoff (Tier 0)

同一个函数,经 TurboFan 重编译(Tier 1)

Same function, TurboFan-recompiled (Tier 1)

和 SIMD 版本的对比

vs the SIMD version

"The Hot Loop" 在后续 22 章的回引地图

Where The Hot Loop returns in the next 22 chapters

一张图看完它的一生

The whole life in one frame

把 192 字节摊在桌上。

192 bytes, spread on the table.

Module 的外壳 — 8 个字节的承诺

The Module shell — a promise in 8 bytes

第一字节级别的拆解

Byte-level walkthrough

所有 12 个 section 的 ID

All 12 section IDs

192 字节里只用到了 7 个 section

192 bytes use only 7 sections

11 段 section — 一个 .wasm 文件的器官学

11 sections — the organology of a .wasm file

① Type section · 函数签名表

① Type section · the signature table

② Import section · 来自宿主的函数与对象

② Import section · functions and objects from the host

③ Function section · 函数和签名的连线

③ Function section · wiring functions to their signatures

④ Table section · funcref 的数组

④ Table section · the funcref array

⑤ Memory section · 线性内存的声明

⑤ Memory section · linear memory declaration

`fib(40)` 在 wasm 和 Dalvik 上的字节数

`fib(40)` in wasm vs Dalvik bytes

`memory.grow` 的代价

The cost of `memory.grow`