JS 极致性能优化 — V8 优化原理与一段热点函数的重生

CHAPTER 01 · PROLOGUE

三视角看 `const a = 3 + 4`

Three eyes on `const a = 3 + 4`

同一行代码,三种世界,三种翻译

one line of code, three worlds, three translations

性能优化的第一步,不是 profile,是回答一个问题:你脑子里那行 JS,跟 CPU 真正执行的那串指令,中间到底差了几层翻译?

把 const a = 3 + 4 摆出来,用三种"眼睛"去看它,你会发现这一行字面上看似一回事的代码,在三层世界里长得完全不一样——而 V8 的所有优化魔法,都发生在这三层之间的翻译过程里。

The first step of performance work isn't profiling — it's answering one question: how many translations sit between the line of JavaScript in your head and the instructions the CPU actually runs?

Take const a = 3 + 4. Look at it with three different eyes. The same one-liner shape-shifts across three worlds, and every V8 optimization trick lives in the translation between them.

◉

人脑

Brain

js source

const a = 3 + 4; // → a = 7

›

V8

Ignition

bytecode

LdaSmi [3] Star0 LdaSmi [4] Add r0, [0] Star1 Return

›

▦

TurboFan

x86-64 asm

; const folded → 7 mov eax, 7 ret

◉

人脑

Brain

js source

function add(a, b) { return a + b; }

›

V8

Ignition

bytecode

// 3 ops · 6 bytes Ldar a1 Add a0, [0] Return

›

▦

TurboFan

x86-64 (assumed SMI)

testb [rbx+0xf], 0x1 jne deopt testb [rcx+0xf], 0x1 jne deopt mov rax, rbx add rax, rcx jo overflow ret

◉

人脑

Brain

js source

function abs(x) { return x < 0 ? -x : x; }

›

V8

Ignition

bytecode

Ldar a0 TestLessThan [0], [1] JumpIfFalse [+5] Ldar a0 Negate [2] Return Ldar a0 Return

›

▦

TurboFan

x86-64 (assumed SMI)

testb [rbx+0xf], 0x1 jne deopt test rbx, rbx jns .pos neg rbx .pos: mov rax, rbx ret

◉

人脑

Brain

js source

function sum(n) { let s = 0; for (let i = 0; i < n; i++) { s += i; } return s; }

›

V8

Ignition

bytecode

LdaZero Star0 ; s LdaZero Star1 ; i .L: Ldar r1 TestLessThan a0, [0] JumpIfFalse [+9] Ldar r0 Add r1, [1] Star0 Ldar r1 Inc [2] Star1 JumpLoop [-13] Ldar r0 Return

›

▦

TurboFan

x86-64 (loop opt)

xor rax, rax ; s = 0 xor rcx, rcx ; i = 0 .loop: cmp rcx, rbx jge .done add rax, rcx inc rcx jmp .loop .done: ret

抽象程度高More abstract 抽象程度低Less abstract

FIG. 01 · interactive 点上面 4 个 tab 切换不同 JS 样例,每段都给出 V8 Ignition 的字节码产物和 TurboFan 的机器码产物。同一段 JS,经过两次翻译,从抽象到具体。注意右栏汇编里多出来的 testb 和 jne deopt——那不是逻辑,是 V8 埋的类型检查 checkpoint(后面 Phase II 详细讲)。 Click the four tabs above to swap JS samples; each shows V8 Ignition's bytecode and TurboFan's machine code. The same JS, two translations, abstract to concrete. Note the extra testb + jne deopt in the asm column — those aren't logic, they're V8's type-check checkpoints (Phase II will dissect them).

为什么要这么多层翻译

Why so many translations

裸的 CPU 只认机器码,但 JS 是动态的——它的类型是运行时才知道的。一个 add(a, b) 函数,你不告诉我 a 和 b 是什么,我就没法把它编译成"两个整数相加"这一条 add eax, ebx 指令——因为下一秒你可能会传两个字符串过来。

于是 V8 在脑和 CPU 之间垫了一层字节码:它比 AST 接近物理机,又比机器码灵活——可以解释执行,可以收集"参数到底是什么类型"的反馈,等收集够了再把字节码编译成机器码。

这一层就是性能优化的全部战场。下面整本文章,讲的都是 V8 在这一层做了什么、能被你怎么利用。

A bare CPU only speaks machine code, but JavaScript is dynamic — types are a runtime fact. Given add(a, b), I can't fold it down to a single add eax, ebx if I don't know what a and b are — next second you might pass me two strings.

So V8 inserts a bytecode layer between brain and CPU: closer to a real machine than the AST, more flexible than asm — interpretable, observable, and recompilable into machine code once V8 has watched enough calls to know what your types actually are.

That layer is the entire battlefield of JS performance work. The rest of this piece is about what V8 does in there — and how you can cooperate with it.

所谓 JS 的"快",
其实是 V8 在背着你猜对了一万次。 Field Note · 03

"Fast JavaScript" really means
V8 quietly guessed right ten thousand times. Field Note · 03

为什么 V8 要从字节码出发WHY START FROM BYTECODE 本质上 V8 字节码和 x86 汇编是同一种东西——都是给"虚拟机"或"物理机"消费的指令。区别只在于这世上没有一台 CPU 直接跑 V8 字节码,机器码因为能在硬件上裸跑所以快;V8 之所以引入字节码这一层,是因为它比 AST 更接近物理机(无层次嵌套、是基于栈的带累加寄存器的指令集),又比机器码灵活(可解释、可观测、可热替换)。 Underneath, V8 bytecode and x86 asm are the same kind of thing — instructions for a (virtual or physical) machine. The only difference is no CPU ships with V8 bytecode in silicon. Machine code wins because hardware runs it directly; V8 invented bytecode because it sits closer to the metal than the AST (a flat, stack-based ISA with an accumulator register) and stays more flexible than asm (interpretable, observable, hot-swappable).

所以这篇文章在解决什么问题

So what is this piece actually for

性能优化的"难"不在改代码——难在看懂 V8 当前在干什么。这篇文章不会铺一遍 V8 全部知识,而是围绕一个具体问题:

The hard part isn't editing code. The hard part is seeing what V8 is doing right now. So this piece doesn't tour every V8 internal — it answers one concrete question:

主问题 / THE PROBLEMTHE PROBLEM

function px2rem(input, base) { /* … */ } // 1M loops · polymorphic input → 240 ms

怎么把它跑到 How do we get this down to ~24 ms,且知道为什么变快了? — and know why?

这是一段很普通的工具函数:输入可能是 number、string、或 { value, unit },输出是一个 rem 数值。它在我们某个项目里被每帧调用上百次,占了一段不容忽视的 CPU 时间。

下面的 19 章,每一章都是把它跑得更快这件事里的一刀。流水线、Hidden Class、Inline Cache 不是为了好看的术语——它们是用来切割问题的刀。

It's a perfectly ordinary helper. Input may be a number, a string, or a { value, unit } shape; output is a rem value. In one of our projects it ran hundreds of times per frame and burned non-trivial CPU.

The 19 chapters that follow are each one cut of making it faster. Pipeline, Hidden Class, Inline Cache — these aren't decorations. They are the knives we'll cut the problem with.

下一章next JIT vs AOT — 为什么 JS 不能像 C 一样提前编译JIT vs AOT — why JS can't compile up front →

CHAPTER 02 · PROLOGUE

JIT vs AOT — 编译时机的两条路

JIT vs AOT — two roads to a binary

编译期很薄,运行时很厚

a thin compile-time, a fat runtime

C / Rust / Go 这种静态语言,编译期就能确定每个变量是什么类型、每个函数怎么调用、对象在内存里长什么样——所以它们走 AOT (Ahead-Of-Time):发布之前就生成最终机器码。运行时几乎没有"编译"这件事。

而 JS 是动态的,函数被调用前没人能保证 a + b 里的两个值是数字、字符串、还是别的什么东西。所以 V8 走 JIT (Just-In-Time):运行边编译,边收集类型反馈,边根据反馈优化。

Static languages like C / Rust / Go can pin down types, call shapes, and memory layouts at compile time — so they take the AOT (Ahead-Of-Time) road. By the time the binary ships, almost no "compiling" happens at runtime.

JavaScript is dynamic. Until add(a, b) is actually called, nobody can promise the two arguments are numbers — they could be strings or objects. So V8 takes the JIT (Just-In-Time) road: it compiles while running, observes types, and re-optimizes from feedback.

维度Dimension	AOT · C / Rust / Go	JIT · V8 / JSC
编译期Compile-time	很厚 · 全部优化都在这里做thick · all optimization happens here	很薄 · 只 parse + 生 bytecodethin · only parse + bytecode
运行时Runtime	很薄 · 直接跑机器码thin · just runs machine code	很厚 · 编译/反优化都在跑的过程中thick · (re)compile and deopt during execution
类型信息Type info	编译期已知known at compile	运行时收集 (feedback)collected at runtime (feedback)
最优代码生成时机Peak code emitted when	编译完成那一刻at compile end	"足够热"之后 (TurboFan)after "hot enough" (TurboFan)
代价Cost	部署慢 / 二进制大slow build · large binary	冷启动慢 / 内存占编译产物cold start · compiled-code memory

"很厚的运行时"是什么意思

What does "thick runtime" actually mean

意思是:你写的同一段 JS,V8 运行时会反复地编译它——先用解释器跑(Ignition),发现是热点函数后,会用基线编译(Sparkplug)、中间编译(Maglev)、最后峰值编译(TurboFan)轮番上阵。每一次升级都要花 ms 级的时间在编译本身上,这部分时间是 AOT 语言不需要付的。

所以你写 function add(a, b) { return a + b },V8 在你眼皮底下可能跑过 4 个不同版本的 add——每个都对应不同程度的优化和不同程度的"假设"。

这恰恰也是性能优化的机会:如果你能让 V8 的假设保持稳定,它就能一直跑在最优版本(TurboFan 机器码)上,不会被"反优化"打回字节码解释。

It means the same chunk of JS gets recompiled while it runs — interpreter first (Ignition), then once V8 notices it's hot, baseline (Sparkplug), mid-tier (Maglev), and finally peak (TurboFan) take turns. Each upgrade burns milliseconds in compilation itself — a tax AOT languages never pay.

So when you write function add(a, b) { return a + b }, four different versions of add may have run inside V8 by the time you blink — each at a different level of optimization, each with a different set of assumptions.

That's also where the leverage sits: keep V8's assumptions stable and your function lives on the peak (TurboFan machine code) forever — never "deoptimized" back into bytecode interpretation.

为什么 JS 不能像 C 一样 AOTWHY JS CAN'T JUST AOT 不是不能,是很难精确。Hermes(React Native 用的 JS 引擎)在打包阶段会做一次 AOT 字节码生成,体积/启动都比 V8 漂亮——但代价是没有 JIT,峰值性能往往打不过 V8。本质权衡是:动态性能 vs 启动 / 包体。V8 选了前者,Hermes 选了后者。 Not "can't" — just "can't precisely". Hermes (React Native's engine) does AOT-bytecode at bundle time, winning on size and start-up. The tradeoff: no JIT, so peak throughput often loses to V8. Dynamic peak vs start-up / size — V8 chose the first, Hermes the second.

上一章prev ← 三视角Three eyes 下一章next 四层 JIT 流水线Four-tier JIT pipeline →

CHAPTER 03 · PIPELINE

四层 JIT — Parser → Ignition → Sparkplug → Maglev → TurboFan

Four-tier JIT — the compile pipeline

同一段 JS,在 V8 里其实有四份不同的它

the same JS exists in V8 as four different versions of itself

V8 不是只有一个编译器,而是一条流水线上的四级编译器。同一个函数,会随着"被调用次数"在四级之间向上爬——每爬一级,生成的代码越接近裸机,执行越快,但编译本身的耗时也越大。

这是 V8 的核心权衡:冷代码不值得花力气编译,热代码越烫值得越深的优化。所以函数的"性能"不是一个数字,而是一条会变化的曲线——这一章是这条曲线的地图。

V8 is not one compiler — it's a pipeline of four. The same function climbs the tiers as it gets called more often. Each tier emits code closer to the metal, runs faster, but costs more time to compile in the first place.

This is V8's core tradeoff: cold code isn't worth optimizing, hot code is worth optimizing harder. A function's "performance" isn't a single number — it's a curve that moves over time. This chapter is the map of that curve.

Parser

source → AST

AST

Ignition

interpreter

bytecode

Sparkplug

baseline JIT

asm (no opt)

Maglev

mid-tier JIT

asm (light opt)

TurboFan

peak JIT

asm (full opt)

▲ 绿线 · 收集 feedback,推动升级 green · collect feedback, climb up

红线 · 假设打破,反优化回 Ignition red · assumption broken, deopt back to Ignition ▼

FIG. 02 V8 四层编译流水线。每一级输出都比上一级更接近 CPU 直接消费的形式。绿线"上"代表收集到足够的 type feedback 把代码升级,红线"下"代表运行时假设被打破,被迫退回上一级。 V8's four-tier compile pipeline. Each output is closer to what the CPU runs raw. The green arrow climbs (more feedback → upgrade), the red arrow falls (assumption broken → deopt).

每一级在干什么

What each tier actually does

Parser

把源码字符串解析成 AST,顺手也生成第一版 bytecode。所有后面三层的输入都是这个 bytecode,而不是源码。换句话说,V8 后续的优化全部基于字节码,源码到此为止。 Tokenizes source into an AST and emits the first bytecode. Every later tier consumes that bytecode — not the source. From here on V8's world is bytecode; the source is gone.

Ignition

字节码解释器。直接解释执行 bytecode,边跑边收集 type feedback(参数大概率是什么类型、对象大概率是什么 shape),写到一个叫 FeedbackVector 的结构里。所有冷代码都死在这一级——没必要再爬。 The bytecode interpreter. Runs bytecode line-by-line and collects type feedback (which types/shapes the args usually take) into a structure called the FeedbackVector. Cold code dies here — no need to climb.

Sparkplug

2021 年加入,是一个非优化基线 JIT。它做一件事:把 bytecode 一对一翻译成原生汇编,跳过解释器的"取指令-decode-dispatch"开销。它不使用 feedback,所以编译几乎是免费的。 Added in 2021. A non-optimizing baseline JIT: emits a one-to-one translation of bytecode into native asm to skip the interpreter's fetch-decode-dispatch tax. It doesn't use feedback, so compiling is nearly free.

Maglev

2023 年加入的中间级优化 JIT。会看 feedback,但只做轻量优化,目标是用比 TurboFan 短得多的编译时间换接近 TurboFan 70% 的运行性能。Speedometer 跑分上比纯 Sparkplug 提升 ~21%。 Added in 2023. A mid-tier optimizing JIT that does read feedback, but only does light optimizations — targeting a fraction of TurboFan's compile time for ~70% of TurboFan's peak. Speedometer shows ~21% over Sparkplug alone.

TurboFan

峰值优化 JIT。基于 Sea-of-Nodes IR 做 inline 展开、逃逸分析、循环不变量外提、Inline Cache 内联等几十种优化。生成的机器码可能跟 C 编译器输出一样紧凑。代价:编译耗时毫秒级,内存占用大。 Peak optimizing JIT. Sea-of-Nodes IR with dozens of passes: inlining, escape analysis, LICM, IC inlining. The output can be as tight as a C compiler's. Cost: millisecond-level compile time and a fat code footprint.

同一段 JS · 五级输出对照

Same JS · five-tier output trace

把 Ch1 那个 function add(a, b) { return a + b; } 沿着五级流水线走一遍,你能直观看到每一级输出的代码长得有多不一样。点 tab 切换:

Trace Ch1's function add(a, b) { return a + b; } through all five tiers. Each tier's output looks dramatically different. Click a tab:

输入 · JS 源码 Input · JS source

function add(a, b) {
  return a + b;
}

PARSER AST · JSON-like tree

Program { body: [ FunctionDeclaration { id: { name: "add" }, params: [ Identifier { name: "a" }, Identifier { name: "b" } ], body: BlockStatement { body: [ ReturnStatement { argument: BinaryExpression { operator: "+", left: Identifier { name: "a" }, right: Identifier { name: "b" } } } ] } } ] } ; 没有任何机器指令——只是一棵语法树。 ; 所有后面四级的输入都是这棵树(及其字节码),不是源码。

IGNITION bytecode · 6 bytes · 3 opcodes

[generated bytecode for function: add] Parameter count 3 (含 receiver) Register count 0 (无本地变量) 00 : Ldar a1 ; load b → accumulator 02 : Add a0, [0] ; acc = a + acc, feedback slot 0 05 : Return ; 由 Ignition 解释器 dispatch 执行 ; 每条指令带一个 [slot] 用来记录 type feedback

SPARKPLUG baseline x86-64 · 1:1 字节码翻译 · 不读 feedback

L01 push rbp L02 mov rbp, rsp L03 push rsi ; 保存 context L04 push rdi ; 保存 fn ;── Ldar a1 ────────────────────────────── L05 mov rax, [rbp+0x10] ; rax = b ;── Add a0, [0] ────────────────────────── L06 mov rcx, [rbp+0x18] ; rcx = a L07 call Builtin::Add_Baseline ; ★ 调通用 builtin,不内联 ;── Return ─────────────────────────────── L08 mov rsp, rbp L09 pop rbp L10 ret ; ★ 关键:Add 这条字节码翻译成一个 call。Sparkplug 不知道 ; a/b 的类型,所以走通用 builtin。比解释器快 ~3-5×, ; 但远没到 TurboFan 的水平。

MAGLEV mid-tier x86-64 · 读 feedback · 内联 add · 轻量 deopt

L01 push rbp L02 mov rbp, rsp ;── 读取 b 并做轻量 SMI 检查 ────────────── L03 mov rax, [rbp+0x10] ; rax = b L04 test al, 0x1 ; SMI? L05 jne Maglev_Deopt ; 不是 → deopt ;── 读取 a 并做轻量 SMI 检查 ────────────── L06 mov rcx, [rbp+0x18] ; rcx = a L07 test cl, 0x1 L08 jne Maglev_Deopt ;── 内联的 add ────────────────────────── L09 add rax, rcx ; ★ 内联,不再 call L10 jo Maglev_Deopt ; 溢出 L11 pop rbp L12 ret ; ★ 比 Sparkplug 多了 SMI checkpoint,但 add 已经内联。 ; 编译耗时比 TurboFan 短得多,性能 ~70% TurboFan。

TURBOFAN peak x86-64 · 完整 checkpoint + deopt scaffold

L01 push rbp L02 mov rbp, rsp ;── 完整的类型守卫 (Phase II 的主角) ─────── L19 testb [rbx+0xf], 0x1 ; ★ checkpoint: a SMI? L20 jne CompileLazyDeoptimizedCode L21 testb [rcx+0xf], 0x1 ; ★ checkpoint: b SMI? L22 jne CompileLazyDeoptimizedCode ;── 实际 add(裸寄存器,SMI 已确认)────────── L23 mov rax, rbx L24 add rax, rcx ; ★ 核心逻辑就这一行 L25 jo OverflowDeopt ; 溢出 L26 pop rbp L27 ret ; ★ 1 行 JS → 9 行 asm。其中 6 行钢架(checkpoint+jo+ret), ; 1 行真正逻辑(add)。这 6 行钢架就是"V8 比 C 多付的税"。

FIG. 02b · interactive 同一个 add(a, b) 在 V8 流水线上的五种形态。Parser 出 AST 树;Ignition 出 3 条字节码;Sparkplug 出 ~10 行 baseline asm 但还要 call 通用 builtin;Maglev 内联了 add,用 SMI 检查替代 call;TurboFan 加了完整的 deopt 钢架,但核心 add 还是裸寄存器一条指令。每升一级,代码量增多,但运行时越来越接近 CPU 直接消费。 The same add(a, b) in five forms across V8's pipeline. Parser emits an AST tree; Ignition emits 3 bytecodes; Sparkplug emits ~10 lines of baseline asm that still calls a generic builtin; Maglev inlines the add, replacing the call with SMI guards; TurboFan adds the full deopt scaffold but the core add is still one register-level instruction. Each tier adds lines but moves runtime closer to what the CPU eats raw.

为什么不一上来就用 TurboFan

Why not just go straight to TurboFan

因为 TurboFan 的编译耗时本身就是性能成本——而且 TurboFan 需要 feedback 才能优化得好,没有 feedback 时它生成的代码也很平庸。

真实的程序里,大部分代码只跑几次:页面初始化的某个 setup 函数、某个一次性的 callback——把它们编进 TurboFan 是纯亏。所以 V8 用一个简单的启发式:

Because TurboFan's compile time is itself a performance cost — and TurboFan needs feedback to optimize well; without it, even TurboFan's output is mediocre.

In real apps, most code only runs a handful of times: a setup function on page load, a one-shot callback. Compiling those into TurboFan is a pure loss. So V8 uses a simple heuristic:

V8 的升级策略V8'S TIERING POLICY

cold → Ignition (interpret) warm → Sparkplug (baseline asm) getting hot → Maglev (mid-tier opt) hot → TurboFan (peak opt)

"热"的判断在 v8::internal::TieringManager::ShouldOptimize——后面 Ch9 会拆开看。"hot" is decided in v8::internal::TieringManager::ShouldOptimize — we'll dissect it in Ch9.

这对优化意味着什么

What this means for optimization

三个推论:

性能是"跑过几次之后"的事。冷启动跑分跟稳态跑分可能差 5-10 倍,benchmark 一定要让函数预热(几千次循环)再测。
反优化(deopt)是性能噩梦。一次 deopt 把函数从 TurboFan 机器码退回 Ignition 解释器,等于从 Top Gear 换回 1 挡。后面 Ch7 单独讲怎么避免。
Maglev 让"小函数也能优化"。在 Maglev 之前,小热点函数有时撑不到 TurboFan 阈值就放弃了;Maglev 等于在中间多塞一档,让它们也能享受到一定优化。

Three implications:

Performance is "after-N-calls" performance. Cold and steady-state numbers can differ by 5–10×; any benchmark must pre-warm the function for thousands of iterations before timing.
Deopt is the performance nightmare. One deopt drops a function from TurboFan machine code back to Ignition — like shifting from top gear straight to first. Ch7 covers how to avoid it.
Maglev makes small hot functions matter. Pre-Maglev, small hotspots sometimes never crossed TurboFan's threshold and stayed un-optimized. Maglev gave them a middle gear to hit.

怎么观察 — 你的函数现在在哪一级?HOW TO INSPECT — WHICH TIER IS MY FUNCTION ON? 用 --allow-natives-syntax 启动 node 或 Chromium,然后 %GetOptimizationStatus(fn) 会返回一个 bitmask——位 4 是 kOptimized、位 5 是 kMaglevved、位 6 是 kTurboFanned。Ch20 工具箱里有完整命令清单。 Launch node or Chromium with --allow-natives-syntax, then %GetOptimizationStatus(fn) returns a bitmask — bit 4 is kOptimized, bit 5 kMaglevved, bit 6 kTurboFanned. The full command list lives in the Ch20 toolbox.

上一章prev ← JIT vs AOT 下一章next 字节码 vs 机器码Bytecode vs machine code →

CHAPTER 04 · PIPELINE

字节码 vs 机器码 — 同一种东西的两副面孔

Bytecode vs machine code — two faces, one idea

为什么虚拟机指令集存在这个世界上

why virtual ISAs exist at all

"字节码"听起来像个魔法词,但它的本质很朴素——一个虚拟 CPU 的指令集。和 x86 / arm 这些真 CPU 的指令集相比,只差在"这世上没有一台 CPU 直接跑它"。

把 const a = 3 + 4 在 V8 里走完一遭,你会看到它在两种语言里出场两次:第一次是 Ignition 的字节码,第二次是 TurboFan 的机器码。下面把它们摆在一起对照看。

"Bytecode" sounds like a magic word, but underneath it's plain: an instruction set for a virtual CPU. The only difference from x86/arm is that no silicon ships with a bytecode decoder built in.

Walk const a = 3 + 4 through V8 and you'll see it appear in two languages: first as Ignition bytecode, then as TurboFan machine code. Side by side:

▸ V8 BYTECODEIgnition

// function add() { return 3 + 4 } LdaSmi [3] // 加载 3 到 accumulator Star0 // acc → r0 AddSmi [4], [0] // acc += 4 (slot 0) Return // 5 条指令 · 基于栈 + accumulator // 由 Ignition 解释器 dispatch 执行

▸ MACHINE CODETurboFan x86-64

; 同一个 add(),TurboFan 出品 mov eax, 3 add eax, 4 ret ; 3 条指令 · 基于寄存器 ; 直接由 CPU 流水线消费 ; (常量折叠还能折成 mov eax, 7)

FIG. 03 同一个函数,左边是 V8 字节码(给 Ignition 解释器消费),右边是 TurboFan 输出的 x86-64 机器码(给 CPU 直接消费)。看起来不一样,本质上做的是同一件事:取数 → 加 → 返。 Same function. Left: V8 bytecode for the Ignition interpreter. Right: TurboFan-emitted x86-64 for the CPU itself. Different surface, same job: load → add → return.

为什么字节码长这样

Why bytecode looks like this

V8 的字节码 ISA 是一个带累加寄存器的栈机(stack machine with accumulator)。LdaSmi [3] 里的 Lda 就是 "Load into accumulator",Star0 是 "Store accumulator to register 0"。这种设计有两个好处:

指令编码短。大部分指令只占 1-3 字节,读起来比基于"三地址码"的 IR 紧凑很多——这对 V8 启动速度很关键,因为冷启动阶段就是 Ignition 在跑。
解释器 dispatch 简单。每条字节码都有一个 handler,handler 之间通过 accumulator 串连——CPU 流水线友好,分支预测也好做。

但代价也明显:解释器每条指令都要走"取指令 → decode → 跳到 handler → 执行 → 跳回 dispatch"的循环——这个循环本身的开销,大概比裸跑机器码慢一个数量级。所以才有 Sparkplug:把 bytecode 一对一翻译成 asm,把这个 dispatch 循环踢掉。

V8's bytecode ISA is a stack machine with an accumulator. LdaSmi [3] means "Load Smi into accumulator"; Star0 means "Store accumulator into register 0". Two wins:

Compact encoding. Most instructions are 1–3 bytes — far tighter than a three-address IR. This matters for V8's start-up: the cold path runs in Ignition.
Simple dispatch. Each opcode has a handler; handlers chain through the accumulator. CPU-pipeline friendly and easy on branch prediction.

The cost is the interpreter loop: every instruction pays "fetch → decode → jump-to-handler → execute → jump-back". That loop alone is roughly an order of magnitude slower than running raw machine code. That's why Sparkplug exists — translate bytecode 1-to-1 into asm and kill the dispatch loop.

扩展阅读extended 栈机 vs 寄存器机 — V8 为什么选了"带累加寄存器的栈机" Stack vs register machines — why V8 chose "stack with accumulator"

"虚拟机指令集"的形态历史上分两大流派,选哪种几十年来一直是 VM 设计的第一刀:

栈机 (Stack machine):操作数在一条隐式栈上 push/pop。代表:JVM(Java bytecode)、CPython 3.10 及以前、.NET CIL、早期 Lua。
寄存器机 (Register machine):操作数显式编码到指令里(三地址码),没有栈。代表:Lua 5.0+、Dalvik(Android)、JavaScriptCore LLInt、Hermes。
带累加寄存器的栈机 (Stack-with-accumulator):V8 Ignition 选的混合形态——有一个隐式累加寄存器(acc)+ 256 个显式寄存器(r0~r255),所有运算结果默认进 acc,寄存器只用来"暂存"。

把 c = a + b 在三种 ISA 里编出来,差距一目了然:

Bytecode ISAs split into two historical schools — picking one is one of the first decisions any VM designer makes:

Stack machine: operands flow through an implicit stack, push/pop driven. Examples: JVM (Java bytecode), CPython ≤ 3.10, .NET CIL, early Lua.
Register machine: operands are encoded explicitly into the instruction (three-address code), no stack. Examples: Lua 5.0+, Dalvik (Android), JavaScriptCore LLInt, Hermes.
Stack with accumulator: V8 Ignition's hybrid — one implicit accumulator register (acc) + 256 explicit registers (r0~r255). All ops default to writing the result into acc; registers are just temporaries.

Compile c = a + b in three ISA flavors and the differences become obvious:

栈机 Stack machine JVM · CPython

①push astack: [a]

②push b[a, b]

③add[a+b]

④store c[]

iload a ; 1B
iload b ; 1B
iadd ; 1B
istore c ; 1B

+ 指令短(无操作数)、好生成、好移植+ tiny opcodes, easy to emit, portable
− 指令多(4 条做一件事),解释器要频繁读写"栈顶" − many ops per logical step (4 here), interpreter hammers stack top

带累加器 Stack + accumulator V8 · Ignition

①Ldar aacc = a

②Add b, [0]acc = a+b

③Star cc = a+b

(acc 隐式)

Ldar a ; 2B
Add b, [0] ; 3B (含 fb)
Star c ; 2B

+ 指令短(acc 不占操作数位)+ 比纯栈机快+ tight (acc costs no operand bits) + faster than pure stack
− 编码不如纯寄存器机简单 − encoding rules slightly trickier than pure register

寄存器机 Register machine JSC · Hermes · Lua 5+

①add r3, r1, r2r3 = r1+r2

r1, r2 已是输入

r3 已是输出

(单条搞定)

add r3, r1, r2 ; 4B
; 1 条指令 = 1 个高级操作
; 但每条指令更长

+ 解释器跑得最快(指令少)+ interpreter is fastest (fewer instructions)
− 指令长(操作数占位多),代码体积更大 − longer per-op encoding → larger code size

主流引擎选型对照

What the major engines picked

引擎 / 语言Engine / lang	ISA 形态ISA shape	为什么这么选Why
V8 Ignition · Chrome / Node	栈 + 累加器(混合)stack + accumulator	2016 年从纯解释器(JIT 直接编源码)切到 Ignition · 优先缩短启动时间 + 节省内存(短字节码) · acc 让指令编码紧凑2016 cutover from JIT-only to Ignition · optimized for fast start-up + small footprint (short bytecode) · acc keeps encoding compact
JSCore LLInt · Safari	寄存器机(3 地址码)register (3-address)	优先解释器稳态吞吐 · 体积代价由 LLInt 的紧凑编码缓解 · 后端 LLVM JIT 也对寄存器 IR 友好optimized for steady-state throughput · LLInt's encoding mitigates the size cost · LLVM-based JIT thrives on register IR
SpiderMonkey · Firefox	纯栈机pure stack	最早期(1996)就这么定的 · ECMA-262 spec 本身用栈语义描述 · CacheIR 在解释器之上加优化 · 没改架构picked in 1996 · ECMA-262 spec itself uses stack semantics · CacheIR layered on top instead of redesign
Hermes · React Native	寄存器机register	没有 JIT(iOS 不允许动态生成代码)· 解释器是唯一跑代码的层,所以必须最快 · 寄存器机解释器吞吐高no JIT (iOS bans dynamic code-gen) · the interpreter is the only execution path, so it has to be fast · register machines lead in interpreter throughput
JVM · Java	纯栈机pure stack	1995 年的设计选择 · 优先跨平台编码紧凑 · HotSpot JIT 把字节码再编译成寄存器机器码,补回性能1995 design call · prioritized cross-platform compact encoding · HotSpot JIT rewrites bytecode into register machine code at runtime
Lua 5+	寄存器机register	2003 年 Lua 5.0 发布时从栈机切到寄存器机 · 论文显示解释器吞吐提升 2-3× · 是这个领域的经典转折点Lua 5.0 (2003) switched from stack to register · published interpreter throughput jumped 2–3× · the canonical case study
CPython 3.11+	栈机(加 specialization)stack + specialization	没改 ISA 形态 · 但 3.11 加了"specializing adaptive interpreter":在不改字节码的前提下,运行时把热点字节码替换成类型专用版,变相获得寄存器机的部分收益ISA unchanged · but 3.11 added "specializing adaptive interpreter": rewrite hot bytecodes into type-specialized versions at runtime — backdoor register-machine wins

V8 为什么选 "栈 + 累加器"

Why V8 specifically picked "stack + accumulator"

2016 年 Ignition 设计文档里写的三个原因:

启动速度优先:Web 主线程冷启动每减 1ms 都是用户能感知的。短字节码 = 短编译时间 = 短启动延迟。栈+累加器的编码比纯寄存器机短约 20-30%。
内存优先:Chrome 多 tab 场景每个 V8 isolate 都要存一份字节码。短字节码 = 小内存——这也是 Chrome 一直被骂"吃内存"时少有的反向收益。
给 JIT 留弹性:解释器吞吐不是 V8 的瓶颈(冷代码很快被 JIT 接管)。Ignition 故意做得"够用就好",把性能优化让给 Sparkplug / Maglev / TurboFan——这才是 V8 的设计哲学。

有意思的是:JSCore 走完全相反的方向——纯寄存器机 LLInt 是为解释器吞吐优化的,因为 JSC 历来更看重"没 JIT 时也得快"(早期 iOS 限制 + 服务端场景)。两个引擎的 ISA 选型,本质上反映的是两家公司对"JIT 之前那段时间该多重要"的不同押注。

The 2016 Ignition design doc listed three reasons:

Start-up first: every millisecond on the Web main thread cold start is user-perceptible. Short bytecode = short compile time = shorter time-to-first-paint. Stack+accumulator encoding is ~20–30% shorter than pure register.
Memory first: Chrome's multi-tab world stores a bytecode copy per isolate. Short bytecode = less memory — a rare "we got smaller" win for Chrome.
Defer to the JIT: interpreter throughput isn't V8's bottleneck — JITs take over quickly. Ignition is deliberately "good enough", leaving optimization to Sparkplug / Maglev / TurboFan. That's V8's design philosophy.

Tellingly, JSCore went the opposite way — its pure-register LLInt is optimized for interpreter throughput, because JSC has always cared more about "fast even without JIT" (early iOS restrictions + server-side use cases). The two engines' ISA choices reflect different bets on how much the time before the JIT kicks in matters.

读源码线索SOURCE READING POINTERS

想看 V8 字节码 ISA 的全部 opcode,直接读 v8/src/interpreter/bytecodes.h——里面定义了 ~200 条 opcode + 每条的操作数布局。Lua 5.0 改寄存器机的经典论文是 Roberto Ierusalimschy 的 "The Implementation of Lua 5.0"(2005),JS 引擎设计参考最多的就是这篇。

For V8's full bytecode ISA, read v8/src/interpreter/bytecodes.h — ~200 opcodes with operand layout. The canonical paper on Lua's switch to register-based VMs is Roberto Ierusalimschy's "The Implementation of Lua 5.0" (2005) — the most-cited reference in JS-engine design discussions.

汇编里多出来的那些"看不懂"指令

The "extra" instructions you'll see in real asm

真实的 TurboFan 输出比上面 figure 里的三条指令长得多——你用 node --print-opt-code --allow-natives-syntax 打印出来,会看到一堆 cmp / jne / test 指令围着核心逻辑。这些不是逻辑本身,而是 V8 在做 checkpoint(类型检查)和 调用约定 的钢架。

Real TurboFan output is much fatter than the three-line figure above. Run node --print-opt-code --allow-natives-syntax and you'll see a swarm of cmp / jne / test around the core. That's not logic — it's V8's checkpoint machinery (type guards) plus the calling convention scaffold.

OUTPUT · node --print-opt-code --allow-natives-syntax ./add.js turbofan x86-64

; function add(a, b) { return a + b } (assumed Smi+Smi) L01 push rbp ; 调用约定 L02 mov rbp, rsp L19 testb [rbx+0xf], 0x1 ; checkpoint: a 是不是 SMI? L20 jne CompileLazyDeoptimizedCode ; 不是 → 反优化退回 Ignition L21 testb [rcx+0xf], 0x1 ; checkpoint: b 是不是 SMI? L22 jne CompileLazyDeoptimizedCode L23 add rbx, rcx ; ★ 实际逻辑就这一行 ★ L24 jo OverflowDeopt ; 溢出 → 反优化 L25 mov rax, rbx L26 pop rbp L27 ret

橙色高亮的是类型检查——它们在每次调用时验证"这次传进来的还是 SMI 吗?"。验证通过就走绿色那一行核心 add,验证失败就跳走反优化。一段 1 行的 JS 编出来 9 行汇编,其中 6 行是钢架,1 行是逻辑。

这套钢架就是后面 Phase II 的主角——assumption + feedback + checkpoint 三件套。它解释了"为什么 TurboFan 比 Ignition 快"和"为什么打破假设性能会突然崩塌"是同一件事的正反面。

The orange lines are type guards: they verify "is this still a SMI?" on every call. Pass → fall through to the green core add; fail → jump out to deopt. One line of JS becomes nine lines of asm — six scaffold, one logic.

That scaffold is the protagonist of Phase II: the assumption + feedback + checkpoint trio. It's why TurboFan is faster than Ignition and why breaking an assumption tanks performance — same coin, two faces.

JS 是动态的。
除非你让它看起来是静态的。 Field Note · 03

JavaScript is dynamic —
until you make it look static. Field Note · 03

上一章prev ← 四层 JITFour-tier JIT 下一章next Tagged PointerTagged Pointer →

CHAPTER 05 · PIPELINE

Tagged Pointer — 最低位决定的世界

Tagged Pointer — one bit decides reality

SMI 与 HeapObject 的 1 bit 之差

SMI vs HeapObject, decided by one bit

上一章那段汇编里出现了 testb [rbx+0xf], 0x1——它在检查最低一位。这个习惯并非 V8 独有,C/C++ 里叫 Tagged Pointer:用一个指针的若干低位携带类型信息,而不是另开一个字段。V8 的版本是这样的:

最低位是 0 → 这是一个 SMI(Small Integer,32 位有符号小整数),实际数值是把高 31 位左移一位读出来。
最低位是 1 → 这是一个 HeapObject 指针,真正的对象在堆上,需要解引用。

这一个 bit 决定了 V8 看到一个 64 位字时的两条完全不同的处理路径。下面把它点亮看看:

That testb [rbx+0xf], 0x1 in the previous chapter is checking the lowest bit. This isn't V8-only — in C/C++ it's called a Tagged Pointer: stuff type info into a pointer's low bits instead of carrying a separate field. V8's flavor:

low bit = 0 → it's a SMI (Small Integer, 31-bit signed). Read the value by shifting the high 31 bits one to the right.
low bit = 1 → it's a HeapObject pointer. The real object lives on the heap; dereference required.

That single bit forks V8's handling of a 64-bit word into two completely different paths. Try lighting one up:

输入数 number SMI — 整数,塞得下 → 走 SMI 快路径 — integer that fits → SMI fast path

数值位 (1)value bit (1)

SMI tag · 末位 0SMI tag · low bit 0

HeapObject tag · 末位 1HeapObject tag · low bit 1

FIG. 04 · interactive 输入一个数,看看 V8 把它装进 32 位字时长什么样:整数(|num| ≤ 2³⁰)走 SMI,末位是 0 表示"这个字本身就是数值";其他情况(浮点、大整数、字符串、对象)走 HeapObject,末位是 1 表示"这个字是个指针,得去堆上找真东西"。 Type a number and see how V8 packs it into a 32-bit word. Integers fitting |n| ≤ 2³⁰ go SMI — low bit 0 means "the word IS the value". Anything else (floats, big ints, strings, objects) goes HeapObject — low bit 1 means "the word is a pointer, look on the heap".

为什么 V8 要这么省

Why V8 hoards bits like this

因为整数太常见了。一个普通页面跑起来,堆里大半是数字——下标、像素、毫秒、坐标、计数器。如果每个数字都老老实实地包成一个 HeapObject(配 hiddenClass / 元信息 / GC 头),内存和指针追逐都会把性能拖死。

用末位来 tag SMI,V8 可以做到:

SMI 不进堆——直接以 立即数 形式跑在寄存器里,加减乘除一条 CPU 指令搞定。
类型检查只是一个 testb——上一章那段汇编里的"是不是 SMI?"在 CPU 上只占 1 周期。
GC 不用扫 SMI——它们根本不是指针,GC 走过就跳过去。

Because integers are everywhere. In a real page, the heap is mostly numbers — indices, pixels, milliseconds, coordinates, counters. Boxing every one of them into a HeapObject (with a hiddenClass, meta header, and GC bits) would drown V8 in memory traffic and pointer chasing.

Tagging SMIs with the low bit lets V8:

Skip the heap for SMIs — they live in registers as immediate values; arithmetic is one CPU instruction.
Reduce the type check to a testb — the "is it a SMI?" guard from the previous chapter is one cycle.
Skip SMIs in GC — they aren't pointers, so the collector just moves past them.

扩展阅读extended SMI vs HeapObject — 两种"装值"的方式,为什么 HeapObject "重" SMI vs HeapObject — two ways V8 boxes a value, and why HeapObject is "heavy"

V8 在内部表达"一个 JS 值"的方式只有两种,这两种就是 SMI 和 HeapObject。所有 JS 里能拿到的东西——number / string / boolean / null / object / function——背后都得装进这两个之一。

SMI · 装得下,直接塞进指针里

SMI(Small Integer)是一个整数且足够小(31 位有符号,±2³⁰ 范围内)时 V8 走的那条路。它跳过堆,把数值编码进那个 64 位字本身:把数左移 1 位,末位填 0 标识"我是 SMI"。

读 SMI 时只需要右移一位拿数值;算术时直接用 CPU 寄存器,一条 add 指令搞定。整个过程不分配堆内存,不进 GC,不解引用。

HeapObject · 装不下的,丢去堆上

其他一切——浮点数、大整数、字符串、对象、数组、函数——都装不进 SMI,V8 会在堆上分配一块内存,把数据放那儿,然后用一个指针指过去。这个指针的末位是 1,V8 通过这一位区分"这是 SMI 还是要去堆上找的指针"。

但 HeapObject 不只是"一段值",它有一整套元信息——这是它"重"的根源。下面这张图把 SMI 跟 HeapNumber(用来装 42.5 这种浮点数)摆在一起对比:

Internally, V8 represents any JS value in exactly one of two ways: SMI or HeapObject. Everything you can hold in JS — numbers, strings, booleans, null, objects, functions — eventually fits into one bucket or the other.

SMI · small enough, packed into the pointer itself

SMI (Small Integer) is the path V8 takes when a value is an integer and small enough (31-bit signed, ±2³⁰). It skips the heap and packs the value directly into the 64-bit word: shift left by one and set the low bit to 0 to mark "I am a SMI".

Reading a SMI is just a right shift; arithmetic happens in CPU registers, one add instruction. No heap alloc, no GC visit, no dereference.

HeapObject · everything else lives on the heap

Anything bigger or more complex — floats, big integers, strings, objects, arrays, functions — doesn't fit. V8 allocates a chunk on the heap, puts the data there, and hands you a pointer. That pointer's low bit is 1; V8 uses that bit to know "this is a pointer, dereference me to find the real thing".

But a HeapObject isn't just "a value" — it carries a whole stack of metadata, and that's the source of its weight. Here's a SMI and a HeapNumber (the box used for floats like 42.5) side by side:

SMI · 装整数 42SMI · holding integer 42

值字本身: 0…0 0010 1010 0word itself: 0…0 0010 1010 0 SMI

8 字节 · 1 个 word8 bytes · 1 word
0 次堆分配 · 0 次 GC 扫描 · 0 次解引用 0 heap allocs · 0 GC visits · 0 deref
运算就是一条 CPU add arithmetic is one CPU add

vs

HeapNumber · 装浮点 42.5HeapNumber · holding float 42.5

指针字: 0x... 1pointer word: 0x... 1 HeapObject

↓

Map ptr → HeapNumber8 B

value 42.58 B

至少 16 字节(含 Map 头 + GC 税)≥ 16 bytes (Map header + GC tax)
1 次堆分配 · 每次 GC 都要扫 1 heap alloc · scanned by every GC pass
读取要先解引用拿到 value reading requires a deref to fetch the value

对一个普通对象 {x: 1, y: 2}(JSObject),情况就更夸张——光这两字段就占 40 字节,还得带一个指向 Hidden Class 的指针:

A plain object {x: 1, y: 2} (a JSObject) is even more dramatic — those two fields alone occupy 40 bytes, plus a pointer to its Hidden Class:

JSObject · {x: 1, y: 2} 的全部底层占用JSObject · the full physical footprint of {x: 1, y: 2}

指针字: 0x... 1pointer word: 0x... 1 HeapObject

↓

Map ptr → Hidden Class8 B

Properties ptr8 B

Elements ptr8 B

in-obj #0 = 1 (= obj.x)8 B

in-obj #1 = 2 (= obj.y)8 B

≥ 40 字节(含 Map 头 + GC 税)· 还引用一个 Hidden Class≥ 40 bytes (Map header + GC tax) · plus a referenced Hidden Class
每次 GC 要追整条引用链 · 每次属性访问要走 IC 路径 GC walks the entire reference chain · every property access goes through the IC

所以"一个数到底是 SMI 还是 HeapObject"是 V8 内部代价差距最大的那一个 bit——一个普通页面跑起来,堆里大半的 HeapObject 其实都是包装一个数字用的。这就是为什么 V8 拼命让能进 SMI 的就别进堆。

这也是后面 Phase III(Hidden Class、IC、Fast/Slow Properties)所以重要的根源:那些机制都在围绕"如何让访问 HeapObject 像访问 SMI 一样便宜"做文章。

So "SMI vs HeapObject" is the single biggest cost difference in V8 internals — and in a real page, most of those HeapObjects are just boxed numbers. That's why V8 fights so hard to keep things in SMI form.

It's also why Phase III (Hidden Class, IC, Fast/Slow Properties) matters so much: every mechanism there is about making HeapObject access as cheap as SMI access.

扩展阅读extended "GC 头" 到底是什么 — V8 GC 在每个 HeapObject 上收的三份税 What "GC header" really means — the three taxes V8 GC charges per HeapObject

"GC 头"是个简化叫法。V8 里其实没有像 JVM 那种独立的"GC header"字段——每个 HeapObject 的第一个 word(8 字节)就是 Map pointer,这个指针同时承担三件事:

类型描述:Map 告诉 V8"这是个 HeapNumber / String / JSObject / ..."
Shape 描述:对 JSObject 来说,Map 就是 Hidden Class,记录有哪些字段、每个字段在哪个 offset。
GC 入口:GC 拿到一个 HeapObject,先读它的 Map,从 Map 上得知"这个对象身上哪几个 word 是指针"——这样才知道接下来要追哪些位置。

所以严格说,"GC 头"就是这个 Map pointer——它不是额外加的字节,就是对象本身的第一字段。HeapNumber 占 16 字节里,8 字节是 Map,另外 8 字节才是数值;Map 占了一半。

那为什么我管它叫"GC 税"

因为 Map 字段只是入场券。GC 真正每次跑都要在每个 HeapObject 上做的事情,远不止这一份字节占用。下面把 SMI 和 HeapObject 在一次完整 GC 周期里被对待的方式摆出来对比:

"GC header" is shorthand. V8 doesn't actually have a JVM-style separate "GC header" word — the first word (8 bytes) of every HeapObject is the Map pointer, and it pulls triple duty:

Type: the Map tells V8 "this is a HeapNumber / String / JSObject / ...".
Shape: for JSObjects, the Map IS the Hidden Class — recording field names and their offsets.
GC entry point: when the GC visits a HeapObject, it first reads the Map. The Map tells it "these N words on this object are pointers" — so the GC knows what to follow.

Strictly, "GC header" is that Map pointer — not extra bytes, just the object's first field. A 16-byte HeapNumber spends 8 of those on the Map; the other 8 carry the value. Half the box is the label.

So why call it "GC tax"

Because the Map word is just the entry ticket. The GC's per-HeapObject work on every cycle is much more than those bytes. Compare what happens to a SMI vs a HeapObject across a full GC pass:

GC 阶段GC phase	SMI	HeapObject
分配 (Allocation)Allocation	没分配 · 直接在寄存器/栈帧里none · lives in register / stack	在 NewSpace bump-allocate · 改 top 指针 + 写 Map + 写字段bump-allocate in NewSpace · advance top + write Map + write fields
标记 (Marking)Marking	末位是 0 → GC 一眼跳过low bit 0 → GC skips at a glance	读 Map → 知道哪几位是指针 → 加入 mark queue → 翻一位 mark bitread Map → learn which words are pointers → push to mark queue → flip a mark bit
扫描 / 清理 (Sweeping)Sweeping	不存在N/A	未 mark → 空间归还 free list 或被 compact 阶段拿走unmarked → returned to free list or claimed by compaction
压缩 / 搬迁 (Compaction)Compaction	不存在N/A	可能被搬到新地址 · 旧位置写一个 forwarding pointer · 所有指向它的指针都要改may be relocated · old slot becomes a forwarding pointer · all referrers must be rewritten
下次再活Next survival	无成本free	可能从 NewSpace 晋升到 OldSpace,改 Map · 多次幸存进 LargeObjectSpacemay promote NewSpace → OldSpace, Map updated · long-lived ones land in LargeObjectSpace

关键观察:SMI 在 GC 里整一行都是空的;HeapObject 每一行都要付代价。Chrome 一个普通页面 GC 触发频率是每秒 10 次以上(Scavenger 在 NewSpace 上的小 GC),每次 GC 都要遍历所有 HeapObject,把它们各自的 Map 读一遍、mark 一下、可能搬一下。这叠加起来才是"HeapObject 重"的真实重量。

顺手解决一个常见困惑

那 mark bit 不也是每个对象一份吗? 是的,但 V8 的 mark bit 不存在对象本身上——它存在一个独立的 MarkingBitmap(side table)里,每个堆地址对应 bitmap 里的一位。所以 HeapNumber 是真的 16 字节、JSObject 真的 40 字节,mark bit 不再额外加。但在算"GC 工作量"时,bitmap 那一位的读写依然是每个对象一次。

所以更精确地说,HeapObject 的"GC 税"包含:

对象本身的 Map 指针(8 字节,占用对象内存)
side bitmap 里的 1 个 mark bit(不在对象上,但每个对象都要消耗一位)
每次 GC 周期对它的 4 步处理(allocate / mark / sweep / 可能 compact)
compaction 时可能产生的 forwarding pointer 和所有引用方的指针更新

这就是为什么你写一个 { x: 1 } 看起来无害,在每秒上百次新建/丢弃的热点路径上能让 GC 占走 10-20% 的主线程时间——你交的不是字节,是 GC 的劳动力。

Key observation: the SMI column is blank for every row; the HeapObject column pays at every step. A typical Chrome page triggers GC more than 10× per second (Scavenger in NewSpace), and every cycle visits every HeapObject — reads its Map, marks it, possibly relocates it. Cumulatively, that's the real weight of "HeapObject is heavy".

Sidebar: aren't mark bits per-object too?

Yes, but V8's mark bits don't live on the object itself. They live in a separate MarkingBitmap (side table), one bit per heap address. So a HeapNumber really is 16 bytes and a JSObject really is 40 bytes — mark bits don't add to that. But when counting "GC work", the bitmap bit still costs one read+write per object.

So more precisely, the HeapObject "GC tax" comprises:

The object's Map pointer (8 bytes, lives on the object)
One mark bit in the side bitmap (off-object but one per object)
The 4-step pipeline per GC cycle (allocate / mark / sweep / maybe compact)
During compaction, possibly a forwarding pointer and pointer rewriting at every referrer

That's why an innocent-looking { x: 1 } on a hot path that creates+discards hundreds per second can have GC eat 10–20% of main-thread time. You aren't paying bytes — you're paying GC labor.

怎么测HOW TO MEASURE

Chrome DevTools → Performance → 录一段,在 Summary 里看 GC 那一段占帧时间多少。常见信号:Minor GC(NewSpace,毫秒以下)很多 → 大量短命对象;Major GC(OldSpace,十几毫秒)出现 → 长命对象太多 / 内存压力。Phase IV 的"逃逸分析"规则就是为了让短命对象根本不上堆,直接消除前者。

Chrome DevTools → Performance → record a session and look at GC's share in the Summary panel. Common signals: lots of Minor GC (NewSpace, sub-millisecond) → many short-lived objects; Major GC (OldSpace, 10s of ms) showing up → too many long-lived objects / memory pressure. Phase IV's "escape analysis" rule kills short-lived heap objects entirely so that Minor GC has nothing to scan.

尺寸细节SIZING DETAILS 在 32 位 V8 上,SMI 是 31 位有符号整数(范围 ±2³⁰),浮点和大整数会被装箱成 HeapNumber / BigInt。在 64 位 V8 上,默认开启指针压缩,SMI/Pointer 都是 32 位,Pointer 高 32 位由"isolate root"统一,所以低 32 位放得下大部分情况。这个细节决定了"传 number 比传 string 快得多"——string 永远是 HeapObject。 On 32-bit V8, SMI is a 31-bit signed integer (±2³⁰); floats and bigints are boxed as HeapNumber / BigInt. On 64-bit V8, pointer compression is on by default — SMI and Pointer both fit in 32 bits with the high 32 derived from an "isolate root". This is why "passing a number is much faster than passing a string" — strings are always HeapObjects.

这跟"写快 JS"有什么关系

What this has to do with writing fast JS

有三个直接的实战推论:

能用整数就别用浮点。1.0 + 2.0 会被装成 HeapNumber,走慢路径;1 + 2 全程 SMI,走快路径。同样的道理,Math.floor(x) 之后立刻参与运算,V8 知道结果是整数,可以保持 SMI。
能用 number 就别用 string 当 key 或开关。在热点函数里把 obj['x'] 改成 obj.x,把 switch ('mode') 改成 switch (MODE_ENUM)(整数枚举),V8 的检查路径会短一截。
避免数组里混类型。[1, 2, 'three'] 会让 V8 把整个数组的 elements kind 升级到通用模式(HOLEY_ELEMENTS),后续读写都得走 HeapObject 路径——而 [1, 2, 3] 全程是 PACKED_SMI_ELEMENTS,读写都是裸内存访问。

Tagged Pointer 不是知识点——它是你写每一行 JS 时,V8 在背后做的那个最小决定。

Three direct, actionable consequences:

Use integers over floats when you can. 1.0 + 2.0 boxes into HeapNumber (slow path); 1 + 2 stays SMI throughout (fast path). Same for Math.floor(x) followed by arithmetic — V8 knows the result is an int and keeps it SMI.
Prefer numbers over strings as keys / switch values. In hot functions, replace obj['x'] with obj.x, and switch on integer enums rather than string literals — V8's check path becomes shorter.
Don't mix types in arrays. [1, 2, 'three'] escalates the whole array's elements kind to generic HOLEY_ELEMENTS — all reads/writes go through HeapObject paths. [1, 2, 3] stays in PACKED_SMI_ELEMENTS, where reads are raw memory accesses.

Tagged Pointer isn't a curiosity. It's the minimal decision V8 makes behind every line of JS you write.

扩展阅读extended obj.x 比 obj['x'] 快——是为什么 Why obj.x beats obj['x']

它们看起来等价,但 V8 把它们编译成两条不同的字节码,走两条不同的 IC 路径:

They look equivalent, but V8 compiles them into two different bytecodes taking two different IC paths:

字节码差异bytecode diff v8/ignition

// obj.x —— 编译期就知道 key 是 "x" 的字面量 LdaNamedProperty <obj>, ['x'], [0] ; → 走 LoadIC,V8 可以把 key 内联成 in-object offset ; → 命中后是一条 mov [base + offset] // obj['x'] —— 哪怕 key 是字面量,也走通用 keyed 路径 LdaKeyedProperty <obj>, [0] ; key 在 accumulator 上 ; → 走 KeyedLoadIC,要同时处理: ; - string key (像 obj['x']) ; - SMI 数字下标 (像 arr[0]) ; 状态机比 LoadIC 复杂,优化命中率也更低 // obj[varName] —— key 是动态变量,完全没法静态优化 LdaKeyedProperty <obj>, [0] ; → KeyedLoadIC + 多态 feedback,最坏情况退化到字典查找

三个实战层面的差距:

字节码不同:LdaNamedProperty 比 LdaKeyedProperty 多带一个 key index——V8 编译期就知道要找哪个属性,直接走最优路径。
IC 状态机不同:LoadIC 只处理"按名字找属性",形态简单;KeyedLoadIC 还要兼顾数字下标(arr[0]),所以它的 fast-path 检查更多、更容易退化到 megamorphic。
退化的代价不同:写 obj['x'] 时,只要旁边有人写了 obj[someVar],这个 site 的 IC 就同时被两种用法污染;而 obj.x 的 site 永远只是 LoadIC,不会被 keyed 路径污染。

注意:现代 V8(2022 年之后)对字面量 obj['x'] 做了一定的常量折叠,在某些场景下能把它优化成跟 obj.x 一样的 LoadIC——但不保证,而且只要这个 site 同时见过非字面量的 keyed 访问,优化就会退回。所以"用 .x"是明确告诉 V8 和读代码的人:这是个静态名字,不要给以后留隐性退化的口子。

Three practical differences:

Different bytecode: LdaNamedProperty carries the key index — V8 knows the target property at compile time and can take the most direct path. LdaKeyedProperty doesn't.
Different IC state machine: LoadIC only handles "lookup by name", which is simple. KeyedLoadIC also has to handle numeric indices (arr[0]), so its fast paths have more checks and degrade to megamorphic more easily.
Different blast radius: write obj['x'] next to a obj[someVar] and the call site's IC gets polluted by both usages. A obj.x site stays as LoadIC and never gets pulled into the keyed path.

Note: modern V8 (post-2022) does constant-fold literal obj['x'] into LoadIC in some cases — but it isn't guaranteed, and once a site has seen any non-literal keyed access, the folding bails. Using .x is your way of telling V8 and your future self that this is a static name — no hidden degradation later.

同时也适用于SAME REASONING APPLIES TO

switch (str) vs switch (intEnum) · 字符串 switch 每个 case 要做字符串比较(string intern 之后是指针比较,但仍然走 KeyedCompareIC);整数 switch 可以编译成 jump table,O(1) 直接跳。同样,obj.method() vs obj['method']() 也是同样的差距,.method 走 NamedLoadIC + Call,['method']() 走 KeyedLoadIC + Call,前者快得多。

switch (str) vs switch (intEnum) · string switches do per-case string comparisons (pointer-compare after interning, but still through KeyedCompareIC). Integer switches compile into jump tables — O(1) direct branch. Likewise, obj.method() vs obj['method'](): the dot form goes through NamedLoadIC + Call; the bracket form takes the slower KeyedLoadIC + Call path.

上一章prev ← 字节码 vs 机器码Bytecode vs asm 下一章next 三件套 — assumption + feedback + checkpointThe trio →

CHAPTER 06 · ASSUMPTIONS

三件套 — assumption + feedback + checkpoint

The trio — assumption + feedback + checkpoint

V8 是怎么"猜对"的

how V8 manages to guess right

到这里出现了一个真正的悖论:JS 是动态类型的语言,V8 凭什么能把它编译成跟 C 一样紧凑的机器码?

答案是 V8 不"知道",而是猜。它边跑边收集类型反馈,根据反馈做大胆假设,然后基于假设生成快路径机器码——同时在机器码里埋下类型检查 checkpoint,一旦假设被打破就立刻抛弃机器码,退回字节码解释执行。

这就是 V8 性能的三件套:

Here's the real paradox: JS is dynamically typed, so how does V8 ever produce C-tight machine code?

Answer: V8 doesn't "know" — it guesses. It runs, watches the types your function actually sees, makes bold assumptions, and emits fast-path asm based on those assumptions — with type-check checkpoints inlined so it can throw the asm away the moment the guess fails.

The trio:

#	名字Name	在哪一层Lives at	在做什么Does what
1	feedback	Ignition / Sparkplug 跑的时候while Ignition / Sparkplug runs	观察"这个函数被调用时,参数是什么类型,对象是什么 shape",写到 `FeedbackVector` 里。Watches what types/shapes flow through each call site, writing to a `FeedbackVector`.
2	assumption	Maglev / TurboFan 编译的时候when Maglev / TurboFan compiles	读 feedback 决定:"这次我假设两个参数都是 SMI",据此走快路径;feedback 越单态,假设越大胆。Reads feedback and decides: "I'll assume both args are SMIs". The more monomorphic the feedback, the bolder the bet.
3	checkpoint	编译出来的机器码里in the emitted machine code	每个假设都对应一行 `testb`/`cmp` 守卫——验证通过走快路径,失败立刻 deopt。Each assumption gets a `testb`/`cmp` guard inlined — pass → fast path; fail → deopt immediately.

三件套的协作流程

How the trio cooperates

把它画成时序就是:

As a timeline:

三件套时序TRIO TIMELINE

[1] add(1, 2) → Ignition 解释,feedback: arg0=SMI, arg1=SMI [2] add(3, 4) → feedback 加固:仍然 SMI+SMI ... (~10K 次调用) ... [N] ShouldOptimize() == true → TurboFan 接管 assumption: a是SMI && b是SMI && add不溢出生成机器码: testb · jne deopt · add · jo deopt · ret [N+1] add(1, 2) → 走机器码,checkpoint pass,1 周期返回 [X] add('a', 'b') → checkpoint fail,deopt,扔掉机器码,退回 Ignition

关键观察:checkpoint 是 assumption 的"保险栓"。没有它,V8 就不敢做大胆优化;有它,V8 可以激进到几乎跟 C 一样快。Key observation: checkpoint is the safety pin for assumption. Without it V8 wouldn't dare optimize aggressively; with it, V8 can be nearly as bold as a C compiler.

怎么观察这套机制

How to observe this in practice

三件套是看不见的——除非你打开 V8 自带的几个开关。这是分析慢 JS 的第一类工具:

The trio is invisible by default — until you flip V8's built-in switches. This is the first class of tool for analyzing slow JS:

OBSERVATION TOOLBOX · 观察工具 node / chromium flags

▸ --allow-natives-syntax // 启用 %DebugPrint / %GetOptimizationStatus / %OptimizeFunctionOnNextCall ▸ --trace-opt // 打印每一次"函数被升级到 TurboFan"的事件 // → [optimizing 0x... <JSFunction add> (target TURBOFAN) - took 0.8 ms] ▸ --trace-deopt // 打印每一次反优化事件 + 原因(checkpoint 名字) // → [bailout (kind: eager): begin. reason: not a smi] ▸ --print-opt-code // 打印 TurboFan 输出的机器码 — 看 checkpoint 长啥样 ▸ %DebugPrint(fn) // 在代码里直接调,打印函数的 feedback vector + map

把这几个开关组合起来,你就能看见 V8 在背后做什么。下一章我们用真函数演示一遍——当假设被打破时,V8 是怎么 deopt 的。

Compose these switches and you can see what V8 is doing behind the scenes. The next chapter walks a real function through a deopt event.

上一章prev ← Tagged Pointer 下一章next 反优化第一现场First scene of deopt →

CHAPTER 07 · ASSUMPTIONS

反优化第一现场 — 当 add(1,2) 突然来了 add('a','b')

First scene of deopt — when add(1,2) meets add('a','b')

一段实测:打破假设之后性能为什么不只是变慢,而是断崖

a real measurement of why broken assumptions don't just slow you down — they cliff-drop

把上一章讲的"checkpoint fail → deopt"放到 benchmark 里看一眼。下面这段代码 V8 在执行时会跑出三段截然不同的性能,差距高达 3-5 倍:

Let's actually measure the "checkpoint fail → deopt" event from the previous chapter. The code below runs in three distinct performance regions, with a 3–5× swing:

BENCHMARK · deopt-on-type-mismatch node --allow-natives-syntax

const { printOptimizationStatus } = require('./v8-print'); function add(a, b) { return a + b; } // L1 ─ 单态预热:全部 SMI+SMI console.time('mono'); for (let i = 0; i < 99999999; i++) add(i, i); console.timeEnd('mono'); // → 66 ms // L2 ─ 用 string 调一次,把 assumption 打破 add('a', 'b'); // ★ 一次就够了 ★ printOptimizationStatus(add); // → kIsLazy (deopted!) // L3 ─ 同样的 SMI+SMI 循环再跑一遍 console.time('after-deopt'); for (let i = 0; i < 99999999; i++) add(i, i); console.timeEnd('after-deopt'); // → 243 ms · 慢了 3.6×

66ms

L1 · monomorphic

deopt

L2 · one bad call

243ms

L3 · same code, 3.6× slower

为什么 L3 的代码跟 L1 一样,却跑得慢三倍

Why the same loop runs 3× slower after one bad call

因为 V8 把 L1 时编出来的 TurboFan 机器码扔了。L3 的循环重新从 Ignition 开始跑——而 Ignition 是字节码解释器,本身就慢一个数量级。等到再跑足够多次,V8 才会重新编译,但这次的 feedback 已经"被污染"了:它知道 a/b 既可能是 number 也可能是 string,所以新版本的 assumption 退化成 any+any,生成的机器码必须额外多打一份类型分支——比第一次的 mono-SMI 版本臃肿得多。

这就是反优化的真正成本:

立即成本:扔掉的那份 TurboFan 机器码白编了——几 ms 编译时间打水漂。
过渡成本:函数掉回 Ignition 的几千次解释执行,每次都慢 5-10×。
长期成本:重编后的版本是多态版,稳态性能也比单态版差得多——上面 L3 比 L1 慢 3.6× 就是这个原因。

Because V8 threw away the TurboFan machine code it had compiled in L1. L3's loop restarts in Ignition — the bytecode interpreter, an order of magnitude slower on its own. Eventually V8 re-compiles, but the feedback is now "polluted": it knows a/b can be either number or string, so the new assumption degrades to any+any, and the emitted asm has to carry extra type branches — fatter than the original mono-SMI version.

That's the real cost of deopt:

Immediate: the discarded TurboFan code wasted milliseconds of compile time.
Transient: thousands of interpreted calls in Ignition before re-tiering, each 5–10× slower.
Lasting: the recompiled version is polymorphic; even its steady state is much slower than the original monomorphic version — that's why L3 is 3.6× slower than L1.

怎么用 `--trace-deopt` 抓现场

How `--trace-deopt` catches it

$ node --allow-natives-syntax --trace-deopt deopt-demo.js --trace-deopt

; (...省略 L1 的优化日志...) [bailout (kind: eager): begin. deopting 0x1f9422b23479 <Code kind=TURBOFAN>] [ reason: not a Smi] ; ★ 这是 L2 那行的元凶 [ bytecode offset: 7] [ function: 0x1f942f4cc121 <JSFunction add>] [ script: deopt-demo.js · line: 3] [bailout: end. ↩ Ignition]

"reason: not a Smi" 这一行就是分析慢 JS 时最常见的元凶——它告诉你 哪一行 JS、第几个字节码偏移、为什么触发了反优化。后面 Phase 4 主线函数的优化过程里,我们会用这条日志一行行倒推问题。

"reason: not a Smi" is the single most common smoking gun when chasing slow JS — it pinpoints which line, which bytecode offset, which assumption blew up. In Phase IV's main-line we'll use this exact log to backtrack issues line by line.

实战教训REAL-WORLD LESSON

一次"加日志"导致整个页面卡顿When "just adding a log" tanked a whole page

某次 PR 在一个被每帧调用上千次的格子计算函数里加了 console.log(arg),arg 偶尔是 undefined。结果 Profiler 显示这个函数突然慢了 4 倍——不是 console.log 的开销,而是 undefined 这种类型让函数 deopt 了,从此跑在多态机器码上。把日志移到外层(只在 dev 模式生效)后,性能立刻回到原状。A PR added console.log(arg) to a per-cell function called thousands of times per frame. arg was occasionally undefined. Profiler showed the function suddenly 4× slower — not because of the log itself, but because undefined deopted the function into polymorphic asm forever after. Hoisting the log to the outer scope (dev-only) restored performance instantly.

上一章prev← 三件套The trio 下一章next{Mono | Poly | Mega}morphic →

CHAPTER 08 · ASSUMPTIONS

{Mono | Poly | Mega} morphic — 一个 feedback slot 的命运

{Mono | Poly | Mega} morphic — a feedback slot's fate

单态最快,多态次之,巨态退化成解释器

monomorphic flies, polymorphic crawls, megamorphic gives up

上一章看到一次 add('a','b') 让函数 deopt——但实际情况比这更细。每个调用点的 FeedbackVector 都有一个 状态机,会随着接收到的类型种类逐步退化:

The previous chapter showed one add('a','b') triggering a deopt — but the truth is finer-grained. Each call site's FeedbackVector entry runs a state machine that degrades step by step as more type variations come through:

Monomorphic

单态

Mono

SMI · SMI

只见过一种类型组合。V8 可以做最深的优化,checkpoint 只有一行。

Has only seen one type combo. V8 inlines the deepest optimization; the checkpoint is a single line.

~1× · 基准 / baseline

Polymorphic

多态

Poly

见过 2–4 种 类型组合。V8 还能优化,但需要在快路径上多打几条 cmp/jne 分支。

Has seen 2–4 type combos. Still optimizable, but with extra cmp/jne branches inlined.

~2–3× slower

Megamorphic

巨态

Mega

超过 4 种 类型组合。V8 直接放弃对这个 site 做优化,退回通用 dispatch——慢得跟解释器差不多。

Beyond 4 type combos. V8 gives up on this site and falls back to generic dispatch — about as slow as the interpreter.

~5–10× slower

点一下加一种类型组合,看 feedback slot 怎么退化: click to add a new type combo, watch the slot degrade:

FIG. 05 · interactive 一个调用点的 feedback slot 随类型多样化而退化:1 种 → mono;2-4 种 → poly;>4 种 → mega 放弃优化。点上面的按钮试试看。 A call site's feedback slot degrades as more type combos arrive: 1 → mono, 2–4 → poly, >4 → mega (V8 gives up). Click the buttons above to try.

为什么有 4 种这个具体数字

Why specifically 4

这是 V8 工程上的权衡。每多打一条类型分支,生成的机器码就多几行 cmp/jne,体积变大、缓存压力变大。V8 团队跑过大量 benchmark,发现 4 种以下的多态分支还能跑得比解释器快,超过这个就得不偿失了——干脆退回通用 dispatch。

这意味着:4 是工业经验,不是物理常数。但你写代码时只需要记一个原则:

It's V8's engineering tradeoff. Each extra type branch adds a few cmp/jne lines to the asm — code grows, i-cache pressure grows. V8 benchmarked extensively and found that polymorphism up to 4 still beats the interpreter; beyond that, it's a net loss — fall back to generic dispatch.

So 4 is empirical, not physical. As an author you only need one rule:

让每个热点函数
都尽量是 monomorphic 的。 The single most useful V8 heuristic.

Make every hot function
as monomorphic as you can. The single most useful V8 heuristic.

怎么读 feedback slot 的当前状态

How to read a slot's current state

用 %DebugPrint(fn),然后翻到 feedback_vector 那一段,会看到类似:

Use %DebugPrint(fn) and find the feedback_vector section. You'll see something like:

%DebugPrint(test) excerpt

… - feedback vector: 0x2c7742c43b89 <FeedbackVector> - shared function info: … - tiering state: TieringState::kNone - invocation count: 22554 - slot #0 BinaryOp BinaryOp::SignedSmall { // ★ Mono · 仅见过 SMI [0]: 1 } // 给同一个函数喂一次 string 之后再打: - slot #0 BinaryOp BinaryOp::Any { // ★ Poly → Mega · 退化成 Any [0]: 1 }

看到 BinaryOp::SignedSmall 就放心(SMI 单态),看到 BinaryOp::Any 就要警觉了——这个 slot 已经退到最差。这是 Phase 4 主线优化里反复用到的第一个诊断信号。

BinaryOp::SignedSmall means you're golden (SMI mono); BinaryOp::Any means the slot has degraded to the worst case. This is the first diagnostic signal we'll reach for repeatedly in Phase IV's main-line.

上一章prev← 反优化第一现场First scene of deopt 下一章nextShouldOptimize →

CHAPTER 09 · ASSUMPTIONS

ShouldOptimize — 什么时候才会真正进 TurboFan

ShouldOptimize — when does TurboFan kick in

直接拆 V8 源码里的那个判断

reading the actual V8 source for the decision

"足够热"到底意味着什么?V8 把这件事写在了一个具体的函数里——v8::internal::TieringManager::ShouldOptimize。我们直接拆它:

What does "hot enough" actually mean? V8 codifies it in one function — v8::internal::TieringManager::ShouldOptimize. Let's read it:

v8/src/execution/tiering-manager.cc · ShouldOptimize simplified

OptimizationDecision ShouldOptimize(FeedbackVector fbv, CodeKind kind) { SharedFunctionInfo shared = fbv.shared_function_info(); L1 if (kind == CodeKind::TURBOFAN) L2 return DoNotOptimize(); // 已经在 TurboFan 上,跳过 L3 if (TiersUpToMaglev(kind) && L4 shared->PassesFilter(maglev_filter) && L5 !shared->maglev_compilation_failed()) { L6 return Maglev(); L7 } L8 if (V8_UNLIKELY(!v8_flags.turbofan || L9 !shared->PassesFilter(turbo_filter) || L10 v8_flags.efficiency_mode_disable_turbofan || L11 isolate->EfficiencyModeEnabledForTiering())) { L12 return DoNotOptimize(); // 节电 / 关闭 / filter 排除 L13 } L14 if (fbv.invocation_count() < v8_flags.minimum_invocations_before_optimization) { L15 return DoNotOptimize(); // ★ 调用次数不够,不优化 ★ L16 } L17 BytecodeArray bc = shared->GetBytecodeArray(isolate_); L18 if (bc.length() > v8_flags.max_optimized_bytecode_size) { L19 return DoNotOptimize(); // ★ 函数太长,不优化 ★ L20 } L21 return TurbofanHotAndStable(); }

从开发者视角读这段

What this means as an engineer

L1 · 已经优化过的代码不会再优化(没必要)。
L3 · 这段决定是否启用 Maglev——具体见 maglev_filter(可以用 --maglev-filter=name 限定)。
L8 · 通过参数主动禁用,或者节电模式 / 电池低,等等场景不会优化(比如 node --v8-options="--turbo-filter=xxxxx")。
L14 · 运行足够多次才会优化。这是性能曲线的"门槛"——冷启动的几千次不会进 TurboFan。还有一个 efficiency_mode_delay_turbofan 配置可以延后启动 TurboFan。
L18 · 太长的函数不会优化。max_optimized_bytecode_size 默认 60K bytecode 字节。这就是为什么我们后面 Phase IV 第一刀会是函数拆解——把超大函数拆小,让每一段都能进 TurboFan。

L1 — already optimized code isn't re-optimized.
L3 — Maglev decision; constrained by maglev_filter (see --maglev-filter=name).
L8 — explicit disable / power-saver / efficiency mode skip optimization (e.g. node --v8-options="--turbo-filter=xxxxx").
L14 — must be called enough times to optimize. This is the threshold on the performance curve — your first thousand cold-start calls won't make it into TurboFan. There's also efficiency_mode_delay_turbofan to push tiering further out.
L18 — overly long functions are skipped. max_optimized_bytecode_size defaults to 60K bytecode bytes. That's why our first move in Phase IV will be function decomposition: break giant functions into small ones so each can be optimized.

Maglev 的跑分对照

Maglev's bench numbers

Ignition

64

+ Sparkplug

93

+ TurboFan

279

+ Maglev

302

all four

327

FIG. 06 JetStream 跑分对照(分数越高越快)。Maglev 在已经有 TurboFan 的情况下还能再加 8% 左右,核心收益是缩短"够热再优化"的等待时间。数据来自 v8.dev/blog/maglev。 JetStream scores (higher = faster). Maglev still adds ~8% on top of TurboFan; its real win is shrinking the "wait until hot" gap. Data: v8.dev/blog/maglev.

推论:不要写超大函数TAKEAWAY · DON'T WRITE GIANT FUNCTIONS L18 那条 max_optimized_bytecode_size 是性能优化里最容易踩的坑——一个 1000 行的处理函数,V8 会因为字节码太长直接放弃优化它,无论你跑多少次都没用。Phase IV 的"函数拆解"规则之所以排第二,就是为了把这种函数拆出 TurboFan 阈值之内。 The max_optimized_bytecode_size at L18 is one of the easiest traps. A thousand-line handler can sit forever above the threshold — V8 simply skips optimizing it no matter how often it's called. That's why the "function decomposition" rule in Phase IV is non-negotiable.

上一章prev← Mono | Poly | Mega 下一章next2008 年的设计题A 2008 design puzzle →

CHAPTER 10 · OBJECT MODEL

一道 2008 年的设计题 — JSObject 的内存布局

A 2008 design puzzle — laying out JSObject in memory

假如你是当年 Google 的工程师

if you were Lars Bak in 2008

到这里我们已经讲完 V8 的编译流水线和假设系统。现在转入第三块,也是性能优化里最有趣的一块——对象内存模型。

用一个思想实验开场:假如你是 2008 年 Chrome V8 项目的工程师,任务是设计 JS 对象在内存里怎么布局,你会怎么做?先看 C 是怎么做的:

We've covered V8's compile pipeline and assumption system. Now into the third — and most rewarding — block: the object memory model.

A thought experiment: it's 2008, you're on the Chrome V8 team, and your job is to lay JS objects out in memory. How would you do it? First, here's how C does it:

C struct · 静态布局 x86-64 gcc -O2

struct Point { int x; int y; }; int printPoint(struct Point p) { printf("x=%d, y=%d", p.x, p.y); } ; 编译出来访问 p.x: mov eax, DWORD PTR [rbp-12] ; base + 0 → x mov edx, DWORD PTR [rbp-8] ; base + 4 → y

静态语言的 struct 是一段连续线性内存。编译期就知道 x 在偏移 0、y 在偏移 4——属性访问就是O(1) 的偏移寻址。但这有两个不可调和的前提:

编译期就知道结构里有哪些字段。
结构是定长的,不能运行时加字段。

JS 全反过来——obj.foo = 42 可以在任何时刻给对象加属性,delete obj.foo 又可以随时拿走。所以你不能像 C 那样"一条 mov 指令搞定属性读取"。

A static struct is one contiguous block of memory. The compiler knows x is at offset 0, y at offset 4 — property access is an O(1) offset load. But that rests on two assumptions you can't break:

The compiler knows which fields exist at compile time.
The struct is fixed-size; you can't add fields at runtime.

JS shatters both. obj.foo = 42 can graft on at any moment; delete obj.foo rips off at will. So you can't get away with "one mov per property read".

最朴素的设计:存 [key, value] 数组

The naive design: an array of [key, value]

第一反应可能是:既然字段是动态的,那就存成 [key1, val1, key2, val2 ...] 这种"键值对数组"——每次读 obj.x 时遍历查找。

First instinct: if fields are dynamic, store them as [key1, val1, key2, val2 ...] and walk the array on every obj.x.

[0]"x"3

[1]"y"5

[2]"z"6

但有两个问题:

属性查找是 O(n)——对每个属性访问都要扫一遍 keys。
同一种 shape 的对象会重复存 key 名:有 100 万个 {x, y},内存里就有 100 万份 "x" / "y" 字符串。

第二个问题尤其致命——典型 Web 应用里同样 shape 的对象动辄几万几百万,这是不能接受的浪费。

Two problems:

Lookup is O(n) — scan keys on every access.
Objects of the same shape duplicate the key strings: a million {x, y} objects means a million copies of "x" and "y".

The second one's lethal — a real Web app has tens or hundreds of thousands of identically-shaped objects. That's unacceptable bloat.

V8 的解法:把 shape 抽出来

V8's answer: lift the shape out

V8 的设计是:每个 JSObject 有三类存储,加上一个指向 Hidden Class 的指针——这个 Hidden Class 才是"shape 的描述"。所有同 shape 对象共享一份。

V8 chose this: every JSObject has three storage areas plus a pointer to a Hidden Class — and the Hidden Class itself is the "shape description". All same-shape objects share one.

+0*hiddenClass→ shape

+8*properties→ named-props array

+16*elements→ indexed-elems array

+24in-object #03 (= obj.x)

+32in-object #15 (= obj.y)

+40in-object #26 (= obj.z)

*hiddenClass

指向形状描述。所有 {x, y, z} 对象都指向同一个 Hidden Class——key 名只存一份。下一章详细拆。 Points to the shape descriptor. All {x, y, z} objects point at the same Hidden Class — keys are stored once. Next chapter dissects it.

*properties

指向命名属性数组。当属性多到 in-object 槽放不下,溢出来的就放这里。 Points to a named-properties array. When properties overflow the in-object slots, they spill here.

*elements

指向下标元素数组。专门存 arr[0] 这种数字下标的元素——下标访问是连续内存,极快。 Points to an indexed-elements array. Numeric-indexed (arr[0]-style) values live here — contiguous memory, very fast.

in-object

预留在 JSObject 本体里的属性槽。访问最快——直接 base + offset,跟 C 的 struct 一样!但要"预知 shape"才能用——这正是 Hidden Class + IC 配合的产物。 Slots reserved inside JSObject itself. Fastest access — base + offset, just like a C struct! But only when "shape is known", which is exactly what Hidden Class + IC give you.

这就解决了上面两个问题

This fixes both problems

不再重复存 key。100 万个 {x, y, z} 对象只存 1 份 "x" / "y" / "z"(在共享的 Hidden Class 里)。
属性查找可以变成 O(1)——只要事先知道这个对象的 shape。Ch11 讲 Hidden Class 怎么记录 shape,Ch13 讲 Inline Cache 怎么把它编进汇编偏移量。

但代价是:对象的 shape 一旦变化,Hidden Class 也得变。这就引入了 Phase III 的核心机制——Transition Chain(下一章)。

No more key duplication. A million {x, y, z}s share a single set of "x"/"y"/"z" strings (inside the shared Hidden Class).
Property lookup can become O(1) — provided we know the object's shape ahead of time. Ch11 covers how Hidden Class records the shape; Ch13 covers how Inline Cache compiles it into asm offsets.

The price: change the shape, change the Hidden Class. Hence the central mechanism of Phase III — Transition Chain (next chapter).

上一章prev← ShouldOptimize 下一章nextHidden ClassHidden Class →

CHAPTER 11 · OBJECT MODEL

Hidden Class / Shapes — 对象的骨架

Hidden Class / Shapes — an object's skeleton

同 shape 的对象共享一份描述

same-shape objects share one descriptor

"Hidden Class"是 V8 的术语,在 V8 源码里它的工程名是 Map(就是 %DebugPrint 里看到的那个 Map);Edge Chakra 叫 Types,JavaScriptCore 叫 Structure,SpiderMonkey 叫 Shapes。所有现代 JS 引擎都有同一个东西——只是名字不一样。

Hidden Class 内部最关键的子结构是 DescriptorArray——它记录"这个 shape 上有哪些 key、key 对应的 in-object 槽位下标是几"。下面用一个具体例子:

"Hidden Class" is V8's term — internally, the V8 source calls it a Map (yes, the same Map you see in %DebugPrint). Edge Chakra calls it Types, JavaScriptCore calls it Structure, SpiderMonkey calls it Shapes. Every modern JS engine has the exact same thing under different labels.

The most important sub-structure inside a Hidden Class is the DescriptorArray — it records "this shape has these keys, and each key corresponds to this in-object slot index". Concrete example:

JSObject (o)

*hiddenClass→

*properties[]

*elements[]

in-obj #011

in-obj #122

Hidden Class (Map · DescriptorArray)

"x"offset 0

"y"offset 1

FIG. 07 const o = { x: 11, y: 22 } 在 V8 内部的真实样子。JSObject 本体只存值,key 名一律由共享的 Hidden Class 描述。如果再创建一个 { x: 33, y: 44 },它会指向同一个 Hidden Class——这正是性能优化的杠杆所在。 What const o = { x: 11, y: 22 } really looks like in V8. The JSObject itself only carries values; key names are shared via the Hidden Class. Another { x: 33, y: 44 } would point at the same Hidden Class — that's the lever.

两个对象 shape 相同就能复用 Hidden Class

Same shape ⇒ same Hidden Class

关键性质:shape 完全相同的对象,共享同一个 Hidden Class 实例。

const o1 = { x: 11, y: 22 } · Hidden Class A
const o2 = { x: 33, y: 44 } · 同一个 Hidden Class A
const o3 = { y: 11, x: 22 } · 不同的 Hidden Class B(顺序变了!)
const o4 = { x: 11, y: 22, z: 33 } · Hidden Class C(多了一个 key)

注意 o3 和 o1 的区别——属性赋值的顺序也是 shape 的一部分。这是 Phase IV 第 4 条改写规则的依据:"保持对象赋值顺序不变"。

The crucial property: objects of identical shape share the same Hidden Class instance.

const o1 = { x: 11, y: 22 } · Hidden Class A
const o2 = { x: 33, y: 44 } · same Hidden Class A
const o3 = { y: 11, x: 22 } · different Hidden Class B (order changed!)
const o4 = { x: 11, y: 22, z: 33 } · Hidden Class C (extra key)

Note o3 vs o1 — assignment order is part of the shape. This underlies rule #4 of Phase IV: "keep property assignment order stable".

怎么验证两个对象 Hidden Class 是同一个?HOW TO VERIFY TWO OBJECTS SHARE A HIDDEN CLASS 用 %DebugPrint(obj),看输出里的 map: 0x... 字段。两个对象的 map 物理地址一样,就说明它们走的是同一个 Hidden Class——后面 IC 优化能命中同一份汇编。 Run %DebugPrint(obj) and look at the map: 0x... line. Same physical address = same Hidden Class = the same IC fast path will hit both.

in-object properties vs *properties

上一章提到 V8 有两种存"命名属性"的位置:in-object(预留在 JSObject 本体里)和 *properties 数组(溢出存储)。Hidden Class 的 DescriptorArray 同时描述这两类——开发者眼里只是 obj.x,V8 内部却可能走两条路。

你可能会问:那什么时候走哪条?V8 默认给空对象预留 4 个 in-object 槽位(称为 Slack Tracking,见 Ch20 工具箱),前 4 个属性走 in-object,后面溢出到 *properties 数组。这是 Phase IV 第 5 条规则"class 字段加默认值"的根源——让对象一出生就立刻把 4 个槽位填满。

The previous chapter mentioned V8 has two places to store named properties: in-object (reserved inside the JSObject body) and the *properties array (overflow). The DescriptorArray in Hidden Class covers both — to you it's just obj.x, but V8 may take either path internally.

Which one? V8 reserves 4 in-object slots for an empty object (called Slack Tracking, see Ch20). The first 4 properties go in-object; later ones overflow into *properties. That's the foundation for Phase IV rule #5 ("declare class fields with defaults") — fill those slots immediately at construction.

上一章prev← 2008 设计题2008 puzzle 下一章nextTransition Chain →

CHAPTER 12 · OBJECT MODEL

Transition Chain — 对象长大时的链表生长

Transition Chain — growing a linked list as the object grows

点按钮看链表怎么一节一节长出来

click to watch the chain grow node by node

上一章说同 shape 共享 Hidden Class——但 shape 怎么变化?V8 的设计是把 Hidden Class 链成一条 transition chain:每给对象加一个属性,就追加一个 Hidden Class 节点。同样路径走过的对象,共用同一条链上的同一个节点。

下面是一个交互式演示——点 "+ x"、"+ y"、"+ z" 看链表怎么生长:

The previous chapter said same-shape objects share a Hidden Class — but how does shape change? V8's answer: chain Hidden Classes into a transition chain. Each new property appends a node; objects that took the same path of insertions share the same chain node.

Click "+x", "+y", "+z" below to watch the chain grow:

当前 Hidden Class:current map: ∅ (empty)

JSObject

*hiddenClass→ ∅

*properties[]

*elements[]

∅ empty

FIG. 08 · interactive 每新加一个属性,Hidden Class 链表就长一节。两个对象只要走过同样的添加路径,就会停在同一个 Hidden Class 节点上——这是 V8 复用 IC 优化的物理基础。 Each new property appends a chain node. Two objects walking the same insertion path end up on the same Hidden Class node — this is the physical basis of V8's IC sharing.

为什么"赋值顺序"很重要

Why insertion order matters

从链表结构能直接看出来:

The chain structure makes it obvious:

两个看起来一样,实际不一样two objects that look identical but aren't shape pitfall

// 路径 A:先 x 后 y const a = {}; a.x = 1; a.y = 2; // → 链路 ∅ → "x" → "y" → Hidden Class A // 路径 B:先 y 后 x ★ 不一样的链路 ★ const b = {}; b.y = 2; b.x = 1; // → 链路 ∅ → "y" → "x" → Hidden Class B ≠ A // 后果:同一个函数同时收到 a 和 b,就变成 polymorphic 了 // IC 缓存被打破,性能可能直接掉 2-3 倍

看似无害的两段代码,在 V8 眼里指向两个完全不同的 Hidden Class——所有把它们当参数的函数都会被推入 polymorphic。这是写性能敏感代码时最容易踩的隐形坑。

解决办法非常机械:初始化对象时就把所有字段一次性写齐,顺序固定。比如 React/Vue 这种框架内部维护对象池时,会刻意保证每个 component 实例的字段顺序一致——目的就是让所有 instance 共用一个 Hidden Class。

Two innocuous-looking blocks. In V8's eyes they point at two completely different Hidden Classes, and any function that takes either gets pushed into polymorphism. The most insidious trap in performance-critical code.

The fix is mechanical: initialize all fields up front, in a fixed order. React/Vue's internal instance pools deliberately preserve field order across components for exactly this reason — keep every instance on the same Hidden Class.

从链表到树:分叉的情况

Branches: when the chain forks

链表只能描述"路径相同"的情况。当两条路径在某一步分叉时,Hidden Class 会变成一棵带 transition 的树。比如:

A chain only handles same-path growth. When two paths diverge, the Hidden Class becomes a tree with transitions:

TRANSITION TREE

∅ → x → y → u // o1 = {x, y, u} ↘ v // o2 = {x, y, v}

o1 和 o2 的前两步共享 ∅→x→y;到第三步分叉,各自挂一个新的 transition。这种共享前缀让 V8 的 Hidden Class 总数远小于"对象 shape 笛卡尔积"。o1 and o2 share ∅→x→y; at step three each branches off. Shared prefixes keep V8's Hidden Class count far below the Cartesian product of object shapes.

上一章prev← Hidden Class 下一章nextInline Caches →

CHAPTER 13 · OBJECT MODEL

Inline Caches — 把字符串查找变成偏移读取

Inline Caches — turning string lookup into offset reads

从 O(n) 到 O(1) 的那把刀

the knife that cuts O(n) down to O(1)

到这里,Phase III 的所有铺垫都是为了讲清楚这一章。Inline Cache (IC) 是 V8 性能曲线最陡的那一段——它能让一个 obj.x 的访问从字符串查找的 O(n) 降到一条 mov 指令的 O(1)。差距能上百倍。

看一段实测:同一个"服务发现"函数,一种动态写法,一种静态写法,跑 10M 次:

Everything in Phase III leads here. Inline Cache (IC) is the steepest part of V8's performance curve — it can cut an obj.x access from O(n) string lookup down to a single mov. The gap is over 100×.

Real measurement, same "service discovery" function written two ways, 10M iterations:

动态查找 · 跑 10M 次dynamic lookup · 10M iters ~6.4 s

静态查找 · 跑 10M 次static lookup · 10M iters ~44 ms

function select(map, key) { if (key === 'userLogin') return map.userLogin; if (key === 'a') return map.a; // ★ 静态 key,可被 IC 优化为偏移 if (key === 'b') return map.b; if (key === 'c') return map.c; return _404Handler; } // 同样 10M 次,跑出来 ~44 ms // ↑ 145× 倍速差距

FIG. 09 · interactive 点上面两个 tab 切换。同样逻辑,同样 10M 次调用,动态 map[key] 跑了 6.4 s,静态 map.a 跑了 44 ms——差距 ~145 倍。这不是函数本身的差异,是 V8 能不能把它编进 IC 的差异。 Toggle between the tabs. Same logic, same 10M calls — dynamic map[key] takes 6.4 s, static map.a takes 44 ms. ~145×. Not the function's fault — it's whether V8 can fold the access into an IC.

IC 在汇编里长什么样 — 145 倍差距怎么来的

What an IC looks like in asm — where the 145× comes from

同一个 obj.a 这件事,动态写法 map[key] 和静态写法 map.a 在 V8 里要做的工作完全不同。把每次调用的步骤画出来对比,差距就具体了:

Reading obj.a via dynamic map[key] vs static map.a sends V8 down two completely different paths. Let's draw what each call has to do:

map[key] 动态 key · O(N)dynamic key · O(N)

运行时才知道 key 是什么 · 必须扫一遍 Hidden Class 的 key 表key only known at runtime · must walk the Hidden Class key table

map.hiddenClass扫描中…scanning…

✗"x"offset 24

✗"y"offset 32

✓"a" — 命中!offset 40

·"b"offset 48

·"c"offset 56

1读 map.hiddenClassload map.hiddenClass~3 cyc

2读 DescriptorArrayload DescriptorArray~3 cyc

3-53 次指针 cmp(intern 后)3 pointer cmps (interned)~9 cyc

6读对应 offsetload offset entry~3 cyc

7最终 ldr 取值final ldr to read value~3 cyc

每次调用per call ~50–200 cyc

map.a 静态 key · IC 命中 · O(1)static key · IC hit · O(1)

编译期就把 hiddenClass 地址 + offset baked into 汇编hiddenClass ptr + offset baked into the asm at compile time

map.hiddenClass编译期已知known at compile

★"a" → in-obj slot 1offset 32 baked

·其他 key 不必关心other keys: unused—

1cmp shape ptr(校验 shape 没变)cmp shape ptr (validate shape)~1 cyc

2ldr [obj + 32] 直接取值load directly~3 cyc

没有循环 · 没有字符串比较no loop · no string compares

每次调用per call ~5 cyc

FIG. 09b 同样是 obj.a:左边动态查找在 Hidden Class 里走一遍——本例运气好命中第三个,运气不好(目标 key 在表尾)得扫到底;右边 IC 命中后只剩两步——1 次 cmp 验证 shape 没变 + 1 次 ldr 按固定偏移直接取值。这两条路径的 cycle 差距,就是文章开头跑分里那 145 倍的物理来源。 Same obj.a: on the left, dynamic lookup walks the Hidden Class — best case (this run) hits on the 3rd row; worst case scans to the end. On the right, after the IC has cached, only two ops remain — 1 cmp to validate the shape, 1 ldr at a fixed offset. The cycle gap here is exactly the 145× the bench-bars opened with.

把上面右边那条快路径在真实 ARM64 里编出来,就是下面这 5 行——里面的 cmp + ldr 跟图里那两步一一对应,多出的 b.ne 是反优化守卫:

The same fast path, emitted in real ARM64, looks like this — the cmp + ldr map directly to the two steps above; the extra b.ne is the deopt guard:

map.a 的真实 TurboFan 输出 — 全部就这么多map.a · real TurboFan output — that's all of it arm64 (m1)

; map.a — 假设 map 的 Hidden Class 物理地址 = 0x3a8d76b74971 L133 ldr x0, [x4, #+24] ; 读 map.hiddenClass(图中步骤 1 之前) L134 cmp x0, 0x3a8d76b74971 ; ★ 图中步骤 1 — shape 没变? L135 b.ne CompileLazyDeoptimizedCode ; 变了 → deopt(快路径外的兜底) L136 ldr x0, [x4, #+32] ; ★ 图中步骤 2 — 按 baked offset 取值 ; (32 = JSObject header 24 + slot 1 × 8) L137 ret ; → 一行 obj.a 编出来 5 条 asm,核心就两条(cmp + ldr) ; → 没有循环、没有字符串比较、没有内存追逐 — 这就是 IC 的全部魔法

那条 ★ ldr x0, [x4, #+32] 就是 Inline Cache 的真身——V8 把"按 key 找属性"这件事 inline 成了"按一个固定偏移直接读"。这个偏移之所以能写死在汇编里,是因为 cmp 那一行先验证了 shape 没变——shape 一变,deopt 把整段汇编扔掉重编。"缓存"被 inline 进了汇编,这就是 IC 名字的由来。

That ★ ldr x0, [x4, #+32] is Inline Cache in the flesh — V8 inlines "look up by key" into "load at a fixed offset". The reason that offset can be baked in is that the cmp above guarantees the shape is unchanged; if it changes, the whole asm gets thrown away and recompiled. The "cache" is inlined into the asm itself — that's where the name comes from.

静态写法,优于动态写法。
不是风格之争,是 145 倍 的差距。 Field Note · 03

Static beats dynamic.
Not a style preference — a 145× gap. Field Note · 03

IC 的"州"

IC states

注意 IC 也走第 8 章的状态机:第一次调用时未初始化 (uninitialized),第二次起进入 monomorphic,见过 2-4 个不同 Hidden Class 的对象进入 polymorphic,>4 个就 megamorphic 放弃缓存。所以"让对象保持同 shape"和"用静态 key"是同一件事的两面——IC 优化只在它们都满足时生效。

ICs follow the same state machine as Chapter 8: uninitialized → monomorphic (after first call) → polymorphic (2–4 different Hidden Classes) → megamorphic (>4, cache abandoned). "Same shape" and "static key" are two faces of the same thing — IC only kicks in when both are true.

上一章prev← Transition Chain 下一章nextFast vs Slow PropertiesFast vs Slow Properties →

CHAPTER 14 · OBJECT MODEL

Fast Properties vs Slow Properties — `delete` 的代价

Fast vs Slow Properties — the cost of `delete`

缓存技术最怕的就是 delete

caching's worst enemy is invalidation

到目前为止,我们讲的都是Fast Properties——用 Hidden Class + IC 把属性访问压到一条 ldr。但有一种操作能把对象一脚踹出快路径,让它退化成Slow Properties——慢几十甚至几百倍。

这个操作就是 delete。

Everything so far has been Fast Properties — Hidden Class + IC compressing access into one ldr. There's one operation that kicks an object off the fast path entirely, demoting it to Slow Properties (dozens to hundreds of times slower).

That operation is delete.

%HasFastProperties · before / after delete node --allow-natives-syntax

> const obj = { x: 123, y: 555 }; > console.log('初始化后:', obj, %HasFastProperties(obj)); → {x: 123, y: 555} true > obj.xxxxx = 123; > console.log('随便加一个成员后:', obj, %HasFastProperties(obj)); → {x: 123, y: 555, xxxxx: 123} true ; ★ 加属性还在 Fast > delete obj.xxxxx; > console.log('删除一个成员后:', obj, %HasFastProperties(obj)); → {x: 123, y: 555} false ; ★ 删了之后掉到 Slow!

为什么 V8 不再为 delete 维护 Hidden Class

Why V8 stops maintaining Hidden Class after delete

因为 delete 一旦允许,会引爆一连串问题:

删 o1.x 之后,剩下的 in-object 槽位怎么办?移动后面的属性补齐 → 其他对象指针就乱了。空着不填 → IC 缓存的偏移就错了。
那些之前指向同一个 Hidden Class 的对象 现在还要不要也指过来?如果保留,o1 的 x 没了别人却还指着,引用乱套;如果切换 Hidden Class,所有 IC 都得失效。
那些已经 inline 进 TurboFan 机器码里的偏移,要全部重编。

三个问题都很难解。V8 选了最简单的放弃方案:被 delete 过的对象一律退化为 Slow Properties——把属性集中存到一个字典里(类似 Map<string, Value>),抛弃 in-object + IC 优化。

这个字典的访问要走哈希查找,比 IC 慢几十到一百倍。而且这个降级是不可逆的——一旦掉到 Slow,这个对象就回不去 Fast 了。

Allowing delete opens three nasty cans of worms:

What happens to the freed in-object slot? Compact later ones into it → all other objects' pointers break. Leave it empty → all cached IC offsets are now wrong.
Other objects on the same Hidden Class — keep them pointing here, or fork? Keep → references go stale; fork → invalidate every IC pointing at the old class.
Every offset already inlined into TurboFan machine code needs re-emitting.

All three are hard. V8 picked the simple give-up plan: any object touched by delete degrades to Slow Properties — store properties in a dictionary (like Map<string, Value>) and abandon in-object + IC optimization.

Dictionary access is hash lookup — dozens to a hundred times slower than IC. And the demotion is one-way — once Slow, always Slow.

实战教训REAL-WORLD LESSON

"清理对象"反而让性能暴跌

"Cleanup" tanks performance instead of helping

某段代码循环结束后想"释放内存",对每个 cache 对象做了 delete obj.bigPayload。结果下一帧还在用这些对象做属性访问的函数全部 deopt——cache 对象悉数掉到 Slow Properties,整个模块慢了 4 倍。正确做法是 obj.bigPayload = null 或 obj.bigPayload = undefined——这样不改变 Hidden Class,GC 也能正常回收引用的内存。

Some code did delete obj.bigPayload on every cache object at end-of-loop to "free memory". Next frame, every function reading those objects' properties deopted. The whole module ran 4× slower. The fix: obj.bigPayload = null (or = undefined) — preserves Hidden Class while still letting GC reclaim the referenced memory.

规则RULE 在性能敏感代码里,能不用 delete 就不用 delete。要"清掉"一个属性,改成 obj.foo = null 或 obj.foo = undefined——前者明确表达"无值",后者保持兼容。Hidden Class 不会变,IC 不会失效,GC 会回收引用的内存。 In performance-critical code, avoid delete. To "clear" a property, set it to null or undefined. Hidden Class survives, ICs stay valid, and GC still reclaims the referenced memory.

上一章prev← Inline Caches 下一章next主线 · 前世 · 240msMain-line · Before · 240ms →

CHAPTER 15 · HOT FUNCTION

前世 · ~240 ms — 朴素写法的多态地狱

Before · ~240 ms — the polymorphic mess

把前面 14 章的诊断工具一次性用上

putting all 14 chapters' diagnostic tools to work

到这里前面 14 章是所有的刀。这一章我们拿出主线那段函数,用刀解剖它。

函数本身一句话就能描述:把任意输入(数字 / 字符串 / 对象)归一化成 rem 数值。在我们的代码库里,它在每帧布局计算里被叫上百次,profiler 显示是个明显热点:

The previous 14 chapters were the knives. This chapter takes the main-line function and dissects it.

The function in one sentence: normalize any input (number / string / object) into a rem value. In our codebase it ran hundreds of times per layout frame; the profiler called it out as a hot spot:

v0 · 朴素版 · before any optimization naive

// 输入: number | string | { value, unit } // 输出: rem 数值 (number) function px2rem(input, base) { let value, unit; if (typeof input === 'number') { // case A value = input; unit = 'px'; } else if (typeof input === 'string') { // case B const m = input.match(/^(-?\d+(?:\.\d+)?)(px|rem|em|%)?$/); value = m ? parseFloat(m[1]) : 0; unit = m && m[2] || 'px'; } else if (input && typeof input === 'object') { // case C value = input.value; unit = input.unit || 'px'; } else { return 0; } // case D if (unit === 'rem') return value; if (unit === 'em') return value; if (unit === '%') return value / 100; return value / base; // 'px' 默认 }

第一刀:profile 看跑分

Cut #1: profile and time it

$ node --allow-natives-syntax bench.js timing

// 喂三种输入,模拟真实调用分布 const samples = [12, '14px', { value: 16, unit: 'rem' }, 20, '1.5em']; console.time('v0'); for (let i = 0; i < 1_000_000; i++) px2rem(samples[i % 5], 16); console.timeEnd('v0'); // → v0: 243.7 ms

关于这些数字A NOTE ON THESE NUMBERS 本文里 240ms / 24ms / 145× / 10× 这些是 M1 MacBook Pro + Node 22 上的典型量级,做叙事用——你机器上跑出来可能 2× 到 20× 不等,跟硬件、Node 版本、循环次数都有关系。但 V8 内部的 feedback 状态机变化是确定性的:多态会让 BinaryOp::Any,反优化会让 deopt log 出现 not a Smi——这些信号跟你机器上跑出的具体毫秒数无关。看这些信号去判断,不要执着于复现具体的倍率。 The 240ms / 24ms / 145× / 10× numbers in this piece are typical magnitudes on M1 MacBook Pro + Node 22 — your machine may show anywhere from 2× to 20× depending on hardware, Node version, and loop size. But V8's internal feedback state machine is deterministic: polymorphism produces BinaryOp::Any, deoptimization writes not a Smi to the deopt log. Read those signals — don't fixate on reproducing a specific multiplier.

第二刀:看 V8 把它当成什么

Cut #2: ask V8 what it thinks of this function

%GetOptimizationStatus(px2rem) + %DebugPrint(px2rem) diagnosis

// %GetOptimizationStatus 输出: → kIsFunction | kOptimized | kTurboFanned ; 已经在 TurboFan 上了,但还是慢 // %DebugPrint 关键节选: - feedback vector: - invocation count: 1000000 - slot #0 BinaryOp BinaryOp::Any ; ★ 退化到 Any! - slot #4 LoadProperty (LoadIC) MEGAMORPHIC ; ★ 巨态! - slot #7 Compare CompareOp::Any ; ★ 类型守卫退化

亲手跑一遍try it yourself 让 V8 告诉你这段代码病在哪 — 完整 a.js + node 命令 Have V8 tell you where this code is sick — full a.js + node command

把下面这段存成 a.js,然后 node --allow-natives-syntax a.js。它会在你自己机器上跑出和上面那一刀几乎一模一样的输出——你能亲眼看到 BinaryOp::Any 和 MEGAMORPHIC。

Save the snippet below as a.js and run node --allow-natives-syntax a.js. You'll get the same kind of output the cut above shows — you can see BinaryOp::Any and MEGAMORPHIC with your own eyes.

a.js v0 · 故意多态

// v0 朴素版 — 跟文章上面这一节一样的多态地狱 function px2rem(input, base) { let value, unit; if (typeof input === 'number') { value = input; unit = 'px'; } else if (typeof input === 'string') { const m = input.match(/^(-?\d+(?:\.\d+)?)(px|rem|em|%)?$/); value = m ? parseFloat(m[1]) : 0; unit = m && m[2] || 'px'; } else if (input && typeof input === 'object') { value = input.value; unit = input.unit || 'px'; } else { return 0; } if (unit === 'rem') return value; if (unit === 'em') return value; if (unit === '%') return value / 100; return value / base; } // 1. 喂混合类型,触发多态(关键 — 不混合就一直是单态,看不到病) const samples = [12, '14px', { value: 16, unit: 'rem' }, 20, '1.5em']; console.time('v0'); for (let i = 0; i < 1_000_000; i++) px2rem(samples[i % 5], 16); console.timeEnd('v0'); // 2. 强制下次调用就编进 TurboFan(保证看到的是峰值版本) %OptimizeFunctionOnNextCall(px2rem); px2rem(12, 16); // 3. 看 V8 把它当成什么 const status = %GetOptimizationStatus(px2rem); console.log('\n--- %GetOptimizationStatus ---'); console.log('raw:', status, ' binary:', status.toString(2)); const FLAGS = [ [0, 'kIsFunction'], [3, 'kMaybeDeopted'], [4, 'kOptimized'], [5, 'kMaglevved'], [6, 'kTurboFanned'], [7, 'kInterpreted'], [14, 'kMarkedForDeoptimization'], ]; for (const [b, n] of FLAGS) if (status & (1 << b)) console.log(' ★', n); // 4. 完整内部信息 — feedback vector 在输出底部 console.log('\n--- %DebugPrint ---'); %DebugPrint(px2rem);

跑法

How to run

terminal node 22+

$ node --allow-natives-syntax a.js # 想看反优化日志 + TurboFan 实际机器码,把 trace 全开: $ node --allow-natives-syntax \ --trace-opt --trace-deopt --print-opt-code \ a.js 2>&1 | less

怎么读输出

How to read the output

raw status 是个整数,二进制每一位代表一个 flag。比如 81 = 1010001 = bit 0+4+6 = kIsFunction | kOptimized | kTurboFanned——就是上面截图那一行。
%DebugPrint 的 feedback vector 段是真正的诊断核心。找里面的 slot #N <op>::<type>:SignedSmall / Number = 单态(快路径),Any 或 MEGAMORPHIC = 文章里那种"已经病了"的状态。
--trace-deopt 会打印每次反优化事件 + 具体原因(not a Smi / wrong map),配上 @ bytecode N 偏移可以倒推到 JS 哪一行触发的。

raw status is an integer; each binary bit is a flag. For example, 81 = 1010001 = bits 0+4+6 = kIsFunction | kOptimized | kTurboFanned — exactly the line in the screenshot.
The feedback vector section of %DebugPrint is where diagnosis really happens. Look for slot #N <op>::<type>: SignedSmall / Number = monomorphic (fast); Any or MEGAMORPHIC = the "already sick" state from the article.
--trace-deopt prints every deopt event with its reason (not a Smi / wrong map) and the @ bytecode N offset, which lets you backtrack to the exact JS line.

几个常见坑

A few gotchas

看到 SyntaxError: Unexpected token '%' 就是 --allow-natives-syntax 漏了——是 parse 错不是 runtime 错,整个文件加载不了。
Node 版本要够新:Sparkplug 17+,Maglev 默认开是 20.6+,Node 22 全套齐。
%DebugPrint 输出在 stderr。管道 grep 时记得 2>&1。
必须真的喂混合类型(数组里既有 number 又有 string 又有 object),否则函数一直停在单态,看不到 ::Any。
同一段代码第一次跑 vs 跑过几百万次后 feedback 是不同的——这就是 V8 性能曲线的本质,要看稳态就让循环跑足够多次。

SyntaxError: Unexpected token '%' means you forgot --allow-natives-syntax — it's a parse error, not runtime, so the whole file fails to load.
Node version matters: Sparkplug needs 17+, Maglev defaults on at 20.6+, Node 22 has all of it.
%DebugPrint writes to stderr — pipe 2>&1 if you want to grep.
You really need to feed mixed types (the array has number, string, and object). Otherwise the function stays monomorphic and you'll never see ::Any.
The feedback for the same code is different on the first call vs after millions of calls. That's the V8 perf curve in action — let the loop run long enough to reach steady state.

第三刀:看反优化日志

Cut #3: read the deopt log

$ node --trace-deopt bench.js --trace-deopt

[bailout (kind: eager): reason: not a Smi; px2rem @ bytecode 7] [bailout (kind: eager): reason: wrong map; px2rem @ bytecode 41] [bailout (kind: eager): reason: unexpected type; px2rem @ bytecode 18] [bailout (kind: eager): reason: not a Smi; px2rem @ bytecode 7] ; 在 1M 次循环里 deopt 触发了 47 次 ★

这五个症状对应的"病"

Five symptoms, five diagnoses

#	症状Symptom	病因Root cause	章节Ref
1	`BinaryOp::Any`	参数类型混杂(number / string / object 都见过) → polymorphicargs mix number/string/object → polymorphic	Ch8
2	`LoadIC::MEGAMORPHIC`	`input.value` / `input.unit` 看到太多 shape → IC 放弃`input.value` / `input.unit` see too many shapes → IC gives up	Ch13
3	reason: not a Smi	number 路径假设是 SMI,但浮点跑了 HeapNumber 路径,触发 deoptnumber path expected SMI but a float (HeapNumber) deopted it	Ch5
4	reason: wrong map	object 路径上多个 shape 的 `{value, unit}` 来回切multiple object shapes flowing through the object branch	Ch11
5	函数还很长function is also long	三种输入塞在一个函数里 → bytecode 多 → 接近 `max_optimized_bytecode_size` 阈值three input paths in one function → bytecode bloat → near `max_optimized_bytecode_size`	Ch9

这五个病都源于一个共同的设计错误:用一个函数处理三种结构性不同的输入。从 V8 的角度,这等于强迫它对每个属性访问、每个二元运算都做"应付三种类型"的多态机器码——快路径根本没机会形成。

修法在下一章——把它拆成三个单态函数,然后顺着前面 14 章的刀一刀一刀切。

All five trace back to one design mistake: one function handling three structurally different inputs. From V8's view, you've forced it to emit polymorphic asm for every property access and every binary op — the fast path never gets to form.

The fix is next chapter — split into three monomorphic functions and apply the rest of the 14 knives.

上一章prev← Fast vs Slow Properties 下一章next十二条改写规则Twelve rewrite rules →

CHAPTER 16 · HOT FUNCTION

十二条 V8 友好的改写规则

Twelve V8-friendly rewrite rules

每条规则点开看示例

click each rule to expand the example

下面这 12 条规则不是"风格指南"——是每一条都对应前面 14 章里某个具体机制的工程总结。我把它们按"应用次数"在主线 px2rem 上的频度排序——前几条是收益最大的几刀。

点每一条的标题展开看示例。

The following 12 rules aren't style preferences — each maps to a specific mechanism from the previous 14 chapters. I've ordered them by impact frequency on the main-line px2rem function — the top few cuts buy the most.

Click each rule's heading to expand its example.

RULE 01 · #1

把多态函数拆成多个单态函数

Split polymorphic functions into monomorphic ones

主线 px2rem 同时接 number / string / object,V8 必须为每个 binop 都生成多态机器码 → 退化到 BinaryOp::Any。把它拆成 px2remNumber / px2remString / px2remObject 三个函数,在调用方分发——每个函数都可以是 monomorphic。

px2rem takes number / string / object — V8 must emit polymorphic asm for every binop, dropping to BinaryOp::Any. Split into three: px2remNumber / px2remString / px2remObject, dispatch at the call site — each function can be monomorphic.

function px2rem(input, base) {
  // 同一个函数三种类型 → BinaryOp::Any
  if (typeof input === 'number') ...
  else if (typeof input === 'string') ...
  else ...
}

function px2rem(i, base) {
  // 调用方分发到单态函数
  if (typeof i === 'number') return px2remNumber(i, base);
  if (typeof i === 'string') return px2remString(i, base);
  return px2remObject(i, base);
}

对应章节: Ch8 (Mono/Poly/Mega)。
这一刀通常占整体提速的 50%+。在主线函数上,只这一刀就把 v0 的 240ms 砍到 ~120ms。

Maps to: Ch8 (Mono/Poly/Mega).
Usually accounts for 50%+ of the total speedup. On the main-line, this single cut takes v0 from 240ms to ~120ms.

RULE 02 · #2

把热点函数拆得足够小

Decompose hot functions until each is small

超过 max_optimized_bytecode_size(默认 60K bytecode 字节)V8 不会优化。即使没超,小函数还能享受 inline 展开——TurboFan 会把小被调函数 inline 进调用方,省一次 push/pop。

Functions over max_optimized_bytecode_size (60K bytecode bytes by default) skip optimization entirely. Even below the limit, small functions get inlined — TurboFan folds them into the caller, saving the push/pop.

function processOrder(o) {
  // 1000 行混合 validation/calc/format/dispatch
  // → 超 max_optimized_bytecode_size
  // → V8 直接放弃优化整段
}

function processOrder(o) {
  validate(o);   // < 200 字节
  calc(o);       // < 200 字节
  format(o);     // 各自能被 inline
}

对应章节: Ch3 (流水线), Ch9 (ShouldOptimize)。
主线函数把单位换算和距离计算拆成独立函数,各自 < 200 字节 bytecode。

Maps to: Ch3 (Pipeline), Ch9 (ShouldOptimize).
The main-line splits unit conversion and distance math into separate functions, each under 200 bytecode bytes.

RULE 03 · #3

用 TypeScript 锁住函数的入参类型

Use TypeScript to lock arg types

TS 类型系统不是为了"装",它在工程上恰好替你保证了热点函数的单态性——只要类型签名是 (n: number) => number,你就基本不会不小心给它喂 string。

TS types aren't decoration. In practice they enforce the monomorphism of hot functions — a signature of (n: number) => number means you basically won't accidentally feed it a string.

function add(a, b) {
  // 没类型约束,谁都能传 string 进来
  return a + b;
}

function add(a: number, b: number): number {
  // 编译期就拒绝非 number 调用方
  return a + b;
}

注意: TS 不能保证 SMI vs 浮点的区分,这是 V8 内部的差异。但它能保证 number vs string 不混。

Caveat: TS can't enforce SMI vs float — that's a V8 internal distinction. But it does keep number and string apart.

RULE 04 · #4

保持对象赋值顺序不变

Keep property assignment order stable

{x:1, y:2} 和 {y:2, x:1} 在 V8 里是两个不同的 Hidden Class。在 factory 函数里,所有对象都按同一个顺序赋值——这样所有 instance 共享同一条 transition chain。

{x:1, y:2} and {y:2, x:1} are two different Hidden Classes. In factory functions, assign properties in a fixed order so every instance walks the same transition chain.

// 条件分支里的赋值改变了顺序
if (debug) o.dbgFlag = 1;
o.x = 1;
o.y = 2;
// → debug 时 HC 分叉

// 始终按相同顺序赋值
o.x = 1;
o.y = 2;
if (debug) o.dbgFlag = 1;
// → 所有实例同一条 transition

对应章节: Ch11, Ch12 (Hidden Class · Transition Chain)。

Maps to: Ch11, Ch12 (Hidden Class · Transition Chain).

RULE 05 · #5

class 字段加默认值

Declare class fields with defaults

V8 给空对象预留 4 个 in-object 槽位(Slack Tracking)。如果你在 constructor 里"有时"才赋某个字段,会触发 Hidden Class 分叉。所有字段在 constructor 一次写齐(没值就 null/undefined),让所有实例走同一条链。

V8 reserves 4 in-object slots (Slack Tracking). If your constructor "sometimes" assigns a field, you fork the Hidden Class. Initialize every field in the constructor (use null/undefined if no value), keeping all instances on one chain.

class Point {
  constructor(x, y) {
    this.x = x;
    if (y !== undefined) this.y = y;
    // 有 y 和无 y 的实例不同 HC
  }
}

class Point {
  y = 0;   // 字段默认值
  constructor(x, y = 0) {
    this.x = x;
    this.y = y;   // 所有实例同 HC
  }
}

主线 px2remObject 的内部 result 对象就是按这条规则一次性初始化的。

px2remObject's internal result object follows this rule for its single-shot init.

RULE 06 · #6

不用 delete

Don't use delete

一次 delete obj.x 会把对象从 Fast Properties 一脚踹进 Slow Properties——所有 IC 失效,后续访问慢几十~百倍且不可逆。要"清掉"就 obj.x = null。

A single delete obj.x kicks an object from Fast to Slow Properties — invalidates every IC, slows access dozens to a hundred times, and is irreversible. To "clear" a property, use obj.x = null.

// 释放内存,顺手 delete
delete cache.payload;
// → cache 永远 Slow Properties
// → 所有访问 cache.* 的 IC 全部失效

// 设 null,GC 会回收引用的内存
cache.payload = null;
// → Hidden Class 不变
// → IC 全保留

RULE 07 · #7

避免反优化

Avoid deopts

在生产 build 上加 --trace-deopt 跑一遍核心场景,看哪些函数 deopt——大多数是偶尔传 undefined 或者偶尔抛 try-catch。把这些"偶尔"消除就行。

Run your core scenarios with --trace-deopt in a prod build and find every deopting function. Most cases are occasional undefined or occasional try-catch throws. Remove the "occasionals".

function format(x) {
  try { return x.toFixed(2); }
  catch { return '0'; }
  // 偶尔 throw 触发 deopt → 不可逆变多态
}

function format(x) {
  return typeof x === 'number'
    ? x.toFixed(2) : '0';
  // 用 typeof 守卫替代 try-catch
}

RULE 08 · #8

静态写法,优于动态写法

Static beats dynamic

这是 Ch13 的 145 倍跑分差距。在热点里把 obj[key] 改成 obj.knownKey,把 switch(string) 改成 switch(intEnum)——一刀切。

The 145× from Ch13. In hot paths, replace obj[key] with obj.knownKey and string switches with int-enum switches. One clean cut.

// KeyedLoadIC,O(N) 字符串比较
return obj[key];
switch (mode) {
  case 'show': ...
  case 'hide': ...
}

// LoadIC,O(1) 偏移读
return obj.knownKey;
switch (mode) {
  case MODE_SHOW: ...   // 整数 enum
  case MODE_HIDE: ...
}

RULE 09 · #9

字面量声明优于过程式声明

Literals beat procedural construction

const o = {x: 1, y: 2} 比 const o = {}; o.x = 1; o.y = 2 更稳——前者一次性建好 Hidden Class,后者要走两次 transition。

const o = {x: 1, y: 2} is more reliable than const o = {}; o.x = 1; o.y = 2 — the literal builds the Hidden Class in one shot; the procedural form walks two transitions.

const o = {};
o.x = 1;
o.y = 2;
// → ∅ → "x" → "y" 两次 transition

const o = { x: 1, y: 2 };
// → 一次性建好 Hidden Class
// → 跟 100 万个同 shape 对象共享

RULE 10 · #10

让对象只活在一个函数内

Keep object lifetime within one function

基于 逃逸分析(Escape Analysis):如果对象不逃出函数,V8 可以把它的字段全部展开成寄存器变量,根本不分配堆内存。这对 GC 也是免费收益。

Based on escape analysis: if an object never escapes its function, V8 can replace its fields with register variables and skip heap allocation entirely. Free GC win too.

function dist2(a, b) {
  const tmp = { dx: a-b, dy: 0 };
  return tmp.dx * tmp.dx;
  // → tmp 上堆 + GC + 间接寻址
}

function dist2(a, b) {
  const dx = a - b;
  return dx * dx;
  // → 全在寄存器,逃逸分析 0 分配
}

RULE 11 · #11

能用整数就别用浮点

Use integers over floats when you can

SMI(整数)在 V8 里是立即数,不进堆;float 一律装箱成 HeapNumber,要分配 + GC + 间接寻址。能用 Math.floor / 整数 enum 就用,只在最终输出层做一次 / 100 转浮点。

SMIs (ints) live as immediate values; floats box into HeapNumber with allocation, GC, and indirection. Prefer Math.floor and integer enums; only divide-by-100 at the very last output step.

let sum = 0;
for (...) sum += 1.5;
return sum / 1000;
// → 第一次 += 就装箱成 HeapNumber

let sum = 0;
for (...) sum += 1500;          // 整数累加
return sum / 1_000_000;
// → 全程 SMI,只在末尾转一次 float

RULE 12 · #12

慎用 Ref<T> 之类的包装

Avoid Ref<T>-style wrappers

React/Vue 里 useRef(0) 把数字包成 { current: 0 } 对象——读写都得过一层 Hidden Class + IC。如果你需要在热点里高频读写一个数,直接用闭包 let 变量,比 ref 快好几倍。

React/Vue's useRef(0) wraps a number into { current: 0 } — every read/write hits a Hidden Class + IC. For high-frequency hot-path reads, a closure-captured let outperforms a ref by several times.

const count = useRef(0);
for (let i = 0; i < 1e6; i++) {
  count.current++;   // 每次走 IC
}

let count = 0;   // 闭包变量
for (let i = 0; i < 1e6; i++) {
  count++;           // 直接寄存器 inc
}

规则的优先级PRIORITIZING THE RULES 不是每条都得用上。Rule 1 / 2 / 6 / 8 是性能收益最大的四条——其他几条更多是"保护性"规则,在热点函数上别踩坑。如果改一段代码只能动一两刀,从这四条里挑。 You don't need all twelve. Rules 1 / 2 / 6 / 8 carry the most weight — the others are protective: don't step on these traps. If you only have time for two cuts, pick from those four.

五个看起来人畜无害的 perf 凶手

Five innocent-looking perf killers

上面 12 条规则讲"应该怎么做"。下面五个反过来——它们是 12 条规则的典型违反案例,共同特点是看起来非常无害,代码评审一般也不会拦,但一上热点就立刻把性能腰斩。把它们记成模式,review 时一眼能识别。

The 12 rules above say "what to do". These five are the inverse — typical violations of those rules, all sharing one trait: they look harmless and won't be flagged by code review, but the moment they enter a hot path they cut performance in half. Memorize them as patterns; you'll spot them at a glance during review.

#1 · the cleanup-killer

"释放内存"用 delete

"Cleanup" via delete

伪装成内存卫生 · 一行 PR 评审一定过

Disguised as memory hygiene — sails through code review

for (const item of cache) {
  delete item.bigPayload;   // 看起来在帮 GC
}

把所有 item 永久打到 Slow Properties · IC 全废 · 不可逆。"清"用 = null,Hidden Class 不变。→ Ch14

Demotes every item to Slow Properties forever · all ICs invalidated · irreversible. Use = null instead — Hidden Class stays intact. → Ch14

#2 · the debug-trail

热点里的 console.log(arg)

console.log(arg) in a hotspot

伪装成临时调试 · 大概率忘了删

Disguised as temporary debugging — usually never removed

function layoutCell(cell) {
  console.log('cell:', cell);   // arg 偶尔 undefined
  // → 函数瞬间 deopt 永远多态
}

console.log 本身不慢,但只要 arg 偶尔是 undefined / 复杂对象,就会让宿主函数 deopt,从此跑在多态机器码上。生产代码遇到一定要彻底删 / 改成 dev-only。→ Ch7

console.log itself isn't slow — but if arg is occasionally undefined / a complex object, the host function deopts and runs on polymorphic asm forever after. Strip in prod or gate it dev-only. → Ch7

#3 · the dynamic key

obj[key] 配合可变 key

obj[key] with a variable key

看起来跟 obj.x 一样 · 静态分析也很难抓

Looks identical to obj.x — even static analysis usually misses it

function pick(obj, key) {
  return obj[key];   // KeyedLoadIC,慢 145×
}

走 KeyedLoadIC 而不是 LoadIC——状态机更复杂、更易退化到 megamorphic。如果 key 集合已知,改成 if 串 / map 字面量 + .knownKey;实在要动态,起码加个 typeof key === 'string' 守卫。→ Ch13

Routes through KeyedLoadIC instead of LoadIC — more state-machine complexity, more prone to megamorphic decay. If the key set is known, switch to an if-chain / map-literal + .knownKey. If truly dynamic, at least guard with typeof key === 'string'. → Ch13

#4 · the wrapper trap

useRef(0) / 数字包对象

useRef(0) / numeric wrappers

伪装成 React 习惯写法 · 一查文档全这么写的

Disguised as idiomatic React — every tutorial writes it this way

const count = useRef(0);
function onScroll() {
  count.current++;   // 读+写 都过 IC
}

useRef(n) 把数字包成 { current: n } 对象——每次 count.current 都要走一层 Hidden Class + IC,SMI 直接寄存器 inc 的 ~5 倍开销。要 React 跨 render 持久化:用 ref;要热点高频读写:用闭包 let。两件事别混。→ Ch13

useRef(n) wraps the number into { current: n } — every count.current goes through a Hidden Class + IC, ~5× the cost of a SMI register-inc. Use ref for cross-render persistence; use a closure let for hot-path counters. Don't conflate them. → Ch13

#5 · the catch-all

用 try / catch 替代类型守卫

try / catch as a type guard

伪装成"健壮"代码 · "万一传错呢" · 评审反而会鼓励

Disguised as "defensive coding" — "what if it's malformed?" — code review often rewards this

function format(x) {
  try { return x.toFixed(2); }
  catch { return '0'; }
  // 偶尔 throw → 整个函数被推入 polymorphic + deopt 路径,且 V8 对 try-catch 块本身的优化也更保守
}

用 try-catch 当 type guard 一举俩坑:(1) 偶发 throw 触发 deopt;(2) 历史上 V8 对 try-catch 函数的内联和优化都更保守(现代版本好了不少,但仍非零开销)。改成显式 typeof 守卫,性能差几倍。→ Ch7

Using try-catch as a type guard pays twice: (1) the occasional throw triggers a deopt; (2) historically V8 was more conservative about inlining and optimizing functions containing try-catch (much better now, but still non-zero). Replace with explicit typeof guards — several times faster. → Ch7

怎么用这份清单HOW TO USE THIS LIST 这五个不是"绝对禁止",而是"在热点函数里出现就要追问"。delete 在配置初始化里完全 OK;useRef(0) 用来跨 render 持久化也 OK——它们只在每秒上百次调用的代码路径上才致命。所以使用流程是:Performance 面板找出热点 → 拿这份清单 grep 一遍 → 命中的就改。 These five aren't "always wrong" — they're "worth questioning when they show up in hot code". delete in a setup function is fine; useRef(0) for cross-render state is fine — they're only lethal on paths called hundreds of times per second. Workflow: find the hotspot in the Performance panel → grep your hotspot for these five patterns → fix what hits.

上一章prev← 前世 · 240msBefore · 240ms 下一章next今生 · 24msAfter · 24ms →

CHAPTER 17 · HOT FUNCTION

今生 · ~24 ms — 单态 + Hidden Class 稳定 + IC 友好

After · ~24 ms — monomorphic, stable shapes, IC-friendly

把所有刀切下去之后

after all the cuts have landed

下面是按 12 条规则改完的版本。代码量更长了——但每个函数都是单态、字段顺序固定、没有 delete、没有动态 key:

The version after all twelve rules. The code is longer — but every function is monomorphic, field order is fixed, no delete, no dynamic keys:

v1 · final · all 12 rules applied monomorphic + IC-friendly

// ── 三个单态分支 ── (Rule 1 + 3) function px2remNumber(value /* number */, base /* number */) { return value / base; } // 提到模块顶层,只编译一次 (Rule 9) const RE = /^(-?\d+(?:\.\d+)?)(px|rem|em|%)?$/; const UNIT_PX = 0, UNIT_REM = 1, UNIT_EM = 2, UNIT_PCT = 3; // 整数 enum (Rule 11) const UNIT_MAP = { 'px': UNIT_PX, 'rem': UNIT_REM, 'em': UNIT_EM, '%': UNIT_PCT }; function px2remString(input /* string */, base /* number */) { const m = input.match(RE); if (!m) return 0; const v = +m[1]; // + 比 parseFloat 更直接 const u = UNIT_MAP[m[2]] ?? UNIT_PX; // 静态 key (Rule 8) if (u === UNIT_REM || u === UNIT_EM) return v; if (u === UNIT_PCT) return v / 100; return v / base; } // (Rule 5) 工厂确保所有 input 对象 shape 完全一致 — // 字段顺序固定 value, unit (Rule 4),从不 delete (Rule 6) function px2remObject(input /* {value:number, unit:string} */, base) { const u = UNIT_MAP[input.unit] ?? UNIT_PX; // 静态属性访问 (Rule 8) const v = input.value; // (Rule 8) if (u === UNIT_REM || u === UNIT_EM) return v; if (u === UNIT_PCT) return v / 100; return v / base; } // 调用方分发 (Rule 1) — 只有这里碰多态,且分发一次 inline 就消失了 function px2rem(input, base) { if (typeof input === 'number') return px2remNumber(input, base); if (typeof input === 'string') return px2remString(input, base); return px2remObject(input, base); }

跑分对比

Benchmark comparison

v0 · naive

243 ms

+ rule 1 (split)

122 ms

+ rule 2 + 9

79 ms

+ rule 4 + 5

48 ms

+ rule 8 + 11

31 ms

v1 · all rules

24 ms

FIG. 10 每加一刀的累积效应。1M 次 px2rem 调用,从 243 ms 降到 24 ms,十倍提速——其中第一刀(拆单态)占了一半,接下来的几刀各自砍了 30-40%。这就是规则 #1 的"绝对优先级"由来。 Cumulative effect of each cut. 1M px2rem calls drop from 243 ms to 24 ms — 10×. The first cut (split into monomorphic) takes half; later cuts each shave 30–40%. That's why rule #1 sits at the top.

243ms

v0 · naive

10×

speedup verified

24ms

v1 · final

验证 V8 现在怎么看这段代码

Asking V8 what it thinks now

%DebugPrint(px2remNumber) verified mono

- feedback vector: - tiering state: TieringState::kNone - invocation count: 200000 - slot #0 BinaryOp BinaryOp::Number ; ★ 单态 Number [0]: 1 - Code: - kind: TURBOFAN - bytecode size: 28 bytes ; 远低于 60K 阈值 - inlined into px2rem? YES (3 call sites) ; ★ 被 inline 展开了

亲手跑一遍try it yourself 验证你的重写真的单态了 — 完整 a.js + node 命令 Verify your rewrite really is monomorphic — full a.js + node command

把下面这段存成 a.js,跑 node --allow-natives-syntax a.js。它会让你亲眼看到上面那个截图里的 BinaryOp::Number(单态!)和 kind: TURBOFAN。如果你看到的不是 Number 而是 Any——说明你的重写还有漏网之鱼。

对照 Ch15 那段 v0 的 a.js 一起跑,差距更直观:同一台机器、同一种循环规模,前者吐 ::Any,后者吐 ::Number;前者 ~240ms,后者 ~24ms。这个对比就是 10× 的物理证据。

Save the snippet below as a.js and run node --allow-natives-syntax a.js. You'll see the BinaryOp::Number (monomorphic!) and kind: TURBOFAN from the screenshot above with your own eyes. If you see Any instead of Number — your rewrite still has a leak.

Run this side-by-side with the v0 a.js from Ch15 and the contrast is concrete: same machine, same loop size, but one prints ::Any and the other prints ::Number; one takes ~240ms, the other ~24ms. That's the physical proof of 10×.

a.js v1 · 单态分发

// 提到模块顶层,只编一次 (Rule 9) const RE = /^(-?\d+(?:\.\d+)?)(px|rem|em|%)?$/; const UNIT_PX = 0, UNIT_REM = 1, UNIT_EM = 2, UNIT_PCT = 3; const UNIT_MAP = { px: UNIT_PX, rem: UNIT_REM, em: UNIT_EM, '%': UNIT_PCT }; // 三个单态分支 (Rule 1 + Rule 3) function px2remNumber(value, base) { return value / base; } function px2remString(input, base) { const m = input.match(RE); if (!m) return 0; const v = +m[1]; const u = UNIT_MAP[m[2]] ?? UNIT_PX; if (u === UNIT_REM || u === UNIT_EM) return v; if (u === UNIT_PCT) return v / 100; return v / base; } function px2remObject(input, base) { const u = UNIT_MAP[input.unit] ?? UNIT_PX; const v = input.value; if (u === UNIT_REM || u === UNIT_EM) return v; if (u === UNIT_PCT) return v / 100; return v / base; } function px2rem(input, base) { if (typeof input === 'number') return px2remNumber(input, base); if (typeof input === 'string') return px2remString(input, base); return px2remObject(input, base); } // 同样的混合输入,跑同样次数 const samples = [12, '14px', { value: 16, unit: 'rem' }, 20, '1.5em']; console.time('v1'); for (let i = 0; i < 1_000_000; i++) px2rem(samples[i % 5], 16); console.timeEnd('v1'); // 强制三个单态分支都升级到 TurboFan %OptimizeFunctionOnNextCall(px2remNumber); %OptimizeFunctionOnNextCall(px2remString); %OptimizeFunctionOnNextCall(px2remObject); px2rem(12, 16); px2rem('14px', 16); px2rem({ value: 16, unit: 'rem' }, 16); // 重点 — 看 px2remNumber 的 feedback vector,期望看到 BinaryOp::Number console.log('\n--- px2remNumber 优化状态 ---'); const sN = %GetOptimizationStatus(px2remNumber); console.log('binary:', sN.toString(2), ' TurboFan?', !!(sN & (1 << 6)), ' Optimized?', !!(sN & (1 << 4))); console.log('\n--- %DebugPrint(px2remNumber) ---'); %DebugPrint(px2remNumber); // 同样看 String / Object 那两个分支的 feedback console.log('\n--- %DebugPrint(px2remString) ---'); %DebugPrint(px2remString); console.log('\n--- %DebugPrint(px2remObject) ---'); %DebugPrint(px2remObject);

跑法

How to run

terminal node 22+

$ node --allow-natives-syntax a.js # 想看每次 inline 展开和 TurboFan 真正的输出汇编: $ node --allow-natives-syntax \ --trace-opt --trace-deopt --print-opt-code \ a.js 2>&1 | less

期望看到的输出

What you should see

三个分支函数的 feedback vector 段都应该长这样,跟 v0 形成强对比:

All three branch functions should show feedback vectors like this — a sharp contrast with v0:

期望节选expected excerpt verified mono

// px2remNumber:纯算术,所以 BinaryOp 是 Number 单态 - slot #0 BinaryOp BinaryOp::Number ; ★ 单态(不是 Any) - code: kind: TURBOFAN ; ★ 已经在峰值层 // px2remString:静态属性 LoadIC 不退化 - slot #X LoadProperty (LoadIC) MONOMORPHIC ; ★ 单态(不是 MEGAMORPHIC) // px2remObject:input.value / input.unit 全部 IC 命中 - slot #X LoadProperty (LoadIC) MONOMORPHIC ; ★ 同上

对照 + 自检

Compare + self-check

把 Ch15 那个 v0 的 a.js 也跑一遍,把两次终端输出贴一起。确认 v0 看到 ::Any + MEGAMORPHIC,v1 看到 ::Number + MONOMORPHIC——这就是十倍提速的状态机层面证据。
如果 v1 任意一个分支函数还看到 ::Any:大概率是你不小心给它喂了非预期类型(比如 px2remNumber 收到了 NaN 或 boolean)。重检调用方分发逻辑。
如果三个分支看到的 kind 不是 TURBOFAN 而是 BASELINE / IGNITION:说明 1M 次循环还没跑够升级阈值,或者函数太长触发了 max_optimized_bytecode_size(Ch9)。试试加大循环或拆得更小。

Run the v0 a.js from Ch15 too and paste both terminal outputs side by side. Confirm v0 shows ::Any + MEGAMORPHIC while v1 shows ::Number + MONOMORPHIC — that's the state-machine evidence behind the 10×.
If any v1 branch still shows ::Any: most likely you accidentally fed it an unexpected type (e.g. px2remNumber got NaN or a boolean). Audit your dispatch.
If kind isn't TURBOFAN but BASELINE / IGNITION: either the 1M-call loop didn't cross the tiering threshold, or the function is over max_optimized_bytecode_size (Ch9). Bump the loop or split smaller.

十倍是怎么算出来的

Where the 10× actually comes from

不是某一刀很神,而是每一刀都解决了一个具体的 V8 机制问题,所有的小提速复合起来。把它列成账本:

No single cut is magic. Each one solves one specific V8 mechanism problem, and the small wins compound. As a ledger:

刀Cut	解决的问题Problem fixed	单刀贡献Per-cut win	累计Cumulative
v0	起点baseline	—	243 ms
+ R1	三个单态函数 → 退出 BinaryOp::Anythree mono fns → exit BinaryOp::Any	−50%	122 ms
+ R2/9	小函数被 inline + 提模块顶层常量small fns get inlined + top-level constants	−35%	79 ms
+ R4/5	所有 result/input 对象同 Hidden Classall result/input objects share a Hidden Class	−39%	48 ms
+ R8/11	静态 key + 整数枚举 → IC 优化static keys + int enums → IC kicks in	−35%	31 ms
+ R10	逃逸分析,临时对象不上堆escape analysis, temp objects skip heap	−23%	24 ms
v1	对比 v0vs v0	10.1×	24 ms

十倍提速不是魔法,
是十二刀切下去的累加。 Field Note · 03

A tenfold speedup isn't magic.
It's twelve cuts that compound. Field Note · 03

这套方法论可以照搬到任何热点上吗

Will this method work on any hotspot

大部分情况能。但前提是你的瓶颈真的是 JS 执行——如果是 DOM 操作、合成层、网络、GC——那就是另外一座山(分别对应 chromium-renderer 那篇文章里的不同章节)。

检验方法很简单:打开 Chrome DevTools 的 Performance 面板,看你的热点函数占帧时间多少、是 JS 颜色还是别的颜色。如果是 JS 蓝色 + 占比超过 5%,这套方法论几乎一定有用。

Mostly yes. The precondition is your bottleneck is actually JS execution. If it's DOM, compositing, network, or GC — that's a different mountain (each covered in different chapters of the chromium-renderer piece).

Quick check: open Chrome DevTools' Performance panel and see your hot function's share of frame time and color. JS-blue + over 5% means this methodology will almost certainly help.

上一章prev← 十二条规则Twelve rules 下一章nextJSCore 也在做这件事JSCore does the same →

CHAPTER 18 · FRONTIERS

JSCore 也在做这件事 — Safari 的 LLVM JIT

JSCore does the same — Safari's LLVM JIT

这套方法论是引擎无关的

this methodology is engine-agnostic

这本文一直在讲 V8——但前面 12 条规则跨引擎都成立。原因很简单:Hidden Class、Inline Cache、type feedback,这套设计是 1991 年 Self 语言研究里就有的——所有现代 JS 引擎都独立实现了一份。

This piece has been about V8, but those 12 rules are engine-agnostic. The reason: Hidden Class, Inline Cache, type feedback all trace back to 1991 Self research — every modern JS engine has independently implemented the same trio.

引擎Engine	JIT 层级JIT tiers	Hidden Class	IC	Type feedback
V8 · Chrome / Node	Ignition · Sparkplug · Maglev · TurboFan	Map	✓	FeedbackVector
JSCore · Safari	LLInt · Baseline · DFG · FTL (LLVM)	Structure	✓	ValueProfile
SpiderMonkey · Firefox	Interpreter · Baseline · Warp · Ion	Shape	✓	CacheIR
Hermes · RN	AOT bytecode (no JIT)	HiddenClass	✓	— (no JIT)

JSCore 的特别之处:把 LLVM 拉来当后端

What's special about JSCore: LLVM as backend

JSCore(WebKit 的 JS 引擎,iOS / macOS Safari 用)有一个独门设计:它的峰值层 FTL(Fourth Tier LLVM)直接把 JS 编译进 LLVM IR,然后调用 LLVM 全部优化——同一份 LLVM 用来编 C++ / Rust / Swift,现在也用来编你的热点 JS。

实战影响:在某些 benchmark 上,Safari 的 JSCore 比 Chrome 的 V8 还快——尤其是计算密集型 + 类型稳定的代码,LLVM 的循环优化、SIMD 化、内联策略都比 V8 的 TurboFan 更激进。

但跨引擎的高性能 JS 写法是同一套——前面那 12 条规则在 JSCore 上一字不差地适用。

JSCore (WebKit's engine, used in iOS/macOS Safari) has a unique design: its peak tier FTL (Fourth Tier LLVM) compiles JS straight into LLVM IR and then runs full LLVM optimization passes — the same LLVM that ships C++/Rust/Swift, now also processing your hot JS.

Real-world impact: on certain benchmarks Safari's JSCore beats V8 — especially on compute-heavy, type-stable code, where LLVM's loop, SIMD, and inlining strategies are more aggressive than TurboFan's.

But fast-JS writing is the same trade across engines — the 12 rules apply word-for-word to JSCore.

实测对比REAL TEST

同一段优化后的代码,Chrome vs Safari

Same optimized code, Chrome vs Safari

在我的电脑上 (M1 MacBook Pro),v1 版 px2rem 跑 1M 次:Chrome (V8) 24 ms,Safari (JSCore) 17 ms。Safari 更快——因为 LLVM 把 UNIT_MAP 那个查表完全展开成了直接比较。但跑得快的代码,在哪个浏览器上都跑得快——这才是这套方法论的真正价值。

On my M1 MacBook Pro, the v1 px2rem at 1M iterations: Chrome (V8) 24 ms, Safari (JSCore) 17 ms. Safari wins — LLVM fully unrolled the UNIT_MAP lookup into direct compares. But fast code stays fast across browsers. That's the methodology's real value.

上一章prev← 今生 · 24msAfter · 24ms 下一章nextWasm →

CHAPTER 19 · FRONTIERS

Wasm — V8 的另一条流水线

Wasm — V8's other pipeline

当 JS 已经不够快

when JavaScript isn't fast enough

把 px2rem 优化到 24 ms 已经是极限了——再快只能不写 JS。这就是 WebAssembly 的位置。

V8 内部其实有两条独立的流水线:JS 那条在前 14 章讲过(Ignition→Sparkplug→Maglev→TurboFan);Wasm 有自己的两层——Liftoff(基线编译,毫秒内编完)和 TurboFan(峰值编译,Wasm 也复用了同一个后端)。两条流水线共享同一份机器码内存、同一份 GC、同一个 main thread——所以 Wasm 不是"另一种语言",而是JS 性能曲线的另一种形状。

Optimizing px2rem to 24 ms is roughly the JS ceiling. Beyond that, you have to stop writing JS. That's WebAssembly's slot.

V8 actually runs two parallel pipelines: JS uses the four-tier covered in Ch3 (Ignition→Sparkplug→Maglev→TurboFan); Wasm has its own two-tier — Liftoff (baseline, compiles in milliseconds) and TurboFan (peak, shared backend). Both pipelines share the same machine-code memory, the same GC, the same main thread — so Wasm isn't "another language" so much as another shape of the JS performance curve.

什么时候上 Wasm

When to reach for Wasm

场景Scenario	建议Recommendation
业务热点(布局、滚动、动画)UI hotspots (layout, scroll, animation)	优化 JS 即可,基本能搞定JS optimization is enough
媒体编解码 / 加解密 / 物理仿真media codec / crypto / physics	Wasm 决定性更好(2-10×)Wasm wins decisively (2–10×)
大规模数据处理(协同编辑、Excel 表)bulk data (collab editing, spreadsheets)	视情况——多次 JS↔Wasm 边界开销可能吃掉收益it depends — JS↔Wasm boundary costs can swamp the win
DOM 操作DOM ops	Wasm 反而更慢(必须经 JS 桥)Wasm is slower here — must bridge through JS

而且重要的是:Wasm 不是"用了就快"。一段写得不好的 Wasm(频繁的 boundary call、不友好的内存布局、没向量化)有时还不如同等逻辑的优化过的 JS。

所以这本文最后一句话还是:先用前面 12 条把 JS 优化到极限,再去考虑 Wasm——大部分业务场景里,JS 优化能解决 80% 的性能问题,而且不引入构建复杂度。

And critically: Wasm isn't "fast just by being Wasm". Poorly-written Wasm (frequent boundary calls, unfriendly memory layout, no vectorization) sometimes loses to equivalent optimized JS.

The last word of this piece, then: push JS to its limit with the 12 rules first; reach for Wasm second. In most business code, JS optimization solves 80% of perf without adding build complexity.

上一章prev← JSCore 下一章next工具箱Toolbox →

CODA · TOOLBOX

工具箱 — `--allow-natives-syntax` 全套实战

Toolbox — `--allow-natives-syntax` in practice

所有用到的命令、参数、native syntax,集中在这里

every command, flag, and native syntax used in this piece, in one place

这本文从头到尾用到的所有"怎么观察 V8 在干什么"工具,集中放在这里。建议把这一章存成 cheatsheet——下次遇到慢 JS 时直接抄。

Every "how to see what V8 is doing" tool used across this piece, in one place. Save this chapter as a cheatsheet — next time you face slow JS, copy-paste from here.

术语速查glossary 所有专有名词,1 行解释 + 跳到首次出现章节 every jargon term, 1-line def + jump to first appearance

读到一半卡在某个名词?在这里查。词条按"出现频率"排,不按字母——跟读这本文的节奏对得上。

Stuck on a term? Look here. Entries are ordered by frequency-of-appearance, not alphabetically — to match the reading order of this piece.

SMI · Small Integer: 31 位有符号小整数,值直接编码进 64 位字本身,末位 0。→ Ch531-bit signed integer packed directly into the 64-bit word, low bit 0. → Ch5
HeapObject: 堆上的对象,指针末位 1。所有非 SMI 都是 HeapObject(浮点、字符串、对象、函数)。→ Ch5a heap-allocated object; pointer's low bit is 1. Anything that isn't a SMI (floats, strings, objects, functions). → Ch5
HeapNumber: 浮点数 / 大整数装箱后的 HeapObject。→ Ch5a HeapObject wrapping a float / bigint that didn't fit a SMI. → Ch5
Tagged Pointer: 用指针的若干低位携带类型信息。V8 用末 1 位区分 SMI / HeapObject。→ Ch5stuffing type info into a pointer's low bits. V8 uses the low bit to split SMI / HeapObject. → Ch5
JSObject: V8 表达 JS 对象的核心结构,含 Map / properties / elements / in-object slots。→ Ch10V8's core object struct: Map / properties / elements / in-object slots. → Ch10
Hidden Class · Map · Shape: 对象的"形状描述"。同 shape 对象共享一份。Edge 叫 Types,JSC 叫 Structure,SpiderMonkey 叫 Shape,V8 源码里叫 Map。→ Ch11an object's "shape descriptor". Same-shape objects share one. Edge: Types · JSC: Structure · SpiderMonkey: Shape · V8 source: Map. → Ch11
DescriptorArray: Hidden Class 的子结构,记录每个 key 名 + 对应 in-object slot 下标。→ Ch11sub-structure inside Hidden Class — records each key name + its in-object slot index. → Ch11
Transition Chain: Hidden Class 的链表/树,描述 shape 怎么随属性增加而生长。→ Ch12linked-list / tree of Hidden Classes describing how a shape grows as properties are added. → Ch12
in-object properties: 预留在 JSObject 本体里的属性槽,访问最快(base + offset)。→ Ch10slots reserved inside the JSObject body itself; fastest access (base + offset). → Ch10
Slack Tracking: V8 给空对象默认预留 4 个 in-object 槽位,后续 GC 回收没用到的。→ Ch11V8 reserves 4 in-object slots for an empty object; unused ones reclaimed in a later GC. → Ch11
IC · Inline Cache: 把"通过 key 找属性"内联成"按固定偏移读"——核心性能技术。→ Ch13inlining "look up by key" into "load at a fixed offset" — V8's core perf trick. → Ch13
LoadIC · KeyedLoadIC: 两种属性读 IC:LoadIC 走静态 obj.x(快),KeyedLoadIC 走动态 obj[key] + 数组下标(慢)。→ Ch13two property-load IC variants: LoadIC for static obj.x (fast); KeyedLoadIC for dynamic obj[key] + array indexing (slower). → Ch13
Mono · Poly · Mega morphic: IC 状态机:见过 1 种 → mono(最快);2-4 种 → poly(还行);>4 种 → mega(放弃缓存,接近解释器)。→ Ch8IC state machine: 1 type → mono (fastest); 2–4 → poly (OK); >4 → mega (cache abandoned, near-interpreter). → Ch8
Fast / Slow Properties: Fast 用 in-object + IC;Slow 用字典(被 delete 打过 / 属性太多溢出)。Slow 慢几十~百倍且不可逆。→ Ch14Fast uses in-object + IC; Slow uses a dictionary (after delete or too many overflow keys). Slow is 10–100× slower and irreversible. → Ch14
FeedbackVector: 每个函数挂的"类型反馈表",每个调用点一个 slot。→ Ch6a per-function "type feedback table"; one slot per call site. → Ch6
Feedback Slot: FeedbackVector 里一个具体表项,记录某个 site 见过哪些类型 / shape。→ Ch6a single entry in the FeedbackVector — records which types / shapes a given site has seen. → Ch6
Assumption: V8 编译时基于 feedback 做的"大胆假设",比如"a/b 都是 SMI"。→ Ch6V8's "bold guess" at compile time, e.g. "a and b are both SMIs". → Ch6
Checkpoint · Type Guard: JIT 汇编里 inline 的类型守卫(testb / cmp),保护 assumption。→ Ch4 · Ch6inlined type guard (testb / cmp) that protects an assumption. → Ch4 · Ch6
Deoptimization · deopt: JIT 假设打破时丢弃机器码,退回 Ignition 解释执行。性能噩梦。→ Ch7JIT discards optimized code when an assumption breaks, falling back to Ignition. Performance nightmare. → Ch7
Parser · AST: V8 第一级:把源码解析成抽象语法树 + 第一版字节码。→ Ch3V8's first stage: parse source into an AST + emit first bytecode. → Ch3
Ignition: V8 字节码解释器。冷代码住在这里。→ Ch3V8's bytecode interpreter. Where cold code lives. → Ch3
Bytecode: V8 自定义的虚拟机指令集(栈+累加器)。所有 JIT 层的输入都是它,不是源码。→ Ch4V8's custom VM ISA (stack + accumulator). Every JIT tier's input — not source. → Ch4
Sparkplug: 2021 年加的非优化基线 JIT。把 bytecode 1:1 翻译成 asm,跳过 dispatch。→ Ch32021's non-optimizing baseline JIT. Translates bytecode 1:1 into asm, skipping dispatch. → Ch3
Maglev: 2023 年加的中间级 JIT。用 feedback 做轻量优化,达到 ~70% TurboFan 性能。→ Ch32023's mid-tier JIT. Uses feedback for light opts, reaches ~70% TurboFan perf. → Ch3
TurboFan: V8 的峰值优化 JIT。基于 Sea-of-Nodes IR 做几十种优化。→ Ch3V8's peak optimizing JIT. Sea-of-Nodes IR with dozens of passes. → Ch3
NewSpace · OldSpace: V8 GC 的两个分代:NewSpace 装短命对象(Scavenger 频繁扫),OldSpace 装活过 GC 的长命对象。→ Ch5V8 GC generations: NewSpace for short-lived objects (Scavenger sweeps often); OldSpace for survivors. → Ch5
Scavenger · Orinoco: V8 GC 的实现:Scavenger 是 NewSpace 上的复制式收集器;Orinoco 是 V8 的并行/增量 GC 总称。→ Ch5V8's GC implementations: Scavenger is the NewSpace copying collector; Orinoco is V8's parallel/incremental GC framework. → Ch5
Forwarding Pointer: GC compaction 时,被搬走的对象在原位置留下的"新地址提示"。→ Ch5a "moved-to address hint" left at an object's old location during GC compaction. → Ch5
Mark Bitmap: V8 把 GC mark bits 存在对象之外的 side bitmap 里,每堆地址一位。→ Ch5V8 stores GC mark bits in a side bitmap off-object, one bit per heap address. → Ch5
JSCore · LLInt · DFG · FTL: Safari 的 JS 引擎(WebKit)+ 它的四层 JIT(LLInt 解释器 / Baseline / DFG / FTL=LLVM)。→ Ch18Safari's JS engine (WebKit) and its four-tier JIT (LLInt interpreter / Baseline / DFG / FTL=LLVM). → Ch18
Hermes: Meta 给 React Native 用的 JS 引擎,无 JIT(iOS 限制)。打包阶段做 AOT 字节码。→ Ch2Meta's React Native JS engine, no JIT (iOS restrictions). Does AOT bytecode at bundle time. → Ch2
SpiderMonkey · CacheIR: Firefox 的 JS 引擎 + 它的 IC IR(基于 cache 操作的中间表示)。→ Ch18Firefox's JS engine + its IC IR (cache-operation-based intermediate representation). → Ch18
Wasm · Liftoff: WebAssembly + V8 的 Wasm 基线 JIT(对应 Sparkplug 的角色)。→ Ch19WebAssembly + V8's Wasm baseline JIT (Sparkplug's analog). → Ch19

启动开关

Startup flags

node / chromium 启动时 flags

# node $ node --allow-natives-syntax bench.js # node + 全套 trace(强烈推荐) $ node --allow-natives-syntax \ --trace-opt \ # 升级 TurboFan 时打日志 --trace-deopt \ # 反优化时打日志 --print-opt-code \ # 打印 TurboFan 输出的机器码 bench.js # Chromium $ open -a Chromium --args --js-flags="--allow-natives-syntax --trace-deopt" # 限定只 trace 某个函数(避免日志爆炸) $ node --allow-natives-syntax \ --turbo-filter=px2rem* \ --print-opt-code bench.js

Native syntax 命令

Native syntax commands

在 JS 代码里直接调用callable from JS code itself % prefix

// ─ 1. 看一个函数当前的优化状态 ─ console.log(%GetOptimizationStatus(px2rem).toString(2)); // 返回 bitmask,关键位: // bit 4 = kOptimized // 在优化版本上跑 // bit 5 = kMaglevved // 在 Maglev 上 // bit 6 = kTurboFanned // 在 TurboFan 上 // bit 14 = kMarkedForDeoptimization // ─ 2. 强制下次调用就升级到 TurboFan ─ px2rem(10, 16); // 至少跑一次让它 collect feedback %OptimizeFunctionOnNextCall(px2rem); px2rem(10, 16); // 这次会被 TurboFan 编译 // ─ 3. 打印对象 / 函数的内部信息 ─ %DebugPrint(px2rem); // 输出 feedback vector / Hidden Class / 优化状态等 // ─ 4. 看对象是 Fast 还是 Slow Properties ─ console.log(%HasFastProperties(obj)); // → true / false // ─ 5. 让函数立刻反优化(测试用)─ %DeoptimizeFunction(px2rem);

Heap snapshot · 看 Hidden Class

Heap snapshot · reading Hidden Classes

Chrome DevTools → Memory → Take heap snapshot,然后:

左上角 dropdown 选 "Class filter",在搜索框里输入对象的构造器名(比如 Object)。
展开任意一条,会看到 map :: system / Map @0x...——这就是 Hidden Class 的物理地址。
点这条 Map,下方 Retainers 面板会列出所有指向同一个 Hidden Class 的对象。如果你看到几万个对象指向同一个 Hidden Class——✓ shape 稳定。如果几个对象各指向不同的 Hidden Class——✗ shape 分裂了。

这是排查"对象 shape 是否稳定"最直接的方法,比看 %DebugPrint 的 map 地址更直观。

Chrome DevTools → Memory → Take heap snapshot, then:

Top-left dropdown → "Class filter", search for your constructor (e.g. Object).
Expand any entry, you'll see map :: system / Map @0x... — that's the Hidden Class's physical address.
Click the Map entry; the Retainers panel below lists every object pointing at the same Hidden Class. Tens of thousands sharing one map ✓ — shapes stable. A few objects each pointing at different maps ✗ — shapes split.

This is the most direct way to verify shape stability — easier than diffing %DebugPrint output.

在浏览器里看 V8 — DevTools Performance 面板

Watching V8 in the browser — DevTools Performance panel

本文前面所有 %DebugPrint / --trace-deopt 都是 Node CLI 工具,但你实际工作中的瓶颈大多在浏览器里——DevTools 的 Performance 面板 才是日常入口。Node 工具用来"解剖"已经定位的热点函数,Performance 面板用来"找到"哪个函数是热点。两步走:

Everything above with %DebugPrint / --trace-deopt is for Node CLI, but your real-world bottlenecks usually surface in the browser — DevTools' Performance panel is the daily entry point. Node tools dissect a hotspot you've already located; the Performance panel locates which function is the hotspot. Two-step flow:

第一步 · 找step 1 · find

Performance 面板录一段(3-5 秒,在录制期间触发慢路径——比如滚一下、点一下),停止录制。看顶部火焰图,找最高最宽的 JS 蓝色柱——那就是热点。如果你的"慢"完全是 DOM 紫色 / Layout 紫红 / Paint 绿色,本文方法论用不上,得去看 chromium-renderer 那篇。 Open the Performance panel and record for 3–5 seconds while triggering the slow path (scroll, click, etc.), then stop. Look at the flame chart and find the tallest+widest JS-blue bar — that's the hotspot. If your "slow" is entirely DOM purple / Layout magenta / Paint green, this piece's methodology won't help — read the chromium-renderer companion instead.

第二步 · 量step 2 · quantify

底部切到 Bottom-Up tab,按 Self Time 降序排——这是"这个函数自己(不算调用)烧了多少 CPU 时间"。占比 > 5% + 是 JS 蓝色 → 本文方法论一定有用。点开任意一条进 Call Tree,可以看到完整调用链。 Switch the bottom panel to Bottom-Up and sort by Self Time descending — this is "how much CPU this function itself burns, not its callees". > 5% + JS-blue → this piece's methodology will help. Click any row to expand the Call Tree.

第三步 · 解剖step 3 · dissect

把那个热点函数从浏览器搬到 Node(连同它的真实输入分布),用本文 Ch15 / Ch20 那套 %DebugPrint + --trace-deopt 看它在 V8 里的状态。"在浏览器里找,在 Node 里改" 是这套方法论的标准工作流。 Lift that hotspot from the browser into Node (with realistic input distribution) and apply Ch15 / Ch20's %DebugPrint + --trace-deopt to see its V8 state. "Find in the browser, fix in Node" is this methodology's standard workflow.

几条信号别忽略SIGNALS NOT TO IGNORE 在 Performance 面板里看到这些就是 V8 在求救:大量 Minor GC 锯齿(NewSpace 上短命对象多 → 上 Phase IV 规则 #10 逃逸分析);函数自身耗时高 + 调用链短(纯 CPU 热点,典型本文目标);"Compile Code" / "Optimize Code" 频繁出现(deopt 在反复发生,直接对应 Ch7 反优化);Long Tasks(>50ms 的红色三角)(主线程被某段 JS 长占)。 These are V8 cries for help in the Performance panel: jagged Minor GC pattern (lots of short-lived NewSpace objects → see Phase IV rule #10 escape analysis); high self-time + shallow call stack (pure CPU hotspot — exactly this piece's target); frequent "Compile Code" / "Optimize Code" entries (repeated deopts — straight to Ch7); Long Tasks (red triangles >50ms) (main thread held hostage by some JS).

扩展阅读extended 为什么 DevTools 看不到 BinaryOp::Any / 反优化日志 Why DevTools can't show you BinaryOp::Any / deopt logs

Performance 面板能告诉你哪个函数烧时间,但不能告诉你"这个函数为什么慢——是因为退化成 Mega 了,还是 deopt 了,还是 GC 太多"。这些"为什么"信号都在 V8 内部状态里,DevTools 没暴露这一层(原因是大部分 Web 开发者用不到,V8 团队没动力放出来)。

真要看,有三条路:

Node 复刻(本文走的路):把热点函数 + 真实输入搬到 Node,用 --allow-natives-syntax + %DebugPrint 看 feedback vector。最直接,但要自己重建调用环境。
Chrome 启动加 V8 flags:open -a Chromium --args --js-flags="--allow-natives-syntax --trace-deopt" 启动,然后在 stdout 看 deopt 日志。代价:Chrome 控制台不直接显示这些,得从启动它的 terminal 里看。
chrome://tracing:开 V8 的内部 tracing(--enable-blink-features=... 之类),能看到 Maglev / TurboFan 编译事件。最完整但门槛最高,基本只有引擎开发者用。

实战建议:就走第 1 条。Performance 面板找到热点,把它移到 Node 文件,跑本文 Ch15 / Ch17 的 a.js 模板。

The Performance panel tells you which function burns time, but not "why it's slow — has it degraded to Mega, is it deopting, is GC dominating". Those "why" signals live in V8 internals, and DevTools doesn't surface that layer (most web devs don't need it, so V8 team hasn't prioritized exposing it).

Three ways to actually see them:

Replicate in Node (this piece's path): lift the hotspot + realistic inputs into a Node script, run with --allow-natives-syntax + %DebugPrint to inspect the feedback vector. Most direct, but you have to rebuild the calling context.
Launch Chrome with V8 flags: open -a Chromium --args --js-flags="--allow-natives-syntax --trace-deopt", then watch the launching terminal's stdout for deopt logs. Note: Chrome's DevTools Console doesn't show these — you have to read the terminal that launched Chromium.
chrome://tracing: enable V8 internal tracing categories. Most complete but highest barrier — basically engine-developer territory.

Practical advice: just do (1). Find the hotspot in Performance, move it to a Node file, run the a.js templates from Ch15 / Ch17.

最小重现 cheatsheet

Minimal repro cheatsheet

bench.js · paste-and-run template copy this

// ── 一段最小 V8 性能测量模板 ── // 用法: node --allow-natives-syntax bench.js function target(a, b) { /* 你要测的函数,写在这 */ return a + b; } // 1. 预热(让 V8 收集 feedback 并升级到 TurboFan) for (let i = 0; i < 10000; i++) target(i, i); %OptimizeFunctionOnNextCall(target); target(0, 0); // 2. 看 V8 怎么看这个函数 %DebugPrint(target); // 3. 跑 1M 次 timing console.time('target'); for (let i = 0; i < 1_000_000; i++) target(i, i); console.timeEnd('target'); // 4. 改一个怀疑参数喂进去,再跑一次,看是否 deopt target('a', 'b'); // 故意打破单态 console.time('after-bad'); for (let i = 0; i < 1_000_000; i++) target(i, i); console.timeEnd('after-bad'); // 慢了几倍?

参考

References

OFFICIAL DESIGN DOCS · 官方设计文档 v8.dev · webkit.org

▸ V8 blog · v8.dev/blog // 推荐:"Maglev — V8's Fastest Optimizing JIT" // "Sparkplug — a non-optimizing JavaScript compiler" // "Faster JavaScript calls" / "Hidden classes" ▸ V8 source · TieringManager // chromium.googlesource.com/v8/v8/+/main/src/execution/tiering-manager.cc ▸ V8 native runtime list // chromium.googlesource.com/v8/v8/+/main/src/runtime/runtime.h // 完整 % 函数清单(几百个,本文用到的只是一小部分) ▸ JSCore design · webkit.org/blog // "Speculation in JavaScriptCore" — 跟本文 Phase II 是同一个故事 ▸ 原版笔记 · app.tana.inc/shared/js/aHdCaFV4clZMaC9FSXBhUmRrVk9EUlo= // 这本文方法论的最早 Tana 版本(私有笔记)

姊妹篇SISTER PIECE

这本文之外:其他性能瓶颈

Beyond this piece: other perf bottlenecks

这本文只覆盖了"JS 执行"这一类瓶颈。如果你的瓶颈在 DOM、布局、绘制、合成,可以读姊妹篇字节码到像素的一生 — Chromium 渲染流水线全景;如果是合成卡顿,可以读 Jank & Stutter。

This piece only covers JS execution. If your bottleneck is DOM, layout, paint, or compositing, see the sister piece Bytecode to Pixels — Chromium's Rendering Pipeline. For compositor jank specifically, see Jank & Stutter.

上一章prev← Wasm 回到back to引子prologue ↑

三视角看 const a = 3 + 4

Three eyes on const a = 3 + 4

为什么要这么多层翻译

Why so many translations

所以这篇文章在解决什么问题

So what is this piece actually for

JIT vs AOT — 编译时机的两条路

JIT vs AOT — two roads to a binary

"很厚的运行时"是什么意思

What does "thick runtime" actually mean

四层 JIT — Parser → Ignition → Sparkplug → Maglev → TurboFan

Four-tier JIT — the compile pipeline

每一级在干什么

What each tier actually does

同一段 JS · 五级输出对照

Same JS · five-tier output trace

为什么不一上来就用 TurboFan

Why not just go straight to TurboFan

这对优化意味着什么

What this means for optimization

字节码 vs 机器码 — 同一种东西的两副面孔

Bytecode vs machine code — two faces, one idea

为什么字节码长这样

Why bytecode looks like this

主流引擎选型对照

What the major engines picked

V8 为什么选 "栈 + 累加器"

Why V8 specifically picked "stack + accumulator"

汇编里多出来的那些"看不懂"指令

The "extra" instructions you'll see in real asm

Tagged Pointer — 最低位决定的世界

Tagged Pointer — one bit decides reality

为什么 V8 要这么省

Why V8 hoards bits like this

SMI · 装得下,直接塞进指针里

HeapObject · 装不下的,丢去堆上

SMI · small enough, packed into the pointer itself

HeapObject · everything else lives on the heap

那为什么我管它叫"GC 税"

So why call it "GC tax"

顺手解决一个常见困惑

Sidebar: aren't mark bits per-object too?

这跟"写快 JS"有什么关系

What this has to do with writing fast JS

三件套 — assumption + feedback + checkpoint

The trio — assumption + feedback + checkpoint

三件套的协作流程

How the trio cooperates

怎么观察这套机制

How to observe this in practice

反优化第一现场 — 当 add(1,2) 突然来了 add('a','b')

First scene of deopt — when add(1,2) meets add('a','b')

为什么 L3 的代码跟 L1 一样,却跑得慢三倍

Why the same loop runs 3× slower after one bad call

怎么用 --trace-deopt 抓现场

How --trace-deopt catches it

一次"加日志"导致整个页面卡顿When "just adding a log" tanked a whole page

{Mono | Poly | Mega} morphic — 一个 feedback slot 的命运

{Mono | Poly | Mega} morphic — a feedback slot's fate

为什么有 4 种这个具体数字

Why specifically 4

怎么读 feedback slot 的当前状态

How to read a slot's current state

ShouldOptimize — 什么时候才会真正进 TurboFan

ShouldOptimize — when does TurboFan kick in

从开发者视角读这段

What this means as an engineer

Maglev 的跑分对照

Maglev's bench numbers

一道 2008 年的设计题 — JSObject 的内存布局

A 2008 design puzzle — laying out JSObject in memory

最朴素的设计:存 [key, value] 数组

The naive design: an array of [key, value]

V8 的解法:把 shape 抽出来

V8's answer: lift the shape out

这就解决了上面两个问题

This fixes both problems

Hidden Class / Shapes — 对象的骨架

Hidden Class / Shapes — an object's skeleton

两个对象 shape 相同就能复用 Hidden Class

三视角看 `const a = 3 + 4`

Three eyes on `const a = 3 + 4`

怎么用 `--trace-deopt` 抓现场

How `--trace-deopt` catches it

Fast Properties vs Slow Properties — `delete` 的代价

Fast vs Slow Properties — the cost of `delete`

工具箱 — `--allow-natives-syntax` 全套实战