ursb.me / notes
FIELD NOTE / 03 JS 性能工程 JS Performance Engineering 2026

JS 极致
性能优化

JavaScript
at the limit.

一段 JS 跑得慢——你怎么知道它慢在哪里?怎么基于 V8 内部原理动手改写?怎么验证它真的快了十倍?这是一份方法论手册,不是百科全书。

When a piece of JS runs slow — how do you know where it's slow? How do you rewrite it using V8's internals as a guide? How do you prove it really got ten times faster? A methodology, not an encyclopedia.

V8 · 四层编译流水线 V8 · four-tier compile pipeline ▸ live pulse
Parser
source → AST
Ignition
interp · bytecode
Sparkplug
baseline JIT
Maglev
mid-tier JIT
TurboFan
peak JIT
cold hot
CHAPTER 01 · PROLOGUE

三视角看 const a = 3 + 4

Three eyes on const a = 3 + 4

同一行代码,三种世界,三种翻译

one line of code, three worlds, three translations

性能优化的第一步,不是 profile,是回答一个问题:你脑子里那行 JS,跟 CPU 真正执行的那串指令,中间到底差了几层翻译?

const a = 3 + 4 摆出来,用三种"眼睛"去看它,你会发现这一行字面上看似一回事的代码,在三层世界里长得完全不一样——而 V8 的所有优化魔法,都发生在这三层之间的翻译过程里

The first step of performance work isn't profiling — it's answering one question: how many translations sit between the line of JavaScript in your head and the instructions the CPU actually runs?

Take const a = 3 + 4. Look at it with three different eyes. The same one-liner shape-shifts across three worlds, and every V8 optimization trick lives in the translation between them.

人脑
Brain
high abstraction
// 算 3+4,存到 a a = 7
V8
V8 引擎
V8 engine
parser → bytecode → asm
// Ignition bytecode LdaSmi [3] Star0 AddSmi [4], [0] Star1 Return
CPU
CPU
low abstraction
; TurboFan x86-64 mov eax, 3 add eax, 4 ret
抽象程度高More abstract 抽象程度低Less abstract
FIG. 01 同一行代码,在抽象轴上的三个落点。V8 的工作就是在中间那一格里,把左边折叠成右边。 The same one-liner sampled at three points on the abstraction axis. V8's job is to fold the left into the right, in the middle box.

为什么要这么多层翻译

Why so many translations

裸的 CPU 只认机器码,但 JS 是动态的——它的类型是运行时才知道的。一个 add(a, b) 函数,你不告诉我 ab 是什么,我就没法把它编译成"两个整数相加"这一条 add eax, ebx 指令——因为下一秒你可能会传两个字符串过来。

于是 V8 在脑和 CPU 之间垫了一层字节码:它比 AST 接近物理机,又比机器码灵活——可以解释执行,可以收集"参数到底是什么类型"的反馈,等收集够了再把字节码编译成机器码

这一层就是性能优化的全部战场。下面整本文章,讲的都是 V8 在这一层做了什么、能被你怎么利用

A bare CPU only speaks machine code, but JavaScript is dynamic — types are a runtime fact. Given add(a, b), I can't fold it down to a single add eax, ebx if I don't know what a and b are — next second you might pass me two strings.

So V8 inserts a bytecode layer between brain and CPU: closer to a real machine than the AST, more flexible than asm — interpretable, observable, and recompilable into machine code once V8 has watched enough calls to know what your types actually are.

That layer is the entire battlefield of JS performance work. The rest of this piece is about what V8 does in there — and how you can cooperate with it.

所谓 JS 的"快",
其实是 V8 在背着你猜对了一万次。 Field Note · 03
"Fast JavaScript" really means
V8 quietly guessed right ten thousand times. Field Note · 03
为什么 V8 要从字节码出发WHY START FROM BYTECODE 本质上 V8 字节码和 x86 汇编是同一种东西——都是给"虚拟机"或"物理机"消费的指令。区别只在于这世上没有一台 CPU 直接跑 V8 字节码,机器码因为能在硬件上裸跑所以快;V8 之所以引入字节码这一层,是因为它比 AST 更接近物理机(无层次嵌套、是基于栈的带累加寄存器的指令集),又比机器码灵活(可解释、可观测、可热替换)。 Underneath, V8 bytecode and x86 asm are the same kind of thing — instructions for a (virtual or physical) machine. The only difference is no CPU ships with V8 bytecode in silicon. Machine code wins because hardware runs it directly; V8 invented bytecode because it sits closer to the metal than the AST (a flat, stack-based ISA with an accumulator register) and stays more flexible than asm (interpretable, observable, hot-swappable).

所以这篇文章在解决什么问题

So what is this piece actually for

性能优化的"难"不在改代码——难在看懂 V8 当前在干什么。这篇文章不会铺一遍 V8 全部知识,而是围绕一个具体问题:

The hard part isn't editing code. The hard part is seeing what V8 is doing right now. So this piece doesn't tour every V8 internal — it answers one concrete question:

主问题 / THE PROBLEMTHE PROBLEM
function px2rem(input, base) { /* … */ } // 1M loops · polymorphic input → 240 ms
怎么把它跑到 How do we get this down to ~24 ms,且知道为什么变快了? — and know why?

这是一段很普通的工具函数:输入可能是 numberstring、或 { value, unit },输出是一个 rem 数值。它在我们某个项目里被 每帧 调用上百次,占了一段不容忽视的 CPU 时间。

下面的 19 章,每一章都是把它跑得更快这件事里的一刀。流水线、Hidden Class、Inline Cache 不是为了好看的术语——它们是用来切割问题的刀

It's a perfectly ordinary helper. Input may be a number, a string, or a { value, unit } shape; output is a rem value. In one of our projects it ran hundreds of times per frame and burned non-trivial CPU.

The 19 chapters that follow are each one cut of making it faster. Pipeline, Hidden Class, Inline Cache — these aren't decorations. They are the knives we'll cut the problem with.

CHAPTER 02 · PROLOGUE

JIT vs AOT — 编译时机的两条路

JIT vs AOT — two roads to a binary

编译期很薄,运行时很厚

a thin compile-time, a fat runtime

C / Rust / Go 这种静态语言,编译期就能确定每个变量是什么类型、每个函数怎么调用、对象在内存里长什么样——所以它们走 AOT (Ahead-Of-Time):发布之前就生成最终机器码。运行时几乎没有"编译"这件事。

而 JS 是动态的,函数被调用前没人能保证 a + b 里的两个值是数字、字符串、还是别的什么东西。所以 V8 走 JIT (Just-In-Time):运行编译,边收集类型反馈,边根据反馈优化。

Static languages like C / Rust / Go can pin down types, call shapes, and memory layouts at compile time — so they take the AOT (Ahead-Of-Time) road. By the time the binary ships, almost no "compiling" happens at runtime.

JavaScript is dynamic. Until add(a, b) is actually called, nobody can promise the two arguments are numbers — they could be strings or objects. So V8 takes the JIT (Just-In-Time) road: it compiles while running, observes types, and re-optimizes from feedback.

维度Dimension AOT · C / Rust / Go JIT · V8 / JSC
编译期Compile-time 很厚 · 全部优化都在这里做thick · all optimization happens here 很薄 · 只 parse + 生 bytecodethin · only parse + bytecode
运行时Runtime 很薄 · 直接跑机器码thin · just runs machine code 很厚 · 编译/反优化都在跑的过程中thick · (re)compile and deopt during execution
类型信息Type info 编译期已知known at compile 运行时收集 (feedback)collected at runtime (feedback)
最优代码生成时机Peak code emitted when 编译完成那一刻at compile end "足够热"之后 (TurboFan)after "hot enough" (TurboFan)
代价Cost 部署慢 / 二进制大slow build · large binary 冷启动慢 / 内存占编译产物cold start · compiled-code memory

"很厚的运行时"是什么意思

What does "thick runtime" actually mean

意思是:你写的同一段 JS,V8 运行时会反复地编译它——先用解释器跑(Ignition),发现是热点函数后,会用基线编译(Sparkplug)、中间编译(Maglev)、最后峰值编译(TurboFan)轮番上阵。每一次升级都要花 ms 级的时间在编译本身上,这部分时间是 AOT 语言不需要付的。

所以你写 function add(a, b) { return a + b },V8 在你眼皮底下可能跑过 4 个不同版本的 add——每个都对应不同程度的优化和不同程度的"假设"。

这恰恰也是性能优化的机会:如果你能让 V8 的假设保持稳定,它就能一直跑在最优版本(TurboFan 机器码)上,不会被"反优化"打回字节码解释。

It means the same chunk of JS gets recompiled while it runs — interpreter first (Ignition), then once V8 notices it's hot, baseline (Sparkplug), mid-tier (Maglev), and finally peak (TurboFan) take turns. Each upgrade burns milliseconds in compilation itself — a tax AOT languages never pay.

So when you write function add(a, b) { return a + b }, four different versions of add may have run inside V8 by the time you blink — each at a different level of optimization, each with a different set of assumptions.

That's also where the leverage sits: keep V8's assumptions stable and your function lives on the peak (TurboFan machine code) forever — never "deoptimized" back into bytecode interpretation.

为什么 JS 不能像 C 一样 AOTWHY JS CAN'T JUST AOT 不是不能,是很难精确。Hermes(React Native 用的 JS 引擎)在打包阶段会做一次 AOT 字节码生成,体积/启动都比 V8 漂亮——但代价是没有 JIT,峰值性能往往打不过 V8。本质权衡是:动态性能 vs 启动 / 包体。V8 选了前者,Hermes 选了后者。 Not "can't" — just "can't precisely". Hermes (React Native's engine) does AOT-bytecode at bundle time, winning on size and start-up. The tradeoff: no JIT, so peak throughput often loses to V8. Dynamic peak vs start-up / size — V8 chose the first, Hermes the second.
CHAPTER 03 · PIPELINE

四层 JIT — Parser → Ignition → Sparkplug → Maglev → TurboFan

Four-tier JIT — the compile pipeline

同一段 JS,在 V8 里其实有四份不同的它

the same JS exists in V8 as four different versions of itself

V8 不是只有一个编译器,而是一条流水线上的四级编译器。同一个函数,会随着"被调用次数"在四级之间向上爬——每爬一级,生成的代码越接近裸机,执行越快,但编译本身的耗时也越大。

这是 V8 的核心权衡:冷代码不值得花力气编译,热代码越烫值得越深的优化。所以函数的"性能"不是一个数字,而是一条会变化的曲线——这一章是这条曲线的地图。

V8 is not one compiler — it's a pipeline of four. The same function climbs the tiers as it gets called more often. Each tier emits code closer to the metal, runs faster, but costs more time to compile in the first place.

This is V8's core tradeoff: cold code isn't worth optimizing, hot code is worth optimizing harder. A function's "performance" isn't a single number — it's a curve that moves over time. This chapter is the map of that curve.

Parser
source → AST
AST
Ignition
interpreter
bytecode
Sparkplug
baseline JIT
asm (no opt)
Maglev
mid-tier JIT
asm (light opt)
TurboFan
peak JIT
asm (full opt)
红线 · 假设打破,反优化回 Ignition red · assumption broken, deopt back to Ignition
FIG. 02 V8 四层编译流水线。每一级输出都比上一级更接近 CPU 直接消费的形式。绿线"上"代表收集到足够的 type feedback 把代码升级,红线"下"代表运行时假设被打破,被迫退回上一级。 V8's four-tier compile pipeline. Each output is closer to what the CPU runs raw. The green arrow climbs (more feedback → upgrade), the red arrow falls (assumption broken → deopt).

每一级在干什么

What each tier actually does

Parser
把源码字符串解析成 AST,顺手也生成第一版 bytecode。所有后面三层的输入都是这个 bytecode,而不是源码。换句话说,V8 后续的优化全部基于字节码,源码到此为止。 Tokenizes source into an AST and emits the first bytecode. Every later tier consumes that bytecode — not the source. From here on V8's world is bytecode; the source is gone.
Ignition
字节码解释器。直接解释执行 bytecode,边跑边收集 type feedback(参数大概率是什么类型、对象大概率是什么 shape),写到一个叫 FeedbackVector 的结构里。所有冷代码都死在这一级——没必要再爬。 The bytecode interpreter. Runs bytecode line-by-line and collects type feedback (which types/shapes the args usually take) into a structure called the FeedbackVector. Cold code dies here — no need to climb.
Sparkplug
2021 年加入,是一个非优化基线 JIT。它做一件事:把 bytecode 一对一翻译成原生汇编,跳过解释器的"取指令-decode-dispatch"开销。它使用 feedback,所以编译几乎是免费的。 Added in 2021. A non-optimizing baseline JIT: emits a one-to-one translation of bytecode into native asm to skip the interpreter's fetch-decode-dispatch tax. It doesn't use feedback, so compiling is nearly free.
Maglev
2023 年加入的中间级优化 JIT看 feedback,但只做轻量优化,目标是用比 TurboFan 短得多的编译时间换接近 TurboFan 70% 的运行性能。Speedometer 跑分上比纯 Sparkplug 提升 ~21%。 Added in 2023. A mid-tier optimizing JIT that does read feedback, but only does light optimizations — targeting a fraction of TurboFan's compile time for ~70% of TurboFan's peak. Speedometer shows ~21% over Sparkplug alone.
TurboFan
峰值优化 JIT。基于 Sea-of-Nodes IR 做 inline 展开、逃逸分析、循环不变量外提、Inline Cache 内联等几十种优化。生成的机器码可能跟 C 编译器输出一样紧凑。代价:编译耗时毫秒级,内存占用大。 Peak optimizing JIT. Sea-of-Nodes IR with dozens of passes: inlining, escape analysis, LICM, IC inlining. The output can be as tight as a C compiler's. Cost: millisecond-level compile time and a fat code footprint.

为什么不一上来就用 TurboFan

Why not just go straight to TurboFan

因为 TurboFan 的编译耗时本身就是性能成本——而且 TurboFan 需要 feedback 才能优化得好,没有 feedback 时它生成的代码也很平庸。

真实的程序里,大部分代码只跑几次:页面初始化的某个 setup 函数、某个一次性的 callback——把它们编进 TurboFan 是纯亏。所以 V8 用一个简单的启发式:

Because TurboFan's compile time is itself a performance cost — and TurboFan needs feedback to optimize well; without it, even TurboFan's output is mediocre.

In real apps, most code only runs a handful of times: a setup function on page load, a one-shot callback. Compiling those into TurboFan is a pure loss. So V8 uses a simple heuristic:

V8 的升级策略V8'S TIERING POLICY
cold → Ignition (interpret) warm → Sparkplug (baseline asm) getting hot → Maglev (mid-tier opt) hot → TurboFan (peak opt)
"热"的判断在 v8::internal::TieringManager::ShouldOptimize——后面 Ch9 会拆开看。"hot" is decided in v8::internal::TieringManager::ShouldOptimize — we'll dissect it in Ch9.

这对优化意味着什么

What this means for optimization

三个推论:

  1. 性能是"跑过几次之后"的事。冷启动跑分跟稳态跑分可能差 5-10 倍,benchmark 一定要让函数预热(几千次循环)再测。
  2. 反优化(deopt)是性能噩梦。一次 deopt 把函数从 TurboFan 机器码退回 Ignition 解释器,等于从 Top Gear 换回 1 挡。后面 Ch7 单独讲怎么避免。
  3. Maglev 让"小函数也能优化"。在 Maglev 之前,小热点函数有时撑不到 TurboFan 阈值就放弃了;Maglev 等于在中间多塞一档,让它们也能享受到一定优化。

Three implications:

  1. Performance is "after-N-calls" performance. Cold and steady-state numbers can differ by 5–10×; any benchmark must pre-warm the function for thousands of iterations before timing.
  2. Deopt is the performance nightmare. One deopt drops a function from TurboFan machine code back to Ignition — like shifting from top gear straight to first. Ch7 covers how to avoid it.
  3. Maglev makes small hot functions matter. Pre-Maglev, small hotspots sometimes never crossed TurboFan's threshold and stayed un-optimized. Maglev gave them a middle gear to hit.
怎么观察 — 你的函数现在在哪一级?HOW TO INSPECT — WHICH TIER IS MY FUNCTION ON? --allow-natives-syntax 启动 node 或 Chromium,然后 %GetOptimizationStatus(fn) 会返回一个 bitmask——位 4 是 kOptimized、位 5 是 kMaglevved、位 6 是 kTurboFanned。Ch20 工具箱里有完整命令清单。 Launch node or Chromium with --allow-natives-syntax, then %GetOptimizationStatus(fn) returns a bitmask — bit 4 is kOptimized, bit 5 kMaglevved, bit 6 kTurboFanned. The full command list lives in the Ch20 toolbox.
CHAPTER 04 · PIPELINE

字节码 vs 机器码 — 同一种东西的两副面孔

Bytecode vs machine code — two faces, one idea

为什么虚拟机指令集存在这个世界上

why virtual ISAs exist at all

"字节码"听起来像个魔法词,但它的本质很朴素——一个虚拟 CPU 的指令集。和 x86 / arm 这些真 CPU 的指令集相比,只差在"这世上没有一台 CPU 直接跑它"。

const a = 3 + 4 在 V8 里走完一遭,你会看到它在两种语言里出场两次:第一次是 Ignition 的字节码,第二次是 TurboFan 的机器码。下面把它们摆在一起对照看。

"Bytecode" sounds like a magic word, but underneath it's plain: an instruction set for a virtual CPU. The only difference from x86/arm is that no silicon ships with a bytecode decoder built in.

Walk const a = 3 + 4 through V8 and you'll see it appear in two languages: first as Ignition bytecode, then as TurboFan machine code. Side by side:

▸ V8 BYTECODEIgnition
// function add() { return 3 + 4 } LdaSmi [3] // 加载 3 到 accumulator Star0 // acc → r0 AddSmi [4], [0] // acc += 4 (slot 0) Return // 5 条指令 · 基于栈 + accumulator // 由 Ignition 解释器 dispatch 执行
▸ MACHINE CODETurboFan x86-64
; 同一个 add(),TurboFan 出品 mov eax, 3 add eax, 4 ret ; 3 条指令 · 基于寄存器 ; 直接由 CPU 流水线消费 ; (常量折叠还能折成 mov eax, 7)
FIG. 03 同一个函数,左边是 V8 字节码(给 Ignition 解释器消费),右边是 TurboFan 输出的 x86-64 机器码(给 CPU 直接消费)。看起来不一样,本质上做的是同一件事:取数 → 加 → 返。 Same function. Left: V8 bytecode for the Ignition interpreter. Right: TurboFan-emitted x86-64 for the CPU itself. Different surface, same job: load → add → return.

为什么字节码长这样

Why bytecode looks like this

V8 的字节码 ISA 是一个带累加寄存器的栈机(stack machine with accumulator)。LdaSmi [3] 里的 Lda 就是 "Load into accumulator",Star0 是 "Store accumulator to register 0"。这种设计有两个好处:

  1. 指令编码短。大部分指令只占 1-3 字节,读起来比基于"三地址码"的 IR 紧凑很多——这对 V8 启动速度很关键,因为冷启动阶段就是 Ignition 在跑。
  2. 解释器 dispatch 简单。每条字节码都有一个 handler,handler 之间通过 accumulator 串连——CPU 流水线友好,分支预测也好做。

但代价也明显:解释器每条指令都要走"取指令 → decode → 跳到 handler → 执行 → 跳回 dispatch"的循环——这个循环本身的开销,大概比裸跑机器码慢一个数量级。所以才有 Sparkplug:把 bytecode 一对一翻译成 asm,把这个 dispatch 循环踢掉。

V8's bytecode ISA is a stack machine with an accumulator. LdaSmi [3] means "Load Smi into accumulator"; Star0 means "Store accumulator into register 0". Two wins:

  1. Compact encoding. Most instructions are 1–3 bytes — far tighter than a three-address IR. This matters for V8's start-up: the cold path runs in Ignition.
  2. Simple dispatch. Each opcode has a handler; handlers chain through the accumulator. CPU-pipeline friendly and easy on branch prediction.

The cost is the interpreter loop: every instruction pays "fetch → decode → jump-to-handler → execute → jump-back". That loop alone is roughly an order of magnitude slower than running raw machine code. That's why Sparkplug exists — translate bytecode 1-to-1 into asm and kill the dispatch loop.

汇编里多出来的那些"看不懂"指令

The "extra" instructions you'll see in real asm

真实的 TurboFan 输出比上面 figure 里的三条指令长得多——你用 node --print-opt-code --allow-natives-syntax 打印出来,会看到一堆 cmp / jne / test 指令围着核心逻辑。这些不是逻辑本身,而是 V8 在做 checkpoint(类型检查)和 调用约定 的钢架。

Real TurboFan output is much fatter than the three-line figure above. Run node --print-opt-code --allow-natives-syntax and you'll see a swarm of cmp / jne / test around the core. That's not logic — it's V8's checkpoint machinery (type guards) plus the calling convention scaffold.

OUTPUT  ·  node --print-opt-code --allow-natives-syntax ./add.js turbofan x86-64
; function add(a, b) { return a + b } (assumed Smi+Smi) L01 push rbp ; 调用约定 L02 mov rbp, rsp L19 testb [rbx+0xf], 0x1 ; checkpoint: a 是不是 SMI? L20 jne CompileLazyDeoptimizedCode ; 不是 → 反优化退回 Ignition L21 testb [rcx+0xf], 0x1 ; checkpoint: b 是不是 SMI? L22 jne CompileLazyDeoptimizedCode L23 add rbx, rcx ; ★ 实际逻辑就这一行 ★ L24 jo OverflowDeopt ; 溢出 → 反优化 L25 mov rax, rbx L26 pop rbp L27 ret

橙色高亮的是类型检查——它们在每次调用时验证"这次传进来的还是 SMI 吗?"。验证通过就走绿色那一行核心 add,验证失败就跳走反优化。一段 1 行的 JS 编出来 9 行汇编,其中 6 行是钢架,1 行是逻辑。

这套钢架就是后面 Phase II 的主角——assumption + feedback + checkpoint 三件套。它解释了"为什么 TurboFan 比 Ignition 快"和"为什么打破假设性能会突然崩塌"是同一件事的正反面。

The orange lines are type guards: they verify "is this still a SMI?" on every call. Pass → fall through to the green core add; fail → jump out to deopt. One line of JS becomes nine lines of asm — six scaffold, one logic.

That scaffold is the protagonist of Phase II: the assumption + feedback + checkpoint trio. It's why TurboFan is faster than Ignition and why breaking an assumption tanks performance — same coin, two faces.

JS 是动态的。
除非你让它看起来是静态的。 Field Note · 03
JavaScript is dynamic —
until you make it look static. Field Note · 03
CHAPTER 05 · PIPELINE

Tagged Pointer — 最低位决定的世界

Tagged Pointer — one bit decides reality

SMI 与 HeapObject 的 1 bit 之差

SMI vs HeapObject, decided by one bit

上一章那段汇编里出现了 testb [rbx+0xf], 0x1——它在检查最低一位。这个习惯并非 V8 独有,C/C++ 里叫 Tagged Pointer:用一个指针的若干低位携带类型信息,而不是另开一个字段。V8 的版本是这样的:

  • 最低位是 0 → 这是一个 SMI(Small Integer,32 位有符号小整数),实际数值是把高 31 位左移一位读出来。
  • 最低位是 1 → 这是一个 HeapObject 指针,真正的对象在堆上,需要解引用。

这一个 bit 决定了 V8 看到一个 64 位字时的两条完全不同的处理路径。下面把它点亮看看:

That testb [rbx+0xf], 0x1 in the previous chapter is checking the lowest bit. This isn't V8-only — in C/C++ it's called a Tagged Pointer: stuff type info into a pointer's low bits instead of carrying a separate field. V8's flavor:

  • low bit = 0 → it's a SMI (Small Integer, 31-bit signed). Read the value by shifting the high 31 bits one to the right.
  • low bit = 1 → it's a HeapObject pointer. The real object lives on the heap; dereference required.

That single bit forks V8's handling of a 64-bit word into two completely different paths. Try lighting one up:

SMI — 整数,塞得下 → 走 SMI 快路径 — integer that fits → SMI fast path
数值位 (1)value bit (1)
SMI tag · 末位 0SMI tag · low bit 0
HeapObject tag · 末位 1HeapObject tag · low bit 1
FIG. 04 · interactive 输入一个数,看看 V8 把它装进 32 位字时长什么样:整数(|num| ≤ 2³⁰)走 SMI,末位是 0 表示"这个字本身就是数值";其他情况(浮点、大整数、字符串、对象)走 HeapObject,末位是 1 表示"这个字是个指针,得去堆上找真东西"。 Type a number and see how V8 packs it into a 32-bit word. Integers fitting |n| ≤ 2³⁰ go SMI — low bit 0 means "the word IS the value". Anything else (floats, big ints, strings, objects) goes HeapObject — low bit 1 means "the word is a pointer, look on the heap".

为什么 V8 要这么省

Why V8 hoards bits like this

因为整数太常见了。一个普通页面跑起来,堆里大半是数字——下标、像素、毫秒、坐标、计数器。如果每个数字都老老实实地包成一个 HeapObject(配 hiddenClass / 元信息 / GC 头),内存和指针追逐都会把性能拖死。

用末位来 tag SMI,V8 可以做到:

  • SMI 不进堆——直接以 立即数 形式跑在寄存器里,加减乘除一条 CPU 指令搞定。
  • 类型检查只是一个 testb——上一章那段汇编里的"是不是 SMI?"在 CPU 上只占 1 周期。
  • GC 不用扫 SMI——它们根本不是指针,GC 走过就跳过去。

Because integers are everywhere. In a real page, the heap is mostly numbers — indices, pixels, milliseconds, coordinates, counters. Boxing every one of them into a HeapObject (with a hiddenClass, meta header, and GC bits) would drown V8 in memory traffic and pointer chasing.

Tagging SMIs with the low bit lets V8:

  • Skip the heap for SMIs — they live in registers as immediate values; arithmetic is one CPU instruction.
  • Reduce the type check to a testb — the "is it a SMI?" guard from the previous chapter is one cycle.
  • Skip SMIs in GC — they aren't pointers, so the collector just moves past them.
尺寸细节SIZING DETAILS 在 32 位 V8 上,SMI 是 31 位有符号整数(范围 ±2³⁰),浮点和大整数会被装箱成 HeapNumber / BigInt。在 64 位 V8 上,默认开启指针压缩,SMI/Pointer 都是 32 位,Pointer 高 32 位由"isolate root"统一,所以低 32 位放得下大部分情况。这个细节决定了"传 number 比传 string 快得多"——string 永远是 HeapObject。 On 32-bit V8, SMI is a 31-bit signed integer (±2³⁰); floats and bigints are boxed as HeapNumber / BigInt. On 64-bit V8, pointer compression is on by default — SMI and Pointer both fit in 32 bits with the high 32 derived from an "isolate root". This is why "passing a number is much faster than passing a string" — strings are always HeapObjects.

这跟"写快 JS"有什么关系

What this has to do with writing fast JS

有三个直接的实战推论:

  1. 能用整数就别用浮点1.0 + 2.0 会被装成 HeapNumber,走慢路径;1 + 2 全程 SMI,走快路径。同样的道理,Math.floor(x) 之后立刻参与运算,V8 知道结果是整数,可以保持 SMI。
  2. 能用 number 就别用 string 当 key 或开关。在热点函数里把 obj['x'] 改成 obj.x,把 switch ('mode') 改成 switch (MODE_ENUM)(整数枚举),V8 的检查路径会短一截。
  3. 避免数组里混类型[1, 2, 'three'] 会让 V8 把整个数组的 elements kind 升级到通用模式(HOLEY_ELEMENTS),后续读写都得走 HeapObject 路径——而 [1, 2, 3] 全程是 PACKED_SMI_ELEMENTS,读写都是裸内存访问。

Tagged Pointer 不是知识点——它是你写每一行 JS 时,V8 在背后做的那个最小决定

Three direct, actionable consequences:

  1. Use integers over floats when you can. 1.0 + 2.0 boxes into HeapNumber (slow path); 1 + 2 stays SMI throughout (fast path). Same for Math.floor(x) followed by arithmetic — V8 knows the result is an int and keeps it SMI.
  2. Prefer numbers over strings as keys / switch values. In hot functions, replace obj['x'] with obj.x, and switch on integer enums rather than string literals — V8's check path becomes shorter.
  3. Don't mix types in arrays. [1, 2, 'three'] escalates the whole array's elements kind to generic HOLEY_ELEMENTS — all reads/writes go through HeapObject paths. [1, 2, 3] stays in PACKED_SMI_ELEMENTS, where reads are raw memory accesses.

Tagged Pointer isn't a curiosity. It's the minimal decision V8 makes behind every line of JS you write.

CHAPTER 06 · ASSUMPTIONS

三件套 — assumption + feedback + checkpoint

The trio — assumption + feedback + checkpoint

V8 是怎么"猜对"的

how V8 manages to guess right

到这里出现了一个真正的悖论:JS 是动态类型的语言,V8 凭什么能把它编译成跟 C 一样紧凑的机器码?

答案是 V8 不"知道",而是。它边跑边收集类型反馈,根据反馈做大胆假设,然后基于假设生成快路径机器码——同时在机器码里埋下类型检查 checkpoint,一旦假设被打破就立刻抛弃机器码,退回字节码解释执行。

这就是 V8 性能的三件套:

Here's the real paradox: JS is dynamically typed, so how does V8 ever produce C-tight machine code?

Answer: V8 doesn't "know" — it guesses. It runs, watches the types your function actually sees, makes bold assumptions, and emits fast-path asm based on those assumptions — with type-check checkpoints inlined so it can throw the asm away the moment the guess fails.

The trio:

# 名字Name 在哪一层Lives at 在做什么Does what
1 feedback Ignition / Sparkplug 跑的时候while Ignition / Sparkplug runs 观察"这个函数被调用时,参数是什么类型,对象是什么 shape",写到 FeedbackVector 里。Watches what types/shapes flow through each call site, writing to a FeedbackVector.
2 assumption Maglev / TurboFan 编译的时候when Maglev / TurboFan compiles 读 feedback 决定:"这次我假设两个参数都是 SMI",据此走快路径;feedback 越单态,假设越大胆。Reads feedback and decides: "I'll assume both args are SMIs". The more monomorphic the feedback, the bolder the bet.
3 checkpoint 编译出来的机器码里in the emitted machine code 每个假设都对应一行 testb/cmp 守卫——验证通过走快路径,失败立刻 deopt。Each assumption gets a testb/cmp guard inlined — pass → fast path; fail → deopt immediately.

三件套的协作流程

How the trio cooperates

把它画成时序就是:

As a timeline:

三件套时序TRIO TIMELINE
[1] add(1, 2) → Ignition 解释,feedback: arg0=SMI, arg1=SMI [2] add(3, 4) → feedback 加固:仍然 SMI+SMI ... (~10K 次调用) ... [N] ShouldOptimize() == true → TurboFan 接管 assumption: a是SMI && b是SMI && add不溢出 生成机器码: testb · jne deopt · add · jo deopt · ret [N+1] add(1, 2) → 走机器码,checkpoint pass,1 周期返回 [X] add('a', 'b') → checkpoint fail,deopt,扔掉机器码,退回 Ignition
关键观察:checkpoint 是 assumption 的"保险栓"。没有它,V8 就不敢做大胆优化;有它,V8 可以激进到几乎跟 C 一样快。Key observation: checkpoint is the safety pin for assumption. Without it V8 wouldn't dare optimize aggressively; with it, V8 can be nearly as bold as a C compiler.

怎么观察这套机制

How to observe this in practice

三件套是看不见的——除非你打开 V8 自带的几个开关。这是分析慢 JS 的第一类工具:

The trio is invisible by default — until you flip V8's built-in switches. This is the first class of tool for analyzing slow JS:

OBSERVATION TOOLBOX · 观察工具 node / chromium flags
--allow-natives-syntax // 启用 %DebugPrint / %GetOptimizationStatus / %OptimizeFunctionOnNextCall --trace-opt // 打印每一次"函数被升级到 TurboFan"的事件 // → [optimizing 0x... <JSFunction add> (target TURBOFAN) - took 0.8 ms] --trace-deopt // 打印每一次反优化事件 + 原因(checkpoint 名字) // → [bailout (kind: eager): begin. reason: not a smi] --print-opt-code // 打印 TurboFan 输出的机器码 — 看 checkpoint 长啥样 %DebugPrint(fn) // 在代码里直接调,打印函数的 feedback vector + map

把这几个开关组合起来,你就能看见 V8 在背后做什么。下一章我们用真函数演示一遍——当假设被打破时,V8 是怎么 deopt 的。

Compose these switches and you can see what V8 is doing behind the scenes. The next chapter walks a real function through a deopt event.

CHAPTER 07 · ASSUMPTIONS

反优化第一现场 — 当 add(1,2) 突然来了 add('a','b')

First scene of deopt — when add(1,2) meets add('a','b')

一段实测:打破假设之后性能为什么不只是变慢,而是断崖

a real measurement of why broken assumptions don't just slow you down — they cliff-drop

把上一章讲的"checkpoint fail → deopt"放到 benchmark 里看一眼。下面这段代码 V8 在执行时会跑出三段截然不同的性能,差距高达 3-5 倍:

Let's actually measure the "checkpoint fail → deopt" event from the previous chapter. The code below runs in three distinct performance regions, with a 3–5× swing:

BENCHMARK · deopt-on-type-mismatch node --allow-natives-syntax
const { printOptimizationStatus } = require('./v8-print'); function add(a, b) { return a + b; } // L1 ─ 单态预热:全部 SMI+SMI console.time('mono'); for (let i = 0; i < 99999999; i++) add(i, i); console.timeEnd('mono'); // → 66 ms // L2 ─ 用 string 调一次,把 assumption 打破 add('a', 'b'); // ★ 一次就够了 ★ printOptimizationStatus(add); // → kIsLazy (deopted!) // L3 ─ 同样的 SMI+SMI 循环再跑一遍 console.time('after-deopt'); for (let i = 0; i < 99999999; i++) add(i, i); console.timeEnd('after-deopt'); // → 243 ms · 慢了 3.6×
66ms
L1 · monomorphic
deopt
L2 · one bad call
243ms
L3 · same code, 3.6× slower

为什么 L3 的代码跟 L1 一样,却跑得慢三倍

Why the same loop runs 3× slower after one bad call

因为 V8 把 L1 时编出来的 TurboFan 机器码扔了。L3 的循环重新从 Ignition 开始跑——而 Ignition 是字节码解释器,本身就慢一个数量级。等到再跑足够多次,V8 才会重新编译,但这次的 feedback 已经"被污染"了:它知道 a/b 既可能是 number 也可能是 string,所以新版本的 assumption 退化成 any+any,生成的机器码必须额外多打一份类型分支——比第一次的 mono-SMI 版本臃肿得多。

这就是反优化的真正成本:

  • 立即成本:扔掉的那份 TurboFan 机器码白编了——几 ms 编译时间打水漂。
  • 过渡成本:函数掉回 Ignition 的几千次解释执行,每次都慢 5-10×。
  • 长期成本:重编后的版本是多态版,稳态性能也比单态版差得多——上面 L3 比 L1 慢 3.6× 就是这个原因。

Because V8 threw away the TurboFan machine code it had compiled in L1. L3's loop restarts in Ignition — the bytecode interpreter, an order of magnitude slower on its own. Eventually V8 re-compiles, but the feedback is now "polluted": it knows a/b can be either number or string, so the new assumption degrades to any+any, and the emitted asm has to carry extra type branches — fatter than the original mono-SMI version.

That's the real cost of deopt:

  • Immediate: the discarded TurboFan code wasted milliseconds of compile time.
  • Transient: thousands of interpreted calls in Ignition before re-tiering, each 5–10× slower.
  • Lasting: the recompiled version is polymorphic; even its steady state is much slower than the original monomorphic version — that's why L3 is 3.6× slower than L1.

怎么用 --trace-deopt 抓现场

How --trace-deopt catches it

$ node --allow-natives-syntax --trace-deopt deopt-demo.js --trace-deopt
; (...省略 L1 的优化日志...) [bailout (kind: eager): begin. deopting 0x1f9422b23479 <Code kind=TURBOFAN>] [ reason: not a Smi] ; ★ 这是 L2 那行的元凶 [ bytecode offset: 7] [ function: 0x1f942f4cc121 <JSFunction add>] [ script: deopt-demo.js · line: 3] [bailout: end. ↩ Ignition]

"reason: not a Smi" 这一行就是分析慢 JS 时最常见的元凶——它告诉你 哪一行 JS、第几个字节码偏移、为什么触发了反优化。后面 Phase 4 主线函数的优化过程里,我们会用这条日志一行行倒推问题。

"reason: not a Smi" is the single most common smoking gun when chasing slow JS — it pinpoints which line, which bytecode offset, which assumption blew up. In Phase IV's main-line we'll use this exact log to backtrack issues line by line.

实战教训REAL-WORLD LESSON
一次"加日志"导致整个页面卡顿When "just adding a log" tanked a whole page

某次 PR 在一个被每帧调用上千次的格子计算函数里加了 console.log(arg),arg 偶尔是 undefined。结果 Profiler 显示这个函数突然慢了 4 倍——不是 console.log 的开销,而是 undefined 这种类型让函数 deopt 了,从此跑在多态机器码上。把日志移到外层(只在 dev 模式生效)后,性能立刻回到原状。A PR added console.log(arg) to a per-cell function called thousands of times per frame. arg was occasionally undefined. Profiler showed the function suddenly 4× slower — not because of the log itself, but because undefined deopted the function into polymorphic asm forever after. Hoisting the log to the outer scope (dev-only) restored performance instantly.

CHAPTER 08 · ASSUMPTIONS

{Mono | Poly | Mega} morphic — 一个 feedback slot 的命运

{Mono | Poly | Mega} morphic — a feedback slot's fate

单态最快,多态次之,巨态退化成解释器

monomorphic flies, polymorphic crawls, megamorphic gives up

上一章看到一次 add('a','b') 让函数 deopt——但实际情况比这更细。每个调用点的 FeedbackVector 都有一个 状态机,会随着接收到的类型种类逐步退化:

The previous chapter showed one add('a','b') triggering a deopt — but the truth is finer-grained. Each call site's FeedbackVector entry runs a state machine that degrades step by step as more type variations come through:

Monomorphic
单态
Mono
SMI · SMI
只见过一种类型组合。V8 可以做最深的优化,checkpoint 只有一行。
Has only seen one type combo. V8 inlines the deepest optimization; the checkpoint is a single line.
~1× · 基准 / baseline
Polymorphic
多态
Poly
见过 2–4 种 类型组合。V8 还能优化,但需要在快路径上多打几条 cmp/jne 分支。
Has seen 2–4 type combos. Still optimizable, but with extra cmp/jne branches inlined.
~2–3× slower
Megamorphic
巨态
Mega
超过 4 种 类型组合。V8 直接放弃对这个 site 做优化,退回通用 dispatch——慢得跟解释器差不多。
Beyond 4 type combos. V8 gives up on this site and falls back to generic dispatch — about as slow as the interpreter.
~5–10× slower
点一下加一种类型组合,看 feedback slot 怎么退化: click to add a new type combo, watch the slot degrade:
FIG. 05 · interactive 一个调用点的 feedback slot 随类型多样化而退化:1 种 → mono;2-4 种 → poly;>4 种 → mega 放弃优化。点上面的按钮试试看。 A call site's feedback slot degrades as more type combos arrive: 1 → mono, 2–4 → poly, >4 → mega (V8 gives up). Click the buttons above to try.

为什么有 4 种这个具体数字

Why specifically 4

这是 V8 工程上的权衡。每多打一条类型分支,生成的机器码就多几行 cmp/jne,体积变大、缓存压力变大。V8 团队跑过大量 benchmark,发现 4 种以下的多态分支还能跑得比解释器快,超过这个就得不偿失了——干脆退回通用 dispatch。

这意味着:4 是工业经验,不是物理常数。但你写代码时只需要记一个原则:

It's V8's engineering tradeoff. Each extra type branch adds a few cmp/jne lines to the asm — code grows, i-cache pressure grows. V8 benchmarked extensively and found that polymorphism up to 4 still beats the interpreter; beyond that, it's a net loss — fall back to generic dispatch.

So 4 is empirical, not physical. As an author you only need one rule:

让每个热点函数
尽量是 monomorphic 的。 The single most useful V8 heuristic.
Make every hot function
as monomorphic as you can. The single most useful V8 heuristic.

怎么读 feedback slot 的当前状态

How to read a slot's current state

%DebugPrint(fn),然后翻到 feedback_vector 那一段,会看到类似:

Use %DebugPrint(fn) and find the feedback_vector section. You'll see something like:

%DebugPrint(test) excerpt
- feedback vector: 0x2c7742c43b89 <FeedbackVector> - shared function info: … - tiering state: TieringState::kNone - invocation count: 22554 - slot #0 BinaryOp BinaryOp::SignedSmall { // ★ Mono · 仅见过 SMI [0]: 1 } // 给同一个函数喂一次 string 之后再打: - slot #0 BinaryOp BinaryOp::Any { // ★ Poly → Mega · 退化成 Any [0]: 1 }

看到 BinaryOp::SignedSmall 就放心(SMI 单态),看到 BinaryOp::Any 就要警觉了——这个 slot 已经退到最差。这是 Phase 4 主线优化里反复用到的第一个诊断信号

BinaryOp::SignedSmall means you're golden (SMI mono); BinaryOp::Any means the slot has degraded to the worst case. This is the first diagnostic signal we'll reach for repeatedly in Phase IV's main-line.

CHAPTER 09 · ASSUMPTIONS

ShouldOptimize — 什么时候才会真正进 TurboFan

ShouldOptimize — when does TurboFan kick in

直接拆 V8 源码里的那个判断

reading the actual V8 source for the decision

"足够热"到底意味着什么?V8 把这件事写在了一个具体的函数里——v8::internal::TieringManager::ShouldOptimize。我们直接拆它:

What does "hot enough" actually mean? V8 codifies it in one function — v8::internal::TieringManager::ShouldOptimize. Let's read it:

v8/src/execution/tiering-manager.cc · ShouldOptimize simplified
OptimizationDecision ShouldOptimize(FeedbackVector fbv, CodeKind kind) { SharedFunctionInfo shared = fbv.shared_function_info(); L1 if (kind == CodeKind::TURBOFAN) L2 return DoNotOptimize(); // 已经在 TurboFan 上,跳过 L3 if (TiersUpToMaglev(kind) && L4 shared->PassesFilter(maglev_filter) && L5 !shared->maglev_compilation_failed()) { L6 return Maglev(); L7 } L8 if (V8_UNLIKELY(!v8_flags.turbofan || L9 !shared->PassesFilter(turbo_filter) || L10 v8_flags.efficiency_mode_disable_turbofan || L11 isolate->EfficiencyModeEnabledForTiering())) { L12 return DoNotOptimize(); // 节电 / 关闭 / filter 排除 L13 } L14 if (fbv.invocation_count() < v8_flags.minimum_invocations_before_optimization) { L15 return DoNotOptimize(); // ★ 调用次数不够,不优化 ★ L16 } L17 BytecodeArray bc = shared->GetBytecodeArray(isolate_); L18 if (bc.length() > v8_flags.max_optimized_bytecode_size) { L19 return DoNotOptimize(); // ★ 函数太长,不优化 ★ L20 } L21 return TurbofanHotAndStable(); }

从开发者视角读这段

What this means as an engineer

  1. L1 · 已经优化过的代码不会再优化(没必要)。
  2. L3 · 这段决定是否启用 Maglev——具体见 maglev_filter(可以用 --maglev-filter=name 限定)。
  3. L8 · 通过参数主动禁用,或者节电模式 / 电池低,等等场景不会优化(比如 node --v8-options="--turbo-filter=xxxxx")。
  4. L14 · 运行足够多次才会优化。这是性能曲线的"门槛"——冷启动的几千次不会进 TurboFan。还有一个 efficiency_mode_delay_turbofan 配置可以延后启动 TurboFan。
  5. L18 · 太长的函数不会优化max_optimized_bytecode_size 默认 60K bytecode 字节。这就是为什么我们后面 Phase IV 第一刀会是函数拆解——把超大函数拆小,让每一段都能进 TurboFan。
  1. L1 — already optimized code isn't re-optimized.
  2. L3 — Maglev decision; constrained by maglev_filter (see --maglev-filter=name).
  3. L8 — explicit disable / power-saver / efficiency mode skip optimization (e.g. node --v8-options="--turbo-filter=xxxxx").
  4. L14must be called enough times to optimize. This is the threshold on the performance curve — your first thousand cold-start calls won't make it into TurboFan. There's also efficiency_mode_delay_turbofan to push tiering further out.
  5. L18overly long functions are skipped. max_optimized_bytecode_size defaults to 60K bytecode bytes. That's why our first move in Phase IV will be function decomposition: break giant functions into small ones so each can be optimized.

Maglev 的跑分对照

Maglev's bench numbers

Ignition
64
+ Sparkplug
93
+ TurboFan
279
+ Maglev
302
all four
327
FIG. 06 JetStream 跑分对照(分数越高越快)。Maglev 在已经有 TurboFan 的情况下还能再加 8% 左右,核心收益是缩短"够热再优化"的等待时间。数据来自 v8.dev/blog/maglev。 JetStream scores (higher = faster). Maglev still adds ~8% on top of TurboFan; its real win is shrinking the "wait until hot" gap. Data: v8.dev/blog/maglev.
推论:不要写超大函数TAKEAWAY · DON'T WRITE GIANT FUNCTIONS L18 那条 max_optimized_bytecode_size 是性能优化里最容易踩的坑——一个 1000 行的处理函数,V8 会因为字节码太长直接放弃优化它,无论你跑多少次都没用。Phase IV 的"函数拆解"规则之所以排第二,就是为了把这种函数拆出 TurboFan 阈值之内。 The max_optimized_bytecode_size at L18 is one of the easiest traps. A thousand-line handler can sit forever above the threshold — V8 simply skips optimizing it no matter how often it's called. That's why the "function decomposition" rule in Phase IV is non-negotiable.
CHAPTER 10 · OBJECT MODEL

一道 2008 年的设计题 — JSObject 的内存布局

A 2008 design puzzle — laying out JSObject in memory

假如你是当年 Google 的工程师

if you were Lars Bak in 2008

到这里我们已经讲完 V8 的编译流水线假设系统。现在转入第三块,也是性能优化里最有趣的一块——对象内存模型

用一个思想实验开场:假如你是 2008 年 Chrome V8 项目的工程师,任务是设计 JS 对象在内存里怎么布局,你会怎么做?先看 C 是怎么做的:

We've covered V8's compile pipeline and assumption system. Now into the third — and most rewarding — block: the object memory model.

A thought experiment: it's 2008, you're on the Chrome V8 team, and your job is to lay JS objects out in memory. How would you do it? First, here's how C does it:

C struct · 静态布局 x86-64 gcc -O2
struct Point { int x; int y; }; int printPoint(struct Point p) { printf("x=%d, y=%d", p.x, p.y); } ; 编译出来访问 p.x: mov eax, DWORD PTR [rbp-12] ; base + 0 → x mov edx, DWORD PTR [rbp-8] ; base + 4 → y

静态语言的 struct一段连续线性内存。编译期就知道 x 在偏移 0、y 在偏移 4——属性访问就是O(1) 的偏移寻址。但这有两个不可调和的前提:

  1. 编译期就知道结构里有哪些字段
  2. 结构是定长的,不能运行时加字段。

JS 全反过来——obj.foo = 42 可以在任何时刻给对象加属性,delete obj.foo 又可以随时拿走。所以你不能像 C 那样"一条 mov 指令搞定属性读取"。

A static struct is one contiguous block of memory. The compiler knows x is at offset 0, y at offset 4 — property access is an O(1) offset load. But that rests on two assumptions you can't break:

  1. The compiler knows which fields exist at compile time.
  2. The struct is fixed-size; you can't add fields at runtime.

JS shatters both. obj.foo = 42 can graft on at any moment; delete obj.foo rips off at will. So you can't get away with "one mov per property read".

最朴素的设计:存 [key, value] 数组

The naive design: an array of [key, value]

第一反应可能是:既然字段是动态的,那就存成 [key1, val1, key2, val2 ...] 这种"键值对数组"——每次读 obj.x 时遍历查找。

First instinct: if fields are dynamic, store them as [key1, val1, key2, val2 ...] and walk the array on every obj.x.

[0]"x"3
[1]"y"5
[2]"z"6

但有两个问题:

  1. 属性查找是 O(n)——对每个属性访问都要扫一遍 keys。
  2. 同一种 shape 的对象会重复存 key 名:有 100 万个 {x, y},内存里就有 100 万份 "x" / "y" 字符串。

第二个问题尤其致命——典型 Web 应用里同样 shape 的对象动辄几万几百万,这是不能接受的浪费。

Two problems:

  1. Lookup is O(n) — scan keys on every access.
  2. Objects of the same shape duplicate the key strings: a million {x, y} objects means a million copies of "x" and "y".

The second one's lethal — a real Web app has tens or hundreds of thousands of identically-shaped objects. That's unacceptable bloat.

V8 的解法:把 shape 抽出来

V8's answer: lift the shape out

V8 的设计是:每个 JSObject三类存储,加上一个指向 Hidden Class 的指针——这个 Hidden Class 才是"shape 的描述"。所有同 shape 对象共享一份

V8 chose this: every JSObject has three storage areas plus a pointer to a Hidden Class — and the Hidden Class itself is the "shape description". All same-shape objects share one.

+0*hiddenClass→ shape
+8*properties→ named-props array
+16*elements→ indexed-elems array
+24in-object #03 (= obj.x)
+32in-object #15 (= obj.y)
+40in-object #26 (= obj.z)
*hiddenClass
指向形状描述。所有 {x, y, z} 对象都指向同一个 Hidden Class——key 名只存一份。下一章详细拆。 Points to the shape descriptor. All {x, y, z} objects point at the same Hidden Class — keys are stored once. Next chapter dissects it.
*properties
指向命名属性数组。当属性多到 in-object 槽放不下,溢出来的就放这里。 Points to a named-properties array. When properties overflow the in-object slots, they spill here.
*elements
指向下标元素数组。专门存 arr[0] 这种数字下标的元素——下标访问是连续内存,极快。 Points to an indexed-elements array. Numeric-indexed (arr[0]-style) values live here — contiguous memory, very fast.
in-object
预留在 JSObject 本体里的属性槽。访问最快——直接 base + offset,跟 C 的 struct 一样!但要"预知 shape"才能用——这正是 Hidden Class + IC 配合的产物。 Slots reserved inside JSObject itself. Fastest access — base + offset, just like a C struct! But only when "shape is known", which is exactly what Hidden Class + IC give you.

这就解决了上面两个问题

This fixes both problems

  1. 不再重复存 key。100 万个 {x, y, z} 对象只存 1 份 "x" / "y" / "z"(在共享的 Hidden Class 里)。
  2. 属性查找可以变成 O(1)——只要事先知道这个对象的 shape。Ch11 讲 Hidden Class 怎么记录 shape,Ch13 讲 Inline Cache 怎么把它编进汇编偏移量。

但代价是:对象的 shape 一旦变化,Hidden Class 也得变。这就引入了 Phase III 的核心机制——Transition Chain(下一章)。

  1. No more key duplication. A million {x, y, z}s share a single set of "x"/"y"/"z" strings (inside the shared Hidden Class).
  2. Property lookup can become O(1) — provided we know the object's shape ahead of time. Ch11 covers how Hidden Class records the shape; Ch13 covers how Inline Cache compiles it into asm offsets.

The price: change the shape, change the Hidden Class. Hence the central mechanism of Phase III — Transition Chain (next chapter).

CHAPTER 11 · OBJECT MODEL

Hidden Class / Shapes — 对象的骨架

Hidden Class / Shapes — an object's skeleton

同 shape 的对象共享一份描述

same-shape objects share one descriptor

"Hidden Class"是 V8 的术语,在 V8 源码里它的工程名是 Map(就是 %DebugPrint 里看到的那个 Map);Edge Chakra 叫 Types,JavaScriptCore 叫 Structure,SpiderMonkey 叫 Shapes所有现代 JS 引擎都有同一个东西——只是名字不一样。

Hidden Class 内部最关键的子结构是 DescriptorArray——它记录"这个 shape 上有哪些 key、key 对应的 in-object 槽位下标是几"。下面用一个具体例子:

"Hidden Class" is V8's term — internally, the V8 source calls it a Map (yes, the same Map you see in %DebugPrint). Edge Chakra calls it Types, JavaScriptCore calls it Structure, SpiderMonkey calls it Shapes. Every modern JS engine has the exact same thing under different labels.

The most important sub-structure inside a Hidden Class is the DescriptorArray — it records "this shape has these keys, and each key corresponds to this in-object slot index". Concrete example:

JSObject (o)
*hiddenClass
*properties[]
*elements[]
in-obj #011
in-obj #122
Hidden Class (Map · DescriptorArray)
"x"offset 0
"y"offset 1
FIG. 07 const o = { x: 11, y: 22 } 在 V8 内部的真实样子。JSObject 本体只存值,key 名一律由共享的 Hidden Class 描述。如果再创建一个 { x: 33, y: 44 },它会指向同一个 Hidden Class——这正是性能优化的杠杆所在。 What const o = { x: 11, y: 22 } really looks like in V8. The JSObject itself only carries values; key names are shared via the Hidden Class. Another { x: 33, y: 44 } would point at the same Hidden Class — that's the lever.

两个对象 shape 相同就能复用 Hidden Class

Same shape ⇒ same Hidden Class

关键性质:shape 完全相同的对象,共享同一个 Hidden Class 实例

  • const o1 = { x: 11, y: 22 } · Hidden Class A
  • const o2 = { x: 33, y: 44 } · 同一个 Hidden Class A
  • const o3 = { y: 11, x: 22 } · 不同的 Hidden Class B(顺序变了!)
  • const o4 = { x: 11, y: 22, z: 33 } · Hidden Class C(多了一个 key)

注意 o3 和 o1 的区别——属性赋值的顺序也是 shape 的一部分。这是 Phase IV 第 4 条改写规则的依据:"保持对象赋值顺序不变"。

The crucial property: objects of identical shape share the same Hidden Class instance.

  • const o1 = { x: 11, y: 22 } · Hidden Class A
  • const o2 = { x: 33, y: 44 } · same Hidden Class A
  • const o3 = { y: 11, x: 22 } · different Hidden Class B (order changed!)
  • const o4 = { x: 11, y: 22, z: 33 } · Hidden Class C (extra key)

Note o3 vs o1 — assignment order is part of the shape. This underlies rule #4 of Phase IV: "keep property assignment order stable".

怎么验证两个对象 Hidden Class 是同一个?HOW TO VERIFY TWO OBJECTS SHARE A HIDDEN CLASS %DebugPrint(obj),看输出里的 map: 0x... 字段。两个对象的 map 物理地址一样,就说明它们走的是同一个 Hidden Class——后面 IC 优化能命中同一份汇编。 Run %DebugPrint(obj) and look at the map: 0x... line. Same physical address = same Hidden Class = the same IC fast path will hit both.

in-object properties vs *properties

in-object properties vs *properties

上一章提到 V8 有两种存"命名属性"的位置:in-object(预留在 JSObject 本体里)和 *properties 数组(溢出存储)。Hidden Class 的 DescriptorArray 同时描述这两类——开发者眼里只是 obj.x,V8 内部却可能走两条路。

你可能会问:那什么时候走哪条?V8 默认给空对象预留 4 个 in-object 槽位(称为 Slack Tracking,见 Ch20 工具箱),前 4 个属性走 in-object,后面溢出到 *properties 数组。这是 Phase IV 第 5 条规则"class 字段加默认值"的根源——让对象一出生就立刻把 4 个槽位填满。

The previous chapter mentioned V8 has two places to store named properties: in-object (reserved inside the JSObject body) and the *properties array (overflow). The DescriptorArray in Hidden Class covers both — to you it's just obj.x, but V8 may take either path internally.

Which one? V8 reserves 4 in-object slots for an empty object (called Slack Tracking, see Ch20). The first 4 properties go in-object; later ones overflow into *properties. That's the foundation for Phase IV rule #5 ("declare class fields with defaults") — fill those slots immediately at construction.

CHAPTER 12 · OBJECT MODEL

Transition Chain — 对象长大时的链表生长

Transition Chain — growing a linked list as the object grows

点按钮看链表怎么一节一节长出来

click to watch the chain grow node by node

上一章说同 shape 共享 Hidden Class——但 shape 怎么变化?V8 的设计是把 Hidden Class 链成一条 transition chain:每给对象加一个属性,就追加一个 Hidden Class 节点。同样路径走过的对象,共用同一条链上的同一个节点。

下面是一个交互式演示——点 "+ x"、"+ y"、"+ z" 看链表怎么生长:

The previous chapter said same-shape objects share a Hidden Class — but how does shape change? V8's answer: chain Hidden Classes into a transition chain. Each new property appends a node; objects that took the same path of insertions share the same chain node.

Click "+x", "+y", "+z" below to watch the chain grow:

当前 Hidden Class:current map: ∅ (empty)
JSObject
*hiddenClass→ ∅
*properties[]
*elements[]
empty
FIG. 08 · interactive 每新加一个属性,Hidden Class 链表就长一节。两个对象只要走过同样的添加路径,就会停在同一个 Hidden Class 节点上——这是 V8 复用 IC 优化的物理基础。 Each new property appends a chain node. Two objects walking the same insertion path end up on the same Hidden Class node — this is the physical basis of V8's IC sharing.

为什么"赋值顺序"很重要

Why insertion order matters

从链表结构能直接看出来:

The chain structure makes it obvious:

两个看起来一样,实际不一样two objects that look identical but aren't shape pitfall
// 路径 A:先 x 后 y const a = {}; a.x = 1; a.y = 2; // → 链路 ∅ → "x" → "y" → Hidden Class A // 路径 B:先 y 后 x ★ 不一样的链路 ★ const b = {}; b.y = 2; b.x = 1; // → 链路 ∅ → "y" → "x" → Hidden Class B ≠ A // 后果:同一个函数同时收到 a 和 b,就变成 polymorphic 了 // IC 缓存被打破,性能可能直接掉 2-3 倍

看似无害的两段代码,在 V8 眼里指向两个完全不同的 Hidden Class——所有把它们当参数的函数都会被推入 polymorphic。这是写性能敏感代码时最容易踩的隐形坑

解决办法非常机械:初始化对象时就把所有字段一次性写齐,顺序固定。比如 React/Vue 这种框架内部维护对象池时,会刻意保证每个 component 实例的字段顺序一致——目的就是让所有 instance 共用一个 Hidden Class。

Two innocuous-looking blocks. In V8's eyes they point at two completely different Hidden Classes, and any function that takes either gets pushed into polymorphism. The most insidious trap in performance-critical code.

The fix is mechanical: initialize all fields up front, in a fixed order. React/Vue's internal instance pools deliberately preserve field order across components for exactly this reason — keep every instance on the same Hidden Class.

从链表到树:分叉的情况

Branches: when the chain forks

链表只能描述"路径相同"的情况。当两条路径在某一步分叉时,Hidden Class 会变成一棵带 transition 的树。比如:

A chain only handles same-path growth. When two paths diverge, the Hidden Class becomes a tree with transitions:

TRANSITION TREE
xyu // o1 = {x, y, u} v // o2 = {x, y, v}
o1 和 o2 的前两步共享 ∅→x→y;到第三步分叉,各自挂一个新的 transition。这种共享前缀让 V8 的 Hidden Class 总数远小于"对象 shape 笛卡尔积"。o1 and o2 share ∅→x→y; at step three each branches off. Shared prefixes keep V8's Hidden Class count far below the Cartesian product of object shapes.
CHAPTER 13 · OBJECT MODEL

Inline Caches — 把字符串查找变成偏移读取

Inline Caches — turning string lookup into offset reads

从 O(n) 到 O(1) 的那把刀

the knife that cuts O(n) down to O(1)

到这里,Phase III 的所有铺垫都是为了讲清楚这一章。Inline Cache (IC) 是 V8 性能曲线最陡的那一段——它能让一个 obj.x 的访问从字符串查找的 O(n) 降到一条 mov 指令的 O(1)。差距能上百倍。

看一段实测:同一个"服务发现"函数,一种动态写法,一种静态写法,跑 10M 次:

Everything in Phase III leads here. Inline Cache (IC) is the steepest part of V8's performance curve — it can cut an obj.x access from O(n) string lookup down to a single mov. The gap is over 100×.

Real measurement, same "service discovery" function written two ways, 10M iterations:

动态查找 · 跑 10M 次dynamic lookup · 10M iters ~6.4 s
const _404Handler = (req) => ({ a: '404' }); function select(map, key) { return map[key] || _404Handler; // ★ 动态 key,O(n) } const serviceMap = { userLogin: …, a: …, b: …, c: … }; for (let i = 0; i < 10_000_000; i++) { select(serviceMap, 'userLogin')(req); select(serviceMap, 'a')(req); select(serviceMap, 'b')(req); select(serviceMap, '404')(req); } // → ~6400 ms
静态查找 · 跑 10M 次static lookup · 10M iters ~44 ms
function select(map, key) { if (key === 'userLogin') return map.userLogin; if (key === 'a') return map.a; // ★ 静态 key,可被 IC 优化为偏移 if (key === 'b') return map.b; if (key === 'c') return map.c; return _404Handler; } // 同样 10M 次,跑出来 ~44 ms // ↑ 145× 倍速差距
FIG. 09 · interactive 点上面两个 tab 切换。同样逻辑,同样 10M 次调用,动态 map[key] 跑了 6.4 s,静态 map.a 跑了 44 ms——差距 ~145 倍。这不是函数本身的差异,是 V8 能不能把它编进 IC 的差异。 Toggle between the tabs. Same logic, same 10M calls — dynamic map[key] takes 6.4 s, static map.a takes 44 ms. ~145×. Not the function's fault — it's whether V8 can fold the access into an IC.

IC 在汇编里长什么样

What an IC looks like in asm

当 V8 编译 map.a 这种静态 key 的属性访问时,它会在第一次调用时通过 Hidden Class 找到 "a" 对应的 in-object offset(比如 1),然后把这个数字直接写死在编译出来的汇编里:

When V8 compiles a property access with a static key like map.a, it follows the Hidden Class on first call to find "a"'s in-object offset (say, 1), then burns that number into the emitted asm:

IC fast path · TurboFan 输出 arm64 (m1)
; map.a — 假设 map 的 Hidden Class 物理地址是 0x3a8d76b74971 L133 ldr x0, [x4, #+24] ; 读 map 的 *hiddenClass L134 cmp x0, 0x3a8d76b74971 ; ★ checkpoint: shape 没变? L135 b.ne CompileLazyDeoptimizedCode ; 变了 → deopt L136 ldr x0, [x4, #+32] ; ★ 直接读 in-object slot 1 ; (32 = JSObject header 24 + slot 1 × 8) L137 ret

这就是Inline Cache 的真身——它把"通过 key 找属性"这件事 inline 成了"先比一次 Hidden Class 物理地址,再读一个固定偏移"。这个名字也由此而来:把缓存 inline 到了汇编里。

对比一下动态 map[key]:V8 不知道 key 是什么字符串,只能在每次调用时:

  1. key 字符串和 Hidden Class 里的所有 key 名字符串一一比较;
  2. 找到匹配后,再读对应 offset。

哪怕 V8 给字符串比较做了内部 intern(同字符串复用同一个指针,只比指针),也比"一条 ldr"慢得多。这就是 145 倍差距的来源。

That's Inline Cache in the flesh — it inlines "look up by key" into "first compare a Hidden Class pointer, then load a fixed offset". The name comes from this: the cache is inlined into the asm itself.

Contrast dynamic map[key]: V8 doesn't know the value of key at compile time, so each call has to:

  1. compare the key string against every key in the Hidden Class;
  2. then load the matched offset.

Even with V8's internal string interning (same string → same pointer; pointer-compare only), this is dramatically slower than one ldr. Hence the 145×.

静态写法,优于动态写法。
不是风格之争,是 145 倍 的差距。 Field Note · 03
Static beats dynamic.
Not a style preference — a 145× gap. Field Note · 03

IC 的"州"

IC states

注意 IC 也走第 8 章的状态机:第一次调用时未初始化 (uninitialized),第二次起进入 monomorphic,见过 2-4 个不同 Hidden Class 的对象进入 polymorphic,>4 个就 megamorphic 放弃缓存。所以"让对象保持同 shape"和"用静态 key"是同一件事的两面——IC 优化只在它们都满足时生效。

ICs follow the same state machine as Chapter 8: uninitialized → monomorphic (after first call) → polymorphic (2–4 different Hidden Classes) → megamorphic (>4, cache abandoned). "Same shape" and "static key" are two faces of the same thing — IC only kicks in when both are true.

CHAPTER 14 · OBJECT MODEL

Fast Properties vs Slow Properties — delete 的代价

Fast vs Slow Properties — the cost of delete

缓存技术最怕的就是 delete

caching's worst enemy is invalidation

到目前为止,我们讲的都是Fast Properties——用 Hidden Class + IC 把属性访问压到一条 ldr。但有一种操作能把对象一脚踹出快路径,让它退化成Slow Properties——慢几十甚至几百倍。

这个操作就是 delete

Everything so far has been Fast Properties — Hidden Class + IC compressing access into one ldr. There's one operation that kicks an object off the fast path entirely, demoting it to Slow Properties (dozens to hundreds of times slower).

That operation is delete.

%HasFastProperties · before / after delete node --allow-natives-syntax
> const obj = { x: 123, y: 555 }; > console.log('初始化后:', obj, %HasFastProperties(obj)); → {x: 123, y: 555} true > obj.xxxxx = 123; > console.log('随便加一个成员后:', obj, %HasFastProperties(obj)); → {x: 123, y: 555, xxxxx: 123} true ; ★ 加属性还在 Fast > delete obj.xxxxx; > console.log('删除一个成员后:', obj, %HasFastProperties(obj)); → {x: 123, y: 555} false ; ★ 删了之后掉到 Slow!

为什么 V8 不再为 delete 维护 Hidden Class

Why V8 stops maintaining Hidden Class after delete

因为 delete 一旦允许,会引爆一连串问题:

  1. o1.x 之后,剩下的 in-object 槽位怎么办?移动后面的属性补齐 → 其他对象指针就乱了。空着不填 → IC 缓存的偏移就错了。
  2. 那些之前指向同一个 Hidden Class 的对象 现在还要不要也指过来?如果保留,o1 的 x 没了别人却还指着,引用乱套;如果切换 Hidden Class,所有 IC 都得失效。
  3. 那些已经 inline 进 TurboFan 机器码里的偏移,要全部重编。

三个问题都很难解。V8 选了最简单的放弃方案:被 delete 过的对象一律退化为 Slow Properties——把属性集中存到一个字典里(类似 Map<string, Value>),抛弃 in-object + IC 优化。

这个字典的访问要走哈希查找,比 IC 慢几十到一百倍。而且这个降级是不可逆的——一旦掉到 Slow,这个对象就回不去 Fast 了。

Allowing delete opens three nasty cans of worms:

  1. What happens to the freed in-object slot? Compact later ones into it → all other objects' pointers break. Leave it empty → all cached IC offsets are now wrong.
  2. Other objects on the same Hidden Class — keep them pointing here, or fork? Keep → references go stale; fork → invalidate every IC pointing at the old class.
  3. Every offset already inlined into TurboFan machine code needs re-emitting.

All three are hard. V8 picked the simple give-up plan: any object touched by delete degrades to Slow Properties — store properties in a dictionary (like Map<string, Value>) and abandon in-object + IC optimization.

Dictionary access is hash lookup — dozens to a hundred times slower than IC. And the demotion is one-way — once Slow, always Slow.

实战教训REAL-WORLD LESSON
"清理对象"反而让性能暴跌
"Cleanup" tanks performance instead of helping

某段代码循环结束后想"释放内存",对每个 cache 对象做了 delete obj.bigPayload。结果下一帧还在用这些对象做属性访问的函数全部 deopt——cache 对象悉数掉到 Slow Properties,整个模块慢了 4 倍。正确做法是 obj.bigPayload = nullobj.bigPayload = undefined——这样不改变 Hidden Class,GC 也能正常回收引用的内存。

Some code did delete obj.bigPayload on every cache object at end-of-loop to "free memory". Next frame, every function reading those objects' properties deopted. The whole module ran 4× slower. The fix: obj.bigPayload = null (or = undefined) — preserves Hidden Class while still letting GC reclaim the referenced memory.

规则RULE 在性能敏感代码里,能不用 delete 就不用 delete。要"清掉"一个属性,改成 obj.foo = nullobj.foo = undefined——前者明确表达"无值",后者保持兼容。Hidden Class 不会变,IC 不会失效,GC 会回收引用的内存。 In performance-critical code, avoid delete. To "clear" a property, set it to null or undefined. Hidden Class survives, ICs stay valid, and GC still reclaims the referenced memory.
CHAPTER 15 · HOT FUNCTION

前世 · ~240 ms — 朴素写法的多态地狱

Before · ~240 ms — the polymorphic mess

把前面 14 章的诊断工具一次性用上

putting all 14 chapters' diagnostic tools to work

到这里前面 14 章是所有的刀。这一章我们拿出主线那段函数,用刀解剖它

函数本身一句话就能描述:把任意输入(数字 / 字符串 / 对象)归一化成 rem 数值。在我们的代码库里,它在每帧布局计算里被叫上百次,profiler 显示是个明显热点:

The previous 14 chapters were the knives. This chapter takes the main-line function and dissects it.

The function in one sentence: normalize any input (number / string / object) into a rem value. In our codebase it ran hundreds of times per layout frame; the profiler called it out as a hot spot:

v0 · 朴素版 · before any optimization naive
// 输入: number | string | { value, unit } // 输出: rem 数值 (number) function px2rem(input, base) { let value, unit; if (typeof input === 'number') { // case A value = input; unit = 'px'; } else if (typeof input === 'string') { // case B const m = input.match(/^(-?\d+(?:\.\d+)?)(px|rem|em|%)?$/); value = m ? parseFloat(m[1]) : 0; unit = m && m[2] || 'px'; } else if (input && typeof input === 'object') { // case C value = input.value; unit = input.unit || 'px'; } else { return 0; } // case D if (unit === 'rem') return value; if (unit === 'em') return value; if (unit === '%') return value / 100; return value / base; // 'px' 默认 }

第一刀:profile 看跑分

Cut #1: profile and time it

$ node --allow-natives-syntax bench.js timing
// 喂三种输入,模拟真实调用分布 const samples = [12, '14px', { value: 16, unit: 'rem' }, 20, '1.5em']; console.time('v0'); for (let i = 0; i < 1_000_000; i++) px2rem(samples[i % 5], 16); console.timeEnd('v0'); // → v0: 243.7 ms

第二刀:看 V8 把它当成什么

Cut #2: ask V8 what it thinks of this function

%GetOptimizationStatus(px2rem) + %DebugPrint(px2rem) diagnosis
// %GetOptimizationStatus 输出: kIsFunction | kOptimized | kTurboFanned ; 已经在 TurboFan 上了,但还是慢 // %DebugPrint 关键节选: - feedback vector: - invocation count: 1000000 - slot #0 BinaryOp BinaryOp::Any ; ★ 退化到 Any! - slot #4 LoadProperty (LoadIC) MEGAMORPHIC ; ★ 巨态! - slot #7 Compare CompareOp::Any ; ★ 类型守卫退化

第三刀:看反优化日志

Cut #3: read the deopt log

$ node --trace-deopt bench.js --trace-deopt
[bailout (kind: eager): reason: not a Smi; px2rem @ bytecode 7] [bailout (kind: eager): reason: wrong map; px2rem @ bytecode 41] [bailout (kind: eager): reason: unexpected type; px2rem @ bytecode 18] [bailout (kind: eager): reason: not a Smi; px2rem @ bytecode 7] ; 在 1M 次循环里 deopt 触发了 47 次 ★

这五个症状对应的"病"

Five symptoms, five diagnoses

# 症状Symptom 病因Root cause 章节Ref
1 BinaryOp::Any 参数类型混杂(number / string / object 都见过) → polymorphicargs mix number/string/object → polymorphic Ch8
2 LoadIC::MEGAMORPHIC input.value / input.unit 看到太多 shape → IC 放弃input.value / input.unit see too many shapes → IC gives up Ch13
3 reason: not a Smi number 路径假设是 SMI,但浮点跑了 HeapNumber 路径,触发 deoptnumber path expected SMI but a float (HeapNumber) deopted it Ch5
4 reason: wrong map object 路径上多个 shape 的 {value, unit} 来回切multiple object shapes flowing through the object branch Ch11
5 函数还很长function is also long 三种输入塞在一个函数里 → bytecode 多 → 接近 max_optimized_bytecode_size 阈值three input paths in one function → bytecode bloat → near max_optimized_bytecode_size Ch9

这五个病都源于一个共同的设计错误:用一个函数处理三种结构性不同的输入。从 V8 的角度,这等于强迫它对每个属性访问、每个二元运算都做"应付三种类型"的多态机器码——快路径根本没机会形成。

修法在下一章——把它拆成三个单态函数,然后顺着前面 14 章的刀一刀一刀切。

All five trace back to one design mistake: one function handling three structurally different inputs. From V8's view, you've forced it to emit polymorphic asm for every property access and every binary op — the fast path never gets to form.

The fix is next chapter — split into three monomorphic functions and apply the rest of the 14 knives.

CHAPTER 16 · HOT FUNCTION

十二条 V8 友好的改写规则

Twelve V8-friendly rewrite rules

每条规则点开看示例

click each rule to expand the example

下面这 12 条规则不是"风格指南"——是每一条都对应前面 14 章里某个具体机制的工程总结。我把它们按"应用次数"在主线 px2rem 上的频度排序——前几条是收益最大的几刀。

点每一条的标题展开看示例。

The following 12 rules aren't style preferences — each maps to a specific mechanism from the previous 14 chapters. I've ordered them by impact frequency on the main-line px2rem function — the top few cuts buy the most.

Click each rule's heading to expand its example.

RULE 01 · #1
把多态函数拆成多个单态函数
Split polymorphic functions into monomorphic ones

主线 px2rem 同时接 number / string / object,V8 必须为每个 binop 都生成多态机器码 → 退化到 BinaryOp::Any把它拆成 px2remNumber / px2remString / px2remObject 三个函数,在调用方分发——每个函数都可以是 monomorphic。

px2rem takes number / string / object — V8 must emit polymorphic asm for every binop, dropping to BinaryOp::Any. Split into three: px2remNumber / px2remString / px2remObject, dispatch at the call site — each function can be monomorphic.

对应章节: Ch8 (Mono/Poly/Mega)。
这一刀通常占整体提速的 50%+。在主线函数上,只这一刀就把 v0 的 240ms 砍到 ~120ms。
Maps to: Ch8 (Mono/Poly/Mega).
Usually accounts for 50%+ of the total speedup. On the main-line, this single cut takes v0 from 240ms to ~120ms.
RULE 02 · #2
把热点函数拆得足够小
Decompose hot functions until each is small

超过 max_optimized_bytecode_size(默认 60K bytecode 字节)V8 不会优化。即使没超,小函数还能享受 inline 展开——TurboFan 会把小被调函数 inline 进调用方,省一次 push/pop。

Functions over max_optimized_bytecode_size (60K bytecode bytes by default) skip optimization entirely. Even below the limit, small functions get inlined — TurboFan folds them into the caller, saving the push/pop.

对应章节: Ch3 (流水线), Ch9 (ShouldOptimize)。
主线函数把单位换算和距离计算拆成独立函数,各自 < 200 字节 bytecode。
Maps to: Ch3 (Pipeline), Ch9 (ShouldOptimize).
The main-line splits unit conversion and distance math into separate functions, each under 200 bytecode bytes.
RULE 03 · #3
用 TypeScript 锁住函数的入参类型
Use TypeScript to lock arg types

TS 类型系统不是为了"装",它在工程上恰好替你保证了热点函数的单态性——只要类型签名是 (n: number) => number,你就基本不会不小心给它喂 string。

TS types aren't decoration. In practice they enforce the monomorphism of hot functions — a signature of (n: number) => number means you basically won't accidentally feed it a string.

注意: TS 不能保证 SMI vs 浮点的区分,这是 V8 内部的差异。但它能保证 number vs string 不混。
Caveat: TS can't enforce SMI vs float — that's a V8 internal distinction. But it does keep number and string apart.
RULE 04 · #4
保持对象赋值顺序不变
Keep property assignment order stable

{x:1, y:2}{y:2, x:1} 在 V8 里是两个不同的 Hidden Class。在 factory 函数里,所有对象都按同一个顺序赋值——这样所有 instance 共享同一条 transition chain。

{x:1, y:2} and {y:2, x:1} are two different Hidden Classes. In factory functions, assign properties in a fixed order so every instance walks the same transition chain.

对应章节: Ch11, Ch12 (Hidden Class · Transition Chain)。
Maps to: Ch11, Ch12 (Hidden Class · Transition Chain).
RULE 05 · #5
class 字段加默认值
Declare class fields with defaults

V8 给空对象预留 4 个 in-object 槽位(Slack Tracking)。如果你在 constructor 里"有时"才赋某个字段,会触发 Hidden Class 分叉。所有字段在 constructor 一次写齐(没值就 null/undefined),让所有实例走同一条链。

V8 reserves 4 in-object slots (Slack Tracking). If your constructor "sometimes" assigns a field, you fork the Hidden Class. Initialize every field in the constructor (use null/undefined if no value), keeping all instances on one chain.

主线 px2remObject 的内部 result 对象就是按这条规则一次性初始化的。
px2remObject's internal result object follows this rule for its single-shot init.
RULE 06 · #6
不用 delete
Don't use delete

一次 delete obj.x 会把对象从 Fast Properties 一脚踹进 Slow Properties——所有 IC 失效,后续访问慢几十~百倍且不可逆。要"清掉"就 obj.x = null

A single delete obj.x kicks an object from Fast to Slow Properties — invalidates every IC, slows access dozens to a hundred times, and is irreversible. To "clear" a property, use obj.x = null.

RULE 07 · #7
避免反优化
Avoid deopts

在生产 build 上加 --trace-deopt 跑一遍核心场景,看哪些函数 deopt——大多数是偶尔传 undefined 或者偶尔抛 try-catch。把这些"偶尔"消除就行。

Run your core scenarios with --trace-deopt in a prod build and find every deopting function. Most cases are occasional undefined or occasional try-catch throws. Remove the "occasionals".

RULE 08 · #8
静态写法,优于动态写法
Static beats dynamic

这是 Ch13 的 145 倍跑分差距。在热点里把 obj[key] 改成 obj.knownKey,把 switch(string) 改成 switch(intEnum)——一刀切。

The 145× from Ch13. In hot paths, replace obj[key] with obj.knownKey and string switches with int-enum switches. One clean cut.

RULE 09 · #9
字面量声明优于过程式声明
Literals beat procedural construction

const o = {x: 1, y: 2}const o = {}; o.x = 1; o.y = 2 更稳——前者一次性建好 Hidden Class,后者要走两次 transition。

const o = {x: 1, y: 2} is more reliable than const o = {}; o.x = 1; o.y = 2 — the literal builds the Hidden Class in one shot; the procedural form walks two transitions.

RULE 10 · #10
让对象只活在一个函数内
Keep object lifetime within one function

基于 逃逸分析(Escape Analysis):如果对象不逃出函数,V8 可以把它的字段全部展开成寄存器变量,根本不分配堆内存。这对 GC 也是免费收益。

Based on escape analysis: if an object never escapes its function, V8 can replace its fields with register variables and skip heap allocation entirely. Free GC win too.

RULE 11 · #11
能用整数就别用浮点
Use integers over floats when you can

SMI(整数)在 V8 里是立即数,不进堆;float 一律装箱成 HeapNumber,要分配 + GC + 间接寻址。能用 Math.floor / 整数 enum 就用,只在最终输出层做一次 / 100 转浮点。

SMIs (ints) live as immediate values; floats box into HeapNumber with allocation, GC, and indirection. Prefer Math.floor and integer enums; only divide-by-100 at the very last output step.

RULE 12 · #12
慎用 Ref<T> 之类的包装
Avoid Ref<T>-style wrappers

React/Vue 里 useRef(0) 把数字包成 { current: 0 } 对象——读写都得过一层 Hidden Class + IC。如果你需要在热点里高频读写一个数,直接用闭包 let 变量,比 ref 快好几倍。

React/Vue's useRef(0) wraps a number into { current: 0 } — every read/write hits a Hidden Class + IC. For high-frequency hot-path reads, a closure-captured let outperforms a ref by several times.

规则的优先级PRIORITIZING THE RULES 不是每条都得用上。Rule 1 / 2 / 6 / 8 是性能收益最大的四条——其他几条更多是"保护性"规则,在热点函数上别踩坑。如果改一段代码只能动一两刀,从这四条里挑。 You don't need all twelve. Rules 1 / 2 / 6 / 8 carry the most weight — the others are protective: don't step on these traps. If you only have time for two cuts, pick from those four.
CHAPTER 17 · HOT FUNCTION

今生 · ~24 ms — 单态 + Hidden Class 稳定 + IC 友好

After · ~24 ms — monomorphic, stable shapes, IC-friendly

把所有刀切下去之后

after all the cuts have landed

下面是按 12 条规则改完的版本。代码量更长了——但每个函数都是单态、字段顺序固定、没有 delete、没有动态 key:

The version after all twelve rules. The code is longer — but every function is monomorphic, field order is fixed, no delete, no dynamic keys:

v1 · final · all 12 rules applied monomorphic + IC-friendly
// ── 三个单态分支 ── (Rule 1 + 3) function px2remNumber(value /* number */, base /* number */) { return value / base; } // 提到模块顶层,只编译一次 (Rule 9) const RE = /^(-?\d+(?:\.\d+)?)(px|rem|em|%)?$/; const UNIT_PX = 0, UNIT_REM = 1, UNIT_EM = 2, UNIT_PCT = 3; // 整数 enum (Rule 11) const UNIT_MAP = { 'px': UNIT_PX, 'rem': UNIT_REM, 'em': UNIT_EM, '%': UNIT_PCT }; function px2remString(input /* string */, base /* number */) { const m = input.match(RE); if (!m) return 0; const v = +m[1]; // + 比 parseFloat 更直接 const u = UNIT_MAP[m[2]] ?? UNIT_PX; // 静态 key (Rule 8) if (u === UNIT_REM || u === UNIT_EM) return v; if (u === UNIT_PCT) return v / 100; return v / base; } // (Rule 5) 工厂确保所有 input 对象 shape 完全一致 — // 字段顺序固定 value, unit (Rule 4),从不 delete (Rule 6) function px2remObject(input /* {value:number, unit:string} */, base) { const u = UNIT_MAP[input.unit] ?? UNIT_PX; // 静态属性访问 (Rule 8) const v = input.value; // (Rule 8) if (u === UNIT_REM || u === UNIT_EM) return v; if (u === UNIT_PCT) return v / 100; return v / base; } // 调用方分发 (Rule 1) — 只有这里碰多态,且分发一次 inline 就消失了 function px2rem(input, base) { if (typeof input === 'number') return px2remNumber(input, base); if (typeof input === 'string') return px2remString(input, base); return px2remObject(input, base); }

跑分对比

Benchmark comparison

v0 · naive
243 ms
+ rule 1 (split)
122 ms
+ rule 2 + 9
79 ms
+ rule 4 + 5
48 ms
+ rule 8 + 11
31 ms
v1 · all rules
24 ms
FIG. 10 每加一刀的累积效应。1M 次 px2rem 调用,从 243 ms 降到 24 ms,十倍提速——其中第一刀(拆单态)占了一半,接下来的几刀各自砍了 30-40%。这就是规则 #1 的"绝对优先级"由来。 Cumulative effect of each cut. 1M px2rem calls drop from 243 ms to 24 ms — 10×. The first cut (split into monomorphic) takes half; later cuts each shave 30–40%. That's why rule #1 sits at the top.
243ms
v0 · naive
10×
speedup verified
24ms
v1 · final

验证 V8 现在怎么看这段代码

Asking V8 what it thinks now

%DebugPrint(px2remNumber) verified mono
- feedback vector: - tiering state: TieringState::kNone - invocation count: 200000 - slot #0 BinaryOp BinaryOp::Number ; ★ 单态 Number [0]: 1 - Code: - kind: TURBOFAN - bytecode size: 28 bytes ; 远低于 60K 阈值 - inlined into px2rem? YES (3 call sites) ; ★ 被 inline 展开了

十倍是怎么算出来的

Where the 10× actually comes from

不是某一刀很神,而是每一刀都解决了一个具体的 V8 机制问题,所有的小提速复合起来。把它列成账本:

No single cut is magic. Each one solves one specific V8 mechanism problem, and the small wins compound. As a ledger:

Cut 解决的问题Problem fixed 单刀贡献Per-cut win 累计Cumulative
v0起点baseline243 ms
+ R1三个单态函数 → 退出 BinaryOp::Anythree mono fns → exit BinaryOp::Any−50%122 ms
+ R2/9小函数被 inline + 提模块顶层常量small fns get inlined + top-level constants−35%79 ms
+ R4/5所有 result/input 对象同 Hidden Classall result/input objects share a Hidden Class−39%48 ms
+ R8/11静态 key + 整数枚举 → IC 优化static keys + int enums → IC kicks in−35%31 ms
+ R10逃逸分析,临时对象不上堆escape analysis, temp objects skip heap−23%24 ms
v1对比 v0vs v010.1×24 ms
十倍提速不是魔法,
是十二刀切下去的累加。 Field Note · 03
A tenfold speedup isn't magic.
It's twelve cuts that compound. Field Note · 03

这套方法论可以照搬到任何热点上吗

Will this method work on any hotspot

大部分情况能。但前提是你的瓶颈真的是 JS 执行——如果是 DOM 操作、合成层、网络、GC——那就是另外一座山(分别对应 chromium-renderer 那篇文章里的不同章节)。

检验方法很简单:打开 Chrome DevTools 的 Performance 面板,看你的热点函数占帧时间多少是 JS 颜色还是别的颜色。如果是 JS 蓝色 + 占比超过 5%,这套方法论几乎一定有用。

Mostly yes. The precondition is your bottleneck is actually JS execution. If it's DOM, compositing, network, or GC — that's a different mountain (each covered in different chapters of the chromium-renderer piece).

Quick check: open Chrome DevTools' Performance panel and see your hot function's share of frame time and color. JS-blue + over 5% means this methodology will almost certainly help.

CHAPTER 18 · FRONTIERS

JSCore 也在做这件事 — Safari 的 LLVM JIT

JSCore does the same — Safari's LLVM JIT

这套方法论是引擎无关的

this methodology is engine-agnostic

这本文一直在讲 V8——但前面 12 条规则跨引擎都成立。原因很简单:Hidden Class、Inline Cache、type feedback,这套设计是 1991 年 Self 语言研究里就有的——所有现代 JS 引擎都独立实现了一份。

This piece has been about V8, but those 12 rules are engine-agnostic. The reason: Hidden Class, Inline Cache, type feedback all trace back to 1991 Self research — every modern JS engine has independently implemented the same trio.

引擎Engine JIT 层级JIT tiers Hidden Class IC Type feedback
V8 · Chrome / Node Ignition · Sparkplug · Maglev · TurboFan Map FeedbackVector
JSCore · Safari LLInt · Baseline · DFG · FTL (LLVM) Structure ValueProfile
SpiderMonkey · Firefox Interpreter · Baseline · Warp · Ion Shape CacheIR
Hermes · RN AOT bytecode (no JIT) HiddenClass — (no JIT)

JSCore 的特别之处:把 LLVM 拉来当后端

What's special about JSCore: LLVM as backend

JSCore(WebKit 的 JS 引擎,iOS / macOS Safari 用)有一个独门设计:它的峰值层 FTL(Fourth Tier LLVM)直接把 JS 编译进 LLVM IR,然后调用 LLVM 全部优化——同一份 LLVM 用来编 C++ / Rust / Swift,现在也用来编你的热点 JS。

实战影响:在某些 benchmark 上,Safari 的 JSCore 比 Chrome 的 V8 还快——尤其是计算密集型 + 类型稳定的代码,LLVM 的循环优化、SIMD 化、内联策略都比 V8 的 TurboFan 更激进。

跨引擎的高性能 JS 写法是同一套——前面那 12 条规则在 JSCore 上一字不差地适用。

JSCore (WebKit's engine, used in iOS/macOS Safari) has a unique design: its peak tier FTL (Fourth Tier LLVM) compiles JS straight into LLVM IR and then runs full LLVM optimization passes — the same LLVM that ships C++/Rust/Swift, now also processing your hot JS.

Real-world impact: on certain benchmarks Safari's JSCore beats V8 — especially on compute-heavy, type-stable code, where LLVM's loop, SIMD, and inlining strategies are more aggressive than TurboFan's.

But fast-JS writing is the same trade across engines — the 12 rules apply word-for-word to JSCore.

实测对比REAL TEST
同一段优化后的代码,Chrome vs Safari
Same optimized code, Chrome vs Safari

在我的电脑上 (M1 MacBook Pro),v1 版 px2rem 跑 1M 次:Chrome (V8) 24 ms,Safari (JSCore) 17 ms。Safari 更快——因为 LLVM 把 UNIT_MAP 那个查表完全展开成了直接比较。但跑得快的代码,在哪个浏览器上都跑得快——这才是这套方法论的真正价值。

On my M1 MacBook Pro, the v1 px2rem at 1M iterations: Chrome (V8) 24 ms, Safari (JSCore) 17 ms. Safari wins — LLVM fully unrolled the UNIT_MAP lookup into direct compares. But fast code stays fast across browsers. That's the methodology's real value.

CHAPTER 19 · FRONTIERS

Wasm — V8 的另一条流水线

Wasm — V8's other pipeline

当 JS 已经不够快

when JavaScript isn't fast enough

把 px2rem 优化到 24 ms 已经是极限了——再快只能不写 JS。这就是 WebAssembly 的位置。

V8 内部其实有两条独立的流水线:JS 那条在前 14 章讲过(Ignition→Sparkplug→Maglev→TurboFan);Wasm 有自己的两层——Liftoff(基线编译,毫秒内编完)和 TurboFan(峰值编译,Wasm 也复用了同一个后端)。两条流水线共享同一份机器码内存、同一份 GC、同一个 main thread——所以 Wasm 不是"另一种语言",而是JS 性能曲线的另一种形状

Optimizing px2rem to 24 ms is roughly the JS ceiling. Beyond that, you have to stop writing JS. That's WebAssembly's slot.

V8 actually runs two parallel pipelines: JS uses the four-tier covered in Ch3 (Ignition→Sparkplug→Maglev→TurboFan); Wasm has its own two-tier — Liftoff (baseline, compiles in milliseconds) and TurboFan (peak, shared backend). Both pipelines share the same machine-code memory, the same GC, the same main thread — so Wasm isn't "another language" so much as another shape of the JS performance curve.

什么时候上 Wasm

When to reach for Wasm

场景Scenario 建议Recommendation
业务热点(布局、滚动、动画)UI hotspots (layout, scroll, animation) 优化 JS 即可,基本能搞定JS optimization is enough
媒体编解码 / 加解密 / 物理仿真media codec / crypto / physics Wasm 决定性更好(2-10×)Wasm wins decisively (2–10×)
大规模数据处理(协同编辑、Excel 表)bulk data (collab editing, spreadsheets) 视情况——多次 JS↔Wasm 边界开销可能吃掉收益it depends — JS↔Wasm boundary costs can swamp the win
DOM 操作DOM ops Wasm 反而更慢(必须经 JS 桥)Wasm is slower here — must bridge through JS

而且重要的是:Wasm 不是"用了就快"。一段写得不好的 Wasm(频繁的 boundary call、不友好的内存布局、没向量化)有时还不如同等逻辑的优化过的 JS。

所以这本文最后一句话还是:先用前面 12 条把 JS 优化到极限,再去考虑 Wasm——大部分业务场景里,JS 优化能解决 80% 的性能问题,而且不引入构建复杂度。

And critically: Wasm isn't "fast just by being Wasm". Poorly-written Wasm (frequent boundary calls, unfriendly memory layout, no vectorization) sometimes loses to equivalent optimized JS.

The last word of this piece, then: push JS to its limit with the 12 rules first; reach for Wasm second. In most business code, JS optimization solves 80% of perf without adding build complexity.

CODA · TOOLBOX

工具箱 — --allow-natives-syntax 全套实战

Toolbox — --allow-natives-syntax in practice

所有用到的命令、参数、native syntax,集中在这里

every command, flag, and native syntax used in this piece, in one place

这本文从头到尾用到的所有"怎么观察 V8 在干什么"工具,集中放在这里。建议把这一章存成 cheatsheet——下次遇到慢 JS 时直接抄。

Every "how to see what V8 is doing" tool used across this piece, in one place. Save this chapter as a cheatsheet — next time you face slow JS, copy-paste from here.

启动开关

Startup flags

node / chromium 启动时 flags
# node $ node --allow-natives-syntax bench.js # node + 全套 trace(强烈推荐) $ node --allow-natives-syntax \ --trace-opt \ # 升级 TurboFan 时打日志 --trace-deopt \ # 反优化时打日志 --print-opt-code \ # 打印 TurboFan 输出的机器码 bench.js # Chromium $ open -a Chromium --args --js-flags="--allow-natives-syntax --trace-deopt" # 限定只 trace 某个函数(避免日志爆炸) $ node --allow-natives-syntax \ --turbo-filter=px2rem* \ --print-opt-code bench.js

Native syntax 命令

Native syntax commands

在 JS 代码里直接调用callable from JS code itself % prefix
// ─ 1. 看一个函数当前的优化状态 ─ console.log(%GetOptimizationStatus(px2rem).toString(2)); // 返回 bitmask,关键位: // bit 4 = kOptimized // 在优化版本上跑 // bit 5 = kMaglevved // 在 Maglev 上 // bit 6 = kTurboFanned // 在 TurboFan 上 // bit 14 = kMarkedForDeoptimization // ─ 2. 强制下次调用就升级到 TurboFan ─ px2rem(10, 16); // 至少跑一次让它 collect feedback %OptimizeFunctionOnNextCall(px2rem); px2rem(10, 16); // 这次会被 TurboFan 编译 // ─ 3. 打印对象 / 函数的内部信息 ─ %DebugPrint(px2rem); // 输出 feedback vector / Hidden Class / 优化状态等 // ─ 4. 看对象是 Fast 还是 Slow Properties ─ console.log(%HasFastProperties(obj)); // → true / false // ─ 5. 让函数立刻反优化(测试用)─ %DeoptimizeFunction(px2rem);

Heap snapshot · 看 Hidden Class

Heap snapshot · reading Hidden Classes

Chrome DevTools → Memory → Take heap snapshot,然后:

  1. 左上角 dropdown 选 "Class filter",在搜索框里输入对象的构造器名(比如 Object)。
  2. 展开任意一条,会看到 map :: system / Map @0x...——这就是 Hidden Class 的物理地址。
  3. 点这条 Map,下方 Retainers 面板会列出所有指向同一个 Hidden Class 的对象。如果你看到几万个对象指向同一个 Hidden Class——✓ shape 稳定。如果几个对象各指向不同的 Hidden Class——✗ shape 分裂了。

这是排查"对象 shape 是否稳定"最直接的方法,比看 %DebugPrint 的 map 地址更直观。

Chrome DevTools → Memory → Take heap snapshot, then:

  1. Top-left dropdown → "Class filter", search for your constructor (e.g. Object).
  2. Expand any entry, you'll see map :: system / Map @0x... — that's the Hidden Class's physical address.
  3. Click the Map entry; the Retainers panel below lists every object pointing at the same Hidden Class. Tens of thousands sharing one map ✓ — shapes stable. A few objects each pointing at different maps ✗ — shapes split.

This is the most direct way to verify shape stability — easier than diffing %DebugPrint output.

最小重现 cheatsheet

Minimal repro cheatsheet

bench.js · paste-and-run template copy this
// ── 一段最小 V8 性能测量模板 ── // 用法: node --allow-natives-syntax bench.js function target(a, b) { /* 你要测的函数,写在这 */ return a + b; } // 1. 预热(让 V8 收集 feedback 并升级到 TurboFan) for (let i = 0; i < 10000; i++) target(i, i); %OptimizeFunctionOnNextCall(target); target(0, 0); // 2. 看 V8 怎么看这个函数 %DebugPrint(target); // 3. 跑 1M 次 timing console.time('target'); for (let i = 0; i < 1_000_000; i++) target(i, i); console.timeEnd('target'); // 4. 改一个怀疑参数喂进去,再跑一次,看是否 deopt target('a', 'b'); // 故意打破单态 console.time('after-bad'); for (let i = 0; i < 1_000_000; i++) target(i, i); console.timeEnd('after-bad'); // 慢了几倍?

参考

References

OFFICIAL DESIGN DOCS · 官方设计文档 v8.dev · webkit.org
V8 blog · v8.dev/blog // 推荐:"Maglev — V8's Fastest Optimizing JIT" // "Sparkplug — a non-optimizing JavaScript compiler" // "Faster JavaScript calls" / "Hidden classes" V8 source · TieringManager // chromium.googlesource.com/v8/v8/+/main/src/execution/tiering-manager.cc V8 native runtime list // chromium.googlesource.com/v8/v8/+/main/src/runtime/runtime.h // 完整 % 函数清单(几百个,本文用到的只是一小部分) JSCore design · webkit.org/blog // "Speculation in JavaScriptCore" — 跟本文 Phase II 是同一个故事 原版笔记 · app.tana.inc/shared/js/aHdCaFV4clZMaC9FSXBhUmRrVk9EUlo= // 这本文方法论的最早 Tana 版本(私有笔记)
姊妹篇SISTER PIECE
这本文之外:其他性能瓶颈
Beyond this piece: other perf bottlenecks

这本文只覆盖了"JS 执行"这一类瓶颈。如果你的瓶颈在 DOM、布局、绘制、合成,可以读姊妹篇 字节码到像素的一生 — Chromium 渲染流水线全景;如果是合成卡顿,可以读 Jank & Stutter

This piece only covers JS execution. If your bottleneck is DOM, layout, paint, or compositing, see the sister piece Bytecode to Pixels — Chromium's Rendering Pipeline. For compositor jank specifically, see Jank & Stutter.

从 240ms 到 24ms,
十倍提速不是魔法,是十二刀切下去的累加。
每一刀都对应 V8 的一个机制——让假设保持稳定,它就一直跑在最优版本上。

From 240ms to 24ms,
a tenfold speedup is not magic — it's twelve cuts that compound.
Each cut maps to a V8 mechanism. Keep its assumptions stable, and your code stays on the peak.

FIN // END OF FIELD NOTE 03