一段 JS 跑得慢——你怎么知道它慢在哪里?怎么基于 V8 内部原理动手改写?怎么验证它真的快了十倍?这是一份方法论手册,不是百科全书。
When a piece of JS runs slow — how do you know where it's slow? How do you rewrite it using V8's internals as a guide? How do you prove it really got ten times faster? A methodology, not an encyclopedia.
const a = 3 + 4const a = 3 + 4同一行代码,三种世界,三种翻译
one line of code, three worlds, three translations
性能优化的第一步,不是 profile,是回答一个问题:你脑子里那行 JS,跟 CPU 真正执行的那串指令,中间到底差了几层翻译?
把 const a = 3 + 4 摆出来,用三种"眼睛"去看它,你会发现这一行字面上看似一回事的代码,在三层世界里长得完全不一样——而 V8 的所有优化魔法,都发生在这三层之间的翻译过程里。
The first step of performance work isn't profiling — it's answering one question: how many translations sit between the line of JavaScript in your head and the instructions the CPU actually runs?
Take const a = 3 + 4. Look at it with three different eyes. The same one-liner shape-shifts across three worlds, and every V8 optimization trick lives in the translation between them.
testb 和 jne deopt——那不是逻辑,是 V8 埋的类型检查 checkpoint(后面 Phase II 详细讲)。
Click the four tabs above to swap JS samples; each shows V8 Ignition's bytecode and TurboFan's machine code. The same JS, two translations, abstract to concrete. Note the extra testb + jne deopt in the asm column — those aren't logic, they're V8's type-check checkpoints (Phase II will dissect them).
裸的 CPU 只认机器码,但 JS 是动态的——它的类型是运行时才知道的。一个 add(a, b) 函数,你不告诉我 a 和 b 是什么,我就没法把它编译成"两个整数相加"这一条 add eax, ebx 指令——因为下一秒你可能会传两个字符串过来。
于是 V8 在脑和 CPU 之间垫了一层字节码:它比 AST 接近物理机,又比机器码灵活——可以解释执行,可以收集"参数到底是什么类型"的反馈,等收集够了再把字节码编译成机器码。
这一层就是性能优化的全部战场。下面整本文章,讲的都是 V8 在这一层做了什么、能被你怎么利用。
A bare CPU only speaks machine code, but JavaScript is dynamic — types are a runtime fact. Given add(a, b), I can't fold it down to a single add eax, ebx if I don't know what a and b are — next second you might pass me two strings.
So V8 inserts a bytecode layer between brain and CPU: closer to a real machine than the AST, more flexible than asm — interpretable, observable, and recompilable into machine code once V8 has watched enough calls to know what your types actually are.
That layer is the entire battlefield of JS performance work. The rest of this piece is about what V8 does in there — and how you can cooperate with it.
所谓 JS 的"快",
其实是 V8 在背着你猜对了一万次。 Field Note · 03
"Fast JavaScript" really means
V8 quietly guessed right ten thousand times. Field Note · 03
性能优化的"难"不在改代码——难在看懂 V8 当前在干什么。这篇文章不会铺一遍 V8 全部知识,而是围绕一个具体问题:
The hard part isn't editing code. The hard part is seeing what V8 is doing right now. So this piece doesn't tour every V8 internal — it answers one concrete question:
这是一段很普通的工具函数:输入可能是 number、string、或 { value, unit },输出是一个 rem 数值。它在我们某个项目里被 每帧 调用上百次,占了一段不容忽视的 CPU 时间。
下面的 19 章,每一章都是把它跑得更快这件事里的一刀。流水线、Hidden Class、Inline Cache 不是为了好看的术语——它们是用来切割问题的刀。
It's a perfectly ordinary helper. Input may be a number, a string, or a { value, unit } shape; output is a rem value. In one of our projects it ran hundreds of times per frame and burned non-trivial CPU.
The 19 chapters that follow are each one cut of making it faster. Pipeline, Hidden Class, Inline Cache — these aren't decorations. They are the knives we'll cut the problem with.
编译期很薄,运行时很厚
a thin compile-time, a fat runtime
C / Rust / Go 这种静态语言,编译期就能确定每个变量是什么类型、每个函数怎么调用、对象在内存里长什么样——所以它们走 AOT (Ahead-Of-Time):发布之前就生成最终机器码。运行时几乎没有"编译"这件事。
而 JS 是动态的,函数被调用前没人能保证 a + b 里的两个值是数字、字符串、还是别的什么东西。所以 V8 走 JIT (Just-In-Time):运行边编译,边收集类型反馈,边根据反馈优化。
Static languages like C / Rust / Go can pin down types, call shapes, and memory layouts at compile time — so they take the AOT (Ahead-Of-Time) road. By the time the binary ships, almost no "compiling" happens at runtime.
JavaScript is dynamic. Until add(a, b) is actually called, nobody can promise the two arguments are numbers — they could be strings or objects. So V8 takes the JIT (Just-In-Time) road: it compiles while running, observes types, and re-optimizes from feedback.
| 维度Dimension | AOT · C / Rust / Go | JIT · V8 / JSC |
|---|---|---|
| 编译期Compile-time | 很厚 · 全部优化都在这里做thick · all optimization happens here | 很薄 · 只 parse + 生 bytecodethin · only parse + bytecode |
| 运行时Runtime | 很薄 · 直接跑机器码thin · just runs machine code | 很厚 · 编译/反优化都在跑的过程中thick · (re)compile and deopt during execution |
| 类型信息Type info | 编译期已知known at compile | 运行时收集 (feedback)collected at runtime (feedback) |
| 最优代码生成时机Peak code emitted when | 编译完成那一刻at compile end | "足够热"之后 (TurboFan)after "hot enough" (TurboFan) |
| 代价Cost | 部署慢 / 二进制大slow build · large binary | 冷启动慢 / 内存占编译产物cold start · compiled-code memory |
意思是:你写的同一段 JS,V8 运行时会反复地编译它——先用解释器跑(Ignition),发现是热点函数后,会用基线编译(Sparkplug)、中间编译(Maglev)、最后峰值编译(TurboFan)轮番上阵。每一次升级都要花 ms 级的时间在编译本身上,这部分时间是 AOT 语言不需要付的。
所以你写 function add(a, b) { return a + b },V8 在你眼皮底下可能跑过 4 个不同版本的 add——每个都对应不同程度的优化和不同程度的"假设"。
这恰恰也是性能优化的机会:如果你能让 V8 的假设保持稳定,它就能一直跑在最优版本(TurboFan 机器码)上,不会被"反优化"打回字节码解释。
It means the same chunk of JS gets recompiled while it runs — interpreter first (Ignition), then once V8 notices it's hot, baseline (Sparkplug), mid-tier (Maglev), and finally peak (TurboFan) take turns. Each upgrade burns milliseconds in compilation itself — a tax AOT languages never pay.
So when you write function add(a, b) { return a + b }, four different versions of add may have run inside V8 by the time you blink — each at a different level of optimization, each with a different set of assumptions.
That's also where the leverage sits: keep V8's assumptions stable and your function lives on the peak (TurboFan machine code) forever — never "deoptimized" back into bytecode interpretation.
同一段 JS,在 V8 里其实有四份不同的它
the same JS exists in V8 as four different versions of itself
V8 不是只有一个编译器,而是一条流水线上的四级编译器。同一个函数,会随着"被调用次数"在四级之间向上爬——每爬一级,生成的代码越接近裸机,执行越快,但编译本身的耗时也越大。
这是 V8 的核心权衡:冷代码不值得花力气编译,热代码越烫值得越深的优化。所以函数的"性能"不是一个数字,而是一条会变化的曲线——这一章是这条曲线的地图。
V8 is not one compiler — it's a pipeline of four. The same function climbs the tiers as it gets called more often. Each tier emits code closer to the metal, runs faster, but costs more time to compile in the first place.
This is V8's core tradeoff: cold code isn't worth optimizing, hot code is worth optimizing harder. A function's "performance" isn't a single number — it's a curve that moves over time. This chapter is the map of that curve.
bytecode。所有后面三层的输入都是这个 bytecode,而不是源码。换句话说,V8 后续的优化全部基于字节码,源码到此为止。
Tokenizes source into an AST and emits the first bytecode. Every later tier consumes that bytecode — not the source. From here on V8's world is bytecode; the source is gone.
FeedbackVector 的结构里。所有冷代码都死在这一级——没必要再爬。
The bytecode interpreter. Runs bytecode line-by-line and collects type feedback (which types/shapes the args usually take) into a structure called the FeedbackVector. Cold code dies here — no need to climb.
把 Ch1 那个 function add(a, b) { return a + b; } 沿着五级流水线走一遍,你能直观看到每一级输出的代码长得有多不一样。点 tab 切换:
Trace Ch1's function add(a, b) { return a + b; } through all five tiers. Each tier's output looks dramatically different. Click a tab:
function add(a, b) { return a + b; }
add(a, b) 在 V8 流水线上的五种形态。Parser 出 AST 树;Ignition 出 3 条字节码;Sparkplug 出 ~10 行 baseline asm 但还要 call 通用 builtin;Maglev 内联了 add,用 SMI 检查替代 call;TurboFan 加了完整的 deopt 钢架,但核心 add 还是裸寄存器一条指令。每升一级,代码量增多,但运行时越来越接近 CPU 直接消费。
The same add(a, b) in five forms across V8's pipeline. Parser emits an AST tree; Ignition emits 3 bytecodes; Sparkplug emits ~10 lines of baseline asm that still calls a generic builtin; Maglev inlines the add, replacing the call with SMI guards; TurboFan adds the full deopt scaffold but the core add is still one register-level instruction. Each tier adds lines but moves runtime closer to what the CPU eats raw.
因为 TurboFan 的编译耗时本身就是性能成本——而且 TurboFan 需要 feedback 才能优化得好,没有 feedback 时它生成的代码也很平庸。
真实的程序里,大部分代码只跑几次:页面初始化的某个 setup 函数、某个一次性的 callback——把它们编进 TurboFan 是纯亏。所以 V8 用一个简单的启发式:
Because TurboFan's compile time is itself a performance cost — and TurboFan needs feedback to optimize well; without it, even TurboFan's output is mediocre.
In real apps, most code only runs a handful of times: a setup function on page load, a one-shot callback. Compiling those into TurboFan is a pure loss. So V8 uses a simple heuristic:
v8::internal::TieringManager::ShouldOptimize——后面 Ch9 会拆开看。"hot" is decided in v8::internal::TieringManager::ShouldOptimize — we'll dissect it in Ch9.三个推论:
Three implications:
--allow-natives-syntax 启动 node 或 Chromium,然后 %GetOptimizationStatus(fn) 会返回一个 bitmask——位 4 是 kOptimized、位 5 是 kMaglevved、位 6 是 kTurboFanned。Ch20 工具箱里有完整命令清单。
Launch node or Chromium with --allow-natives-syntax, then %GetOptimizationStatus(fn) returns a bitmask — bit 4 is kOptimized, bit 5 kMaglevved, bit 6 kTurboFanned. The full command list lives in the Ch20 toolbox.
为什么虚拟机指令集存在这个世界上
why virtual ISAs exist at all
"字节码"听起来像个魔法词,但它的本质很朴素——一个虚拟 CPU 的指令集。和 x86 / arm 这些真 CPU 的指令集相比,只差在"这世上没有一台 CPU 直接跑它"。
把 const a = 3 + 4 在 V8 里走完一遭,你会看到它在两种语言里出场两次:第一次是 Ignition 的字节码,第二次是 TurboFan 的机器码。下面把它们摆在一起对照看。
"Bytecode" sounds like a magic word, but underneath it's plain: an instruction set for a virtual CPU. The only difference from x86/arm is that no silicon ships with a bytecode decoder built in.
Walk const a = 3 + 4 through V8 and you'll see it appear in two languages: first as Ignition bytecode, then as TurboFan machine code. Side by side:
V8 的字节码 ISA 是一个带累加寄存器的栈机(stack machine with accumulator)。LdaSmi [3] 里的 Lda 就是 "Load into accumulator",Star0 是 "Store accumulator to register 0"。这种设计有两个好处:
但代价也明显:解释器每条指令都要走"取指令 → decode → 跳到 handler → 执行 → 跳回 dispatch"的循环——这个循环本身的开销,大概比裸跑机器码慢一个数量级。所以才有 Sparkplug:把 bytecode 一对一翻译成 asm,把这个 dispatch 循环踢掉。
V8's bytecode ISA is a stack machine with an accumulator. LdaSmi [3] means "Load Smi into accumulator"; Star0 means "Store accumulator into register 0". Two wins:
The cost is the interpreter loop: every instruction pays "fetch → decode → jump-to-handler → execute → jump-back". That loop alone is roughly an order of magnitude slower than running raw machine code. That's why Sparkplug exists — translate bytecode 1-to-1 into asm and kill the dispatch loop.
"虚拟机指令集"的形态历史上分两大流派,选哪种几十年来一直是 VM 设计的第一刀:
r0~r255),所有运算结果默认进 acc,寄存器只用来"暂存"。把 c = a + b 在三种 ISA 里编出来,差距一目了然:
Bytecode ISAs split into two historical schools — picking one is one of the first decisions any VM designer makes:
r0~r255). All ops default to writing the result into acc; registers are just temporaries.Compile c = a + b in three ISA flavors and the differences become obvious:
| 引擎 / 语言Engine / lang | ISA 形态ISA shape | 为什么这么选Why |
|---|---|---|
| V8 Ignition · Chrome / Node | 栈 + 累加器(混合)stack + accumulator | 2016 年从纯解释器(JIT 直接编源码)切到 Ignition · 优先缩短启动时间 + 节省内存(短字节码) · acc 让指令编码紧凑2016 cutover from JIT-only to Ignition · optimized for fast start-up + small footprint (short bytecode) · acc keeps encoding compact |
| JSCore LLInt · Safari | 寄存器机(3 地址码)register (3-address) | 优先解释器稳态吞吐 · 体积代价由 LLInt 的紧凑编码缓解 · 后端 LLVM JIT 也对寄存器 IR 友好optimized for steady-state throughput · LLInt's encoding mitigates the size cost · LLVM-based JIT thrives on register IR |
| SpiderMonkey · Firefox | 纯栈机pure stack | 最早期(1996)就这么定的 · ECMA-262 spec 本身用栈语义描述 · CacheIR 在解释器之上加优化 · 没改架构picked in 1996 · ECMA-262 spec itself uses stack semantics · CacheIR layered on top instead of redesign |
| Hermes · React Native | 寄存器机register | 没有 JIT(iOS 不允许动态生成代码)· 解释器是唯一跑代码的层,所以必须最快 · 寄存器机解释器吞吐高no JIT (iOS bans dynamic code-gen) · the interpreter is the only execution path, so it has to be fast · register machines lead in interpreter throughput |
| JVM · Java | 纯栈机pure stack | 1995 年的设计选择 · 优先跨平台编码紧凑 · HotSpot JIT 把字节码再编译成寄存器机器码,补回性能1995 design call · prioritized cross-platform compact encoding · HotSpot JIT rewrites bytecode into register machine code at runtime |
| Lua 5+ | 寄存器机register | 2003 年 Lua 5.0 发布时从栈机切到寄存器机 · 论文显示解释器吞吐提升 2-3× · 是这个领域的经典转折点Lua 5.0 (2003) switched from stack to register · published interpreter throughput jumped 2–3× · the canonical case study |
| CPython 3.11+ | 栈机(加 specialization)stack + specialization | 没改 ISA 形态 · 但 3.11 加了"specializing adaptive interpreter":在不改字节码的前提下,运行时把热点字节码替换成类型专用版,变相获得寄存器机的部分收益ISA unchanged · but 3.11 added "specializing adaptive interpreter": rewrite hot bytecodes into type-specialized versions at runtime — backdoor register-machine wins |
2016 年 Ignition 设计文档里写的三个原因:
有意思的是:JSCore 走完全相反的方向——纯寄存器机 LLInt 是为解释器吞吐优化的,因为 JSC 历来更看重"没 JIT 时也得快"(早期 iOS 限制 + 服务端场景)。两个引擎的 ISA 选型,本质上反映的是两家公司对"JIT 之前那段时间该多重要"的不同押注。
The 2016 Ignition design doc listed three reasons:
Tellingly, JSCore went the opposite way — its pure-register LLInt is optimized for interpreter throughput, because JSC has always cared more about "fast even without JIT" (early iOS restrictions + server-side use cases). The two engines' ISA choices reflect different bets on how much the time before the JIT kicks in matters.
想看 V8 字节码 ISA 的全部 opcode,直接读 v8/src/interpreter/bytecodes.h——里面定义了 ~200 条 opcode + 每条的操作数布局。Lua 5.0 改寄存器机的经典论文是 Roberto Ierusalimschy 的 "The Implementation of Lua 5.0"(2005),JS 引擎设计参考最多的就是这篇。
For V8's full bytecode ISA, read v8/src/interpreter/bytecodes.h — ~200 opcodes with operand layout. The canonical paper on Lua's switch to register-based VMs is Roberto Ierusalimschy's "The Implementation of Lua 5.0" (2005) — the most-cited reference in JS-engine design discussions.
真实的 TurboFan 输出比上面 figure 里的三条指令长得多——你用 node --print-opt-code --allow-natives-syntax 打印出来,会看到一堆 cmp / jne / test 指令围着核心逻辑。这些不是逻辑本身,而是 V8 在做 checkpoint(类型检查)和 调用约定 的钢架。
Real TurboFan output is much fatter than the three-line figure above. Run node --print-opt-code --allow-natives-syntax and you'll see a swarm of cmp / jne / test around the core. That's not logic — it's V8's checkpoint machinery (type guards) plus the calling convention scaffold.
橙色高亮的是类型检查——它们在每次调用时验证"这次传进来的还是 SMI 吗?"。验证通过就走绿色那一行核心 add,验证失败就跳走反优化。一段 1 行的 JS 编出来 9 行汇编,其中 6 行是钢架,1 行是逻辑。
这套钢架就是后面 Phase II 的主角——assumption + feedback + checkpoint 三件套。它解释了"为什么 TurboFan 比 Ignition 快"和"为什么打破假设性能会突然崩塌"是同一件事的正反面。
The orange lines are type guards: they verify "is this still a SMI?" on every call. Pass → fall through to the green core add; fail → jump out to deopt. One line of JS becomes nine lines of asm — six scaffold, one logic.
That scaffold is the protagonist of Phase II: the assumption + feedback + checkpoint trio. It's why TurboFan is faster than Ignition and why breaking an assumption tanks performance — same coin, two faces.
JS 是动态的。
除非你让它看起来是静态的。 Field Note · 03
JavaScript is dynamic —
until you make it look static. Field Note · 03
SMI 与 HeapObject 的 1 bit 之差
SMI vs HeapObject, decided by one bit
上一章那段汇编里出现了 testb [rbx+0xf], 0x1——它在检查最低一位。这个习惯并非 V8 独有,C/C++ 里叫 Tagged Pointer:用一个指针的若干低位携带类型信息,而不是另开一个字段。V8 的版本是这样的:
这一个 bit 决定了 V8 看到一个 64 位字时的两条完全不同的处理路径。下面把它点亮看看:
That testb [rbx+0xf], 0x1 in the previous chapter is checking the lowest bit. This isn't V8-only — in C/C++ it's called a Tagged Pointer: stuff type info into a pointer's low bits instead of carrying a separate field. V8's flavor:
That single bit forks V8's handling of a 64-bit word into two completely different paths. Try lighting one up:
|num| ≤ 2³⁰)走 SMI,末位是 0 表示"这个字本身就是数值";其他情况(浮点、大整数、字符串、对象)走 HeapObject,末位是 1 表示"这个字是个指针,得去堆上找真东西"。
Type a number and see how V8 packs it into a 32-bit word. Integers fitting |n| ≤ 2³⁰ go SMI — low bit 0 means "the word IS the value". Anything else (floats, big ints, strings, objects) goes HeapObject — low bit 1 means "the word is a pointer, look on the heap".
因为整数太常见了。一个普通页面跑起来,堆里大半是数字——下标、像素、毫秒、坐标、计数器。如果每个数字都老老实实地包成一个 HeapObject(配 hiddenClass / 元信息 / GC 头),内存和指针追逐都会把性能拖死。
用末位来 tag SMI,V8 可以做到:
testb——上一章那段汇编里的"是不是 SMI?"在 CPU 上只占 1 周期。Because integers are everywhere. In a real page, the heap is mostly numbers — indices, pixels, milliseconds, coordinates, counters. Boxing every one of them into a HeapObject (with a hiddenClass, meta header, and GC bits) would drown V8 in memory traffic and pointer chasing.
Tagging SMIs with the low bit lets V8:
testb — the "is it a SMI?" guard from the previous chapter is one cycle.V8 在内部表达"一个 JS 值"的方式只有两种,这两种就是 SMI 和 HeapObject。所有 JS 里能拿到的东西——number / string / boolean / null / object / function——背后都得装进这两个之一。
SMI(Small Integer)是一个整数且足够小(31 位有符号,±2³⁰ 范围内)时 V8 走的那条路。它跳过堆,把数值编码进那个 64 位字本身:把数左移 1 位,末位填 0 标识"我是 SMI"。
读 SMI 时只需要右移一位拿数值;算术时直接用 CPU 寄存器,一条 add 指令搞定。整个过程不分配堆内存,不进 GC,不解引用。
其他一切——浮点数、大整数、字符串、对象、数组、函数——都装不进 SMI,V8 会在堆上分配一块内存,把数据放那儿,然后用一个指针指过去。这个指针的末位是 1,V8 通过这一位区分"这是 SMI 还是要去堆上找的指针"。
但 HeapObject 不只是"一段值",它有一整套元信息——这是它"重"的根源。下面这张图把 SMI 跟 HeapNumber(用来装 42.5 这种浮点数)摆在一起对比:
Internally, V8 represents any JS value in exactly one of two ways: SMI or HeapObject. Everything you can hold in JS — numbers, strings, booleans, null, objects, functions — eventually fits into one bucket or the other.
SMI (Small Integer) is the path V8 takes when a value is an integer and small enough (31-bit signed, ±2³⁰). It skips the heap and packs the value directly into the 64-bit word: shift left by one and set the low bit to 0 to mark "I am a SMI".
Reading a SMI is just a right shift; arithmetic happens in CPU registers, one add instruction. No heap alloc, no GC visit, no dereference.
Anything bigger or more complex — floats, big integers, strings, objects, arrays, functions — doesn't fit. V8 allocates a chunk on the heap, puts the data there, and hands you a pointer. That pointer's low bit is 1; V8 uses that bit to know "this is a pointer, dereference me to find the real thing".
But a HeapObject isn't just "a value" — it carries a whole stack of metadata, and that's the source of its weight. Here's a SMI and a HeapNumber (the box used for floats like 42.5) side by side:
对一个普通对象 {x: 1, y: 2}(JSObject),情况就更夸张——光这两字段就占 40 字节,还得带一个指向 Hidden Class 的指针:
A plain object {x: 1, y: 2} (a JSObject) is even more dramatic — those two fields alone occupy 40 bytes, plus a pointer to its Hidden Class:
所以"一个数到底是 SMI 还是 HeapObject"是 V8 内部代价差距最大的那一个 bit——一个普通页面跑起来,堆里大半的 HeapObject 其实都是包装一个数字用的。这就是为什么 V8 拼命让能进 SMI 的就别进堆。
这也是后面 Phase III(Hidden Class、IC、Fast/Slow Properties)所以重要的根源:那些机制都在围绕"如何让访问 HeapObject 像访问 SMI 一样便宜"做文章。
So "SMI vs HeapObject" is the single biggest cost difference in V8 internals — and in a real page, most of those HeapObjects are just boxed numbers. That's why V8 fights so hard to keep things in SMI form.
It's also why Phase III (Hidden Class, IC, Fast/Slow Properties) matters so much: every mechanism there is about making HeapObject access as cheap as SMI access.
"GC 头"是个简化叫法。V8 里其实没有像 JVM 那种独立的"GC header"字段——每个 HeapObject 的第一个 word(8 字节)就是 Map pointer,这个指针同时承担三件事:
所以严格说,"GC 头"就是这个 Map pointer——它不是额外加的字节,就是对象本身的第一字段。HeapNumber 占 16 字节里,8 字节是 Map,另外 8 字节才是数值;Map 占了一半。
因为 Map 字段只是入场券。GC 真正每次跑都要在每个 HeapObject 上做的事情,远不止这一份字节占用。下面把 SMI 和 HeapObject 在一次完整 GC 周期里被对待的方式摆出来对比:
"GC header" is shorthand. V8 doesn't actually have a JVM-style separate "GC header" word — the first word (8 bytes) of every HeapObject is the Map pointer, and it pulls triple duty:
Strictly, "GC header" is that Map pointer — not extra bytes, just the object's first field. A 16-byte HeapNumber spends 8 of those on the Map; the other 8 carry the value. Half the box is the label.
Because the Map word is just the entry ticket. The GC's per-HeapObject work on every cycle is much more than those bytes. Compare what happens to a SMI vs a HeapObject across a full GC pass:
| GC 阶段GC phase | SMI | HeapObject |
|---|---|---|
| 分配 (Allocation)Allocation | 没分配 · 直接在寄存器/栈帧里none · lives in register / stack | 在 NewSpace bump-allocate · 改 top 指针 + 写 Map + 写字段bump-allocate in NewSpace · advance top + write Map + write fields |
| 标记 (Marking)Marking | 末位是 0 → GC 一眼跳过low bit 0 → GC skips at a glance | 读 Map → 知道哪几位是指针 → 加入 mark queue → 翻一位 mark bitread Map → learn which words are pointers → push to mark queue → flip a mark bit |
| 扫描 / 清理 (Sweeping)Sweeping | 不存在N/A | 未 mark → 空间归还 free list 或被 compact 阶段拿走unmarked → returned to free list or claimed by compaction |
| 压缩 / 搬迁 (Compaction)Compaction | 不存在N/A | 可能被搬到新地址 · 旧位置写一个 forwarding pointer · 所有指向它的指针都要改may be relocated · old slot becomes a forwarding pointer · all referrers must be rewritten |
| 下次再活Next survival | 无成本free | 可能从 NewSpace 晋升到 OldSpace,改 Map · 多次幸存进 LargeObjectSpacemay promote NewSpace → OldSpace, Map updated · long-lived ones land in LargeObjectSpace |
关键观察:SMI 在 GC 里整一行都是空的;HeapObject 每一行都要付代价。Chrome 一个普通页面 GC 触发频率是每秒 10 次以上(Scavenger 在 NewSpace 上的小 GC),每次 GC 都要遍历所有 HeapObject,把它们各自的 Map 读一遍、mark 一下、可能搬一下。这叠加起来才是"HeapObject 重"的真实重量。
那 mark bit 不也是每个对象一份吗? 是的,但 V8 的 mark bit 不存在对象本身上——它存在一个独立的 MarkingBitmap(side table)里,每个堆地址对应 bitmap 里的一位。所以 HeapNumber 是真的 16 字节、JSObject 真的 40 字节,mark bit 不再额外加。但在算"GC 工作量"时,bitmap 那一位的读写依然是每个对象一次。
所以更精确地说,HeapObject 的"GC 税"包含:
这就是为什么你写一个 { x: 1 } 看起来无害,在每秒上百次新建/丢弃的热点路径上能让 GC 占走 10-20% 的主线程时间——你交的不是字节,是 GC 的劳动力。
Key observation: the SMI column is blank for every row; the HeapObject column pays at every step. A typical Chrome page triggers GC more than 10× per second (Scavenger in NewSpace), and every cycle visits every HeapObject — reads its Map, marks it, possibly relocates it. Cumulatively, that's the real weight of "HeapObject is heavy".
Yes, but V8's mark bits don't live on the object itself. They live in a separate MarkingBitmap (side table), one bit per heap address. So a HeapNumber really is 16 bytes and a JSObject really is 40 bytes — mark bits don't add to that. But when counting "GC work", the bitmap bit still costs one read+write per object.
So more precisely, the HeapObject "GC tax" comprises:
That's why an innocent-looking { x: 1 } on a hot path that creates+discards hundreds per second can have GC eat 10–20% of main-thread time. You aren't paying bytes — you're paying GC labor.
Chrome DevTools → Performance → 录一段,在 Summary 里看 GC 那一段占帧时间多少。常见信号:Minor GC(NewSpace,毫秒以下)很多 → 大量短命对象;Major GC(OldSpace,十几毫秒)出现 → 长命对象太多 / 内存压力。Phase IV 的"逃逸分析"规则就是为了让短命对象根本不上堆,直接消除前者。
Chrome DevTools → Performance → record a session and look at GC's share in the Summary panel. Common signals: lots of Minor GC (NewSpace, sub-millisecond) → many short-lived objects; Major GC (OldSpace, 10s of ms) showing up → too many long-lived objects / memory pressure. Phase IV's "escape analysis" rule kills short-lived heap objects entirely so that Minor GC has nothing to scan.
HeapNumber / BigInt。在 64 位 V8 上,默认开启指针压缩,SMI/Pointer 都是 32 位,Pointer 高 32 位由"isolate root"统一,所以低 32 位放得下大部分情况。这个细节决定了"传 number 比传 string 快得多"——string 永远是 HeapObject。
On 32-bit V8, SMI is a 31-bit signed integer (±2³⁰); floats and bigints are boxed as HeapNumber / BigInt. On 64-bit V8, pointer compression is on by default — SMI and Pointer both fit in 32 bits with the high 32 derived from an "isolate root". This is why "passing a number is much faster than passing a string" — strings are always HeapObjects.
有三个直接的实战推论:
1.0 + 2.0 会被装成 HeapNumber,走慢路径;1 + 2 全程 SMI,走快路径。同样的道理,Math.floor(x) 之后立刻参与运算,V8 知道结果是整数,可以保持 SMI。obj['x'] 改成 obj.x,把 switch ('mode') 改成 switch (MODE_ENUM)(整数枚举),V8 的检查路径会短一截。[1, 2, 'three'] 会让 V8 把整个数组的 elements kind 升级到通用模式(HOLEY_ELEMENTS),后续读写都得走 HeapObject 路径——而 [1, 2, 3] 全程是 PACKED_SMI_ELEMENTS,读写都是裸内存访问。Tagged Pointer 不是知识点——它是你写每一行 JS 时,V8 在背后做的那个最小决定。
Three direct, actionable consequences:
1.0 + 2.0 boxes into HeapNumber (slow path); 1 + 2 stays SMI throughout (fast path). Same for Math.floor(x) followed by arithmetic — V8 knows the result is an int and keeps it SMI.obj['x'] with obj.x, and switch on integer enums rather than string literals — V8's check path becomes shorter.[1, 2, 'three'] escalates the whole array's elements kind to generic HOLEY_ELEMENTS — all reads/writes go through HeapObject paths. [1, 2, 3] stays in PACKED_SMI_ELEMENTS, where reads are raw memory accesses.Tagged Pointer isn't a curiosity. It's the minimal decision V8 makes behind every line of JS you write.
obj.x 比 obj['x'] 快——是为什么
Why obj.x beats obj['x']
它们看起来等价,但 V8 把它们编译成两条不同的字节码,走两条不同的 IC 路径:
They look equivalent, but V8 compiles them into two different bytecodes taking two different IC paths:
三个实战层面的差距:
LdaNamedProperty 比 LdaKeyedProperty 多带一个 key index——V8 编译期就知道要找哪个属性,直接走最优路径。arr[0]),所以它的 fast-path 检查更多、更容易退化到 megamorphic。obj['x'] 时,只要旁边有人写了 obj[someVar],这个 site 的 IC 就同时被两种用法污染;而 obj.x 的 site 永远只是 LoadIC,不会被 keyed 路径污染。注意:现代 V8(2022 年之后)对字面量 obj['x'] 做了一定的常量折叠,在某些场景下能把它优化成跟 obj.x 一样的 LoadIC——但不保证,而且只要这个 site 同时见过非字面量的 keyed 访问,优化就会退回。所以"用 .x"是明确告诉 V8 和读代码的人:这是个静态名字,不要给以后留隐性退化的口子。
Three practical differences:
LdaNamedProperty carries the key index — V8 knows the target property at compile time and can take the most direct path. LdaKeyedProperty doesn't.arr[0]), so its fast paths have more checks and degrade to megamorphic more easily.obj['x'] next to a obj[someVar] and the call site's IC gets polluted by both usages. A obj.x site stays as LoadIC and never gets pulled into the keyed path.Note: modern V8 (post-2022) does constant-fold literal obj['x'] into LoadIC in some cases — but it isn't guaranteed, and once a site has seen any non-literal keyed access, the folding bails. Using .x is your way of telling V8 and your future self that this is a static name — no hidden degradation later.
switch (str) vs switch (intEnum) · 字符串 switch 每个 case 要做字符串比较(string intern 之后是指针比较,但仍然走 KeyedCompareIC);整数 switch 可以编译成 jump table,O(1) 直接跳。同样,obj.method() vs obj['method']() 也是同样的差距,.method 走 NamedLoadIC + Call,['method']() 走 KeyedLoadIC + Call,前者快得多。
switch (str) vs switch (intEnum) · string switches do per-case string comparisons (pointer-compare after interning, but still through KeyedCompareIC). Integer switches compile into jump tables — O(1) direct branch. Likewise, obj.method() vs obj['method'](): the dot form goes through NamedLoadIC + Call; the bracket form takes the slower KeyedLoadIC + Call path.
V8 是怎么"猜对"的
how V8 manages to guess right
到这里出现了一个真正的悖论:JS 是动态类型的语言,V8 凭什么能把它编译成跟 C 一样紧凑的机器码?
答案是 V8 不"知道",而是猜。它边跑边收集类型反馈,根据反馈做大胆假设,然后基于假设生成快路径机器码——同时在机器码里埋下类型检查 checkpoint,一旦假设被打破就立刻抛弃机器码,退回字节码解释执行。
这就是 V8 性能的三件套:
Here's the real paradox: JS is dynamically typed, so how does V8 ever produce C-tight machine code?
Answer: V8 doesn't "know" — it guesses. It runs, watches the types your function actually sees, makes bold assumptions, and emits fast-path asm based on those assumptions — with type-check checkpoints inlined so it can throw the asm away the moment the guess fails.
The trio:
| # | 名字Name | 在哪一层Lives at | 在做什么Does what |
|---|---|---|---|
| 1 | feedback | Ignition / Sparkplug 跑的时候while Ignition / Sparkplug runs | 观察"这个函数被调用时,参数是什么类型,对象是什么 shape",写到 FeedbackVector 里。Watches what types/shapes flow through each call site, writing to a FeedbackVector. |
| 2 | assumption | Maglev / TurboFan 编译的时候when Maglev / TurboFan compiles | 读 feedback 决定:"这次我假设两个参数都是 SMI",据此走快路径;feedback 越单态,假设越大胆。Reads feedback and decides: "I'll assume both args are SMIs". The more monomorphic the feedback, the bolder the bet. |
| 3 | checkpoint | 编译出来的机器码里in the emitted machine code | 每个假设都对应一行 testb/cmp 守卫——验证通过走快路径,失败立刻 deopt。Each assumption gets a testb/cmp guard inlined — pass → fast path; fail → deopt immediately. |
把它画成时序就是:
As a timeline:
三件套是看不见的——除非你打开 V8 自带的几个开关。这是分析慢 JS 的第一类工具:
The trio is invisible by default — until you flip V8's built-in switches. This is the first class of tool for analyzing slow JS:
把这几个开关组合起来,你就能看见 V8 在背后做什么。下一章我们用真函数演示一遍——当假设被打破时,V8 是怎么 deopt 的。
Compose these switches and you can see what V8 is doing behind the scenes. The next chapter walks a real function through a deopt event.
一段实测:打破假设之后性能为什么不只是变慢,而是断崖
a real measurement of why broken assumptions don't just slow you down — they cliff-drop
把上一章讲的"checkpoint fail → deopt"放到 benchmark 里看一眼。下面这段代码 V8 在执行时会跑出三段截然不同的性能,差距高达 3-5 倍:
Let's actually measure the "checkpoint fail → deopt" event from the previous chapter. The code below runs in three distinct performance regions, with a 3–5× swing:
因为 V8 把 L1 时编出来的 TurboFan 机器码扔了。L3 的循环重新从 Ignition 开始跑——而 Ignition 是字节码解释器,本身就慢一个数量级。等到再跑足够多次,V8 才会重新编译,但这次的 feedback 已经"被污染"了:它知道 a/b 既可能是 number 也可能是 string,所以新版本的 assumption 退化成 any+any,生成的机器码必须额外多打一份类型分支——比第一次的 mono-SMI 版本臃肿得多。
这就是反优化的真正成本:
Because V8 threw away the TurboFan machine code it had compiled in L1. L3's loop restarts in Ignition — the bytecode interpreter, an order of magnitude slower on its own. Eventually V8 re-compiles, but the feedback is now "polluted": it knows a/b can be either number or string, so the new assumption degrades to any+any, and the emitted asm has to carry extra type branches — fatter than the original mono-SMI version.
That's the real cost of deopt:
--trace-deopt 抓现场--trace-deopt catches it"reason: not a Smi" 这一行就是分析慢 JS 时最常见的元凶——它告诉你 哪一行 JS、第几个字节码偏移、为什么触发了反优化。后面 Phase 4 主线函数的优化过程里,我们会用这条日志一行行倒推问题。
"reason: not a Smi" is the single most common smoking gun when chasing slow JS — it pinpoints which line, which bytecode offset, which assumption blew up. In Phase IV's main-line we'll use this exact log to backtrack issues line by line.
某次 PR 在一个被每帧调用上千次的格子计算函数里加了 console.log(arg),arg 偶尔是 undefined。结果 Profiler 显示这个函数突然慢了 4 倍——不是 console.log 的开销,而是 undefined 这种类型让函数 deopt 了,从此跑在多态机器码上。把日志移到外层(只在 dev 模式生效)后,性能立刻回到原状。A PR added console.log(arg) to a per-cell function called thousands of times per frame. arg was occasionally undefined. Profiler showed the function suddenly 4× slower — not because of the log itself, but because undefined deopted the function into polymorphic asm forever after. Hoisting the log to the outer scope (dev-only) restored performance instantly.
单态最快,多态次之,巨态退化成解释器
monomorphic flies, polymorphic crawls, megamorphic gives up
上一章看到一次 add('a','b') 让函数 deopt——但实际情况比这更细。每个调用点的 FeedbackVector 都有一个 状态机,会随着接收到的类型种类逐步退化:
The previous chapter showed one add('a','b') triggering a deopt — but the truth is finer-grained. Each call site's FeedbackVector entry runs a state machine that degrades step by step as more type variations come through:
cmp/jne 分支。cmp/jne branches inlined.这是 V8 工程上的权衡。每多打一条类型分支,生成的机器码就多几行 cmp/jne,体积变大、缓存压力变大。V8 团队跑过大量 benchmark,发现 4 种以下的多态分支还能跑得比解释器快,超过这个就得不偿失了——干脆退回通用 dispatch。
这意味着:4 是工业经验,不是物理常数。但你写代码时只需要记一个原则:
It's V8's engineering tradeoff. Each extra type branch adds a few cmp/jne lines to the asm — code grows, i-cache pressure grows. V8 benchmarked extensively and found that polymorphism up to 4 still beats the interpreter; beyond that, it's a net loss — fall back to generic dispatch.
So 4 is empirical, not physical. As an author you only need one rule:
让每个热点函数
都尽量是 monomorphic 的。 The single most useful V8 heuristic.
Make every hot function
as monomorphic as you can. The single most useful V8 heuristic.
用 %DebugPrint(fn),然后翻到 feedback_vector 那一段,会看到类似:
Use %DebugPrint(fn) and find the feedback_vector section. You'll see something like:
看到 BinaryOp::SignedSmall 就放心(SMI 单态),看到 BinaryOp::Any 就要警觉了——这个 slot 已经退到最差。这是 Phase 4 主线优化里反复用到的第一个诊断信号。
BinaryOp::SignedSmall means you're golden (SMI mono); BinaryOp::Any means the slot has degraded to the worst case. This is the first diagnostic signal we'll reach for repeatedly in Phase IV's main-line.
直接拆 V8 源码里的那个判断
reading the actual V8 source for the decision
"足够热"到底意味着什么?V8 把这件事写在了一个具体的函数里——v8::internal::TieringManager::ShouldOptimize。我们直接拆它:
What does "hot enough" actually mean? V8 codifies it in one function — v8::internal::TieringManager::ShouldOptimize. Let's read it:
maglev_filter(可以用 --maglev-filter=name 限定)。node --v8-options="--turbo-filter=xxxxx")。efficiency_mode_delay_turbofan 配置可以延后启动 TurboFan。max_optimized_bytecode_size 默认 60K bytecode 字节。这就是为什么我们后面 Phase IV 第一刀会是函数拆解——把超大函数拆小,让每一段都能进 TurboFan。maglev_filter (see --maglev-filter=name).node --v8-options="--turbo-filter=xxxxx").efficiency_mode_delay_turbofan to push tiering further out.max_optimized_bytecode_size defaults to 60K bytecode bytes. That's why our first move in Phase IV will be function decomposition: break giant functions into small ones so each can be optimized.max_optimized_bytecode_size 是性能优化里最容易踩的坑——一个 1000 行的处理函数,V8 会因为字节码太长直接放弃优化它,无论你跑多少次都没用。Phase IV 的"函数拆解"规则之所以排第二,就是为了把这种函数拆出 TurboFan 阈值之内。
The max_optimized_bytecode_size at L18 is one of the easiest traps. A thousand-line handler can sit forever above the threshold — V8 simply skips optimizing it no matter how often it's called. That's why the "function decomposition" rule in Phase IV is non-negotiable.
假如你是当年 Google 的工程师
if you were Lars Bak in 2008
到这里我们已经讲完 V8 的编译流水线和假设系统。现在转入第三块,也是性能优化里最有趣的一块——对象内存模型。
用一个思想实验开场:假如你是 2008 年 Chrome V8 项目的工程师,任务是设计 JS 对象在内存里怎么布局,你会怎么做?先看 C 是怎么做的:
We've covered V8's compile pipeline and assumption system. Now into the third — and most rewarding — block: the object memory model.
A thought experiment: it's 2008, you're on the Chrome V8 team, and your job is to lay JS objects out in memory. How would you do it? First, here's how C does it:
静态语言的 struct 是一段连续线性内存。编译期就知道 x 在偏移 0、y 在偏移 4——属性访问就是O(1) 的偏移寻址。但这有两个不可调和的前提:
JS 全反过来——obj.foo = 42 可以在任何时刻给对象加属性,delete obj.foo 又可以随时拿走。所以你不能像 C 那样"一条 mov 指令搞定属性读取"。
A static struct is one contiguous block of memory. The compiler knows x is at offset 0, y at offset 4 — property access is an O(1) offset load. But that rests on two assumptions you can't break:
JS shatters both. obj.foo = 42 can graft on at any moment; delete obj.foo rips off at will. So you can't get away with "one mov per property read".
第一反应可能是:既然字段是动态的,那就存成 [key1, val1, key2, val2 ...] 这种"键值对数组"——每次读 obj.x 时遍历查找。
First instinct: if fields are dynamic, store them as [key1, val1, key2, val2 ...] and walk the array on every obj.x.
但有两个问题:
O(n)——对每个属性访问都要扫一遍 keys。{x, y},内存里就有 100 万份 "x" / "y" 字符串。第二个问题尤其致命——典型 Web 应用里同样 shape 的对象动辄几万几百万,这是不能接受的浪费。
Two problems:
O(n) — scan keys on every access.{x, y} objects means a million copies of "x" and "y".The second one's lethal — a real Web app has tens or hundreds of thousands of identically-shaped objects. That's unacceptable bloat.
V8 的设计是:每个 JSObject 有三类存储,加上一个指向 Hidden Class 的指针——这个 Hidden Class 才是"shape 的描述"。所有同 shape 对象共享一份。
V8 chose this: every JSObject has three storage areas plus a pointer to a Hidden Class — and the Hidden Class itself is the "shape description". All same-shape objects share one.
*hiddenClass{x, y, z} 对象都指向同一个 Hidden Class——key 名只存一份。下一章详细拆。
Points to the shape descriptor. All {x, y, z} objects point at the same Hidden Class — keys are stored once. Next chapter dissects it.
*properties*elementsarr[0] 这种数字下标的元素——下标访问是连续内存,极快。
Points to an indexed-elements array. Numeric-indexed (arr[0]-style) values live here — contiguous memory, very fast.
base + offset,跟 C 的 struct 一样!但要"预知 shape"才能用——这正是 Hidden Class + IC 配合的产物。
Slots reserved inside JSObject itself. Fastest access — base + offset, just like a C struct! But only when "shape is known", which is exactly what Hidden Class + IC give you.
{x, y, z} 对象只存 1 份 "x" / "y" / "z"(在共享的 Hidden Class 里)。但代价是:对象的 shape 一旦变化,Hidden Class 也得变。这就引入了 Phase III 的核心机制——Transition Chain(下一章)。
{x, y, z}s share a single set of "x"/"y"/"z" strings (inside the shared Hidden Class).The price: change the shape, change the Hidden Class. Hence the central mechanism of Phase III — Transition Chain (next chapter).
同 shape 的对象共享一份描述
same-shape objects share one descriptor
"Hidden Class"是 V8 的术语,在 V8 源码里它的工程名是 Map(就是 %DebugPrint 里看到的那个 Map);Edge Chakra 叫 Types,JavaScriptCore 叫 Structure,SpiderMonkey 叫 Shapes。所有现代 JS 引擎都有同一个东西——只是名字不一样。
Hidden Class 内部最关键的子结构是 DescriptorArray——它记录"这个 shape 上有哪些 key、key 对应的 in-object 槽位下标是几"。下面用一个具体例子:
"Hidden Class" is V8's term — internally, the V8 source calls it a Map (yes, the same Map you see in %DebugPrint). Edge Chakra calls it Types, JavaScriptCore calls it Structure, SpiderMonkey calls it Shapes. Every modern JS engine has the exact same thing under different labels.
The most important sub-structure inside a Hidden Class is the DescriptorArray — it records "this shape has these keys, and each key corresponds to this in-object slot index". Concrete example:
const o = { x: 11, y: 22 } 在 V8 内部的真实样子。JSObject 本体只存值,key 名一律由共享的 Hidden Class 描述。如果再创建一个 { x: 33, y: 44 },它会指向同一个 Hidden Class——这正是性能优化的杠杆所在。
What const o = { x: 11, y: 22 } really looks like in V8. The JSObject itself only carries values; key names are shared via the Hidden Class. Another { x: 33, y: 44 } would point at the same Hidden Class — that's the lever.
关键性质:shape 完全相同的对象,共享同一个 Hidden Class 实例。
const o1 = { x: 11, y: 22 } · Hidden Class Aconst o2 = { x: 33, y: 44 } · 同一个 Hidden Class Aconst o3 = { y: 11, x: 22 } · 不同的 Hidden Class B(顺序变了!)const o4 = { x: 11, y: 22, z: 33 } · Hidden Class C(多了一个 key)注意 o3 和 o1 的区别——属性赋值的顺序也是 shape 的一部分。这是 Phase IV 第 4 条改写规则的依据:"保持对象赋值顺序不变"。
The crucial property: objects of identical shape share the same Hidden Class instance.
const o1 = { x: 11, y: 22 } · Hidden Class Aconst o2 = { x: 33, y: 44 } · same Hidden Class Aconst o3 = { y: 11, x: 22 } · different Hidden Class B (order changed!)const o4 = { x: 11, y: 22, z: 33 } · Hidden Class C (extra key)Note o3 vs o1 — assignment order is part of the shape. This underlies rule #4 of Phase IV: "keep property assignment order stable".
%DebugPrint(obj),看输出里的 map: 0x... 字段。两个对象的 map 物理地址一样,就说明它们走的是同一个 Hidden Class——后面 IC 优化能命中同一份汇编。
Run %DebugPrint(obj) and look at the map: 0x... line. Same physical address = same Hidden Class = the same IC fast path will hit both.
上一章提到 V8 有两种存"命名属性"的位置:in-object(预留在 JSObject 本体里)和 *properties 数组(溢出存储)。Hidden Class 的 DescriptorArray 同时描述这两类——开发者眼里只是 obj.x,V8 内部却可能走两条路。
你可能会问:那什么时候走哪条?V8 默认给空对象预留 4 个 in-object 槽位(称为 Slack Tracking,见 Ch20 工具箱),前 4 个属性走 in-object,后面溢出到 *properties 数组。这是 Phase IV 第 5 条规则"class 字段加默认值"的根源——让对象一出生就立刻把 4 个槽位填满。
The previous chapter mentioned V8 has two places to store named properties: in-object (reserved inside the JSObject body) and the *properties array (overflow). The DescriptorArray in Hidden Class covers both — to you it's just obj.x, but V8 may take either path internally.
Which one? V8 reserves 4 in-object slots for an empty object (called Slack Tracking, see Ch20). The first 4 properties go in-object; later ones overflow into *properties. That's the foundation for Phase IV rule #5 ("declare class fields with defaults") — fill those slots immediately at construction.
点按钮看链表怎么一节一节长出来
click to watch the chain grow node by node
上一章说同 shape 共享 Hidden Class——但 shape 怎么变化?V8 的设计是把 Hidden Class 链成一条 transition chain:每给对象加一个属性,就追加一个 Hidden Class 节点。同样路径走过的对象,共用同一条链上的同一个节点。
下面是一个交互式演示——点 "+ x"、"+ y"、"+ z" 看链表怎么生长:
The previous chapter said same-shape objects share a Hidden Class — but how does shape change? V8's answer: chain Hidden Classes into a transition chain. Each new property appends a node; objects that took the same path of insertions share the same chain node.
Click "+x", "+y", "+z" below to watch the chain grow:
∅ (empty)
从链表结构能直接看出来:
The chain structure makes it obvious:
看似无害的两段代码,在 V8 眼里指向两个完全不同的 Hidden Class——所有把它们当参数的函数都会被推入 polymorphic。这是写性能敏感代码时最容易踩的隐形坑。
解决办法非常机械:初始化对象时就把所有字段一次性写齐,顺序固定。比如 React/Vue 这种框架内部维护对象池时,会刻意保证每个 component 实例的字段顺序一致——目的就是让所有 instance 共用一个 Hidden Class。
Two innocuous-looking blocks. In V8's eyes they point at two completely different Hidden Classes, and any function that takes either gets pushed into polymorphism. The most insidious trap in performance-critical code.
The fix is mechanical: initialize all fields up front, in a fixed order. React/Vue's internal instance pools deliberately preserve field order across components for exactly this reason — keep every instance on the same Hidden Class.
链表只能描述"路径相同"的情况。当两条路径在某一步分叉时,Hidden Class 会变成一棵带 transition 的树。比如:
A chain only handles same-path growth. When two paths diverge, the Hidden Class becomes a tree with transitions:
从 O(n) 到 O(1) 的那把刀
the knife that cuts O(n) down to O(1)
到这里,Phase III 的所有铺垫都是为了讲清楚这一章。Inline Cache (IC) 是 V8 性能曲线最陡的那一段——它能让一个 obj.x 的访问从字符串查找的 O(n) 降到一条 mov 指令的 O(1)。差距能上百倍。
看一段实测:同一个"服务发现"函数,一种动态写法,一种静态写法,跑 10M 次:
Everything in Phase III leads here. Inline Cache (IC) is the steepest part of V8's performance curve — it can cut an obj.x access from O(n) string lookup down to a single mov. The gap is over 100×.
Real measurement, same "service discovery" function written two ways, 10M iterations:
map[key] 跑了 6.4 s,静态 map.a 跑了 44 ms——差距 ~145 倍。这不是函数本身的差异,是 V8 能不能把它编进 IC 的差异。
Toggle between the tabs. Same logic, same 10M calls — dynamic map[key] takes 6.4 s, static map.a takes 44 ms. ~145×. Not the function's fault — it's whether V8 can fold the access into an IC.
同一个 obj.a 这件事,动态写法 map[key] 和静态写法 map.a 在 V8 里要做的工作完全不同。把每次调用的步骤画出来对比,差距就具体了:
Reading obj.a via dynamic map[key] vs static map.a sends V8 down two completely different paths. Let's draw what each call has to do:
map[key]
动态 key · O(N)dynamic key · O(N)
map.a
静态 key · IC 命中 · O(1)static key · IC hit · O(1)
ldr [obj + 32] 直接取值load directly~3 cycobj.a:左边动态查找在 Hidden Class 里走一遍——本例运气好命中第三个,运气不好(目标 key 在表尾)得扫到底;右边 IC 命中后只剩两步——1 次 cmp 验证 shape 没变 + 1 次 ldr 按固定偏移直接取值。这两条路径的 cycle 差距,就是文章开头跑分里那 145 倍的物理来源。
Same obj.a: on the left, dynamic lookup walks the Hidden Class — best case (this run) hits on the 3rd row; worst case scans to the end. On the right, after the IC has cached, only two ops remain — 1 cmp to validate the shape, 1 ldr at a fixed offset. The cycle gap here is exactly the 145× the bench-bars opened with.
把上面右边那条快路径在真实 ARM64 里编出来,就是下面这 5 行——里面的 cmp + ldr 跟图里那两步一一对应,多出的 b.ne 是反优化守卫:
The same fast path, emitted in real ARM64, looks like this — the cmp + ldr map directly to the two steps above; the extra b.ne is the deopt guard:
那条 ★ ldr x0, [x4, #+32] 就是 Inline Cache 的真身——V8 把"按 key 找属性"这件事 inline 成了"按一个固定偏移直接读"。这个偏移之所以能写死在汇编里,是因为 cmp 那一行先验证了 shape 没变——shape 一变,deopt 把整段汇编扔掉重编。"缓存"被 inline 进了汇编,这就是 IC 名字的由来。
That ★ ldr x0, [x4, #+32] is Inline Cache in the flesh — V8 inlines "look up by key" into "load at a fixed offset". The reason that offset can be baked in is that the cmp above guarantees the shape is unchanged; if it changes, the whole asm gets thrown away and recompiled. The "cache" is inlined into the asm itself — that's where the name comes from.
静态写法,优于动态写法。
不是风格之争,是 145 倍 的差距。 Field Note · 03
Static beats dynamic.
Not a style preference — a 145× gap. Field Note · 03
注意 IC 也走第 8 章的状态机:第一次调用时未初始化 (uninitialized),第二次起进入 monomorphic,见过 2-4 个不同 Hidden Class 的对象进入 polymorphic,>4 个就 megamorphic 放弃缓存。所以"让对象保持同 shape"和"用静态 key"是同一件事的两面——IC 优化只在它们都满足时生效。
ICs follow the same state machine as Chapter 8: uninitialized → monomorphic (after first call) → polymorphic (2–4 different Hidden Classes) → megamorphic (>4, cache abandoned). "Same shape" and "static key" are two faces of the same thing — IC only kicks in when both are true.
delete 的代价delete缓存技术最怕的就是 delete
caching's worst enemy is invalidation
到目前为止,我们讲的都是Fast Properties——用 Hidden Class + IC 把属性访问压到一条 ldr。但有一种操作能把对象一脚踹出快路径,让它退化成Slow Properties——慢几十甚至几百倍。
这个操作就是 delete。
Everything so far has been Fast Properties — Hidden Class + IC compressing access into one ldr. There's one operation that kicks an object off the fast path entirely, demoting it to Slow Properties (dozens to hundreds of times slower).
That operation is delete.
因为 delete 一旦允许,会引爆一连串问题:
o1.x 之后,剩下的 in-object 槽位怎么办?移动后面的属性补齐 → 其他对象指针就乱了。空着不填 → IC 缓存的偏移就错了。x 没了别人却还指着,引用乱套;如果切换 Hidden Class,所有 IC 都得失效。三个问题都很难解。V8 选了最简单的放弃方案:被 delete 过的对象一律退化为 Slow Properties——把属性集中存到一个字典里(类似 Map<string, Value>),抛弃 in-object + IC 优化。
这个字典的访问要走哈希查找,比 IC 慢几十到一百倍。而且这个降级是不可逆的——一旦掉到 Slow,这个对象就回不去 Fast 了。
Allowing delete opens three nasty cans of worms:
All three are hard. V8 picked the simple give-up plan: any object touched by delete degrades to Slow Properties — store properties in a dictionary (like Map<string, Value>) and abandon in-object + IC optimization.
Dictionary access is hash lookup — dozens to a hundred times slower than IC. And the demotion is one-way — once Slow, always Slow.
某段代码循环结束后想"释放内存",对每个 cache 对象做了 delete obj.bigPayload。结果下一帧还在用这些对象做属性访问的函数全部 deopt——cache 对象悉数掉到 Slow Properties,整个模块慢了 4 倍。正确做法是 obj.bigPayload = null 或 obj.bigPayload = undefined——这样不改变 Hidden Class,GC 也能正常回收引用的内存。
Some code did delete obj.bigPayload on every cache object at end-of-loop to "free memory". Next frame, every function reading those objects' properties deopted. The whole module ran 4× slower. The fix: obj.bigPayload = null (or = undefined) — preserves Hidden Class while still letting GC reclaim the referenced memory.
delete 就不用 delete。要"清掉"一个属性,改成 obj.foo = null 或 obj.foo = undefined——前者明确表达"无值",后者保持兼容。Hidden Class 不会变,IC 不会失效,GC 会回收引用的内存。
In performance-critical code, avoid delete. To "clear" a property, set it to null or undefined. Hidden Class survives, ICs stay valid, and GC still reclaims the referenced memory.
把前面 14 章的诊断工具一次性用上
putting all 14 chapters' diagnostic tools to work
到这里前面 14 章是所有的刀。这一章我们拿出主线那段函数,用刀解剖它。
函数本身一句话就能描述:把任意输入(数字 / 字符串 / 对象)归一化成 rem 数值。在我们的代码库里,它在每帧布局计算里被叫上百次,profiler 显示是个明显热点:
The previous 14 chapters were the knives. This chapter takes the main-line function and dissects it.
The function in one sentence: normalize any input (number / string / object) into a rem value. In our codebase it ran hundreds of times per layout frame; the profiler called it out as a hot spot:
BinaryOp::Any,反优化会让 deopt log 出现 not a Smi——这些信号跟你机器上跑出的具体毫秒数无关。看这些信号去判断,不要执着于复现具体的倍率。
The 240ms / 24ms / 145× / 10× numbers in this piece are typical magnitudes on M1 MacBook Pro + Node 22 — your machine may show anywhere from 2× to 20× depending on hardware, Node version, and loop size. But V8's internal feedback state machine is deterministic: polymorphism produces BinaryOp::Any, deoptimization writes not a Smi to the deopt log. Read those signals — don't fixate on reproducing a specific multiplier.
把下面这段存成 a.js,然后 node --allow-natives-syntax a.js。它会在你自己机器上跑出和上面那一刀几乎一模一样的输出——你能亲眼看到 BinaryOp::Any 和 MEGAMORPHIC。
Save the snippet below as a.js and run node --allow-natives-syntax a.js. You'll get the same kind of output the cut above shows — you can see BinaryOp::Any and MEGAMORPHIC with your own eyes.
81 = 1010001 = bit 0+4+6 = kIsFunction | kOptimized | kTurboFanned——就是上面截图那一行。%DebugPrint 的 feedback vector 段是真正的诊断核心。找里面的 slot #N <op>::<type>:SignedSmall / Number = 单态(快路径),Any 或 MEGAMORPHIC = 文章里那种"已经病了"的状态。--trace-deopt 会打印每次反优化事件 + 具体原因(not a Smi / wrong map),配上 @ bytecode N 偏移可以倒推到 JS 哪一行触发的。81 = 1010001 = bits 0+4+6 = kIsFunction | kOptimized | kTurboFanned — exactly the line in the screenshot.feedback vector section of %DebugPrint is where diagnosis really happens. Look for slot #N <op>::<type>: SignedSmall / Number = monomorphic (fast); Any or MEGAMORPHIC = the "already sick" state from the article.--trace-deopt prints every deopt event with its reason (not a Smi / wrong map) and the @ bytecode N offset, which lets you backtrack to the exact JS line.SyntaxError: Unexpected token '%' 就是 --allow-natives-syntax 漏了——是 parse 错不是 runtime 错,整个文件加载不了。%DebugPrint 输出在 stderr。管道 grep 时记得 2>&1。::Any。SyntaxError: Unexpected token '%' means you forgot --allow-natives-syntax — it's a parse error, not runtime, so the whole file fails to load.%DebugPrint writes to stderr — pipe 2>&1 if you want to grep.::Any.| # | 症状Symptom | 病因Root cause | 章节Ref |
|---|---|---|---|
| 1 | BinaryOp::Any |
参数类型混杂(number / string / object 都见过) → polymorphicargs mix number/string/object → polymorphic | Ch8 |
| 2 | LoadIC::MEGAMORPHIC |
input.value / input.unit 看到太多 shape → IC 放弃input.value / input.unit see too many shapes → IC gives up |
Ch13 |
| 3 | reason: not a Smi | number 路径假设是 SMI,但浮点跑了 HeapNumber 路径,触发 deoptnumber path expected SMI but a float (HeapNumber) deopted it | Ch5 |
| 4 | reason: wrong map | object 路径上多个 shape 的 {value, unit} 来回切multiple object shapes flowing through the object branch |
Ch11 |
| 5 | 函数还很长function is also long | 三种输入塞在一个函数里 → bytecode 多 → 接近 max_optimized_bytecode_size 阈值three input paths in one function → bytecode bloat → near max_optimized_bytecode_size |
Ch9 |
这五个病都源于一个共同的设计错误:用一个函数处理三种结构性不同的输入。从 V8 的角度,这等于强迫它对每个属性访问、每个二元运算都做"应付三种类型"的多态机器码——快路径根本没机会形成。
修法在下一章——把它拆成三个单态函数,然后顺着前面 14 章的刀一刀一刀切。
All five trace back to one design mistake: one function handling three structurally different inputs. From V8's view, you've forced it to emit polymorphic asm for every property access and every binary op — the fast path never gets to form.
The fix is next chapter — split into three monomorphic functions and apply the rest of the 14 knives.
每条规则点开看示例
click each rule to expand the example
下面这 12 条规则不是"风格指南"——是每一条都对应前面 14 章里某个具体机制的工程总结。我把它们按"应用次数"在主线 px2rem 上的频度排序——前几条是收益最大的几刀。
点每一条的标题展开看示例。
The following 12 rules aren't style preferences — each maps to a specific mechanism from the previous 14 chapters. I've ordered them by impact frequency on the main-line px2rem function — the top few cuts buy the most.
Click each rule's heading to expand its example.
主线 px2rem 同时接 number / string / object,V8 必须为每个 binop 都生成多态机器码 → 退化到 BinaryOp::Any。把它拆成 px2remNumber / px2remString / px2remObject 三个函数,在调用方分发——每个函数都可以是 monomorphic。
px2rem takes number / string / object — V8 must emit polymorphic asm for every binop, dropping to BinaryOp::Any. Split into three: px2remNumber / px2remString / px2remObject, dispatch at the call site — each function can be monomorphic.
function px2rem(input, base) { // 同一个函数三种类型 → BinaryOp::Any if (typeof input === 'number') ... else if (typeof input === 'string') ... else ... }
function px2rem(i, base) { // 调用方分发到单态函数 if (typeof i === 'number') return px2remNumber(i, base); if (typeof i === 'string') return px2remString(i, base); return px2remObject(i, base); }
超过 max_optimized_bytecode_size(默认 60K bytecode 字节)V8 不会优化。即使没超,小函数还能享受 inline 展开——TurboFan 会把小被调函数 inline 进调用方,省一次 push/pop。
Functions over max_optimized_bytecode_size (60K bytecode bytes by default) skip optimization entirely. Even below the limit, small functions get inlined — TurboFan folds them into the caller, saving the push/pop.
function processOrder(o) { // 1000 行混合 validation/calc/format/dispatch // → 超 max_optimized_bytecode_size // → V8 直接放弃优化整段 }
function processOrder(o) { validate(o); // < 200 字节 calc(o); // < 200 字节 format(o); // 各自能被 inline }
TS 类型系统不是为了"装",它在工程上恰好替你保证了热点函数的单态性——只要类型签名是 (n: number) => number,你就基本不会不小心给它喂 string。
TS types aren't decoration. In practice they enforce the monomorphism of hot functions — a signature of (n: number) => number means you basically won't accidentally feed it a string.
function add(a, b) { // 没类型约束,谁都能传 string 进来 return a + b; }
function add(a: number, b: number): number { // 编译期就拒绝非 number 调用方 return a + b; }
{x:1, y:2} 和 {y:2, x:1} 在 V8 里是两个不同的 Hidden Class。在 factory 函数里,所有对象都按同一个顺序赋值——这样所有 instance 共享同一条 transition chain。
{x:1, y:2} and {y:2, x:1} are two different Hidden Classes. In factory functions, assign properties in a fixed order so every instance walks the same transition chain.
// 条件分支里的赋值改变了顺序 if (debug) o.dbgFlag = 1; o.x = 1; o.y = 2; // → debug 时 HC 分叉
// 始终按相同顺序赋值 o.x = 1; o.y = 2; if (debug) o.dbgFlag = 1; // → 所有实例同一条 transition
V8 给空对象预留 4 个 in-object 槽位(Slack Tracking)。如果你在 constructor 里"有时"才赋某个字段,会触发 Hidden Class 分叉。所有字段在 constructor 一次写齐(没值就 null/undefined),让所有实例走同一条链。
V8 reserves 4 in-object slots (Slack Tracking). If your constructor "sometimes" assigns a field, you fork the Hidden Class. Initialize every field in the constructor (use null/undefined if no value), keeping all instances on one chain.
class Point { constructor(x, y) { this.x = x; if (y !== undefined) this.y = y; // 有 y 和无 y 的实例不同 HC } }
class Point { y = 0; // 字段默认值 constructor(x, y = 0) { this.x = x; this.y = y; // 所有实例同 HC } }
deletedelete一次 delete obj.x 会把对象从 Fast Properties 一脚踹进 Slow Properties——所有 IC 失效,后续访问慢几十~百倍且不可逆。要"清掉"就 obj.x = null。
A single delete obj.x kicks an object from Fast to Slow Properties — invalidates every IC, slows access dozens to a hundred times, and is irreversible. To "clear" a property, use obj.x = null.
// 释放内存,顺手 delete delete cache.payload; // → cache 永远 Slow Properties // → 所有访问 cache.* 的 IC 全部失效
// 设 null,GC 会回收引用的内存 cache.payload = null; // → Hidden Class 不变 // → IC 全保留
在生产 build 上加 --trace-deopt 跑一遍核心场景,看哪些函数 deopt——大多数是偶尔传 undefined 或者偶尔抛 try-catch。把这些"偶尔"消除就行。
Run your core scenarios with --trace-deopt in a prod build and find every deopting function. Most cases are occasional undefined or occasional try-catch throws. Remove the "occasionals".
function format(x) { try { return x.toFixed(2); } catch { return '0'; } // 偶尔 throw 触发 deopt → 不可逆变多态 }
function format(x) { return typeof x === 'number' ? x.toFixed(2) : '0'; // 用 typeof 守卫替代 try-catch }
这是 Ch13 的 145 倍跑分差距。在热点里把 obj[key] 改成 obj.knownKey,把 switch(string) 改成 switch(intEnum)——一刀切。
The 145× from Ch13. In hot paths, replace obj[key] with obj.knownKey and string switches with int-enum switches. One clean cut.
// KeyedLoadIC,O(N) 字符串比较 return obj[key]; switch (mode) { case 'show': ... case 'hide': ... }
// LoadIC,O(1) 偏移读 return obj.knownKey; switch (mode) { case MODE_SHOW: ... // 整数 enum case MODE_HIDE: ... }
const o = {x: 1, y: 2} 比 const o = {}; o.x = 1; o.y = 2 更稳——前者一次性建好 Hidden Class,后者要走两次 transition。
const o = {x: 1, y: 2} is more reliable than const o = {}; o.x = 1; o.y = 2 — the literal builds the Hidden Class in one shot; the procedural form walks two transitions.
const o = {}; o.x = 1; o.y = 2; // → ∅ → "x" → "y" 两次 transition
const o = { x: 1, y: 2 }; // → 一次性建好 Hidden Class // → 跟 100 万个同 shape 对象共享
基于 逃逸分析(Escape Analysis):如果对象不逃出函数,V8 可以把它的字段全部展开成寄存器变量,根本不分配堆内存。这对 GC 也是免费收益。
Based on escape analysis: if an object never escapes its function, V8 can replace its fields with register variables and skip heap allocation entirely. Free GC win too.
function dist2(a, b) { const tmp = { dx: a-b, dy: 0 }; return tmp.dx * tmp.dx; // → tmp 上堆 + GC + 间接寻址 }
function dist2(a, b) { const dx = a - b; return dx * dx; // → 全在寄存器,逃逸分析 0 分配 }
SMI(整数)在 V8 里是立即数,不进堆;float 一律装箱成 HeapNumber,要分配 + GC + 间接寻址。能用 Math.floor / 整数 enum 就用,只在最终输出层做一次 / 100 转浮点。
SMIs (ints) live as immediate values; floats box into HeapNumber with allocation, GC, and indirection. Prefer Math.floor and integer enums; only divide-by-100 at the very last output step.
let sum = 0; for (...) sum += 1.5; return sum / 1000; // → 第一次 += 就装箱成 HeapNumber
let sum = 0; for (...) sum += 1500; // 整数累加 return sum / 1_000_000; // → 全程 SMI,只在末尾转一次 float
Ref<T> 之类的包装Ref<T>-style wrappersReact/Vue 里 useRef(0) 把数字包成 { current: 0 } 对象——读写都得过一层 Hidden Class + IC。如果你需要在热点里高频读写一个数,直接用闭包 let 变量,比 ref 快好几倍。
React/Vue's useRef(0) wraps a number into { current: 0 } — every read/write hits a Hidden Class + IC. For high-frequency hot-path reads, a closure-captured let outperforms a ref by several times.
const count = useRef(0); for (let i = 0; i < 1e6; i++) { count.current++; // 每次走 IC }
let count = 0; // 闭包变量 for (let i = 0; i < 1e6; i++) { count++; // 直接寄存器 inc }
上面 12 条规则讲"应该怎么做"。下面五个反过来——它们是 12 条规则的典型违反案例,共同特点是看起来非常无害,代码评审一般也不会拦,但一上热点就立刻把性能腰斩。把它们记成模式,review 时一眼能识别。
The 12 rules above say "what to do". These five are the inverse — typical violations of those rules, all sharing one trait: they look harmless and won't be flagged by code review, but the moment they enter a hot path they cut performance in half. Memorize them as patterns; you'll spot them at a glance during review.
deletedeletefor (const item of cache) { delete item.bigPayload; // 看起来在帮 GC }
把所有 item 永久打到 Slow Properties · IC 全废 · 不可逆。"清"用 = null,Hidden Class 不变。→ Ch14
Demotes every item to Slow Properties forever · all ICs invalidated · irreversible. Use = null instead — Hidden Class stays intact. → Ch14
console.log(arg)console.log(arg) in a hotspotfunction layoutCell(cell) { console.log('cell:', cell); // arg 偶尔 undefined // → 函数瞬间 deopt 永远多态 }
console.log 本身不慢,但只要 arg 偶尔是 undefined / 复杂对象,就会让宿主函数 deopt,从此跑在多态机器码上。生产代码遇到一定要彻底删 / 改成 dev-only。→ Ch7
console.log itself isn't slow — but if arg is occasionally undefined / a complex object, the host function deopts and runs on polymorphic asm forever after. Strip in prod or gate it dev-only. → Ch7
obj[key] 配合可变 keyobj[key] with a variable keyobj.x 一样 · 静态分析也很难抓obj.x — even static analysis usually misses itfunction pick(obj, key) { return obj[key]; // KeyedLoadIC,慢 145× }
走 KeyedLoadIC 而不是 LoadIC——状态机更复杂、更易退化到 megamorphic。如果 key 集合已知,改成 if 串 / map 字面量 + .knownKey;实在要动态,起码加个 typeof key === 'string' 守卫。→ Ch13
Routes through KeyedLoadIC instead of LoadIC — more state-machine complexity, more prone to megamorphic decay. If the key set is known, switch to an if-chain / map-literal + .knownKey. If truly dynamic, at least guard with typeof key === 'string'. → Ch13
useRef(0) / 数字包对象useRef(0) / numeric wrappersconst count = useRef(0); function onScroll() { count.current++; // 读+写 都过 IC }
useRef(n) 把数字包成 { current: n } 对象——每次 count.current 都要走一层 Hidden Class + IC,SMI 直接寄存器 inc 的 ~5 倍开销。要 React 跨 render 持久化:用 ref;要热点高频读写:用闭包 let。两件事别混。→ Ch13
useRef(n) wraps the number into { current: n } — every count.current goes through a Hidden Class + IC, ~5× the cost of a SMI register-inc. Use ref for cross-render persistence; use a closure let for hot-path counters. Don't conflate them. → Ch13
try / catch 替代类型守卫try / catch as a type guardfunction format(x) { try { return x.toFixed(2); } catch { return '0'; } // 偶尔 throw → 整个函数被推入 polymorphic + deopt 路径,且 V8 对 try-catch 块本身的优化也更保守 }
用 try-catch 当 type guard 一举俩坑:(1) 偶发 throw 触发 deopt;(2) 历史上 V8 对 try-catch 函数的内联和优化都更保守(现代版本好了不少,但仍非零开销)。改成显式 typeof 守卫,性能差几倍。→ Ch7
Using try-catch as a type guard pays twice: (1) the occasional throw triggers a deopt; (2) historically V8 was more conservative about inlining and optimizing functions containing try-catch (much better now, but still non-zero). Replace with explicit typeof guards — several times faster. → Ch7
delete 在配置初始化里完全 OK;useRef(0) 用来跨 render 持久化也 OK——它们只在每秒上百次调用的代码路径上才致命。所以使用流程是:Performance 面板找出热点 → 拿这份清单 grep 一遍 → 命中的就改。
These five aren't "always wrong" — they're "worth questioning when they show up in hot code". delete in a setup function is fine; useRef(0) for cross-render state is fine — they're only lethal on paths called hundreds of times per second. Workflow: find the hotspot in the Performance panel → grep your hotspot for these five patterns → fix what hits.
把所有刀切下去之后
after all the cuts have landed
下面是按 12 条规则改完的版本。代码量更长了——但每个函数都是单态、字段顺序固定、没有 delete、没有动态 key:
The version after all twelve rules. The code is longer — but every function is monomorphic, field order is fixed, no delete, no dynamic keys:
把下面这段存成 a.js,跑 node --allow-natives-syntax a.js。它会让你亲眼看到上面那个截图里的 BinaryOp::Number(单态!)和 kind: TURBOFAN。如果你看到的不是 Number 而是 Any——说明你的重写还有漏网之鱼。
对照 Ch15 那段 v0 的 a.js 一起跑,差距更直观:同一台机器、同一种循环规模,前者吐 ::Any,后者吐 ::Number;前者 ~240ms,后者 ~24ms。这个对比就是 10× 的物理证据。
Save the snippet below as a.js and run node --allow-natives-syntax a.js. You'll see the BinaryOp::Number (monomorphic!) and kind: TURBOFAN from the screenshot above with your own eyes. If you see Any instead of Number — your rewrite still has a leak.
Run this side-by-side with the v0 a.js from Ch15 and the contrast is concrete: same machine, same loop size, but one prints ::Any and the other prints ::Number; one takes ~240ms, the other ~24ms. That's the physical proof of 10×.
三个分支函数的 feedback vector 段都应该长这样,跟 v0 形成强对比:
All three branch functions should show feedback vectors like this — a sharp contrast with v0:
a.js 也跑一遍,把两次终端输出贴一起。确认 v0 看到 ::Any + MEGAMORPHIC,v1 看到 ::Number + MONOMORPHIC——这就是十倍提速的状态机层面证据。::Any:大概率是你不小心给它喂了非预期类型(比如 px2remNumber 收到了 NaN 或 boolean)。重检调用方分发逻辑。kind 不是 TURBOFAN 而是 BASELINE / IGNITION:说明 1M 次循环还没跑够升级阈值,或者函数太长触发了 max_optimized_bytecode_size(Ch9)。试试加大循环或拆得更小。a.js from Ch15 too and paste both terminal outputs side by side. Confirm v0 shows ::Any + MEGAMORPHIC while v1 shows ::Number + MONOMORPHIC — that's the state-machine evidence behind the 10×.::Any: most likely you accidentally fed it an unexpected type (e.g. px2remNumber got NaN or a boolean). Audit your dispatch.kind isn't TURBOFAN but BASELINE / IGNITION: either the 1M-call loop didn't cross the tiering threshold, or the function is over max_optimized_bytecode_size (Ch9). Bump the loop or split smaller.不是某一刀很神,而是每一刀都解决了一个具体的 V8 机制问题,所有的小提速复合起来。把它列成账本:
No single cut is magic. Each one solves one specific V8 mechanism problem, and the small wins compound. As a ledger:
| 刀Cut | 解决的问题Problem fixed | 单刀贡献Per-cut win | 累计Cumulative |
|---|---|---|---|
| v0 | 起点baseline | — | 243 ms |
| + R1 | 三个单态函数 → 退出 BinaryOp::Anythree mono fns → exit BinaryOp::Any | −50% | 122 ms |
| + R2/9 | 小函数被 inline + 提模块顶层常量small fns get inlined + top-level constants | −35% | 79 ms |
| + R4/5 | 所有 result/input 对象同 Hidden Classall result/input objects share a Hidden Class | −39% | 48 ms |
| + R8/11 | 静态 key + 整数枚举 → IC 优化static keys + int enums → IC kicks in | −35% | 31 ms |
| + R10 | 逃逸分析,临时对象不上堆escape analysis, temp objects skip heap | −23% | 24 ms |
| v1 | 对比 v0vs v0 | 10.1× | 24 ms |
十倍提速不是魔法,
是十二刀切下去的累加。 Field Note · 03
A tenfold speedup isn't magic.
It's twelve cuts that compound. Field Note · 03
大部分情况能。但前提是你的瓶颈真的是 JS 执行——如果是 DOM 操作、合成层、网络、GC——那就是另外一座山(分别对应 chromium-renderer 那篇文章里的不同章节)。
检验方法很简单:打开 Chrome DevTools 的 Performance 面板,看你的热点函数占帧时间多少、是 JS 颜色还是别的颜色。如果是 JS 蓝色 + 占比超过 5%,这套方法论几乎一定有用。
Mostly yes. The precondition is your bottleneck is actually JS execution. If it's DOM, compositing, network, or GC — that's a different mountain (each covered in different chapters of the chromium-renderer piece).
Quick check: open Chrome DevTools' Performance panel and see your hot function's share of frame time and color. JS-blue + over 5% means this methodology will almost certainly help.
这套方法论是引擎无关的
this methodology is engine-agnostic
这本文一直在讲 V8——但前面 12 条规则跨引擎都成立。原因很简单:Hidden Class、Inline Cache、type feedback,这套设计是 1991 年 Self 语言研究里就有的——所有现代 JS 引擎都独立实现了一份。
This piece has been about V8, but those 12 rules are engine-agnostic. The reason: Hidden Class, Inline Cache, type feedback all trace back to 1991 Self research — every modern JS engine has independently implemented the same trio.
| 引擎Engine | JIT 层级JIT tiers | Hidden Class | IC | Type feedback |
|---|---|---|---|---|
| V8 · Chrome / Node | Ignition · Sparkplug · Maglev · TurboFan | Map | ✓ | FeedbackVector |
| JSCore · Safari | LLInt · Baseline · DFG · FTL (LLVM) | Structure | ✓ | ValueProfile |
| SpiderMonkey · Firefox | Interpreter · Baseline · Warp · Ion | Shape | ✓ | CacheIR |
| Hermes · RN | AOT bytecode (no JIT) | HiddenClass | ✓ | — (no JIT) |
JSCore(WebKit 的 JS 引擎,iOS / macOS Safari 用)有一个独门设计:它的峰值层 FTL(Fourth Tier LLVM)直接把 JS 编译进 LLVM IR,然后调用 LLVM 全部优化——同一份 LLVM 用来编 C++ / Rust / Swift,现在也用来编你的热点 JS。
实战影响:在某些 benchmark 上,Safari 的 JSCore 比 Chrome 的 V8 还快——尤其是计算密集型 + 类型稳定的代码,LLVM 的循环优化、SIMD 化、内联策略都比 V8 的 TurboFan 更激进。
但跨引擎的高性能 JS 写法是同一套——前面那 12 条规则在 JSCore 上一字不差地适用。
JSCore (WebKit's engine, used in iOS/macOS Safari) has a unique design: its peak tier FTL (Fourth Tier LLVM) compiles JS straight into LLVM IR and then runs full LLVM optimization passes — the same LLVM that ships C++/Rust/Swift, now also processing your hot JS.
Real-world impact: on certain benchmarks Safari's JSCore beats V8 — especially on compute-heavy, type-stable code, where LLVM's loop, SIMD, and inlining strategies are more aggressive than TurboFan's.
But fast-JS writing is the same trade across engines — the 12 rules apply word-for-word to JSCore.
在我的电脑上 (M1 MacBook Pro),v1 版 px2rem 跑 1M 次:Chrome (V8) 24 ms,Safari (JSCore) 17 ms。Safari 更快——因为 LLVM 把 UNIT_MAP 那个查表完全展开成了直接比较。但跑得快的代码,在哪个浏览器上都跑得快——这才是这套方法论的真正价值。
On my M1 MacBook Pro, the v1 px2rem at 1M iterations: Chrome (V8) 24 ms, Safari (JSCore) 17 ms. Safari wins — LLVM fully unrolled the UNIT_MAP lookup into direct compares. But fast code stays fast across browsers. That's the methodology's real value.
当 JS 已经不够快
when JavaScript isn't fast enough
把 px2rem 优化到 24 ms 已经是极限了——再快只能不写 JS。这就是 WebAssembly 的位置。
V8 内部其实有两条独立的流水线:JS 那条在前 14 章讲过(Ignition→Sparkplug→Maglev→TurboFan);Wasm 有自己的两层——Liftoff(基线编译,毫秒内编完)和 TurboFan(峰值编译,Wasm 也复用了同一个后端)。两条流水线共享同一份机器码内存、同一份 GC、同一个 main thread——所以 Wasm 不是"另一种语言",而是JS 性能曲线的另一种形状。
Optimizing px2rem to 24 ms is roughly the JS ceiling. Beyond that, you have to stop writing JS. That's WebAssembly's slot.
V8 actually runs two parallel pipelines: JS uses the four-tier covered in Ch3 (Ignition→Sparkplug→Maglev→TurboFan); Wasm has its own two-tier — Liftoff (baseline, compiles in milliseconds) and TurboFan (peak, shared backend). Both pipelines share the same machine-code memory, the same GC, the same main thread — so Wasm isn't "another language" so much as another shape of the JS performance curve.
| 场景Scenario | 建议Recommendation |
|---|---|
| 业务热点(布局、滚动、动画)UI hotspots (layout, scroll, animation) | 优化 JS 即可,基本能搞定JS optimization is enough |
| 媒体编解码 / 加解密 / 物理仿真media codec / crypto / physics | Wasm 决定性更好(2-10×)Wasm wins decisively (2–10×) |
| 大规模数据处理(协同编辑、Excel 表)bulk data (collab editing, spreadsheets) | 视情况——多次 JS↔Wasm 边界开销可能吃掉收益it depends — JS↔Wasm boundary costs can swamp the win |
| DOM 操作DOM ops | Wasm 反而更慢(必须经 JS 桥)Wasm is slower here — must bridge through JS |
而且重要的是:Wasm 不是"用了就快"。一段写得不好的 Wasm(频繁的 boundary call、不友好的内存布局、没向量化)有时还不如同等逻辑的优化过的 JS。
所以这本文最后一句话还是:先用前面 12 条把 JS 优化到极限,再去考虑 Wasm——大部分业务场景里,JS 优化能解决 80% 的性能问题,而且不引入构建复杂度。
And critically: Wasm isn't "fast just by being Wasm". Poorly-written Wasm (frequent boundary calls, unfriendly memory layout, no vectorization) sometimes loses to equivalent optimized JS.
The last word of this piece, then: push JS to its limit with the 12 rules first; reach for Wasm second. In most business code, JS optimization solves 80% of perf without adding build complexity.
--allow-natives-syntax 全套实战--allow-natives-syntax in practice所有用到的命令、参数、native syntax,集中在这里
every command, flag, and native syntax used in this piece, in one place
这本文从头到尾用到的所有"怎么观察 V8 在干什么"工具,集中放在这里。建议把这一章存成 cheatsheet——下次遇到慢 JS 时直接抄。
Every "how to see what V8 is doing" tool used across this piece, in one place. Save this chapter as a cheatsheet — next time you face slow JS, copy-paste from here.
读到一半卡在某个名词?在这里查。词条按"出现频率"排,不按字母——跟读这本文的节奏对得上。
Stuck on a term? Look here. Entries are ordered by frequency-of-appearance, not alphabetically — to match the reading order of this piece.
base + offset)。→ Ch10slots reserved inside the JSObject body itself; fastest access (base + offset). → Ch10obj.x(快),KeyedLoadIC 走动态 obj[key] + 数组下标(慢)。→ Ch13two property-load IC variants: LoadIC for static obj.x (fast); KeyedLoadIC for dynamic obj[key] + array indexing (slower). → Ch13delete 打过 / 属性太多溢出)。Slow 慢几十~百倍且不可逆。→ Ch14Fast uses in-object + IC; Slow uses a dictionary (after delete or too many overflow keys). Slow is 10–100× slower and irreversible. → Ch14testb / cmp),保护 assumption。→ Ch4 · Ch6inlined type guard (testb / cmp) that protects an assumption. → Ch4 · Ch6Chrome DevTools → Memory → Take heap snapshot,然后:
Object)。map :: system / Map @0x...——这就是 Hidden Class 的物理地址。这是排查"对象 shape 是否稳定"最直接的方法,比看 %DebugPrint 的 map 地址更直观。
Chrome DevTools → Memory → Take heap snapshot, then:
Object).map :: system / Map @0x... — that's the Hidden Class's physical address.This is the most direct way to verify shape stability — easier than diffing %DebugPrint output.
本文前面所有 %DebugPrint / --trace-deopt 都是 Node CLI 工具,但你实际工作中的瓶颈大多在浏览器里——DevTools 的 Performance 面板 才是日常入口。Node 工具用来"解剖"已经定位的热点函数,Performance 面板用来"找到"哪个函数是热点。两步走:
Everything above with %DebugPrint / --trace-deopt is for Node CLI, but your real-world bottlenecks usually surface in the browser — DevTools' Performance panel is the daily entry point. Node tools dissect a hotspot you've already located; the Performance panel locates which function is the hotspot. Two-step flow:
Self Time 降序排——这是"这个函数自己(不算调用)烧了多少 CPU 时间"。占比 > 5% + 是 JS 蓝色 → 本文方法论一定有用。点开任意一条进 Call Tree,可以看到完整调用链。
Switch the bottom panel to Bottom-Up and sort by Self Time descending — this is "how much CPU this function itself burns, not its callees". > 5% + JS-blue → this piece's methodology will help. Click any row to expand the Call Tree.
%DebugPrint + --trace-deopt 看它在 V8 里的状态。"在浏览器里找,在 Node 里改" 是这套方法论的标准工作流。
Lift that hotspot from the browser into Node (with realistic input distribution) and apply Ch15 / Ch20's %DebugPrint + --trace-deopt to see its V8 state. "Find in the browser, fix in Node" is this methodology's standard workflow.
Performance 面板能告诉你哪个函数烧时间,但不能告诉你"这个函数为什么慢——是因为退化成 Mega 了,还是 deopt 了,还是 GC 太多"。这些"为什么"信号都在 V8 内部状态里,DevTools 没暴露这一层(原因是大部分 Web 开发者用不到,V8 团队没动力放出来)。
真要看,有三条路:
--allow-natives-syntax + %DebugPrint 看 feedback vector。最直接,但要自己重建调用环境。open -a Chromium --args --js-flags="--allow-natives-syntax --trace-deopt" 启动,然后在 stdout 看 deopt 日志。代价:Chrome 控制台不直接显示这些,得从启动它的 terminal 里看。--enable-blink-features=... 之类),能看到 Maglev / TurboFan 编译事件。最完整但门槛最高,基本只有引擎开发者用。实战建议:就走第 1 条。Performance 面板找到热点,把它移到 Node 文件,跑本文 Ch15 / Ch17 的 a.js 模板。
The Performance panel tells you which function burns time, but not "why it's slow — has it degraded to Mega, is it deopting, is GC dominating". Those "why" signals live in V8 internals, and DevTools doesn't surface that layer (most web devs don't need it, so V8 team hasn't prioritized exposing it).
Three ways to actually see them:
--allow-natives-syntax + %DebugPrint to inspect the feedback vector. Most direct, but you have to rebuild the calling context.open -a Chromium --args --js-flags="--allow-natives-syntax --trace-deopt", then watch the launching terminal's stdout for deopt logs. Note: Chrome's DevTools Console doesn't show these — you have to read the terminal that launched Chromium.Practical advice: just do (1). Find the hotspot in Performance, move it to a Node file, run the a.js templates from Ch15 / Ch17.
这本文只覆盖了"JS 执行"这一类瓶颈。如果你的瓶颈在 DOM、布局、绘制、合成,可以读姊妹篇 字节码到像素的一生 — Chromium 渲染流水线全景;如果是合成卡顿,可以读 Jank & Stutter。
This piece only covers JS execution. If your bottleneck is DOM, layout, paint, or compositing, see the sister piece Bytecode to Pixels — Chromium's Rendering Pipeline. For compositor jank specifically, see Jank & Stutter.
从 240ms 到 24ms,
十倍提速不是魔法,是十二刀切下去的累加。
每一刀都对应 V8 的一个机制——让假设保持稳定,它就一直跑在最优版本上。
From 240ms to 24ms,
a tenfold speedup is not magic — it's twelve cuts that compound.
Each cut maps to a V8 mechanism. Keep its assumptions stable, and your code stays on the peak.