一段 JS 跑得慢——你怎么知道它慢在哪里?怎么基于 V8 内部原理动手改写?怎么验证它真的快了十倍?这是一份方法论手册,不是百科全书。
When a piece of JS runs slow — how do you know where it's slow? How do you rewrite it using V8's internals as a guide? How do you prove it really got ten times faster? A methodology, not an encyclopedia.
const a = 3 + 4const a = 3 + 4同一行代码,三种世界,三种翻译
one line of code, three worlds, three translations
性能优化的第一步,不是 profile,是回答一个问题:你脑子里那行 JS,跟 CPU 真正执行的那串指令,中间到底差了几层翻译?
把 const a = 3 + 4 摆出来,用三种"眼睛"去看它,你会发现这一行字面上看似一回事的代码,在三层世界里长得完全不一样——而 V8 的所有优化魔法,都发生在这三层之间的翻译过程里。
The first step of performance work isn't profiling — it's answering one question: how many translations sit between the line of JavaScript in your head and the instructions the CPU actually runs?
Take const a = 3 + 4. Look at it with three different eyes. The same one-liner shape-shifts across three worlds, and every V8 optimization trick lives in the translation between them.
裸的 CPU 只认机器码,但 JS 是动态的——它的类型是运行时才知道的。一个 add(a, b) 函数,你不告诉我 a 和 b 是什么,我就没法把它编译成"两个整数相加"这一条 add eax, ebx 指令——因为下一秒你可能会传两个字符串过来。
于是 V8 在脑和 CPU 之间垫了一层字节码:它比 AST 接近物理机,又比机器码灵活——可以解释执行,可以收集"参数到底是什么类型"的反馈,等收集够了再把字节码编译成机器码。
这一层就是性能优化的全部战场。下面整本文章,讲的都是 V8 在这一层做了什么、能被你怎么利用。
A bare CPU only speaks machine code, but JavaScript is dynamic — types are a runtime fact. Given add(a, b), I can't fold it down to a single add eax, ebx if I don't know what a and b are — next second you might pass me two strings.
So V8 inserts a bytecode layer between brain and CPU: closer to a real machine than the AST, more flexible than asm — interpretable, observable, and recompilable into machine code once V8 has watched enough calls to know what your types actually are.
That layer is the entire battlefield of JS performance work. The rest of this piece is about what V8 does in there — and how you can cooperate with it.
所谓 JS 的"快",
其实是 V8 在背着你猜对了一万次。 Field Note · 03
"Fast JavaScript" really means
V8 quietly guessed right ten thousand times. Field Note · 03
性能优化的"难"不在改代码——难在看懂 V8 当前在干什么。这篇文章不会铺一遍 V8 全部知识,而是围绕一个具体问题:
The hard part isn't editing code. The hard part is seeing what V8 is doing right now. So this piece doesn't tour every V8 internal — it answers one concrete question:
这是一段很普通的工具函数:输入可能是 number、string、或 { value, unit },输出是一个 rem 数值。它在我们某个项目里被 每帧 调用上百次,占了一段不容忽视的 CPU 时间。
下面的 19 章,每一章都是把它跑得更快这件事里的一刀。流水线、Hidden Class、Inline Cache 不是为了好看的术语——它们是用来切割问题的刀。
It's a perfectly ordinary helper. Input may be a number, a string, or a { value, unit } shape; output is a rem value. In one of our projects it ran hundreds of times per frame and burned non-trivial CPU.
The 19 chapters that follow are each one cut of making it faster. Pipeline, Hidden Class, Inline Cache — these aren't decorations. They are the knives we'll cut the problem with.
编译期很薄,运行时很厚
a thin compile-time, a fat runtime
C / Rust / Go 这种静态语言,编译期就能确定每个变量是什么类型、每个函数怎么调用、对象在内存里长什么样——所以它们走 AOT (Ahead-Of-Time):发布之前就生成最终机器码。运行时几乎没有"编译"这件事。
而 JS 是动态的,函数被调用前没人能保证 a + b 里的两个值是数字、字符串、还是别的什么东西。所以 V8 走 JIT (Just-In-Time):运行边编译,边收集类型反馈,边根据反馈优化。
Static languages like C / Rust / Go can pin down types, call shapes, and memory layouts at compile time — so they take the AOT (Ahead-Of-Time) road. By the time the binary ships, almost no "compiling" happens at runtime.
JavaScript is dynamic. Until add(a, b) is actually called, nobody can promise the two arguments are numbers — they could be strings or objects. So V8 takes the JIT (Just-In-Time) road: it compiles while running, observes types, and re-optimizes from feedback.
| 维度Dimension | AOT · C / Rust / Go | JIT · V8 / JSC |
|---|---|---|
| 编译期Compile-time | 很厚 · 全部优化都在这里做thick · all optimization happens here | 很薄 · 只 parse + 生 bytecodethin · only parse + bytecode |
| 运行时Runtime | 很薄 · 直接跑机器码thin · just runs machine code | 很厚 · 编译/反优化都在跑的过程中thick · (re)compile and deopt during execution |
| 类型信息Type info | 编译期已知known at compile | 运行时收集 (feedback)collected at runtime (feedback) |
| 最优代码生成时机Peak code emitted when | 编译完成那一刻at compile end | "足够热"之后 (TurboFan)after "hot enough" (TurboFan) |
| 代价Cost | 部署慢 / 二进制大slow build · large binary | 冷启动慢 / 内存占编译产物cold start · compiled-code memory |
意思是:你写的同一段 JS,V8 运行时会反复地编译它——先用解释器跑(Ignition),发现是热点函数后,会用基线编译(Sparkplug)、中间编译(Maglev)、最后峰值编译(TurboFan)轮番上阵。每一次升级都要花 ms 级的时间在编译本身上,这部分时间是 AOT 语言不需要付的。
所以你写 function add(a, b) { return a + b },V8 在你眼皮底下可能跑过 4 个不同版本的 add——每个都对应不同程度的优化和不同程度的"假设"。
这恰恰也是性能优化的机会:如果你能让 V8 的假设保持稳定,它就能一直跑在最优版本(TurboFan 机器码)上,不会被"反优化"打回字节码解释。
It means the same chunk of JS gets recompiled while it runs — interpreter first (Ignition), then once V8 notices it's hot, baseline (Sparkplug), mid-tier (Maglev), and finally peak (TurboFan) take turns. Each upgrade burns milliseconds in compilation itself — a tax AOT languages never pay.
So when you write function add(a, b) { return a + b }, four different versions of add may have run inside V8 by the time you blink — each at a different level of optimization, each with a different set of assumptions.
That's also where the leverage sits: keep V8's assumptions stable and your function lives on the peak (TurboFan machine code) forever — never "deoptimized" back into bytecode interpretation.
同一段 JS,在 V8 里其实有四份不同的它
the same JS exists in V8 as four different versions of itself
V8 不是只有一个编译器,而是一条流水线上的四级编译器。同一个函数,会随着"被调用次数"在四级之间向上爬——每爬一级,生成的代码越接近裸机,执行越快,但编译本身的耗时也越大。
这是 V8 的核心权衡:冷代码不值得花力气编译,热代码越烫值得越深的优化。所以函数的"性能"不是一个数字,而是一条会变化的曲线——这一章是这条曲线的地图。
V8 is not one compiler — it's a pipeline of four. The same function climbs the tiers as it gets called more often. Each tier emits code closer to the metal, runs faster, but costs more time to compile in the first place.
This is V8's core tradeoff: cold code isn't worth optimizing, hot code is worth optimizing harder. A function's "performance" isn't a single number — it's a curve that moves over time. This chapter is the map of that curve.
bytecode。所有后面三层的输入都是这个 bytecode,而不是源码。换句话说,V8 后续的优化全部基于字节码,源码到此为止。
Tokenizes source into an AST and emits the first bytecode. Every later tier consumes that bytecode — not the source. From here on V8's world is bytecode; the source is gone.
FeedbackVector 的结构里。所有冷代码都死在这一级——没必要再爬。
The bytecode interpreter. Runs bytecode line-by-line and collects type feedback (which types/shapes the args usually take) into a structure called the FeedbackVector. Cold code dies here — no need to climb.
因为 TurboFan 的编译耗时本身就是性能成本——而且 TurboFan 需要 feedback 才能优化得好,没有 feedback 时它生成的代码也很平庸。
真实的程序里,大部分代码只跑几次:页面初始化的某个 setup 函数、某个一次性的 callback——把它们编进 TurboFan 是纯亏。所以 V8 用一个简单的启发式:
Because TurboFan's compile time is itself a performance cost — and TurboFan needs feedback to optimize well; without it, even TurboFan's output is mediocre.
In real apps, most code only runs a handful of times: a setup function on page load, a one-shot callback. Compiling those into TurboFan is a pure loss. So V8 uses a simple heuristic:
v8::internal::TieringManager::ShouldOptimize——后面 Ch9 会拆开看。"hot" is decided in v8::internal::TieringManager::ShouldOptimize — we'll dissect it in Ch9.三个推论:
Three implications:
--allow-natives-syntax 启动 node 或 Chromium,然后 %GetOptimizationStatus(fn) 会返回一个 bitmask——位 4 是 kOptimized、位 5 是 kMaglevved、位 6 是 kTurboFanned。Ch20 工具箱里有完整命令清单。
Launch node or Chromium with --allow-natives-syntax, then %GetOptimizationStatus(fn) returns a bitmask — bit 4 is kOptimized, bit 5 kMaglevved, bit 6 kTurboFanned. The full command list lives in the Ch20 toolbox.
为什么虚拟机指令集存在这个世界上
why virtual ISAs exist at all
"字节码"听起来像个魔法词,但它的本质很朴素——一个虚拟 CPU 的指令集。和 x86 / arm 这些真 CPU 的指令集相比,只差在"这世上没有一台 CPU 直接跑它"。
把 const a = 3 + 4 在 V8 里走完一遭,你会看到它在两种语言里出场两次:第一次是 Ignition 的字节码,第二次是 TurboFan 的机器码。下面把它们摆在一起对照看。
"Bytecode" sounds like a magic word, but underneath it's plain: an instruction set for a virtual CPU. The only difference from x86/arm is that no silicon ships with a bytecode decoder built in.
Walk const a = 3 + 4 through V8 and you'll see it appear in two languages: first as Ignition bytecode, then as TurboFan machine code. Side by side:
V8 的字节码 ISA 是一个带累加寄存器的栈机(stack machine with accumulator)。LdaSmi [3] 里的 Lda 就是 "Load into accumulator",Star0 是 "Store accumulator to register 0"。这种设计有两个好处:
但代价也明显:解释器每条指令都要走"取指令 → decode → 跳到 handler → 执行 → 跳回 dispatch"的循环——这个循环本身的开销,大概比裸跑机器码慢一个数量级。所以才有 Sparkplug:把 bytecode 一对一翻译成 asm,把这个 dispatch 循环踢掉。
V8's bytecode ISA is a stack machine with an accumulator. LdaSmi [3] means "Load Smi into accumulator"; Star0 means "Store accumulator into register 0". Two wins:
The cost is the interpreter loop: every instruction pays "fetch → decode → jump-to-handler → execute → jump-back". That loop alone is roughly an order of magnitude slower than running raw machine code. That's why Sparkplug exists — translate bytecode 1-to-1 into asm and kill the dispatch loop.
真实的 TurboFan 输出比上面 figure 里的三条指令长得多——你用 node --print-opt-code --allow-natives-syntax 打印出来,会看到一堆 cmp / jne / test 指令围着核心逻辑。这些不是逻辑本身,而是 V8 在做 checkpoint(类型检查)和 调用约定 的钢架。
Real TurboFan output is much fatter than the three-line figure above. Run node --print-opt-code --allow-natives-syntax and you'll see a swarm of cmp / jne / test around the core. That's not logic — it's V8's checkpoint machinery (type guards) plus the calling convention scaffold.
橙色高亮的是类型检查——它们在每次调用时验证"这次传进来的还是 SMI 吗?"。验证通过就走绿色那一行核心 add,验证失败就跳走反优化。一段 1 行的 JS 编出来 9 行汇编,其中 6 行是钢架,1 行是逻辑。
这套钢架就是后面 Phase II 的主角——assumption + feedback + checkpoint 三件套。它解释了"为什么 TurboFan 比 Ignition 快"和"为什么打破假设性能会突然崩塌"是同一件事的正反面。
The orange lines are type guards: they verify "is this still a SMI?" on every call. Pass → fall through to the green core add; fail → jump out to deopt. One line of JS becomes nine lines of asm — six scaffold, one logic.
That scaffold is the protagonist of Phase II: the assumption + feedback + checkpoint trio. It's why TurboFan is faster than Ignition and why breaking an assumption tanks performance — same coin, two faces.
JS 是动态的。
除非你让它看起来是静态的。 Field Note · 03
JavaScript is dynamic —
until you make it look static. Field Note · 03
SMI 与 HeapObject 的 1 bit 之差
SMI vs HeapObject, decided by one bit
上一章那段汇编里出现了 testb [rbx+0xf], 0x1——它在检查最低一位。这个习惯并非 V8 独有,C/C++ 里叫 Tagged Pointer:用一个指针的若干低位携带类型信息,而不是另开一个字段。V8 的版本是这样的:
这一个 bit 决定了 V8 看到一个 64 位字时的两条完全不同的处理路径。下面把它点亮看看:
That testb [rbx+0xf], 0x1 in the previous chapter is checking the lowest bit. This isn't V8-only — in C/C++ it's called a Tagged Pointer: stuff type info into a pointer's low bits instead of carrying a separate field. V8's flavor:
That single bit forks V8's handling of a 64-bit word into two completely different paths. Try lighting one up:
|num| ≤ 2³⁰)走 SMI,末位是 0 表示"这个字本身就是数值";其他情况(浮点、大整数、字符串、对象)走 HeapObject,末位是 1 表示"这个字是个指针,得去堆上找真东西"。
Type a number and see how V8 packs it into a 32-bit word. Integers fitting |n| ≤ 2³⁰ go SMI — low bit 0 means "the word IS the value". Anything else (floats, big ints, strings, objects) goes HeapObject — low bit 1 means "the word is a pointer, look on the heap".
因为整数太常见了。一个普通页面跑起来,堆里大半是数字——下标、像素、毫秒、坐标、计数器。如果每个数字都老老实实地包成一个 HeapObject(配 hiddenClass / 元信息 / GC 头),内存和指针追逐都会把性能拖死。
用末位来 tag SMI,V8 可以做到:
testb——上一章那段汇编里的"是不是 SMI?"在 CPU 上只占 1 周期。Because integers are everywhere. In a real page, the heap is mostly numbers — indices, pixels, milliseconds, coordinates, counters. Boxing every one of them into a HeapObject (with a hiddenClass, meta header, and GC bits) would drown V8 in memory traffic and pointer chasing.
Tagging SMIs with the low bit lets V8:
testb — the "is it a SMI?" guard from the previous chapter is one cycle.HeapNumber / BigInt。在 64 位 V8 上,默认开启指针压缩,SMI/Pointer 都是 32 位,Pointer 高 32 位由"isolate root"统一,所以低 32 位放得下大部分情况。这个细节决定了"传 number 比传 string 快得多"——string 永远是 HeapObject。
On 32-bit V8, SMI is a 31-bit signed integer (±2³⁰); floats and bigints are boxed as HeapNumber / BigInt. On 64-bit V8, pointer compression is on by default — SMI and Pointer both fit in 32 bits with the high 32 derived from an "isolate root". This is why "passing a number is much faster than passing a string" — strings are always HeapObjects.
有三个直接的实战推论:
1.0 + 2.0 会被装成 HeapNumber,走慢路径;1 + 2 全程 SMI,走快路径。同样的道理,Math.floor(x) 之后立刻参与运算,V8 知道结果是整数,可以保持 SMI。obj['x'] 改成 obj.x,把 switch ('mode') 改成 switch (MODE_ENUM)(整数枚举),V8 的检查路径会短一截。[1, 2, 'three'] 会让 V8 把整个数组的 elements kind 升级到通用模式(HOLEY_ELEMENTS),后续读写都得走 HeapObject 路径——而 [1, 2, 3] 全程是 PACKED_SMI_ELEMENTS,读写都是裸内存访问。Tagged Pointer 不是知识点——它是你写每一行 JS 时,V8 在背后做的那个最小决定。
Three direct, actionable consequences:
1.0 + 2.0 boxes into HeapNumber (slow path); 1 + 2 stays SMI throughout (fast path). Same for Math.floor(x) followed by arithmetic — V8 knows the result is an int and keeps it SMI.obj['x'] with obj.x, and switch on integer enums rather than string literals — V8's check path becomes shorter.[1, 2, 'three'] escalates the whole array's elements kind to generic HOLEY_ELEMENTS — all reads/writes go through HeapObject paths. [1, 2, 3] stays in PACKED_SMI_ELEMENTS, where reads are raw memory accesses.Tagged Pointer isn't a curiosity. It's the minimal decision V8 makes behind every line of JS you write.
V8 是怎么"猜对"的
how V8 manages to guess right
到这里出现了一个真正的悖论:JS 是动态类型的语言,V8 凭什么能把它编译成跟 C 一样紧凑的机器码?
答案是 V8 不"知道",而是猜。它边跑边收集类型反馈,根据反馈做大胆假设,然后基于假设生成快路径机器码——同时在机器码里埋下类型检查 checkpoint,一旦假设被打破就立刻抛弃机器码,退回字节码解释执行。
这就是 V8 性能的三件套:
Here's the real paradox: JS is dynamically typed, so how does V8 ever produce C-tight machine code?
Answer: V8 doesn't "know" — it guesses. It runs, watches the types your function actually sees, makes bold assumptions, and emits fast-path asm based on those assumptions — with type-check checkpoints inlined so it can throw the asm away the moment the guess fails.
The trio:
| # | 名字Name | 在哪一层Lives at | 在做什么Does what |
|---|---|---|---|
| 1 | feedback | Ignition / Sparkplug 跑的时候while Ignition / Sparkplug runs | 观察"这个函数被调用时,参数是什么类型,对象是什么 shape",写到 FeedbackVector 里。Watches what types/shapes flow through each call site, writing to a FeedbackVector. |
| 2 | assumption | Maglev / TurboFan 编译的时候when Maglev / TurboFan compiles | 读 feedback 决定:"这次我假设两个参数都是 SMI",据此走快路径;feedback 越单态,假设越大胆。Reads feedback and decides: "I'll assume both args are SMIs". The more monomorphic the feedback, the bolder the bet. |
| 3 | checkpoint | 编译出来的机器码里in the emitted machine code | 每个假设都对应一行 testb/cmp 守卫——验证通过走快路径,失败立刻 deopt。Each assumption gets a testb/cmp guard inlined — pass → fast path; fail → deopt immediately. |
把它画成时序就是:
As a timeline:
三件套是看不见的——除非你打开 V8 自带的几个开关。这是分析慢 JS 的第一类工具:
The trio is invisible by default — until you flip V8's built-in switches. This is the first class of tool for analyzing slow JS:
把这几个开关组合起来,你就能看见 V8 在背后做什么。下一章我们用真函数演示一遍——当假设被打破时,V8 是怎么 deopt 的。
Compose these switches and you can see what V8 is doing behind the scenes. The next chapter walks a real function through a deopt event.
一段实测:打破假设之后性能为什么不只是变慢,而是断崖
a real measurement of why broken assumptions don't just slow you down — they cliff-drop
把上一章讲的"checkpoint fail → deopt"放到 benchmark 里看一眼。下面这段代码 V8 在执行时会跑出三段截然不同的性能,差距高达 3-5 倍:
Let's actually measure the "checkpoint fail → deopt" event from the previous chapter. The code below runs in three distinct performance regions, with a 3–5× swing:
因为 V8 把 L1 时编出来的 TurboFan 机器码扔了。L3 的循环重新从 Ignition 开始跑——而 Ignition 是字节码解释器,本身就慢一个数量级。等到再跑足够多次,V8 才会重新编译,但这次的 feedback 已经"被污染"了:它知道 a/b 既可能是 number 也可能是 string,所以新版本的 assumption 退化成 any+any,生成的机器码必须额外多打一份类型分支——比第一次的 mono-SMI 版本臃肿得多。
这就是反优化的真正成本:
Because V8 threw away the TurboFan machine code it had compiled in L1. L3's loop restarts in Ignition — the bytecode interpreter, an order of magnitude slower on its own. Eventually V8 re-compiles, but the feedback is now "polluted": it knows a/b can be either number or string, so the new assumption degrades to any+any, and the emitted asm has to carry extra type branches — fatter than the original mono-SMI version.
That's the real cost of deopt:
--trace-deopt 抓现场--trace-deopt catches it"reason: not a Smi" 这一行就是分析慢 JS 时最常见的元凶——它告诉你 哪一行 JS、第几个字节码偏移、为什么触发了反优化。后面 Phase 4 主线函数的优化过程里,我们会用这条日志一行行倒推问题。
"reason: not a Smi" is the single most common smoking gun when chasing slow JS — it pinpoints which line, which bytecode offset, which assumption blew up. In Phase IV's main-line we'll use this exact log to backtrack issues line by line.
某次 PR 在一个被每帧调用上千次的格子计算函数里加了 console.log(arg),arg 偶尔是 undefined。结果 Profiler 显示这个函数突然慢了 4 倍——不是 console.log 的开销,而是 undefined 这种类型让函数 deopt 了,从此跑在多态机器码上。把日志移到外层(只在 dev 模式生效)后,性能立刻回到原状。A PR added console.log(arg) to a per-cell function called thousands of times per frame. arg was occasionally undefined. Profiler showed the function suddenly 4× slower — not because of the log itself, but because undefined deopted the function into polymorphic asm forever after. Hoisting the log to the outer scope (dev-only) restored performance instantly.
单态最快,多态次之,巨态退化成解释器
monomorphic flies, polymorphic crawls, megamorphic gives up
上一章看到一次 add('a','b') 让函数 deopt——但实际情况比这更细。每个调用点的 FeedbackVector 都有一个 状态机,会随着接收到的类型种类逐步退化:
The previous chapter showed one add('a','b') triggering a deopt — but the truth is finer-grained. Each call site's FeedbackVector entry runs a state machine that degrades step by step as more type variations come through:
cmp/jne 分支。cmp/jne branches inlined.这是 V8 工程上的权衡。每多打一条类型分支,生成的机器码就多几行 cmp/jne,体积变大、缓存压力变大。V8 团队跑过大量 benchmark,发现 4 种以下的多态分支还能跑得比解释器快,超过这个就得不偿失了——干脆退回通用 dispatch。
这意味着:4 是工业经验,不是物理常数。但你写代码时只需要记一个原则:
It's V8's engineering tradeoff. Each extra type branch adds a few cmp/jne lines to the asm — code grows, i-cache pressure grows. V8 benchmarked extensively and found that polymorphism up to 4 still beats the interpreter; beyond that, it's a net loss — fall back to generic dispatch.
So 4 is empirical, not physical. As an author you only need one rule:
让每个热点函数
都尽量是 monomorphic 的。 The single most useful V8 heuristic.
Make every hot function
as monomorphic as you can. The single most useful V8 heuristic.
用 %DebugPrint(fn),然后翻到 feedback_vector 那一段,会看到类似:
Use %DebugPrint(fn) and find the feedback_vector section. You'll see something like:
看到 BinaryOp::SignedSmall 就放心(SMI 单态),看到 BinaryOp::Any 就要警觉了——这个 slot 已经退到最差。这是 Phase 4 主线优化里反复用到的第一个诊断信号。
BinaryOp::SignedSmall means you're golden (SMI mono); BinaryOp::Any means the slot has degraded to the worst case. This is the first diagnostic signal we'll reach for repeatedly in Phase IV's main-line.
直接拆 V8 源码里的那个判断
reading the actual V8 source for the decision
"足够热"到底意味着什么?V8 把这件事写在了一个具体的函数里——v8::internal::TieringManager::ShouldOptimize。我们直接拆它:
What does "hot enough" actually mean? V8 codifies it in one function — v8::internal::TieringManager::ShouldOptimize. Let's read it:
maglev_filter(可以用 --maglev-filter=name 限定)。node --v8-options="--turbo-filter=xxxxx")。efficiency_mode_delay_turbofan 配置可以延后启动 TurboFan。max_optimized_bytecode_size 默认 60K bytecode 字节。这就是为什么我们后面 Phase IV 第一刀会是函数拆解——把超大函数拆小,让每一段都能进 TurboFan。maglev_filter (see --maglev-filter=name).node --v8-options="--turbo-filter=xxxxx").efficiency_mode_delay_turbofan to push tiering further out.max_optimized_bytecode_size defaults to 60K bytecode bytes. That's why our first move in Phase IV will be function decomposition: break giant functions into small ones so each can be optimized.max_optimized_bytecode_size 是性能优化里最容易踩的坑——一个 1000 行的处理函数,V8 会因为字节码太长直接放弃优化它,无论你跑多少次都没用。Phase IV 的"函数拆解"规则之所以排第二,就是为了把这种函数拆出 TurboFan 阈值之内。
The max_optimized_bytecode_size at L18 is one of the easiest traps. A thousand-line handler can sit forever above the threshold — V8 simply skips optimizing it no matter how often it's called. That's why the "function decomposition" rule in Phase IV is non-negotiable.
假如你是当年 Google 的工程师
if you were Lars Bak in 2008
到这里我们已经讲完 V8 的编译流水线和假设系统。现在转入第三块,也是性能优化里最有趣的一块——对象内存模型。
用一个思想实验开场:假如你是 2008 年 Chrome V8 项目的工程师,任务是设计 JS 对象在内存里怎么布局,你会怎么做?先看 C 是怎么做的:
We've covered V8's compile pipeline and assumption system. Now into the third — and most rewarding — block: the object memory model.
A thought experiment: it's 2008, you're on the Chrome V8 team, and your job is to lay JS objects out in memory. How would you do it? First, here's how C does it:
静态语言的 struct 是一段连续线性内存。编译期就知道 x 在偏移 0、y 在偏移 4——属性访问就是O(1) 的偏移寻址。但这有两个不可调和的前提:
JS 全反过来——obj.foo = 42 可以在任何时刻给对象加属性,delete obj.foo 又可以随时拿走。所以你不能像 C 那样"一条 mov 指令搞定属性读取"。
A static struct is one contiguous block of memory. The compiler knows x is at offset 0, y at offset 4 — property access is an O(1) offset load. But that rests on two assumptions you can't break:
JS shatters both. obj.foo = 42 can graft on at any moment; delete obj.foo rips off at will. So you can't get away with "one mov per property read".
第一反应可能是:既然字段是动态的,那就存成 [key1, val1, key2, val2 ...] 这种"键值对数组"——每次读 obj.x 时遍历查找。
First instinct: if fields are dynamic, store them as [key1, val1, key2, val2 ...] and walk the array on every obj.x.
但有两个问题:
O(n)——对每个属性访问都要扫一遍 keys。{x, y},内存里就有 100 万份 "x" / "y" 字符串。第二个问题尤其致命——典型 Web 应用里同样 shape 的对象动辄几万几百万,这是不能接受的浪费。
Two problems:
O(n) — scan keys on every access.{x, y} objects means a million copies of "x" and "y".The second one's lethal — a real Web app has tens or hundreds of thousands of identically-shaped objects. That's unacceptable bloat.
V8 的设计是:每个 JSObject 有三类存储,加上一个指向 Hidden Class 的指针——这个 Hidden Class 才是"shape 的描述"。所有同 shape 对象共享一份。
V8 chose this: every JSObject has three storage areas plus a pointer to a Hidden Class — and the Hidden Class itself is the "shape description". All same-shape objects share one.
*hiddenClass{x, y, z} 对象都指向同一个 Hidden Class——key 名只存一份。下一章详细拆。
Points to the shape descriptor. All {x, y, z} objects point at the same Hidden Class — keys are stored once. Next chapter dissects it.
*properties*elementsarr[0] 这种数字下标的元素——下标访问是连续内存,极快。
Points to an indexed-elements array. Numeric-indexed (arr[0]-style) values live here — contiguous memory, very fast.
base + offset,跟 C 的 struct 一样!但要"预知 shape"才能用——这正是 Hidden Class + IC 配合的产物。
Slots reserved inside JSObject itself. Fastest access — base + offset, just like a C struct! But only when "shape is known", which is exactly what Hidden Class + IC give you.
{x, y, z} 对象只存 1 份 "x" / "y" / "z"(在共享的 Hidden Class 里)。但代价是:对象的 shape 一旦变化,Hidden Class 也得变。这就引入了 Phase III 的核心机制——Transition Chain(下一章)。
{x, y, z}s share a single set of "x"/"y"/"z" strings (inside the shared Hidden Class).The price: change the shape, change the Hidden Class. Hence the central mechanism of Phase III — Transition Chain (next chapter).
同 shape 的对象共享一份描述
same-shape objects share one descriptor
"Hidden Class"是 V8 的术语,在 V8 源码里它的工程名是 Map(就是 %DebugPrint 里看到的那个 Map);Edge Chakra 叫 Types,JavaScriptCore 叫 Structure,SpiderMonkey 叫 Shapes。所有现代 JS 引擎都有同一个东西——只是名字不一样。
Hidden Class 内部最关键的子结构是 DescriptorArray——它记录"这个 shape 上有哪些 key、key 对应的 in-object 槽位下标是几"。下面用一个具体例子:
"Hidden Class" is V8's term — internally, the V8 source calls it a Map (yes, the same Map you see in %DebugPrint). Edge Chakra calls it Types, JavaScriptCore calls it Structure, SpiderMonkey calls it Shapes. Every modern JS engine has the exact same thing under different labels.
The most important sub-structure inside a Hidden Class is the DescriptorArray — it records "this shape has these keys, and each key corresponds to this in-object slot index". Concrete example:
const o = { x: 11, y: 22 } 在 V8 内部的真实样子。JSObject 本体只存值,key 名一律由共享的 Hidden Class 描述。如果再创建一个 { x: 33, y: 44 },它会指向同一个 Hidden Class——这正是性能优化的杠杆所在。
What const o = { x: 11, y: 22 } really looks like in V8. The JSObject itself only carries values; key names are shared via the Hidden Class. Another { x: 33, y: 44 } would point at the same Hidden Class — that's the lever.
关键性质:shape 完全相同的对象,共享同一个 Hidden Class 实例。
const o1 = { x: 11, y: 22 } · Hidden Class Aconst o2 = { x: 33, y: 44 } · 同一个 Hidden Class Aconst o3 = { y: 11, x: 22 } · 不同的 Hidden Class B(顺序变了!)const o4 = { x: 11, y: 22, z: 33 } · Hidden Class C(多了一个 key)注意 o3 和 o1 的区别——属性赋值的顺序也是 shape 的一部分。这是 Phase IV 第 4 条改写规则的依据:"保持对象赋值顺序不变"。
The crucial property: objects of identical shape share the same Hidden Class instance.
const o1 = { x: 11, y: 22 } · Hidden Class Aconst o2 = { x: 33, y: 44 } · same Hidden Class Aconst o3 = { y: 11, x: 22 } · different Hidden Class B (order changed!)const o4 = { x: 11, y: 22, z: 33 } · Hidden Class C (extra key)Note o3 vs o1 — assignment order is part of the shape. This underlies rule #4 of Phase IV: "keep property assignment order stable".
%DebugPrint(obj),看输出里的 map: 0x... 字段。两个对象的 map 物理地址一样,就说明它们走的是同一个 Hidden Class——后面 IC 优化能命中同一份汇编。
Run %DebugPrint(obj) and look at the map: 0x... line. Same physical address = same Hidden Class = the same IC fast path will hit both.
上一章提到 V8 有两种存"命名属性"的位置:in-object(预留在 JSObject 本体里)和 *properties 数组(溢出存储)。Hidden Class 的 DescriptorArray 同时描述这两类——开发者眼里只是 obj.x,V8 内部却可能走两条路。
你可能会问:那什么时候走哪条?V8 默认给空对象预留 4 个 in-object 槽位(称为 Slack Tracking,见 Ch20 工具箱),前 4 个属性走 in-object,后面溢出到 *properties 数组。这是 Phase IV 第 5 条规则"class 字段加默认值"的根源——让对象一出生就立刻把 4 个槽位填满。
The previous chapter mentioned V8 has two places to store named properties: in-object (reserved inside the JSObject body) and the *properties array (overflow). The DescriptorArray in Hidden Class covers both — to you it's just obj.x, but V8 may take either path internally.
Which one? V8 reserves 4 in-object slots for an empty object (called Slack Tracking, see Ch20). The first 4 properties go in-object; later ones overflow into *properties. That's the foundation for Phase IV rule #5 ("declare class fields with defaults") — fill those slots immediately at construction.
点按钮看链表怎么一节一节长出来
click to watch the chain grow node by node
上一章说同 shape 共享 Hidden Class——但 shape 怎么变化?V8 的设计是把 Hidden Class 链成一条 transition chain:每给对象加一个属性,就追加一个 Hidden Class 节点。同样路径走过的对象,共用同一条链上的同一个节点。
下面是一个交互式演示——点 "+ x"、"+ y"、"+ z" 看链表怎么生长:
The previous chapter said same-shape objects share a Hidden Class — but how does shape change? V8's answer: chain Hidden Classes into a transition chain. Each new property appends a node; objects that took the same path of insertions share the same chain node.
Click "+x", "+y", "+z" below to watch the chain grow:
∅ (empty)
从链表结构能直接看出来:
The chain structure makes it obvious:
看似无害的两段代码,在 V8 眼里指向两个完全不同的 Hidden Class——所有把它们当参数的函数都会被推入 polymorphic。这是写性能敏感代码时最容易踩的隐形坑。
解决办法非常机械:初始化对象时就把所有字段一次性写齐,顺序固定。比如 React/Vue 这种框架内部维护对象池时,会刻意保证每个 component 实例的字段顺序一致——目的就是让所有 instance 共用一个 Hidden Class。
Two innocuous-looking blocks. In V8's eyes they point at two completely different Hidden Classes, and any function that takes either gets pushed into polymorphism. The most insidious trap in performance-critical code.
The fix is mechanical: initialize all fields up front, in a fixed order. React/Vue's internal instance pools deliberately preserve field order across components for exactly this reason — keep every instance on the same Hidden Class.
链表只能描述"路径相同"的情况。当两条路径在某一步分叉时,Hidden Class 会变成一棵带 transition 的树。比如:
A chain only handles same-path growth. When two paths diverge, the Hidden Class becomes a tree with transitions:
从 O(n) 到 O(1) 的那把刀
the knife that cuts O(n) down to O(1)
到这里,Phase III 的所有铺垫都是为了讲清楚这一章。Inline Cache (IC) 是 V8 性能曲线最陡的那一段——它能让一个 obj.x 的访问从字符串查找的 O(n) 降到一条 mov 指令的 O(1)。差距能上百倍。
看一段实测:同一个"服务发现"函数,一种动态写法,一种静态写法,跑 10M 次:
Everything in Phase III leads here. Inline Cache (IC) is the steepest part of V8's performance curve — it can cut an obj.x access from O(n) string lookup down to a single mov. The gap is over 100×.
Real measurement, same "service discovery" function written two ways, 10M iterations:
map[key] 跑了 6.4 s,静态 map.a 跑了 44 ms——差距 ~145 倍。这不是函数本身的差异,是 V8 能不能把它编进 IC 的差异。
Toggle between the tabs. Same logic, same 10M calls — dynamic map[key] takes 6.4 s, static map.a takes 44 ms. ~145×. Not the function's fault — it's whether V8 can fold the access into an IC.
当 V8 编译 map.a 这种静态 key 的属性访问时,它会在第一次调用时通过 Hidden Class 找到 "a" 对应的 in-object offset(比如 1),然后把这个数字直接写死在编译出来的汇编里:
When V8 compiles a property access with a static key like map.a, it follows the Hidden Class on first call to find "a"'s in-object offset (say, 1), then burns that number into the emitted asm:
这就是Inline Cache 的真身——它把"通过 key 找属性"这件事 inline 成了"先比一次 Hidden Class 物理地址,再读一个固定偏移"。这个名字也由此而来:把缓存 inline 到了汇编里。
对比一下动态 map[key]:V8 不知道 key 是什么字符串,只能在每次调用时:
key 字符串和 Hidden Class 里的所有 key 名字符串一一比较;哪怕 V8 给字符串比较做了内部 intern(同字符串复用同一个指针,只比指针),也比"一条 ldr"慢得多。这就是 145 倍差距的来源。
That's Inline Cache in the flesh — it inlines "look up by key" into "first compare a Hidden Class pointer, then load a fixed offset". The name comes from this: the cache is inlined into the asm itself.
Contrast dynamic map[key]: V8 doesn't know the value of key at compile time, so each call has to:
key string against every key in the Hidden Class;Even with V8's internal string interning (same string → same pointer; pointer-compare only), this is dramatically slower than one ldr. Hence the 145×.
静态写法,优于动态写法。
不是风格之争,是 145 倍 的差距。 Field Note · 03
Static beats dynamic.
Not a style preference — a 145× gap. Field Note · 03
注意 IC 也走第 8 章的状态机:第一次调用时未初始化 (uninitialized),第二次起进入 monomorphic,见过 2-4 个不同 Hidden Class 的对象进入 polymorphic,>4 个就 megamorphic 放弃缓存。所以"让对象保持同 shape"和"用静态 key"是同一件事的两面——IC 优化只在它们都满足时生效。
ICs follow the same state machine as Chapter 8: uninitialized → monomorphic (after first call) → polymorphic (2–4 different Hidden Classes) → megamorphic (>4, cache abandoned). "Same shape" and "static key" are two faces of the same thing — IC only kicks in when both are true.
delete 的代价delete缓存技术最怕的就是 delete
caching's worst enemy is invalidation
到目前为止,我们讲的都是Fast Properties——用 Hidden Class + IC 把属性访问压到一条 ldr。但有一种操作能把对象一脚踹出快路径,让它退化成Slow Properties——慢几十甚至几百倍。
这个操作就是 delete。
Everything so far has been Fast Properties — Hidden Class + IC compressing access into one ldr. There's one operation that kicks an object off the fast path entirely, demoting it to Slow Properties (dozens to hundreds of times slower).
That operation is delete.
因为 delete 一旦允许,会引爆一连串问题:
o1.x 之后,剩下的 in-object 槽位怎么办?移动后面的属性补齐 → 其他对象指针就乱了。空着不填 → IC 缓存的偏移就错了。x 没了别人却还指着,引用乱套;如果切换 Hidden Class,所有 IC 都得失效。三个问题都很难解。V8 选了最简单的放弃方案:被 delete 过的对象一律退化为 Slow Properties——把属性集中存到一个字典里(类似 Map<string, Value>),抛弃 in-object + IC 优化。
这个字典的访问要走哈希查找,比 IC 慢几十到一百倍。而且这个降级是不可逆的——一旦掉到 Slow,这个对象就回不去 Fast 了。
Allowing delete opens three nasty cans of worms:
All three are hard. V8 picked the simple give-up plan: any object touched by delete degrades to Slow Properties — store properties in a dictionary (like Map<string, Value>) and abandon in-object + IC optimization.
Dictionary access is hash lookup — dozens to a hundred times slower than IC. And the demotion is one-way — once Slow, always Slow.
某段代码循环结束后想"释放内存",对每个 cache 对象做了 delete obj.bigPayload。结果下一帧还在用这些对象做属性访问的函数全部 deopt——cache 对象悉数掉到 Slow Properties,整个模块慢了 4 倍。正确做法是 obj.bigPayload = null 或 obj.bigPayload = undefined——这样不改变 Hidden Class,GC 也能正常回收引用的内存。
Some code did delete obj.bigPayload on every cache object at end-of-loop to "free memory". Next frame, every function reading those objects' properties deopted. The whole module ran 4× slower. The fix: obj.bigPayload = null (or = undefined) — preserves Hidden Class while still letting GC reclaim the referenced memory.
delete 就不用 delete。要"清掉"一个属性,改成 obj.foo = null 或 obj.foo = undefined——前者明确表达"无值",后者保持兼容。Hidden Class 不会变,IC 不会失效,GC 会回收引用的内存。
In performance-critical code, avoid delete. To "clear" a property, set it to null or undefined. Hidden Class survives, ICs stay valid, and GC still reclaims the referenced memory.
把前面 14 章的诊断工具一次性用上
putting all 14 chapters' diagnostic tools to work
到这里前面 14 章是所有的刀。这一章我们拿出主线那段函数,用刀解剖它。
函数本身一句话就能描述:把任意输入(数字 / 字符串 / 对象)归一化成 rem 数值。在我们的代码库里,它在每帧布局计算里被叫上百次,profiler 显示是个明显热点:
The previous 14 chapters were the knives. This chapter takes the main-line function and dissects it.
The function in one sentence: normalize any input (number / string / object) into a rem value. In our codebase it ran hundreds of times per layout frame; the profiler called it out as a hot spot:
| # | 症状Symptom | 病因Root cause | 章节Ref |
|---|---|---|---|
| 1 | BinaryOp::Any |
参数类型混杂(number / string / object 都见过) → polymorphicargs mix number/string/object → polymorphic | Ch8 |
| 2 | LoadIC::MEGAMORPHIC |
input.value / input.unit 看到太多 shape → IC 放弃input.value / input.unit see too many shapes → IC gives up |
Ch13 |
| 3 | reason: not a Smi | number 路径假设是 SMI,但浮点跑了 HeapNumber 路径,触发 deoptnumber path expected SMI but a float (HeapNumber) deopted it | Ch5 |
| 4 | reason: wrong map | object 路径上多个 shape 的 {value, unit} 来回切multiple object shapes flowing through the object branch |
Ch11 |
| 5 | 函数还很长function is also long | 三种输入塞在一个函数里 → bytecode 多 → 接近 max_optimized_bytecode_size 阈值three input paths in one function → bytecode bloat → near max_optimized_bytecode_size |
Ch9 |
这五个病都源于一个共同的设计错误:用一个函数处理三种结构性不同的输入。从 V8 的角度,这等于强迫它对每个属性访问、每个二元运算都做"应付三种类型"的多态机器码——快路径根本没机会形成。
修法在下一章——把它拆成三个单态函数,然后顺着前面 14 章的刀一刀一刀切。
All five trace back to one design mistake: one function handling three structurally different inputs. From V8's view, you've forced it to emit polymorphic asm for every property access and every binary op — the fast path never gets to form.
The fix is next chapter — split into three monomorphic functions and apply the rest of the 14 knives.
每条规则点开看示例
click each rule to expand the example
下面这 12 条规则不是"风格指南"——是每一条都对应前面 14 章里某个具体机制的工程总结。我把它们按"应用次数"在主线 px2rem 上的频度排序——前几条是收益最大的几刀。
点每一条的标题展开看示例。
The following 12 rules aren't style preferences — each maps to a specific mechanism from the previous 14 chapters. I've ordered them by impact frequency on the main-line px2rem function — the top few cuts buy the most.
Click each rule's heading to expand its example.
主线 px2rem 同时接 number / string / object,V8 必须为每个 binop 都生成多态机器码 → 退化到 BinaryOp::Any。把它拆成 px2remNumber / px2remString / px2remObject 三个函数,在调用方分发——每个函数都可以是 monomorphic。
px2rem takes number / string / object — V8 must emit polymorphic asm for every binop, dropping to BinaryOp::Any. Split into three: px2remNumber / px2remString / px2remObject, dispatch at the call site — each function can be monomorphic.
超过 max_optimized_bytecode_size(默认 60K bytecode 字节)V8 不会优化。即使没超,小函数还能享受 inline 展开——TurboFan 会把小被调函数 inline 进调用方,省一次 push/pop。
Functions over max_optimized_bytecode_size (60K bytecode bytes by default) skip optimization entirely. Even below the limit, small functions get inlined — TurboFan folds them into the caller, saving the push/pop.
TS 类型系统不是为了"装",它在工程上恰好替你保证了热点函数的单态性——只要类型签名是 (n: number) => number,你就基本不会不小心给它喂 string。
TS types aren't decoration. In practice they enforce the monomorphism of hot functions — a signature of (n: number) => number means you basically won't accidentally feed it a string.
{x:1, y:2} 和 {y:2, x:1} 在 V8 里是两个不同的 Hidden Class。在 factory 函数里,所有对象都按同一个顺序赋值——这样所有 instance 共享同一条 transition chain。
{x:1, y:2} and {y:2, x:1} are two different Hidden Classes. In factory functions, assign properties in a fixed order so every instance walks the same transition chain.
V8 给空对象预留 4 个 in-object 槽位(Slack Tracking)。如果你在 constructor 里"有时"才赋某个字段,会触发 Hidden Class 分叉。所有字段在 constructor 一次写齐(没值就 null/undefined),让所有实例走同一条链。
V8 reserves 4 in-object slots (Slack Tracking). If your constructor "sometimes" assigns a field, you fork the Hidden Class. Initialize every field in the constructor (use null/undefined if no value), keeping all instances on one chain.
deletedelete一次 delete obj.x 会把对象从 Fast Properties 一脚踹进 Slow Properties——所有 IC 失效,后续访问慢几十~百倍且不可逆。要"清掉"就 obj.x = null。
A single delete obj.x kicks an object from Fast to Slow Properties — invalidates every IC, slows access dozens to a hundred times, and is irreversible. To "clear" a property, use obj.x = null.
在生产 build 上加 --trace-deopt 跑一遍核心场景,看哪些函数 deopt——大多数是偶尔传 undefined 或者偶尔抛 try-catch。把这些"偶尔"消除就行。
Run your core scenarios with --trace-deopt in a prod build and find every deopting function. Most cases are occasional undefined or occasional try-catch throws. Remove the "occasionals".
这是 Ch13 的 145 倍跑分差距。在热点里把 obj[key] 改成 obj.knownKey,把 switch(string) 改成 switch(intEnum)——一刀切。
The 145× from Ch13. In hot paths, replace obj[key] with obj.knownKey and string switches with int-enum switches. One clean cut.
const o = {x: 1, y: 2} 比 const o = {}; o.x = 1; o.y = 2 更稳——前者一次性建好 Hidden Class,后者要走两次 transition。
const o = {x: 1, y: 2} is more reliable than const o = {}; o.x = 1; o.y = 2 — the literal builds the Hidden Class in one shot; the procedural form walks two transitions.
基于 逃逸分析(Escape Analysis):如果对象不逃出函数,V8 可以把它的字段全部展开成寄存器变量,根本不分配堆内存。这对 GC 也是免费收益。
Based on escape analysis: if an object never escapes its function, V8 can replace its fields with register variables and skip heap allocation entirely. Free GC win too.
SMI(整数)在 V8 里是立即数,不进堆;float 一律装箱成 HeapNumber,要分配 + GC + 间接寻址。能用 Math.floor / 整数 enum 就用,只在最终输出层做一次 / 100 转浮点。
SMIs (ints) live as immediate values; floats box into HeapNumber with allocation, GC, and indirection. Prefer Math.floor and integer enums; only divide-by-100 at the very last output step.
Ref<T> 之类的包装Ref<T>-style wrappersReact/Vue 里 useRef(0) 把数字包成 { current: 0 } 对象——读写都得过一层 Hidden Class + IC。如果你需要在热点里高频读写一个数,直接用闭包 let 变量,比 ref 快好几倍。
React/Vue's useRef(0) wraps a number into { current: 0 } — every read/write hits a Hidden Class + IC. For high-frequency hot-path reads, a closure-captured let outperforms a ref by several times.
把所有刀切下去之后
after all the cuts have landed
下面是按 12 条规则改完的版本。代码量更长了——但每个函数都是单态、字段顺序固定、没有 delete、没有动态 key:
The version after all twelve rules. The code is longer — but every function is monomorphic, field order is fixed, no delete, no dynamic keys:
不是某一刀很神,而是每一刀都解决了一个具体的 V8 机制问题,所有的小提速复合起来。把它列成账本:
No single cut is magic. Each one solves one specific V8 mechanism problem, and the small wins compound. As a ledger:
| 刀Cut | 解决的问题Problem fixed | 单刀贡献Per-cut win | 累计Cumulative |
|---|---|---|---|
| v0 | 起点baseline | — | 243 ms |
| + R1 | 三个单态函数 → 退出 BinaryOp::Anythree mono fns → exit BinaryOp::Any | −50% | 122 ms |
| + R2/9 | 小函数被 inline + 提模块顶层常量small fns get inlined + top-level constants | −35% | 79 ms |
| + R4/5 | 所有 result/input 对象同 Hidden Classall result/input objects share a Hidden Class | −39% | 48 ms |
| + R8/11 | 静态 key + 整数枚举 → IC 优化static keys + int enums → IC kicks in | −35% | 31 ms |
| + R10 | 逃逸分析,临时对象不上堆escape analysis, temp objects skip heap | −23% | 24 ms |
| v1 | 对比 v0vs v0 | 10.1× | 24 ms |
十倍提速不是魔法,
是十二刀切下去的累加。 Field Note · 03
A tenfold speedup isn't magic.
It's twelve cuts that compound. Field Note · 03
大部分情况能。但前提是你的瓶颈真的是 JS 执行——如果是 DOM 操作、合成层、网络、GC——那就是另外一座山(分别对应 chromium-renderer 那篇文章里的不同章节)。
检验方法很简单:打开 Chrome DevTools 的 Performance 面板,看你的热点函数占帧时间多少、是 JS 颜色还是别的颜色。如果是 JS 蓝色 + 占比超过 5%,这套方法论几乎一定有用。
Mostly yes. The precondition is your bottleneck is actually JS execution. If it's DOM, compositing, network, or GC — that's a different mountain (each covered in different chapters of the chromium-renderer piece).
Quick check: open Chrome DevTools' Performance panel and see your hot function's share of frame time and color. JS-blue + over 5% means this methodology will almost certainly help.
这套方法论是引擎无关的
this methodology is engine-agnostic
这本文一直在讲 V8——但前面 12 条规则跨引擎都成立。原因很简单:Hidden Class、Inline Cache、type feedback,这套设计是 1991 年 Self 语言研究里就有的——所有现代 JS 引擎都独立实现了一份。
This piece has been about V8, but those 12 rules are engine-agnostic. The reason: Hidden Class, Inline Cache, type feedback all trace back to 1991 Self research — every modern JS engine has independently implemented the same trio.
| 引擎Engine | JIT 层级JIT tiers | Hidden Class | IC | Type feedback |
|---|---|---|---|---|
| V8 · Chrome / Node | Ignition · Sparkplug · Maglev · TurboFan | Map | ✓ | FeedbackVector |
| JSCore · Safari | LLInt · Baseline · DFG · FTL (LLVM) | Structure | ✓ | ValueProfile |
| SpiderMonkey · Firefox | Interpreter · Baseline · Warp · Ion | Shape | ✓ | CacheIR |
| Hermes · RN | AOT bytecode (no JIT) | HiddenClass | ✓ | — (no JIT) |
JSCore(WebKit 的 JS 引擎,iOS / macOS Safari 用)有一个独门设计:它的峰值层 FTL(Fourth Tier LLVM)直接把 JS 编译进 LLVM IR,然后调用 LLVM 全部优化——同一份 LLVM 用来编 C++ / Rust / Swift,现在也用来编你的热点 JS。
实战影响:在某些 benchmark 上,Safari 的 JSCore 比 Chrome 的 V8 还快——尤其是计算密集型 + 类型稳定的代码,LLVM 的循环优化、SIMD 化、内联策略都比 V8 的 TurboFan 更激进。
但跨引擎的高性能 JS 写法是同一套——前面那 12 条规则在 JSCore 上一字不差地适用。
JSCore (WebKit's engine, used in iOS/macOS Safari) has a unique design: its peak tier FTL (Fourth Tier LLVM) compiles JS straight into LLVM IR and then runs full LLVM optimization passes — the same LLVM that ships C++/Rust/Swift, now also processing your hot JS.
Real-world impact: on certain benchmarks Safari's JSCore beats V8 — especially on compute-heavy, type-stable code, where LLVM's loop, SIMD, and inlining strategies are more aggressive than TurboFan's.
But fast-JS writing is the same trade across engines — the 12 rules apply word-for-word to JSCore.
在我的电脑上 (M1 MacBook Pro),v1 版 px2rem 跑 1M 次:Chrome (V8) 24 ms,Safari (JSCore) 17 ms。Safari 更快——因为 LLVM 把 UNIT_MAP 那个查表完全展开成了直接比较。但跑得快的代码,在哪个浏览器上都跑得快——这才是这套方法论的真正价值。
On my M1 MacBook Pro, the v1 px2rem at 1M iterations: Chrome (V8) 24 ms, Safari (JSCore) 17 ms. Safari wins — LLVM fully unrolled the UNIT_MAP lookup into direct compares. But fast code stays fast across browsers. That's the methodology's real value.
当 JS 已经不够快
when JavaScript isn't fast enough
把 px2rem 优化到 24 ms 已经是极限了——再快只能不写 JS。这就是 WebAssembly 的位置。
V8 内部其实有两条独立的流水线:JS 那条在前 14 章讲过(Ignition→Sparkplug→Maglev→TurboFan);Wasm 有自己的两层——Liftoff(基线编译,毫秒内编完)和 TurboFan(峰值编译,Wasm 也复用了同一个后端)。两条流水线共享同一份机器码内存、同一份 GC、同一个 main thread——所以 Wasm 不是"另一种语言",而是JS 性能曲线的另一种形状。
Optimizing px2rem to 24 ms is roughly the JS ceiling. Beyond that, you have to stop writing JS. That's WebAssembly's slot.
V8 actually runs two parallel pipelines: JS uses the four-tier covered in Ch3 (Ignition→Sparkplug→Maglev→TurboFan); Wasm has its own two-tier — Liftoff (baseline, compiles in milliseconds) and TurboFan (peak, shared backend). Both pipelines share the same machine-code memory, the same GC, the same main thread — so Wasm isn't "another language" so much as another shape of the JS performance curve.
| 场景Scenario | 建议Recommendation |
|---|---|
| 业务热点(布局、滚动、动画)UI hotspots (layout, scroll, animation) | 优化 JS 即可,基本能搞定JS optimization is enough |
| 媒体编解码 / 加解密 / 物理仿真media codec / crypto / physics | Wasm 决定性更好(2-10×)Wasm wins decisively (2–10×) |
| 大规模数据处理(协同编辑、Excel 表)bulk data (collab editing, spreadsheets) | 视情况——多次 JS↔Wasm 边界开销可能吃掉收益it depends — JS↔Wasm boundary costs can swamp the win |
| DOM 操作DOM ops | Wasm 反而更慢(必须经 JS 桥)Wasm is slower here — must bridge through JS |
而且重要的是:Wasm 不是"用了就快"。一段写得不好的 Wasm(频繁的 boundary call、不友好的内存布局、没向量化)有时还不如同等逻辑的优化过的 JS。
所以这本文最后一句话还是:先用前面 12 条把 JS 优化到极限,再去考虑 Wasm——大部分业务场景里,JS 优化能解决 80% 的性能问题,而且不引入构建复杂度。
And critically: Wasm isn't "fast just by being Wasm". Poorly-written Wasm (frequent boundary calls, unfriendly memory layout, no vectorization) sometimes loses to equivalent optimized JS.
The last word of this piece, then: push JS to its limit with the 12 rules first; reach for Wasm second. In most business code, JS optimization solves 80% of perf without adding build complexity.
--allow-natives-syntax 全套实战--allow-natives-syntax in practice所有用到的命令、参数、native syntax,集中在这里
every command, flag, and native syntax used in this piece, in one place
这本文从头到尾用到的所有"怎么观察 V8 在干什么"工具,集中放在这里。建议把这一章存成 cheatsheet——下次遇到慢 JS 时直接抄。
Every "how to see what V8 is doing" tool used across this piece, in one place. Save this chapter as a cheatsheet — next time you face slow JS, copy-paste from here.
Chrome DevTools → Memory → Take heap snapshot,然后:
Object)。map :: system / Map @0x...——这就是 Hidden Class 的物理地址。这是排查"对象 shape 是否稳定"最直接的方法,比看 %DebugPrint 的 map 地址更直观。
Chrome DevTools → Memory → Take heap snapshot, then:
Object).map :: system / Map @0x... — that's the Hidden Class's physical address.This is the most direct way to verify shape stability — easier than diffing %DebugPrint output.
这本文只覆盖了"JS 执行"这一类瓶颈。如果你的瓶颈在 DOM、布局、绘制、合成,可以读姊妹篇 字节码到像素的一生 — Chromium 渲染流水线全景;如果是合成卡顿,可以读 Jank & Stutter。
This piece only covers JS execution. If your bottleneck is DOM, layout, paint, or compositing, see the sister piece Bytecode to Pixels — Chromium's Rendering Pipeline. For compositor jank specifically, see Jank & Stutter.
从 240ms 到 24ms,
十倍提速不是魔法,是十二刀切下去的累加。
每一刀都对应 V8 的一个机制——让假设保持稳定,它就一直跑在最优版本上。
From 240ms to 24ms,
a tenfold speedup is not magic — it's twelve cuts that compound.
Each cut maps to a V8 mechanism. Keep its assumptions stable, and your code stays on the peak.