字节码到像素的一生 — Chromium 渲染流水线全景

CHAPTER 01

两个公式 — 浏览器到底是什么

Two formulas — what is a browser, really?

两个公式，一具骨骼

two formulas, one anatomy

"浏览器"在用户嘴里是个完整的产品，但拆开它的肚子，你看到的是一组可替换的零件。理解 Chromium 渲染流水线之前，先记住两个公式——它们是这套庞大代码的骨骼。

To users, a "browser" is a single product. Open the chest cavity and you see a set of replaceable parts. Before walking through Chromium's pipeline, fix two formulas in your head — they are the skeleton the rest of this story hangs on.

公式 1 / FORMULA 1FORMULA 1

Browser = Engine + Services Safari = WebKit + Apple stack Chrome = Chromium + Google services Edge = Chromium + Microsoft services Yandex = Chromium + Yandex services Chromium = Blink + V8 + others

推论：Implication: "内核"是产品差异里最不变的那一块。the engine is the part the product war never gets to touch.

公式 2 / FORMULA 2FORMULA 2

Engine = Rendering Engine + JavaScript Engine + others

主流浏览器的「骨骼」对照

Browser anatomy at a glance

浏览器Browser	渲染引擎Rendering engine	JS 引擎JS engine
Internet Explorer	Trident (MSHTML)	JScript / Chakra
Microsoft Edge	EdgeHTML →→ Blink	Chakra →→ V8
Firefox	Gecko	SpiderMonkey
Safari	KHTML →→ WebKit	JavaScriptCore
Chrome	WebKit →→ Blink	V8
Opera	Presto →→ WebKit →→ Blink	Carakan →→ V8

从这张表里冒出一个事实：除 Firefox 与已经下葬的 IE 外，主流浏览器最终都汇流到了 Blink + V8 或 WebKit + JavaScriptCore。这是一场静悄悄的吞并。

A quiet fact climbs out of the table: apart from Firefox and the long-buried IE, the mainstream browser world has converged on either Blink + V8 or WebKit + JavaScriptCore. It was a silent annexation.

所谓"浏览器之争"，
最后只剩两个引擎在赛跑。 Field Note · 02

The famed «browser war»,
ended with only two engines still on the track. Field Note · 02

渲染引擎是什么WHAT IS A RENDERING ENGINE 解析 HTML / CSS / JS 后把页面"画"出来。Firefox 的渲染引擎 Gecko 内部至少包含十几个工作组：document parser、layout engine、style system、JS runtime、image library、networking (Necko)、平台图形适配、字体库、安全库 (NSS)…… 渲染引擎从来不是一个东西，它是一座工厂。 Parses HTML / CSS / JS and draws the page. Firefox's Gecko alone bundles a dozen workgroups: a document parser, layout engine, style system, JS runtime, image library, networking (Necko), platform graphics adapters, font library, security library (NSS)… A rendering engine is never one thing — it is a factory.

CHAPTER 02

家谱 — WebKit 的 22 年

A family tree — 22 years of WebKit

从 KHTML 到 Blink 的血脉

a family tree from KHTML to Blink

Apple 在 2001 年从 KDE 的 KHTML 拆出 WebKit；7 年后，Google 又从 WebKit 拆出 Chromium 的早期内核；再 5 年，Google 从 WebKit fork 出自己的 Blink。这条线，至今还活着。

把 22 年的事按"内核 fork"画下来，就是下面这张家谱图——你能直观看到一个引擎被分裂、被继承、被重命名的方式。

In 2001 Apple lifted WebKit out of KDE's KHTML. Seven years later Google lifted its first Chromium engine out of WebKit. Five years after that, Google forked again — this time into Blink. The line is still alive today.

Plot the 22 years as a sequence of forks and you get the family tree below — a single picture of an engine being split, inherited and renamed.

FIG 02 WebKit 家谱：实线为继承，虚线为 fork。Blink 至今仍带着大量 Apple 与 WebKit 的影子代码。 The WebKit family tree: solid lines for inheritance, dashed lines for forks. Blink today still carries vast tracts of Apple-and-WebKit ghost code.

三个分裂时刻

Three moments of divergence

01

2001 · Apple 拿走 KHTML

2001 · Apple lifts KHTML

为 Mac OS X 自己做一个浏览器，Safari 1.0 把 KHTML 重写为 WebKit。

A browser for Mac OS X. Safari 1.0 ships with KHTML rewritten into WebKit.

02

2008 · Google 用 WebKit 起家

2008 · Google starts on WebKit

Chromium 项目以 WebKit 为渲染引擎，但生在多进程架构里——这个体质决定了后来的 fork。

Chromium ships with WebKit, but is born inside a multi-process architecture — the body type that would determine the fork to come.

03

2013 · Google fork 出 Blink

2013 · Google forks Blink

不为新功能，为"减负"——Blink 第一次大改是删除 8000 个文件、450K 行代码。

Not about adding features — about losing weight. Blink's first big change deletes 8 000 files and 450 000 lines of code.

兼容性COMPATIBILITY Web Platform Tests 的 Interop 报告里，Blink 长期处于第一梯队。可以理解为：浏览器之争虽然分了两边，但"网页"这个标准本身仍是一个共识——Blink 通过它的方式向 WebKit 致敬，并且超过了它。 Web Platform Tests consistently rank Blink in the leading tier. Read it like this: the browser war is split into two camps, but «the web» itself remains a shared standard — Blink's way of saluting WebKit, while quietly overtaking it.

CHAPTER 03

JavaScript 引擎 — 性能与体积的取舍

JS engines — speed against footprint

速度与体积的天平

JS engines: speed vs. footprint

渲染引擎决定"页面长什么样"，JavaScript 引擎决定"页面活成什么样"。这两个引擎在浏览器里通常是邻居：JS 引擎作为渲染引擎的内置模块运行，但又有强独立性——可以单独移植到 Node.js、嵌入式系统、IoT 上。

下面这些是被反复挑选的几个名字：

The rendering engine decides what a page looks like; the JS engine decides what a page does. The two are neighbours — the JS engine usually runs as a module inside the rendering engine, yet stays independent enough to be lifted into Node.js, embedded firmware or IoT.

The names worth knowing:

GOOGLE

V8

C++ · 开 JIT 后吊打全场
Chromium / Node.js / Android WebView C++ · with JIT, leaves the rest behind
Chromium / Node.js / Android WebView

APPLE

JavaScriptCore

系统级 API 暴露给 iOS App
(JIT 在 App 端被关闭) A system API exposed to iOS apps
(JIT disabled in app sandbox)

MOZILLA

SpiderMonkey

最早的 JS 引擎之一
Firefox 的心脏 One of the oldest JS engines
The heart of Firefox

FACEBOOK

Hermes

为 RN 而生 · 直接吃字节码
无 JIT，但首屏 TTI 优秀 Built for RN · loads bytecode directly
No JIT, but excellent cold-start TTI

F. BELLARD

QuickJS

210 KB / 启动 300 μs
引用计数 · 适合嵌入 210 KB / 300 μs cold start
Reference-counted · perfect for embedding

性能基准 · 谁跑得最快

Benchmarks · who runs fastest

SUNSPIDER · 越短越好 SUNSPIDER · lower is better

V8

1.0×

JSCore

1.4×

SpiderMonkey

1.7×

QuickJS

3.5×

Hermes

4.2×

体积 · 越短越好 FOOTPRINT · lower is better

QuickJS

210 KB

Hermes

~2 MB

JSCore

~5 MB

SpiderMonkey

~8 MB

V8

~12 MB

两张图叠在一起，看到的是"性能与体积的天平"：V8 用十几兆的运行时换最高性能；QuickJS 用 210 KB 的体积换可嵌入性。Hermes 在中间踩了一条"无 JIT 也要快启动"的路。

Place the two side by side and you see the trade-off everyone is balancing on: V8 trades double-digit megabytes of runtime for top performance; QuickJS trades performance for 210 KB and embeddability. Hermes walks the middle line — «fast cold-start, no JIT».

为什么移动端关掉 JITWHY MOBILE TURNS JIT OFF JIT 预热时间长 → 首屏慢；JIT 也会增加包体积与内存。当系统的 sandbox 不允许动态生成可执行内存（比如 iOS App），JIT 干脆没法用。这就是 JSCore-iOS 不开 JIT 的根本原因。 JIT warm-up is long → cold-start regresses; JIT also bloats binary size and memory. When the sandbox bans writable-executable pages (iOS apps), JIT is simply impossible. That's why JSCore on iOS apps cannot enable it.

V8 拆开 · 四层 JIT 流水线

V8 dissected · the four-tier compilation pipeline

"JIT"在 V8 里不是一个东西——是四层。一段函数从冷到热,V8 会依次把它升格到更"贵"但更"快"的实现:

"JIT" inside V8 is not one thing — it's four tiers. As a function heats up, V8 promotes it through progressively more expensive but faster implementations:

FIG 03 V8 的四层 JIT 流水线。函数边跑边升格:Ignition 跑前几次执行,热了进 Sparkplug,更热进 Maglev,极热进 TurboFan;类型假设一旦失效,整段被 deopt 退回 Ignition。Maglev 是 Chrome 117(2023)新加的中间层,把"从慢到快"的台阶从 2 级补到 4 级——以前 Sparkplug → TurboFan 一步跨太大。 V8's four-tier JIT pipeline. A function is promoted as it heats up: Ignition for the first few runs, then Sparkplug, then Maglev, then TurboFan; once a type assumption breaks, the whole frame is deopted back to Ignition. Maglev is Chrome 117 (2023) new — bridging the "cold to hot" staircase from 2 steps to 4. The old Sparkplug → TurboFan was too steep.

V8 · TIER MATRIX v8/src/{interpreter, baseline, maglev, compiler}/

关键设计:编译都在后台线程(--concurrent-recompilation,默认开)。Main thread 上只跑 Ignition 解释器(零编译开销),后台线程发现热点就编译,编完了通过原子指针交换把函数表入口改写成新版本——下一次调用走 Sparkplug/Maglev/TurboFan,中间没有 stop-the-world。这就是为什么 V8 能"边跑边变快"而不卡顿。

Key design: compilation runs on background threads (--concurrent-recompilation, default-on). Main thread only runs Ignition (zero compile overhead); background threads spot the hot frames and compile, then atomically swap a pointer in the dispatch table to the new version — next call hits Sparkplug/Maglev/TurboFan with no stop-the-world. This is how V8 "gets faster while running" without stalling.

隐藏类 + 内联缓存 · 让动态语言变成"近静态"

Hidden classes + Inline caches · making a dynamic language "nearly static"

JS 是动态语言——对象的 shape 随时可以改。但在内存里,V8 偷偷给每一种"属性集合"分配了一个 HiddenClass(Map),让对象访问能像 C++ struct 一样按 offset 取。配合内联缓存(IC),一次 obj.x 访问可以不走任何字典查找:

JS is dynamic — an object's shape can change at any moment. Yet in memory, V8 quietly assigns each "property set" a HiddenClass (Map), so property access can fetch by offset like a C++ struct. Combined with Inline Caches (IC), a single obj.x access can skip any dictionary lookup:

HIDDEN CLASS · A WORKED EXAMPLE v8/src/objects/map.h

// JS 代码 var p1 = {}; // HiddenClass C0 · 空 p1.x = 10; // HiddenClass C0 → C1 (添加 .x@offset 0) p1.y = 20; // HiddenClass C1 → C2 (添加 .y@offset 1) var p2 = {}; // HiddenClass C0 · 复用! p2.x = 30; // HiddenClass C0 → C1 · 复用! p2.y = 40; // HiddenClass C1 → C2 · 复用! // 现在 p1 与 p2 共享 HiddenClass C2 → 它们的 shape 相同 // 函数 sum(p) { return p.x + p.y } 内的 IC 第一次执行时 // 缓存 [HiddenClass=C2 → x@offset 0, y@offset 1] // 第二次执行 IC 命中 → 直接走 mov [rdi+0x10], rax · 不走字典 // ⚠ 反例: var p3 = {}; p3.y = 99; p3.x = 100; // y 先 x 后 → HiddenClass D2 (≠ C2) // 同一个 sum() 现在面对 2 种 HiddenClass // IC 升级为 polymorphic IC,慢 ~3× // 4 种以上 → megamorphic,完全退化为字典查找

这就是为什么"给对象初始化属性时顺序要一致"是 V8 性能的金科玉律。React/Vue 的 createElement、Class 字段初始化、构造函数里属性赋值顺序,这些细节都直接影响 IC 命中率。IC 命中率每降一级,函数性能掉一个数量级:Monomorphic ≫ Polymorphic ≫ Megamorphic。

This is why "initialise properties in consistent order" is V8's golden rule. React/Vue's createElement, class field initialisation, the order of assignments in a constructor — all of these directly affect IC hit rate. Each step down the IC ladder costs an order of magnitude: Monomorphic ≫ Polymorphic ≫ Megamorphic.

Orinoco · V8 的代际 GC

Orinoco · V8's generational GC

V8 的堆按"年龄"分两块:Young Gen(新对象,~1-8MB)+ Old Gen(老对象,几十-几百 MB)。绝大多数对象朝生夕死(分配后几毫秒就没引用),不值得用昂贵的 mark-sweep——所以 Young Gen 用Scavenger 半空间复制 GC,只扫存活对象,死对象不需要被"收集"——自动消失。

V8's heap is split by age: Young Gen (new objects, ~1-8MB) + Old Gen (long-lived, tens to hundreds of MB). The vast majority of objects die young (no refs within milliseconds of allocation), not worth the cost of mark-sweep — so Young Gen uses Scavenger semi-space copying GC, scanning only live objects; dead objects need no "collection" — they vanish.

ORINOCO · GC ARCHITECTURE v8/src/heap/

Heap ├─ YoungGeneration(~1-8 MB) │ ├─ to-space // 当前用 │ └─ from-space // GC 时把存活对象复制过去,废 from-space │ // Scavenger · stop-the-world ~1-5 ms · 频繁但极快 │ └─ OldGeneration(几十-几百 MB) ├─ MajorGC: Mark-Compact 主回收 │ ├─ Concurrent marking · 后台线程跑 marking,主线程几乎不停 │ ├─ Incremental marking · 切片 marking,塞在主线程空闲 │ └─ Parallel compacting · 多线程整理 └─ MinorGC: 把"幸存 2 次 Scavenge" 的对象提升过来 // Orinoco 的设计目标:把 GC 暂停从 100ms+ 压到 < 1ms // 代价是 GC 工作量被分摊到更多线程 + 更复杂的 read/write barrier

对前端意味着什么: 频繁创建临时对象(函数返回值、Array.map、JSX 重渲染)不可怕——Scavenger 1ms 就清完。真正的杀手是"本来该死掉但意外被引用住"的对象——它们晋升到 Old Gen,触发 Major GC 时全堆 mark,可能 stall 主线程几十毫秒。Memory leak 在 V8 里的体征不是 OOM,是"每隔几秒卡顿一下"——那是 Major GC 在抢 Main thread。

What this means for web devs: creating heaps of short-lived objects (function return values, Array.map, JSX re-renders) is fine — Scavenger sweeps them in 1ms. The real killer is "objects that should have died but accidentally stayed referenced" — they're promoted to Old Gen, triggering Major GC, marking the whole heap, possibly stalling Main for tens of milliseconds. A memory leak in V8 doesn't show as OOM — it shows as "jank every few seconds" — that's Major GC stealing the Main thread.

V8 与渲染流水线的"邻居关系"

V8 and the rendering pipeline · the "neighbour" relationship

V8 不是渲染流水线的"一段",但它在每段都坐着——Main thread 上 Style/Layout/Paint 跑完之后留下的空隙,JS handler 就开始执行。一帧 16.7ms 里典型的剧本:

V8 is not a "stage" of the rendering pipeline, but it squats inside every stage — JS handlers fill the gaps left by Style/Layout/Paint on the Main thread. A typical script for one 16.7ms frame:

MAIN THREAD · 16.7 MS · WHO RUNS WHEN v8 ↔ blink ↔ cc

0 ms ─▶ vsync · Compositor 给 Main 发 BeginMainFrame 0 ms ─▶ V8: 跑事件回调(click/keypress/setTimeout) 2 ms ─▶ V8: microtask queue(Promise.then) 3 ms ─▶ Blink: Style + Layout + Pre-paint + Paint 6 ms ─▶ cc: Commit · Main 阻塞 ~1ms 7 ms ─▶ V8: requestAnimationFrame 回调(动画 / 渲染前最后一次改 DOM) 9 ms ─▶ V8: requestIdleCallback / scheduler.postTask 低优任务 14 ms ─▶ V8: idle · 等下一个 vsync // V8 在 [0,3) [7,9) [9,14) 三段窗口里 // 一段 JS 长任务 50ms → 阻塞 3 帧 → INP > 200ms

长任务的物理后果PHYSICS OF A LONG TASK

为什么 50ms 是 INP 的天花板?

Why is 50ms the ceiling for INP?

一帧 16.7ms 已经预留 ~6ms 给 Style/Layout/Paint,留给 JS 的窗口就是 ~10ms。JS 单任务超过 50ms = 跨过 3 帧——浏览器 Performance Observer 把它标 "Long Task";Web Vitals 把这次 input 的 INP 计为从 input 到 next-paint 的真实时间,直接就 200ms+。Chrome 122 起的 scheduler.yield() + scheduler.postTask({ priority }) 就是为这件事造的——把长任务主动切片。

A 16.7ms frame budget already reserves ~6ms for Style/Layout/Paint, leaving ~10ms for JS. A single JS task over 50ms = crosses 3 frames — the browser's Performance Observer flags it as a "Long Task"; Web Vitals counts the INP for that input as actual time from input to next-paint, instantly 200ms+. Chrome 122's scheduler.yield() + scheduler.postTask({ priority }) exist precisely for this — to actively slice long tasks.

V8 vs JSCore vs Hermes · 设计哲学的三条路

V8 vs JSCore vs Hermes · three design philosophies

维度Axis	V8 (Chrome)	JSCore (Safari)	Hermes (RN)
JIT 层数JIT tiers	4 层	4 层 (LLInt/Baseline/DFG/FTL)	无 JIT · 直接吃字节码No JIT · bytecode only
字节码Bytecode	运行时生成 + 缓存runtime-generated + cached	运行时生成runtime-generated	构建时预编译 → ship 字节码build-time precompiled → ship bytecode
启动时间Cold start	~200ms (parse + IC 预热)	~150ms	< 50ms (无 parse)
峰值速度Peak speed	A+ (TurboFan 全开)	A (FTL 全开)	C (字节码解释器)
GC	Orinoco · 代际 + 并发	Riptide · 三色标记	Hades · 并发 mark + 老生代
体积Footprint	~12 MB	~5 MB	~2 MB
最适合Best fit	Web · 长会话 · CPU 密集Web · long sessions · CPU heavy	移动 + iOS 沙箱Mobile + iOS sandbox	React Native · 启动至上React Native · cold-start matters most

三种引擎各走一条路:V8 把"长会话峰值速度"押到极致;JSCore 用 LLInt 解释器(可在沙箱里执行)弥补 iOS 不能 JIT 的限制;Hermes 把整个 parse + bytecode-gen 阶段挪到构建时——APK 里直接 ship 字节码,App 启动跳过 parse。没有"最好"的引擎,只有"最适合这个场景"的引擎。

Three engines take three roads: V8 pushes "peak speed in long sessions" to the limit; JSCore uses an LLInt interpreter (executable in the sandbox) to compensate for iOS's no-JIT rule; Hermes moves parse + bytecode-gen to build time — the APK ships bytecode directly, the app skips parsing on launch. There is no "best" engine, only the "best fit for this scenario".

2024 + Maglev 默认开启 + Sparkplug 提前编译: Chrome 117+ 默认开 Maglev,把"中等热度函数"的性能从 5× 拉到 30×。Chrome 121 起还引入Compile Hints(Magic-Bytecode 注解),允许网站用 HTTP header 告诉 V8 "这些脚本进 Sparkplug 不要再等",把首屏 JS 启动再砍 ~30%。 Maglev default-on + Sparkplug-eager compilation: Chrome 117+ ships Maglev default, lifting "warm function" performance from 5× to 30×. Chrome 121+ added Compile Hints (Magic-Bytecode annotations) — sites can tell V8 via HTTP header "compile these scripts straight to Sparkplug, don't wait", trimming cold-start JS by another ~30%.

DEVTOOLS

Performance > "Compile Code" / "Optimize Code" / "Function Call";Memory 面板看 GCPerformance > "Compile Code" / "Optimize Code" / "Function Call"; Memory panel for GC

TRACING

v8.execute, v8.compile, v8.gc, v8.ic_stats

FLAG

--js-flags="--print-opt-code --trace-deopt --trace-ic" / 看真实 IC 与 deopt/ inspect actual IC and deopts

SOURCE

v8/src/{interpreter,baseline,maglev,compiler,heap}/

WebAssembly · V8 的另一条流水线

WebAssembly · V8's other pipeline

V8 不止跑 JS,还跑 WebAssembly(Wasm)。这两条流水线共用 V8 进程,但不共享几乎任何东西:不同的字节码格式、不同的编译器、不同的堆、不同的优化哲学。Wasm 自带类型(i32/i64/f32/f64),不需要 IC 反馈;不需要 GC(线性内存手动管理),所以 Orinoco 那一套在 Wasm 这边整个不存在。

V8 doesn't just run JS — it also runs WebAssembly (Wasm). These two pipelines share the V8 process but virtually nothing else: different bytecode formats, different compilers, different heaps, different optimisation philosophies. Wasm carries types (i32/i64/f32/f64); no IC feedback needed. No GC (linear memory is manually managed), so the entire Orinoco machinery is absent on the Wasm side.

WASM · TWO-TIER COMPILATION v8/src/wasm/

Wasm bytes ─▶ decoder ─▶ validator ─▶ tier 1 compile ① Liftoff(baseline) // 流式 · 单 pass · <100 μs/函数 ├─ 输入: Wasm 字节码 ├─ 输出: x64/ARM64 机器码(未优化) └─ 性能: native 的 ~50% · 已比 V8 解释器快 5-10× ② TurboFan(optimizing) // 后台线程 · 慢 · 极快 ├─ 触发: 函数调用 N 次后 ├─ 输入: Wasm 字节码 + Liftoff profile └─ 性能: native 的 ~80-95% // 与 JS 路径的根本不同: // · 没有 deopt(类型已知,假设永远成立) // · 没有 IC(每个 op 类型早就标好) // · 没有 hidden class(没有动态对象) // · 编译产物可以序列化到 IndexedDB,下次启动直接 load

Liftoff 的"流式"是它最妙的地方:Wasm 字节码在网络下载时,V8 已经在边下载边编译——第一字节到达就启动 decoder,函数边界一到就丢给 Liftoff 编译,字节流末尾未到时主入口可能已经就绪。10MB 的 Wasm 包,Liftoff 可以在 ~50ms 内全部编译完(上 GHz CPU),JS 想达到同等效果需要 V8 把整个 bytecode → optimization 跑一遍,几百毫秒起步。

Liftoff's "streaming" is its sharpest trick: as Wasm bytes download from the network, V8 compiles concurrently — decoder fires on the first byte, Liftoff compiles each function the moment its boundary arrives, and the main entry-point can be ready before the byte stream finishes. A 10MB Wasm bundle compiles end-to-end in ~50ms on a GHz CPU; JS would need V8 to walk bytecode → optimization, easily hundreds of ms.

Wasm 与渲染流水线 · 谁占 Main thread?

Wasm and the rendering pipeline · who owns Main?

Wasm 跑在 V8 上 → V8 跑在 Render Process 的 Main thread → 所以默认情况下 Wasm 与 JS 与 Style/Layout/Paint 共用同一个 Main thread,互相阻塞。但 Wasm 可以做JS 做不到的事:

Wasm runs on V8 → V8 runs on the Render process's Main thread → by default Wasm, JS and Style/Layout/Paint all share that single Main thread and block each other. But Wasm can do what JS cannot:

真正的多线程 · 通过 SharedArrayBuffer + Atomics + Web Worker,Wasm 可以让"主线程触发 → Worker 线程跑 Wasm 计算"完全并行。物理上 Worker 线程不是 Main thread,所以 Wasm 计算可以与 Main thread 上的渲染真正并行——这是 JS 拿不到的(JS Worker 不能直接读 DOM,而 Wasm Worker 是纯计算,本来就不需要 DOM)。
True multi-threading · via SharedArrayBuffer + Atomics + Web Workers, Wasm allows "Main triggers → Worker runs Wasm compute" in true parallel. Worker threads aren't the Main thread, so Wasm compute can run truly parallel with rendering on Main — something JS cannot achieve (JS Workers can't touch the DOM directly; Wasm Workers do pure compute, never needed the DOM anyway).
SIMD · Wasm 128-bit SIMD(v128)是显式向量化。一次 SIMD 加法处理 4 个 float32 或 2 个 float64,适合图像处理、ML inference、加密。JS 没有 SIMD(SIMD.js 提案早就废了)。
SIMD · Wasm's 128-bit SIMD (v128) is explicit vectorisation. One SIMD add handles 4 float32 or 2 float64 — perfect for image processing, ML inference, crypto. JS has no SIMD (the SIMD.js proposal died years ago).
预测性能 · 无 GC、无 deopt → Wasm 函数每次调用时间几乎相同。这对实时音视频处理(WebRTC 编解码、AudioWorklet)是致命优势——JS 偶尔卡 50ms,Wasm 永远不会。
Predictable performance · no GC, no deopt → Wasm functions take nearly the same time on every call. Decisive for real-time audio/video (WebRTC codecs, AudioWorklet) — JS occasionally stalls for 50ms, Wasm never.

真实案例CASE

Figma / Photoshop Web / Google Earth · 都是 Wasm

Figma / Photoshop Web / Google Earth · all Wasm

Figma 把渲染引擎(C++ 写的)编译成 Wasm,DOM 只剩一个 canvas;所有的图形、布局、字体都是 Wasm 自己算。Photoshop Web(Adobe + Chrome 团队合作)同理。Google Earth 用 Wasm 跑 3D 地形渲染。这些应用的"大多数 CPU 工作不在 Main thread 上"——Wasm 在 Worker 线程里跑,Main thread 只负责把渲染结果贴到 canvas 上(cc::TextureLayer 路径)。

Figma compiles its rendering engine (written in C++) into Wasm; the DOM has one canvas; all graphics, layout, fonts are computed by Wasm. Photoshop Web (Adobe + Chrome team collaboration) does the same. Google Earth runs 3D terrain in Wasm. For these apps, "most CPU work is not on the Main thread" — Wasm runs on Worker threads, Main only pastes the result into a canvas (the cc::TextureLayer path).

2024 + Wasm GC + JSPI: 2024 后 Wasm 引入原生 GC(--experimental-wasm-gc,Chrome 119+ 默认开),让 Java/Kotlin/Dart 编译到 Wasm 而不必带自己的 GC。JSPI(JavaScript Promise Integration)让 Wasm 可以"挂起 + 等 Promise + 恢复",写出像 async/await 的同步式代码,但底下走 JS 异步——把 Wasm 与 Web 异步生态彻底打通。 Wasm GC + JSPI: 2024 introduces native GC for Wasm (--experimental-wasm-gc, default-on Chrome 119+), letting Java/Kotlin/Dart compile to Wasm without bundling their own GC. JSPI (JavaScript Promise Integration) lets Wasm "suspend + await a Promise + resume", writing synchronous-looking code that runs on JS's async machinery — fully bridging Wasm with the Web's async ecosystem.

CHAPTER 04

进程模型 — 每个 Tab 都是一座孤岛

Process model — every tab is an island

进程与线程的城市规划

processes & threads, mapped

Chromium 启动后，活在内存里的不是一个进程，而是一座城——主城（Browser）、若干个郊区（Render）、一个机场（Viz）、一些工厂（Utility / Plugin）。这座城决定了"一个 Tab 崩了不会拖死整个浏览器"。

跟渲染强相关的，只有三类：Browser · Render · Viz。下面把每类进程的线程编制摆开来：

Boot Chromium and you don't get a process — you get a city. A capital (Browser), a few suburbs (Render), an airport (Viz), some factories (Utility / Plugin). The map is what makes "one crashing tab won't take down the browser" possible.

Three districts are relevant to rendering: Browser · Render · Viz. The thread roster of each:

BROWSER PROCESS

主城

The capital

数量 · ×1

count · ×1

UI Thread
Render & Compositing Thread
Render & Compositing Helpers
Network / IO Thread

RENDER PROCESS

每个 Tab 的工厂

A factory per tab

数量 · ×N（每 Tab 一个）

count · ×N (one per tab)

Main thread ×1
Compositor thread ×1
Raster thread ×1
Worker threads ×N

GPU PROCESS · HOSTS VIZ

机场 · 上屏总枢纽

Airport · the on-screen hub

数量 · ×1

count · ×1

GPU main thread (Skia)
Display Compositor thread
Plus Plugin / Utility · 旁路

Render Process 的工序分工

What each thread does inside a Render process

线程Thread	职责（节选）Responsibilities (excerpts)
Main thread	JS 执行 · Event Loop · Document 生命周期 · Hit-testing · 事件调度 · HTML/CSS 解析JS execution · Event Loop · Document lifecycle · Hit-testing · Event dispatch · HTML/CSS parsing
Compositor thread	Input handler · 滚动 / 动画 · 计算 Web 内容的最优分层 · 协调图片解码、绘制、光栅化任务Input handler · scroll / animation · Layer composition · coordinates image decode / paint / raster tasks
Raster thread(s)	执行 cc::TileTask，把绘画指令光栅化为像素位图（或 GPU 纹理）Runs cc::TileTask — turns paint commands into pixel bitmaps (or GPU textures)
Compositor helpers	数量取决于 CPU 核心数，做并行化的 paint / decode / rasterCount tracks CPU cores. Parallel paint / decode / raster.

进程模式 · process-per-tab 范例

Process modes · the "process-per-tab" worked example

假设你打开 3 个 Tab：foo.com / bar.com / baz.com。其中 foo.com 嵌了两个 iframe，分别指向 foo.com/other-url 和 bar.com。跨站点的 iframe 会触发额外的 Render Process。

Imagine three tabs: foo.com / bar.com / baz.com. Inside foo.com, two iframes point at foo.com/other-url and bar.com. The cross-site iframe spawns an additional Render Process.

Tab 1
foo.com

Render P-A · foo.com (main) Render P-A · foo.com/other-url (同站合并) Render P-B · bar.com iframe (cross-site)

Tab 2
bar.com

Render P-B · bar.com

Tab 3
baz.com

Render P-C · baz.com

注意 Tab 1 的 bar.com iframe 和 Tab 2 的 bar.com 用的是同一个渲染进程（同站点合并）；但和 Tab 1 的 foo.com 是不同的进程，因为跨站。Site Isolation 把"渲染孤岛"的边界从 Tab 提到了站点。

Note that the bar.com iframe in Tab 1 and Tab 2's bar.com share the same render process (same-site reuse), but live in a different process from Tab 1's foo.com because they're cross-site. Site Isolation moved the "render island" boundary from per-tab to per-site.

Viz 是干什么的WHAT VIZ ACTUALLY DOES Viz 是 GPU 进程里的渲染合成服务（"Viz Process" 与 "GPU Process" 在新版 Chromium 是同一个进程，Viz 是它托管的服务）。它收 Render Process 与 Browser Process 各自产出的 viz::CompositorFrame（CF），用 SurfaceAggregator 合并，然后用 GPU 把结果显示在窗口上。所有屏幕上看到的画面，最后都是 Viz 写出去的。 The Viz process is where compositing and display converge. It accepts viz::CompositorFrame (CF) from every Render and the Browser process, merges them via SurfaceAggregator, and pushes the result onto the screen via GPU. Whatever you see on screen — Viz wrote it.

Site Isolation 简史 · Spectre 怎么改了进程模型

A short history of Site Isolation · how Spectre rewrote the process model

"每个站点一个 Render 进程"听起来像是一开始的设计——其实不是。2018 年之前,Chromium 的进程模型是per-tab(每个 Tab 一个进程,但 Tab 内的跨域 iframe 跟主页共用进程)。这是一种性能 / 内存 / 安全三角的折中:每 Tab 一个进程已经够"孤岛",iframe 嵌入页内省一个进程。直到 2018 年 1 月。

"One Render process per site" sounds like a founding design choice — it isn't. Before 2018, Chromium's process model was per-tab (one process per tab; cross-origin iframes shared their parent's process). It was a three-way compromise between performance / memory / security — per-tab gave enough "islands", iframe sharing saved a process per embed. Then January 2018 happened.

FIG 04 Site Isolation 简史。2018 年 Spectre 公布是分水岭——攻击者可以在 JS 里通过分支预测旁路读到同进程任何地址的内存,这意味着同进程里的跨域 iframe 不再安全。Chromium 团队用 4 个月 + 数百个 bug fix 把渲染进程的边界从 Tab 重画到站点(eTLD+1),Chrome 67 默认开启,代价是 +10-13% 内存。 A short history of Site Isolation. The 2018 Spectre disclosure was the watershed — attackers could leak memory at any address in the same process via branch-prediction side-channels in JS, meaning a cross-origin iframe in the same process was no longer safe. The Chromium team spent 4 months + hundreds of bug fixes redrawing the renderer-process boundary from Tab to site (eTLD+1); Chrome 67 turned it on by default at a cost of +10-13% memory.

为什么 Spectre 把"同进程 iframe"打成不安全

Why Spectre made "same-process iframe" unsafe

现代 CPU 用分支预测提前执行可能用到的指令——预测错了就回滚。Spectre 的核心:预测错了的指令虽然不"提交",但留下的 cache 痕迹可以被旁路探测。在 JS 里写一段精心构造的 if 分支,可以哄 CPU 投机访问同进程任意地址的内存,然后通过 cache 命中时间反推那个字节的值——本来 SOP(同源策略)说"你不能读跨域 iframe 的 DOM",但 Spectre 说"我直接读它在物理内存里的位置"。

Modern CPUs use branch prediction to speculatively execute likely-needed instructions — wrong predictions get rolled back. Spectre's insight: even rolled-back speculation leaves cache traces that can be probed. Carefully constructed JS branches can trick the CPU into speculatively reading any memory address in the same process, then recover the byte value via cache-hit timing. The Same-Origin Policy says "you can't read a cross-origin iframe's DOM" — Spectre says "I'll just read its physical memory location directly".

Site Isolation 的修复方式简单粗暴: 把跨域 iframe 装进不同的进程。同进程里就读不到跨域内容了——因为内容根本不在这个进程的地址空间里。代价是每嵌一个跨域 iframe(广告、社交按钮、第三方组件)就多一个 Render Process,Tab 平均内存涨了 10-13%。这就是为什么"嵌入百度统计" 之类的代码在 2018 后会让你的页面多一个 Render Process。

Site Isolation's fix is brutally simple: put cross-origin iframes in different processes. You can't read cross-origin content from your process — the bytes aren't in your address space. The cost: every embedded cross-origin iframe (ads, social buttons, third-party widgets) adds a Render Process; per-tab memory grew 10-13%. This is why "embed Baidu Analytics" code post-2018 spawns an extra Render process for your page.

COOP / COEP · 把 SharedArrayBuffer 救回来

COOP / COEP · bringing SharedArrayBuffer back

SharedArrayBuffer(SAB) 是 JS 多线程的关键 API,但 Spectre 后所有浏览器把它紧急关停了——SAB 本身就提供高精度时间(原子计数器),正好是 Spectre 旁路探测需要的工具。两年后,COOP/COEP 协议把它救回来:

SharedArrayBuffer (SAB) is the linchpin of JS multi-threading, but post-Spectre every browser killed it overnight — SAB itself provides high-precision timing (atomic counters), exactly what Spectre's side-channel needs. Two years later, the COOP/COEP protocol resurrected it:

COOP / COEP · CROSS-ORIGIN ISOLATION RFC · whatwg/html

// 想用 SharedArrayBuffer? 服务器必须发这两个 header: Cross-Origin-Opener-Policy: same-origin // 浏览页面隔离 Cross-Origin-Embedder-Policy: require-corp // 子资源必须显式 opt-in // 嵌入的子资源也必须显式声明能被跨域加载: Cross-Origin-Resource-Policy: cross-origin // 子资源端 // 全套打齐 → window.crossOriginIsolated === true // → SharedArrayBuffer 重新可用 + performance.now() 高精度恢复 // → Wasm threads 可用 → Wasm SIMD + 多线程才真正可用

实际后果: Figma / Photoshop Web / 任何用 Wasm 多线程的应用必须把这套 header 全配齐。否则 SAB 不能用,Wasm 多线程就成了空壳。这就是为什么 crossOriginIsolated 是 2024 年高性能 Web App 的入场券。

The practical consequence: Figma / Photoshop Web / any Wasm-multi-threaded app must ship the full header set. Without them, SAB doesn't work, and Wasm threads become a hollow shell. This is why crossOriginIsolated is the 2024 admission ticket for high-performance Web apps.

2024 + Origin-Agent-Cluster · 进程隔离再细一档: Chrome 88+ 起,网页可以发 Origin-Agent-Cluster: ?1 header 主动要求"把我和同站点(eTLD+1)的其他源也分到不同进程"。比如 a.example.com 和 b.example.com 默认共用进程(同站),开了这个 header 就分开。Site Isolation 是默认,Origin Isolation 是高安全场景的 opt-in。 Origin-Agent-Cluster · finer-grained isolation: Since Chrome 88+, a page can opt-in via Origin-Agent-Cluster: ?1 to demand "isolate me even from same-site (eTLD+1) origins". Default: a.example.com + b.example.com share a process (same site); with this header they split. Site Isolation is the default; Origin Isolation is the opt-in for high-security scenarios.

延伸阅读FURTHER READING

官方设计文档 · "Site Isolation"

Official design docs · "Site Isolation"

想看大型重构怎么落地,Chromium 团队把整套 Site Isolation 的设计与回顾公开放在 chromium.org/Home/chromium-security/site-isolation:从威胁模型、跨进程边界(document.domain、window.opener、剪贴板事件)到性能数据,全摊开。配合 Charlie Reis 的 USENIX Security 2019 论文《Site Isolation: Process Separation for Web Sites within the Browser》一起读,就是近距离看一次大型重构是怎么落地的。

If you want to watch a major refactor land in close-up, the Chromium team published the full Site Isolation design + retrospective at chromium.org/Home/chromium-security/site-isolation: threat model, every cross-process boundary (document.domain, window.opener, clipboard events), performance numbers — all in the open. Pair it with Charlie Reis' USENIX Security 2019 paper «Site Isolation: Process Separation for Web Sites within the Browser» — that's a textbook view of how this kind of "major surgery" actually lands.

CHAPTER 05

流水线 — 字节走完十三步

The pipeline — a 13-step march

十三步行军

the pipeline as a 13-step march

所谓渲染流水线，就是把网络字节"翻译"成屏幕像素的一条工序链。Chromium 把这条链切成13 道工序——这条链同时被 3 个进程 切片、3 个模块 接管、6 段线程 接力。

下面这张总览图是后文所有章节的地图。它同时展示四个维度：

外层虚框：3 个进程——Browser / Renderer / GPU
顶部色带：3 个模块——blink（HTML/CSS/Layout/Paint）/ cc（Compositing/Tiling/Raster/Draw）/ viz（Aggregate/Display）
中间矩形：14 个阶段——Loading + 13 道流水线工序
底部色带：6 段线程——Network → Main → Compositor → Raster → Compositor → Skia

The rendering pipeline is a chain that turns network bytes into pixels. Chromium cuts the chain into thirteen stages — sliced across three processes, owned by three modules, run by six thread segments.

The master diagram below is the map for every chapter that follows. Four layers, all at once:

outer dashed boxes: 3 processes — Browser / Renderer / GPU
top ribbons: 3 modules — blink (HTML/CSS/Layout/Paint) / cc (Compositing/Tiling/Raster/Draw) / viz (Aggregate/Display)
middle cells: 14 stages — Loading + 13 pipeline steps
bottom ribbons: 6 thread segments — Network → Main → Compositor → Raster → Compositor → Skia

FIG 05 渲染流水线总览图——四个维度同时呈现：进程（虚框）/ 模块（顶部色带）/ 阶段（中间矩形）/ 线程（底部色带）。后文每一章都对应这张图里的一格。 Master diagram of the rendering pipeline — four layers at once: processes (dashed frames) / modules (top ribbons) / stages (middle cells) / threads (bottom ribbons). Each subsequent chapter corresponds to exactly one cell.

第二张地图 · 数据结构生灭线 + 可缓存性

A second map · data lifelines & cacheability

总览图回答"谁在哪儿干活",但回答不了一个更深的问题:每段产物存活多久?哪些可以跨帧复用?下图把 11 种核心数据结构的生灭线画在 14 阶段上,用颜色编码可缓存性——绿 = 跨帧缓存(便宜)、黄 = 半缓存(易失)、红 = 每帧重生(贵)。看完你能反向推:修一个 transform 为什么便宜? 因为它只让红区里的 LayerImpl property 翻新,其余全绿。

The master map answers "who works where", but not a deeper question: how long does each artifact live? what's reusable across frames? The figure below charts the lifelines of 11 core data structures across the 14 stages, colour-coded by cacheability — green = cached across frames (cheap), yellow = partially cached (fragile), red = born fresh every frame (expensive). After reading it you can work backward: why is mutating a transform cheap? Because only red-zone LayerImpl properties refresh — everything else stays green.

FIG 05B 数据结构生灭线 · 横轴 14 阶段,纵轴 11 种核心数据结构。左偏的产物大多是绿的(DOM / ComputedStyle / Property Trees / DisplayItemList / SharedImage 都跨帧持久),右尾的产物大多是红的(CF / Quad / Aggregated CF 每帧重生)。 这就是流水线设计的中心思想——把"不必重算的东西" 推到尽可能左、尽可能绿。 Data-structure lifelines · X-axis 14 stages, Y-axis 11 core artifacts. Left-biased artifacts are mostly green (DOM / ComputedStyle / Property Trees / DisplayItemList / SharedImage all persist across frames), right-tail artifacts are mostly red (CF / Quad / Aggregated CF born fresh each frame). This is the central design idea of the pipeline — push "what must not be recomputed" as far left and as green as possible.

用一个钥匙词记住每一步

A keyword for each stage

#	阶段Stage	输入 → 输出（钥匙词）In → Out (keyword)
01	Parsing	bytes → DOM Treebytes → DOM Tree
02	Style	DOM Tree → Render Tree（带 ComputedStyle）DOM Tree → Render Tree (with ComputedStyle)
03	Layout	Render Tree → Layout Tree（带几何）Render Tree → Layout Tree (with geometry)
04	Pre-paint	Layout Tree → Property TreesLayout Tree → Property Trees
05	Paint	Layout Object → cc::Layer / DisplayItemListLayout Object → cc::Layer / DisplayItemList
06	Commit	cc::Layer → LayerImpl（跨线程）cc::Layer → LayerImpl (cross-thread)
07	Compositing	PaintLayer → GraphicsLayerPaintLayer → GraphicsLayer
08	Tiling	LayerImpl → cc::TileTaskLayerImpl → cc::TileTask
09	Raster	cc::TileTask → 像素纹理cc::TileTask → pixel texture
10	Activate	Pending → Active TreePending → Active Tree
11	Draw	LayerImpl → viz::DrawQuadLayerImpl → viz::DrawQuad
12	Aggregate	多 CF → 合成 CFmultiple CFs → aggregated CF
13	Display	DrawQuad → 屏幕像素（GL/Skia）DrawQuad → screen pixels (GL / Skia)

13道工序stages

从字节到像素bytes to pixels

3个进程processes

Browser · Render · VizBrowser · Render · Viz

16.7ms

60Hz 一帧的预算budget for one 60Hz frame

流水线的每一道工序，都不是为了"好看"——
而是为了把"重新计算什么"的范围，
压缩到尽可能小。 Field Note · 02

Each stage exists not for elegance —
but to shrink the surface of what has to be recomputed when something changes. Field Note · 02

MAIN-LINE EXAMPLE

主线例子 — 一张名片的旅程

The running example — one business card's journey

每一章都会回到这张卡

the card we'll watch through every stage

抽象的流水线总让人忘——3 个进程、6 段线程、13 道工序，看完合上书一个数字也记不住。所以从下一章开始,每一章顶部都会有一段"主线 · The Card 在这一步"，跟踪同一张名片在该工序后的形态。

这就是那张卡——Airing 的名片:

An abstract pipeline always slips out of memory — 3 processes, 6 thread segments, 13 stages, and an hour later you can't recall a single number. So from the next chapter onward, each chapter opens with a "Main-line · The Card after this stage" block, tracking what happens to one business card at every step.

This is that card — Airing's business card:

Airing

字节码到像素的旅人

A traveller from bytecode to pixels

ursb.me

FIG 05.5 主线例子。整张卡 20 行 HTML+CSS,所有 13 道工序、4 棵属性树、3 个进程都能从这张卡上讲出来。把鼠标悬到关注按钮上——这次 hover 动画的整条路径,主线程一根毛都不动。 The running example. Twenty lines of HTML+CSS, yet every one of the 13 stages, 4 property trees and 3 processes can be told through this single card. Hover over Follow — the entire animation path runs without touching the Main thread.

名片的源码

The card's source

THE CARD · HTML + CSS demo · 20 lines

为什么是这张卡

Why this card

选择这张卡不是随手画的——它刚好能把流水线的每一段都讲透:

flex 容器 → 撑起 Layout 章的 LayoutNGFlexibleBox 算法(主轴/交叉轴的两遍 layout)
5 种子元素(img / h2 / p / a / button)→ 撑起 LayoutObject 家族树(LayoutReplaced + LayoutBlock + LayoutInline + button 的 UA 样式)
.card · .bio · .follow:hover → 撑起 Style 章的 RuleMap 三种桶(class / class / pseudo)
linear-gradient + box-shadow → Pre-paint 阶段的 Effect tree 节点
border-radius: 50% 头像 → Paint 阶段非平凡的 ClipRRect + DrawImageRect
远端 airing.png → 把 Stage 0 网络线程拉进剧情
will-change: transform → Compositing 阶段独立成层
:hover 同时改 transform 与 background → Display 名场面:transform 走 Compositor,background 把 Main thread 拉回。真实代码很少有 100% 纯合成的动画

Choosing this card was not arbitrary — it just barely exercises every segment of the pipeline:

flex container → carries Layout's LayoutNGFlexibleBox algorithm (main-axis / cross-axis double pass)
Five child types (img / h2 / p / a / button) → carry the LayoutObject family tree (LayoutReplaced + LayoutBlock + LayoutInline + button's UA style)
.card · .bio · .follow:hover → carry Style's three RuleMap buckets (class / class / pseudo)
linear-gradient + box-shadow → carry Pre-paint's Effect tree nodes
border-radius: 50% avatar → carries Paint's non-trivial ClipRRect + DrawImageRect
Remote airing.png → drags Stage 0 — the network thread into the story
will-change: transform → carries Compositing's layer promotion
:hover mutating both transform and background → the Display finale: transform stays on the Compositor, background drags Main back in. Real code rarely produces 100% Compositor-pure animation

卡片在 13 步里的形态地图

Stage-by-stage transformation map

#	阶段Stage	这张卡的当前形态The card after this stage
00	Network	HTML 字节流到达 + airing.png 由 PreloadScanner 抢跑发出请求HTML bytes arrive + airing.png fired early by the PreloadScanner
01	Parsing	11 个 token → DOM 栈最深 4 层 → 6 节点的 DOM 树11 tokens → DOM stack 4 deep → a 6-node DOM tree
02	Style	5 条规则分到 3 张 RuleMap;每节点挂 ComputedStyle5 rules split across 3 RuleMaps; ComputedStyle attached to each node
03	Layout	LayoutNGFlexibleBox(card 340×88)双遍布局完成LayoutNGFlexibleBox(card 340×88) finishes its two passes
04	Pre-paint	.follow 在 Transform tree 出节点;Effect tree 多 2 个节点(shadow + gradient).follow gets a Transform tree node; Effect tree gains 2 nodes (shadow + gradient)
05	Paint	DisplayItemList 约 12 项;Save/ClipRRect/Restore 围住头像~12-entry DisplayItemList; Save/ClipRRect/Restore wrap the avatar
06	Commit	2 棵 cc::Layer:主图层 + .follow 独立图层Two cc::Layers: the main one + a dedicated one for .follow
07	Compositing	.follow 因 will-change 升格成独立 GraphicsLayer.follow is promoted to its own GraphicsLayer thanks to will-change
08	Tiling	主卡 340×88 → 1 块 256×128 + 1 块 128×128 边缘 tilecard 340×88 → one 256×128 tile + one 128×128 edge tile
09	Raster	每块 tile playback DisplayItemList;头像走 ImageDecodeCacheeach tile plays back its slice of the DisplayItemList; avatar goes via ImageDecodeCache
10	Activate	Pending Tree → Active Tree;tile 全部 readyPending Tree → Active Tree; all tiles ready
11	Draw	主层吐 2 个 TileDrawQuad;.follow 1 个 TileDrawQuad;shadow 触发独立 RenderPassmain layer emits 2 TileDrawQuads; .follow emits 1; the shadow triggers a separate RenderPass
12	Aggregate	(变体)如果嵌入第三方页面,父用 SurfaceDrawQuad 引用(variant) if embedded as OOPIF, parent references via SurfaceDrawQuad
13	Display	SwapBuffers 上屏。hover 时 transform 走 Compositor,background 把 Main 拉回SwapBuffers to screen. On hover, transform stays on Compositor while background drags Main back in

怎么用这张表HOW TO READ THE TABLE 这张表是检索表。读完整篇你应该能反过来用——看到一张卡片,你能预言它会在每一步被处理成什么样。如果某一行你想不起来"为什么",就翻回那一章读"主线 · The Card 在这一步"那段。 This table is an index. After finishing the article you should be able to use it in reverse — see any card and predict what happens to it at every stage. If a row stops making sense, jump back to that chapter's "Main-line · The Card after this stage" block.

STAGE 00 · NETWORK

Loading — bytes 到达之前的事

Loading — what happens before the first byte

网络线程 + Mojo IPC + 抢跑机制

network thread, mojo IPC, the preload scanner head-start

Module

network_service

Process

Browser

Thread

Network ×N

Output

bytes → Renderer

这一步在做什么

What it does

Browser Process 的 NetworkService 通过 Mojo IPC 把 HTML 字节流推给 Render Process 的 blink::DocumentLoader。同时,在 Render Process 里跑的 HTMLPreloadScanner 抢在主 Parser 之前发现 <img> / <link rel="stylesheet"> 的 URL,反过来再向 Browser 申请第二批资源——主 HTML 还没解析完,sub-resource 已经在路上。 Browser-process NetworkService streams HTML bytes to Render-process blink::DocumentLoader via Mojo IPC. Meanwhile, the in-Render HTMLPreloadScanner races ahead of the main parser, spots <img> / <link rel="stylesheet"> URLs and asks Browser for the next batch of bytes — sub-resources are on the wire before the main HTML is even fully parsed.

为什么要算一步

Why count it as a stage

原文把 13 步从 Parsing 数起,把 Loading 隐进了 Browser Process 那个虚框。但整条流水线的 P50 延迟,80% 来自这一段——首字节没到,后面 13 步连开始的资格都没有。这一节正式把它拉进流水线。 The original counts 13 stages starting from Parsing and folds Loading into the Browser-process box. But 80% of the pipeline's P50 latency lives here — until the first byte arrives, the next 13 stages cannot even start. This chapter promotes it to a first-class stage.

关键函数

Key entrypoints

network::URLLoader · blink::DocumentLoader::DataReceived · blink::ResourceFetcher::PreloadStarted · blink::HTMLPreloadScanner::Scan

两个进程之间发生了什么

What happens between the two processes

从 DNS 解析到第一字节交给 Renderer,整条链路跨越 Browser Process 的 NetworkService → Mojo IPC → Render Process 的 ResourceFetcher。下面这张图把链路展开:

From DNS lookup to the first byte arriving at the Renderer, the chain crosses Browser-process NetworkService → Mojo IPC → Render-process ResourceFetcher. The figure below unpacks it:

FIG 00 Loading 阶段的真实拓扑。Browser 进程的 NetworkService 持有所有底层连接、cookie、缓存;Render 进程通过 Mojo DataPipe 接收字节,同时反向触发新的 sub-resource 请求。主 parser 在阻塞,PreloadScanner 在抢跑——这是 Chromium 首屏速度的秘诀。 The real topology of Loading. Browser-process NetworkService owns all the lower-level connections, cookies and cache; the Render process receives bytes via Mojo DataPipe and, in the opposite direction, fires new sub-resource requests. The main parser blocks, the PreloadScanner races — this is the secret of Chromium's cold-start speed.

主线 · The Card 在这一步

Main-line · The Card after this stage

STAGE 00 网络阶段Network stage

两个 URLLoader 并行飞着

Two URLLoaders in flight, side by side

主页 HTML 字节流(几 KB)由 URLLoader · main 流入 blink::DocumentLoader::DataReceived。当 HTMLPreloadScanner 在某个未阻塞的间隙扫到 <img class="avatar" src="airing.png">,立刻调 ResourceFetcher::PreloadStarted 反向通知 Browser 进程申请头像 png。头像的 GET 请求会在主 HTML 还在飞的时候就发出去。

The home HTML byte stream (a few KB) flows from URLLoader · main into blink::DocumentLoader::DataReceived. The moment HTMLPreloadScanner spots <img class="avatar" src="airing.png"> in an unblocked window, it calls ResourceFetcher::PreloadStarted back at the Browser process to fetch the avatar PNG. The avatar's GET request leaves before the main HTML even finishes downloading.

// 时间线 · The Card 在加载中

  0ms   URLLoader·main      ─┬─▶ HTML bytes 第一包到 Renderer
        DocumentLoader.DataReceived(2.1KB)
  4ms   HTMLDocumentParser  ─▶ 开始 token 化
 12ms   HTMLPreloadScanner  ─▶ 扫到 <img src="airing.png">
 12ms   ResourceFetcher     ─▶ PreloadStarted({airing.png})
 13ms   URLLoader·sub       ─▶ GET airing.png  // ← 抢跑请求出去
 24ms   URLLoader·main      ─▶ HTML 末尾包到达
 88ms   URLLoader·sub       ─▶ airing.png 字节回 (~46KB)
        // 67ms 抢跑红利:不抢跑就要 88ms 才发请求

为什么 NetworkService 是独立模块

Why NetworkService stands on its own

早期 Chromium 把网络栈直接编进 Browser Process。Chrome 73 起把它拆成 NetworkService——可以选择 in-process 或独立 utility process。这件事不仅是工程整洁:独立进程意味着 cookie 与凭据被关进单独沙箱,Render Process 即使被 PWN 也拿不到原始 cookie——它只能通过 Mojo 接口请求"某个 URL 的字节",由 NetworkService 替它附 cookie。

Early Chromium baked the network stack into the Browser process. Chrome 73 split it out as NetworkService — either in-process or as a standalone utility process. The split is not just engineering hygiene: a separate process means cookies and credentials live in their own sandbox. A pwned Render process can never read raw cookies — it can only ask Mojo for "the bytes of this URL", and NetworkService attaches the cookies on its behalf.

2024 + 预连接 + 推测式 prefetch: 现在 Chromium 还会在用户悬停链接时,通过 chrome.predictors 启发式提前 DNS / TCP / TLS 握手,甚至预取 HTML——这一切发生在 Stage 0 之前的"Stage -1"。配合 <link rel="modulepreload"> 和 Speculation Rules API,首屏感知速度持续在压。 Pre-connect + speculative prefetch: Chromium now uses chrome.predictors on link hover to pre-warm DNS / TCP / TLS, sometimes prefetching the HTML itself — happening before Stage 0, in what's effectively "Stage -1". Combined with <link rel="modulepreload"> and the Speculation Rules API, perceived cold-start keeps shrinking.

DEVTOOLS

Network 面板 · 看 Initiator 是不是 (preload)Network panel — check if Initiator is (preload)

TRACING

chrome://tracing · category loading,blink.resource,navigationchrome://tracing · categories loading,blink.resource,navigation

FLAG

--enable-features=NetworkServiceInProcess / 关闭可见 IPC 开销/ disables to inspect IPC overhead

SOURCE

services/network/url_loader.cc · third_party/blink/renderer/core/loader/document_loader.cc

▸ Network · Waterfall · Initiator column 7 requests · 88 KB · 220 ms

/

DNS+TCP+TLS

TTFB

HTML 2.1KB

demo.css (preload)

queued

conn reuse

TTFB

CSS 5.4KB

app.js (preload)

queued

conn reuse

TTFB

JS 24KB

avatar.png (preload)

queued

conn reuse

TTFB

img 46KB

analytics.js (parser)

queued (no preload!)

conn

TTFB

JS 12KB

050ms100ms150ms200ms

诊断 3 件事: ① Initiator 列里看 (preload) 标记 — 关键资源都该有这个标(说明 PreloadScanner 抢跑成功);若标的是 (parser),资源是被主 parser 顺路发现的,迟一拍;② queued 段太长(本图 analytics.js 排了 55%) = 浏览器并发限制(同源 6 路)堵住了,关键资源被低优 JS 挤后面 — 用 fetchpriority="high" 解决;③ TLS 段看是否走了 HTTP/2 multiplexing(同域多请求该共用连接),没走会看到每个请求都重做握手。 3 things to diagnose: ① Look at the Initiator column for (preload) tags — every critical resource should have one (PreloadScanner head-start succeeded); if it shows (parser), the resource was discovered by the main parser, one beat late; ② queued segment too long (here analytics.js queued 55%) = browser's per-origin concurrency cap (6) blocked it, critical resources stuck behind low-priority JS — fix with fetchpriority="high"; ③ TLS segment reveals HTTP/2 multiplexing usage (same-origin requests should reuse one connection); if not, every request redoes the handshake.

上一章previous ← CHAPTER 05 · 流水线总览CHAPTER 05 · Pipeline overview 下一章next CHAPTER 06 · Parsing · bytes 落地后CHAPTER 06 · Parsing · bytes finally arrive →

INPUT

URL来自地址栏 / 链接点击 / APIfrom address bar / link / API

→

OUTPUT

bytes推给 Render Processpushed to Render process

STAGE 01 · DOC PHASE

Parsing — bytes 到 DOM Tree

Parsing — bytes to a DOM tree

字节如何长成一棵树

bytes → characters → tokens → DOM

Module

blink

Process

Render

Thread

Main

Output

DOM Tree

这一步在做什么

What it does

把网络线程吐出来的 bytes，一路扭转为一棵 DOM Tree，挂在 blink::TreeScope 上。 Take the bytes coming out of the network thread and end up with a DOM tree hanging off blink::TreeScope.

为什么要分 5 步

Why five sub-stages

每一段输入都不只一种"形态"——bytes 来自字节流、characters 取决于编码、tokens 是 W3C 标准、Element 是 Blink 数据结构。把它们拆开，每段都能流式增量处理，也能复用——比如同样的 Tokenizer 可以喂给预扫描器（Preload Scanner）来提前发起请求。 Every input has a distinct "shape" — bytes are a network stream, characters depend on encoding, tokens are a W3C standard, Element is a Blink data type. Splitting them lets each layer stream incrementally and be reused — the same tokenizer feeds the Preload Scanner to fire requests early.

关键函数

Key entrypoints

blink::DocumentLoader · blink::HTMLDocumentParser · HTMLTokenizer::NextToken · HTMLConstructionSite::CreateElement

Parsing 是 Main thread 的第一项工作：把 Browser Process 网络线程喂过来的 bytes，一路扭转成一棵活生生的 DOM Tree。中间的数据流可以拆成 5 段：

Parsing is the Main thread's opening act: take the bytes the Browser process's network thread hands over, and turn them into a living DOM tree. The data flow splits into five hand-offs:

Loading

bytes

Conversion

characters

Tokenizing

W3C tokens

Lexing

Element

DOM Build

DOM Tree

→→→

STAGE 01 主线 · The Card 在解析后Main-line · The Card after Parsing

11 个 token,4 层栈,6 节点的 DOM 树

11 tokens, a stack 4 deep, a 6-node DOM tree

名片源码进 Tokenizer 后吐出 11 个 token,DOM 构造栈最深一刻压到 4 层(article → div.info → h2 / p / a 三选一)。结束时栈空,留下下面这棵 DOM 树:

Through the Tokenizer the card emits 11 tokens, with the construction stack peaking at 4 deep (article → div.info → h2 / p / a). When it empties, this DOM tree is left behind:

// DOM Tree · 主线例子在 Parsing 之后

article.card
├── img.avatar  [src="airing.png"]  // Stage 0 已抢跑请求
├── div.info
│    ├── h2.name      ─▶ "Airing"
│    ├── p.bio       ─▶ "字节码到像素的旅人"
│    └── a.url       ─▶ "ursb.me"  [href="https://ursb.me"]
└── button.follow ─▶ "关注"

留意: 此时 ComputedStyle 还没挂上(那是 Style 章的活),img 也还没解码(那是 Raster 章的活),button 也还没"知道"自己将来要被升格成图层(那是 Compositing 章的活)。这棵裸 DOM 是一切下游工作的种子。

Notice: ComputedStyle is not yet attached (that's Style's job), the img isn't decoded (Raster's job), and the button has no idea it's about to be promoted into its own layer (Compositing's job). This bare DOM is the seed for every downstream stage.

看一段 bytes 怎么"被分词"

Watch a slice of bytes get tokenised

下面这段 HTML：

Take this HTML:

<div class="hello">world</div>

先被 HTMLTokenizer::NextToken 切成 W3C 标准 token，看到的是这串：

It first goes through HTMLTokenizer::NextToken and emerges as W3C-standard tokens:

StartTag · div Attribute · class Value · "hello" Character · world EndTag · div

边走边等PARSE ↔ FETCH ↔ EXEC Tokenizing 时碰到 <link> / <script> / <img>，会反过来发起新的网络请求；碰到 <script> 还要先把 JavaScript 跑完——因为 document.write() 可能会改写后面的 DOM。"边解析边等" 是 HTML 解析最贵的成本之一。 Hit <link> / <script> / <img> mid-tokenizing and the parser fires new network requests; hit <script> and it must finish executing before resuming — because document.write() may rewrite what comes next. The «parse-and-wait» tax is one of the steepest costs of HTML parsing.

Conversion 的真实堆栈

The real Conversion stack

在 HTMLDocumentParser::Append 处下断点，能看到 bytes 一路被翻译成 String 的链路——这是每一个页面加载时都会跑的栈：

Set a breakpoint on HTMLDocumentParser::Append and you can watch bytes get translated into String. This is the stack every page load walks:

STACK · CONVERSION (BYTES → CHARACTERS) third_party/blink/renderer/core/html/parser/

#0 blink::HTMLDocumentParser::Append(WTF::String const&) html_document_parser.cc:1037 #1 blink::DecodedDataDocumentParser::UpdateDocument(WTF::String&) decoded_data_document_parser.cc:98 #2 blink::DecodedDataDocumentParser::AppendBytes(char const*, unsigned long) decoded_data_document_parser.cc:71 #3 blink::HTMLDocumentParser::AppendBytes(char const*, unsigned long) html_document_parser.cc:1351

从下往上读：网络线程喂 AppendBytes(char*)，DecodedDataDocumentParser 用编码解码成 String，最后塞进 Tokenizer 的入口 Append(String&)。解码是按编码（UTF-8 / GBK / ISO-8859-1）走的，编码错就一切错。

Read bottom-up: the network thread hands AppendBytes(char*) in, DecodedDataDocumentParser decodes by encoding into a String, and the result lands at the tokenizer's Append(String&) entry. Decoding follows the page's encoding (UTF-8 / GBK / ISO-8859-1) — get that wrong and everything downstream is wrong.

Lexing · token 怎么变成 Element

Lexing · how a token becomes an Element

关键函数是 HTMLConstructionSite::CreateElement。在它内部，会用一个栈结构来跟踪当前 open 的 Element——因为 HTML5 解析里允许的隐式标签闭合（如 <p> 内部出现 <div> 会自动闭 <p>），靠的就是这个栈：

The key entrypoint: HTMLConstructionSite::CreateElement. Internally, a stack tracks currently-open Elements — HTML5's implicit close rules (a <div> appearing inside <p> auto-closes the <p>) are implemented through this stack:

HTML CONSTRUCTION SITE · STACK OPS html_construction_site.h

// HTMLToken::StartTag HTMLConstructionSite::ProcessStartTag(token) ↳ push(HTMLStackItem) // 压栈 / push // HTMLToken::EndTag HTMLConstructionSite::ProcessEndTag(token) ↳ pop() // 出栈 / pop

真实案例CASE

为什么 `<p><div></div></p>` 不是合法的嵌套

Why `<p><div></div></p>` is not actually nested

HTML5 规则：<p> 是 phrasing content，碰到 block 元素 <div> 时必须先关闭 <p>。栈在这里被偷偷动了：StartTag-div 触发时，构造器先弹出 <p>，再压入 <div>。结果是——你写的 <p><div>...</div></p> 在 DOM 树里其实是三个兄弟节点：<p></p> + <div></div> + 一个隐含的空 <p></p>。

HTML5 rule: a <p> is phrasing content; encountering a block element like <div> forces <p> to close first. The stack is silently mutated — when StartTag-div fires, the constructor pops <p> before pushing <div>. The result: what you wrote as <p><div>...</div></p> is in fact three sibling nodes in the DOM — <p></p> + <div></div> + an implicit empty <p></p>.

DOM 是用一个栈搭出来的

A stack is what builds the tree

Lexing 把 token 转成 Element 实例，"DOM construction" 用一个栈一边压一边出——开始标签压栈，结束标签出栈，最后栈空时这棵树也就建完了。

Lexing turns tokens into Element instances. "DOM construction" then walks a stack — start-tags push, end-tags pop. When the stack empties, the tree is finished.

栈 · 实时

Stack · live

DOM 树 · 增量

DOM tree · incremental

输入：<div><p><div></div></p><span></span></div>

Input: <div><p><div></div></p><span></span></div>

为什么用栈,不用链表? · 嵌套天然 LIFO,栈是规范要求的数据结构 Why a stack, not a linked list? · nesting is naturally LIFO; the stack is the spec-required data structure

这是个很好的问题——4 个原因,从最直接到最深入:

A great question — four reasons, from the most immediate to the deepest:

HTML 嵌套天然是 LIFO。解析器吃到 </div> 时,它要找的是"最近一个还没闭合的同名开标签"——这正好是栈顶。栈直接 O(1) pop;链表得从头/尾遍历找最近的匹配,O(n)。一个 100KB 的页面有几千次 push/pop,O(n) 与 O(1) 的差距是数量级。
HTML nesting is naturally LIFO. When the parser sees </div>, it needs the "most recent unclosed open tag with that name" — exactly the top of the stack. Stack pop is O(1); a linked list has to traverse to find the latest match, O(n). A 100KB page does thousands of push/pop pairs — O(1) vs O(n) is an order-of-magnitude gap.
HTML5 规范本身就用栈描述。W3C/WHATWG 的 HTML5 解析算法里有两个明文叫做"stack of open elements"和"list of active formatting elements"的数据结构。最复杂的那段——adoption agency algorithm(处理 <b><p>X</b>Y</p> 这种交叉嵌套的"错误恢复")——直接按栈的术语写规范。换成链表,你不仅要重写所有规范引文,逻辑也表达不出来了。
The HTML5 spec itself describes it as a stack. W3C/WHATWG's HTML parsing algorithm explicitly uses two data structures named "stack of open elements" and "list of active formatting elements". The hairiest piece — the adoption agency algorithm (error recovery for crossed nestings like <b><p>X</b>Y</p>) — is written directly in stack terminology. Switching to a linked list would force you to rewrite every spec quote, and the logic would no longer express itself.
缓存友好。栈通常用连续数组实现(Blink 的 HTMLElementStack 内部就是 Vector<HTMLStackItem*>),CPU 一次 cache line 64 字节就装下 8 个指针,push/pop 在 L1 cache 里跑。链表节点散落在堆上,每次 next 指针跳一次都可能 cache miss——实测能差 5-10×。
Cache-friendly. Stacks typically live in a contiguous array (Blink's HTMLElementStack is internally a Vector<HTMLStackItem*>). One 64-byte cache line holds 8 pointers; push/pop runs in L1. Linked-list nodes scatter across the heap, so each next-pointer chase risks a cache miss — 5-10× slower in practice.
栈高度 = 嵌套深度,白送一个语义指标。HTML5 规范里大量规则按"nesting depth"判断:<table> 内部的 foster parenting、限制 <p> 不能嵌套 block、<option> 在 <select> 里的隐式闭合……栈直接 .size() 给答案。链表需要单独维护一个深度计数器,加同步成本。
Stack depth = nesting depth, a free semantic index. Many HTML5 spec rules dispatch on "nesting depth": foster parenting inside <table>, banning <p> from nesting block elements, the implicit close of <option> inside <select>… Stack .size() answers in O(1). A linked list would need a separately maintained depth counter, with its own sync cost.

反过来想: "什么时候用链表反而合适?" 答案是——需要在中间任意位置插入/删除的场景。HTML 解析里,中间插入的次数极少(只有 adoption agency 算法里的 fragment-tree 重排会用到),用栈的代价小到可以无视。所以"HTML 解析器为什么用栈而不是链表"的真正答案是:这个问题问反了——HTML 嵌套就是栈结构,链表是个反直觉的选项。

Flip the question: "When would a linked list be a better fit?" Answer: scenarios needing arbitrary mid-list insert/delete. HTML parsing rarely needs that (only the adoption agency algorithm's fragment-tree rearrangement, and even that's cheap on top of a stack). So the real answer to "why a stack, not a linked list?" is: the question is inverted — HTML nesting is a stack; a linked list would be the counter-intuitive choice.

怎么观察这个流程

How to inspect this stage

DEVTOOLS · PERFETTO

两种工具，两种深度

Two tools, two depths

Chrome DevTools 的 Performance 面板能画出 Parsing 的火焰图——但它只看到 JS 侧的栈调用，C++ 内核里的真实流程对它是黑盒。

Chrome DevTools Performance can show Parsing as a flame graph — but only the JS-side call stack. The C++ internals are a black box to it.

想看到 HTMLDocumentParser::AppendBytes → ... → HTMLConstructionSite::CreateElement 这一整条 C++ 栈，就必须用 Perfetto 录制——它不仅能拉出 C++ 调用栈，还能告诉你这个调用属于哪个线程，跨进程通信还会自动连线"发出端 → 接收端"的两个函数调用。

To see the full C++ stack HTMLDocumentParser::AppendBytes → ... → HTMLConstructionSite::CreateElement, you need Perfetto traces — they expose C++ stacks, tag each call with its thread, and even draw cross-process IPC as "sender → receiver" arcs.

HTMLTokenizer · 80+ 状态的有限状态机

HTMLTokenizer · an 80-state finite state machine

"Tokenizing"听起来简单，实际上是 W3C HTML5 规范里写明的一台庞大状态机——80 多个状态、上百条转移边。HTMLTokenizer::NextToken 内部就是个巨大的 switch，按当前状态读一个字符、决定吐 token / 切换状态。下面是几条最常走的边：

"Tokenizing" sounds simple but is in fact a sprawling state machine spelled out in the W3C HTML5 spec — 80+ states, hundreds of transitions. HTMLTokenizer::NextToken is one giant switch that reads a character based on the current state and either emits a token or switches state. The most common edges:

HTML_TOKENIZER · STATE TRANSITIONS third_party/blink/renderer/core/html/parser/html_tokenizer.cc

enum State { kDataState, // 默认状态 · 在标签外 kTagOpenState, // 见 '<' kTagNameState, // 在 <tag-name 中 kBeforeAttributeNameState, kAttributeNameState, kAttributeValueDoubleQuotedState, kScriptDataState, // <script> 内 · "几乎"原样吞 kScriptDataEndTagOpenState, // 见 '</' 时跳出 script kRAWTEXTState, // <style> / <textarea> 等 kCDATASectionState, // XML 风格的 <![CDATA[…]]> kCharacterReferenceInDataState, // 见 '&' 解 entity ... // 共 80+ 个 }; // 典型转移：data → tag-open → tag-name → before-attr → attr-name → ... HTMLTokenizer::NextToken(SegmentedString& src, HTMLToken& token) { while (!src.IsEmpty()) { UChar cc = src.CurrentChar(); switch (state_) { case kDataState: if (cc == '<') { state_ = kTagOpenState; src.advance(); continue; } if (cc == '&') { state_ = kCharacterReferenceInDataState; continue; } // 普通字符 → 累加进 character token ... case kTagOpenState: ... } } return token.type() != HTMLToken::kUninitialized; }

这台状态机的难点是容错。HTML5 规范用 "insertion mode"（24 种）+ "original insertion mode"（栈式回退）一起描述"遇到错误怎么修"——比如 <table> 里突然冒出一个 <span>，规范要求把它"提到表外面"。这就是为什么所有浏览器解析坏 HTML 的结果都一样——它们都按这套规范走。

The hard problem this machine solves is error recovery. HTML5 spec describes "how to fix errors" with 24 "insertion modes" + a stack-based "original insertion mode" rewind — for instance, a <span> appearing inside <table> is mandated to be "foster-parented out of the table". That's why every browser parses bad HTML identically — they all follow this same spec.

PreloadScanner · 提前抢跑

PreloadScanner · the head-start scanner

Parsing 真正的"提速秘诀"在 HTMLPreloadScanner。当主 Parser 因 <script> 阻塞时（要等 JS 执行完），另一个轻量 tokenizer 在副线程上继续往前扫描后续字节，看到 <link rel="stylesheet"> / <img src> / <script src> 就提前发起网络请求。等主 Parser 解锁继续走时，资源已经在路上甚至已经到了。

Parsing's real "go-faster" trick lives in HTMLPreloadScanner. When the main parser is blocked on a <script> (waiting for JS to run), a second lightweight tokenizer continues scanning ahead on a side thread. The moment it sees <link rel="stylesheet"> / <img src> / <script src> it fires the network request early. By the time the main parser unblocks, the bytes are on the wire — sometimes already arrived.

HTML_PRELOAD_SCANNER · LIFECYCLE third_party/blink/renderer/core/html/parser/html_preload_scanner.cc

HTMLPreloadScanner::Scan(const KURL& starting_base_url) { while (tokenizer_->NextToken(*source_, token_)) { // 复用 HTMLTokenizer StartTagScanner scanner(token_->GetName(), media_values_, ...); scanner.ProcessAttributes(token_->Attributes()); PreloadRequest* req = scanner.CreatePreloadRequest(predicted_base_url, ...); if (req) preloads_->push_back(std::move(req)); token_->Clear(); } resource_fetcher_->PreloadStarted(*preloads_); // 提前发起请求 }

这套机制让"HTML 解析"和"资源下载"实质上并行——是 Chromium 首屏速度比"老派单线程解析"快 30%~50% 的真正原因。Chrome DevTools 里 Network 面板里看到那些 (Preload) 标记的请求，都是 PreloadScanner 提前发起的。

This is what makes "HTML parsing" and "resource download" effectively parallel — and the real reason Chromium's cold-start is 30-50% faster than a "naive single-threaded parser". Those (Preload)-tagged requests you see in DevTools' Network panel? All fired by PreloadScanner ahead of time.

怎么观察这个流程

How to inspect this stage (continued)

想看 Tokenizer 状态机的实时切换，最快的办法是开 Chromium 的 tracing：在 chrome://tracing 里勾上 blink.parser 类别，刷新页面，能看到一条按时间排开的"状态轨迹"——每个标签的开闭都在轨迹上留下一个色块。下图是大致长这样:

To see the Tokenizer state machine flip in real-time, the fastest path is Chromium's tracing: in chrome://tracing, enable the blink.parser category and reload — you'll see a time-aligned "state trace" with a colour block for every tag open/close. Here's roughly how it looks:

▸ chrome://tracing · blink.parser, blink.preload_scanner, loading 16ms window · M1 / Chrome 130

Network

URLLoader · GET / (HTML)

GET demo.css

GET app.js

GET avatar.png

Main · DocLoader

DataReceived

Main · Tokenizer

html

head

<link>

body

article.card

div.info

h2

p

a.url

button

Side · PreloadScanner

scan → preload demo.css

scan → preload app.js

scan → preload avatar.png

Main · V8

parse + execute app.js (blocks parser)

microtask

rAF callback

0 ms 4 8 12 16 ms

看 Main · Tokenizer 那条轨——每个色块是一次 token 触发,蓝/橙/绿对应不同 tag 类型;红色 <script> 期间 Tokenizer 完全冻结(主 parser 停了 6ms 等 V8 跑完);但同一时刻下面 PreloadScanner 那条还在偷偷扫,提前发了 app.js / avatar.png 的请求——上面 Network 轨里那两个棕条就是抢跑出去的。"parse-and-wait 的真实代价"在这张图里一目了然。 Watch the Main · Tokenizer lane — each block is one token, blue / orange / green map to different tag types. During the red <script> the Tokenizer freezes (main parser stalls 6 ms while V8 runs); but at the same time the PreloadScanner below keeps scanning and fires app.js / avatar.png early — the two brown bars on the Network lane are those head-start requests. The real cost of parse-and-wait is right here in one picture.

DEVTOOLS

Performance > "Parse HTML" 段;Memory > Heap snapshot 看 DOM 节点数Performance > "Parse HTML" segment; Memory > Heap snapshot for DOM node count

TRACING

chrome://tracing · blink.parser, blink.preload_scanner, loading

FLAG

--enable-blink-features=PaintHolding / 控制首屏渲染等待/ controls first-paint hold

SOURCE

third_party/blink/renderer/core/html/parser/html_document_parser.cc

上一章previous← Stage 00 · LoadingStage 00 · Loading 下一章nextStage 02 · Style · CSS 与 ComputedStyleStage 02 · Style · CSS & ComputedStyle →

输入 / 输出

In / Out

INPUT

bytes来自 Browser Process 网络线程from Browser network thread

→

OUTPUT

DOM Treeblink::TreeScopeblink::TreeScope

STAGE 02 · DOC PHASE

Style — CSS 是从右到左读的

Style — CSS is read right-to-left

CSSOM 与反向匹配

CSSOM and right-to-left selectors

Module

blink

Process

Render

Thread

Main

Output

Render Tree

这一步在做什么

What it does

遍历 DOM Tree，每个节点跑一遍"哪些 CSS 命中我"，把命中的样式合并 + 继承 + UA 默认值，最后挂一个 ComputedStyle——这就是 Render Tree。 Walk the DOM tree. For each node, find which CSS rules match, then merge + inherit + UA-default them. Attach a ComputedStyle to the node — that's the Render Tree.

为什么不能跳过

Why not skip

CSS 是 render-blocking。一棵无样式的 DOM 渲染上屏，下一帧 CSS 一到又得整页重排——等是更便宜的。所以浏览器宁可白屏也要等 CSSOM。 CSS is render-blocking. Drawing an unstyled DOM and re-layouting the second CSS arrives is more expensive than waiting — blank-screen is cheaper than a re-layout. The browser sits and waits for the CSSOM.

关键函数

Key entrypoints

Document::UpdateStyleAndLayout · StyleEngine::RecalcStyle · ElementRuleCollector::CompareRules · CSSColor::Create

Style Engine 遍历 DOM，匹配 CSSOM，给每个节点附上 ComputedStyle——最终产物叫 Render Tree。这一段的核心函数：Document::UpdateStyleAndLayout。

整段流程被切成 3 块：CSS 加载 → CSS 解析 → CSS 计算。其中两个反直觉的事实——选择器从右往左读、RuleMap 按选择器类型分桶——决定了 CSS 性能的全部走向。

The Style Engine walks the DOM, matches against the CSSOM and attaches a ComputedStyle to every node. The output: a Render Tree. Core: Document::UpdateStyleAndLayout.

Three sub-stages: CSS load → CSS parse → CSS compute. Two counter-intuitive facts here decide the entire performance shape — selectors are read right-to-left and RuleMap shards by selector type.

STAGE 02 主线 · The Card 在样式后Main-line · The Card after Style

5 条规则进 3 张 RuleMap,每节点挂 ComputedStyle

5 rules into 3 RuleMaps, ComputedStyle on every node

名片的 5 条 CSS 规则各自找到自己的桶——类选择器走 class_rules_,:hover 伪类走 ua_shadow_pseudo_element_rules_:

The card's five CSS rules each find their bucket — class selectors land in class_rules_, the :hover pseudo lands in ua_shadow_pseudo_element_rules_:

// RuleSet · sharding 之后

class_rules_ ─┬─ .card    ─▶ width / padding / border-radius / shadow / flex
              ├─ .avatar  ─▶ width / height / border-radius
              ├─ .name    ─▶ font-size / font-weight
              ├─ .bio     ─▶ font-size / color
              ├─ .url     ─▶ font-size / color / font-family
              └─ .follow  ─▶ background / padding / border-radius / will-change

ua_shadow_pseudo_element_rules_
              └─ .follow:hover ─▶ transform / background

每个 DOM 节点跑一遍"右往左反向匹配"——比如 article.card 上的".card 命中我吗"是 1 跳 hash 命中;给它合并 + 继承所有命中的属性,再套上 UA 默认值,挂出 ComputedStyle。最后一步 article.card 的 ComputedStyle 长这样:

Every DOM node runs the right-to-left match — for instance, article.card asks "does .card match me?" with a single hash hit. Then merge + inherit + UA defaults, and attach a ComputedStyle. After all that, article.card's ComputedStyle reads roughly:

// ComputedStyle · article.card

display              : flex           // 由 .card 给
flex-direction       : row            // flex 默认
align-items          : center         // 由 .card 给
gap                  : 14px
width                : 340px
padding              : 18px 20px
border-radius        : 14px
background           : linear-gradient(...)  // 触发 Effect tree
box-shadow           : 0 6px 20px rgba(...)   // 触发 Effect tree
font-family          : -apple-system, ...     // 从 body 继承
color                : rgb(21,23,28)          // 从 body 继承

关键产物: 6 节点的 DOM 树各挂一个 ComputedStyle。从这一刻起,样式不再是字符串——它是 RGBA32、Length、EFlexDirection 这些紧凑的 C++ 类型。下游所有阶段都按这套结构化数据干活,没人需要再看 CSS 字符串。

The output: the 6-node DOM tree, each carrying a ComputedStyle. From this moment on style is no longer strings — it's RGBA32, Length, EFlexDirection, all compact C++ types. Every downstream stage operates on this structured data; no one needs to look at the CSS source again.

CSS 加载 · 真实日志

CSS load · the real log

在 Blink 里加桩打印解析过程，可以看到 HTML 解析与 CSS 解析是交错进行的。当 HTML 解析到 readystatechange = Interactive 之后，CSSParserImpl 才开始把外联样式表解析为 StyleRule：

Instrument Blink and print as it parses — you can see HTML and CSS parsing interleave. Only after HTML reaches readystatechange = Interactive does CSSParserImpl start turning the external stylesheet into StyleRules:

BLINK · INSTRUMENTED LOG · PARSE INTERLEAVE DocumentLoader · HTMLDocumentParser · CSSParserImpl

[DocumentLoader.cpp(558)] "<!DOCType html>\n<html>\n<head>\n<link rel=\"stylesheet\" href=\"demo.css\">…" [HTMLDocumentParser.cpp(765)] tagName: html |type: DOCTYPE [HTMLDocumentParser.cpp(765)] tagName: html |type: startTag [HTMLDocumentParser.cpp(765)] tagName: link |type: startTag // → 网络请求 demo.css [HTMLDocumentParser.cpp(765)] tagName: body |type: startTag [HTMLDocumentParser.cpp(765)] tagName: html |type: EndTag [HTMLDocumentParser.cpp(765)] tagName: |type: EndOfFile [Document.cpp(1231)] readystatechange to Interactive [CSSParserImpl.cpp(217)] recieved and parsing stylesheet: ".text{ font-size: 20px; } .text p{ color: #505050; }"

CSS Tokens · 30+ 种类型

CSS Tokens · 30+ flavours

CSS 解析的第一步：characters → tokens。Tokenizer 会吐出下面这些类型——其中 FunctionToken（蓝）、HashToken（铜）、DelimToken（紫）是最热的几种：

First stop in CSS parsing: characters → tokens. The tokenizer emits flavours like the ones below — FunctionToken (blue), HashToken (copper) and DelimToken (purple) being the hot ones:

Whitespace

Comma

Colon

Semicolon

LeftBrace

RightBrace

LeftParen

RightParen

LeftBracket

RightBracket

String

Url

Ident

Number

Dimension

Percentage

Function · rgb(

Hash · #fff

Delim · >

AtKeyword · @media

BadString

BadUrl

CDO <!--

EOF

微基准MICRO-BENCH

`#fff` 比 `rgb(255,255,255)` 快 ~15%

`#fff` beats `rgb(255,255,255)` by ~15%

Blink 内部用 RGBA32（一个 32 位整数）存颜色，入口是 CSSColor::Create。#hex 走的是 HashToken 直读路径——直接做位运算合成 RGBA32；而 rgb() 是 FunctionToken，要先解析参数列表、做范围检查、再合成。同样一个白色，多走了几跳函数调用。

Blink stores colours as RGBA32 — one 32-bit int — via CSSColor::Create. #hex goes through HashToken's direct path: bitwise pack straight into RGBA32. rgb() is a FunctionToken: parse arg list, range-check, then pack. Same white, more hops.

15% 这个数字真的有意义吗? · 微基准 vs 实际页面收益 Is the 15% really meaningful? · micro-bench vs real-page payoff

先说结论:对几乎所有业务页面,这 15% 看不见。原因是 CSS 解析只在首次加载时跑一次,几千条规则的整张表也就 5-15ms,15% 即~2ms,在白屏时间(几百 ms 起步)里是噪音。所以"把所有 rgb() 改成 hex"是典型的过度优化。

Bottom line: for nearly every business page, the 15% is invisible. CSS parsing runs once at first load; a stylesheet of thousands of rules takes 5-15ms total, so 15% means ~2ms — noise inside cold-start (which starts at hundreds of ms). "Convert all rgb() to hex" is textbook over-optimisation.

那为什么这个数字还值得记? 因为它是一扇窗户——透过这 15% 你能看到 V8/Blink 这种 C++ 系统的性能哲学:

Then why is the number worth knowing? Because it's a window — through that 15% you see the performance philosophy of C++ systems like V8/Blink:

"fast path + slow path" 是 Blink/V8 全栈的常用招。Hex 是 fast path(纯位运算),rgb() 是 slow path(走 function 解析子系统)。整个 CSS 解析器、JS 引擎、布局引擎里都是这种结构——常见情况优化到 nano 秒,罕见情况老实走 micro 秒。
"fast path + slow path" is a Blink/V8 staple. Hex is the fast path (pure bitwise); rgb() is the slow path (function-parsing subsystem). The CSS parser, JS engine, and layout engine all use this structure — optimise the common case to nanoseconds, let the rare case take its microseconds.
"函数调用本身就是性能事件" — C++ 里一次虚函数调用 ~5-10ns,加上参数检查 ~50ns。CSS 一行 color: rgb(...) 多 50ns,几千行就是几十微秒。积少成多 的逻辑只在每帧都跑的代码路径上才生效——CSS 解析不是,所以无所谓。
"function calls are themselves performance events" — one C++ virtual call costs ~5-10ns, plus arg validation ~50ns. A row of color: rgb(...) adds 50ns; thousands of rows add tens of microseconds. The "drop in the bucket" calculus only matters on per-frame hot paths — CSS parsing isn't one, so the math doesn't move the needle.
真正每帧都跑的颜色路径在 paint/raster:每个 DisplayItem 的 fill color 解析、每个 tile 的像素采样,这些才是 RGBA32 优化的真正受益者。CSS 解析的 15% 只是同一套数据结构在 cold path 上的副产品。
The colour path that actually runs every frame is in paint/raster: every DisplayItem's fill color, every tile's pixel sampling — these are the real beneficiaries of RGBA32. The CSS parsing 15% is just the same data structure showing up on the cold path.

所以: 不要为了这 15% 重写已有 CSS;但要理解"fast/slow path 二分" 是浏览器内核遍布的优化模式——真正的优化目标是流水线后段那些"每帧 1000 次" 的路径(C9/C10/C14 才是战场)。

So: don't rewrite existing CSS for that 15%; do understand that "fast/slow path bifurcation" is a pattern the entire browser engine uses — the real optimisation targets are the "1000 times per frame" paths in later pipeline stages (C9/C10/C14 are the actual battleground).

选择器解析 · 真实 dump

Selector parsing · the real dump

解析 .text .hello 与 #world，Blink 会把它们拆成下面这种结构。注意 relation 字段——它就是"反向匹配"用的指针：从 .hello 出发，沿着 Descendant 关系往上找 .text。

Parsing .text .hello and #world, Blink emits the structure below. The relation field — that's the pointer right-to-left matching follows: start at .hello, walk Descendant edges up to .text.

UA Stylesheet · 加载顺序

UA stylesheet · the load order

Blink 内置一套默认样式表，按下面这个顺序加载——每张表都覆盖前一张的同名规则，业务样式排在最后：

Blink ships a default stylesheet stack — loaded in this order, each sheet overrides same-name rules in the previous one, your CSS comes last:

DEFAULT

html.css

QUIRKS

quirks.css

PER-OS

android / linux / mac.css

业务YOUR CSS

other.css

完整加载列表见 blink/public/blink_resources.grd。这就是为什么"不用 !important 就能覆盖默认 button 样式"——因为业务样式排在最后，胜在声明位置。

Full list in blink/public/blink_resources.grd. This is why "your button style overrides the UA's without !important" — your CSS comes last, winning on declaration order.

RuleMap 源码

RuleMap · source

RULE_SET.H third_party/blink/renderer/core/css/rule_set.h

class RuleSet { RuleMap id_rules_; // #id 选择器 RuleMap class_rules_; // .class 选择器 RuleMap attr_rules_; // [attr] 选择器 RuleMap tag_rules_; // 标签选择器 RuleMap ua_shadow_pseudo_element_rules_; // 伪类 };

为什么 DOM 不能立刻渲染

Why a DOM cannot render immediately

The browser blocks rendering until it has both
the DOM and the CSSOM. Render-blocking CSS · MDN

为什么？因为没样式的 DOM 是无意义的。一棵裸树渲染上屏，下一帧又要因为 CSS 进来重排——还不如等。这是 CSS 一直被叫 "render-blocking" 的根源。

Because rendering a bare DOM is meaningless. Drawing it to the screen and re-laying it out the moment CSS arrives is more expensive than waiting. That is the root reason CSS is called "render-blocking".

RuleMap · 按选择器类型分桶

RuleMap · sharding by selector type

所有 StyleRule 不是塞进一个大数组，而是按"第一个选择器的类型"分桶到 5 张 Map 里。匹配时只需要查对应的桶，把 O(N) 缩成 O(N / k)。

StyleRules are not piled into one big array — they're sharded by their first selector's type into five maps. Matching only consults the relevant bucket, collapsing O(N) into O(N / k).

Map · #id

id_rules_

#world
#header

Map · .class

class_rules_

.text
.btn-primary

Map · [attr]

attr_rules_

[data-state="open"]

Map · tag

tag_rules_

div, span, p…

Map · ::pseudo

ua_shadow_…

::before
::placeholder

从右到左 · 选择器为什么这样读

Right-to-left · why selectors read backwards

假设你写下 .text .hello。要快速判断"这个 div 命中吗"，浏览器从最右边开始：先看节点本身有没有 .hello，命中了再向上找祖先里有没有 .text。从右到左能在第一步就否决掉绝大多数节点。

You write .text .hello. The fastest way to decide "does this div match?" is to start from the right: does the node itself have .hello? If yes, walk ancestors looking for .text. Right-to-left rejects the vast majority of nodes on the very first check.

.text .hello

↑ 反向匹配 · 从右往左

↑ matched right-to-left

优先级阶梯 · 4 级判定

Cascade ladder · four levels of priority

一条样式被应用与否、按谁的来，是 4 级判定的产物。前一级相同时，才走下一级——声明位置是兜底的最后一关。

Whether a declaration applies, and which one wins, is decided by four ranked criteria. Only when one ties does the next level kick in — declaration order is the final tiebreaker.

01

Cascade layers 顺序

Cascade layers

@layer 块的声明顺序，最先声明的最弱。

Order of @layer declarations — earliest is weakest.

02

选择器特异度

Selector specificity

id (100) · class/attr/pseudo-class (10) · tag (1) 之和。

Sum of id (100) · class/attr/pseudo-class (10) · tag (1).

03

Proximity 排序

Proximity ordering

Cascade Level 6 引入，作用范围嵌套深的获胜。

Introduced in Cascade Level 6 — deeper-scoped scope wins.

04

声明位置

Declaration order

最后到达的获胜——这就是为什么 main-heading2 写在后面就赢了。

Last-write-wins — this is why main-heading2 wins simply by being declared later.

真实案例CASE

`main-heading2 main-heading` · 谁赢？

`main-heading2 main-heading` · which class wins?

假设有个 <h1 class="main-heading main-heading2">，CSS 里两条规则：.main-heading { color: red; } 与 .main-heading2 { color: blue; }。

Suppose <h1 class="main-heading main-heading2"> with two rules: .main-heading { color: red; } and .main-heading2 { color: blue; }.

特异度同为 0,1,0，class 顺序无关——决定胜负的是声明位置。.main-heading2 写在后面，标题就是蓝色，把 class 顺序反过来写也一样。HTML 里 class 出现的先后从来不影响 CSS。

Specificity is identical at 0,1,0; the order of class names is irrelevant — declaration order decides. .main-heading2 is declared later, so the heading is blue, no matter what order you write the classes in HTML. Class order in HTML never affects CSS.

为什么 UA 样式排在你之前WHY UA STYLES DECLARE BEFORE YOURS Blink 内置默认样式表（html.css 等）总是第一个注册到 RuleSet。在第 4 级判定（声明位置）里，业务样式由于是后注册的，每次都赢——这就是 UA 样式可以被你覆盖 的根本机制。 Blink's UA stylesheet (html.css and friends) is always registered into the RuleSet first. At the 4th cascade tier (declaration order), your CSS is registered later — and wins by being last. That is the actual mechanism that lets you override UA styles without !important.

DEVTOOLS

Performance > "Recalculate Style";Elements > Computed 看 ComputedStylePerformance > "Recalculate Style"; Elements > Computed for ComputedStyle

TRACING

blink.style, blink.css_parser, blink.invalidation

FLAG

--enable-blink-features=CSSContainerQueries,CSSScope

SOURCE

third_party/blink/renderer/core/css/style_engine.cc, rule_set.h

▸ Performance · Main thread · "Recalculate Style" selected · 4.2 ms · 1842 elements

Main

JS

Recalculate Style 4.2ms

Layout 3.1ms

Paint 1.8ms

Commit

idle

Recalc Style 展开

CollectMatching

CompareRules · cascade

ApplyMatched · ComputedStyle

Selector match stats · this frame RuleSet hit-rate: 96.4%

id_rules_ = 12 hits

class_rules_ = 1432 hits

tag_rules_ = 387 hits

attr_rules_ = 9 hits ⚠ slow

看 3 件事: ① Recalc Style / Layout / Paint 三个块的相对宽度 — 哪个胖哪个就是瓶颈。Style 通常是 Layout 的 1/2;若反过来,八成是选择器太复杂(每个节点跑右往左匹配的成本爆了);② 底部 RuleSet hit-rate < 80% = 大量节点跑了无效匹配;③ attr_rules_ 命中 标红 — 属性选择器([data-state="open"])是最慢的桶,遇到全文档量级 selector 时尤其贵。 3 things to watch: ① Recalc Style vs Layout vs Paint width ratios — fattest one is the bottleneck. Style is usually ½ of Layout; if reversed, you almost certainly have over-complex selectors (per-node right-to-left match cost explodes); ② RuleSet hit-rate < 80% = many nodes running futile matches; ③ attr_rules_ hits in red — attribute selectors ([data-state="open"]) are the slowest bucket, particularly costly with document-scale selectors.

上一章previous← Stage 01 · ParsingStage 01 · Parsing 下一章nextStage 03 · Layout · 几何归位Stage 03 · Layout · geometry settles →

INPUT

DOM Tree来自 Parsingfrom Parsing

→

OUTPUT

Render Tree每节点附 ComputedStyle+ ComputedStyle per node

STAGE 03 · DOC PHASE

Layout — 几何属性的归宿

Layout — where geometry settles

几何归位

where geometry settles

Module

blink

Process

Render

Thread

Main

Output

Layout Tree

这一步在做什么

What it does

遍历 Render Tree，给每个 LayoutObject 计算 x · y · width · height。所谓 LayoutTree = Render Tree + 几何属性。 Walk the Render Tree, compute x · y · width · height for every LayoutObject. LayoutTree = Render Tree + geometry.

为什么不能跳过

Why not skip

没有几何就没法绘制——"画一个红色矩形" 至少需要 4 个数字。Layout 还要解决 inline ↔ block ↔ float ↔ flex ↔ grid 之间错综的相互影响，是 Main thread 上最容易长尾的一段。 No geometry → no painting. "Draw a red rectangle" needs four numbers at minimum. Layout also has to resolve the tangled interactions between inline ↔ block ↔ float ↔ flex ↔ grid — and it is the Main thread's most long-tail-prone stage.

关键函数

Key entrypoints

Document::UpdateStyleAndLayout · LayoutBlockFlow::UpdateLayout · NGLayoutAlgorithm::Layout

Layout 阶段处理的是节点的几何——位置、尺寸。每个 LayoutObject 上挂一个 LayoutRect，记录 x / y / width / height。

但要紧的是：LayoutObject 与 DOM Node 不是 1 : 1 的关系。一个 display: list-item 在 LayoutTree 里会变成两个 LayoutObject（item 框 + marker 框）；一个 anonymous block 会"凭空"出现以保证布局规则。

Layout is about geometry — position and size. Each LayoutObject carries a LayoutRect that stores x / y / width / height.

The catch: LayoutObject does not map 1 : 1 to a DOM node. A display: list-item becomes two LayoutObjects (item box + marker box). An anonymous block "appears from nowhere" to keep layout rules consistent.

STAGE 03 主线 · The Card 在布局后Main-line · The Card after Layout

LayoutNGFlexibleBox 的两遍布局,8 个 LayoutObject 各就位

LayoutNGFlexibleBox's two passes, 8 LayoutObjects in place

名片是个 display: flex 容器,所以根的 article 用的不是 LayoutNGBlockFlow,而是 LayoutNGFlexibleBox。Flex 算法跑两遍:第一遍量主轴(横向)——头像 56 + gap 14 + button 53 = 123,剩下 217 给 .info;第二遍量交叉轴(纵向)——按 align-items: center 把三个项目居中。最终 LayoutTree 长这样:

The card is a display: flex container, so the root article uses LayoutNGFlexibleBox, not LayoutNGBlockFlow. The flex algorithm runs twice: first the main axis (horizontal) — avatar 56 + gap 14 + button 53 = 123, leaving 217 for .info; then the cross axis (vertical) — center via align-items. The final LayoutTree:

// Layout Tree · The Card 之后

LayoutNGFlexibleBox(article.card)        [0,0  340×88]
├── LayoutImage(img.avatar)             [20,16  56×56]
├── LayoutNGBlockFlow(div.info)          [90,16  217×56]
│    ├── LayoutNGBlockFlow(h2.name)       [90,16  217×22]
│    │    └── LayoutText           ─▶ "Airing"
│    ├── LayoutNGBlockFlow(p.bio)        [90,40  217×17]
│    │    └── LayoutText           ─▶ "字节码到像素的旅人"
│    └── LayoutInline(a.url)            [90,59  56×16]
│         └── LayoutText           ─▶ "ursb.me"
└── LayoutNGBlockFlow(button.follow)     [267,28  53×32]
     └── LayoutText                ─▶ "关注"

注意 5 件事:① img 用 LayoutImage(LayoutReplaced 子类),它没有子节点也不会有——img 是 replaced 元素,Layout 给的就是一个外部资源占位的盒;② a 用 LayoutInline 而非 BlockFlow——它是 phrasing content,只占一行盒;③ button 自带 LayoutNGBlockFlow + 默认 padding/border-radius(都来自 UA 样式);④ DOM 6 节点 → LayoutTree 8 个 LayoutObject(三个 LayoutText 是凭空多出来的);⑤ 整个布局算法跑了 2 次——这是 flex 容器特有的成本,普通 block 流只跑 1 次。

Five things to notice: ① img uses LayoutImage (a LayoutReplaced subclass) — it has no children and never will; img is a replaced element, Layout gives it a single box for the external resource; ② a uses LayoutInline, not BlockFlow — it's phrasing content, occupies a line-box; ③ button brings its own LayoutNGBlockFlow with default padding / border-radius from UA stylesheet; ④ DOM 6 nodes → 8 LayoutObjects (the three LayoutText appear from nowhere); ⑤ the layout algorithm ran twice — that's the flex tax. A plain block flow only runs once.

5 种"DOM ↔ LayoutObject 不对等"

Five flavours of "DOM ↔ LayoutObject mismatch"

1 → N

list-item · marker 单独成框

list-item · marker gets its own box

一个 <li> 节点在 LayoutTree 中产生两个 LayoutObject：item-block + marker-block，因为列表标记可以独立定位。

A single <li> becomes two LayoutObjects: item-block + marker-block, because the bullet is independently positioned.

DOM 1 ↔ Layout 2

0 → 1

::before / ::after · 凭空生

::before / ::after · synthesised

伪元素在 DOM 里不存在，但在 LayoutTree 里有自己的 LayoutObject。所以你能给它做 layout、动画、绝对定位。

Pseudo-elements don't exist in the DOM, yet have their own LayoutObject. That's why you can lay them out, animate them, absolutely-position them.

DOM 0 ↔ Layout 1

0 → 1

Anonymous block · 自动包裹

Anonymous block · auto-wrap

混排的 inline + block 子元素会被自动包成一个匿名块。这是 Layout 引擎的"修复 HTML 不规整"机制。

Mixed inline + block children are automatically wrapped in an anonymous block. The layout engine's "repair the HTML mess" mechanism.

DOM 0 ↔ Layout 1

N → 0

display:none / head 元素

display:none / head elements

不参与渲染的节点（隐藏元素、<script> / <style>）没有 LayoutObject，但 DOM 节点还在。

Non-rendered nodes (hidden, <script> / <style>) have no LayoutObject, yet remain in the DOM.

DOM 1 ↔ Layout 0

1 → N

Multi-column · 一节点多框

Multi-column · one node, many fragments

CSS Multi-column 让一个长 block 在 LayoutTree 中拆成多个 fragment——每列各自一个 LayoutObject。

CSS Multi-column splits one long block into multiple fragments — one LayoutObject per column.

DOM 1 ↔ Layout N

N → N+M

Table · 自动生 row / cell

Table · auto-synthesised row / cell

display: table 缺 tr / td 时，Blink 会按 CSS 表格规则自动补全 LayoutObject 层级。

When display: table lacks tr / td, Blink auto-synthesises the LayoutObject hierarchy per CSS table rules.

DOM N ↔ Layout N+M

一个具体的例子

A worked example

max-width: 100px

F

The quick brown fox

jumps

Reflow 的代价 / 怎么少做几次

The cost of reflow / and how to do less

每一次 Layout（reflow）都是 Main thread 的独占任务。优化的思路是合并多次 reflow，只反馈一次到 render tree。具体几招：

Every Layout (reflow) is an exclusive job on the Main thread. The strategy is to batch multiple reflows into one render-tree update. The classics:

A

改 className，不改 inline style

Change className, not inline style

避免 CSSOM 重新生成 / 重新合成。

Avoids regenerating / recombining the CSSOM.

B

把频繁 reflow 的元素「下线」

Move chatty nodes "offline"

先 display:none / 离屏 → 改完属性 → 再上线。

Detach (display:none / off-screen), mutate, then re-attach.

C

替代触发 reflow 的属性

Substitute reflow-triggering properties

能用 transform / opacity 就别用 top / left / width。

Use transform / opacity instead of top / left / width.

D

把影响范围圈在独立图层

Isolate effects in their own layer

用 will-change 提前告诉 Compositor「这个会动」。

Use will-change to warn the Compositor: "this will move".

CSS Triggers · 一张能省命的对照表

CSS Triggers · the table that saves frames

每条 CSS 属性的修改会触发哪些阶段？CSS Triggers 给出了答案。下面是几条最常用的——把 reflow 路径上的属性挪到 composite 路径上，是最容易拿到的性能收益：

What stages does mutating each CSS property trigger? CSS Triggers answers it. The most-cited rows below — moving a property from the reflow path to the composite path is the lowest-hanging perf win:

CSS 属性CSS property	Layout	Paint	Composite
`width / height / padding / margin`	●	●	●
`top / left / right / bottom`	●	●	●
`font-size / line-height / display`	●	●	●
`color / background-color / box-shadow`	—	●	●
`border-radius / outline`	—	●	●
`opacity`	—	—	●
`transform`	—	—	●
`filter`	—	—	●

用法HOW TO USE 想做位移动画 → 用 transform: translate(x, y)，不要用 top / left；想淡入淡出 → opacity，不要 display 切换；圆角变化 → 改 border-radius 时整张图层都得重 Paint，能避就避。不同浏览器内核的处理表略有差异，CSS Triggers 是 Lookup table，不是宪法。 Want a position animation? Use transform: translate(x, y), not top / left. Cross-fade? opacity, not toggling display. Animating border-radius repaints the whole layer — avoid where you can. Different engines vary slightly; CSS Triggers is a lookup, not a law.

LayoutObject 家族 · 一棵继承树

The LayoutObject family · one inheritance tree

"LayoutObject" 不是一个类，而是一棵继承树。Blink 用子类化表达不同盒模型规则——block / inline / table / svg / mathml 各走各的算法。下面这棵树是 third_party/blink/renderer/core/layout/ 目录的精简骨架：

"LayoutObject" is not a single class — it's an inheritance tree. Blink uses subclassing to encode different box-model rules: block / inline / table / svg / mathml each walk their own algorithm. Below is a condensed map of the third_party/blink/renderer/core/layout/ tree:

LAYOUT OBJECT · CLASS HIERARCHY third_party/blink/renderer/core/layout/layout_object.h

LayoutObject // 抽象基类 · 节点元信息 + 失效标志 ├─ LayoutText // 纯文本节点（无 box） └─ LayoutBoxModelObject // 一切"有边界"的盒 ├─ LayoutInline // inline 元素 · <span> / <b> └─ LayoutBox // 一切"有面积"的盒 ├─ LayoutReplaced // img / iframe / video / canvas ├─ LayoutBlock // 通用块容器 │ ├─ LayoutBlockFlow // block + inline 混排（最常见） │ │ ├─ LayoutNGBlockFlow // LayoutNG · 默认实现 │ │ └─ LayoutNGFlexibleBox // flex 容器 │ ├─ LayoutTable // table 容器 │ │ ├─ LayoutTableSection // thead / tbody / tfoot │ │ ├─ LayoutTableRow // tr │ │ └─ LayoutTableCell // td / th │ ├─ LayoutNGGrid // grid 容器 │ ├─ LayoutMultiColumnFlowThread // CSS multi-column 主流 │ └─ LayoutView // 文档根 · 唯一 root └─ LayoutSVGBlock // svg / mathml 混入点

两个细节值得记住：① LayoutText 不继承 LayoutBox——它没有盒，几何由父级 LayoutInline / LayoutBlockFlow 在 inline 行盒里安排；② LayoutView 是整棵树唯一的根，它持有 viewport 的尺寸 + 根 ScrollableArea。删除 document.body 也不会让它消失，LayoutView 是 Document 的常驻成员。

Two details worth remembering: ① LayoutText doesn't inherit from LayoutBox — it has no box, its geometry is decided by the parent LayoutInline / LayoutBlockFlow inside an inline line-box; ② LayoutView is the single root — it owns the viewport's size + the root ScrollableArea. Removing document.body doesn't kill it; LayoutView is a permanent member of Document.

LayoutNG · 不可变片段树

LayoutNG · the immutable fragment tree

2017 起 Chromium 把整套布局引擎重写为 LayoutNG（Next-Generation Layout）。最大的改变是引入"片段（Fragment）"作为只读的几何快照——LayoutObject 仍是输入的"配方"，但布局计算的输出不再回写到 LayoutObject 自己，而是产出一棵不可变的片段树（NGPhysicalFragment 树）。这一拆分让布局可缓存、可并行、可在子树边界处提前停止。

From 2017 Chromium rewrote the layout engine as LayoutNG (Next-Generation Layout). The headline change: introduce a "Fragment" as a read-only geometry snapshot — LayoutObjects remain the input "recipe", but layout output is no longer written back into them. Instead we get an immutable fragment tree (NGPhysicalFragment tree). This split lets layout be cached, parallelised, and short-circuited at subtree boundaries.

LAYOUTNG · CORE INTERFACE third_party/blink/renderer/core/layout/ng/

// 1. 调度入口：每个 LayoutNGBlockFlow 由一个算法对象处理 NGLayoutAlgorithm { virtual const NGLayoutResult* Layout() = 0; virtual MinMaxSizes ComputeMinMaxSizes(MinMaxSizesInput) = 0; }; // 2. 输入：约束空间——告诉算法"你能用多少地方" NGConstraintSpace { available_size_ : LogicalSize // 父级让出的可用区域 percentage_resolution_size_ : LogicalSize // 百分比解析基线 writing_mode_ : WritingMode // horizontal-tb / vertical-rl block_fragmentation_type_ : kFragmentColumn / kFragmentPage }; // 3. 输出：不可变片段——LayoutObject 缓存它，下一帧可直接复用 NGPhysicalBoxFragment : public NGPhysicalFragment { size_ : PhysicalSize children_ : Vector<NGLink> // 子片段 + 物理偏移 break_token_ : NGBlockBreakToken* // 多列/分页的接续点 inline_items_ : NGFragmentItemsBuilder* // inline 行盒的 SoA 布局 };

流程上，LayoutBlockFlow::UpdateLayout() 会构造一个 NGBlockLayoutAlgorithm，喂入 NGConstraintSpace，运行后吐出 NGLayoutResult——里面的核心是一棵 NGPhysicalBoxFragment。"约束 + 算法 → 片段"是 LayoutNG 的三段式，每一段都纯函数式，每一段都可缓存。这就是为什么很多原本 O(整棵树) 的 reflow，在 NG 下能被剪枝到 O(脏子树)。

Operationally, LayoutBlockFlow::UpdateLayout() constructs an NGBlockLayoutAlgorithm, feeds it an NGConstraintSpace, runs it and emits an NGLayoutResult — at its centre an NGPhysicalBoxFragment. "Constraint + Algorithm → Fragment" is LayoutNG's three-act form, each act purely functional, each act cacheable. This is what reduces many O(whole tree) reflows to O(dirty subtree) under NG.

块级布局算法 · 一行一行往下走

Block-flow algorithm · line by line, top down

所有 block 级元素的布局都跑同一个核心循环——NGBlockLayoutAlgorithm::Layout()。它做的事可以用 7 步概括：

All block-level layout runs through one core loop — NGBlockLayoutAlgorithm::Layout(). The seven beats:

NG_BLOCK_LAYOUT_ALGORITHM::LAYOUT() ng_block_layout_algorithm.cc

① SetupRelativeOffsetForOOFPositionedNodes() // 提前登记绝对定位的子节点 ② HandleNewFormattingContext() // 浮动/BFC 边界 ③ FOR EACH child IN children_iterator: a CreateConstraintSpaceForChild(child) // 给子节点划定可用空间 b child->Layout(constraint_space) // 递归 / 缓存命中 c ResolveBfcBlockOffset(child_fragment) // 与浮动协商边距折叠 d container_builder_.AddChild(...) // 加入当前 fragment ④ FinalizeForFragmentation() // 处理 multi-col / pagination 断裂 ⑤ PositionListMarker() // list-style marker 单独定位 ⑥ LayoutOutOfFlowPositionedCandidates() // 处理 ① 登记的绝对定位 ⑦ return container_builder_.ToBoxFragment() // 产出 NGPhysicalBoxFragment

边距折叠 · 难点MARGIN COLLAPSE · TRICKY

为什么"两个 block 之间的 margin 会塌缩"是 LayoutNG 的难题

Why "margins collapse between two blocks" is a hard problem in LayoutNG

CSS 规定：相邻 block 元素垂直方向的 margin 取较大值（margin collapse）。问题在于当前块的 margin-top 是否参与折叠，要等"它的第一个子块"完成布局才能确定——形成跨子树的反向依赖。LayoutNG 用 NGUnpositionedFloat 和 NGMarginStrut 把这种"待结清的 margin"显式建模，让算法保持纯函数式。

CSS says: adjacent vertical margins between block siblings collapse to the larger one. The catch: whether the current block's margin-top participates in collapse can only be decided after its first child block has been laid out — a cross-subtree backward dependency. LayoutNG models this explicitly with NGUnpositionedFloat and NGMarginStrut — keeping the algorithm pure functional.

为什么 Flex 要跑两遍布局? · 主轴 / 交叉轴的强先后依赖 Why does Flex run layout twice? · main-axis must finish before cross-axis can begin

flex 的两遍布局,本质是"分配尺寸 → 排列方向" 这两步必须串行:

Flex's two layout passes really come from "distribute size → place along axis" being a strict sequence:

第一遍 · 主轴(横向): 跑 flex-grow / flex-shrink / flex-basis 的算术——根据容器可用宽度,按 flex: 1 0 200px 这类规则把宽度分配给每个子项。这一遍输出的是每个子项的主轴尺寸(横向布局时就是 width)。
Pass 1 · main axis (horizontal): solve flex-grow / flex-shrink / flex-basis arithmetic — given the container's available width, distribute it across children per flex: 1 0 200px style rules. The output: each child's main-axis size (i.e. width for a horizontal flexbox).
第二遍 · 交叉轴(纵向): 现在每个子项的宽度已知,才能开始量它们的高度(因为高度可能依赖宽度——例如自动换行的文本,容器宽就 1 行,容器窄就 3 行)。align-items / align-self 的对齐计算,还有"这一行(flex-line)的最高项决定整行高度"这条规则,都得在这一遍才能算。
Pass 2 · cross axis (vertical): now that every child's width is known, we can measure their heights (height often depends on width — e.g. auto-wrapping text: 1 line if the container is wide, 3 lines if narrow). align-items / align-self alignment + "the tallest item in a flex-line sets the line's height" rule both need this second pass.

为什么不能并行? 因为子项的高度依赖它在第一遍被分配到的宽度,这是单向数据依赖。如果你强行先量高度再分配宽度,你会得到"用容器原宽度量出的高度"——一旦容器宽度改变(比如 flex-grow 把它放大),所有高度都白算了,要再来一遍。所以 flex 的"两遍"是数学约束,不是工程拍脑袋。

Why can't this run in parallel? Because a child's height depends on the width it received in pass 1 — a one-way dependency. If you forced height-first, you'd get "height computed against the original container width" — once the container width changes (because flex-grow expanded a child), all heights are stale and need recomputing. So the "two passes" are a mathematical constraint, not an engineering whim.

Grid 更狠: CSS Grid 在某些场景里要跑三遍甚至更多(比如 min-content / max-content 的轨道大小算法是迭代收敛的)。flex 两遍已经是相对省的了。

Grid is even worse: CSS Grid runs three or more passes in some scenarios (the min-content / max-content track-sizing algorithm is iterative until convergence). Flex's two passes are actually frugal in comparison.

DEVTOOLS

Performance > "Layout"/"Reflow";Rendering > Layout Shift RegionsPerformance > "Layout"/"Reflow"; Rendering > Layout Shift Regions

TRACING

blink.layout, blink.layout_ng, blink.fragment

FLAG

--enable-blink-features=LayoutNGFlexBox / 默认已开/ default-on

SOURCE

third_party/blink/renderer/core/layout/ng/ng_block_layout_algorithm.cc

▸ Performance · Main thread · Layout/Reflow event(展开) selected · 8.4 ms · forced

Main

JS

Recalc Style 1.6ms

Layout 8.4ms ⚠ forced reflow

Paint 2.1ms

Commit

idle

Layout(展开)

PerformLayout

UpdateLayout · 312 nodes

UpdateLayout(展开)

NGFlexLayoutAlgorithm

NGBlockLayout · pass1

NGBlockLayout · pass2

offsetWidth (sync layout)

04 ms8 ms12 ms16 ms

3 个红线信号: ① Layout 占主线程比例 > 30%(本图 8.4/16.7 = 50%,直接掉帧);② Layout bar 上有 ⚠ forced reflow 标 — JS 在读 offsetWidth/getBoundingClientRect() 之前刚改了 DOM,触发同步布局(C8 章 reflow 那段说的);③ 同一帧内出现多次 Layout 块 — 典型 layout thrashing,把读/写操作分批合并到 rAF 里就能消除。 3 red-line signals: ① Layout takes > 30% of the Main thread (here 8.4/16.7 = 50%, frame dropped); ② Layout bar carries a ⚠ forced reflow tag — JS read offsetWidth/getBoundingClientRect() right after a DOM mutation, triggering sync layout (the C8 reflow story); ③ multiple Layout blocks in one frame — classic layout thrashing, fixed by batching reads + writes inside a rAF.

上一章previous← Stage 02 · StyleStage 02 · Style 下一章nextStage 04 · Pre-paint · 4 棵属性树Stage 04 · Pre-paint · the four property trees →

INPUT

Render Tree

→

OUTPUT

Layout Tree含 LayoutRect + Fragment 树+ LayoutRect + Fragment tree

STAGE 04 · DOC PHASE

Pre-paint — 四棵属性树的诞生

Pre-paint — birth of the four property trees

局部更新的"语法"

isolation contracts for transform · clip · effect · scroll

Module

blink

Process

Render

Thread

Main

Output

Property Trees ×4

这一步在做什么

What it does

从 LayoutTree 抽出 4 棵属性树（Transform / Clip / Effect / Scroll），每棵树用父子继承的方式表达"该节点上面叠了哪些变换 / 裁剪 / 特效 / 滚动"。把这部分从图层结构里剥离出来，是 Compositor 能"只更新一个属性、不重画"的根基。 Extract 4 property trees (Transform / Clip / Effect / Scroll) from the LayoutTree. Each tree uses parent-child inheritance to express "what transforms / clips / effects / scrolls are stacked above this node". Splitting these axes from the layer structure is what lets the Compositor "mutate one property without repainting".

为什么不能跳过

Why not skip

没有这 4 棵树，每次 transform / opacity / scroll 改变都要顺着图层树重新计算继承关系，跨线程把整棵树拷一遍。属性树是 Compositor "动一个节点的一条属性，其他都 cache 命中"的隔离合同。 Without the four trees, every transform / opacity / scroll change has to recompute inheritance up the layer tree and ship the whole tree cross-thread. Property trees are the isolation contract that lets "mutate one property of one node" stay local.

关键函数

Key entrypoints

PrePaintTreeWalk::WalkTree · PaintPropertyTreeBuilder::UpdateForSelf · cc::PropertyTrees::AnimationsAreFor

CAP · COMPOSITE AFTER PAINT 新版本 Chromium 的 Pre-paint & Paint 已重写为 CAP（Composite After Paint） 模式——属性树的构建从 Layout 后剥离到 Paint 之后完成，去掉了 PaintLayer 这一层。结果是更少的中间产物 + 更精确的失效计算。本文中的描述基于 CAP 之前的世界，但脉络与新版完全一致。 Recent Chromium has rewritten Pre-paint & Paint as CAP (Composite After Paint) — property-tree construction moves from "after Layout" to "after Paint", and the PaintLayer abstraction is gone. The result: fewer intermediate artifacts and more precise invalidation. The model below predates CAP, but the shape is identical in the new world.

STAGE 04 主线 · The Card 在 Pre-paint 后Main-line · The Card after Pre-paint

4 棵树各取所需,Effect 树多 2 个节点,Transform 树多 1 个

Four trees each take a piece — Effect gains 2 nodes, Transform gains 1

名片这次会让 4 棵属性树都"动起来"——但每棵树多出来的节点不一样。NeedsEffect() 看到 .card 上的 box-shadow 与 linear-gradient,在 Effect tree 里给它建一个节点;看到 .follow 的 background 切色又建一个;NeedsTransform() 看到 .follow 的 will-change,在 Transform tree 给它建一个;NeedsClip() 看到 .avatar 的 border-radius: 50%,在 Clip tree 多一个 RRect 节点。Scroll tree 没动(整张卡都不滚)。

This time the card stirs all four property trees — but each tree gains a different number of nodes. NeedsEffect() sees .card's box-shadow + linear-gradient and creates one Effect node; .follow's background-mutation creates another. NeedsTransform() sees .follow's will-change and creates a Transform node. NeedsClip() sees .avatar's border-radius: 50% and creates a Clip node. Scroll tree remains untouched — nothing scrolls.

// 4 棵属性树 · The Card 之后

Transform tree          Clip tree
─ root                  ─ root
   └─ .follow            └─ .avatar (RRect 50%)
      [will-change]          [border-radius]

Effect tree             Scroll tree
─ root                  ─ root
   ├─ .card                // (无变化)
   │  [shadow + gradient]     // nothing scrolls in
   │  render_surface = YES     // the example
   └─ .follow
      [will-change]

关键判定: .card 节点上 render_surface_reason_ 被打成非空(因为 box-shadow 需要离屏合成才能正确画出)——这意味着整个 .card 子树将来在 Draw 阶段会被装进一个独立 RenderPass。这个决定在 Pre-paint 里做完,但要等到 C16(Draw)才兑现。4 棵树是流水线后半段所有"局部更新"的合同基础——hover 改 .follow 的 transform 时,只动 Transform tree 上 .follow 那一个节点,Effect/Clip/Scroll 三棵树全都 cache 命中,Layout 完全不跑。

Key decision: .card's render_surface_reason_ flips to non-null (the box-shadow needs off-screen compositing to render correctly) — meaning .card's whole subtree will land in its own RenderPass during Draw. The decision is made in Pre-paint but only cashes in at C16. The four trees are the contract that powers every "local update" later — hover-changing .follow's transform mutates one node in the Transform tree; the other three trees are cache hits and Layout doesn't run at all.

四棵属性树

The four property trees

TRANSFORM

变换树Transform tree

每个节点的位移、旋转、缩放、3D 变换；动画热路径必经。

Per-node translation / rotation / scale / 3D. The hot path of animations.

CLIP

裁剪树Clip tree

overflow / clip-path 在层级里生成的剪裁矩形。

Clip rectangles inherited from overflow / clip-path.

EFFECT

特效树Effect tree

opacity、filter、mask、blend；Compositor 上能直接合的就在这里发生。

opacity / filter / mask / blend — composable on the GPU directly.

SCROLL

滚动树Scroll tree

每条可滚动容器的偏移量；让滚动跑在 Compositor thread 上。

Scroll offsets per container — keeps scrolling on the Compositor thread.

基于这 4 棵树，Chromium 可以单独操作某个节点的变换 / 裁剪 / 特效 / 滚动而不影响子节点。这也是 CSS 动画"为什么稳"的原因——整条动画都不需要回到 Main thread。

Backed by these trees, Chromium can mutate one node's transform / clip / effect / scroll without disturbing its descendants. This is why CSS animations stay smooth — the entire animation never goes back to the Main thread.

案例 · 一个 div，4 棵树各取所需

Worked case · one div, four trees each take a piece

CASE · DECOMPOSITION

CSS 属性是怎么"分流"到不同属性树上的

How CSS properties get routed to different trees

同一个 LayoutObject，在 Pre-paint 这一步会同时给 Transform / Clip / Effect / Scroll 各贡献一个节点。后续动画 opacity 时只动 Effect tree，整个 Transform / Clip / Scroll 都是 cache 命中。

The same LayoutObject contributes one node to each of Transform / Clip / Effect / Scroll during Pre-paint. Animating opacity later mutates only the Effect tree — Transform / Clip / Scroll are all cache hits.

PrePaintTreeWalk · 算法

PrePaintTreeWalk · the algorithm

Pre-paint 走一遍 LayoutTree 的前序遍历。在每个 LayoutObject 上：

Pre-paint runs a pre-order traversal of the LayoutTree. At each LayoutObject:

ALGORITHM · PRE_PAINT_TREE_WALK third_party/blink/renderer/core/paint/

enter LayoutObject ① PaintPropertyTreeBuilder::UpdateForSelf(object); // 自己的 4 个属性节点 ② UpdateForChildren(object); // 给子节点的"父指针" ③ InvalidatePaintForUncacheableSubtree(...); // 标脏：哪部分要重 Paint ④ recurse into children leave LayoutObject

关键是步骤 ①：每个对象不重建整棵树，只更新与自己相关的 4 个节点指针。Pre-paint 的复杂度是 O(变化的节点数)，而不是 O(整棵树)——这才让它能在每帧都跑。

The key is step ①: each object does not rebuild the whole tree, it only updates the 4 property-node pointers tied to itself. Pre-paint runs in O(changed nodes), not O(whole tree) — that's what makes it cheap enough to run every frame.

属性树节点 · 一个节点长什么样

Property tree node · what one node looks like

"属性树"听起来抽象，落到代码上就是一组继承自 cc::PropertyTreeNode 的小结构体。每棵树是一个数组（Vector<Node>），节点之间用 parent_id 整数索引连接——之所以不用指针，是因为这棵树要在 Commit 时整棵 memcpy 到 Compositor 线程。下面是 4 棵树各自节点的核心字段：

"Property tree" sounds abstract; in code it's a handful of small structs that inherit from cc::PropertyTreeNode. Each tree is an array (Vector<Node>); nodes link via integer parent_id — not pointers, because the whole tree gets memcpy'd cross-thread at Commit time. The four flavours of node, with their key fields:

PROPERTY_TREES · NODE TYPES cc/trees/property_tree.h · paint_property_node.h

TransformPaintPropertyNode { // Transform tree parent_ : TransformNode* matrix_ : gfx::Transform // 4×4 矩阵 · 自身变换 origin_ : gfx::Point3F // transform-origin flattens_inherited_transform_ : bool // preserve-3d 子树 scroll_node_id_ : int // 关联的滚动节点 sticky_constraint_ / anchor_position_ // position: sticky 等 }; ClipPaintPropertyNode { // Clip tree parent_ : ClipNode* local_transform_id_ : int // clip 矩形所在的坐标系 layout_clip_rect_ : FloatRoundedRect // overflow / clip-path paint_clip_rect_ : FloatRoundedRect // 像素对齐后 }; EffectPaintPropertyNode { // Effect tree parent_ : EffectNode* opacity_ : float // 0 ~ 1 filter_ : CompositorFilterOperations backdrop_filter_ : CompositorFilterOperations blend_mode_ : SkBlendMode // mix-blend-mode mask_ : MaskOp // mask-image render_surface_reason_ : enum // 是否需要离屏 RenderPass }; ScrollPaintPropertyNode { // Scroll tree parent_ : ScrollNode* container_bounds_ : gfx::Size // 可视框 contents_size_ : gfx::Size // 内容总大小 scroll_offset_ : ScrollOffset // 当前滚动量 · Compositor 直写 user_scrollable_horizontal_ / vertical_ main_thread_scrolling_reasons_ : uint32_t // 0 = 可在 Compositor 上滚 };

三个细节决定了这套设计的威力：① 4 棵树共用一套节点编号空间（node_id 从 1 开始递增，0 留给 root），让 LayerImpl 只需要 4 个 int 就能定位"我属于哪条变换链 / 裁剪链 / 特效链 / 滚动链"；② 每个节点带一个 changed_flag，下一帧只 push 真正变了的节点，Commit 的字节量与变化量成线性；③ Effect 节点上的 render_surface_reason_ 才是决定"是否离屏合成（RenderPass）"的真正开关——比如 filter: blur()、mix-blend-mode、mask-image 都会把它打成非空。

Three details unlock the design's power: ① all four trees share one node-id space (1-based, 0 is the root) — a LayerImpl needs only 4 ints to know "which transform / clip / effect / scroll chain I belong to"; ② every node carries a changed_flag, so the next frame only pushes the nodes that actually changed — Commit bytes scale with delta; ③ the render_surface_reason_ on an Effect node is the real switch for "do we need an off-screen RenderPass?" — filter: blur(), mix-blend-mode, mask-image all flip it on.

PaintPropertyTreeBuilder · 节点的诞生

PaintPropertyTreeBuilder · how a node is born

在 PrePaintTreeWalk 走到某个 LayoutObject 时，PaintPropertyTreeBuilder::UpdateForSelf() 会按"因变才建"的策略，只为需要新节点的 CSS 属性创建节点。比如：

When PrePaintTreeWalk reaches a LayoutObject, PaintPropertyTreeBuilder::UpdateForSelf() follows a "build-on-demand" rule — it creates a node only if a CSS property needs one. For example:

UPDATE_FOR_SELF · DECISION TABLE paint_property_tree_builder.cc

// 进 Transform 树的条件（任一即可） NeedsTransform(style): ↳ HasTransform / Translate / Rotate / Scale / Perspective ↳ HasWillChangeTransform / WillChangeOpacity / WillChangeFilter ↳ HasBackfaceVisibility(hidden) / Has3DRendering(preserve-3d) ↳ position: sticky / anchor() // 进 Effect 树的条件 NeedsEffect(style): ↳ opacity != 1 · filter != none · backdrop-filter != none ↳ mix-blend-mode != normal · mask-image != none ↳ clip-path != none · isolation: isolate // 进 Clip 树的条件 NeedsClip(style): ↳ overflow: hidden|clip|scroll|auto · clip-path · contain: paint ↳ CSS clip 属性 / 表格单元 cell-clip // 进 Scroll 树的条件 NeedsScroll(style): ↳ ScrollableArea != null（即 overflow: scroll/auto 且内容溢出）

这就是为什么"什么属性会触发 GPU 加速"的实际答案——只要它能让 Pre-paint 阶段产出一个新的 Transform / Effect 节点，并把 render_surface_reason_ 设为非空，这块子树就会被晋升为独立合成层。will-change: transform 之所以"创造图层"，本质是绕过 NeedsTransform 的条件判定、强行制造一个 Transform 节点。

This is the real answer to "which properties trigger GPU acceleration?" — anything that makes Pre-paint emit a new Transform / Effect node with a non-empty render_surface_reason_ promotes that subtree to its own composited layer. will-change: transform "creates a layer" precisely because it bypasses the NeedsTransform guard and forces a Transform node into existence.

为什么是 4 棵树,而不是 1 棵大树? · 不同属性的失效粒度本质不同 Why four trees, not one big tree? · different properties have fundamentally different invalidation granularity

最直接的答案:这 4 类属性的"变化频率" 与"影响范围" 完全不一样。把它们塞进一棵树,每次变一个属性都要把整棵树推到 Compositor 线程,那就退化成"整页重做"了。

The most direct answer: these four kinds of properties have completely different "change frequency" and "impact radius". Stuff them into one tree and every property change ships the whole tree to the Compositor — degenerating to "redo the whole page".

具体看 4 棵树各自的"性格":

Each tree's "personality" specifically:

Transform tree · 变化最频繁(每帧动画都改),但不影响其他元素的几何。一个独立树,变化只 push 一个节点。
Transform tree · changes most often (every animation frame), but doesn't affect other elements' geometry. Its own tree → one node push per change.
Clip tree · 由 layout 决定,变化不那么频繁;但裁剪是累积的(父裁剪与子裁剪要求交),需要一棵独立的祖先链。如果跟 Transform 混在一起,每次 transform 改都要重算 clip——浪费。
Clip tree · driven by layout, changes less often; but clipping is cumulative (parent clip ∩ child clip), needing its own ancestor chain. Mixed with Transform, every transform change would re-evaluate clip — pure waste.
Effect tree · 决定 RenderPass 边界(box-shadow / opacity / blend-mode 触发离屏合成)。这棵树的节点是 GPU 工作量的分桶器——它决定"哪段 quad 走独立 RenderPass"。Transform / Clip 都不影响 RenderPass 拆分,所以必须独立。
Effect tree · decides RenderPass boundaries (box-shadow / opacity / blend-mode trigger off-screen composition). This tree's nodes are the buckets of GPU work — they decide "which quads go to a dedicated RenderPass". Transform / Clip don't affect RenderPass partitioning, so this must stay separate.
Scroll tree · 滚动是用户输入触发,跟 vsync 节奏完全独立。Compositor 线程要在没有 Main 线程参与的情况下直接修改 scroll offset。如果 Scroll 在 Transform 树里,Compositor 修一个 offset 就要"叫醒整棵 Transform 树",违背"滚动跑在 Compositor 上"的初衷。
Scroll tree · scrolling is triggered by user input, on a completely different cadence from vsync. The Compositor needs to mutate scroll offset without involving Main. If Scroll lived inside Transform, mutating one offset would "wake the entire Transform tree" — defeating the purpose of "scrolling runs on the Compositor".

所以 4 棵树是"正交分解"的产物:每条属性轴的"变化频率 × 影响范围"都是独立的,把它们关进一棵树会让 4 个独立的优化失效。把正交的轴拆开存,每个轴上做局部更新——这是数据库引擎、图形引擎、操作系统调度器都用过的同一招"按变更模式分桶"。

So the 4 trees are an "orthogonal decomposition": each property axis has independent "change frequency × impact radius"; cramming them into one tree disables 4 independent optimisations. Storing orthogonal axes separately and updating locally per axis — the same trick used by database engines, graphics engines, and OS schedulers: "shard by mutation pattern".

反例: 早期 Blink 的 PaintLayer 就是"一棵把所有属性都揉进去的树",每次改 transform 都要遍历 PaintLayer 树重算。CAP(Composite After Paint)项目的核心动作之一,就是把 PaintLayer 砍掉,换成这 4 棵独立属性树。性能提升来自"把不该一起变的东西拆开"——这条规律对所有系统都成立。

Counter-example: early Blink's PaintLayer was exactly "one tree with all properties bundled" — every transform change traversed the PaintLayer tree. The CAP (Composite After Paint) project's core action was killing PaintLayer and replacing it with these 4 independent property trees. The perf win came from "splitting things that shouldn't change together" — a rule that holds for any system.

DEVTOOLS

Performance > "Update Layer Tree";Layers 面板看图层树Performance > "Update Layer Tree"; Layers panel for the layer tree

TRACING

cc, paint, blink.invalidation

FLAG

--enable-blink-features=CompositeAfterPaint / 已默认/ default-on

SOURCE

third_party/blink/renderer/core/paint/paint_property_tree_builder.cc

▸ Layers · DevTools 三点菜单 → More tools → Layers 2 composited layers · 4 property nodes

▾ #document └─ #root LayoutView ├─ <body> └─ ▾ <article.card> [.card · render_surface] ↑ box-shadow + linear-gradient ├─ <img.avatar> ├─ <div.info> │ ├─ <h2.name> │ ├─ <p.bio> │ └─ <a.url> └─ <button.follow> ↑ will-change: transform

.card · 340×88
PaintChunk #1 · effect node

.follow
transform node

↑ 实线 = 独立合成层

看 3 处: ① 树视图里星标 ★ promoted 的就是 4 棵属性树给它建了独立节点 的元素 — Pre-paint 的输出在这里可见;② 右侧画布是层级俯视图,实线方框 = 独立合成层(.card 与 .follow),虚线 = 普通子元素;③ 鼠标悬到任一层上 DevTools 会显示 Compositing Reasons(kActiveTransformAnimation / kBackdropFilter 等) — 直接对应 cc::CompositingReason enum。独立合成层数量爆炸是 will-change 滥用的明证。 3 spots to inspect: ① ★ promoted entries in the tree are elements where the four property trees created dedicated nodes — Pre-paint's output is visible here; ② the right canvas is the top-down layer view; solid borders = composited layers (.card and .follow), dashed = inline children; ③ hover any layer and DevTools shows the Compositing Reasons (kActiveTransformAnimation / kBackdropFilter etc.) — directly mapping to the cc::CompositingReason enum. An explosion of composited layers is proof of will-change abuse.

上一章previous← Stage 03 · LayoutStage 03 · Layout 下一章nextStage 05 · Paint · 写绘画清单Stage 05 · Paint · the display list →

INPUT

Layout Tree

→

OUTPUT

Property Trees ×4Transform · Clip · Effect · ScrollTransform · Clip · Effect · Scroll

STAGE 05 · DOC PHASE

Paint — 不画像素，写"绘画清单"

Paint — not pixels, but a display list

按 CSS 绘画顺序写指令

layout objects → display item list, in CSS painting order

Module

blink → cc

Process

Render

Thread

Main

Output

cc::PictureLayer + DisplayItemList

这一步在做什么

What it does

遍历 LayoutTree，按 CSS 绘画顺序把每个 LayoutObject 翻译成一组 DisplayItem，组成一份 DisplayItemList，挂到 cc::PictureLayer。这一步不画一个像素——它只是把"要画什么"写成可重放的脚本。 Walk the LayoutTree in CSS painting order, translate each LayoutObject into a batch of DisplayItems assembled into a DisplayItemList, attach to cc::PictureLayer. Not one pixel painted here — Paint writes a replayable script of "what to draw".

为什么不能跳过

Why not skip

指令而非像素 = 可跨线程传递。Main thread 只生产 DisplayItemList，Raster 线程拿过去 playback。同一份指令在不同 scale / 不同 Tile / 不同设备上反复使用——是 Chromium "便宜地多次画" 的根。 Instructions, not pixels = cross-thread transferable. Main only produces DisplayItemList; Raster threads play it back. The same list is reused across scales, tiles, devices — the root of Chromium's "paint cheaply, paint many times".

关键函数

Key entrypoints

LocalFrameView::PaintTree · BoxPainter::PaintBoxDecorations · PaintCanvas::drawRect · cc::DisplayItemList::EndPaintOfPairedEnd

STAGE 05 主线 · The Card 在 Paint 后Main-line · The Card after Paint

14 条 DisplayItem 排好,头像被 ClipRRect 圈住

14 DisplayItems in line, avatar wrapped by ClipRRect

Paint 不画一个像素——它只把 Layout 与 Effect tree 给的指令转写成 cc::DisplayItemList。名片的最终 list 是按 CSS Appendix E 的 7 阶段绘画顺序排好的:

Paint paints zero pixels — it only transcribes Layout + Effect tree decisions into a cc::DisplayItemList. The card's final list is ordered by the CSS Appendix E painting sequence:

// DisplayItemList · The Card · ~14 items

01 SaveLayer           // 进入 .card 的 effect 子树(为 shadow 准备)
02 DrawShadow          // box-shadow: 0 6px 20px rgba(0,0,0,.08)
03 FillRRect           // .card 背景 linear-gradient · 14px radius
04 SaveLayer           // 进入头像
05 ClipRRect(56×56,50%) // 圆形裁剪
06 DrawImageRect       // 画 airing.png(若解码完成,否则后续帧)
07 Restore             // 退出头像
08 DrawText            // "Airing"
09 DrawText            // "字节码到像素的旅人"
10 DrawText            // "ursb.me" + underline on hover
11 SaveLayer           // 进入 .follow(独立合成层 · will-change)
12 FillRRect           // button 背景 999px radius
13 DrawText            // "关注"
14 Restore             // 退出 .follow
15 Restore             // 退出 .card 主层

三件事很有看头:① 头像的 ClipRRect 是真的圆——SkRRect 用 4 个独立角半径,头像各角都是 28px,所以 GPU 会真把它当圆裁;② SaveLayer 嵌套两层——外层为 .card 的 shadow effect 圈出离屏纹理,内层为 .follow 的独立合成图层。两层 SaveLayer 在 cc 看来是两个独立 PaintChunk,各绑一组 4 棵属性树状态;③ 整个 list 通过 PaintController 的双 buffer 增量复用——下一帧不变的 chunk 直接 O(1) 平移过来,只有改了的 chunk 才会进 Raster。

Three things to notice: ① the avatar's ClipRRect is a real circle — SkRRect carries four independent corner radii, all four set to 28px, so the GPU treats it as a true circle; ② SaveLayer nests twice — the outer one carves an off-screen texture for .card's shadow effect; the inner one is .follow's own composite layer. To cc these are two independent PaintChunks, each bound to its own 4-tree state; ③ the whole list reuses incrementally via PaintController's double buffer — next frame's unchanged chunks are O(1) range-moved over, only mutated chunks enter Raster.

CSS 绘画顺序 · 7 阶段

CSS painting order · 7 phases

CSS 规范规定每个块级元素的绘制按 固定 7 个阶段顺序执行——背景在最下，文字最上。这套顺序叫 "CSS 2.1 Appendix E painting order"，Blink 严格遵循：

The CSS spec mandates a fixed 7-phase order for every block — background at the bottom, text at the top. The order is "CSS 2.1 Appendix E painting order", strictly followed by Blink:

#	阶段Phase	画什么What is painted
1	Background	背景色 / 背景图background-color / background-image
2	Border	边框 + box-shadow（inset）borders + inset box-shadow
3	Float descendants	浮动子元素floated descendants
4	In-flow content	非定位的 inline 与 block 内容non-positioned inline + block content
5	Positioned descendants	绝对 / 相对定位的子元素absolutely / relatively positioned descendants
6	Outline	outline · 不占用 layout 空间outline · doesn't take layout space
7	Text decorations	下划线 / 删除线 / overlineunderline / strike / overline

DisplayItemList · 一份完整案例

DisplayItemList · a full worked case

Paint 阶段不画像素，它记录"该画什么"。每条记录是一个 DisplayItem，最终注入 cc::PictureLayer。给定下面这段 HTML：

The Paint stage doesn't paint pixels — it records "what to paint". Each entry is a DisplayItem, ultimately fed into a cc::PictureLayer. Given this HTML:

按 7 阶段绘画顺序，吐出下面这份清单——DrawingDisplayItem 的"剧本"按"边框 → 背景 → 内容 → 子节点"展开：

Following the 7-phase painting order, here is the emitted list — the "script" of DrawingDisplayItems, "border → background → content → children":

DISPLAYITEMLIST · cc::PictureLayer #p
01SaveLayeralpha=255
02Clip[25,25 → 79,49]
03DrawRect[25,25 → 79,49] · purple · stroke=4   ← border
04FillRect[29,29 → 75,45] · #d3d3d3            ← background
05DrawText"pixels" · x=33 y=42                    ← content
06Restore

Paint 走的也是栈

Paint walks a stack, too

和 DOM 构建一样，Paint 走的是栈式遍历——遇到一个 LayoutObject，先 SaveLayer 压栈，绘制完子节点后 Restore 出栈。这种模式让裁剪、变换、不透明度都能就地嵌套，又不会污染兄弟节点。

Like DOM construction, Paint walks a stack — hit a LayoutObject, push SaveLayer, paint children, pop Restore. This pattern lets clip, transform and opacity nest in place without contaminating siblings.

PAINT · STACK PUSH/POP MODEL cc/paint/display_item_list.h

enter LayoutObject push SaveLayer / Clip / Translate paint background paint border recurse into children… pop Restore leave LayoutObject

PaintController · 三层数据结构

PaintController · the three-layer data model

"DisplayItemList" 在 Blink 内部并不是一个扁平数组——它由 3 层结构逐层包裹：DisplayItem → PaintChunk → PaintArtifact。这套设计让"失效粒度"和"属性树绑定"都能精确落到子区段：

"DisplayItemList" inside Blink is not a flat array — it nests three layers: DisplayItem → PaintChunk → PaintArtifact. The design lets invalidation granularity and property-tree binding both target sub-ranges precisely:

PAINT_CONTROLLER · DATA MODEL third_party/blink/renderer/platform/graphics/paint/

DisplayItem { // 最小绘画指令 client_id_ : DisplayItemClientId // 来源 LayoutObject 的指纹 type_ : Type // kBoxDecorationBackground / kForegroundFirst … visual_rect_ : IntRect // 影响的像素范围 draw_op_ : sk_sp<PaintRecord> // 真正的绘画字节码 }; PaintChunk { // 一段连续的 DisplayItem begin_index_, end_index_ : size_t // [begin, end) 在 vector 里的区段 id_ : ChunkId // 客户端 + 类型标识 properties_ : PropertyTreeStateOrAlias // 绑定的 4 棵属性树节点 bounds_ : FloatRect // chunk 的并集包围盒 drawable_count_ / is_cacheable_ // 增量重 Paint 的依据 }; PaintArtifact { // 一帧 Paint 的总产物 display_item_list_ : DisplayItemList // 全部 DisplayItem chunks_ : Vector<PaintChunk> // 全部 chunk · 已按绘画顺序排好 };

为什么要"chunk 这一层"？因为 cc 在 Layerization（图层化）时需要的最小粒度不是单条 DisplayItem，而是"共享同一组属性树状态的连续 DisplayItem 段"——只有同一段 chunk 才能合到同一个 cc::Layer 上。chunk 的 properties_ 字段就是 Pre-paint 阶段输出的那 4 棵属性树的"当前状态指针"——在每个 SaveLayer/Clip/Transform 边界上换一次。这把"属性树"和"绘画清单"缝在了一起。

Why a "chunk layer"? Because cc's Layerization step doesn't operate on individual DisplayItems — it operates on "contiguous DisplayItems sharing one property-tree state". Only a chunk's worth can land on the same cc::Layer. The chunk's properties_ field carries the current pointer into the four property trees Pre-paint just produced — it switches at every SaveLayer/Clip/Transform boundary. This is the seam that stitches "property trees" and "display lists" together.

DisplayItem 缓存 · 增量重 Paint

DisplayItem cache · incremental repaint

每个 LayoutObject 生成的 DisplayItem 带身份 (DisplayItemClient)。下一帧重 Paint 时，未变更的 LayoutObject 把上一帧的 DisplayItem 原样复用，不重生成。Paint 是增量的——它的成本只跟"脏区"有关。

Every LayoutObject's DisplayItems carry an identity (DisplayItemClient). On the next Paint, unchanged LayoutObjects reuse last frame's DisplayItems verbatim. Paint is incremental — its cost scales with the dirty region only.

具体的复用机制走的是 PaintController 上的双 buffer：current_paint_artifact_ 是上一帧产物，new_paint_artifact_ 是这一帧正在累积的产物。当 PaintController::UseCachedItemIfPossible(client, type) 命中（client 没标脏 + 上一帧也画过这种 type），就把上一帧整段连续的 DisplayItems 整体平移到 new_artifact 里——O(1) 拷贝整段，不进重 Paint 流程。这套机制叫 "Paint cache"，是 Blink 在每帧只画 200~300 个新 item（其余几千个全部命中缓存）的根本原因。

Mechanically the reuse runs on a double-buffer inside PaintController: current_paint_artifact_ is last frame's output; new_paint_artifact_ is this frame's accumulator. When PaintController::UseCachedItemIfPossible(client, type) hits (the client is clean and last frame painted this type), the whole contiguous slice of DisplayItems is bulk-moved into new_artifact — O(1) range move, no repaint. The mechanism is called "Paint cache", and it's why a typical frame in Blink only paints 200~300 new items even though the document has thousands.

cc::Layer 的 5 个家族成员

The cc::Layer family — five subtypes

Paint 把 LayoutObject 转成的不是一张图——而是一棵 cc::Layer 树，运行在主线程，每个 Render Process 有且只有一棵。它的子类决定了"上屏方式":

Paint hands off not an image — but a cc::Layer tree, living on the Main thread, exactly one per Render process. The subclass decides "how this will reach the screen":

cc::PictureLayer

最常见。装的是 DisplayItemList，cc 把它光栅化成一组 viz::TileDrawQuad。

The default. Holds a DisplayItemList; cc rasterises it into a bag of viz::TileDrawQuads.

cc::TextureLayer

"我自己光栅化好了"——Flash、WebGL、Canvas 等都是它。

"I rasterised it myself" — Flash, WebGL, <canvas> arrive here.

cc::SolidColorLayer

纯色矩形。最便宜的图层——直接走 SolidColorDrawQuad。

A solid colour rectangle. The cheapest layer — straight to SolidColorDrawQuad.

cc::SurfaceLayer

嵌入另一个进程的 CompositorFrame。iframe / OffscreenCanvas / 视频用它。

Embeds a CompositorFrame from another process. Used by iframes, OffscreenCanvas, video.

cc::UIResourceLayer / NinePatch

软件渲染场景下的"位图层"——类 TextureLayer 的 fallback。

Software-rendering bitmap layer — the fallback cousin of TextureLayer.

cc::VideoLayer (deprecated)

已弃用，被 SurfaceLayer 取代。

Replaced by SurfaceLayer.

cc 是什么WHAT 'CC' STANDS FOR 这里的 cc = content collator（内容编排器），不是 Chromium Compositor。整个 cc 模块的工作就是在 Render 进程内组织好"该给 Viz 画什么"，所以叫 collator 比 compositor 更贴切。 cc = content collator — not Chromium Compositor. The cc module's job is to assemble what should be drawn for Viz inside the Render process. Collator fits better than Compositor.

DEVTOOLS

Rendering > "Paint flashing" 勾选 + Performance > "Paint" 事件展开Rendering > "Paint flashing" checkbox + Performance > "Paint" event expanded

TRACING

blink.painting, paint, cc.debug.invalidation

FLAG

--show-paint-rects / 启动时就开,直接看脏区/ on at startup, dirty rects visible

SOURCE

third_party/blink/renderer/core/paint/, paint_controller.cc

▸ Performance · "Paint" 事件展开 · 7 阶段绘画顺序 PaintArtifact · 2 chunks · 124 items · 1.8 ms

Main

JS

Style + Layout + Pre-paint

Paint 1.8 ms

Commit

idle

Paint 7 阶段

① bg

② border

③ float

④ in-flow

⑤ positioned

⑥ outline

⑦ text

PaintChunk

chunk #1 · .card · property=(T0,C0,E1,S0)

chunk #2 · .follow ★ · (T1,C0,E0,S0)

PaintController 双 buffer 命中

UseCachedItemIfPossible= 98 / 124 items (79%) Re-painted (dirty)= 26 items (.follow hover) PaintChunk 复用= 1 / 2 chunks

3 个判读: ① 7 阶段顺序固定(CSS 2.1 Appendix E),你看到的色块顺序永远是 bg → border → float → in-flow → positioned → outline → text;② PaintChunk 数量 ≈ 独立 PaintLayer 数量,过多说明子树绑了不同的 property tree state;③ UseCachedItemIfPossible 命中率 < 80% 是典型的 invalidation 风暴 — 改了某个共享属性导致整子树标脏。开 Paint Flashing 浮层(浅绿色矩形闪一下),能直观看到哪一块被重 paint。 3 reads: ① The 7-phase order is fixed (CSS 2.1 Appendix E) — you'll always see bg → border → float → in-flow → positioned → outline → text; ② PaintChunk count ≈ distinct PaintLayer count; explosion means subtrees are bound to different property-tree states; ③ UseCachedItemIfPossible hit rate < 80% is a classic invalidation storm — a shared property mutation marked the whole subtree dirty. Toggle Paint Flashing (light-green flashing rectangles) and you can directly see which region was repainted.

上一章previous← Stage 04 · Pre-paintStage 04 · Pre-paint 下一章nextStage 06 · Commit · 跨过线程的边界Stage 06 · Commit · crossing the thread boundary →

INPUT

Layout Tree+ Property Trees+ Property Trees

→

OUTPUT

cc::Layer Tree+ DisplayItemList+ DisplayItemList

STAGE 06 · CC PHASE

Commit — 跨过线程的边界

Commit — crossing the thread boundary

主线程到合成线程的原子过桥

main thread → compositor thread, atomically

Module

cc

Process

Render

Thread

Main → Compositor

Output

LayerImpl Tree

这一步在做什么

What it does

把 Main thread 上的 cc::Layer 树（外加 4 棵属性树、DisplayItemList）同步到 Compositor thread 上的 LayerImpl 树。这是渲染管线唯一一次显式跨线程的时刻——执行期间 Main thread 被短暂"冻住"。 Synchronise the Main-thread cc::Layer tree (plus 4 property trees + DisplayItemList) onto the Compositor-thread LayerImpl tree. This is the pipeline's only explicit cross-thread moment — Main is briefly frozen while it happens.

为什么不能跳过

Why not skip

两线程不能直接共享指针——JS 随时可能在 Main thread 上改动 cc::Layer，Compositor thread 同时还要光栅化它。Commit 是一次 "snapshot + 转交所有权" 的同步，让两边在边界上不打架。 The two threads cannot share pointers directly — JS may mutate cc::Layer on Main any moment while the Compositor thread is rasterising it. Commit is a "snapshot + ownership transfer", so the two sides never trip over each other.

关键函数

Key entrypoints

cc::Layer::PushPropertiesTo · cc::TreeSynchronizer::PushLayerProperties · cc::LayerTreeHost::FinishCommitOnImplThread

STAGE 06 主线 · The Card 在 Commit 后Main-line · The Card after Commit

2 棵 cc::Layer 跨过线程,变成 2 棵 LayerImpl

2 cc::Layers cross the thread boundary, become 2 LayerImpls

名片在 Paint 阶段产出 2 棵 cc::Layer(主图层 + .follow 独立层) + 4 棵属性树 + 2 段 PaintChunk(每个 chunk 绑定一组属性树状态)。Commit 把这一整捆东西原子地从 Main 推到 Compositor:

Paint produced 2 cc::Layers (main + .follow) + 4 property trees + 2 PaintChunks (each bound to a property-tree state). Commit pushes this whole bundle atomically from Main to Compositor:

// Commit · 把树映射到对面线程

Main thread                                Compositor thread

LayerTreeHost                            LayerTreeHostImpl
├─ root_layer_ → cc::Layer(card)    ──▶  ├─ active_tree_
│  ├─ position: (0, 0)                   │  ├─ root_layer_impl_
│  ├─ bounds: 340×88                     │  │  ├─ position: (0, 0)
│  └─ display_list: [chunk #1]          │  │  ├─ bounds: 340×88
│                                        │  │  └─ display_list: [chunk #1]
└─ child: cc::Layer(.follow)        ──▶  │  └─ child: LayerImpl(.follow)
   ├─ position: (267, 28)                │     ├─ position: (267, 28)
   ├─ bounds: 53×32                      │     ├─ bounds: 53×32
   └─ display_list: [chunk #2]          │     └─ display_list: [chunk #2]
                                         │
property_trees_ ─────────────────────▶   └─ property_trees_
  Transform / Clip / Effect / Scroll       (deep copy via memcpy)

关键操作: TreeSynchronizer 在两边走双指针——左边是 Main 的 cc::Layer 树,右边是 Compositor 的 LayerImpl 树。每对节点调一次 PushPropertiesTo() 把变化推过去。如果 .follow 的 transform 变了,这次 Push 只动 .follow 那一个 LayerImpl + Transform tree 上的一个节点(changed_flag)。整次 Commit 拷贝的字节量与"实际变化的属性数"成正比——这就是为什么修一个像素的 transform 不会让 Commit 变贵。

The mechanism: TreeSynchronizer walks both sides with a double pointer — left is Main's cc::Layer tree, right is Compositor's LayerImpl tree. Each pair calls PushPropertiesTo() to push deltas. If only .follow's transform changed, this Push touches one LayerImpl + one Transform-tree node (changed_flag). The bytes copied per Commit scale with the actual number of changed properties — that's why mutating a single transform doesn't blow up the Commit cost.

Frame Lifecycle · BeginMainFrame 到 Commit

Frame lifecycle · from BeginMainFrame to Commit

vsync 一来，Compositor thread 上的 Scheduler 给 Main thread 发一个 BeginMainFrame。Main thread 接到信号后跑 Style → Layout → Pre-paint → Paint 这四步，把产物（cc::Layer 树）准备好；准备完毕，触发 Commit；Commit 执行期间 Main 被阻塞，结束后 Main 立刻继续干别的（执行 JS、跑 microtask 等）。

When vsync ticks, the Compositor-thread Scheduler sends a BeginMainFrame to the Main thread. Main runs Style → Layout → Pre-paint → Paint to prepare the cc::Layer tree, then triggers Commit. Commit blocks Main; once it returns, Main is free to do other things (execute JS, run microtasks, …).

FIG 11.A 一个完整的 frame lifecycle：vsync 触发 BeginMainFrame；Main 跑前 5 步；Commit 后 Main 立即解放，Compositor 接着跑后 7 步。 A complete frame lifecycle: vsync fires BeginMainFrame; Main runs the first five steps; Commit unblocks Main, and the Compositor takes over the remaining seven.

PushPropertiesTo · 一棵树到另一棵树

PushPropertiesTo · one tree onto another

Commit 的本质是每个 cc::Layer 把自己最新的属性"推"到对应 cc::LayerImpl 上。TreeSynchronizer 走两边树的对位遍历，调用每个 Layer 的 PushPropertiesTo。这一步不拷贝纹理（纹理走 SharedImage），只搬属性。

Commit is essentially every cc::Layer pushing its latest properties onto its matching cc::LayerImpl. TreeSynchronizer walks both trees in lockstep, calling each Layer's PushPropertiesTo. Textures aren't copied (they live in SharedImage), only properties.

STACK · COMMIT cc/trees/

↑ cc::PictureLayer::PushPropertiesTo(PictureLayerImpl* base) ↑ cc::PushLayerPropertiesInternal<Iter>(begin, end, host_tree, target_impl) ↑ cc::TreeSynchronizer::PushLayerProperties(host_tree, impl_tree) ↑ cc::LayerTreeHost::FinishCommitOnImplThread(host_impl) ↑ cc::SingleThreadProxy::DoCommit() ↑ cc::SingleThreadProxy::ScheduledActionCommit() ↑ cc::Scheduler::ProcessScheduledActions() ↑ cc::Scheduler::NotifyReadyToCommit(unique_ptr<BeginMainFrameMetrics>) ↑ cc::SingleThreadProxy::DoPainting() ↑ cc::SingleThreadProxy::BeginMainFrame(BeginFrameArgs&) // vsync 触发点

SingleThreadProxy vs ProxyMain · 两套合成模式

SingleThreadProxy vs ProxyMain · two compositing modes

Chromium 实际上有两套 Commit 实现：

Chromium actually has two Commit implementations:

SingleThreadProxy

Compositor 跑在 Main thread 自己内（非典型）。Android WebView、headless mode 用它。Commit 退化成函数调用。

Compositor runs on the Main thread itself (atypical). Used by Android WebView, headless mode. Commit degenerates into a function call.

ProxyMain ↔ ProxyImpl

默认模式。Main 和 Compositor 各跑各的，通过消息泵通信。Commit 是一次跨线程同步——Main 阻塞等 Compositor 拷完属性。

The default. Main and Compositor each run on their own thread, communicating via message pumps. Commit is a cross-thread sync — Main blocks while the Compositor copies properties.

Commit 慢起来 · 为什么

Slow commits · why

PERF DIAGNOSIS

"为什么我的 Commit 要 30ms"

"Why is my commit taking 30ms"

Commit 慢，常见三种原因：

Three common causes:

① 图层树过深——一万个 cc::Layer 一个一个 PushPropertiesTo 是有成本的。优化：合并相邻图层、避免无意义的 will-change。

① The layer tree is too deep — calling PushPropertiesTo on ten thousand cc::Layers isn't free. Fix: merge adjacent layers, drop pointless will-change.

② Property tree 节点爆炸——每个 transform / clip / effect 都新建一个属性节点。一个 1000 节点的属性树同步起来会卡。

② Property tree node explosion — every transform / clip / effect creates a new property-tree node. A 1000-node tree syncs slowly.

③ 大块图片资源——TransferableResource 引用一旦改动，需要更新 SharedImage 引用，跨进程通信增多。

③ Large image resources — TransferableResource refs that churn force SharedImage ref updates, increasing IPC.

为什么 Commit 必须阻塞 Main thread? · 跨线程数据一致性的硬约束 Why must Commit block the Main thread? · a hard constraint on cross-thread data consistency

"能不能不阻塞,Main 一边跑 JS,Compositor 一边读数据?" 这是个最自然的优化想法——也是最容易踩坑的。原因有三:

"Can't Main keep running JS while Compositor reads the data?" The most natural optimisation idea — and the most dangerous trap. Three reasons:

不阻塞 = 数据撕裂。一个 cc::Layer 持有 transform / opacity / bounds / display_list 等十几个字段。如果 Main 改 transform 改到一半(刚改完 x,还没改 y),Compositor 此时读这个 Layer,会读到一个"x 是新的,y 是旧的" 的非法状态——画面会出现一帧错位。要么用锁(成本高且死锁风险),要么 stop the world——Chromium 选了后者,因为 Commit 通常很短(<1ms),阻塞代价小于锁的开销。
No blocking = torn reads. A cc::Layer holds a dozen fields — transform / opacity / bounds / display_list, etc. If Main is mid-mutation on transform (x updated, y not yet) when Compositor reads the layer, it gets an illegal "new x, old y" state and the next frame jitters. Either lock (expensive + deadlock risk) or stop-the-world — Chromium picks the latter, because Commit is usually short (<1ms), cheaper than the lock overhead.
Commit 是"事务边界"。一帧 Main 上跑了:Style → Layout → Paint → 改 cc::Layer。这一长串变更必须原子地一起出现在 Compositor 端,否则 Compositor 看到的是"新的 Layout 但是老的 Paint",几何与像素对不上。Commit 阻塞 Main 那一瞬,本质是在执行一次跨线程事务提交——跟数据库的 COMMIT 同名同义。
Commit is a "transaction boundary". In one frame, Main ran: Style → Layout → Paint → mutate cc::Layer. This whole train of changes must appear atomically on the Compositor side; otherwise Compositor sees "new Layout but old Paint" and geometry doesn't match pixels. Commit blocking Main is literally a cross-thread transaction commit — same name, same meaning as a database COMMIT.
Commit 真的很短(典型 0.5-2ms),原因是它只搬指针,不搬数据:LayerImpl 是 cc::Layer 的"影子拷贝",共享底层的 SkPicture / TransferableResource(指针引用计数);Property tree 也是 4 个 vector 的浅拷贝。所以即使阻塞 Main,影响很小——大多数页面 Commit 时间在 1ms 内,根本看不出卡顿。真正会让 Commit 慢的是 cc::Layer 数量爆炸到几千个,或者 Property tree 节点过多(就是上面的"Commit 慢起来"那段说的)。
Commit really is short (typically 0.5-2ms), because it moves pointers, not data: LayerImpl is a "shadow copy" of cc::Layer, sharing the underlying SkPicture / TransferableResource (refcount); property trees are shallow vector copies. So even with Main blocked, the impact is small — most pages commit in under 1ms and you can't feel it. What actually slows Commit is cc::Layer count exploding into the thousands, or property tree node count going wild (covered in the "slow commits" section above).

有没有不阻塞的方案? 有——叫 "impl-side painting"(2014 实验过,后来废弃)与 "composite without commit"(只对极少数 Compositor-only 动画走旁路)。前者太复杂被砍,后者只能处理"纯 transform / opacity"动画。对绝大多数业务来说,接受 1ms 的 Commit 阻塞,换取整套渲染管线的简洁,是绝对划算的交易。

Are there non-blocking alternatives? Yes — "impl-side painting" (experimented in 2014, abandoned) and "composite without commit" (a side-channel for the very few Compositor-only animations). The first was killed for complexity; the second only handles "pure transform / opacity" animations. For most business pages, accepting 1ms of Commit blocking in exchange for a clean pipeline is an excellent trade.

DEVTOOLS

Performance > Frames 行 + "Commit" 事件;Main 线和 Compositor 线对齐看Performance > Frames row + "Commit" event; align Main and Compositor lanes

TRACING

cc, blink.commit, viz.frame_production, scheduler

FLAG

--single-process / 看 SingleThreadProxy 模式 vs ProxyMain/ inspect SingleThreadProxy vs ProxyMain modes

SOURCE

cc/trees/proxy_main.cc, proxy_impl.cc, layer_tree_host.cc::FinishCommitOnImplThread

▸ Performance · BeginMainFrame 到 Commit 的完整周期 vsync N · 16.7 ms budget

Compositor

BMF

wait for Main

Commit

Tile + Activate

wait for Raster

Draw + Submit

idle

Main

wake

JS

Style

Layout

Pre-paint

Paint

⤳ Commit (blocked)

microtask + rAF

idle (until next BMF)

Raster ×4

idle

parallel raster (4 threads)

idle

0481216.7 ms

3 个判读: ① Compositor 与 Main 在 "Commit" 那条红线同时停下 — 这是唯一一段两线程显式同步的时刻,~1ms 阻塞;② Commit 之后 Main 立即解锁,可以跑 microtask 与 rAF — 不是在等渲染完成;③ Compositor 在 Commit 之后并不闲,而是继续推进 Tile / Activate / Draw,与 Main 上的 JS 工作并行 — 这就是 cc 的 "异步流水线"。看到 Commit 时间长 > 5ms,八成是 cc::Layer 树过大或 Property 节点爆炸。 3 reads: ① Compositor and Main both pause at the red "Commit" line — the one moment of explicit cross-thread sync, ~1ms block; ② Main unlocks immediately after Commit and runs microtasks + rAF — not waiting for rendering to finish; ③ Compositor isn't idle after Commit either — it keeps pushing Tile / Activate / Draw, in parallel with Main's JS work. That's cc's "asynchronous pipeline". Commit time > 5ms almost always means an oversized cc::Layer tree or property-node explosion.

上一章previous← Stage 05 · PaintStage 05 · Paint 下一章nextStage 07 · Compositing · 切独立图层Stage 07 · Compositing · slicing into layers →

INPUT

cc::Layer Tree+ Property Trees + DisplayItemList+ Property Trees + DisplayItemList

→

OUTPUT

LayerImpl Tree挂在 Compositor thread 上on the Compositor thread

STAGE 07 · CC PHASE

Compositing — 把页面切成独立图层

Compositing — slicing the page into independent layers

"动一处只重画一处" 的物理基础

why "mutate one, repaint one" is possible

Module

cc

Process

Render

Thread

Compositor

Output

GraphicsLayer Tree

这一步在做什么

What it does

在 Compositor thread 上把 LayerImpl 树分组成独立的 GraphicsLayer——每一个 GraphicsLayer 拥有自己的纹理与变换矩阵，能独立动画、独立失效、独立合成。 On the Compositor thread, group the LayerImpl tree into independent GraphicsLayers — each owns its own texture and transform matrix, can be animated, invalidated and composited on its own.

为什么不能跳过

Why not skip

没了图层切分，一次普通的滚动也会让所有像素重新 Paint + Raster。即便每个阶段都做缓存，无图层下也救不了——失效粒度是"全屏"。 Without layer separation, even a single scroll repaints every pixel. Caches at every prior stage can't save you — the invalidation granularity collapses to "the whole screen".

关键函数

Key entrypoints

Compositor::UpdateLayerTreeHost · cc::PaintArtifactCompositor::Update · cc::LayerTreeHost::SetNeedsAnimate

STAGE 07 主线 · The Card 在 Compositing 后Main-line · The Card after Compositing

.follow 因为 will-change 升格成 GraphicsLayer · 主卡留在原层

.follow promotes to its own GraphicsLayer · the rest stays put

Pre-paint 早就在 Transform tree 上给 .follow 留了节点,Compositing 这一步只是把这个意图兑现:把 .follow 从主图层里"抠出去",变成一个独立 GraphicsLayer。这件事的代价是额外一份 GPU 纹理(53×32 ≈ 6KB),收益是 hover 时只动它一个 layer。

Pre-paint already reserved a Transform-tree node for .follow; Compositing simply cashes in: extract .follow from the main layer into its own GraphicsLayer. The cost is one extra GPU texture (53×32 ≈ 6KB); the payoff is "only this layer moves on hover".

// 图层结构 · Compositing 之后

GraphicsLayer · root
└─ GraphicsLayer · main(card 主体)         340×88
   ├─ paints: shadow + bg + avatar + 3×text     // 静态内容
   └─ GraphicsLayer · follow(独立合成层)    53×32  // ← 升格
      ├─ paints: button bg + text
      └─ composited_animation_id = transform

"升格" 的判定标准: 翻 cc::CompositingReason 这个 enum——它列了 30+ 种触发条件,任意一个命中就升格。.follow 命中的是 kActiveTransformAnimation(因为 will-change: transform);如果它还是 video / canvas / iframe / 3D transform / fixed scroll / overflow:scroll-snap...也都会升格。每升一层都付内存,所以 cc 内部还有一组反向启发式 (e.g. "如果 layer 太小且没动画就回退合并")。

The promotion criteria: consult the cc::CompositingReason enum — 30+ triggers, any one matches and you're promoted. .follow matches kActiveTransformAnimation (because of will-change: transform); video / canvas / iframe / 3D transform / fixed scroll / overflow:scroll-snap would also trigger. Every layer costs memory, so cc has the reverse heuristic too — "if a layer is too small and isn't animating, fold it back into the parent".

如果没有 Compositing 会怎样

A world without compositing

想象这一阶段被砍掉：Paint 完直接 Raster 上屏。一旦 Raster 的数据源在 vsync 来临时还没就绪，就丢一帧。即便每个阶段都做了缓存，一次普通的滚动也会让所有像素被重新 Paint + Raster。

Imagine the stage cut: Paint goes straight to Raster, straight to screen. The moment Raster's data isn't ready when vsync arrives — a frame drops. Even with caches at every prior stage, a single scroll would force every pixel re-Painted and re-Rastered.

案例 · .wobbleCASE · .wobble

同样一个 transform，给不给图层的差距

The same transform — with vs. without a layer

.wobble { animation: wobble 2s infinite; /* will-change: transform; ←— 加上它就独占一层 */ } @keyframes wobble { 50% { transform: translateY(-12px); } }

不加 will-change：.wobble 与背景在同一图层，每帧动画都触发整层 Paint + Raster。
加上 will-change: transform：Compositor 提前把它升格成独立的 GraphicsLayer，动画只在 Compositor thread 上做矩阵变换——Main thread 零参与，Raster 也不用重跑。

Without will-change: .wobble shares a layer with its surroundings; every frame retriggers Paint + Raster of the whole layer.
With will-change: transform: the Compositor promotes it to its own GraphicsLayer ahead of time, animation reduces to a matrix multiply on the Compositor thread — zero Main-thread work, zero Raster re-runs.

没有图层 · 任何变化都重画整页

No layers · any change repaints the whole page

background
content
.wobble

.wobble 的 transform 改了 → 整张画布都要 Paint + Raster。

.wobble's transform changed → the whole canvas needs re-Paint + re-Raster.

有图层 · 只动 .wobble 那一层

With layers · only .wobble's layer is dirty

background · cached
content · cached
.wobble (anim)

动画在 Compositor thread 上跑 · 主线程零参与。

Animation runs on the Compositor thread · zero Main-thread involvement.

什么会被升格为独立图层 · 完整清单

What gets promoted to its own layer · the full list

Compositor 不是无脑给每个元素一个图层——它有明确的升格条件。下面这张表是常见的命中点（CAP 之后规则有简化但骨架不变）：

The Compositor doesn't blindly give every element its own layer — it has explicit promotion criteria. The list below is the common hit set (post-CAP simplifies the rules, but the skeleton is the same):

3D / perspective transform

写了 transform: translate3d(...) 或 perspective——3D 必须独占一层。

transform: translate3d(...) or perspective — 3D demands its own layer.

opacity / filter / transform animation

运行时正在跑的动画——升格让动画在 Compositor 线程上做。

An animation currently running — promotion lets it run on the Compositor thread.

<video> · <canvas>

视频 / Canvas 自己产纹理，必须 TextureLayer。

Video / Canvas produce their own textures — must be TextureLayer.

composited iframe

跨进程的 OOPIF 必走 SurfaceLayer。

Cross-process OOPIFs always go through SurfaceLayer.

will-change: transform / opacity / filter

显式提示：这条会动。Compositor 提前给图层。

Explicit hint: this will animate. The Compositor pre-creates the layer.

position: fixed / sticky

滚动期需要"贴住"——升格成独立层就不用每帧重 Paint。

Needs to "stick" during scroll — a layer avoids per-frame repaints.

mix-blend-mode (≠ normal)

混合模式要读背景像素 → 离屏合成 → 自己一层。

Blend modes need the backdrop → off-screen compositing → own layer.

backdrop-filter

同上，且更贵——背景纹理 + blur shader。

Same as above, but more expensive — background texture + blur shader.

缓存策略 · 每一段都留了一手

Caching · every stage keeps a copy

为了让重画"小一点"，Chromium 在 Style / Layout / Paint / Raster 都各自缓存了产物：

To shrink the redraw surface, Chromium caches the output of every stage:

Style

缓存 ComputedStyle，DOM 不变就不重算样式。Cache ComputedStyle. If the DOM is unchanged, no recalc.

Layout

缓存 LayoutRect，contain / 子树未脏 → 跳过。Cache LayoutRect. With contain / clean subtrees → skip.

Paint

缓存 DisplayItemList，未变更的图层不重新生成绘画指令。Cache DisplayItemList. Unchanged layers reuse their paint commands.

Raster

缓存 Tile 纹理，已经光栅化的图块直接复用。Cache Tile textures. Already-rasterised tiles are reused as-is.

will-change · 不是免费的

will-change · is not free

每升一个图层都要付内存——Tile 缓存、纹理、属性树节点。把 will-change 撒在所有元素上，会让显存被一群不动的图层吃掉。原则：只给"真的会动"的元素加，且动画结束后摘掉。

Every promoted layer costs memory — tile cache, textures, property-tree entries. Sprinkle will-change everywhere and you watch VRAM filled with motionless layers. The rule: only on elements that actually move; remove it once the animation ends.

DevTools · 看一眼自己的图层

DevTools · inspecting your own layers

CHROME DEVTOOLS · LAYERS

每个图层为什么存在 / 占多少内存 / 重画了几次

Why each layer exists, how much memory, how many repaints

DevTools 的 Layers 面板（实验功能里开启）能列出当前页面的所有 GraphicsLayer，点开任意一个会告诉你：这个图层的产生原因（will-change / 3D transform / video / iframe / mix-blend-mode…）、纹理内存占用、已绘制次数。看到一个莫名其妙存在的图层，往往是性能洞的入口。

DevTools' Layers panel (toggle in experiments) lists every GraphicsLayer. Click one and it tells you: why this layer exists (will-change / 3D transform / video / iframe / mix-blend-mode…), texture memory, paint count so far. An unexpectedly-existing layer is often the entry to a perf hole.

输入事件的小后门

Input · the side door

Compositor Thread 还有一个小职能：处理输入事件。Browser Process 把 mousewheel / scroll / touch 投到 Compositor thread 上，它能直接处理而不用麻烦 Main thread——前提是页面没有 JS 监听这些事件。一旦你 addEventListener，Compositor 就只能把事件转发回 Main thread 了。

The Compositor thread also has a side door: input event handling. The Browser process throws mousewheel / scroll / touch directly to the Compositor — bypassing Main thread entirely — as long as no JS is listening. The moment you addEventListener, the Compositor must hand the event back to Main.

真实陷阱FOOTGUN

`addEventListener('touchstart', ...)` 让滚动卡顿

`addEventListener('touchstart', ...)` drops scroll smoothness

默认情况下浏览器认为你可能 preventDefault，所以把 touch 事件路由回 Main thread——一旦 Main 阻塞（JS 慢函数），滚动就掉帧。解决：明确声明 passive listener：

By default the browser assumes you might preventDefault, so it routes touch events back to Main — and any slow JS on Main drops the frame rate. Fix: declare a passive listener:

window.addEventListener( 'touchstart', handler, { passive: true } // ← 告诉 Compositor: 我不会 preventDefault );

这样 Compositor thread 会直接处理事件，滚动永远不被 Main thread 拖累。

Now the Compositor thread handles the event directly — scrolling stops paying the Main-thread tax.

Compositor 像一个客厅。
你不开窗户，没人吵——所有动画自己滚自己的；
你一开 onScroll，Main thread 就被叫醒。 Field Note · 02

The Compositor is like a quiet living room.
Keep the windows shut and animations roll themselves;
open an onScroll and the Main thread is awakened. Field Note · 02

DEVTOOLS

Rendering 面板 > "Layer borders" 勾选;Layers 面板看 Compositing ReasonsRendering panel > "Layer borders" checkbox; Layers panel for Compositing Reasons

TRACING

cc, blink.compositing, viz.layers

FLAG

--show-composited-layer-borders / 启动时就开/ on at startup

SOURCE

cc/trees/property_tree_builder.cc, cc/layers/picture_layer.cc

▸ Layer Borders · 视口浮层 · "Rendering" 面板勾选后实时叠加 5 layers · 4 promoted · 1 inline

#header (sticky) · transform · 1248×64

article.card · 340×88 · render_surface (shadow)

.follow ★

video · TextureLayer

inline · <p> (not composited)

黄=PictureLayer · 绿=独立合成层 · 红=TextureLayer · 灰=未升格

3 个判读: ① 实线 = 独立合成层(transform / opacity 动画走 Compositor 的);虚线 = 普通子元素挂在父层上;② 颜色编码层类型:黄 PictureLayer(常规)、绿升格(will-change/3D/sticky)、红 TextureLayer(Canvas/Video/WebGL)、紫 SurfaceLayer(iframe/OOPIF);③ 层数 ≤ 30 是健康线。Twitter 主页大约 80,GitHub 大约 40。合成层数 × 平均面积 ≈ GPU 上传量 — 看到 200+ 层基本可以确认 will-change 用得太凶。 3 reads: ① solid borders = composited layers (transform / opacity animations run on the Compositor); dashed = inline children attached to a parent layer; ② colour codes layer subtype: yellow = PictureLayer (default), green = promoted (will-change/3D/sticky), red = TextureLayer (Canvas/Video/WebGL), purple = SurfaceLayer (iframe/OOPIF); ③ layer count ≤ 30 is healthy. Twitter homepage ~80, GitHub ~40. Composited-layer count × average area ≈ GPU upload volume — seeing 200+ layers is near-certain proof of will-change abuse.

上一章previous← Stage 06 · CommitStage 06 · Commit 下一章nextStage 08 · Tiling · 切马赛克Stage 08 · Tiling · slicing into mosaic →

INPUT

LayerImpl Tree来自 Commitfrom Commit

→

OUTPUT

GraphicsLayer Tree分组好的独立图层grouped independent layers

STAGE 08 · CC PHASE

Tiling — 把图层切成马赛克

Tiling — slicing layers into mosaic

从图层到任务，按优先级派发

picture layer → tile tasks, with priority

Module

cc

Process

Render

Thread

Compositor

Output

cc::TileTask[]

这一步在做什么

What it does

把每个 cc::PictureLayerImpl 按 256×256 / 512×512 切成一组 cc::Tile，根据距视口的距离排好优先级，封装成 cc::TileTask 投入 TaskGraph。这一步只调度，不画。 Cut every cc::PictureLayerImpl into 256×256 / 512×512 cc::Tiles, order them by viewport distance, and wrap each into a cc::TileTask for the TaskGraph. This stage only schedules — no painting.

为什么不能跳过

Why not skip

两条物理边界：① GPU 不支持任意大小的纹理——一张超大图层必须切；② 多 Tab Chromium 共用一个统一缓冲池，Tile 是池的最小分配单位。没了 Tiling，多开几个 Tab 就显存爆。 Two physical limits: ① GPUs can't support arbitrary-sized textures — a huge layer must be split; ② multi-tab Chromium shares a unified buffer pool, and the tile is its smallest allocation unit. Without Tiling, opening a few extra tabs runs out of VRAM.

关键函数

Key entrypoints

cc::TileManager::PrepareTiles · cc::PictureLayerImpl::UpdateTilePriorities · cc::TileTaskManagerImpl::ScheduleTasks

STAGE 08 主线 · The Card 在切片后Main-line · The Card after Tiling

主层 2 块 tile,follow 单独 1 块,共 3 个 cc::TileTask

Main layer → 2 tiles · follow → 1 tile · 3 cc::TileTasks total

主图层是 340×88 的卡片——按 256×256 切,得到 1 块完整 tile + 1 块右边的 84×88 边缘 tile。.follow 独立图层只有 53×32,不到 1 块,但仍然独占一块 tile(因为它绑了独立的属性树状态)。视口内,3 块 tile 全部 priority_bin = NOW:

The main layer is 340×88 — sliced into 256-aligned tiles, that's 1 full tile + 1 right-edge tile (84×88). The standalone .follow layer is only 53×32 — smaller than a tile but still owns one (because it's bound to its own property-tree state). In-viewport, all 3 tiles are priority_bin = NOW:

// TilingSet · The Card

PictureLayerTiling(main · 340×88 · scale 1.0)
├─ cc::Tile #t1 ─▶ [0,0  256×88]
│   priority   = NOW · distance_to_visible = 0
│   draw_info  = ALLOCATED · resource = SharedImage #r1
└─ cc::Tile #t2 ─▶ [256,0  84×88]      // 右边缘瘦 tile
    priority   = NOW · distance_to_visible = 0
    draw_info  = ALLOCATED · resource = SharedImage #r2

PictureLayerTiling(.follow · 53×32 · scale 1.0)
└─ cc::Tile #t3 ─▶ [0,0  53×32]
    priority   = NOW · distance_to_visible = 0
    draw_info  = ALLOCATED · resource = SharedImage #r3

三件细节: ① 边缘 tile 也分配整张 256×256 的纹理(GPU 不接受任意尺寸,所以 #r2 实际是 256×256,只有 84×88 区域被画——其余像素浪费,但简化了纹理池管理);② .follow 单独占 1 块 tile——这是"升格成独立合成层"在内存上的代价具象化:53×32 的按钮,实际占了 256×256 的纹理槽;③ 3 个 cc::TileTask 同时进 TaskGraph——TaskGraph 上还排着 1 个 ImageDecodeTask(给 airing.png 解码用),共 4 个任务由 Raster 线程池消化。

Three subtleties: ① edge tiles still allocate a full 256×256 texture (GPUs reject arbitrary sizes, so #r2 is really 256×256 with only 84×88 painted — wasted memory, simpler pool management); ② .follow occupies a tile of its own — this is the memory cost of "promotion to a composited layer", made concrete: a 53×32 button, eating a 256×256 texture slot; ③ 3 cc::TileTasks enter the TaskGraph — alongside 1 ImageDecodeTask (for airing.png), totaling 4 tasks for the Raster thread pool to chew through.

cc::Tile · 一个最小渲染单元的内部

cc::Tile · what's inside one mosaic piece

每一个 Tile 不是一张单纯的位图——它带着身份与状态：

A Tile is not a plain bitmap — it carries identity and state:

CC::TILE cc/tiles/tile.h

class Tile { id_ : Id; // 全局唯一 tiling_i_index_, tiling_j_index_ // 在 PictureLayerTiling 里的网格坐标 contents_scale_key_ : float // 当前缩放级别（HiDPI / 缩放） enclosing_layer_rect_ : gfx::Rect // 在图层坐标系下的位置 priority_[] : TilePriority // 当前 / 下一帧的优先级 draw_info_ : TileDrawInfo // 光栅化结果（纹理引用） resource_ : ResourcePool slot // 池子里的位置 };

优先级 · 距视口的"距离场"

Priority · the distance field around the viewport

Tile 不是一股脑全画——它按优先级派发。TilePriority 由三个维度决定：

Tiles aren't painted all at once — they go out by priority. TilePriority is decided by three axes:

维度Axis	含义meaning	举例example
`resolution`	高 / 低分辨率HIGH or LOW resolution	远场可以先用低分辨率代替far tiles can settle for LOW first
`priority_bin`	NOW · SOON · EVENTUALLYNOW · SOON · EVENTUALLY	视口内 NOW、即将出现 SOON、可能要用 EVENTUALLYin-viewport NOW, about-to-show SOON, maybe-needed EVENTUALLY
`distance_to_visible`	距离视口的像素pixels from the visible rect	越小越优先smaller wins

下图是一次惯性滚动里 TileManager 实时调度优先级的样子——蓝色越深，光栅化优先级越高；棋盘格是"还没准备好"的占位：

The figure below shows TileManager re-prioritising tiles during inertial scroll — deeper blue = higher priority; the chequerboard zone is the "not yet ready" placeholder:

视口 · NOWviewport · NOW 近场 · SOONnear · SOON 低分辨率代替LOW-resolution placeholder 远场 · EVENTUALLYfar · EVENTUALLY 未光栅化 · 棋盘格unrasterised · checker

FIG 13 视口在网页中跳来跳去（模拟惯性滚动），cc::TileManager 实时调度每个 Tile 的光栅化优先级。 As the viewport jumps across the page (simulating inertial scroll), cc::TileManager re-orders raster priority for every tile in real time.

PrepareTiles · 提交任务的真实栈

PrepareTiles · the real submission stack

Tile 任务被发到 Raster 线程之前，会走完下面整条链路——从 vsync 触发的 BeginMainFrame 一路打到 SingleThreadTaskGraphRunner::ScheduleTasks：

Before a Tile task lands on a Raster thread, it travels this full chain — from the vsync-triggered BeginMainFrame down to SingleThreadTaskGraphRunner::ScheduleTasks:

STACK · TILE TASK SCHEDULING libcc.so

↑ cc::SingleThreadTaskGraphRunner::ScheduleTasks(NamespaceToken, TaskGraph*) ↑ cc::TileTaskManagerImpl::ScheduleTasks(TaskGraph*) ↑ cc::TileManager::ScheduleTasks(PrioritizedWorkToSchedule) ↑ cc::TileManager::PrepareTiles(GlobalState&) ↑ cc::LayerTreeHostImpl::PrepareTiles() ↑ cc::LayerTreeHostImpl::NotifyPendingTreeFullyPainted() ↑ cc::LayerTreeHostImpl::UpdateSyncTreeAfterCommitOrImplSideInvalidation() ↑ cc::LayerTreeHostImpl::CommitComplete() ↑ cc::SingleThreadProxy::DoCommit() ↑ cc::Scheduler::NotifyReadyToCommit(unique_ptr<BeginMainFrameMetrics>) ↑ cc::SingleThreadProxy::BeginMainFrame(BeginFrameArgs&) // vsync 触发点

为什么 256×256 / 512×512

Why 256×256 / 512×512

SIZE TRADE-OFF

不是 1024×1024，也不是 64×64

Not 1024×1024, not 64×64

太大 → 内存浪费（视口外的整块都得分配）；太小 → 任务数量爆炸，调度开销吃满 CPU。

Too large → wasted memory (whole tiles outside the viewport still get allocated); too small → task explosion, scheduling overhead saturates the CPU.

256×256 是大量实测后定下的折衷——移动端因为屏幕窄、动画范围小，会改用 256；桌面端视口大用 512 更划算。Tile 大小不是常数，是按设备特征动态选择的。

256×256 is the empirical sweet spot. Mobile uses 256 (narrow screens, small animation range); desktop uses 512 (large viewport amortises the overhead). Tile size is not a constant — it is chosen per device.

那为什么不"真正动态"——按 layer 大小自动选? · 统一货币单位才能复用 So why not "truly dynamic" — pick per-layer? · uniform "currency" enables reuse

直觉上"小 layer 用小 tile,大 layer 用大 tile"应该最省。但 cc 没这么做,因为 tile 不只是"切块",更是整个 GPU 内存系统的"货币单位":

Intuition says "small layer = small tile, big layer = big tile" should be most efficient. cc doesn't do this, because tiles aren't just "chunks" — they're the entire GPU memory system's "currency unit":

统一尺寸 → 纹理池可复用。GPU 纹理分配是极慢的(几十微秒到几毫秒)。cc 维护一个 ResourcePool 缓存"已分配但当前空闲"的纹理槽位——下次有 tile 需要,直接从池里抓一个,O(1)。但前提是所有 tile 同尺寸:如果 tile 大小不一,池子要按尺寸分桶,大尺寸槽位经常空着、小尺寸不够用,内存碎片严重。统一 256/512 = 池子里所有槽位完全可互换,这是 ResourcePool 高命中率的根基。
Uniform size → reusable texture pool. GPU texture allocation is extremely slow (tens of µs to ms). cc maintains a ResourcePool caching "allocated but idle" texture slots — when a tile needs one, grab from the pool, O(1). But that requires all tiles to be the same size: with mixed sizes, the pool buckets by size, large slots sit unused while small ones are starved — heavy fragmentation. Uniform 256/512 = every slot in the pool is fully interchangeable — the foundation of ResourcePool's hit rate.
统一尺寸 → 跨 Tab 共享内存。Chrome 多 Tab 共用同一个 GPU 进程的纹理池。如果 Tab A 用 256, Tab B 用 333, Tab C 用 512,池子被三种尺寸切碎,共享几乎不可能。统一标准让 30 个 Tab 共用一池——就像内存分页用 4KB 一个标准,不会按文件大小做动态页大小。
Uniform size → cross-tab memory sharing. Chrome's tabs share one GPU-process texture pool. If Tab A uses 256, Tab B uses 333, Tab C uses 512, the pool is sliced three ways and sharing is nearly impossible. Standard size lets 30 tabs share one pool — just like memory paging uses one 4KB standard, not dynamic page sizes per file.
统一尺寸 → 算法简化。Tile manager 排优先级时按 grid 坐标做距离场计算 — 如果每个 layer 网格步长不同,要先归一化才能比较。统一 256 让"第几行第几列"在所有 layer 上等价,priority 直接比较。
Uniform size → simpler algorithm. TileManager's priority calc uses grid-coord distance fields — different grid pitches per layer would force normalisation before comparison. Uniform 256 makes "row N, col M" equivalent across all layers; priority compares directly.

所以"设备维度" 选 256 vs 512 是 OK 的(因为同一设备所有 layer 用同一尺寸,池子还是统一的);但"layer 维度" 动态选不行——会把 ResourcePool 这套缓存复用机制整个废掉。这跟操作系统选 4KB 还是 2MB Huge Page 是同一个权衡:统一标准的复用红利,远大于"因地制宜" 的浪费节省。

So "per-device" 256 vs 512 is fine (same device → all layers use the same size → pool stays uniform); but "per-layer" dynamic isn't — it would gut the entire cache-reuse mechanism of ResourcePool. The same trade-off as an OS picking 4KB vs 2MB Huge Pages: the reuse dividend of one standard far outweighs the savings of "fitting each case".

预测光栅化 · 先低后高

Predictive raster · low first, high later

Chromium 还做一件事："先粗后细"。首次合成图块时降低分辨率（"LOW resolution tiling"），等优先级转成 NOW 后再补上高分辨率版本。这样首屏看起来"立刻有内容"，但内容会在第二三帧"清晰一下"——你在 4G 网络下加载长页时常会看到这种行为。

Chromium also does "coarse first, fine later". The first composite of a tile is rendered at lower resolution ("LOW resolution tiling"); the high-res version follows once it gets bumped to NOW. The screen has "something" immediately, but you'll see content "sharpen" a frame or two later — common when you load a long page over 4G.

TileManager 与 ImageDecodeCache · 共用 TaskGraph

TileManager & ImageDecodeCache · sharing the TaskGraph

Tile 任务并不孤立——cc::ImageDecodeCache 也把 JPEG/PNG/WebP 解码任务塞进同一张 TaskGraph，由同一组 Raster 线程消化。解码与光栅化竞争同一个 CPU 池。这就是为什么"图片图片图片"的页面在低端机上更卡——Raster 线程被解码长时间占用。

Tile tasks aren't islands — cc::ImageDecodeCache drops JPEG/PNG/WebP decode tasks into the same TaskGraph, consumed by the same Raster threads. Decode and rasterisation share one CPU pool. That's why image-heavy pages stutter more on low-end devices — Raster threads get monopolised by decoding.

DEVTOOLS

Rendering > Frame Rendering Stats;Performance > "Update Tile Priorities"Rendering > Frame Rendering Stats; Performance > "Update Tile Priorities"

TRACING

cc, cc.debug.tile_priority, blink.tile, gpu.command_buffer

FLAG

--default-tile-width=512 --default-tile-height=512 / 强行改 tile 大小/ force tile size

SOURCE

cc/tiles/tile_manager.cc, cc/tiles/picture_layer_tiling.cc

▸ Tile Priority Heatmap · 距视口的"距离场" · NOW / SOON / EVENTUALLY layer 1280×2400 · 30 tiles total

NOW · 视口内 · 必须 ready SOON · 1 屏内 · 主动 raster EVENTUALLY · 远场 · 滚到才 raster 视口

TilePriority · 当前帧采样

priority_bin = NOW12 tiles priority_bin = SOON14 tiles priority_bin = EVENTUALLY16 tiles distance_to_visible min0 px (in viewport) max1280 px median320 px resolution HIGH26 tiles LOW4 tiles ⚠ 退化

3 个判读规则: ① 视口内必须全 NOW + HIGH 分辨率 — 任何一个 SOON 就是 jank 信号;② SOON tiles 数量 = 用户即将看到的内容量,这个数大表示页面"纵向接续" 多,惯性滚动会更平滑;③ LOW resolution 数量 > 0 = 上一帧 raster 跟不上 GPU 上传需求,系统主动降级 — 短暂可接受,持续就是性能崩盘。TileManager 调度的全部目标:让视口里永远是绿色 NOW,SOON 在路上,EVENTUALLY 别动。 3 reading rules: ① viewport must be entirely NOW + HIGH res — any SOON is a jank signal; ② SOON tile count = volume of about-to-show content, larger = more "vertical continuity", smoother inertial scroll; ③ LOW resolution count > 0 = last frame's raster couldn't keep up with GPU upload demand, system actively downgraded — temporarily OK, sustained = perf collapse. TileManager's entire goal: keep the viewport green-NOW, SOON in flight, EVENTUALLY untouched.

上一章previous← Stage 07 · CompositingStage 07 · Compositing 下一章nextStage 09 · Raster · 翻成像素Stage 09 · Raster · into pixels →

INPUT

cc::PictureLayerImpl+ DisplayItemList+ DisplayItemList

→

OUTPUT

cc::TileTask[]已按优先级排序，丢入 TaskGraphprioritised, posted into the TaskGraph

STAGE 09 · CC PHASE

Raster — 把"绘画清单"翻成像素

Raster — translating the display list into pixels

绘画清单的真实落地

DisplayItemList playback into bitmaps / textures

Module

cc

Process

Render

Thread

Raster ×N

Output

Tile texture / bitmap

这一步在做什么

What it does

Raster 线程逐个执行 cc::TileTask——把 DisplayItemList 中属于该 Tile 的绘画指令"Playback" 到一块纹理（GPU SharedImage）或位图（共享内存）上。13 步里第一次真正"画像素"的一步。 The Raster threads run cc::TileTasks one by one — "playing back" the DisplayItemList's draw commands that fall in this tile onto a texture (GPU SharedImage) or bitmap (shared memory). The first stage in the 13 that actually paints pixels.

为什么不能跳过

Why not skip

DrawQuad 需要的"图片资源" 必须事先存在。Raster 是把 cc 的"指令"变成 GPU 能采样的"纹理"的唯一桥梁。没有 Raster，Display 阶段无 quad 可贴。 DrawQuads require pre-existing "image resources". Raster is the sole bridge from cc's "instructions" into GPU-samplable "textures". Without Raster, Display has nothing to sample.

关键函数

Key entrypoints

cc::RasterBufferProvider::AcquireBufferForRaster · cc::RasterBuffer::Playback · cc::ImageDecodeCache::GetTaskForImageAndRef

STAGE 09 主线 · The Card 在光栅化后Main-line · The Card after Raster

3 个 tile 各跑一次 playback,airing.png 异步解码进 SharedImage

3 tiles each play back, airing.png decodes async into a SharedImage

三块 tile 派给 3 条 Raster 线程并行光栅化——它们同时调 Skia 把 DisplayItemList 的相关切片"重放"到自己的 SharedImage 上。同一时刻,ImageDecodeCache 把 airing.png 推给第 4 条 Raster 线程做 PNG 解码,解码结果直接落进另一个 SharedImage:

Three tiles dispatch to 3 Raster threads in parallel, each calling Skia to replay the relevant slice of the DisplayItemList onto its own SharedImage. Simultaneously, ImageDecodeCache hands airing.png to a 4th Raster thread for PNG decoding; the result lands directly into yet another SharedImage:

// 同一时刻 · 4 条 Raster 线程的并行

RasterThread #1 ─▶ Tile #t1 (256×88)
                  Skia DDL: 录 [shadow→bg→text→...] → submit
                  ─▶ SharedImage #r1 (GPU 纹理)

RasterThread #2 ─▶ Tile #t2 (84×88 边缘)
                  Skia DDL: 录 [.url 文字+button 边角] → submit
                  ─▶ SharedImage #r2

RasterThread #3 ─▶ Tile #t3 (53×32 .follow)
                  Skia DDL: 录 [#15171c bg + "关注"] → submit
                  ─▶ SharedImage #r3

RasterThread #4 ─▶ ImageDecode airing.png
                  PNG → ARGB → resize 到 56×56
                  ─▶ SharedImage #r4 (与 tile 共用纹理池)

注意 #r4 与 #r1 是分开的:头像不属于任何一个 tile,它是单独的资源,Tile #t1 在 playback 头像那条 DisplayItem 时,实际写的是引用(DrawImageRect(SharedImage_id=#r4, dst_rect=avatar));GPU 在最终合成时才采样 #r4 拼到 #r1 上。这就是"资源独立 + 引用拼接"的核心——同一张头像可以被多个 tile / 多个图层共享而不重复存。如果在第 9ms 这一帧 #r4 还没解码完,Tile #t1 的 playback 仍会进行,只是头像那块画一个透明矩形——下一帧 #r4 就绪后再补回来,这就是"头像先灰后亮"现象的物理来源。

Notice #r4 is separate from #r1: the avatar doesn't belong to any tile — it's a standalone resource. When Tile #t1 plays back the avatar's DisplayItem, it actually writes a reference (DrawImageRect(SharedImage_id=#r4, dst_rect=avatar)); the GPU samples #r4 only at final composite time. This is the heart of "resource independent + reference assembled" — the same avatar can be shared across many tiles and layers without duplication. If on this frame at 9ms #r4 hasn't finished decoding, Tile #t1's playback still proceeds — the avatar slot just paints transparent; once #r4 is ready next frame, the avatar appears. This is the physical source of "avatar appears grey first, then loads".

Playback · 把 DisplayItemList 重放

Playback · replaying the display list

"Playback" 不是比喻——Raster 线程拿到 Tile 的 DisplayItemList 后，从头到尾把每条 DisplayItem 重新执行一遍，输出到目标 buffer 上。这就是为什么 Paint 阶段不画像素，Raster 才画——同一份指令清单可以在不同 scale / 不同 Tile / 不同设备上反复 playback，是 Chromium 性能模型的核心。

"Playback" is not a metaphor — given the Tile's DisplayItemList, the Raster thread re-executes every DisplayItem in order, outputting onto the target buffer. This is exactly why Paint doesn't paint and Raster does — the same instruction list can be played back at different scales, on different tiles, on different devices. It's the heart of Chromium's performance model.

同步 vs 异步光栅化

Sync vs async rasterisation

浏览器走的是异步分块光栅化；移动 OS 与 Flutter 走同步光栅化。两条路线各有优势——下面这张对照能让你看清边界：

Browsers run async tiled raster; mobile OSes and Flutter run sync raster. Each route has its strong suit. The boundary, side-by-side:

SYNCHRONOUS

同步光栅化Synchronous raster

Android / iOS / Flutter · 间接像素缓冲

Android / iOS / Flutter · indirect pixel buffer

内存占用Memory footprintA+

首屏性能Cold-start TTIB

动态变化Dynamic contentB

图层动画Layer animationC

低端机Low-end devicesC

ASYNCHRONOUS · TILED

异步分块光栅化Async tiled raster

Chromium / WebView · Raster thread

内存占用Memory footprintD

首屏性能Cold-start TTIC

动态变化Dynamic contentC

图层动画Layer animationA+

惯性滚动Inertial scrollA

总结一句：浏览器内核的性能，大半是用内存换来的。异步光栅化给惯性滚动和 CSS 动画带来绝对优势，但代价是内存占用极高 · 快速滚动会白屏 · 滚动中 DOM 更新可能不同步。

In one line: browser-engine performance is mostly bought with memory. Async raster gives inertial scroll and CSS animations their unfair advantage, at the cost of massive memory · white screens during fast scroll · DOM updates that may desync mid-scroll.

动画演示 · 两种光栅化在你眼前跑一遍

Live · the two strategies, side by side

下面这张图实时对照两种策略——左边 sync 在每次"raster"瞬间整屏冻结(黄色 RASTER 闪烁标记),scroll 走的是离散步进;右边 async 持续滚动,但视口边缘出现棋盘格占位(因为 raster 还没追上),tile 一块块"填进来"。底部的线程时间条说明谁在动、谁在闲。

The figure below runs both strategies live — left: sync freezes the screen at each "raster" moment (yellow RASTER flash), scroll proceeds in discrete steps. Right: async scrolls continuously, but the viewport edges show chequer placeholders (raster hasn't caught up); tiles fill in one by one. The thread strips below explain who is moving and who is idle.

SYNC

同步光栅化 · 串行 · raster→composite→displaysync raster · serial · raster→composite→display

RASTER…

Main

GPU

Display

看出来: Main thread 一直在 raster (满格铜色),屏幕只在每次 raster 末尾"跳一下"。8 帧/4 秒 = 2 fps 的视觉节奏。每多滚 1px 就要重 raster 整屏。 Look: the Main thread is always rastering (full copper bars), and the screen only "jumps" at the tail of each raster. 8 frames in 4s = 2 fps visual rhythm. Every extra px of scroll re-rasters the entire screen.

ASYNC

异步分块光栅化 · 并行 · 棋盘占位async tiled raster · parallel · chequer placeholders

Main

Compositor

Raster 1

Raster 2

Raster 3

GPU

看出来: Main thread 全程闲(灰条),Compositor 与 GPU 不停跑(满格),Raster 1/2/3 三条线程并行各自处理 tile。视口里出现棋盘格的瞬间——那是raster 还没追上;但屏幕每帧都在动,不冻结。这就是 60fps 的代价:多 3 条 Raster 线程 + 一堆 SharedImage 内存。 Look: the Main thread idles all the way (grey strip), Compositor and GPU run continuously, Raster 1/2/3 process tiles in parallel. The chequer cells in the viewport are tiles raster hasn't caught up to yet — but the screen moves every frame, never freezes. This is the cost of 60fps: three extra Raster threads + a chunk of SharedImage memory.

FIG 14·anim 两种光栅化策略的实时对照。Sync 那侧整屏冻结、离散滚动;Async 那侧持续滚动、棋盘占位。彩蛋: 这个动画自身只用 transform 和 opacity,所以它正是它讲的"纯 Compositor 动画"的实例——你读这段字时,Main 和 Raster 都没在跑这个动画的任何一帧。 The two raster strategies, live. Sync freezes the entire screen and scrolls in discrete steps; Async scrolls continuously and shows chequer placeholders. Easter egg: this very animation uses only transform and opacity, which means it is itself an instance of the "pure Compositor animation" it describes — while you read this caption, neither Main nor Raster runs a single frame of this animation.

RasterBufferProvider · 4 种实现

RasterBufferProvider · four implementations

不同的硬件 / 加速能力 / 浏览器配置，对应不同的 RasterBufferProvider 子类。它们的差别本质上是"光栅化结果如何到达 GPU 内存"的路径区别——拷贝越少越好：

Different hardware / accel capability / config map to different RasterBufferProvider subtypes. Their difference is really about "how the raster output reaches GPU memory" — fewer copies the better:

cc::GpuRasterBufferProvider

GPU 直接光栅化，结果就在 SharedImage——零拷贝路径。最快但要求 GPU 加速可用。

GPU rasters directly into SharedImage — zero-copy. Fastest, requires GPU accel.

cc::OneCopyRasterBufferProvider

Skia 光栅化进 GpuMemoryBuffer，再 CopySubTexture 拷到 SharedImage——一次拷贝。

Skia rasters into a GpuMemoryBuffer, then CopySubTexture into SharedImage — one copy.

cc::ZeroCopyRasterBufferProvider

Skia 光栅化进 GpuMemoryBuffer，直接用它创建 SharedImage——零拷贝但有平台限制。

Skia rasters into a GpuMemoryBuffer that directly backs a SharedImage — zero-copy but platform-restricted.

cc::BitmapRasterBufferProvider

关闭硬件加速时才走的路——Skia 光栅化进共享内存（CPU 位图）。

The fallback when hardware accel is off — Skia rasters into shared memory (CPU bitmap).

GPU SharedImage · 抽象 GPU 内存

GPU SharedImage · abstracting GPU memory

SharedImage 是 Chromium 对"GPU 数据存储"的抽象——它替代了早期的 Mailbox 机制。架构是经典的 Client / Service 模型：

SharedImage is Chromium's abstraction over GPU data storage — it replaced the older Mailbox mechanism. The architecture is a classic Client / Service split:

FIG 14 SharedImage：多个 Client 都能直写 GPU 内存，由 GPU Process 上的唯一 Service 协调。这是 cc Raster 与 Viz 之间能"零拷贝"传纹理的底座。 SharedImage: many Clients write directly to GPU memory, coordinated by the single Service in the GPU process. This is the substrate that makes textures travel from cc Raster to Viz with zero copies.

SharedImage 的 4 个使用场景

Four scenarios that lean on SharedImage

cc · Raster → Viz

Tile 光栅化的结果直接写进 SharedImage，Viz 合成时直接读——避免一次跨进程的内存拷贝。

Tiles raster directly into a SharedImage; Viz reads it during compositing — sparing one cross-process memory copy.

OffscreenCanvas

Worker 线程上的 Canvas 把绘制结果写进 SharedImage，主线程外完成像素生成。

Worker-thread Canvases render into a SharedImage — pixels generated entirely off Main.

图片处理 / 渲染Image decode / render

一个线程把图片解码进 GPU SharedImage，另一个线程读它做滤镜 / 渲染。

One thread decodes images into a GPU SharedImage; another reads it to apply filters / render.

视频播放Video playback

解码线程把视频帧投入 SharedImage，合成线程读取并贴到对应的 TextureLayer 上。

Decoder thread pushes video frames into a SharedImage; the composite thread reads them onto the appropriate TextureLayer.

图片解码 · 与 Tile 同道

Image decode · sharing the road with tiles

页面里有 <img> ？JPEG / PNG / WebP 的解码也发生在这里——cc::ImageDecodeCache 协调 Raster 线程对图片做异步解码：解码任务和 Tile 任务在同一个 TaskGraph 里调度。

Got <img> on the page? JPEG / PNG / WebP decoding also happens here — cc::ImageDecodeCache orchestrates the Raster threads to decode asynchronously: decode tasks and tile tasks share the same TaskGraph.

FOOTGUN

为什么"满屏图片"的页面更卡

Why image-heavy pages stutter

Raster 线程数固定（通常 ≤ 4 个，与 CPU 核数挂钩）。一张大图解码耗时 10–80ms 不等，解码占着线程时，Tile 任务排队等。视口里的关键内容因此被推迟光栅化——这就是图片密集页面"滚动卡 + 渲染慢"的根因。

Raster thread count is fixed (typically ≤ 4, tied to CPU cores). A large image decode takes 10–80ms; while it holds a thread, tile tasks queue behind it. Critical viewport content gets delayed — the root cause of "image-heavy pages scroll-jank + render-slow".

解药：loading="lazy" · 合理使用 decoding="async" · 给关键 LCP 图片用 fetchpriority="high"。

Antidotes: loading="lazy" · sensible decoding="async" · fetchpriority="high" on the LCP image.

RasterBuffer 接口 · 五行决定数据走向

RasterBuffer interface · five lines decide where pixels go

所有 RasterBufferProvider 子类的差异其实只藏在两个虚函数里——AcquireBufferForRaster() 决定纹理从哪儿借出，RasterBuffer::Playback() 决定 DisplayItemList 怎么落到那块纹理上：

The difference between all RasterBufferProvider subclasses really hides in two virtuals — AcquireBufferForRaster() decides where the texture is borrowed from, RasterBuffer::Playback() decides how the DisplayItemList lands on it:

RASTER_BUFFER_PROVIDER · INTERFACE cc/raster/raster_buffer_provider.h

RasterBufferProvider { // 抽象基类 virtual std::unique_ptr<RasterBuffer> AcquireBufferForRaster(const ResourcePool::InUsePoolResource& resource, uint64_t resource_content_id, // 缓存命中标识 uint64_t previous_content_id, // 上一帧版本号 bool depends_on_at_raster_decodes, bool depends_on_hardware_accelerated_jpeg_candidates, bool depends_on_hardware_accelerated_webp_candidates) = 0; virtual bool CanPartialRasterIntoProvidedResource() = 0; // 局部 raster？ virtual void Flush(); }; RasterBuffer { // 单次光栅化任务的句柄 virtual void Playback( const RasterSource* raster_source, // 待重放的 DisplayItemList const gfx::Rect& raster_full_rect, // tile 在图层中的位置 const gfx::Rect& raster_dirty_rect, // 实际需要画的脏区 uint64_t new_content_id, // 写入后版本 const gfx::AxisTransform2d& transform, // device pixel ↔ layer space const RasterSource::PlaybackSettings&) = 0; };

"depends_on_*" 这一串 bool 揭开了 cc 的一个核心机密——它会试探：这块 tile 的输出是否依赖任何"还没解码完的图片"？——如果答案是是，cc 就不能用"已经解码进 SharedImage 的"老 tile 内容做局部 raster，必须重画。这就是为什么"第一次滚到含 WebP 大图的位置经常先看到棋盘"——条件不满足，整块 tile 重新派任务。

The string of "depends_on_*" booleans hides a core secret of cc — it asks: does this tile's output depend on any "not-yet-decoded image"?. If yes, cc cannot reuse the previous tile's pixels for partial raster — the whole tile re-queues. This is why "scrolling onto a WebP-heavy region often shows the chequer first" — the precondition fails, and the tile gets re-scheduled from scratch.

Skia DDL · 把"录制"和"重放"完全切开

Skia DDL · separating "recording" from "replay"

GPU raster 路径在 Skia 端走的是 DDL（Deferred Display List） 流。Skia 把"构造绘画命令"和"真正提交到 GPU"物理拆开到两个线程：Raster 线程负责录制 DDL（不持有 GL 上下文），GPU 线程负责重放 DDL（独占 GL 上下文）。这一切实只用 4 个步骤：

The GPU raster path on Skia rides on DDL (Deferred Display List). Skia physically splits "building the drawing commands" from "submitting them to the GPU" into two threads: the Raster thread records the DDL (no GL context), the GPU thread replays it (owns the GL context). Four steps:

SKIA · DDL FLOW · GPU RASTER PATH third_party/skia / components/viz/service/display

[Raster thread] ① SkDeferredDisplayListRecorder recorder(surface_characterization); // surface_characterization 描述目标纹理的格式 / 大小 / 像素格式 ② SkCanvas* canvas = recorder.getCanvas(); display_item_list->Raster(canvas, ...); // 录制：所有 drawRect / drawText 入 DDL ③ sk_sp<SkDeferredDisplayList> ddl = recorder.detach(); // 把 ddl 通过队列丢给 GPU 线程 [GPU thread] ④ SkSurface* surface = CreateSkSurface(shared_image_backing); surface->draw(ddl); // 重放：真正提交 GL/Vulkan 命令 gpu->FlushAndSubmit();

这就是为什么 Chromium 的桌面渲染不需要"把所有 GL 调用塞回主线程"——Raster 线程只构建命令缓冲，GPU 上下文从头到尾归 Viz 进程独占。多 Tab 同时光栅化，是几十个 Raster 线程并行写自己的 DDL，然后排队进 GPU 线程依次重放。"录制 / 重放" 是 Chromium 多线程渲染的真正中枢。

This is why Chromium's desktop rendering doesn't need to "funnel every GL call back to the main thread" — Raster threads only build command buffers, while the GPU context stays exclusive to the Viz process from beginning to end. Many tabs rasterising in parallel = dozens of Raster threads each writing their own DDL, queued for the single GPU thread to replay. "Record / Replay" is the real hinge of Chromium's multi-threaded rendering.

DEVTOOLS

Performance > "Rasterize Paint";chrome://gpu 看 SharedImage 数Performance > "Rasterize Paint"; chrome://gpu for SharedImage count

TRACING

cc, blink.raster, gpu.command_buffer, skia

FLAG

--enable-zero-copy --enable-gpu-rasterization

SOURCE

cc/raster/raster_buffer_provider.h, gpu_raster_buffer_provider.cc

▸ Rendering · Tile inspector overlay (DevTools 三点菜单 → More tools → Rendering) 视口 · 4 raster threads

NOW · ready NOW · raster pending LOW-res placeholder checker · 未光栅化视口

RASTER THREADS · 6ms

Raster 1

tile #t1

tile #t5

tile #t9

Raster 2

tile #t2

tile #t6

Raster 3

tile #t3

tile #t7

Raster 4

decode avatar.png

tile #t8

03 ms6 ms

看 4 件事: ① 视口内(黄框)必须全绿 — 否则就是棋盘;② 视口外的 LOW-res(对角线)是预渲染,会逐步 upgrade 到 NOW;③ 右侧 4 条 Raster 线程满载并行,Raster 4 上 Image decode 跟 tile raster 共用 TaskGraph(C14 那段说的);④ 棋盘格密集 = TileManager 优先级排错或 raster 跟不上 — 用 cc.debug.scheduling 找瓶颈。 4 things to look at: ① the viewport (yellow border) must be all green — otherwise chequer; ② LOW-res (diagonal) outside the viewport is pre-render, gradually upgraded to NOW; ③ the 4 Raster lanes on the right run in full parallel; Raster 4 shares its TaskGraph between image decode and tile raster (the C14 footgun); ④ dense chequer = TileManager priority misorder or raster can't keep up — use cc.debug.scheduling to find the bottleneck.

上一章previous← Stage 08 · TilingStage 08 · Tiling 下一章nextStage 10 · Activate · 三棵树轮换Stage 10 · Activate · three-tree rotation →

INPUT

cc::TileTask+ DisplayItemList+ DisplayItemList

→

OUTPUT

PictureLayerImpl+ 像素纹理（SharedImage / Bitmap）+ pixel textures (SharedImage / Bitmap)

STAGE 10 · CC PHASE

Activate — 三棵 LayerImpl 的舞蹈

Activate — three LayerImpl trees in rotation

你没看到的那一层缓冲

pending → active → recycle, the buffering you didn't see

Module

cc

Process

Render

Thread

Compositor

Output

Active LayerImpl Tree

这一步在做什么

What it does

把已经光栅化好的 Pending Tree "翻"成可被 Draw 使用的 Active Tree。这一步是 Compositor thread 上的原子切换：切换前后屏幕始终能看到一帧合法画面。 Promote the now-rasterised Pending Tree into a draw-ready Active Tree. The switch is atomic on the Compositor thread — the screen always sees a valid frame, before and after.

为什么不能跳过

Why not skip

没了 Activate，光栅化与上屏就成了串行：要么等所有 tile 画完再上屏（卡顿），要么边画边上屏（撕裂）。三棵树的中间层是把这两个矛盾解开的设计。 Without Activate, raster and display would serialise: either wait for every tile, then display (stalls) or display while painting (tearing). The triple-tree middle layer is what unties the knot.

关键函数

Key entrypoints

LayerTreeHostImpl::ActivateSyncTree · LayerTreeHostImpl::NotifyReadyToActivate · LayerTreeImpl::PushPropertiesTo

STAGE 10 主线 · The Card 在激活后Main-line · The Card after Activate

3 块 tile 全 ready,Pending → Active 一指针交换

All 3 tiles ready, Pending → Active, one pointer swap

名片的 3 块 tile 都成功 raster——IsReadyToActivate() 返回 true。Compositor thread 把 active_tree_ 与 pending_tree_ 的指针互换:原 pending(刚 raster 好的)变 active,原 active(上一帧)变 recycle。从这一刻起,Draw 阶段会从新的 active 上吐 quad。

All 3 tiles successfully rastered — IsReadyToActivate() returns true. The Compositor thread swaps the active_tree_ and pending_tree_ pointers: the old pending (freshly rastered) becomes active; the old active (last frame) becomes recycle. From this instant, Draw emits quads from the new active.

// 三指针 · 激活前后

激活前                          激活后
─────                            ─────
active_tree_  ─▶ [old card]     active_tree_  ─▶ [new card]  ◀ 新就绪
pending_tree_ ─▶ [new card]     pending_tree_ ─▶ null
recycle_tree_ ─▶ null            recycle_tree_ ─▶ [old card]  ◀ 留待复用

// 唯一动的:3 个 std::unique_ptr 的指针交换 · 不拷贝节点

反例: 假如这一帧头像 #r4 还没解码完,#t1 是"带占位的部分 raster",IsReadyToActivate() 仍可能返回 true(部分 raster 算 ready);但未光栅化区域 num_missing_tiles > 0,这个数字会随 CompositorFrame 元数据传到 Viz——Viz 知道下一帧大概率要再来一遍。"激活成功 ≠ 内容完美"是 Activate 阶段的一个重要事实——它只保证"能画",不保证"画全"。

Counter-example: if avatar #r4 hasn't decoded yet, #t1 is a "partial raster with placeholder"; IsReadyToActivate() may still return true (partial raster counts as ready). But num_missing_tiles > 0, and this number rides the CompositorFrame metadata back to Viz — Viz knows another frame is likely needed. "Activated ≠ complete" is a key fact about this stage — it only guarantees "drawable", never "complete".

三棵树 · 各司其职

Three trees · each with one job

Compositor thread 同时持有三棵 LayerImpl 树：

The Compositor thread holds three LayerImpl trees at the same time:

PENDING

↳ 接收 Commit · 正在 Raster

↳ Receives Commit · Rasterising

ACTIVE

↳ Draw · 正在被光栅化好的图块上屏

↳ Drawing · already-rasterised tiles to screen

RECYCLE

↳ 等待复用 · 不销毁，避免重复创建

↳ Waiting to be reused · spared from destruction

LayerTreeHostImpl 源码 · 三棵指针

LayerTreeHostImpl source · the three pointers

LAYER_TREE_HOST_IMPL.H cc/trees/layer_tree_host_impl.h

class LayerTreeHostImpl { // Tree currently being drawn. std::unique_ptr<LayerTreeImpl> active_tree_; // In impl-side painting mode, tree with possibly // incomplete rasterized content. // May be promoted to active by ActivateSyncTree(). std::unique_ptr<LayerTreeImpl> pending_tree_; // Inert tree with layers that can be recycled // by the next sync from the main thread. std::unique_ptr<LayerTreeImpl> recycle_tree_; };

ActivateSyncTree · 切换的算法

ActivateSyncTree · the switch algorithm

Activate 不是一次拷贝，而是一系列所有权交接。LayerTreeHostImpl::ActivateSyncTree 大致跑这几步：

Activate isn't a copy — it's a sequence of ownership transfers. LayerTreeHostImpl::ActivateSyncTree roughly walks these steps:

ALGORITHM · ACTIVATE_SYNC_TREE layer_tree_host_impl.cc

① CHECK(pending_tree_->IsReadyToActivate()); // 必须 Raster 完毕 ② recycle_tree_ = std::move(active_tree_); // 老 active 退役 → recycle ③ active_tree_ = std::move(pending_tree_); // pending 升格 → active ④ active_tree_->DidBecomeActive(); // 通知图层"我上岗了" ⑤ NotifySwapPromiseMonitorsOfSetNeedsRedraw(); // 触发下一次 Draw ⑥ SetNeedsRedraw(); // 把这一帧调度起来

设计原理DESIGN RATIONALE

为什么是三棵，不是两棵也不是四棵

Why three trees — not two, not four

两棵不够：如果只有 Pending + Active，每次 Commit 都要等 Active 用完再回收，主线程要等 Compositor 一次 Draw 周期，无法连续提交。

Two trees is not enough: with just Pending + Active, every Commit would have to wait for Active to be done before recycling, so the Main thread waits a Compositor draw cycle — no continuous commits.

四棵浪费：每棵 LayerImpl 都不便宜（它持有 Tile 任务、纹理引用、属性树副本），多一棵就多一份内存。

Four is wasteful: each LayerImpl tree is not cheap (it owns TileTasks, texture refs, property-tree copies). Each extra tree is real memory.

三棵刚好：Pending 接活、Active 上屏、Recycle 等复用——三个角色对应渲染管线的三个并发态。

Three is right: Pending takes work, Active draws, Recycle waits to be reused — three roles, one for each concurrent phase of the pipeline.

什么时候 Activate 会失败 / 推迟

When activation fails or delays

Activate 有一个前置条件：IsReadyToActivate() 必须返回 true——也就是 Pending Tree 上视口内的所有 Tile 已经完成 Raster。如果远端图块或图片解码尚未完成，Activate 会推迟到 NotifyReadyToActivate 触发后才执行。这就是"快速滚动出现棋盘格"的物理来源——Pending 没准备好，Active 还是旧的。

Activation has a precondition: IsReadyToActivate() must return true — meaning every viewport tile on the Pending tree is rasterised. If a far-away tile or an image decode hasn't finished, activation is deferred until NotifyReadyToActivate fires. That's the physical source of "checkerboard tiles during fast scroll" — Pending isn't ready, so Active is still last frame.

为什么是三棵 LayerImpl,不是两棵? · 三个并发态需要三块"各自占位" Why three LayerImpl trees, not two? · three concurrent phases need three independent slots

Pending + Active 两棵看起来够了——但实际跑下来,每一帧都会卡一下。三棵的真正理由,是"下一帧的准备工作 ‖ 当前帧的展示 ‖ 上一帧的回收" 这三件事在物理上同时存在:

Two trees (Pending + Active) seem enough — but in practice, every frame would stall briefly. The real reason for three: "preparing the next frame ‖ displaying the current frame ‖ recycling the previous frame" all coexist in time:

// 时间线 · 三棵树的"分时复用"

t = 0ms   Active      ─▶ 上屏画 frame N      // GPU 在读它
          Pending     ─▶ 接收 frame N+1 的 Commit
          Recycle     ─▶ 闲(留给 N+2)

t = 4ms   Pending     ─▶ Raster 完成,准备 Activate
          Active      ─▶ 仍在画 frame N(GPU 没完)
          Recycle     ─▶ 闲

t = 8ms   vsync · ActivateSyncTree() 切换指针
          old Active ──▶ Recycle(等下一次 Commit 复用)
          old Pending ─▶ Active(开始上屏 frame N+1)
          new Pending ◀─ Main 线程开始 Commit frame N+2 进来

三棵刚好对应三个角色:

Three trees, three roles:

Active · 当前帧,GPU 正在采样它的 tile 纹理。这棵树只能读,不能改——一改 GPU 就读到撕裂数据。
Active · the current frame, GPU is sampling its tile textures. This tree is read-only — mutate it and the GPU sees torn data.
Pending · 下一帧的工地,Raster 线程往里灌新 tile,Main 线程的 Commit 也写它。必须独立于 Active,否则上一条违反。
Pending · the next frame's worksite, Raster threads pour new tiles in, Main's Commit also writes here. Must be independent of Active, or the rule above is violated.
Recycle · 上一个 Active 退役的"休息位"。当下一次 Commit 来时,直接把 Recycle 改名为 Pending(指针交换),复用它的内存槽位。如果没有 Recycle,Main 每次 Commit 都要新分配一棵 LayerImpl 树,GC + 内存碎片成本极高。
Recycle · the "resting position" for the just-retired Active tree. When the next Commit arrives, simply rename Recycle to Pending (pointer swap) and reuse its memory slot. Without Recycle, every Commit would allocate a fresh LayerImpl tree — GC + fragmentation cost would be brutal.

两棵不够 · 四棵浪费:

Two is not enough · Four is wasteful:

只有 2 棵 (Active + Pending): 当 Activate 把 Pending 升为 Active 时,旧的 Active 立刻被丢弃——下次 Commit 又得新建一棵。每帧 1 次 alloc + 1 次 free,在 60Hz 下就是 60 次/秒。LayerImpl 每个挂着 Tile 引用、属性树副本,几 MB 的对象树新建很贵。
Only 2 trees (Active + Pending): when Activate promotes Pending to Active, the old Active is immediately discarded — next Commit must build a fresh one. One alloc + one free per frame, 60 times/sec at 60Hz. Each LayerImpl carries Tile references + property-tree copies; allocating multi-MB object trees that often is expensive.
4 棵以上: 多余的 LayerImpl 树要持续占内存。视口外 + 历史帧 + 等等就过 60MB,移动端直接被 OOM kill。
4 or more: surplus trees would idle in memory. Off-viewport + history + spare easily passes 60MB; mobile gets OOM-killed.

所以"三"是被流水线的3 种并发状态逼出来的: "正在画" / "正在做" / "等着复用"。这跟 OS 进程调度的 running / ready / blocked,数据库 WAL 的 active / clean / recycled 是同一个模式——任何"消费 + 生产 + 回收"系统的最小可行配置,都是 3 个槽位。

So "three" is forced by the pipeline's three concurrent states: "being drawn" / "being built" / "waiting to be reused". The same pattern as OS process scheduling (running / ready / blocked), and database WAL (active / clean / recycled). Any "consume + produce + recycle" system's minimum viable config is three slots.

DEVTOOLS

Performance > "Activate Layer Tree" 事件;Layers 面板看 pending vs activePerformance > "Activate Layer Tree" event; Layers panel for pending vs active

TRACING

cc, cc.debug.activation, viz.frame_submission

FLAG

--enable-features=PartialRasterUploads / 让部分 tile ready 也可以激活/ allows partial-tile activation

SOURCE

cc/trees/layer_tree_host_impl.cc::ActivateSyncTree

▸ Performance · cc.debug · 三棵树指针轨迹 · 4 帧 66.7 ms window · 60 Hz

Active 指针

frame N

frame N+1

frame N+2

frame N+3

Pending 指针

build N+1

build N+2

build N+3

build N+4 ⚠ raster slow

build N+4(完)

Recycle 指针

empty

old N (留)

old N+1 (留)

old N+2 (留)

⚡ Activate 触发

⚡

⚡ delayed!

0vsyncvsyncvsyncvsync · 66.7ms

3 个观察点: ① Active 指针在每个 vsync 整数倍切换 = 健康节奏(60fps);② Pending 提前 ~12ms 完成 build 才跟得上 — 跟不上(图中 frame N+4)就 Activate 推迟,Active 还显示老 frame,用户感知"这一帧没动";③ Recycle 永远满载 — 它总是上一个被替换下来的 Active,留给下一次 Commit 复用,这就是"不需要每帧 alloc/free LayerImpl" 的源头。 3 watch points: ① Active pointer flips at every vsync = healthy rhythm (60fps); ② Pending finishes build ~12ms early to keep up — when it can't (frame N+4 here), Activate is delayed, Active stays on the old frame, and the user perceives "this frame didn't move"; ③ Recycle is always full — always holding the just-retired Active, ready for next Commit's reuse. This is the source of "no per-frame alloc/free of LayerImpl".

上一章previous← Stage 09 · RasterStage 09 · Raster 下一章nextStage 11 · Draw · 写"剧本"Stage 11 · Draw · writing the script →

INPUT

Pending Tree+ rasterised tiles+ rasterised tiles

→

OUTPUT

Active Tree已就绪可被 Draw 使用draw-ready

STAGE 11 · CC PHASE

Draw — 不是画像素，是写"剧本"

Draw — not painting pixels, writing the script

把 LayerImpl 翻译成 CompositorFrame

turning LayerImpl into a CompositorFrame

Module

cc

Process

Render

Thread

Compositor

Output

viz::CompositorFrame

这一步在做什么

What it does

遍历 Active Tree 的每个 LayerImpl，调用 AppendQuads 生成一组 viz::DrawQuad，封装为 viz::CompositorFrame，发给 Viz Process。这一步不动 GPU 一根毛——它生产的是"指令脚本"。 Walk the Active tree's LayerImpls, call AppendQuads on each, produce a batch of viz::DrawQuad, wrap them in a viz::CompositorFrame and ship to the Viz process. The GPU is not touched here — what's produced is an "instruction script".

为什么不能跳过

Why not skip

Render Process 不能直接画屏幕——OS 把 GPU 上下文交给 GPU/Viz Process。所以 Render 必须把"我想画什么"序列化成 CF，由 Viz 执行。这是多进程隔离的代价。 A Render process cannot draw to the screen directly — the OS hands the GPU context to the GPU/Viz process. So Render must serialise "what to draw" into a CF that Viz executes. This is the price of multi-process isolation.

关键函数

Key entrypoints

LayerTreeHostImpl::DrawLayers · PictureLayerImpl::AppendQuads · LayerTreeFrameSink::SubmitCompositorFrame

STAGE 11 主线 · The Card 在 Draw 后Main-line · The Card after Draw

3 个 TileDrawQuad + 1 个 RenderPass for shadow

3 TileDrawQuads + 1 RenderPass for the shadow

Active tree 上跑 AppendQuads:主图层 PictureLayerImpl 吐 2 个 TileDrawQuad(对应两块 tile),follow 独立图层吐 1 个 TileDrawQuad。box-shadow 因为在 Pre-paint 阶段把 render_surface_reason_ 打开了——这里被翻译成独立的 RenderPass。最终 viz::CompositorFrame 长这样:

On the Active tree, AppendQuads runs: the main PictureLayerImpl emits 2 TileDrawQuads (one per tile); the standalone .follow layer emits 1. Because box-shadow flipped render_surface_reason_ on in Pre-paint, the shadow becomes its own RenderPass. The final viz::CompositorFrame:

// CompositorFrame · The Card

CompositorFrameMetadata
  device_scale_factor : 2.0    // retina
  num_missing_tiles   : 0
  begin_frame_ack     : ack(seq=N)

TransferableResource[]
  ─ #r1 (256×88 main tile)
  ─ #r2 (84×88 edge tile)
  ─ #r3 (53×32 .follow tile)
  ─ #r4 (56×56 avatar PNG decoded)
  ─ #r5 (140×100 shadow off-screen)  // ← Pass#2 的产物

RenderPass[]
  ┌─ Pass#2 "shadow off-screen"  // 离屏先画
  │   filter = blur(20px) · output → #r5
  │   quad_list:
  │     ─ FillRRect(card 黑色版本)
  │
  └─ Pass#1 "root surface"      // 主合成
      quad_list:
        ─ RenderPassDrawQuad(ref=Pass#2)  // shadow 贴回
        ─ TileDrawQuad(resource=#r1)     // 主层 tile1
        ─ TileDrawQuad(resource=#r2)     // 主层 tile2
        ─ TileDrawQuad(resource=#r3)     // .follow tile

4 件事让这一步意味深长: ① shadow 单独一个 RenderPass + 一份 #r5 临时纹理——这是 box-shadow 真正的 GPU 成本,blur 半径越大临时纹理越大;② 主层的两个 TileDrawQuad 共用同一个 SharedQuadState(同一组 transform/clip/opacity)——Viz 端只算一次矩阵;③ 头像不在这层 quad 里出现——它的引用埋在 #r1 的 tile 纹理里(Raster 阶段已经画进去),Viz 看到的就是一块完整的 256×88 像素;④ 整个 CF 通过 LayerTreeFrameSink::SubmitCompositorFrame 走 Mojo IPC 投到 Viz 进程——从这一刻起,Render Process 对这一帧的工作正式结束。

Four meaningful details: ① the shadow gets its own RenderPass + #r5 temporary texture — this is box-shadow's real GPU cost, the larger the blur radius the bigger the temp texture; ② the two main-layer TileDrawQuads share one SharedQuadState (same transform/clip/opacity) — Viz computes the matrix once; ③ the avatar doesn't appear in the quad list — its reference is baked into #r1's tile texture (Raster painted it in); to Viz, #r1 is just one solid 256×88 chunk of pixels; ④ the whole CF travels via LayerTreeFrameSink::SubmitCompositorFrame over Mojo IPC to the Viz process — at this instant, the Render process is done with this frame.

AppendQuads · 走一遍 Active Tree

AppendQuads · walking the Active tree

每种 LayerImpl 子类都重写自己的 AppendQuads，决定"我这个图层应该变成哪几个 Quad"。规则简单：

Each LayerImpl subclass overrides its own AppendQuads, deciding "which Quads do I become". The rules are simple:

LayerImpl 类型LayerImpl subtype	→ 生成的 DrawQuad→ DrawQuads emitted	备注note
`SolidColorLayerImpl`	1 × SolidColorDrawQuad	最便宜的图层cheapest layer
`PictureLayerImpl`	N × TileDrawQuad	每个 Tile 一个 Quadone Quad per tile
`TextureLayerImpl`	1 × TextureDrawQuad	Canvas / WebGL / VideoCanvas / WebGL / Video
`SurfaceLayerImpl`	1 × SurfaceDrawQuad	嵌入其他进程的 Surfaceembeds Surface from another process
`HeadsUpDisplayLayerImpl`	混合（Tile / SolidColor）	DevTools FPS meter 等DevTools FPS meter etc.

最复杂的是 PictureLayerImpl::AppendQuads——它要遍历该图层的所有"当前可见 Tile"，根据每个 Tile 的状态决定吐什么 Quad：已光栅化吐 TileDrawQuad，视口内但还没光栅化的吐 SolidColorDrawQuad 占位（背景色 + 棋盘格），缺失但有低分辨率版本的退到低分辨率代替。算法骨架：

The hairiest one is PictureLayerImpl::AppendQuads — it walks all "currently visible tiles" of the layer and emits Quads based on each tile's state: rasterised → TileDrawQuad; in-viewport-but-unrasterised → SolidColorDrawQuad placeholder (background colour + chequer); missing but low-res available → fall back to the low-res tier. The skeleton:

PICTURE_LAYER_IMPL::APPEND_QUADS · ALGORITHM cc/layers/picture_layer_impl.cc

PictureLayerImpl::AppendQuads(CompositorRenderPass* render_pass, AppendQuadsData* data) { SharedQuadState* shared_quad_state = render_pass->CreateAndAppendSharedQuadState(); ① PopulateSharedQuadState(shared_quad_state); // transform / clip / opacity / blend_mode // ② 用 PictureLayerTilingSet::CoverageIterator 遍历"高分辨率层 + 低分辨率层" // 的并集，按视口扫描线顺序得到一格一格的 (tile, geometry_rect) ② FOR EACH (tile, geometry_rect) IN coverage_iter: Tile* tile = iter.tile(); if (tile && tile->draw_info().IsReadyToDraw()) { // → 已光栅化：发 TileDrawQuad auto* quad = render_pass->CreateAndAppendDrawQuad<TileDrawQuad>(); quad->SetNew(shared_quad_state, geometry_rect, ..., tile->draw_info().resource_id_for_export(), texture_rect, texture_size, ...); } else if (tile && tile->draw_info().mode() == SOLID_COLOR) { // → 单色优化：cc 检测出 tile 全是同一颜色，跳过纹理 AppendSolidQuad(render_pass, tile->draw_info().solid_color()); } else { // → 缺失：发 SolidColorDrawQuad 占位（背景色） + 累计 checkerboarded_* AppendCheckerboardOrBgColorQuad(...); data->num_missing_tiles++; data->checkerboarded_visible_content_area += geometry_rect.size(); } ③ UpdateScrollDamageOnFirstAppend(...); }

两个细节构成了"流畅滚动"的底层契约：① SharedQuadState 共享——同一图层产出的几百个 TileDrawQuad 共用一份变换 / 裁剪 / 不透明度，Viz 端只算一次矩阵；② num_missing_tiles 累计回 frame metadata——Viz 拿到这个数能立刻知道"这一帧有多少棋盘格"，进而决定是不是应该推迟 Activate 等下一帧的 Raster 完成。这就是 cc 与 viz 之间"反馈环"的真实形态。

Two details form the contract behind smooth scrolling: ① shared SharedQuadState — the few hundred TileDrawQuads from a single layer share one transform / clip / opacity, Viz computes the matrix once; ② num_missing_tiles bubbles into the frame metadata — Viz reads it to know "how chequered is this frame?" and, if needed, can defer Activate until the next Raster cycle catches up. That's the actual shape of the "feedback loop" between cc and viz.

CompositorFrame · 一帧画面的内部结构

CompositorFrame · what's inside one frame

一个 viz::CompositorFrame（CF）代表"某个矩形显示区域中的一帧"。它拆成三块：

A viz::CompositorFrame (CF) represents "one frame for a given rectangular display region". It breaks into three blocks:

viz::CompositorFrame

① CompositorFrameMetadata

device_scale_factor : float
latency_info : ui::LatencyInfo
referenced_surfaces : SurfaceRange
begin_frame_ack : BeginFrameAck

这一帧的元信息：缩放、延迟、嵌套引用、vsync 应答。

Frame-level metadata: scale, latency, embedded references, vsync ack.

② TransferableResource[]

Software : 共享内存 / shared memory
Texture : GPU SharedImage

这一帧引用的"图片资源"——Software 资源用共享内存，Texture 资源用 GPU。

"Image resources" referenced by this frame — Software via shared memory, Texture via the GPU.

③ RenderPass[] · DrawQuad[]

RenderPass::quad_list : DrawQuad*
RenderPass::id : RenderPassId
RenderPass::transform · filters · cache · copy_request

真正的"绘画指令"。可以对单个 RenderPass 应用特效、缓存、截图——这就是 RenderPassDrawQuad 嵌套的根基。

The actual drawing instructions. A single RenderPass can take effects, caching, screenshots — and is the basis for nested RenderPassDrawQuad.

RenderPass · 嵌套的"画中画"

RenderPass · the "picture-in-picture" of compositing

默认情况下，所有 DrawQuad 都进同一个 RenderPass——"画到 root surface 上"。但当某个图层带有需要离屏合成的特效时（如 filter: blur(8px)、mix-blend-mode、mask-image），cc 会单独建一个 RenderPass：先把该子树画到一个临时纹理上，再把这个纹理作为 RenderPassDrawQuad 引用回主 Pass。

By default, all DrawQuads land in the same RenderPass — "draw onto the root surface". But when a layer has an effect that requires off-screen compositing (e.g. filter: blur(8px), mix-blend-mode, mask-image), cc creates a dedicated RenderPass: render the subtree to a temporary texture, then reference it back into the main pass as a RenderPassDrawQuad.

CASE · backdrop-filter

`backdrop-filter: blur(20px)` 为什么这么贵

Why `backdrop-filter: blur(20px)` is so expensive

backdrop-filter 要求读取背景的像素再做模糊。cc 会为它建立一个 RenderPass，先把"被它遮挡的所有图层"渲染到一张临时纹理；然后再做 blur shader；最后才贴回主 Pass。每一帧都要做一次离屏 Pass + GPU 模糊，这是显存与带宽双重支出。

backdrop-filter reads the background pixels and blurs them. cc spins up a RenderPass: render every layer it covers to a temporary texture, run the blur shader, then composite back into the main pass. Every frame pays for an off-screen pass + a GPU blur — double cost in VRAM and bandwidth.

DrawQuad 的 6 种类型

Six flavours of DrawQuad

TileDrawQuad

最常见——一个 Tile 块。DisplayItemList 被 cc 光栅化后就变它。

The default — one tile. DisplayItemLists become these after cc rasterises them.

TextureDrawQuad

引用一份 GPU 资源——Canvas / WebGL / Video 都走它。

References a GPU resource — Canvas / WebGL / Video all take this path.

SolidColorDrawQuad

纯色矩形。最便宜的 Quad。

A solid-coloured rectangle. The cheapest quad on the menu.

RenderPassDrawQuad

引用另一个 RenderPass 的 ID——给嵌套特效用。

References another RenderPass by ID — for nested effects.

SurfaceDrawQuad

嵌入另一个进程的 Surface——OOPIF / OffscreenCanvas 的关键。

Embeds a Surface from another process — the linchpin of OOPIF and OffscreenCanvas.

PictureDrawQuad

里面直接装 DisplayItemList——目前只 Android WebView 用。

Carries a DisplayItemList directly — only Android WebView uses this today.

LayerTreeFrameSink · 把 CF 寄出去

LayerTreeFrameSink · the parcel office

CF 装好之后，cc 调用 LayerTreeFrameSink::SubmitCompositorFrame(local_surface_id, frame, hit_test_data) 把它通过 Mojo IPC 投到 Viz 进程。Render Process 的渲染至此结束——剩下的事归 Viz 管。

Once the CF is packed, cc calls LayerTreeFrameSink::SubmitCompositorFrame(local_surface_id, frame, hit_test_data) and ships it to Viz over Mojo IPC. The Render process's rendering work ends here — everything that follows belongs to Viz.

为什么叫 "Submit" 而不是 "Draw"WHY "SUBMIT", NOT "DRAW" cc 在这里强调"提交"——它不直接画像素，只是把"应该画什么"寄给 Viz。如果 Viz 没空（GPU 忙、vsync 错过），CF 会被排队甚至丢弃。Submit 的成功 ≠ 上屏。Render 进程通过 BeginFrameAck 才知道"自己上一次的 Submit 上屏了没"。 cc emphasises "submit" here — it doesn't paint pixels, it ships "what to paint" to Viz. If Viz is busy (GPU is full, vsync missed), the CF queues or even drops. Submit succeeding ≠ on-screen. The Render process learns whether its last submission landed via BeginFrameAck.

Submit 与 Draw,Chromium 词汇里到底是什么区别? · "下单" vs "下厨" Submit vs Draw — what's the precise difference in Chromium-speak? · "placing the order" vs "cooking it"

在 Chromium 词汇里这两个词从不混用,但又长得很像,容易让人糊涂。最简洁的类比:Submit 是"下单",Draw 是"下厨"。

In Chromium's vocabulary, these two words are never interchangeable, but they look so similar they're easy to confuse. The simplest analogy: Submit is "placing the order", Draw is "cooking it".

	Submit	Draw
谁在做Who	cc (Render Process · Compositor thread)	Viz (GPU Process · GPU thread)
在做什么Doing what	把 CompositorFrame 通过 Mojo IPC 寄出去ships CompositorFrame via Mojo IPC	真的调 GL/Vulkan 把像素画到 framebufferactually calls GL/Vulkan, paints pixels to framebuffer
是否触屏On-screen?	不一定not necessarily	是,SwapBuffers 之后yes, after SwapBuffers
失败/丢弃可能Can fail / be dropped?	可以(GPU 忙、vsync 错过、damage_rect 为空 → 整帧跳过)yes (GPU busy, vsync missed, empty damage_rect → frame skipped)	很少(已经在 GPU 上,基本一定执行)rarely (already on GPU, almost always executes)
何时知道结果When you know the outcome	`BeginFrameAck` 回弹callback	同步,函数返回时已完成synchronous, complete on return

为什么这个区分非常重要? 因为它直接对应性能反馈 的拓扑:

Why does this distinction matter so much? Because it maps directly to the performance feedback topology:

"Submit 成功"只能告诉你"cc 把工作打包好了"。Web Vitals 的 LCP/CLS不能用 Submit 时间——必须用真正上屏的时间。这就是为什么 Chrome 内部专门有 FrameMetrics 在 Display 阶段后通过 BeginFrameAck 把"这一帧上屏了吗" 反馈回 Render Process。
"Submit success" only tells you "cc packaged its work". Web Vitals' LCP/CLS cannot use Submit time — they must use actual on-screen time. That's exactly why Chrome internals have FrameMetrics, which after the Display stage uses BeginFrameAck to feed "did this frame reach the screen?" back to the Render process.
Submit 与 Draw 之间的"排队期" 是 Viz 的负载缓冲。GPU 偶尔忙(其他 Tab 在做大动画),Viz 队列里能堆 2-3 帧 CF;一旦超 3 帧,旧的就被丢弃(kSkipped)——cc 通过 BeginFrameAck 知道"我刚才那帧白做了",决定下一帧要不要降级(降分辨率、跳动画帧)。"Submit 是 push,Draw 是 pull,中间有队列" 是这套架构的本质。
The "queue period" between Submit and Draw is Viz's load buffer. The GPU is occasionally busy (other tabs running heavy animations); Viz can queue 2-3 frames of CF. Once it exceeds 3, old ones drop (kSkipped) — cc, via BeginFrameAck, learns "my last frame was wasted work" and decides whether to degrade the next (lower resolution, skip animation frame). "Submit is push, Draw is pull, queue in between" is the essence of this architecture.

类比: 这跟 Git 里的 git push 与 git merge 很像——push 只是"把 commit 推到远程仓库"(可能被 reject),merge 才是"真的进入主干"。Chromium 把这种"提交 ≠ 落地" 的语义做得极其严谨,任何 perf 监控如果搞混 Submit 与 Draw 的时间,数据就是错的。

Analogy: very much like Git's git push vs git merge — push only "uploads commits to the remote" (which may reject); merge is "actually integrating into trunk". Chromium pushes this "submit ≠ land" semantic to the limit, and any perf monitor that conflates Submit with Draw timestamps gives wrong numbers.

DEVTOOLS

Performance > "Composite Layers";Rendering > Layer BordersPerformance > "Composite Layers"; Rendering > Layer Borders

TRACING

cc, viz.layers, viz.surface

FLAG

--show-composited-layer-borders --show-paint-rects

SOURCE

cc/layers/picture_layer_impl.cc::AppendQuads

▸ DevTools · Performance · Frame · Composite Layers (展开 CF) frame #842 · 2.8 ms

▾ CompositorFrame · LocalSurfaceId(2:7) · device_scale_factor=2.0 ▾ CompositorFrameMetadata ├─ begin_frame_ack = (source_id=4, seq=842, has_damage=true) ├─ latency_info[] = 2 entries (input → frame) └─ num_missing_tiles = 0 ✓ ▾ TransferableResource[5] ├─ #r1 256×128 GPU SharedImage (main tile #t1) ├─ #r2 84×128 GPU SharedImage (edge tile #t2) ├─ #r3 53× 32 GPU SharedImage (.follow tile) ├─ #r4 56× 56 GPU SharedImage (avatar.png decoded) └─ #r5 140×100 GPU SharedImage (shadow off-screen, Pass#2 output) ▾ RenderPass[2] ├─ Pass#2 "shadow off-screen" output_rect=[140×100] filter=blur(20px) │ └─ FillRRect(card-shape, #000) → render to #r5 └─ Pass#1 "root surface" output_rect=[1280×800] ├─ RenderPassDrawQuad(ref=Pass#2) // shadow 贴回 ├─ TileDrawQuad resource=#r1 geometry=[0,0 256×88] ├─ TileDrawQuad resource=#r2 geometry=[256,0 84×88] └─ TileDrawQuad resource=#r3 geometry=[267,28 53×32] └─ shared_quad_state.transform = translate(267, 28) · opacity=1.0

3 个关键诊断点: ① num_missing_tiles = 0(若 > 0 = 有棋盘);② RenderPass 数量 = 1 + 离屏特效数(这里 1 + 1 shadow);③ TransferableResource[].size 之和 = 这一帧 GPU 上传字节数(本例 ≈ 80KB)。RenderPass 数量爆炸是 backdrop-filter / mix-blend-mode 滥用的信号——一个 Pass 一份临时纹理 + 一次 GPU 切换。 3 key diagnostic spots: ① num_missing_tiles = 0 (if > 0 = chequer); ② RenderPass count = 1 + off-screen effects (here 1 + 1 shadow); ③ sum of TransferableResource[].size = this frame's GPU upload bytes (≈ 80 KB here). An exploding RenderPass count is the signal of backdrop-filter / mix-blend-mode abuse — one Pass = one temp texture + one GPU state switch.

上一章previous← Stage 10 · ActivateStage 10 · Activate 下一章nextStage 12 · Aggregate · 跨进程合成Stage 12 · Aggregate · cross-process flatten →

INPUT

Active LayerImpl Tree+ rasterised tile textures+ rasterised tile textures

→

OUTPUT

viz::CompositorFrame通过 Mojo IPC 投到 Vizdelivered to Viz via Mojo

STAGE 12 · VIZ PHASE

Aggregate — 多个进程的 CF 合成一帧

Aggregate — many CFs become one frame

把跨进程的 Surface 拍平

flattening surface trees across processes

Module

viz

Process

GPU（hosts viz）(hosts viz)

Thread

Skia / Display Compositor

Output

Aggregated CF

这一步在做什么

What it does

Viz 把当前所有"活着"的 CF（来自 Browser UI、每个 Render、每个 OOPIF、每个 OffscreenCanvas）按 SurfaceId 引用关系铺平成一份 Aggregated CF。 Viz takes every live CF (Browser UI, every Render process, every OOPIF, every OffscreenCanvas) and flattens them — following SurfaceId references — into a single Aggregated CF.

为什么不能跳过

Why not skip

屏幕只有一块。多进程产出多份 CF，必须有人决定它们的层叠与裁剪。Aggregate 是 Site Isolation × 流畅渲染的结合点。 There's only one screen. Multiple processes produce multiple CFs — someone has to decide their stacking and clipping. Aggregate is where Site Isolation meets smooth rendering.

关键函数

Key entrypoints

SurfaceAggregator::Aggregate · SurfaceAggregator::HandleSurfaceQuad · SurfaceManager::GetLatestSurfaceForRange

STAGE 12 主线 · The Card 在合成后(变体)Main-line · The Card after Aggregate (variant)

把名片当成第三方组件,嵌入另一个进程的页面

Turn the card into a third-party widget, embedded across processes

这一步的变体把名片放到一个想象的"第三方点赞墙"页面里——名片由 Render B(ursb.me 域)渲染,父页面由 Render A 渲染。两个 Render 进程各自把 CF 提交到 Viz,Viz 用 SurfaceAggregator 把它们拍平成一份:

The variant for this stage: drop the card into an imaginary "third-party praise wall" page. The card is rendered by Render B (ursb.me origin), the parent page by Render A. Both processes submit their CFs to Viz; SurfaceAggregator flattens them into one:

// Aggregated CF · 跨进程合成后

CF · Render A · "praise-wall.com"
└─ Pass#1 (root)
    ├─ FillRect(粉色背景)
    ├─ DrawText("最近为他点赞的人:")
    └─ SurfaceDrawQuad(surface_id=B, transform=T_card, clip=420×120)
                    │
                    │ SurfaceAggregator::HandleSurfaceQuad
                    │ T_final = T_root ∘ T_card ∘ T_inside_card
                    │ clip_final = clip_root ∩ clip_card
                    ▼
CF · Render B · The Card      → 展开后并入 Pass#1:
└─ Pass#2 (shadow off-screen)   ─▶ 复制为 Pass#3 in aggregated
└─ Pass#1 (root surface)        ─▶ quads 平移到 (12, 80)
   ├─ RenderPassDrawQuad(Pass#3)
   ├─ TileDrawQuad(#r1)         ─▶ transform 已乘
   ├─ TileDrawQuad(#r2)
   └─ TileDrawQuad(#r3)

最终 Aggregated CF
├─ Pass#3 (shadow · 子级离屏)
└─ Pass#1 (root)
   ├─ FillRect(粉色)
   ├─ DrawText
   ├─ RenderPassDrawQuad(Pass#3)
   ├─ TileDrawQuad(#r1, 已变换)
   ├─ TileDrawQuad(#r2, 已变换)
   └─ TileDrawQuad(#r3, 已变换)

3 件让人惊叹的事: ① Render A 永远拿不到名片的真实像素——它只持有一个 SurfaceId,具体的 #r1~#r5 由 Viz 进程持有。这是 Site Isolation 的图形侧实现,跨域 iframe 的安全边界靠这一刀刻出来;② 变换矩阵会跨边界相乘——父页面给名片 Surface 应用的 T_card 与名片自己内部的 T_inside 在 Viz 里乘起来,等价于"名片直接画在父页面坐标系上";③ 裁剪求交可能让整张卡白白渲染——如果父页面把名片的 clip 设成 0×0(可能因 overflow:hidden 滚出视口),Viz 会跳过整张卡的所有 quad,GPU 一根毛不动,但 Render B 的 cc 仍然在背后默默 raster——这就是"不可见的 OOPIF 也消耗 CPU 但不消耗 GPU"。

3 things to marvel at: ① Render A can never see the card's real pixels — it only holds a SurfaceId; the actual #r1~#r5 live in the Viz process. This is Site Isolation's graphics-side implementation; the cross-origin iframe security boundary is carved here; ② transform matrices multiply across the boundary — the parent's T_card applied to the card's Surface times the card's own internal T_inside equals "the card painted directly into the parent's coordinate system"; ③ clip intersection can make the whole card render in vain — if the parent clips the card to 0×0 (e.g. scrolled out via overflow:hidden), Viz skips every quad of the card, the GPU doesn't move, but Render B's cc is still silently rastering in the background — this is "invisible OOPIFs cost CPU but not GPU".

SurfaceAggregator · 铺平的算法

SurfaceAggregator · the flattening algorithm

Aggregate 走深度优先遍历：从 root surface（通常是 Browser UI）开始，每碰到一个 SurfaceDrawQuad 就跳到那个被引用的 Surface，把它的 RenderPass 和 DrawQuad 拷贝过来（带上正确的变换 + 裁剪），再继续往下走。

Aggregate walks depth-first: start from the root surface (typically Browser UI), every SurfaceDrawQuad hit triggers a jump to the referenced Surface, copy its RenderPasses and DrawQuads (with proper transform + clip), then continue.

FIG 17 SurfaceAggregator 把分布在多个进程的 CF 合成一帧。OOPIF 之所以"隔离但顺滑"，靠的是这个步骤。 SurfaceAggregator merges CFs scattered across processes into a single frame. OOPIF stays isolated yet seamless because of this stage.

Surface ID · 一份"通讯录"

Surface ID · the cross-process address book

每个 Surface 都有唯一身份：

Every Surface has a unique identity:

SURFACE_ID components/viz/common/surfaces/surface_id.h

struct SurfaceId { FrameSinkId frame_sink_id; // 谁产生的（一个进程一个槽） LocalSurfaceId local_surface_id; // 在该 sink 内的版本号 }; struct LocalSurfaceId { uint32_t parent_sequence_number; // 父级（如 Browser UI）变了 +1 uint32_t child_sequence_number; // 自己变了 +1（如 viewport resize） base::UnguessableToken nonce; // 防伪 };

这套 ID 是跨进程引用的核心——一个 OOPIF 知道父页面的 SurfaceId，但拿不到真实 GPU 纹理；一切只通过这个 ID 由 Viz 在合成时解引用。

This is how cross-process references work — an OOPIF knows the parent page's SurfaceId but cannot reach its actual GPU textures; the dereference happens inside Viz during aggregation.

Damage 跟踪 · 不是每帧都"全合"

Damage tracking · not every frame is fully aggregated

Aggregate 自带差异检测。每一帧 SurfaceAggregator 会计算 damage_rect——上一帧到这一帧之间真正变化的区域。GPU 只重画这个区域；其他像素直接复用上一帧 Front Buffer。一个静态页面 + 一个旋转的角标，每帧合成的 GPU 工作量可能只有几十像素。

Aggregate has built-in diffing. Each frame, SurfaceAggregator computes damage_rect — the actual area changed since the previous frame. The GPU only redraws that area; the rest is reused from last frame's Front Buffer. A static page + one spinning badge can mean only a few hundred pixels of GPU work per frame.

CASE · OOPIF

为什么跨域 iframe 也能"完美贴合"父页

Why cross-origin iframes still composite seamlessly

父页面 Render A 的 CF 里包含一个 SurfaceDrawQuad，引用 OOPIF Render B 的 SurfaceId。两进程独立提交 CF 到 Viz；SurfaceAggregator 在 Viz 里把它们用 变换矩阵 + 裁剪 rect 拼好。父页面永远拿不到 OOPIF 的像素，但屏幕上看起来天衣无缝——这就是 Site Isolation 的图形侧实现。

The parent page's CF (Render A) contains a SurfaceDrawQuad referencing OOPIF Render B's SurfaceId. Both processes submit CFs to Viz independently; SurfaceAggregator stitches them together with transform + clip rect. The parent never sees the OOPIF's pixels, yet the screen looks seamless. This is Site Isolation's graphics-side implementation.

HandleSurfaceQuad · 跨进程指针的展开

HandleSurfaceQuad · expanding the cross-process pointer

铺平算法的关键钩子是 SurfaceAggregator::HandleSurfaceQuad。每次 DFS 遍历到一个 SurfaceDrawQuad 时，它就在主 RenderPass 里展开被引用的整个 surface 子树，并且要把"子坐标系 → 父坐标系"的所有变换和裁剪一路前推。算法骨架：

The flatten algorithm's key hook is SurfaceAggregator::HandleSurfaceQuad. Each DFS visit to a SurfaceDrawQuad expands the referenced surface's whole subtree inline into the main RenderPass, while threading "child coordinate system → parent" transforms and clips the whole way:

SURFACE_AGGREGATOR · HANDLE_SURFACE_QUAD components/viz/service/display/surface_aggregator.cc

SurfaceAggregator::HandleSurfaceQuad( const SurfaceDrawQuad* quad, float parent_device_scale_factor, const gfx::Transform& target_transform, const ClipData& clip_rect, CompositorRenderPass* dest_pass) { ① Surface* surface = surface_manager_->GetLatestSurfaceForRange(quad->surface_range()); // 没拿到？走 fallback——可能整个子树都用占位色 ② if (!surface) { EmitDefaultBackgroundColorQuad(quad, ...); return; } ③ // 把子 surface 的 RenderPass 列表"重映射"到主 frame 的 ID 空间 RenderPassMap remapped_pass_ids = RemapPassIds(surface); ④ FOR EACH child_pass IN surface->render_pass_list(): CopyPass(child_pass, dest_pass, target_transform * child_pass.transform_to_root, clip_rect ∩ child_clip, // 父子裁剪求交 remapped_pass_ids); ⑤ UnionDamageRect(surface->damage_rect(), target_transform); // ⑥ 子 surface 的资源引用全部 add_to_referenced_resources_ ⑥ AddSurfaceDamage(quad->surface_range().end()); }

两个数学操作决定了"完美贴合"的视觉效果：① 变换矩阵相乘——父 frame 的变换 × 子 RenderPass 的变换 = 子 quad 在最终屏幕上的位姿；② 裁剪 rect 求交——父 frame 的裁剪 ∩ 子 surface 的裁剪 = 子 quad 真正可见的区域。如果裁剪结果为空，整个子 surface 不会被光栅化，连 GPU 内存都不分配——这就是为什么"不可见的 OOPIF 也不会浪费 GPU 资源"。

Two pieces of math deliver the "perfectly stitched" look: ① transform multiplication — parent frame's transform × child RenderPass's transform = the child quad's final on-screen pose; ② clip-rect intersection — parent's clip ∩ child surface's clip = the actual visible region. If the intersection is empty, the whole child surface is never rasterised — no GPU memory allocated. That's why "invisible OOPIFs cost no GPU".

为什么 SurfaceAggregator 用 DFS,不是 BFS? · 渲染顺序就是深度顺序 Why does SurfaceAggregator use DFS, not BFS? · rendering order IS depth order

"BFS 不是更快吗?层级浅,缓存友好。"——这是个公平的疑问。但 Aggregate 是渲染流水线的工序之一,DFS 是被3 件硬约束逼出来的:

"Wouldn't BFS be faster? Shallower levels, cache-friendly." — a fair question. But Aggregate is a step in a rendering pipeline, and DFS is forced by three hard constraints:

z-order 是深度顺序,不是宽度顺序。"子 surface 画在父 surface 之上" 是 HTML/CSS 的层叠规则——你想象一棵 surface 树,根是 Browser UI,叶子是最深的 OOPIF。正确的画图顺序是从根开始,先画自己,然后立即递归画第一个子的整棵子树,然后画第二个子的整棵子树……——这就是 DFS 前序遍历。换成 BFS,你会先画所有第 1 层,再画所有第 2 层,但同一层里的两个 surface 完全没关系,反而跨子树穿插,合不成正确的 z-order。
z-order is depth-order, not breadth-order. "Child surface paints on top of parent surface" is the HTML/CSS stacking rule — picture a surface tree with Browser UI at the root and the deepest OOPIF at the leaves. The correct paint order: start from the root, draw self, then immediately recurse into the first child's entire subtree, then into the second child's, etc. — that's exactly DFS pre-order. Under BFS you'd paint all level-1 surfaces, then all level-2, but two level-N surfaces have no relation to each other and get interleaved across subtrees — z-order breaks.
RenderPass 依赖 = 子先于父。SurfaceAggregator 输出的 RenderPass 列表里,如果父用 RenderPassDrawQuad 引用子的 Pass,那子 Pass 必须出现在列表的前面(GPU 按列表顺序执行)。DFS 后序天然产出"叶子在前,根在后" 的列表——拓扑排序;BFS 给不了这个保证,要单独再做一遍拓扑排序,白搭一倍开销。
RenderPass dependency = child before parent. In SurfaceAggregator's output RenderPass list, when a parent references a child Pass via RenderPassDrawQuad, the child Pass must appear earlier in the list (GPU executes in list order). DFS post-order naturally produces a "leaves first, root last" list — a topological sort. BFS gives no such guarantee — you'd need a separate topo-sort pass, doubling the cost.
剪枝最早发生。DFS 在递归到一个子 surface 时,立刻能算出"父变换 × 子变换" 与"父裁剪 ∩ 子裁剪"——如果裁剪结果为空,整棵子树跳过,连一个 quad 都不复制。BFS 在第 1 层做完才进第 2 层,要么提前算所有人的变换/裁剪(浪费),要么晚才发现空裁剪(浪费)。DFS 的"遇到空裁剪就回溯" 是天然的剪枝优化。
Earliest possible pruning. When DFS recurses into a child surface, it can immediately compute "parent transform × child transform" and "parent clip ∩ child clip" — if the intersection is empty, the entire subtree is skipped, not even one quad is copied. BFS finishes level 1 before level 2 — either compute every transform/clip up front (waste), or discover the empty clip late (waste). DFS's "back off on empty clip" is natural pruning.

总结: DFS 在三个维度上都自然契合 SurfaceAggregator 的需求——z-order = 深度优先、RenderPass 依赖 = 拓扑排序、裁剪剪枝 = 回溯发生在第一时间。BFS 看似平易近人,但每一条都得额外付代码代价。"选对了遍历方式,半个算法就免费了" 是图算法工程的常见现象。

Bottom line: DFS naturally fits SurfaceAggregator on three axes — z-order = depth-first, RenderPass dependency = topological order, clip pruning = early backoff. BFS looks friendlier but each property costs extra code. "Pick the right traversal and half the algorithm is free" is a common phenomenon in graph engineering.

DEVTOOLS

chrome://compositor-thread-rendering-stats · Performance > "Frame submitted to display"

TRACING

viz, viz.surface, viz.aggregator, viz.frame_submission

FLAG

--enable-features=VizFrameSubmissionForWebView,VizDebuggerOverlay

SOURCE

components/viz/service/display/surface_aggregator.cc, surface_manager.cc

▸ Surface Inspector · 一帧 Aggregate 之后的 Surface 树展开 3 surfaces · 5 RenderPasses · damage 8.4%

▾ root · FrameSinkId(1, root_compositor) · LocalSurfaceId(7:42) // Browser UI · Aggregator 入口 transform = identity damage_rect = [0,0 1280×800] damage = 8.4% ▾ SurfaceDrawQuad → render_a transform = translate(0, 64) // 主页面位于工具栏下面 clip = [0,64 1280×736] damage_rect = [340,200 220×88] ▾ RenderPass[3] ├─ Pass#A2 (shadow off-screen, 离屏特效) ├─ Pass#A3 (filter blur, 离屏特效) └─ Pass#A1 (root surface) ├─ TileDrawQuad ×4 // 主图层 ├─ RenderPassDrawQuad → Pass#A2 └─ ▾ SurfaceDrawQuad → render_b (OOPIF) transform = parent ∘ translate(180, 280) ∘ scale(0.9) clip = parent ∩ [180,280 420×120] = [180,280 420×120] ▾ RenderPass[2] ├─ Pass#B2 (filter) └─ Pass#B1 (root) └─ TileDrawQuad ×3 // OOPIF 内容 ▾ SurfaceDrawQuad → offscreen_canvas damage_rect = empty // ⚠ 整棵子树跳过(裁剪 0×0 或 damage 空) ⤳ PRUNED // 节省 ~120KB GPU 上传

看 4 件事: ① Surface 数量 = Browser UI + 每个 Render + 每个 OffscreenCanvas;② damage_rect 占 viewport 比例(本帧 8.4%)— 越小,Display 阶段 GPU 工作量越小;③ "PRUNED" 标记 = 子 surface 因裁剪/damage 空被跳过,connected to "不可见 OOPIF 不耗 GPU" 那条规则;④ 嵌套深度 > 4 是危险信号 — 说明 OOPIF 套 OOPIF 套 OffscreenCanvas,SurfaceAggregator 的 DFS 会变贵。 4 things to watch: ① Surface count = Browser UI + every Render + every OffscreenCanvas; ② damage_rect as % of viewport (8.4% this frame) — smaller = less GPU work in Display; ③ "PRUNED" marker = child surface skipped due to empty clip or damage, the "invisible OOPIF costs no GPU" rule; ④ Nesting depth > 4 is a red flag — OOPIF nested inside OOPIF nested inside OffscreenCanvas makes the DFS expensive.

上一章previous← Stage 11 · DrawStage 11 · Draw 下一章nextStage 13 · Display · 像素终于上屏Stage 13 · Display · pixels finally reach the screen →

INPUT

N 份 CFN CFsBrowser UI · Render × N · OffscreenCanvas…Browser UI · Render × N · OffscreenCanvas…

→

OUTPUT

Aggregated CF+ damage_rect+ damage_rect

STAGE 13 · VIZ PHASE

Display — 像素终于上屏

Display — pixels finally reach the screen

字节走完最后一步

DrawQuads → GL/Skia → vsync → photons

Module

viz

Process

GPU（hosts viz）(hosts viz)

Thread

Skia / GPU main

Output

on-screen pixels

这一步在做什么

What it does

把 Aggregated CF 里的 DrawQuad 翻译成实际 GPU 调用，画到 Back Buffer；vsync 一来，Display::DrawAndSwap 把 Back / Front 互换，新一帧就出现在屏幕上。 Translate the Aggregated CF's DrawQuads into real GPU calls, paint them into the Back Buffer; on vsync, Display::DrawAndSwap swaps Back / Front, and the new frame appears on screen.

为什么不能跳过

Why not skip

这是 13 步流水线的唯一真实操作 GPU 的一步。前 12 步都在"准备"——分类、组织、序列化、调度——而 Display 是把指令真的执行下去的那一刻。 This is the only step of the 13 that actually drives the GPU. The previous 12 stages all "prepare" — classify, organise, serialise, schedule — Display is the moment instructions actually execute.

关键函数

Key entrypoints

Display::DrawAndSwap · SkiaRenderer::DrawRenderPass · SkiaOutputSurfaceImpl::SubmitPaint · OutputSurface::SwapBuffers

STAGE 13 主线 · The Card 上屏 · 名场面Main-line · The Card on screen · the finale

SwapBuffers 那一刻 · 然后 hover 一下,看看哪条线程在动

The SwapBuffers moment — then hover and see which thread moves

名片走完前 12 步,Aggregated CF 在 Viz 进程的 SkiaRenderer 手里。它把所有 quad 录进一份 SkDeferredDisplayList,丢给 GPU thread → SkSurface::draw 重放 → OutputSurface::SwapBuffers() → 名片在你眼前点亮。

After all 12 stages, the Aggregated CF lands in the SkiaRenderer in Viz. SkiaRenderer records every quad into one SkDeferredDisplayList, hands it to the GPU thread → SkSurface::draw replays it → OutputSurface::SwapBuffers() → the card lights up before your eyes.

现在做一件事: 把鼠标移到本章顶部那张真的 Airing 名片的 关注 按钮上(就是主线例子章那张可交互的卡)。整条 hover 动画的执行路径:

Try this now: hover over the Follow button on the real Airing card at the top of this article (the interactive one in the Main-line example chapter). The entire execution path of the hover animation:

// hover 动画 · 谁在动,谁在闲

Browser process     ─▶ 收到 mousemove,routing 给 Render
Render · Compositor  ─▶ 接到 input,匹配 :hover · transform: scale(1.08)
                  ─▶ 直接更新 Transform tree 上 .follow 那 1 个节点
                  ─▶ 不需要 Commit · 不需要 Paint · 不需要 Raster
Render · Main        ─▶ 闲着 / IDLE ✦
                                // 注意:这是 transform 那一半
                                // background 那一半要 paint,Main 会被叫醒
Viz · GPU            ─▶ 下一帧 SwapBuffers,屏幕上 .follow 变大 8%
                  ─▶ 整张主卡的 #r1 #r2 不变,只 #r3 重画

// 下一帧的 GPU 工作量 ≈ 53×32 像素重画 + 一次 SwapBuffers
// = 0.04ms 在 M1 上

这就是 13 步流水线想了 20 年想出来的东西——一个 transform 动画,从 input 到上屏,跨 3 个进程、5 段线程,但每一段都只动了它必须动的那一点点。Transform tree 1 个节点变;.follow LayerImpl 1 个变换矩阵变;1 块 53×32 tile 重 raster;1 个 TileDrawQuad 重发;GPU 重画 1696 像素;SwapBuffers。"不重新计算的范围,就是性能本身"——这是 C5 那段引言的真正含义。

This is what 13 stages and 20 years of engineering bought you — a single transform animation, from input to on-screen, crosses 3 processes and 5 thread segments, yet every segment only moves the bare minimum it must. One Transform-tree node mutates; one transform matrix on .follow LayerImpl mutates; one 53×32 tile re-rasters; one TileDrawQuad re-emits; the GPU repaints 1696 pixels; SwapBuffers. "What's not recomputed — that is performance itself" — this is the true meaning of C5's epigraph.

但要注意CAVEAT 同一次 hover 还同时改了 background ——这个动画走的是 Paint 路径,Main thread 会被叫起来。所以"纯 Compositor 动画" 在真实代码里很少 100% 纯。名片用的是混合动画,这正是真实业务的样子。要 100% 纯合成,只改 transform / opacity / filter 即可。 The same hover also mutates background — that path goes through Paint, and Main thread does wake up. So "pure Compositor animation" is rare in real code at 100%. The card uses a hybrid animation, which is what real product code looks like. To stay 100% on the Compositor, only mutate transform / opacity / filter.

三种渲染目标 · 三条路

Three render targets, three roads

关闭硬件加速时Hardware accel off

SoftwareRenderer

纯软件渲染。SoftwareOutputSurface + SoftwareOutputDevice。

Pure software path. SoftwareOutputSurface + SoftwareOutputDevice.

现今主路径Today's primary path

SkiaRenderer

用 Skia DDL 录指令、统一 SubmitPaint 上屏；支持 GL / Vulkan / 离屏。

Records via Skia DDL and submits in one batch; supports GL / Vulkan / offscreen.

已废弃Deprecated

GLRenderer

通过 CommandBuffer 把 GL 调用从 Compositor thread 转到 GPU main thread 真实执行。未来由 SkiaRenderer 取代。

CommandBuffer-routed GL calls — from Compositor thread to GPU main thread for real execution. Being replaced by SkiaRenderer.

SkiaRenderer · DDL 录制 → SubmitPaint

SkiaRenderer · DDL recording → SubmitPaint

SkiaRenderer 的核心是 SkDeferredDisplayListRecorder（DDL）——它不立即画，而是把所有 RenderPass 的绘制操作"录"到一个 DDL 中，等所有 RenderPass 录完，SkiaOutputSurfaceImpl::SubmitPaint 把整批指令发给 SkiaOutputSurfaceImplOnGpu 在 GPU thread 上一次性执行。

SkiaRenderer's core is the SkDeferredDisplayListRecorder (DDL) — it doesn't paint immediately, but records every RenderPass's draw operations into a DDL. When all RenderPasses are recorded, SkiaOutputSurfaceImpl::SubmitPaint ships the whole batch to SkiaOutputSurfaceImplOnGpu for one execution on the GPU thread.

FIG 18 SkiaRenderer 的延迟绘制流：DrawQuads 先在 Compositor thread 录成 DDL，最后在 GPU thread 上 SkSurface::draw 一次性执行。 SkiaRenderer's deferred-draw flow: DrawQuads recorded into a DDL on the Compositor thread, then the GPU thread runs SkSurface::draw in one shot.

COMMANDBUFFER · GLRENDERER LEGACY

GL 调用是怎么"穿"到 GPU 进程的

How GL calls actually "tunnel" to the GPU process

GLRenderer（已废弃）走的是 CommandBuffer 机制：Compositor thread 上的 GL 调用并不真的执行，而是被序列化成命令字节流，通过 InProcessCommandBuffer 投到 GPU 进程的 CrGpuMain 线程，那里才真正调 OpenGL ES。整套机制让 GL 调用方 ↔ 真实驱动 解耦，也让安全沙箱成为可能。

GLRenderer (deprecated) tunnels through the CommandBuffer: GL calls on the Compositor thread don't really execute — they're serialised into a command byte stream, posted via InProcessCommandBuffer to the GPU process's CrGpuMain thread, where the real OpenGL ES happens. The split decouples GL caller ↔ real driver — and is what makes the security sandbox possible.

双缓冲与 vsync · 屏幕永不撕裂

Double buffer and vsync · the screen never tears

几乎所有的图形系统都用双缓冲——前缓冲（Front Buffer）展示给屏幕，后缓冲（Back Buffer）让你画。Display::DrawAndSwap 在 vsync 来临时交换两个 buffer 的指针，下一帧就直接显示出来——屏幕永远不会看到一帧"画到一半"。

Every modern graphics stack double-buffers. Front Buffer is what the screen reads; Back Buffer is where you paint. At vsync, Display::DrawAndSwap swaps the pointers and the new frame is on display the next refresh — the screen never sees a half-painted frame.

Front Buffer

A

Back Buffer

B

VSYNC

FIG 18.V VSync 的每次"▼"，Front 与 Back 互换。屏幕永远从 Front 读，所以从不闪烁。 At every "▼" of vsync, Front and Back swap. The screen always reads Front — and never flickers.

Triple buffer · 牺牲一点延迟换流畅

Triple buffer · trading a touch of latency for smoothness

移动 OS（Android Surface Flinger / iOS）默认开三缓冲：当 GPU 还在画 N 帧、屏幕还在显示 N-1 帧时，CPU 已经可以开始准备 N+1 帧的数据。代价是 +1 帧的输入延迟，回报是 极少的卡顿——因为没有任何一段被对方"等"住。Chromium 在多数桌面/移动平台上也走类似策略。

Mobile OSes (Android Surface Flinger / iOS) default to triple buffering: while the GPU is still painting frame N and the screen still displays N-1, the CPU can already prepare frame N+1. The cost is +1 frame of input latency; the reward is far less stutter — no party ever waits for another. Chromium employs a similar strategy on most desktop / mobile platforms.

DrawAndSwap · 13 步的句号

DrawAndSwap · the period at the end of 13 steps

DISPLAY::DRAW_AND_SWAP components/viz/service/display/display.cc

Display::DrawAndSwap(DrawAndSwapParams) { ① aggregator_->Aggregate(...); // 拍平多份 CF ② renderer_->DrawFrame(passes); // Skia / GL 真正执行 ③ output_surface_->SwapBuffers(...); // vsync · Back↔Front ④ scheduler_->DidSwapBuffers(); // 通知调度器准备下一帧 }

为什么 SkiaRenderer 要走 DDL,不直接调 GL? · 线程隔离 + 命令打包 = 真正的并行 Why does SkiaRenderer go through DDL, not call GL directly? · thread isolation + command batching = real parallelism

"把 quad 直接翻译成 glDrawArrays 不就行了?" 这是 GLRenderer(已废弃)走的路。SkiaRenderer 选择 DDL(SkDeferredDisplayList)路径,有三个原因:

"Why not just translate quads to glDrawArrays directly?" That was GLRenderer's path (now deprecated). SkiaRenderer chose DDL (SkDeferredDisplayList) for three reasons:

GL 上下文不能跨线程。OpenGL/Vulkan 规范要求同一时刻一个 GL 上下文只能被一个线程访问(make_current 的语义)。如果 SkiaRenderer 在 Compositor 线程上直接调 GL,就要把 GL 上下文 make_current 到 Compositor 线程——但 Compositor 线程除了渲染还要处理输入、滚动、动画,GL 上下文的独占性会变成串行瓶颈。DDL 解耦了"构造命令"与"提交命令":Compositor 线程构造 DDL(无 GL 上下文,纯内存操作),GPU 线程独占 GL 上下文执行 DDL,两条线程真正并行。
GL contexts can't cross threads. OpenGL/Vulkan specs require only one thread can access a GL context at a time (make_current semantics). If SkiaRenderer called GL directly on the Compositor thread, you'd have to make_current to that thread — but Compositor also handles input, scroll, animation, and the GL context's exclusivity becomes a serial bottleneck. DDL decouples "building commands" from "submitting commands": Compositor builds DDL (no GL context, pure memory ops), GPU thread exclusively owns the GL context and replays the DDL — true parallelism.
批量提交 = 状态切换最少。GL 的"状态切换"(切 shader、切 texture、切 blend mode)在 GPU 上极慢。如果 quad-by-quad 调 GL,每个 quad 切一次状态——GPU 大半时间在改状态,而不是画像素。Skia 在录 DDL 时会重排命令,把同状态的画一起提交(类似数据库的 query batching)——一次性 setShader,然后画 100 个 quad,然后再切状态。实测能比 quad-by-quad 调 GL 快 3-5×。
Batched submission = minimum state changes. GL "state changes" (swap shader, swap texture, swap blend mode) are extremely slow on GPUs. Quad-by-quad GL means a state switch per quad — most of the GPU's time goes to swapping state, not painting pixels. Skia, when recording the DDL, reorders commands to batch same-state draws together (like a database batching queries) — setShader once, draw 100 quads, then switch state. 3-5× faster than quad-by-quad GL in practice.
后端可替换 = Skia 抽象层的红利。同一个 DDL 可以喂给 Skia 的 GL backend、Vulkan backend、Metal backend(macOS)、Dawn(WebGPU)、甚至 software 后端。GLRenderer 写死了 GL,Vulkan / Metal / Graphite 这些新后端要重写整个 renderer;SkiaRenderer 一份代码全 cover,后端切换只是换一个 SkSurface 实现。这就是为什么 Chrome 122+ 的"SkiaGraphite"实验(把 Skia 渲染换成基于 wgpu 的更现代后端)只动 SkiaRenderer 不动 cc——架构整洁的红利。
Pluggable backends = Skia abstraction dividend. The same DDL can feed Skia's GL backend, Vulkan backend, Metal backend (macOS), Dawn (WebGPU), even a software backend. GLRenderer hard-coded GL; Vulkan / Metal / Graphite new backends would need a full renderer rewrite. SkiaRenderer covers all of them in one codebase — switching backends is just swapping the SkSurface implementation. This is why Chrome 122+'s "SkiaGraphite" experiment (swapping Skia rendering to a wgpu-based modern backend) only touches SkiaRenderer, not cc — the dividend of a clean architecture.

归根结底,DDL 是 Skia 团队为"多线程图形 API 时代"设计的中间表示——跟 LLVM IR 之于编译器、Mojo IDL 之于 IPC 是同一类东西。"录制 + 重放" 是任何要解耦"构造"和"执行"的系统都会走的路。Chromium 把它用在了渲染管线的最末端,把 GL 上下文的串行瓶颈彻底关进 GPU 进程的一根线程里。

In essence, DDL is Skia's intermediate representation for the "multi-threaded graphics API era" — same kind of thing as LLVM IR for compilers, or Mojo IDL for IPC. "Record + replay" is the path of any system that needs to decouple "construction" from "execution". Chromium uses it at the very end of the rendering pipeline, locking the GL context's serial bottleneck inside one GPU-process thread once and for all.

DEVTOOLS

chrome://gpu · Performance > GPU lane · Rendering > FPS meter

TRACING

viz.gpu, gpu.command_buffer, viz.skia, gpu.driver

FLAG

--enable-features=Vulkan,SkiaGraphite / 试 Vulkan + Graphite 后端/ try Vulkan + Graphite

SOURCE

components/viz/service/display/skia_renderer.cc, display.cc

▸ chrome://gpu · Graphics Feature Status Chrome 130 · M1 Max

Canvas

Hardware accelerated

Compositing

Hardware accelerated

Out-of-process Rasterization

Enabled cc::GpuRasterBufferProvider

Skia Renderer

Enabled · DDL pipeline

Skia Graphite

Disabled (experimental, --enable-features=SkiaGraphite)

Vulkan

Disabled (macOS uses Metal via ANGLE)

WebGL / WebGL2

Hardware accelerated

WebGPU

Hardware accelerated · Dawn / Metal backend

Multiple Raster Threads

Enabled · 4 threads

看 4 项就够了: ① Compositing = Hardware accelerated(没这个 GPU 完全没参与);② Skia Renderer = Enabled(说明走 DDL 路径,而非废弃的 GLRenderer);③ Out-of-process Rasterization = Enabled(GpuRasterBufferProvider 走零拷贝);④ Multiple Raster Threads(确认并行)。任何一项掉到 Software only 或 Disabled,渲染性能瞬间塌方 5-10×。 Four items are all you need: ① Compositing = Hardware accelerated (without this, the GPU isn't involved); ② Skia Renderer = Enabled (confirms DDL path, not deprecated GLRenderer); ③ Out-of-process Rasterization = Enabled (zero-copy GpuRasterBufferProvider); ④ Multiple Raster Threads (confirms parallelism). Any one slipping to Software only or Disabled = 5-10× perf collapse.

上一章previous← Stage 12 · AggregateStage 12 · Aggregate 下一章nextSYNTHESIS 01 · 一帧的时间线 · 全 13 步在 16.7ms 内SYNTHESIS 01 · The 13 stages on one timeline →

INPUT

Aggregated CFSurfaceAggregator 输出from SurfaceAggregator

→

OUTPUT

on-screen pixels屏幕从 Front Buffer 读出screen reads from Front Buffer

字节走过这 13 步，
Chromium 给了你 16.7 毫秒里发生的全部魔术——
你只是看到了一帧画面。 Field Note · 02

Bytes complete the thirteen steps,
and Chromium hands you the entire 16.7 ms of magic —
what you saw was one frame. Field Note · 02

SYNTHESIS 01 · TIMELINE

一帧的时间线 — 16.7ms 内的合奏

One frame's timeline — the 16.7ms ensemble

把 13 步放回时间轴

all 13 stages on a single timeline

前 18 章把流水线"切片"开来,每章对一段。但真实世界里 13 步是同时在跑的——某些串行强约束(必须等),很多可以并行(各跑各的)。下面这张图把名片的一帧合成放回时间轴上,告诉你谁在何时动:

The previous 18 chapters sliced the pipeline open, one chapter per stage. But in real life all 13 stages are running at the same time — some have hard serial constraints (must wait), many can run in parallel (everybody works at once). The figure below puts the card's one-frame composition back onto the time axis: who moves, and when.

FIG 19 名片在 16.7ms 内的真实时间线。3 件值得记的事: ① Main thread 真正"渲染相关"的时间只有约 6ms,剩下 10ms 全是 idle——可以塞 JS / microtask / rAF;② 3 条 Raster 线程并行,与 Compositor 的 Tiling/Activate 阶段重叠;③ Viz/GPU 在最后 4ms 被叫醒,做 Aggregate + DrawAndSwap + Swap,整个前 12ms 它在睡。 The card's real timeline within 16.7ms. Three things worth remembering: ① the Main thread spends only ~6ms on rendering, the remaining 10ms is idle — perfect for JS / microtasks / rAF; ② 3 Raster threads run in parallel and overlap with the Compositor's Tiling/Activate; ③ Viz/GPU wakes up only in the last 4ms to Aggregate + DrawAndSwap + Swap — it sleeps the first 12ms.

Main thread 的一秒钟

A typical second on the Main thread

把上面那张帧时间线按 60 倍重复就是 1 秒。但 Main thread 的工作不只是渲染——还有 JS 执行、事件处理、microtask、setTimeout 回调。一个典型 1 秒(中等复杂度的 SPA)的 Main thread 时间分布大致如此:

Repeat that frame timeline 60 times — that's a second. But the Main thread does more than rendering: JS, event handlers, microtasks, setTimeout callbacks. A typical 1-second budget on the Main thread of a moderately complex SPA looks like this:

JS 35%

Style 12%

Layout 8%

Paint 5%

idle 40%

0ms 250ms 500ms 750ms 1000ms

JS 是头号竞争者——React 重 render、状态库 reducer、IntersectionObserver 回调,这些都在抢 Main thread。当 JS 一个长任务超过 50ms,整条流水线在那 50ms 都在排队:Style/Layout/Paint 都做不了,vsync 来了也只能丢帧。这就是为什么 Web Vitals 把 INP(Interaction to Next Paint)和 TBT(Total Blocking Time)放在前面——它们直接量"JS 占用 Main 多久"。

JS is the chief competitor — React re-renders, state-library reducers, IntersectionObserver callbacks all fight for the Main thread. When a JS long task exceeds 50ms, the entire pipeline queues up for those 50ms: Style/Layout/Paint cannot proceed, the vsync arrives only to drop the frame. That's why Web Vitals leads with INP (Interaction to Next Paint) and TBT (Total Blocking Time) — both measure "how long does JS hold the Main thread".

2024 + Scheduler.yield() 与 isInputPending(): 现代 Chromium 提供 scheduler.yield() 让 JS 主动让出主线程,以及 navigator.scheduling.isInputPending() 让长任务可以提前退让给输入事件。这两个 API 让"不要让 JS 阻塞渲染"从口号变成可量化的工程实践。 Scheduler.yield() and isInputPending(): modern Chromium ships scheduler.yield() for JS to voluntarily yield the Main thread, plus navigator.scheduling.isInputPending() for long tasks to step aside for incoming input. These two APIs make "don't let JS block render" measurable rather than aspirational.

三种"掉帧"的物理来源

Three physical sources of "jank"

#	物理现象Physical event	看到什么What you see
1	Main 长任务	JS 跑了 80ms,5 帧没刷新——卡顿"段落式"出现JS ran 80ms, 5 frames missed — jank in "chunks"
2	Raster 跟不上	滚动时屏幕一直在动,但视口边缘棋盘格screen keeps moving while scrolling, viewport edges show chequer
3	GPU 排队	动画起步那一刹那"卡一下",之后顺畅(GPU 上了纹理)animation "hitches" at the very first frame, smooth afterward (GPU loaded textures)

SYNTHESIS 02 · REVERSE

输入到帧 — 反方向的流水线

Input to Frame — the reverse pipeline

点击事件如何穿越整条流水线

click → router → handler → frame

前面 18 章讲的是正向流水线:bytes 进来 → pixels 出去。但浏览器一半的复杂度藏在反向流水线里:用户点了一下,这个 click 怎么从硬件中断,跨过 3 个进程、5 段线程,最终触发新一轮 13 步流水线。这是 RAIL 模型里 R(Response)的真实拓扑。

The previous 18 chapters described the forward pipeline: bytes in, pixels out. But half of a browser's complexity hides in the reverse pipeline: a click, from a hardware interrupt, crossing 3 processes and 5 thread segments, eventually firing the next 13-stage round. This is the real topology behind RAIL's R (Response).

输入流水线 · 一次 click 的旅程Input pipeline · one click's journey

OS EventHardware IRQ

→

Browser · IOBrowser process

→

Browser · UIrouting & hit-test

→

Render · Compositortry handler

↘

Render · MainJS handler · setState

→

Style + Layout + Paintrender pipeline

→

Viz · GPUSwapBuffers

5 个关键节点:

Five key checkpoints:

OS → Browser IO 线程:操作系统通过 evdev / WindowProc / NSEvent 把硬件中断翻译成 InputEvent,塞进 Browser process 的 IO 线程消息队列。
OS → Browser IO thread: the OS translates the hardware interrupt into an InputEvent via evdev / WindowProc / NSEvent and enqueues it on the Browser process's IO thread.
Browser UI 路由 + hit-test:Browser 用 hit-test region(由 cc 提供的命中测试矩形列表)决定该事件归哪个 Render Process 的哪个 frame——OOPIF 的事件路由就靠这个。
Browser UI routes + hit-tests: Browser uses cc-supplied hit-test region (a list of hit-test rectangles) to decide which Render process's which frame owns this event — this is how OOPIF event routing works.
Render Compositor 先看一眼:输入事件优先送到 Render 的 Compositor thread。如果是滚动 / pinch / non-blocking touch,Compositor 自己处理就够(直接调整 scroll offset / transform),从不打扰 Main——这就是"滚动跑在 Compositor 上"的物理实现。
Render Compositor takes a first look: input goes to Render's Compositor thread first. If it's a scroll / pinch / non-blocking touch, the Compositor handles it alone (just adjust scroll offset / transform) and never wakes Main — this is the physical implementation of "scrolling runs on the Compositor".
Bounce 给 Main · 跑 JS handler:如果是 click / keypress / 注册了 active touch listener 的事件,Compositor 把事件转给 Main thread,这才轮到 JS handler 跑。passive: true 是关键标记——它告诉 Compositor"这个事件不会调 preventDefault",Compositor 可以在等 Main 处理的同时继续把后续的滚动事件按自己的节奏处理。
Bounce to Main · run the JS handler: if it's a click / keypress / event with active touch listener registered, Compositor forwards it to Main, where the JS handler finally runs. passive: true is the key flag — it tells the Compositor "this event will not call preventDefault", letting Compositor keep handling subsequent scroll events on its own cadence while Main works.
触发新一轮帧:JS handler 调 setState / 改 className / 调 DOM API → 触发 Style 失效 → 下一个 BeginMainFrame 时正向流水线再跑一遍。如果改的属性走 Compositor-only 路径,Main 这一段都不需要醒。
Trigger a new frame: the JS handler calls setState / changes className / mutates DOM → invalidates Style → on the next BeginMainFrame, the forward pipeline runs again. If the mutation only touches Compositor-only properties, Main doesn't even need to wake.

RAIL 模型 · 100ms 才是真正的"瞬间"

The RAIL model · 100ms is the real "instant"

Google 提出的 RAIL 把每一类用户感知按时间预算分类:Response 100ms · Animation 16ms · Idle 50ms · Load 1000ms。其中 R 那 100ms 不是"从 click 到 pixel"——而是"从 click 到屏幕开始有反馈"(可以是一个 loading 转圈、按钮按下态)。这给了 Compositor 一段宝贵的"先反应一下,再认真处理"窗口——所有现代 UI 库都依赖这一点(:active 伪类、focus ring、ripple animation)。

Google's RAIL model classifies user-perceptible work by time budget: Response 100ms · Animation 16ms · Idle 50ms · Load 1000ms. The 100ms in R is not "click to pixel" — it's "click to feedback-on-screen" (could be a spinner, a pressed-button state, a ripple). This gives the Compositor a precious "react-fast-then-process-properly" window — every modern UI library leans on it (:active pseudo, focus ring, ripple animations).

真实案例CASE

为什么 `onClick` 比 `onPointerDown` 慢 ~100ms

Why `onClick` is ~100ms slower than `onPointerDown`

click 必须等待"是否双击"的判定窗口(默认 ~300ms,但浏览器有一些启发式可以提前到 100ms)。pointerdown 在硬件事件触发时立即触发,不需要等。Material Design 的 ripple animation 之所以总在按下时立即出现,就是绑在 pointerdown 上,而不是 click——这是为了利用 RAIL 那 100ms 的"先反馈窗口"。

click waits for the "maybe double-click?" window (~300ms by default, but browsers have heuristics that drop it to ~100ms). pointerdown fires immediately. Material Design's ripple animation always appears the moment you press down precisely because it's bound to pointerdown, not click — exploiting RAIL's 100ms "react-first window".

SYNTHESIS 03 · MODERNITY

2024-2025 · 流水线的几件大事

2024-2025 · what's new in the pipeline

原文写于 2022,这两年改了什么

scroll-driven anim · view transitions · anchor positioning · @scope

原文写于 2022,Chromium 流水线又走了 3 年。这一章把这两年最值得注意的变化挂回到对应章节——它们没有改流水线的骨架,但都给某一段加了新的"钩子"。

The original was written in 2022; the Chromium pipeline has moved on by another three years. This chapter pins the most meaningful recent changes back to their stages — none reshapes the pipeline's skeleton, but each adds a new "hook" somewhere on it.

PRE-PAINT Anchor Positioning · CSS Anchor
新属性 anchor-name / position-anchor / inset-area 让一个元素"跟着另一个元素飞"。这给 Pre-paint 阶段引入了新的 Transform 节点子类——anchor 位置变化要通过 cc 同步到 Compositor 上,不需要回 Main。第一次让"tooltip / popover 跟随触发器"完全跑在 Compositor 上。 New properties anchor-name / position-anchor / inset-area let an element "fly with another". This introduces a new Transform-node subclass in Pre-paint — anchor position changes sync through cc to the Compositor without going back to Main. For the first time, "tooltip / popover tracks its trigger" runs entirely on the Compositor.

COMPOSITING Scroll-Driven Animations · CSS animation-timeline
animation-timeline: scroll() / view() 让 CSS 动画的进度由滚动位置驱动,而不是时间。整条动画跑在 Compositor 线程上——cc 把 scroll offset 直接喂给 animation interpolator,Main thread 完全不参与。这把以前用 IntersectionObserver + JS 算的"视差滚动 / 进度条" 一夜之间变成了几行 CSS。 animation-timeline: scroll() / view() drives a CSS animation by scroll position rather than time. The entire animation runs on the Compositor thread — cc feeds the scroll offset straight into the animation interpolator, with no Main-thread involvement. Overnight, the "parallax / progress-bar" pattern that needed IntersectionObserver + JS becomes a few lines of CSS.

AGGREGATE View Transitions API
document.startViewTransition() 让 SPA 路由切换有了原生的跨状态平滑动画。底层机制就是 SurfaceAggregator 的快照 + 跨状态合成:旧状态被截图为一个 SharedImage,新状态正常渲染,Viz 用 cross-fade / slide / scale 把两个 surface 在合成阶段连起来。从 C17 章的视角看,这是 SurfaceAggregator 第一次被前端开发者直接调用。 document.startViewTransition() brings native cross-state smooth transitions to SPA route changes. Under the hood it's SurfaceAggregator's snapshot + cross-state composition: the old state captured into a SharedImage, the new state rendered normally, Viz cross-fades / slides / scales the two surfaces during aggregation. From C17's perspective, this is the first time SurfaceAggregator is directly invokable by web developers.

STYLE @scope / @container / @starting-style
三个新 at-rule 给 Style 章的 RuleSet 多了几条 sharding 维度。@scope 让 RuleSet 多出一个 scoped_rules_ 桶;@container 让一条 rule 的"命中条件"取决于祖先容器的 layout——这违反了原来 Style 在 Layout 之前的强约束,所以 Chromium 给 Container Queries 实现了"two-pass layout-style-layout",这是 LayoutNG 之后最大的样式系统改造。 Three new at-rules add new sharding dimensions to Style's RuleSet. @scope introduces a scoped_rules_ bucket; @container makes a rule's "match condition" depend on an ancestor container's layout — breaking the old "Style strictly before Layout" rule, so Chromium implemented "two-pass layout-style-layout" for Container Queries, the biggest style-system rework since LayoutNG.

RASTER RasterInducingScroll · default-on
早期 Chromium 滚动时只走 Compositor,内容不重 raster——快速滚动会出棋盘。新策略 RasterInducingScroll 在惯性滚动期间主动触发 raster,牺牲一点 CPU 换"没有棋盘"。Chrome 122 起已经默认开启。 Early Chromium scrolled on the Compositor only, never re-rastering — fast scroll showed chequer. The new RasterInducingScroll strategy proactively triggers raster during inertial scroll, trading some CPU for "no chequer". Default-on since Chrome 122.

PROCESS NetworkService · 默认独立进程
2024 年 Chrome 把 NetworkService 默认推到独立进程(早期可选 in-process)。意味着 Stage 0 的 Mojo IPC 是真的跨进程,不是同进程的简单消息传递。安全沙箱因此更深:Render 即使被 PWN 也拿不到原始 cookie。 In 2024 Chrome flipped NetworkService's default to a separate process (used to be optionally in-process). That makes Stage 0's Mojo IPC actually cross-process, not in-process message passing. The sandbox runs deeper: even a pwned Render cannot read raw cookies.

SYNTHESIS 04 · DEBUG GUIDE

症状反查表 — 从卡顿回到阶段

Symptom reverse lookup — from jank back to a stage

出问题时先抓哪一段

when something is wrong, where to look first

读完前 21 章你已经知道每段在做什么。但工程师在 PR 阶段最常问的是反过来的问题:页面卡了 / 滚不动 / 动画掉帧 / 首屏白屏 —— 该看哪一段?下面这张表把常见症状映射到流水线阶段、抓什么数据、用什么工具。

By now you know what every stage does. But the question engineers actually ask in PRs is the reverse: page is janky / scroll is sluggish / animation drops frames / cold-start is white — which stage do I look at first? The table below maps common symptoms back to a stage, what to capture, and which tool to reach for.

症状Symptom	可疑阶段Suspect stage	先抓什么First capture
首屏白屏久(LCP > 2.5s)Cold-start blank (LCP > 2.5s)	Stage 00 + 02	Network 面板看 render-blocking 资源 · 看是否有 PreloadScanner 没抢到的关键 CSS / 字体Network panel for render-blocking resources · check whether any critical CSS / font missed PreloadScanner
点击响应慢 (INP > 200ms)Slow click (INP > 200ms)	C20 input + Main	Performance 录制看 click handler 的 long task · 用 `scheduler.yield()` 把它劈开Performance trace for the click handler's long task · split it with `scheduler.yield()`
滚动卡顿,Compositor thread 高Scroll jank, Compositor thread saturated	C13 Tiling + C14 Raster	`chrome://tracing` · category `cc,blink.tile` · 看 tile queue depth`chrome://tracing` · categories `cc,blink.tile` · check tile queue depth
滚动出棋盘格Chequerboard during scroll	C14 Raster	Performance 录制看 num_missing_tiles · 通常是图片解码占满 Raster threadsPerformance trace num_missing_tiles · usually image decode hogs Raster threads
CSS transform 动画掉帧CSS transform animation drops frames	C9 Pre-paint + C12 Compositing	DevTools Layers 面板 · 确认元素在独立合成层 · `will-change` 是否生效DevTools Layers panel · confirm the element is on its own composite layer · whether `will-change` works
backdrop-filter 卡backdrop-filter is heavy	C16 Draw + C18 Display	Rendering 面板开 "Layer borders" · 看是否产生独立 RenderPass · 测 GPU 使用率Rendering panel turn on "Layer borders" · check for separate RenderPass · measure GPU usage
大量 DOM 修改,布局抖动DOM thrashing on bulk mutation	C8 Layout	Performance 录制 · 看 forced reflow 警告 · 用 batch DOM API / requestAnimationFrame 合并Performance trace · forced reflow warnings · batch via requestAnimationFrame
长列表内存爆Memory bloat on long lists	C12 Compositing + C13 Tiling	`chrome://gpu` 看 SharedImage 数量 · 别滥用 `will-change` 升格图层 · 用 content-visibility:auto`chrome://gpu` for SharedImage count · stop spamming `will-change` · use content-visibility:auto
iframe 嵌入卡顿Embedded iframe is sluggish	C17 Aggregate	Performance 看 SurfaceAggregator 时间 · 看父子 surface 的 damage_rect 重叠Performance trace SurfaceAggregator time · inspect parent/child surface damage_rect overlap
第一次 hover 卡 50ms · 之后流畅First hover hitches 50ms · smooth afterward	C14 Raster + C18 Display	第一帧要做 GPU 上传 + texture 编译——预热(prewarm)目标 layer 的 rasterFirst frame uploads to GPU + compiles texture — prewarm the target layer's raster
页面打开 ~5s 后突然变流畅Page suddenly smooth ~5s after open	C7 Style + V8	V8 JIT 优化完成,bytecode → optimized code · 等热身;或预热关键路径V8 JIT done, bytecode → optimized code · either wait, or warm up critical paths

通用工作流UNIVERSAL WORKFLOW 1. 录: 用 Performance 面板录 5-10 秒,包括症状出现的瞬间。2. 分: 看哪个线程是红的——Main 红 = JS / Style / Layout / Paint;Compositor 红 = Tiling / Activate / Draw;Raster 红 = 光栅化跟不上;GPU 红 = 像素吞吐瓶颈。3. 抓: 用上表定位到具体阶段,然后用对应工具深入。 1. Record: 5-10 seconds in Performance panel, including the moment the symptom appears. 2. Diagnose: see which thread is red — Main red = JS / Style / Layout / Paint; Compositor red = Tiling / Activate / Draw; Raster red = raster can't keep up; GPU red = pixel-throughput ceiling. 3. Capture: use the table above to land on a specific stage, then dive in with that stage's tool.

SYNTHESIS 05 · INDEX

术语表 — 64 个关键名词

Glossary — 64 key terms

类名与概念速查

a quick lookup for class names & concepts

展开 / 收起expand / collapse

Blink: Chromium 的渲染引擎,2013 fork 自 WebKit。Chromium's rendering engine, forked from WebKit in 2013. → CH 02
cc: Chromium "content collator" 模块,负责合成、切片、光栅化调度。Chromium's "content collator" module, owns compositing/tiling/raster scheduling. → Stage 05
Viz: 运行在 GPU 进程的 Display Compositor 模块。Display Compositor module living in the GPU process. → Stage 12
BeginMainFrame: Compositor 给 Main thread 发的"开始下一帧"信号。Compositor's "start next frame" signal to Main. → Stage 06
BeginFrameAck: Render 通知 Viz "这一帧产物已上屏"的反馈。Render's ack to Viz that "this frame is on-screen". → Stage 11
blink::Document: DOM 文档根。Root of the DOM. → Stage 01
HTMLDocumentParser: 字节流到 DOM 的主 parser。Bytes-to-DOM main parser. → Stage 01
HTMLPreloadScanner: 主 parser 阻塞时在副线程跑的"抢跑器"。Side-thread "head-start scanner" while main parser is blocked. → Stage 00
HTMLTokenizer: 80+ 状态的有限状态机,characters → tokens。80+ state FSM, characters → tokens. → Stage 01
HTMLConstructionSite: tokens → Element + DOM 栈构造器。tokens → Elements with stack-based DOM construction. → Stage 01
RuleSet / RuleMap: CSS 规则按选择器类型分桶的容器。CSS rules sharded by selector type. → Stage 02
CSSParserImpl: 把外联 CSS 解析成 StyleRule。External CSS → StyleRule. → Stage 02
ElementRuleCollector: 每个节点上跑右往左反向匹配的收集器。Per-node right-to-left match collector. → Stage 02
ComputedStyle: 合并 + 继承 + UA 默认值后挂在节点上的最终样式。Final per-node style after merge + inherit + UA defaults. → Stage 02
RGBA32: Blink 内部颜色表示,32 位整数。Blink's internal color representation, 32-bit int. → Stage 02
LayoutObject: 布局节点的抽象基类。Abstract base class for layout nodes. → Stage 03
LayoutBox: "有面积的盒"——LayoutBlock / LayoutReplaced 等的父类。"Anything with an area" — parent of LayoutBlock / LayoutReplaced. → Stage 03
LayoutNGFlexibleBox: LayoutNG 框架下的 flex 布局算法。Flex layout algorithm under LayoutNG. → Stage 03
NGConstraintSpace: 布局算法的输入约束(可用宽度等)。Input constraint (available width etc.) for layout algorithm. → Stage 03
NGPhysicalBoxFragment: LayoutNG 输出的不可变几何快照。Immutable geometry snapshot output by LayoutNG. → Stage 03
PaintPropertyTreeBuilder: "因变才建"地构造 4 棵属性树。Builds the 4 property trees "on demand". → Stage 04
TransformPaintPropertyNode: Transform tree 节点。Transform tree node. → Stage 04
EffectPaintPropertyNode: Effect tree 节点 · 含 render_surface_reason_。Effect tree node · holds render_surface_reason_. → Stage 04
ClipPaintPropertyNode: Clip tree 节点。Clip tree node. → Stage 04
ScrollPaintPropertyNode: Scroll tree 节点 · 含 main_thread_scrolling_reasons_。Scroll tree node · carries main_thread_scrolling_reasons_. → Stage 04
render_surface_reason_: Effect 节点上的"是否离屏 RenderPass"开关。"Off-screen RenderPass?" switch on Effect nodes. → Stage 04
DisplayItem: 最小绘画指令(SaveLayer / DrawText / ...)。Minimum paint instruction (SaveLayer / DrawText / ...). → Stage 05
DisplayItemList: Paint 阶段的产物 · 一段绘画清单。Paint output · a list of draw commands. → Stage 05
PaintChunk: DisplayItem 的连续区段 · 绑定一组属性树状态。A contiguous DisplayItem range bound to one property-tree state. → Stage 05
PaintArtifact: DisplayItemList + Chunks · 一帧 Paint 总产物。DisplayItemList + Chunks · one Paint frame's full output. → Stage 05
PaintController: 双 buffer 持有 PaintArtifact · 实现增量复用。Double-buffered PaintArtifact · powers incremental reuse. → Stage 05
cc::Layer: Main thread 上的图层抽象。Main-thread layer abstraction. → Stage 05
cc::PictureLayer: 最常见图层 · 含 DisplayItemList。Most common layer · holds DisplayItemList. → Stage 05
cc::TextureLayer: "我自己光栅化好了" · WebGL / Canvas / Video。"I rastered it myself" · WebGL / Canvas / Video. → Stage 05
cc::SurfaceLayer: 嵌入另一进程的 CompositorFrame · iframe / OffscreenCanvas。Embeds CF from another process · iframe / OffscreenCanvas. → Stage 05
cc::LayerImpl: Compositor thread 上的图层副本。Compositor-thread copy of a layer. → Stage 06
LayerTreeHost: Main thread 上的 cc 顶级容器。cc's top-level container on Main thread. → Stage 06
LayerTreeHostImpl: Compositor thread 上的对应顶级容器 · 持有三棵 LayerImpl 树。Top-level container on Compositor thread · holds three LayerImpl trees. → Stage 10
PushPropertiesTo: Commit 阶段把变化从 Layer 推到 LayerImpl 的方法。Commit-time method to push changes from Layer to LayerImpl. → Stage 06
TreeSynchronizer: 在 Main 与 Compositor 双指针同步两棵树。Double-pointer syncs Main and Compositor's trees. → Stage 06
GraphicsLayer: 独立合成层 · Compositing 阶段的产物。An independently composited layer · output of Compositing. → Stage 07
CompositingReason: enum · 列出所有"升格成独立图层"的触发条件。enum · enumerates every "promote to its own layer" trigger. → Stage 07
cc::Tile: 256×256 / 512×512 像素切片。256×256 / 512×512 pixel slice. → Stage 08
cc::TileTask: "光栅化这块 tile" 的任务。"Raster this tile" task. → Stage 08
TileManager: cc 中负责 tile 调度与优先级。cc-side tile scheduling & priority. → Stage 08
TilePriority: resolution × bin (NOW/SOON/EVENTUALLY) × distance_to_visible。resolution × bin (NOW/SOON/EVENTUALLY) × distance_to_visible. → Stage 08
RasterBufferProvider: "纹理从哪儿借" 接口 · 4 种实现。"Where does the texture come from" interface · 4 implementations. → Stage 09
SharedImage: GPU 内存抽象 · 替代旧 Mailbox 机制。GPU memory abstraction · replacement for the old Mailbox mechanism. → Stage 09
ImageDecodeCache: JPEG/PNG/WebP 异步解码 · 与 tile task 共用 TaskGraph。JPEG/PNG/WebP async decode · shares TaskGraph with tile tasks. → Stage 09
SkDeferredDisplayList: Skia 的"录指令然后异线程重放"机制。Skia's "record then replay on another thread" mechanism. → Stage 09
ActivateSyncTree: 把 Pending 切到 Active 的指针交换。Pointer swap moving Pending → Active. → Stage 10
num_missing_tiles: CompositorFrame 元数据 · "这一帧有多少棋盘"。CompositorFrame metadata · "how chequered is this frame". → Stage 10
AppendQuads: 每种 LayerImpl 把自己变成 DrawQuad 的方法。Per-LayerImpl method that emits DrawQuads. → Stage 11
SharedQuadState: 同一图层多个 quad 共用的变换/裁剪/不透明度。Transform/clip/opacity shared across one layer's quads. → Stage 11
DrawQuad: 最小绘图原语 · 6 种类型(Tile/Texture/SolidColor/RenderPass/Surface/Picture)。Atomic draw primitive · 6 flavors (Tile/Texture/SolidColor/RenderPass/Surface/Picture). → Stage 11
RenderPass: 一组 quad + 可选离屏 + filter · "画中画"。A group of quads + optional off-screen + filter · "picture-in-picture". → Stage 11
CompositorFrame: 一帧 = metadata + resources + render_pass_list。One frame = metadata + resources + render_pass_list. → Stage 11
LayerTreeFrameSink: cc 把 CF 投到 Viz 的"邮局"。cc-side "parcel office" that ships CFs to Viz. → Stage 11
SurfaceId: FrameSinkId + LocalSurfaceId · 跨进程 surface 唯一标识。FrameSinkId + LocalSurfaceId · cross-process surface identity. → Stage 12
SurfaceAggregator: 深度优先把多份 CF 拍平成一份。Depth-first flattens many CFs into one. → Stage 12
damage_rect: 本帧相对上帧真正变化的区域 · GPU 只重画它。The actual changed region this frame · only this gets repainted on GPU. → Stage 12
OOPIF: Out-Of-Process IFrame · 跨进程渲染的 iframe。Out-Of-Process IFrame · cross-process rendered iframe. → Stage 12
SkiaRenderer: Viz 的现代默认渲染后端 · 走 DDL。Viz's modern default backend · DDL-based. → Stage 13
OutputSurface: 封装 Front/Back Buffer + SwapBuffers 的抽象。Wraps Front/Back Buffers + SwapBuffers. → Stage 13
vsync: 显示器垂直同步信号 · 60Hz 屏每 16.7ms 一次。Display vertical sync · 16.7ms per tick on 60Hz panels. → Stage 13
NetworkService: Browser 进程独立模块 · 拥有 cookie/cache/socket。Browser-process module · owns cookie/cache/socket. → Stage 00
Mojo IPC: Chromium 的 RPC + 共享内存机制 · 跨进程通信底座。Chromium's RPC + shared-memory mechanism · cross-process communication base. → Stage 00