A stream of bytes from the network has to cross thirteen stages, three processes and four property trees before it can light a single pixel. This is a field map of Chromium's rendering pipeline.
AUTHORAiringTOPICBlink · cc · Viz · GPUFORMATLong Read
渲染流水线 · 13 阶段Rendering pipeline · 13 stages▸ live pulse
To users, a "browser" is a single product. Open the chest cavity and you see a set of replaceable parts. Before walking through Chromium's pipeline, fix two formulas in your head — they are the skeleton the rest of this story hangs on.
A quiet fact climbs out of the table: apart from Firefox and the long-buried IE, the mainstream browser world has converged on either Blink + V8 or WebKit + JavaScriptCore. It was a silent annexation.
所谓"浏览器之争",
最后只剩两个引擎在赛跑。
Field Note · 02
The famed «browser war»,
ended with only two engines still on the track.
Field Note · 02
渲染引擎是什么WHAT IS A RENDERING ENGINE解析 HTML / CSS / JS 后把页面"画"出来。Firefox 的渲染引擎 Gecko 内部至少包含十几个工作组:document parser、layout engine、style system、JS runtime、image library、networking (Necko)、平台图形适配、字体库、安全库 (NSS)…… 渲染引擎从来不是一个东西,它是一座工厂。Parses HTML / CSS / JS and draws the page. Firefox's Gecko alone bundles a dozen workgroups: a document parser, layout engine, style system, JS runtime, image library, networking (Necko), platform graphics adapters, font library, security library (NSS)… A rendering engine is never one thing — it is a factory.
In 2001 Apple lifted WebKit out of KDE's KHTML. Seven years later Google lifted its first Chromium engine out of WebKit. Five years after that, Google forked again — this time into Blink. The line is still alive today.
Plot the 22 years as a sequence of forks and you get the family tree below — a single picture of an engine being split, inherited and renamed.
FIG 02WebKit 家谱:实线为继承,虚线为 fork。Blink 至今仍带着大量 Apple 与 WebKit 的影子代码。The WebKit family tree: solid lines for inheritance, dashed lines for forks. Blink today still carries vast tracts of Apple-and-WebKit ghost code.
三个分裂时刻
Three moments of divergence
01
2001 · Apple 拿走 KHTML
2001 · Apple lifts KHTML
为 Mac OS X 自己做一个浏览器,Safari 1.0 把 KHTML 重写为 WebKit。
A browser for Mac OS X. Safari 1.0 ships with KHTML rewritten into WebKit.
Chromium ships with WebKit, but is born inside a multi-process architecture — the body type that would determine the fork to come.
03
2013 · Google fork 出 Blink
2013 · Google forks Blink
不为新功能,为"减负"——Blink 第一次大改是删除 8000 个文件、450K 行代码。
Not about adding features — about losing weight. Blink's first big change deletes 8 000 files and 450 000 lines of code.
兼容性COMPATIBILITYWeb Platform Tests 的 Interop 报告里,Blink 长期处于第一梯队。可以理解为:浏览器之争虽然分了两边,但"网页"这个标准本身仍是一个共识——Blink 通过它的方式向 WebKit 致敬,并且超过了它。Web Platform Tests consistently rank Blink in the leading tier. Read it like this: the browser war is split into two camps, but «the web» itself remains a shared standard — Blink's way of saluting WebKit, while quietly overtaking it.
The rendering engine decides what a page looks like; the JS engine decides what a page does. The two are neighbours — the JS engine usually runs as a module inside the rendering engine, yet stays independent enough to be lifted into Node.js, embedded firmware or IoT.
The names worth knowing:
GOOGLE
V8
C++ · 开 JIT 后吊打全场 Chromium / Node.js / Android WebViewC++ · with JIT, leaves the rest behind Chromium / Node.js / Android WebView
APPLE
JavaScriptCore
系统级 API 暴露给 iOS App (JIT 在 App 端被关闭)A system API exposed to iOS apps (JIT disabled in app sandbox)
MOZILLA
SpiderMonkey
最早的 JS 引擎之一 Firefox 的心脏One of the oldest JS engines The heart of Firefox
FACEBOOK
Hermes
为 RN 而生 · 直接吃字节码 无 JIT,但首屏 TTI 优秀Built for RN · loads bytecode directly No JIT, but excellent cold-start TTI
Place the two side by side and you see the trade-off everyone is balancing on: V8 trades double-digit megabytes of runtime for top performance; QuickJS trades performance for 210 KB and embeddability. Hermes walks the middle line — «fast cold-start, no JIT».
为什么移动端关掉 JITWHY MOBILE TURNS JIT OFFJIT 预热时间长 → 首屏慢;JIT 也会增加包体积与内存。当系统的 sandbox 不允许动态生成可执行内存(比如 iOS App),JIT 干脆没法用。这就是 JSCore-iOS 不开 JIT 的根本原因。JIT warm-up is long → cold-start regresses; JIT also bloats binary size and memory. When the sandbox bans writable-executable pages (iOS apps), JIT is simply impossible. That's why JSCore on iOS apps cannot enable it.
"JIT" inside V8 is not one thing — it's four tiers. As a function heats up, V8 promotes it through progressively more expensive but faster implementations:
FIG 03V8 的四层 JIT 流水线。函数边跑边升格:Ignition 跑前几次执行,热了进 Sparkplug,更热进 Maglev,极热进 TurboFan;类型假设一旦失效,整段被 deopt 退回 Ignition。Maglev 是 Chrome 117(2023)新加的中间层,把"从慢到快"的台阶从 2 级补到 4 级——以前 Sparkplug → TurboFan 一步跨太大。V8's four-tier JIT pipeline. A function is promoted as it heats up: Ignition for the first few runs, then Sparkplug, then Maglev, then TurboFan; once a type assumption breaks, the whole frame is deopted back to Ignition. Maglev is Chrome 117 (2023) new — bridging the "cold to hot" staircase from 2 steps to 4. The old Sparkplug → TurboFan was too steep.
Key design: compilation runs on background threads (--concurrent-recompilation, default-on). Main thread only runs Ignition (zero compile overhead); background threads spot the hot frames and compile, then atomically swap a pointer in the dispatch table to the new version — next call hits Sparkplug/Maglev/TurboFan with no stop-the-world. This is how V8 "gets faster while running" without stalling.
隐藏类 + 内联缓存 · 让动态语言变成"近静态"
Hidden classes + Inline caches · making a dynamic language "nearly static"
JS is dynamic — an object's shape can change at any moment. Yet in memory, V8 quietly assigns each "property set" a HiddenClass (Map), so property access can fetch by offset like a C++ struct. Combined with Inline Caches (IC), a single obj.x access can skip any dictionary lookup:
HIDDEN CLASS · A WORKED EXAMPLEv8/src/objects/map.h
This is why "initialise properties in consistent order" is V8's golden rule. React/Vue's createElement, class field initialisation, the order of assignments in a constructor — all of these directly affect IC hit rate. Each step down the IC ladder costs an order of magnitude: Monomorphic ≫ Polymorphic ≫ Megamorphic.
Orinoco · V8 的代际 GC
Orinoco · V8's generational GC
V8 的堆按"年龄"分两块:Young Gen(新对象,~1-8MB)+ Old Gen(老对象,几十-几百 MB)。绝大多数对象朝生夕死(分配后几毫秒就没引用),不值得用昂贵的 mark-sweep——所以 Young Gen 用Scavenger 半空间复制 GC,只扫存活对象,死对象不需要被"收集"——自动消失。
V8's heap is split by age: Young Gen (new objects, ~1-8MB) + Old Gen (long-lived, tens to hundreds of MB). The vast majority of objects die young (no refs within milliseconds of allocation), not worth the cost of mark-sweep — so Young Gen uses Scavenger semi-space copying GC, scanning only live objects; dead objects need no "collection" — they vanish.
对前端意味着什么: 频繁创建临时对象(函数返回值、Array.map、JSX 重渲染)不可怕——Scavenger 1ms 就清完。真正的杀手是"本来该死掉但意外被引用住"的对象——它们晋升到 Old Gen,触发 Major GC 时全堆 mark,可能 stall 主线程几十毫秒。Memory leak 在 V8 里的体征不是 OOM,是"每隔几秒卡顿一下"——那是 Major GC 在抢 Main thread。
What this means for web devs: creating heaps of short-lived objects (function return values, Array.map, JSX re-renders) is fine — Scavenger sweeps them in 1ms. The real killer is "objects that should have died but accidentally stayed referenced" — they're promoted to Old Gen, triggering Major GC, marking the whole heap, possibly stalling Main for tens of milliseconds. A memory leak in V8 doesn't show as OOM — it shows as "jank every few seconds" — that's Major GC stealing the Main thread.
V8 与渲染流水线的"邻居关系"
V8 and the rendering pipeline · the "neighbour" relationship
V8 is not a "stage" of the rendering pipeline, but it squats inside every stage — JS handlers fill the gaps left by Style/Layout/Paint on the Main thread. A typical script for one 16.7ms frame:
MAIN THREAD · 16.7 MS · WHO RUNS WHENv8 ↔ blink ↔ cc
0 ms ─▶ vsync · Compositor 给 Main 发 BeginMainFrame 0 ms ─▶ V8: 跑事件回调(click/keypress/setTimeout) 2 ms ─▶ V8: microtask queue(Promise.then) 3 ms ─▶ Blink: Style + Layout + Pre-paint + Paint 6 ms ─▶ cc: Commit · Main 阻塞 ~1ms 7 ms ─▶ V8: requestAnimationFrame 回调(动画 / 渲染前最后一次改 DOM) 9 ms ─▶ V8: requestIdleCallback / scheduler.postTask 低优任务 14 ms ─▶ V8: idle · 等下一个 vsync// V8 在 [0,3) [7,9) [9,14) 三段窗口里// 一段 JS 长任务 50ms → 阻塞 3 帧 → INP > 200ms
A 16.7ms frame budget already reserves ~6ms for Style/Layout/Paint, leaving ~10ms for JS. A single JS task over 50ms = crosses 3 frames — the browser's Performance Observer flags it as a "Long Task"; Web Vitals counts the INP for that input as actual time from input to next-paint, instantly 200ms+. Chrome 122's scheduler.yield() + scheduler.postTask({ priority }) exist precisely for this — to actively slice long tasks.
V8 vs JSCore vs Hermes · 设计哲学的三条路
V8 vs JSCore vs Hermes · three design philosophies
Three engines take three roads: V8 pushes "peak speed in long sessions" to the limit; JSCore uses an LLInt interpreter (executable in the sandbox) to compensate for iOS's no-JIT rule; Hermes moves parse + bytecode-gen to build time — the APK ships bytecode directly, the app skips parsing on launch. There is no "best" engine, only the "best fit for this scenario".
2024 +Maglev 默认开启 + Sparkplug 提前编译: Chrome 117+ 默认开 Maglev,把"中等热度函数"的性能从 5× 拉到 30×。Chrome 121 起还引入Compile Hints(Magic-Bytecode 注解),允许网站用 HTTP header 告诉 V8 "这些脚本进 Sparkplug 不要再等",把首屏 JS 启动再砍 ~30%。Maglev default-on + Sparkplug-eager compilation: Chrome 117+ ships Maglev default, lifting "warm function" performance from 5× to 30×. Chrome 121+ added Compile Hints (Magic-Bytecode annotations) — sites can tell V8 via HTTP header "compile these scripts straight to Sparkplug, don't wait", trimming cold-start JS by another ~30%.
V8 doesn't just run JS — it also runs WebAssembly (Wasm). These two pipelines share the V8 process but virtually nothing else: different bytecode formats, different compilers, different heaps, different optimisation philosophies. Wasm carries types (i32/i64/f32/f64); no IC feedback needed. No GC (linear memory is manually managed), so the entire Orinoco machinery is absent on the Wasm side.
Liftoff's "streaming" is its sharpest trick: as Wasm bytes download from the network, V8 compiles concurrently — decoder fires on the first byte, Liftoff compiles each function the moment its boundary arrives, and the main entry-point can be ready before the byte stream finishes. A 10MB Wasm bundle compiles end-to-end in ~50ms on a GHz CPU; JS would need V8 to walk bytecode → optimization, easily hundreds of ms.
Wasm 与渲染流水线 · 谁占 Main thread?
Wasm and the rendering pipeline · who owns Main?
Wasm 跑在 V8 上 → V8 跑在 Render Process 的 Main thread → 所以默认情况下 Wasm 与 JS 与 Style/Layout/Paint 共用同一个 Main thread,互相阻塞。但 Wasm 可以做JS 做不到的事:
Wasm runs on V8 → V8 runs on the Render process's Main thread → by default Wasm, JS and Style/Layout/Paint all share that single Main thread and block each other. But Wasm can do what JS cannot:
True multi-threading · via SharedArrayBuffer + Atomics + Web Workers, Wasm allows "Main triggers → Worker runs Wasm compute" in true parallel. Worker threads aren't the Main thread, so Wasm compute can run truly parallel with rendering on Main — something JS cannot achieve (JS Workers can't touch the DOM directly; Wasm Workers do pure compute, never needed the DOM anyway).
SIMD · Wasm's 128-bit SIMD (v128) is explicit vectorisation. One SIMD add handles 4 float32 or 2 float64 — perfect for image processing, ML inference, crypto. JS has no SIMD (the SIMD.js proposal died years ago).
Predictable performance · no GC, no deopt → Wasm functions take nearly the same time on every call. Decisive for real-time audio/video (WebRTC codecs, AudioWorklet) — JS occasionally stalls for 50ms, Wasm never.
Figma compiles its rendering engine (written in C++) into Wasm; the DOM has one canvas; all graphics, layout, fonts are computed by Wasm. Photoshop Web (Adobe + Chrome team collaboration) does the same. Google Earth runs 3D terrain in Wasm. For these apps, "most CPU work is not on the Main thread" — Wasm runs on Worker threads, Main only pastes the result into a canvas (the cc::TextureLayer path).
2024 +Wasm GC + JSPI: 2024 后 Wasm 引入原生 GC(--experimental-wasm-gc,Chrome 119+ 默认开),让 Java/Kotlin/Dart 编译到 Wasm 而不必带自己的 GC。JSPI(JavaScript Promise Integration)让 Wasm 可以"挂起 + 等 Promise + 恢复",写出像 async/await 的同步式代码,但底下走 JS 异步——把 Wasm 与 Web 异步生态彻底打通。Wasm GC + JSPI: 2024 introduces native GC for Wasm (--experimental-wasm-gc, default-on Chrome 119+), letting Java/Kotlin/Dart compile to Wasm without bundling their own GC. JSPI (JavaScript Promise Integration) lets Wasm "suspend + await a Promise + resume", writing synchronous-looking code that runs on JS's async machinery — fully bridging Wasm with the Web's async ecosystem.
Boot Chromium and you don't get a process — you get a city. A capital (Browser), a few suburbs (Render), an airport (Viz), some factories (Utility / Plugin). The map is what makes "one crashing tab won't take down the browser" possible.
Three districts are relevant to rendering: Browser · Render · Viz. The thread roster of each:
Imagine three tabs: foo.com / bar.com / baz.com. Inside foo.com, two iframes point at foo.com/other-url and bar.com. The cross-site iframe spawns an additional Render Process.
Note that the bar.com iframe in Tab 1 and Tab 2's bar.com share the same render process (same-site reuse), but live in a different process from Tab 1's foo.com because they're cross-site. Site Isolation moved the "render island" boundary from per-tab to per-site.
Viz 是干什么的WHAT VIZ ACTUALLY DOESViz 是 GPU 进程里的渲染合成服务("Viz Process" 与 "GPU Process" 在新版 Chromium 是同一个进程,Viz 是它托管的服务)。它收 Render Process 与 Browser Process 各自产出的 viz::CompositorFrame(CF),用 SurfaceAggregator 合并,然后用 GPU 把结果显示在窗口上。所有屏幕上看到的画面,最后都是 Viz 写出去的。The Viz process is where compositing and display converge. It accepts viz::CompositorFrame (CF) from every Render and the Browser process, merges them via SurfaceAggregator, and pushes the result onto the screen via GPU. Whatever you see on screen — Viz wrote it.
Site Isolation 简史 · Spectre 怎么改了进程模型
A short history of Site Isolation · how Spectre rewrote the process model
"One Render process per site" sounds like a founding design choice — it isn't. Before 2018, Chromium's process model was per-tab (one process per tab; cross-origin iframes shared their parent's process). It was a three-way compromise between performance / memory / security — per-tab gave enough "islands", iframe sharing saved a process per embed. Then January 2018 happened.
FIG 04Site Isolation 简史。2018 年 Spectre 公布是分水岭——攻击者可以在 JS 里通过分支预测旁路读到同进程任何地址的内存,这意味着同进程里的跨域 iframe 不再安全。Chromium 团队用 4 个月 + 数百个 bug fix 把渲染进程的边界从 Tab 重画到站点(eTLD+1),Chrome 67 默认开启,代价是 +10-13% 内存。A short history of Site Isolation. The 2018 Spectre disclosure was the watershed — attackers could leak memory at any address in the same process via branch-prediction side-channels in JS, meaning a cross-origin iframe in the same process was no longer safe. The Chromium team spent 4 months + hundreds of bug fixes redrawing the renderer-process boundary from Tab to site (eTLD+1); Chrome 67 turned it on by default at a cost of +10-13% memory.
为什么 Spectre 把"同进程 iframe"打成不安全
Why Spectre made "same-process iframe" unsafe
现代 CPU 用分支预测提前执行可能用到的指令——预测错了就回滚。Spectre 的核心:预测错了的指令虽然不"提交",但留下的 cache 痕迹可以被旁路探测。在 JS 里写一段精心构造的 if 分支,可以哄 CPU 投机访问同进程任意地址的内存,然后通过 cache 命中时间反推那个字节的值——本来 SOP(同源策略)说"你不能读跨域 iframe 的 DOM",但 Spectre 说"我直接读它在物理内存里的位置"。
Modern CPUs use branch prediction to speculatively execute likely-needed instructions — wrong predictions get rolled back. Spectre's insight: even rolled-back speculation leaves cache traces that can be probed. Carefully constructed JS branches can trick the CPU into speculatively reading any memory address in the same process, then recover the byte value via cache-hit timing. The Same-Origin Policy says "you can't read a cross-origin iframe's DOM" — Spectre says "I'll just read its physical memory location directly".
Site Isolation's fix is brutally simple: put cross-origin iframes in different processes. You can't read cross-origin content from your process — the bytes aren't in your address space. The cost: every embedded cross-origin iframe (ads, social buttons, third-party widgets) adds a Render Process; per-tab memory grew 10-13%. This is why "embed Baidu Analytics" code post-2018 spawns an extra Render process for your page.
SharedArrayBuffer (SAB) is the linchpin of JS multi-threading, but post-Spectre every browser killed it overnight — SAB itself provides high-precision timing (atomic counters), exactly what Spectre's side-channel needs. Two years later, the COOP/COEP protocol resurrected it:
实际后果: Figma / Photoshop Web / 任何用 Wasm 多线程的应用必须 把这套 header 全配齐。否则 SAB 不能用,Wasm 多线程就成了空壳。这就是为什么 crossOriginIsolated 是 2024 年高性能 Web App 的入场券。
The practical consequence: Figma / Photoshop Web / any Wasm-multi-threaded app must ship the full header set. Without them, SAB doesn't work, and Wasm threads become a hollow shell. This is why crossOriginIsolated is the 2024 admission ticket for high-performance Web apps.
2024 +Origin-Agent-Cluster · 进程隔离再细一档: Chrome 88+ 起,网页可以发 Origin-Agent-Cluster: ?1 header 主动要求"把我和同站点(eTLD+1)的其他源也分到不同进程"。比如 a.example.com 和 b.example.com 默认共用进程(同站),开了这个 header 就分开。Site Isolation 是默认,Origin Isolation 是高安全场景的 opt-in。Origin-Agent-Cluster · finer-grained isolation: Since Chrome 88+, a page can opt-in via Origin-Agent-Cluster: ?1 to demand "isolate me even from same-site (eTLD+1) origins". Default: a.example.com + b.example.com share a process (same site); with this header they split. Site Isolation is the default; Origin Isolation is the opt-in for high-security scenarios.
延伸阅读FURTHER READING
官方设计文档 · "Site Isolation"
Official design docs · "Site Isolation"
想看大型重构怎么落地,Chromium 团队把整套 Site Isolation 的设计与回顾公开放在 chromium.org/Home/chromium-security/site-isolation:从威胁模型、跨进程边界(document.domain、window.opener、剪贴板事件)到性能数据,全摊开。配合 Charlie Reis 的 USENIX Security 2019 论文《Site Isolation: Process Separation for Web Sites within the Browser》一起读,就是近距离看一次大型重构是怎么落地的。
If you want to watch a major refactor land in close-up, the Chromium team published the full Site Isolation design + retrospective at chromium.org/Home/chromium-security/site-isolation: threat model, every cross-process boundary (document.domain, window.opener, clipboard events), performance numbers — all in the open. Pair it with Charlie Reis' USENIX Security 2019 paper «Site Isolation: Process Separation for Web Sites within the Browser» — that's a textbook view of how this kind of "major surgery" actually lands.
底部色带:6 段线程——Network → Main → Compositor → Raster → Compositor → Skia
The rendering pipeline is a chain that turns network bytes into pixels. Chromium cuts the chain into thirteen stages — sliced across three processes, owned by three modules, run by six thread segments.
The master diagram below is the map for every chapter that follows. Four layers, all at once:
The master map answers "who works where", but not a deeper question: how long does each artifact live? what's reusable across frames? The figure below charts the lifelines of 11 core data structures across the 14 stages, colour-coded by cacheability — green = cached across frames (cheap), yellow = partially cached (fragile), red = born fresh every frame (expensive). After reading it you can work backward: why is mutating a transform cheap? Because only red-zone LayerImpl properties refresh — everything else stays green.
FIG 05B数据结构生灭线 · 横轴 14 阶段,纵轴 11 种核心数据结构。左偏的产物大多是绿的(DOM / ComputedStyle / Property Trees / DisplayItemList / SharedImage 都跨帧持久),右尾的产物大多是红的(CF / Quad / Aggregated CF 每帧重生)。 这就是流水线设计的中心思想——把"不必重算的东西" 推到尽可能左、尽可能绿。Data-structure lifelines · X-axis 14 stages, Y-axis 11 core artifacts. Left-biased artifacts are mostly green (DOM / ComputedStyle / Property Trees / DisplayItemList / SharedImage all persist across frames), right-tail artifacts are mostly red (CF / Quad / Aggregated CF born fresh each frame). This is the central design idea of the pipeline — push "what must not be recomputed" as far left and as green as possible.
用一个钥匙词记住每一步
A keyword for each stage
#
阶段Stage
输入 → 输出(钥匙词)In → Out (keyword)
01
Parsing
bytes → DOM Treebytes → DOM Tree
02
Style
DOM Tree → Render Tree(带 ComputedStyle)DOM Tree → Render Tree (with ComputedStyle)
03
Layout
Render Tree → Layout Tree(带几何)Render Tree → Layout Tree (with geometry)
04
Pre-paint
Layout Tree → Property TreesLayout Tree → Property Trees
流水线的每一道工序,都不是为了"好看"——
而是为了把"重新计算什么"的范围, 压缩到尽可能小。
Field Note · 02
Each stage exists not for elegance —
but to shrink the surface of what has to be recomputed when something changes.
Field Note · 02
MAIN-LINE EXAMPLE
主线例子 — 一张名片的旅程
The running example — one business card's journey
每一章都会回到这张卡
the card we'll watch through every stage
抽象的流水线总让人忘——3 个进程、6 段线程、13 道工序,看完合上书一个数字也记不住。所以从下一章开始,每一章顶部都会有一段"主线 · The Card 在这一步",跟踪同一张名片在该工序后的形态。
这就是那张卡——Airing 的名片:
An abstract pipeline always slips out of memory — 3 processes, 6 thread segments, 13 stages, and an hour later you can't recall a single number. So from the next chapter onward, each chapter opens with a "Main-line · The Card after this stage" block, tracking what happens to one business card at every step.
FIG 05.5主线例子。整张卡 20 行 HTML+CSS,所有 13 道工序、4 棵属性树、3 个进程都能从这张卡上讲出来。把鼠标悬到 关注 按钮上——这次 hover 动画的整条路径,主线程一根毛都不动。The running example. Twenty lines of HTML+CSS, yet every one of the 13 stages, 4 property trees and 3 processes can be told through this single card. Hover over Follow — the entire animation path runs without touching the Main thread.
:hover mutating both transform and background → the Display finale: transform stays on the Compositor, background drags Main back in. Real code rarely produces 100% Compositor-pure animation
卡片在 13 步里的形态地图
Stage-by-stage transformation map
#
阶段Stage
这张卡的当前形态The card after this stage
00
Network
HTML 字节流到达 + airing.png 由 PreloadScanner 抢跑发出请求HTML bytes arrive + airing.png fired early by the PreloadScanner
01
Parsing
11 个 token → DOM 栈最深 4 层 → 6 节点的 DOM 树11 tokens → DOM stack 4 deep → a 6-node DOM tree
02
Style
5 条规则分到 3 张 RuleMap;每节点挂 ComputedStyle5 rules split across 3 RuleMaps; ComputedStyle attached to each node
03
Layout
LayoutNGFlexibleBox(card 340×88)双遍布局完成LayoutNGFlexibleBox(card 340×88) finishes its two passes
04
Pre-paint
.follow 在 Transform tree 出节点;Effect tree 多 2 个节点(shadow + gradient).follow gets a Transform tree node; Effect tree gains 2 nodes (shadow + gradient)
05
Paint
DisplayItemList 约 12 项;Save/ClipRRect/Restore 围住头像~12-entry DisplayItemList; Save/ClipRRect/Restore wrap the avatar
06
Commit
2 棵 cc::Layer:主图层 + .follow 独立图层Two cc::Layers: the main one + a dedicated one for .follow
07
Compositing
.follow 因 will-change 升格成独立 GraphicsLayer.follow is promoted to its own GraphicsLayer thanks to will-change
每块 tile playback DisplayItemList;头像走 ImageDecodeCacheeach tile plays back its slice of the DisplayItemList; avatar goes via ImageDecodeCache
10
Activate
Pending Tree → Active Tree;tile 全部 readyPending Tree → Active Tree; all tiles ready
11
Draw
主层吐 2 个 TileDrawQuad;.follow 1 个 TileDrawQuad;shadow 触发独立 RenderPassmain layer emits 2 TileDrawQuads; .follow emits 1; the shadow triggers a separate RenderPass
12
Aggregate
(变体)如果嵌入第三方页面,父用 SurfaceDrawQuad 引用(variant) if embedded as OOPIF, parent references via SurfaceDrawQuad
13
Display
SwapBuffers 上屏。hover 时 transform 走 Compositor,background 把 Main 拉回SwapBuffers to screen. On hover, transform stays on Compositor while background drags Main back in
怎么用这张表HOW TO READ THE TABLE这张表是检索表。读完整篇你应该能反过来用——看到一张卡片,你能预言它会在每一步被处理成什么样。如果某一行你想不起来"为什么",就翻回那一章读"主线 · The Card 在这一步"那段。This table is an index. After finishing the article you should be able to use it in reverse — see any card and predict what happens to it at every stage. If a row stops making sense, jump back to that chapter's "Main-line · The Card after this stage" block.
STAGE 00 · NETWORK
Loading — bytes 到达之前的事
Loading — what happens before the first byte
网络线程 + Mojo IPC + 抢跑机制
network thread, mojo IPC, the preload scanner head-start
Module
network_service
Process
Browser
Thread
Network ×N
Output
bytes → Renderer
这一步在做什么
What it does
Browser Process 的 NetworkService 通过 Mojo IPC 把 HTML 字节流推给 Render Process 的 blink::DocumentLoader。同时,在 Render Process 里跑的 HTMLPreloadScanner 抢在主 Parser 之前发现 <img> / <link rel="stylesheet"> 的 URL,反过来再向 Browser 申请第二批资源——主 HTML 还没解析完,sub-resource 已经在路上。Browser-process NetworkService streams HTML bytes to Render-process blink::DocumentLoader via Mojo IPC. Meanwhile, the in-Render HTMLPreloadScanner races ahead of the main parser, spots <img> / <link rel="stylesheet"> URLs and asks Browser for the next batch of bytes — sub-resources are on the wire before the main HTML is even fully parsed.
为什么要算一步
Why count it as a stage
原文把 13 步从 Parsing 数起,把 Loading 隐进了 Browser Process 那个虚框。但整条流水线的 P50 延迟,80% 来自这一段——首字节没到,后面 13 步连开始的资格都没有。这一节正式把它拉进流水线。The original counts 13 stages starting from Parsing and folds Loading into the Browser-process box. But 80% of the pipeline's P50 latency lives here — until the first byte arrives, the next 13 stages cannot even start. This chapter promotes it to a first-class stage.
从 DNS 解析到第一字节交给 Renderer,整条链路跨越 Browser Process 的 NetworkService → Mojo IPC → Render Process 的 ResourceFetcher。下面这张图把链路展开:
From DNS lookup to the first byte arriving at the Renderer, the chain crosses Browser-process NetworkService → Mojo IPC → Render-process ResourceFetcher. The figure below unpacks it:
FIG 00Loading 阶段的真实拓扑。Browser 进程的 NetworkService 持有所有底层连接、cookie、缓存;Render 进程通过 Mojo DataPipe 接收字节,同时反向触发新的 sub-resource 请求。主 parser 在阻塞,PreloadScanner 在抢跑——这是 Chromium 首屏速度的秘诀。The real topology of Loading. Browser-process NetworkService owns all the lower-level connections, cookies and cache; the Render process receives bytes via Mojo DataPipe and, in the opposite direction, fires new sub-resource requests. The main parser blocks, the PreloadScanner races — this is the secret of Chromium's cold-start speed.
主线 · The Card 在这一步
Main-line · The Card after this stage
STAGE 00网络阶段Network stage
两个 URLLoader 并行飞着
Two URLLoaders in flight, side by side
主页 HTML 字节流(几 KB)由 URLLoader · main 流入 blink::DocumentLoader::DataReceived。当 HTMLPreloadScanner 在某个未阻塞的间隙扫到 <img class="avatar" src="airing.png">,立刻调 ResourceFetcher::PreloadStarted 反向通知 Browser 进程申请头像 png。头像的 GET 请求会在主 HTML 还在飞的时候就发出去。
The home HTML byte stream (a few KB) flows from URLLoader · main into blink::DocumentLoader::DataReceived. The moment HTMLPreloadScanner spots <img class="avatar" src="airing.png"> in an unblocked window, it calls ResourceFetcher::PreloadStarted back at the Browser process to fetch the avatar PNG. The avatar's GET request leaves before the main HTML even finishes downloading.
Early Chromium baked the network stack into the Browser process. Chrome 73 split it out as NetworkService — either in-process or as a standalone utility process. The split is not just engineering hygiene: a separate process means cookies and credentials live in their own sandbox. A pwned Render process can never read raw cookies — it can only ask Mojo for "the bytes of this URL", and NetworkService attaches the cookies on its behalf.
2024 +预连接 + 推测式 prefetch: 现在 Chromium 还会在用户悬停链接时,通过 chrome.predictors 启发式提前 DNS / TCP / TLS 握手,甚至预取 HTML——这一切发生在 Stage 0 之前的"Stage -1"。配合 <link rel="modulepreload"> 和 Speculation Rules API,首屏感知速度持续在压。Pre-connect + speculative prefetch: Chromium now uses chrome.predictors on link hover to pre-warm DNS / TCP / TLS, sometimes prefetching the HTML itself — happening before Stage 0, in what's effectively "Stage -1". Combined with <link rel="modulepreload"> and the Speculation Rules API, perceived cold-start keeps shrinking.
DEVTOOLS
Network 面板 · 看 Initiator 是不是 (preload)Network panel — check if Initiator is (preload)
诊断 3 件事: ① Initiator 列里看 (preload) 标记 — 关键资源都该有这个标(说明 PreloadScanner 抢跑成功);若标的是 (parser),资源是被主 parser 顺路发现的,迟一拍;② queued 段太长(本图 analytics.js 排了 55%) = 浏览器并发限制(同源 6 路)堵住了,关键资源被低优 JS 挤后面 — 用 fetchpriority="high" 解决;③ TLS 段看是否走了 HTTP/2 multiplexing(同域多请求该共用连接),没走会看到每个请求都重做握手。3 things to diagnose: ① Look at the Initiator column for (preload) tags — every critical resource should have one (PreloadScanner head-start succeeded); if it shows (parser), the resource was discovered by the main parser, one beat late; ② queued segment too long (here analytics.js queued 55%) = browser's per-origin concurrency cap (6) blocked it, critical resources stuck behind low-priority JS — fix with fetchpriority="high"; ③ TLS segment reveals HTTP/2 multiplexing usage (same-origin requests should reuse one connection); if not, every request redoes the handshake.
URL来自地址栏 / 链接点击 / APIfrom address bar / link / API
→
OUTPUT
bytes推给 Render Processpushed to Render process
STAGE 01 · DOC PHASE
Parsing — bytes 到 DOM Tree
Parsing — bytes to a DOM tree
字节如何长成一棵树
bytes → characters → tokens → DOM
Module
blink
Process
Render
Thread
Main
Output
DOM Tree
这一步在做什么
What it does
把网络线程吐出来的 bytes,一路扭转为一棵 DOM Tree,挂在 blink::TreeScope 上。Take the bytes coming out of the network thread and end up with a DOM tree hanging off blink::TreeScope.
为什么要分 5 步
Why five sub-stages
每一段输入都不只一种"形态"——bytes 来自字节流、characters 取决于编码、tokens 是 W3C 标准、Element 是 Blink 数据结构。把它们拆开,每段都能流式增量处理,也能复用——比如同样的 Tokenizer 可以喂给预扫描器(Preload Scanner)来提前发起请求。Every input has a distinct "shape" — bytes are a network stream, characters depend on encoding, tokens are a W3C standard, Element is a Blink data type. Splitting them lets each layer stream incrementally and be reused — the same tokenizer feeds the Preload Scanner to fire requests early.
Parsing 是 Main thread 的第一项工作:把 Browser Process 网络线程喂过来的 bytes,一路扭转成一棵活生生的 DOM Tree。中间的数据流可以拆成 5 段:
Parsing is the Main thread's opening act: take the bytes the Browser process's network thread hands over, and turn them into a living DOM tree. The data flow splits into five hand-offs:
Loading
bytes
Conversion
characters
Tokenizing
W3C tokens
Lexing
Element
DOM Build
DOM Tree
→→→
STAGE 01主线 · The Card 在解析后Main-line · The Card after Parsing
11 个 token,4 层栈,6 节点的 DOM 树
11 tokens, a stack 4 deep, a 6-node DOM tree
名片源码进 Tokenizer 后吐出 11 个 token,DOM 构造栈最深一刻压到 4 层(article → div.info → h2 / p / a 三选一)。结束时栈空,留下下面这棵 DOM 树:
Through the Tokenizer the card emits 11 tokens, with the construction stack peaking at 4 deep (article → div.info → h2 / p / a). When it empties, this DOM tree is left behind:
Notice: ComputedStyle is not yet attached (that's Style's job), the img isn't decoded (Raster's job), and the button has no idea it's about to be promoted into its own layer (Compositing's job). This bare DOM is the seed for every downstream stage.
边走边等PARSE ↔ FETCH ↔ EXECTokenizing 时碰到 <link> / <script> / <img>,会反过来发起新的网络请求;碰到 <script> 还要先把 JavaScript 跑完——因为 document.write() 可能会改写后面的 DOM。"边解析边等" 是 HTML 解析最贵的成本之一。Hit <link> / <script> / <img> mid-tokenizing and the parser fires new network requests; hit <script> and it must finish executing before resuming — because document.write() may rewrite what comes next. The «parse-and-wait» tax is one of the steepest costs of HTML parsing.
Read bottom-up: the network thread hands AppendBytes(char*) in, DecodedDataDocumentParser decodes by encoding into a String, and the result lands at the tokenizer's Append(String&) entry. Decoding follows the page's encoding (UTF-8 / GBK / ISO-8859-1) — get that wrong and everything downstream is wrong.
The key entrypoint: HTMLConstructionSite::CreateElement. Internally, a stack tracks currently-open Elements — HTML5's implicit close rules (a <div> appearing inside <p> auto-closes the <p>) are implemented through this stack:
HTML CONSTRUCTION SITE · STACK OPShtml_construction_site.h
HTML5 rule: a <p> is phrasing content; encountering a block element like <div>forces <p> to close first. The stack is silently mutated — when StartTag-div fires, the constructor pops <p> before pushing <div>. The result: what you wrote as <p><div>...</div></p> is in fact three sibling nodes in the DOM — <p></p> + <div></div> + an implicit empty <p></p>.
DOM 是用一个栈搭出来的
A stack is what builds the tree
Lexing 把 token 转成 Element 实例,"DOM construction" 用一个栈一边压一边出——开始标签压栈,结束标签出栈,最后栈空时这棵树也就建完了。
Lexing turns tokens into Element instances. "DOM construction" then walks a stack — start-tags push, end-tags pop. When the stack empties, the tree is finished.
栈 · 实时
Stack · live
DOM 树 · 增量
DOM tree · incremental
输入:<div><p><div></div></p><span></span></div>
Input: <div><p><div></div></p><span></span></div>
为什么用栈,不用链表? · 嵌套天然 LIFO,栈是规范要求的数据结构Why a stack, not a linked list? · nesting is naturally LIFO; the stack is the spec-required data structure
这是个很好的问题——4 个原因,从最直接到最深入:
A great question — four reasons, from the most immediate to the deepest:
HTML nesting is naturally LIFO. When the parser sees </div>, it needs the "most recent unclosed open tag with that name" — exactly the top of the stack. Stack pop is O(1); a linked list has to traverse to find the latest match, O(n). A 100KB page does thousands of push/pop pairs — O(1) vs O(n) is an order-of-magnitude gap.
HTML5 规范本身就用栈描述。W3C/WHATWG 的 HTML5 解析算法里有两个明文叫做"stack of open elements"和"list of active formatting elements"的数据结构。最复杂的那段——adoption agency algorithm(处理 <b><p>X</b>Y</p> 这种交叉嵌套的"错误恢复")——直接按栈的术语写规范。换成链表,你不仅要重写所有规范引文,逻辑也表达不出来了。
The HTML5 spec itself describes it as a stack. W3C/WHATWG's HTML parsing algorithm explicitly uses two data structures named "stack of open elements" and "list of active formatting elements". The hairiest piece — the adoption agency algorithm (error recovery for crossed nestings like <b><p>X</b>Y</p>) — is written directly in stack terminology. Switching to a linked list would force you to rewrite every spec quote, and the logic would no longer express itself.
Cache-friendly. Stacks typically live in a contiguous array (Blink's HTMLElementStack is internally a Vector<HTMLStackItem*>). One 64-byte cache line holds 8 pointers; push/pop runs in L1. Linked-list nodes scatter across the heap, so each next-pointer chase risks a cache miss — 5-10× slower in practice.
Stack depth = nesting depth, a free semantic index. Many HTML5 spec rules dispatch on "nesting depth": foster parenting inside <table>, banning <p> from nesting block elements, the implicit close of <option> inside <select>… Stack .size() answers in O(1). A linked list would need a separately maintained depth counter, with its own sync cost.
Flip the question: "When would a linked list be a better fit?" Answer: scenarios needing arbitrary mid-list insert/delete. HTML parsing rarely needs that (only the adoption agency algorithm's fragment-tree rearrangement, and even that's cheap on top of a stack). So the real answer to "why a stack, not a linked list?" is: the question is inverted — HTML nesting is a stack; a linked list would be the counter-intuitive choice.
Chrome DevTools Performance can show Parsing as a flame graph — but only the JS-side call stack. The C++ internals are a black box to it.
想看到 HTMLDocumentParser::AppendBytes → ... → HTMLConstructionSite::CreateElement 这一整条 C++ 栈,就必须用 Perfetto 录制——它不仅能拉出 C++ 调用栈,还能告诉你这个调用属于哪个线程,跨进程通信还会自动连线"发出端 → 接收端"的两个函数调用。
To see the full C++ stack HTMLDocumentParser::AppendBytes → ... → HTMLConstructionSite::CreateElement, you need Perfetto traces — they expose C++ stacks, tag each call with its thread, and even draw cross-process IPC as "sender → receiver" arcs.
"Tokenizing" sounds simple but is in fact a sprawling state machine spelled out in the W3C HTML5 spec — 80+ states, hundreds of transitions. HTMLTokenizer::NextToken is one giant switch that reads a character based on the current state and either emits a token or switches state. The most common edges:
HTML_TOKENIZER · STATE TRANSITIONSthird_party/blink/renderer/core/html/parser/html_tokenizer.cc
The hard problem this machine solves is error recovery. HTML5 spec describes "how to fix errors" with 24 "insertion modes" + a stack-based "original insertion mode" rewind — for instance, a <span> appearing inside <table> is mandated to be "foster-parented out of the table". That's why every browser parses bad HTML identically — they all follow this same spec.
Parsing's real "go-faster" trick lives in HTMLPreloadScanner. When the main parser is blocked on a <script> (waiting for JS to run), a second lightweight tokenizer continues scanning ahead on a side thread. The moment it sees <link rel="stylesheet"> / <img src> / <script src> it fires the network request early. By the time the main parser unblocks, the bytes are on the wire — sometimes already arrived.
This is what makes "HTML parsing" and "resource download" effectively parallel — and the real reason Chromium's cold-start is 30-50% faster than a "naive single-threaded parser". Those (Preload)-tagged requests you see in DevTools' Network panel? All fired by PreloadScanner ahead of time.
To see the Tokenizer state machine flip in real-time, the fastest path is Chromium's tracing: in chrome://tracing, enable the blink.parser category and reload — you'll see a time-aligned "state trace" with a colour block for every tag open/close. Here's roughly how it looks:
看 Main · Tokenizer 那条轨——每个色块是一次 token 触发,蓝/橙/绿对应不同 tag 类型;红色 <script> 期间 Tokenizer 完全冻结(主 parser 停了 6ms 等 V8 跑完);但同一时刻下面 PreloadScanner 那条还在偷偷扫,提前发了 app.js / avatar.png 的请求——上面 Network 轨里那两个棕条就是抢跑出去的。"parse-and-wait 的真实代价"在这张图里一目了然。Watch the Main · Tokenizer lane — each block is one token, blue / orange / green map to different tag types. During the red <script> the Tokenizer freezes (main parser stalls 6 ms while V8 runs); but at the same time the PreloadScanner below keeps scanning and fires app.js / avatar.png early — the two brown bars on the Network lane are those head-start requests. The real cost of parse-and-wait is right here in one picture.
DEVTOOLS
Performance > "Parse HTML" 段;Memory > Heap snapshot 看 DOM 节点数Performance > "Parse HTML" segment; Memory > Heap snapshot for DOM node count
bytes来自 Browser Process 网络线程from Browser network thread
→
OUTPUT
DOM Treeblink::TreeScopeblink::TreeScope
STAGE 02 · DOC PHASE
Style — CSS 是从右到左读的
Style — CSS is read right-to-left
CSSOM 与反向匹配
CSSOM and right-to-left selectors
Module
blink
Process
Render
Thread
Main
Output
Render Tree
这一步在做什么
What it does
遍历 DOM Tree,每个节点跑一遍"哪些 CSS 命中我",把命中的样式合并 + 继承 + UA 默认值,最后挂一个 ComputedStyle——这就是 Render Tree。Walk the DOM tree. For each node, find which CSS rules match, then merge + inherit + UA-default them. Attach a ComputedStyle to the node — that's the Render Tree.
为什么不能跳过
Why not skip
CSS 是 render-blocking。一棵无样式的 DOM 渲染上屏,下一帧 CSS 一到又得整页重排——等是更便宜的。所以浏览器宁可白屏也要等 CSSOM。CSS is render-blocking. Drawing an unstyled DOM and re-layouting the second CSS arrives is more expensive than waiting — blank-screen is cheaper than a re-layout. The browser sits and waits for the CSSOM.
The Style Engine walks the DOM, matches against the CSSOM and attaches a ComputedStyle to every node. The output: a Render Tree. Core: Document::UpdateStyleAndLayout.
Three sub-stages: CSS load → CSS parse → CSS compute. Two counter-intuitive facts here decide the entire performance shape — selectors are read right-to-left and RuleMap shards by selector type.
STAGE 02主线 · The Card 在样式后Main-line · The Card after Style
5 条规则进 3 张 RuleMap,每节点挂 ComputedStyle
5 rules into 3 RuleMaps, ComputedStyle on every node
每个 DOM 节点跑一遍"右往左反向匹配"——比如 article.card 上的".card 命中我吗"是 1 跳 hash 命中;给它合并 + 继承所有命中的属性,再套上 UA 默认值,挂出 ComputedStyle。最后一步 article.card 的 ComputedStyle 长这样:
Every DOM node runs the right-to-left match — for instance, article.card asks "does .card match me?" with a single hash hit. Then merge + inherit + UA defaults, and attach a ComputedStyle. After all that, article.card's ComputedStyle reads roughly:
// ComputedStyle · article.card
display : flex// 由 .card 给
flex-direction : row// flex 默认
align-items : center// 由 .card 给
gap : 14px
width : 340px
padding : 18px 20px
border-radius : 14px
background : linear-gradient(...)// 触发 Effect tree
box-shadow : 0 6px 20px rgba(...)// 触发 Effect tree
font-family : -apple-system, ...// 从 body 继承
color : rgb(21,23,28)// 从 body 继承
关键产物: 6 节点的 DOM 树各挂一个 ComputedStyle。从这一刻起,样式不再是字符串——它是 RGBA32、Length、EFlexDirection 这些紧凑的 C++ 类型。下游所有阶段都按这套结构化数据干活,没人需要再看 CSS 字符串。
The output: the 6-node DOM tree, each carrying a ComputedStyle. From this moment on style is no longer strings — it's RGBA32, Length, EFlexDirection, all compact C++ types. Every downstream stage operates on this structured data; no one needs to look at the CSS source again.
CSS 加载 · 真实日志
CSS load · the real log
在 Blink 里加桩打印解析过程,可以看到 HTML 解析与 CSS 解析是交错进行的。当 HTML 解析到 readystatechange = Interactive 之后,CSSParserImpl 才开始把外联样式表解析为 StyleRule:
Instrument Blink and print as it parses — you can see HTML and CSS parsing interleave. Only after HTML reaches readystatechange = Interactive does CSSParserImpl start turning the external stylesheet into StyleRules:
First stop in CSS parsing: characters → tokens. The tokenizer emits flavours like the ones below — FunctionToken (blue), HashToken (copper) and DelimToken (purple) being the hot ones:
Blink stores colours as RGBA32 — one 32-bit int — via CSSColor::Create. #hex goes through HashToken's direct path: bitwise pack straight into RGBA32. rgb() is a FunctionToken: parse arg list, range-check, then pack. Same white, more hops.
15% 这个数字真的有意义吗? · 微基准 vs 实际页面收益Is the 15% really meaningful? · micro-bench vs real-page payoff
Bottom line: for nearly every business page, the 15% is invisible. CSS parsing runs once at first load; a stylesheet of thousands of rules takes 5-15ms total, so 15% means ~2ms — noise inside cold-start (which starts at hundreds of ms). "Convert all rgb() to hex" is textbook over-optimisation.
那为什么这个数字还值得记? 因为它是一扇窗户——透过这 15% 你能看到 V8/Blink 这种 C++ 系统的性能哲学:
Then why is the number worth knowing? Because it's a window — through that 15% you see the performance philosophy of C++ systems like V8/Blink:
"fast path + slow path" is a Blink/V8 staple. Hex is the fast path (pure bitwise); rgb() is the slow path (function-parsing subsystem). The CSS parser, JS engine, and layout engine all use this structure — optimise the common case to nanoseconds, let the rare case take its microseconds.
"function calls are themselves performance events" — one C++ virtual call costs ~5-10ns, plus arg validation ~50ns. A row of color: rgb(...) adds 50ns; thousands of rows add tens of microseconds. The "drop in the bucket" calculus only matters on per-frame hot paths — CSS parsing isn't one, so the math doesn't move the needle.
真正每帧都跑的颜色路径在 paint/raster:每个 DisplayItem 的 fill color 解析、每个 tile 的像素采样,这些才是 RGBA32 优化的真正受益者。CSS 解析的 15% 只是同一套数据结构在 cold path 上的副产品。
The colour path that actually runs every frame is in paint/raster: every DisplayItem's fill color, every tile's pixel sampling — these are the real beneficiaries of RGBA32. The CSS parsing 15% is just the same data structure showing up on the cold path.
So:don't rewrite existing CSS for that 15%; do understand that "fast/slow path bifurcation" is a pattern the entire browser engine uses — the real optimisation targets are the "1000 times per frame" paths in later pipeline stages (C9/C10/C14 are the actual battleground).
Parsing .text .hello and #world, Blink emits the structure below. The relation field — that's the pointer right-to-left matching follows: start at .hello, walk Descendant edges up to .text.
selector text = ".text .hello"value = "hello" matchType = "Class" relation = "Descendant"tag history selector text = ".text" value = "text" matchType = "Class" relation = "SubSelector"selector text = "#world"value = "world" matchType = "Id" relation = "SubSelector"
Full list in blink/public/blink_resources.grd. This is why "your button style overrides the UA's without !important" — your CSS comes last, winning on declaration order.
The browser blocks rendering until it has both
the DOM and the CSSOM.
Render-blocking CSS · MDN
The browser blocks rendering until it has both
the DOM and the CSSOM.
Render-blocking CSS · MDN
为什么?因为没样式的 DOM 是无意义的。一棵裸树渲染上屏,下一帧又要因为 CSS 进来重排——还不如等。这是 CSS 一直被叫 "render-blocking" 的根源。
Because rendering a bare DOM is meaningless. Drawing it to the screen and re-laying it out the moment CSS arrives is more expensive than waiting. That is the root reason CSS is called "render-blocking".
StyleRules are not piled into one big array — they're sharded by their first selector's type into five maps. Matching only consults the relevant bucket, collapsing O(N) into O(N / k).
Map · #id
id_rules_
#world #header
Map · .class
class_rules_
.text .btn-primary
Map · [attr]
attr_rules_
[data-state="open"]
Map · tag
tag_rules_
div, span, p…
Map · ::pseudo
ua_shadow_…
::before ::placeholder
从右到左 · 选择器为什么这样读
Right-to-left · why selectors read backwards
假设你写下 .text .hello。要快速判断"这个 div 命中吗",浏览器从最右边开始:先看节点本身有没有 .hello,命中了再向上找祖先里有没有 .text。从右到左能在第一步就否决掉绝大多数节点。
You write .text .hello. The fastest way to decide "does this div match?" is to start from the right: does the node itself have .hello? If yes, walk ancestors looking for .text. Right-to-left rejects the vast majority of nodes on the very first check.
Whether a declaration applies, and which one wins, is decided by four ranked criteria. Only when one ties does the next level kick in — declaration order is the final tiebreaker.
01
Cascade layers 顺序
Cascade layers
@layer 块的声明顺序,最先声明的最弱。
Order of @layer declarations — earliest is weakest.
02
选择器特异度
Selector specificity
id (100) · class/attr/pseudo-class (10) · tag (1) 之和。
Sum of id (100) · class/attr/pseudo-class (10) · tag (1).
03
Proximity 排序
Proximity ordering
Cascade Level 6 引入,作用范围嵌套深的获胜。
Introduced in Cascade Level 6 — deeper-scoped scope wins.
04
声明位置
Declaration order
最后到达的获胜——这就是为什么 main-heading2 写在后面就赢了。
Last-write-wins — this is why main-heading2 wins simply by being declared later.
Suppose <h1 class="main-heading main-heading2"> with two rules: .main-heading { color: red; } and .main-heading2 { color: blue; }.
特异度同为 0,1,0,class 顺序无关——决定胜负的是声明位置。.main-heading2 写在后面,标题就是蓝色,把 class 顺序反过来写也一样。HTML 里 class 出现的先后从来不影响 CSS。
Specificity is identical at 0,1,0; the order of class names is irrelevant — declaration order decides. .main-heading2 is declared later, so the heading is blue, no matter what order you write the classes in HTML. Class order in HTML never affects CSS.
为什么 UA 样式排在你之前WHY UA STYLES DECLARE BEFORE YOURSBlink 内置默认样式表(html.css 等)总是第一个注册到 RuleSet。在第 4 级判定(声明位置)里,业务样式由于是后注册的,每次都赢——这就是 UA 样式可以被你覆盖 的根本机制。Blink's UA stylesheet (html.css and friends) is always registered into the RuleSet first. At the 4th cascade tier (declaration order), your CSS is registered later — and wins by being last. That is the actual mechanism that lets you override UA styles without !important.
DEVTOOLS
Performance > "Recalculate Style";Elements > Computed 看 ComputedStylePerformance > "Recalculate Style"; Elements > Computed for ComputedStyle
▸ Performance · Main thread · "Recalculate Style"selected · 4.2 ms · 1842 elements
Main
JS
Recalculate Style 4.2ms
Layout 3.1ms
Paint 1.8ms
Commit
idle
Recalc Style 展开
CollectMatching
CompareRules · cascade
ApplyMatched · ComputedStyle
Selector match stats · this frameRuleSet hit-rate: 96.4%
id_rules_ = 12 hits
class_rules_ = 1432 hits
tag_rules_ = 387 hits
attr_rules_ = 9 hits ⚠ slow
看 3 件事: ① Recalc Style / Layout / Paint 三个块的相对宽度 — 哪个胖哪个就是瓶颈。Style 通常是 Layout 的 1/2;若反过来,八成是选择器太复杂(每个节点跑右往左匹配的成本爆了);② 底部 RuleSet hit-rate < 80% = 大量节点跑了无效匹配;③ attr_rules_ 命中 标红 — 属性选择器([data-state="open"])是最慢的桶,遇到全文档量级 selector 时尤其贵。3 things to watch: ① Recalc Style vs Layout vs Paint width ratios — fattest one is the bottleneck. Style is usually ½ of Layout; if reversed, you almost certainly have over-complex selectors (per-node right-to-left match cost explodes); ② RuleSet hit-rate < 80% = many nodes running futile matches; ③ attr_rules_ hits in red — attribute selectors ([data-state="open"]) are the slowest bucket, particularly costly with document-scale selectors.
Render Tree每节点附 ComputedStyle+ ComputedStyle per node
STAGE 03 · DOC PHASE
Layout — 几何属性的归宿
Layout — where geometry settles
几何归位
where geometry settles
Module
blink
Process
Render
Thread
Main
Output
Layout Tree
这一步在做什么
What it does
遍历 Render Tree,给每个 LayoutObject 计算 x · y · width · height。所谓 LayoutTree = Render Tree + 几何属性。Walk the Render Tree, compute x · y · width · height for every LayoutObject. LayoutTree = Render Tree + geometry.
为什么不能跳过
Why not skip
没有几何就没法绘制——"画一个红色矩形" 至少需要 4 个数字。Layout 还要解决 inline ↔ block ↔ float ↔ flex ↔ grid 之间错综的相互影响,是 Main thread 上最容易长尾的一段。No geometry → no painting. "Draw a red rectangle" needs four numbers at minimum. Layout also has to resolve the tangled interactions between inline ↔ block ↔ float ↔ flex ↔ grid — and it is the Main thread's most long-tail-prone stage.
Layout is about geometry — position and size. Each LayoutObject carries a LayoutRect that stores x / y / width / height.
The catch: LayoutObject does not map 1 : 1 to a DOM node. A display: list-item becomes two LayoutObjects (item box + marker box). An anonymous block "appears from nowhere" to keep layout rules consistent.
STAGE 03主线 · The Card 在布局后Main-line · The Card after Layout
LayoutNGFlexibleBox 的两遍布局,8 个 LayoutObject 各就位
LayoutNGFlexibleBox's two passes, 8 LayoutObjects in place
The card is a display: flex container, so the root article uses LayoutNGFlexibleBox, not LayoutNGBlockFlow. The flex algorithm runs twice: first the main axis (horizontal) — avatar 56 + gap 14 + button 53 = 123, leaving 217 for .info; then the cross axis (vertical) — center via align-items. The final LayoutTree:
Five things to notice: ① img uses LayoutImage (a LayoutReplaced subclass) — it has no children and never will; img is a replaced element, Layout gives it a single box for the external resource; ② a uses LayoutInline, not BlockFlow — it's phrasing content, occupies a line-box; ③ button brings its own LayoutNGBlockFlow with default padding / border-radius from UA stylesheet; ④ DOM 6 nodes → 8 LayoutObjects (the three LayoutText appear from nowhere); ⑤ the layout algorithm ran twice — that's the flex tax. A plain block flow only runs once.
What stages does mutating each CSS property trigger? CSS Triggers answers it. The most-cited rows below — moving a property from the reflow path to the composite path is the lowest-hanging perf win:
CSS 属性CSS property
Layout
Paint
Composite
width / height / padding / margin
●
●
●
top / left / right / bottom
●
●
●
font-size / line-height / display
●
●
●
color / background-color / box-shadow
—
●
●
border-radius / outline
—
●
●
opacity
—
—
●
transform
—
—
●
filter
—
—
●
用法HOW TO USE想做位移动画 → 用 transform: translate(x, y),不要用 top / left;想淡入淡出 → opacity,不要 display 切换;圆角变化 → 改 border-radius 时整张图层都得重 Paint,能避就避。不同浏览器内核的处理表略有差异,CSS Triggers 是 Lookup table,不是宪法。Want a position animation? Use transform: translate(x, y), not top / left. Cross-fade? opacity, not toggling display. Animating border-radius repaints the whole layer — avoid where you can. Different engines vary slightly; CSS Triggers is a lookup, not a law.
"LayoutObject" is not a single class — it's an inheritance tree. Blink uses subclassing to encode different box-model rules: block / inline / table / svg / mathml each walk their own algorithm. Below is a condensed map of the third_party/blink/renderer/core/layout/ tree:
LAYOUT OBJECT · CLASS HIERARCHYthird_party/blink/renderer/core/layout/layout_object.h
Two details worth remembering: ① LayoutText doesn't inherit from LayoutBox — it has no box, its geometry is decided by the parent LayoutInline / LayoutBlockFlow inside an inline line-box; ② LayoutView is the single root — it owns the viewport's size + the root ScrollableArea. Removing document.body doesn't kill it; LayoutView is a permanent member of Document.
From 2017 Chromium rewrote the layout engine as LayoutNG (Next-Generation Layout). The headline change: introduce a "Fragment" as a read-only geometry snapshot — LayoutObjects remain the input "recipe", but layout output is no longer written back into them. Instead we get an immutable fragment tree (NGPhysicalFragment tree). This split lets layout be cached, parallelised, and short-circuited at subtree boundaries.
Operationally, LayoutBlockFlow::UpdateLayout() constructs an NGBlockLayoutAlgorithm, feeds it an NGConstraintSpace, runs it and emits an NGLayoutResult — at its centre an NGPhysicalBoxFragment. "Constraint + Algorithm → Fragment" is LayoutNG's three-act form, each act purely functional, each act cacheable. This is what reduces many O(whole tree) reflows to O(dirty subtree) under NG.
CSS says: adjacent vertical margins between block siblings collapse to the larger one. The catch: whether the current block's margin-top participates in collapse can only be decided after its first child block has been laid out — a cross-subtree backward dependency. LayoutNG models this explicitly with NGUnpositionedFloat and NGMarginStrut — keeping the algorithm pure functional.
为什么 Flex 要跑两遍布局? · 主轴 / 交叉轴的强先后依赖Why does Flex run layout twice? · main-axis must finish before cross-axis can begin
flex 的两遍布局,本质是"分配尺寸 → 排列方向" 这两步必须串行:
Flex's two layout passes really come from "distribute size → place along axis" being a strict sequence:
Pass 1 · main axis (horizontal): solve flex-grow / flex-shrink / flex-basis arithmetic — given the container's available width, distribute it across children per flex: 1 0 200px style rules. The output: each child's main-axis size (i.e. width for a horizontal flexbox).
Pass 2 · cross axis (vertical): now that every child's width is known, we can measure their heights (height often depends on width — e.g. auto-wrapping text: 1 line if the container is wide, 3 lines if narrow). align-items / align-self alignment + "the tallest item in a flex-line sets the line's height" rule both need this second pass.
Why can't this run in parallel? Because a child's height depends on the width it received in pass 1 — a one-way dependency. If you forced height-first, you'd get "height computed against the original container width" — once the container width changes (because flex-grow expanded a child), all heights are stale and need recomputing. So the "two passes" are a mathematical constraint, not an engineering whim.
Grid is even worse: CSS Grid runs three or more passes in some scenarios (the min-content / max-content track-sizing algorithm is iterative until convergence). Flex's two passes are actually frugal in comparison.
▸ Performance · Main thread · Layout/Reflow event(展开)selected · 8.4 ms · forced
Main
JS
Recalc Style 1.6ms
Layout 8.4ms ⚠ forced reflow
Paint 2.1ms
Commit
idle
Layout(展开)
PerformLayout
UpdateLayout · 312 nodes
UpdateLayout(展开)
NGFlexLayoutAlgorithm
NGBlockLayout · pass1
NGBlockLayout · pass2
offsetWidth (sync layout)
04 ms8 ms12 ms16 ms
3 个红线信号: ① Layout 占主线程比例 > 30%(本图 8.4/16.7 = 50%,直接掉帧);② Layout bar 上有 ⚠ forced reflow 标 — JS 在读 offsetWidth/getBoundingClientRect() 之前刚改了 DOM,触发同步布局(C8 章 reflow 那段说的);③ 同一帧内出现多次 Layout 块 — 典型 layout thrashing,把读/写操作分批合并到 rAF 里就能消除。3 red-line signals: ① Layout takes > 30% of the Main thread (here 8.4/16.7 = 50%, frame dropped); ② Layout bar carries a ⚠ forced reflow tag — JS read offsetWidth/getBoundingClientRect() right after a DOM mutation, triggering sync layout (the C8 reflow story); ③ multiple Layout blocks in one frame — classic layout thrashing, fixed by batching reads + writes inside a rAF.
Layout Tree含 LayoutRect + Fragment 树+ LayoutRect + Fragment tree
STAGE 04 · DOC PHASE
Pre-paint — 四棵属性树的诞生
Pre-paint — birth of the four property trees
局部更新的"语法"
isolation contracts for transform · clip · effect · scroll
Module
blink
Process
Render
Thread
Main
Output
Property Trees ×4
这一步在做什么
What it does
从 LayoutTree 抽出 4 棵属性树(Transform / Clip / Effect / Scroll),每棵树用父子继承的方式表达"该节点上面叠了哪些变换 / 裁剪 / 特效 / 滚动"。把这部分从图层结构里剥离出来,是 Compositor 能"只更新一个属性、不重画"的根基。Extract 4 property trees (Transform / Clip / Effect / Scroll) from the LayoutTree. Each tree uses parent-child inheritance to express "what transforms / clips / effects / scrolls are stacked above this node". Splitting these axes from the layer structure is what lets the Compositor "mutate one property without repainting".
为什么不能跳过
Why not skip
没有这 4 棵树,每次 transform / opacity / scroll 改变都要顺着图层树重新计算继承关系,跨线程把整棵树拷一遍。属性树是 Compositor "动一个节点的一条属性,其他都 cache 命中"的隔离合同。Without the four trees, every transform / opacity / scroll change has to recompute inheritance up the layer tree and ship the whole tree cross-thread. Property trees are the isolation contract that lets "mutate one property of one node" stay local.
CAP · COMPOSITE AFTER PAINT新版本 Chromium 的 Pre-paint & Paint 已重写为 CAP(Composite After Paint) 模式——属性树的构建从 Layout 后剥离到 Paint 之后完成,去掉了 PaintLayer 这一层。结果是更少的中间产物 + 更精确的失效计算。本文中的描述基于 CAP 之前的世界,但脉络与新版完全一致。Recent Chromium has rewritten Pre-paint & Paint as CAP (Composite After Paint) — property-tree construction moves from "after Layout" to "after Paint", and the PaintLayer abstraction is gone. The result: fewer intermediate artifacts and more precise invalidation. The model below predates CAP, but the shape is identical in the new world.
STAGE 04主线 · The Card 在 Pre-paint 后Main-line · The Card after Pre-paint
4 棵树各取所需,Effect 树多 2 个节点,Transform 树多 1 个
Four trees each take a piece — Effect gains 2 nodes, Transform gains 1
名片这次会让 4 棵属性树都"动起来"——但每棵树多出来的节点不一样。NeedsEffect() 看到 .card 上的 box-shadow 与 linear-gradient,在 Effect tree 里给它建一个节点;看到 .follow 的 background 切色又建一个;NeedsTransform() 看到 .follow 的 will-change,在 Transform tree 给它建一个;NeedsClip() 看到 .avatar 的 border-radius: 50%,在 Clip tree 多一个 RRect 节点。Scroll tree 没动(整张卡都不滚)。
This time the card stirs all four property trees — but each tree gains a different number of nodes. NeedsEffect() sees .card's box-shadow + linear-gradient and creates one Effect node; .follow's background-mutation creates another. NeedsTransform() sees .follow's will-change and creates a Transform node. NeedsClip() sees .avatar's border-radius: 50% and creates a Clip node. Scroll tree remains untouched — nothing scrolls.
// 4 棵属性树 · The Card 之后Transform treeClip tree
─ root ─ root
└─ .follow └─ .avatar (RRect 50%)[will-change][border-radius]Effect treeScroll tree
─ root ─ root
├─ .card// (无变化)
│ [shadow + gradient]// nothing scrolls in
│ render_surface = YES// the example
└─ .follow[will-change]
Key decision: .card's render_surface_reason_ flips to non-null (the box-shadow needs off-screen compositing to render correctly) — meaning .card's whole subtree will land in its own RenderPass during Draw. The decision is made in Pre-paint but only cashes in at C16. The four trees are the contract that powers every "local update" later — hover-changing .follow's transform mutates one node in the Transform tree; the other three trees are cache hits and Layout doesn't run at all.
四棵属性树
The four property trees
TRANSFORM
变换树Transform tree
每个节点的位移、旋转、缩放、3D 变换;动画热路径必经。
Per-node translation / rotation / scale / 3D. The hot path of animations.
CLIP
裁剪树Clip tree
overflow / clip-path 在层级里生成的剪裁矩形。
Clip rectangles inherited from overflow / clip-path.
Backed by these trees, Chromium can mutate one node's transform / clip / effect / scroll without disturbing its descendants. This is why CSS animations stay smooth — the entire animation never goes back to the Main thread.
案例 · 一个 div,4 棵树各取所需
Worked case · one div, four trees each take a piece
CASE · DECOMPOSITION
CSS 属性是怎么"分流"到不同属性树上的
How CSS properties get routed to different trees
<div style="
transform: rotate(10deg); /* → Transform tree */
overflow: hidden; /* → Clip tree */
opacity: 0.5; /* → Effect tree */
filter: blur(4px); /* → Effect tree */
overflow-y: scroll; /* → Scroll tree */
">
The same LayoutObject contributes one node to each of Transform / Clip / Effect / Scroll during Pre-paint. Animating opacity later mutates only the Effect tree — Transform / Clip / Scroll are all cache hits.
The key is step ①: each object does not rebuild the whole tree, it only updates the 4 property-node pointers tied to itself. Pre-paint runs in O(changed nodes), not O(whole tree) — that's what makes it cheap enough to run every frame.
"Property tree" sounds abstract; in code it's a handful of small structs that inherit from cc::PropertyTreeNode. Each tree is an array (Vector<Node>); nodes link via integer parent_id — not pointers, because the whole tree gets memcpy'd cross-thread at Commit time. The four flavours of node, with their key fields:
Three details unlock the design's power: ① all four trees share one node-id space (1-based, 0 is the root) — a LayerImpl needs only 4 ints to know "which transform / clip / effect / scroll chain I belong to"; ② every node carries a changed_flag, so the next frame only pushes the nodes that actually changed — Commit bytes scale with delta; ③ the render_surface_reason_ on an Effect node is the real switch for "do we need an off-screen RenderPass?" — filter: blur(), mix-blend-mode, mask-image all flip it on.
When PrePaintTreeWalk reaches a LayoutObject, PaintPropertyTreeBuilder::UpdateForSelf() follows a "build-on-demand" rule — it creates a node only if a CSS property needs one. For example:
This is the real answer to "which properties trigger GPU acceleration?" — anything that makes Pre-paint emit a new Transform / Effect node with a non-empty render_surface_reason_promotes that subtree to its own composited layer. will-change: transform "creates a layer" precisely because it bypasses the NeedsTransform guard and forces a Transform node into existence.
为什么是 4 棵树,而不是 1 棵大树? · 不同属性的失效粒度本质不同Why four trees, not one big tree? · different properties have fundamentally different invalidation granularity
最直接的答案:这 4 类属性的"变化频率" 与"影响范围" 完全不一样。把它们塞进一棵树,每次变一个属性都要把整棵树推到 Compositor 线程,那就退化成"整页重做"了。
The most direct answer: these four kinds of properties have completely different "change frequency" and "impact radius". Stuff them into one tree and every property change ships the whole tree to the Compositor — degenerating to "redo the whole page".
具体看 4 棵树各自的"性格":
Each tree's "personality" specifically:
Transform tree · 变化最频繁(每帧动画都改),但不影响其他元素的几何。一个独立树,变化只 push 一个节点。
Transform tree · changes most often (every animation frame), but doesn't affect other elements' geometry. Its own tree → one node push per change.
Clip tree · driven by layout, changes less often; but clipping is cumulative (parent clip ∩ child clip), needing its own ancestor chain. Mixed with Transform, every transform change would re-evaluate clip — pure waste.
Effect tree · decides RenderPass boundaries (box-shadow / opacity / blend-mode trigger off-screen composition). This tree's nodes are the buckets of GPU work — they decide "which quads go to a dedicated RenderPass". Transform / Clip don't affect RenderPass partitioning, so this must stay separate.
Scroll tree · 滚动是用户输入触发,跟 vsync 节奏完全独立。Compositor 线程要在没有 Main 线程参与的情况下直接修改 scroll offset。如果 Scroll 在 Transform 树里,Compositor 修一个 offset 就要"叫醒整棵 Transform 树",违背"滚动跑在 Compositor 上"的初衷。
Scroll tree · scrolling is triggered by user input, on a completely different cadence from vsync. The Compositor needs to mutate scroll offset without involving Main. If Scroll lived inside Transform, mutating one offset would "wake the entire Transform tree" — defeating the purpose of "scrolling runs on the Compositor".
So the 4 trees are an "orthogonal decomposition": each property axis has independent "change frequency × impact radius"; cramming them into one tree disables 4 independent optimisations. Storing orthogonal axes separately and updating locally per axis — the same trick used by database engines, graphics engines, and OS schedulers: "shard by mutation pattern".
Counter-example: early Blink's PaintLayer was exactly "one tree with all properties bundled" — every transform change traversed the PaintLayer tree. The CAP (Composite After Paint) project's core action was killing PaintLayer and replacing it with these 4 independent property trees. The perf win came from "splitting things that shouldn't change together" — a rule that holds for any system.
DEVTOOLS
Performance > "Update Layer Tree";Layers 面板看图层树Performance > "Update Layer Tree"; Layers panel for the layer tree
看 3 处: ① 树视图里星标 ★ promoted 的就是 4 棵属性树给它建了独立节点 的元素 — Pre-paint 的输出在这里可见;② 右侧画布是层级俯视图,实线方框 = 独立合成层(.card 与 .follow),虚线 = 普通子元素;③ 鼠标悬到任一层上 DevTools 会显示 Compositing Reasons(kActiveTransformAnimation / kBackdropFilter 等) — 直接对应 cc::CompositingReason enum。独立合成层数量爆炸是 will-change 滥用的明证。3 spots to inspect: ① ★ promoted entries in the tree are elements where the four property trees created dedicated nodes — Pre-paint's output is visible here; ② the right canvas is the top-down layer view; solid borders = composited layers (.card and .follow), dashed = inline children; ③ hover any layer and DevTools shows the Compositing Reasons (kActiveTransformAnimation / kBackdropFilter etc.) — directly mapping to the cc::CompositingReason enum. An explosion of composited layers is proof of will-change abuse.
layout objects → display item list, in CSS painting order
Module
blink → cc
Process
Render
Thread
Main
Output
cc::PictureLayer + DisplayItemList
这一步在做什么
What it does
遍历 LayoutTree,按 CSS 绘画顺序把每个 LayoutObject 翻译成一组 DisplayItem,组成一份 DisplayItemList,挂到 cc::PictureLayer。这一步不画一个像素——它只是把"要画什么"写成可重放的脚本。Walk the LayoutTree in CSS painting order, translate each LayoutObject into a batch of DisplayItems assembled into a DisplayItemList, attach to cc::PictureLayer. Not one pixel painted here — Paint writes a replayable script of "what to draw".
为什么不能跳过
Why not skip
指令而非像素 = 可跨线程传递。Main thread 只生产 DisplayItemList,Raster 线程拿过去 playback。同一份指令在不同 scale / 不同 Tile / 不同设备上反复使用——是 Chromium "便宜地多次画" 的根。Instructions, not pixels = cross-thread transferable. Main only produces DisplayItemList; Raster threads play it back. The same list is reused across scales, tiles, devices — the root of Chromium's "paint cheaply, paint many times".
STAGE 05主线 · The Card 在 Paint 后Main-line · The Card after Paint
14 条 DisplayItem 排好,头像被 ClipRRect 圈住
14 DisplayItems in line, avatar wrapped by ClipRRect
Paint 不画一个像素——它只把 Layout 与 Effect tree 给的指令转写成 cc::DisplayItemList。名片的最终 list 是按 CSS Appendix E 的 7 阶段绘画顺序排好的:
Paint paints zero pixels — it only transcribes Layout + Effect tree decisions into a cc::DisplayItemList. The card's final list is ordered by the CSS Appendix E painting sequence:
Three things to notice: ① the avatar's ClipRRect is a real circle — SkRRect carries four independent corner radii, all four set to 28px, so the GPU treats it as a true circle; ② SaveLayer nests twice — the outer one carves an off-screen texture for .card's shadow effect; the inner one is .follow's own composite layer. To cc these are two independent PaintChunks, each bound to its own 4-tree state; ③ the whole list reuses incrementally via PaintController's double buffer — next frame's unchanged chunks are O(1) range-moved over, only mutated chunks enter Raster.
The CSS spec mandates a fixed 7-phase order for every block — background at the bottom, text at the top. The order is "CSS 2.1 Appendix E painting order", strictly followed by Blink:
The Paint stage doesn't paint pixels — it records "what to paint". Each entry is a DisplayItem, ultimately fed into a cc::PictureLayer. Given this HTML:
和 DOM 构建一样,Paint 走的是栈式遍历——遇到一个 LayoutObject,先 SaveLayer 压栈,绘制完子节点后 Restore 出栈。这种模式让裁剪、变换、不透明度都能就地嵌套,又不会污染兄弟节点。
Like DOM construction, Paint walks a stack — hit a LayoutObject, push SaveLayer, paint children, pop Restore. This pattern lets clip, transform and opacity nest in place without contaminating siblings.
"DisplayItemList" inside Blink is not a flat array — it nests three layers: DisplayItem → PaintChunk → PaintArtifact. The design lets invalidation granularity and property-tree binding both target sub-ranges precisely:
PAINT_CONTROLLER · DATA MODELthird_party/blink/renderer/platform/graphics/paint/
Why a "chunk layer"? Because cc's Layerization step doesn't operate on individual DisplayItems — it operates on "contiguous DisplayItems sharing one property-tree state". Only a chunk's worth can land on the same cc::Layer. The chunk's properties_ field carries the current pointer into the four property trees Pre-paint just produced — it switches at every SaveLayer/Clip/Transform boundary. This is the seam that stitches "property trees" and "display lists" together.
Every LayoutObject's DisplayItems carry an identity (DisplayItemClient). On the next Paint, unchanged LayoutObjects reuse last frame's DisplayItems verbatim. Paint is incremental — its cost scales with the dirty region only.
Mechanically the reuse runs on a double-buffer inside PaintController: current_paint_artifact_ is last frame's output; new_paint_artifact_ is this frame's accumulator. When PaintController::UseCachedItemIfPossible(client, type) hits (the client is clean and last frame painted this type), the whole contiguous slice of DisplayItems is bulk-moved into new_artifact — O(1) range move, no repaint. The mechanism is called "Paint cache", and it's why a typical frame in Blink only paints 200~300 new items even though the document has thousands.
cc::Layer 的 5 个家族成员
The cc::Layer family — five subtypes
Paint 把 LayoutObject 转成的不是一张图——而是一棵 cc::Layer 树,运行在主线程,每个 Render Process 有且只有一棵。它的子类决定了"上屏方式":
Paint hands off not an image — but a cc::Layer tree, living on the Main thread, exactly one per Render process. The subclass decides "how this will reach the screen":
Embeds a CompositorFrame from another process. Used by iframes, OffscreenCanvas, video.
cc::UIResourceLayer / NinePatch
软件渲染场景下的"位图层"——类 TextureLayer 的 fallback。
Software-rendering bitmap layer — the fallback cousin of TextureLayer.
cc::VideoLayer (deprecated)
已弃用,被 SurfaceLayer 取代。
Replaced by SurfaceLayer.
cc 是什么WHAT 'CC' STANDS FOR这里的 cc = content collator(内容编排器),不是 Chromium Compositor。整个 cc 模块的工作就是在 Render 进程内组织好"该给 Viz 画什么",所以叫 collator 比 compositor 更贴切。cc = content collator — not Chromium Compositor. The cc module's job is to assemble what should be drawn for Viz inside the Render process. Collator fits better than Compositor.
3 个判读: ① 7 阶段顺序固定(CSS 2.1 Appendix E),你看到的色块顺序永远是 bg → border → float → in-flow → positioned → outline → text;② PaintChunk 数量 ≈ 独立 PaintLayer 数量,过多说明子树绑了不同的 property tree state;③ UseCachedItemIfPossible 命中率 < 80% 是典型的 invalidation 风暴 — 改了某个共享属性导致整子树标脏。开 Paint Flashing 浮层(浅绿色矩形闪一下),能直观看到哪一块被重 paint。3 reads: ① The 7-phase order is fixed (CSS 2.1 Appendix E) — you'll always see bg → border → float → in-flow → positioned → outline → text; ② PaintChunk count ≈ distinct PaintLayer count; explosion means subtrees are bound to different property-tree states; ③ UseCachedItemIfPossible hit rate < 80% is a classic invalidation storm — a shared property mutation marked the whole subtree dirty. Toggle Paint Flashing (light-green flashing rectangles) and you can directly see which region was repainted.
把 Main thread 上的 cc::Layer 树(外加 4 棵属性树、DisplayItemList)同步到 Compositor thread 上的 LayerImpl 树。这是渲染管线唯一一次显式跨线程的时刻——执行期间 Main thread 被短暂"冻住"。Synchronise the Main-thread cc::Layer tree (plus 4 property trees + DisplayItemList) onto the Compositor-thread LayerImpl tree. This is the pipeline's only explicit cross-thread moment — Main is briefly frozen while it happens.
为什么不能跳过
Why not skip
两线程不能直接共享指针——JS 随时可能在 Main thread 上改动 cc::Layer,Compositor thread 同时还要光栅化它。Commit 是一次 "snapshot + 转交所有权" 的同步,让两边在边界上不打架。The two threads cannot share pointers directly — JS may mutate cc::Layer on Main any moment while the Compositor thread is rasterising it. Commit is a "snapshot + ownership transfer", so the two sides never trip over each other.
Paint produced 2 cc::Layers (main + .follow) + 4 property trees + 2 PaintChunks (each bound to a property-tree state). Commit pushes this whole bundle atomically from Main to Compositor:
The mechanism:TreeSynchronizer walks both sides with a double pointer — left is Main's cc::Layer tree, right is Compositor's LayerImpl tree. Each pair calls PushPropertiesTo() to push deltas. If only .follow's transform changed, this Push touches one LayerImpl + one Transform-tree node (changed_flag). The bytes copied per Commit scale with the actual number of changed properties — that's why mutating a single transform doesn't blow up the Commit cost.
Frame Lifecycle · BeginMainFrame 到 Commit
Frame lifecycle · from BeginMainFrame to Commit
vsync 一来,Compositor thread 上的 Scheduler 给 Main thread 发一个 BeginMainFrame。Main thread 接到信号后跑 Style → Layout → Pre-paint → Paint 这四步,把产物(cc::Layer 树)准备好;准备完毕,触发 Commit;Commit 执行期间 Main 被阻塞,结束后 Main 立刻继续干别的(执行 JS、跑 microtask 等)。
When vsync ticks, the Compositor-thread Scheduler sends a BeginMainFrame to the Main thread. Main runs Style → Layout → Pre-paint → Paint to prepare the cc::Layer tree, then triggers Commit. Commit blocks Main; once it returns, Main is free to do other things (execute JS, run microtasks, …).
FIG 11.A一个完整的 frame lifecycle:vsync 触发 BeginMainFrame;Main 跑前 5 步;Commit 后 Main 立即解放,Compositor 接着跑后 7 步。A complete frame lifecycle: vsync fires BeginMainFrame; Main runs the first five steps; Commit unblocks Main, and the Compositor takes over the remaining seven.
Commit is essentially every cc::Layer pushing its latest properties onto its matching cc::LayerImpl. TreeSynchronizer walks both trees in lockstep, calling each Layer's PushPropertiesTo. Textures aren't copied (they live in SharedImage), only properties.
SingleThreadProxy vs ProxyMain · two compositing modes
Chromium 实际上有两套 Commit 实现:
Chromium actually has two Commit implementations:
SingleThreadProxy
Compositor 跑在 Main thread 自己内(非典型)。Android WebView、headless mode 用它。Commit 退化成函数调用。
Compositor runs on the Main thread itself (atypical). Used by Android WebView, headless mode. Commit degenerates into a function call.
ProxyMain ↔ ProxyImpl
默认模式。Main 和 Compositor 各跑各的,通过消息泵通信。Commit 是一次跨线程同步——Main 阻塞等 Compositor 拷完属性。
The default. Main and Compositor each run on their own thread, communicating via message pumps. Commit is a cross-thread sync — Main blocks while the Compositor copies properties.
Commit 慢起来 · 为什么
Slow commits · why
PERF DIAGNOSIS
"为什么我的 Commit 要 30ms"
"Why is my commit taking 30ms"
Commit 慢,常见三种原因:
Three common causes:
① 图层树过深——一万个 cc::Layer 一个一个 PushPropertiesTo 是有成本的。优化:合并相邻图层、避免无意义的 will-change。
① The layer tree is too deep — calling PushPropertiesTo on ten thousand cc::Layers isn't free. Fix: merge adjacent layers, drop pointless will-change.
② Property tree 节点爆炸——每个 transform / clip / effect 都新建一个属性节点。一个 1000 节点的属性树同步起来会卡。
② Property tree node explosion — every transform / clip / effect creates a new property-tree node. A 1000-node tree syncs slowly.
③ 大块图片资源——TransferableResource 引用一旦改动,需要更新 SharedImage 引用,跨进程通信增多。
③ Large image resources — TransferableResource refs that churn force SharedImage ref updates, increasing IPC.
为什么 Commit 必须阻塞 Main thread? · 跨线程数据一致性的硬约束Why must Commit block the Main thread? · a hard constraint on cross-thread data consistency
No blocking = torn reads. A cc::Layer holds a dozen fields — transform / opacity / bounds / display_list, etc. If Main is mid-mutation on transform (x updated, y not yet) when Compositor reads the layer, it gets an illegal "new x, old y" state and the next frame jitters. Either lock (expensive + deadlock risk) or stop-the-world — Chromium picks the latter, because Commit is usually short (<1ms), cheaper than the lock overhead.
Commit 是"事务边界"。 一帧 Main 上跑了:Style → Layout → Paint → 改 cc::Layer。这一长串变更必须原子地一起出现在 Compositor 端,否则 Compositor 看到的是"新的 Layout 但是老的 Paint",几何与像素对不上。Commit 阻塞 Main 那一瞬,本质是在执行一次跨线程事务提交——跟数据库的 COMMIT 同名同义。
Commit is a "transaction boundary". In one frame, Main ran: Style → Layout → Paint → mutate cc::Layer. This whole train of changes must appear atomically on the Compositor side; otherwise Compositor sees "new Layout but old Paint" and geometry doesn't match pixels. Commit blocking Main is literally a cross-thread transaction commit — same name, same meaning as a database COMMIT.
Commit really is short (typically 0.5-2ms), because it moves pointers, not data: LayerImpl is a "shadow copy" of cc::Layer, sharing the underlying SkPicture / TransferableResource (refcount); property trees are shallow vector copies. So even with Main blocked, the impact is small — most pages commit in under 1ms and you can't feel it. What actually slows Commit is cc::Layer count exploding into the thousands, or property tree node count going wild (covered in the "slow commits" section above).
Are there non-blocking alternatives? Yes — "impl-side painting" (experimented in 2014, abandoned) and "composite without commit" (a side-channel for the very few Compositor-only animations). The first was killed for complexity; the second only handles "pure transform / opacity" animations. For most business pages, accepting 1ms of Commit blocking in exchange for a clean pipeline is an excellent trade.
DEVTOOLS
Performance > Frames 行 + "Commit" 事件;Main 线和 Compositor 线对齐看Performance > Frames row + "Commit" event; align Main and Compositor lanes
TRACING
cc, blink.commit, viz.frame_production, scheduler
FLAG
--single-process/ 看 SingleThreadProxy 模式 vs ProxyMain/ inspect SingleThreadProxy vs ProxyMain modes
▸ Performance · BeginMainFrame 到 Commit 的完整周期vsync N · 16.7 ms budget
Compositor
BMF
wait for Main
Commit
Tile + Activate
wait for Raster
Draw + Submit
idle
Main
wake
JS
Style
Layout
Pre-paint
Paint
⤳ Commit (blocked)
microtask + rAF
idle (until next BMF)
Raster ×4
idle
parallel raster (4 threads)
idle
0481216.7 ms
3 个判读: ① Compositor 与 Main 在 "Commit" 那条红线同时停下 — 这是唯一一段两线程显式同步的时刻,~1ms 阻塞;② Commit 之后 Main 立即解锁,可以跑 microtask 与 rAF — 不是在等渲染完成;③ Compositor 在 Commit 之后并不闲,而是继续推进 Tile / Activate / Draw,与 Main 上的 JS 工作并行 — 这就是 cc 的 "异步流水线"。看到 Commit 时间长 > 5ms,八成是 cc::Layer 树过大或 Property 节点爆炸。3 reads: ① Compositor and Main both pause at the red "Commit" line — the one moment of explicit cross-thread sync, ~1ms block; ② Main unlocks immediately after Commit and runs microtasks + rAF — not waiting for rendering to finish; ③ Compositor isn't idle after Commit either — it keeps pushing Tile / Activate / Draw, in parallel with Main's JS work. That's cc's "asynchronous pipeline". Commit time > 5ms almost always means an oversized cc::Layer tree or property-node explosion.
cc::Layer Tree+ Property Trees + DisplayItemList+ Property Trees + DisplayItemList
→
OUTPUT
LayerImpl Tree挂在 Compositor thread 上on the Compositor thread
STAGE 07 · CC PHASE
Compositing — 把页面切成独立图层
Compositing — slicing the page into independent layers
"动一处只重画一处" 的物理基础
why "mutate one, repaint one" is possible
Module
cc
Process
Render
Thread
Compositor
Output
GraphicsLayer Tree
这一步在做什么
What it does
在 Compositor thread 上把 LayerImpl 树分组成独立的 GraphicsLayer——每一个 GraphicsLayer 拥有自己的纹理与变换矩阵,能独立动画、独立失效、独立合成。On the Compositor thread, group the LayerImpl tree into independent GraphicsLayers — each owns its own texture and transform matrix, can be animated, invalidated and composited on its own.
为什么不能跳过
Why not skip
没了图层切分,一次普通的滚动也会让所有像素重新 Paint + Raster。即便每个阶段都做缓存,无图层下也救不了——失效粒度是"全屏"。Without layer separation, even a single scroll repaints every pixel. Caches at every prior stage can't save you — the invalidation granularity collapses to "the whole screen".
Pre-paint already reserved a Transform-tree node for .follow; Compositing simply cashes in: extract .follow from the main layer into its own GraphicsLayer. The cost is one extra GPU texture (53×32 ≈ 6KB); the payoff is "only this layer moves on hover".
The promotion criteria: consult the cc::CompositingReason enum — 30+ triggers, any one matches and you're promoted. .follow matches kActiveTransformAnimation (because of will-change: transform); video / canvas / iframe / 3D transform / fixed scroll / overflow:scroll-snap would also trigger. Every layer costs memory, so cc has the reverse heuristic too — "if a layer is too small and isn't animating, fold it back into the parent".
Imagine the stage cut: Paint goes straight to Raster, straight to screen. The moment Raster's data isn't ready when vsync arrives — a frame drops. Even with caches at every prior stage, a single scroll would force every pixel re-Painted and re-Rastered.
Without will-change: .wobble shares a layer with its surroundings; every frame retriggers Paint + Raster of the whole layer. With will-change: transform: the Compositor promotes it to its own GraphicsLayer ahead of time, animation reduces to a matrix multiply on the Compositor thread — zero Main-thread work, zero Raster re-runs.
Animation runs on the Compositor thread · zero Main-thread involvement.
什么会被升格为独立图层 · 完整清单
What gets promoted to its own layer · the full list
Compositor 不是无脑给每个元素一个图层——它有明确的升格条件。下面这张表是常见的命中点(CAP 之后规则有简化但骨架不变):
The Compositor doesn't blindly give every element its own layer — it has explicit promotion criteria. The list below is the common hit set (post-CAP simplifies the rules, but the skeleton is the same):
Every promoted layer costs memory — tile cache, textures, property-tree entries. Sprinkle will-change everywhere and you watch VRAM filled with motionless layers. The rule: only on elements that actually move; remove it once the animation ends.
DevTools · 看一眼自己的图层
DevTools · inspecting your own layers
CHROME DEVTOOLS · LAYERS
每个图层为什么存在 / 占多少内存 / 重画了几次
Why each layer exists, how much memory, how many repaints
DevTools 的 Layers 面板(实验功能里开启)能列出当前页面的所有 GraphicsLayer,点开任意一个会告诉你:这个图层的产生原因(will-change / 3D transform / video / iframe / mix-blend-mode…)、纹理内存占用、已绘制次数。看到一个莫名其妙存在的图层,往往是性能洞的入口。
DevTools' Layers panel (toggle in experiments) lists every GraphicsLayer. Click one and it tells you: why this layer exists (will-change / 3D transform / video / iframe / mix-blend-mode…), texture memory, paint count so far. An unexpectedly-existing layer is often the entry to a perf hole.
输入事件的小后门
Input · the side door
Compositor Thread 还有一个小职能:处理输入事件。Browser Process 把 mousewheel / scroll / touch 投到 Compositor thread 上,它能直接处理而不用麻烦 Main thread——前提是页面没有 JS 监听这些事件。一旦你 addEventListener,Compositor 就只能把事件转发回 Main thread 了。
The Compositor thread also has a side door: input event handling. The Browser process throws mousewheel / scroll / touch directly to the Compositor — bypassing Main thread entirely — as long as no JS is listening. The moment you addEventListener, the Compositor must hand the event back to Main.
默认情况下浏览器认为你可能 preventDefault,所以把 touch 事件路由回 Main thread——一旦 Main 阻塞(JS 慢函数),滚动就掉帧。解决:明确声明 passive listener:
By default the browser assumes you might preventDefault, so it routes touch events back to Main — and any slow JS on Main drops the frame rate. Fix: declare a passive listener:
这样 Compositor thread 会直接处理事件,滚动永远不被 Main thread 拖累。
Now the Compositor thread handles the event directly — scrolling stops paying the Main-thread tax.
Compositor 像一个客厅。
你不开窗户,没人吵——所有动画自己滚自己的;
你一开 onScroll,Main thread 就被叫醒。
Field Note · 02
The Compositor is like a quiet living room.
Keep the windows shut and animations roll themselves;
open an onScroll and the Main thread is awakened.
Field Note · 02
把每个 cc::PictureLayerImpl 按 256×256 / 512×512 切成一组 cc::Tile,根据距视口的距离排好优先级,封装成 cc::TileTask 投入 TaskGraph。这一步只调度,不画。Cut every cc::PictureLayerImpl into 256×256 / 512×512cc::Tiles, order them by viewport distance, and wrap each into a cc::TileTask for the TaskGraph. This stage only schedules — no painting.
为什么不能跳过
Why not skip
两条物理边界:① GPU 不支持任意大小的纹理——一张超大图层必须切;② 多 Tab Chromium 共用一个统一缓冲池,Tile 是池的最小分配单位。没了 Tiling,多开几个 Tab 就显存爆。Two physical limits: ① GPUs can't support arbitrary-sized textures — a huge layer must be split; ② multi-tab Chromium shares a unified buffer pool, and the tile is its smallest allocation unit. Without Tiling, opening a few extra tabs runs out of VRAM.
The main layer is 340×88 — sliced into 256-aligned tiles, that's 1 full tile + 1 right-edge tile (84×88). The standalone .follow layer is only 53×32 — smaller than a tile but still owns one (because it's bound to its own property-tree state). In-viewport, all 3 tiles are priority_bin = NOW:
Three subtleties: ① edge tiles still allocate a full 256×256 texture (GPUs reject arbitrary sizes, so #r2 is really 256×256 with only 84×88 painted — wasted memory, simpler pool management); ② .follow occupies a tile of its own — this is the memory cost of "promotion to a composited layer", made concrete: a 53×32 button, eating a 256×256 texture slot; ③ 3 cc::TileTasks enter the TaskGraph — alongside 1 ImageDecodeTask (for airing.png), totaling 4 tasks for the Raster thread pool to chew through.
cc::Tile · 一个最小渲染单元的内部
cc::Tile · what's inside one mosaic piece
每一个 Tile 不是一张单纯的位图——它带着身份与状态:
A Tile is not a plain bitmap — it carries identity and state:
The figure below shows TileManager re-prioritising tiles during inertial scroll — deeper blue = higher priority; the chequerboard zone is the "not yet ready" placeholder:
FIG 13视口在网页中跳来跳去(模拟惯性滚动),cc::TileManager 实时调度每个 Tile 的光栅化优先级。As the viewport jumps across the page (simulating inertial scroll), cc::TileManager re-orders raster priority for every tile in real time.
Before a Tile task lands on a Raster thread, it travels this full chain — from the vsync-triggered BeginMainFrame down to SingleThreadTaskGraphRunner::ScheduleTasks:
256×256 is the empirical sweet spot. Mobile uses 256 (narrow screens, small animation range); desktop uses 512 (large viewport amortises the overhead). Tile size is not a constant — it is chosen per device.
Intuition says "small layer = small tile, big layer = big tile" should be most efficient. cc doesn't do this, because tiles aren't just "chunks" — they're the entire GPU memory system's "currency unit":
Uniform size → reusable texture pool. GPU texture allocation is extremely slow (tens of µs to ms). cc maintains a ResourcePool caching "allocated but idle" texture slots — when a tile needs one, grab from the pool, O(1). But that requires all tiles to be the same size: with mixed sizes, the pool buckets by size, large slots sit unused while small ones are starved — heavy fragmentation. Uniform 256/512 = every slot in the pool is fully interchangeable — the foundation of ResourcePool's hit rate.
统一尺寸 → 跨 Tab 共享内存。Chrome 多 Tab 共用同一个 GPU 进程的纹理池。如果 Tab A 用 256, Tab B 用 333, Tab C 用 512,池子被三种尺寸切碎,共享几乎不可能。统一标准让 30 个 Tab 共用一池——就像内存分页用 4KB 一个标准,不会按文件大小做动态页大小。
Uniform size → cross-tab memory sharing. Chrome's tabs share one GPU-process texture pool. If Tab A uses 256, Tab B uses 333, Tab C uses 512, the pool is sliced three ways and sharing is nearly impossible. Standard size lets 30 tabs share one pool — just like memory paging uses one 4KB standard, not dynamic page sizes per file.
Uniform size → simpler algorithm. TileManager's priority calc uses grid-coord distance fields — different grid pitches per layer would force normalisation before comparison. Uniform 256 makes "row N, col M" equivalent across all layers; priority compares directly.
So "per-device" 256 vs 512 is fine (same device → all layers use the same size → pool stays uniform); but "per-layer" dynamic isn't — it would gut the entire cache-reuse mechanism of ResourcePool. The same trade-off as an OS picking 4KB vs 2MB Huge Pages: the reuse dividend of one standard far outweighs the savings of "fitting each case".
预测光栅化 · 先低后高
Predictive raster · low first, high later
Chromium 还做一件事:"先粗后细"。首次合成图块时降低分辨率("LOW resolution tiling"),等优先级转成 NOW 后再补上高分辨率版本。这样首屏看起来"立刻有内容",但内容会在第二三帧"清晰一下"——你在 4G 网络下加载长页时常会看到这种行为。
Chromium also does "coarse first, fine later". The first composite of a tile is rendered at lower resolution ("LOW resolution tiling"); the high-res version follows once it gets bumped to NOW. The screen has "something" immediately, but you'll see content "sharpen" a frame or two later — common when you load a long page over 4G.
TileManager 与 ImageDecodeCache · 共用 TaskGraph
TileManager & ImageDecodeCache · sharing the TaskGraph
Tile tasks aren't islands — cc::ImageDecodeCache drops JPEG/PNG/WebP decode tasks into the same TaskGraph, consumed by the same Raster threads. Decode and rasterisation share one CPU pool. That's why image-heavy pages stutter more on low-end devices — Raster threads get monopolised by decoding.
3 个判读规则: ① 视口内必须全 NOW + HIGH 分辨率 — 任何一个 SOON 就是 jank 信号;② SOON tiles 数量 = 用户即将看到的内容量,这个数大表示页面"纵向接续" 多,惯性滚动会更平滑;③ LOW resolution 数量 > 0 = 上一帧 raster 跟不上 GPU 上传需求,系统主动降级 — 短暂可接受,持续就是性能崩盘。TileManager 调度的全部目标:让视口里永远是绿色 NOW,SOON 在路上,EVENTUALLY 别动。3 reading rules: ① viewport must be entirely NOW + HIGH res — any SOON is a jank signal; ② SOON tile count = volume of about-to-show content, larger = more "vertical continuity", smoother inertial scroll; ③ LOW resolution count > 0 = last frame's raster couldn't keep up with GPU upload demand, system actively downgraded — temporarily OK, sustained = perf collapse. TileManager's entire goal: keep the viewport green-NOW, SOON in flight, EVENTUALLY untouched.
cc::TileTask[]已按优先级排序,丢入 TaskGraphprioritised, posted into the TaskGraph
STAGE 09 · CC PHASE
Raster — 把"绘画清单"翻成像素
Raster — translating the display list into pixels
绘画清单的真实落地
DisplayItemList playback into bitmaps / textures
Module
cc
Process
Render
Thread
Raster ×N
Output
Tile texture / bitmap
这一步在做什么
What it does
Raster 线程逐个执行 cc::TileTask——把 DisplayItemList 中属于该 Tile 的绘画指令"Playback" 到一块纹理(GPU SharedImage)或位图(共享内存)上。13 步里第一次真正"画像素"的一步。The Raster threads run cc::TileTasks one by one — "playing back" the DisplayItemList's draw commands that fall in this tile onto a texture (GPU SharedImage) or bitmap (shared memory). The first stage in the 13 that actually paints pixels.
为什么不能跳过
Why not skip
DrawQuad 需要的"图片资源" 必须事先存在。Raster 是把 cc 的"指令"变成 GPU 能采样的"纹理"的唯一桥梁。没有 Raster,Display 阶段无 quad 可贴。DrawQuads require pre-existing "image resources". Raster is the sole bridge from cc's "instructions" into GPU-samplable "textures". Without Raster, Display has nothing to sample.
Three tiles dispatch to 3 Raster threads in parallel, each calling Skia to replay the relevant slice of the DisplayItemList onto its own SharedImage. Simultaneously, ImageDecodeCache hands airing.png to a 4th Raster thread for PNG decoding; the result lands directly into yet another SharedImage:
Notice #r4 is separate from #r1: the avatar doesn't belong to any tile — it's a standalone resource. When Tile #t1 plays back the avatar's DisplayItem, it actually writes a reference (DrawImageRect(SharedImage_id=#r4, dst_rect=avatar)); the GPU samples #r4 only at final composite time. This is the heart of "resource independent + reference assembled" — the same avatar can be shared across many tiles and layers without duplication. If on this frame at 9ms #r4 hasn't finished decoding, Tile #t1's playback still proceeds — the avatar slot just paints transparent; once #r4 is ready next frame, the avatar appears. This is the physical source of "avatar appears grey first, then loads".
"Playback" is not a metaphor — given the Tile's DisplayItemList, the Raster thread re-executes every DisplayItem in order, outputting onto the target buffer. This is exactly why Paint doesn't paint and Raster does — the same instruction list can be played back at different scales, on different tiles, on different devices. It's the heart of Chromium's performance model.
同步 vs 异步光栅化
Sync vs async rasterisation
浏览器走的是异步分块光栅化;移动 OS 与 Flutter 走同步光栅化。两条路线各有优势——下面这张对照能让你看清边界:
Browsers run async tiled raster; mobile OSes and Flutter run sync raster. Each route has its strong suit. The boundary, side-by-side:
SYNCHRONOUS
同步光栅化Synchronous raster
Android / iOS / Flutter · 间接像素缓冲
Android / iOS / Flutter · indirect pixel buffer
内存占用Memory footprintA+
首屏性能Cold-start TTIB
动态变化Dynamic contentB
图层动画Layer animationC
低端机Low-end devicesC
ASYNCHRONOUS · TILED
异步分块光栅化Async tiled raster
Chromium / WebView · Raster thread
Chromium / WebView · Raster thread
内存占用Memory footprintD
首屏性能Cold-start TTIC
动态变化Dynamic contentC
图层动画Layer animationA+
惯性滚动Inertial scrollA
总结一句:浏览器内核的性能,大半是用内存换来的。异步光栅化给惯性滚动和 CSS 动画带来绝对优势,但代价是内存占用极高 · 快速滚动会白屏 · 滚动中 DOM 更新可能不同步。
In one line: browser-engine performance is mostly bought with memory. Async raster gives inertial scroll and CSS animations their unfair advantage, at the cost of massive memory · white screens during fast scroll · DOM updates that may desync mid-scroll.
The figure below runs both strategies live — left: sync freezes the screen at each "raster" moment (yellow RASTER flash), scroll proceeds in discrete steps. Right: async scrolls continuously, but the viewport edges show chequer placeholders (raster hasn't caught up); tiles fill in one by one. The thread strips below explain who is moving and who is idle.
SYNC
同步光栅化 · 串行 · raster→composite→displaysync raster · serial · raster→composite→display
RASTER…
Main
GPU
Display
看出来: Main thread 一直在 raster (满格铜色),屏幕只在每次 raster 末尾"跳一下"。8 帧/4 秒 = 2 fps 的视觉节奏。每多滚 1px 就要重 raster 整屏。Look: the Main thread is always rastering (full copper bars), and the screen only "jumps" at the tail of each raster. 8 frames in 4s = 2 fps visual rhythm. Every extra px of scroll re-rasters the entire screen.
看出来: Main thread 全程闲(灰条),Compositor 与 GPU 不停跑(满格),Raster 1/2/3 三条线程并行各自处理 tile。视口里出现棋盘格的瞬间——那是raster 还没追上;但屏幕每帧都在动,不冻结。这就是 60fps 的代价:多 3 条 Raster 线程 + 一堆 SharedImage 内存。Look: the Main thread idles all the way (grey strip), Compositor and GPU run continuously, Raster 1/2/3 process tiles in parallel. The chequer cells in the viewport are tiles raster hasn't caught up to yet — but the screen moves every frame, never freezes. This is the cost of 60fps: three extra Raster threads + a chunk of SharedImage memory.
FIG 14·anim两种光栅化策略的实时对照。Sync 那侧整屏冻结、离散滚动;Async 那侧持续滚动、棋盘占位。彩蛋: 这个动画自身只用 transform 和 opacity,所以它正是它讲的"纯 Compositor 动画"的实例——你读这段字时,Main 和 Raster 都没在跑这个动画的任何一帧。The two raster strategies, live. Sync freezes the entire screen and scrolls in discrete steps; Async scrolls continuously and shows chequer placeholders. Easter egg: this very animation uses only transform and opacity, which means it is itself an instance of the "pure Compositor animation" it describes — while you read this caption, neither Main nor Raster runs a single frame of this animation.
Different hardware / accel capability / config map to different RasterBufferProvider subtypes. Their difference is really about "how the raster output reaches GPU memory" — fewer copies the better:
SharedImage is Chromium's abstraction over GPU data storage — it replaced the older Mailbox mechanism. The architecture is a classic Client / Service split:
FIG 14SharedImage:多个 Client 都能直写 GPU 内存,由 GPU Process 上的唯一 Service 协调。这是 cc Raster 与 Viz 之间能"零拷贝"传纹理的底座。SharedImage: many Clients write directly to GPU memory, coordinated by the single Service in the GPU process. This is the substrate that makes textures travel from cc Raster to Viz with zero copies.
Got <img> on the page? JPEG / PNG / WebP decoding also happens here — cc::ImageDecodeCache orchestrates the Raster threads to decode asynchronously: decode tasks and tile tasks share the same TaskGraph.
Raster thread count is fixed (typically ≤ 4, tied to CPU cores). A large image decode takes 10–80ms; while it holds a thread, tile tasks queue behind it. Critical viewport content gets delayed — the root cause of "image-heavy pages scroll-jank + render-slow".
The difference between all RasterBufferProvider subclasses really hides in two virtuals — AcquireBufferForRaster() decides where the texture is borrowed from, RasterBuffer::Playback() decides how the DisplayItemList lands on it:
The string of "depends_on_*" booleans hides a core secret of cc — it asks: does this tile's output depend on any "not-yet-decoded image"?. If yes, cc cannot reuse the previous tile's pixels for partial raster — the whole tile re-queues. This is why "scrolling onto a WebP-heavy region often shows the chequer first" — the precondition fails, and the tile gets re-scheduled from scratch.
The GPU raster path on Skia rides on DDL (Deferred Display List). Skia physically splits "building the drawing commands" from "submitting them to the GPU" into two threads: the Raster thread records the DDL (no GL context), the GPU thread replays it (owns the GL context). Four steps:
This is why Chromium's desktop rendering doesn't need to "funnel every GL call back to the main thread" — Raster threads only build command buffers, while the GPU context stays exclusive to the Viz process from beginning to end. Many tabs rasterising in parallel = dozens of Raster threads each writing their own DDL, queued for the single GPU thread to replay. "Record / Replay" is the real hinge of Chromium's multi-threaded rendering.
NOW · ready NOW · raster pending LOW-res placeholder checker · 未光栅化 视口
RASTER THREADS · 6ms
Raster 1
tile #t1
tile #t5
tile #t9
Raster 2
tile #t2
tile #t6
Raster 3
tile #t3
tile #t7
Raster 4
decode avatar.png
tile #t8
03 ms6 ms
看 4 件事: ① 视口内(黄框)必须全绿 — 否则就是棋盘;② 视口外的 LOW-res(对角线)是预渲染,会逐步 upgrade 到 NOW;③ 右侧 4 条 Raster 线程满载并行,Raster 4 上 Image decode 跟 tile raster 共用 TaskGraph(C14 那段说的);④ 棋盘格密集 = TileManager 优先级排错或 raster 跟不上 — 用 cc.debug.scheduling 找瓶颈。4 things to look at: ① the viewport (yellow border) must be all green — otherwise chequer; ② LOW-res (diagonal) outside the viewport is pre-render, gradually upgraded to NOW; ③ the 4 Raster lanes on the right run in full parallel; Raster 4 shares its TaskGraph between image decode and tile raster (the C14 footgun); ④ dense chequer = TileManager priority misorder or raster can't keep up — use cc.debug.scheduling to find the bottleneck.
pending → active → recycle, the buffering you didn't see
Module
cc
Process
Render
Thread
Compositor
Output
Active LayerImpl Tree
这一步在做什么
What it does
把已经光栅化好的 Pending Tree "翻"成可被 Draw 使用的 Active Tree。这一步是 Compositor thread 上的原子切换:切换前后屏幕始终能看到一帧合法画面。Promote the now-rasterised Pending Tree into a draw-ready Active Tree. The switch is atomic on the Compositor thread — the screen always sees a valid frame, before and after.
为什么不能跳过
Why not skip
没了 Activate,光栅化与上屏就成了串行:要么等所有 tile 画完再上屏(卡顿),要么边画边上屏(撕裂)。三棵树的中间层是把这两个矛盾解开的设计。Without Activate, raster and display would serialise: either wait for every tile, then display (stalls) or display while painting (tearing). The triple-tree middle layer is what unties the knot.
All 3 tiles successfully rastered — IsReadyToActivate() returns true. The Compositor thread swaps the active_tree_ and pending_tree_ pointers: the old pending (freshly rastered) becomes active; the old active (last frame) becomes recycle. From this instant, Draw emits quads from the new active.
Counter-example: if avatar #r4 hasn't decoded yet, #t1 is a "partial raster with placeholder"; IsReadyToActivate() may still return true (partial raster counts as ready). But num_missing_tiles > 0, and this number rides the CompositorFrame metadata back to Viz — Viz knows another frame is likely needed. "Activated ≠ complete" is a key fact about this stage — it only guarantees "drawable", never "complete".
三棵树 · 各司其职
Three trees · each with one job
Compositor thread 同时持有三棵 LayerImpl 树:
The Compositor thread holds three LayerImpl trees at the same time:
class LayerTreeHostImpl {// Tree currently being drawn. std::unique_ptr<LayerTreeImpl> active_tree_;// In impl-side painting mode, tree with possibly// incomplete rasterized content.// May be promoted to active by ActivateSyncTree(). std::unique_ptr<LayerTreeImpl> pending_tree_;// Inert tree with layers that can be recycled// by the next sync from the main thread. std::unique_ptr<LayerTreeImpl> recycle_tree_;};
两棵不够:如果只有 Pending + Active,每次 Commit 都要等 Active 用完再回收,主线程要等 Compositor 一次 Draw 周期,无法连续提交。
Two trees is not enough: with just Pending + Active, every Commit would have to wait for Active to be done before recycling, so the Main thread waits a Compositor draw cycle — no continuous commits.
Activation has a precondition: IsReadyToActivate() must return true — meaning every viewport tile on the Pending tree is rasterised. If a far-away tile or an image decode hasn't finished, activation is deferred until NotifyReadyToActivate fires. That's the physical source of "checkerboard tiles during fast scroll" — Pending isn't ready, so Active is still last frame.
为什么是三棵 LayerImpl,不是两棵? · 三个并发态需要三块"各自占位"Why three LayerImpl trees, not two? · three concurrent phases need three independent slots
Pending + Active 两棵看起来够了——但实际跑下来,每一帧都会卡一下。三棵的真正理由,是"下一帧的准备工作 ‖ 当前帧的展示 ‖ 上一帧的回收" 这三件事在物理上同时存在:
Two trees (Pending + Active) seem enough — but in practice, every frame would stall briefly. The real reason for three: "preparing the next frame ‖ displaying the current frame ‖ recycling the previous frame" all coexist in time:
// 时间线 · 三棵树的"分时复用"
t = 0ms Active ─▶ 上屏画 frame N // GPU 在读它Pending ─▶ 接收 frame N+1 的 Commit
Recycle ─▶ 闲(留给 N+2)
t = 4ms Pending ─▶ Raster 完成,准备 Activate
Active ─▶ 仍在画 frame N(GPU 没完)
Recycle ─▶ 闲
t = 8ms vsync · ActivateSyncTree() 切换指针
old Active ──▶ Recycle(等下一次 Commit 复用)
old Pending ─▶ Active(开始上屏 frame N+1)
new Pending ◀─ Main 线程开始 Commit frame N+2 进来
三棵刚好对应三个角色:
Three trees, three roles:
Active · 当前帧,GPU 正在采样它的 tile 纹理。这棵树只能读,不能改——一改 GPU 就读到撕裂数据。
Active · the current frame, GPU is sampling its tile textures. This tree is read-only — mutate it and the GPU sees torn data.
Pending · the next frame's worksite, Raster threads pour new tiles in, Main's Commit also writes here. Must be independent of Active, or the rule above is violated.
Recycle · the "resting position" for the just-retired Active tree. When the next Commit arrives, simply rename Recycle to Pending (pointer swap) and reuse its memory slot. Without Recycle, every Commit would allocate a fresh LayerImpl tree — GC + fragmentation cost would be brutal.
Only 2 trees (Active + Pending): when Activate promotes Pending to Active, the old Active is immediately discarded — next Commit must build a fresh one. One alloc + one free per frame, 60 times/sec at 60Hz. Each LayerImpl carries Tile references + property-tree copies; allocating multi-MB object trees that often is expensive.
So "three" is forced by the pipeline's three concurrent states: "being drawn" / "being built" / "waiting to be reused". The same pattern as OS process scheduling (running / ready / blocked), and database WAL (active / clean / recycled). Any "consume + produce + recycle" system's minimum viable config is three slots.
DEVTOOLS
Performance > "Activate Layer Tree" 事件;Layers 面板看 pending vs activePerformance > "Activate Layer Tree" event; Layers panel for pending vs active
3 个观察点: ① Active 指针在每个 vsync 整数倍切换 = 健康节奏(60fps);② Pending 提前 ~12ms 完成 build 才跟得上 — 跟不上(图中 frame N+4)就 Activate 推迟,Active 还显示老 frame,用户感知"这一帧没动";③ Recycle 永远满载 — 它总是上一个被替换下来的 Active,留给下一次 Commit 复用,这就是"不需要每帧 alloc/free LayerImpl" 的源头。3 watch points: ① Active pointer flips at every vsync = healthy rhythm (60fps); ② Pending finishes build ~12ms early to keep up — when it can't (frame N+4 here), Activate is delayed, Active stays on the old frame, and the user perceives "this frame didn't move"; ③ Recycle is always full — always holding the just-retired Active, ready for next Commit's reuse. This is the source of "no per-frame alloc/free of LayerImpl".
遍历 Active Tree 的每个 LayerImpl,调用 AppendQuads 生成一组 viz::DrawQuad,封装为 viz::CompositorFrame,发给 Viz Process。这一步不动 GPU 一根毛——它生产的是"指令脚本"。Walk the Active tree's LayerImpls, call AppendQuads on each, produce a batch of viz::DrawQuad, wrap them in a viz::CompositorFrame and ship to the Viz process. The GPU is not touched here — what's produced is an "instruction script".
为什么不能跳过
Why not skip
Render Process 不能直接画屏幕——OS 把 GPU 上下文交给 GPU/Viz Process。所以 Render 必须把"我想画什么"序列化成 CF,由 Viz 执行。这是多进程隔离的代价。A Render process cannot draw to the screen directly — the OS hands the GPU context to the GPU/Viz process. So Render must serialise "what to draw" into a CF that Viz executes. This is the price of multi-process isolation.
On the Active tree, AppendQuads runs: the main PictureLayerImpl emits 2 TileDrawQuads (one per tile); the standalone .follow layer emits 1. Because box-shadow flipped render_surface_reason_ on in Pre-paint, the shadow becomes its own RenderPass. The final viz::CompositorFrame:
Four meaningful details: ① the shadow gets its own RenderPass + #r5 temporary texture — this is box-shadow's real GPU cost, the larger the blur radius the bigger the temp texture; ② the two main-layer TileDrawQuads share one SharedQuadState (same transform/clip/opacity) — Viz computes the matrix once; ③ the avatar doesn't appear in the quad list — its reference is baked into #r1's tile texture (Raster painted it in); to Viz, #r1 is just one solid 256×88 chunk of pixels; ④ the whole CF travels via LayerTreeFrameSink::SubmitCompositorFrame over Mojo IPC to the Viz process — at this instant, the Render process is done with this frame.
The hairiest one is PictureLayerImpl::AppendQuads — it walks all "currently visible tiles" of the layer and emits Quads based on each tile's state: rasterised → TileDrawQuad; in-viewport-but-unrasterised → SolidColorDrawQuad placeholder (background colour + chequer); missing but low-res available → fall back to the low-res tier. The skeleton:
Two details form the contract behind smooth scrolling: ① shared SharedQuadState — the few hundred TileDrawQuads from a single layer share one transform / clip / opacity, Viz computes the matrix once; ② num_missing_tiles bubbles into the frame metadata — Viz reads it to know "how chequered is this frame?" and, if needed, can defer Activate until the next Raster cycle catches up. That's the actual shape of the "feedback loop" between cc and viz.
By default, all DrawQuads land in the same RenderPass — "draw onto the root surface". But when a layer has an effect that requires off-screen compositing (e.g. filter: blur(8px), mix-blend-mode, mask-image), cc creates a dedicated RenderPass: render the subtree to a temporary texture, then reference it back into the main pass as a RenderPassDrawQuad.
backdrop-filter reads the background pixels and blurs them. cc spins up a RenderPass: render every layer it covers to a temporary texture, run the blur shader, then composite back into the main pass. Every frame pays for an off-screen pass + a GPU blur — double cost in VRAM and bandwidth.
DrawQuad 的 6 种类型
Six flavours of DrawQuad
TileDrawQuad
最常见——一个 Tile 块。DisplayItemList 被 cc 光栅化后就变它。
The default — one tile. DisplayItemLists become these after cc rasterises them.
TextureDrawQuad
引用一份 GPU 资源——Canvas / WebGL / Video 都走它。
References a GPU resource — Canvas / WebGL / Video all take this path.
SolidColorDrawQuad
纯色矩形。最便宜的 Quad。
A solid-coloured rectangle. The cheapest quad on the menu.
RenderPassDrawQuad
引用另一个 RenderPass 的 ID——给嵌套特效用。
References another RenderPass by ID — for nested effects.
SurfaceDrawQuad
嵌入另一个进程的 Surface——OOPIF / OffscreenCanvas 的关键。
Embeds a Surface from another process — the linchpin of OOPIF and OffscreenCanvas.
PictureDrawQuad
里面直接装 DisplayItemList——目前只 Android WebView 用。
Carries a DisplayItemList directly — only Android WebView uses this today.
LayerTreeFrameSink · 把 CF 寄出去
LayerTreeFrameSink · the parcel office
CF 装好之后,cc 调用 LayerTreeFrameSink::SubmitCompositorFrame(local_surface_id, frame, hit_test_data) 把它通过 Mojo IPC 投到 Viz 进程。Render Process 的渲染至此结束——剩下的事归 Viz 管。
Once the CF is packed, cc calls LayerTreeFrameSink::SubmitCompositorFrame(local_surface_id, frame, hit_test_data) and ships it to Viz over Mojo IPC. The Render process's rendering work ends here — everything that follows belongs to Viz.
为什么叫 "Submit" 而不是 "Draw"WHY "SUBMIT", NOT "DRAW"cc 在这里强调"提交"——它不直接画像素,只是把"应该画什么"寄给 Viz。如果 Viz 没空(GPU 忙、vsync 错过),CF 会被排队甚至丢弃。Submit 的成功 ≠ 上屏。Render 进程通过 BeginFrameAck 才知道"自己上一次的 Submit 上屏了没"。cc emphasises "submit" here — it doesn't paint pixels, it ships "what to paint" to Viz. If Viz is busy (GPU is full, vsync missed), the CF queues or even drops. Submit succeeding ≠ on-screen. The Render process learns whether its last submission landed via BeginFrameAck.
Submit 与 Draw,Chromium 词汇里到底是什么区别? · "下单" vs "下厨"Submit vs Draw — what's the precise difference in Chromium-speak? · "placing the order" vs "cooking it"
In Chromium's vocabulary, these two words are never interchangeable, but they look so similar they're easy to confuse. The simplest analogy: Submit is "placing the order", Draw is "cooking it".
Submit
Draw
谁在做Who
cc (Render Process · Compositor thread)
Viz (GPU Process · GPU thread)
在做什么Doing what
把 CompositorFrame 通过 Mojo IPC 寄出去ships CompositorFrame via Mojo IPC
真的调 GL/Vulkan 把像素画到 framebufferactually calls GL/Vulkan, paints pixels to framebuffer
"Submit success" only tells you "cc packaged its work". Web Vitals' LCP/CLS cannot use Submit time — they must use actual on-screen time. That's exactly why Chrome internals have FrameMetrics, which after the Display stage uses BeginFrameAck to feed "did this frame reach the screen?" back to the Render process.
The "queue period" between Submit and Draw is Viz's load buffer. The GPU is occasionally busy (other tabs running heavy animations); Viz can queue 2-3 frames of CF. Once it exceeds 3, old ones drop (kSkipped) — cc, via BeginFrameAck, learns "my last frame was wasted work" and decides whether to degrade the next (lower resolution, skip animation frame). "Submit is push, Draw is pull, queue in between" is the essence of this architecture.
Analogy: very much like Git's git push vs git merge — push only "uploads commits to the remote" (which may reject); merge is "actually integrating into trunk". Chromium pushes this "submit ≠ land" semantic to the limit, and any perf monitor that conflates Submit with Draw timestamps gives wrong numbers.
Active LayerImpl Tree+ rasterised tile textures+ rasterised tile textures
→
OUTPUT
viz::CompositorFrame通过 Mojo IPC 投到 Vizdelivered to Viz via Mojo
STAGE 12 · VIZ PHASE
Aggregate — 多个进程的 CF 合成一帧
Aggregate — many CFs become one frame
把跨进程的 Surface 拍平
flattening surface trees across processes
Module
viz
Process
GPU(hosts viz)(hosts viz)
Thread
Skia / Display Compositor
Output
Aggregated CF
这一步在做什么
What it does
Viz 把当前所有"活着"的 CF(来自 Browser UI、每个 Render、每个 OOPIF、每个 OffscreenCanvas)按 SurfaceId 引用关系铺平成一份 Aggregated CF。Viz takes every live CF (Browser UI, every Render process, every OOPIF, every OffscreenCanvas) and flattens them — following SurfaceId references — into a single Aggregated CF.
为什么不能跳过
Why not skip
屏幕只有一块。多进程产出多份 CF,必须有人决定它们的层叠与裁剪。Aggregate 是 Site Isolation × 流畅渲染的结合点。There's only one screen. Multiple processes produce multiple CFs — someone has to decide their stacking and clipping. Aggregate is where Site Isolation meets smooth rendering.
The variant for this stage: drop the card into an imaginary "third-party praise wall" page. The card is rendered by Render B (ursb.me origin), the parent page by Render A. Both processes submit their CFs to Viz; SurfaceAggregator flattens them into one:
3 件让人惊叹的事: ① Render A 永远拿不到名片的真实像素——它只持有一个 SurfaceId,具体的 #r1~#r5 由 Viz 进程持有。这是 Site Isolation 的图形侧实现,跨域 iframe 的安全边界靠这一刀刻出来;② 变换矩阵会跨边界相乘——父页面给名片 Surface 应用的 T_card 与名片自己内部的 T_inside 在 Viz 里乘起来,等价于"名片直接画在父页面坐标系上";③ 裁剪求交可能让整张卡白白渲染——如果父页面把名片的 clip 设成 0×0(可能因 overflow:hidden 滚出视口),Viz 会跳过整张卡的所有 quad,GPU 一根毛不动,但 Render B 的 cc 仍然在背后默默 raster——这就是"不可见的 OOPIF 也消耗 CPU 但不消耗 GPU"。
3 things to marvel at: ① Render A can never see the card's real pixels — it only holds a SurfaceId; the actual #r1~#r5 live in the Viz process. This is Site Isolation's graphics-side implementation; the cross-origin iframe security boundary is carved here; ② transform matrices multiply across the boundary — the parent's T_card applied to the card's Surface times the card's own internal T_inside equals "the card painted directly into the parent's coordinate system"; ③ clip intersection can make the whole card render in vain — if the parent clips the card to 0×0 (e.g. scrolled out via overflow:hidden), Viz skips every quad of the card, the GPU doesn't move, but Render B's cc is still silently rastering in the background — this is "invisible OOPIFs cost CPU but not GPU".
Aggregate walks depth-first: start from the root surface (typically Browser UI), every SurfaceDrawQuad hit triggers a jump to the referenced Surface, copy its RenderPasses and DrawQuads (with proper transform + clip), then continue.
FIG 17SurfaceAggregator 把分布在多个进程的 CF 合成一帧。OOPIF 之所以"隔离但顺滑",靠的是这个步骤。SurfaceAggregator merges CFs scattered across processes into a single frame. OOPIF stays isolated yet seamless because of this stage.
这套 ID 是跨进程引用的核心——一个 OOPIF 知道父页面的 SurfaceId,但拿不到真实 GPU 纹理;一切只通过这个 ID 由 Viz 在合成时解引用。
This is how cross-process references work — an OOPIF knows the parent page's SurfaceId but cannot reach its actual GPU textures; the dereference happens inside Viz during aggregation.
Damage 跟踪 · 不是每帧都"全合"
Damage tracking · not every frame is fully aggregated
Aggregate has built-in diffing. Each frame, SurfaceAggregator computes damage_rect — the actual area changed since the previous frame. The GPU only redraws that area; the rest is reused from last frame's Front Buffer. A static page + one spinning badge can mean only a few hundred pixels of GPU work per frame.
CASE · OOPIF
为什么跨域 iframe 也能"完美贴合"父页
Why cross-origin iframes still composite seamlessly
父页面 Render A 的 CF 里包含一个 SurfaceDrawQuad,引用 OOPIF Render B 的 SurfaceId。两进程独立提交 CF 到 Viz;SurfaceAggregator 在 Viz 里把它们用 变换矩阵 + 裁剪 rect 拼好。父页面永远拿不到 OOPIF 的像素,但屏幕上看起来天衣无缝——这就是 Site Isolation 的图形侧实现。
The parent page's CF (Render A) contains a SurfaceDrawQuad referencing OOPIF Render B's SurfaceId. Both processes submit CFs to Viz independently; SurfaceAggregator stitches them together with transform + clip rect. The parent never sees the OOPIF's pixels, yet the screen looks seamless. This is Site Isolation's graphics-side implementation.
HandleSurfaceQuad · 跨进程指针的展开
HandleSurfaceQuad · expanding the cross-process pointer
The flatten algorithm's key hook is SurfaceAggregator::HandleSurfaceQuad. Each DFS visit to a SurfaceDrawQuadexpands the referenced surface's whole subtree inline into the main RenderPass, while threading "child coordinate system → parent" transforms and clips the whole way:
Two pieces of math deliver the "perfectly stitched" look: ① transform multiplication — parent frame's transform × child RenderPass's transform = the child quad's final on-screen pose; ② clip-rect intersection — parent's clip ∩ child surface's clip = the actual visible region. If the intersection is empty, the whole child surface is never rasterised — no GPU memory allocated. That's why "invisible OOPIFs cost no GPU".
为什么 SurfaceAggregator 用 DFS,不是 BFS? · 渲染顺序就是深度顺序Why does SurfaceAggregator use DFS, not BFS? · rendering order IS depth order
"Wouldn't BFS be faster? Shallower levels, cache-friendly." — a fair question. But Aggregate is a step in a rendering pipeline, and DFS is forced by three hard constraints:
z-order is depth-order, not breadth-order. "Child surface paints on top of parent surface" is the HTML/CSS stacking rule — picture a surface tree with Browser UI at the root and the deepest OOPIF at the leaves. The correct paint order: start from the root, draw self, then immediately recurse into the first child's entire subtree, then into the second child's, etc. — that's exactly DFS pre-order. Under BFS you'd paint all level-1 surfaces, then all level-2, but two level-N surfaces have no relation to each other and get interleaved across subtrees — z-order breaks.
RenderPass dependency = child before parent. In SurfaceAggregator's output RenderPass list, when a parent references a child Pass via RenderPassDrawQuad, the child Pass must appear earlier in the list (GPU executes in list order). DFS post-order naturally produces a "leaves first, root last" list — a topological sort. BFS gives no such guarantee — you'd need a separate topo-sort pass, doubling the cost.
Earliest possible pruning. When DFS recurses into a child surface, it can immediately compute "parent transform × child transform" and "parent clip ∩ child clip" — if the intersection is empty, the entire subtree is skipped, not even one quad is copied. BFS finishes level 1 before level 2 — either compute every transform/clip up front (waste), or discover the empty clip late (waste). DFS's "back off on empty clip" is natural pruning.
Bottom line: DFS naturally fits SurfaceAggregator on three axes — z-order = depth-first, RenderPass dependency = topological order, clip pruning = early backoff. BFS looks friendlier but each property costs extra code. "Pick the right traversal and half the algorithm is free" is a common phenomenon in graph engineering.
DEVTOOLS
chrome://compositor-thread-rendering-stats · Performance > "Frame submitted to display"
看 4 件事: ① Surface 数量 = Browser UI + 每个 Render + 每个 OffscreenCanvas;② damage_rect 占 viewport 比例(本帧 8.4%)— 越小,Display 阶段 GPU 工作量越小;③ "PRUNED" 标记 = 子 surface 因裁剪/damage 空被跳过,connected to "不可见 OOPIF 不耗 GPU" 那条规则;④ 嵌套深度 > 4 是危险信号 — 说明 OOPIF 套 OOPIF 套 OffscreenCanvas,SurfaceAggregator 的 DFS 会变贵。4 things to watch: ① Surface count = Browser UI + every Render + every OffscreenCanvas; ② damage_rect as % of viewport (8.4% this frame) — smaller = less GPU work in Display; ③ "PRUNED" marker = child surface skipped due to empty clip or damage, the "invisible OOPIF costs no GPU" rule; ④ Nesting depth > 4 is a red flag — OOPIF nested inside OOPIF nested inside OffscreenCanvas makes the DFS expensive.
N 份 CFN CFsBrowser UI · Render × N · OffscreenCanvas…Browser UI · Render × N · OffscreenCanvas…
→
OUTPUT
Aggregated CF+ damage_rect+ damage_rect
STAGE 13 · VIZ PHASE
Display — 像素终于上屏
Display — pixels finally reach the screen
字节走完最后一步
DrawQuads → GL/Skia → vsync → photons
Module
viz
Process
GPU(hosts viz)(hosts viz)
Thread
Skia / GPU main
Output
on-screen pixels
这一步在做什么
What it does
把 Aggregated CF 里的 DrawQuad 翻译成实际 GPU 调用,画到 Back Buffer;vsync 一来,Display::DrawAndSwap 把 Back / Front 互换,新一帧就出现在屏幕上。Translate the Aggregated CF's DrawQuads into real GPU calls, paint them into the Back Buffer; on vsync, Display::DrawAndSwap swaps Back / Front, and the new frame appears on screen.
为什么不能跳过
Why not skip
这是 13 步流水线的唯一真实操作 GPU 的一步。前 12 步都在"准备"——分类、组织、序列化、调度——而 Display 是把指令真的执行下去的那一刻。This is the only step of the 13 that actually drives the GPU. The previous 12 stages all "prepare" — classify, organise, serialise, schedule — Display is the moment instructions actually execute.
After all 12 stages, the Aggregated CF lands in the SkiaRenderer in Viz. SkiaRenderer records every quad into one SkDeferredDisplayList, hands it to the GPU thread → SkSurface::draw replays it → OutputSurface::SwapBuffers() → the card lights up before your eyes.
Try this now: hover over the Follow button on the real Airing card at the top of this article (the interactive one in the Main-line example chapter). The entire execution path of the hover animation:
This is what 13 stages and 20 years of engineering bought you — a single transform animation, from input to on-screen, crosses 3 processes and 5 thread segments, yet every segment only moves the bare minimum it must. One Transform-tree node mutates; one transform matrix on .follow LayerImpl mutates; one 53×32 tile re-rasters; one TileDrawQuad re-emits; the GPU repaints 1696 pixels; SwapBuffers. "What's not recomputed — that is performance itself" — this is the true meaning of C5's epigraph.
但要注意CAVEAT同一次 hover 还同时改了 background ——这个动画走的是 Paint 路径,Main thread 会 被叫起来。所以"纯 Compositor 动画" 在真实代码里很少 100% 纯。名片用的是混合动画,这正是真实业务的样子。要 100% 纯合成,只改 transform / opacity / filter 即可。The same hover also mutates background — that path goes through Paint, and Main thread does wake up. So "pure Compositor animation" is rare in real code at 100%. The card uses a hybrid animation, which is what real product code looks like. To stay 100% on the Compositor, only mutate transform / opacity / filter.
SkiaRenderer's core is the SkDeferredDisplayListRecorder (DDL) — it doesn't paint immediately, but records every RenderPass's draw operations into a DDL. When all RenderPasses are recorded, SkiaOutputSurfaceImpl::SubmitPaint ships the whole batch to SkiaOutputSurfaceImplOnGpu for one execution on the GPU thread.
FIG 18SkiaRenderer 的延迟绘制流:DrawQuads 先在 Compositor thread 录成 DDL,最后在 GPU thread 上 SkSurface::draw 一次性执行。SkiaRenderer's deferred-draw flow: DrawQuads recorded into a DDL on the Compositor thread, then the GPU thread runs SkSurface::draw in one shot.
GLRenderer (deprecated) tunnels through the CommandBuffer: GL calls on the Compositor thread don't really execute — they're serialised into a command byte stream, posted via InProcessCommandBuffer to the GPU process's CrGpuMain thread, where the real OpenGL ES happens. The split decouples GL caller ↔ real driver — and is what makes the security sandbox possible.
Every modern graphics stack double-buffers. Front Buffer is what the screen reads; Back Buffer is where you paint. At vsync, Display::DrawAndSwapswaps the pointers and the new frame is on display the next refresh — the screen never sees a half-painted frame.
Front Buffer
A
Back Buffer
B
VSYNC
FIG 18.VVSync 的每次"▼",Front 与 Back 互换。屏幕永远从 Front 读,所以从不闪烁。At every "▼" of vsync, Front and Back swap. The screen always reads Front — and never flickers.
Triple buffer · 牺牲一点延迟换流畅
Triple buffer · trading a touch of latency for smoothness
Mobile OSes (Android Surface Flinger / iOS) default to triple buffering: while the GPU is still painting frame N and the screen still displays N-1, the CPU can already prepare frame N+1. The cost is +1 frame of input latency; the reward is far less stutter — no party ever waits for another. Chromium employs a similar strategy on most desktop / mobile platforms.
"Why not just translate quads to glDrawArrays directly?" That was GLRenderer's path (now deprecated). SkiaRenderer chose DDL (SkDeferredDisplayList) for three reasons:
GL 上下文不能跨线程。OpenGL/Vulkan 规范要求同一时刻一个 GL 上下文只能被一个线程访问(make_current 的语义)。如果 SkiaRenderer 在 Compositor 线程上直接调 GL,就要把 GL 上下文 make_current 到 Compositor 线程——但 Compositor 线程除了渲染还要处理输入、滚动、动画,GL 上下文的独占性会变成串行瓶颈。DDL 解耦了"构造命令"与"提交命令":Compositor 线程构造 DDL(无 GL 上下文,纯内存操作),GPU 线程独占 GL 上下文执行 DDL,两条线程真正并行。
GL contexts can't cross threads. OpenGL/Vulkan specs require only one thread can access a GL context at a time (make_current semantics). If SkiaRenderer called GL directly on the Compositor thread, you'd have to make_current to that thread — but Compositor also handles input, scroll, animation, and the GL context's exclusivity becomes a serial bottleneck. DDL decouples "building commands" from "submitting commands": Compositor builds DDL (no GL context, pure memory ops), GPU thread exclusively owns the GL context and replays the DDL — true parallelism.
Batched submission = minimum state changes. GL "state changes" (swap shader, swap texture, swap blend mode) are extremely slow on GPUs. Quad-by-quad GL means a state switch per quad — most of the GPU's time goes to swapping state, not painting pixels. Skia, when recording the DDL, reorders commands to batch same-state draws together (like a database batching queries) — setShader once, draw 100 quads, then switch state. 3-5× faster than quad-by-quad GL in practice.
Pluggable backends = Skia abstraction dividend. The same DDL can feed Skia's GL backend, Vulkan backend, Metal backend (macOS), Dawn (WebGPU), even a software backend. GLRenderer hard-coded GL; Vulkan / Metal / Graphite new backends would need a full renderer rewrite. SkiaRenderer covers all of them in one codebase — switching backends is just swapping the SkSurface implementation. This is why Chrome 122+'s "SkiaGraphite" experiment (swapping Skia rendering to a wgpu-based modern backend) only touches SkiaRenderer, not cc — the dividend of a clean architecture.
In essence, DDL is Skia's intermediate representation for the "multi-threaded graphics API era" — same kind of thing as LLVM IR for compilers, or Mojo IDL for IPC. "Record + replay" is the path of any system that needs to decouple "construction" from "execution". Chromium uses it at the very end of the rendering pipeline, locking the GL context's serial bottleneck inside one GPU-process thread once and for all.
DEVTOOLS
chrome://gpu · Performance > GPU lane · Rendering > FPS meter
The previous 18 chapters sliced the pipeline open, one chapter per stage. But in real life all 13 stages are running at the same time — some have hard serial constraints (must wait), many can run in parallel (everybody works at once). The figure below puts the card's one-frame composition back onto the time axis: who moves, and when.
FIG 19名片在 16.7ms 内的真实时间线。3 件值得记的事: ① Main thread 真正"渲染相关"的时间只有约 6ms,剩下 10ms 全是 idle——可以塞 JS / microtask / rAF;② 3 条 Raster 线程并行,与 Compositor 的 Tiling/Activate 阶段重叠;③ Viz/GPU 在最后 4ms 被叫醒,做 Aggregate + DrawAndSwap + Swap,整个前 12ms 它在睡。The card's real timeline within 16.7ms. Three things worth remembering: ① the Main thread spends only ~6ms on rendering, the remaining 10ms is idle — perfect for JS / microtasks / rAF; ② 3 Raster threads run in parallel and overlap with the Compositor's Tiling/Activate; ③ Viz/GPU wakes up only in the last 4ms to Aggregate + DrawAndSwap + Swap — it sleeps the first 12ms.
Main thread 的一秒钟
A typical second on the Main thread
把上面那张帧时间线按 60 倍重复就是 1 秒。但 Main thread 的工作不只是渲染——还有 JS 执行、事件处理、microtask、setTimeout 回调。一个典型 1 秒(中等复杂度的 SPA)的 Main thread 时间分布大致如此:
Repeat that frame timeline 60 times — that's a second. But the Main thread does more than rendering: JS, event handlers, microtasks, setTimeout callbacks. A typical 1-second budget on the Main thread of a moderately complex SPA looks like this:
JS 35%
Style 12%
Layout 8%
Paint 5%
idle 40%
0ms250ms500ms750ms1000ms
JS 是头号竞争者——React 重 render、状态库 reducer、IntersectionObserver 回调,这些都在抢 Main thread。当 JS 一个长任务超过 50ms,整条流水线在那 50ms 都在排队:Style/Layout/Paint 都做不了,vsync 来了也只能丢帧。这就是为什么 Web Vitals 把 INP(Interaction to Next Paint)和 TBT(Total Blocking Time)放在前面——它们直接量"JS 占用 Main 多久"。
JS is the chief competitor — React re-renders, state-library reducers, IntersectionObserver callbacks all fight for the Main thread. When a JS long task exceeds 50ms, the entire pipeline queues up for those 50ms: Style/Layout/Paint cannot proceed, the vsync arrives only to drop the frame. That's why Web Vitals leads with INP (Interaction to Next Paint) and TBT (Total Blocking Time) — both measure "how long does JS hold the Main thread".
2024 +Scheduler.yield() 与 isInputPending(): 现代 Chromium 提供 scheduler.yield() 让 JS 主动让出主线程,以及 navigator.scheduling.isInputPending() 让长任务可以提前退让给输入事件。这两个 API 让"不要让 JS 阻塞渲染"从口号变成可量化的工程实践。Scheduler.yield() and isInputPending(): modern Chromium ships scheduler.yield() for JS to voluntarily yield the Main thread, plus navigator.scheduling.isInputPending() for long tasks to step aside for incoming input. These two APIs make "don't let JS block render" measurable rather than aspirational.
三种"掉帧"的物理来源
Three physical sources of "jank"
#
物理现象Physical event
看到什么What you see
1
Main 长任务
JS 跑了 80ms,5 帧没刷新——卡顿"段落式"出现JS ran 80ms, 5 frames missed — jank in "chunks"
2
Raster 跟不上
滚动时屏幕一直在动,但视口边缘棋盘格screen keeps moving while scrolling, viewport edges show chequer
3
GPU 排队
动画起步那一刹那"卡一下",之后顺畅(GPU 上了纹理)animation "hitches" at the very first frame, smooth afterward (GPU loaded textures)
The previous 18 chapters described the forward pipeline: bytes in, pixels out. But half of a browser's complexity hides in the reverse pipeline: a click, from a hardware interrupt, crossing 3 processes and 5 thread segments, eventually firing the next 13-stage round. This is the real topology behind RAIL's R (Response).
输入流水线 · 一次 click 的旅程Input pipeline · one click's journey
OS EventHardware IRQ
→
Browser · IOBrowser process
→
Browser · UIrouting & hit-test
→
Render · Compositortry handler
↘
Render · MainJS handler · setState
→
Style + Layout + Paintrender pipeline
→
Viz · GPUSwapBuffers
5 个关键节点:
Five key checkpoints:
OS → Browser IO 线程:操作系统通过 evdev / WindowProc / NSEvent 把硬件中断翻译成 InputEvent,塞进 Browser process 的 IO 线程消息队列。
OS → Browser IO thread: the OS translates the hardware interrupt into an InputEvent via evdev / WindowProc / NSEvent and enqueues it on the Browser process's IO thread.
Browser UI 路由 + hit-test:Browser 用 hit-test region(由 cc 提供的命中测试矩形列表)决定该事件归哪个 Render Process 的哪个 frame——OOPIF 的事件路由就靠这个。
Browser UI routes + hit-tests: Browser uses cc-supplied hit-test region (a list of hit-test rectangles) to decide which Render process's which frame owns this event — this is how OOPIF event routing works.
Render Compositor 先看一眼:输入事件优先送到 Render 的 Compositor thread。如果是滚动 / pinch / non-blocking touch,Compositor 自己处理就够(直接调整 scroll offset / transform),从不打扰 Main——这就是"滚动跑在 Compositor 上"的物理实现。
Render Compositor takes a first look: input goes to Render's Compositor thread first. If it's a scroll / pinch / non-blocking touch, the Compositor handles it alone (just adjust scroll offset / transform) and never wakes Main — this is the physical implementation of "scrolling runs on the Compositor".
Bounce 给 Main · 跑 JS handler:如果是 click / keypress / 注册了 active touch listener 的事件,Compositor 把事件转给 Main thread,这才轮到 JS handler 跑。passive: true 是关键标记——它告诉 Compositor"这个事件不会调 preventDefault",Compositor 可以在等 Main 处理的同时继续把后续的滚动事件按自己的节奏处理。
Bounce to Main · run the JS handler: if it's a click / keypress / event with active touch listener registered, Compositor forwards it to Main, where the JS handler finally runs. passive: true is the key flag — it tells the Compositor "this event will not call preventDefault", letting Compositor keep handling subsequent scroll events on its own cadence while Main works.
Trigger a new frame: the JS handler calls setState / changes className / mutates DOM → invalidates Style → on the next BeginMainFrame, the forward pipeline runs again. If the mutation only touches Compositor-only properties, Main doesn't even need to wake.
Google's RAIL model classifies user-perceptible work by time budget: Response 100ms · Animation 16ms · Idle 50ms · Load 1000ms. The 100ms in R is not "click to pixel" — it's "click to feedback-on-screen" (could be a spinner, a pressed-button state, a ripple). This gives the Compositor a precious "react-fast-then-process-properly" window — every modern UI library leans on it (:active pseudo, focus ring, ripple animations).
click waits for the "maybe double-click?" window (~300ms by default, but browsers have heuristics that drop it to ~100ms). pointerdown fires immediately. Material Design's ripple animation always appears the moment you press down precisely because it's bound to pointerdown, not click — exploiting RAIL's 100ms "react-first window".
The original was written in 2022; the Chromium pipeline has moved on by another three years. This chapter pins the most meaningful recent changes back to their stages — none reshapes the pipeline's skeleton, but each adds a new "hook" somewhere on it.
PRE-PAINTAnchor Positioning · CSS Anchor 新属性 anchor-name / position-anchor / inset-area 让一个元素"跟着另一个元素飞"。这给 Pre-paint 阶段引入了新的 Transform 节点子类——anchor 位置变化要通过 cc 同步到 Compositor 上,不需要回 Main。第一次让"tooltip / popover 跟随触发器"完全跑在 Compositor 上。New properties anchor-name / position-anchor / inset-area let an element "fly with another". This introduces a new Transform-node subclass in Pre-paint — anchor position changes sync through cc to the Compositor without going back to Main. For the first time, "tooltip / popover tracks its trigger" runs entirely on the Compositor.
COMPOSITINGScroll-Driven Animations · CSS animation-timeline animation-timeline: scroll() / view() 让 CSS 动画的进度由滚动位置驱动,而不是时间。整条动画跑在 Compositor 线程上——cc 把 scroll offset 直接喂给 animation interpolator,Main thread 完全不参与。这把以前用 IntersectionObserver + JS 算的"视差滚动 / 进度条" 一夜之间变成了几行 CSS。animation-timeline: scroll() / view() drives a CSS animation by scroll position rather than time. The entire animation runs on the Compositor thread — cc feeds the scroll offset straight into the animation interpolator, with no Main-thread involvement. Overnight, the "parallax / progress-bar" pattern that needed IntersectionObserver + JS becomes a few lines of CSS.
AGGREGATEView Transitions API document.startViewTransition() 让 SPA 路由切换有了原生的跨状态平滑动画。底层机制就是 SurfaceAggregator 的快照 + 跨状态合成:旧状态被截图为一个 SharedImage,新状态正常渲染,Viz 用 cross-fade / slide / scale 把两个 surface 在合成阶段连起来。从 C17 章的视角看,这是 SurfaceAggregator 第一次被前端开发者直接调用。document.startViewTransition() brings native cross-state smooth transitions to SPA route changes. Under the hood it's SurfaceAggregator's snapshot + cross-state composition: the old state captured into a SharedImage, the new state rendered normally, Viz cross-fades / slides / scales the two surfaces during aggregation. From C17's perspective, this is the first time SurfaceAggregator is directly invokable by web developers.
STYLE@scope / @container / @starting-style 三个新 at-rule 给 Style 章的 RuleSet 多了几条 sharding 维度。@scope 让 RuleSet 多出一个 scoped_rules_ 桶;@container 让一条 rule 的"命中条件"取决于祖先容器的 layout——这违反了原来 Style 在 Layout 之前的强约束,所以 Chromium 给 Container Queries 实现了"two-pass layout-style-layout",这是 LayoutNG 之后最大的样式系统改造。Three new at-rules add new sharding dimensions to Style's RuleSet. @scope introduces a scoped_rules_ bucket; @container makes a rule's "match condition" depend on an ancestor container's layout — breaking the old "Style strictly before Layout" rule, so Chromium implemented "two-pass layout-style-layout" for Container Queries, the biggest style-system rework since LayoutNG.
RASTERRasterInducingScroll · default-on 早期 Chromium 滚动时只走 Compositor,内容不重 raster——快速滚动会出棋盘。新策略 RasterInducingScroll 在惯性滚动期间主动触发 raster,牺牲一点 CPU 换"没有棋盘"。Chrome 122 起已经默认开启。Early Chromium scrolled on the Compositor only, never re-rastering — fast scroll showed chequer. The new RasterInducingScroll strategy proactively triggers raster during inertial scroll, trading some CPU for "no chequer". Default-on since Chrome 122.
PROCESSNetworkService · 默认独立进程 2024 年 Chrome 把 NetworkService 默认推到独立进程(早期可选 in-process)。意味着 Stage 0 的 Mojo IPC 是真的跨进程,不是同进程的简单消息传递。安全沙箱因此更深:Render 即使被 PWN 也拿不到原始 cookie。In 2024 Chrome flipped NetworkService's default to a separate process (used to be optionally in-process). That makes Stage 0's Mojo IPC actually cross-process, not in-process message passing. The sandbox runs deeper: even a pwned Render cannot read raw cookies.
SYNTHESIS 04 · DEBUG GUIDE
症状反查表 — 从卡顿回到阶段
Symptom reverse lookup — from jank back to a stage
By now you know what every stage does. But the question engineers actually ask in PRs is the reverse: page is janky / scroll is sluggish / animation drops frames / cold-start is white — which stage do I look at first? The table below maps common symptoms back to a stage, what to capture, and which tool to reach for.
症状Symptom
可疑阶段Suspect stage
先抓什么First capture
首屏白屏久(LCP > 2.5s)Cold-start blank (LCP > 2.5s)
Stage 00 + 02
Network 面板看 render-blocking 资源 · 看是否有 PreloadScanner 没抢到的关键 CSS / 字体Network panel for render-blocking resources · check whether any critical CSS / font missed PreloadScanner
点击响应慢 (INP > 200ms)Slow click (INP > 200ms)
C20 input + Main
Performance 录制看 click handler 的 long task · 用 scheduler.yield() 把它劈开Performance trace for the click handler's long task · split it with scheduler.yield()
滚动卡顿,Compositor thread 高Scroll jank, Compositor thread saturated
DevTools Layers 面板 · 确认元素在独立合成层 · will-change 是否生效DevTools Layers panel · confirm the element is on its own composite layer · whether will-change works
backdrop-filter 卡backdrop-filter is heavy
C16 Draw + C18 Display
Rendering 面板开 "Layer borders" · 看是否产生独立 RenderPass · 测 GPU 使用率Rendering panel turn on "Layer borders" · check for separate RenderPass · measure GPU usage
大量 DOM 修改,布局抖动DOM thrashing on bulk mutation
C8 Layout
Performance 录制 · 看 forced reflow 警告 · 用 batch DOM API / requestAnimationFrame 合并Performance trace · forced reflow warnings · batch via requestAnimationFrame
页面打开 ~5s 后突然变流畅Page suddenly smooth ~5s after open
C7 Style + V8
V8 JIT 优化完成,bytecode → optimized code · 等热身;或预热关键路径V8 JIT done, bytecode → optimized code · either wait, or warm up critical paths
通用工作流UNIVERSAL WORKFLOW1. 录: 用 Performance 面板录 5-10 秒,包括症状出现的瞬间。2. 分: 看哪个线程是红的——Main 红 = JS / Style / Layout / Paint;Compositor 红 = Tiling / Activate / Draw;Raster 红 = 光栅化跟不上;GPU 红 = 像素吞吐瓶颈。3. 抓: 用上表定位到具体阶段,然后用对应工具深入。1. Record: 5-10 seconds in Performance panel, including the moment the symptom appears. 2. Diagnose: see which thread is red — Main red = JS / Style / Layout / Paint; Compositor red = Tiling / Activate / Draw; Raster red = raster can't keep up; GPU red = pixel-throughput ceiling. 3. Capture: use the table above to land on a specific stage, then dive in with that stage's tool.
SYNTHESIS 05 · INDEX
术语表 — 64 个关键名词
Glossary — 64 key terms
类名与概念速查
a quick lookup for class names & concepts
展开 / 收起expand / collapse
Blink
Chromium 的渲染引擎,2013 fork 自 WebKit。Chromium's rendering engine, forked from WebKit in 2013.→ CH 02
This piece took you to a depth where you can start working with the pipeline. If you want to dig further, the documents below are Chromium / V8's primary sources. After reading them you'll be ready to contribute code or fix bugs.
OFFICIAL DESIGN DOCS · 官方设计文档chromium.org · v8.dev · web.dev
«WebKit 技术内幕» (Chen Zihao) — Chinese-language deep dive on rendering engines; based on early WebKit/Blink, but the pipeline skeleton matches this article. «Inside Chromium» (Tom Dale, online) — module-level diagrams. «High Performance Browser Networking» (Ilya Grigorik) — the networking layer's bible, perfect companion to Stage 0. Together with this piece, you've closed the loop.
From bytes to pixels,
Chromium translates "a web page" into "light" in thirteen movements.
Every frame you see is this pipeline rehearsing the score in 16.7 ms.