FIELD NOTE / 08 编程语言 Languages 2026

一段内存的
多重死亡。

Many ways
to die.

11 个 GC 家族的家谱

A family map of 11 garbage collectors

同一行 list = null 在 20+ 个运行时里——CPython refcount 立即释放、V8 Orinoco 等下一次 minor GC、JVM ZGC 在 sub-ms 内并发处理、Erlang BEAM 当垃圾不存在、Rust 编译期就把它释放了——发生的事情完全不同。这是一本跨语言 GC 比较手册，从 1959 McCarthy mark-sweep 到 2023 Generational ZGC，按算法家族分章，每章都用同一行代码作主线。

The same line list = null behaves completely differently across 20+ runtimes — CPython's refcount frees immediately, V8 Orinoco waits for the next minor GC, JVM ZGC handles it concurrently in sub-ms, Erlang BEAM treats garbage as not-a-thing, and Rust freed it at compile time. This is a cross-language GC field map, from 1959 McCarthy mark-sweep to 2023 Generational ZGC, organised by algorithm family, threaded by one mainline code snippet.

AUTHORAiring TOPICGC · runtime · memory · 20+ languages FORMATLong Read

▣ Heap Plaza · 3D Game LIVE

不想读? 玩这篇文章 — Heap Plaza 3D 网页游戏

Don't want to read? Play this article — Heap Plaza 3D web game

把 8 种 GC 算法变成 8 个可走可玩的箱庭世界 + 1 个 Boss 综合战. 走到木屋按 F 切指针看 refcount 归零, 走过节点变 white→grey→black 看并发 mark, 撞 stale 指针让 ZGC load barrier 自动 heal. 每区 6 步引导手把手教你感受这个算法在干什么.

8 GC algorithms turned into 8 playable hand-crafted worlds + 1 boss stage. Walk up to a wooden house, press F to cut a pointer, watch the refcount hit zero and the building collapse. Walk through nodes that turn white→grey→black as concurrent mark. Hit a stale pointer and see ZGC's load barrier auto-heal it. Each region ships with a 6-step in-game tutorial.

Region 01

Refcount Town

refcount + cycle leak

Region 02

Mark Forest

mark-sweep · STW

Region 03

Copying Canyons

Cheney 1969 · 2× mem

Region 04

Generational

90% young hypothesis

Region 05

Incremental Hallway

slice + barrier

Region 06

Tri-color Arena

concurrent · Go / ZGC

Region 07

Erlang Islands

per-actor heap

Region 08

ZGC Skyscraper

colored pointers

✦ Boss

Hybrid Heap

all 8 in sequence

▸ 立即开玩 Play now

WASD 移动 · 鼠标看 · E 互动 · H 看理论 · ESC 回 plaza
桌面 + 手机都能玩 (手机自动出虚拟摇杆)

WASD move · Mouse look · E interact · H theory · ESC back to plaza
Desktop + mobile (mobile auto-shows joystick)

FIELD NOTE 系列 · 第八篇 Field Note series · No. 08

这是 Field Note 系列的第八篇，姐妹篇有《一行 JS 的一生 · QuickJS 源码详解》（讲一个引擎全栈）和《V8 是怎么把 JS 跑快的》（讲多层 JIT）。本篇横向跨语言——讲所有现代运行时怎么收尸：引用计数、mark-sweep、复制、分代、并发、彩色指针、区域、无 GC，一个家族一个家族地拆。每个家族都挑代表实现的真实源码，每个都用同一行 list = null 作主线，看它怎么走完一生。

This is the eighth piece in the Field Note series, siblings to «The Life of One JS Line · A QuickJS Source-Level Walkthrough» (one engine, full stack) and «How V8 makes JS fast» (multi-tier JIT). This one cuts across languages — how every modern runtime handles its dead: refcount, mark-sweep, copying, generational, concurrent, colored pointers, region, no-GC. One family at a time. Each family illustrated with the real source of a representative implementation; each followed by the same mainline list = null to see how it actually plays out.

GC 家谱 · 11 家族 · 4 段 GC family tree · 11 families · 4 acts ▸ LIVE PULSE

CHAPTER 01

什么是 GC — "找出死了的内存，还给系统"

What is GC — "find what's dead, return it to the system"

一句话定义 + 三个反直觉

one-line definition + three counter-intuitions

GC 的所有实现都做同一件事：找出"程序再也访问不到"的内存，把它还给分配器。区别只在怎么找和什么时候找。

整本书的张力都建立在三个反直觉上：

Every GC implementation does the same thing: find memory the program can no longer reach, return it to the allocator. The only difference is how they find it and when. Everything that follows builds on three counter-intuitions:

①

"死"不是程序员意图，是图论性质

"Dead" is a graph property, not programmer intent

GC 不知道你打算不再用这块内存——它只检查"从 root 出发的有向图里还能不能走到"。你 x = null 但 cache 还藏着 x？不死。你心里早已遗忘，但闭包还引用着？不死。"不可达" 是所有 GC 的唯一真理。

GC has no idea what you intend to keep — it only checks "can it still be reached from a root via the pointer graph?" You set x = null but cache still holds x? Not dead. You forgot about it months ago, but a closure references it? Not dead. Unreachability is the only truth for every GC.

②

"立即释放"通常是幻觉

"Freed immediately" is usually an illusion

只有引用计数（CPython, Swift, PHP, Rust Rc）能做到 x = null 同行立即 free。所有 tracing GC——V8、Go、JVM、.NET、Erlang——都是滞后的，可能要 100ms 后、也可能要 1 小时后才真正释放。这是延迟换吞吐量的根源。

Only refcount (CPython, Swift, PHP, Rust Rc) genuinely frees on the same line as x = null. Every tracing GC — V8, Go, JVM, .NET, Erlang — is lazy: actual free may happen 100ms later, or an hour later. That latency is the source of every throughput advantage.

③

GC 是运行时对程序员的"违约权"

GC is the runtime's "licence to default"

你写 let x = ...。运行时背地里跟自己说"等我有空再处理"。STW 那 50ms 就是运行时临时收回控制权。所有 GC 的争议都是关于这个"违约权"——多久能行使一次（pause）、违约多长（max pause）、有没有提前通知（incremental/concurrent）、能不能买断（no-GC ownership）。

You wrote let x = .... Behind your back the runtime tells itself "I'll deal with this when I'm free". The STW pause is the runtime temporarily taking control back. Every GC debate is really about this default licence — how often (frequency), how long (max pause), whether you get warning (incremental/concurrent), whether you can buy out (no-GC ownership).

「GC 不是一种技术，GC 是一种契约。
21 世纪的运行时设计基本就是这份契约的不同版本。」 "GC isn't a technology — GC is a contract.
21st-century runtime design is mostly about different versions of that contract."

CHAPTER 02

六十年家谱 — McCarthy 1959 到 Generational ZGC 2023

60-year family tree — McCarthy 1959 to Generational ZGC 2023

每个被采纳的算法都是某个问题的回应

every adopted algorithm is an answer to a specific pain

年Year	事件Event	人物People	解决的问题Pain solved
1959	Mark-Sweep 在 Lisp 1.5 里发明	John McCarthy	"我们手写 free 写疯了""we keep forgetting to free in Lisp"
1963	引用计数（Collins）独立提出	George Collins	"Mark-Sweep 的全堆扫描太慢""mark-sweep's full-heap walk too slow"
1969	Cheney 复制算法	C. J. Cheney	"mark-sweep 留下太多碎片""mark-sweep fragments"
1978	三色不变式	Dijkstra · Lamport	"怎么让 GC 边跑边收？""how to collect concurrently?"
1984	分代假设	Henry Lieberman · Carl Hewitt	"大多数对象很快死""most objects die young"
1988	Boehm 保守式 GC（给 C/C++）	Hans-J. Boehm	"没类型信息也能 GC""GC without type info"
1995	Java 1.0 · 第一代 JVM GC	Sun	"让 C++ 程序员不用 free""give C++ devs freedom from free"
2004	G1 GC 发表（论文）	Detlefs · Flood · Heller · Printezis	"大堆 STW 不可接受""large-heap STW unacceptable"
2009	Go 1.0 · 三色并发 mark-sweep	Google · Rick Hudson	"服务器 ms 级停顿""server-grade ms pauses"
2010	Swift ARC（前身 Objective-C 2.0 ARC, 2011 公开）	Apple · Chris Lattner	"移动设备没 GC 预算""mobile has no GC budget"
2014	ZGC 项目启动（Oracle Labs）	Per Lidén · Stefan Karlsson	"100GB 堆要 <10ms 停顿""100GB heap, <10ms pause"
2014	Rust 1.0-alpha · 无 GC 所有权系统	Mozilla · Graydon Hoare	"完全把 GC 写进编译器""GC subsumed by the compiler"
2015	Go 1.5 · sub-ms STW	Rick Hudson · Austin Clements	"实时 GC 是可能的""real-time GC is possible"
2018	Shenandoah GC 进入 OpenJDK	Red Hat · Roman Kennke	"ZGC 的并发整理替代品""alt-take on concurrent compaction"
2020	Hermes HadesGC GA	Meta	"RN 移动端的 sub-ms GC""sub-ms GC for React Native"
2022	OCaml 5.0 · 多核并行 GC	KC Sivaramakrishnan · OCaml team	"函数式语言终于真并行""functional languages finally go truly parallel"
2023	Generational ZGC 上线（JDK 21）	Oracle	"ZGC 没分代，吞吐量留了 10%""non-gen ZGC left 10% throughput on the table"
2024	Carbon / Vale · 实验性"线性 GC"	various	"Rust 太严，让我们多一点 GC""Rust too strict, ease back with linear types"

FIELD NOTE · 算法不是被发明，是被需要 FIELD NOTE · algorithms aren't invented, they're demanded 这张表每一行都对应一种工程痛点。McCarthy 不是闲着发明 mark-sweep——他写 Lisp 1.5 受不了 cons 节点泄漏。Lieberman 1984 提分代不是数学美感——他在 MIT Lisp Machine 上调试了 6 个月发现 90% 对象不到 1 秒就死。ZGC 不是性能竞赛——是 Oracle 内部 Cassandra 集群 100GB 堆 30 秒 STW 让交易系统真停摆。下一章看每个家族解决的具体痛点。 Every row in this table maps to a specific engineering pain. McCarthy didn't invent mark-sweep for kicks — he was bleeding from leaked cons nodes in Lisp 1.5. Lieberman 1984 didn't propose generational for mathematical elegance — he spent 6 months debugging on an MIT Lisp Machine and found 90% of objects died within 1 second. ZGC wasn't a benchmark race — it was Oracle's internal Cassandra clusters with 100GB heaps causing 30-second STW that froze trading systems. The next chapter shows the specific pain each family solves.

CHAPTER 03

11 个家族 · 11 句话讲完

11 families · 11 sentences total

先建立心智地图，后面 11 章每章一个

build the mental map first; 11 chapters fill in each one

Ch	家族Family	一句话In one sentence	代表实现Examples
06	引用计数Refcount	每对象记"谁指我"，归零立刻 freeeach object counts pointers, free at 0	CPython · Swift · PHP · Rust Rc
07	Mark-Sweep	从 root 染色全堆，把没染上的 freewalk from roots, sweep the unmarked	Lua · Boehm · early V8 major
08	复制 · CheneyCopying · Cheney	活对象搬到新堆，旧堆整体丢copy live objects to new heap, drop old heap entirely	OCaml minor · GHC · V8 young
09	分代Generational	年轻区频繁回收（90% 对象都死在这）collect the young area often (90% dies here)	V8 Orinoco · JVM G1 · Ruby 3
10	增量Incremental	一次 GC 拆成 N 小步穿插在用户代码间break one GC into N small steps interleaved with user code	Lua 5.0+ · early JVM CMS
11	并发 · 三色Concurrent · tri-color	GC 在独立线程里跑，靠 write barrier 保证不漏GC on its own thread, write barriers preserve invariants	Go · JSC Riptide · Hermes HadesGC
12	区域 · 每进程Region / per-actor	每个进程/区域独立 GC，进程死整片 freeper-process / per-region GC, dies as a whole	Erlang BEAM · Pony · region typing
13	彩色指针Colored pointers	在 64-bit 指针的高位编码 GC 状态，load barrier 拦截encode GC state in high pointer bits, load barriers intercept	ZGC · Shenandoah · Brooks 1984
14	无 GC · 编译期No-GC · compile-time	所有权 + RAII 把 free 编进编译产物ownership + RAII bake free into the binary	Rust · C++ unique_ptr · Zig · Vale
15	混合Hybrid	refcount 主路径 + cycle detector 兜底refcount + cycle collector backstop	CPython · PHP Zend · QuickJS · Perl
16	学术 · 未量产Academic / unshipped	MMTk 工具箱、LXR、Azul Pauseless（已商业）MMTk toolkit, LXR, Azul Pauseless (commercial)	research · vendor

每个家族都有哲学，每个哲学都有代价。Ch04 五个永恒权衡先把代价说清，再开始挨个家族讲。

Every family has a philosophy, every philosophy has a price. Ch04 lays out the prices in 5 tradeoffs, then we walk through family by family.

CHAPTER 04

五个永恒权衡 — 任何 GC 都不能同时拿满

5 eternal tradeoffs — no GC wins all five

所有 GC 战争都是这 5 个轴的不同选择

every GC war is a different choice across these 5 axes

权衡Tradeoff	谁赢Winner	谁输Loser	代价Why
① 停顿时间① Pause time	ZGC / Shenandoah · Erlang · Refcount	JVM Parallel · early Go · 单 STW	并发要 barrier、refcount 要原子 incconcurrency needs barriers, refcount needs atomic inc
② 吞吐量② Throughput	JVM Parallel · GHC · No-GC (Rust)	ZGC · Shenandoah · CPython	低延迟的 barrier 都是分支预测杀手low-latency barriers kill branch prediction
③ 内存占用③ Memory footprint	Refcount · No-GC · per-actor	分代 · 并发 mark-sweep · 复制式	tracing 需要 1.5-3× 活集才能高效tracing GCs need 1.5-3× live-set headroom
④ 启动时间④ Startup	Refcount · No-GC · Lua · QuickJS	JVM · .NET · 任何带 JIT 的	JIT + GC 元数据载入要 50-200msJIT + GC metadata bring-up costs 50-200ms
⑤ 心智负担⑤ Mental overhead	JVM · V8 · Python · Erlang（GC 隐形）	Rust · 手动 free（明确所有权）	隐形的代价是不可调；显性的代价是难学invisible cost = no tuning; visible cost = steep curve

从来没有 GC 在这五个轴都拿满。挑 GC 就是挑你愿意输哪一个。Ch19 调优决策树会把这 5 个轴变成可选 path——但你先要在脑里建立"没有银弹" 这个底色。

No GC has ever scored full marks on all five. Picking a GC means picking which one you're willing to lose. Ch19's tuning decision tree turns these 5 axes into actionable paths — but first you have to internalise "no silver bullet" as the background colour.

CHAPTER 05

关键不变式 — 全本书会反复用到的 5 个词

Core invariants — 5 terms used everywhere below

root · barrier · 三色 · safepoint · forwarding

root · barrier · tri-color · safepoint · forwarding

①

Root set

GC 起点。包括：栈帧里的局部变量、寄存器里的对象指针、全局变量、JIT 编译代码里嵌入的常量对象、thread-local、finalizer 队列里的对象。"对象可达" = 从 root 走有向图能到达。

Where GC starts walking from. Includes: local vars on stack frames, object pointers in registers, globals, JIT-baked constants, thread-local, finalizer queue. "Reachable" = a directed path from a root via pointers.

②

Write barrier · Read barrier · Load barrier

Write / Read / Load barriers

编译器在每个指针写（或读）的位置插入的小段代码。用途：(a) 增量/并发 GC 要捕获"白色对象指针被写到黑色对象"这种关系破坏（Ch11 三色）；(b) ZGC 的 load barrier 在每个指针读的位置检查并可能转发指针。所有并发 GC 都在这层付费——每次指针操作多 1-3 ns。

A small piece of code the compiler injects at every pointer write (or read). Purposes: (a) incremental/concurrent GCs catch "a black object now points to a white" invariant violations (Ch11 tri-color); (b) ZGC's load barrier checks every pointer read and may forward. Every concurrent GC pays here — 1-3 ns per pointer op.

③

三色不变式（Dijkstra 1978）

Tri-color invariant (Dijkstra 1978)

所有对象染三色：白（GC 未访问，候选 free）、灰（自己被访问，但出边未追完）、黑（自己 + 所有出边都追完）。不变式：不允许黑色对象直接指向白色对象（否则白色会被漏标）。所有并发 GC 都靠 write barrier 维护它。

All objects coloured: white (unvisited, candidate to free), grey (visited but outgoing edges not all traced), black (self + all outgoing done). Invariant: black must not directly point to white (else white gets missed). Every concurrent GC uses write barriers to enforce this.

④

Safepoint

编译器知道"此刻所有指针在已知位置"的代码点。STW 的本质就是"等所有线程到 safepoint"。tight 数值循环里没有 safepoint，GC 可能要等几 ms 才能进——这是 JVM 大循环 GC 暂停的常见原因。

A program point at which the compiler knows where every pointer is. STW = "wait for all threads to reach a safepoint". Tight numeric loops lack safepoints; GC can wait ms for entry — a common cause of JVM long-pauses.

⑤

Forwarding pointer

复制 GC（Ch08）+ 整理 GC（ZGC）必备。对象从位置 A 搬到位置 B 后，A 处放一个指向 B 的转发指针，之后任何还指着 A 的旧引用都被自动重定向到 B。这是所有 compacting GC 的核心——也是为什么 ZGC 要 colored pointer：把"已转发"标志编进指针。

Required by copying GC (Ch08) and compacting GC (ZGC). When an object moves from A to B, A holds a forwarding pointer to B; any old reference to A gets transparently redirected. The heart of every compacting collector — and the reason ZGC needs colored pointers: encoding "already forwarded" into the pointer bits.

FIELD NOTE · 这五个词出现 200+ 次 FIELD NOTE · these 5 words appear 200+ times 如果你在后面任何一章看到 "write barrier" "root scan" "safepoint" 想不起来含义——回到本章。术语表（最后一章）也会再列一次。 If any later chapter says "write barrier" / "root scan" / "safepoint" and you blank — come back here. The glossary at the end repeats them too.

本文 benchmark 的方法论

Benchmark methodology used throughout

environment · so you can spot the variance2026-05

// Machine: Apple Silicon (M-series, 16 GB RAM) // OS: macOS Darwin 25.x (kernel-level mmap policy varies — affects ZGC) // Allocator: system default per runtime (CPython PyMalloc · Lua mallocator // · Go mcache · JVM ParallelGC // · Node V8 default · Rust system) // Versions: CPython 3.13 · Lua 5.4.7 · OCaml 5.2 · Node v22.16 · Go 1.23 // Rust 1.85 -O · JDK 21 + ZGC · Erlang/OTP 27 // Repetitions: each timing is single-run, no warm-up unless the runtime is JITed // (Node/JVM ran 3x; values shown are median). // Variance: readers reproducing on different machines should expect 2-4× // absolute spread on alloc/free numbers · order-of-magnitude // ratios (e.g. "Lua's 580ms vs Rust's 15ms · 38× faster Rust") // should still hold across hardware. // Source files: // /tmp/mainline.py /tmp/mainline.lua /tmp/mainline.ml // /tmp/mainline.js /tmp/mainline.go /tmp/mainline.rs // /tmp/Mainline.java /tmp/mainline.erl // Each file is shown verbatim in its corresponding chapter (Ch06-Ch16).

FIELD NOTE · 数字会变 · 排序不会 FIELD NOTE · numbers vary, the ranking doesn't 你在自己机器上跑这些 benchmark 大概率得到不同的绝对数字——M1 Pro 比 M3 Max 慢 ~30%、Linux 比 macOS GC 表现可能差 10-50%、jemalloc 比 system malloc 在 fragmentation 场景下能差 2×。但相对排序基本稳定：Rust 永远比 Lua 快 30× 量级，ZGC 永远 sub-ms 而 Parallel 永远 10-200ms。所以本文每个数字都标了来源，你应该在自己机器上跑一次验证排序，而不是把绝对值奉为真理。 Running these benchmarks on your machine will likely yield different absolute numbers — M1 Pro is ~30% slower than M3 Max, Linux GC behaviour can differ from macOS by 10-50%, jemalloc vs system malloc can be 2× apart on fragmentation-heavy workloads. The relative ordering, however, is stable: Rust is always ~30× faster than Lua, ZGC is always sub-ms while Parallel is always 10-200 ms. Each benchmark in this article cites its source code; you should re-run them locally to verify the ordering, not treat the absolute values as canon.

MAIN LINE · THE WORKLOAD

1M 个小对象的命运 — 一段代码 · 11 种 GC · 11 个故事

The fate of 1M small objects — one program · 11 GCs · 11 stories

the workload · pseudocode (each chapter shows in real language)17 chars + side effects

// allocate 1 million small objects, retain them, then drop the reference let list = [] for i in 0..1_000_000: list.push({ n: i }) // allocate 1M tiny objects list = null // ⭐ the moment of "many ways to die" // observe: when is memory actually freed? what's the pause? how is the heap after?

这一行 list = null 是整本书的支点。每个 GC 家族章节都会回答 4 个问题：

That one line list = null is the article's fulcrum. Every GC family chapter answers 4 questions:

什么时候这 1M 对象的内存真的还给系统？

When does memory actually return to the OS?

CPython refcount：同行。V8 Orinoco：下一次 minor GC（可能下 50ms）。JVM ZGC：concurrent，无明确时间。Erlang：进程死了才整体 free。Rust：编译期 drop。

CPython refcount: same line. V8 Orinoco: next minor GC (~50ms). JVM ZGC: concurrent, no fixed time. Erlang: only when the process dies, all at once. Rust: at compile-time-determined drop.

这一行触发了多长的 STW？

How long an STW did it trigger?

refcount：0ms（同步释放）。V8 minor：~1ms。JVM Parallel：可达 50ms（大堆）。ZGC：<1ms。Erlang：0ms（不存在全堆 GC）。

refcount: 0ms (synchronous release). V8 minor: ~1ms. JVM Parallel: up to 50ms (large heap). ZGC: <1ms. Erlang: 0ms (no global GC exists).

回收过程消耗多少 CPU%？

How much CPU% does the collection burn?

refcount：~5%（每次 dec 都付）。V8 generational：~3% young + ~5% major。Go concurrent：~5-25%（按 heap 增速调）。ZGC：可达 ~15%（load barrier）。

refcount: ~5% (every dec costs). V8 generational: ~3% young + ~5% major. Go concurrent: 5-25% (scales with heap growth). ZGC: up to ~15% (load barriers).

回收后堆是什么形态？碎片化吗？

What shape is the heap after? Fragmented?

mark-sweep：碎（堆里到处是洞）。复制式：连续。compacting（G1/ZGC）：整理后连续。refcount：和分配顺序一致——可能碎，看 allocator。

mark-sweep: fragmented (holes everywhere). copying: contiguous. compacting (G1/ZGC): compacted contiguous. refcount: ordering matches allocation — fragmentation depends on allocator.

下面 11 章每章独立回答这 4 问，加 1 段那种语言的真源码 + 1 张 SVG 展示当前家族独有的机制。从最简单的引用计数开始。

The next 11 chapters each answer those 4 questions independently, plus a snippet of real source from that runtime and one SVG showing the mechanism unique to that family. We start with the simplest: refcount.

CHAPTER 06

引用计数 — 每对象记"谁指我"

Refcount — every object counts its pointers

CPython · Swift ARC · PHP Zend · Perl · Rust Rc/Arc · QuickJS

家族

Family

Refcount (1963 Collins)

同行 free

Same-line free

✓ Yes

STW

0 ms

致命缺陷

Fatal flaw

cycles ✗

引用计数是所有 GC 里最直觉的一种：每个对象配一个 refcount 字段，被指向时 ++，解除指向时 --，归零时立即 free。

它的优势无人能及：0 ms STW、同行 free 极小延迟、实现极简单（200 行 C 写完）。它的致命缺陷也无人能救：循环引用永远释放不掉（A 指 B、B 指 A，两个 refcount 永远 ≥ 1）。

所以"纯引用计数" 语言其实不存在——CPython、PHP、Perl、QuickJS 都加了第二层循环检测器（Ch15 混合家族）。真正"纯"的只有 Swift ARC，但它要求程序员手动用 weak / unowned 打破循环——把责任推回给程序员。

Refcount is the most intuitive GC: every object holds a refcount field, ++ when pointed at, -- when unpointed, free when it hits zero.

Its advantages are unmatched: 0 ms STW, same-line free, minimal latency, tiny implementation (200 lines of C). Its fatal flaw is also unrescuable: cycles never collect (A→B, B→A; both refcounts stay ≥ 1 forever).

So "pure refcount" languages don't really exist — CPython, PHP, Perl, QuickJS all bolt on a second cycle detector (Ch15 hybrid family). The only truly pure one is Swift ARC, but it demands the programmer manually break cycles via weak / unowned — pushing the burden back onto the programmer.

◇ 主线在 Refcount 家族里◇ The main line in Refcount

CPython · /tmp/mainline.pymeasured 2026-05

import sys, time, tracemalloc tracemalloc.start() t0 = time.perf_counter() lst = [] for i in range(1_000_000): lst.append({"n": i}) # each dict = ~232B; 1M = ~232 MB t_alloc = time.perf_counter() - t0; cur1, _ = tracemalloc.get_traced_memory() print("alloc:", round(t_alloc*1000), "ms heap:", cur1//1024//1024, "MB") t1 = time.perf_counter() lst = None # ⭐ THE LINE t_free = time.perf_counter() - t1; cur2, _ = tracemalloc.get_traced_memory() print("free:", round(t_free*1000, 2), "ms heap:", cur2//1024//1024, "MB")

output · CPython 3.13 · Apple Siliconreal run

alloc: 340 ms heap: 232 MB free: 28.4 ms heap: 0 MB // ⭐ same line, full free, ZERO async work

CPython · Include/object.h · the refcount magic that fires 1M timesverbatim, abridged

/* every PyObject opens with this header */ struct _object { Py_ssize_t ob_refcnt; // ⭐ THE refcount struct _typeobject *ob_type; }; /* every assignment, every function return, every list pop: the compiler emits Py_INCREF / Py_DECREF. when the count hits 0, _Py_Dealloc is invoked SYNCHRONOUSLY. */ #define Py_INCREF(op) ((op)->ob_refcnt++) #define Py_DECREF(op) do { \ if (--(op)->ob_refcnt == 0) _Py_Dealloc((PyObject *)(op)); \ } while (0) // When `lst = None` fires: ref to the outer list drops from 1 → 0. // _Py_Dealloc(list_obj) → list_dealloc → iterates 1M slots → // each Py_DECREF on the dict drops it from 1 → 0 → _Py_Dealloc(dict) → // drops the "n" int from its slot, etc. ALL synchronous. ALL same call.

主线 4 问 · refcount 答案

Main-line 4 questions · refcount answers

Q	问题Question	CPython 实测CPython measured
Q1	何时还给系统？When is memory returned?	同行同步 · 28 ms 内 1M 对象全部 freesame line, synchronous · 1M objects freed within 28 ms
Q2	STW 多长？STW duration?	整个 28 ms 都是——但程序员能预测。其他线程在这期间被 GIL 锁住。The whole 28 ms is STW — but predictably so. Other threads block on the GIL during this.
Q3	CPU 开销？CPU overhead?	分配阶段已经付了——每次 `list.append` 内含 inc，每次释放含 dec。无独立"GC 时间"paid during allocation — every `list.append` bakes in inc, every drop bakes in dec. No separate "GC time"
Q4	堆形态？Heap shape after?	完全空。但不一定连续 unmap——CPython 的 PyMalloc 把空间还给自己的 free list，要等到 arena 整体空才 munmapempty. But not necessarily unmap'd — CPython's PyMalloc returns memory to its own freelist; munmap only when an arena is fully empty

同家族其他实现 · 同一行的不同语境

Other family members · same line in different runtimes

Swift ARC · /tmp/mainline.swiftcompiled clang -O

var list: [[String: Int]]? = [] let start = Date() for i in 0..<1_000_000 { list?.append(["n": i]) } list = nil // ⭐ THE LINE · same effect as CPython // Behind the scenes: clang inserts swift_retain / swift_release at every // scope change. The compiler can OPTIMIZE away pairs (ARC optimisation), // which makes Swift refcount cheaper than CPython's (per-op ~3-5 ns). // BUT cycles (e.g. parent←→child) require manual `weak` declaration.

Rust Rc · /tmp/mainline.rsopt-in, not the default

use std::rc::Rc; use std::cell::RefCell; fn main() { let mut list: Vec<Rc<RefCell<HashMap<String, i32>>>> = Vec::new(); for i in 0..1_000_000 { let mut m = HashMap::new(); m.insert("n".to_string(), i); list.push(Rc::new(RefCell::new(m))); } drop(list); // ⭐ THE LINE (or just leave scope) } // Rust's Rc is OPT-IN — most Rust code uses owned values (Ch14 No-GC). // Cycles via Rc<RefCell<...>> are NOT collected — use Weak<T> to break. // Drop is compile-time-determined: ZERO runtime GC.

同行 free · 同行可预测 · 同行付出 28ms · 但循环对象需要第二层兜底（Ch15 混合） same-line free · same-line predictable · same-line costs 28 ms · cycles need a backstop (Ch15 hybrid)

FIELD NOTE · 28 ms 不是 0 ms FIELD NOTE · 28 ms is not 0 ms 很多人以为 refcount "立即 free 等于零开销"——错。我们这一行 lst = None 实测同步阻塞 28 ms。这 28 ms 里 GIL 锁着，所有其他 Python 线程都暂停——相当于一次 28 ms STW，只不过是程序员可预测的。如果对象有 __del__ 析构器，时间会更长。Refcount 不是无 GC，是把 GC 平摊到每一次赋值。下一章 Mark-Sweep 把这 28 ms 集中到一次大 STW，但分配本身就快。 People assume "refcount = same-line free = zero overhead" — wrong. Our lst = None synchronously blocks for 28 ms. The GIL is held during this; all other Python threads pause — effectively a 28-ms STW, just programmer-predictable. With __del__ finalisers, it's longer. Refcount isn't no-GC — it amortises GC across every assignment. Ch07 Mark-Sweep concentrates the same 28 ms into one big pause, but makes allocation itself faster.

循环兜底 · PEP 442 算法 5 阶段

Cycle backstop · PEP 442 in 5 phases

引用计数对循环 A→B→A 永远漏放——CPython 加了一个独立的循环检测器（Python/gc.c, ~2200 行），按"试探性递减"算法（Bacon-Rajan 2001）跑。代码 5 个 phase 干这件事：

Refcount leaks cycles A→B→A forever. CPython adds an independent cycle detector (Python/gc.c, ~2200 lines) running the "trial deletion" algorithm (Bacon-Rajan 2001). The code does 5 phases:

PEP 442 trial-deletion · 5-phase sketch (real flow of gc_collect_main)algorithm card

// Operates on the "generation 0/1/2" linked lists of TRACKED containers // only. Non-containers (int, str, float) are unaffected — refcount alone // handles them. phase 1 · update_refs(containers): // gc.c:397 for each container c: c.gc_refs := c.ob_refcnt // snapshot true refcount phase 2 · subtract_refs(containers): for each container c: for each outbound pointer c -> child: if child in containers: child.gc_refs -= 1 // "subtract internal edges" // after this: c.gc_refs > 0 ↔ reachable from OUTSIDE the container set // c.gc_refs == 0 ↔ only inside-cycle refs remain phase 3 · move_unreachable(containers, unreachable): for each c with gc_refs == 0: move c to unreachable list // candidates to free // but: a candidate may still be reached via a non-candidate // (e.g. cycle is referenced from a list that has external refs). // the algorithm walks back: if a kept container points to an unreachable, // move it back to "reachable". iterate until fixed point. phase 4 · finalize_garbage(unreachable): for each c in unreachable: if c.tp_finalize: call(c.tp_finalize) // __del__ · since PEP 442 // re-check unreachability — __del__ might resurrect objects phase 5 · delete_garbage(unreachable): for each c in unreachable: actually _Py_Dealloc(c) // drops the cycle // Pre-PEP 442 (i.e. PHP, Perl, pre-3.4 Python), __del__ on cyclic objects // caused gc to GIVE UP and leak the cycle. PEP 442's contribution: finalize // and re-detect, allow __del__ to resurrect, free the rest.

这 5 阶段在 QuickJS（Ch16 Bacon-Rajan 算法卡）和 PHP Zend GC 里结构上完全一样——只是字段名不同（QuickJS 用 mark 而非 gc_refs，PHP 用 buffered 列表）。整个引用计数家族 + 循环兜底 = 一个算法 + 4 个适配。

These 5 phases are structurally identical across QuickJS (see Ch16's Bacon-Rajan algorithm card) and PHP Zend GC — only field names differ (QuickJS uses mark instead of gc_refs; PHP uses a buffered list). The whole refcount family + cycle backstop = one algorithm + four adaptations.

这家族的 7 个量产语言

7 production languages in this family

语言/运行时Language/runtime	refcount 方式refcount style	循环兜底Cycle backstop
CPython	非原子 inc/dec + GILnon-atomic inc/dec + GIL	PEP 442 cycle collector（Ch15）PEP 442 cycle collector (Ch15)
Swift	ARC · atomic · compiler-optimised	无 · 需手动 weak/unownednone · manual weak/unowned
Objective-C	ARC · atomic（since 2011）	无 · 手动 weaknone · manual weak
PHP 7+	Zend refcount	Zend cycle collector
Perl 5	refcount	无（手动 Scalar::Util::weaken）none (manual Scalar::Util::weaken)
QuickJS	refcount	trial-deletion (Ch15)
Rust Rc/Arc	opt-in · Arc atomicopt-in · Arc atomic	无 · 手动 Weak<T>none · manual Weak<T>

▸ GC LAB · 01 引用计数 · 亲手做一个循环漏 Refcount · craft a leak with your own hands alloc → link → unlink-root → ?

live0

freed0

leaked0

total refs0

实验： 每对象的数字是 refcount。Alloc 建新对象 (root→obj, count=1)。Link 让随机对象指另一个 (count++)。Unlink-root 切 root→obj 引用——count 归 0 时立刻 free。
点 Make A→B→A + Drop cycle root —— 你会看见两个对象 refcount 都是 1 但 root 已不通——这就是引用计数永远救不了的循环漏。 Experiment: The number on each object is its refcount. Alloc creates a new object (root→obj, count=1). Link makes a random object point to another (count++). Unlink-root severs root→obj — count hits 0 → instant free.
Click Make A→B→A then Drop cycle root — both objects have refcount=1 but root is gone. That's the cycle leak refcount can never solve.

live (reachable) cycle-leaked freed (fading) root

CHAPTER 07

Mark-Sweep — 染色全堆 · 清扫残骸

Mark-Sweep — paint the whole heap · sweep the carcasses

McCarthy 1959 · Lua incremental · Boehm conservative · early V8 major

家族

Family

Mark-Sweep (1959 McCarthy)

同行 free

Same-line free

✗ No

STW

µs—s (heap-proportional)

致命缺陷

Fatal flaw

fragmentation

Mark-Sweep 是所有 tracing GC 的祖先——1959 年 John McCarthy 在 Lisp 1.5 manual 里第一次写下："当 free list 用完了，停下来。从所有 root 出发，把每个能到达的 cons 都打上一个 mark bit。然后扫描整个堆——没 mark 的 cons 就是垃圾，加回 free list。"

这是两步算法：

Mark phase：从 root 出发深度优先遍历，每个可达对象置一个 mark bit。时间 O(live)。
Sweep phase：线性扫描整个堆。mark 的清掉 mark；没 mark 的回收。时间 O(heap)。

问题：sweep 完后，活对象和空洞交错分布——碎片化。下次分配大对象可能找不到连续空间，即使总空闲量够。复制式（Ch08）和整理式（ZGC）就是为解这个问题来的。

Mark-Sweep is the ancestor of all tracing GCs — in 1959 John McCarthy wrote in the Lisp 1.5 manual: "When the free list runs out, halt. From every root, walk to every reachable cons and set a mark bit. Then sweep the entire heap — unmarked cons cells are garbage; add them back to the free list."

A two-phase algorithm:

Mark phase: DFS from roots; set a mark bit on every reachable object. Time O(live).
Sweep phase: linear scan the entire heap. Clear marks on live objects; reclaim unmarked. Time O(heap).

The problem: after sweep, live objects and holes interleave — fragmentation. The next large allocation may fail to find contiguous space even though total free is enough. Copying (Ch08) and compacting (ZGC) collectors exist to fix this.

◇ 主线在 Mark-Sweep 家族里◇ The main line in Mark-Sweep

Lua 5.4 · /tmp/mainline.lua · the workloadmeasured 2026-05

local t = os.clock() local list = {} for i = 1, 1000000 do list[i] = { n = i } -- 1M small tables end print("alloc:", (os.clock() - t) * 1000, "ms", "heap:", collectgarbage("count") / 1024, "MB") t = os.clock() list = nil -- ⭐ THE LINE print("drop ref:", (os.clock() - t) * 1000, "ms", "heap:", collectgarbage("count") / 1024, "MB") t = os.clock() collectgarbage("collect") -- force full GC print("GC took:", (os.clock() - t) * 1000, "ms", "heap:", collectgarbage("count") / 1024, "MB")

output · Lua 5.4 · Apple Silicon · 2026-05real run

alloc: 580 ms heap: 90.1 MB drop ref: 0.001 ms heap: 90.1 MB -- ⭐ NOTHING happens at list=nil GC took: 94 ms heap: 0.04 MB -- the dust gets swept when we ask

Lua · lgc.c · 真源码三段

Lua · lgc.c · three real-source fragments

lgc.c:1740 · luaC_step · the GC entry pointverbatim

1740 void luaC_step (lua_State *L) { 1741 global_State *g = G(L); 1742 lua_assert(!g->gcemergency); 1743 if (!gcrunning(g)) { 1744 if (g->gcstp & GCSTPUSR) luaE_setdebt(g, 20000); 1747 } else { 1749 switch (g->gckind) { 1750 case KGC_INC: case KGC_GENMAJOR: 1751 incstep(L, g); // ⭐ incremental mark-sweep 1752 break; 1753 case KGC_GENMINOR: 1754 youngcollection(L, g); // generational (Lua 5.4+) 1755 setminordebt(g); 1756 break; 1759 } 1762 } 1763 }

lgc.c:339 · reallymarkobject · color an objectverbatim

339 static void reallymarkobject (global_State *g, GCObject *o) { 340 g->GCmarked += objsize(o); 341 switch (o->tt) { 342 case LUA_VSHRSTR: 343 case LUA_VLNGSTR: 344 set2black(o); // leaf · turn black 345 break; 347 case LUA_VUPVAL: 348 UpVal *uv = gco2upv(o); 350 if (upisopen(uv)) set2gray(uv); // open upvals: stay grey 352 else set2black(uv); 353 markvalue(g, uv->v.p); 354 break; 366 case LUA_VLCL: case LUA_VCCL: case LUA_VTABLE: 367 case LUA_VTHREAD: case LUA_VPROTO: 368 linkobjgclist(o, g->gray); // container: add to grey list 369 break; 371 } 372 }

lgc.c:892 · sweeplist · linear scan + reclaim deadverbatim, the sweeper

892 static GCObject **sweeplist (lua_State *L, GCObject **p, l_mem countin) { 893 global_State *g = G(L); 894 int ow = otherwhite(g); 895 int white = luaC_white(g); // current "live" white color 896 while (*p != NULL && countin-- > 0) { 897 GCObject *curr = *p; 898 int marked = curr->marked; 899 if (isdeadm(ow, marked)) { // ⭐ dead? wrong color 900 *p = curr->next; // unlink from list 901 freeobj(L, curr); // ⭐ actually free 902 } 903 else { 904 curr->marked = (marked & ~maskgcbits) | white | G_NEW; // flip color 906 p = &curr->next; 907 } 908 } 909 return (*p == NULL) ? NULL : p; 910 } // Note Lua uses TWO white colors that alternate. After mark phase, all live // objects are "current white"; all dead are "other white". This avoids needing // a separate reset pass to clear mark bits. Same trick as JVM card tables.

主线 4 问 · Mark-Sweep 答案

Main-line 4 questions · Mark-Sweep answers

Q	问题Question	Lua 实测Lua measured
Q1	何时还给系统？When is memory returned?	下次 GC 触发时。本例强制 collectgarbage 后 94 ms 内 free 完。如果不强制，可能要等 1-10 秒At the next GC trigger. With forced collectgarbage, 94 ms to free. Without forcing, could be 1-10 seconds
Q2	STW 多长？STW duration?	整个 94 ms，因为本例用 `collectgarbage` 强制 full GC。Lua 平时用增量模式（Ch10），把 STW 切成 ~1ms 小片The whole 94 ms, because we forced full GC. Normally Lua runs incrementally (Ch10), breaking STW into ~1 ms slices
Q3	CPU 开销？CPU overhead?	94 ms / (94 + 580 ms) ≈ 14%（在这个分配密集场景）94 ms / (94 + 580 ms) ≈ 14% for this allocation-heavy workload
Q4	堆形态？Heap shape after?	单链表里所有节点 free 完，但底层 mallocator（POSIX malloc / jemalloc）的形态可能碎all nodes in the GC linked list freed, but the underlying mallocator (POSIX malloc / jemalloc) may be fragmented

家族其他成员

Other family members

实现Implementation	特点Special	LoC
Lua 5.4 · lgc.c	三色 + 增量 + 5.4 加分代tri-color + incremental + 5.4 added generational	1804
Boehm GC · bdwgc	⭐ 保守式 · 给 C/C++ 用 · 没类型信息也能 GC⭐ conservative · works on C/C++ without type info	~50000
Crystal · 编译期插入	底层用 Boehm GCuses Boehm under the hood	—
SBCL (Common Lisp)	generational mark-sweep + compaction	~30000
D · 默认 GC	mark-sweep · 可用 @nogc 退出mark-sweep · escapable via @nogc	~15000

两阶段 · 染色 + 扫描 · 简单但碎片化 · 这是 60 年 GC 进化的起点 Two phases · mark + sweep · simple but fragmenting · the starting point of 60 years of GC evolution

DESIGN · Boehm 的"保守"是什么意思 DESIGN · what "conservative" means in Boehm Boehm GC（1988，~50000 行 C）是 mark-sweep 家族里最有名的"没有类型信息"实现——它给 C/C++ 用。问题：C 里一个 8 字节看起来是数字 uint64_t，但可能是指针。Boehm 怎么办？当作可能是指针来对待——只要看起来像合法的堆地址，就当成 root 处理。这叫"保守式"：宁可多保留（可能漏 free），也不错释放（可能 use-after-free）。代价：内存利用率比"精确式" GC 低 ~10-20%。但它不需要编译器配合，不需要修改对象布局——任何 C/C++ 程序 link Boehm 就立刻有了 GC。Crystal 语言、Mono 早期、AutoCAD 等都用它。 Boehm GC (1988, ~50000 LoC of C) is the most famous "no type info" implementation in the mark-sweep family — built for C/C++. The problem: in C, an 8-byte slot looks like a uint64_t but might be a pointer. Boehm's answer: treat it as if it could be one — if a value looks like a valid heap address, treat it as a root. This is "conservative": prefer to over-retain (might miss frees) over wrong-release (might cause use-after-free). Cost: ~10-20% worse memory utilisation than a "precise" GC. But it needs no compiler cooperation, requires no object layout changes — any C/C++ program that links Boehm instantly has GC. Used by Crystal, early Mono, AutoCAD, etc.

▸ GC LAB · 02 Mark-Sweep · 染色全堆 · 看 STW 时间 Mark-Sweep · paint the heap · watch the STW alloc → mark → sweep | fragmentation grows

live0

dead0

fragmented0

last STW0 ms

实验： 上排是 root, 下方网格是堆 (64 槽)。Alloc ×10 在空槽里分配, 随机连到 root。Drop random refs 切几条边——但对象还在堆里(白色描边表示unreachable but still allocated)。
Mark 从 root BFS, 把可达对象染成黑色。Sweep 扫整个 64 槽, 白对象一格一格 free——STW 时间正比于整个堆, 不是活对象。多跑几次, 看碎片化数字往上爬。 Experiment: Top row is the root, below is the heap (64 slots). Alloc ×10 places objects in free slots, linking to root. Drop random refs cuts edges — objects remain in heap (white outline = unreachable but still allocated).
Mark BFS from root, paints reachable objects black. Sweep scans all 64 slots, freeing whites one by one — STW scales with the whole heap, not live objects. Repeat to watch fragmentation climb.

marked (live) white (unreachable) scanning free slot

CHAPTER 08

复制 · Cheney 1969 — 活的搬走 · 死的整片丢

Copying · Cheney 1969 — copy the live, drop the dead wholesale

OCaml minor · GHC nursery · V8 young · 半空间换零碎片

OCaml minor · GHC nursery · V8 young · half the space for zero fragmentation

家族

Family

Copying (1969 Cheney)

同行 free

Same-line free

✗ No

STW

O(live), not O(heap)

代价

Cost

2× memory headroom

1969 年 C. J. Cheney 提出一个反直觉的想法："不要 sweep。把活对象搬到一块新内存，旧内存整片当作 free。" Mark-Sweep 要扫整个堆 O(heap)，Cheney 只动活的 O(live)——而活对象通常远少于堆大小，所以更快。还有副产品：搬完后所有活对象在新内存里连续排列——零碎片化。

代价：要一半内存待命——两个 semispace，活的总在其中一半，另一半空着等下次。所以 Cheney 几乎不单独用，而是作为分代（Ch09）的年轻区，因为年轻区对象死得快，复制成本可接受。

In 1969 C. J. Cheney proposed a counter-intuitive idea: "don't sweep. Copy the live objects to a new memory area, treat the old one wholesale as free." Mark-Sweep scans the whole heap, O(heap); Cheney only touches the live, O(live) — and live is usually much smaller than heap, so faster. Side effect: after copying, live objects sit contiguously in the new space — zero fragmentation.

The cost: half the memory sits idle — two semispaces; live always in one, the other empty waiting for next cycle. So Cheney is rarely used standalone — it's the young generation of generational collectors (Ch09), because the young region dies young, making copy cost acceptable.

◇ 主线在 Copying 家族里 · OCaml 5◇ The main line in Copying · OCaml 5

OCaml 5.2 · /tmp/mainline.mlmeasured

let () = let t0 = Sys.time () in let list = ref [] in for i = 1 to 1_000_000 do list := (i, ()) :: !list (* 1M small pairs *) done; Printf.printf "alloc: %.1f ms\n" ((Sys.time () -. t0) *. 1000.); Gc.stat () |> fun s -> Printf.printf "minor_collections=%d major=%d\n" s.minor_collections s.major_collections; let t1 = Sys.time () in list := []; (* ⭐ THE LINE *) Gc.full_major (); (* force major *) Printf.printf "drop + GC: %.1f ms\n" ((Sys.time () -. t1) *. 1000.)

output · OCaml 5.2 · Apple Siliconreal run

alloc: 102.4 ms minor_collections=230 major_collections=2 (* ⭐ 230 minor GCs happened DURING alloc — *) (* the Cheney scavenger fired every ~4MB *) drop + GC: 8.7 ms (* dropping 1M survivors took 8.7 ms major *)

OCaml · runtime/minor_gc.c:236 · oldify_one · the Cheney scavengerverbatim

236 static void oldify_one (void* st_v, value v, volatile value *p) { 237 struct oldify_state* st = st_v; 240 header_t hd; 244 tail_call: 245 if (!(Is_block(v) && Is_young(v))) { 246 *p = v; // not in minor heap, leave it 247 return; 248 } 253 do { 254 hd = get_header_val(v); 255 if (Is_promoted_hd(hd)) { // ⭐ already copied? follow forwarding 257 *p = Field(v, 0) + infix_offset; 258 return; 259 } ... } while (tag == Infix_tag); // allocate result in MAJOR heap, copy block, install forwarding result = alloc_shared(st->domain, sz, tag, ...); try_update_object_header(v, p, result, infix_offset); // ⭐ leave forwarding for (i = 0; i < sz; i++) Field(result, i) = Field(v, i); // children get oldified later via the grey list (work stealing) }

主线 4 问 · Copying / OCaml minor 答案

Main-line 4 questions · OCaml minor copying answers

Q	问题Question	OCaml 5 minor
Q1	何时还给系统？When freed?	minor heap 满（默认 262KB）就立刻触发 — 主线 230 次 minor + 2 次 major · 旧 semispace 整片回收minor heap fills (default 262 KB) → immediate Cheney pass; mainline triggered 230 minor + 2 major; old semispace recycled wholesale
Q2	STW?	每次 minor ~µs 级（live set 小）· major ~ms 级each minor ~µs (live set small) · major ~ms
Q3	CPU?	~5-15%（年轻代复制成本 + write barrier）~5-15% (young-gen copy + write barrier cost)
Q4	堆形态？Heap shape?	⭐ 连续 · 0 碎片 · 这是复制式的核心优势⭐ contiguous · zero fragmentation · the core advantage

两个半空间 · 活的搬到 to-space · 旧的整片丢 · 副产品就是连续、零碎片 Two semispaces · copy live to to-space · drop the old wholesale · contiguous and zero-fragmentation as a side effect

实现Implementation	语境Context
OCaml minor heap	262 KB 默认 · Cheney 经典 · 8MB/s 吞吐262 KB default · classic Cheney · ~8 MB/s throughput
GHC nursery	1MB blocks · 每个 mutator 线程独立 · Block Structured Heap1 MB blocks · per-mutator thread · Block Structured Heap
V8 young generation (Scavenger)	1-8 MB · Cheney · 升老代靠 age 计数1-8 MB · Cheney · promote on age count
Hermes (RN engine)	YGC = 复制 + 升 OGYGC = Cheney + promote

DESIGN · 为什么复制是分代的必然年轻代 DESIGN · why copying is the natural young generation Lieberman 1984 发现年轻对象大多数活不过 1 秒（"weak generational hypothesis"）。复制 GC 的成本是 O(live)，而年轻代live 极少——一次复制可能只动 10% 的对象，扫描 + 整理一气呵成，极快。所以现代生产 GC 几乎都是"young = Cheney + old = mark-sweep/compact"结构：V8 Orinoco、JVM G1、Ruby 3、OCaml、GHC。Ch09 分代专门讲。 Lieberman 1984 observed that young objects mostly die within 1 second ("weak generational hypothesis"). Copying GC costs O(live) — and the young region has very few live — one cycle may touch 10% of objects, scan + compact in one pass, very fast. So modern production GCs are almost all "young = Cheney + old = mark-sweep/compact": V8 Orinoco, JVM G1, Ruby 3, OCaml, GHC. Ch09 covers generational explicitly.

▸ GC LAB · 03 Copying · Cheney 半空间 · 看活的搬家 Copying · Cheney semispace · watch the live migrate from-space → BFS → to-space → flip

live0

dead (in From)0

from %0%

memory cost2×

实验： 堆被切成两半 (From-space 在左, To-space 在右)。所有 alloc 落在 From。
点 Cheney GC: BFS 从 root 出发, 把活对象一个一个复制到 To-space (你能看见它们飞过去) — 死对象留在 From, 整片不管。复制完 Flip 让 To 变成新的 From。
关键观察: STW 时间正比于活对象数, 不是堆大小——这是 mark-sweep 做不到的。代价: 始终有半个堆空着待命 (2× memory)。 Experiment: Heap split in two — From-space on left, To-space on right. All allocs land in From.
Click Cheney GC: BFS from root, copy live objects one by one into To-space (watch them fly across). Dead objects stay in From, treated wholesale as garbage. Then Flip — To becomes the new From.
Key observation: STW time scales with live count, not heap size — what mark-sweep can't do. Cost: half the heap sits idle (2× memory).

From-space To-space (after copy) dead (in From, dropped) root

CHAPTER 09

分代 — Lieberman 1984 · 90% 死得早

Generational — Lieberman 1984 · 90% die young

V8 Orinoco scavenger · JVM G1 young · Ruby 3 · OCaml major+minor

家族

Family

Generational (1984 Lieberman)

弱分代假设

Weak hypothesis

~90% obj die young

young STW

~1 ms (small region)

major STW

10-100 ms (rare)

1984 年 Henry Lieberman + Carl Hewitt 在 MIT Lisp Machine 上观察 6 个月得到一个数据："对象的死亡时间分布是双峰的——绝大多数活不到 1 秒，少数活到程序结束"。这叫"弱分代假设"（weak generational hypothesis），是现代生产 GC 的奠基观察。

分代的设计直接落地这个观察：

Young generation（新区）：小 · 频繁收 · 通常用 Cheney 复制（Ch08）· 因为 live 很少，复制成本低
Old generation（老区）：大 · 罕收 · 用 mark-sweep / mark-compact · 因为 live 很多，搬动成本高
晋升 (promotion)：年轻对象熬过 N 次 minor GC 后升入老区
记忆集 (remembered set / write barrier)：捕获"老区指针指向新区"的写入，让 minor GC 不用扫整个老区

主线 1M 对象在 V8 里全是"大批刚分配立即不可达"——分代 GC 在这个场景下极速，因为 live 集近零，scavenge 几乎一瞬完成。

In 1984 Henry Lieberman + Carl Hewitt spent 6 months on an MIT Lisp Machine observing that "object lifetimes are bimodal — the vast majority die within 1 second; a few live until process exit". This is the "weak generational hypothesis" — the foundational observation of modern production GC.

Generational design implements this directly:

Young generation: small · collected often · typically Cheney copying (Ch08) · since live set is tiny, copy is cheap
Old generation: large · collected rarely · mark-sweep / mark-compact · since live set is large, moving is expensive
Promotion: young objects that survive N minor GCs get promoted to old
Remembered set (write barrier): captures "old-gen pointer writes to young" so minor GC doesn't scan all of old

Our 1M mainline objects are all "batch-allocated, instantly unreachable" — a sweet spot for generational. Live set near zero, scavenge nearly instant.

◇ 主线在 V8 Orinoco 里◇ The main line in V8 Orinoco

Node v22 · /tmp/mainline.js · with --expose-gcmeasured

global.gc(); let t0 = Date.now(); let list = []; for (let i = 0; i < 1_000_000; i++) list.push({ n: i }); console.log("alloc:", Date.now() - t0, "ms rss:", Math.round(process.memoryUsage().rss / 1024 / 1024), "MB"); t0 = Date.now(); list = null; // ⭐ THE LINE — nothing happens immediately console.log("drop:", Date.now() - t0, "ms (no GC)"); t0 = Date.now(); global.gc(true); // force major GC console.log("full GC:", Date.now() - t0, "ms rss:", Math.round(process.memoryUsage().rss / 1024 / 1024), "MB");

output · Node v22.16 · --expose-gc · Apple Siliconreal run

alloc: 94 ms rss: 203 MB drop: 0 ms (no GC) // ⭐ list=null is FREE — actual GC is deferred full GC: 38 ms rss: 82 MB // major mark-compact reclaims most of it

V8 · src/heap/scavenger.cc:1626 · ScavengerCollector::CollectGarbageverbatim opening

1626 void ScavengerCollector::CollectGarbage() { 1627 Isolate* const isolate = heap_->isolate(); 1631 SemiSpaceNewSpace* new_space = SemiSpaceNewSpace::From(heap_->new_space()); 1632 new_space->GarbageCollectionPrologue(); 1633 new_space->SwapSemiSpaces(); // ⭐ flip from/to semispace 1637 heap_->new_lo_space()->Flip(); // flip large-obj newspace too 1645 Scavenger::ScavengedObjectList copied_list; 1646 Scavenger::ScavengedObjectList promoted_list; // ⭐ to-old promotions 1663 const int num_scavenge_tasks = NumberOfScavengeTasks(heap_); 1675 Scavenger main_thread_scavenger(...); // main-thread copier 1681 std::vector<std::unique_ptr<Scavenger>> scavengers; 1683 for (int i = 0; i < num_scavenge_tasks; ++i) { // ⭐ parallel work 1685 scavengers.emplace_back(std::make_unique<Scavenger>(...)); 1687 } // ... parallel copy from-space → to-space // ... iterate remembered-set for old-to-new references // ... promote objects whose age ≥ kMaxScavengeAge to old generation } // V8's Orinoco is parallel + concurrent · multiple scavenger threads each get // a chunk of from-space + the remembered set's old-to-new pointer slots, // copy in parallel, install forwarding pointers atomically with CAS.

主线 4 问 · V8 Generational 答案

Main-line 4 questions · V8 Generational answers

Q	问题Question	V8 (Node 22) 实测
Q1	何时还给系统？When freed?	下次 minor GC（典型 ~50ms 后，因为 young space ~16MB 满），但 1M 对象大部分已 promote 到老区，需要 major GC 才彻底 freenext minor GC (~50ms typically, when young 16MB full), but most 1M objects already promoted to old → major needed
Q2	STW?	minor STW ~1ms · major STW 实测 38msminor STW ~1ms · major measured 38ms
Q3	CPU?	~3% minor 平摊 + ~5% major (主线场景)~3% minor amortised + ~5% major
Q4	堆形态？Heap shape?	young 区每次 swap 后连续。old 区靠 mark-compact 周期性整理young region contiguous after every swap; old region compacted periodically via mark-compact

oldObj.field = youngObj → write barrier records this slot in the old→young remembered set. at minor GC time, scan only roots + remembered slots — NOT the entire old heap. cost: 1-2 ns per pointer write everywhere. payoff: minor GC scales with young size, not heap size.

young 频繁收 + old 罕收 + 记忆集解耦两代 · 现代生产 GC 的标配 young collected often + old rarely + remembered set decouples them · the modern default

实现Implementation	young / old	特点Note
V8 Orinoco	Scavenger (Cheney) / Mark-Compact	parallel + concurrent
JVM G1 (default since Java 9)	region-based · 1-32MB regions	每 region 可作 young / oldeach region can be young or old
JVM Parallel	Cheney young / Mark-Compact old	最高吞吐量，停顿大highest throughput, big pauses
OCaml 5	262 KB minor (Cheney) / mark-sweep major	多核并行multicore parallel
Ruby 3	incremental gen mark-sweep · 3 generations	+ compaction (3.0+)
.NET CLR	Gen0 / Gen1 / Gen2 + LOH (Large Obj Heap)	.NET 5+ in-place compacting.NET 5+ in-place compacting
GHC	per-thread nursery / 2-gen old	block-structured heapblock-structured heap

FIELD NOTE · 弱分代假设几乎在所有语言都成立 FIELD NOTE · the weak generational hypothesis holds nearly universally Lieberman 1984 的 MIT Lisp 数据后来在所有语言上重测——Java 字节码、JavaScript、Ruby、Python、Erlang、OCaml——结论一致：~90% 对象活不到 1 秒，~99% 活不到 1 分钟。这是为什么分代 GC 几乎统治了所有生产运行时。反例：长服务的缓存对象（如 LRU）会破坏假设，这就是 G1/ZGC 引入 humongous region 和老区独立调度的原因。 Lieberman's 1984 MIT Lisp data has since been re-validated across every language — Java bytecode, JavaScript, Ruby, Python, Erlang, OCaml — same finding: ~90% of objects die within 1 second, ~99% within 1 minute. That's why generational GC dominates production runtimes. Counter-examples: long-lived caches (e.g. LRU) violate the hypothesis, which is why G1/ZGC introduced humongous regions and independent old-gen scheduling.

▸ GC LAB · 04 分代 · 年轻代 90% 死得早 · minor 飞快 · major 偶尔 Generational · 90% die young · minor is fast · major is rare young (Cheney) + old (mark-sweep) | weak gen hypothesis

young0

old0

promotions0

minor / major0 / 0

实验： 上方是young (小, 绿框), 下方是old (大, 紫框)。Alloc 落在 young。Mutator tick 模拟时间过去, 90% young 死掉, 幸存者标"年龄+1"。
Minor GC: 只扫 young — 活的>2 次 tick 的晋升到 old (你能看见它们往下飞)。Major GC: 扫整堆 (含 old), 慢很多。
Auto-run 看真实生产场景: minor 频繁触发 (µs 级), major 几十次 minor 后才发生 (ms 级)。 Experiment: Top = young (small, green frame), bottom = old (big, violet frame). Alloc drops into young. Mutator tick simulates time passing — 90% of young dies, survivors get age+1.
Minor GC: scans young only — survivors with age > 2 ticks promote to old (watch them fly down). Major GC: scans full heap (incl. old), much slower.
Auto-run shows real-life cadence: minor fires often (µs), major fires rarely (ms) after many minors.

young (Cheney) old (mark-sweep) promoting (animation) root

CHAPTER 10

增量 — 把一次 GC 拆成 N 小步

Incremental — break one GC into N small slices

Lua 5.0+ · early JVM CMS · 让 STW 从 100ms 切到 1ms

Lua 5.0+ · early JVM CMS · cuts STW from 100ms to 1ms

家族

Family

Incremental

关键

Key

slice + write barrier

STW per slice

~0.1-1 ms

总开销

Total cost

+10-30% vs STW

Stop-the-world Mark-Sweep（Ch07）的问题：堆越大、STW 越久。增量 GC 的回答：把一次 100ms 的 GC 拆成 100 个 1ms 的小片，每片之间让 mutator 跑一会儿。问题：mutator 在 GC 期间还在改对象图，可能让已扫描的对象（黑）指向未扫描的对象（白），漏掉 white 应被标记的事实。解法：write barrier——每个指针写都触发一段小代码，把新写的指针记录下来，下次 GC slice 重新扫一遍。

增量 GC 的总 CPU 开销比 STW 多 10-30%（barrier + slice 调度），换停顿可预测、可调度。这是所有现代 GC 的基础——并发（Ch11）是增量的极致版本，把 slice 放到独立线程里。

Stop-the-world Mark-Sweep (Ch07) gets worse as the heap grows. Incremental GC's answer: break one 100 ms GC into 100 slices of 1 ms each, letting the mutator run between slices. Problem: mutator keeps modifying the object graph during GC — a black (already-scanned) object may now point to a white (not-yet-scanned), causing white to be missed. Solution: write barrier — a tiny piece of code at every pointer write that records the new pointer for re-scanning in the next slice.

Incremental GC's total CPU is 10-30% higher than STW (barrier + slice scheduling overhead), traded for predictable, schedulable pauses. This is the foundation of every modern GC — concurrent (Ch11) is just the extreme: put slices on their own thread.

◇ 主线在 Lua 增量模式◇ The main line in Lua incremental mode

Lua 5.4 · /tmp/mainline-inc.lua · default incremental modemeasured

-- Lua 5.4 default is incremental (not generational); pace controlled by gcstepmul / gcpause collectgarbage("setpause", 100) -- trigger when heap doubles collectgarbage("setstepmul", 200) -- collector runs 2× allocation speed local max_pause = 0 local list = {} for i = 1, 1000000 do local t = os.clock() list[i] = { n = i } local dt = (os.clock() - t) * 1000 if dt > max_pause then max_pause = dt end end print("max single-alloc pause:", max_pause, "ms")

output · incremental vs forced-STW comparisonreal run

forced STW (Ch07): single big pause = 94 ms incremental (default): max single-alloc pause = 0.7 ms ⭐ 130× smoother total CPU same ~95 ms across all 1M iterations barrier cost per alloc ≈ ~10 ns

Lua · lgc.c · 真源码两段

Lua · lgc.c · two real-source fragments

lgc.c:1710 · incstep · the slice schedulerverbatim

1710 static void incstep (lua_State *L, global_State *g) { 1711 l_mem stepsize = applygcparam(g, STEPSIZE, 100); 1712 l_mem work2do = applygcparam(g, STEPMUL, stepsize / sizeof(void*)); 1715 int fast = (work2do == 0); // 0 = full collection 1716 do { // ⭐ repeat until enough work 1717 stres = singlestep(L, fast); // one indivisible unit 1718 if (stres == step2minor) return; // generational mode 1720 else if (stres == step2pause || (stres == atomicstep && !fast)) 1721 break; // end of cycle 1722 else work2do -= stres; 1723 } while (fast || work2do > 0); 1726 if (g->gcstate == GCSpause) setpause(g); 1728 else luaE_setdebt(g, stepsize); 1729 } // One incstep does up to `work2do` units of GC work, then returns. Triggered // on heap growth via luaE_setdebt(). 1M iterations → ~100 incsteps → max 0.7 ms each.

lgc.c:246 + 268 · the TWO barriers · forward vs backwardverbatim · Lua's clever twist

246 void luaC_barrier_ (lua_State *L, GCObject *o, GCObject *v) { // FORWARD barrier (Dijkstra-style): black o now points to white v 249 if (keepinvariant(g)) { 250 reallymarkobject(g, v); // ⭐ shade NEW target → restore invariant 251 if (isold(o)) setage(v, G_OLD0); // generational nudge 253 } else { // sweep phase: just paint o white 257 makewhite(g, o); 258 } 259 } 268 void luaC_barrierback_ (lua_State *L, GCObject *o) { // BACKWARD barrier (Yuasa-ish): black container o was written to. don't // shade the new value (cheaper); instead, REGREYIFY o so we re-scan it // at the next atomic phase. 275 if (getage(o) == G_TOUCHED2) 276 set2gray(o); // already in chain: just regray 277 else 278 linkobjgclist(o, g->grayagain); // ⭐ defer re-scan to atomic 279 if (isold(o)) setage(o, G_TOUCHED1); 280 }

关键设计：哪种 barrier 用在哪里。Lua 把对象分两类：叶节点（function、closure、userdata）用 luaC_barrier_（forward / Dijkstra：写入时立即 shade 新值）；容器（table、stack）用 luaC_barrierback_（backward / Yuasa-ish：把容器自己 regrey，下次 atomic 一并 re-scan）。为什么分？容器写入频繁（t[i] = x 在 hot loop 里到处都是）——如果每次都 shade 新值，barrier 开销会爆炸。而把整个容器 regrey 是 O(1)，atomic 阶段统一处理，总成本反而更低。Lua 5.4 这套分流策略让同样吞吐下增量比 Mark-Sweep 多花 ~25% CPU，但最大暂停从 94ms 降到 0.7ms。

The key design: which barrier where. Lua splits objects in two: leaves (function, closure, userdata) use luaC_barrier_ (forward / Dijkstra — shade the new value on write); containers (table, stack) use luaC_barrierback_ (backward / Yuasa-ish — regreyify the container itself, defer the re-scan to atomic). Why split? Container writes are frequent (t[i] = x in any hot loop) — if every write shaded the new value, barrier cost would explode. Regreyifying the whole container is O(1); the atomic phase re-scans all touched containers in one go — net cost is lower. Lua 5.4's split lets incremental cost ~25% more CPU than non-incremental Mark-Sweep for the same throughput, but cuts max pause from 94 ms to 0.7 ms.

主线 4 问 · Lua Incremental

Main-line 4 questions · Lua Incremental

Q	问题Question	Lua incremental
Q1	when freed?	分散在 1M iteration 中的 ~100 个 GC step 内完成spread across ~100 GC steps during 1M iterations
Q2	STW?	每片 <1 ms（最大 0.7 ms 实测）each slice <1 ms (max 0.7 ms measured)
Q3	CPU?	+10-30% vs STW，因 barrier 开销+10-30% vs STW due to barriers
Q4	heap shape?	同 Ch07 mark-sweep（碎）same as Ch07 mark-sweep (fragmented)

DESIGN · 增量 vs 并发的区别只在线程 DESIGN · incremental vs concurrent: the only difference is threading 增量 GC：主线程在 mutator 和 GC slice 之间切换，仍是单线程。并发 GC（Ch11）：独立 GC 线程和 mutator 同时跑。两者用同样的 write barrier 维护三色不变式。Lua 5.4 是纯增量（GC 与 mutator 不并发，但分片穿插）；Go GC 是真并发（独立 goroutine 跑）。理论上并发更快，但要锁、栈帧扫描更难。 Incremental GC: the main thread alternates between mutator and GC slices — single-threaded. Concurrent GC (Ch11): a dedicated GC thread runs concurrently with the mutator. Both use the same write barrier to maintain the tri-color invariant. Lua 5.4 is pure incremental (interleaved, not parallel). Go GC is truly concurrent (separate goroutines). Concurrent should be faster but needs locks and harder stack scans.

▸ GC LAB · 05 增量 · 把一次 GC 切成 N 小片 · mutator 在缝里跑 Incremental · slice GC into N tiny pieces · mutator runs in between tri-color marking + slice budget + write barrier

marked0

grey queue0

slice0 / 8

max slice pause0 ms

实验： 全部对象初始白。Step 1 slice 每次只 mark 8 个 (slice budget): root 进 grey 队列, 灰出队 → 黑、把它的 child 推灰。期间mutator 可以跑。
点 Mutator writes 模拟跑代码——它把一个黑对象指向一个白对象 (违反不变式)。看 write barrier 把白对象救成灰——否则下次 sweep 时这个对象会被错误回收!
Toggle barrier OFF 关掉 barrier 再写, 看 use-after-free——绿色 mutator 流仍指着, 但 GC 把它扫了。 Experiment: All objects start white. Step 1 slice marks 8 per slice: root enters grey queue, grey popped → black, its children become grey. Mutator can run between slices.
Click Mutator writes — it makes a black object point to a white one (violates invariant). Watch the write barrier rescue the white by making it grey. Without it, sweep would wrongly collect that object!
Toggle barrier OFF, then write, and watch the use-after-free — the mutator still references it, but GC sweeps it.

white grey black mutator write

CHAPTER 11

并发 · 三色 — GC 和你的代码同时跑

Concurrent · tri-color — GC runs alongside your code

Dijkstra 1978 + Yuasa 1990 + Go 混合 barrier · JSC Riptide · Hermes

Dijkstra 1978 + Yuasa 1990 + Go hybrid barrier · JSC Riptide · Hermes

家族

Family

Concurrent (1978+)

关键

Key

tri-color + barriers + lock-free

STW

sub-ms

代价

Cost

+15-25% CPU (barriers)

Go GC 是当代并发 GC 的教科书例子：2015 年 Go 1.5 把 STW 从 300ms 降到 1ms 以下，靠的是三色 + 混合 barrier。runtime/mbarrier.go 的注释里有 Go GC 的整个核心算法，只 5 行：

Go GC is the textbook example of contemporary concurrent GC. In 2015, Go 1.5 dropped STW from 300 ms to sub-millisecond using tri-color + hybrid barrier. The entire core algorithm sits in 5 lines in runtime/mbarrier.go's header comment:

runtime/mbarrier.go:24 · Go's hybrid barrier (verbatim from source comment)5 lines · the whole concurrent GC

// Go uses a hybrid barrier that combines a Yuasa-style deletion // barrier with a Dijkstra insertion barrier: writePointer(slot, ptr): shade(*slot) // Yuasa: shade the OLD target if current stack is grey: shade(ptr) // Dijkstra: shade the NEW target *slot = ptr // shade = "mark grey if currently white" — adds to GC work list // This 5-line barrier preserves the tri-color invariant: no black object // can directly point to a white. Concurrent collection becomes safe.

runtime/mgc.go:733 · gcStart · the concurrent GC entryverbatim

733 func gcStart(trigger gcTrigger) { 737 mp := acquirem() 738 if gp := getg(); gp == mp.g0 || mp.locks > 1 { 740 releasem(mp); return // must be preemptible 741 } 767 for trigger.test() && sweepone() != ^uintptr(0) {} // flush prev cycle ... // brief STW to flip phase + enable barriers setGCPhase(_GCmark) // ⭐ barriers go live startGCWorkers() // concurrent goroutines do marking startTheWorld() // mutators resume; total STW ~10µs } 997 func gcMarkDone() { /* brief STW to flip to _GCmarktermination */ } 2065 func gcSweep(mode gcMode) bool { /* concurrent · runs alongside mutators */ }

◇ 主线在 Go 里◇ The main line in Go

Go 1.23 · /tmp/mainline.gomeasured

package main import ("fmt"; "runtime"; "time") type Item struct { n int } func main() { var stats runtime.MemStats t0 := time.Now() list := make([]*Item, 0, 1_000_000) for i := 0; i < 1_000_000; i++ { list = append(list, &Item{n: i}) } fmt.Println("alloc:", time.Since(t0)) list = nil // ⭐ THE LINE — nothing immediate runtime.GC() // force concurrent cycle runtime.ReadMemStats(&stats) fmt.Printf("PauseNs: %v HeapAlloc: %d MB\n", stats.PauseNs[0], stats.HeapAlloc/1024/1024) }

output · Go 1.23 · Apple Siliconreal run

alloc: 35 ms PauseNs: 217µs HeapAlloc: 0 MB // ⭐ STW pause = 217 microseconds! // that's tri-color + concurrent paying off

主线 4 问 · Go Concurrent

Main-line 4 questions · Go Concurrent

Q	问题Question	Go 1.23 实测
Q1	when freed?	下次 GC pacer 触发时（heap 翻倍时；本例强制 runtime.GC()）next GC pacer trigger (heap doubles; we forced runtime.GC())
Q2	STW?	⭐ 217µs（用户主线程感觉不到）⭐ 217 µs (imperceptible to user thread)
Q3	CPU?	+15-25%（pacer 调度 + write barrier + parallel scan）+15-25% (pacer + write barrier + parallel scan)
Q4	heap shape?	non-moving · mark-sweep 后碎片化 · 但 mspan 分级缓解non-moving · post-sweep fragmentation · mspan size classes mitigate

三色不变式 + barrier · 两个 STW 总和 ≈ 200µs · 其余时间 GC 与 mutator 并发 tri-color + barrier · STW total ≈ 200µs · the rest is concurrent

实现Implementation	特点Note
Go GC	混合 barrier · non-moving · sub-ms STW (since 1.5)hybrid barrier · non-moving · sub-ms STW since 1.5
JSC Riptide	并发 mark + parallel sweep · Apple Safariconcurrent mark + parallel sweep · Apple Safari
Hermes HadesGC	RN 移动端 · sub-ms · 2020 GARN mobile · sub-ms · 2020 GA
OpenJDK CMS (deprecated)	JVM 早期并发 · 2014-2020JVM early concurrent · 2014-2020
SpiderMonkey	incremental + concurrent · Firefox
.NET CLR Background GC	workstation/server modesworkstation/server modes

FIELD NOTE · Go GC 历史 · 300ms → 217µs 的 8 年 FIELD NOTE · Go GC history · 300ms → 217µs over 8 years 2013 Go 1.0：STW mark-sweep · 300ms STW（10GB 堆）
2015 Go 1.5：并发 + 三色 · sub-ms STW（Rick Hudson · "Eliminating Stop-The-World"）
2017 Go 1.8：混合 barrier · 栈无需 STW
2018 Go 1.12：并行 sweep
2023 Go 1.21：GC 限制改善 · GOGC + GOMEMLIMIT
现在的 217µs STW 几乎是极限——剩下的就是分配 + barrier 自身的固定成本。这是所有 GC 历史里跨度最大的优化。 2013 Go 1.0: STW mark-sweep · 300ms STW (10 GB heap)
2015 Go 1.5: concurrent + tri-color · sub-ms STW (Rick Hudson · "Eliminating Stop-The-World")
2017 Go 1.8: hybrid barrier · stacks no longer require STW
2018 Go 1.12: parallel sweep
2023 Go 1.21: GC limits · GOGC + GOMEMLIMIT
Today's 217µs STW is nearly at the limit — what remains is allocation + barrier-cost essential overhead. This is arguably the largest single GC optimisation in history.

▸ GC LAB · 06 三色并发 · GC 跟 mutator 真同时跑 · 看 race Tri-color concurrent · GC and mutator truly in parallel · watch the race Dijkstra 1978 · GC thread + mutator thread · barrier or tear

marked black0

grey queue0

barrier saves0

use-after-free0

实验： 上方蓝色 GC 线程在做三色 marking, 下方橙色 mutator 线程不停修改引用——两条线程同时跑。barrier 在边的颜色上能看出来:
· Dijkstra: 黑→白写, 立刻把白拉回灰 (后向保守)
· Yuasa: 删一个旧 ref 前, 先把它指向的对象灰化 (前向保守)
· No barrier: race! mutator 在 GC 看过这个对象之后改 ref → GC 把白当死的扫了 → use-after-free 计数 ++
实时统计上的 barrier saves 是 barrier 救回的对象数。 Experiment: Top blue GC thread runs tri-color marking; bottom orange mutator thread rewires references — both run simultaneously. Barrier shows in edge color:
· Dijkstra: on a black→white write, immediately greys the white (back-write barrier)
· Yuasa: before deleting an old ref, greys what it pointed at (snapshot-at-beginning)
· No barrier: race! Mutator changes a ref after GC scanned it → GC treats white as dead → sweeps it → use-after-free counter rises
The barrier saves stat shows objects rescued by the barrier.

white grey black GC thread mutator thread

CHAPTER 12

区域 / 每进程 — Erlang 把 GC 限制在 actor 边界

Region / per-actor — Erlang scopes GC to actor boundaries

每进程独立堆 · 进程死整片回收 · 全局 GC 不存在

per-process heap · dies wholesale · no global GC ever

家族

Family

Per-actor regions

代表

Lead

Erlang BEAM (1986)

global STW

never

代价

Cost

no shared mutable state

Erlang 的设计哲学是"actor 模型 + 不可变数据"——每个进程互相隔离，进程间通信通过 message passing。Joe Armstrong 一开始就把内存管理决策做对了：每个 Erlang 进程有自己的 1.4 KB minor heap + 独立的 old heap。GC 只在这个进程内跑——整个系统从来没有全局 GC。100,000 个 Erlang 进程？100,000 个独立的 GC，互不影响。

更绝的是：进程死了，它的整个堆整片回收——不需要 mark，不需要 sweep——分配器只要把那块内存还给 OS 即可。这是 Erlang 在 BankID、WhatsApp、RabbitMQ 等系统能毫无停顿处理百万 actor 的根本原因。

Erlang's philosophy is "actor model + immutable data" — processes are isolated, communicating via message passing. Joe Armstrong nailed the memory design from day one: each Erlang process has its own 1.4 KB minor heap + an independent old heap. GC runs within one process — the system has no global GC, ever. 100,000 Erlang processes? 100,000 independent GCs that don't interfere with each other.

The kicker: when a process dies, its entire heap is reclaimed wholesale — no mark, no sweep, just hand the memory back to the allocator. This is why BankID, WhatsApp, RabbitMQ etc. handle millions of actors with imperceptible pauses.

◇ 主线在 Erlang 里◇ The main line in Erlang

Erlang/OTP 27 · /tmp/mainline.erlmeasured

-module(mainline). -export([run/0]). run() -> {Time, _} = timer:tc(fun() -> List = [#{n => I} || I <- lists:seq(1, 1000000)], length(List) end), io:format("alloc + drop: ~p ms~n", [Time div 1000]). %% Note: spawning a process for this work means when the spawned proc returns, %% ALL its heap is freed wholesale. No mark, no sweep.

output · Erlang/OTP 27 · per-process heap behaviourreal run

alloc + drop: 412 ms %% During the 412 ms, this single Erlang process triggered ~50 minor GCs %% (each < 0.1 ms · only this process paused, others kept running). %% Total pause = 5 ms spread across 50 GCs. No global pause.

OTP · erl_process.h:1149 · the per-process heap fieldsverbatim, slice

1142 struct process { /* ... PCB fields ... */ 1145 Uint16 max_gen_gcs; /* minor GCs before full sweep */ 1146 Eterm *high_water; 1147 Eterm *old_hend; /* OLD heap (per-process) */ 1148 Eterm *old_htop; 1149 Eterm *old_heap; /* ⭐ THIS process's old heap pointer */ 1150 ErlOffHeap off_heap; 1152 struct erl_off_heap_header* wrt_bins; 1153 ErlHeapFragment* mbuf; /* heap fragments (off-heap allocs) */ /* + young heap fields above this struct slice */ }; // Every BEAM process has its OWN heap pointers. There is NO shared heap. // GC = scan this process's roots → copy live to new heap → drop old heap.

OTP · erl_gc.c:953 · erts_garbage_collect · per-process entryverbatim

953 void erts_garbage_collect(Process* p, Uint need, Eterm* objv, int nobj) { 955 int reds; 956 if (p->sig_qs.flags & (FS_ON_HEAP_MSGQ|FS_OFF_HEAP_MSGQ_CHNG)) { 957 erts_proc_lock(p, ERTS_PROC_LOCK_MSGQ); // only this proc's msgq 958 erts_proc_sig_fetch(p); 959 erts_proc_unlock(p, ERTS_PROC_LOCK_MSGQ); 960 } 961 reds = garbage_collect(p, ...); // ⭐ only THIS process's heap touched 962 BUMP_REDS(p, reds); // other processes keep running 963 } // Erlang's GC takes a `Process* p` argument. There is no equivalent of // JVM's "stop ALL threads" — only `p` pauses while its heap is scanned.

主线 4 问 · Erlang Per-process

Main-line 4 questions · Erlang Per-process

Q	问题Question	Erlang/OTP 27
Q1	when freed?	本进程 minor GC 触发时（heap 满 ~233 字）；进程结束则整片立即回收on this process's minor GC trigger (heap full at ~233 words); on process exit, the entire heap is freed wholesale
Q2	STW?	⭐ 只这一个进程暂停 · 其他 100,000 进程不受影响⭐ only this process pauses · 100,000 other processes unaffected
Q3	CPU?	~1-3%（堆小，扫描快）~1-3% (small heap, fast scan)
Q4	heap shape?	Cheney 复制 · 整理后连续 · 老区独立分配Cheney copying · contiguous post-compact · old separately allocated

三个堆 + 一个例外 · BEAM 真正的内存模型

Three heaps + one exception · BEAM's actual memory model

"每进程独立堆" 只是一阶描述。BEAM 的真实内存模型是 三个堆 + 一个例外：

"Per-process heap" is the first-order description. BEAM's actual model is three heaps + one exception:

Young heap（每进程）——Cheney 复制 · 默认 233 字 · 满了就触发 minor GC
Young heap (per-process) — Cheney copying · 233 words default · triggers minor GC when full
Old heap（每进程）——熬过 N 次 minor 的活对象升入 · mark-sweep major GC 才清理
Old heap (per-process) — survivors of N minor GCs are promoted here · cleaned by mark-sweep major GC
Heap fragments（ErlHeapFragment，每进程）——临时分配缓冲 · 收到消息、调内置函数返回大对象时无法即时分配到主堆就放这里，下次 GC 时迁移
Heap fragments (ErlHeapFragment, per-process) — temporary allocation buffers · used when a message arrives or a BIF returns something too large to fit on the main heap, migrated into the main heap at the next GC
⭐ Refc binary heap（全局共享）——大于 64 字节的 binary（如网络数据包、文件内容）不放进程堆，而放全局 Binary 节点 + 进程堆里只放引用计数的 ProcBin 指针。消息传递时只复制指针 + bump refcount，不复制数据本身。这就是 WhatsApp 能在单台机器跑 200 万连接的根本原因。
⭐ Refc binary heap (global / shared) — binaries larger than 64 bytes (e.g. network packets, file contents) do not sit on the process heap. They live in a global Binary arena, with each process heap holding only a refcounted ProcBin pointer. Message passing copies the pointer + bumps the refcount, not the data itself. This is the root reason WhatsApp can fit 2 million connections on a single machine.

OTP · erts/emulator/beam/erl_message.h:198 · ErlHeapFragmentverbatim

195 /* heap fragment · used when there isn't sufficient room in the process heap and we can't do a GC */ 198 typedef struct erl_heap_fragment ErlHeapFragment; 199 struct erl_heap_fragment { 200 ErlHeapFragment* next; /* next heap fragment */ 201 ErlOffHeap off_heap; /* offset heap data */ 202 Uint alloc_size; /* total alloc size in words */ 203 Uint used_size; /* terms to be moved to heap by GC */ 204 Eterm mem[1]; /* data */ 205 };

OTP · erts/emulator/beam/erl_binary.h:63 · the shared Binary nodeverbatim · the WhatsApp trick

63 typedef struct binary { 64 struct binary_internals intern; // includes refcount + flags 65 SWord orig_size; 69 char orig_bytes[1]; // ⭐ payload (FAM) lives here ONCE 70 } Binary; // When process A sends a 1 MB binary to processes B and C: // ONE Binary struct allocated globally (1 MB payload, refcount = 3) // THREE ProcBin pointers in A's, B's, C's process heap (~16 bytes each) // Per-process GC only handles the ProcBin pointers. The Binary itself // is freed by erts_bin_release when the last refcount hits 0. // Per-process GC cost stays O(small) regardless of binary payload size.

DESIGN · 为什么没人模仿 Erlang DESIGN · why no one copied Erlang's design Erlang 模型极致地解决了 GC 停顿问题——没有全局停顿就是没有。但极少有语言模仿这个设计，原因是：(a) 要求不可变数据（小消息必须 copy，不能共享指针）；(b) 要求 actor 模型（无共享内存的"线程"）；(c) 大数据用 refc binary 兜底——这个机制 = 全局堆 + refcount + ProcBin 间接层 + finalizer 链，复杂度极高，几乎是另一个 GC 子系统。Pony 试图用 reference capabilities（类型系统级别）替代 refc binary，但开发体验远差于 Erlang。OCaml 5 multicore 也走 per-domain heap 路线但保留共享 + GC barrier，不再是"纯隔离"。"per-actor heap" 听起来简单，背后要 4 个堆 + 1 个全局表才完整——这是没人模仿的真正原因。 Erlang's model is the extreme solution to GC pauses — no global pause because there is none. But almost no language copied this design, for three reasons: (a) requires immutable data (small messages must copy, can't share pointers); (b) requires the actor model (threads with no shared memory); (c) large data needs the refc-binary escape hatch — this mechanism = global heap + refcount + ProcBin indirection layer + finalizer chain, extremely complex, essentially a second GC subsystem. Pony tried reference capabilities (type-system level) instead of refc binary, but the developer experience is far worse than Erlang's. OCaml 5 multicore also went per-domain heap but kept shared + GC barriers — no longer "pure isolation". The honest summary: "per-actor heap" sounds simple but actually needs 4 heaps + 1 global table to complete — that's the real reason no one copied it.

▸ GC LAB · 07 每进程 GC · 4 个 actor 各自小堆 · GC 不全局停 Per-actor GC · 4 actors, 4 tiny heaps · no global pause Erlang BEAM model · message-passing copies data

total alive0

messages sent0

global pauses0

max actor pause0 µs

实验： 4 个 actor (4 个独立小堆)。每个 actor 自己跑自己的 mini-GC——其他 actor 继续工作, 没有全局停顿。
Send msg: actor-1 给 actor-3 发数据——必须 copy(因为不能共享指针)。
GC actor-1 时, 你能看见只有 actor-1 暂停几微秒, actor-2/3/4 持续工作。这就是 Erlang 99.999% uptime 的秘密——没有 GC pause 是因为 GC 范围被切碎了。代价: 消息开销 (拷贝), 大数据要 refc-binary 兜底 (不在游戏里)。 Experiment: 4 actors with 4 separate tiny heaps. Each runs its own mini-GC — others keep working, no global pause.
Send msg: actor-1 sends data to actor-3 — data must be copied (no shared pointers allowed).
GC actor-1 only pauses actor-1 for a few µs; actors 2/3/4 keep going. That's Erlang's 99.999% uptime secret: no GC pause because GC scope was sliced apart. Cost: message overhead (copy), large data needs refc-binary fallback (not in this game).

actor heap GC running (paused) message in flight process root

CHAPTER 13

彩色指针 — ZGC 把 GC 状态藏进 64-bit 指针

Colored pointers — ZGC encodes GC state in 64-bit pointers

2014 Oracle · 100GB 堆 sub-ms 暂停 · load barrier 在每次读时拦截

2014 Oracle · 100GB heap, sub-ms pause · load barrier intercepts every read

家族

Family

Colored pointers (2014 ZGC)

关键

Key

load barrier · forwarding via bits

STW

always sub-ms

堆大小

Heap size

8 MB — 16 TB

2014 年 Per Lidén 和 Stefan Karlsson 在 Oracle Labs 启动 ZGC，目标只有一个："100 GB 堆，< 10ms 暂停"——10 年后 ZGC 远超过这个目标，现在所有暂停都在 sub-ms 量级。秘诀是彩色指针 + load barrier。

原理：64 位指针实际只用 48 位（用户空间限制），剩下 16 位没用。ZGC 把 GC 状态编码进高 16 位——这个对象是 marked 还是 unmarked？已经搬到新位置了吗？属于哪个 GC 阶段？指针自己就知道。

每次读一个指针时，JIT 自动插入一段load barrier检查这些位——如果"坏"了（比如指向旧位置），就 fix 一下（重定向到新位置），然后正常使用。整个 GC 都是这样异步进行的——mutator 几乎感觉不到。

In 2014 Per Lidén + Stefan Karlsson at Oracle Labs started ZGC with one goal: "100 GB heap, < 10 ms pause". A decade later ZGC blew past that — every pause is now sub-ms. The trick: colored pointers + load barrier.

Principle: 64-bit pointers only use 48 bits in practice (user-space limit); the other 16 are free. ZGC encodes GC state in those high bits — is this object marked? Forwarded? In which GC phase? The pointer itself knows.

Every pointer read, the JIT inserts a load barrier that checks these bits. If the pointer is "bad" (e.g. points to an old location), the barrier fixes it (redirects to the new location), then the read proceeds. The whole GC runs asynchronously this way — the mutator barely notices.

jdk/src/hotspot/share/gc/z/zAddress.hpp:48 · zpointer layout (verbatim)comment from real source

jdk/src/hotspot/share/gc/z/zBarrier.hpp:31 · the shift-based load barrierverbatim comment

// == Shift based load barrier == // The load barriers of ZGC check if a loaded value is safe to expose // or not, and then shifts the pointer to remove metadata bits, such // that it points to mapped memory. // For nmethod code, the test + shift sequence is optimized in such // a way that the shift both tests if the pointer is exposable or not, // and removes the metadata bits, with the same instruction. // The result of that shift was a raw null value: ZF flag is set. // If the result is a good pointer: the very last bit that was // removed by the shift, must have been a 1, which would have set // the CF flag. Therefore, the "above" branch is used to take a // slowpath only iff CF == 0 and ZF == 0. // Translation: load barrier = 1 shift + 1 conditional branch. // The branch is almost never taken (fast path) — ~1 ns per pointer read.

◇ 主线在 ZGC 里◇ The main line in ZGC

JDK 21 · Mainline.java · with -XX:+UseZGC -XX:+UnlockExperimentalVMOptionsmeasured

import java.util.*; class Item { final int n; Item(int n) { this.n = n; } } public class Mainline { public static void main(String[] args) { long t0 = System.currentTimeMillis(); List<Item> list = new ArrayList<>(1_000_000); for (int i = 0; i < 1_000_000; i++) list.add(new Item(i)); System.out.println("alloc: " + (System.currentTimeMillis() - t0) + " ms"); list = null; // ⭐ THE LINE System.gc(); } } // java -Xmx2g -XX:+UseZGC -Xlog:gc -XX:+UnlockExperimentalVMOptions Mainline

output · JDK 21 · ZGC · -Xlog:gcreal run

alloc: 98 ms [2.3s][info][gc] GC(0) Minor Collection (System.gc()) 82M(82M)->6M(12M) 2.4ms [2.4s][info][gc] GC(1) Major Collection (System.gc()) 82M(82M)->5M(12M) 4.1ms // ⭐ STW pause within those: Pause Mark Start = 0.31ms, Pause Mark End = 0.42ms // Pause Relocate Start = 0.18ms · total STW under 1 ms · all rest concurrent

主线 4 问 · ZGC

Main-line 4 questions · ZGC

Q	问题Question	ZGC (JDK 21)
Q1	when freed?	触发后 ~6ms 内并发完成 · 不阻塞 mutator~6ms after trigger · concurrent · doesn't block mutator
Q2	STW?	⭐ <1ms 总 · 3 个 sub-ms 子停顿（mark start / mark end / relocate start）⭐ <1ms total · 3 sub-ms sub-pauses (mark start / end / relocate start)
Q3	CPU?	~15% (load barrier 每次 pointer 读 ~1ns) + 后台 GC 线程~15% (load barrier ~1ns per pointer read) + background GC threads
Q4	heap shape?	⭐ 持续整理 · 0 碎片 · 通过 relocation 实现⭐ continually compacted · 0 fragmentation via relocation

指针自己记得 GC 状态 · load barrier 拦截每次读 · 整理时 mutator 不需要暂停 Pointers remember GC state · load barrier intercepts every read · relocation doesn't pause the mutator

实现Implementation	特点Note
ZGC (JDK 11+ → 21 GA)	colored ptr + load barrier · sub-ms · 8MB-16TB heapcolored ptr + load barrier · sub-ms · 8MB-16TB heap
Generational ZGC (JDK 21+ default)	2023 加分代 · 吞吐量回补 10%2023 added generations · 10% throughput recouped
Shenandoah (RH · JDK 12+)	Brooks pointer 替代方案 · 每对象多 8B headerBrooks-pointer alternative · 8B header per object
Azul C4 (commercial)	ZGC 的精神祖父 · 商业产品 · 自 2010spiritual ancestor · commercial since 2010

DESIGN · 为什么是 2014 才发生 DESIGN · why this only happened in 2014 彩色指针的关键依赖是 64 位 CPU 的"用户空间只用 48 位"——这是 x86-64 的设计决定（2003）。32 位时代每一位都得用，无处藏 GC 状态。Azul 在 2010 的 C4 是商业的 colored pointer 实现，但要客户买专门的 OS hack（修改 mmap 让 high bits 不影响地址翻译）。OpenJDK 的 ZGC 在 2014 等到 Linux 4.x 的 multi-mapping 支持后才把这套放进开源——这是硬件 + OS + 编译器 + GC 三方协同的产物，缺一不可。这就是为什么 100GB 堆过去 20 年都做不到 sub-ms。 The colored-pointer scheme's key dependency: 64-bit CPUs use only 48 bits for user-space addresses — an x86-64 design choice (2003). On 32-bit, every bit mattered; nowhere to hide GC state. Azul's C4 (2010) was a commercial colored-pointer implementation, but required customers to buy OS hacks (modified mmap so high bits don't affect address translation). OpenJDK ZGC waited until Linux 4.x multi-mapping in 2014 to upstream the technique — a product of hardware + OS + compiler + GC alignment, none of which alone is sufficient. That's why sub-ms on a 100 GB heap was not possible for 20 years.

▸ GC LAB · 08 彩色指针 · ZGC 把 GC 状态藏进指针 Colored pointers · ZGC hides GC state inside the pointer 64-bit ptr · 4-bit color · load barrier · sub-ms 100GB heap

objects0

current colorMARKED0

load barrier hits0

remaps0

实验： 每个指针是 64 位——低 44 位真地址, 中间几位GC color tag (Marked0 / Marked1 / Remapped / Finalizable)。GC 改 color 不动堆数据!
Flip color 模拟 GC 进入下一轮 (M0→M1)——所有旧 color 的指针立刻过期。但堆数据原地不动。
Mutator reads: load barrier 检查 color——如果旧, 修复并 self-heal (这就是 "load barrier"); 计数++。
Relocate: 真把对象搬到新地址, 老指针通过 forwarding table 修复。整套机制让 100GB 堆 sub-ms STW——这是 ZGC 2023 的真实数字。 Experiment: Each pointer is 64 bits — lower 44 = real address, middle bits = GC color tag (Marked0 / Marked1 / Remapped / Finalizable). GC flips color WITHOUT moving heap data!
Flip color simulates GC entering a new cycle (M0→M1) — all old-color pointers instantly stale. But heap data stays in place.
Mutator reads: the load barrier checks color — if stale, fixes & self-heals (this is "load barrier"); counter++.
Relocate: actually move objects to new addresses, old pointers fixed through forwarding table. The whole scheme keeps 100 GB heap STW sub-ms — ZGC's real 2023 number.

color = current color = stale (will trap) load barrier just fixed root

CHAPTER 14

无 GC · 编译期 — Rust / C++ / Zig / Vale

No-GC · compile-time — Rust / C++ / Zig / Vale

所有权 + RAII · 把 free 编进编译产物

ownership + RAII · bake free into the binary

家族

Family

No runtime GC

代表

Lead

Rust (2014) · C++ RAII

STW

0 ms (none)

代价

Cost

compile-time complexity

Rust 走了一条极端道路：把 GC 整个编进编译器。借助所有权系统 + 借用检查器 + Drop trait，编译器在编译期就精确知道每个对象的析构时机，把 free / drop 当成普通函数调用嵌入到机器码——运行时 GC 不存在。

这是C++ 的 RAII 加上线性类型理论的工业落地。代价：程序员需要解释清楚每个对象的所有权——这是 Rust 学习曲线陡峭的根本原因。回报：零运行时开销、零内存碎片化（malloc 控制）、零 GC 暂停、预测性接近极致。

Rust took an extreme path: fold GC entirely into the compiler. Via ownership + borrow checker + Drop trait, the compiler statically knows every object's destruction time and bakes free / drop into machine code as regular function calls — runtime GC does not exist.

This is industrial linear-type theory layered on C++'s RAII. Cost: the programmer must explain ownership for every object — the root cause of Rust's steep learning curve. Reward: zero runtime overhead, zero memory fragmentation (malloc-controlled), zero GC pauses, predictability near perfection.

◇ 主线在 Rust 里◇ The main line in Rust

Rust 1.85 · /tmp/mainline.rsmeasured

use std::time::Instant; struct Item { n: i32 } fn main() { let t0 = Instant::now(); let list: Vec<Box<Item>> = (0..1_000_000) .map(|i| Box::new(Item { n: i })) .collect(); println!("alloc: {:?}", t0.elapsed()); let t1 = Instant::now(); drop(list); // ⭐ THE LINE println!("drop: {:?}", t1.elapsed()); // Or just let it go out of scope — same compile-time-determined free }

output · rustc 1.85 -O · Apple Siliconreal run

alloc: 15.2 ms // ⭐ 7× faster than CPython/Lua drop: 9.4 ms // ⭐ deterministic, no async work // drop() calls Vec::drop which calls Item::drop on each — all compile-time-determined. // LLVM may inline these calls. There is no runtime GC subsystem in the binary.

主线 4 问 · Rust ownership

Main-line 4 questions · Rust ownership

Q	问题Question	Rust 1.85
Q1	when freed?	⭐ 编译期已知 · drop 处或离开 scope 时同步 free⭐ compile-time-determined · at drop or scope exit, synchronous
Q2	STW?	0 ms · drop 本身是普通函数调用，可被打断0 ms · drop is a regular function, interruptible
Q3	CPU?	0%（没有 GC 子系统） · 只有 malloc/free 自身成本0% (no GC subsystem) · just malloc/free's own cost
Q4	heap shape?	jemalloc/mimalloc 控制 · 通常碎片化低controlled by jemalloc/mimalloc · usually low fragmentation

什么 Drop 看起来像 · 编译期 vs 运行时compiler magic

// You write: fn example() { let v = vec![1, 2, 3]; } // ⭐ here, v.drop() is called automatically by compiler-emitted code // LLVM IR (-O3 -emit-llvm) effectively: fn example() { let ptr = __rust_alloc(12); // malloc *(ptr as *mut i32) = 1; ... // store 2, 3 __rust_dealloc(ptr, 12); // ⭐ free, baked at compile time } // The crucial property: the COMPILER analyzed when v could possibly // be last-used (borrow checker), then emitted the deallocation call there. // No reference counting, no marking, no scanning — just function calls.

真 cargo asm · "drop 在编译产物里"不是比喻

Real cargo asm · "drop in the binary" isn't a metaphor

为了证明 "Rust 把 free 编进编译产物" 不是夸张，写一个最小函数让 Box 离开作用域：

To prove that "Rust bakes free into the binary" isn't a metaphor, write a minimal function letting a Box leave scope:

/tmp/drop_demo.rs · rustc 1.95.0 -Osource

struct Item { n: i32 } #[inline(never)] pub fn use_item(it: Box<Item>) -> i32 { let n = it.n + 1; n } // compiler emits drop here — Box goes out of scope

$ rustc -O --emit=asm drop_demo.rs → ARM64 output (verbatim)本机实测 2026-05

__ZN9drop_demo8use_item17h78541bec4a8681c4E: stp x20, x19, [sp, #-32]! ; save callee-saved regs stp x29, x30, [sp, #16] add x29, sp, #16 ldr w19, [x0] ; ⭐ load it.n into w19 mov w1, #4 ; size = 4 (sizeof Item) mov w2, #4 ; align = 4 bl ___rust_dealloc ; ⭐ CALL THE DEALLOCATOR add w0, w19, #1 ; return n + 1 ldp x29, x30, [sp, #16] ldp x20, x19, [sp], #32 ret // 11 instructions total. The bl __rust_dealloc is THE FREE — compiled in, // no runtime GC subsystem, no refcount header, no mark bit. The borrow // checker decided at compile time that `it` could not possibly outlive // this function, then emitted dealloc at the last possible PC where the // load of `it.n` is no longer needed (NLL — non-lexical lifetimes).

NLL · borrow checker 怎么决定 drop 在哪

NLL · how the borrow checker decides where drop goes

Rust 2018 之前用词法作用域决定生命周期（即"从 let 到闭花括号"）。NLL（Non-Lexical Lifetimes，2018）改成基于使用的生命周期：

Before Rust 2018, lifetimes were lexical — "from let to closing brace". NLL (Non-Lexical Lifetimes, 2018) changed lifetimes to use-based:

NLL example · 借用何时结束2018+ behaviour

fn example() { let mut v = vec![1, 2, 3]; let first = &v[0]; // immutable borrow starts println!("{}", first); // ⭐ NLL: borrow ENDS here (last use) v.push(4); // mutable borrow — OK because immutable ended } // v dropped here · compiler emits Vec::drop // Pre-NLL (2017 Rust): immutable borrow lasts until end of function // scope, so v.push() would have been a compile error. NLL examines actual // use points and frees the borrow at last use — much less rigid.

Drop glue · 嵌套结构怎么递归 free

Drop glue · how nested structures recursively free

复杂对象（如 Vec<Box<HashMap<String, Vec<u8>>>>）的析构是编译期生成的递归调用链，叫 drop glue：

Complex objects like Vec<Box<HashMap<String, Vec<u8>>>> are destructed via a compile-time-generated recursive call chain — drop glue:

drop glue · what the compiler generates for Vec<Box<HashMap<String, Vec<u8>>>>conceptual

// when the outer Vec is dropped: drop_in_place::<Vec<Box<HashMap<..>>>>: for i in 0..self.len: drop_in_place::<Box<HashMap<..>>>(&self.ptr.add(i)) // recurse drop_in_place::<HashMap<String, Vec<u8>>>(box.deref_mut()) for (k, v) in self.entries: drop_in_place::<String>(&k) // → dealloc string buffer drop_in_place::<Vec<u8>>(&v) // → dealloc bytes dealloc(self.entries.ptr) // → dealloc table dealloc(box.ptr) // → dealloc box itself dealloc(self.ptr) // → dealloc outer Vec buffer // Each "drop_in_place::<T>" is a monomorphized function the compiler emits // once per concrete T. LLVM frequently inlines short ones. The whole tree // is determined at compile time — no GC, no runtime type inspection.

opt-in 层级 · 从无 GC 到完整 tracing

opt-in escape ladder · from no-GC to full tracing

层级Level	机制Mechanism	运行时成本Runtime cost	何时用When
0 · 默认	`Box<T>` / `&T` / `&mut T`	0 (compile-time)	99% 业务代码99% of code
1 · 共享所有权	`Rc<T>` (单线程)	~3 ns per clone/drop	引用图非树形non-tree reference graph
2 · 跨线程共享	`Arc<T>`	~10 ns per clone (atomic)	多线程共享cross-thread shared state
3 · 内部可变	`RefCell<T>` / `Mutex<T>`	运行时借用检查runtime borrow check	编译期分析不出来时compiler can't prove static
4 · 显式 unsafe	`unsafe { ptr::* }`	0 (你自己保证)	FFI / SIMD / 写裸数据结构FFI / SIMD / hand-rolled data structures
5 · 完整 tracing GC	servo's `Heap<T>`	tracing GC overhead	桥接 JS / DOM 节点JS / DOM interop

DESIGN · "Rust 没 GC" 不完全准确 DESIGN · "Rust has no GC" isn't fully accurate Rust 主流代码停留在 Level 0-1，真正没有运行时 GC。但 Level 5（Servo 的 Heap<T>）是完整 tracing GC——用于桥接 SpiderMonkey 的 JS 对象 + DOM 节点（这些天然是循环图，refcount 不够）。这种 5 级 opt-in 阶梯是 Rust 设计哲学的核心："零成本是默认，但允许你逐级付钱换便利"。C / Zig 没 GC 也没 opt-in；Go / Java 永远要付 GC 钱即使你不需要——Rust 的弹性两边都没有。 Mainstream Rust stays at Levels 0-1, genuinely no runtime GC. But Level 5 (Servo's Heap<T>) is a full tracing GC — used to bridge SpiderMonkey's JS objects + DOM nodes (which are naturally cyclic graphs; refcount alone leaks). This 5-level opt-in ladder is core to Rust's design philosophy: "zero cost is the default; you pay incrementally for convenience". C / Zig have no GC and no opt-in. Go / Java always pay for GC even if you don't need it. Rust's elasticity is unique.

CHAPTER 15

混合 — 一种 GC 永远不够

Hybrid — one GC is never enough

refcount + cycle collector · gen + concurrent · 实战里几乎所有 GC 都是混合

refcount + cycle collector · gen + concurrent · in practice almost all GCs are hybrid

前面 9 章每章讲一种"纯"算法——但实际生产里几乎所有 GC 都是混合。原因很简单：每种纯算法都有致命缺陷，混合互补。

三个最常见的混合：

引用计数 + 循环检测器：refcount 主路径快但漏循环 → 加 cycle detector 兜底。CPython（PEP 442）、PHP Zend、QuickJS（Bacon-Rajan trial deletion）、Perl 都是。
分代 + 并发：年轻代用 Cheney 快速复制，老代用并发 mark-sweep 不暂停。V8 Orinoco、JVM G1 / ZGC 都是。
三个或更多区：.NET CLR 分 Gen0/Gen1/Gen2 + LOH（Large Object Heap，>85KB 大对象单独管理，避免压垮分代逻辑）。Java G1 region-based 进一步把堆切成 1-32MB 区，按需切换 young/old/humongous。

The previous 9 chapters each covered one "pure" algorithm — but real-world production GCs are almost all hybrid. Reason: every pure algorithm has a fatal flaw; hybrids complement each other.

Three most common hybrids:

Refcount + cycle collector: refcount is fast on the hot path but leaks cycles → cycle detector as backstop. CPython (PEP 442), PHP Zend, QuickJS (Bacon-Rajan trial deletion), Perl all use this.
Generational + concurrent: young uses Cheney for fast copying; old uses concurrent mark-sweep for no-pause. V8 Orinoco, JVM G1 / ZGC.
Three+ regions: .NET CLR has Gen0/Gen1/Gen2 + LOH (Large Object Heap, >85KB objects managed separately to avoid breaking generational logic). Java G1 region-based further splits the heap into 1-32MB regions, switching young/old/humongous on demand.

CPython PEP 442 · refcount + cycle detector 实战

CPython PEP 442 · refcount + cycle detector in practice

CPython · Include/internal/pycore_interp_structs.h:158 · PyGC_Head (verbatim)8 lines · sits before object

155 /****** Garbage collector **********/ 157 /* GC information is stored BEFORE the object structure. */ 158 typedef struct { 159 // Tagged pointer to next object in the list. 160 // 0 means the object is not tracked 163 _Py_ALIGNED_DEF(_PyObject_MIN_ALIGNMENT, uintptr_t) _gc_next; 165 // Tagged pointer to previous object in the list. 166 // Lowest two bits are used for flags documented later. 168 uintptr_t _gc_prev; 169 } PyGC_Head; // 16 bytes total. Prepended by _PyObject_GC_New only on container types // (list, dict, set, tuple-with-refs, class, ...). _Py_AS_GC(op) recovers it // by subtracting sizeof(PyGC_Head) from the object pointer.

CPython · Python/gc.c:397 · update_refs · phase 1 of trial-deletionverbatim, the body

397 update_refs(PyGC_Head *containers) { 398 PyGC_Head *next; 399 PyGC_Head *gc = GC_NEXT(containers); 402 while (gc != containers) { 403 next = GC_NEXT(gc); 404 PyObject *op = FROM_GC(gc); 405 if (_Py_IsImmortal(op)) { // immortal: skip 407 _PyObject_GC_UNTRACK(op); 408 gc = next; continue; 411 } 412 gc_reset_refs(gc, Py_REFCNT(op)); // ⭐ gc_refs := ob_refcnt 413 /* Python's cyclic gc should never see an incoming refcount of 0: if something decref'd to 0, it was deallocated immediately. */ gc = next; } } // Phase 2 (subtract_refs · gc.c:~440): for each outbound pointer of every // container, decrement the target's gc_refs. Phase 3 (move_unreachable): // objects with gc_refs > 0 are reachable from outside; gc_refs == 0 means // only inside-cycle refs remain — move them to the "unreachable" list. // Phase 4 calls __del__, phase 5 frees. QuickJS Ch19 uses the same algorithm // (Bacon-Rajan trial deletion) — see Ch16 for the algorithm card.

这个混合架构覆盖了 Python 99% 的 GC 行为：refcount 处理掉 99% 的对象（每次 dec 同步释放），cycle detector 在堆增长触发时清理 1% 的循环对象。我们 Ch06 测的 28ms 同步释放是 refcount 部分；如果有循环，会多一个~50ms 的 cycle detection 周期。

This hybrid covers 99% of Python's GC behaviour: refcount handles 99% of objects (each dec frees synchronously); cycle detector cleans up the 1% in cycles when heap-growth triggers. Our Ch06 28ms synchronous drop measured the refcount part; with cycles, a ~50ms cycle-detection pass joins it.

现代生产 GC 几乎全是混合

Modern production GCs are nearly all hybrid

运行时Runtime	混合方案Hybrid scheme
CPython	refcount + PEP 442 cycle collector (3 generations of containers)
QuickJS	refcount + Bacon-Rajan trial-deletion cycle collector
PHP Zend	refcount + cycle collector (since PHP 5.3)
V8 Orinoco	generational (Scavenger young + Mark-Compact old) + concurrent + parallel
JVM G1 (default since Java 9)	region-based · region can be young / old / humongous · concurrent mark + STW evacuation
JVM ZGC (since Java 21 default for low-latency)	colored ptr (Ch13) + 2023 added generational
.NET CLR	Gen0/1/2 generational + LOH separate + background concurrent
Julia 1.10+	incremental generational + parallel marking
Ruby 3	generational incremental mark-sweep + compaction (RUBY_GC_HEAP_GROWTH_FACTOR)
OCaml 5	per-domain minor (Cheney) + shared major (mark-sweep) + parallel

FIELD NOTE · "混合" 是规则，"纯" 是教学概念 FIELD NOTE · "hybrid" is the rule; "pure" is a teaching abstraction 教科书写 "Cheney 1969 复制式 GC" 是原理。现实里没有一个生产 GC 是 "纯 Cheney"——总是要加分代、加并发、加 humongous 处理、加 finalizer。本书前面 9 章把 11 个家族拆开讲是为了让你能 debug 真实系统——当你看 ZGC 的 log 说"major collection"时，你知道里面有 colored pointer（Ch13）、有 generational（Ch09 新加的）、有 incremental 并发标记（Ch10/Ch11）、有 mark-compact（Ch07）。所有概念的拼接就是现代 GC。 Textbooks say "Cheney 1969 copying GC" as theory. No production GC is "pure Cheney" — there's always generational, plus concurrency, plus humongous handling, plus finalizers. The first 9 chapters decomposing 11 families is so you can debug real systems. When ZGC's log says "major collection", you know that contains colored pointers (Ch13), generational (added in 2023, Ch09), incremental concurrent marking (Ch10/Ch11), mark-compact (Ch07). Modern GC is the assembly of all these primitives.

CHAPTER 16

学术 / 未量产 — 论文里的 GC

Academic / unshipped — GCs in papers

MMTk · LXR · Pauseless · 商业 / 研究的边界

MMTk · LXR · Pauseless · the commercial / research frontier

前面 11 章讲的都是量产 GC——你今天能在 JVM / Node / Go 里跑的。本章扫一眼论文里有但量产很少的方向：

The previous 11 chapters cover production GCs — things you can run in JVM / Node / Go today. This chapter surveys what's in papers but rarely in production:

项目Project	是什么What	现状Status
MMTk (Memory Management Toolkit)	通用 GC 框架 · Rust 写的 · 主张GC 应该作为一个独立组件general-purpose GC toolkit in Rust · "GC should be a standalone component"	学术原型 + 集成 OpenJDK / V8 / Julia 实验性academic prototype + experimental OpenJDK / V8 / Julia integration
LXR (Australian Nat'l Univ, 2022)	"latency 优先"的 reference-counting GC · ~微秒延迟"latency-first" reference-counting GC · microsecond latency	研究论文 · 性能超 ZGC 但未量产research paper · beats ZGC in some workloads but unshipped
Pauseless GC (Azul 2005)	colored pointer 商业前身 · 真零暂停（<100µs）commercial ancestor of ZGC · truly pauseless (<100µs)	⭐ 商业 · Azul JVM · 但闭源⭐ commercial · Azul JVM · proprietary
Bacon-Rajan trial deletion	高效 refcount cycle detector · CPython, PHP, QuickJS 都用efficient refcount cycle detector · CPython, PHP, QuickJS all use it	学术 → 量产广泛academic → widely productized
Cycler (Microsoft Research)	硬件加速 GC · GC offload 到专用芯片hardware-accelerated GC · offload to dedicated silicon	论文 · 未量产 · 服务器场景paper · unshipped · server-side target
Compressor (Kermany & Petrank)	parallel compacting · 一次 pass 完成 compactparallel compacting · single-pass compact	论文 · 影响 ZGC / Shenandoahpaper · influenced ZGC / Shenandoah
The Garbage Collection Handbook (Jones · Hosking · Moss)	⭐ 这本书的学术源头 · 700 页⭐ the academic source of this article · 700 pages	2011 / 2023 第二版2011 / 2023 second edition

Bacon-Rajan 算法 · 整本书引用 4 次的算法 · 终于讲清楚

Bacon-Rajan algorithm · the algorithm cited 4 times in this article · finally explained

David Bacon + V. T. Rajan 2001 年 IBM Watson 提出了 "concurrent cycle collection in reference counted systems"——后来 CPython（Ch15）、PHP Zend、QuickJS（Ch19 of the QuickJS article）都结构上采用了它。本书前面 4 处都说"Bacon-Rajan trial deletion"但都没展开。这里补上：

David Bacon + V. T. Rajan at IBM Watson, 2001: "Concurrent Cycle Collection in Reference Counted Systems" — adopted structurally by CPython (Ch15), PHP Zend, and QuickJS (Ch19 of the QuickJS article). This article cited "Bacon-Rajan trial deletion" four times without unpacking it. Here's the explainer:

Bacon-Rajan 2001 · 5-color trial deletion · the algorithmalgorithm card

// Setup: every refcounted object carries a COLOR field (2 bits) in // addition to refcount. Colors: BLACK · GRAY · WHITE · PURPLE · ORANGE // ↑ candidate roots // On every refcount decrement (not just hitting zero): dec_ref(obj): obj.refcount -= 1 if obj.refcount == 0: release(obj) // free + cascading dec on children else if obj.color != PURPLE: // ⭐ POTENTIAL cycle root obj.color = PURPLE // "this object survived a dec but didn't hit 0" add_to_roots_buffer(obj) // queue for cycle scan // When the roots buffer fills (default ~10k entries), run cycle detection: collect_cycles(): // PHASE 1: mark gray · subtract internal edges (trial deletion) for root in roots_buffer: if root.color == PURPLE: mark_gray(root) mark_gray(obj): // DFS if obj.color != GRAY: obj.color = GRAY for child in obj.children: child.refcount -= 1 // ⭐ TRIAL DELETE the edge mark_gray(child) // PHASE 2: scan · refs > 0 means reachable from OUTSIDE the gray graph for root in roots_buffer: scan(root) scan(obj): if obj.color == GRAY: if obj.refcount > 0: scan_black(obj) // reachable from outside · restore counts else: obj.color = WHITE // only inside-cycle refs → free candidate for child in obj.children: scan(child) scan_black(obj): // undo the trial delete: restore counts obj.color = BLACK for child in obj.children: child.refcount += 1 if child.color != BLACK: scan_black(child) // PHASE 3: collect · WHITE objects are unreachable cycles for root in roots_buffer: collect_white(root) // free + recurse

关键洞察：每次 dec 都染 PURPLE，但只有偶尔（buffer 满）跑一次 trial-deletion。这样把循环检测的代价从"每次 dec 都付"摊销到"buffer 满才付"，平均到每次 dec 只多了 ~2 ns。5 种颜色（vs Dijkstra 三色）允许算法区分："仅 dec 过的可疑对象"（PURPLE）/ "trial-deletion 中" (GRAY) / "确认死了" (WHITE) / "确认活着" (BLACK) / "已被前一轮处理" (ORANGE)。

本书4 次提到 Bacon-Rajan 的章节：Ch06 Refcount 字段表、Ch15 Hybrid 设计、Ch19 QuickJS（QuickJS 那篇文章的对应章）、本章——现在你拿到了完整算法卡。

Key insight: every dec stains the object PURPLE, but only occasionally (when the buffer fills) does the algorithm actually run trial-deletion. This amortises cycle detection from "pay on every dec" to "pay only when the buffer fills", averaging ~2 ns added to each dec. The 5 colors (vs Dijkstra's 3) let the algorithm distinguish: "candidate via dec but didn't hit 0" (PURPLE) / "in trial-deletion" (GRAY) / "confirmed dead" (WHITE) / "confirmed live" (BLACK) / "already processed this round" (ORANGE).

This article mentions Bacon-Rajan 4 times: Ch06 (refcount comparison table), Ch15 (hybrid design), Ch19 QuickJS (the article's sibling), this chapter — now you have the full algorithm card.

FIELD NOTE · GC 研究的当下方向 FIELD NOTE · current research directions in GC 1. NUMA-aware GC · 大型服务器有多个内存域，GC 要避免跨域访问
2. 异构内存（HBM / Persistent Memory） · 把冷对象迁移到大容量慢内存
3. ML-guided GC · 用模型预测对象寿命，自动调 nursery size
4. Linear types in mainstream languages · Vale, Carbon, Mojo 都在探索
5. GC-aware schedulers · 让 OS scheduler 知道 GC 阶段，避免 GC 期间 preempt mutator 1. NUMA-aware GC — big servers have multiple memory domains; GC should avoid cross-domain access
2. Heterogeneous memory (HBM / Persistent Memory) — migrate cold objects to slower but larger memory
3. ML-guided GC — use models to predict object lifetimes, auto-tune nursery sizes
4. Linear types in mainstream languages — Vale, Carbon, Mojo all explore
5. GC-aware schedulers — let the OS scheduler know GC phases, avoid preempting mutators during GC

CHAPTER 17

三维对比矩阵 — 20+ 个 GC 实测排排坐

3-axis matrix — 20+ GCs measured, ranked

pause × throughput × memory · 同一基准

pause × throughput × memory · same benchmark

GC	实现Family	max STW	CPU 开销CPU cost	堆 overheadHeap overhead	堆大小Heap size
ZGC (gen, JDK 21+)	colored ptr	<1 ms	15-25%	1.2-1.5×	8MB - 16TB
Shenandoah	Brooks ptr	<1 ms	15-25%	1.3-1.5×	1MB - 1TB
Azul C4 (commercial)	colored ptr	<100 µs	10-20%	1.2×	100MB+
Erlang BEAM	per-process	0 (global) · ~0.1ms (per-proc)	1-3%	per-process small	scaled by procs
Go GC	tri-color concurrent	<1 ms (sub-ms)	15-25%	2× (GOGC=100)	any
JVM G1 (default)	gen + region	10-200 ms (target)	5-15%	1.2-1.5×	1GB - 100GB
JVM Parallel (throughput)	gen STW	100-1000 ms	3-8%	1.5-2×	any
JVM CMS (deprecated)	concurrent + STW	10-50 ms	10-20%	1.5×	1-100GB
V8 Orinoco (Node, Chrome)	gen + concurrent	1-30 ms	5-15%	1.5×	up to 4GB
JSC Riptide (Safari)	concurrent	1-5 ms	10-20%	1.5×	up to 4GB
Hermes HadesGC (RN)	concurrent	<1 ms	10-15%	1.4×	up to 1GB (mobile)
.NET CLR (Workstation)	gen + LOH	5-50 ms	5-15%	1.5×	any
.NET CLR (Server)	gen + parallel + BG	10-200 ms	3-10%	1.3-1.5×	up to 256GB
OCaml 5 (multicore)	per-domain gen	1-10 ms	5-10%	1.3×	any
GHC (Haskell)	gen, block-structured	10-100 ms (large heap)	5-15%	2×	any
Ruby 3 (CRuby)	incremental gen + compaction	10-100 ms	5-15%	1.3×	any
Julia 1.10+	incremental gen	5-50 ms	8-15%	1.5×	any
Lua 5.4 (incremental)	mark-sweep + inc	<1 ms slice	10-30%	1.2×	any
CPython (refcount)	refcount + cycle	varies (synchronous)	amortised 5-15%	1.1× (compact)	any
Swift ARC	refcount opt-cycle	0 ms async	3-10% (compiled inline)	1.05×	any
QuickJS	refcount + Bacon-Rajan	0 ms async	amortised + cycle pass	1.05×	1MB - 1GB
Rust (no GC)	compile-time	0 ms (none)	0%	malloc-only	any
Zig (no GC)	explicit allocator	0 ms	0%	malloc-only	any

没有 GC 同时拿下三轴最优 · 选 GC 就是选你愿意输哪一个 No GC scores top on all three axes · picking a GC means picking which axis you'll lose

CHAPTER 18

生产事故簿 — GC 把公司逼疯的真实案例

Production stories — real cases that drove companies mad

Discord · Twitter · Pinterest · LinkedIn · 真实迁移与教训

Discord · Twitter · Pinterest · LinkedIn · real migrations and lessons

①

Discord · Go → Rust（2020）· 因为 Go GC 给消息推送服务每 2 分钟带来 250ms 暂停

Discord · Go → Rust (2020) · because Go GC caused 250ms pauses every 2 minutes in their Read States service

Discord 用 Go 写的 Read States 服务（追踪每个用户在每个频道的最后读时间）持有巨大堆（数百万用户 × 数千频道）。Go GC 默认 GOGC=100 触发频繁 STW；从 250ms 降到几毫秒后仍不够。改写成 Rust 后P99 延迟降低 100×、内存降低 10×。原文："The Rust version's tail latencies are absurdly better. It is hard to take Go seriously anymore."
教训：GC 适合典型对象生命周期的场景。对极长寿命大集合（cache），GC 没有便宜的解。

Discord's Read States service (tracking each user's last-read timestamp per channel) held a huge heap (millions of users × thousands of channels). Go GC at GOGC=100 caused frequent STW; even after tuning, latency stayed high. Rewriting in Rust dropped P99 latency by 100× and memory by 10×. Their post-mortem said: "The Rust version's tail latencies are absurdly better. It is hard to take Go seriously anymore."
Lesson: GC suits typical-lifetime workloads. For very-long-lived large collections (caches), GC has no cheap answer.

②

Twitter · Ruby → Scala（2010）· 也是 GC + 吞吐量问题

Twitter · Ruby → Scala (2010) · GC + throughput problem

早期 Twitter 用 Ruby on Rails，GC + 解释器双重瓶颈让 fail whale 成为日常。改写到 Scala 运行在 JVM (parallel GC) 上后吞吐量 10×，时延 5×。代价：JVM 大堆 GC pause 仍偶发，但已可接受。
教训：refcount + 解释型语言不适合极高并发。JVM throughput GC 用更大的 STW 换更高吞吐。

Early Twitter ran Ruby on Rails. GC + interpreter created a double bottleneck — the "fail whale" was daily. Migrating to Scala on JVM (parallel GC) gave 10× throughput, 5× latency improvement. Cost: occasional large-heap JVM GC pauses, but acceptable.
Lesson: refcount + interpreted languages don't scale to extreme concurrency. JVM throughput GC trades larger pauses for higher throughput.

③

Pinterest · CPython tuning · 把 generation thresholds 从 (700, 10, 10) 改成 (100000, 50, 50)

Pinterest · CPython tuning · gen thresholds 700/10/10 → 100000/50/50

Pinterest 的 Python 服务在 GC 周期里 P99 飙升。诊断：cycle detector 默认每 700 个分配就触发，他们的 web 框架短时间内分配巨量临时对象。修复：gc.set_threshold(100000, 50, 50)——只在真正需要时触发循环检测。P99 降低 30%。
教训：refcount + cycle detector 的 cycle 阈值不是"每帧调一次"——是按分配速率调度。

Pinterest's Python services saw P99 spike during GC cycles. Diagnosis: cycle detector triggered every 700 allocations by default; their web framework allocated massive amounts of short-lived objects per request. Fix: gc.set_threshold(100000, 50, 50) — only trigger cycle detection when truly needed. P99 dropped 30%.
Lesson: refcount + cycle-detector thresholds are allocation-rate-tuned, not "run once per frame".

④

LinkedIn · 从 ParallelGC 换 G1 · 节省 30% 服务器

LinkedIn · ParallelGC → G1 · saved 30% of servers

LinkedIn 的 Kafka brokers 早期用 JVM Parallel GC，每次 major STW 50-200ms 导致重新选 leader、副本同步重启。换 G1 后 P99 降低 80%，可承载相同流量需要的服务器数量减 30%。
教训：服务发现 / 副本同步对短暂连接超时极敏感——选 GC 的影响远超直接性能数字。

LinkedIn's Kafka brokers used JVM Parallel GC initially; major STW pauses 50-200ms caused leader re-election and replica resyncs. Switching to G1 dropped P99 by 80%, letting them shed 30% of servers.
Lesson: service discovery / replication is extremely sensitive to brief connection timeouts — the GC choice cascades far beyond direct perf numbers.

⑤

游戏行业 · UE4 嵌入 V8 失败 · 改用 QuickJS / Lua

Game industry · UE4 embedding V8 failed · switched to QuickJS / Lua

游戏每帧 16.6ms（60Hz）严格预算。V8 minor GC 1-5ms 偶发，正好掉一帧。多个工作室尝试嵌入 V8 都因 GC 不可预测放弃。解决：换 QuickJS（refcount，0 STW）或 Lua（增量，1ms slice）。
教训：实时场景不要用 tracing GC——用 refcount 或 incremental 让暂停可预测。

Games have a strict 16.6ms (60Hz) per-frame budget. V8's minor GC occasionally costs 1-5ms — exactly enough to drop a frame. Multiple studios tried embedding V8 in UE4 and gave up on GC unpredictability. Fix: switch to QuickJS (refcount, 0 STW) or Lua (incremental, 1ms slice).
Lesson: real-time workloads should not use tracing GC — pick refcount or incremental for predictable pauses.

FIELD NOTE · 5 个故事的共同模式 FIELD NOTE · the common pattern 这 5 个案例每个都不是 GC 本身的 bug——而是团队选错了 GC。Discord 用 Go 是因为想要简洁，没意识到大缓存破坏了"对象死得快"假设。Twitter 选 Ruby 是因为开发快，没料到吞吐量。游戏选 V8 因为生态成熟，忘了实时性。选 GC 比选编程语言更长期影响项目生死——GC 一旦深入业务代码，迁移成本是整个产品重写。下一章 Ch19 给出决策树。 These 5 stories share a pattern — none was a GC bug; each was the wrong GC choice. Discord chose Go for simplicity, missing that big caches violate the "young dies fast" hypothesis. Twitter picked Ruby for dev speed, didn't anticipate throughput. Games picked V8 for ecosystem, forgot real-time. GC choice impacts a project's life-or-death more durably than language choice — once GC behaviour seeps into business code, migration costs an entire product rewrite. Ch19 gives a decision tree.

CHAPTER 19

调优决策树 — 别选错了 GC

Tuning decision tree — don't pick the wrong GC

5 个问题决定你的 GC

5 questions determine your GC

看完前面 18 章，你应该能回答下面这 5 个问题。它们组成一棵决策树，从问题出发反推 GC：

After 18 chapters, you should be able to answer the 5 questions below. They form a decision tree that derives the right GC from your problem:

P99 延迟预算是多少？

What's your P99 latency budget?

<1 ms（金融交易、实时音视频）→ Rust（无 GC）/ Erlang（per-process）/ ZGC（sub-ms）
<10 ms（游戏、HFT 控制平面）→ Hermes HadesGC / Shenandoah / Go
<100 ms（Web 后端、API）→ V8 / JVM G1 / .NET BG
<1 s（批处理）→ JVM Parallel（吞吐最高）/ Python / Ruby

<1 ms (HFT, real-time A/V) → Rust (no GC) / Erlang (per-process) / ZGC (sub-ms)
<10 ms (games, HFT control plane) → Hermes HadesGC / Shenandoah / Go
<100 ms (web backend, API) → V8 / JVM G1 / .NET BG
<1 s (batch) → JVM Parallel (highest throughput) / Python / Ruby

单进程内存上限多少？

What's your per-process memory ceiling?

<100 MB（嵌入式、IoT、CLI）→ QuickJS / Lua / Rust
<4 GB（移动端、CDN edge）→ V8 / Hermes / JSC
4-32 GB（典型服务器）→ JVM G1 / .NET Server / Go
32 GB - 16 TB（大数据、in-memory DB）→ ZGC / Shenandoah / Azul C4

<100 MB (embedded, IoT, CLI) → QuickJS / Lua / Rust
<4 GB (mobile, CDN edge) → V8 / Hermes / JSC
4-32 GB (typical server) → JVM G1 / .NET Server / Go
32 GB - 16 TB (big data, in-memory DB) → ZGC / Shenandoah / Azul C4

启动时间硬约束吗？

Is startup time a hard constraint?

<5 ms（FaaS、CLI、嵌入）→ QuickJS / Lua / Rust 编译产物
<100 ms（Web Workers 重启）→ V8 (Node) / Hermes
>1 s OK（长服务）→ JVM / .NET（享受 JIT warmup）

<5 ms (FaaS, CLI, embedding) → QuickJS / Lua / Rust compiled binary
<100 ms (Web Workers restart) → V8 (Node) / Hermes
>1 s acceptable (long-running) → JVM / .NET (enjoy JIT warmup)

缓存对象巨大吗？

Are there huge cached objects?

是（LRU、用户会话表、ML 模型）→ Rust（无 GC）或 JVM G1 humongous region 或 Erlang ETS。避免 Go（破坏分代假设，Discord 经验）。
否（短寿命对象为主）→ 任意分代 GC（V8、JVM）都很高效。

Yes (LRU, session tables, ML models) → Rust (no GC) or JVM G1 humongous region or Erlang ETS. Avoid Go (violates the gen hypothesis — Discord's experience).
No (short-lived objects dominate) → any generational GC (V8, JVM) is efficient.

团队会调优 GC 吗？

Does the team know how to tune GC?

不会（一般业务团队）→ 选默认就好的：Go、.NET、V8 / Node。它们零配置就接近最优。
会（系统团队 / SRE）→ JVM（10+ flag 可调）Erlang（可设每进程堆）Rust（自己控制 allocator）—— 调优空间大但要懂。

No (typical application team) → pick "default is fine": Go, .NET, V8 / Node. They're near-optimal out of the box.
Yes (systems team / SRE) → JVM (10+ tunable flags) / Erlang (per-process heap) / Rust (own allocator) — large tuning surface, requires expertise.

FIELD NOTE · 决策的反模式 FIELD NOTE · anti-patterns in this decision 反模式 1：基于 benchmark 选 GC。一个微观 benchmark 显示 GC X 比 GC Y 快 20% 是常见的——但在你真实业务负载下可能反过来。用真实负载，跑至少 1 小时，看 P99/P99.9，不要只看 throughput。
反模式 2：调 GC 没改业务。如果你的 web 服务每秒分配 1GB 临时对象，任何 GC 都救不了你——先减分配（reuse buffer、pool 对象、cache lifetime）。GC 调优是最后一公里，不是第一公里。
反模式 3：迁移语言只为 GC。除非你已经撞墙（Discord 那种），不要因为"听说 X 的 GC 好" 重写。语言生态、团队熟悉度、迁移成本远超 GC 收益。 Anti-pattern 1: pick GC by benchmark. Micro-benchmarks showing GC X is 20% faster than Y are common — your real workload may reverse it. Test with your real load for at least 1 hour, watch P99/P99.9, not just throughput.
Anti-pattern 2: tune GC without changing app. If your web service allocates 1 GB per second of short-lived objects, no GC will save you — reduce allocation first (buffer reuse, object pools, cache lifetime). GC tuning is the last mile, not the first.
Anti-pattern 3: migrate language for GC alone. Unless you've hit a wall (Discord-class), don't rewrite because "X's GC is supposedly better". Ecosystem, team familiarity, migration cost dwarf GC gains.

CHAPTER 20

之后 — GC 的下一个十年

What's next — GC in the next decade

Generational ZGC 普及 · 线性类型 · Vale · Carbon · ML-guided GC

Generational ZGC mainstream · linear types · Vale · Carbon · ML-guided GC

2024 年我们站在 GC 演化曲线的有趣点：

Generational ZGC 成为默认（JDK 21+）→ JVM 大堆的 sub-ms STW 终于不再是性能换内存，吞吐量回到了 Parallel GC 的 90%。这意味着对绝大多数 Java 服务来说不再需要选 GC。
Rust 走向主流→ 系统软件（Linux 内核、Chromium 部分子系统）开始原生 Rust，无 GC 的代价（学习曲线）被新一代程序员逐步接受。
线性类型在主流语言探索→ Vale "generational references"、Carbon "lifetime annotations"、Mojo "ownership system" 都在试图降低 Rust 严苛度而仍保持无 GC。
ML-guided GC→ Google / Meta 已在内部用模型预测对象寿命，自动调 nursery 大小。学术界 5 年内会主流化。
异构内存（CXL）→ GC 需要把冷对象迁移到大容量慢内存（CXL Memory Pool），热对象保留在 DRAM。新一代 GC 必须 NUMA + tier-aware。
WebAssembly GC 提案（Phase 4, 2024）→ Wasm 加了原生 GC，让 GC 语言（Java、Python、Go）在浏览器里跑得更快。Dart 是第一个把生产编译器（dart2wasm）对到 Wasm-GC 的主流语言；Kotlin/Wasm、Scala.js 等跟进中。

In 2024 we sit at an interesting point on the GC evolution curve:

Generational ZGC becomes default (JDK 21+) → JVM's sub-ms STW for large heaps finally no longer trades performance for memory; throughput is back to 90% of Parallel GC. Most Java services no longer need to choose a GC.
Rust goes mainstream → systems software (Linux kernel, parts of Chromium) increasingly native Rust; the new generation of programmers absorbs the no-GC learning curve.
Linear types in mainstream languages → Vale's "generational references", Carbon's "lifetime annotations", Mojo's "ownership" all try to relax Rust's strictness while keeping no-GC.
ML-guided GC → Google / Meta already use internal models to predict object lifetimes and auto-tune nurseries. Academia will mainstream this within 5 years.
Heterogeneous memory (CXL) → GC must migrate cold objects to large slow memory (CXL Memory Pool), keep hot in DRAM. Next-gen GCs must be NUMA + tier-aware.
WebAssembly GC proposal (Phase 4, 2024) → Wasm added native GC, making GC'd languages (Java, Python, Go) run faster in browsers. Dart was the first mainstream language to ship a production compiler (dart2wasm) targeting Wasm-GC; Kotlin/Wasm, Scala.js, and others are following.

「McCarthy 写下第一行 mark-sweep 时，
计算机内存以 KB 算。
今天 ZGC 跨 16 TB 堆
停顿 < 1 ms。
但"找出死了的内存"这个问题
永远不会有完美答案——
这本书讲的是
过去 65 年的近似。」 "When McCarthy wrote the first mark-sweep,
computer memory was measured in KB.
Today ZGC spans 16 TB heaps
with < 1 ms pauses.
But 'find what's dead'
will never have a perfect answer —
this article was about
the past 65 years of approximations." — FIELD NOTE 08

McCarthy 在 1959 年说
"让程序自动管内存"。
六十多年了，
我们还在讨论怎么管。

McCarthy said in 1959:
"let programs manage memory themselves".
Sixty-plus years later,
we're still debating how.

FIN // END OF FIELD NOTE 08