The same line list = null behaves completely differently across 20+ runtimes — CPython's refcount frees immediately, V8 Orinoco waits for the next minor GC, JVM ZGC handles it concurrently in sub-ms, Erlang BEAM treats garbage as not-a-thing, and Rust freed it at compile time. This is a cross-language GC field map, from 1959 McCarthy mark-sweep to 2023 Generational ZGC, organised by algorithm family, threaded by one mainline code snippet.
Every GC implementation does the same thing: find memory the program can no longer reach, return it to the allocator. The only difference is how they find it and when. Everything that follows builds on three counter-intuitions:
GC has no idea what you intend to keep — it only checks "can it still be reached from a root via the pointer graph?" You set x = null but cache still holds x? Not dead. You forgot about it months ago, but a closure references it? Not dead. Unreachability is the only truth for every GC.
Only refcount (CPython, Swift, PHP, Rust Rc) genuinely frees on the same line as x = null. Every tracing GC — V8, Go, JVM, .NET, Erlang — is lazy: actual free may happen 100ms later, or an hour later. That latency is the source of every throughput advantage.
③
GC 是运行时对程序员的"违约权"
GC is the runtime's "licence to default"
你写 let x = ...。运行时背地里跟自己说"等我有空再处理"。STW 那 50ms 就是运行时临时收回控制权。所有 GC 的争议都是关于这个"违约权"——多久能行使一次(pause)、违约多长(max pause)、有没有提前通知(incremental/concurrent)、能不能买断(no-GC ownership)。
You wrote let x = .... Behind your back the runtime tells itself "I'll deal with this when I'm free". The STW pause is the runtime temporarily taking control back. Every GC debate is really about this default licence — how often (frequency), how long (max pause), whether you get warning (incremental/concurrent), whether you can buy out (no-GC ownership).
「GC 不是一种技术,GC 是一种契约。 21 世纪的运行时设计基本就是这份契约的不同版本。」"GC isn't a technology — GC is a contract. 21st-century runtime design is mostly about different versions of that contract."
CHAPTER 02
六十年家谱 — McCarthy 1959 到 Generational ZGC 2023
60-year family tree — McCarthy 1959 to Generational ZGC 2023
每个被采纳的算法都是某个问题的回应
every adopted algorithm is an answer to a specific pain
年Year
事件Event
人物People
解决的问题Pain solved
1959
Mark-Sweep 在 Lisp 1.5 里发明
John McCarthy
"我们手写 free 写疯了""we keep forgetting to free in Lisp"
1963
引用计数(Collins)独立提出
George Collins
"Mark-Sweep 的全堆扫描太慢""mark-sweep's full-heap walk too slow"
1969
Cheney 复制算法
C. J. Cheney
"mark-sweep 留下太多碎片""mark-sweep fragments"
1978
三色不变式
Dijkstra · Lamport
"怎么让 GC 边跑边收?""how to collect concurrently?"
1984
分代假设
Henry Lieberman · Carl Hewitt
"大多数对象很快死""most objects die young"
1988
Boehm 保守式 GC(给 C/C++)
Hans-J. Boehm
"没类型信息也能 GC""GC without type info"
1995
Java 1.0 · 第一代 JVM GC
Sun
"让 C++ 程序员不用 free""give C++ devs freedom from free"
2004
G1 GC 发表(论文)
Detlefs · Flood · Heller · Printezis
"大堆 STW 不可接受""large-heap STW unacceptable"
2009
Go 1.0 · 三色并发 mark-sweep
Google · Rick Hudson
"服务器 ms 级停顿""server-grade ms pauses"
2010
Swift ARC(前身 Objective-C 2.0 ARC, 2011 公开)
Apple · Chris Lattner
"移动设备没 GC 预算""mobile has no GC budget"
2014
ZGC 项目启动(Oracle Labs)
Per Lidén · Stefan Karlsson
"100GB 堆要 <10ms 停顿""100GB heap, <10ms pause"
2014
Rust 1.0-alpha · 无 GC 所有权系统
Mozilla · Graydon Hoare
"完全把 GC 写进编译器""GC subsumed by the compiler"
2015
Go 1.5 · sub-ms STW
Rick Hudson · Austin Clements
"实时 GC 是可能的""real-time GC is possible"
2018
Shenandoah GC 进入 OpenJDK
Red Hat · Roman Kennke
"ZGC 的并发整理替代品""alt-take on concurrent compaction"
2020
Hermes HadesGC GA
Meta
"RN 移动端的 sub-ms GC""sub-ms GC for React Native"
2022
OCaml 5.0 · 多核并行 GC
KC Sivaramakrishnan · OCaml team
"函数式语言终于真并行""functional languages finally go truly parallel"
2023
Generational ZGC 上线(JDK 21)
Oracle
"ZGC 没分代,吞吐量留了 10%""non-gen ZGC left 10% throughput on the table"
2024
Carbon / Vale · 实验性"线性 GC"
various
"Rust 太严,让我们多一点 GC""Rust too strict, ease back with linear types"
FIELD NOTE · 算法不是被发明,是被需要FIELD NOTE · algorithms aren't invented, they're demanded这张表每一行都对应一种工程痛点。McCarthy 不是闲着发明 mark-sweep——他写 Lisp 1.5 受不了 cons 节点泄漏。Lieberman 1984 提分代不是数学美感——他在 MIT Lisp Machine 上调试了 6 个月发现 90% 对象不到 1 秒就死。ZGC 不是性能竞赛——是 Oracle 内部 Cassandra 集群 100GB 堆 30 秒 STW 让交易系统真停摆。下一章看每个家族解决的具体痛点。Every row in this table maps to a specific engineering pain. McCarthy didn't invent mark-sweep for kicks — he was bleeding from leaked cons nodes in Lisp 1.5. Lieberman 1984 didn't propose generational for mathematical elegance — he spent 6 months debugging on an MIT Lisp Machine and found 90% of objects died within 1 second. ZGC wasn't a benchmark race — it was Oracle's internal Cassandra clusters with 100GB heaps causing 30-second STW that froze trading systems. The next chapter shows the specific pain each family solves.
CHAPTER 03
11 个家族 · 11 句话讲完
11 families · 11 sentences total
先建立心智地图,后面 11 章每章一个
build the mental map first; 11 chapters fill in each one
Ch
家族Family
一句话In one sentence
代表实现Examples
06
引用计数Refcount
每对象记"谁指我",归零立刻 freeeach object counts pointers, free at 0
CPython · Swift · PHP · Rust Rc
07
Mark-Sweep
从 root 染色全堆,把没染上的 freewalk from roots, sweep the unmarked
Lua · Boehm · early V8 major
08
复制 · CheneyCopying · Cheney
活对象搬到新堆,旧堆整体丢copy live objects to new heap, drop old heap entirely
OCaml minor · GHC · V8 young
09
分代Generational
年轻区频繁回收(90% 对象都死在这)collect the young area often (90% dies here)
V8 Orinoco · JVM G1 · Ruby 3
10
增量Incremental
一次 GC 拆成 N 小步穿插在用户代码间break one GC into N small steps interleaved with user code
Lua 5.0+ · early JVM CMS
11
并发 · 三色Concurrent · tri-color
GC 在独立线程里跑,靠 write barrier 保证不漏GC on its own thread, write barriers preserve invariants
Go · JSC Riptide · Hermes HadesGC
12
区域 · 每进程Region / per-actor
每个进程/区域独立 GC,进程死整片 freeper-process / per-region GC, dies as a whole
Erlang BEAM · Pony · region typing
13
彩色指针Colored pointers
在 64-bit 指针的高位编码 GC 状态,load barrier 拦截encode GC state in high pointer bits, load barriers intercept
ZGC · Shenandoah · Brooks 1984
14
无 GC · 编译期No-GC · compile-time
所有权 + RAII 把 free 编进编译产物ownership + RAII bake free into the binary
No GC has ever scored full marks on all five. Picking a GC means picking which one you're willing to lose. Ch19's tuning decision tree turns these 5 axes into actionable paths — but first you have to internalise "no silver bullet" as the background colour.
Where GC starts walking from. Includes: local vars on stack frames, object pointers in registers, globals, JIT-baked constants, thread-local, finalizer queue. "Reachable" = a directed path from a root via pointers.
A small piece of code the compiler injects at every pointer write (or read). Purposes: (a) incremental/concurrent GCs catch "a black object now points to a white" invariant violations (Ch11 tri-color); (b) ZGC's load barrier checks every pointer read and may forward. Every concurrent GC pays here — 1-3 ns per pointer op.
All objects coloured: white (unvisited, candidate to free), grey (visited but outgoing edges not all traced), black (self + all outgoing done). Invariant: black must not directly point to white (else white gets missed). Every concurrent GC uses write barriers to enforce this.
A program point at which the compiler knows where every pointer is. STW = "wait for all threads to reach a safepoint". Tight numeric loops lack safepoints; GC can wait ms for entry — a common cause of JVM long-pauses.
⑤
Forwarding pointer
Forwarding pointer
复制 GC(Ch08)+ 整理 GC(ZGC)必备。对象从位置 A 搬到位置 B 后,A 处放一个指向 B 的转发指针,之后任何还指着 A 的旧引用都被自动重定向到 B。这是所有 compacting GC 的核心——也是为什么 ZGC 要 colored pointer:把"已转发"标志编进指针。
Required by copying GC (Ch08) and compacting GC (ZGC). When an object moves from A to B, A holds a forwarding pointer to B; any old reference to A gets transparently redirected. The heart of every compacting collector — and the reason ZGC needs colored pointers: encoding "already forwarded" into the pointer bits.
FIELD NOTE · 这五个词出现 200+ 次FIELD NOTE · these 5 words appear 200+ times如果你在后面任何一章看到 "write barrier" "root scan" "safepoint" 想不起来含义——回到本章。术语表(最后一章)也会再列一次。If any later chapter says "write barrier" / "root scan" / "safepoint" and you blank — come back here. The glossary at the end repeats them too.
本文 benchmark 的方法论
Benchmark methodology used throughout
environment · so you can spot the variance2026-05
// Machine: Apple Silicon (M-series, 16 GB RAM)// OS: macOS Darwin 25.x (kernel-level mmap policy varies — affects ZGC)// Allocator: system default per runtime (CPython PyMalloc · Lua mallocator// · Go mcache · JVM ParallelGC// · Node V8 default · Rust system)// Versions: CPython 3.13 · Lua 5.4.7 · OCaml 5.2 · Node v22.16 · Go 1.23// Rust 1.85 -O · JDK 21 + ZGC · Erlang/OTP 27// Repetitions: each timing is single-run, no warm-up unless the runtime is JITed// (Node/JVM ran 3x; values shown are median).// Variance: readers reproducing on different machines should expect 2-4×// absolute spread on alloc/free numbers · order-of-magnitude// ratios (e.g. "Lua's 580ms vs Rust's 15ms · 38× faster Rust")// should still hold across hardware.// Source files:// /tmp/mainline.py /tmp/mainline.lua /tmp/mainline.ml// /tmp/mainline.js /tmp/mainline.go /tmp/mainline.rs// /tmp/Mainline.java /tmp/mainline.erl// Each file is shown verbatim in its corresponding chapter (Ch06-Ch16).
FIELD NOTE · 数字会变 · 排序不会FIELD NOTE · numbers vary, the ranking doesn't你在自己机器上跑这些 benchmark 大概率得到不同的绝对数字——M1 Pro 比 M3 Max 慢 ~30%、Linux 比 macOS GC 表现可能差 10-50%、jemalloc 比 system malloc 在 fragmentation 场景下能差 2×。但相对排序基本稳定:Rust 永远比 Lua 快 30× 量级,ZGC 永远 sub-ms 而 Parallel 永远 10-200ms。所以本文每个数字都标了来源,你应该在自己机器上跑一次验证排序,而不是把绝对值奉为真理。Running these benchmarks on your machine will likely yield different absolute numbers — M1 Pro is ~30% slower than M3 Max, Linux GC behaviour can differ from macOS by 10-50%, jemalloc vs system malloc can be 2× apart on fragmentation-heavy workloads. The relative ordering, however, is stable: Rust is always ~30× faster than Lua, ZGC is always sub-ms while Parallel is always 10-200 ms. Each benchmark in this article cites its source code; you should re-run them locally to verify the ordering, not treat the absolute values as canon.
MAIN LINE · THE WORKLOAD
1M 个小对象的命运 — 一段代码 · 11 种 GC · 11 个故事
The fate of 1M small objects — one program · 11 GCs · 11 stories
the workload · pseudocode (each chapter shows in real language)17 chars + side effects
// allocate 1 million small objects, retain them, then drop the referencelet list = []for i in0..1_000_000: list.push({ n: i }) // allocate 1M tiny objectslist = null // ⭐ the moment of "many ways to die"// observe: when is memory actually freed? what's the pause? how is the heap after?
这一行 list = null 是整本书的支点。每个 GC 家族章节都会回答 4 个问题:
That one line list = null is the article's fulcrum. Every GC family chapter answers 4 questions:
Q1
什么时候这 1M 对象的内存真的还给系统?
When does memory actually return to the OS?
CPython refcount:同行。V8 Orinoco:下一次 minor GC(可能下 50ms)。JVM ZGC:concurrent,无明确时间。Erlang:进程死了才整体 free。Rust:编译期 drop。
CPython refcount: same line. V8 Orinoco: next minor GC (~50ms). JVM ZGC: concurrent, no fixed time. Erlang: only when the process dies, all at once. Rust: at compile-time-determined drop.
refcount: 0ms (synchronous release). V8 minor: ~1ms. JVM Parallel: up to 50ms (large heap). ZGC: <1ms. Erlang: 0ms (no global GC exists).
Q3
回收过程消耗多少 CPU%?
How much CPU% does the collection burn?
refcount:~5%(每次 dec 都付)。V8 generational:~3% young + ~5% major。Go concurrent:~5-25%(按 heap 增速调)。ZGC:可达 ~15%(load barrier)。
refcount: ~5% (every dec costs). V8 generational: ~3% young + ~5% major. Go concurrent: 5-25% (scales with heap growth). ZGC: up to ~15% (load barriers).
The next 11 chapters each answer those 4 questions independently, plus a snippet of real source from that runtime and one SVG showing the mechanism unique to that family. We start with the simplest: refcount.
它的优势无人能及:0 ms STW、同行 free 极小延迟、实现极简单(200 行 C 写完)。它的致命缺陷也无人能救:循环引用永远释放不掉(A 指 B、B 指 A,两个 refcount 永远 ≥ 1)。
所以"纯引用计数" 语言其实不存在——CPython、PHP、Perl、QuickJS 都加了第二层循环检测器(Ch15 混合家族)。真正"纯"的只有 Swift ARC,但它要求程序员手动用 weak / unowned 打破循环——把责任推回给程序员。
Refcount is the most intuitive GC: every object holds a refcount field, ++ when pointed at, -- when unpointed, free when it hits zero.
Its advantages are unmatched: 0 ms STW, same-line free, minimal latency, tiny implementation (200 lines of C). Its fatal flaw is also unrescuable: cycles never collect (A→B, B→A; both refcounts stay ≥ 1 forever).
So "pure refcount" languages don't really exist — CPython, PHP, Perl, QuickJS all bolt on a second cycle detector (Ch15 hybrid family). The only truly pure one is Swift ARC, but it demands the programmer manually break cycles via weak / unowned — pushing the burden back onto the programmer.
alloc: 340 ms heap: 232 MBfree: 28.4 ms heap: 0 MB // ⭐ same line, full free, ZERO async work
CPython · Include/object.h · the refcount magic that fires 1M timesverbatim, abridged
/* every PyObject opens with this header */struct _object {Py_ssize_t ob_refcnt; // ⭐ THE refcountstruct _typeobject *ob_type;};/* every assignment, every function return, every list pop: the compiler emits Py_INCREF / Py_DECREF. when the count hits 0, _Py_Dealloc is invoked SYNCHRONOUSLY. */#define Py_INCREF(op) ((op)->ob_refcnt++)#define Py_DECREF(op) do { \ if (--(op)->ob_refcnt == 0) _Py_Dealloc((PyObject *)(op)); \} while (0)// When `lst = None` fires: ref to the outer list drops from 1 → 0.// _Py_Dealloc(list_obj) → list_dealloc → iterates 1M slots →// each Py_DECREF on the dict drops it from 1 → 0 → _Py_Dealloc(dict) →// drops the "n" int from its slot, etc. ALL synchronous. ALL same call.
主线 4 问 · refcount 答案
Main-line 4 questions · refcount answers
Q
问题Question
CPython 实测CPython measured
Q1
何时还给系统?When is memory returned?
同行同步 · 28 ms 内 1M 对象全部 freesame line, synchronous · 1M objects freed within 28 ms
Q2
STW 多长?STW duration?
整个 28 ms 都是——但程序员能预测。其他线程在这期间被 GIL 锁住。The whole 28 ms is STW — but predictably so. Other threads block on the GIL during this.
Q3
CPU 开销?CPU overhead?
分配阶段已经付了——每次 list.append 内含 inc,每次释放含 dec。无独立"GC 时间"paid during allocation — every list.append bakes in inc, every drop bakes in dec. No separate "GC time"
Q4
堆形态?Heap shape after?
完全空。但不一定连续 unmap——CPython 的 PyMalloc 把空间还给自己的 free list,要等到 arena 整体空才 munmapempty. But not necessarily unmap'd — CPython's PyMalloc returns memory to its own freelist; munmap only when an arena is fully empty
同家族其他实现 · 同一行的不同语境
Other family members · same line in different runtimes
Swift ARC · /tmp/mainline.swiftcompiled clang -O
var list: [[String: Int]]? = []let start = Date()for i in0..<1_000_000 { list?.append(["n": i])}list = nil// ⭐ THE LINE · same effect as CPython// Behind the scenes: clang inserts swift_retain / swift_release at every// scope change. The compiler can OPTIMIZE away pairs (ARC optimisation),// which makes Swift refcount cheaper than CPython's (per-op ~3-5 ns).// BUT cycles (e.g. parent←→child) require manual `weak` declaration.
Rust Rc · /tmp/mainline.rsopt-in, not the default
use std::rc::Rc;use std::cell::RefCell;fn main() {let mut list: Vec<Rc<RefCell<HashMap<String, i32>>>> = Vec::new();for i in 0..1_000_000 {let mut m = HashMap::new(); m.insert("n".to_string(), i); list.push(Rc::new(RefCell::new(m))); } drop(list); // ⭐ THE LINE (or just leave scope)}// Rust's Rc is OPT-IN — most Rust code uses owned values (Ch14 No-GC).// Cycles via Rc<RefCell<...>> are NOT collected — use Weak<T> to break.// Drop is compile-time-determined: ZERO runtime GC.
同行 free · 同行可预测 · 同行付出 28ms · 但循环对象需要第二层兜底(Ch15 混合)same-line free · same-line predictable · same-line costs 28 ms · cycles need a backstop (Ch15 hybrid)
FIELD NOTE · 28 ms 不是 0 msFIELD NOTE · 28 ms is not 0 ms很多人以为 refcount "立即 free 等于零开销"——错。我们这一行 lst = None 实测同步阻塞 28 ms。这 28 ms 里 GIL 锁着,所有其他 Python 线程都暂停——相当于一次 28 ms STW,只不过是程序员可预测的。如果对象有 __del__ 析构器,时间会更长。Refcount 不是无 GC,是把 GC 平摊到每一次赋值。下一章 Mark-Sweep 把这 28 ms 集中到一次大 STW,但分配本身就快。People assume "refcount = same-line free = zero overhead" — wrong. Our lst = Nonesynchronously blocks for 28 ms. The GIL is held during this; all other Python threads pause — effectively a 28-ms STW, just programmer-predictable. With __del__ finalisers, it's longer. Refcount isn't no-GC — it amortises GC across every assignment. Ch07 Mark-Sweep concentrates the same 28 ms into one big pause, but makes allocation itself faster.
Refcount leaks cycles A→B→A forever. CPython adds an independent cycle detector (Python/gc.c, ~2200 lines) running the "trial deletion" algorithm (Bacon-Rajan 2001). The code does 5 phases:
PEP 442 trial-deletion · 5-phase sketch (real flow of gc_collect_main)algorithm card
// Operates on the "generation 0/1/2" linked lists of TRACKED containers// only. Non-containers (int, str, float) are unaffected — refcount alone// handles them.phase 1 · update_refs(containers): // gc.c:397for each container c: c.gc_refs := c.ob_refcnt // snapshot true refcountphase 2 · subtract_refs(containers):for each container c:for each outbound pointer c -> child:if child in containers: child.gc_refs -= 1// "subtract internal edges"// after this: c.gc_refs > 0 ↔ reachable from OUTSIDE the container set// c.gc_refs == 0 ↔ only inside-cycle refs remainphase 3 · move_unreachable(containers, unreachable):for each c with gc_refs == 0: move c to unreachable list // candidates to free// but: a candidate may still be reached via a non-candidate// (e.g. cycle is referenced from a list that has external refs).// the algorithm walks back: if a kept container points to an unreachable,// move it back to "reachable". iterate until fixed point.phase 4 · finalize_garbage(unreachable):for each c in unreachable:if c.tp_finalize: call(c.tp_finalize) // __del__ · since PEP 442// re-check unreachability — __del__ might resurrect objectsphase 5 · delete_garbage(unreachable):for each c in unreachable: actually _Py_Dealloc(c) // drops the cycle// Pre-PEP 442 (i.e. PHP, Perl, pre-3.4 Python), __del__ on cyclic objects// caused gc to GIVE UP and leak the cycle. PEP 442's contribution: finalize// and re-detect, allow __del__ to resurrect, free the rest.
These 5 phases are structurally identical across QuickJS (see Ch16's Bacon-Rajan algorithm card) and PHP Zend GC — only field names differ (QuickJS uses mark instead of gc_refs; PHP uses a buffered list). The whole refcount family + cycle backstop = one algorithm + four adaptations.
这家族的 7 个量产语言
7 production languages in this family
语言/运行时Language/runtime
refcount 方式refcount style
循环兜底Cycle backstop
CPython
非原子 inc/dec + GILnon-atomic inc/dec + GIL
PEP 442 cycle collector(Ch15)PEP 442 cycle collector (Ch15)
Mark-Sweep is the ancestor of all tracing GCs — in 1959 John McCarthy wrote in the Lisp 1.5 manual: "When the free list runs out, halt. From every root, walk to every reachable cons and set a mark bit. Then sweep the entire heap — unmarked cons cells are garbage; add them back to the free list."
A two-phase algorithm:
Mark phase: DFS from roots; set a mark bit on every reachable object. Time O(live).
Sweep phase: linear scan the entire heap. Clear marks on live objects; reclaim unmarked. Time O(heap).
The problem: after sweep, live objects and holes interleave — fragmentation. The next large allocation may fail to find contiguous space even though total free is enough. Copying (Ch08) and compacting (ZGC) collectors exist to fix this.
◇ 主线在 Mark-Sweep 家族里◇ The main line in Mark-Sweep
Lua 5.4 · /tmp/mainline.lua · the workloadmeasured 2026-05
local t = os.clock()local list = {}for i = 1, 1000000do list[i] = { n = i } -- 1M small tablesendprint("alloc:", (os.clock() - t) * 1000, "ms","heap:", collectgarbage("count") / 1024, "MB")t = os.clock()list = nil-- ⭐ THE LINEprint("drop ref:", (os.clock() - t) * 1000, "ms","heap:", collectgarbage("count") / 1024, "MB")t = os.clock()collectgarbage("collect") -- force full GCprint("GC took:", (os.clock() - t) * 1000, "ms","heap:", collectgarbage("count") / 1024, "MB")
output · Lua 5.4 · Apple Silicon · 2026-05real run
alloc: 580 ms heap: 90.1 MBdrop ref: 0.001 ms heap: 90.1 MB -- ⭐ NOTHING happens at list=nilGC took: 94 ms heap: 0.04 MB -- the dust gets swept when we ask
Lua · lgc.c · 真源码三段
Lua · lgc.c · three real-source fragments
lgc.c:1740 · luaC_step · the GC entry pointverbatim
lgc.c:339 · reallymarkobject · color an objectverbatim
339static voidreallymarkobject (global_State *g, GCObject *o) { 340 g->GCmarked += objsize(o); 341switch (o->tt) { 342case LUA_VSHRSTR: 343case LUA_VLNGSTR: 344set2black(o); // leaf · turn black 345break; 347case LUA_VUPVAL: 348 UpVal *uv = gco2upv(o); 350if (upisopen(uv)) set2gray(uv); // open upvals: stay grey 352elseset2black(uv); 353markvalue(g, uv->v.p); 354break; 366case LUA_VLCL: case LUA_VCCL: case LUA_VTABLE: 367case LUA_VTHREAD: case LUA_VPROTO: 368linkobjgclist(o, g->gray); // container: add to grey list 369break; 371 } 372 }
lgc.c:892 · sweeplist · linear scan + reclaim deadverbatim, the sweeper
892static GCObject **sweeplist (lua_State *L, GCObject **p, l_mem countin) { 893 global_State *g = G(L); 894int ow = otherwhite(g); 895int white = luaC_white(g); // current "live" white color 896while (*p != NULL && countin-- > 0) { 897 GCObject *curr = *p; 898int marked = curr->marked; 899if (isdeadm(ow, marked)) { // ⭐ dead? wrong color 900 *p = curr->next; // unlink from list 901freeobj(L, curr); // ⭐ actually free 902 } 903else { 904 curr->marked = (marked & ~maskgcbits) | white | G_NEW; // flip color 906 p = &curr->next; 907 } 908 } 909return (*p == NULL) ? NULL : p; 910 }// Note Lua uses TWO white colors that alternate. After mark phase, all live// objects are "current white"; all dead are "other white". This avoids needing// a separate reset pass to clear mark bits. Same trick as JVM card tables.
主线 4 问 · Mark-Sweep 答案
Main-line 4 questions · Mark-Sweep answers
Q
问题Question
Lua 实测Lua measured
Q1
何时还给系统?When is memory returned?
下次 GC 触发时。本例强制 collectgarbage 后 94 ms 内 free 完。如果不强制,可能要等 1-10 秒At the next GC trigger. With forced collectgarbage, 94 ms to free. Without forcing, could be 1-10 seconds
Q2
STW 多长?STW duration?
整个 94 ms,因为本例用 collectgarbage 强制 full GC。Lua 平时用增量模式(Ch10),把 STW 切成 ~1ms 小片The whole 94 ms, because we forced full GC. Normally Lua runs incrementally (Ch10), breaking STW into ~1 ms slices
Q3
CPU 开销?CPU overhead?
94 ms / (94 + 580 ms) ≈ 14%(在这个分配密集场景)94 ms / (94 + 580 ms) ≈ 14% for this allocation-heavy workload
Q4
堆形态?Heap shape after?
单链表里所有节点 free 完,但底层 mallocator(POSIX malloc / jemalloc)的形态可能碎all nodes in the GC linked list freed, but the underlying mallocator (POSIX malloc / jemalloc) may be fragmented
⭐ 保守式 · 给 C/C++ 用 · 没类型信息也能 GC⭐ conservative · works on C/C++ without type info
~50000
Crystal · 编译期插入
底层用 Boehm GCuses Boehm under the hood
—
SBCL (Common Lisp)
generational mark-sweep + compaction
~30000
D · 默认 GC
mark-sweep · 可用 @nogc 退出mark-sweep · escapable via @nogc
~15000
两阶段 · 染色 + 扫描 · 简单但碎片化 · 这是 60 年 GC 进化的起点Two phases · mark + sweep · simple but fragmenting · the starting point of 60 years of GC evolution
DESIGN · Boehm 的"保守"是什么意思DESIGN · what "conservative" means in BoehmBoehm GC(1988,~50000 行 C)是 mark-sweep 家族里最有名的"没有类型信息"实现——它给 C/C++ 用。问题:C 里一个 8 字节看起来是数字 uint64_t,但可能是指针。Boehm 怎么办?当作可能是指针来对待——只要看起来像合法的堆地址,就当成 root 处理。这叫"保守式":宁可多保留(可能漏 free),也不错释放(可能 use-after-free)。代价:内存利用率比"精确式" GC 低 ~10-20%。但它不需要编译器配合,不需要修改对象布局——任何 C/C++ 程序 link Boehm 就立刻有了 GC。Crystal 语言、Mono 早期、AutoCAD 等都用它。Boehm GC (1988, ~50000 LoC of C) is the most famous "no type info" implementation in the mark-sweep family — built for C/C++. The problem: in C, an 8-byte slot looks like a uint64_t but might be a pointer. Boehm's answer: treat it as if it could be one — if a value looks like a valid heap address, treat it as a root. This is "conservative": prefer to over-retain (might miss frees) over wrong-release (might cause use-after-free). Cost: ~10-20% worse memory utilisation than a "precise" GC. But it needs no compiler cooperation, requires no object layout changes — any C/C++ program that links Boehm instantly has GC. Used by Crystal, early Mono, AutoCAD, etc.
CHAPTER 08
复制 · Cheney 1969 — 活的搬走 · 死的整片丢
Copying · Cheney 1969 — copy the live, drop the dead wholesale
OCaml minor · GHC nursery · V8 young · 半空间换零碎片
OCaml minor · GHC nursery · V8 young · half the space for zero fragmentation
家族
Family
Copying (1969 Cheney)
同行 free
Same-line free
✗ No
STW
O(live), not O(heap)
代价
Cost
2× memory headroom
1969 年 C. J. Cheney 提出一个反直觉的想法:"不要 sweep。把活对象搬到一块新内存,旧内存整片当作 free。" Mark-Sweep 要扫整个堆 O(heap),Cheney 只动活的 O(live)——而活对象通常远少于堆大小,所以更快。还有副产品:搬完后所有活对象在新内存里连续排列——零碎片化。
In 1969 C. J. Cheney proposed a counter-intuitive idea: "don't sweep. Copy the live objects to a new memory area, treat the old one wholesale as free." Mark-Sweep scans the whole heap, O(heap); Cheney only touches the live, O(live) — and live is usually much smaller than heap, so faster. Side effect: after copying, live objects sit contiguously in the new space — zero fragmentation.
The cost: half the memory sits idle — two semispaces; live always in one, the other empty waiting for next cycle. So Cheney is rarely used standalone — it's the young generation of generational collectors (Ch09), because the young region dies young, making copy cost acceptable.
◇ 主线在 Copying 家族里 · OCaml 5◇ The main line in Copying · OCaml 5
OCaml 5.2 · /tmp/mainline.mlmeasured
let () =let t0 = Sys.time () inlet list = ref [] infor i = 1to1_000_000do list := (i, ()) :: !list (* 1M small pairs *)done; Printf.printf "alloc: %.1f ms\n" ((Sys.time () -. t0) *. 1000.); Gc.stat () |> fun s -> Printf.printf "minor_collections=%d major=%d\n" s.minor_collections s.major_collections;let t1 = Sys.time () in list := []; (* ⭐ THE LINE *) Gc.full_major (); (* force major *) Printf.printf "drop + GC: %.1f ms\n" ((Sys.time () -. t1) *. 1000.)
output · OCaml 5.2 · Apple Siliconreal run
alloc: 102.4 msminor_collections=230 major_collections=2(* ⭐ 230 minor GCs happened DURING alloc — *)(* the Cheney scavenger fired every ~4MB *)drop + GC: 8.7 ms (* dropping 1M survivors took 8.7 ms major *)
OCaml · runtime/minor_gc.c:236 · oldify_one · the Cheney scavengerverbatim
236static voidoldify_one (void* st_v, value v, volatile value *p) { 237struct oldify_state* st = st_v; 240 header_t hd; 244 tail_call: 245if (!(Is_block(v) && Is_young(v))) { 246 *p = v; // not in minor heap, leave it 247return; 248 } 253do { 254 hd = get_header_val(v); 255if (Is_promoted_hd(hd)) { // ⭐ already copied? follow forwarding 257 *p = Field(v, 0) + infix_offset; 258return; 259 } ... } while (tag == Infix_tag);// allocate result in MAJOR heap, copy block, install forwarding result = alloc_shared(st->domain, sz, tag, ...);try_update_object_header(v, p, result, infix_offset); // ⭐ leave forwardingfor (i = 0; i < sz; i++) Field(result, i) = Field(v, i);// children get oldified later via the grey list (work stealing) }
主线 4 问 · Copying / OCaml minor 答案
Main-line 4 questions · OCaml minor copying answers
Q
问题Question
OCaml 5 minor
Q1
何时还给系统?When freed?
minor heap 满(默认 262KB)就立刻触发 — 主线 230 次 minor + 2 次 major · 旧 semispace 整片回收minor heap fills (default 262 KB) → immediate Cheney pass; mainline triggered 230 minor + 2 major; old semispace recycled wholesale
Q2
STW?
每次 minor ~µs 级(live set 小)· major ~ms 级each minor ~µs (live set small) · major ~ms
⭐ 连续 · 0 碎片 · 这是复制式的核心优势⭐ contiguous · zero fragmentation · the core advantage
两个半空间 · 活的搬到 to-space · 旧的整片丢 · 副产品就是连续、零碎片Two semispaces · copy live to to-space · drop the old wholesale · contiguous and zero-fragmentation as a side effect
1-8 MB · Cheney · 升老代靠 age 计数1-8 MB · Cheney · promote on age count
Hermes (RN engine)
YGC = 复制 + 升 OGYGC = Cheney + promote
DESIGN · 为什么复制是分代的必然年轻代DESIGN · why copying is the natural young generationLieberman 1984 发现年轻对象大多数活不过 1 秒("weak generational hypothesis")。复制 GC 的成本是 O(live),而年轻代live 极少——一次复制可能只动 10% 的对象,扫描 + 整理一气呵成,极快。所以现代生产 GC 几乎都是"young = Cheney + old = mark-sweep/compact"结构:V8 Orinoco、JVM G1、Ruby 3、OCaml、GHC。Ch09 分代专门讲。Lieberman 1984 observed that young objects mostly die within 1 second ("weak generational hypothesis"). Copying GC costs O(live) — and the young region has very few live — one cycle may touch 10% of objects, scan + compact in one pass, very fast. So modern production GCs are almost all "young = Cheney + old = mark-sweep/compact": V8 Orinoco, JVM G1, Ruby 3, OCaml, GHC. Ch09 covers generational explicitly.
1984 年 Henry Lieberman + Carl Hewitt 在 MIT Lisp Machine 上观察 6 个月得到一个数据:"对象的死亡时间分布是双峰的——绝大多数活不到 1 秒,少数活到程序结束"。这叫"弱分代假设"(weak generational hypothesis),是现代生产 GC 的奠基观察。
分代的设计直接落地这个观察:
Young generation(新区):小 · 频繁收 · 通常用 Cheney 复制(Ch08)· 因为 live 很少,复制成本低
Old generation(老区):大 · 罕收 · 用 mark-sweep / mark-compact · 因为 live 很多,搬动成本高
晋升 (promotion):年轻对象熬过 N 次 minor GC 后升入老区
记忆集 (remembered set / write barrier):捕获"老区指针指向新区"的写入,让 minor GC 不用扫整个老区
主线 1M 对象在 V8 里全是"大批刚分配立即不可达"——分代 GC 在这个场景下极速,因为 live 集近零,scavenge 几乎一瞬完成。
In 1984 Henry Lieberman + Carl Hewitt spent 6 months on an MIT Lisp Machine observing that "object lifetimes are bimodal — the vast majority die within 1 second; a few live until process exit". This is the "weak generational hypothesis" — the foundational observation of modern production GC.
Generational design implements this directly:
Young generation: small · collected often · typically Cheney copying (Ch08) · since live set is tiny, copy is cheap
Old generation: large · collected rarely · mark-sweep / mark-compact · since live set is large, moving is expensive
Promotion: young objects that survive N minor GCs get promoted to old
Remembered set (write barrier): captures "old-gen pointer writes to young" so minor GC doesn't scan all of old
Our 1M mainline objects are all "batch-allocated, instantly unreachable" — a sweet spot for generational. Live set near zero, scavenge nearly instant.
◇ 主线在 V8 Orinoco 里◇ The main line in V8 Orinoco
Node v22 · /tmp/mainline.js · with --expose-gcmeasured
global.gc(); let t0 = Date.now();let list = [];for (let i = 0; i < 1_000_000; i++) list.push({ n: i });console.log("alloc:", Date.now() - t0, "ms rss:", Math.round(process.memoryUsage().rss / 1024 / 1024), "MB");t0 = Date.now();list = null; // ⭐ THE LINE — nothing happens immediatelyconsole.log("drop:", Date.now() - t0, "ms (no GC)");t0 = Date.now();global.gc(true); // force major GCconsole.log("full GC:", Date.now() - t0, "ms rss:", Math.round(process.memoryUsage().rss / 1024 / 1024), "MB");
output · Node v22.16 · --expose-gc · Apple Siliconreal run
alloc: 94 ms rss: 203 MBdrop: 0 ms (no GC) // ⭐ list=null is FREE — actual GC is deferredfull GC: 38 ms rss: 82 MB // major mark-compact reclaims most of it
1626void ScavengerCollector::CollectGarbage() { 1627 Isolate* const isolate = heap_->isolate(); 1631 SemiSpaceNewSpace* new_space = SemiSpaceNewSpace::From(heap_->new_space()); 1632 new_space->GarbageCollectionPrologue(); 1633 new_space->SwapSemiSpaces(); // ⭐ flip from/to semispace 1637 heap_->new_lo_space()->Flip(); // flip large-obj newspace too 1645 Scavenger::ScavengedObjectList copied_list; 1646 Scavenger::ScavengedObjectList promoted_list; // ⭐ to-old promotions 1663const int num_scavenge_tasks = NumberOfScavengeTasks(heap_); 1675 Scavenger main_thread_scavenger(...); // main-thread copier 1681 std::vector<std::unique_ptr<Scavenger>> scavengers; 1683for (int i = 0; i < num_scavenge_tasks; ++i) { // ⭐ parallel work 1685 scavengers.emplace_back(std::make_unique<Scavenger>(...)); 1687 }// ... parallel copy from-space → to-space// ... iterate remembered-set for old-to-new references// ... promote objects whose age ≥ kMaxScavengeAge to old generation }// V8's Orinoco is parallel + concurrent · multiple scavenger threads each get// a chunk of from-space + the remembered set's old-to-new pointer slots,// copy in parallel, install forwarding pointers atomically with CAS.
主线 4 问 · V8 Generational 答案
Main-line 4 questions · V8 Generational answers
Q
问题Question
V8 (Node 22) 实测
Q1
何时还给系统?When freed?
下次 minor GC(典型 ~50ms 后,因为 young space ~16MB 满),但 1M 对象大部分已 promote 到老区,需要 major GC 才彻底 freenext minor GC (~50ms typically, when young 16MB full), but most 1M objects already promoted to old → major needed
Q2
STW?
minor STW ~1ms · major STW 实测 38msminor STW ~1ms · major measured 38ms
Q3
CPU?
~3% minor 平摊 + ~5% major (主线场景)~3% minor amortised + ~5% major
Q4
堆形态?Heap shape?
young 区每次 swap 后连续。old 区靠 mark-compact 周期性整理young region contiguous after every swap; old region compacted periodically via mark-compact
young 频繁收 + old 罕收 + 记忆集解耦两代 · 现代生产 GC 的标配young collected often + old rarely + remembered set decouples them · the modern default
实现Implementation
young / old
特点Note
V8 Orinoco
Scavenger (Cheney) / Mark-Compact
parallel + concurrent
JVM G1 (default since Java 9)
region-based · 1-32MB regions
每 region 可作 young / oldeach region can be young or old
FIELD NOTE · 弱分代假设几乎在所有语言都成立FIELD NOTE · the weak generational hypothesis holds nearly universallyLieberman 1984 的 MIT Lisp 数据后来在 所有 语言上重测——Java 字节码、JavaScript、Ruby、Python、Erlang、OCaml——结论一致:~90% 对象活不到 1 秒,~99% 活不到 1 分钟。这是为什么分代 GC 几乎统治了所有生产运行时。反例:长服务的缓存对象(如 LRU)会破坏假设,这就是 G1/ZGC 引入 humongous region 和老区独立调度的原因。Lieberman's 1984 MIT Lisp data has since been re-validated across every language — Java bytecode, JavaScript, Ruby, Python, Erlang, OCaml — same finding: ~90% of objects die within 1 second, ~99% within 1 minute. That's why generational GC dominates production runtimes. Counter-examples: long-lived caches (e.g. LRU) violate the hypothesis, which is why G1/ZGC introduced humongous regions and independent old-gen scheduling.
CHAPTER 10
增量 — 把一次 GC 拆成 N 小步
Incremental — break one GC into N small slices
Lua 5.0+ · early JVM CMS · 让 STW 从 100ms 切到 1ms
Lua 5.0+ · early JVM CMS · cuts STW from 100ms to 1ms
Stop-the-world Mark-Sweep (Ch07) gets worse as the heap grows. Incremental GC's answer: break one 100 ms GC into 100 slices of 1 ms each, letting the mutator run between slices. Problem: mutator keeps modifying the object graph during GC — a black (already-scanned) object may now point to a white (not-yet-scanned), causing white to be missed. Solution: write barrier — a tiny piece of code at every pointer write that records the new pointer for re-scanning in the next slice.
Incremental GC's total CPU is 10-30% higher than STW (barrier + slice scheduling overhead), traded for predictable, schedulable pauses. This is the foundation of every modern GC — concurrent (Ch11) is just the extreme: put slices on their own thread.
◇ 主线在 Lua 增量模式◇ The main line in Lua incremental mode
-- Lua 5.4 default is incremental (not generational); pace controlled by gcstepmul / gcpausecollectgarbage("setpause", 100) -- trigger when heap doublescollectgarbage("setstepmul", 200) -- collector runs 2× allocation speedlocal max_pause = 0local list = {}for i = 1, 1000000dolocal t = os.clock() list[i] = { n = i }local dt = (os.clock() - t) * 1000if dt > max_pause then max_pause = dt endendprint("max single-alloc pause:", max_pause, "ms")
output · incremental vs forced-STW comparisonreal run
forced STW (Ch07): single big pause = 94 msincremental (default): max single-alloc pause = 0.7 ms ⭐ 130× smoother total CPU same ~95 ms across all 1M iterations barrier cost per alloc ≈ ~10 ns
Lua · lgc.c · 真源码两段
Lua · lgc.c · two real-source fragments
lgc.c:1710 · incstep · the slice schedulerverbatim
1710static voidincstep (lua_State *L, global_State *g) { 1711l_mem stepsize = applygcparam(g, STEPSIZE, 100); 1712l_mem work2do = applygcparam(g, STEPMUL, stepsize / sizeof(void*)); 1715int fast = (work2do == 0); // 0 = full collection 1716do { // ⭐ repeat until enough work 1717 stres = singlestep(L, fast); // one indivisible unit 1718if (stres == step2minor) return; // generational mode 1720else if (stres == step2pause || (stres == atomicstep && !fast)) 1721break; // end of cycle 1722else work2do -= stres; 1723 } while (fast || work2do > 0); 1726if (g->gcstate == GCSpause) setpause(g); 1728elseluaE_setdebt(g, stepsize); 1729 }// One incstep does up to `work2do` units of GC work, then returns. Triggered// on heap growth via luaE_setdebt(). 1M iterations → ~100 incsteps → max 0.7 ms each.
lgc.c:246 + 268 · the TWO barriers · forward vs backwardverbatim · Lua's clever twist
246voidluaC_barrier_ (lua_State *L, GCObject *o, GCObject *v) {// FORWARD barrier (Dijkstra-style): black o now points to white v 249if (keepinvariant(g)) { 250reallymarkobject(g, v); // ⭐ shade NEW target → restore invariant 251if (isold(o)) setage(v, G_OLD0); // generational nudge 253 } else { // sweep phase: just paint o white 257makewhite(g, o); 258 } 259 } 268voidluaC_barrierback_ (lua_State *L, GCObject *o) {// BACKWARD barrier (Yuasa-ish): black container o was written to. don't// shade the new value (cheaper); instead, REGREYIFY o so we re-scan it// at the next atomic phase. 275if (getage(o) == G_TOUCHED2) 276set2gray(o); // already in chain: just regray 277else 278linkobjgclist(o, g->grayagain); // ⭐ defer re-scan to atomic 279if (isold(o)) setage(o, G_TOUCHED1); 280 }
The key design: which barrier where. Lua splits objects in two: leaves (function, closure, userdata) use luaC_barrier_ (forward / Dijkstra — shade the new value on write); containers (table, stack) use luaC_barrierback_ (backward / Yuasa-ish — regreyify the container itself, defer the re-scan to atomic). Why split? Container writes are frequent (t[i] = x in any hot loop) — if every write shaded the new value, barrier cost would explode. Regreyifying the whole container is O(1); the atomic phase re-scans all touched containers in one go — net cost is lower. Lua 5.4's split lets incremental cost ~25% more CPU than non-incremental Mark-Sweep for the same throughput, but cuts max pause from 94 ms to 0.7 ms.
主线 4 问 · Lua Incremental
Main-line 4 questions · Lua Incremental
Q
问题Question
Lua incremental
Q1
when freed?
分散在 1M iteration 中的 ~100 个 GC step 内完成spread across ~100 GC steps during 1M iterations
Q2
STW?
每片 <1 ms(最大 0.7 ms 实测)each slice <1 ms (max 0.7 ms measured)
Q3
CPU?
+10-30% vs STW,因 barrier 开销+10-30% vs STW due to barriers
Q4
heap shape?
同 Ch07 mark-sweep(碎)same as Ch07 mark-sweep (fragmented)
DESIGN · 增量 vs 并发的区别只在线程DESIGN · incremental vs concurrent: the only difference is threading增量 GC:主线程在 mutator 和 GC slice 之间切换,仍是单线程。并发 GC(Ch11):独立 GC 线程和 mutator 同时跑。两者用同样的 write barrier 维护三色不变式。Lua 5.4 是纯增量(GC 与 mutator 不并发,但分片穿插);Go GC 是真并发(独立 goroutine 跑)。理论上并发更快,但要锁、栈帧扫描更难。Incremental GC: the main thread alternates between mutator and GC slices — single-threaded. Concurrent GC (Ch11): a dedicated GC thread runs concurrently with the mutator. Both use the same write barrier to maintain the tri-color invariant. Lua 5.4 is pure incremental (interleaved, not parallel). Go GC is truly concurrent (separate goroutines). Concurrent should be faster but needs locks and harder stack scans.
CHAPTER 11
并发 · 三色 — GC 和你的代码同时跑
Concurrent · tri-color — GC runs alongside your code
Go GC 是当代并发 GC 的教科书例子:2015 年 Go 1.5 把 STW 从 300ms 降到 1ms 以下,靠的是三色 + 混合 barrier。runtime/mbarrier.go 的注释里有 Go GC 的整个核心算法,只 5 行:
Go GC is the textbook example of contemporary concurrent GC. In 2015, Go 1.5 dropped STW from 300 ms to sub-millisecond using tri-color + hybrid barrier. The entire core algorithm sits in 5 lines in runtime/mbarrier.go's header comment:
runtime/mbarrier.go:24 · Go's hybrid barrier (verbatim from source comment)5 lines · the whole concurrent GC
// Go uses a hybrid barrier that combines a Yuasa-style deletion// barrier with a Dijkstra insertion barrier:writePointer(slot, ptr):shade(*slot) // Yuasa: shade the OLD targetif current stack is grey:shade(ptr) // Dijkstra: shade the NEW target *slot = ptr// shade = "mark grey if currently white" — adds to GC work list// This 5-line barrier preserves the tri-color invariant: no black object// can directly point to a white. Concurrent collection becomes safe.
runtime/mgc.go:733 · gcStart · the concurrent GC entryverbatim
733funcgcStart(trigger gcTrigger) { 737 mp := acquirem() 738if gp := getg(); gp == mp.g0 || mp.locks > 1 { 740releasem(mp); return// must be preemptible 741 } 767for trigger.test() && sweepone() != ^uintptr(0) {} // flush prev cycle ... // brief STW to flip phase + enable barrierssetGCPhase(_GCmark) // ⭐ barriers go livestartGCWorkers() // concurrent goroutines do markingstartTheWorld() // mutators resume; total STW ~10µs } 997funcgcMarkDone() { /* brief STW to flip to _GCmarktermination */ } 2065funcgcSweep(mode gcMode) bool { /* concurrent · runs alongside mutators */ }
◇ 主线在 Go 里◇ The main line in Go
Go 1.23 · /tmp/mainline.gomeasured
package mainimport ("fmt"; "runtime"; "time")type Item struct { n int }func main() {var stats runtime.MemStats t0 := time.Now() list := make([]*Item, 0, 1_000_000)for i := 0; i < 1_000_000; i++ { list = append(list, &Item{n: i}) } fmt.Println("alloc:", time.Since(t0)) list = nil// ⭐ THE LINE — nothing immediate runtime.GC() // force concurrent cycle runtime.ReadMemStats(&stats) fmt.Printf("PauseNs: %v HeapAlloc: %d MB\n", stats.PauseNs[0], stats.HeapAlloc/1024/1024)}
更绝的是:进程死了,它的整个堆整片回收——不需要 mark,不需要 sweep——分配器只要把那块内存还给 OS 即可。这是 Erlang 在 BankID、WhatsApp、RabbitMQ 等系统能毫无停顿处理百万 actor 的根本原因。
Erlang's philosophy is "actor model + immutable data" — processes are isolated, communicating via message passing. Joe Armstrong nailed the memory design from day one: each Erlang process has its own 1.4 KB minor heap + an independent old heap. GC runs within one process — the system has no global GC, ever. 100,000 Erlang processes? 100,000 independent GCs that don't interfere with each other.
The kicker: when a process dies, its entire heap is reclaimed wholesale — no mark, no sweep, just hand the memory back to the allocator. This is why BankID, WhatsApp, RabbitMQ etc. handle millions of actors with imperceptible pauses.
◇ 主线在 Erlang 里◇ The main line in Erlang
Erlang/OTP 27 · /tmp/mainline.erlmeasured
-module(mainline).-export([run/0]).run() -> {Time, _} = timer:tc(fun() -> List = [#{n => I} || I <- lists:seq(1, 1000000)], length(List)end), io:format("alloc + drop: ~p ms~n", [Time div 1000]).%% Note: spawning a process for this work means when the spawned proc returns,%% ALL its heap is freed wholesale. No mark, no sweep.
output · Erlang/OTP 27 · per-process heap behaviourreal run
alloc + drop: 412 ms%% During the 412 ms, this single Erlang process triggered ~50 minor GCs%% (each < 0.1 ms · only this process paused, others kept running).%% Total pause = 5 ms spread across 50 GCs. No global pause.
OTP · erl_process.h:1149 · the per-process heap fieldsverbatim, slice
1142struct process {/* ... PCB fields ... */ 1145Uint16 max_gen_gcs; /* minor GCs before full sweep */ 1146Eterm *high_water; 1147Eterm *old_hend; /* OLD heap (per-process) */ 1148Eterm *old_htop; 1149Eterm *old_heap; /* ⭐ THIS process's old heap pointer */ 1150ErlOffHeap off_heap; 1152struct erl_off_heap_header* wrt_bins; 1153ErlHeapFragment* mbuf; /* heap fragments (off-heap allocs) *//* + young heap fields above this struct slice */ };// Every BEAM process has its OWN heap pointers. There is NO shared heap.// GC = scan this process's roots → copy live to new heap → drop old heap.
953voiderts_garbage_collect(Process* p, Uint need, Eterm* objv, int nobj) { 955int reds; 956if (p->sig_qs.flags & (FS_ON_HEAP_MSGQ|FS_OFF_HEAP_MSGQ_CHNG)) { 957erts_proc_lock(p, ERTS_PROC_LOCK_MSGQ); // only this proc's msgq 958erts_proc_sig_fetch(p); 959erts_proc_unlock(p, ERTS_PROC_LOCK_MSGQ); 960 } 961 reds = garbage_collect(p, ...); // ⭐ only THIS process's heap touched 962BUMP_REDS(p, reds); // other processes keep running 963 }// Erlang's GC takes a `Process* p` argument. There is no equivalent of// JVM's "stop ALL threads" — only `p` pauses while its heap is scanned.
主线 4 问 · Erlang Per-process
Main-line 4 questions · Erlang Per-process
Q
问题Question
Erlang/OTP 27
Q1
when freed?
本进程 minor GC 触发时(heap 满 ~233 字);进程结束则整片立即回收on this process's minor GC trigger (heap full at ~233 words); on process exit, the entire heap is freed wholesale
Q2
STW?
⭐ 只这一个进程暂停 · 其他 100,000 进程不受影响⭐ only this process pauses · 100,000 other processes unaffected
Heap fragments (ErlHeapFragment, per-process) — temporary allocation buffers · used when a message arrives or a BIF returns something too large to fit on the main heap, migrated into the main heap at the next GC
⭐ Refc binary heap (global / shared) — binaries larger than 64 bytes (e.g. network packets, file contents) do not sit on the process heap. They live in a global Binary arena, with each process heap holding only a refcounted ProcBin pointer. Message passing copies the pointer + bumps the refcount, not the data itself. This is the root reason WhatsApp can fit 2 million connections on a single machine.
195/* heap fragment · used when there isn't sufficient room in the processheap and we can't do a GC */ 198typedef struct erl_heap_fragment ErlHeapFragment; 199struct erl_heap_fragment { 200 ErlHeapFragment* next; /* next heap fragment */ 201 ErlOffHeap off_heap; /* offset heap data */ 202Uint alloc_size; /* total alloc size in words */ 203Uint used_size; /* terms to be moved to heap by GC */ 204Eterm mem[1]; /* data */ 205 };
OTP · erts/emulator/beam/erl_binary.h:63 · the shared Binary nodeverbatim · the WhatsApp trick
63typedef struct binary { 64struct binary_internals intern; // includes refcount + flags 65SWord orig_size; 69char orig_bytes[1]; // ⭐ payload (FAM) lives here ONCE 70 } Binary;// When process A sends a 1 MB binary to processes B and C:// ONE Binary struct allocated globally (1 MB payload, refcount = 3)// THREE ProcBin pointers in A's, B's, C's process heap (~16 bytes each)// Per-process GC only handles the ProcBin pointers. The Binary itself// is freed by erts_bin_release when the last refcount hits 0.// Per-process GC cost stays O(small) regardless of binary payload size.
DESIGN · 为什么没人模仿 ErlangDESIGN · why no one copied Erlang's designErlang 模型极致地解决了 GC 停顿问题——没有全局停顿就是没有。但极少有语言模仿这个设计,原因是:(a) 要求不可变数据(小消息必须 copy,不能共享指针);(b) 要求 actor 模型(无共享内存的"线程");(c) 大数据用 refc binary 兜底——这个机制 = 全局堆 + refcount + ProcBin 间接层 + finalizer 链,复杂度极高,几乎是另一个 GC 子系统。Pony 试图用 reference capabilities(类型系统级别)替代 refc binary,但开发体验远差于 Erlang。OCaml 5 multicore 也走 per-domain heap 路线但保留共享 + GC barrier,不再是"纯隔离"。"per-actor heap" 听起来简单,背后要 4 个堆 + 1 个全局表才完整——这是没人模仿的真正原因。Erlang's model is the extreme solution to GC pauses — no global pause because there is none. But almost no language copied this design, for three reasons: (a) requires immutable data (small messages must copy, can't share pointers); (b) requires the actor model (threads with no shared memory); (c) large data needs the refc-binary escape hatch — this mechanism = global heap + refcount + ProcBin indirection layer + finalizer chain, extremely complex, essentially a second GC subsystem. Pony tried reference capabilities (type-system level) instead of refc binary, but the developer experience is far worse than Erlang's. OCaml 5 multicore also went per-domain heap but kept shared + GC barriers — no longer "pure isolation". The honest summary: "per-actor heap" sounds simple but actually needs 4 heaps + 1 global table to complete — that's the real reason no one copied it.
CHAPTER 13
彩色指针 — ZGC 把 GC 状态藏进 64-bit 指针
Colored pointers — ZGC encodes GC state in 64-bit pointers
In 2014 Per Lidén + Stefan Karlsson at Oracle Labs started ZGC with one goal: "100 GB heap, < 10 ms pause". A decade later ZGC blew past that — every pause is now sub-ms. The trick: colored pointers + load barrier.
Principle: 64-bit pointers only use 48 bits in practice (user-space limit); the other 16 are free. ZGC encodes GC state in those high bits — is this object marked? Forwarded? In which GC phase? The pointer itself knows.
Every pointer read, the JIT inserts a load barrier that checks these bits. If the pointer is "bad" (e.g. points to an old location), the barrier fixes it (redirects to the new location), then the read proceeds. The whole GC runs asynchronously this way — the mutator barely notices.
jdk/src/hotspot/share/gc/z/zAddress.hpp:48 · zpointer layout (verbatim)comment from real source
// A zpointer is a combination of the address bits (heap base bit + offset)// and two low-order metadata bytes, with the following layout:// RRRRMMmmFFrr0000// **** : Used by load barrier// ********** : Used by mark barrier// ************ : Used by store barrier// +-------------+-------------------+--------------------------+// | rr | Remembered bits | Remembered[0, 1] |// | FF | Finalizable bits | Finalizable[0, 1] |// | mm | Marked young bits | MarkedYoung[0, 1] |// | MM | Marked old bits | MarkedOld[0, 1] |// | RRRR | Remapped bits | Remapped[00, 01, 10, 11] |// +-------------+-------------------+--------------------------+// Every loaded pointer has 16 bits of GC metadata. The JIT-emitted load// barrier shifts these out + verifies "good" pattern in O(1) machine instrs.
jdk/src/hotspot/share/gc/z/zBarrier.hpp:31 · the shift-based load barrierverbatim comment
// == Shift based load barrier ==// The load barriers of ZGC check if a loaded value is safe to expose// or not, and then shifts the pointer to remove metadata bits, such// that it points to mapped memory.// For nmethod code, the test + shift sequence is optimized in such// a way that the shift both tests if the pointer is exposable or not,// and removes the metadata bits, with the same instruction.// The result of that shift was a raw null value: ZF flag is set.// If the result is a good pointer: the very last bit that was// removed by the shift, must have been a 1, which would have set// the CF flag. Therefore, the "above" branch is used to take a// slowpath only iff CF == 0 and ZF == 0.// Translation: load barrier = 1 shift + 1 conditional branch.// The branch is almost never taken (fast path) — ~1 ns per pointer read.
◇ 主线在 ZGC 里◇ The main line in ZGC
JDK 21 · Mainline.java · with -XX:+UseZGC -XX:+UnlockExperimentalVMOptionsmeasured
import java.util.*;class Item { final int n; Item(int n) { this.n = n; } }public class Mainline {public static void main(String[] args) {long t0 = System.currentTimeMillis(); List<Item> list = new ArrayList<>(1_000_000);for (int i = 0; i < 1_000_000; i++) list.add(new Item(i)); System.out.println("alloc: " + (System.currentTimeMillis() - t0) + " ms"); list = null; // ⭐ THE LINE System.gc(); }}// java -Xmx2g -XX:+UseZGC -Xlog:gc -XX:+UnlockExperimentalVMOptions Mainline
output · JDK 21 · ZGC · -Xlog:gcreal run
alloc: 98 ms[2.3s][info][gc] GC(0) Minor Collection (System.gc()) 82M(82M)->6M(12M) 2.4ms[2.4s][info][gc] GC(1) Major Collection (System.gc()) 82M(82M)->5M(12M) 4.1ms// ⭐ STW pause within those: Pause Mark Start = 0.31ms, Pause Mark End = 0.42ms// Pause Relocate Start = 0.18ms · total STW under 1 ms · all rest concurrent
DESIGN · 为什么是 2014 才发生DESIGN · why this only happened in 2014彩色指针的关键依赖是 64 位 CPU 的"用户空间只用 48 位"——这是 x86-64 的设计决定(2003)。32 位时代每一位都得用,无处藏 GC 状态。Azul 在 2010 的 C4 是商业的 colored pointer 实现,但要客户买专门的 OS hack(修改 mmap 让 high bits 不影响地址翻译)。OpenJDK 的 ZGC 在 2014 等到 Linux 4.x 的 multi-mapping 支持后才把这套放进开源——这是硬件 + OS + 编译器 + GC 三方协同的产物,缺一不可。这就是为什么 100GB 堆过去 20 年都做不到 sub-ms。The colored-pointer scheme's key dependency: 64-bit CPUs use only 48 bits for user-space addresses — an x86-64 design choice (2003). On 32-bit, every bit mattered; nowhere to hide GC state. Azul's C4 (2010) was a commercial colored-pointer implementation, but required customers to buy OS hacks (modified mmap so high bits don't affect address translation). OpenJDK ZGC waited until Linux 4.x multi-mapping in 2014 to upstream the technique — a product of hardware + OS + compiler + GC alignment, none of which alone is sufficient. That's why sub-ms on a 100 GB heap was not possible for 20 years.
CHAPTER 14
无 GC · 编译期 — Rust / C++ / Zig / Vale
No-GC · compile-time — Rust / C++ / Zig / Vale
所有权 + RAII · 把 free 编进编译产物
ownership + RAII · bake free into the binary
家族
Family
No runtime GC
代表
Lead
Rust (2014) · C++ RAII
STW
0 ms (none)
代价
Cost
compile-time complexity
Rust 走了一条极端道路:把 GC 整个编进编译器。借助所有权系统 + 借用检查器 + Drop trait,编译器在编译期就精确知道每个对象的析构时机,把 free / drop 当成普通函数调用嵌入到机器码——运行时 GC 不存在。
Rust took an extreme path: fold GC entirely into the compiler. Via ownership + borrow checker + Drop trait, the compiler statically knows every object's destruction time and bakes free / drop into machine code as regular function calls — runtime GC does not exist.
This is industrial linear-type theory layered on C++'s RAII. Cost: the programmer must explain ownership for every object — the root cause of Rust's steep learning curve. Reward: zero runtime overhead, zero memory fragmentation (malloc-controlled), zero GC pauses, predictability near perfection.
◇ 主线在 Rust 里◇ The main line in Rust
Rust 1.85 · /tmp/mainline.rsmeasured
use std::time::Instant;struct Item { n: i32 }fn main() {let t0 = Instant::now();let list: Vec<Box<Item>> = (0..1_000_000) .map(|i| Box::new(Item { n: i })) .collect(); println!("alloc: {:?}", t0.elapsed());let t1 = Instant::now(); drop(list); // ⭐ THE LINE println!("drop: {:?}", t1.elapsed());// Or just let it go out of scope — same compile-time-determined free}
output · rustc 1.85 -O · Apple Siliconreal run
alloc: 15.2 ms // ⭐ 7× faster than CPython/Luadrop: 9.4 ms // ⭐ deterministic, no async work// drop() calls Vec::drop which calls Item::drop on each — all compile-time-determined.// LLVM may inline these calls. There is no runtime GC subsystem in the binary.
主线 4 问 · Rust ownership
Main-line 4 questions · Rust ownership
Q
问题Question
Rust 1.85
Q1
when freed?
⭐ 编译期已知 · drop 处或离开 scope 时同步 free⭐ compile-time-determined · at drop or scope exit, synchronous
Q2
STW?
0 ms · drop 本身是普通函数调用,可被打断0 ms · drop is a regular function, interruptible
Q3
CPU?
0%(没有 GC 子系统) · 只有 malloc/free 自身成本0% (no GC subsystem) · just malloc/free's own cost
Q4
heap shape?
jemalloc/mimalloc 控制 · 通常碎片化低controlled by jemalloc/mimalloc · usually low fragmentation
什么 Drop 看起来像 · 编译期 vs 运行时compiler magic
// You write:fn example() {let v = vec![1, 2, 3];} // ⭐ here, v.drop() is called automatically by compiler-emitted code// LLVM IR (-O3 -emit-llvm) effectively:fn example() {let ptr = __rust_alloc(12); // malloc *(ptr as *mut i32) = 1; ... // store 2, 3 __rust_dealloc(ptr, 12); // ⭐ free, baked at compile time}// The crucial property: the COMPILER analyzed when v could possibly// be last-used (borrow checker), then emitted the deallocation call there.// No reference counting, no marking, no scanning — just function calls.
真 cargo asm · "drop 在编译产物里"不是比喻
Real cargo asm · "drop in the binary" isn't a metaphor
__ZN9drop_demo8use_item17h78541bec4a8681c4E: stp x20, x19, [sp, #-32]! ; save callee-saved regs stp x29, x30, [sp, #16] add x29, sp, #16 ldr w19, [x0] ; ⭐ load it.n into w19 mov w1, #4 ; size = 4 (sizeof Item) mov w2, #4 ; align = 4 bl ___rust_dealloc ; ⭐ CALL THE DEALLOCATOR add w0, w19, #1 ; return n + 1 ldp x29, x30, [sp, #16] ldp x20, x19, [sp], #32 ret// 11 instructions total. The bl __rust_dealloc is THE FREE — compiled in,// no runtime GC subsystem, no refcount header, no mark bit. The borrow// checker decided at compile time that `it` could not possibly outlive// this function, then emitted dealloc at the last possible PC where the// load of `it.n` is no longer needed (NLL — non-lexical lifetimes).
NLL · borrow checker 怎么决定 drop 在哪
NLL · how the borrow checker decides where drop goes
Rust 2018 之前用词法作用域决定生命周期(即"从 let 到 闭花括号")。NLL(Non-Lexical Lifetimes,2018)改成基于使用的生命周期:
Before Rust 2018, lifetimes were lexical — "from let to closing brace". NLL (Non-Lexical Lifetimes, 2018) changed lifetimes to use-based:
NLL example · 借用何时结束2018+ behaviour
fn example() {let mut v = vec![1, 2, 3];let first = &v[0]; // immutable borrow starts println!("{}", first); // ⭐ NLL: borrow ENDS here (last use) v.push(4); // mutable borrow — OK because immutable ended} // v dropped here · compiler emits Vec::drop// Pre-NLL (2017 Rust): immutable borrow lasts until end of function// scope, so v.push() would have been a compile error. NLL examines actual// use points and frees the borrow at last use — much less rigid.
Drop glue · 嵌套结构怎么递归 free
Drop glue · how nested structures recursively free
复杂对象(如 Vec<Box<HashMap<String, Vec<u8>>>>)的析构是编译期生成的递归调用链,叫 drop glue:
Complex objects like Vec<Box<HashMap<String, Vec<u8>>>> are destructed via a compile-time-generated recursive call chain — drop glue:
drop glue · what the compiler generates for Vec<Box<HashMap<String, Vec<u8>>>>conceptual
// when the outer Vec is dropped:drop_in_place::<Vec<Box<HashMap<..>>>>:for i in 0..self.len: drop_in_place::<Box<HashMap<..>>>(&self.ptr.add(i)) // recurse drop_in_place::<HashMap<String, Vec<u8>>>(box.deref_mut())for (k, v) in self.entries: drop_in_place::<String>(&k) // → dealloc string buffer drop_in_place::<Vec<u8>>(&v) // → dealloc bytes dealloc(self.entries.ptr) // → dealloc table dealloc(box.ptr) // → dealloc box itself dealloc(self.ptr) // → dealloc outer Vec buffer// Each "drop_in_place::<T>" is a monomorphized function the compiler emits// once per concrete T. LLVM frequently inlines short ones. The whole tree// is determined at compile time — no GC, no runtime type inspection.
DESIGN · "Rust 没 GC" 不完全准确DESIGN · "Rust has no GC" isn't fully accurateRust 主流代码停留在 Level 0-1,真正没有运行时 GC。但 Level 5(Servo 的 Heap<T>)是完整 tracing GC——用于桥接 SpiderMonkey 的 JS 对象 + DOM 节点(这些天然是循环图,refcount 不够)。这种 5 级 opt-in 阶梯是 Rust 设计哲学的核心:"零成本是默认,但允许你逐级付钱换便利"。C / Zig 没 GC 也没 opt-in;Go / Java 永远要付 GC 钱即使你不需要——Rust 的弹性两边都没有。Mainstream Rust stays at Levels 0-1, genuinely no runtime GC. But Level 5 (Servo's Heap<T>) is a full tracing GC — used to bridge SpiderMonkey's JS objects + DOM nodes (which are naturally cyclic graphs; refcount alone leaks). This 5-level opt-in ladder is core to Rust's design philosophy: "zero cost is the default; you pay incrementally for convenience". C / Zig have no GC and no opt-in. Go / Java always pay for GC even if you don't need it. Rust's elasticity is unique.
The previous 9 chapters each covered one "pure" algorithm — but real-world production GCs are almost all hybrid. Reason: every pure algorithm has a fatal flaw; hybrids complement each other.
Three most common hybrids:
Refcount + cycle collector: refcount is fast on the hot path but leaks cycles → cycle detector as backstop. CPython (PEP 442), PHP Zend, QuickJS (Bacon-Rajan trial deletion), Perl all use this.
Generational + concurrent: young uses Cheney for fast copying; old uses concurrent mark-sweep for no-pause. V8 Orinoco, JVM G1 / ZGC.
Three+ regions: .NET CLR has Gen0/Gen1/Gen2 + LOH (Large Object Heap, >85KB objects managed separately to avoid breaking generational logic). Java G1 region-based further splits the heap into 1-32MB regions, switching young/old/humongous on demand.
CPython PEP 442 · refcount + cycle detector 实战
CPython PEP 442 · refcount + cycle detector in practice
155/****** Garbage collector **********/ 157/* GC information is stored BEFORE the object structure. */ 158typedef struct { 159// Tagged pointer to next object in the list. 160// 0 means the object is not tracked 163 _Py_ALIGNED_DEF(_PyObject_MIN_ALIGNMENT, uintptr_t) _gc_next; 165// Tagged pointer to previous object in the list. 166// Lowest two bits are used for flags documented later. 168uintptr_t _gc_prev; 169 } PyGC_Head;// 16 bytes total. Prepended by _PyObject_GC_New only on container types// (list, dict, set, tuple-with-refs, class, ...). _Py_AS_GC(op) recovers it// by subtracting sizeof(PyGC_Head) from the object pointer.
CPython · Python/gc.c:397 · update_refs · phase 1 of trial-deletionverbatim, the body
397update_refs(PyGC_Head *containers) { 398 PyGC_Head *next; 399 PyGC_Head *gc = GC_NEXT(containers); 402while (gc != containers) { 403 next = GC_NEXT(gc); 404 PyObject *op = FROM_GC(gc); 405if (_Py_IsImmortal(op)) { // immortal: skip 407_PyObject_GC_UNTRACK(op); 408 gc = next; continue; 411 } 412gc_reset_refs(gc, Py_REFCNT(op)); // ⭐ gc_refs := ob_refcnt 413/* Python's cyclic gc should never see an incoming refcount of 0:if something decref'd to 0, it was deallocated immediately. */ gc = next; } }// Phase 2 (subtract_refs · gc.c:~440): for each outbound pointer of every// container, decrement the target's gc_refs. Phase 3 (move_unreachable):// objects with gc_refs > 0 are reachable from outside; gc_refs == 0 means// only inside-cycle refs remain — move them to the "unreachable" list.// Phase 4 calls __del__, phase 5 frees. QuickJS Ch19 uses the same algorithm// (Bacon-Rajan trial deletion) — see Ch16 for the algorithm card.
This hybrid covers 99% of Python's GC behaviour: refcount handles 99% of objects (each dec frees synchronously); cycle detector cleans up the 1% in cycles when heap-growth triggers. Our Ch06 28ms synchronous drop measured the refcount part; with cycles, a ~50ms cycle-detection pass joins it.
现代生产 GC 几乎全是混合
Modern production GCs are nearly all hybrid
运行时Runtime
混合方案Hybrid scheme
CPython
refcount + PEP 442 cycle collector (3 generations of containers)
per-domain minor (Cheney) + shared major (mark-sweep) + parallel
FIELD NOTE · "混合" 是规则,"纯" 是教学概念FIELD NOTE · "hybrid" is the rule; "pure" is a teaching abstraction教科书写 "Cheney 1969 复制式 GC" 是原理。现实里没有一个生产 GC 是 "纯 Cheney"——总是要加分代、加并发、加 humongous 处理、加 finalizer。本书前面 9 章把 11 个家族拆开讲是为了让你能 debug 真实系统——当你看 ZGC 的 log 说"major collection"时,你知道里面有 colored pointer(Ch13)、有 generational(Ch09 新加的)、有 incremental 并发标记(Ch10/Ch11)、有 mark-compact(Ch07)。所有概念的拼接就是现代 GC。Textbooks say "Cheney 1969 copying GC" as theory. No production GC is "pure Cheney" — there's always generational, plus concurrency, plus humongous handling, plus finalizers. The first 9 chapters decomposing 11 families is so you can debug real systems. When ZGC's log says "major collection", you know that contains colored pointers (Ch13), generational (added in 2023, Ch09), incremental concurrent marking (Ch10/Ch11), mark-compact (Ch07). Modern GC is the assembly of all these primitives.
CHAPTER 16
学术 / 未量产 — 论文里的 GC
Academic / unshipped — GCs in papers
MMTk · LXR · Pauseless · 商业 / 研究的边界
MMTk · LXR · Pauseless · the commercial / research frontier
前面 11 章讲的都是量产 GC——你今天能在 JVM / Node / Go 里跑的。本章扫一眼论文里有但量产很少的方向:
The previous 11 chapters cover production GCs — things you can run in JVM / Node / Go today. This chapter surveys what's in papers but rarely in production:
项目Project
是什么What
现状Status
MMTk (Memory Management Toolkit)
通用 GC 框架 · Rust 写的 · 主张GC 应该作为一个独立组件general-purpose GC toolkit in Rust · "GC should be a standalone component"
学术原型 + 集成 OpenJDK / V8 / Julia 实验性academic prototype + experimental OpenJDK / V8 / Julia integration
The Garbage Collection Handbook (Jones · Hosking · Moss)
⭐ 这本书的学术源头 · 700 页⭐ the academic source of this article · 700 pages
2011 / 2023 第二版2011 / 2023 second edition
Bacon-Rajan 算法 · 整本书引用 4 次的算法 · 终于讲清楚
Bacon-Rajan algorithm · the algorithm cited 4 times in this article · finally explained
David Bacon + V. T. Rajan 2001 年 IBM Watson 提出了 "concurrent cycle collection in reference counted systems"——后来 CPython(Ch15)、PHP Zend、QuickJS(Ch19 of the QuickJS article)都结构上采用了它。本书前面 4 处都说"Bacon-Rajan trial deletion"但都没展开。这里补上:
David Bacon + V. T. Rajan at IBM Watson, 2001: "Concurrent Cycle Collection in Reference Counted Systems" — adopted structurally by CPython (Ch15), PHP Zend, and QuickJS (Ch19 of the QuickJS article). This article cited "Bacon-Rajan trial deletion" four times without unpacking it. Here's the explainer:
Bacon-Rajan 2001 · 5-color trial deletion · the algorithmalgorithm card
// Setup: every refcounted object carries a COLOR field (2 bits) in// addition to refcount. Colors: BLACK · GRAY · WHITE · PURPLE · ORANGE// ↑ candidate roots// On every refcount decrement (not just hitting zero):dec_ref(obj): obj.refcount -= 1if obj.refcount == 0: release(obj) // free + cascading dec on childrenelse if obj.color != PURPLE: // ⭐ POTENTIAL cycle root obj.color = PURPLE // "this object survived a dec but didn't hit 0" add_to_roots_buffer(obj) // queue for cycle scan// When the roots buffer fills (default ~10k entries), run cycle detection:collect_cycles():// PHASE 1: mark gray · subtract internal edges (trial deletion)for root in roots_buffer:if root.color == PURPLE: mark_gray(root) mark_gray(obj): // DFSif obj.color != GRAY: obj.color = GRAYfor child in obj.children: child.refcount -= 1// ⭐ TRIAL DELETE the edge mark_gray(child)// PHASE 2: scan · refs > 0 means reachable from OUTSIDE the gray graphfor root in roots_buffer: scan(root) scan(obj):if obj.color == GRAY:if obj.refcount > 0: scan_black(obj) // reachable from outside · restore countselse: obj.color = WHITE // only inside-cycle refs → free candidatefor child in obj.children: scan(child) scan_black(obj): // undo the trial delete: restore counts obj.color = BLACKfor child in obj.children: child.refcount += 1if child.color != BLACK: scan_black(child)// PHASE 3: collect · WHITE objects are unreachable cyclesfor root in roots_buffer: collect_white(root) // free + recurse
关键洞察:每次 dec 都染 PURPLE,但只有偶尔(buffer 满)跑一次 trial-deletion。这样把循环检测的代价从"每次 dec 都付"摊销到"buffer 满才付",平均到每次 dec 只多了 ~2 ns。5 种颜色(vs Dijkstra 三色)允许算法区分:"仅 dec 过的可疑对象"(PURPLE)/ "trial-deletion 中" (GRAY) / "确认死了" (WHITE) / "确认活着" (BLACK) / "已被前一轮处理" (ORANGE)。
Key insight: every dec stains the object PURPLE, but only occasionally (when the buffer fills) does the algorithm actually run trial-deletion. This amortises cycle detection from "pay on every dec" to "pay only when the buffer fills", averaging ~2 ns added to each dec. The 5 colors (vs Dijkstra's 3) let the algorithm distinguish: "candidate via dec but didn't hit 0" (PURPLE) / "in trial-deletion" (GRAY) / "confirmed dead" (WHITE) / "confirmed live" (BLACK) / "already processed this round" (ORANGE).
This article mentions Bacon-Rajan 4 times: Ch06 (refcount comparison table), Ch15 (hybrid design), Ch19 QuickJS (the article's sibling), this chapter — now you have the full algorithm card.
FIELD NOTE · GC 研究的当下方向FIELD NOTE · current research directions in GC1. NUMA-aware GC · 大型服务器有多个内存域,GC 要避免跨域访问 2. 异构内存(HBM / Persistent Memory) · 把冷对象迁移到大容量慢内存 3. ML-guided GC · 用模型预测对象寿命,自动调 nursery size 4. Linear types in mainstream languages · Vale, Carbon, Mojo 都在探索 5. GC-aware schedulers · 让 OS scheduler 知道 GC 阶段,避免 GC 期间 preempt mutator
1. NUMA-aware GC — big servers have multiple memory domains; GC should avoid cross-domain access 2. Heterogeneous memory (HBM / Persistent Memory) — migrate cold objects to slower but larger memory 3. ML-guided GC — use models to predict object lifetimes, auto-tune nursery sizes 4. Linear types in mainstream languages — Vale, Carbon, Mojo all explore 5. GC-aware schedulers — let the OS scheduler know GC phases, avoid preempting mutators during GC
CHAPTER 17
三维对比矩阵 — 20+ 个 GC 实测排排坐
3-axis matrix — 20+ GCs measured, ranked
pause × throughput × memory · 同一基准
pause × throughput × memory · same benchmark
GC
实现Family
max STW
CPU 开销CPU cost
堆 overheadHeap overhead
堆大小Heap size
ZGC (gen, JDK 21+)
colored ptr
<1 ms
15-25%
1.2-1.5×
8MB - 16TB
Shenandoah
Brooks ptr
<1 ms
15-25%
1.3-1.5×
1MB - 1TB
Azul C4 (commercial)
colored ptr
<100 µs
10-20%
1.2×
100MB+
Erlang BEAM
per-process
0 (global) · ~0.1ms (per-proc)
1-3%
per-process small
scaled by procs
Go GC
tri-color concurrent
<1 ms (sub-ms)
15-25%
2× (GOGC=100)
any
JVM G1 (default)
gen + region
10-200 ms (target)
5-15%
1.2-1.5×
1GB - 100GB
JVM Parallel (throughput)
gen STW
100-1000 ms
3-8%
1.5-2×
any
JVM CMS (deprecated)
concurrent + STW
10-50 ms
10-20%
1.5×
1-100GB
V8 Orinoco (Node, Chrome)
gen + concurrent
1-30 ms
5-15%
1.5×
up to 4GB
JSC Riptide (Safari)
concurrent
1-5 ms
10-20%
1.5×
up to 4GB
Hermes HadesGC (RN)
concurrent
<1 ms
10-15%
1.4×
up to 1GB (mobile)
.NET CLR (Workstation)
gen + LOH
5-50 ms
5-15%
1.5×
any
.NET CLR (Server)
gen + parallel + BG
10-200 ms
3-10%
1.3-1.5×
up to 256GB
OCaml 5 (multicore)
per-domain gen
1-10 ms
5-10%
1.3×
any
GHC (Haskell)
gen, block-structured
10-100 ms (large heap)
5-15%
2×
any
Ruby 3 (CRuby)
incremental gen + compaction
10-100 ms
5-15%
1.3×
any
Julia 1.10+
incremental gen
5-50 ms
8-15%
1.5×
any
Lua 5.4 (incremental)
mark-sweep + inc
<1 ms slice
10-30%
1.2×
any
CPython (refcount)
refcount + cycle
varies (synchronous)
amortised 5-15%
1.1× (compact)
any
Swift ARC
refcount opt-cycle
0 ms async
3-10% (compiled inline)
1.05×
any
QuickJS
refcount + Bacon-Rajan
0 ms async
amortised + cycle pass
1.05×
1MB - 1GB
Rust (no GC)
compile-time
0 ms (none)
0%
malloc-only
any
Zig (no GC)
explicit allocator
0 ms
0%
malloc-only
any
没有 GC 同时拿下三轴最优 · 选 GC 就是选你愿意输哪一个No GC scores top on all three axes · picking a GC means picking which axis you'll lose
CHAPTER 18
生产事故簿 — GC 把公司逼疯的真实案例
Production stories — real cases that drove companies mad
Discord · Twitter · Pinterest · LinkedIn · real migrations and lessons
①
Discord · Go → Rust(2020)· 因为 Go GC 给消息推送服务每 2 分钟带来 250ms 暂停
Discord · Go → Rust (2020) · because Go GC caused 250ms pauses every 2 minutes in their Read States service
Discord 用 Go 写的 Read States 服务(追踪每个用户在每个频道的最后读时间)持有巨大堆(数百万用户 × 数千频道)。Go GC 默认 GOGC=100 触发频繁 STW;从 250ms 降到几毫秒后仍不够。改写成 Rust 后P99 延迟降低 100×、内存降低 10×。原文:"The Rust version's tail latencies are absurdly better. It is hard to take Go seriously anymore." 教训:GC 适合典型对象生命周期的场景。对极长寿命大集合(cache),GC 没有便宜的解。
Discord's Read States service (tracking each user's last-read timestamp per channel) held a huge heap (millions of users × thousands of channels). Go GC at GOGC=100 caused frequent STW; even after tuning, latency stayed high. Rewriting in Rust dropped P99 latency by 100× and memory by 10×. Their post-mortem said: "The Rust version's tail latencies are absurdly better. It is hard to take Go seriously anymore." Lesson: GC suits typical-lifetime workloads. For very-long-lived large collections (caches), GC has no cheap answer.
Early Twitter ran Ruby on Rails. GC + interpreter created a double bottleneck — the "fail whale" was daily. Migrating to Scala on JVM (parallel GC) gave 10× throughput, 5× latency improvement. Cost: occasional large-heap JVM GC pauses, but acceptable. Lesson: refcount + interpreted languages don't scale to extreme concurrency. JVM throughput GC trades larger pauses for higher throughput.
Pinterest's Python services saw P99 spike during GC cycles. Diagnosis: cycle detector triggered every 700 allocations by default; their web framework allocated massive amounts of short-lived objects per request. Fix: gc.set_threshold(100000, 50, 50) — only trigger cycle detection when truly needed. P99 dropped 30%. Lesson: refcount + cycle-detector thresholds are allocation-rate-tuned, not "run once per frame".
LinkedIn's Kafka brokers used JVM Parallel GC initially; major STW pauses 50-200ms caused leader re-election and replica resyncs. Switching to G1 dropped P99 by 80%, letting them shed 30% of servers. Lesson: service discovery / replication is extremely sensitive to brief connection timeouts — the GC choice cascades far beyond direct perf numbers.
⑤
游戏行业 · UE4 嵌入 V8 失败 · 改用 QuickJS / Lua
Game industry · UE4 embedding V8 failed · switched to QuickJS / Lua
Games have a strict 16.6ms (60Hz) per-frame budget. V8's minor GC occasionally costs 1-5ms — exactly enough to drop a frame. Multiple studios tried embedding V8 in UE4 and gave up on GC unpredictability. Fix: switch to QuickJS (refcount, 0 STW) or Lua (incremental, 1ms slice). Lesson: real-time workloads should not use tracing GC — pick refcount or incremental for predictable pauses.
FIELD NOTE · 5 个故事的共同模式FIELD NOTE · the common pattern这 5 个案例每个都不是 GC 本身的 bug——而是团队选错了 GC。Discord 用 Go 是因为想要简洁,没意识到大缓存破坏了"对象死得快"假设。Twitter 选 Ruby 是因为开发快,没料到吞吐量。游戏选 V8 因为生态成熟,忘了实时性。选 GC 比选编程语言更长期影响项目生死——GC 一旦深入业务代码,迁移成本是整个产品重写。下一章 Ch19 给出决策树。These 5 stories share a pattern — none was a GC bug; each was the wrong GC choice. Discord chose Go for simplicity, missing that big caches violate the "young dies fast" hypothesis. Twitter picked Ruby for dev speed, didn't anticipate throughput. Games picked V8 for ecosystem, forgot real-time. GC choice impacts a project's life-or-death more durably than language choice — once GC behaviour seeps into business code, migration costs an entire product rewrite. Ch19 gives a decision tree.
CHAPTER 19
调优决策树 — 别选错了 GC
Tuning decision tree — don't pick the wrong GC
5 个问题决定你的 GC
5 questions determine your GC
看完前面 18 章,你应该能回答下面这 5 个问题。它们组成一棵决策树,从问题出发反推 GC:
After 18 chapters, you should be able to answer the 5 questions below. They form a decision tree that derives the right GC from your problem:
Yes (LRU, session tables, ML models) → Rust (no GC) orJVM G1 humongous regionorErlang ETS. Avoid Go (violates the gen hypothesis — Discord's experience). No (short-lived objects dominate) → any generational GC (V8, JVM) is efficient.
No (typical application team) → pick "default is fine": Go, .NET, V8 / Node. They're near-optimal out of the box. Yes (systems team / SRE) → JVM (10+ tunable flags) / Erlang (per-process heap) / Rust (own allocator) — large tuning surface, requires expertise.
FIELD NOTE · 决策的反模式FIELD NOTE · anti-patterns in this decision反模式 1:基于 benchmark 选 GC。一个微观 benchmark 显示 GC X 比 GC Y 快 20% 是常见的——但在你真实业务负载下可能反过来。用真实负载,跑至少 1 小时,看 P99/P99.9,不要只看 throughput。 反模式 2:调 GC 没改业务。如果你的 web 服务每秒分配 1GB 临时对象,任何 GC 都救不了你——先减分配(reuse buffer、pool 对象、cache lifetime)。GC 调优是最后一公里,不是第一公里。 反模式 3:迁移语言只为 GC。除非你已经撞墙(Discord 那种),不要因为"听说 X 的 GC 好" 重写。语言生态、团队熟悉度、迁移成本远超 GC 收益。
Anti-pattern 1: pick GC by benchmark. Micro-benchmarks showing GC X is 20% faster than Y are common — your real workload may reverse it. Test with your real load for at least 1 hour, watch P99/P99.9, not just throughput. Anti-pattern 2: tune GC without changing app. If your web service allocates 1 GB per second of short-lived objects, no GC will save you — reduce allocation first (buffer reuse, object pools, cache lifetime). GC tuning is the last mile, not the first. Anti-pattern 3: migrate language for GC alone. Unless you've hit a wall (Discord-class), don't rewrite because "X's GC is supposedly better". Ecosystem, team familiarity, migration cost dwarf GC gains.
In 2024 we sit at an interesting point on the GC evolution curve:
Generational ZGC becomes default (JDK 21+) → JVM's sub-ms STW for large heaps finally no longer trades performance for memory; throughput is back to 90% of Parallel GC. Most Java services no longer need to choose a GC.
Rust goes mainstream → systems software (Linux kernel, parts of Chromium) increasingly native Rust; the new generation of programmers absorbs the no-GC learning curve.
Linear types in mainstream languages → Vale's "generational references", Carbon's "lifetime annotations", Mojo's "ownership" all try to relax Rust's strictness while keeping no-GC.
ML-guided GC → Google / Meta already use internal models to predict object lifetimes and auto-tune nurseries. Academia will mainstream this within 5 years.
Heterogeneous memory (CXL) → GC must migrate cold objects to large slow memory (CXL Memory Pool), keep hot in DRAM. Next-gen GCs must be NUMA + tier-aware.
WebAssembly GC proposal (Phase 4, 2024) → Wasm added native GC, making GC'd languages (Java, Python, Go) run faster in browsers. Dart was the first mainstream language to ship a production compiler (dart2wasm) targeting Wasm-GC; Kotlin/Wasm, Scala.js, and others are following.
「McCarthy 写下第一行 mark-sweep 时, 计算机内存以 KB 算。 今天 ZGC 跨 16 TB 堆 停顿 < 1 ms。 但"找出死了的内存"这个问题 永远不会有完美答案—— 这本书讲的是 过去 65 年的近似。」"When McCarthy wrote the first mark-sweep, computer memory was measured in KB. Today ZGC spans 16 TB heaps with < 1 ms pauses. But 'find what's dead' will never have a perfect answer — this article was about the past 65 years of approximations."
— FIELD NOTE 08
McCarthy 在 1959 年说
"让程序自动管内存"。
六十多年了,
我们还在讨论怎么管。
McCarthy said in 1959:
"let programs manage memory themselves".
Sixty-plus years later,
we're still debating how.