FIELD NOTE / 09 LLM 推理工程 LLM Inference 2026

一次 LLM 推理
的一生。

The life of one
LLM inference.

tokenizer → embedding → KV cache → attention → MoE / MLA → logit → sampling

你按下回车之后那 200 毫秒里到底发生了什么？把 5 个 token 拽进 llama.cpp 的肠子里走一遍——28 个站,每站一段真源码。

What actually happens in the 200 ms after you hit Enter? Drag five tokens through llama.cpp's intestines — 28 stations, each one a slice of real source.

AUTHORAiring TOPICllama.cpp · vLLM · Transformer · KV cache · MoE · MLA FORMATLong Read

这篇要解决什么 What this is for

这不是"Transformer 教程"。市面上讲 attention 的文章足够多——画一张 Q·Kᵀ 的热力图，写一个 100 行的 toy GPT。但真到了线上，一个 prompt 怎么变成一个回答这件事，要踩过 tokenizer、scheduler、KV cache、graph builder、ggml backend、sampler 至少 18 层抽象——每一层都有自己的"为什么这么写"。这篇文章拿 llama.cpp 当主线代码，逐行解释那 18 层抽象。

This is not yet another Transformer tutorial. The internet has plenty of "draw a softmax(QKᵀ) heatmap, code up a 100-line toy GPT" pieces. The real question — how does a prompt become a reply in production — runs through at least 18 layers of abstraction: tokenizer, scheduler, KV cache, graph builder, ggml backend, sampler. Each layer has its own "why-it-is-this-way". This piece uses llama.cpp as the through-line and reads those 18 layers line by line.

主线 · 4 句话总结 Through-line · in 4 sentences

STEP 1

切字节

Tokenize

BPE · llama_tokenize

STEP 2

堆隐空间

Hidden state

32 layers · attention + FFN

STEP 3

投回词表

Project

LM head · 128k logits

STEP 4

掷骰子

Sample

top-p · temperature

已经懂 Transformer? 直接跳到 → KV cache · MLA · MoE · 量化 · 投机解码 · Continuous batching · TP/PP/EP · Constrained decoding Know Transformers already? Skip to → KV cache · MLA · MoE · Quantization · Speculative · Continuous batching · TP/PP/EP · Constrained decoding

一次推理 · 28 个站 One inference · 25 stations ▸ live trace

prompttoken

CHAPTER 01 · PROLOGUE

三眼看 `"你好,llama"`

Three eyes on `"hello, llama"`

同一段 prompt,三种世界,三种语言

one prompt, three worlds, three languages

所有讲 LLM 推理的文章都会从同一处败下阵来——它们在第三段就跳到了"attention 是 Q·Kᵀ"。但你脑子里那串字符,跟显卡上跑的那段矩阵乘法,中间隔着至少三种语言。这三种语言不打通,attention 怎么解释都像在背公式。

把 "你好,llama" 这 6 个字符摆出来,用三种眼睛去看它,你会发现整篇文章接下来要做的事情,本质上就是在这三种语言之间反复翻译。

Every "LLM inference" tutorial loses you at the same spot — paragraph three jumps to "attention is just Q·Kᵀ". But the string of characters in your head and the matrix multiplications running on your GPU are separated by at least three different languages. Until those three are connected, attention will keep sounding like a memorized formula.

Lay "hello, llama" out, look at it through three different eyes, and you'll see what this whole article is really about: translating back and forth between those three languages.

眼睛eye	看到的东西what it sees	尺寸shape
人Human utf-8 string	`"你好,llama"` · 6 chars / 12 bytes	[1] · str
TokenToken BPE id	`[128000, 47045, 50739, 11, 9091, 64]` <\|begin\|> · 你 · 好 · , · ll · ama	[6] · int32
TensorTensor embedding	`[[0.012, -0.045, ..., 0.083], ...]` 6 行 × 4096 列 fp16	[6, 4096] · fp16

FIG. 01 同一段 prompt 的三种"形状"。人眼看到 6 个字符;tokenizer 切完是 6 个 id(Llama-3 词表 128k,每个 id 是个 int32);embedding 查完是 6 × 4096 的 fp16 矩阵——也就是 48 KB。这 48 KB 就是整个推理"输入"的真实重量。 Same prompt, three shapes. Human eye: 6 chars. Tokenizer: 6 BPE ids (Llama-3 vocab 128k, each an int32). Embedding: a 6×4096 fp16 matrix — exactly 48 KB. That 48 KB is the real weight of "input" to the model.

每跨一道门,你失去一点信息,得到一点结构

Each door costs information, buys structure

从"人眼"到"tensor"这两次翻译都是有损的,但每一次都把"东西"变得更可计算:

人 → token:你失去了字面边界。"llama" 不再是一个词,而是 "ll" + "ama" 两个 BPE 片段。模型不会"读 llama 这个英文单词",它在处理两个 id 的组合。
token → tensor:你失去了离散性。47045 这个 id 进了 embedding 表之后,变成 4096 维空间里的一个点。从这一刻起,模型干的所有事情都是在这个 4096 维空间里走。

反过来,出口的时候是逆翻译:tensor → token → 字符。第 13 章的 LM head 把最后一个 4096 维向量投射回 128k 维的 vocab logits,第 14 章的 sampler 挑出一个 id,第 3 章的逆函数 llama_detokenize 把这个 id 翻回字符串。整个推理就是这两次翻译的来回。

Both translations from "human" to "tensor" are lossy, and each one trades information for computability:

Human → token: you lose lexical boundaries. "llama" is no longer a word — it's "ll" + "ama", two BPE pieces. The model isn't "reading the English word llama"; it's processing a combination of two ids.
Token → tensor: you lose discreteness. The id 47045, once it hits the embedding table, becomes a point in a 4096-dim space. From this moment, everything the model does is movement inside that 4096-dim space.

The exit is the inverse: tensor → token → chars. Chapter 13's LM head projects the final 4096-vector back to 128k vocab logits; Chapter 14's sampler picks one id; Chapter 3's inverse llama_detokenize turns the id back into a string. The whole inference is just these two translations, in and out.

为什么这样开局WHY START HERE 后面 17 章每一章都在解释"从一种眼睛切换到另一种眼睛"具体怎么发生。第 3 章是人→token;第 4 章是token→tensor;第 6-12 章是tensor 在 4096 维空间里走的 32 层楼梯;第 13-15 章是tensor→token→人的逆翻译。这个框架记住了,剩下都是细节。 Each of the next 17 chapters explains one specific transition between these eyes. Ch.3 is human→token; Ch.4 is token→tensor; Ch.6–12 is the tensor walking its 32 flights of stairs in 4096-dim space; Ch.13–15 is tensor→token→human, the reverse. Hold this skeleton in your head and the rest is detail.

CHAPTER 02 · PROLOGUE

Prefill vs Decode — 一次推理的两条命

Prefill vs Decode — one inference, two lives

同一个 forward,两种工作负载

same forward, two completely different workloads

"一次 LLM 推理"听起来像一个步骤——模型 forward 一次,出一个回答。但实际上它有两条命,而这两条命在硬件上长得完全不一样,可以说是整个推理系统的核心矛盾:

Prefill:把 prompt 那 6 个 token 一次性全喂进模型,走 32 层 transformer,产出一个 hidden state 矩阵——准备好"第一个回答 token"。是矩阵 × 矩阵,算力密集(compute-bound)。
Decode:之后每生成一个 token,都得再 forward 一遍,但这一次只有 1 个 token进入模型。是矩阵 × 向量,带宽密集(memory-bandwidth-bound)。

这两条命在 llama.cpp 里都走同一个入口函数 llama_decode(实现在 src/llama-context.cpp),区别只在于这次喂进去的 batch 里有多少 token。表面上一致,内里完全是两种负载——KV cache 就是为了把它们粘起来才发明的(第 8 章详谈)。

"One inference" sounds like a single step — model forwards once, answer comes out. But in fact it has two lives, and these two lives look completely different on hardware. This is essentially the core tension of the whole inference system:

Prefill: shove all 6 prompt tokens through the 32 transformer layers at once, producing a hidden-state matrix — ready to predict "the first answer token". This is matrix × matrix, compute-bound.
Decode: for every subsequent token you forward again, but only 1 token enters the model this time. This is matrix × vector, memory-bandwidth-bound.

Both go through the same llama.cpp entry point — llama_decode in src/llama-context.cpp — differing only in how many tokens are in the batch. Identical surface, totally different physics. KV cache (Ch. 8) was invented exactly to glue these two regimes together.

	PrefillPrefill	DecodeDecode
输入 token 数tokens in	`N` (prompt 全长)	`1`
主算子main op	matmul · [N, d]×[d, d]	matvec · [1, d]×[d, d]
瓶颈bottleneck	FLOPs (compute)	DRAM bandwidth
典型 H100 利用率H100 util	≥ 70% MFU	5–15% MFU · 90% MBU
每 token 平均耗时latency / token	~1 ms (8K prompt)	~15–50 ms
KV cache 状态KV cache	写入 N 个 slot	读取 N+k 个 slot · 写入 1 个
用户感知user feel	TTFT (time to first token)	TPOT (time per output token)

FIG. 02 同一个模型、同一套权重,prefill 是算力题,decode 是内存题。这就是为什么"长 prompt + 短回答"和"短 prompt + 长回答"在同一台 GPU 上跑出的 token 速度天差地别——它们考的是不同的科目。 Same model, same weights — prefill is a compute test, decode is a memory test. That's why "long prompt + short reply" and "short prompt + long reply" hit totally different bottlenecks on the same GPU: they're two different exams.

为什么 decode 这么"贵"

Why decode is so "expensive"

反直觉但是真的:decode 每个 token 比 prefill 每个 token 慢一个数量级。原因是 GPU 的"计算饱和点":

一张 H100 SXM 的 fp16 算力是 ~990 TFLOPS,内存带宽是 ~3.35 TB/s。这两个数除一下,得到这张卡的"算术强度阈值" ≈ 295 FLOPs/byte——意思是你每从显存里搬 1 字节,只要做超过 295 次浮点运算,你就用满了算力。
Prefill:N 个 token 同时走 attention,Q 是 [N, d],K 也是 [N, d],一次 matmul 做 2·N²·d 次浮点运算,但只读 2·N·d 个数。算术强度 = N,只要 N ≥ 300 就饱和。
Decode:每次只有 1 个 token 的 Q (是 [1, d]),要跟 KV cache 里的 N 个历史 K/V 做 matmul。这是 [1, d] × [d, N] = matvec。算术强度 ≈ 1——你做 1 次乘加就得读 1 个数。完全卡死在带宽上,算力根本喂不饱。

这就是所有现代推理优化的根源:第 8 章的 KV cache 是为了别重算、第 10-11 章的 GQA/MLA 是为了让 KV cache 更小、第 12 章的 MoE 是为了让 decode 的"有效参数"更少、第 17 章的 PagedAttention 是为了让 KV cache 装得更密、speculative decoding 是为了一次 decode 多个 token——它们全在跟"decode 是 memory-bound"作斗争。

Counterintuitive but true: each decode token is an order of magnitude slower than each prefill token. The reason is the GPU's "compute saturation point":

An H100 SXM does ~990 fp16 TFLOPS and has ~3.35 TB/s of HBM bandwidth. Dividing gives an arithmetic intensity threshold ≈ 295 FLOPs/byte — bring more than 295 ops per byte fetched and you saturate compute; less and you saturate bandwidth.
Prefill: N tokens go through attention together. Q is [N, d], K is [N, d]; one matmul does 2·N²·d FLOPs while reading only 2·N·d numbers. Intensity = N; cross N ≈ 300 and you're saturated.
Decode: only 1 token's Q ([1, d]) matmuls against KV cache of N history rows. That's [1, d] × [d, N] — a matvec. Intensity ≈ 1: one fused-multiply per byte fetched. You're dead in the water on bandwidth; the compute is starving.

This is the root of every modern inference optimization. KV cache (Ch.8) avoids redoing prefill on history. GQA/MLA (Ch.10–11) shrink the KV. MoE (Ch.12) shrinks the active parameter set per decode step. PagedAttention (Ch.17) packs KV cache denser. Speculative decoding emits more than 1 token per step. They are all fighting the same battle: decode is memory-bound.

src/llama-context.cpp · llama_context::decode()prefill==decode in code

// 同一个函数,既处理 prefill 也处理 decode // the difference is just batch.n_tokens int llama_context::decode(llama_batch & inp_batch) { // ... const uint32_t n_tokens_all = inp_batch.n_tokens; // 1. 把 batch 切成不超过 n_ubatch 的小段 llama_sbatch sbatch = sbatch.from_batch(inp_batch, ...); while (sbatch.n_tokens > 0) { llama_ubatch ubatch = sbatch.split_simple(n_ubatch); // 2. 给这一段 ubatch 在 KV cache 里找空位 kv_self->find_slot(ubatch); // → C08 // 3. 走 graph:embedding → 32 × (attn + ffn) → LM head ggml_cgraph * gf = graph_build(ubatch); // → C04..C13 graph_compute(gf); // 4. 拿到 logits / hidden state(下游用 sampler) extract_logits(...); // → C13,C14 } return 0; }

把握这个KEEP THIS IN MIND "prefill 用一次,decode 用很多次"——所以推理总时间 ≈ TTFT + n_out × TPOT。短回答里 TTFT 占大头(用户等"开始");长回答里 TPOT 占大头(用户等"说完")。这两个数据是不同 SKU 的服务在拼的两个赛道。 "Prefill runs once, decode runs many times" — so total wall time ≈ TTFT + n_out × TPOT. Short replies are TTFT-dominated (the user waits to start); long replies are TPOT-dominated (the user waits for the rest). These are the two metrics every inference SKU competes on.

CHAPTER 03 · INTAKE

Tokenizer · BPE 的字节舞

Tokenizer · the byte-pair waltz

人 → token · llama-vocab.cpp 第一站

human → token · the first station in llama-vocab.cpp

tokenizer 是整篇文章里唯一一个不在 GPU 上的环节,但它常常是第一处出问题的地方:同一段 prompt,Llama-3 tokenizer 切出 6 个 id,GPT-2 tokenizer 切出 9 个 id,DeepSeek tokenizer 又是另一组——而 KV cache 占用直接乘以这个数字。

llama.cpp 把所有 tokenizer 逻辑都塞进了 src/llama-vocab.cpp(早期版本叫 llama.cpp 主文件下半部分,后来拆分出来)。主入口是 llama_tokenize,核心算法是 BPE(Byte Pair Encoding)——"从字节出发,反复贪心合并出现频率最高的相邻对"。

The tokenizer is the only stage in this whole article that doesn't run on a GPU, yet it's often the first thing that goes wrong: the same prompt yields 6 ids under Llama-3, 9 under GPT-2, something else under DeepSeek — and KV cache scales linearly with that number.

llama.cpp puts all tokenizer logic in src/llama-vocab.cpp (in earlier versions it was the lower half of the main llama.cpp file, later refactored out). Entry point: llama_tokenize. Core algorithm: BPE (Byte Pair Encoding) — "start from bytes, greedily merge the most-frequent adjacent pair, repeat".

input

FIG. 03 · interactive 玩具版 BPE。改输入框就能看到 token 切分。注意 "llama" 会被切成 "ll" + "ama"(两个 id)而不是一个——BPE 的合并是基于训练语料的统计,而不是语义。这是玩具实现:真 BPE 的 trie 查找在 llama.cpp 里走 tokenize_with_pre_tokenizer(GPT-2 风格的正则切分)+ llm_tokenizer_bpe::tokenize(优先队列合并)。 Toy BPE. Edit the input and watch the split. Notice "llama" becomes "ll" + "ama" (two ids), not one — BPE merges are driven by training-corpus statistics, not semantics. This is a toy: real BPE in llama.cpp goes through tokenize_with_pre_tokenizer (GPT-2-style regex split) then llm_tokenizer_bpe::tokenize (priority-queue merge).

BPE 的两步:pre-tokenize · 然后贪心合并

Two steps: pre-tokenize · then greedy merge

真实 BPE 不是一次走完所有字节。Llama-3 / GPT-4 这类 tokenizer 都是两阶段:

Pre-tokenize:先用一段长正则(GPT-2 那段经典的 (?i:'s|'t|'re|'ve|...))把字符串切成"词块"——空格、标点、连续数字各自单独成块。这步避免了"跨标点的 BPE 合并"。
BPE merge:在每个词块内部,从单字节出发,反复找合并优先级最高的相邻对。"优先级"是训练时学出来的——每个 merge rule 有一个 rank,rank 越小越优先合并。

Real BPE isn't a single pass over the bytes. Llama-3 / GPT-4 family tokenizers run a two-stage process:

Pre-tokenize: split the string with a long regex (the classic GPT-2 (?i:'s|'t|'re|'ve|...)) into "chunks" — whitespace, punctuation, runs of digits each become their own chunk. This prevents cross-punctuation merges.
BPE merge: inside each chunk, start from single bytes and repeatedly find the highest-priority adjacent pair to merge. "Priority" is learned at training time — each merge rule has a rank, lower rank merges first.

src/llama-vocab.cpp · llm_tokenizer_bpe::tokenize()priority queue merge

// 一个词块进来时,先把它拆成字节,然后用堆把所有相邻对按 rank 排好, // 反复 pop 最小 rank 的对,把它合并掉 void llm_tokenizer_bpe::tokenize(const std::string & word, ...) { // symbols_ 是当前未合并的片段链表 for (size_t i = 0; i < word.size(); i = utf8_next(word, i)) { symbols_.push_back(llm_symbol{ ..., word.substr(i, len) }); } // 把所有相邻对 (i, i+1) 算一遍 rank,塞进优先队列 for (int i = 1; i < (int)symbols_.size(); ++i) { add_new_bigram(i - 1, i); // queue 按 rank 升序 } // 主循环:不断合并 rank 最小的对 while (!work_queue_.empty()) { auto bigram = work_queue_.top(); work_queue_.pop(); // 检查这一对是否还有效(没被前一次合并破坏) if (symbols_[bigram.left].n == 0 || symbols_[bigram.right].n == 0) continue; // 合并:把右边片段追加到左边,右边标记为已合并 symbols_[bigram.left].n += symbols_[bigram.right].n; symbols_[bigram.right].n = 0; // 新的左/右邻居,推回 queue add_new_bigram(symbols_[bigram.left].prev, bigram.left); add_new_bigram(bigram.left, symbols_[bigram.left].next); } // 最后没被合并掉的片段,就是 token 序列 // 逐个查 vocab.token_to_id[piece] 得到最终 id 数组 }

"你好" 这两个汉字是怎么变成一个 token 的

How "你好" becomes one (or two) token(s)

Llama-3 的词表是字节级的——所有 256 个字节都在词表里。当 "你"(utf-8 是 3 字节 0xE4 0xBD 0xA0)进 BPE 时,它先被拆成 3 个字节。然后:

如果训练语料里 0xE4 0xBD 这一对的合并 rank 是 12000,而 0xBD 0xA0 的 rank 是 8500,先合 0xBD 0xA0。
合完后剩 0xE4 [0xBD0xA0],再查这对的 rank……
最终的合并路径取决于训练语料里"你"出现的频率。Llama-3 在中文语料上跑过,所以 "你" 很可能恰好是一个 token(id 在 100k+ 区间)。在 GPT-2 上则不一定——它的词表中文偏少,"你"可能保留为 3 字节。

这就是为什么"同一段话,不同模型 token 数差几倍"——它直接决定:(1) prompt 占多少 context,(2) prefill 慢多少,(3) KV cache 占多少内存。一个偷懒的 tokenizer 会让你的 API 账单翻倍。

Llama-3's vocabulary is byte-level — all 256 raw bytes live in the vocab. When "你" (utf-8 0xE4 0xBD 0xA0) enters BPE, it's first split into 3 bytes. Then:

If the training corpus learned (0xE4, 0xBD) with merge rank 12000 and (0xBD, 0xA0) with rank 8500, the latter merges first.
After merging you have 0xE4 [0xBD0xA0]; look up the new pair's rank…
The final merge path is decided by how often "你" appears in training. Llama-3 saw Chinese, so "你" likely became a single token (id around 100k+). GPT-2's Chinese coverage is thin — "你" stays as 3 bytes there.

This is why "same sentence, different models, multi-fold token count differences" — directly determining: (1) how much context the prompt eats, (2) how slow prefill is, (3) how much memory KV cache costs. A lazy tokenizer doubles your API bill.

扩展EXTENDED 三种主流 BPE 的取舍 — GPT-2 / SentencePiece / Tiktoken / Tekken Four mainstream BPE variants — GPT-2 / SentencePiece / Tiktoken / Tekken

都是 BPE,但实现细节差异巨大。llama.cpp 在 src/llama-vocab.cpp 里枚举了 LLAMA_VOCAB_TYPE_BPE / SPM / WPM / UGM 等多种类型,每种走略微不同的 path:

GPT-2 BPE(Llama-3 / Qwen2 / Mistral 都是这个家族):字节级,pre-tokenize 用一段长正则,贪心合并 from rank。最常见。
SentencePiece BPE(Llama-1/2 / Gemma):没有 pre-tokenize,直接在整段文本上跑 BPE,把空格当一个特殊字符 ▁。"原生支持任意语言"是它的卖点。
Tiktoken(GPT-3.5+ / GPT-4):跟 GPT-2 同家族,但是 OpenAI 用 Rust 重写了一套更快的 BPE,把合并从优先队列变成BPE rank table 直接查,常数倍快。llama.cpp 不直接调 Tiktoken,但兼容它的 vocab 文件。
Tekken(Mistral 新一代):基于 Tiktoken 改的,词表更紧凑,中文/代码效率更高。

对推理引擎来说,tokenizer 的速度几乎从来不是瓶颈——主要算力都在 GPU 上。但 tokenizer 的正确性是大坑:大小写、空格、emoji、罕见 unicode、模型 chat template 里的特殊 token(<|im_start|> 之类),错一个就让模型"看不懂"自己被怎么 prompt 的。llama.cpp 历史上 BPE 实现 bug 修了几十个 PR——这是整个推理栈里最容易引入隐 bug 的环节。

All BPE, but implementations diverge wildly. llama.cpp's src/llama-vocab.cpp enumerates LLAMA_VOCAB_TYPE_BPE / SPM / WPM / UGM, each taking a slightly different path:

GPT-2 BPE (Llama-3 / Qwen2 / Mistral family): byte-level, pre-tokenize with a long regex, greedy merge by rank. Most common.
SentencePiece BPE (Llama-1/2 / Gemma): no pre-tokenize; BPE runs over the raw text with whitespace mapped to a special ▁. "Language-agnostic out of the box" is its pitch.
Tiktoken (GPT-3.5+ / GPT-4): same family as GPT-2 BPE, but OpenAI rewrote it in Rust with a faster algorithm that replaces the priority queue with direct rank-table lookups, several × faster. llama.cpp doesn't link Tiktoken but reads its vocab.
Tekken (next-gen Mistral): Tiktoken derivative; tighter vocab, better Chinese / code efficiency.

For an inference engine, tokenizer speed is almost never the bottleneck — compute lives on the GPU. But tokenizer correctness is a minefield: case, whitespace, emoji, rare unicode, chat-template specials like <|im_start|>. Mess one up and the model literally can't see how it was prompted. llama.cpp's BPE has been bug-fixed in dozens of PRs over the years — this is the easiest place in the whole stack to silently regress.

主线 · 我们的 promptMAIN LINE · OUR PROMPT "你好,llama" 进 Llama-3 tokenizer,输出是 [128000, 47045, 50739, 11, 9091, 64] ——6 个 token,占 24 字节(int32)。这 6 个 int 就是第 4 章 embedding lookup 的入参。 "hello, llama" through Llama-3's tokenizer yields [128000, 9906, 11, 100793] — 4 tokens, 16 bytes (int32). Whichever language, those ints feed Ch. 4's embedding lookup.

扩展EXTENDED GPT-2 那段正则 — Pre-tokenize 的真实代码 The GPT-2 regex — actual pre-tokenize code

Pre-tokenize 那一段正则 OpenAI GPT-2 2019 年原始论文附了一行看起来像乱码的 regex。它做的事情是把任意 utf-8 字符串切成"词块"——空格、标点、连续数字、连续字母各自一段。看一眼:

The pre-tokenize regex from GPT-2's 2019 release is famously cryptic-looking. What it does: slice any utf-8 string into "chunks" — whitespace, punctuation, digit runs, letter runs, each their own segment. Take a look:

GPT-2 / Llama-3 pre-tokenize regexthe canonical pattern

# 原始 GPT-2 版本 pat = re.compile(r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""") # Llama-3 微调版 · 加了 case-insensitive 缩写匹配 pat = re.compile(r"""(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+""") # 拆开看每段: # (?i:'s|'t|'re|'ve|'m|'ll|'d) ← 英文缩写:'s, 'tis, 're 等 # [^\r\n\p{L}\p{N}]?\p{L}+ ← 一段字母(可前导一个标点) # \p{N}{1,3} ← 数字 · 最多 3 位一起(避免 1234567 = 1 token) # ?[^\s\p{L}\p{N}]+[\r\n]* ← 一段标点(可加换行) # \s*[\r\n]+ ← 一段换行 # \s+(?!\S) ← 行尾空白 # \s+ ← 其他连续空白

llama.cpp 用 pcre2 调这段 regex(src/llama-vocab.cpp 里的 regex_split)。注意一个细节:这段 regex 用了 \p{L}(任意 Unicode 字母)和 \p{N}(任意 Unicode 数字)——它是 Unicode-aware 的。所以中文、阿拉伯文、emoji 都能正确分块。这是为什么 Llama-3 比 Llama-2 中文支持好——同一段 regex 但配上 byte-level 词表,中文不会被切成"0xE4 一类奇怪的字节段"。

"数字最多 3 位一起" 这条尤其重要:防止模型把"1234567"看成一个 token,而是切成 "123","456","7"——这样模型对数字的处理(尤其加减法)精度高得多。GPT-4 那段广为吐槽的"怎么不会算 17 + 28" 在 Llama-3 上好得多,部分原因就是 pre-tokenize 改进。

llama.cpp uses pcre2 to run this regex (regex_split in src/llama-vocab.cpp). Notice: it uses \p{L} (any Unicode letter) and \p{N} (any Unicode digit) — it's Unicode-aware. So Chinese, Arabic, emoji all chunk correctly. This is why Llama-3 handles Chinese far better than Llama-2 — same regex but with a byte-level vocab; Chinese doesn't fragment into "weird 0xE4-type byte slices".

The "digits chunked at most 3 at a time" rule matters a lot: prevents the model from treating "1234567" as one token. Instead it splits into "123","456","7" — way better for arithmetic accuracy. GPT-4's infamous "can't compute 17+28" issues are noticeably better in Llama-3, partly because of this pre-tokenize improvement.

扩展EXTENDED Llama-3 tokenizer launch 的"三个 bug" — 真实事故簿 Llama-3 tokenizer's launch bugs · a real incident log

2024 年 4 月 Llama-3 发布后两周内,llama.cpp / Hugging Face / Ollama 等所有推理栈都发生过 tokenizer 相关 bug。我挑三个最经典的:

添加 BOS token 两次(#7088 等):Llama-3 chat template 已经在最前面有 <|begin_of_text|>(BOS),但 llama.cpp 加载新模型时默认还会自动添加 BOS——导致第一个 token 是 BOS BOS,模型看到"两个开始符"行为异常,输出质量明显下降。修复方法:模型 metadata 里加 add_bos_token = false。
特殊 token 被普通 BPE 处理(#6920):某些用户 prompt 里包含字面 <|eot_id|> 字符串(可能是从日志复制粘贴)。早期 tokenizer 把它当普通字符串 BPE 切——而不是识别为特殊 token id 128009。结果模型把"用户在文本中提到的 eot_id" 当成"对话该结束了",立刻停止生成。要在 tokenize 前显式扫描已知特殊 token。
chat template 不一致:Llama-3 base 和 instruct 用了同一个 tokenizer 但不同 chat template。开发者下载 base 模型加载 instruct 的 template,模型给出"系统级"莫名其妙的输出。bug 修在 GGUF metadata 引入 chat_template 字段,llama.cpp 一加载就用模型自己的模板。

这些 bug 都不是算法错,是"接缝层"的人为错误。但它们的用户体感是"这个模型变笨了"。所以 tokenizer 是整个推理栈里最容易被忽视、但损害最大的环节——错一个字节,模型整体性能可见地崩。

In the two weeks after Llama-3's April 2024 release, llama.cpp / Hugging Face / Ollama all had tokenizer bugs. Three classics:

Double BOS (issue #7088 et al.): Llama-3 chat templates already include <|begin_of_text|> (BOS) at the start; but llama.cpp defaulted to auto-adding BOS when loading new models — first token becomes BOS BOS. The model sees "two start markers", behaves oddly, quality drops visibly. Fix: model metadata add_bos_token = false.
Special tokens treated as normal BPE (issue #6920): users sometimes paste prompts containing literal <|eot_id|> (copied from logs). Early tokenizer treated it as a regular BPE string — not as the special token id 128009. The model interpreted "user mentioning eot_id in text" as "this turn ended" and stopped generating immediately. Fix: scan for known special tokens before BPE.
Chat template mismatch: Llama-3 base and instruct share the same tokenizer but different chat templates. Devs downloaded base, loaded the instruct template — model produced "systemwide" gibberish. Fix: GGUF metadata gained a chat_template field; llama.cpp picks up the model's own template at load.

None of these are algorithm bugs; they're "seam-layer" human errors. But the user-perceived impact is "this model got dumber". So tokenizer is the most overlooked, highest-damage piece of the inference stack — one byte wrong and the whole model visibly degrades.

CHAPTER 04 · INTAKE

Embedding lookup — 一个 int 变成 4096 个 float

Embedding lookup — one int becomes 4096 floats

token → tensor · ggml_get_rows 一行就完成的进入

token → tensor · the entry that ggml_get_rows does in one row

从这一步开始,token id 就消失了。剩下的 17 个站全部在 4096 维(Llama-3-8B)/ 7168 维(DeepSeek-V3)的实数空间里走。这次翻译极其简单——简单到 ggml 里就一个函数 ggml_get_rows:

From this step on the token ids disappear. The next 17 stations all run inside a 4096-dim (Llama-3-8B) or 7168-dim (DeepSeek-V3) real-valued space. This translation is trivial — so trivial it's a single ggml op called ggml_get_rows:

src/llama-graph.cpp · llm_graph_context::build_inp_embd()id → vector

// 输入 ubatch.token 是 [n_tokens] 的 int32 数组 // 输出 inpL 是 [n_embd, n_tokens] 的 fp16 矩阵 ggml_tensor * build_inp_embd(ggml_tensor * tok_embd) { // 1. 创建 inp_tokens 输入节点(实际数据等 ggml_compute 时填) ggml_tensor * inp_tokens = ggml_new_tensor_1d(ctx0, GGML_TYPE_I32, n_tokens); ggml_set_input(inp_tokens); // 2. 关键一行 —— 从 tok_embd 这张 [n_vocab, n_embd] 大表里, // 按 inp_tokens 指定的下标抽出对应的 n_tokens 行 ggml_tensor * inpL = ggml_get_rows(ctx0, tok_embd, inp_tokens); // 3. 有些模型在这里加 token_type 或 position embedding // Llama 没有,RoPE 在后面 attention 里加(C06) return inpL; // shape: [n_embd, n_tokens] }

为什么 embedding 是整个推理里最贵的单层参数

Why embedding is the single most parameter-heavy layer

看个数字:Llama-3-8B 的 tok_embd 是 [128256, 4096] 的矩阵,fp16 占 1 GB。同一个模型一层 attention 的 QKV 投影矩阵加起来也就 200 MB——embedding 表占了全模型权重的 12%。

这就是为什么词表大小是一个工程决策,不是越大越好:

词表大 → 同一段文本 token 数少 → prefill 快、KV cache 小,但 embedding 表占内存翻倍。
词表小 → embedding 小,但同一段文本 token 数多。中文用 GPT-2 词表跑会比英文多 2-3 倍 token,直接吃掉一个数量级的成本。

多数现代模型(Llama-3 是 128k,DeepSeek 是 130k,Qwen2 是 152k)都故意选了一个"中文不太亏、英文不太浪费"的 sweet spot。

Run the numbers: Llama-3-8B's tok_embd is [128256, 4096] in fp16 — 1 GB. One layer's QKV projection sums to about 200 MB; the embedding table alone is 12% of the entire model.

This is why vocab size is an engineering decision, not a "bigger is better" knob:

Vocab large → fewer tokens per text → faster prefill, smaller KV cache, but the embedding table balloons.
Vocab small → small embedding table, but more tokens per text. Chinese under a GPT-2 vocab costs 2-3× tokens vs. English — an order of magnitude of cost goes there.

Modern models (Llama-3 at 128k, DeepSeek at 130k, Qwen2 at 152k) deliberately pick a sweet spot — "doesn't penalize Chinese, doesn't waste on English".

实现细节IMPL DETAIL ggml_get_rows 看着是"行选择",底层在 CPU 上是 memcpy,在 CUDA 上是 cudaMemcpy2DAsync——它根本不是数学操作,是寻址操作。所以这一步耗时几乎可以忽略(几个 μs)。"贵"的不是计算,是内存:模型一启动就要把这 1 GB 的 embedding 表加载进显存,占住不放。 ggml_get_rows looks like "row selection" but the real implementation is just memcpy on CPU, cudaMemcpy2DAsync on CUDA — it isn't a math op, it's an addressing op. The compute is negligible (microseconds). What's "expensive" is the resident memory: that 1 GB embedding table sits in VRAM for the entire lifetime of the model.

"为什么 4096 维"

"Why 4096"

这个数字没有第一原理。它是从 Transformer 原论文(d_model=512)开始,GPT-2 加到 1024,GPT-3 加到 12288,Llama-2-7B 是 4096,Llama-3-8B 沿用 4096,Llama-3-70B 用 8192——大致跟参数量开三次方根的方向走,但每个模型组各自有自己的工程口味。

真正的约束是:

n_embd 必须能整除 n_head——每个 attention head 要拿到 d_head = n_embd / n_head 维。常见取值 d_head ∈ {64, 96, 128},因为 CUDA 的 Tensor Core 对这几个尺寸最友好。
n_embd 决定了模型"每个 token 能记住多少东西"——这是一个 4096 维的"向量",它要装下"我是一个动词、我前面接过冠词、我整段话是讽刺语气"等等所有上下文信号。

4096 不是物理常数,是过去 7 年里大家觉得对 7-8B 规模够用、对 70B 规模偏小的那个数。

This number has no first principle. It started at 512 in the original Transformer paper, GPT-2 went to 1024, GPT-3 to 12288, Llama-2-7B to 4096, Llama-3-8B kept 4096, Llama-3-70B uses 8192 — roughly scaling as the cube root of total params, but each model family has its own taste.

The real constraints:

n_embd must be divisible by n_head — each attention head gets d_head = n_embd / n_head dims. Common choices d_head ∈ {64, 96, 128}, because CUDA Tensor Cores like these shapes.
n_embd sets the model's "per-token memory capacity" — that 4096-dim vector has to encode "I'm a verb, I followed an article, the whole sentence is sarcastic" and all other contextual signals at once.

4096 isn't a physical constant. It's the number the field has roughly settled on as "enough for 7-8B, a little tight for 70B" over the past seven years.

扩展EXTENDED Weight tying · 为什么 LM head 和 embedding "可能共用同一个矩阵" Weight tying · why LM head and embedding sometimes share a matrix

"embedding 把 token id 翻成 4096 维向量" 是入口;"LM head 把 4096 维向量翻回 token id" 是出口(第 13 章)。从形状看,两者恰好互为转置:都是 [n_vocab, n_embd] 形状的矩阵,只是矩阵乘的方向相反。

所以 2016 年 Inan 等人提出 "weight tying":让 embedding 和 LM head 共用同一份权重。具体做法是 LM head 用 embedding.T 而不是另一个独立矩阵。好处:

省 ~1 GB 存储:Llama-3-8B 的 embedding 是 1 GB,共用之后整个模型小 1 GB(占总参数的 12%)。
训练时 input 和 output 同步学习:同一个 token 的输入语义和输出预测自然对齐。论文实测能提升 perplexity。

哪些模型用 tying:GPT-2、Llama-1、Llama-2、Mistral、Qwen2、Gemma 都用。Llama-3 不用——Meta 论文里没解释,业界推测是词表大到 128k 之后,untied 给了模型更多表达自由度;另外 untied 让 prompt-side embedding(检索向量)和生成-side embedding(语义生成)各自专精。

llama.cpp 在加载模型时通过 GGUF metadata 自动判断,output 这个权重张量如果不存在,就 fallback 用 token_embd.T。所以推理代码不需要关心是不是 tied——但训练框架必须显式设置 tied_weights=True,否则两个权重独立训练就没有 tying 的效果。

"Embedding turns a token id into a 4096-dim vector" is the entrance; "LM head turns a 4096-dim vector back into a token id" is the exit (Ch.13). Shape-wise, the two are literal transposes: both are [n_vocab, n_embd]; only the matmul direction differs.

So in 2016 Inan et al. proposed "weight tying": have embedding and LM head share the same weights. Implementation: LM head uses embedding.T instead of an independent matrix. Benefits:

Saves ~1 GB: Llama-3-8B's embedding is 1 GB; tying shrinks the full model by 1 GB (12% of params).
Input and output learn together: a single token's input semantics and output prediction stay aligned. Papers measure perplexity improvements.

Who uses tying: GPT-2, Llama-1, Llama-2, Mistral, Qwen2, Gemma all do. Llama-3 doesn't — Meta's paper doesn't explain; the industry guess: once vocab grows to 128k, untied gives the model more expressive freedom; also untied lets prompt-side embedding (retrieval vectors) and generation-side embedding (semantic generation) each specialize.

llama.cpp auto-detects via GGUF metadata: if the output tensor is absent, it falls back to token_embd.T. So inference code doesn't need to know whether weights are tied — but training frameworks must explicitly set tied_weights=True, otherwise the two weights train independently and the tying benefit is lost.

扩展EXTENDED Intrinsic dimension · 为什么 4096 维"够用" Intrinsic dimension · why 4096 dims is "enough"

4096 维听起来很多,但 LLM 实际用到的维度可能远少于此。Intrinsic dimension(内在维度)研究:在 LLM 的 hidden state 上跑 SVD,前 90% 方差通常只用 ~500-1000 个 principal component。剩下 3000+ 维都是冗余——它们存在主要是为了训练时的优化空间,推理时大部分激活值很小。

这件事直接催生了几个推理优化方向:

LoRA(Low-Rank Adaptation):微调时只学一个低秩的 ΔW = A·B,A 和 B 都是 ~16-64 维的低秩矩阵。一个 7B 模型的 LoRA adapter 才几十 MB——因为模型的"实际信息容量" 是低维的。
Pruning:把 hidden state 里数值接近 0 的维度直接去掉。Wanda(2024)论文显示 50% pruning 几乎不掉精度——但工程实现复杂,生产用不多。
SVD-based 量化:对权重矩阵先做 SVD,只保留 top-k 奇异值,再对低秩近似量化。GPTQ 部分基于这个想法。

所以 4096 维不是"恰好够",更像是"训练时为了优化空间留的余量"。推理时的"有效维度" 远小于此——这是为什么各种 low-rank 压缩 trick 在 LLM 上效果好的根本原因。

4096 sounds like a lot, but LLMs actually use far fewer dimensions. Intrinsic dimension research: SVD on LLM hidden states; the top 90% variance is typically captured by ~500-1000 principal components. The remaining 3000+ dims are redundant — they exist mainly to provide optimization room during training; most are near-zero at inference.

This fact has directly driven several inference optimization directions:

LoRA (Low-Rank Adaptation): finetune only a low-rank ΔW = A·B; both A and B are ~16-64 dim. A 7B model's LoRA adapter is just tens of MB — because the model's "actual information capacity" is low-dimensional.
Pruning: drop dimensions with near-zero activation. Wanda (2024) shows 50% pruning barely affects accuracy — but engineering complexity keeps it rare in production.
SVD-based quantization: SVD the weight matrix; keep top-k singular values; quantize the low-rank approximation. GPTQ is partly based on this idea.

So 4096 isn't "exactly enough" — it's more like "training-time slack for optimization". The "effective dimension" at inference is much smaller — which is the root reason every low-rank compression trick works so well on LLMs.

CHAPTER 05 · INTAKE

Batch / ubatch — 一段 prompt 怎么切成 prefill

Batch / ubatch — how a prompt is sliced into prefill

llama-batch.cpp · 调度的最小单位

llama-batch.cpp · the smallest unit the scheduler sees

到这里 prompt 已经是 [n_embd, n_tokens] 的 fp16 矩阵了。但"喂给模型"这件事还有一层抽象:batch。一个 batch 可以装多个 prompt(不同用户的请求拼在一起),也可以是同一个 prompt 在 prefill 阶段被切成几段(避免一次性吃太多显存)。这两件事都由 llama_batch 统一表达。

At this point the prompt is a [n_embd, n_tokens] fp16 matrix. But "feed it to the model" still has one more layer: batch. A batch can pack multiple prompts (different users' requests glued together) or one prompt sliced into chunks during prefill (avoiding a memory spike). Both are expressed through llama_batch.

include/llama.h · struct llama_batchthe shape of a batch

struct llama_batch { int32_t n_tokens; // 这批有多少 token llama_token * token; // 每个 token 的 id · [n_tokens] float * embd; // 或者直接给 embedding · 跟 token 二选一 llama_pos * pos; // 每个 token 在它所属序列里的位置 · [n_tokens] int32_t * n_seq_id; // 每个 token 属于几条序列 llama_seq_id** seq_id; // 每个 token 的序列 id 列表(多用户拼在一起时区分) int8_t * logits; // 这个 token 要不要输出 logits(只有最后一个要 → C13) };

三个字段决定了一次推理的全部"谁"信息

Three fields encode the entire "who" of one inference

抽象层数不多但很关键:

token / embd:数据本身。要么是 token id(走 embedding lookup),要么直接是 embedding(给 multi-modal vision 那种用,图片 patch 已经投到 embedding 空间了)。
pos:RoPE 要用的位置编号(第 6 章)。第一次见这个字段会困惑——为什么 token 还得带"位置"?因为 batch 里同时跑多个序列时,token 0 是用户 A 的第 0 个 token,但下面 token 5 可能是用户 B 的第 100 个 token,光看 batch 内的下标不够。
seq_id:这个 token 属于哪条序列。同一条序列共享 KV cache,不同序列的 KV 互不影响。这就是多用户并发在 llama.cpp 里的实现方式——一张 KV cache 表里同时住着好几条对话,通过 seq_id 区分。

Not many abstractions, but each matters:

token / embd: the data itself. Either token ids (will go through embedding lookup) or raw embeddings (for multimodal vision — image patches already projected to embedding space).
pos: the position index used by RoPE (Ch. 6). First time you see this field you wonder — why does a token need a "position"? Because when multiple sequences share a batch, token 0 might be user A's first token while token 5 is user B's hundredth. The batch index alone isn't enough.
seq_id: which sequence this token belongs to. Tokens with the same seq_id share a KV cache; different sequences don't see each other's KV. This is how multi-user concurrency lives inside llama.cpp — several conversations co-exist in one KV cache table, distinguished by seq_id.

ubatch — 把大 batch 切成显存装得下的小块

ubatch — slicing the batch into memory-sized chunks

用户给的 llama_batch 是逻辑单位,但 GPU 上一次 forward 能装多少,是物理约束。n_ubatch 这个参数(默认 512)决定了"一次 graph 实际跑多少 token"。如果用户喂进来 4096 个 token 的 prompt,llama.cpp 会把它切成 8 个 ubatch 串行跑,中间 KV cache 累积。

The llama_batch the user hands in is a logical unit, but how many tokens fit through one forward is a physical constraint. The n_ubatch param (default 512) sets "tokens per actual graph run". A 4096-token prompt is sliced into 8 ubatches, run serially, with KV cache accumulating between them.

src/llama-batch.cpp · llama_sbatch::split_simple()batch → ubatch

// 从 sbatch(还没切的"超级 batch")里拿出最多 n_ubatch 个 token, // 打包成一个 ubatch 交给 graph 跑 llama_ubatch llama_sbatch::split_simple(size_t n_ubatch) { size_t n_tokens = std::min(n_ubatch, this->n_tokens); llama_ubatch ubatch = { .equal_seqs = true, // 同一段属于同一条 seq .n_tokens = n_tokens, .n_seq_tokens = n_tokens, .n_seqs = 1, .token = ids.data(), .pos = pos.data(), ... }; // 从队头扣掉 n_tokens 个,留给下一次切 this->n_tokens -= n_tokens; return ubatch; }

主线 · 走到哪了MAIN LINE · WHERE WE ARE 至此,我们的 prompt 已经变成 llama_ubatch{ n_tokens=6, token=[128000,47045,...], pos=[0,1,2,3,4,5], seq_id=[0,0,0,0,0,0], logits=[0,0,0,0,0,1] } ——最后一位的 logits=1 告诉模型"只在最后一个 token 上输出 logits",因为我们只关心下一个要生成的 token。这个 ubatch 现在准备好被 graph_compute 推过 32 层 transformer。 Our prompt is now llama_ubatch{ n_tokens=4, token=[128000,9906,11,100793], pos=[0,1,2,3], seq_id=[0,0,0,0], logits=[0,0,0,1] } — the trailing logits=1 tells the model "only emit logits at the final token", because that's the only position we'll sample from. This ubatch is now ready for graph_compute to push it through 32 transformer layers.

扩展EXTENDED Request 生命周期 · 从 HTTP 到 batch 的旅程 Request lifecycle · from HTTP to batch

用户发一个 OpenAI 兼容 HTTP 请求到 vLLM 服务器,这个请求在变成 ubatch 之前要经过几个阶段:

HTTP receive(fastapi async handler): 解 JSON,验证 schema · 1-2 ms
Tokenize(C03,在 CPU 上): 把 prompt 切成 token id · 1-5 ms 取决长度
Sequence 入队: 创建 SequenceGroup 对象,塞进 scheduler 的 WAITING 队列 · 几 μs
等待调度: 一直到 Scheduler.schedule() 选到它 · 0-100 ms 取决并发情况
变成 ubatch: scheduler 把这个 seq + 其他 seq 打包成一个 ubatch · μs 级
forward + sample: 走 graph · 几 ms 到几百 ms
Detokenize + stream: 把生成的 token 转字符串发回 client · 每 token ~0.5 ms

注意第 4 步——"等待调度" 是个不可忽视的延迟来源。在高负载下,新请求可能等几十毫秒才被排进 batch——这部分时间用户看到的就是 TTFT 高。"调度延迟" 跟 prefill 时间一样,都是 TTFT 的组成部分。

Priority scheduling: vLLM 1.0 之后引入,允许给请求加优先级(高优先级请求插队)。生产场景下区分 free tier / paid tier 用户、互动 chat / 异步批任务等。这是OS 调度器思路在 LLM 上的复用。

A user sends an OpenAI-compatible HTTP request to a vLLM server. Before becoming a ubatch, it passes through several stages:

HTTP receive (fastapi async handler): JSON parse + schema validate · 1-2 ms
Tokenize (C03, on CPU): split prompt into token ids · 1-5 ms depending on length
Sequence enqueue: create a SequenceGroup, push to scheduler's WAITING queue · μs
Wait for scheduling: until Scheduler.schedule() picks it · 0-100 ms depending on load
Become ubatch: scheduler packs this seq + others into a ubatch · μs
forward + sample: walk the graph · ms to hundreds of ms
Detokenize + stream: convert tokens back to strings, send to client · ~0.5 ms per token

Note step 4 — "wait for scheduling" is a non-negligible source of latency. Under high load a new request can wait tens of ms before being batched — what the user sees is high TTFT. "Scheduling latency" is a TTFT component just as much as prefill time.

Priority scheduling: introduced after vLLM 1.0, lets requests have priority levels (high-pri jumps the queue). Production use: distinguishing free vs paid tier, interactive chat vs async batch jobs. OS-scheduler thinking reused for LLM serving.

CHAPTER 06 · HEART

RMSNorm + RoPE — 进入 attention 前的两道门

RMSNorm + RoPE — the two doors before attention

归一化 + 位置编码 · 每层 transformer 都要重做

normalize + rotate · redone in every layer

从这一章起,我们就进入了那 32 层 transformer 楼梯。每一层结构完全一样,只是权重不同——所以本章到第 9 章描述的 4 个步骤,会被重复执行 32 次。一次 forward 总共要走 32 × 4 = 128 步,这还只是 attention 部分,FFN/MoE 是另外的事。

每一层的第一件事不是计算 attention,而是洗澡:把 hidden state 用 RMSNorm 归一化一下,再把 query/key 用 RoPE 旋转一下,把位置信息埋进去。这两步都不学(RMSNorm 只有一个 scale 参数,RoPE 完全无参数),但少了任何一步模型就崩。

From here on we're inside the 32-floor transformer stairwell. Every floor is structurally identical — only weights differ. So the 4 steps described in Ch. 6–9 are repeated 32 times. One forward pass walks 32 × 4 = 128 steps, and that's just attention; FFN/MoE is on top.

Each layer's first move is not computing attention but giving the hidden state a bath: normalize it via RMSNorm, then rotate Q/K via RoPE to embed position. Both steps are almost parameter-free (RMSNorm has one scale vector, RoPE has zero learned params), but drop either and the model collapses.

RMSNorm · 比 LayerNorm 简化了一半

RMSNorm · half the work of LayerNorm

RMSNorm 干的事一句话:把每个 token 的 4096 维向量缩放到"平均能量"为 1,再乘一个可学的 scale。公式:

x' = x / √(mean(x²) + ε) · scale

跟 LayerNorm 相比少了"减均值"那一步——没有数学上的"必须保留",纯粹是 Touvron 等人在 Llama-1 论文里发现"去掉中心化几乎不掉精度但能省 20% 时间"。从此所有 Llama 系、Mistral、Qwen2、DeepSeek 全部用 RMSNorm。

RMSNorm in one sentence: rescale each token's 4096-vector so its average energy is 1, then multiply by a learnable scale. Formula:

x' = x / √(mean(x²) + ε) · scale

Compared to LayerNorm, the mean-subtraction is dropped. No math forces this; Touvron et al. found in the Llama-1 paper that removing centering barely hurts accuracy but saves ~20% time. Llama, Mistral, Qwen2, DeepSeek all use RMSNorm now.

ggml/src/ggml.c · ggml_compute_forward_rms_norm_f32()the whole op

// 对每个 token 的 ne[0] 维向量做 RMSNorm for (int64_t i = 0; i < n_tokens; i++) { const float * x = (const float *)((const char *)src0->data + i*nb01); float * y = ( float *)(( char *)dst ->data + i*nb1); // step 1: 求平方和 ggml_float sum = 0.0; for (int64_t j = 0; j < ne0; j++) sum += (ggml_float)(x[j] * x[j]); // step 2: 求 RMS = √(mean + eps),取倒数 const float mean = sum/ne0; const float scale = 1.0f/sqrtf(mean + eps); // step 3: 缩放(scale 参数在后面单独乘) ggml_vec_scale_f32(ne0, y, scale); }

RoPE · 把"我是第 5 个 token"塞进 Q/K 本身

RoPE · embedding "I am token #5" into Q/K itself

原始 Transformer 用"位置 embedding 加在 input 上"——但这有个毛病:模型不知道两个 token 之间的相对距离,只知道各自的绝对位置。这在长上下文里特别糟糕——"5K 远的 token"和"50 远的 token"看起来一样陌生。

RoPE 的思路精彩:不加位置 embedding,而是把 Q 和 K 的每一对维度,看作复数平面上的点,按位置编号 m 旋转 mθ 角度。结果是 Q·Kᵀ 的内积自然带上了"两个 token 位置差"的信息——绝对位置消失了,只剩相对位置。

The original Transformer added "position embeddings to the input", but with a flaw: the model couldn't see the relative distance between two tokens, only their absolute positions. Long context made this brutal — "5K away" and "50 away" both feel equally foreign.

RoPE's idea is elegant: don't add a position embedding; instead, treat each pair of Q and K dims as a point in the complex plane, and rotate it by an angle mθ where m is the position. The result: Q·Kᵀ inner product naturally picks up "the position difference between two tokens" — absolute positions vanish, only relative ones survive.

维度对dim pair	θ_i	在 token m 上旋转rotate at position m	作用captures
(0, 1)	10000⁰ = 1	m × 1 rad	高频 · 短距离high freq · short range
(2, 3)	10000⁻²/ᵈ	m × 10000⁻²/ᵈ rad	中频mid freq
(d-2, d-1)	≈ 10000⁻¹	m × 0.0001 rad	低频 · 长距离low freq · long range

FIG. 04 RoPE 把 d_head 维分成 d/2 对,每对的旋转频率从 1 rad 指数衰减到 10000⁻¹ rad。高频对捕捉"相邻 token"关系,低频对捕捉"几千 token 之前那段"关系——这就是 RoPE 能在 8K/32K/128K 上下文里都工作的物理基础。 RoPE splits d_head into d/2 pairs; each pair's rotation frequency decays exponentially from 1 rad to 10000⁻¹ rad. High-freq pairs catch "neighbouring token" relations, low-freq pairs catch "the section thousands of tokens back" — the physical basis for RoPE working at 8K / 32K / 128K context.

src/llama-graph.cpp · build_attn() · RoPE call sitetwo lines, one rotation

// Q 和 K 都要被 RoPE 旋转 · V 不用(V 跟位置无关) // 旋转角度 = pos × theta_i,theta_i 是模型固定参数 Qcur = ggml_rope_ext(ctx0, ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head, n_tokens), inp_pos, // [n_tokens] · 每个 token 的位置 rope_factors, // 长上下文缩放(YaRN/NTK-aware) n_rot, rope_type, n_ctx_orig, freq_base, freq_scale, ...); Kcur = ggml_rope_ext(ctx0, ggml_reshape_3d(ctx0, Kcur, n_embd_head, n_head_kv, n_tokens), inp_pos, rope_factors, n_rot, rope_type, n_ctx_orig, freq_base, freq_scale, ...); // 注意:V 不旋转! 这是后面 KV cache "K 存旋转后" vs "V 存原始" // 这个细节让 MLA 能省一半 cache(C11)

扩展EXTENDED YaRN / NTK-aware — 怎么把 8K 训练的模型推到 128K YaRN / NTK-aware — how to push an 8K-trained model to 128K

RoPE 训练时见过的最长位置是 n_ctx_orig(Llama-3 是 8192)。如果你想让它处理 128K 的 prompt,直接喂不工作——超出训练范围的位置 m,旋转角度 mθ 已经绕了好几圈,模型从来没见过。

YaRN(以及前身 NTK-aware) 的做法是把高频维度的频率压低,让它们也能"画长"。具体在 ggml_rope_ext 里通过 rope_factors 参数实现——这是个 d/2 维的向量,每个频率被乘上一个 ≤ 1 的因子。低频维(管长距离)几乎不动,高频维(管短距离)被压得更慢。

这个 trick 让你不重新训练就能把 8K Llama-3 跑出 32K 上下文。代价是远距离精度略降,适合"找资料"型任务,不适合"逻辑推理跨整个 128K"型任务——后者需要真训长。

RoPE was trained seeing positions up to n_ctx_orig (8192 for Llama-3). Feeding a 128K prompt straight in doesn't work — angles mθ for unseen m have wrapped around several times and the model has no idea.

YaRN (and its predecessor NTK-aware) compresses the high-frequency dims' frequencies so they "stretch longer". In ggml_rope_ext this is the rope_factors param — a d/2 vector, each freq multiplied by a factor ≤ 1. Low-freq dims (long range) barely change; high-freq dims (short range) get compressed.

The trick lets you push an 8K Llama-3 to 32K context without retraining. The cost is reduced precision at long distance — fine for "retrieval" tasks, weak for "logical reasoning spanning 128K" workloads. Those need actual long-context training.

扩展 · 数学EXTENDED · MATH RoPE 的完整复数推导 — 为什么旋转能编码"相对位置" RoPE in full complex-number derivation · why rotation encodes "relative position"

RoPE 看起来神奇——加位置 embedding 不香吗?为什么要旋转?它最美的地方在 attention 的 Q·K^T 自然等于"位置差的函数"——这件事在数学上是有唯一构造的。一步一步推:

把 Q 和 K 的每一对维度看作复平面上的点:
q_m = (q_a, q_b) → 复数 q = q_a + i·q_b
k_n = (k_a, k_b) → 复数 k = k_a + i·k_b

RoPE 给每个位置 m 旋转 mθ 角:
q'_m = q · e^(imθ)
k'_n = k · e^(inθ)

关键的魔法时刻:看 Q 和 K 的内积(对应 attention scores):

⟨q'_m, k'_n⟩ = Re(q'_m · k'_n*)
= Re(q · e^(imθ) · k* · e^(-inθ))
= Re(q · k* · e^(i(m-n)θ))

看那个 e^(i(m-n)θ) ——结果只跟 m-n 有关,不再跟 m 和 n 各自有关。绝对位置在内积里消失了,只剩相对位置。这就是 RoPE 的全部数学之美:用一个朴素的旋转,让 attention 自动只看相对距离。

而且这个性质不能被简单的 "加位置 embedding" 实现——加法在内积里展开会留下 q·pos_m + pos_n·k + pos_m·pos_n 三个交叉项,无法清干净绝对位置。Su Jianlin 在 2021 年发现这个构造时,真的是一记纯数学的妙手。

实际实现:d_head 维向量被分成 d_head/2 对,每对独立旋转一个不同频率的角度。频率 θ_i = 10000^(-2i/d_head) 从 1 衰减到 1/10000,使得不同对的"波长"覆盖从 ~6 token 到 ~62800 token 的所有相对距离尺度。这就是为什么 d_head 必须偶数——奇数会有一个孤零零的维度不能配对成复数。

RoPE looks magical — why not just add a position embedding? Why rotate? Its real beauty: Q·K^T naturally becomes a "function of position difference" — and there's a mathematically unique construction for that. Step by step:

View each pair of Q / K dims as a point in the complex plane:
q_m = (q_a, q_b) → complex q = q_a + i·q_b
k_n = (k_a, k_b) → complex k = k_a + i·k_b

RoPE rotates each position m by mθ:
q'_m = q · e^(imθ)
k'_n = k · e^(inθ)

The magic moment: look at the inner product Q · K (which is attention's score):

⟨q'_m, k'_n⟩ = Re(q'_m · k'_n*)
= Re(q · e^(imθ) · k* · e^(-inθ))
= Re(q · k* · e^(i(m-n)θ))

That e^(i(m-n)θ) — the result depends only on m-n, no longer on m and n individually. Absolute position vanishes from the inner product; only relative position remains. This is the entirety of RoPE's mathematical beauty: a plain rotation makes attention automatically see only relative distance.

And this property cannot be achieved by simple "add a position embedding" — addition leaves three crossterms (q·pos_m + pos_n·k + pos_m·pos_n) in the inner product expansion, irreducible. When Su Jianlin discovered this construction in 2021, it was a stroke of pure math brilliance.

In practice: d_head dims are paired into d_head/2 complex pairs; each pair rotates by a different frequency. Frequencies θ_i = 10000^(-2i/d_head) decay from 1 down to 1/10000, so "wavelengths" span ~6 tokens to ~62800 tokens — covering every relative-distance scale. This is why d_head must be even — an odd dim would leave one unpaired axis with no complex partner.

扩展EXTENDED 为什么是 10000 — RoPE 那个"底数"的选择 Why 10000 — RoPE's "base" choice

RoPE 公式里那个 10000^(-2i/d) 的10000不是物理常数。它是个设计参数,影响"模型能感知多远的位置差"。直觉:

base 太小(比如 100):频率衰减快,低频维快速绕回——长距离信息丢失。
base 太大(比如 1000000):频率衰减慢,所有维度都管短距离,模型不知道"5K 远"和"50K 远"的区别。
base = 10000(原 Transformer 选的):一个经验上的折中——使得最长 wavelength ≈ 2π × 10000 ≈ 62800,约等于"训练时的最长 seq len 的几倍"——给推理时一定 extrapolation 余量。

Llama-3 在 Llama-2 基础上把 base 从 10000 调到了 500000——为了支持 128K 上下文。base 大了之后所有 wavelength 都拉长,模型在 128K 范围内不会"绕回"。这是 Llama-3 能原生支持 128K 的关键(配合长上下文训练数据)。

DeepSeek-V3 的 base 更夸张,达到 1,000,000,对应 128K-256K 上下文。这件事跟 YaRN 配合使用——base 决定长度范围,YaRN 决定推理时怎么把训练长度外推。

The 10000 in 10000^(-2i/d) isn't a physical constant. It's a design parameter controlling "how far the model can perceive position differences". Intuition:

base too small (e.g. 100): frequencies decay fast, low-freq dims quickly wrap — long-range info lost.
base too large (e.g. 1000000): frequencies decay slowly, all dims care about short range — model can't distinguish "5K away" from "50K away".
base = 10000 (original Transformer pick): empirical compromise — max wavelength ≈ 2π × 10000 ≈ 62800, roughly "a few times the training seq len" — leaving some extrapolation room at inference.

Llama-3 bumped base from Llama-2's 10000 to 500000 — to support 128K context. Larger base stretches all wavelengths so the model never "wraps" within 128K. This is the key knob (paired with long-context training data) enabling Llama-3's native 128K.

DeepSeek-V3 goes further: base 1,000,000, supporting 128K-256K. Often paired with YaRN — base sets the length budget, YaRN dictates how to extrapolate beyond training length at inference.

扩展EXTENDED RMSNorm vs LayerNorm vs QK-Norm — 三种归一化的细节 RMSNorm vs LayerNorm vs QK-Norm — three normalizations in detail

归一化在 Transformer 里不只 RMSNorm 一种。三种主流变体的差异其实影响训练稳定性,推理时形状一样,但有些性质值得知道:

three normalization variants · side by sidesubtle differences

# LayerNorm · 经典 Transformer · BERT 用 y = (x - mean(x)) / sqrt(var(x) + eps) * gamma + beta // 减均值 + 除标准差 + 学 scale + 学 bias // 参数:gamma[d_model], beta[d_model] # RMSNorm · Llama / Mistral / DeepSeek 用 y = x / sqrt(mean(x^2) + eps) * gamma // 不减均值 · 不学 bias // 参数:gamma[d_model] · 比 LayerNorm 少一半 # QK-Norm · Vit-22B, Llama-3.x finetune · 部分 model card 提到 q' = RMSNorm(q) k' = RMSNorm(k) attn = softmax(q' @ k'.T / sqrt(d)) // 在 attention 内部 · 对 Q 和 K 各做一次 RMSNorm // 用于训练稳定性 · 解决大模型训长 context 时的 attention logit 爆炸

关键差异:

LayerNorm → RMSNorm:工程上少一半参数(去掉 beta),计算上去掉 mean,~20% 加速。理论上 RMSNorm 假设输入已经中心化(这个假设在残差网络上经验上对得起来)。
QK-Norm:Llama 论文没用,但训长 context 的 SOTA 都在用——理由是 attention 公式里 Q·K^T 在大 d_head 时数值范围爆炸,softmax 进 saturation,梯度消失。在 Q/K 各加一次 RMSNorm 直接把它们规整到固定 norm 范围,稳定训练。

推理时 RMSNorm 是 ggml 里的 ggml_rms_norm,QK-Norm 是在 build_attn 里手动加。无论哪个,推理代价都很小(每个 token ~5K FLOPs),但训练时的影响显著——错一个归一化策略,大模型可能在第 10000 step 突然发散。

RMSNorm isn't the only normalizer in Transformers. Three mainstream variants differ in subtle ways that affect training stability; at inference the shape is the same but some properties matter:

three normalization variants · side by sidesubtle differences

# LayerNorm · classic Transformer · used by BERT y = (x - mean(x)) / sqrt(var(x) + eps) * gamma + beta // subtract mean + divide by std + learned scale + learned bias // Params: gamma[d_model], beta[d_model] # RMSNorm · Llama / Mistral / DeepSeek y = x / sqrt(mean(x^2) + eps) * gamma // no mean-subtract · no learned bias // Params: gamma[d_model] · half of LayerNorm # QK-Norm · ViT-22B, some Llama-3.x finetunes · mentioned in model cards q' = RMSNorm(q) k' = RMSNorm(k) attn = softmax(q' @ k'.T / sqrt(d)) // inside attention · normalize Q and K each // improves training stability · prevents logit explosion in long-context

Key differences:

LayerNorm → RMSNorm: half the params (no beta), one less op (no mean), ~20% faster. Theoretically RMSNorm assumes input is already centered (an assumption that empirically holds in residual nets).
QK-Norm: not in the Llama paper, but SOTA for long-context training — at large d_head, Q·K^T magnitudes blow up, softmax saturates, gradients vanish. RMSNorm on Q and K each pins them to a fixed-norm regime, stabilizing training.

At inference, RMSNorm is ggml_rms_norm; QK-Norm is added manually in build_attn. Either way inference cost is tiny (~5K FLOPs per token), but training impact is significant — pick the wrong norm and a large model can suddenly diverge at step 10000.

CHAPTER 07 · HEART

QKV projection — 一个 token 摊成三份

QKV projection — one token, three faces

build_attn 的前 3 行 · 最暴力的 matmul

first three lines of build_attn · the brutest matmul

RMSNorm 之后,hidden state 已经准备好"变身"。接下来这一步看起来平淡——三次矩阵乘——但它占了整层 90% 以上的 FLOPs。Q/K/V 投影矩阵是整个 transformer 里最大的可学权重(单个权重矩阵就 4096×4096 = 16M 个参数,fp16 32 MB,32 层就是 1 GB——比 embedding 还大)。

"一个 token,三种身份":

Q(Query):我是谁、我在找什么——这个 token 自己出主动权问"谁能帮我?"
K(Key):我能被找到——存进 KV cache,等以后被别的 token 当成"可能的答案"对比。
V(Value):如果你选中我,我能给你什么——也存进 KV cache,等被加权求和的时候贡献内容。

After RMSNorm, the hidden state is ready to "transform". The next step looks unremarkable — three matrix multiplications — but it accounts for over 90% of one layer's FLOPs. Q/K/V projection matrices are the biggest learnable weights in a transformer (a single matrix is 4096×4096 = 16M params, 32 MB in fp16, 1 GB across 32 layers — bigger than the embedding table).

"One token, three identities":

Q (Query): who am I, what am I looking for — this token actively asks "who can help me?"
K (Key): I can be found — stored in KV cache, later compared against by other tokens looking for "possible answers".
V (Value): if you pick me, here's what I contribute — also stored in KV cache, weighted-summed at attention time.

src/llama-graph.cpp · llm_build_llama::build_attn() · lines 1-12three matmuls

// inpL 是这一层的输入 hidden state · shape [n_embd, n_tokens] // 已经 RMSNorm 过(C06) ggml_tensor * cur = build_norm(inpL, model.layers[il].attn_norm, ...); // Q = cur × W_q · shape [n_embd_head_k × n_head, n_tokens] ggml_tensor * Qcur = build_lora_mm(model.layers[il].wq, cur); // K = cur × W_k · shape [n_embd_head_k × n_head_kv, n_tokens] // 注意 n_head_kv ≠ n_head 在 GQA 里(C10) ggml_tensor * Kcur = build_lora_mm(model.layers[il].wk, cur); // V = cur × W_v · 形状同 K ggml_tensor * Vcur = build_lora_mm(model.layers[il].wv, cur); // 然后 RoPE(C06)、KV cache 写入(C08)、attention(C09)

"多头" 是怎么塞进同一个矩阵里的

How "multi-head" packs into a single matrix

Llama-3-8B 有 32 个 attention head。物理上看,你以为是 32 套独立的 Q/K/V 投影矩阵,但实现上不是——它们被拼成一个大矩阵。wq 是 [n_embd, n_head × d_head] = [4096, 32 × 128] = [4096, 4096]。它本身就是 32 个头,reshape 一下视角就分出来了。

这是个很重要的工程选择:三个大 matmul 比 32 × 3 = 96 个小 matmul 快得多。GPU 上启动一次 kernel 的开销是固定的(几 μs),小矩阵摊不开 SM 利用率。所以现代实现都是"大 matmul + 一次 reshape",把 head 维当成 reshape 出来的虚拟维度。

Llama-3-8B has 32 attention heads. Logically you'd imagine 32 independent Q/K/V projection matrix sets — but implementation-wise that's not how it works. They're packed into one big matrix. wq is [n_embd, n_head × d_head] = [4096, 32 × 128] = [4096, 4096]. It IS the 32 heads; reshape just changes how you view it.

This is a load-bearing engineering choice: three big matmuls beat 32 × 3 = 96 small matmuls by a huge margin. GPU kernel launch is a fixed cost (a few μs); small matrices can't saturate SMs. So modern implementations all do "one big matmul + one reshape", treating the head dim as a virtual axis exposed by reshape.

数字感NUMBERS Llama-3-8B,一个 token 进 attention 这一层:Q 算 4096×4096 = 16.8M FLOPs,K 同等,V 同等——合起来 50M FLOPs。32 层乘起来,一个 token 光 QKV 投影就要 1.6 GFLOPs。这只是 attention 块,FFN 块的 FLOPs 还是它的 2-3 倍。这就是为什么 Llama-3-8B 跑 1 个 token 是 ~5 GFLOPs · 跑 1K 个 token 是 ~5 TFLOPs · prefill 才能把 H100 喂饱。 Llama-3-8B, one token entering this attention layer: Q = 4096×4096 = 16.8M FLOPs, K same, V same — 50M total. 32 layers stacked: one token spends 1.6 GFLOPs on QKV projection alone. And that's just the attention block; FFN is 2-3× more. Which is why Llama-3-8B on 1 token is ~5 GFLOPs, on 1K tokens it's ~5 TFLOPs — only prefill saturates an H100.

CHAPTER 08 · HEART · KEY

KV cache — 整个推理系统最贵的东西

KV cache — the most expensive thing in the system

llama-kv-cache-unified.cpp · 一张让 decode 可行的桌子

llama-kv-cache-unified.cpp · the table that makes decode possible

这是整篇文章最关键的一章。如果只能记住一件事,记住这个:没有 KV cache,LLM 推理就是一台不可能跑起来的机器。

原因在第 2 章解释过:decode 时只有 1 个 token 进入模型,但 attention 公式要求这个 token 的 Q 跟所有历史 K/V 算内积。如果每次 decode 都重新算历史 token 的 K/V,那生成第 100 个 token 就要重算 99 个 token 的全部 32 层 attention——O(n²) 的 prefill 反复跑,完全没法用。

解决方法朴素到不像方法:把每个 token 的 K 和 V 算一次,存起来,以后直接读。这就是 KV cache。它把 decode 从 O(n) 每 token 变成 O(1) 每 token(忽略 cache 本身的读取)。代价是内存——而且是显存,GPU 最贵的那种内存。

This is the most important chapter in this article. If you only remember one thing, remember this: without KV cache, LLM inference is a machine that simply cannot run.

The reason was sketched in Ch. 2: at decode time only one token enters the model, yet attention requires that token's Q to inner-product with all historical K/V. If you recompute history K/V on every decode step, producing the 100th token reruns 99 tokens' worth of full 32-layer attention — O(n²) prefill on every decode. Unusable.

The fix is so plain it barely sounds like one: compute each token's K and V once, save them, reuse forever. That's the KV cache. It turns decode from O(n) per token into O(1) per token (ignoring the cache fetch itself). The price is memory — and specifically VRAM, the most expensive memory the GPU has.

KV cache 的真实重量

The actual weight of KV cache

容量公式很简单:

size = 2 × n_layer × n_head_kv × d_head × n_ctx × sizeof(dtype)
(2 因为 K 和 V 各存一份)

代入 Llama-3-8B + 8K 上下文 + fp16:

2 × 32 (层) × 8 (n_head_kv,GQA 把 32 head 压成 8 个) × 128 (d_head) × 8192 (n_ctx) × 2 (bytes per fp16)
= 1.07 GB

同样这个模型,如果是每用户一份 KV cache,服务 32 个并发用户就是 32 GB——一张 80 GB H100 砸下去,模型权重 16 GB 占走,KV cache 占 32 GB,剩下空间装中间激活、通信缓冲。这就是为什么 LLM 推理服务最贵:一个用户 1 GB+,模型本身才 16 GB,用户的"历史"比模型自己还重。

Capacity formula:

size = 2 × n_layer × n_head_kv × d_head × n_ctx × sizeof(dtype)
(2 because K and V each stored)

Plug in Llama-3-8B + 8K context + fp16:

2 × 32 (layers) × 8 (n_head_kv, GQA shrinks 32 heads to 8) × 128 (d_head) × 8192 (n_ctx) × 2 (bytes/fp16)
= 1.07 GB

Same model, per-user KV cache: 32 concurrent users = 32 GB. On an 80 GB H100, model weights take 16 GB, KV cache eats 32 GB, the rest covers intermediate activations and comm buffers. This is why LLM serving is so expensive: each user 1 GB+, the model itself only 16 GB — the user's "history" outweighs the model.

KV cache · 128 槽 · 每槽 = 1 个 token 的 K 和 V KV cache · 128 slots · 1 slot = K + V for one token

已用used 本次写入just written 空闲free

FIG. 05 · interactive 点 + decode 看 cache 一格一格往前涨。Llama-3-8B 在 8K 上下文下,这张"桌子"一共 8192 个槽;每多一个用户,桌子加一倍;每加一倍上下文,桌子加一倍。这就是为什么"128K 上下文" 是个工程奇迹而不是个简单的配置项——它意味着 KV cache 大 16 倍。 Click "+ decode 1 token" and watch the cache grow one slot at a time. Llama-3-8B at 8K context = 8192 slots in this "table". Add a user, table doubles. Double the context, table doubles. This is why "128K context" is an engineering feat, not a config flag — it means a 16× larger KV cache.

llama.cpp 的 KV cache 实现:slot 分配 + ring buffer

llama.cpp's KV cache: slot allocation + ring buffer

llama.cpp 早期的 KV cache 是一个"线性写入,直到满"的简单数组(llama_kv_cache),但多用户 / 长对话场景下逐渐演化出 llama_kv_cache_unified(src/llama-kv-cache-unified.cpp),核心结构是"很多 slot,每 slot 标记 [pos, seq_id]"——同一张物理 cache 上同时住几条对话,跟 vLLM 的 PagedAttention 思路接近但实现简单很多。

Early llama.cpp KV cache was a simple "write linearly until full" array (llama_kv_cache). Multi-user and long-conversation workloads pushed it toward llama_kv_cache_unified (src/llama-kv-cache-unified.cpp): "lots of slots, each tagged with [pos, seq_id]" — multiple conversations share one physical cache, similar in spirit to vLLM's PagedAttention but much simpler in mechanism.

src/llama-kv-cache-unified.cpp · llama_kv_cache_unified::find_slot()where this ubatch goes

// 给当前 ubatch 在 KV cache 上找连续的 n_tokens 个 slot // 返回起始下标,后续 K/V 写入就从这里开始 slot_info llama_kv_cache_unified::find_slot(const llama_ubatch & ubatch) { const uint32_t n_tokens = ubatch.n_tokens; // 从 head 开始扫,找连续 n_tokens 个"空闲或可复用"的 slot uint32_t n_tested = 0; while (true) { if (head + n_tokens > size) { // 绕回开头 n_tested += size - head; head = 0; continue; } bool found = true; for (uint32_t i = 0; i < n_tokens; i++) { if (cells[head + i].pos >= 0) { // 已占用 found = false; head += i + 1; n_tested += i + 1; break; } } if (found) break; if (n_tested >= size) return {-1}; // 整个 cache 都满 } // 标记这些 slot 属于 ubatch 的哪些 token / seq_id for (uint32_t i = 0; i < n_tokens; i++) { cells[head + i].pos = ubatch.pos[i]; cells[head + i].seq_id.insert(ubatch.seq_id[i][0]); } return {head}; }

KV cache 在 attention 里的两种角色

KV cache's two roles in attention

每一层 attention 块在同一张 KV cache上做两件事:

写:本 ubatch 的 n_tokens 个新 K 和 V 拷贝到 cache 的指定 slot——这一步在 ggml 里就是 ggml_cpy(本质 memcpy)。耗时 ~几十 μs。
读:本 ubatch 的 Q 要跟"从 cache 起始到当前位置的所有 K"做 matmul,再跟所有 V 做加权求和——这是第 9 章的 attention kernel。

这两件事在每一层都重复一次。所以 32 层模型一次 forward,KV cache 被读 32 次、写 32 次——它是整个推理里读写最频繁的张量。

Each attention block does two things to the same KV cache:

Write: copy this ubatch's n_tokens new K and V into the assigned cache slots — in ggml this is ggml_cpy (just memcpy). Takes ~tens of μs.
Read: this ubatch's Q matmuls with "all K from cache start to current position", then weighted-sums all V — Ch. 9's attention kernel.

Both happen per layer. So a 32-layer forward reads KV cache 32 times and writes it 32 times — the most-trafficked tensor in the whole inference.

主线 · KV cache 在我们的 prompt 上MAIN LINE · OUR PROMPT'S KV prefill 这 6 个 token:find_slot 找到 head=0 的 6 个连续空 slot,标记 [pos=0..5, seq_id=0]。32 层 attention 各自把自己算出的 K/V 写进这 6 个 slot——总共写入 2 × 32 × 8 × 128 × 6 × 2 = 1.5 MB。然后第 1 个 decode token 来的时候,Q 跟这 6 个 K 做内积,再跟 6 个 V 加权,得到第 7 个 token 的 hidden state——这才是 attention 的"记得前文"。 Prefill 4 tokens: find_slot grabs head=0 + 4 contiguous empty slots, marks them [pos=0..3, seq_id=0]. All 32 attention layers write their computed K/V into these 4 slots — total 2 × 32 × 8 × 128 × 4 × 2 = 1 MB written. When the first decode token arrives, its Q dot-products against those 4 K rows, weighted-sums the 4 V rows, and produces token-5's hidden state. That is what "the model remembers what came before" really means.

扩展 · 重要EXTENDED · KEY KV cache 的物理布局 — 为什么 K 和 V 不同形状 KV cache physical layout — why K and V have different shapes

看 llama.cpp 的 KV 张量声明会发现一个细节:K 和 V 的形状不一样。

Looking at llama.cpp's KV tensor declarations reveals a curious detail: K and V have different shapes.

src/llama-kv-cache-unified.cpp · init() · K vs V layoutattention-friendly transpose

// 同一层 · 同一个 token 数 · 但形状反着来 k_l = ggml_new_tensor_3d(ctx, type_k, n_embd_k_gqa, // d_head × n_head_kv = 128 × 8 = 1024 kv_size, // n_ctx = 8192 1); // K: shape [n_embd_k_gqa, n_ctx] · 每行是一个 token 的全部 K v_l = ggml_new_tensor_3d(ctx, type_v, kv_size, // n_ctx = 8192 · 注意这维度在前! n_embd_v_gqa, // d_head × n_head_kv = 1024 1); // V: shape [n_ctx, n_embd_v_gqa] · 每列是一个 token 的 V

这不是手抖。attention 公式里 QKᵀ 跟 P·V 的内存访问模式相反:

QKᵀ:Q 是 [n_tokens, d_head],K 是 [n_ctx, d_head]。要做 Q × K.T,需要 K 按"行=token"组织——K cache 长这样,每行一个 token,顺序读高速。
P·V:P 是 [n_tokens, n_ctx],V 是 [n_ctx, d_head]。需要 V 按"列=token"组织——这样 P 的第 i 列乘 V 的第 i 行能顺序累加。

所以 V 物理上是转置存储的——写入时 ggml_cpy 会做一次 transpose,读取时直接顺序流。这一步看似小,实测能给 attention kernel 提 30% 速度——因为 GPU memory access 对"顺序 vs 跨步"极其敏感。

这也是为什么 fp8 KV cache(第 18 章)只对 K 容易做,V 转置之后 fp8 量化的 outlier 跨 token 分布,要做per-head per-block scale才能保持精度——vLLM 在这部分有大量 CUDA kernel 代码。

Not a typo. QKᵀ and P·V in attention have opposite memory access patterns:

QKᵀ: Q is [n_tokens, d_head], K is [n_ctx, d_head]. For Q × K.T, K must be organized "row = token" — K cache is laid out this way, each row a token, sequential reads are fast.
P·V: P is [n_tokens, n_ctx], V is [n_ctx, d_head]. For this we want V organized "column = token" — so P's i-th column times V's i-th row accumulates sequentially.

So V is physically stored transposed — ggml_cpy at write time does a transpose; reads stream sequentially. Looks minor, measurably 30% faster attention — GPU memory access is brutally sensitive to "sequential vs strided".

This is also why fp8 KV cache (Ch.18) is easier on K than V: after V's transpose, fp8 outliers spread across tokens, requiring per-head per-block scales to preserve precision. vLLM has a lot of CUDA kernel code dedicated to exactly this.

扩展EXTENDED KV cache 满了怎么办 — 三种淘汰策略 When KV cache fills up — three eviction strategies

llama.cpp 的 find_slot 在全部 slot 占满时返回 -1——上层主循环必须做点什么。三种主流策略:

硬截断:直接报错"context full",拒绝继续。早期 llama.cpp 的做法。
滑动窗口(llama_kv_self_seq_rm):丢掉最早的 N 个 token 的 KV,新 token 写进腾出来的 slot。问题是模型彻底忘了开头——chat 应用里"开头是 system prompt",丢掉就崩了。
StreamingLLM(Han Song 团队,2023):保留开头 4 个 token 的 KV(attention sinks) + 最近 N 个 token 的 KV,中间丢掉。研究发现 attention 在前几个 token 上有异常高的注意力分数(sink 现象),保住这几个,模型就还认得"对话语境"。可以让 4K 上下文的模型处理 4M token 的流。

实际生产里更常见的策略是在 KV cache 快满之前主动 evict 整条对话——把已经几小时没活跃的 session 的 KV 直接 free 掉。vLLM 的 BlockSpaceManager.swap_out 甚至能把不活跃的 session 的 KV 暂存到 CPU 内存,等用户回来再 swap_in。这是把 OS swap 抄进 LLM 推理。

llama.cpp's find_slot returns -1 when all slots are occupied — the caller must do something. Three mainstream strategies:

Hard cutoff: error out with "context full", refuse to continue. Old llama.cpp behavior.
Sliding window (llama_kv_self_seq_rm): drop the earliest N tokens' KV, write new tokens into the freed slots. Problem: model totally forgets the beginning — and in chat "the beginning is the system prompt"; lose it and behavior breaks.
StreamingLLM (Han Song's group, 2023): keep the first 4 tokens' KV (attention sinks) + the most recent N tokens' KV; discard the middle. Research found attention puts anomalously high scores on the first few tokens (the "sink" phenomenon). Keep those and the model still "holds the conversational context". A 4K-context model can stream 4M tokens this way.

In real production, the more common move is to actively evict whole conversations before the cache fills — sessions inactive for hours get their entire KV freed. vLLM's BlockSpaceManager.swap_out goes further: park inactive sessions' KV in CPU memory, swap_in when the user comes back. OS swap, transplanted into LLM inference.

扩展EXTENDED KV cache 真实开销账 — 一个 70B 服务的成本结构 A real cost breakdown — what a 70B service spends per user

把 Llama-3-70B(GQA-8 · 80 层 · n_embd_head_v = 128)做一笔账。模型权重:fp16 是 140 GB,Q4_K_M 是 ~40 GB。per-user per-8K-context KV cache 是 ~2.5 GB。

用 8 × H100 (80GB) 跑这个服务:

模型权重 40 GB(quantized) · 占用 1 张卡左右
各种 activation buffer + CUDA 上下文 ~20 GB
剩下 640 - 40 - 20 = 580 GB 可分给 KV cache
每用户 2.5 GB · 理论最多 232 并发用户(8K 上下文)
实际还要留 ~25% 给碎片和峰值,实际 ~170 并发

一个 H100 节点(8 卡)的 AWS 报价是 ~$32/h。如果服务跑满 170 个并发用户、每用户每分钟 1 个 query,一小时 ~10000 query · 每 query ~$0.0032。但 KV cache 不工作时(用户在打字)也占着内存,实际 utilization 远不到这个数——所以 OpenAI / Anthropic 等服务的真实毛利都在 50-70% 区间,看起来高其实大头被 idle KV cache 吃掉了。

这就是 MLA(第 11 章)、PagedAttention(第 17 章)、prefix caching(第 19 章)、KV swap-out 等所有"省 KV / 共享 KV / 挪 KV"技术真正的商业意义:每砍 1/2 的 KV cache 占用,服务密度翻倍,毛利涨 20 个点。

Take Llama-3-70B (GQA-8 · 80 layers · n_embd_head_v = 128). Weights: 140 GB at fp16, ~40 GB at Q4_K_M. Per-user, per-8K KV cache is ~2.5 GB.

Serving on 8 × H100 (80 GB):

Weights 40 GB (quantized) · ~1 card's worth
Activation buffers + CUDA contexts ~20 GB
Remaining 640 − 40 − 20 = 580 GB for KV cache
Per user 2.5 GB · theoretical max 232 concurrent users at 8K context
Reserve ~25% for fragmentation and peaks → ~170 concurrent in practice

One H100 node (8 cards) on AWS lists at ~$32/h. Fully utilized at 170 concurrent users, 1 query/min each, that's ~10000 queries/hour · ~$0.0032 per query. But KV cache also holds memory when the user is typing; real utilization is much lower. This is why OpenAI / Anthropic post 50-70% gross margins that look high but are mostly eaten by idle KV cache.

This is the actual commercial meaning of MLA (Ch.11), PagedAttention (Ch.17), prefix caching (Ch.19), KV swap-out, every "save KV / share KV / move KV" technique: cut KV usage in half and service density doubles, gross margin gains 20 points.

扩展EXTENDED Attention sinks · 为什么去掉开头几个 token 模型就崩 Attention sinks · why dropping the first few tokens breaks the model

StreamingLLM(Han Song 团队 2023)论文有个反直觉发现:把 KV cache 的前 4 个 token 永远保留,模型就能处理任意长流;不保留,模型立刻崩。这 4 个 token 不是"系统 prompt",可能是 <|begin_of_text|> + 任意三个早期文本 token——为什么这么神奇?

研究后发现一个普适现象:Transformer 训练中,attention 倾向于把大量分数分配给序列最前面的几个 token,即使这些 token 语义无关。原因:softmax 公式 exp(QK)/Σexp 要求所有 token 的概率和为 1——但在某些 query 看来,"整段历史都没什么相关的"。模型必须把概率泄到某个地方,而早期 token 因为训练时永远存在 成了天然的"泄洪口"。

所以这些前几个 token 被称为 attention sinks——它们的 KV不承载有意义的信息,但必须存在,否则其它 token 的 attention 概率分配出问题,scores 数值范围混乱,生成质量崩坏。

这个发现实际催生了三件事:

StreamingLLM: 永远保留开头 4 个 + 最近 N 个 KV,中间可以丢——支持无限流。
训练时显式 sink token: Mistral 等模型在训练时就放一个特殊的 <sink> token,后续推理时它就是天然的泄洪口。
KV cache 压缩(H2O, scissorhands): 既然只有前几个 token + 最近几个有用,中间的 KV 可以用更激进的策略压缩或丢弃。

这个研究领域 2023-2024 还很热,但没有大规模生产落地——StreamingLLM 在非聊天场景(纯流式 log 处理)用得多,聊天场景因为 prompt 长度本来就有限,attention sink 优化的边际收益不大。

StreamingLLM (Han Song team, 2023) found a counterintuitive result: always keep the first 4 KV cache tokens and a model handles arbitrarily long streams; drop them and the model breaks instantly. These 4 tokens aren't a "system prompt" — could be <|begin_of_text|> + any three early text tokens. Why so magical?

Investigation revealed a universal phenomenon: in Transformer training, attention tends to assign large amounts of score to the first few tokens of the sequence — even when semantically irrelevant. Reason: the softmax formula exp(QK)/Σexp requires all tokens' probabilities sum to 1 — but in some queries' view, "nothing in the history is relevant". The model has to dump probability somewhere; early tokens, being always present during training, became the natural "overflow drain".

So these first few tokens are called attention sinks — their KV carries no meaningful info, but must exist, or other tokens' attention probability distributions break, score ranges go haywire, generation quality collapses.

This finding produced three things:

StreamingLLM: always keep first 4 + most recent N KV; toss the middle — enables infinite streaming.
Explicit sink tokens at training time: Mistral et al. put a <sink> special token at training, providing a natural overflow drain at inference.
KV cache compression (H2O, scissorhands): given only the first few + most recent matter, the middle KV can be compressed or dropped more aggressively.

This area was hot in 2023-2024 but not widely deployed in production — StreamingLLM is more used in non-chat scenarios (pure-stream log processing); chat is bounded enough that attention-sink optimization's marginal gain is small.

扩展EXTENDED KV cache 压缩 · H2O / scissorhands / SnapKV — 选择性遗忘 KV compression · H2O / scissorhands / SnapKV — selective forgetting

除了量化(C18 fp8 KV),另一类减小 KV cache 的思路是主动丢掉一部分 KV ——保留"重要"的,丢"不重要"的。怎么定义"重要"是一系列论文的核心:

H2O(Heavy-Hitter Oracle, 2023): 维护每个历史 token被注意到的累计概率。生成时只保留 attention sinks + 最近 token + 历史 heavy hitters,把"从来没人看过" 的 token KV 丢掉。实测能丢 ~50% KV 而 PPL 几乎不变。
Scissorhands(2023): 类似 H2O,但用更精细的"importance score" 跟踪每个 token,允许"动态进出" cache。
SnapKV(2024): 不在生成时 drop,而是在prompt prefill 完成后,根据"prompt 最后一个 token 对历史每个 token 的 attention" 选出 top-N 关键 token,只保留这些的 KV。简单粗暴但效果好——长 prompt 场景 KV cache 砍 80% 而精度不变。
PyramidKV(2024): 观察到不同 layer 对长程依赖的需求不同——底层需要更多上下文,高层只需要少数关键 token。每层维护不同大小的 KV cache,整体节省 ~70%。

注意这类技术跟量化是正交的——可以叠加。fp8 KV 砍一半内存 + H2O 再丢一半 KV = 总共 KV cache 砍到 1/4。这是 2024 年长上下文推理优化的组合战。

llama.cpp 主线还没集成这些(仍是简单 ring buffer),但研究分支 已经有实验。vLLM 在 2024 年加了 SnapKV 选项。这是个还在快速演化的领域,标准答案没出。

Besides quantization (C18 fp8 KV), another way to shrink KV cache is proactively drop parts of it — keep "important" tokens, drop "unimportant" ones. How to define "important" is the topic of a series of papers:

H2O (Heavy-Hitter Oracle, 2023): maintain per-token cumulative attention received. At generation, keep attention sinks + recent tokens + history heavy hitters; drop tokens "never attended to". ~50% KV dropped with near-zero PPL change.
Scissorhands (2023): similar to H2O but with finer "importance score" tracking, allowing "dynamic in/out" of the cache.
SnapKV (2024): instead of dropping during generation, after prompt prefill completes, use "attention from the prompt's last token to each history token" to pick top-N key tokens; keep only their KV. Simple but effective — long-prompt KV cache cut by 80% with no accuracy loss.
PyramidKV (2024): observation: different layers need different amounts of long-range context — early layers need more context, deep layers need only a few key tokens. Different cache sizes per layer, ~70% total savings.

These techniques are orthogonal to quantization — they stack. fp8 KV halves memory + H2O drops half the tokens = total KV down to 1/4. This is the 2024 long-context inference optimization combo.

llama.cpp main hasn't integrated these (still simple ring buffer); experimental branches have. vLLM added SnapKV as an option in 2024. This is a fast-evolving field; no standard answer yet.

CHAPTER 09 · HEART

Attention kernel — softmax(QKᵀ)V 在 Flash 化之后

Attention kernel — softmax(QKᵀ)V after FlashAttention

ggml_flash_attn_ext · 三步合一

ggml_flash_attn_ext · three steps fused into one

到这一步,这一层 attention 的所有素材都齐了:

Q: 这个 ubatch n_tokens 个 token 各自的 query 向量,RoPE 旋转过 · shape [d_head, n_head, n_tokens]
K: 从 cache 起始到 head+n_tokens 的所有 K · shape [d_head, n_head_kv, n_kv]
V: 同样范围的 V · shape [d_head, n_head_kv, n_kv]

朴素 attention 公式三步:S = QKᵀ / √d → P = softmax(S + mask) → O = P·V。中间矩阵 S 大小是 [n_tokens × n_kv]——n_tokens × n_kv 个 float。8K 上下文下,光一个 head 的 S 就是 64M 个 float = 256 MB,32 个 head 是 8 GB,32 层是 256 GB——显存装不下。

这就是FlashAttention(Tri Dao, 2022)解决的问题。它的洞察是:S 这个中间矩阵根本不用整张物化。把 Q 和 K/V 都切成小块(tiles),在 SRAM 里算完一块就立刻 softmax 加权,累加到 O 上,然后扔掉这块 S。整个 attention 变成"三步合一"的单个 kernel——既省了显存读写,又省了显存容量。

By this point all ingredients are ready:

Q: this ubatch's n_tokens query vectors, RoPE-rotated · shape [d_head, n_head, n_tokens]
K: all K from cache start to head+n_tokens · shape [d_head, n_head_kv, n_kv]
V: matching V · shape [d_head, n_head_kv, n_kv]

Naive attention is three steps: S = QKᵀ / √d → P = softmax(S + mask) → O = P·V. The intermediate S is [n_tokens × n_kv] floats. At 8K context, one head's S = 64M floats = 256 MB; 32 heads = 8 GB; 32 layers = 256 GB. Doesn't fit in VRAM.

This is what FlashAttention (Tri Dao, 2022) solved. The insight: S doesn't need to be materialized whole. Tile Q and K/V into blocks. For each tile, compute its S in SRAM, softmax-weight V immediately, accumulate to O, throw S away. The whole attention becomes "three steps fused into one" kernel — saving VRAM bandwidth and VRAM capacity at once.

	朴素Naive	FlashAttentionFlashAttention
中间 S 矩阵intermediate S	物化到 HBM	tiled · 留在 SRAM
HBM 读写HBM traffic	O(N²) · 每步都读写	O(N) · 只读 Q/K/V/O
SRAM 用法SRAM usage	不利用	tile 进 SRAM 反复用
实际 H100 加速actual speedup	1×	5–10× longer context
能跑的最长上下文max context	~4K (80 GB)	~128K+

FIG. 06 FlashAttention 不改变结果,只改变怎么算。所以它跟 GQA / MLA 这些"改公式"的优化是正交的——可以叠加。现代推理几乎所有 attention 都走 Flash 实现。 FlashAttention doesn't change the answer; it changes how it's computed. It's orthogonal to GQA / MLA "formula-changing" optimizations — they compose. Almost all modern attention runs via Flash today.

src/llama-graph.cpp · build_attn() · the fused kernel callthree steps in one node

// ggml 的 flash attention 路径(后端自己选 CUDA / Metal / CPU) // 这一个 ggml 节点等价于:matmul + softmax(+ mask + scale) + matmul if (cparams.flash_attn) { cur = ggml_flash_attn_ext(ctx0, q, // [d_head, n_head, n_tokens] k, // [d_head, n_head_kv, n_kv] v, // [d_head, n_head_kv, n_kv] kq_mask, // causal mask · [n_kv, n_tokens] kq_scale, // 1/√d max_bias, // ALiBi 相对位置偏置(部分模型用) logit_softcap); // Gemma 风格 softcap } else { // fallback: 朴素三步,中间 S 真的物化 ggml_tensor * kq = ggml_mul_mat(ctx0, k, q); kq = ggml_soft_max_ext(ctx0, kq, kq_mask, kq_scale, max_bias); cur = ggml_mul_mat(ctx0, ggml_cont(ctx0, v), kq); }

causal mask · "我不能看到未来"

causal mask · "I can't see the future"

上面那个 kq_mask 是 transformer 里唯一让模型"顺序"的东西。它是一个 [n_kv, n_tokens] 的下三角矩阵:第 i 个 query token 只能看到 j ≤ i 的 key token。在 softmax 之前,mask 上"不该看到"的位置被加上 -inf,softmax 后这些位置概率为 0。

这就是 GPT 类自回归模型不能并行预测后面的 token 的根本原因——它的训练目标就是"看着前 i 个 token,预测第 i+1 个",mask 是这个目标在 attention 公式里的具体形式。如果去掉 mask,模型就变成"双向"的(BERT 那种),用法完全不同。

That kq_mask is the only thing in a transformer that gives it a "direction". A [n_kv, n_tokens] lower-triangular matrix: query token i can only see key tokens j ≤ i. Before softmax, mask positions are set to -inf, so after softmax those positions are 0.

This is the root reason autoregressive GPTs can't parallel-predict future tokens. The training objective is "look at first i tokens, predict token i+1", and the mask is that objective made concrete in the attention formula. Drop the mask and you get a bidirectional model (BERT-style) — a completely different use case.

一个 attention 块的 FLOPs 账FLOPs budget · one attention block

QKV projection	3 × 2·N·d²	3·2·4·4096² = 400 MFLOPs
Q × Kᵀ	2·N²·d	2·4·4·4096 = 130 KFLOPs
softmax	~5·N²	~80 FLOPs
P × V	2·N²·d	130 KFLOPs
output proj	2·N·d²	130 MFLOPs
合计	~ 530 MFLOPs	单层,N=4 prefillone layer, N=4 prefill

走到哪了WHERE WE ARE 这一层 attention 算完,输出是 [n_embd, n_tokens] 的 hidden state——跟输入同形。然后接一个 output projection(W_o)、残差连接,再走 FFN/MoE(第 12 章),再来一次 RMSNorm + attention + FFN……重复 32 次。32 层后,最后一层的 hidden state 进入第 13 章的 LM head。 This layer's attention finishes. Output is [n_embd, n_tokens] hidden state — same shape as input. Then output projection (W_o), residual add, then FFN/MoE (Ch. 12), then RMSNorm + attention + FFN again… 32 times. After 32 layers the final hidden state enters the LM head (Ch. 13).

扩展 · 数学EXTENDED · MATH Online softmax — FlashAttention 真正的"三步合一"是怎么算的 Online softmax — how FlashAttention actually fuses "three steps into one"

朴素 softmax 需要看完所有元素才能算:先求 max(数值稳定),再 exp 求和(归一化分母),最后逐项除。这就是为什么 attention 的中间 S = QKᵀ 矩阵必须先完整物化到 HBM,再做 softmax——典型 O(N²) 内存。

FlashAttention 的核心数学是 online softmax(Milakov & Gimelshein,2018):边读边更新,不需要先看完所有元素。原理是维护两个累加量,随着新元素流入做修正:

读到 x_i 之前 · 已经处理过 i-1 个元素:
m_{i-1} = max(x_1, ..., x_{i-1}) // 当前最大值
d_{i-1} = Σ exp(x_j - m_{i-1}) // 当前归一化分母

读到 x_i,更新:
m_i = max(m_{i-1}, x_i)
d_i = d_{i-1} · exp(m_{i-1} - m_i) + exp(x_i - m_i)

那个 exp(m_{i-1} - m_i) 因子是 修正系数——之前的所有 exp 都是基于旧的 max 算的,新 max 一来,所有旧值都被"整体缩小"一点。这一步保证了:无论你读到第几个元素,d_i 都是"就这一段已读元素的正确归一化分母"。

FlashAttention 把这套 online softmax 跟分块矩阵乘结合:Q 和 K/V 都切成小块(tile),每次只把一对 tile 拉进 SRAM。在 SRAM 里算这对 tile 的局部 S = QK^T,直接 online softmax 累加到输出 O 上,然后把当前块扔掉,读下一对——全程不物化整张 S 矩阵。

Naive softmax requires seeing all elements: first compute max (for numerical stability), then exp-sum (the normalization denominator), then divide. This is why attention's intermediate S = QKᵀ must be fully materialized to HBM before softmax — classic O(N²) memory.

FlashAttention's core math is online softmax (Milakov & Gimelshein, 2018): update incrementally; never need to see everything first. Maintain two accumulators that update as new elements stream in:

Before reading x_i · already processed i-1 elements:
m_{i-1} = max(x_1, ..., x_{i-1}) // running max
d_{i-1} = Σ exp(x_j - m_{i-1}) // running denominator

Read x_i, update:
m_i = max(m_{i-1}, x_i)
d_i = d_{i-1} · exp(m_{i-1} - m_i) + exp(x_i - m_i)

The exp(m_{i-1} - m_i) factor is a correction coefficient — all the prior exps were computed against the old max; when the max grows, all old values get uniformly shrunk a bit. After the update, d_i is "the correct normalization denominator for just the elements seen so far".

FlashAttention couples this online softmax with block matrix multiply: Q and K/V are tiled; each iteration loads one tile pair into SRAM. In SRAM, compute the tile's local S = QK^T, online-softmax-accumulate it onto the output O, discard, load next pair — the full S matrix is never materialized.

FlashAttention algorithm · pseudocodeblock-wise tile loop

# 输入:Q ∈ ℝ^[N, d], K, V ∈ ℝ^[N, d] · 块大小 Br × Bc # 输出:O ∈ ℝ^[N, d] = softmax(QKᵀ/√d) · V # Step 1: 把 Q 切成 Tr = N/Br 个块 · K, V 切成 Tc = N/Bc 个块 for i in range(Tr): # 外循环 · 遍历 Q 块 Q_i = Q[i*Br : (i+1)*Br] # 把 Q 块加载到 SRAM O_i = zeros(Br, d) # 输出 · 初始化 m_i = -inf # 该 Q 块对应的 running max d_i = 0 # running denominator for j in range(Tc): # 内循环 · 遍历 K/V 块 K_j = K[j*Bc : (j+1)*Bc] # SRAM 加载 V_j = V[j*Bc : (j+1)*Bc] S_ij = Q_i @ K_j.T / sqrt(d) # 局部 attention scores m_ij = rowmax(S_ij) # 这一块的 max m_new = max(m_i, m_ij) # 新的 running max alpha = exp(m_i - m_new) # 修正旧累加 beta = exp(m_ij - m_new) # 新块的尺度 P_ij = beta * exp(S_ij - m_ij) # 局部 softmax 分子 d_i = alpha * d_i + rowsum(P_ij) # 累加分母 O_i = alpha * O_i + P_ij @ V_j # 累加输出 m_i = m_new O[i*Br:(i+1)*Br] = O_i / d_i # 最终归一化,一次性除

看这个伪代码,你能感觉到FlashAttention 的精妙:它把所有"需要看完整列才能算"的依赖,全部用 alpha / beta 这两个修正项消解掉了。每次内循环只需要 SRAM 里的当前块和几个标量(m, d),整个外循环结束时 O_i 是正确的 softmax(QKᵀ)V[i] 行——跟 naive 算法bit-equivalent(忽略 fp 累加误差)。

实测收益:在 A100 上,8K seq len、d=128 的 attention,naive 算法用 ~70 GB HBM 流量(物化 S 矩阵 + 来回读写),FlashAttention 只用 ~7 GB——10× 内存带宽节省直接翻译成 ~5-7× 实际加速。同时支持的最长 seq 也从 ~4K 涨到 ~64K。

Reading the pseudocode you can feel FlashAttention's elegance: every "need to see the whole row first" dependency is dissolved by those two correction terms, alpha and beta. The inner loop needs only the current tiles in SRAM and a few scalars (m, d). After the outer loop, O_i is the correct softmax(QKᵀ)V[i] — bit-equivalent to the naive algorithm (modulo fp accumulation noise).

Measured win: on A100, 8K seq len, d=128 attention, naive uses ~70 GB HBM traffic (materialize S + back-and-forth). FlashAttention uses ~7 GB — 10× bandwidth saved, translating to ~5-7× real speedup. Max workable seq length grows from ~4K to ~64K.

扩展EXTENDED FlashAttention v1 → v2 → v3 — 三代 4 年的演化 FlashAttention v1 → v2 → v3 — three generations, four years

FlashAttention 自 2022 年首发以来,迭代过两次大版本,每次都跟硬件代际深度耦合:

FlashAttention v1(2022): 上面那段伪代码就是 v1。在 A100 / V100 上把 attention 从 5%-15% MFU 推到 ~50% MFU。当时业界第一次意识到"attention 是 memory-bound 不是 compute-bound"。
FlashAttention v2(2023): 改了循环顺序——v1 是 "外循环 Q 块,内循环 K 块",有些操作冗余;v2 是 "外循环 K 块,内循环 Q 块",数学等价但更适合 GPU 的内存层次。另外把 work 拆得更细,SM occupancy 提升。A100 上比 v1 快 ~2×,接近理论峰值的 70%。是现在 2024-2025 年的主流实现。
FlashAttention v3(2024): 专门为 Hopper(H100)优化,利用三件事:(1) WGMMA(warp-group matrix multiply accumulate)异步张量核;(2) TMA(Tensor Memory Accelerator)硬件预取;(3) fp8 计算路径。在 H100 上比 v2 又快 1.5-2×,达到 ~75% 理论峰值。

有意思的遗产问题:vLLM 一开始用 FlashAttention v2,后来加 PagedAttention 自己的 attention kernel,FA v3 又重写了一遍兼容 paged KV。kernel 实现是个吃工程力的活——同样数学,不同硬件,得分别写。这也是为什么"新硬件出来到完整推理栈支持" 通常要 6-12 个月——不是模型移植慢,是 kernel 移植慢。

FlashAttention has shipped two major versions since 2022, each tightly coupled to a hardware generation:

FlashAttention v1 (2022): the pseudocode above is v1. On A100 / V100, attention MFU went from 5-15% to ~50%. The first time the field collectively realized "attention is memory-bound, not compute-bound".
FlashAttention v2 (2023): swapped loop order — v1 was "outer Q block, inner K block" with some redundant work; v2 is "outer K block, inner Q block", math-equivalent but better aligned with GPU memory hierarchy. Also splits work finer, lifting SM occupancy. ~2× over v1 on A100, ~70% of theoretical peak. The mainstream impl 2024-2025.
FlashAttention v3 (2024): Hopper-specific (H100), leveraging (1) WGMMA (warp-group async matrix multiply-accumulate); (2) TMA (Tensor Memory Accelerator hardware prefetch); (3) fp8 compute paths. 1.5-2× over v2 on H100, ~75% of theoretical peak.

An interesting legacy issue: vLLM started on FA v2, then added PagedAttention with its own kernel, then FA v3 was rewritten to support paged KV. Kernel implementations are engineering-heavy — same math, different hardware, separate writes. This is why "new hardware → full inference stack support" usually takes 6-12 months — model porting isn't slow; kernel porting is.

扩展EXTENDED 为什么 decode 用 FA 收益小 — Q 长度 = 1 的尴尬 Why decode benefits less from FA · the Q=1 awkwardness

FlashAttention 的所有收益都来自"S 矩阵不物化到 HBM"。但 decode 阶段 Q 只有 1 行——S = QK^T 本来就只有 [1, N_kv] 行,物化到 HBM 也就几十 KB,跟模型权重比微不足道。所以 decode 阶段 FA 的收益主要在 prefill 的 attention 上,decode 收益很小。

这就是为什么 vLLM / TensorRT-LLM 等推理引擎都有专门为 decode 优化的 attention kernel——它们不走 FA,走"纯流式"算法,每次只处理 1 个 query token,但同时把 KV 按 PagedAttention 的 block 索引读——这就是 paged_attention_v2 CUDA kernel 在干的事。

FlashDecoding(Tri Dao 团队 2023): 解决了"decode 阶段 GPU 还是欠饱和" 的问题。思路是把 KV cache 沿 seq 维度切成多块,每块用一个 SM 独立算这一段 KV 对当前 Q 的 attention 贡献,最后再合并所有 SM 的结果。这样 SM 占用率从 ~30% 涨到 ~70%,在长上下文 decode 上加速 ~3-5×。是 2024 年 long-context decode 加速的最大单点优化。

All of FlashAttention's wins come from "S not materialized to HBM". But at decode, Q has only 1 row — S = QK^T is just [1, N_kv], a few tens of KB to materialize, negligible against model weights. So FA's win at decode is essentially zero; the win is all on prefill attention.

This is why vLLM / TensorRT-LLM ship decode-specific attention kernels — they don't use FA but a "pure streaming" algorithm: one query token at a time, but reading KV via PagedAttention's block indexing. This is exactly what the paged_attention_v2 CUDA kernel does.

FlashDecoding (Tri Dao's team, 2023): tackled "even decode under-saturates the GPU". The idea: split KV cache along the sequence axis into chunks; each SM independently computes "this chunk's contribution to the current Q's attention"; merge afterward. SM occupancy goes from ~30% to ~70%, ~3-5× speedup on long-context decode. Single biggest 2024 long-context decode optimization.

CHAPTER 10 · VARIANTS

GQA / MQA — 几个 query 共享一份 KV

GQA / MQA — many queries, one KV

最朴素的"省 KV cache"招式

the simplest "save KV cache" trick

第 8 章那个 1 GB 的数字里,有一个变量叫 n_head_kv——Llama-3-8B 不是 32,而是 8。这并不是错。是GQA(Grouped Query Attention)的设计:让 32 个 query head 共享 8 个 K/V head——每 4 个 query 共用 1 组 K/V。

这件事的动机直接得难以反驳:

KV cache 大小正比于 n_head_kv。GQA 把 KV head 砍到 1/4,KV cache 直接砍 75%。
实测精度损失:几乎可以忽略(Llama 论文里大概是 0.5% 以内)。
FLOPs 几乎不变——Q 还是 32 个 head,attention 计算量不变。省的全是内存。

"MQA"(Multi-Query Attention)是 GQA 的极端版:所有 query head 共用一组 K/V(n_head_kv = 1)。再激进一点。最早是 PaLM 用,现在 Falcon、StarCoder 用。代价是精度损失更明显——这也是为什么 Llama-2 之后大家更喜欢 GQA(折中)而不是 MQA。

In Ch.8's 1 GB number there's a variable n_head_kv — Llama-3-8B has not 32 but 8. That's not a typo. It's GQA (Grouped Query Attention): 32 query heads share 8 K/V heads — every 4 queries share one K/V pair.

Motivation is hard to argue with:

KV cache size scales with n_head_kv. GQA quarters KV heads, quartering the cache.
Measured accuracy hit: nearly nothing (Llama paper reports <0.5%).
FLOPs almost unchanged — Q still has 32 heads, attention compute is the same. The savings are purely memory.

"MQA" (Multi-Query Attention) is GQA's extreme: all queries share one K/V (n_head_kv = 1). More aggressive. First used in PaLM; now Falcon, StarCoder. Cost: more visible accuracy hit. Which is why after Llama-2 most teams settled on GQA (middle ground) over MQA.

配置Config	n_head	n_head_kv	每 query 共享几组 KVqueries per KV	KV cache · 8K · fp16KV size · 8K · fp16
MHA (Llama-1)	32	32	1	4.3 GB
GQA-8 (Llama-3-8B)	32	8	4	1.07 GB
GQA-4	32	4	8	540 MB
MQA (Falcon)	32	1	32	135 MB

FIG. 07 同样是 Llama-3-8B 形状的模型,KV cache 在四种配置下差 32 倍。这是为什么 n_head_kv 是模型卡里最不起眼但成本最高的一个参数。 Same Llama-3-8B-shaped model — KV cache spans 32× across these four configs. n_head_kv is the most-overlooked, most-cost-impacting field on the model card.

src/llama-graph.cpp · build_attn() · the GQA repeathow 8 K become 32 K (logically)

// attention 公式需要 Q 和 K 的 head 数对齐 · GQA 用 ggml_repeat 解决 // 实现上是"view"操作 — 实际数据不复制,只是改 stride 让 K[0..3] 都指向 K_real[0] const int64_t n_head = hparams.n_head; // 32 const int64_t n_head_kv = hparams.n_head_kv; // 8 const int64_t n_rep = n_head / n_head_kv; // 4 · 每组 KV 服务 4 个 Q // k 原本 shape [d_head, n_head_kv, n_kv] // repeat 后视角变成 [d_head, n_head_kv, n_rep, n_kv] 再 reshape 成 // [d_head, n_head, n_kv] —— 后续 attention 就当 32 个 head 跑 if (n_rep > 1) { k = ggml_repeat_4d(ctx0, k, d_head, n_head_kv*n_rep, n_kv, 1); v = ggml_repeat_4d(ctx0, v, d_head, n_head_kv*n_rep, n_kv, 1); } // Flash kernel 内部也可以原生支持 GQA · 不需要这层 repeat

设计感DESIGN TAKE GQA 是少有的"不动模型容量、白送内存"的优化。Llama-2-70B 之后所有主流模型都默认 GQA。它的存在让 70B 模型在单张 80 GB GPU上能跑长上下文——MHA 的 70B 在 8K 上下文要 ~34 GB KV cache,GQA-8 只要 ~8.5 GB。没有 GQA,就没有"单卡 70B"。 GQA is rare: "same model capacity, free memory savings". After Llama-2-70B every mainstream model defaults to GQA. It's why a 70B can run long context on a single 80 GB GPU — MHA-70B at 8K needs ~34 GB of KV cache, GQA-8 only ~8.5 GB. No GQA, no "single-card 70B".

CHAPTER 11 · VARIANTS · KEY

MLA — DeepSeek 把 KV 压成 latent 那一手

MLA — DeepSeek's latent-KV trick

Multi-head Latent Attention · 一招省到 1/14

Multi-head Latent Attention · cache shrunk to 1/14

GQA 节省 KV cache 的方式是"少存几组 K/V"。MLA 走了另一条路——K 和 V 根本不存原始张量,只存一个"压缩潜空间"里的低维向量。要用的时候再展开。

这是 DeepSeek 在 V2 / V3 用的招式(论文《DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model》)。把每 token 的 KV cache 占用摆在一起对比:

Llama-3-8B(GQA-8 · 32 层): 2 × 32 × 8 × 128 × 2 bytes ≈ 128 KB / token
DeepSeek-V2(MLA · 60 层): 60 × 576 × 2 bytes ≈ 68 KB / token · 只存一个 latent c_kv,576 = d_c 512 + d_rope 64
"同 shape 假如用 MHA"(60 层 × 128 head × 128 d_head): 2 × 60 × 128 × 128 × 2 ≈ 3.75 MB / token ·DeepSeek-V2 论文宣称 MLA 比 MHA 省 ~57×,跟这个数对得上

注意 V2/V3 是大模型(总参数 236B / 671B),不是 7B 级别——MLA 设计的真正目的就是让这种规模的模型还能在合理硬件上 decode 长上下文。Llama 这边的 GQA 已经够 8B/70B 用了。

GQA saves KV cache by "storing fewer K/V groups". MLA takes a different road — don't store the raw K and V at all; store only a low-dim vector in a "compressed latent space". Expand on demand.

This is the trick DeepSeek used in V2/V3 (paper: DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model). Per-token KV cache, side by side:

Llama-3-8B (GQA-8 · 32 layers): 2 × 32 × 8 × 128 × 2 bytes ≈ 128 KB / token
DeepSeek-V2 (MLA · 60 layers): 60 × 576 × 2 bytes ≈ 68 KB / token · just one latent c_kv, where 576 = d_c 512 + d_rope 64
"Same shape with MHA" (60 layers × 128 heads × 128 d_head): 2 × 60 × 128 × 128 × 2 ≈ 3.75 MB / token · DeepSeek-V2's paper claims MLA saves ~57× over MHA, consistent with this

Note: V2/V3 are large models (236B / 671B total params), not 7B-class — MLA exists precisely so that scale can still decode long context on reasonable hardware. Llama's GQA is already enough at 8B/70B.

核心思想:在 latent 空间存,在 head 空间用

Core idea: store in latent, consume in head

朴素 attention:每个 head 有自己的 K/V,存 n_head_kv × d_head 个数。MLA 的拆解:

每个 token 算一个低维潜向量 c_kv ∈ ℝ^d_c(DeepSeek-V2:d_c = 512)。这是真正存进 KV cache 的东西。
要用 K 时,从 c_kv 投影出来:K = W_uk · c_kv(W_uk 是个 [n_head × d_head, d_c] 的"解压矩阵")。
要用 V 时同理:V = W_uv · c_kv。

原本要存 n_head_kv × d_head × 2 个数,现在只存 d_c 个数。Llama 数字:8 × 128 × 2 = 2048 → DeepSeek:512(再加少量 RoPE 维)≈ 1/4。

但这里有个问题:每次 decode 时,从 latent 解压出 K 和 V,再做 attention——岂不是多了一次 matmul?这就是 MLA 真正巧妙的地方。

Naive attention: each head stores n_head_kv × d_head numbers for K/V. MLA's decomposition:

Each token computes a low-dim latent c_kv ∈ ℝ^d_c (DeepSeek-V2: d_c = 512). This is what actually lives in the KV cache.
When K is needed: project up — K = W_uk · c_kv (W_uk is a [n_head × d_head, d_c] "decompression matrix").
Same for V: V = W_uv · c_kv.

Instead of storing n_head_kv × d_head × 2 numbers, store d_c. Llama: 8 × 128 × 2 = 2048 → DeepSeek: 512 (plus a small RoPE tail) ≈ 1/4.

But there's a catch: every decode step now expands latent → K, V, then runs attention. Extra matmul each step? This is exactly where MLA gets clever.

DeepSeek paper · MLA's "absorb" trickmath sleight-of-hand

// attention 公式: O = softmax(Q · Kᵀ / √d) · V // 把 K = W_uk · c_kv 代进去: // Q · Kᵀ = Q · (W_uk · c_kv)ᵀ = Q · c_kvᵀ · W_ukᵀ // = (Q · W_ukᵀ) · c_kvᵀ // // 把 W_ukᵀ "吸收"进 Q 的投影矩阵 W_q,得到一个新的 W_q' = W_q · W_ukᵀ // 这样推理时: // Q' = x · W_q' (forward 时一次性算出来,不多花) // S = Q' · c_kvᵀ (直接跟 latent 算,跳过解压 K) // O = softmax(S) · (W_uv · c_kv) // = (softmax(S) · W_uv) · c_kv // 同样把 W_uv "吸收"进 output projection W_o · 还是不多花 // // 结果:KV cache 大小 = O(d_c) 而不是 O(n_head × d_head), // 而 FLOPs 跟原本的 MHA 完全一样 —— 不多花一次 matmul

这就是 MLA 看起来像魔法的根源:"压缩"是数学上的 free lunch,不是计算量上的妥协。因为线性矩阵可以重新分组,你可以把"解压"那步永远放在权重侧而不是数据侧——latent 张量永远不需要被显式解开。

实际实现里还有个细节:RoPE 不能直接施加在 latent 上(latent 的几何不保留位置旋转)。所以 DeepSeek 把每个 head 的 K 拆成两部分:大部分维度走 MLA(无 RoPE),少数维度单独走标准 K(带 RoPE)。两部分拼起来给 attention。代码里这是 kv_lora_rank 和 qk_rope_head_dim 两个参数。

This is why MLA looks like magic: "compression" is a free lunch in math, not a compute trade-off. Linear matrices can be re-grouped, so the "decompression" step lives permanently on the weight side, never on the data side — latents never get explicitly expanded.

One real-world wrinkle: RoPE can't be applied directly to the latent (it doesn't preserve position rotation geometry). So DeepSeek splits each head's K in two: most dims go via MLA (no RoPE), a few dims go standard K (with RoPE). Both are concatenated for attention. In code these are kv_lora_rank and qk_rope_head_dim.

src/llama-graph.cpp · llm_build_deepseek2::build_attn() · MLA pathtwo-branch K

// 这一段是 DeepSeek-V2/V3 在 llama.cpp 里的真实分支 // 跟 Llama 的 build_attn 走完全不同的逻辑 // 1. 算 latent c_kv · 这是真正进 KV cache 的东西 ggml_tensor * c_kv = build_lora_mm(model.layers[il].wkv_a_mqa, cur); c_kv = build_norm(c_kv, model.layers[il].attn_kv_a_norm, ...); // 2. 同时算 K 的"RoPE 部分" k_pe · 这部分要存原始,不能进 latent ggml_tensor * k_pe = build_lora_mm(model.layers[il].wk_b, cur); k_pe = ggml_rope_ext(ctx0, k_pe, inp_pos, ...); // 3. 写入 KV cache:c_kv(non-RoPE 部分)+ k_pe(RoPE 部分) // 两部分加起来才是 ~576 维 · 远小于 GQA 的 2048 维 ggml_build_forward_expand(gf, ggml_cpy(ctx0, c_kv, k_cache_view)); ggml_build_forward_expand(gf, ggml_cpy(ctx0, k_pe, k_pe_view)); // 4. 算 Q · 这里 Q 已经被 W_uk 吸收过了(权重在模型加载时就做完) // 5. 拼出 K_final = [latent_expanded, k_pe],跑 flash attention

为什么这个 trick 这么重要WHY THIS MATTERS DeepSeek-V3 是 671B 参数(MoE,实际激活 ~37B)。这个量级如果用 MHA 的 KV cache,128K 上下文每用户要 ~600 GB——根本装不下任何单机。MLA 把这个数字降到 ~70 GB——一台 8×H100 (80GB) 装得下。MLA 不是"性能优化",MLA 是 DeepSeek-V3 能在商用硬件上跑的前提。 DeepSeek-V3 is 671B params (MoE, ~37B active). At MHA's KV cache rate, 128K context per user would be ~600 GB — no single machine can hold that. MLA brings it to ~70 GB — fits an 8×H100 (80 GB) box. MLA isn't a "perf optimization"; it's the prerequisite for DeepSeek-V3 running on commodity hardware at all.

扩展 · 重要EXTENDED · KEY "吸收"那一步的代数 — 一步一步把 W_uk 塞进 W_q The "absorb" step worked out — folding W_uk into W_q

MLA 看着像魔法,关键就是"把解压矩阵 W_uk 吸收进 Q 的投影 W_q"这一步。展开看其实就是一连串合法的线性代数。

朴素 attention 公式(单 head,省略缩放和 softmax):

O = softmax((Q · Kᵀ) / √d) · V
Q ∈ ℝ^[n, d] · K ∈ ℝ^[m, d] · V ∈ ℝ^[m, d]

MLA 把 K 拆成 K = c_kv · W_uk(其中 c_kv ∈ ℝ^[m, d_c],W_uk ∈ ℝ^[d_c, d])。代入 Q·Kᵀ:

Q · Kᵀ = Q · (c_kv · W_uk)ᵀ
= Q · W_ukᵀ · c_kvᵀ
= (Q · W_ukᵀ) · c_kvᵀ

注意 Q = h · W_q(h 是输入 hidden state)。所以:

Q · W_ukᵀ = h · W_q · W_ukᵀ = h · W_q'
其中 W_q' := W_q · W_ukᵀ ∈ ℝ^[d_model, d_c]

这就是吸收:W_q' 可以在模型加载时一次性算出来,运行时 Q' = h · W_q' 是一次正常的 matmul,跟普通 attention 等价的代价。然后:

Q · Kᵀ = Q' · c_kvᵀ
直接拿 latent c_kv 算 attention scores · 不需要"解压" K

V 那边同样手法:V = c_kv · W_uv,把 W_uv 吸收进输出投影 W_o,得到 W_o' := W_uv · W_o。

结果:KV cache 只存 d_c 维的 c_kv,推理时算两个新 matmul 但没多花 FLOPs——你只是把"解压 K · 算 attention · 解压 V · 投影 O" 重新分组成了"算 Q' · 算 attention · 投影 O'"。这就是吸收的全部:线性代数里的纯结合律重排。

更妙的是,推理 batch size 大的时候,W_uk / W_uv 那两个矩阵从来不需要被算到——它们只活在权重侧,推理路径上看不到。这是模型架构里少有的、纯几何技巧带来的几倍内存收益。

MLA looks like magic; the key is "absorb the decompression matrix W_uk into Q's projection W_q". Expanded, it's just a chain of perfectly legal linear algebra.

Naive attention (single head, omitting scale and softmax):

O = softmax((Q · Kᵀ) / √d) · V
Q ∈ ℝ^[n, d] · K ∈ ℝ^[m, d] · V ∈ ℝ^[m, d]

MLA factors K = c_kv · W_uk (c_kv ∈ ℝ^[m, d_c], W_uk ∈ ℝ^[d_c, d]). Substitute into Q·Kᵀ:

Q · Kᵀ = Q · (c_kv · W_uk)ᵀ
= Q · W_ukᵀ · c_kvᵀ
= (Q · W_ukᵀ) · c_kvᵀ

And Q = h · W_q (h is the input hidden state), so:

Q · W_ukᵀ = h · W_q · W_ukᵀ = h · W_q'
where W_q' := W_q · W_ukᵀ ∈ ℝ^[d_model, d_c]

That's absorption: W_q' is precomputed once at model load; at runtime Q' = h · W_q' is one normal matmul, same cost as vanilla attention. Then:

Q · Kᵀ = Q' · c_kvᵀ
attention scores directly from the latent c_kv · no "decompress K" needed

Same trick on V: V = c_kv · W_uv, absorb W_uv into the output projection W_o, get W_o' := W_uv · W_o.

Result: KV cache stores only d_c-dim c_kv. Two matmuls at runtime but no FLOPs increase — you've just regrouped "decompress K · attention · decompress V · project O" into "compute Q' · attention · project O'". That's all absorption is: pure associativity-driven re-grouping of linear algebra.

Even better: at large batch size, W_uk / W_uv are never computed — they live only on the weight side, invisible on the inference path. A rare case of model architecture getting multi-fold memory savings from a pure geometric trick.

扩展EXTENDED 为什么 RoPE 不能进 latent · MLA 的"分裂头"设计 Why RoPE can't go into the latent · MLA's "split-head" design

上面那个吸收推导有个隐含前提:Q 和 K 都是线性函数。但 RoPE 不是线性的——它是一个旋转,Q 和 K 各自被乘上一个位置相关的旋转矩阵。

具体地说,RoPE 把 K 变成 K(m) = R(m) · K_orig,其中 R(m) 是依赖位置 m 的旋转矩阵。如果你想吸收 W_uk 进 W_q,需要把 R(m) 一起处理——但 R(m) 跟 token 位置有关,每个 token 不同,没法预先打包进权重。

DeepSeek 的解决:把 K 拆成两部分:

c_kv(non-RoPE,d_c = 512 维):走 MLA latent,被吸收;
k_pe(RoPE 部分,d_rope = 64 维):单独存原始 K(经过 RoPE),不进 latent。

attention 时把两部分拼起来当 K_final 用。所以 MLA 的 KV cache 实际存的是 [d_c + d_rope] = 576 维的混合向量——绝大部分能享受吸收带来的压缩,少数维度走标准 RoPE 路径。这就是为什么前面公式里出现 "576 = d_c 512 + d_rope 64"。

同样的设计也用在 Q 上:q_nope(non-RoPE)+ q_rope(RoPE)。所以 DeepSeek-V2/V3 的 attention 节点比 Llama 多了几次拼接操作,但 KV cache 容量降下来,带宽收益远超拼接开销。这是模型架构与推理引擎的协同设计——光看模型论文不够,要对着 ggml 的 cgraph 节点才能完整看明白。

The absorption proof above had a hidden assumption: Q and K are linear functions. RoPE is not linear — it's a rotation; Q and K are each multiplied by a position-dependent rotation matrix.

Concretely, RoPE makes K(m) = R(m) · K_orig, where R(m) is the rotation for position m. To absorb W_uk into W_q you'd need to absorb R(m) too — but R(m) is per-token, different each time; you can't bake it into precomputed weights.

DeepSeek's fix: split K in two:

c_kv (non-RoPE, d_c = 512 dims): goes through the MLA latent, gets absorbed;
k_pe (RoPE-part, d_rope = 64 dims): stored raw (post-RoPE), bypasses the latent.

At attention time the two parts are concatenated to form K_final. So MLA's KV actually stores a [d_c + d_rope] = 576-dim mixed vector — most dims enjoy absorption's compression, a few take the standard RoPE path. That's where "576 = d_c 512 + d_rope 64" earlier came from.

Same design on Q: q_nope (non-RoPE) + q_rope (RoPE). So DeepSeek-V2/V3 attention nodes have a few extra concat operations vs. Llama, but KV cache shrinks enough that bandwidth gains dwarf concat cost. Model architecture co-designed with the inference engine — reading the paper alone isn't enough; you need to read it against the ggml cgraph nodes to see the whole picture.

CHAPTER 12 · VARIANTS · KEY

MoE 路由 — gate → top-k experts

MoE routing — gate → top-k experts

同一层 FFN,256 个备选,只跑 8 个

same FFN slot, 256 candidates, only 8 fire

到这里 attention 块讲完了,但 transformer 一层不只有 attention——后面还有 FFN(Feed Forward Network),通常占模型参数量的 2/3。对 Llama-3-8B 来说,FFN 是一个 4096 → 14336 → 4096 的两层 MLP(SwiGLU 激活),每层每 token 都跑这个 14336 维的中间表示。Decode 时极其费内存带宽。

MoE(Mixture of Experts)思路:不是只有一个 FFN,而是有 N 个"专家"FFN。每个 token 经过一个 router,只激活 top-k 个专家。比如 DeepSeek-V3 是 256 个路由专家 + 1 个共享专家,每 token 激活 8 个路由专家。"激活" = "FFN 计算实际经过这些专家",其它 248 个专家这一步根本不参与。

结果:模型参数总量很大(DeepSeek-V3 是 671B),但每个 token 实际激活的参数只有 37B——decode 阶段的内存带宽消耗按 37B 算,但模型容量是 671B。"大模型而不慢"的诀窍就在这里。

Attention is done, but a transformer layer has more than attention — there's also the FFN (Feed Forward Network), usually 2/3 of all parameters. For Llama-3-8B, the FFN is a 4096 → 14336 → 4096 two-layer MLP (SwiGLU activation) — every token, every layer, goes through that 14336-dim intermediate. Murderously bandwidth-heavy at decode.

MoE (Mixture of Experts) idea: have N "expert" FFNs. Each token passes through a router, which activates only top-k experts. DeepSeek-V3 is 256 routed experts + 1 shared expert, 8 activated per token. "Activated" = "FFN compute actually goes through these"; the other 248 experts don't participate.

Result: total params huge (DeepSeek-V3 has 671B), but active params per token only 37B. Decode bandwidth = 37B's worth. Model capacity = 671B's worth. "Big model that isn't slow" — that's the trick.

router: softmax(W·x) → top-2 · (toy: 8 experts, top-2)

FIG. 08 · animated 每 ~2 秒一轮:router 给 8 个 expert 各算一个分数,选 top-2 激活。这是 token-level 决策——同一段 prompt 里的不同 token 会选到完全不同的 experts。 Every ~2s a new round: router scores all 8, top-2 activate. This is a token-level decision — different tokens in the same prompt pick different experts.

src/llama-graph.cpp · build_moe_ffn()router + dispatch

// 输入 cur · shape [n_embd, n_tokens] // 输出同 shape,但经过的是 top-k 个 experts 加权求和 ggml_tensor * build_moe_ffn(ggml_tensor * cur, ...) { // 1. 算 router logits · shape [n_experts, n_tokens] ggml_tensor * logits = ggml_mul_mat(ctx0, gate, cur); ggml_tensor * probs = ggml_soft_max(ctx0, logits); // 2. top-k · 选每个 token 概率最高的 k 个 expert // selected_experts shape [k, n_tokens] · weights 同 shape ggml_tensor * selected_experts = ggml_top_k(ctx0, probs, n_expert_used); ggml_tensor * weights = ggml_get_rows(ctx0, ggml_reshape_3d(ctx0, probs, 1, n_expert, n_tokens), selected_experts); // 3. (option) normalize weights · 让 top-k 概率和 = 1 if (norm_w) { ggml_tensor * weights_sum = ggml_sum_rows(ctx0, weights); weights = ggml_div(ctx0, weights, weights_sum); } // 4. 把 cur 分发到选中的 experts // 这里用 ggml_mul_mat_id —— 一个 batched matmul, // 每行选哪个 expert 的权重由 selected_experts 指定 ggml_tensor * up = ggml_mul_mat_id(ctx0, ffn_up_exps, cur, selected_experts); ggml_tensor * gate = ggml_mul_mat_id(ctx0, ffn_gate_exps, cur, selected_experts); ggml_tensor * act = ggml_silu(ctx0, gate); // SwiGLU 的 silu ggml_tensor * par = ggml_mul(ctx0, up, act); ggml_tensor * out = ggml_mul_mat_id(ctx0, ffn_down_exps, par, selected_experts); // 5. 用 weights 加权求和 · 得到最终 FFN 输出 out = ggml_mul(ctx0, out, weights); return ggml_sum_rows(ctx0, out); // 把 top-k 个加起来 }

MoE 真实问题:负载均衡

The real MoE problem: load balancing

"每个 token 选 top-k"听起来简单,实际跑起来会出三种病:

专家坍缩:有几个专家被绝大多数 token 选中,其它专家几乎闲置——白白浪费了 MoE 的容量优势。训练阶段通过auxiliary loss(惩罚不均衡)解决,推理阶段已经是结果。
路由抖动:相似 token 选到完全不同的 experts,导致 cache 命中差、batch 利用率低。
跨设备通信:分布式部署时专家分布在不同 GPU 上,一个 token 选了 8 个专家可能跨 8 张卡——通信开销变成新瓶颈,催生了 expert parallelism 这个独立的优化领域。

DeepSeek-V3 用了"无 aux loss"路由(论文里的 "auxiliary-loss-free load balancing"),通过给每个专家一个动态 bias 来调整选择概率——既保留了"自然路由"的语义,又避免了 aux loss 对模型质量的影响。这个细节在 llama.cpp 里对应 build_moe_ffn 的 expert bias 参数。

"Each token picks top-k" sounds simple, but in practice three diseases emerge:

Expert collapse: a few experts get picked by most tokens; the rest sit idle, wasting MoE's capacity advantage. Mitigated at training via auxiliary loss (penalizing imbalance); at inference time it's already baked in.
Routing jitter: similar tokens pick wildly different experts, hurting cache locality and batch utilization.
Cross-device traffic: in distributed deployments experts live on different GPUs; one token picking 8 experts may span 8 cards — communication becomes the new bottleneck. This birthed expert parallelism as its own optimization area.

DeepSeek-V3 uses an "aux-loss-free" router (paper: "auxiliary-loss-free load balancing"), giving each expert a dynamic bias that nudges selection probabilities — keeping "natural routing" semantics while avoiding aux-loss's quality cost. In llama.cpp this surfaces as the expert-bias arg of build_moe_ffn.

实际 decode 速度REAL DECODE SPEED 671B 总参数,37B 激活参数,在一张 H100 + MLA 配合下,decode 速度 ≈ 40 tokens/s(单 batch)。同样硬件跑一个 70B 稠密模型只有 ~12 tokens/s。MoE 让你拥有 70B 的速度 + 远超 70B 的质量——这是 2024-2025 年 LLM 工程界最热的方向。 671B total params, 37B active. On one H100 + MLA, decode ≈ 40 tokens/s (single batch). A dense 70B on the same hardware does ~12 tokens/s. MoE buys you 70B speed at much-greater-than-70B quality — the dominant industry direction in 2024–2025.

扩展 · 重要EXTENDED · KEY Expert parallelism — 256 个专家怎么分到 8 张卡上 Expert parallelism — splitting 256 experts across 8 GPUs

671B 参数的 DeepSeek-V3 不可能塞进一张 GPU——8×H100 的 640 GB 也要量化才装得下。Tensor Parallelism(沿 d_model 维切矩阵)是稠密模型的标准做法,但对 MoE 不友好:每个 expert 都是独立的 FFN,横切它意义不大。MoE 的自然并行轴是expert parallelism——把 256 个专家分散到多张卡上,每张卡负责 32 个 expert。

但这立刻引出一个新问题:一个 token 选中的 top-8 个专家,大概率不在同一张卡上。每一层 FFN 都要把这个 token 的 hidden state发到 8 张卡里有它选中的 expert 的那几张,算完再聚回来——这就是 all-to-all 通信,MoE 训练 / 推理里最大的工程挑战。

671B-param DeepSeek-V3 won't fit on a single GPU — 8×H100 = 640 GB needs quantization to fit. Tensor Parallelism (cut along d_model) is the dense-model default but doesn't suit MoE: every expert is an independent FFN, cutting it horizontally helps little. The natural axis for MoE is expert parallelism — spread the 256 experts across cards, 32 experts per card.

This immediately introduces a new problem: a token's top-8 experts almost certainly don't all live on one card. Every FFN layer must broadcast this token's hidden state to the cards holding its selected experts, compute, then gather back — that's all-to-all communication, the biggest engineering challenge in MoE training/inference.

a typical MoE forward · expert parallel · pseudocode3 collectives per layer

# 每个 rank(GPU)持有一部分 expert · 比如 rank_i 拿到 expert[i*32:(i+1)*32] def moe_layer_forward(x, gate_weights, expert_weights): # step 1: 本地 gate 算 routing,所有 rank 都做一遍(轻量) scores = softmax(x @ gate_weights) selected = top_k(scores, k=8) # [n_tokens, 8] 个 expert id # step 2: dispatch · 按 expert id 把 token 分组,all-to-all 发出去 # 每个 rank 现在收到自己负责的 expert 该处理的 token tokens_for_local_experts = all_to_all(x, selected) # step 3: 本地 expert 计算 · grouped GEMM(每个 expert 处理它收到的那批) out_local = grouped_gemm(tokens_for_local_experts, expert_weights) # step 4: combine · 再一次 all-to-all,把结果送回原本 token 的 rank out = all_to_all(out_local, selected.T) # step 5: 加权 sum top-k 的输出 return weighted_sum(out, scores)

关键瓶颈是 step 2 和 step 4 的两次 all-to-all。8 卡 NVLink 4.0 单向 ~450 GB/s,每个 token hidden 状态 7168 × 2 bytes ≈ 14 KB——一次 batch 1024 token 的两次 all-to-all 是 ~30 MB,约 67 μs。听起来不多,但 32 层 × 67 μs = 2 ms 的纯通信时间。MoE 模型的 decode 延迟里,通信能占 20-30%——这就是为什么 DeepSeek 自己写了 DeepEP 那个高性能 all-to-all 库,以及为什么"训练 MoE 比训练同 FLOPs 的稠密模型工程量大得多"。

The bottleneck is the two all-to-alls in steps 2 and 4. 8-card NVLink 4.0 is ~450 GB/s one-way; each token hidden state is 7168 × 2 bytes ≈ 14 KB. Two all-to-alls for a 1024-token batch = ~30 MB ≈ 67 μs. Sounds small, but 32 layers × 67 μs = 2 ms of pure communication. 20-30% of MoE decode latency can be communication — which is why DeepSeek wrote their own high-performance all-to-all library DeepEP, and why "training MoE is much more engineering than training a dense model of the same FLOPs".

扩展EXTENDED "无 aux loss" 路由 — DeepSeek-V3 的 bias gate "Aux-loss-free" routing — DeepSeek-V3's bias-gate trick

MoE 训练里,经典的负载均衡方法是加一个 auxiliary loss(L_aux):惩罚"某些 expert 被选得过多"。但 aux loss 有两个副作用:

它跟主任务 loss 拔河——为了均衡而牺牲了一些质量。
它的权重难调——太小不均衡,太大模型质量掉。

DeepSeek-V3 论文提出的 auxiliary-loss-free 方法,核心是给每个 expert 加一个 动态 bias b_i,在 routing 时:

选 top-k 时按 (gate_logit_i + b_i) 排序
但最终的"路由权重"还是用原始 gate_logit_i

巧妙之处:b_i 只影响"谁被选中",不影响"选中后的权重"。训练时维护每个 expert 的 token 计数,过载的 expert 把 b_i 调低(下次少被选),冷门 expert 把 b_i 调高(下次多被选)。这是外部反馈而不是梯度信号——主任务 loss 完全不被污染。

llama.cpp 在 build_moe_ffn 里通过 expert_bias 参数支持这个:加载 DeepSeek-V3 模型时,bias 是模型权重的一部分,运行时直接加到 gate logit 上。训练时的动态调整在推理时变成了静态的 bias 向量——又一个"训练阶段算的代价,推理阶段直接享用"的例子。

The classic MoE load-balancing trick is an auxiliary loss (L_aux): penalize "some experts get picked too often". But aux loss has two side effects:

It tugs against the main task loss — sacrificing some quality for balance.
Its weight is hard to tune — too small means imbalance, too large means quality drop.

DeepSeek-V3's auxiliary-loss-free method adds a dynamic bias b_i per expert. At routing time:

top-k selection by (gate_logit_i + b_i)
but the final "routing weight" still uses raw gate_logit_i

Clever bit: b_i affects who gets picked, not the weight of the pick. During training they track per-expert token counts; overloaded experts get b_i decreased (picked less next time), underloaded experts get b_i increased. It's external feedback, not a gradient signal — the main loss stays clean.

llama.cpp's build_moe_ffn supports this via the expert_bias arg: when loading DeepSeek-V3, the bias is part of the saved weights, added directly to gate logits at runtime. Training-time dynamic adjustment becomes inference-time static bias — another "training pays, inference takes" idiom.

扩展EXTENDED MoE 家族大谱 — Mixtral / DBRX / Phi-MoE / DeepSeek-V3 / Llama-4 怎么选 MoE family tree — Mixtral / DBRX / Phi-MoE / DeepSeek-V3 / Llama-4 choices

MoE 不是一个固定方案,每家在专家数 / top-k / 共享专家 / 路由函数这几个旋钮上做了不同选择。把目前主流的拉一张表对比:

MoE isn't one fixed recipe — each lab picks differently on number of experts / top-k / shared experts / gating function. A side-by-side of current production models:

model	n_expert	top-k	共享专家shared exp	路由函数gate fn	active / total
Mixtral 8x7B	8	2	无	softmax	12.9B / 47B
Mixtral 8x22B	8	2	无	softmax	39B / 141B
DBRX	16	4	无	softmax	36B / 132B
Phi-3.5-MoE	16	2	无	softmax + aux loss	6.6B / 42B
DeepSeek-V2	160 routed + 2 shared	6	2 个固定激活	softmax + aux loss	21B / 236B
DeepSeek-V3	256 routed + 1 shared	8	1 个固定激活	sigmoid + aux-free bias	37B / 671B
Llama-4 Maverick	128	1	有	sigmoid	17B / 400B
Llama-4 Behemoth	16	1	有	sigmoid	288B / ~2T

看几个有意思的演化方向:

专家数从 8 → 256: 早期 Mixtral 8 个大专家,近年 DeepSeek 256 个小专家。趋势是"更细粒度"——更多但更小的专家给路由器更多组合空间,容量利用率更高。
共享专家的回归: 早期 Mixtral 没有共享专家,所有 FFN 计算都走 routed。DeepSeek-V2 起加回了"shared expert" 这个老想法(Switch Transformer 也有过)——理由是有些"通用"知识每个 token 都需要,与其让 router 反复选,不如固定永远激活一个共享 FFN,routed 部分只承担"专门化"工作。这让 routed 专家可以更小、分工更细。
softmax → sigmoid 的转变: DeepSeek-V3 和 Llama-4 都从 softmax router 改成了 sigmoid。softmax 强制 routing 权重和为 1,所有专家"抢一份饼";sigmoid 让每个专家有独立的"该不该激活"概率,top-k 选择更稳定。这是 2024 年 MoE 论文里的新共识。
top-1 vs top-k > 1: Mixtral / DBRX 都用 top-2 或 top-4——一个 token 走多个专家。Llama-4 大胆地走 top-1,只激活 1 个专家——通信开销更小、更适合大规模分布式部署,代价是模型组合性更弱。

从这张表能看出 LLM 工程界的方法收敛:大家在不同点上探索,但最终都在往"更多小专家 + 共享专家 + sigmoid router + aux-free balance" 这个方向走。2024 年底的"教科书 MoE" 跟 2023 年 Switch Transformer 论文里的"教科书 MoE" 已经很不一样。

Notable evolution lines:

n_experts: 8 → 256: early Mixtral 8 large experts; recent DeepSeek 256 small ones. The trend is "finer grain" — more but smaller experts give the router more combinatorial room, better capacity utilization.
Return of the shared expert: early Mixtral had no shared expert, all FFN went routed. From DeepSeek-V2 onward the "shared expert" idea returned (also in Switch Transformer earlier) — some "generic" knowledge is needed by every token; better to always activate a shared FFN than have the router repeatedly select for it, freeing routed experts to specialize.
softmax → sigmoid: DeepSeek-V3 and Llama-4 both switched from softmax to sigmoid routers. Softmax forces routing weights to sum to 1 — all experts "fight over one pie". Sigmoid gives each expert independent "should I fire" probabilities, making top-k more stable. New consensus in 2024 MoE papers.
top-1 vs top-k > 1: Mixtral / DBRX use top-2 or top-4 — a token visits multiple experts. Llama-4 boldly uses top-1 — lower comm cost, friendlier to large-scale distributed deployment, at the cost of weaker model compositionality.

This table shows the field's methodological convergence: people explored different points but all moved toward "more small experts + shared expert + sigmoid router + aux-free balance". The "textbook MoE" of late 2024 looks very different from Switch Transformer's "textbook MoE" of 2023.

扩展EXTENDED MoE 推理上的"token 丢弃" — capacity factor 这件事 "Token dropping" in MoE inference · the capacity-factor question

训练 MoE 时,因为batch 内每个专家收到的 token 数不固定,GPU 内存分配很难——你不知道要给每个专家留多少空间。早期 Switch Transformer 的解决方案是"给每个专家一个 capacity factor"——比如每 expert 最多收 1.5 × (n_tokens / n_experts) 个 token,超过的直接丢掉(走残差跳过 FFN)。

这件事在训练里 acceptable,在推理里完全不行——丢一个 token 模型生成就崩。所以推理时的 MoE 实现都不做 capacity 限制:Mixtral / DeepSeek 推理时每个专家无上限地处理所有路由到它的 token。代价是每个 expert 处理的 token 数不齐——某些 expert 一次跑 50 个 token,另一些跑 5 个。这导致 GPU 上 grouped GEMM 的负载不均,部分 SM 提前完工等待。

这就是为什么"推理时 MoE 实际加速比理论激活率低" 的根本原因——理论上 671B/37B 是 18× 算力节省,但实测可能只到 10× 左右,5-8× 的损耗在负载不均 + 通信开销 上。vLLM 和 SGLang 都在专家分组(把热门专家分到不同卡上)和动态调度上做了很多努力,但是个开放问题。

In MoE training, each expert receives a variable number of tokens per batch — GPU memory allocation is awkward; you don't know how much to reserve per expert. Early Switch Transformer introduced "capacity factor" — cap each expert at 1.5 × (n_tokens / n_experts); overflow tokens are dropped (skip FFN via residual).

Acceptable in training; unacceptable in inference — drop one token and generation breaks. So inference MoE implementations don't cap capacity: Mixtral / DeepSeek let each expert process all tokens routed to it. Cost: uneven per-expert token counts — some get 50 tokens, others 5. This makes grouped GEMM load-imbalanced; some SMs finish early and idle.

This is the root reason "MoE inference speedup falls short of the theoretical active-ratio". 671B/37B should be 18× compute saving; in practice ~10×. The 5-8× loss is in load imbalance + comm overhead. vLLM and SGLang invest heavily in expert grouping (placing hot experts on different cards) and dynamic scheduling, but it's an open problem.

CHAPTER 13 · OUTLET

LM head — hidden state 回到词表

LM head — back from hidden space to vocab

最后一次 matmul · 4096 维 → 128k 维

the last matmul · 4096 → 128k

走完 32 层 transformer,最后一个 token("llama")的 hidden state 是一个 4096 维向量。它是模型对"下一个 token 应该是什么"的分布式表示——但还不是概率分布。要变成概率分布,得先投回 128k 维的词表空间,得到logits:每个 vocab id 上一个实数,代表"它当下一个 token 的合理性"。

这一步的计算极简——一次 matmul。但矩阵很大:LM head 的权重 output 是 [n_embd, n_vocab] = [4096, 128256],fp16 占 1 GB。所以 LM head 跟 embedding 表是整个模型里两个最重的单体权重。

有意思的是,很多模型(Llama-3 之前的 Llama-1/2、Gemma 等)把这两个权重共享(weight tying)——embedding 表本身就是 LM head 的转置。这省了 1 GB,代价是输出和输入空间被绑定。Llama-3 故意不绑,允许两边各自优化,精度涨,内存涨。

After 32 transformer layers, the final token's ("llama") hidden state is a 4096-dim vector. It's the model's distributed representation of "what the next token should be" — but not yet a probability distribution. To get one, project back to 128k-dim vocab space, producing logits: a real number per vocab id, encoding "plausibility as the next token".

The compute is dead simple — one matmul. But the matrix is huge: LM head's output weight is [n_embd, n_vocab] = [4096, 128256], 1 GB in fp16. LM head + embedding table = the two heaviest single weights in the model.

Interesting twist: many models (Llama-1/2, Gemma) share these two via weight tying — the embedding table is the LM head's transpose. Saves 1 GB; cost is input and output spaces are tied. Llama-3 deliberately untied them — both sides can specialize, accuracy up, memory up.

src/llama-graph.cpp · llm_build_llama::build() · the final projectionlast matmul

// inpL 是 32 层走完后的 hidden state · shape [n_embd, n_tokens] // 最后一次 RMSNorm cur = build_norm(inpL, model.output_norm, ...); // LM head · shape 变成 [n_vocab, n_tokens] cur = build_lora_mm(model.output, cur); // 如果模型有 logit softcap(Gemma 用),在这里裁一下幅度 if (hparams.f_logit_scale != 0.0f) { cur = ggml_tanh(ctx0, ggml_scale(ctx0, cur, 1.0f/hparams.f_logit_scale)); cur = ggml_scale(ctx0, cur, hparams.f_logit_scale); } // cur 现在就是 logits · 接下来交给 sampler(C14) ggml_set_output(cur); ggml_build_forward_expand(gf, cur);

"只在最后一个 token 上算 logits"

"Only compute logits at the last token"

第 5 章那个 logits=[0,0,0,0,0,1] 现在派上用场了。prefill 时,6 个 token 都走完了 32 层 transformer,理论上 6 个位置都能算出 logits——但我们只关心第 6 个(预测第 7 个 token)。所以 llama.cpp 在 LM head 之前会 skip 掉不需要的位置,只对那 1 个 token 做 matmul:

节省的 FLOPs ≈ (n_tokens - 1) × 2 × n_embd × n_vocab
prefill 4 token 时:省 ≈ 3 GFLOPs · 不小

这是个简单但关键的优化——所有现代推理引擎都默认开。如果你做 logprob 输出(用户要每个 token 的概率),才会强制要求"所有位置都算 logits",这时 LM head 就成了 prefill 阶段的次要瓶颈。

That logits=[0,0,0,0,0,1] from Ch.5 finally earns its keep. Prefill: all 6 tokens have walked the 32 layers, all 6 positions could in principle yield logits — but we only care about position 6 (predicts token 7). So before LM head llama.cpp skips the unneeded positions and matmuls only the 1 token of interest:

FLOPs saved ≈ (n_tokens - 1) × 2 × n_embd × n_vocab
4-token prefill: ~3 GFLOPs saved · not nothing

Simple but load-bearing optimization — every modern engine has it on. The only time it's off: logprob output mode (user wants per-token probabilities), then LM head becomes a secondary prefill bottleneck.

扩展EXTENDED Vocab-parallel sharding · 当词表大到一张卡装不下 Vocab-parallel sharding · when the vocab exceeds one card

Llama-3 词表 128k,LM head 1 GB——还能装下一张 H100。但多语言模型(NLLB, XLM-R)词表能达到 250k-1M,LM head 直接膨胀到 4-10 GB。在大模型 + 大词表场景下,LM head 反而成了单卡放不下的张量。

解决方案是 vocab-parallel sharding:把 LM head 沿词表维度切到多张卡上,每张卡只算 vocab 的一段 logit。比如 TP=8 下,第 0 张卡算 vocab[0:16000] 的 logit,第 1 张卡算 vocab[16000:32000]……做 sampling 时需要一次跨卡聚合(all-gather 全部 logit 到一张卡上 sample,或者每张卡本地 top-k 然后聚合 top-k)。

注意这跟 TP 的"沿 d_model 切矩阵" 不同——TP 在 attention/FFN 里切的是输入或输出特征维,词表维是新增的并行轴。所以 vocab-parallel 在概念上是第 4 种并行(在 TP/PP/EP 之外)。生产环境很少独立使用,通常跟 TP 叠加。Megatron-LM 的 VocabParallelEmbedding 类专门做这件事。

另一个相关 trick: logit soft-capping(Gemma 用)。如果 LM head 的输出范围 unbounded,某些 token 的 logit 可能爆到 ±100,softmax 后退化成 one-hot,sampling 失去多样性。在 LM head 后面套一个 tanh(x / cap) × cap (cap 是 ~30),把范围拉回 [-cap, cap]——既稳又给 sampling 留概率分布。llama.cpp 在 build 里有 f_logit_scale 参数实现。

Llama-3's 128k vocab + 1 GB LM head still fits one H100. But multilingual models (NLLB, XLM-R) push vocabs to 250k-1M, LM head balloons to 4-10 GB. Large model + large vocab → LM head becomes a tensor that doesn't fit on one card.

The fix: vocab-parallel sharding. Shard LM head along the vocab axis across cards; each card computes logits for one slice. Under TP=8, card 0 handles vocab[0:16000], card 1 handles [16000:32000], etc. Sampling then needs one cross-card aggregation (all-gather all logits to one card to sample, or local top-k per card then aggregate).

This is distinct from "shard along d_model" TP — TP cuts input/output feature dims in attention/FFN; vocab is a new parallel axis. So vocab-parallel is conceptually a fourth parallelism dimension (alongside TP/PP/EP). Rarely standalone in production, usually layered with TP. Megatron-LM's VocabParallelEmbedding class does this.

A related trick: logit soft-capping (Gemma). If LM head output is unbounded, some token logits can blow up to ±100, softmax degenerates to one-hot, sampling loses diversity. Wrap LM head with tanh(x / cap) × cap (cap ~30) to bring it back to [-cap, cap] — stable, leaves room for probability distribution. llama.cpp implements via f_logit_scale in build.

扩展EXTENDED 为什么 LM head 量化最敏感 · 一个权重错位的代价 Why LM head is the most quantization-sensitive layer

第 18 章那个 Q4_K_M = "mixed precision" 把 LM head 单独留在 Q6_K 不量化到 Q4 ——这件事的原因值得展开。

普通层(attention / FFN)的权重量化损失会被后续层平滑掉:某个 weight 量化错 5%,经过 32 层 transformer 的层层抹平,对最终 hidden state 影响只剩 ~0.1%。但 LM head 不一样——它是最后一层,后面没有 transformer 抹平。它的输出直接是 sampling 的输入。

更糟的是,LM head 是 [n_embd × n_vocab] 的矩阵,n_vocab = 128k 行各自独立。一个权重出错只影响一个 vocab id 的概率——但哪一个 id 是随机的。如果倒霉量化错的那个 id 恰好是高频 token("the", " ", "的"),整个生成质量立刻断崖式下跌。

实测数据:对 Llama-3-8B,LM head 单独量化到 Q4 → PPL 涨 ~8%(比全模型 Q4 还差),保持 Q6_K → PPL 涨仅 ~0.5%。多花 1 GB 内存换 6% 精度,绝对划算。

这件事的一般原则:越接近模型 IO 边界的层,量化越敏感。LM head 和 embedding 表都属于"边界层",量化要谨慎;中间的 attention/FFN 厚得多,量化容忍度高。理解这个,你就明白为什么 GGUF 量化方案那么琐碎(每层不同位数)——它是大量经验测试出的层级量化策略。

Ch.18's Q4_K_M = "mixed precision" leaves LM head at Q6_K instead of quantizing to Q4. Worth unpacking why.

Quantization errors in regular layers (attention / FFN) get smoothed over by subsequent layers: a 5% weight error, after 32 layers of transformer flattening, contributes only ~0.1% to the final hidden state. LM head is different — it's the last layer; no transformer behind to smooth. Its output is directly sampling's input.

Worse: LM head is [n_embd × n_vocab] with n_vocab = 128k independent rows. One weight error affects just one vocab id's probability — but which id is random. If the misquantized id happens to be a high-frequency token ("the", " ", "的"), generation quality drops off a cliff.

Measurements: Llama-3-8B with LM head quantized to Q4 alone → PPL up ~8% (worse than the whole model at Q4); keeping it at Q6_K → PPL up only ~0.5%. 1 GB more memory for 6% accuracy — clearly worth it.

General principle: layers near the model's IO boundary are more quant-sensitive. LM head and embedding are "boundary layers" — quantize carefully. Middle attention/FFN is much thicker, more tolerant. Understand this and you'll see why GGUF quantization is so fiddly (different bits per layer) — it's a per-layer empirical strategy crafted from extensive testing.

CHAPTER 14 · OUTLET

Sampling — 从 128k 维里掷一次骰子

Sampling — rolling a die in 128k dimensions

llama-sampling.cpp · 一条 sampler chain

llama-sampling.cpp · a chain of samplers

logits 在手了——128256 个浮点数,代表每个 vocab id 当下一个 token 的"可能性"。要让它变成一个具体的 token,要经过两步:

整形:用 temperature、top-k、top-p、repetition penalty 等等"过滤器"修改 logits/probabilities——让分布更尖、更扁、或者抑制重复。
采样:从修过形的概率分布里掷一次骰子,选一个 token。

llama.cpp 把这两步统一抽象成"sampler chain"——每个 sampler 是一个改 logits 的函数,链尾的那个负责"挑出一个 token"。这设计跟 Unix pipe 一样:简单、组合、可换。

Logits in hand — 128256 floats, each one's "plausibility" as the next token. To pick one concrete token, two steps:

Shape: apply temperature, top-k, top-p, repetition penalty and other "filters" to modify logits/probabilities — sharpening, flattening, or suppressing repeats.
Sample: roll a die over the reshaped probability distribution, pick one token.

llama.cpp models both as a "sampler chain" — each sampler is a function that mutates logits; the tail sampler picks the final id. Like Unix pipes: simple, composable, swappable.

temperature 0.70 top-p 0.90

FIG. 09 · interactive 玩 temperature 和 top-p。低 temperature(0.3)→ 分布更尖,模型更"确定";高 temperature(1.5)→ 分布更扁,更"有想象力"。低 top-p(0.6)→ 砍掉长尾,只在最可能的几个 token 里选;高 top-p(0.95)→ 几乎全保留。橘色那一行是"被采样选中"。 Play with temperature and top-p. Low temp (0.3) → sharper distribution, model more "certain". High temp (1.5) → flatter, more "imaginative". Low top-p (0.6) → cut the long tail, pick from a few. High top-p (0.95) → keep almost all. The orange row is "picked by sampling".

src/llama-sampling.cpp · llama_sampler_chain_apply()unix pipe of samplers

// 用户配置的 sampler chain · 比如: // penalties → top-k → top-p → min-p → temperature → dist // llama.cpp 依次 apply 这些 sampler,最后一个 sample 出 token void llama_sampler_chain_apply(llama_sampler * chain, llama_token_data_array * cur_p) { auto * c = (llama_sampler_chain *)chain->ctx; // 每个子 sampler 都拿到完整的 logits 数组,自由修改 for (auto * smpl : c->samplers) { llama_sampler_apply(smpl, cur_p); } } // 典型的 sampler 实现 · top-k 为例: static void llama_sampler_top_k_apply(llama_sampler * smpl, llama_token_data_array * cur_p) { const auto * ctx = (llama_sampler_top_k *)smpl->ctx; // 部分排序 · 只把前 k 大的元素挪到数组前部 std::partial_sort(cur_p->data, cur_p->data + ctx->k, cur_p->data + cur_p->size, [](const llama_token_data & a, const llama_token_data & b) { return a.logit > b.logit; }); cur_p->size = ctx->k; // 剩下的全部抛弃 }

几种 sampler 在干的事 · 一行话版

A one-liner for each sampler

sampler	在干什么what it does	控制感feel
`temperature`	logit ÷ T · T < 1 让分布更尖,T > 1 让它更扁divide logits by T · T<1 sharpens, T>1 flattens	creativity
`top-k`	只保留前 k 个,其余设 -infkeep top k, others to -inf	cut tail
`top-p`	累积概率到 p 为止,其余设 -inf · 比 top-k 自适应keep cumulative probability up to p; adaptive	cut adaptive tail
`min-p`	保留 p ≥ min_p × max_p 的所有,丢其余 · 比 top-p 更稳keep where p ≥ min_p × max_p · steadier than top-p	robust cut
`repetition penalty`	最近出现过的 token 的 logit 减一点 · 防重复subtract from logits of recent tokens · anti-loop	de-loop
`frequency penalty`	同上但按出现次数累加 · OpenAI 风格same but proportional to count · OpenAI-style	de-loop
`mirostat`	动态调 temperature 让"perplexity"贴近目标值dynamically tune temp toward a target perplexity	auto-pilot
`grammar / GBNF`	把"不符合给定语法的 token"全部 mask 掉 · 强约束输出格式mask any token not matching a BNF grammar · structured output	hard-rail
`dist` (链尾)	最后这步:从修过形的概率里掷骰子选一个 tokenfinal step: actually draw from the reshaped distribution	roll

FIG. 10 sampler 链有顺序:常见的好顺序是 penalties → top-k → top-p → temperature → dist——先按规则过滤,再调整尖度,最后掷骰子。顺序换了行为也变:先 temperature 再 top-p 跟先 top-p 再 temperature 结果完全不同。 Sampler order matters. A common good order: penalties → top-k → top-p → temperature → dist — filter first, sharpen second, roll last. Reorder and behavior changes: "temp before top-p" ≠ "top-p before temp".

主线 · 我们的下一个 tokenMAIN LINE · OUR NEXT TOKEN prompt "你好,llama" 走完 32 层、LM head 投影、sampler chain 之后,假设我们用 temperature=0.7 + top-p=0.9,选出来的下一个 token 可能是 "你"(id 47045)。这个 id 会被立即拼到对话历史里,作为下一次 decode 的最后一个 token——下一次 decode 时,它的 K 和 V 算出来写进 KV cache,Q 跟前 6 个 K 算 attention,产出第 8 个 token……如此循环,直到第 15 章的 EOS 出现。 After our prompt passes through 32 layers, LM head, and a sampler chain with temperature=0.7 + top-p=0.9, the next token might be (say) "Hi". That id is immediately appended to the conversation history and becomes the last token of the next decode. Its K and V are computed and written into KV cache; its Q does attention against the previous 4 K. Repeat until the EOS of Ch.15 fires.

扩展EXTENDED Min-p · 2024 年的新 sampler · 比 top-p 更稳 Min-p · 2024's new sampler · steadier than top-p

top-p 有个众人皆知但很少人说清楚的毛病:cutoff 不自适应。top-p=0.9 在高确定性情境下会保留太多——比如下一个 token 是 ".",最大概率 0.85,top-p=0.9 还得拉进来一两个 ~0.05 的备选 token,这些备选语义可能完全无关,纯粹靠 temperature 进 noise。top-p=0.9 在高不确定性情境下又保留太少——如果前 20 个 token 概率都在 0.04-0.06 之间,top-p=0.9 强行卡在 ~18 个,把第 19、20 个有意义的 token 砍掉。

Min-p(2024 paper《Min P Sampling: Balancing Creativity and Coherence at High Temperature》)解决了这个:不按累积概率算,而是按"跟最大概率的相对值"算。规则:保留所有 p ≥ min_p × p_max 的 token。

高确定性时(p_max = 0.85),min_p=0.1 → 阈值 = 0.085,几乎只剩最大的那一个。
高不确定性时(p_max = 0.06),min_p=0.1 → 阈值 = 0.006,保留所有 ≥ 0.006 的——可能 30 个 token——给采样足够丰富的备选。

结果是更适合高 temperature 创作场景 + 避免低概率 noise token。llama.cpp 早期就支持,vLLM 2024 也加进了 sampler chain。"2024 年的默认 sampler 是 min_p 而不是 top_p" 已是社区共识。

Top-p has a widely known but rarely articulated issue: its cutoff isn't adaptive. Top-p=0.9 keeps too many in high-certainty situations — if the next token is "." with p_max = 0.85, top-p=0.9 still pulls in one or two ~0.05 alternatives that may be semantically unrelated, pure temperature noise. Top-p=0.9 keeps too few in high-uncertainty situations — if the top 20 tokens all hover 0.04-0.06, top-p=0.9 cuts at ~18 and drops legitimate 19th and 20th candidates.

Min-p (2024 paper Min P Sampling: Balancing Creativity and Coherence at High Temperature) fixes this. Instead of cumulative probability, use "relative-to-max": keep all tokens with p ≥ min_p × p_max.

High certainty (p_max = 0.85), min_p=0.1 → threshold = 0.085 — almost only the top token.
High uncertainty (p_max = 0.06), min_p=0.1 → threshold = 0.006 — keep all ≥ 0.006 — maybe 30 tokens — giving sampling enough variety.

Result: better for high-temperature creative writing + avoids low-prob noise tokens. llama.cpp shipped support early; vLLM added it in 2024. "The 2024 default sampler is min_p, not top_p" is community consensus now.

扩展EXTENDED DRY 和 XTC — 两个看起来奇怪但实际有用的 sampler DRY and XTC — two strange-looking but practical samplers

"repetition penalty"(C14 主表里那条)是最早出现的抗重复 sampler,但它太粗暴——只要 token 出现过就一律打折,常常误杀必须重复的 token(逗号、空格、"the" 这种)。2024 年社区(主要 KoboldAI / SillyTavern 圈子)给出了两个更精细的方案:

DRY(Don't Repeat Yourself, 2024): 不惩罚单个 token,而是惩罚序列。算法:扫描历史,找当前位置如果选某 token 会形成"跟之前出现过的 n-gram 重合"的情况,对这些 token 应用指数衰减的惩罚(更长的重复 → 更重的惩罚)。这样模型不会重复整段话,但保留正常重复的小词。
XTC(Exclude Top Choices, 2024): 反 sampler,主动砍掉头部高概率 token,让模型去选"次优但仍合理"的备选。用法:在生成创意性文本时刻意让模型"不走最常见的路"。配合高 temperature 用,故事写作场景显著提升多样性。

这两个 sampler 都不在主流学术论文里,纯粹是 llama.cpp 社区(应用方)摸出来的工程经验。它们说明了 sampling 这一层的设计空间还远没有探索完——研究界主要关心模型本身,但 sampling 是离用户体感最近的一层,小调整能带来大改变。

"Repetition penalty" (in C14's main table) is the earliest anti-loop sampler, but too blunt — discounts any token that appeared, often killing necessary repeats (commas, spaces, "the"). The 2024 community (mostly KoboldAI / SillyTavern circles) came up with two finer-grained alternatives:

DRY (Don't Repeat Yourself, 2024): penalize sequences, not single tokens. Algorithm: scan history; find tokens whose selection would form "an n-gram already seen"; apply exponentially-decaying penalty (longer repeats → heavier). The model avoids repeating whole phrases but keeps natural function-word repeats.
XTC (Exclude Top Choices, 2024): an anti-sampler — actively chop the highest-probability tokens to force the model to pick "second-best-but-still-reasonable" alternatives. Use during creative writing to ensure the model "doesn't take the most obvious path". Combined with high temperature, dramatically boosts diversity in fiction.

Neither sampler is in mainstream academic papers — they're pure llama.cpp-community engineering folklore. They show that the sampling design space is far from exhausted. Research focuses on the model; sampling is the layer closest to user experience, and small tweaks can change a lot.

扩展EXTENDED Mirostat · 让 perplexity 稳定在目标值的自适应 sampler Mirostat · adaptive sampler targeting a perplexity setpoint

Mirostat 是个奇怪但优雅的 sampler。它的目标不是"砍掉低概率 token",而是"让生成序列的 surprise 保持在指定值"——具体说,让每个生成 token 的 -log(p) 平均稳定在 tau(用户给的目标值,比如 5.0)。

工作原理是一个简单的反馈控制:

定一个 tau ——这是目标"perplexity"(实际上是 ln-perplexity 的等价)。
维护一个动态变量 mu ——它就像一个 top-k 的"有效 k"。
每生成一个 token,看它实际的 surprise 跟 tau 的差距:实际偏高 → mu 调小(限制更严);实际偏低 → mu 调大(放更多 token 进来)。
等价于一个 PID 控制器,把模型的"不确定性输出" 调到 tau。

用户体感: tau=3 → 模型谨慎、保守(每个 token 都"很确定");tau=8 → 模型大胆、有创造性(每个 token 都"有点出乎意料")。这比 temperature 更语义可控——temperature 是输入,perplexity 是输出,Mirostat 直接调输出。

llama.cpp 内置 mirostat v1 / v2 两个版本,sampler chain 里一个 sampler 就实现。它在故事生成 / 角色扮演等"需要保持一致语气"的场景下显著好于固定 temperature——因为它主动校正了局部 perplexity 漂移。

Mirostat is a strange-but-elegant sampler. Its goal isn't "cut low-prob tokens" but "keep generation surprise at a target setpoint" — specifically, average -log(p) per generated token at tau (user setpoint, e.g. 5.0).

It's literally a feedback controller:

Set tau — target "perplexity" (technically log-perplexity equivalent).
Maintain a dynamic mu — it's like an effective top-k cap.
After each token, compare its actual surprise to tau: too high → shrink mu (tighter cap); too low → grow mu (more candidates).
Equivalent to a PID controller regulating the model's "uncertainty output" toward tau.

User feel: tau=3 → cautious, conservative (every token "certain"); tau=8 → bold, creative (every token "slightly surprising"). More semantically controllable than temperature — temperature is input, perplexity is output; Mirostat drives the output directly.

llama.cpp ships mirostat v1 / v2, each a single sampler in the chain. Notably better than fixed temperature for "maintain consistent tone" scenarios (fiction, roleplay) — it actively corrects local perplexity drift.

扩展EXTENDED Beam search · 为什么 LLM 几乎不用它 Beam search · why LLMs almost never use it

2018 年 BERT 之前的所有 seq2seq 翻译 / 摘要任务都用 beam search:维护 k 个候选序列,每步把每个候选扩展所有可能的下一 token,排序保留概率最高的 k 个。它不掷骰子,纯启发式搜索"整体最优序列"。

但 LLM 推理基本不用 beam search,理由:

速度: beam_width=4 意味着同时维护 4 倍 KV cache,decode 速度直接降到 1/4。
生成质量奇怪: 寻找"整体最优"导致过于安全——LLM beam search 会反复输出"I am an AI language model and I cannot" 这种万能套话,因为它们的累计概率最高。
多样性: 多 sample(temperature + top-p)+ 选最好那个,实测比 beam search 更好。这就是 ChatGPT API 提供 n 参数的原因。

少数仍用 beam search 的场景:翻译任务(正确翻译是相对客观的)、speculative decoding 里的 draft 模型(选 top-k 候选)、code completion(语法约束下 beam 比 sampling 稳)。但聊天 / 创意写作全部用 sampling-based。这也是Transformer 时代的 sampling 哲学跟 RNN 时代不一样的地方:模型本身够好,允许多样性 > 强行最优。

Before 2018's BERT, all seq2seq translation / summarization used beam search: maintain k candidate sequences; at each step expand each by all possible next tokens; keep the top k by probability. No dice rolling; pure heuristic search for the "globally most likely sequence".

But LLM inference almost never uses beam search:

Speed: beam_width=4 means 4× the KV cache; decode drops to 1/4 the speed.
Weird outputs: searching for "global optimum" leads to excessive safety — LLM beam search routinely produces "I am an AI language model and I cannot" boilerplate because that has the highest cumulative probability.
Diversity: sampling N times (temperature + top-p) + picking the best empirically beats beam search. That's why ChatGPT's API has n.

Beam survives in: translation (where "correct" is relatively objective); speculative decoding's draft model (top-k candidates); code completion (under syntactic constraints, beam is steadier). But chat / creative writing all uses sampling. This is the Transformer-era sampling philosophy: model is good enough that diversity beats forced optimality.

CHAPTER 15 · OUTLET

EOS — 一个 token 怎么决定"停下"

EOS — how one token decides "stop"

没有魔法,就是一个特殊 id

no magic, just one special id

模型怎么知道"该停了"?这件事看起来需要"理解语义",其实没有——它只是一个特殊的 token id。Llama-3 的 EOS 是 128009(对应字符串 <|eot_id|>),DeepSeek 是 100001。训练时,每段对话末尾都接这个 token,模型学会"在合适的地方输出它"。推理时,sampler 一旦采到这个 id,主循环就 break。

How does the model know "time to stop"? Sounds like it needs "semantic understanding". It doesn't — it's just one special token id. Llama-3's EOS is 128009 (string: <|eot_id|>); DeepSeek's is 100001. At training, every dialog ends with this token, and the model learns "output it at appropriate spots". At inference, the moment the sampler picks that id, the main loop breaks.

examples/main/main.cpp · the loop that calls llama_decodewhere it stops

// 用户层的 decode 循环 · 删掉了不相关的代码 while (n_remain > 0) { // 1. 喂当前 batch 进模型(prefill 第一次,decode 后续每次) llama_decode(ctx, batch); // 2. 拿到最后位置的 logits,过 sampler chain 选下一个 token const llama_token id = llama_sampler_sample(smpl, ctx, -1); // 3. 看看是不是 EOS · 是就退出 if (llama_vocab_is_eog(vocab, id)) { break; } // 4. 不是 EOS · 拼到历史,准备下一次 decode llama_batch_clear(batch); llama_batch_add(batch, id, n_past++, { 0 }, true); n_remain--; }

"EOG" 是什么 · 为什么有时候不止一个 EOS

"EOG" — sometimes more than one EOS

llama_vocab_is_eog 里的 EOG 是"end of generation"——一个广义概念,不只是 EOS。Llama-3 有两个 EOS:<|end_of_text|>(128001)是"整段文本结束",<|eot_id|>(128009)是"当前轮次结束"。在 chat 场景里我们要在轮次结束就停,而不是等"文本结束"——所以 sampler 默认把这两个都当 stop token。

这又是个容易踩坑的工程细节:Llama-3 刚开源时,很多人用 Llama-2 风格的 stop token(只检测 </s>),结果模型说完一轮后继续往下扯"User:..."——因为 <|eot_id|> 不在 stop 列表里。这种 bug 在整个推理栈里最难调,因为模型看起来一切正常,只是不知道该闭嘴。

EOG in llama_vocab_is_eog = "end of generation" — a broader concept than EOS. Llama-3 has two EOSs: <|end_of_text|> (128001) = "document ends", <|eot_id|> (128009) = "this turn ends". In chat we want to stop at turn end, not document end — so the sampler treats both as stop tokens by default.

This is another nasty engineering footgun: when Llama-3 first shipped, many integrations used Llama-2-style stop detection (only </s>) and the model would happily keep going after its turn, hallucinating "User:..." — because <|eot_id|> wasn't in the stop set. Hardest bug to debug in the whole stack: the model looks fine, just doesn't know how to shut up.

扩展EXTENDED Stop sequence matching · 用户指定 "停止字符串" 怎么实现 Stop sequence matching · how user-specified "stop strings" are handled

OpenAI API 允许传 stop=["\n\n", "User:", "###"]——模型一旦输出匹配这些字符串就停。看起来简单但实现有三个坑:

字符串 vs token 不对齐: stop 是字符串,模型输出是 token。"User:" 这个字符串可能跨越多个 token——"Use"+"r:"。所以推理引擎要每生成一个 token,都重新拼接最近 N 个 token 的字符串,做后缀匹配。
已经超出怎么撤回: 假设 stop 是 "User:",模型刚生成了 "User",可能下一个 token 是 ":" → 触发 stop,可能是 " feedback" → 不触发。但 "User" 已经流式发给客户端了。如果触发了 stop,要么不流式("攒一段再发"——延迟变高),要么截断流式输出(发了 "User" 就发不回去)。OpenAI 选了"不流式直到 stop 决出胜负"——所以你会观察到 streaming API 偶尔"卡一下"再继续,就是引擎在判断 stop。
性能: 每个 token 都做后缀字符串匹配是 O(stop_len × n_stops),如果用户传了 20 个 stop 加起来 200 字符,每 token 多花几 μs。批服务里这是不可忽视开销。vLLM 用 Aho-Corasick 自动机一次扫所有 stop,优化到 O(1) amortized。

这件事在 llama.cpp 里是 common_sampler::stop_strings 字段,匹配在主循环里做。很多 wrapper 库写错过这个——一个常见 bug 是 stop 触发了但已经流出去的 token 没有 rollback,客户端 UI 显示了不该显示的内容。

OpenAI's API accepts stop=["\n\n", "User:", "###"] — model output stops the moment it matches these strings. Looks simple; three traps:

Strings don't align to tokens: stop is strings, output is tokens. "User:" may span multiple tokens — "Use"+"r:". So the engine must after every token, re-concat the last N tokens' strings and check suffix-match.
What if it's already past?: say stop is "User:" and the model just emitted "User"; next token might be ":" → triggers; might be " feedback" → doesn't. But "User" already streamed to the client. If stop triggers, you either don't stream ("buffer until resolved", latency up) or truncate the streamed output (you sent "User" — can't unsend). OpenAI picked "no streaming until stop is resolved" — that's why you'll observe streaming API occasionally "hitching" before continuing: it's the engine resolving stop.
Performance: per-token suffix matching is O(stop_len × n_stops); 20 stops totaling 200 chars adds μs per token. In batched serving this is not negligible. vLLM uses an Aho-Corasick automaton to scan all stops in one O(1) amortized pass.

In llama.cpp this lives in common_sampler::stop_strings; matching happens in the main loop. Many wrapper libraries get this wrong — a common bug: stop triggers but already-streamed tokens aren't rolled back, and the client UI shows content that shouldn't have appeared.

扩展EXTENDED Chat template inversion · 解析模型输出的关卡 Chat template inversion · parsing the model's output

第 23 章讲了"把对话格式化成模型懂的字符串" 这是 forward 方向。但反方向——模型输出回来怎么解析回结构化字段——同样重要。Function calling 尤甚:模型输出可能是:

model output · raw streammultiple meta-tokens

<|start_header_id|>assistant<|end_header_id|> I'll check the weather for you. <|python_tag|>{"name":"get_weather", "arguments":{"city":"Beijing"}}<|eom_id|>

这段输出里有4 件事需要分别处理:

第一段 <|start_header_id|>...<|end_header_id|> 是角色头,要剥掉。
正文 "I'll check the weather for you." 是普通输出,流回客户端。
<|python_tag|> 后面是工具调用,要解析 JSON,触发外部函数。
<|eom_id|>(end of message)而不是 <|eot_id|>(end of turn)——意思是"我说完了这一段,但下面还要继续"(等工具结果)。

所以 chat completion API 内部其实是一个状态机,在 streaming token 时根据特殊 token 切换状态(普通流式 / 工具调用解析 / role 切换等)。OpenAI 的 chat completion 协议把这一切包装成"看似简单" 的 JSON,但底下是这一整套 chat template + 反解析 + 状态机。

llama.cpp 在 tools/server/utils.hpp 的 oaicompat_chat_params_parse 里做这件事,vLLM 有自己的一套openai/serving_chat.py。这部分是推理栈里业务最复杂的地方——比 attention kernel 还容易出 bug,因为它要兼容十几种 chat template 的不同特殊 token。

Ch.23 covered "format conversation into the model's preferred string" — the forward direction. The reverse direction — parsing model output back into structured fields — matters equally. Function calling especially: the model might emit:

model output · raw streammultiple meta-tokens

<|start_header_id|>assistant<|end_header_id|> I'll check the weather for you. <|python_tag|>{"name":"get_weather", "arguments":{"city":"Beijing"}}<|eom_id|>

This output has four things to handle separately:

The leading <|start_header_id|>...<|end_header_id|> is the role header — strip it.
The body "I'll check the weather for you." is normal output — stream to client.
After <|python_tag|> is a tool call — parse JSON, trigger external function.
<|eom_id|> (end of message) instead of <|eot_id|> (end of turn) — meaning "I'm done with this segment, but more is coming" (waiting for tool result).

So chat completion APIs are internally state machines, switching states based on special tokens while streaming (normal stream / tool call parse / role switch). OpenAI's chat completion protocol wraps all this into "look-simple" JSON, but underneath is the full chat template + reverse-parsing + state-machine machinery.

llama.cpp handles this in tools/server/utils.hpp's oaicompat_chat_params_parse; vLLM has its own openai/serving_chat.py. This is the most business-logic-heavy part of the inference stack — buggier than the attention kernel, because it must handle a dozen different chat-template variations.

CHAPTER 16 · BIG PICTURE

完整 forward — 把前 15 章串起来的 stack trace

Full forward — tying 15 chapters into one stack trace

从 main.cpp 到 ggml kernel · 一次推理的全谱

main.cpp → ggml kernel · the whole spectrum

把前面所有章节的函数名串起来,一次推理的完整调用栈大概是这样:

String together every function we've named, and one inference looks like this:

stack trace · one inference of "你好,llama"15 stations, one trace

═══════════ USER LAYER ═══════════ main() // examples/main/main.cpp │ ├─ llama_tokenize(prompt) → [128000, 47045, ...] // C03 · llama-vocab.cpp │ ├─ while (!eos): // C15 │ │ │ ├─ llama_decode(ctx, batch) // C02 · llama-context.cpp │ │ │ │ │ ├─ sbatch.split_simple(n_ubatch) // C05 · llama-batch.cpp │ │ │ │ │ ├─ kv_cache.find_slot(ubatch) // C08 · llama-kv-cache-unified.cpp │ │ │ │ │ ├─ gf = graph_build(ubatch) // llama-graph.cpp │ │ │ │ │ │ │ ├─ inpL = build_inp_embd() // C04 · ggml_get_rows │ │ │ │ │ │ │ ├─ for il in 0..32: // 32 layers │ │ │ │ ├─ cur = build_norm(inpL, attn_norm) // C06 · RMSNorm │ │ │ │ ├─ Q = ggml_rope_ext(Q·W_q, pos) // C06,C07 · QKV + RoPE │ │ │ │ ├─ K = ggml_rope_ext(K·W_k, pos) │ │ │ │ ├─ V = cur·W_v // V 不旋转 │ │ │ │ ├─ ggml_cpy(K → kv_cache.k) // C08 · write KV │ │ │ │ ├─ ggml_cpy(V → kv_cache.v) │ │ │ │ ├─ attn = ggml_flash_attn_ext(Q, K, V) // C09 · GQA-aware │ │ │ │ │ // (DeepSeek: + MLA · C11) │ │ │ │ ├─ cur = attn·W_o + cur // 残差 │ │ │ │ ├─ cur = build_norm(cur, ffn_norm) │ │ │ │ ├─ cur = build_moe_ffn(cur) // C12 · or build_ffn for dense │ │ │ │ └─ inpL = cur + ffn_input // 残差 │ │ │ │ │ │ │ └─ logits = build_lora_mm(output, inpL) // C13 · LM head │ │ │ │ │ └─ graph_compute(gf) // ggml backend runs kernels │ │ │ ├─ id = llama_sampler_sample(smpl, ctx, -1) // C14 · llama-sampling.cpp │ │ ├─ apply_penalties(logits) │ │ ├─ apply_top_k(logits) │ │ ├─ apply_top_p(logits) │ │ ├─ apply_temperature(logits) │ │ └─ sample_dist(probs) → id │ │ │ ├─ if llama_vocab_is_eog(id): break // C15 │ │ │ └─ batch.add(id, n_past++) // 下一轮 decode │ └─ llama_detokenize(tokens) → "你好,llama 你也好呀..."

prefill 4 token + decode 100 token · 一个真实时间表

Prefill 4 tokens + decode 100 tokens · a real timeline

Llama-3-8B · 单卡 H100 · fp16 · 大致时序Llama-3-8B · single H100 · fp16 · approximate timeline

tokenize prompt	~0.5 ms	CPU · BPE merge
setup batch + KV slot	~0.1 ms	find_slot scan
prefill forward (4 tok × 32 layer)	~5 ms	compute-bound
LM head (final token only)	~0.5 ms	1 GB weight read
sample first token	~0.1 ms	top-k partial sort
TTFT(time to first token)	~6 ms	用户感知"开始"user feels "started"
decode 1 token (avg)	~15 ms	memory-bound
decode 100 tokens	~1500 ms	用户感知"说完"user feels "done"
detokenize	~0.5 ms	id → string
total wall	~1.51 s

从这张表能看出两件事:

prefill 极快:6 ms 出第一个 token。这是 H100 的算力被喂饱的状态——~5 TFLOPs / 5 ms = ~1 PFLOPs/s,接近 990 TFLOPs 的 fp16 峰值。
decode 是大头:100 个 token 用了 1500 ms,人均 15 ms。每个 token 要从显存里读 16 GB 模型权重 + ~1 MB KV 增量(随长度增长)+ 写回 1 MB——大部分时间在等 HBM 带宽。MFU(model FLOPs utilization)只有 ~10%。

这就是为什么同样的 GPU,跑长 prompt 是赚的,跑长 output 是亏的。前者按算力定价,后者按时间定价。整个推理产业的服务化(API 定价、batch 调度、连续 batching、speculative decoding 等等)都在围绕这一点做文章。

Two takeaways from this table:

Prefill is fast: 6 ms to first token. H100 is saturated — ~5 TFLOPs / 5 ms ≈ 1 PFLOPs/s, close to 990 TFLOPs fp16 peak.
Decode is the long tail: 100 tokens in 1500 ms, 15 ms each. Every step reads 16 GB of weights + ~1 MB of KV (growing) + writes 1 MB — mostly waiting on HBM. MFU ~10%.

This is why same GPU, long prompts are profitable, long outputs are lossy. Former is compute-priced, latter is time-priced. Every productization choice in inference serving (API pricing, batching, continuous batching, speculative decoding) is fighting on this slope.

CHAPTER 17 · BIG PICTURE · KEY

llama.cpp vs vLLM — 两种实现哲学

llama.cpp vs vLLM — two philosophies

同一件事 · 两种写法

same job · different shapes

前 16 章都用 llama.cpp 当主线,是因为它小——一个 C++ 项目,几万行代码,源码可读、可改、可 fork。但生产环境的另一极是 vLLM——Python + CUDA,4 万颗 GitHub 星,UC Berkeley 出的,backed by Anyscale,主打"高吞吐量服务化"。两者代表的是两种推理引擎的哲学。

The first 16 chapters used llama.cpp as the through-line because it's small — a single C++ project, tens of thousands of lines, source-readable, fork-friendly. The other pole in production is vLLM — Python + CUDA, 40k GitHub stars, from UC Berkeley, backed by Anyscale, optimized for "high-throughput serving". They represent two philosophies of inference engines.

	llama.cpp	vLLM
语言language	C++ + 自实现 ggml	Python + PyTorch + CUDA
代码规模codebase	~80k lines · 单 repo	~200k lines · 多依赖
Backendbackend	CPU / CUDA / Metal / Vulkan / ROCm	CUDA 一等公民 · ROCm beta
KV cache 管理KV cache	ring buffer / unified · 简单	PagedAttention · 页式分配
batch 调度batching	静态 batch	continuous batching · token 级抢占
最佳场景sweet spot	单机 · 单用户 · 嵌入	多用户 · 高并发 · API 服务
部署形态deployment	静态二进制 · ~10 MB	Python 进程 · ~5 GB(含 CUDA)
量化支持quantization	GGUF · Q2-Q8 · 极成熟	AWQ / GPTQ · 较新
实际定位positioning	"本地 / 边缘 / 实验"	"生产 / 集群 / SaaS"

FIG. 11 两个项目的取舍轴几乎相反。llama.cpp 用"少抽象"换"可移植性 + 可读性";vLLM 用"多抽象"换"吞吐量 + 调度灵活性"。两条路都对,只是为不同问题。 The trade-off axes are nearly opposite. llama.cpp trades more abstractions for portability + readability. vLLM trades more abstractions for throughput + scheduling flexibility. Both right, both for different problems.

vLLM 的杀手锏:PagedAttention

vLLM's killer move: PagedAttention

llama.cpp 的 KV cache 是"连续分配 + ring buffer"——简单,但有碎片化问题。多用户场景下,user A 的对话长 200 token,user B 长 8000 token,如果 cache 按用户线性切分,user A 那一段后面 7800 token 的位置就是空的,但其他用户用不上(因为不知道这块何时被回收)。vLLM 的 PagedAttention(论文 "Efficient Memory Management for Large Language Model Serving with PagedAttention" )用操作系统的虚拟内存思路解决:

KV cache 切成固定大小的页(典型 16 token / 页)。
每个序列有一张"页表",指向若干物理页——这些页可以散落在 cache 的任何位置。
Attention 计算时,通过页表收集这个序列的所有页(block_tables)再算——多了一层间接,但 GPU 上一次 memory gather 的开销很小。
新用户上来直接分配新页,不需要"连续大块"。碎片几乎消失——同样硬件能服务 2–4 倍的并发用户。

llama.cpp's KV cache is "contiguous + ring buffer" — simple but suffers fragmentation. Multi-user case: user A's chat is 200 tokens, user B's is 8000. Linearly slicing the cache leaves a 7800-token gap behind A — useless to others, because nobody knows when it'll be reclaimed. vLLM's PagedAttention (paper: Efficient Memory Management for Large Language Model Serving with PagedAttention) borrows OS-level virtual memory:

KV cache is sliced into fixed-size pages (typical: 16 tokens/page).
Each sequence has a "page table" pointing to a list of physical pages — pages can sit anywhere.
Attention computation gathers the sequence's pages via the table (block_tables) — one extra indirection, but GPU memory gather is cheap.
New users get freshly allocated pages — no "contiguous large block" required. Fragmentation vanishes — same hardware serves 2-4× more concurrent users.

vllm/core/block_manager_v1.py · BlockSpaceManager.allocate()page table allocation

// 每个序列来的时候,按它当前 token 数估算需要多少页 def allocate(self, seq_group): seq = seq_group.get_seqs()[0] num_required_blocks = len(seq.logical_token_blocks) # 从空闲页池里分配,可以是任意物理位置 block_table: BlockTable = [] for _ in range(num_required_blocks): block = self.gpu_allocator.allocate() # 拿一个空闲物理页 block_table.append(block) # 关键:序列的 KV 物理位置由 block_table 决定 · 不需要连续 self.block_tables[seq.seq_id] = block_table # attention 计算时把 block_table 传进 CUDA kernel # vllm/attention/ops/paged_attn.py def forward_decode(self, q, k_cache, v_cache, block_tables, ...): # 自定义 kernel:按 block_tables 间接寻址,拼出这个序列的全部 K/V paged_attention_v2(q, k_cache, v_cache, block_tables, ...)

何时选谁 · 一句话决策

When to pick which · one-line decisions

"我想在 Mac 上跑一个 7B 模型聊天" → llama.cpp(Metal backend + GGUF 量化)。
"我要部署一个 API · 同时服务 200 个用户" → vLLM(PagedAttention + continuous batching)。
"我在调研一个新模型架构 · 需要改 attention 实现" → llama.cpp(C++ 改起来比 PyTorch + CUDA 容易得多)。
"我要 Function calling + 复杂调度" → vLLM(Python 生态 + OpenAI 兼容 API)。
"我要把模型嵌入 iOS app" → llama.cpp(静态二进制,Metal 跑得动)。

两个项目在互相借鉴——llama.cpp 后来也支持类 paged 的 KV 调度,vLLM 也支持 GGUF。但根本气质不一样:llama.cpp 是"能装进任何缝隙的瑞士军刀",vLLM 是"同时服务 1000 人的工业铣床"。

"Run a 7B chat on my Mac" → llama.cpp (Metal backend + GGUF quantization).
"Serve an API to 200 concurrent users" → vLLM (PagedAttention + continuous batching).
"Research a new attention shape; need to mod the kernel" → llama.cpp (C++ is easier to hack than PyTorch + custom CUDA).
"Function calling + complex scheduling" → vLLM (Python ecosystem + OpenAI-compatible API).
"Embed a model in an iOS app" → llama.cpp (static binary, Metal works).

The two projects borrow from each other — llama.cpp now supports a paged-ish KV scheduler, vLLM supports GGUF. But the temperaments differ: llama.cpp is "a Swiss army knife that fits any crack"; vLLM is "an industrial milling machine that serves 1000 simultaneously".

扩展EXTENDED CUDA Graph capture · 把一次 forward 录成宏 CUDA Graph capture · recording one forward as a macro

vLLM 比 llama.cpp 快的另一个不那么显眼的原因是 CUDA Graph。原理:一次 decode forward 涉及 ~几百个 CUDA kernel launch,每次 launch 在 H100 上有 ~3-5 μs 固定开销——decode 100 个 kernel × 5 μs = ~500 μs 全是 launch 时间。这部分时间跟模型计算无关,纯粹是 CPU → GPU 的指令传输开销。

CUDA Graph 的思路:把这 100 个 kernel 的静态序列 一次性"录"成一个 graph 对象,后续 forward 时直接整体重放这个 graph——只需要一次 graph launch,所有 kernel 顺序执行,launch 开销摊到 0。一次性能给 decode latency 降 10-20%。

但 CUDA Graph 有个严苛要求:graph 内部的 tensor 形状必须完全固定,kernel 序列必须完全确定。这跟 vLLM 的 dynamic batch 起冲突——batch 大小总在变。解决:vLLM 为常见 batch 大小(1, 2, 4, 8, 16, ..., 256)各自捕获一个 graph,运行时根据实际 batch 大小选最近的 graph(padding 一下)。代价是预热时间长(启动要捕获几十个 graph),内存多 ~1-2 GB。

llama.cpp 长期没用 CUDA Graph——因为它的 graph builder(ggml)就是动态构造的,跟 CUDA Graph 的"静态" 假设不兼容。直到 2024 年才加了实验性的 graph capture 支持,但仍不如 vLLM 成熟。这是"vLLM decode 100 tokens/s, llama.cpp 60 tokens/s 同模型同硬件" 这种数字差的来源之一。

Another less-obvious reason vLLM beats llama.cpp on speed: CUDA Graph. The setup: one decode forward involves ~hundreds of CUDA kernel launches; each launch has ~3-5 μs fixed overhead on H100 — 100 kernels × 5 μs = ~500 μs of pure launch time. This time is unrelated to model compute — purely CPU → GPU instruction transfer.

CUDA Graph's idea: "record" the static sequence of those 100 kernels once into a graph object; subsequent forwards just replay the entire graph — one graph launch, all kernels execute in order, launch overhead amortized to 0. Gives 10-20% decode latency reduction.

But CUDA Graph has strict requirements: tensor shapes inside the graph must be fully fixed; kernel sequence must be fully determined. This conflicts with vLLM's dynamic batching — batch sizes change constantly. Fix: vLLM captures a graph for each common batch size (1, 2, 4, 8, 16, ..., 256); at runtime picks the nearest captured graph based on actual batch (with padding). Cost: long warmup (capture dozens of graphs at startup), 1-2 GB extra memory.

llama.cpp didn't use CUDA Graph for a long time — its graph builder (ggml) is dynamically constructed, incompatible with CUDA Graph's "static" assumption. Experimental support arrived in 2024 but is less mature than vLLM's. This is one source of "vLLM 100 tokens/s, llama.cpp 60 tokens/s on the same model and hardware" numbers.

扩展EXTENDED Triton kernels vs 手写 CUDA — 推理引擎的"三层 kernel 栈" Triton vs hand-written CUDA · the "three-layer kernel stack"

vLLM 的 kernel 不全是手写 CUDA。它是个三层栈:

cuBLAS / cuDNN(NVIDIA 闭源库): 一般的 GEMM / GEMV 用 cuBLAS,标准卷积 / RNN 等用 cuDNN。这层接 NVIDIA 工程师对硬件最细致的优化,推理引擎调用就行。
Triton(OpenAI 开源 JIT 编译器): vLLM 自己写的 attention / norm / quantize / rope 等"非标准" kernel 都用 Triton 写。Triton 是个Python-like DSL,自动管理 SRAM tile、warp 分配、async copy——比手写 CUDA 快 5-10× 开发,性能能到 cuBLAS 的 80-95%。
手写 CUDA: 最关键的几个 kernel(FlashAttention 系列、PagedAttention)是手写 CUDA / CUTLASS,因为要榨干硬件每一个特性(WGMMA / TMA / async pipeline)。这部分 Tri Dao 团队等专门优化。

llama.cpp 走完全不同的路:它自己实现了 ggml,所有 kernel 都是手写 C++/CUDA,不依赖 cuBLAS。这给了它跨平台能力(Metal/Vulkan/SYCL 各自实现一套相同 op 的 backend),代价是每个新硬件特性都要重写。Triton 不行,因为 Triton 只跑 NVIDIA GPU。

这是"llama.cpp 跨平台 vs vLLM CUDA-only" 这个根本差异的技术根源:Python 引擎可以用 Triton 把 NVIDIA 当一等公民,C++ 引擎要平衡多平台只能自己写 kernel。各自有道理,但路径锁死了发展空间。

vLLM's kernels aren't all hand-written CUDA. It's a three-layer stack:

cuBLAS / cuDNN (NVIDIA closed-source): general GEMM / GEMV via cuBLAS, standard conv / RNN via cuDNN. NVIDIA engineers' most hardware-tuned code; the inference engine just calls them.
Triton (OpenAI's open-source JIT compiler): vLLM writes attention / norm / quantize / rope and other "non-standard" kernels in Triton. Triton is a Python-like DSL; it auto-manages SRAM tiles, warp allocation, async copies — 5-10× faster to write than CUDA, achieving 80-95% of cuBLAS performance.
Hand-written CUDA: the most critical kernels (FlashAttention family, PagedAttention) are hand-written CUDA / CUTLASS, to squeeze every hardware feature (WGMMA / TMA / async pipeline). Specialists like Tri Dao's team work here.

llama.cpp takes a completely different path: implements ggml itself, all kernels hand-written C++/CUDA, no cuBLAS dependency. This gives cross-platform capability (Metal/Vulkan/SYCL each implement the same op set in their own backend), at the cost of having to rewrite every new hardware feature. Triton's no help — Triton only runs NVIDIA GPUs.

This is the technical root of the "llama.cpp cross-platform vs vLLM CUDA-only" fundamental divide: Python engines can treat NVIDIA as first-class via Triton; C++ engines balancing multi-platform must write their own kernels. Each path makes sense but locks the trajectory.

CHAPTER 18 · PRODUCTION · KEY

量化 — Q4_K_M / fp8 / GGUF 的真实位段

Quantization — Q4_K_M / fp8 / GGUF, bits laid bare

把 16 GB 模型挤进 5 GB 显存 · 精度损 3%

squeezing a 16 GB model into 5 GB VRAM · 3% accuracy cost

到这里整篇文章一直在用 fp16(2 字节)讲。但线上几乎没人用 fp16 跑推理——大家都在跑量化版本。一个 Llama-3-8B fp16 是 16 GB,Q4_K_M 量化是 4.6 GB,PPL 只差 ~3%——同样一张 16 GB 显存,fp16 只能装下模型本身没余地,Q4_K_M 还能留 11 GB 给 KV cache。量化决定了"什么模型能跑在什么硬件上"。

但"4-bit"这个词太笼统——4-bit 量化有十种不同实现,llama.cpp 自己就有 Q4_0 / Q4_1 / Q4_K / Q4_K_S / Q4_K_M 等等。差别全在怎么处理"异常值"——少数权重数值远大于均值,简单 4-bit 截断会把它们截烂。

The whole article so far has run on fp16 (2 bytes). But nobody runs production inference at fp16 — everyone uses quantized versions. Llama-3-8B fp16 is 16 GB, Q4_K_M is 4.6 GB, PPL difference is ~3% — on the same 16 GB GPU, fp16 barely fits the model itself, Q4_K_M leaves 11 GB for KV cache. Quantization decides "what model runs on what hardware".

But "4-bit" is too vague — there are ten different 4-bit schemes. llama.cpp alone ships Q4_0 / Q4_1 / Q4_K / Q4_K_S / Q4_K_M and more. The differences are all about handling "outliers" — a few weights with values far above the mean. Naive 4-bit truncation crushes them.

Q4_K_M 的真实位段:32 weights → 18 bytes

Q4_K_M's actual layout: 32 weights → 18 bytes

llama.cpp 的"K-quants"是 2023 年 ikawrakow 设计的一套精细方案。Q4_K 不是把每个 weight 量成 4 bit——它是把 32 个 weights 打包成一个"block",8 个 block 再打包成一个"super-block"。每个 block 有自己的 scale,每个 super-block 有自己的 super-scale——多层量化,把异常值"分摊"到合适的尺度。

llama.cpp's "K-quants" were designed by ikawrakow in 2023. Q4_K doesn't simply 4-bit each weight — it packs 32 weights into a "block", 8 blocks into a "super-block". Each block has its own scale; the super-block has a super-scale. Layered quantization that amortizes outliers across appropriate scales.

ggml/src/ggml-quants.h · block_q4_K256 weights → 144 bytes

// 一个 super-block · 256 个 weights · 共 144 字节 // = 4.5 bits/weight(比纯 4-bit 多 0.5 bit 给 scale / min) // 朴素 fp16 是 32 bits/weight · 节省 ~7.1× typedef struct { union { struct { ggml_half d; // super-block scale · fp16 · 2 bytes ggml_half dmin; // super-block min · fp16 · 2 bytes } GGML_COMMON_AGGR; ggml_half2 dm; }; uint8_t scales[12]; // 8 个 sub-block 的 6-bit scale + 6-bit min · 共 12 bytes uint8_t qs[128]; // 256 个 4-bit weights · packed · 128 bytes } block_q4_K; // 总共 144 字节 // 重建一个 weight 的公式: // w = d * scale[sub] * q + dmin * min[sub] // 其中 q ∈ [0, 15],scale[sub] / min[sub] 来自 scales[] 的 6-bit 字段

Q4_K_M 里那个"M" — mixed precision

The "M" in Q4_K_M — mixed precision

Q4_K_M 不是"所有权重都 Q4_K"。它是 llama.cpp 量化方案家族里的"mixed"档:

attention.wv 和 feed_forward.w2(对精度更敏感的层): 一半用 Q6_K(~6.5 bits/weight)
其他层: 全部 Q4_K(~4.5 bits/weight)
output(LM head): 用 Q6_K——LM head 量化错了直接破坏生成,这层不能省

所以 Q4_K_M 实际平均 ~4.85 bits/weight。Q4_K_S(S = small)就全 Q4_K,~4.6 bits/weight,体积更小但 PPL 多 ~1%。这种哪些层用什么量化的策略在 llama_model_quantize_internal 里写死,基于对每层 weight 分布的经验观察。

Q4_K_M doesn't mean "all weights Q4_K". It's the "mixed" tier in llama.cpp's quant family:

attention.wv and feed_forward.w2 (more precision-sensitive layers): half use Q6_K (~6.5 bits/weight)
Other layers: all Q4_K (~4.5 bits/weight)
output (LM head): Q6_K — quantize this layer wrong and generation breaks; not negotiable

So Q4_K_M actually averages ~4.85 bits/weight. Q4_K_S (S = small) is full Q4_K, ~4.6 bits/weight — smaller but +1% PPL. The "which layer at which quant" policy is hard-coded in llama_model_quantize_internal, based on empirical per-layer weight-distribution observations.

Llama-3-8B · 同一个 wikitext-2 测试 · 实测 PPL 与体积Llama-3-8B · same wikitext-2 test · measured PPL vs size

fp16 (baseline)	16.0 GB	PPL ≈ 6.50 · reference
Q8_0	8.5 GB	PPL ≈ 6.50 · ~ identical
Q6_K	6.6 GB	PPL ≈ 6.52 · +0.3%
Q5_K_M	5.7 GB	PPL ≈ 6.55 · +0.8%
Q4_K_M (sweet spot)	4.6 GB	PPL ≈ 6.67 · +2.6%
Q3_K_M	3.8 GB	PPL ≈ 7.06 · +8.6%
Q2_K	2.9 GB	PPL ≈ 8.5 · +30% · 明显劣化

这张表是整个量化哲学的浓缩:Q4_K_M 是甜点——再省往下到 Q3 就开始明显掉精度。不是所有模型都有这个甜点:更小的模型(3B 以下)Q4 已经掉得厉害,要 Q5 以上才稳;更大的模型(70B+)Q3 都还能用,因为冗余多。大模型更"抗量化"——这也是为什么 70B-Q4 比 8B-fp16 又准又快又省。

This table is the whole quantization philosophy in one place: Q4_K_M is the sweet spot — go down to Q3 and accuracy collapses. Not every model has this sweet spot: smaller models (under 3B) lose noticeably already at Q4 and need Q5+; larger models (70B+) survive Q3 well because they have more redundancy. Larger models are more "quantization-resilient" — which is exactly why 70B-Q4 beats 8B-fp16 on accuracy, speed, and memory.

fp8 KV cache · 同样的 trick 套到第 8 章那张桌子上

fp8 KV cache · the same trick applied to Ch.8's table

权重量化是静态的(模型一次性量化完写盘),但 KV cache 是动态的——每一步 decode 都新写一份。要量化 KV cache,得在每次写入时实时压缩,读取时反量化。这个开销不能太大,所以一般选格式简单的 fp8(e4m3 或 e5m2)而不是 K-quants 那种多层结构。

fp8 KV cache 把 Ch.8 那 1 GB 直接砍一半到 512 MB。代价是 PPL 多 ~0.5%——比权重量化还安全,因为 attention 公式本身就在 softmax 后做归一化,抹平了一些精度噪声。vLLM 的 kv_cache_dtype="fp8_e5m2" 是生产环境标配。

Weight quantization is static (quantize once, write to disk). But KV cache is dynamic — written fresh on every decode step. To quantize the KV cache you compress on every write and dequantize on every read. That overhead can't be high, so usually a simple format like fp8 (e4m3 or e5m2) — not K-quants' multi-tier scheme.

fp8 KV cache halves Ch.8's 1 GB to 512 MB. PPL cost is ~0.5% — even safer than weight quantization because the softmax normalization in attention itself absorbs some precision noise. vLLM's kv_cache_dtype="fp8_e5m2" is the production default.

扩展EXTENDED e4m3 vs e5m2 · 同样是 fp8 · 用法相反 e4m3 vs e5m2 · both fp8 · used for opposite things

fp8 有两种主流格式,1 字节里的位分配不同:

e4m3:1 sign + 4 exponent + 3 mantissa · 表达范围 ±448 · 精度高、范围窄 · 适合权重(权重值通常在 ±10 内,需要精度)
e5m2:1 sign + 5 exponent + 2 mantissa · 表达范围 ±57344 · 范围大、精度低 · 适合梯度 / activation / KV(动态范围大但能容忍噪声)

NVIDIA Hopper(H100)开始硬件原生支持这两种格式,fp8 GEMM 性能是 fp16 的 2× · TFLOPS 翻倍直接对应 decode 速度。所以 vLLM 在 H100 上跑 fp8 KV cache 不只是省内存,还是省时间。

fp8 has two mainstream formats; the bit allocation differs:

e4m3: 1 sign + 4 exponent + 3 mantissa · range ±448 · more precision, narrower range · used for weights (typically ±10, precision matters)
e5m2: 1 sign + 5 exponent + 2 mantissa · range ±57344 · wider range, less precision · used for gradients / activations / KV (large dynamic range, tolerates noise)

NVIDIA Hopper (H100) introduced native hardware fp8 support; fp8 GEMM throughput is 2× fp16. Double the TFLOPS translates directly to decode speed. So fp8 KV cache on H100 isn't just memory savings — it's time savings too.

主线 · 量化后的我们的模型MAIN LINE · OUR MODEL QUANTIZED 假设把 Llama-3-8B 从 fp16(16 GB)量到 Q4_K_M(4.6 GB)+ fp8 KV cache(8K 上下文从 1.07 GB → 535 MB):整个模型 + KV cache 从 17 GB 降到 5.1 GB。一张 8 GB 的 4060 现在能跑;一张 24 GB 的 4090 能服务 ~30 个并发用户而不是 6 个。这就是量化对"谁能跑 LLM"的真实影响。 Quantize Llama-3-8B from fp16 (16 GB) to Q4_K_M (4.6 GB) + fp8 KV cache (8K context: 1.07 GB → 535 MB): model + KV cache drops from 17 GB to 5.1 GB. An 8 GB 4060 can now run it; a 24 GB 4090 serves ~30 concurrent users instead of 6. This is quantization's real impact on "who can run an LLM".

扩展EXTENDED 量化方案动物园 · GGUF / AWQ / GPTQ / SmoothQuant / HQQ 的区别 Quantization zoo · GGUF / AWQ / GPTQ / SmoothQuant / HQQ

llama.cpp 的 GGUF 量化只是一种路线。生产环境里你会碰到至少 5 种主流方案,它们在 "如何选择 scale" 这个核心问题上走了不同路:

llama.cpp's GGUF quantization is just one approach. In production you'll encounter at least five mainstream schemes, each diverging on the core question "how to pick scales":

scheme	校准方式calibration	优势strength	劣势weakness	主要用户main users
GGUF (K-quants)	纯权重统计 · 离线weight stats · offline	简单 · 跨平台simple · cross-platform	~3% PPL 损~3% PPL drop	llama.cpp · Ollama · LM Studio
GPTQ	校准数据集 · 逐层 Hessiancal dataset · per-layer Hessian	精度高 · ~1% PPL 损accurate · ~1% PPL drop	需要校准数据 · 量化慢needs cal data · slow to quant	HuggingFace · AutoGPTQ
AWQ	activation-aware scaleactivation-aware scale	~0.5% PPL 损 · 推理快~0.5% PPL · fast inference	校准数据敏感cal-data-sensitive	vLLM · TGI
SmoothQuant	smooth activation outlier 到 weightsmooth act outliers to weights	支持 W8A8enables W8A8	实现复杂 · 模型特异complex · model-specific	NVIDIA TensorRT-LLM
HQQ	无校准 · 数据无关no calibration · data-free	极快量化(秒级)extremely fast quant (seconds)	~2% PPL 损~2% PPL drop	研究 · 实验research · experimental

三个关键观察:

"校准"是 GPTQ / AWQ 比 GGUF 准的根源: 它们用一段真实文本跑 forward,记录每层 activation 分布,然后反向算 "哪些 weight 量化错不重要"(因为乘上的 activation 接近 0),反过来给重要 weight 留更多位。代价是量化时间——一个 70B 模型 GPTQ 量化要 30 分钟,GGUF 几分钟。
SmoothQuant 解决"activation outlier"问题: LLM 的 activation 里偶尔有数值极大的"异常值",这些值不能量化(被 int8 截断会大错)。SmoothQuant 的招式:把 activation 的尺度搬到 weight 上 ——activation 除以 s, weight 乘以 s,数学等价,但 activation 更均衡,可以放心量化。这是 W8A8(8-bit weight + 8-bit activation)能跑的前提。
HQQ 用 Halton 采样初始化 + 一次梯度 找最优 scale,不需要校准数据。是 "不知道用户会跑什么" 场景的好选择(比如开源给用户自己 quant)。

实际选择:llama.cpp 用户用 GGUF,vLLM / TGI 部署用 AWQ,TensorRT-LLM 用 SmoothQuant。同一个模型 Llama-3-70B 在 HuggingFace Hub 上可能有 6-8 个不同量化版本——根据用户的引擎选用就行。

Three key observations:

"Calibration" is why GPTQ / AWQ beat GGUF: they forward a real text sample, record per-layer activation distributions, then reverse-compute "which weight quantization errors don't matter" (because they multiply near-zero activations); conversely give the important weights more bits. Cost: quant time — a 70B model takes 30 minutes under GPTQ, a few minutes under GGUF.
SmoothQuant solves "activation outliers": LLM activations occasionally have extremely large "outlier" values — these can't be quantized (int8 truncation causes huge error). SmoothQuant's trick: shift the activation's scale onto the weight — divide activation by s, multiply weight by s, mathematically equivalent but activations become balanced and quantizable. Prerequisite for W8A8 (8-bit weight + 8-bit activation).
HQQ uses Halton-sampled initialization + one gradient step to find optimal scales — no calibration data needed. Good for "we don't know what the user will run" scenarios (e.g. open-sourcing the quant tool for users to self-quantize).

Real choices: llama.cpp users use GGUF; vLLM / TGI deployments use AWQ; TensorRT-LLM uses SmoothQuant. The same Llama-3-70B on HuggingFace Hub may have 6-8 different quantized variants — pick by which engine you're using.

CHAPTER 19 · PRODUCTION · KEY

Speculative decoding + Prefix caching — 让 decode 真的快起来

Speculative decoding + Prefix caching — making decode actually fast

两个跟 decode is memory-bound 死磕的方法

two ways of fighting decode-is-memory-bound

第 2 章定下了整个推理的基本矛盾:decode is memory-bound。这一章讲两个真正改变这件事的方法——它们都是这两年 LLM 工程界最大的产品差异化武器,但在学术圈外讨论得不够。

Speculative decoding(投机解码):让一个小模型先猜几个 token,大模型一次性并行验证——把 n 个串行 decode压成 1 个并行 prefill,实测 2-3× 加速。
Prefix caching(前缀缓存):同一段 system prompt 被几百个用户共用?那 KV cache 也共用——hash 一下 token 序列,命中就跳过 prefill 这部分。生产环境普遍 70-90% 命中率,直接砍 TTFT。

这两件事都没改模型架构——纯粹是推理引擎层面的工程,但效果比改 attention 公式还大。理解这两个能让你看懂"为什么同样模型 vLLM 跑得比 llama.cpp 快 3 倍"。

Ch. 2 set the core tension: decode is memory-bound. This chapter covers the two techniques that actually change that — they're the biggest product differentiators in LLM inference these days, yet underdiscussed outside academia.

Speculative decoding: let a small model guess several tokens; the big model verifies them in parallel — squeezing n serial decodes into one parallel prefill. 2-3× speedup measured.
Prefix caching: same system prompt shared across hundreds of users? Share their KV cache too — hash the token sequence, skip prefill on hit. Production hit rates of 70-90% are normal; TTFT drops directly.

Neither changes the model architecture — pure engine-level engineering, but with bigger impact than changing the attention formula. Understanding these explains "why vLLM beats llama.cpp 3× on the same model".

Speculative decoding · "赌一把,猜对了就赚"

Speculative decoding · "place a bet, save time if you win"

朴素 decode 是纯串行:生成第 i 个 token 必须等第 i-1 个出来,因为 i 要喂回模型当输入。一次 decode ~15 ms,生成 100 个 token 就是 1.5 秒——这 1.5 秒里 GPU 大部分时间在等 HBM。算力闲着。

Speculative decoding 的洞察:用一个小模型(draft model)便宜地猜 k 个 token,然后让大模型对这 k 个 token 一次性跑一遍 forward 验证——大模型这一次是 prefill 而不是 decode,k 个 token 并行算,等于免费验证。

关键的数学:如果 draft 猜对了 k 个里的前 m 个,大模型这一次就等价于做了 m+1 次 decode——因为大模型 forward 的同时也算出了"给定第 i 个 token,第 i+1 个应该是什么"。一个被丢弃的 draft token 不亏(大模型这次还是产出了它的预测),只要 draft 命中率 > 1/k,就赚。

Naive decode is purely serial: generating token i requires token i-1 first (i is the input). One decode ~15 ms; 100 tokens = 1.5 s. During that 1.5 s the GPU sits mostly idle waiting on HBM. Compute is starving.

Speculative decoding's insight: let a small "draft model" cheaply guess k tokens; have the big model run a single forward over those k tokens to verify — that forward is a prefill, not a decode, all k tokens parallel, essentially free.

The key math: if the draft got the first m of k tokens right, the big model's forward is equivalent to m+1 decodes — because the big model also computes "given token i, what should token i+1 be" as a side product. Any discarded draft token is not lost (the big model still computed its own prediction). As long as draft hit-rate > 1/k, you win.

变体Variant	draft 模型draft source	命中率hit rate	加速speedup
vanilla speculative	小模型 · 1B 给 70B 当 draftsmall model · 1B drafts 70B	50-70%	2-3×
Medusa	主模型 + 几个并行 head 同时预测main model + extra heads predicting in parallel	60-80%	2.2×
EAGLE	小 draft head 直接看主模型的 hidden statetiny head on top of main model's hidden state	75-85%	3×
Lookahead	自己 n-gram 缓存,无需 draft 模型self n-gram cache, no draft model	40-60%	1.5-2×
Speculative + prefix sharing	同一 batch 多用户共享 draftbatched users share draft	—	4-5× (高并发)

FIG. 12 同一个家族 · 五种变体。共同点是用算力换时间——decode 时算力闲着,正好拿去算 draft / verify。EAGLE 是 2024 年后最强的变体,因为 draft 直接看主模型的 hidden state,几乎不可能猜错——已经是语义对齐的猜测。 Same family · five variants. The shared idea: trade compute for time — decode has compute to spare, use it for draft/verify. EAGLE is the strongest variant since 2024: the draft sees the main model's hidden state directly, so its guesses are semantically aligned and rarely wrong.

examples/speculative/speculative.cpp · the verify loopdraft + accept + reject

// 简化版的投机解码主循环 · llama.cpp 真有这个 example while (n_generated < n_max) { // 1. 用 draft 模型连续 decode k 个 token std::vector<llama_token> drafted; for (int i = 0; i < n_draft; i++) { llama_decode(ctx_draft, batch_with_last_token); llama_token id = sample(ctx_draft); drafted.push_back(id); } // 2. 把这 k 个 drafted token 一次性丢进大模型 · prefill 模式 llama_batch_clear(batch_tgt); for (int i = 0; i < drafted.size(); i++) { llama_batch_add(batch_tgt, drafted[i], n_past + i, {0}, true); } llama_decode(ctx_tgt, batch_tgt); // 大模型一次跑 k 个 token // 3. 大模型给每个位置都给出"它觉得下一个应该是什么" // 对每个 drafted[i] 比较:大模型同意吗? int n_accepted = 0; for (int i = 0; i < drafted.size(); i++) { llama_token id_tgt = sample(ctx_tgt, i); if (id_tgt == drafted[i]) n_accepted++; else { drafted[i] = id_tgt; break; } // 不一致 · 用大模型的版本 } // 注意:即使一个都不一致,我们也至少前进了 1 个 token // (因为大模型的预测 id_tgt[0] 是有效的下一个 token) n_generated += n_accepted + 1; n_past += n_accepted + 1; // 4. KV cache 回退:被拒绝的 drafted token 写进 cache 了,要 rewind llama_kv_self_seq_rm(ctx_tgt, 0, n_past, -1); llama_kv_self_seq_rm(ctx_draft, 0, n_past, -1); }

Prefix caching · 把 system prompt 的 KV 拿来当模板

Prefix caching · reusing system-prompt KV as a template

第 8 章那张 KV cache 桌子里,有个被忽视的事实:同一段 token 序列在同一个模型上,KV 永远一样。它是纯函数——给定输入 token 和位置 id,K 和 V 是确定的。所以如果两个用户的请求都以"You are a helpful assistant. Today is ..." 开头,这一段的 KV 完全可以共享——不用重新 prefill。

这个观察催生了 prefix caching。生产环境里,API 用户的 prompt 通常长这样:

system prompt(几千 token,所有请求都一样)
few-shot examples(几百 token,同一应用所有请求都一样)
user message(几十-几百 token,真正独特的部分)

典型一个 API 应用,system + examples 占总 prompt 的 70-90%。如果命中 prefix cache,这部分 prefill 时间归零——TTFT 从 100 ms 降到 10 ms 是常态。

Ch.8's KV cache table contains an underappreciated fact: the same token sequence over the same model always produces the same KV. It's a pure function: given input tokens and position ids, K and V are deterministic. So if two users' requests both start with "You are a helpful assistant. Today is ...", that portion's KV can be shared — no need to re-prefill.

This observation gave rise to prefix caching. In production, API user prompts typically look like:

system prompt (thousands of tokens, identical for all requests)
few-shot examples (hundreds of tokens, identical within an app)
user message (tens-hundreds of tokens, the actually unique part)

For a typical API app, system + examples account for 70-90% of total prompt tokens. A prefix-cache hit makes that prefill time zero — TTFT drops from 100 ms to 10 ms routinely.

vllm/core/block_manager.py + prefix_caching.pyhash-based block lookup

# vLLM 把 KV cache 按 block 组织(每 block 16 token,见 C17) # prefix caching 给每个 block 算一个 hash: # hash = sha256(tokens[0:block_end] || parent_hash) # 同样 prefix 的两个序列,前 N 个 block 的 hash 完全一样 class PrefixCachingBlockAllocator: def allocate_with_prefix(self, token_ids): block_table = [] prev_hash = 0 for i in range(0, len(token_ids), block_size): block_tokens = tuple(token_ids[i:i+block_size]) block_hash = hash((prev_hash, block_tokens)) if block_hash in self.cached_blocks: # 命中:直接复用已有 KV,refcount +=1 block = self.cached_blocks[block_hash] block.refcount += 1 self.metrics.hits += 1 else: # miss:分配新 block,等待 prefill 填充 block = self._allocate_new() block.hash = block_hash self.cached_blocks[block_hash] = block self.metrics.misses += 1 block_table.append(block) prev_hash = block_hash return block_table # 用户的 prefill 只跑 cache miss 的那几个 block · cache hit 的部分跳过

两个一起用 · 真正改变成本曲线

Stacked together · the real cost-curve mover

这两个 trick 是正交的——可以叠加。一个典型的 API 服务调用,完整时间分解:

These two tricks are orthogonal — they stack. A typical API call, decomposed:

API call · system prompt 4K + user 200 + output 500 · Llama-3-70BAPI call · 4K sys prompt + 200 user + 500 output · Llama-3-70B

朴素:全部 prefill + 串行 decodeNaive: full prefill + serial decode	prefill 800 ms + decode 12500 ms = 13.3 s	baseline
+ Prefix cache(4K sys 命中)+ Prefix cache (4K sys hits)	prefill 40 ms + decode 12500 ms = 12.5 s	−6% wall · −95% TTFT
+ Speculative · k=5 · 60% hit+ Speculative · k=5 · 60% hit	prefill 40 ms + decode 5000 ms = 5.0 s	−60% wall
+ fp8 KV cache + Q4_K_M weight+ fp8 KV + Q4_K_M weights	prefill 30 ms + decode 3500 ms = 3.5 s	−74% wall

叠加之后 13.3 s → 3.5 s,4× 加速,什么模型都没动。这就是为什么 2024-2025 年同样硬件、同样模型,主流推理服务的实际吞吐量比 2022 年提了 5-10 倍。训练这两年遇到瓶颈,推理这边在悄悄发力。

顺手说一句:这两个 trick 还有个共同的副作用——它们让"同一个 batch 里的所有用户"对计算量的贡献不再对称。命中 prefix cache 的用户几乎免费;接收 speculative draft 多的用户消耗更多算力。计费因此变成一个细微问题——OpenAI 在 2024 年底推出"prompt caching discount"就是为了把 prefix caching 的收益分给用户。

Stacked: 13.3 s → 3.5 s, 4× speedup without touching the model. This is why in 2024-2025, on the same hardware and model, mainstream inference services hit 5-10× the real throughput of 2022. Training hit a wall these two years; inference quietly leveled up.

An aside: these two tricks share a side-effect — they make "users in the same batch" no longer contribute symmetrically to compute. A user hitting prefix cache is nearly free; a user with many accepted speculative tokens consumes more compute. Pricing becomes subtle — OpenAI's late-2024 "prompt caching discount" exists exactly to share the prefix-caching gain with users.

为什么这两章放最后WHY THESE TWO LAST 这两章讲的是"引擎做的而不是模型做的"事情——你看不到它们在 attention 公式里,但生产推理服务的商业可行性建立在它们之上。把模型架构(Phase II/III)和引擎工程(Phase VI)分开理解,你才能看懂为什么"同一个 Llama-3-8B,A 厂 100 tokens/s,B 厂 30 tokens/s"——A 厂在引擎层面领先你一个时代。 These two chapters cover what "the engine does, not the model". You don't see them in the attention formula, but the commercial viability of inference services rests on them. Once you separate model architecture (Phase II/III) from engine engineering (Phase VI), you can finally see why "same Llama-3-8B, vendor A does 100 tokens/s, vendor B does 30" — A is an engine-generation ahead.

扩展EXTENDED Spec decoding 三代演化 · vanilla → Medusa → EAGLE Three generations of spec decoding · vanilla → Medusa → EAGLE

主表里那 5 种 speculative 变体值得展开。它们解决"怎么便宜地猜对下个 token" 这个核心问题的方式不同:

vanilla(Leviathan et al. 2022): 用一个独立的小模型(比如 1B)给大模型(70B)当 draft。优点是简单,缺点是要训练两个模型,小模型还得跟大模型的采样分布足够接近——很多时候这不容易,命中率只 ~50%。
Medusa(Cai et al. 2024): 不要独立 draft 模型,而是给主模型加几个并行 LM head。每个 medusa head 不是预测"下一个 token",而是"下下个 / 下下下个 token"。这样一次 forward 出 k 个候选 token——大模型自己当自己的 draft。优势:不需要独立模型,head 训练快。缺点:命中率没 EAGLE 高,因为 medusa head 只看 hidden state,看不到中间 token。
EAGLE(Li et al. 2024): 在 Medusa 的基础上,让 draft head 不只是看 hidden state,而是看一个"轻量级 transformer 层"跑出来的中间表示。轻量层小得多(几十 M 参数),但因为能"看到主模型的语义表示",命中率达 75-85%。是目前 SOTA。
Lookahead(Fu et al. 2023): 不用任何模型猜——而是从当前生成历史里 mine n-gram 频率,猜"这种上下文下接下来 k 个 token 很可能是什么"。极简,但只对重复性强的文本(code / structured output)有效。
spec + prefix sharing(2024-2025): 同一 batch 里多用户的 draft 模型共享 ——一个 batch 跑一次 draft,所有用户共享候选 token。在高并发场景下让 draft 摊销到接近 0 成本。

vLLM 2024 内置 EAGLE 支持,SGLang 也跟进。llama.cpp 主线只有 vanilla speculative,有人在做 EAGLE 移植但还没合并。在reasoning 模型 场景下(C26),EAGLE 加速尤其大——因为 reasoning 阶段语义连贯性高,draft 命中率能到 85%+。

The 5 speculative variants in the main table deserve unpacking. Each solves "cheaply guess the next token" differently:

Vanilla (Leviathan et al. 2022): use an independent small model (e.g. 1B) as draft for the big model (70B). Simple, but needs two trained models, and the small model's sampling distribution must match the big one closely — often hard; hit rate ~50%.
Medusa (Cai et al. 2024): no independent draft model — add multiple parallel LM heads to the main model. Each medusa head doesn't predict "next token" but "next-next / next-next-next token". One forward emits k candidates — the big model is its own draft. Pro: no separate model, fast head training. Con: lower hit rate than EAGLE — medusa heads see only hidden state, not intermediate tokens.
EAGLE (Li et al. 2024): builds on Medusa — draft heads see not just hidden state but a "lightweight transformer layer"'s intermediate. Lightweight layer is small (tens of M params), but seeing the main model's semantic representation lifts hit rate to 75-85%. Current SOTA.
Lookahead (Fu et al. 2023): no model guesses — mine n-gram frequencies from current generation history; guess "given this context, what k tokens likely come next". Ultra-simple but only works for repetition-heavy text (code / structured output).
Spec + prefix sharing (2024-2025): in one batch, multiple users share the draft model — run draft once per batch, all users use the same candidate tokens. At high concurrency this amortizes draft to near-zero cost.

vLLM 2024 has built-in EAGLE; SGLang followed. llama.cpp main only has vanilla speculative; EAGLE port is WIP but not merged. In reasoning model scenarios (C26), EAGLE accelerates dramatically — reasoning has high semantic continuity, draft hit rate reaches 85%+.

CHAPTER 20 · SCALE-OUT · KEY

Continuous batching — 让 GPU 在请求间隙不闲着

Continuous batching — keep the GPU busy between requests

vLLM 的真正杀招 · token-level 抢占

vLLM's actual killer move · token-level preemption

在线服务里,同时会有几十上百个用户在跟模型说话。最朴素的做法是static batching:把同时到来的请求拼成一个 batch 一起跑,直到 batch 里所有请求都生成完 EOS 才返回。问题是不同请求的输出长度差别巨大——有的说 20 个 token 就完,有的要说 2000 个。整个 batch 会被最慢的那个拖死,前面那些早早完成的请求等着,GPU 大部分时间在跑"无效 decode"。

2022 年 OSDI 的 Orca 论文(《Orca: A Distributed Serving System for Transformer-Based Generative Models》)第一次提出了continuous batching(也叫 in-flight batching):每生成一个 token 就重新组 batch,完成的请求立刻退出 batch,等待中的新请求立刻加入。vLLM 把它做成了开源标准实现,这是它最重要的工程贡献——比 PagedAttention 还重要。

In online serving, tens to hundreds of users talk to the model simultaneously. The naive approach is static batching: pack arriving requests into a batch, run them together until every request in the batch emits EOS. Problem: output lengths vary wildly — some requests finish in 20 tokens, others need 2000. The whole batch is held hostage by the longest. The early finishers wait around; the GPU spends most of its time on "useless decode".

The 2022 OSDI Orca paper (Orca: A Distributed Serving System for Transformer-Based Generative Models) introduced continuous batching (aka in-flight batching): regroup the batch after every generated token; finished requests drop out instantly, waiting requests jump in instantly. vLLM made it the open-source default — this is vLLM's single most important engineering contribution, more impactful than PagedAttention.

	Static batchingStatic batching	Continuous batchingContinuous batching
batch 组成方式batch composition	一次组好,跑到所有 EOSfixed once, runs until all EOS	每 step 重新组re-formed each step
输出长度差异敏感性output-length variance	短请求被拖死short requests dragged	完成即退出finish → leave
prefill 怎么处理prefill handling	所有 prefill 攒齐才跑collected, then batched	prefill 与 decode 混跑prefill mixed with decode
新请求等待时间new request wait	等当前 batch 结束until current batch ends	下一 step 立刻插入joins on next step
GPU 利用率GPU utilization	~30-50%~30-50%	70-90%
吞吐量提升throughput gain	1×	5-20×(论文实测)

FIG. 13 Orca 论文里那张经典图:同样硬件、同样请求负载,continuous batching 的吞吐量是 static batching 的 5-20 倍。这不是模型变快了,是 GPU 终于不再"带薪喝咖啡"。 The classic figure from Orca: same hardware, same request load, continuous batching delivers 5-20× the throughput of static. The model isn't faster — the GPU finally stopped being "on paid coffee break".

vLLM 的 Scheduler · 一个调度循环

vLLM's Scheduler · the dispatch loop

vLLM 把所有请求抽象成两类:

WAITING:刚到、还没 prefill 的请求
RUNNING:已经 prefill 完、正在 decode 的请求

每个 step,scheduler 决定这一 step 哪些 WAITING 请求要 prefill、哪些 RUNNING 请求要 decode,然后把它们打包成一个 batch给 model_runner 跑。关键技巧是 prefill 和 decode 可以混在同一个 batch 里——一个 batch 里同时有"5 个新请求各自 prefill 200 token"和"10 个老请求各 decode 1 token",GPU 一次 forward 把这些都算完。

vLLM abstracts all requests into two states:

WAITING: newly arrived, not yet prefilled
RUNNING: prefill done, in decode

Each step, the scheduler decides which WAITING requests to prefill this step and which RUNNING requests to decode, then packs them into one batch for model_runner. The key trick: prefill and decode can mix in the same batch — a single batch can contain "5 new requests prefilling 200 tokens each" and "10 ongoing requests each decoding 1 token"; the GPU finishes all of it in one forward.

vllm/core/scheduler.py · Scheduler.schedule()policy-driven dispatch

def schedule(self) -> SchedulerOutputs: # step 1: 让 RUNNING 队列里能继续 decode 的请求继续 # 但要先检查 KV cache 还能不能再扩一格(每个 RUNNING 请求每步都要写一个新 KV) running_scheduled, preempted = self._schedule_running() # step 2: 优先级:被抢占的请求先回来,再考虑新 WAITING # 这一步会按 FCFS / priority 等策略选请求 swapped_in = self._schedule_swapped() # 从 CPU 内存换回来的 prefill_scheduled = self._schedule_prefills( token_budget = self.max_num_batched_tokens - len(running_scheduled.tokens)) # step 3: 打包成一个 batch · prefill + decode 混合 return SchedulerOutputs( scheduled_seq_groups = running_scheduled + swapped_in + prefill_scheduled, num_prefill_groups = len(prefill_scheduled), preempted = preempted, # 这些被踢出去的会重排队 ) # 主循环 def step(self): outputs = self.scheduler.schedule() if outputs.is_empty(): return [] # 一次 model.forward 处理这个混合 batch model_output = self.model_runner.execute_model(outputs) # 把生成的新 token 写回各自 sequence,完成的请求 detokenize 后流回客户端 finished = self._process_model_outputs(model_output, outputs) return finished

抢占 · 当 KV cache 真的满了

Preemption · when KV cache really fills up

continuous batching 是 happy path——但新请求一直涌入,KV cache 总会满。这时 scheduler 就要做抢占(preemption):把某些 RUNNING 请求踢出去,让出 KV slot 给新请求。vLLM 有两种抢占策略:

swap:把这个请求当前的 KV cache 整体从 GPU 内存挪到CPU 内存,等以后可调度时再 swap 回来。耗时取决于 PCIe 带宽——H100 PCIe 5.0 单向 64 GB/s,一个 2.5 GB 的 KV swap 一次 ~40 ms。
recompute:直接把这个请求的 KV 删掉,等下次调度时重新 prefill。听起来浪费,但如果原 prompt 不长,重新 prefill 比 swap 快——而且 prefix caching 命中后 recompute 几乎免费。

vLLM 默认 recompute,因为 prefix caching 命中率高时它最优。Sglang(另一家高吞吐引擎)默认 swap。这两种选择直接影响 P99 延迟——swap 让 P50 好但 P99 偶尔很差(swap-out 那次),recompute 让平均更平但偶尔有"请求莫名重新跑了一遍 prefill"。

Continuous batching is the happy path — but new requests keep arriving and KV cache eventually fills. The scheduler must preempt: evict some RUNNING requests to free KV slots for newcomers. vLLM has two policies:

swap: move this request's entire KV cache from GPU memory to CPU memory; swap it back when reschedulable. Time depends on PCIe bandwidth — H100 PCIe 5.0 is 64 GB/s one-way; a 2.5 GB KV swap is ~40 ms.
recompute: drop this request's KV outright; re-prefill on next schedule. Wasteful-sounding but if the original prompt is short, recompute beats swap — and with prefix-cache hits, recompute is nearly free.

vLLM defaults to recompute (best when prefix-cache hit rates are high). Sglang (another high-throughput engine) defaults to swap. The choice directly impacts P99 latency — swap makes P50 good but P99 occasionally awful (the swap-out moment), recompute is flatter on average but occasionally "a request mysteriously ran prefill twice".

扩展EXTENDED 为什么 prefill + decode 混跑能行 · 同 kernel,异 N Why prefill + decode in one batch works · same kernel, different N

第 2 章说过 prefill 和 decode 是两种工作负载。那它们怎么混在一个 batch 里?这件事需要解释一下,因为它不显然。

关键:模型 forward 的输入抽象是 llama_ubatch{ token, pos, seq_id, ... }(第 5 章)——它不区分"这个 token 是 prefill 还是 decode",只关心"这个 token 属于哪条序列、在那条序列的什么位置"。一个 batch 里:

seq_0 的 token 0,1,2,3,4,5(prefill 6 个 token,共享一个 seq_id)
seq_1 的 token 100(decode,只有 1 个 token)
seq_2 的 token 0,1,...,199(prefill 200 个 token)
seq_3 的 token 50(decode)

这个 batch 总共 6 + 1 + 200 + 1 = 208 个 token。一次 forward 处理 208 个 token,Q 矩阵 [208, d_head]。attention 的 mask 矩阵根据 seq_id + pos 算出来——seq_0 的 token 5 只能看到 seq_0 的 token 0..5,看不到 seq_2 的 token——这就是多用户 batch 的隔离。

这件事在 H100 上实测能跑得很好,因为 SM 数量大(132 个),一次 forward 同时容纳 prefill 和 decode 的算力 / 带宽混合负载毫无压力。但kernel 设计要支持这种 mixed batch——FlashAttention v2/v3 都明确针对这件事做了优化(varlen API,可以接受不同长度的 Q)。

顺手补一个 vLLM 在 mixed batch 上的陷阱:同一个 batch 里 prefill 的 N 大,decode 的 N=1。如果 batch 总 token 数有限(max_num_batched_tokens),一个长 prefill 进来会挤掉很多 decode——P99 latency 突然变差。vLLM 引入 --max-num-seqs 和 chunked prefill(第 21 章)就是为了解决这个不公平。

Ch.2 said prefill and decode are two different workloads. So how can they share one batch? This deserves an explanation, since it's not obvious.

Key: model forward's input abstraction is llama_ubatch{ token, pos, seq_id, ... } (Ch.5) — it doesn't distinguish "prefill vs decode token", only "which sequence does this token belong to, at what position". One batch:

seq_0 tokens 0,1,2,3,4,5 (prefill 6 tokens, shared seq_id)
seq_1 token 100 (decode, 1 token)
seq_2 tokens 0,1,...,199 (prefill 200 tokens)
seq_3 token 50 (decode)

Batch total: 6 + 1 + 200 + 1 = 208 tokens. One forward processes 208 tokens, Q matrix [208, d_head]. Attention's mask is computed from seq_id + pos — seq_0's token 5 sees only seq_0's tokens 0..5, not seq_2's — that's per-user isolation inside the batch.

This works great on H100 because there are 132 SMs; one forward easily absorbs the mixed compute/bandwidth load of prefill and decode. But the kernel must support this mixed batch — FlashAttention v2/v3 explicitly added this (the varlen API, accepting Q rows of different lengths).

One footgun: in a mixed batch, prefill has large N, decode N=1. If the batch's total token budget (max_num_batched_tokens) is capped, one long prefill crowds out many decodes — P99 latency spikes. vLLM's --max-num-seqs flag and chunked prefill (Ch.21) exist precisely to fight this unfairness.

同一台机器 · 不同 batching 策略的吞吐Same machine · throughput across strategies

static batching (HF transformers)	~50 req/min	baseline
+ continuous batching (Orca-style)	~250 req/min	5×
+ PagedAttention(密度更高)	~600 req/min	12×
+ chunked prefill(第 21 章)	~800 req/min	16×
+ prefix caching + spec decode(第 19 章)	~1500 req/min	30×

为什么这是"vLLM 的核心"WHY THIS IS "vLLM'S CORE" 大众把 vLLM 跟 PagedAttention 划等号——其实 PagedAttention 只是它"更密集"地用 KV cache 的招式。真正让 vLLM 在 2023 年震惊业界的是 continuous batching + 抢占 + 调度策略这一整套——它把"一个推理引擎"从"跑模型 forward 的程序"升级成了"跟 OS 一样的调度系统"。理解这个,你才能解释"为什么 vLLM 单机 100K tokens/s 不是因为它的 attention kernel 比 llama.cpp 快"——它的 attention kernel 跟 FlashAttention 一样。它快是因为它的 GPU 不闲着。 Popular discourse equates vLLM with PagedAttention — but PagedAttention is "just" the trick that packs KV cache denser. What stunned the industry in 2023 was the full package: continuous batching + preemption + scheduling policy — it upgraded "an inference engine" from "a program that runs model.forward" to "an OS-like scheduling system". Understanding this, you can finally explain "why vLLM hits 100K tokens/s/machine — not because its attention kernel is faster than llama.cpp's". Its attention kernel is FlashAttention. It's faster because its GPU never sits idle.

CHAPTER 21 · SCALE-OUT

Chunked prefill — 把 32K prompt 切成 8 段塞进 batch

Chunked prefill — slicing a 32K prompt into 8 batch chunks

长 prompt 不再拖死所有 decode

long prompts no longer hold decode hostage

第 20 章那张吞吐表里有一行加了 +60% 的 "chunked prefill"——这是 vLLM 2024 年的关键升级。背景是 continuous batching 的不公平:如果某个用户发了一个 32K token 的长 prompt,它的 prefill 要算 ~50 ms,这 50 ms 里同 batch 的其他 decode 用户全部等着——他们看到的 TPOT 突然变成 50 ms 而不是 15 ms。

chunked prefill 的招式:把长 prefill 切成小块,每块 ~512 token,跟 decode 混跑。一个 32K prefill 切成 64 块,每个 batch step 跑 1 块 + 一堆 decode——单 step latency 维持在 ~20 ms,decode 用户感知不到长 prompt 用户的存在。代价是长 prompt 用户的 TTFT 略微变高(从 50 ms 涨到 80 ms),但 TPOT 更稳。

That +60% row in Ch.20's throughput table for "chunked prefill" is vLLM's 2024 key upgrade. Background: continuous batching is unfair. If one user sends a 32K prompt, prefilling it takes ~50 ms; during those 50 ms every other decode user in the batch waits — their TPOT spikes from 15 ms to 50 ms.

Chunked prefill's move: slice long prefill into small chunks of ~512 tokens; mix them with decode. A 32K prefill becomes 64 chunks; each batch step runs 1 chunk + many decodes — per-step latency stays ~20 ms, decode users never notice the long-prompt user. Cost: the long-prompt user's TTFT goes from 50 ms to ~80 ms, but TPOT is flatter.

vllm/core/scheduler.py · chunked prefill budget splitbudget per step

# 一个 step 的"预算"是 max_num_batched_tokens(典型 2048) # 这预算被 prefill 和 decode 共同瓜分 def _get_token_budget(self, running_decodes, waiting_prefills): # step 1: 给 decode 全部留位置(每 decode 占 1 token) decode_tokens = len(running_decodes) # step 2: 剩下的预算分给 prefill prefill_budget = self.max_num_batched_tokens - decode_tokens # step 3: 按 FCFS 选 prefill,可能切某个长 prompt 进来 chosen = [] for req in waiting_prefills: if req.tokens_remaining <= prefill_budget: chosen.append((req, req.tokens_remaining)) prefill_budget -= req.tokens_remaining else: # 这个 prefill 太长,切一块进来 chosen.append((req, prefill_budget)) req.tokens_remaining -= prefill_budget prefill_budget = 0 break return running_decodes + chosen

chunked prefill 的隐性代价 · 重复读 KV cache

Chunked prefill's hidden cost · re-reading KV cache

切成块跑不是完全免费的。原本 32K prompt 一次 prefill 要读 KV cache 1 次(因为是从空开始),chunked 之后每一块 prefill 都要读"之前已经写入的所有 KV"——第 8 块要读前 7 块 × 512 = 3584 个 K——总读取量从 O(N²/2) 变成 O(N²)。

对 32K prompt,实测多读 ~3 GB 数据(per layer 维度累加)。看起来很多,但因为 prefill 阶段本来就 compute-bound,HBM 带宽有富余,这部分多读几乎不影响总时间。chunked prefill 在 compute-bound 阶段交换 latency-fairness——只在 prefill 才有意义,decode 是 memory-bound,chunked 反而损失。

这就是为什么 vLLM 默认只对 prefill chunked,decode 永远整段跑。调度策略跟硬件物理特性强绑定——理解这一点,你就能预测下一代 GPU(更高 HBM 带宽)会让 chunked prefill 的相对优势如何变化。

Slicing isn't free. A 32K prompt's monolithic prefill reads KV cache 0 times (starts empty); chunked prefill must read "everything written by prior chunks" — the 8th chunk reads 7 × 512 = 3584 K rows — total reads go from O(N²/2) to O(N²).

For a 32K prompt, that's ~3 GB extra data read (summed across layers). Looks expensive but prefill is compute-bound and HBM bandwidth has headroom; the extra reads barely move total time. Chunked prefill trades excess bandwidth for latency-fairness in the compute-bound regime — only sensible during prefill, where decode (memory-bound) would lose, not win.

That's why vLLM defaults to chunked prefill only; decode runs whole. Scheduling is tightly coupled to hardware physics — understand this and you can predict how the relative advantage shifts on next-gen GPUs (higher HBM bandwidth).

扩展EXTENDED Sarathi / DistServe — 把 prefill 和 decode 物理分开 Sarathi / DistServe — physically splitting prefill from decode

chunked prefill 是逻辑上混跑 prefill 和 decode。2024 年又出现了一个更激进的思路——物理上把它们分开,放在不同的 GPU上:

Sarathi-Serve(MSR 2023):指出 prefill 和 decode 对硬件的需求差异巨大,提议 prefill chunked 进 decode batch。被 vLLM 采用。
DistServe(SOSP 2024,UC San Diego):更激进——给 prefill 专门一组 GPU,给 decode 专门一组 GPU,中间通过 KV cache 传输衔接。一个用户的 prompt 在"prefill 集群"上跑完,KV 整体搬到"decode 集群"开始 decode。延迟和吞吐量两端都涨,代价是 KV 传输的网络开销。
Splitwise(MSR 2024):同思路,把 prefill / decode 分别放到不同 SKU 的 GPU——prefill 用算力强的 H100,decode 用 HBM 带宽强但算力弱的 A100。成本敏感场景下有意义。

这条路线还在研究 → 量产的过渡期。最大的实际部署是 Anthropic 内部和某些超大规模服务——你看不到它的具体代码,但能从 P99 latency 改善的口径推断。下一代推理引擎大概率会原生支持这种"disaggregated prefill/decode"。

Chunked prefill mixes prefill and decode logically. 2024 brought a more radical idea — split them physically across different GPUs:

Sarathi-Serve (MSR 2023): showed prefill and decode have wildly different hardware needs; proposed chunked prefill into decode batch. Adopted by vLLM.
DistServe (SOSP 2024, UC San Diego): more radical — dedicate one GPU pool to prefill, another to decode; bridge via KV cache transfer. A user's prompt prefills on the "prefill cluster"; the resulting KV moves wholesale to the "decode cluster" to start decoding. Wins on both latency and throughput, at the cost of network KV transfer overhead.
Splitwise (MSR 2024): same idea, but on different GPU SKUs — prefill on compute-rich H100, decode on bandwidth-rich-but-compute-lean A100. Makes sense in cost-sensitive scenarios.

This line is still in the research → production transition. The biggest real deployments are at Anthropic-tier scale and a few hyperscalers — you can't see their code, but P99 latency improvements give it away. Next-gen inference engines will likely natively support this "disaggregated prefill/decode" pattern.

CHAPTER 22 · SCALE-OUT · KEY

多 GPU — TP / PP / EP 三种切法

Multi-GPU — TP / PP / EP, three ways to slice

一张卡跑不动,八张卡怎么协同

when one card isn't enough, how do eight cooperate

Llama-3-405B fp16 是 810 GB——单 H100 (80 GB) 一张装不下模型权重,更别说 KV cache。要让它跑起来,必须把权重切到多张卡上。"切"有三种思路,产业里叫 TP / PP / EP:

Tensor Parallelism (TP):同一个权重矩阵切到多张卡上(典型沿列切),每张卡算自己那部分,然后 all-reduce 合并。跨卡通信:每层 attention + FFN 一次 all-reduce。
Pipeline Parallelism (PP):不同层放到不同卡上。layer 0-10 在卡 A,layer 11-20 在卡 B……token 在 forward 时顺次穿过所有卡。跨卡通信:每个 stage boundary 一次 send/recv。
Expert Parallelism (EP):MoE 专家分布到不同卡上(第 12 章扩展讨论过)。每层 FFN 两次 all-to-all。

这三种是正交的,可以叠加。比如 Llama-3-405B 在 8 × H100 上典型用 TP=8(8 张卡协同算每一层);DeepSeek-V3 671B 在 16 × H100 上用 TP=4 × EP=4(每 4 张卡协同算 attention,16 张卡分摊 expert)。

Llama-3-405B fp16 = 810 GB — a single H100 (80 GB) can't hold the weights, never mind KV cache. To run it you must split weights across cards. Three strategies, industry-known as TP / PP / EP:

Tensor Parallelism (TP): shard one weight matrix across cards (typically along the column axis); each card computes its part; results all-reduced. Cross-card traffic: one all-reduce per attention + FFN block.
Pipeline Parallelism (PP): different layers on different cards. Layers 0-10 on card A, 11-20 on card B... tokens flow sequentially through all cards. Cross-card traffic: one send/recv per stage boundary.
Expert Parallelism (EP): MoE experts distributed across cards (covered in Ch.12 extended). Two all-to-alls per FFN layer.

The three are orthogonal and compose. Llama-3-405B on 8×H100 typically runs TP=8 (8 cards cooperate per layer). DeepSeek-V3 671B on 16×H100 runs TP=4 × EP=4 (every 4 cards do attention together; the 16 split the experts).

	TP	PP	EP
切什么what to slice	单层权重矩阵a single weight matrix	不同层different layers	MoE 专家MoE experts
通信类型comms	all-reduce	send/recv	all-to-all
通信频率comms freq	每层 2 次	stage 边界 1 次	每 FFN 层 2 次
每次通信量comm size	[n_tokens × d_model]	[n_tokens × d_model]	[n_tokens × d_model]
需要快带宽needs fast link	NVLink 必需	PCIe 即可	NVLink 必需
扩展上限scale limit	~8 (NVLink domain)	几十-几百	~64
流水气泡pipeline bubble	—	prefill 大 · decode 小	—
适用场景use case	单机多卡single node, multi-GPU	跨节点cross-node	MoE onlyMoE only

FIG. 14 TP 通信频次高但每次小,需要 NVLink 这种"同机内"高速链路;PP 通信稀疏可以走 PCIe / InfiniBand,适合跨节点。所以"单机 TP,跨机 PP" 是 405B 这种真大模型的标准切法。 TP has frequent small comms — needs NVLink-class intra-node speed. PP comms are sparse and fit PCIe/InfiniBand — fine for cross-node. So "TP within a node, PP across nodes" is the standard cut for genuinely large models like 405B.

TP 的真实切法 · attention 的 column-row 配对

TP in detail · attention's column-row pairing

朴素地"把矩阵从中间切两半"是不够的——TP 要保证切完之后,计算结果跟未切版本完全一致(注意可能有微小 fp 误差,但语义上一致)。这需要在每一层用列切 + 行切配对:

Q/K/V 的投影矩阵 W_q / W_k / W_v 沿输出维度(列)切——每张卡得到 1/TP 个 head。
各卡独立算自己那部分 attention(不需要通信,因为 head 之间独立)。
输出投影 W_o 沿输入维度(行)切——每张卡乘自己那部分,然后 all-reduce 把结果加起来。
FFN 同样:W_gate / W_up 沿列切,W_down 沿行切 + all-reduce。

所以一层 transformer 在 TP=8 下,有两次 all-reduce(一次 attention 出口,一次 FFN 出口)。32 层模型 → 64 次 all-reduce。每次 all-reduce 通信量 ~n_tokens × d_model × 2 bytes,decode 时 ~30 KB,prefill 时几 MB。在 NVLink 4.0(450 GB/s 双向)上一次 all-reduce ~5-50 μs,32 层叠加 ~200 μs - 几 ms,占总 latency 5-15%。

Naively "slice the matrix down the middle" doesn't suffice — TP must guarantee the post-slice computation is bit-equivalent (modulo fp noise) to the un-sliced version. The trick: per layer, column-shard + row-shard pairing:

Q/K/V projections W_q / W_k / W_v shard along the output dim (column) — each card owns 1/TP heads.
Each card runs its own attention slice (no comm needed; heads are independent).
Output projection W_o shards along the input dim (row) — each card multiplies its slice, then all-reduce sums everything up.
FFN same: W_gate / W_up column-sharded; W_down row-sharded + all-reduce.

So one transformer layer under TP=8 has two all-reduces (one at attention exit, one at FFN exit). 32 layers → 64 all-reduces. Each all-reduce moves ~n_tokens × d_model × 2 bytes — ~30 KB at decode, several MB at prefill. On NVLink 4.0 (450 GB/s bidirectional), one all-reduce takes ~5-50 μs; 32 layers stacks to ~200 μs - several ms, 5-15% of total latency.

vllm/distributed/parallel_state.py · all-reduce in attentioncolumn · row pairing

# 在 vLLM 的 TP attention 实现里 · 简化版 class TPAttention(nn.Module): def __init__(self, hidden_size, num_heads, tp_size): # Column-parallel:输出沿 num_heads 切 self.qkv_proj = ColumnParallelLinear( hidden_size, 3 * hidden_size, gather_output=False) # 不聚合 · 留在本卡 # Row-parallel:输入沿 hidden_size/TP 切 self.o_proj = RowParallelLinear( hidden_size, hidden_size, input_is_parallel=True) # 输入已经分片 def forward(self, hidden): qkv = self.qkv_proj(hidden) # 本卡只算自己那部分 head q, k, v = qkv.chunk(3, dim=-1) attn = flash_attn(q, k, v) # 本卡的 head 独立算 out = self.o_proj(attn) # 内部 all-reduce! return out class RowParallelLinear(nn.Module): def forward(self, x): local_out = x @ self.weight_local # 本卡部分乘 if self.tp_size > 1: dist.all_reduce(local_out, op=ReduceOp.SUM) # NCCL all-reduce return local_out

PP 的"气泡"问题 · 为什么 PP 比 TP 麻烦

PP's "bubble" problem · why PP is trickier than TP

PP 看起来很自然——"把 32 层分给 8 张卡,每卡 4 层"——但 forward 是顺序的:卡 A 算完 layer 0-3,把结果发给卡 B 算 layer 4-7,卡 B 算的时候卡 A 在干嘛?闲着。这就是pipeline bubble:第一个 token 流过 8 张卡的时候,前 7 张卡都在等。

解决的办法是 micro-batching:把 batch 切成多个 micro-batch,让卡 A 算完 mb_0 立刻开始 mb_1,同时卡 B 在算 mb_0。理想情况下 8 张卡同时各自处理一个 micro-batch,气泡降到 ~1/n_micro_batches。

但decode 不能 micro-batch——一个 batch step 只有 1 个 token,切不动。所以 PP 在 decode 阶段气泡严重——卡 A 算一个 token 的 4 层 → 卡 B → 卡 C → ……→ 卡 H 算完出来,中间 7 张卡每次都要等。这就是为什么纯 PP 推理几乎没人用,生产环境都是 PP + TP 混合(大节点之间 PP,节点内 TP)。

PP looks natural — "32 layers across 8 cards, 4 layers each" — but forward is sequential: card A finishes layers 0-3, sends to card B for 4-7. What's card A doing while B works? Sitting idle. That's the pipeline bubble: as the first token flows through 8 cards, 7 cards wait.

The fix is micro-batching: slice the batch into multiple micro-batches; A starts mb_1 the moment it finishes mb_0, while B runs mb_0. Ideally 8 cards each process one micro-batch concurrently; bubble drops to ~1/n_micro_batches.

But decode can't micro-batch — one batch step has only 1 token to spread. So PP at decode has severe bubbles — A does 4 layers, B, C, ..., H finishes; 7 cards always waiting. Which is why pure-PP inference is rarely used; production deployments use PP + TP mixed (PP across nodes, TP within).

扩展EXTENDED Llama-3-405B 部署配方 · 一台 8×H100 (640 GB) 怎么扛 Llama-3-405B deployment recipe · how 8×H100 (640 GB) holds it

Llama-3-405B 在 fp8 量化下是 ~405 GB,GQA-8 在 128K 上下文下每用户 KV cache ~10 GB。一台 8×H100 节点 640 GB,拆账如下:

模型权重 fp8: 405 GB / 8 = ~51 GB/卡
CUDA 上下文 + activation buffer: ~10 GB/卡
剩余 KV cache 容量: 80 - 51 - 10 = ~19 GB/卡
全节点 KV 容量: 19 × 8 = 152 GB
支持并发用户(128K context, ~10 GB/user): ~15 个

这个 SKU 极度奢侈:一节点 ~$32/h,15 用户 = $2.13/用户·小时。这就是为什么 405B 服务的per-token 定价比 70B 服务高 5×——单 GPU 服务密度低了 10×。

下一代的优化方向:

Llama-3.1-405B-Instruct-MoE(假设)→ 用 MoE 砍激活参数,decode 速度上去,KV cache 下来。Meta 没出,但 DeepSeek-V3 / Llama-4 是这条路。
更低精度:fp4(MXFP4)→ 模型 ~200 GB,5 张卡装得下 → 同节点能服务 30 用户。
跨节点 PP:用 2 节点 16 张卡,PP=2 × TP=8,模型权重每卡 ~25 GB,KV 容量翻倍。代价是节点间 InfiniBand 通信。

Llama-3-405B in fp8 = ~405 GB; GQA-8 at 128K context = ~10 GB KV cache per user. One 8×H100 node (640 GB):

fp8 weights: 405 GB / 8 = ~51 GB/card
CUDA context + activation buffers: ~10 GB/card
Remaining for KV cache: 80 - 51 - 10 = ~19 GB/card
Node-wide KV capacity: 19 × 8 = 152 GB
Concurrent users (128K context, ~10 GB/user): ~15

This SKU is luxurious: one node at ~$32/h, 15 users = $2.13/user-hour. Hence why 405B's per-token pricing is 5× a 70B's — service density is 10× lower per GPU.

Next-gen directions:

Llama-3.1-405B-Instruct-MoE (hypothetical) → MoE cuts active params, lifts decode speed, shrinks KV. Meta hasn't shipped this; DeepSeek-V3 / Llama-4 are on this path.
Lower precision: fp4 (MXFP4) → model ~200 GB, fits in 5 cards → 30 users per node.
Cross-node PP: 2 nodes / 16 cards, PP=2 × TP=8 → ~25 GB weights per card, KV capacity doubles. Cost: cross-node InfiniBand traffic.

CHAPTER 23 · LAST MILE

Streaming + Chat Template — 从 token 到用户屏幕

Streaming + Chat Template — from token to user screen

SSE · 流式输出 · 对话模板

SSE · streaming · the chat template

前 22 章讲的全是怎么生成 token。但 token 不是用户看到的东西——用户看到的是字符串,而且是一个一个蹦出来的。从 token 到屏幕,这"最后一公里"有两件事要做:

Chat template:用户给的是 "你好,llama" 这样的纯文本,但模型不是看纯文本工作的——它训练时的对话格式是 <|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n你好,llama<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n(Llama-3 格式)。这一层包装错了模型就不工作。
Streaming:不能等模型生成完整段回答再返回——用户体验需要每生成一个 token 就发一个 token。HTTP 上用 Server-Sent Events(SSE),WebSocket 也可以但 SSE 简单得多。

The previous 22 chapters were entirely about generating tokens. But tokens aren't what users see — users see a string, and one that appears character by character. From token to screen, this "last mile" has two pieces:

Chat template: the user types "hello, llama" as plain text, but the model doesn't see plain text — it was trained on conversation formatted as <|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nhello, llama<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n (Llama-3 format). Get this wrapping wrong and the model breaks.
Streaming: can't wait for the full answer — UX requires emitting each token as it's produced. HTTP uses Server-Sent Events (SSE); WebSocket works too but SSE is much simpler.

Chat template · 三个主流家族的实际格式

Chat template · three mainstream formats

chat templates · side by sidethree families, three formats

这三种格式的每个分隔符都是特殊 token(不在普通词表里,需要 tokenizer 显式插入)。Llama-3 的 <|eot_id|> 是 128009,<|start_header_id|> 是 128006。如果你拿 user 输入直接 tokenize,这些 token 永远不会出现——必须由应用层把它们插进去。

llama.cpp 把这件事放到 llama_chat_apply_template 函数,通过 模型自带的 Jinja2 模板(GGUF metadata 里有)渲染。OpenAI API 兼容的 chat 接口(包括 vLLM 的)都会自动做这一步——但裸 completion 接口需要你自己来。这就是为什么 raw completion 和 chat completion API 在同一个模型上输出风格差异巨大:不是模型不一样,是 chat completion 帮你套了正确的模板。

Every separator in these formats is a special token (not in regular vocab; must be explicitly inserted by the tokenizer). Llama-3's <|eot_id|> is 128009, <|start_header_id|> is 128006. If you tokenize raw user input, these never appear — they must be injected by the application layer.

llama.cpp handles this in llama_chat_apply_template, rendering the model's bundled Jinja2 template (stored in GGUF metadata). OpenAI-compatible chat APIs (vLLM included) do this automatically — but the raw completion endpoint requires you to do it. This is why raw completion and chat completion on the same model produce wildly different styles: not because the model differs, but because chat completion applied the correct template for you.

Streaming · SSE 的简单美学

Streaming · the simple beauty of SSE

Server-Sent Events 协议非常简单:HTTP 响应不结束,服务器持续往 socket 上写 data: {...}\n\n 行,客户端 EventSource 一行行解析。OpenAI / vLLM / Anthropic 的 streaming API 都是这个协议:

SSE is dead simple: the HTTP response never ends; the server keeps writing data: {...}\n\n lines to the socket; the client EventSource parses line by line. OpenAI / vLLM / Anthropic streaming APIs all use this:

wire format · SSE chat completionone chunk per token

HTTP/1.1 200 OK Content-Type: text/event-stream Cache-Control: no-cache Connection: keep-alive data: {"choices":[{"delta":{"role":"assistant"}}]} data: {"choices":[{"delta":{"content":"你"}}]} data: {"choices":[{"delta":{"content":"好"}}]} data: {"choices":[{"delta":{"content":","}}]} // ... 每个 token 一条 chunk ... data: {"choices":[{"finish_reason":"stop"}]} data: [DONE]

看起来简单,但实际有三个坑:

多字节 token 的边界:中文一个字往往是2-3 个 BPE token。如果服务器拿到一个 token 立刻 detokenize 再发,可能发出"半个字符"(utf-8 的 leading byte 但没有 continuation byte),前端 JSON 解析报错。解决:服务器内部维护一个 "buffer",只在完整字符形成时才发。
多 token 一起来的 burst:speculative decoding(第 19 章)一次能产出多个 accepted token,要么一次发一个,要么累积——前者降低延迟、后者降低字节开销。OpenAI 选了"一次发一个"。
客户端断开:HTTP 长连接不可靠,客户端中途关掉的话必须立刻停止 decode 释放 GPU——vLLM 在 AsyncEngineClient 里通过 cancellation token 检测,llama.cpp HTTP server 也有 --idle-timeout。否则 idle 的 KV cache 会一直占着,新请求被挤掉。

Looks simple, but three real footguns:

Multi-byte token boundaries: Chinese characters are often 2-3 BPE tokens. If the server detokenizes and emits per token, it can ship "half a character" (utf-8 leading byte without continuation) — front-end JSON breaks. Fix: server-side buffer, emit only when a complete character forms.
Multi-token bursts: speculative decoding (Ch.19) can produce several accepted tokens at once; emit individually or buffered — former cuts latency, latter cuts bytes. OpenAI picked "one at a time".
Client disconnects: HTTP long connections are unreliable; if the client drops, decode must stop immediately to free the GPU. vLLM detects via cancellation tokens in AsyncEngineClient; llama.cpp HTTP server has --idle-timeout. Otherwise idle KV cache squats and crowds out new requests.

扩展EXTENDED Detokenize 不是 tokenize 的反函数 — 流式 detoken 的细节 Detokenize isn't tokenize's inverse — streaming detoken in detail

第 3 章讲了 tokenize。反向(detokenize)看似简单——查 vocab 表把 id 翻成字符串拼起来。但实际不能这么干:Llama-3 的 vocab 里有几千个"带前导空格"的 token(" the", " hello"),拼接时如果你直接 concat 那些 token 的字符串,会得到 "the helloworld" 这样的乱七八糟的输出——空格的位置全错了。

正确做法:每个 token 的字符串已经带了它该带的前导空格(BPE 的训练就是这么做的)。所以 detokenize 就是简单的字符串 concat,不要自作主张加空格。但 leading <|begin_of_text|> 这类 special token 在 detokenize 时要被吃掉,不能直接输出到用户屏幕。llama.cpp 通过 llama_vocab::detokenize(..., remove_special=true) 标记区分。

流式场景下,detokenize 必须按 token 增量跑,但不能假设每个 token 是独立可解码的 utf-8 序列。Chinese token "你"是 id 47045,detokenize 后是 3 字节 utf-8 E4 BD A0 ——干净的 1 字符。但 emoji "🤖"(U+1F916)在 BPE 里可能被拆成2 个 token,每个 token 单独解码出来是非法 utf-8。所以服务器要维护一个"待发字节缓冲",每次检查这个缓冲从头能解出多少完整字符,只发出可解部分,剩下半截留到下一次拼接。

Ch.3 covered tokenize. The reverse (detokenize) looks easy — look each id up in vocab and concat. But you can't just do that: Llama-3's vocab has thousands of tokens with leading spaces (" the", " hello"); naive concat gives "the helloworld" — spaces in the wrong places.

Correct: each token's string already carries its leading space (BPE training did this for you). So detokenize is just concat — don't add spaces. But special tokens like <|begin_of_text|> must be eaten at detokenize, not shown to the user. llama.cpp distinguishes via llama_vocab::detokenize(..., remove_special=true).

In streaming, detokenize runs incrementally per token, but you can't assume each token decodes to a clean utf-8 substring. The Chinese "你" (id 47045) detokenizes to 3 utf-8 bytes E4 BD A0 — clean. But "🤖" (U+1F916) may be split across 2 BPE tokens; each individually is invalid utf-8. Servers must maintain a "pending byte buffer", checking how much of it forms complete characters, emitting only that, leaving the partial bytes for the next concat.

CHAPTER 24 · LAST MILE · KEY

Constrained decoding — 让模型只能输出合法 JSON

Constrained decoding — forcing valid JSON / regex output

GBNF · 限制采样空间 · 函数调用

GBNF · constrained sampling · function calling

很多生产场景下,你需要模型输出严格格式的内容——比如一段 JSON、一个 SQL 语句、一个函数调用。朴素方法是"在 prompt 里求模型乖一点"——但这不可靠,模型偶尔会瞎写。可靠的方法是constrained decoding:在 sampler chain 里加一道关卡,把所有违法 token 的 logit 设为 -inf,模型从一开始就不可能选到非法 token。

llama.cpp 内置 GBNF(GGML BNF,一种 BNF 变体)语法引擎,vLLM 用 outlines / xgrammar。两者本质一样:用户给一个语法,引擎构建一个有限状态机,每生成一个 token 都更新 FSM 状态,只允许"从当前状态能走出去"的 token。

Many production scenarios need strictly formatted output — JSON, SQL, function calls. The naive approach is "beg the model to behave in the prompt" — but it's unreliable; the model occasionally goes off. The reliable approach is constrained decoding: add a guard to the sampler chain that sets the logit of every illegal token to -inf — the model literally cannot pick an invalid one.

llama.cpp ships GBNF (GGML BNF, a BNF variant) grammar engine; vLLM uses outlines / xgrammar. Both work the same way: user provides a grammar, engine builds a finite state machine, every emitted token updates FSM state, and only "tokens that can leave the current state" are allowed.

grammars/json.gbnf · llama.cpp built-ina recursive grammar

root ::= object value ::= object | array | string | number | ("true" | "false" | "null") ws object ::= "{" ws ( string ":" ws value ("," ws string ":" ws value)* )? "}" ws array ::= "[" ws ( value ("," ws value)* )? "]" ws string ::= "\"" ( [^"\\\x7F\x00-\x1F] | "\\" (["\\bfnrt] | "u" [0-9a-fA-F]{4}) )* "\"" ws number ::= "-"? ([0] | [1-9] [0-9]*) ("." [0-9]+)? ([eE] [-+]? [0-9]+)? ws ws ::= ([ \t\n] ws)?

grammar mask 的计算 · sampler 链最后一道关

Grammar mask compute · the last guard in the sampler chain

每生成一个 token 之前,grammar sampler 都要做一件事:遍历整个 vocab 表(128k),对每个 token 的字符串问"这串字符接在当前 FSM 状态后面合法吗?" 不合法的 logit 设为 -inf。这听起来贵——但 llama.cpp 做了大量优化:

FSM 状态用 trie 索引,每次只检查那些可能从当前状态出去的 token,而不是全 vocab 扫一遍。
对单字符 token,用 ASCII bitmap 直接查表,极快。
对多字符 token,用 partial match 试探——很多 token 只要前缀就能判定。

实测开 grammar 后 sampling 慢 ~30%——但整体推理时间几乎不变,因为 sampling 本来就占不到 5% 总时间。换来 100% 的格式合规率,是绝对划算的交易。

Before emitting each token, the grammar sampler does one thing: iterate the entire vocab (128k), ask of each token's string "can this string follow the current FSM state legally?" Set illegal tokens' logits to -inf. Sounds expensive — but llama.cpp does heavy optimization:

FSM states are trie-indexed; only tokens that could possibly leave the current state are checked, not the whole vocab.
For single-char tokens, an ASCII bitmap is consulted — extremely fast.
For multi-char tokens, partial matches resolve via prefix in most cases.

Measured: sampling is ~30% slower with grammar on — but total inference time barely moves, because sampling was <5% of the total. 100% format compliance for a near-free cost. Excellent trade.

src/llama-grammar.cpp · llama_grammar_apply_impl()mask invalid tokens

void llama_grammar_apply_impl(const llama_grammar & gr, llama_token_data_array * cur_p) { // 给每个候选 token 检查它是否能从当前 grammar stack 走通 for (size_t i = 0; i < cur_p->size; ++i) { const llama_token id = cur_p->data[i].id; const std::string & piece = gr.vocab->token_to_piece(id); // 模拟在当前 stack 上消耗 piece · 看能否到达任意有效 stack 状态 if (!accepts_string(gr.stacks, piece)) { cur_p->data[i].logit = -INFINITY; // 杀掉 } } } // 选中 token 后,推进 grammar 状态 void llama_grammar_accept_impl(llama_grammar & gr, llama_token id) { const std::string & piece = gr.vocab->token_to_piece(id); gr.stacks = advance_grammar(gr.stacks, piece); }

Function calling · OpenAI 的真实实现

Function calling · how OpenAI actually does it

OpenAI 的 function calling 看起来像魔法——你给个 schema,模型自动产出符合 schema 的 JSON。背后就是constrained decoding:OpenAI 把每个 function schema 编译成 GBNF / outlines 形式的 grammar,推理时挂到 sampler chain 上,保证输出 100% schema-valid。它的"不会输出非法 JSON"承诺,是工程而不是魔法。

开源世界里,vLLM 的 guided_decoding 参数支持 JSON Schema / regex / lark grammar 三种约束源——它会自动把 JSON Schema 编译成 xgrammar 的内部 FSM。Anthropic 的 tool use API 也是类似实现。

有意思的限制:constrained decoding 让模型保证格式合法,但不保证内容正确。比如你约束输出 {"age": number},模型可能输出 {"age": 999999}——格式合法,内容荒谬。"格式"和"语义"是两件事——constrained decoding 解决前者,后者还得靠模型本身。

OpenAI's function calling looks magical — give a schema, model produces schema-valid JSON. Behind the scenes: constrained decoding. OpenAI compiles each schema into a GBNF / outlines grammar, attached to the sampler chain at inference. The "can't emit invalid JSON" guarantee is engineering, not magic.

In the open world, vLLM's guided_decoding parameter supports JSON Schema / regex / lark grammar — it compiles JSON Schema into xgrammar's internal FSM. Anthropic's tool-use API uses a similar mechanism.

An interesting limitation: constrained decoding guarantees format legality, not content correctness. Constrain output to {"age": number} and the model can still emit {"age": 999999} — format legal, content absurd. Format and semantics are different — constrained decoding solves the former, the latter falls back on the model itself.

扩展EXTENDED Structured Output 是怎么训练出来的 — schema-aware finetuning How "structured output" gets trained — schema-aware finetuning

constrained decoding 的缺陷:它强迫模型输出特定 token,但模型可能不擅长在那个 token 上——一个不熟悉 JSON 的模型被强迫输出 JSON,可能在每个 key 上瞎选(因为它的概率分布不集中在合理 key 上)。

OpenAI / Anthropic 的"structured output" SOTA 是constraint + finetune 联动:在 SFT(supervised fine-tuning)阶段大量喂"schema → schema-valid JSON"的对应,模型自己就学会了"看到 schema 描述,自然产出合法 JSON"。这时再叠加 constrained decoding,既双保险又语义自然。

这就是为什么 GPT-4o structured outputs 比 Llama-3 + GBNF JSON 输出的语义质量高一截——前者是"模型本来就想这么写,grammar 兜底",后者是"模型不太会,grammar 硬掰"。同样的工具,语义底子不同,输出质量差很多。

Constrained decoding has a weakness: it forces the model into specific tokens, but the model may be bad at choosing those — forcing a JSON-naive model to emit JSON makes it pick keys somewhat randomly (its probability mass isn't concentrated on sensible keys).

OpenAI's / Anthropic's structured output SOTA is constraint + finetune in tandem: SFT (supervised fine-tuning) heavy on "schema → schema-valid JSON" pairs, so the model learns "see schema description, naturally emit valid JSON". Layer constrained decoding on top — double insurance and semantically natural.

This is why GPT-4o structured outputs are semantically a class above Llama-3 + GBNF JSON — the former is "model wants to write it this way, grammar backstop"; the latter is "model isn't great, grammar forces it". Same tool, different semantic base, very different quality.

CHAPTER 25 · FRONTIERS · KEY

Multimodal — 当 prompt 是一张图片

Multimodal — when the prompt is an image

vision encoder · projector · 像素也变成 token

vision encoder · projector · pixels become tokens too

前 24 章的所有讨论都基于一个隐含前提:输入是文字。但 GPT-4o / Claude 3.5 / Gemini / Llama-3.2 Vision 这些"看得见图"的模型怎么工作?它们没有专门处理图片的内部模块——而是把图片"翻译"成 transformer 听得懂的语言:tokens。一张 512×512 的图,在模型眼里就是 256 个"视觉 token",跟文字 token 一起排成同一个序列送进同一个 transformer。

这件事的精彩在于:整本书前 24 章的所有内容,在多模态模型上都不变——KV cache / attention / MoE / sampling / speculative / 量化全部照旧。改变的只是序列的开头那几百个 token从哪里来。

The whole article so far had an implicit assumption: input is text. So how do "see-the-image" models — GPT-4o / Claude 3.5 / Gemini / Llama-3.2 Vision — work? They don't have a special image module inside; they "translate" images into the language the transformer already understands: tokens. A 512×512 image, in the model's eyes, becomes 256 "visual tokens" sitting in the same sequence as text tokens, fed into the same transformer.

The elegant part: everything in the previous 24 chapters still applies to multimodal models — KV cache, attention, MoE, sampling, speculative, quantization. What changes is only where the first few hundred tokens of the sequence come from.

三步:patch → encoder → projector

Three steps: patch → encoder → projector

主流 vision encoder(LLaVA / Llama-3.2-Vision / Qwen2-VL 等)都走类似流程:

Patch: 把 512×512 的图切成 32×32 的小块,得到 16×16 = 256 个 patches。每个 patch 是 32 × 32 × 3 = 3072 个像素值。
Encoder: 一个独立的 vision transformer(常见 CLIP-ViT-L)把每个 patch 编码成一个 1024 维向量。这部分跟 LLM 完全独立,有自己的 attention / FFN / 24 层 transformer——但通常是预训练好的,推理时只 forward 一次。
Projector: 一个简单的 MLP(典型 2 层 + GeLU)把这 256 个 1024 维向量投射到 LLM 的 hidden 维度(Llama-3 是 4096)。投影完的 256 个 4096 维向量,就当作 token插入 LLM 的输入序列。

这一下子产生 256 个"视觉 token" 塞进序列前面——你可以把它们想象成 256 个"系统说话"——后续 LLM 推理跟纯文本完全一样。所以一张图实际占用 256 个上下文 token + 对应的 KV cache。这是为什么 GPT-4 Vision 的 image input 比 text input 贵——一张高清图等价于 ~1500 个文字 token,而且必须经过完整 prefill。

Mainstream vision encoders (LLaVA / Llama-3.2-Vision / Qwen2-VL) all follow a similar pipeline:

Patch: split a 512×512 image into 32×32 patches → 16×16 = 256 patches. Each patch is 32 × 32 × 3 = 3072 pixel values.
Encoder: an independent vision transformer (commonly CLIP-ViT-L) encodes each patch into a 1024-dim vector. This part is fully separate from the LLM with its own attention / FFN / 24-layer transformer — but typically pretrained-frozen, forward-only at inference.
Projector: a simple MLP (typically 2 layers + GeLU) projects the 256 × 1024-dim vectors to the LLM's hidden dim (4096 for Llama-3). The 256 projected 4096-dim vectors are then treated as tokens, inserted at the front of the LLM's input sequence.

This produces 256 "visual tokens" prepended to the sequence — think of them as 256 "system utterances". Subsequent LLM inference is identical to text. So an image actually occupies 256 context tokens + the matching KV cache. That's why GPT-4 Vision's image input is expensive — a high-res image equals ~1500 text tokens, all of which must go through a full prefill.

examples/llava/llava-cli.cpp · llama.cpp's vision pathimage → embedding → main LLM

// 简化版的 LLaVA forward · llama.cpp 实际跟 mtmd-cli // (multimodal CLI · Llama-3.2-V / Qwen2-VL / LLaVA 都走它) int main(int argc, char** argv) { // 1. 加载主 LLM 和 vision encoder · 两个独立模型,共享 ggml backend llama_model * llm = llama_model_load_from_file("llama-3.2-11b-vision.gguf", ...); clip_ctx * clip = clip_model_load("clip-vit-l.gguf"); // 2. 处理图片 · 切 patches + 跑 CLIP-ViT encoder + projector clip_image_u8 img = clip_image_load_from_bytes("cat.jpg"); float * image_embd; // [n_image_tokens × n_embd] int n_image_tokens; clip_image_encode(clip, img, &image_embd, &n_image_tokens); // 3. 把图片 embedding 当作 token 序列喂进主 LLM // 用 .embd 字段而不是 .token(这就是 C05 那张表里 embd / token 二选一) llama_batch batch = llama_batch_init(n_image_tokens, n_embd, 1); batch.embd = image_embd; batch.n_tokens = n_image_tokens; llama_decode(ctx, batch); // 走 32 层 transformer · 把图片 prefill 进 KV cache // 4. 然后正常 tokenize 文本 prompt 跟着喂进去 // "What's in this image?" → tokenize → batch.token → llama_decode // 注意 LLM 看到的序列是: [image_tok_0 ... image_tok_255, "What", "'s", " in", ...] // 5. decode 生成回答 · 跟纯文本完全一样 }

为什么 "visual token" 比 "text token" 更耗时

Why visual tokens are more expensive than text tokens

同样占一个 token 位置,visual token 从产生到使用的成本远高于文本 token,因为它要先走一个完整的 vision encoder。Llama-3.2-Vision 的 CLIP-ViT-L 是 ~300M 参数,24 层 transformer——一张 512×512 图的 encoder forward 大约 ~80 GFLOPs,在 H100 上 ~15 ms。

这件事的几个隐性影响:

TTFT 突然变长:加上 vision encoder 的 15 ms,纯文本 prompt 6 ms TTFT 变成 21 ms——用户感知"开始慢"。
视觉 token 不能被 prefix cache: 因为图片是用户独有的,prefix cache 命中率为 0。这是为什么 GPT-4o 的 vision 模式没有 prompt caching 折扣。
多图叠加: GPT-4 Vision 允许一次传几张图,n × 256 个 visual token 直接占满 context——一张 4K 高清图就能吃掉 1500 token,5 张就 7500——很容易把 8K 上下文压爆。

所以"多模态模型在视觉任务上推理慢" 不是模型质量问题,是视觉路径上的固定开销。这也是为什么 GPT-4o 这类原生多模态模型用了"visual token 池化"或"动态分辨率"等优化——把 256 个 visual token 减到 64-128 个,但精度损失需要靠 finetune 补。

For the same token slot, a visual token's create-to-consume cost vastly exceeds a text token's — it first runs through a full vision encoder. Llama-3.2-Vision's CLIP-ViT-L is ~300M params, 24 layers; one 512×512 image encoder forward is ~80 GFLOPs, ~15 ms on H100.

Hidden implications:

TTFT spikes: add the 15 ms encoder; a pure-text 6 ms TTFT becomes 21 ms — users feel "slow to start".
Visual tokens can't be prefix-cached: images are user-unique; prefix cache hit rate = 0. This is why GPT-4o vision mode lacks prompt caching discount.
Multi-image bloat: GPT-4 Vision allows multiple images; n × 256 visual tokens fill context fast — one 4K hi-res image eats 1500 tokens, 5 images = 7500 — easy to blow an 8K window.

So "multimodal inference is slow on vision tasks" isn't a model-quality issue — it's a fixed visual-path overhead. This is why GPT-4o-class native multimodal models use "visual token pooling" or "dynamic resolution" optimizations — squeezing 256 visual tokens down to 64-128, with precision losses recovered through finetuning.

扩展EXTENDED 三类多模态融合 — Early / Late / Cross-attention Three flavors of multimodal fusion · Early / Late / Cross-attention

上面讲的"把 visual token 拼到 text token 前面" 是early fusion(LLaVA / Llama-3.2-Vision)。还有两种思路:

Late fusion(Flamingo 2022): 单独 vision encoder,只在特定的 LLM 层之间插入"视觉信息"——通过额外的 cross-attention 层让 text token 去"看"图片。优点:vision token 不占 LLM 序列长度,节省 KV cache;缺点:架构复杂,部署难。
Cross-attention(Llama-3.2-Vision 的部分变体): 类似 Flamingo,但 cross-attention 集成更紧密。Meta 的 Llama-3.2-90B Vision 用这种,在 LLM 主路径每 4 层插入一个 cross-attention 块,vision features 通过这些块流入,不占主序列。
Native multimodal(GPT-4o, Gemini): 不区分 vision encoder 和主 LLM——同一个 transformer 从训练开始就同时见过 text token 和 visual token。架构上跟 LLaVA-style 没差别,但训练目标完全不同,效果更好但模型本身重一倍。

实际推理引擎里:llama.cpp 主要支持 early fusion(LLaVA 兼容),Llama-3.2-Vision 的 cross-attention 变体需要专门支持(因为它要在 graph 里插入额外的 attention 节点,KV cache 设计也不同)。支持新的多模态架构 是推理引擎团队最常加班的事——每出一个新模型,kernel + graph 都要改。

The above — "prepend visual tokens to text tokens" — is early fusion (LLaVA / Llama-3.2-Vision). Two other approaches:

Late fusion (Flamingo 2022): separate vision encoder; inject "visual info" only at specific LLM layer boundaries via extra cross-attention layers that let text tokens "look at" the image. Pros: visual tokens don't occupy LLM sequence length, saving KV cache. Cons: architectural complexity, deployment difficulty.
Cross-attention (some Llama-3.2-Vision variants): similar to Flamingo, more tightly integrated. Meta's Llama-3.2-90B Vision uses this — inserting a cross-attention block every 4 layers in the main LLM path; vision features flow through these blocks without taking sequence space.
Native multimodal (GPT-4o, Gemini): no separation between vision encoder and main LLM — same transformer trained on text and visual tokens together from the start. Architecturally identical to LLaVA-style but with totally different training objectives. Better quality but models are 2× heavier.

In real inference engines: llama.cpp mainly supports early fusion (LLaVA-compatible); Llama-3.2-Vision's cross-attention variant needs dedicated support (extra attention nodes in the graph, different KV cache layout). Supporting new multimodal architectures is what inference-engine teams pull overtime for — each new model means kernel and graph changes.

扩展EXTENDED 音频 · 视频 · 通用模态 — 扩展到 visual token 之外 Audio · video · general modalities — beyond visual tokens

同样的"把任意东西编码成 token" 思路对其他模态也成立:

音频(Whisper / GPT-4o-realtime / Qwen2-Audio): 1 秒 16kHz 音频 → mel-spectrogram → audio encoder → ~50 个 audio token。语音输入比图片更密集——15 秒一段语音就 ~750 个 token。这就是为什么"实时语音 AI" 推理成本极高——每秒输入持续吃 50 个 prefill token。
视频(Gemini 1.5 / Qwen2-VL): 1 秒视频 ≈ 1 帧 × 256 visual token + 1 秒 audio × 50 audio token = ~306 token/s。一段 60 秒视频就是 ~18000 token——直接耗光 8K 上下文,必须靠 Gemini 这种 1M+ 上下文的模型才能处理。
3D / 点云(Llama-Mesh, 还很早期): 把 3D 点云编码成 token——已经有论文跑通,效果还在改进。
具身智能(Octo / RT-2): 机器人传感器读数 + 摄像头 + 关节状态全部编码成 token,模型输出动作 token。本质都是"token-in token-out"。

所以 LLM 推理引擎本质上在变成"通用 token 处理器"——前端接什么模态的 encoder 不重要,只要能产出 token 就行。前 24 章讲的全部推理优化(KV cache / attention / sampling / batching / 量化)全部适用。这是 LLM 范式的力量:把一切异构数据统一到 token 序列上,优化路径就只有一条。

The same "encode anything as tokens" idea works for other modalities:

Audio (Whisper / GPT-4o-realtime / Qwen2-Audio): 1s of 16kHz audio → mel-spectrogram → audio encoder → ~50 audio tokens. Voice is denser than images — 15s of speech = ~750 tokens. This is why "real-time voice AI" inference is so expensive — every second of input continuously eats 50 prefill tokens.
Video (Gemini 1.5 / Qwen2-VL): 1s of video ≈ 1 frame × 256 visual tokens + 1s audio × 50 audio tokens = ~306 tokens/s. A 60-second video = ~18000 tokens — instantly torches an 8K window; you need Gemini-class 1M+ context to handle it.
3D / point clouds (Llama-Mesh, very early): encode 3D point clouds as tokens — papers work, quality improving.
Embodied AI (Octo / RT-2): robot sensor readings + camera + joint states all encoded as tokens; the model outputs action tokens. Same "token-in, token-out" structure.

So LLM inference engines are essentially becoming "universal token processors" — what encoder feeds the front-end doesn't matter, as long as it produces tokens. Everything from the previous 24 chapters (KV cache / attention / sampling / batching / quantization) applies. That's the power of the LLM paradigm: unify all heterogeneous data into token sequences and have only one optimization path.

CHAPTER 26 · FRONTIERS · KEY

Reasoning models — 当 decode 要花 10000 个 token

Reasoning models — when decode runs for 10000 tokens

o1 / R1 · test-time compute · "思考"也是推理

o1 / R1 · test-time compute · "thinking" is inference too

2024 年 9 月 OpenAI 发布 o1,把整个 LLM 推理的economics翻了个底朝天。之前的模型生成 100-1000 个 token 就给出最终回答;o1 / DeepSeek-R1 / Gemini-2-Flash-Thinking 可以"思考" 10000-50000 个内部 token,然后才输出几百 token 的最终回答。这些"思考 token"对用户是不可见的,但推理引擎要扎扎实实地生成它们。

从架构上,reasoning 模型跟普通 LLM 一模一样——同样的 transformer / attention / FFN / MoE。所有奇迹都来自训练数据 + RL 奖励:模型被训练成在给出答案前先生成一大段"思考过程"(在 <think>...</think> 标签里包着),并通过 RLHF / RLVR 让"思考能提升答案质量" 成为模型的本能。

但这件事在推理工程上的影响巨大。

September 2024, OpenAI released o1 and flipped the entire LLM inference economics upside down. Pre-o1 models emitted 100-1000 tokens before finalizing an answer; o1 / DeepSeek-R1 / Gemini-2-Flash-Thinking can "think" 10000-50000 internal tokens before producing a few hundred tokens of final answer. These "thinking tokens" are invisible to the user but the inference engine generates them all the same.

Architecturally, reasoning models are identical to regular LLMs — same transformer / attention / FFN / MoE. The magic comes from training data + RL rewards: the model is trained to produce a long "thinking" segment (wrapped in <think>...</think> tags) before answering, and RLHF / RLVR makes "thinking improves answer quality" an instinct.

But the impact on inference engineering is enormous.

o1 时代的成本曲线 · decode 成主战场

The o1-era cost curve · decode becomes the main battlefield

第 2 章那张 "TTFT + n_out × TPOT" 表在 o1 时代彻底变形。一个典型 o1 query:

prompt 200 token · prefill 几 ms · TTFT ~10 ms
"thinking" 阶段: 内部生成 20000 token · 全是 decode · 在 Llama-3-70B 上 ~20 秒
"answer" 阶段: 真正的回答 200 token · 又一段 decode · ~0.2 秒
总 wall time: ~20.2 秒

用户看到的是"慢但准" 的体验——但服务方付的是20 秒的 GPU 时间,而不是之前的 ~2 秒。一个 o1 query 的真实推理成本比同等长度的 GPT-4 答案贵 10×——这就是为什么 o1 API 定价高,而且 OpenAI 把"reasoning effort" 做成参数(low / medium / high)给用户选——让用户在成本和质量之间显式选。

The Ch.2 "TTFT + n_out × TPOT" formula warps completely in the o1 era. A typical o1 query:

Prompt 200 tokens · prefill few ms · TTFT ~10 ms
"Thinking" phase: 20000 internal tokens generated · all decode · ~20s on Llama-3-70B
"Answer" phase: actual answer 200 tokens · another decode · ~0.2s
Total wall time: ~20.2s

User sees "slow but accurate" — but the service pays for 20 seconds of GPU time, not the ~2s from before. An o1 query's real inference cost is 10× a same-length GPT-4 answer — which is why o1 API pricing is high, and why OpenAI exposes "reasoning effort" as a parameter (low / medium / high) letting users explicitly trade cost for quality.

推理工程的新优化方向

The new inference optimizations this triggers

o1 时代的成本结构推动了几个很新的推理优化方向:

thinking 不流式返回: 用户既然看不到 thinking,引擎就不需要逐 token 流式生成——可以攒一批一起算。这跟 prefill 的混批策略类似。OpenAI 的 reasoning models 实际上不流到客户端,等到 </think> 后才开始流答案,客户端体验是"很久没动静,然后一下子开始输出"。
KV cache for thinking: 20000 个 thinking token 全部占 KV cache。如果是 Llama-3-70B,~20000 × 130 KB = 2.6 GB/用户——之前文中算过的 70B 服务并发上限直接砍掉一半。
thinking compression: 学术界已经有 paper 探讨"用更少的 thinking token 达到同样效果"——比如让 thinking 阶段用更激进的 quantize / pruning,因为这部分不直接见用户。也有 "thinking summarization" 想法:每生成 N 个 thinking token,模型把它们压缩成更短的 latent。这是非常新的研究方向。
parallel thinking: 模型可以"同时探索多个思路"——在 decoding 时维护多个 branch,选最有希望的那个继续。本质是 beam search 的回归(C14 扩展讨论过),但只用在 thinking 阶段。OpenAI 没公开 o1 的具体实现,但学术界的 Self-Consistency / Tree of Thoughts 已经是这个方向。

有意思的是:reasoning 模型几乎是第一次让推理引擎需要为"大量串行 decode" 而不是"低延迟用户响应" 做优化。这反过来又催生 speculative decoding / fp4 / 更高效 attention kernel 等被研究界忽视已久的硬件路线——因为现在 decode 速度直接决定 reasoning 质量。

The o1-era cost structure drives several very new inference optimization directions:

Thinking isn't streamed: users can't see thinking, so the engine doesn't need per-token streaming — it can batch internally. Similar to prefill mixing. OpenAI's reasoning models don't stream to the client until </think>; client-side it feels like "silent for a long time, then suddenly outputs".
KV cache for thinking: 20000 thinking tokens all occupy KV cache. On Llama-3-70B: ~20000 × 130 KB = 2.6 GB/user — halves the 70B service concurrency limit we computed earlier.
Thinking compression: papers explore "fewer thinking tokens, same effect" — e.g. more aggressive quantization or pruning during thinking, since users don't see it. Also "thinking summarization" ideas: every N thinking tokens, compress to a shorter latent. Very new research direction.
Parallel thinking: model "explores multiple lines of thought simultaneously" — maintain branches during decode, pick the most promising. Beam-search resurrection (Ch.14 extended), confined to thinking. OpenAI doesn't disclose o1's specifics, but academic Self-Consistency / Tree of Thoughts heads this direction.

Notably: reasoning models are almost the first time inference engines need to optimize for "massive serial decode" rather than "low-latency user response". This in turn drives speculative decoding / fp4 / more efficient attention kernels — hardware paths the research community had been neglecting — because now decode speed directly determines reasoning quality.

不同推理 effort 的成本对比 · 假设 Llama-3-70B SKUCost across reasoning effort levels · Llama-3-70B SKU

chat baseline (no thinking)	200 out · 3 s	$ 0.001 / query
o1-mini equivalent (low)	1000 think + 200 ans · 12 s	$ 0.004
o1 equivalent (medium)	5000 + 200 · 50 s	$ 0.016
o1 pro / R1 (high)	20000 + 500 · 3 min	$ 0.06
o3-tier high effort	80000 + 1000 · 12 min	$ 0.24

为什么这章重要WHY THIS CHAPTER MATTERS 2022-2024 的 LLM 推理工程主要矛盾是 "decode 太慢" → 各种 trick 让 decode 更快。2024 末开始,reasoning 模型把decode 数量放大了 50×——decode 加速不再是奢侈品,而是必需品。之前所有 chapter 18-22 那些优化技术,在 reasoning 模型上的边际收益翻倍。这就是为什么 2025 年所有头部公司都在 reasoning + 推理引擎优化两条腿走——它们是同一个商业战役的两面。 2022-2024 LLM inference engineering's main tension was "decode is too slow" → tricks to make decode faster. Starting late 2024, reasoning models multiplied decode volume by 50×. Decode speedup is no longer a luxury — it's a requirement. All chapter 18-22 optimizations have doubled marginal value on reasoning models. This is why every 2025 frontier lab pushes reasoning + inference-engine optimization simultaneously — they're two sides of the same commercial battle.

CHAPTER 27 · FRONTIERS

硬件代际 — Ampere / Hopper / Blackwell 给推理换了三次跑鞋

Hardware eras — Ampere / Hopper / Blackwell, three pairs of running shoes

A100 → H100 → B200 · 同一个模型 · 不同的物理舞台

A100 → H100 → B200 · same model · different physical stage

前 26 章基本都把硬件当"就这样"——但其实每两年NVIDIA 发布一代新 GPU,推理引擎都要大改一遍。从 2020 年的 A100 到 2024 年的 H200 到 2025 年的 B200,关键性能指标变了几倍,推理路径上的优化策略也跟着变。

The previous 26 chapters mostly treated hardware as "given" — but every two years NVIDIA ships a new GPU generation and inference engines have to rewrite significant parts. From A100 (2020) to H200 (2024) to B200 (2025), key performance metrics moved by multiples; optimization strategy follows.

	A100 80GB	H100 80GB	H200 141GB	B200 192GB
代号arch	Ampere	Hopper	Hopper	Blackwell
fp16 TFLOPS	312	990	990	2250
fp8 TFLOPS	—	1980	1980	4500
fp4 TFLOPS	—	—	—	9000
HBM 容量	80 GB	80 GB	141 GB	192 GB
HBM 带宽	2.0 TB/s	3.35 TB/s	4.8 TB/s	8.0 TB/s
NVLink 双向	600 GB/s	900 GB/s	900 GB/s	1800 GB/s
关键新硬件key new HW	Tensor Core gen 3	fp8 · TMA · WGMMA	+ HBM3e	fp4 · 2nd-gen Transformer Engine
对推理的意义inference impact	FlashAttention v1 起飞FlashAttention v1 takes off	fp8 KV · FA v3 · TMA 预取fp8 KV · FA v3 · TMA prefetch	长 ctx · 大模型可单卡long ctx · large model single-card	fp4 权重 · 训练推理融合fp4 weights · train/infer fused

FIG. 15 同一个 Llama-3-70B 模型,在四代 GPU 上的实际表现差三个数量级。A100 跑 70B 必须 4 卡 TP,H100 单卡 fp8 装得下,H200 单卡 fp16 都装得下,B200 单卡能装 405B 量化版。"大模型部署" 的边界是被硬件代际推着走的——这就是为什么 LLM 模型尺寸总是恰好是当年顶级 GPU 的容量。 Same Llama-3-70B model, three orders of magnitude difference across four GPU generations. A100 requires 4-card TP for 70B; H100 fp8 fits one card; H200 fp16 fits one card; B200 holds a quantized 405B on one card. "Large model deployment" boundaries are pushed by GPU generations — which is why LLM sizes always exactly match contemporary flagship GPU capacity.

Hopper 给推理带来了什么 · 三件硬件武器

What Hopper brought · three hardware weapons

Hopper(H100)上的三个新硬件单元是近两年所有推理优化的物理基础:

WGMMA(Warp Group MMA): 异步张量核 instruction,一次 16×16×16 矩阵乘,但不阻塞调用线程——下一个 MMA 可以马上 issue。这是 FlashAttention v3 比 v2 快 1.5-2× 的根源——之前的同步 MMA 在等结果时 SM 空转。
TMA(Tensor Memory Accelerator): 一个独立的 DMA 引擎,把"从 HBM 搬数据到 SRAM" 这件事从 SM 上分离出去。SM 在算 MMA 时,TMA 同时预取下一块——compute / memory 真并行,以前必须串行。
fp8 / Transformer Engine: 硬件原生支持 fp8 GEMM,吞吐量是 fp16 的 2×。配合自动量化校准(Transformer Engine 软件),让 fp8 推理几乎"开关式"——无需手动调,精度损失自动控制。

这三个加在一起让 H100 上 fp16 模型的 attention kernel 利用率从 50% 涨到 70%+,fp8 模型再翻倍。不写新 kernel,根本拿不到这些收益——这是为什么"新硬件→量产推理栈" 总要 6-12 个月。

Three new hardware units in Hopper (H100) are the physical basis for nearly every inference optimization of the past two years:

WGMMA (Warp Group MMA): asynchronous tensor-core instructions; one 16×16×16 matrix multiply that doesn't block the calling thread — the next MMA can issue immediately. This is why FlashAttention v3 is 1.5-2× faster than v2 — the old synchronous MMA stalled SMs waiting on results.
TMA (Tensor Memory Accelerator): a separate DMA engine that handles "move data from HBM to SRAM" off the SM. While SMs run MMAs, TMA prefetches the next tile — compute and memory truly in parallel; previously serial.
fp8 / Transformer Engine: native hardware fp8 GEMM, 2× the throughput of fp16. Paired with automatic quant calibration (Transformer Engine software), fp8 inference is almost "flip a switch" — no manual tuning; precision loss self-controlled.

Combined: Hopper attention kernels on fp16 hit 70%+ utilization vs 50% before; fp8 doubles that again. You can't get these gains without writing new kernels — which is why "new hardware → production inference stack" always takes 6-12 months.

扩展EXTENDED Blackwell · fp4 时代 · 训练推理边界开始模糊 Blackwell · the fp4 era · the train/inference boundary blurs

2025 年 NVIDIA 出的 B200 / GB200 NVL72 系统是 LLM 推理的新一代分水岭。两个关键变化:

fp4 硬件支持(MXFP4 / NVFP4): 4 个 mantissa bit 的浮点数,精度比 int4 高(动态范围保留),但 GEMM 吞吐量是 fp8 的 2×。在 B200 上,一个 405B 模型用 fp4 量化只占 ~200 GB,单 GB200 节点(GPU 间 5th-gen NVLink, 1.8 TB/s)能服务 70B-class 的吞吐量。
GB200 NVL72: 单机柜 72 张 B200 + 36 张 Grace CPU,全部用 NVLink 互联——之前要跨节点 InfiniBand 的 405B / 671B / 1T+ 模型,现在可以在单机柜内跑 TP=72。跨节点的延迟瓶颈消失了 70%。

更深远的:Blackwell 的"Transformer Engine 2.0" 把训练时的 fp8 自动校准跟推理时的 fp4 部署放进同一个软件栈——一个模型可以 fp8 训完直接 fp4 部署,损失自动测算补偿。训练和推理的工具链开始合体——这是过去 5 年里第一次。下一步可能是 2026-2027 年的"训练时就为 fp4 部署优化" (quantization-aware training natively)。

NVIDIA's B200 / GB200 NVL72 systems (2025) mark a new watershed for LLM inference. Two key shifts:

fp4 hardware (MXFP4 / NVFP4): 4-mantissa-bit floats — higher precision than int4 (preserves dynamic range), GEMM throughput 2× fp8. On B200, a 405B model in fp4 fits ~200 GB; one GB200 node (1.8 TB/s 5th-gen NVLink) serves 70B-class throughput.
GB200 NVL72: single rack with 72 B200 + 36 Grace CPUs, all NVLink-interconnected — what previously required cross-node InfiniBand for 405B / 671B / 1T+ models can now run TP=72 within a single rack. Cross-node latency bottlenecks vanish by ~70%.

Deeper still: Blackwell's "Transformer Engine 2.0" puts training-time fp8 auto-calibration and inference-time fp4 deployment in the same software stack — a model trained fp8 can deploy fp4 directly, with losses auto-measured and compensated. Training and inference toolchains begin to merge — a first in five years. The next step is probably 2026-2027's "train natively optimized for fp4 deployment" (quantization-aware training as default).

扩展EXTENDED 非 NVIDIA 路径 · TPU / MI300X / Trainium 在干什么 Non-NVIDIA paths · TPU / MI300X / Trainium

NVIDIA 80% 的 LLM 推理市场不是没人挑战,但每一家路径不同:

Google TPU v5p / v6: 内部专用,不卖外面。最大优势是2D 网格互联——256 个 TPU 直连,带宽比 NVLink 高。Gemini 全部在 TPU 上训练 + 推理。但软件栈是 JAX/XLA,跟 PyTorch 生态隔离。
AMD MI300X: 单卡 192 GB HBM3,跟 H100 fp16 算力接近,价格便宜 30%。短板在软件——ROCm 比 CUDA 慢 5 年,kernel 库(rocBLAS / rocFFT)远不如 CUDA 成熟。vLLM 2024 加了 ROCm 支持但生产环境少见。
AWS Trainium / Inferentia: AWS 自研推理芯片,2024 年发布 Trainium 2。性价比好,但只有 AWS 上能用,锁定 AWS 生态。
Cerebras / SambaNova / Groq: 各家奇怪架构。Groq 用大量 SRAM 替代 HBM,把整个模型权重塞进 SRAM——Llama-3-70B decode 速度 ~750 tokens/s(行业第一),但不能跑超 80B 模型且不能批处理(单 query 才有这速度)。Cerebras 用 wafer-scale 单芯片把 100B 模型塞一整张晶圆。这些都属于"赌一把架构" 的小众选择。

趋势:NVIDIA 软件生态 + 算力领先 还会持续 2-3 年,然后 AMD MI 系列追上的可能性最大(软件栈靠 vLLM / SGLang 等开源项目慢慢补齐)。TPU 和 Trainium 是"封闭花园"——技术上不一定差,但锁生态。

NVIDIA's 80% LLM-inference market share is challenged, but each path differs:

Google TPU v5p / v6: internal-only, not for sale. Big advantage: 2D mesh interconnect — 256 TPUs directly linked, higher bandwidth than NVLink. Gemini trains and runs entirely on TPU. But the software stack is JAX/XLA, isolated from PyTorch's ecosystem.
AMD MI300X: 192 GB HBM3 per card, fp16 compute close to H100, 30% cheaper. Weakness: software — ROCm trails CUDA by 5 years; kernel libraries (rocBLAS / rocFFT) lag CUDA's. vLLM added ROCm support in 2024 but production deployments are rare.
AWS Trainium / Inferentia: AWS's in-house inference chips; Trainium 2 launched 2024. Good price-perf but only on AWS, ecosystem-locked.
Cerebras / SambaNova / Groq: weird architectures. Groq replaces HBM with massive SRAM — entire model weights in SRAM. Llama-3-70B decode ~750 tokens/s (industry-leading), but can't run > 80B models and no batching (per-query only). Cerebras uses wafer-scale single-chip to fit a 100B model on one wafer. Niche "bet the architecture" plays.

Trend: NVIDIA's software ecosystem + compute lead persists for 2-3 more years; AMD MI series most likely to catch up next (software gradually closed by vLLM / SGLang and other open-source efforts). TPU and Trainium are "walled gardens" — technically competitive but ecosystem-locked.

为什么用整章讲硬件WHY A WHOLE CHAPTER ON HARDWARE 前 26 章每一章的"为什么这么写",根都在硬件上——KV cache 大是因为 HBM 贵、Flash 重要是因为 SRAM 小、量化重要是因为 fp16 GEMM 还不够快、MoE 流行是因为 NVLink 能支持 expert parallel。推理工程是物理工程:你不知道当代硬件长什么样,就读不懂推理引擎为什么这么设计。从 2026 开始,这个边界会因为 Blackwell 和 NVL72 再变一次——所以这一章不是历史课,是近期路线图。 Every "why-it-is-this-way" in the previous 26 chapters traces back to hardware — KV cache is huge because HBM is expensive; Flash matters because SRAM is small; quantization matters because fp16 GEMM isn't fast enough; MoE took off because NVLink can support expert parallel. Inference engineering is physics engineering: without knowing what current hardware looks like, you can't read why inference engines are designed the way they are. From 2026, this boundary shifts again with Blackwell and NVL72 — so this chapter isn't a history lesson, it's a near-term roadmap.

CHAPTER 28 · CODA · TOOLBOX

怎么自己 trace 一次推理

How to trace it yourself

前 24 章的所有断言 · 自己验证

verify every claim above · with your own eyes

这一章不讲新东西,只讲怎么自己看。前面 27 章的每个断言都可以被验证——下面是一组最实用的工具。

No new content in this chapter — only how to look for yourself. Every claim in the previous 27 can be verified. Here's the toolkit.

1. llama.cpp 的 verbose / 性能日志

1. llama.cpp's verbose / perf logs

terminal · llama-cli with timingstep 1

# 最基础的 timing 输出 · 第 16 章那张表的来源 $ ./llama-cli -m llama-3-8b.gguf -p "你好,llama" -n 100 --log-verbose # 输出末尾 · 默认就有 · 不用任何 flag llama_perf_sampler_print: sampling time = 1.23 ms / 105 runs llama_perf_context_print: load time = 812.4 ms llama_perf_context_print: prompt eval time = 5.8 ms / 6 tokens llama_perf_context_print: eval time = 1504.2 ms / 100 runs # prompt eval = prefill · eval = decode # 你刚刚验证了"prefill 6 ms vs decode 100×15 ms"那条断言

2. 看 graph · ggml_graph_print

2. Inspect the graph · ggml_graph_print

每次 forward 都构建一张 ggml_cgraph——它就是第 16 章 stack trace 的静态图表示。打印出来能看到 32 层 attention 的每一个节点。

Every forward constructs a ggml_cgraph — the static-graph form of Ch. 16's stack trace. Print it and you see every node across all 32 layers.

examples/main/main.cpp · enable graph dumpstep 2

// 在 graph_compute 之前加这一行 ggml_graph_print(gf); // 把全图打到 stderr // 或者导出成 DOT,然后用 graphviz 画图 ggml_graph_dump_dot(gf, NULL, "graph.dot"); $ dot -Tsvg graph.dot -o graph.svg # 4096 个节点 · 蔚为壮观

3. NSight Compute / NSight Systems · GPU 视角

3. NSight Compute / NSight Systems · GPU view

想看 H100 上 attention 那一步真的把 SM 喂饱了没?用 NVIDIA 自家的 NSight。ncu(Compute)给单 kernel 的微观分析——FLOPs 利用率、memory throughput、warp stall 原因。nsys(Systems)给整个时间轴——你能看到 prefill 那 5 ms 里有几次 kernel launch,decode 每步等了多久 memory。

Want to verify attention is actually saturating SMs on H100? Use NVIDIA's NSight. ncu (Compute) gives per-kernel microscopy — FLOPs utilization, memory throughput, warp-stall reasons. nsys (Systems) gives a full timeline — see how many kernel launches fit into prefill's 5 ms, and how long each decode step waits on memory.

terminal · nsys profile · capture one inferencestep 3

$ nsys profile -o llm-trace --capture-range=cudaProfilerApi \ --capture-range-end=stop ./llama-cli -m llama-3-8b.gguf -p "hi" -n 100 # 打开 llm-trace.nsys-rep · 你会看到: # - prefill 5 ms 里 ~100 个 kernel launch · 大头是 cublasGemmEx # - decode 每步 ~15 ms · 单 kernel ~50-200 μs · 中间大段 idle 等 HBM # - flash_attn_ext 是 attention 那个融合 kernel · ~80 KFLOPs/step · 算力闲

4. transformers · 看 reference 实现

4. transformers · see the reference impl

llama.cpp 是推理引擎,优化得很猛;HuggingFace 的 transformers 是 reference implementation——慢但每一步都裸露在 Python 里,容易插断点、加 print、改公式。如果你想"看着 hidden state 一层一层长大",在 LlamaModel.forward 里改改:

llama.cpp is an inference engine with heavy optimizations. HuggingFace's transformers is the reference — slow but everything is naked Python; easy to break, print, modify. To "watch hidden state grow layer by layer", patch LlamaModel.forward:

transformers/models/llama/modeling_llama.py · LlamaModel.forwardstep 4

# 给每一层加一行 print for idx, layer in enumerate(self.layers): hidden_states = layer(hidden_states, ...) print(f"layer {idx} · mean={hidden_states.mean().item()} · " f"std={hidden_states.std().item()}") # 跑一下 · 你会看到 hidden state 的"形状"在 32 层里如何演化 # 前几层 mean ≈ 0 · 后几层逐渐有结构 · LLM 的"思考过程"长这样

5. 量化前后对比 · 用 llama-perplexity

5. Quantization before/after · llama-perplexity

"这个 Q4_K_M 量化跟 fp16 比损失多少?" 这种问题用 llama.cpp 自带的 llama-perplexity 工具直接量:

"How much does Q4_K_M quantization cost compared to fp16?" Measure it directly with llama.cpp's bundled llama-perplexity:

terminal · llama-perplexitystep 5

# 跑一段 wikitext-2 测试集 $ ./llama-perplexity -m llama-3-8b-Q4_K_M.gguf -f wikitext-2.txt $ ./llama-perplexity -m llama-3-8b-fp16.gguf -f wikitext-2.txt # 典型输出: fp16 PPL ≈ 6.5 · Q4_K_M PPL ≈ 6.7 · 差 ~3% # 显存从 16 GB 降到 5 GB · 这就是量化的真实 ROI

这篇文章的方法论

The methodology of this article

每一章我都做了同一件事:挑一段 llama.cpp 源码,把它放进上下文,逐行解释为什么这么写。这套手法在所有系统类技术写作里都通用——比 paper 易读、比 tutorial 深入、比 talk 可重放。如果你也想写这种,流程大概是:

找一个真问题:不要"什么是 attention",要"为什么 decode 这么慢"。
找一段真源码:HuggingFace transformers 是 reference,llama.cpp / vLLM 是 production。两端都看。
构造一个主线 prompt / 主线代码:让所有章节都围绕它转,读者不丢失上下文。
每章一图一码一数:概念图 + 真源码片段 + 真实数字。三样齐全才"落下来"。
写完自己读一遍:每段都问"这话有信息密度吗"——没有就删。

Every chapter did the same thing: pull a piece of real llama.cpp source into the page, then explain line by line why it's written that way. This recipe generalizes to any systems-writing — more readable than a paper, deeper than a tutorial, replayable unlike a talk. To write more of this:

Pick a real question: not "what is attention", but "why is decode so slow".
Find real source: HuggingFace transformers as reference, llama.cpp / vLLM as production. Read both ends.
Construct a through-line prompt / code: every chapter orbits it, the reader never loses context.
One figure, one source snippet, one number per chapter: concept diagram + real code + measured datum. All three or it doesn't "land".
Re-read your own draft: ask of every paragraph "does this have information density" — if not, cut it.

致谢与免责CREDITS & CAVEATS 代码引用基于 llama.cpp / ggml 主干分支(截至 2026 年初)的结构。函数名和文件路径是稳定的,但具体行号会随版本变化——以仓库当下状态为准。vLLM 引用基于 v0.6.x 系列。所有 benchmark 数据是常见公开口径,实测会随硬件、batch size、context 长度浮动 2–4×;数量级关系应该是稳的。 Code citations track llama.cpp / ggml main-branch structure (as of early 2026). Function names and file paths are stable; specific line numbers drift between versions — defer to the repo at HEAD. vLLM citations follow the v0.6.x line. All benchmark numbers are widely published averages; real measurements vary 2-4× with hardware, batch size, and context length, but the order-of-magnitude relationships hold.

三眼看 "你好,llama"

Three eyes on "hello, llama"