tokenizer → embedding → KV cache → attention → MoE / MLA → logit → sampling
tokenizer → embedding → KV cache → attention → MoE / MLA → logit → sampling
你按下回车之后那 200 毫秒里到底发生了什么?把 5 个 token 拽进 llama.cpp 的肠子里走一遍——28 个站,每站一段真源码。
What actually happens in the 200 ms after you hit Enter? Drag five tokens through llama.cpp's intestines — 28 stations, each one a slice of real source.
"你好,llama""hello, llama"同一段 prompt,三种世界,三种语言
one prompt, three worlds, three languages
所有讲 LLM 推理的文章都会从同一处败下阵来——它们在第三段就跳到了"attention 是 Q·Kᵀ"。但你脑子里那串字符,跟显卡上跑的那段矩阵乘法,中间隔着至少三种语言。这三种语言不打通,attention 怎么解释都像在背公式。
把 "你好,llama" 这 6 个字符摆出来,用三种眼睛去看它,你会发现整篇文章接下来要做的事情,本质上就是在这三种语言之间反复翻译。
Every "LLM inference" tutorial loses you at the same spot — paragraph three jumps to "attention is just Q·Kᵀ". But the string of characters in your head and the matrix multiplications running on your GPU are separated by at least three different languages. Until those three are connected, attention will keep sounding like a memorized formula.
Lay "hello, llama" out, look at it through three different eyes, and you'll see what this whole article is really about: translating back and forth between those three languages.
| 眼睛eye | 看到的东西what it sees | 尺寸shape |
|---|---|---|
| 人Human utf-8 string |
"你好,llama" · 6 chars / 12 bytes |
[1] · str |
| TokenToken BPE id |
[128000, 47045, 50739, 11, 9091, 64]<|begin|> · 你 · 好 · , · ll · ama |
[6] · int32 |
| TensorTensor embedding |
[[0.012, -0.045, ..., 0.083], ...]6 行 × 4096 列 fp16 |
[6, 4096] · fp16 |
从"人眼"到"tensor"这两次翻译都是有损的,但每一次都把"东西"变得更可计算:
"llama" 不再是一个词,而是 "ll" + "ama" 两个 BPE 片段。模型不会"读 llama 这个英文单词",它在处理两个 id 的组合。47045 这个 id 进了 embedding 表之后,变成 4096 维空间里的一个点。从这一刻起,模型干的所有事情都是在这个 4096 维空间里走。反过来,出口的时候是逆翻译:tensor → token → 字符。第 13 章的 LM head 把最后一个 4096 维向量投射回 128k 维的 vocab logits,第 14 章的 sampler 挑出一个 id,第 3 章的逆函数 llama_detokenize 把这个 id 翻回字符串。整个推理就是这两次翻译的来回。
Both translations from "human" to "tensor" are lossy, and each one trades information for computability:
"llama" is no longer a word — it's "ll" + "ama", two BPE pieces. The model isn't "reading the English word llama"; it's processing a combination of two ids.47045, once it hits the embedding table, becomes a point in a 4096-dim space. From this moment, everything the model does is movement inside that 4096-dim space.The exit is the inverse: tensor → token → chars. Chapter 13's LM head projects the final 4096-vector back to 128k vocab logits; Chapter 14's sampler picks one id; Chapter 3's inverse llama_detokenize turns the id back into a string. The whole inference is just these two translations, in and out.
同一个 forward,两种工作负载
same forward, two completely different workloads
"一次 LLM 推理"听起来像一个步骤——模型 forward 一次,出一个回答。但实际上它有两条命,而这两条命在硬件上长得完全不一样,可以说是整个推理系统的核心矛盾:
这两条命在 llama.cpp 里都走同一个入口函数 llama_decode(实现在 src/llama-context.cpp),区别只在于这次喂进去的 batch 里有多少 token。表面上一致,内里完全是两种负载——KV cache 就是为了把它们粘起来才发明的(第 8 章详谈)。
"One inference" sounds like a single step — model forwards once, answer comes out. But in fact it has two lives, and these two lives look completely different on hardware. This is essentially the core tension of the whole inference system:
Both go through the same llama.cpp entry point — llama_decode in src/llama-context.cpp — differing only in how many tokens are in the batch. Identical surface, totally different physics. KV cache (Ch. 8) was invented exactly to glue these two regimes together.
| PrefillPrefill | DecodeDecode | |
|---|---|---|
| 输入 token 数tokens in | N (prompt 全长) |
1 |
| 主算子main op | matmul · [N, d]×[d, d] | matvec · [1, d]×[d, d] |
| 瓶颈bottleneck | FLOPs (compute) | DRAM bandwidth |
| 典型 H100 利用率H100 util | ≥ 70% MFU | 5–15% MFU · 90% MBU |
| 每 token 平均耗时latency / token | ~1 ms (8K prompt) | ~15–50 ms |
| KV cache 状态KV cache | 写入 N 个 slot | 读取 N+k 个 slot · 写入 1 个 |
| 用户感知user feel | TTFT (time to first token) | TPOT (time per output token) |
反直觉但是真的:decode 每个 token 比 prefill 每个 token 慢一个数量级。原因是 GPU 的"计算饱和点":
2·N²·d 次浮点运算,但只读 2·N·d 个数。算术强度 = N,只要 N ≥ 300 就饱和。这就是所有现代推理优化的根源:第 8 章的 KV cache 是为了别重算、第 10-11 章的 GQA/MLA 是为了让 KV cache 更小、第 12 章的 MoE 是为了让 decode 的"有效参数"更少、第 17 章的 PagedAttention 是为了让 KV cache 装得更密、speculative decoding 是为了一次 decode 多个 token——它们全在跟"decode 是 memory-bound"作斗争。
Counterintuitive but true: each decode token is an order of magnitude slower than each prefill token. The reason is the GPU's "compute saturation point":
2·N²·d FLOPs while reading only 2·N·d numbers. Intensity = N; cross N ≈ 300 and you're saturated.This is the root of every modern inference optimization. KV cache (Ch.8) avoids redoing prefill on history. GQA/MLA (Ch.10–11) shrink the KV. MoE (Ch.12) shrinks the active parameter set per decode step. PagedAttention (Ch.17) packs KV cache denser. Speculative decoding emits more than 1 token per step. They are all fighting the same battle: decode is memory-bound.
TTFT + n_out × TPOT。短回答里 TTFT 占大头(用户等"开始");长回答里 TPOT 占大头(用户等"说完")。这两个数据是不同 SKU 的服务在拼的两个赛道。
"Prefill runs once, decode runs many times" — so total wall time ≈ TTFT + n_out × TPOT. Short replies are TTFT-dominated (the user waits to start); long replies are TPOT-dominated (the user waits for the rest). These are the two metrics every inference SKU competes on.
人 → token · llama-vocab.cpp 第一站
human → token · the first station in llama-vocab.cpp
tokenizer 是整篇文章里唯一一个不在 GPU 上的环节,但它常常是第一处出问题的地方:同一段 prompt,Llama-3 tokenizer 切出 6 个 id,GPT-2 tokenizer 切出 9 个 id,DeepSeek tokenizer 又是另一组——而 KV cache 占用直接乘以这个数字。
llama.cpp 把所有 tokenizer 逻辑都塞进了 src/llama-vocab.cpp(早期版本叫 llama.cpp 主文件下半部分,后来拆分出来)。主入口是 llama_tokenize,核心算法是 BPE(Byte Pair Encoding)——"从字节出发,反复贪心合并出现频率最高的相邻对"。
The tokenizer is the only stage in this whole article that doesn't run on a GPU, yet it's often the first thing that goes wrong: the same prompt yields 6 ids under Llama-3, 9 under GPT-2, something else under DeepSeek — and KV cache scales linearly with that number.
llama.cpp puts all tokenizer logic in src/llama-vocab.cpp (in earlier versions it was the lower half of the main llama.cpp file, later refactored out). Entry point: llama_tokenize. Core algorithm: BPE (Byte Pair Encoding) — "start from bytes, greedily merge the most-frequent adjacent pair, repeat".
"llama" 会被切成 "ll" + "ama"(两个 id)而不是一个——BPE 的合并是基于训练语料的统计,而不是语义。这是玩具实现:真 BPE 的 trie 查找在 llama.cpp 里走 tokenize_with_pre_tokenizer(GPT-2 风格的正则切分)+ llm_tokenizer_bpe::tokenize(优先队列合并)。
Toy BPE. Edit the input and watch the split. Notice "llama" becomes "ll" + "ama" (two ids), not one — BPE merges are driven by training-corpus statistics, not semantics. This is a toy: real BPE in llama.cpp goes through tokenize_with_pre_tokenizer (GPT-2-style regex split) then llm_tokenizer_bpe::tokenize (priority-queue merge).
真实 BPE 不是一次走完所有字节。Llama-3 / GPT-4 这类 tokenizer 都是两阶段:
(?i:'s|'t|'re|'ve|...))把字符串切成"词块"——空格、标点、连续数字各自单独成块。这步避免了"跨标点的 BPE 合并"。Real BPE isn't a single pass over the bytes. Llama-3 / GPT-4 family tokenizers run a two-stage process:
(?i:'s|'t|'re|'ve|...)) into "chunks" — whitespace, punctuation, runs of digits each become their own chunk. This prevents cross-punctuation merges.Llama-3 的词表是字节级的——所有 256 个字节都在词表里。当 "你"(utf-8 是 3 字节 0xE4 0xBD 0xA0)进 BPE 时,它先被拆成 3 个字节。然后:
0xE4 0xBD 这一对的合并 rank 是 12000,而 0xBD 0xA0 的 rank 是 8500,先合 0xBD 0xA0。0xE4 [0xBD0xA0],再查这对的 rank……"你" 很可能恰好是一个 token(id 在 100k+ 区间)。在 GPT-2 上则不一定——它的词表中文偏少,"你"可能保留为 3 字节。这就是为什么"同一段话,不同模型 token 数差几倍"——它直接决定:(1) prompt 占多少 context,(2) prefill 慢多少,(3) KV cache 占多少内存。一个偷懒的 tokenizer 会让你的 API 账单翻倍。
Llama-3's vocabulary is byte-level — all 256 raw bytes live in the vocab. When "你" (utf-8 0xE4 0xBD 0xA0) enters BPE, it's first split into 3 bytes. Then:
(0xE4, 0xBD) with merge rank 12000 and (0xBD, 0xA0) with rank 8500, the latter merges first.0xE4 [0xBD0xA0]; look up the new pair's rank…"你" likely became a single token (id around 100k+). GPT-2's Chinese coverage is thin — "你" stays as 3 bytes there.This is why "same sentence, different models, multi-fold token count differences" — directly determining: (1) how much context the prompt eats, (2) how slow prefill is, (3) how much memory KV cache costs. A lazy tokenizer doubles your API bill.
都是 BPE,但实现细节差异巨大。llama.cpp 在 src/llama-vocab.cpp 里枚举了 LLAMA_VOCAB_TYPE_BPE / SPM / WPM / UGM 等多种类型,每种走略微不同的 path:
▁。"原生支持任意语言"是它的卖点。对推理引擎来说,tokenizer 的速度几乎从来不是瓶颈——主要算力都在 GPU 上。但 tokenizer 的正确性是大坑:大小写、空格、emoji、罕见 unicode、模型 chat template 里的特殊 token(<|im_start|> 之类),错一个就让模型"看不懂"自己被怎么 prompt 的。llama.cpp 历史上 BPE 实现 bug 修了几十个 PR——这是整个推理栈里最容易引入隐 bug 的环节。
All BPE, but implementations diverge wildly. llama.cpp's src/llama-vocab.cpp enumerates LLAMA_VOCAB_TYPE_BPE / SPM / WPM / UGM, each taking a slightly different path:
▁. "Language-agnostic out of the box" is its pitch.For an inference engine, tokenizer speed is almost never the bottleneck — compute lives on the GPU. But tokenizer correctness is a minefield: case, whitespace, emoji, rare unicode, chat-template specials like <|im_start|>. Mess one up and the model literally can't see how it was prompted. llama.cpp's BPE has been bug-fixed in dozens of PRs over the years — this is the easiest place in the whole stack to silently regress.
"你好,llama" 进 Llama-3 tokenizer,输出是 [128000, 47045, 50739, 11, 9091, 64] ——6 个 token,占 24 字节(int32)。这 6 个 int 就是第 4 章 embedding lookup 的入参。
"hello, llama" through Llama-3's tokenizer yields [128000, 9906, 11, 100793] — 4 tokens, 16 bytes (int32). Whichever language, those ints feed Ch. 4's embedding lookup.
Pre-tokenize 那一段正则 OpenAI GPT-2 2019 年原始论文附了一行看起来像乱码的 regex。它做的事情是把任意 utf-8 字符串切成"词块"——空格、标点、连续数字、连续字母各自一段。看一眼:
The pre-tokenize regex from GPT-2's 2019 release is famously cryptic-looking. What it does: slice any utf-8 string into "chunks" — whitespace, punctuation, digit runs, letter runs, each their own segment. Take a look:
llama.cpp 用 pcre2 调这段 regex(src/llama-vocab.cpp 里的 regex_split)。注意一个细节:这段 regex 用了 \p{L}(任意 Unicode 字母)和 \p{N}(任意 Unicode 数字)——它是 Unicode-aware 的。所以中文、阿拉伯文、emoji 都能正确分块。这是为什么 Llama-3 比 Llama-2 中文支持好——同一段 regex 但配上 byte-level 词表,中文不会被切成"0xE4 一类奇怪的字节段"。
"数字最多 3 位一起" 这条尤其重要:防止模型把"1234567"看成一个 token,而是切成 "123","456","7"——这样模型对数字的处理(尤其加减法)精度高得多。GPT-4 那段广为吐槽的"怎么不会算 17 + 28" 在 Llama-3 上好得多,部分原因就是 pre-tokenize 改进。
llama.cpp uses pcre2 to run this regex (regex_split in src/llama-vocab.cpp). Notice: it uses \p{L} (any Unicode letter) and \p{N} (any Unicode digit) — it's Unicode-aware. So Chinese, Arabic, emoji all chunk correctly. This is why Llama-3 handles Chinese far better than Llama-2 — same regex but with a byte-level vocab; Chinese doesn't fragment into "weird 0xE4-type byte slices".
The "digits chunked at most 3 at a time" rule matters a lot: prevents the model from treating "1234567" as one token. Instead it splits into "123","456","7" — way better for arithmetic accuracy. GPT-4's infamous "can't compute 17+28" issues are noticeably better in Llama-3, partly because of this pre-tokenize improvement.
2024 年 4 月 Llama-3 发布后两周内,llama.cpp / Hugging Face / Ollama 等所有推理栈都发生过 tokenizer 相关 bug。我挑三个最经典的:
<|begin_of_text|>(BOS),但 llama.cpp 加载新模型时默认还会自动添加 BOS——导致第一个 token 是 BOS BOS,模型看到"两个开始符"行为异常,输出质量明显下降。修复方法:模型 metadata 里加 add_bos_token = false。<|eot_id|> 字符串(可能是从日志复制粘贴)。早期 tokenizer 把它当普通字符串 BPE 切——而不是识别为特殊 token id 128009。结果模型把"用户在文本中提到的 eot_id" 当成"对话该结束了",立刻停止生成。要在 tokenize 前显式扫描已知特殊 token。这些 bug 都不是算法错,是"接缝层"的人为错误。但它们的用户体感是"这个模型变笨了"。所以 tokenizer 是整个推理栈里最容易被忽视、但损害最大的环节——错一个字节,模型整体性能可见地崩。
In the two weeks after Llama-3's April 2024 release, llama.cpp / Hugging Face / Ollama all had tokenizer bugs. Three classics:
<|begin_of_text|> (BOS) at the start; but llama.cpp defaulted to auto-adding BOS when loading new models — first token becomes BOS BOS. The model sees "two start markers", behaves oddly, quality drops visibly. Fix: model metadata add_bos_token = false.<|eot_id|> (copied from logs). Early tokenizer treated it as a regular BPE string — not as the special token id 128009. The model interpreted "user mentioning eot_id in text" as "this turn ended" and stopped generating immediately. Fix: scan for known special tokens before BPE.None of these are algorithm bugs; they're "seam-layer" human errors. But the user-perceived impact is "this model got dumber". So tokenizer is the most overlooked, highest-damage piece of the inference stack — one byte wrong and the whole model visibly degrades.
token → tensor · ggml_get_rows 一行就完成的进入
token → tensor · the entry that ggml_get_rows does in one row
从这一步开始,token id 就消失了。剩下的 17 个站全部在 4096 维(Llama-3-8B)/ 7168 维(DeepSeek-V3)的实数空间里走。这次翻译极其简单——简单到 ggml 里就一个函数 ggml_get_rows:
From this step on the token ids disappear. The next 17 stations all run inside a 4096-dim (Llama-3-8B) or 7168-dim (DeepSeek-V3) real-valued space. This translation is trivial — so trivial it's a single ggml op called ggml_get_rows:
看个数字:Llama-3-8B 的 tok_embd 是 [128256, 4096] 的矩阵,fp16 占 1 GB。同一个模型一层 attention 的 QKV 投影矩阵加起来也就 200 MB——embedding 表占了全模型权重的 12%。
这就是为什么词表大小是一个工程决策,不是越大越好:
多数现代模型(Llama-3 是 128k,DeepSeek 是 130k,Qwen2 是 152k)都故意选了一个"中文不太亏、英文不太浪费"的 sweet spot。
Run the numbers: Llama-3-8B's tok_embd is [128256, 4096] in fp16 — 1 GB. One layer's QKV projection sums to about 200 MB; the embedding table alone is 12% of the entire model.
This is why vocab size is an engineering decision, not a "bigger is better" knob:
Modern models (Llama-3 at 128k, DeepSeek at 130k, Qwen2 at 152k) deliberately pick a sweet spot — "doesn't penalize Chinese, doesn't waste on English".
ggml_get_rows 看着是"行选择",底层在 CPU 上是 memcpy,在 CUDA 上是 cudaMemcpy2DAsync——它根本不是数学操作,是寻址操作。所以这一步耗时几乎可以忽略(几个 μs)。"贵"的不是计算,是内存:模型一启动就要把这 1 GB 的 embedding 表加载进显存,占住不放。
ggml_get_rows looks like "row selection" but the real implementation is just memcpy on CPU, cudaMemcpy2DAsync on CUDA — it isn't a math op, it's an addressing op. The compute is negligible (microseconds). What's "expensive" is the resident memory: that 1 GB embedding table sits in VRAM for the entire lifetime of the model.
这个数字没有第一原理。它是从 Transformer 原论文(d_model=512)开始,GPT-2 加到 1024,GPT-3 加到 12288,Llama-2-7B 是 4096,Llama-3-8B 沿用 4096,Llama-3-70B 用 8192——大致跟参数量开三次方根的方向走,但每个模型组各自有自己的工程口味。
真正的约束是:
n_embd 必须能整除 n_head——每个 attention head 要拿到 d_head = n_embd / n_head 维。常见取值 d_head ∈ {64, 96, 128},因为 CUDA 的 Tensor Core 对这几个尺寸最友好。n_embd 决定了模型"每个 token 能记住多少东西"——这是一个 4096 维的"向量",它要装下"我是一个动词、我前面接过冠词、我整段话是讽刺语气"等等所有上下文信号。4096 不是物理常数,是过去 7 年里大家觉得对 7-8B 规模够用、对 70B 规模偏小的那个数。
This number has no first principle. It started at 512 in the original Transformer paper, GPT-2 went to 1024, GPT-3 to 12288, Llama-2-7B to 4096, Llama-3-8B kept 4096, Llama-3-70B uses 8192 — roughly scaling as the cube root of total params, but each model family has its own taste.
The real constraints:
n_embd must be divisible by n_head — each attention head gets d_head = n_embd / n_head dims. Common choices d_head ∈ {64, 96, 128}, because CUDA Tensor Cores like these shapes.n_embd sets the model's "per-token memory capacity" — that 4096-dim vector has to encode "I'm a verb, I followed an article, the whole sentence is sarcastic" and all other contextual signals at once.4096 isn't a physical constant. It's the number the field has roughly settled on as "enough for 7-8B, a little tight for 70B" over the past seven years.
"embedding 把 token id 翻成 4096 维向量" 是入口;"LM head 把 4096 维向量翻回 token id" 是出口(第 13 章)。从形状看,两者恰好互为转置:都是 [n_vocab, n_embd] 形状的矩阵,只是矩阵乘的方向相反。
所以 2016 年 Inan 等人提出 "weight tying":让 embedding 和 LM head 共用同一份权重。具体做法是 LM head 用 embedding.T 而不是另一个独立矩阵。好处:
哪些模型用 tying:GPT-2、Llama-1、Llama-2、Mistral、Qwen2、Gemma 都用。Llama-3 不用——Meta 论文里没解释,业界推测是词表大到 128k 之后,untied 给了模型更多表达自由度;另外 untied 让 prompt-side embedding(检索向量)和生成-side embedding(语义生成)各自专精。
llama.cpp 在加载模型时通过 GGUF metadata 自动判断,output 这个权重张量如果不存在,就 fallback 用 token_embd.T。所以推理代码不需要关心是不是 tied——但训练框架必须显式设置 tied_weights=True,否则两个权重独立训练就没有 tying 的效果。
"Embedding turns a token id into a 4096-dim vector" is the entrance; "LM head turns a 4096-dim vector back into a token id" is the exit (Ch.13). Shape-wise, the two are literal transposes: both are [n_vocab, n_embd]; only the matmul direction differs.
So in 2016 Inan et al. proposed "weight tying": have embedding and LM head share the same weights. Implementation: LM head uses embedding.T instead of an independent matrix. Benefits:
Who uses tying: GPT-2, Llama-1, Llama-2, Mistral, Qwen2, Gemma all do. Llama-3 doesn't — Meta's paper doesn't explain; the industry guess: once vocab grows to 128k, untied gives the model more expressive freedom; also untied lets prompt-side embedding (retrieval vectors) and generation-side embedding (semantic generation) each specialize.
llama.cpp auto-detects via GGUF metadata: if the output tensor is absent, it falls back to token_embd.T. So inference code doesn't need to know whether weights are tied — but training frameworks must explicitly set tied_weights=True, otherwise the two weights train independently and the tying benefit is lost.
4096 维听起来很多,但 LLM 实际用到的维度可能远少于此。Intrinsic dimension(内在维度)研究:在 LLM 的 hidden state 上跑 SVD,前 90% 方差通常只用 ~500-1000 个 principal component。剩下 3000+ 维都是冗余——它们存在主要是为了训练时的优化空间,推理时大部分激活值很小。
这件事直接催生了几个推理优化方向:
所以 4096 维不是"恰好够",更像是"训练时为了优化空间留的余量"。推理时的"有效维度" 远小于此——这是为什么各种 low-rank 压缩 trick 在 LLM 上效果好的根本原因。
4096 sounds like a lot, but LLMs actually use far fewer dimensions. Intrinsic dimension research: SVD on LLM hidden states; the top 90% variance is typically captured by ~500-1000 principal components. The remaining 3000+ dims are redundant — they exist mainly to provide optimization room during training; most are near-zero at inference.
This fact has directly driven several inference optimization directions:
So 4096 isn't "exactly enough" — it's more like "training-time slack for optimization". The "effective dimension" at inference is much smaller — which is the root reason every low-rank compression trick works so well on LLMs.
llama-batch.cpp · 调度的最小单位
llama-batch.cpp · the smallest unit the scheduler sees
到这里 prompt 已经是 [n_embd, n_tokens] 的 fp16 矩阵了。但"喂给模型"这件事还有一层抽象:batch。一个 batch 可以装多个 prompt(不同用户的请求拼在一起),也可以是同一个 prompt 在 prefill 阶段被切成几段(避免一次性吃太多显存)。这两件事都由 llama_batch 统一表达。
At this point the prompt is a [n_embd, n_tokens] fp16 matrix. But "feed it to the model" still has one more layer: batch. A batch can pack multiple prompts (different users' requests glued together) or one prompt sliced into chunks during prefill (avoiding a memory spike). Both are expressed through llama_batch.
抽象层数不多但很关键:
token / embd:数据本身。要么是 token id(走 embedding lookup),要么直接是 embedding(给 multi-modal vision 那种用,图片 patch 已经投到 embedding 空间了)。pos:RoPE 要用的位置编号(第 6 章)。第一次见这个字段会困惑——为什么 token 还得带"位置"?因为 batch 里同时跑多个序列时,token 0 是用户 A 的第 0 个 token,但下面 token 5 可能是用户 B 的第 100 个 token,光看 batch 内的下标不够。seq_id:这个 token 属于哪条序列。同一条序列共享 KV cache,不同序列的 KV 互不影响。这就是多用户并发在 llama.cpp 里的实现方式——一张 KV cache 表里同时住着好几条对话,通过 seq_id 区分。Not many abstractions, but each matters:
token / embd: the data itself. Either token ids (will go through embedding lookup) or raw embeddings (for multimodal vision — image patches already projected to embedding space).pos: the position index used by RoPE (Ch. 6). First time you see this field you wonder — why does a token need a "position"? Because when multiple sequences share a batch, token 0 might be user A's first token while token 5 is user B's hundredth. The batch index alone isn't enough.seq_id: which sequence this token belongs to. Tokens with the same seq_id share a KV cache; different sequences don't see each other's KV. This is how multi-user concurrency lives inside llama.cpp — several conversations co-exist in one KV cache table, distinguished by seq_id.用户给的 llama_batch 是逻辑单位,但 GPU 上一次 forward 能装多少,是物理约束。n_ubatch 这个参数(默认 512)决定了"一次 graph 实际跑多少 token"。如果用户喂进来 4096 个 token 的 prompt,llama.cpp 会把它切成 8 个 ubatch 串行跑,中间 KV cache 累积。
The llama_batch the user hands in is a logical unit, but how many tokens fit through one forward is a physical constraint. The n_ubatch param (default 512) sets "tokens per actual graph run". A 4096-token prompt is sliced into 8 ubatches, run serially, with KV cache accumulating between them.
llama_ubatch{ n_tokens=6, token=[128000,47045,...], pos=[0,1,2,3,4,5], seq_id=[0,0,0,0,0,0], logits=[0,0,0,0,0,1] } ——最后一位的 logits=1 告诉模型"只在最后一个 token 上输出 logits",因为我们只关心下一个要生成的 token。这个 ubatch 现在准备好被 graph_compute 推过 32 层 transformer。
Our prompt is now llama_ubatch{ n_tokens=4, token=[128000,9906,11,100793], pos=[0,1,2,3], seq_id=[0,0,0,0], logits=[0,0,0,1] } — the trailing logits=1 tells the model "only emit logits at the final token", because that's the only position we'll sample from. This ubatch is now ready for graph_compute to push it through 32 transformer layers.
用户发一个 OpenAI 兼容 HTTP 请求到 vLLM 服务器,这个请求在变成 ubatch 之前要经过几个阶段:
fastapi async handler): 解 JSON,验证 schema · 1-2 msSequenceGroup 对象,塞进 scheduler 的 WAITING 队列 · 几 μsScheduler.schedule() 选到它 · 0-100 ms 取决并发情况注意第 4 步——"等待调度" 是个不可忽视的延迟来源。在高负载下,新请求可能等几十毫秒才被排进 batch——这部分时间用户看到的就是 TTFT 高。"调度延迟" 跟 prefill 时间一样,都是 TTFT 的组成部分。
Priority scheduling: vLLM 1.0 之后引入,允许给请求加优先级(高优先级请求插队)。生产场景下区分 free tier / paid tier 用户、互动 chat / 异步批任务等。这是OS 调度器思路在 LLM 上的复用。
A user sends an OpenAI-compatible HTTP request to a vLLM server. Before becoming a ubatch, it passes through several stages:
fastapi async handler): JSON parse + schema validate · 1-2 msSequenceGroup, push to scheduler's WAITING queue · μsScheduler.schedule() picks it · 0-100 ms depending on loadNote step 4 — "wait for scheduling" is a non-negligible source of latency. Under high load a new request can wait tens of ms before being batched — what the user sees is high TTFT. "Scheduling latency" is a TTFT component just as much as prefill time.
Priority scheduling: introduced after vLLM 1.0, lets requests have priority levels (high-pri jumps the queue). Production use: distinguishing free vs paid tier, interactive chat vs async batch jobs. OS-scheduler thinking reused for LLM serving.
归一化 + 位置编码 · 每层 transformer 都要重做
normalize + rotate · redone in every layer
从这一章起,我们就进入了那 32 层 transformer 楼梯。每一层结构完全一样,只是权重不同——所以本章到第 9 章描述的 4 个步骤,会被重复执行 32 次。一次 forward 总共要走 32 × 4 = 128 步,这还只是 attention 部分,FFN/MoE 是另外的事。
每一层的第一件事不是计算 attention,而是洗澡:把 hidden state 用 RMSNorm 归一化一下,再把 query/key 用 RoPE 旋转一下,把位置信息埋进去。这两步都不学(RMSNorm 只有一个 scale 参数,RoPE 完全无参数),但少了任何一步模型就崩。
From here on we're inside the 32-floor transformer stairwell. Every floor is structurally identical — only weights differ. So the 4 steps described in Ch. 6–9 are repeated 32 times. One forward pass walks 32 × 4 = 128 steps, and that's just attention; FFN/MoE is on top.
Each layer's first move is not computing attention but giving the hidden state a bath: normalize it via RMSNorm, then rotate Q/K via RoPE to embed position. Both steps are almost parameter-free (RMSNorm has one scale vector, RoPE has zero learned params), but drop either and the model collapses.
RMSNorm 干的事一句话:把每个 token 的 4096 维向量缩放到"平均能量"为 1,再乘一个可学的 scale。公式:
x' = x / √(mean(x²) + ε) · scale
跟 LayerNorm 相比少了"减均值"那一步——没有数学上的"必须保留",纯粹是 Touvron 等人在 Llama-1 论文里发现"去掉中心化几乎不掉精度但能省 20% 时间"。从此所有 Llama 系、Mistral、Qwen2、DeepSeek 全部用 RMSNorm。
RMSNorm in one sentence: rescale each token's 4096-vector so its average energy is 1, then multiply by a learnable scale. Formula:
x' = x / √(mean(x²) + ε) · scale
Compared to LayerNorm, the mean-subtraction is dropped. No math forces this; Touvron et al. found in the Llama-1 paper that removing centering barely hurts accuracy but saves ~20% time. Llama, Mistral, Qwen2, DeepSeek all use RMSNorm now.
原始 Transformer 用"位置 embedding 加在 input 上"——但这有个毛病:模型不知道两个 token 之间的相对距离,只知道各自的绝对位置。这在长上下文里特别糟糕——"5K 远的 token"和"50 远的 token"看起来一样陌生。
RoPE 的思路精彩:不加位置 embedding,而是把 Q 和 K 的每一对维度,看作复数平面上的点,按位置编号 m 旋转 mθ 角度。结果是 Q·Kᵀ 的内积自然带上了"两个 token 位置差"的信息——绝对位置消失了,只剩相对位置。
The original Transformer added "position embeddings to the input", but with a flaw: the model couldn't see the relative distance between two tokens, only their absolute positions. Long context made this brutal — "5K away" and "50 away" both feel equally foreign.
RoPE's idea is elegant: don't add a position embedding; instead, treat each pair of Q and K dims as a point in the complex plane, and rotate it by an angle mθ where m is the position. The result: Q·Kᵀ inner product naturally picks up "the position difference between two tokens" — absolute positions vanish, only relative ones survive.
| 维度对dim pair | θi | 在 token m 上旋转rotate at position m | 作用captures |
|---|---|---|---|
| (0, 1) | 10000⁰ = 1 | m × 1 rad | 高频 · 短距离high freq · short range |
| (2, 3) | 10000⁻²/ᵈ | m × 10000⁻²/ᵈ rad | 中频mid freq |
| (d-2, d-1) | ≈ 10000⁻¹ | m × 0.0001 rad | 低频 · 长距离low freq · long range |
RoPE 训练时见过的最长位置是 n_ctx_orig(Llama-3 是 8192)。如果你想让它处理 128K 的 prompt,直接喂不工作——超出训练范围的位置 m,旋转角度 mθ 已经绕了好几圈,模型从来没见过。
YaRN(以及前身 NTK-aware) 的做法是把高频维度的频率压低,让它们也能"画长"。具体在 ggml_rope_ext 里通过 rope_factors 参数实现——这是个 d/2 维的向量,每个频率被乘上一个 ≤ 1 的因子。低频维(管长距离)几乎不动,高频维(管短距离)被压得更慢。
这个 trick 让你不重新训练就能把 8K Llama-3 跑出 32K 上下文。代价是远距离精度略降,适合"找资料"型任务,不适合"逻辑推理跨整个 128K"型任务——后者需要真训长。
RoPE was trained seeing positions up to n_ctx_orig (8192 for Llama-3). Feeding a 128K prompt straight in doesn't work — angles mθ for unseen m have wrapped around several times and the model has no idea.
YaRN (and its predecessor NTK-aware) compresses the high-frequency dims' frequencies so they "stretch longer". In ggml_rope_ext this is the rope_factors param — a d/2 vector, each freq multiplied by a factor ≤ 1. Low-freq dims (long range) barely change; high-freq dims (short range) get compressed.
The trick lets you push an 8K Llama-3 to 32K context without retraining. The cost is reduced precision at long distance — fine for "retrieval" tasks, weak for "logical reasoning spanning 128K" workloads. Those need actual long-context training.
RoPE 看起来神奇——加位置 embedding 不香吗?为什么要旋转?它最美的地方在 attention 的 Q·K^T 自然等于"位置差的函数"——这件事在数学上是有唯一构造的。一步一步推:
把 Q 和 K 的每一对维度看作复平面上的点:
q_m = (q_a, q_b) → 复数 q = q_a + i·q_b
k_n = (k_a, k_b) → 复数 k = k_a + i·k_b
RoPE 给每个位置 m 旋转 mθ 角:
q'_m = q · e^(imθ)
k'_n = k · e^(inθ)
关键的魔法时刻:看 Q 和 K 的内积(对应 attention scores):
⟨q'_m, k'_n⟩ = Re(q'_m · k'_n*)
= Re(q · e^(imθ) · k* · e^(-inθ))
= Re(q · k* · e^(i(m-n)θ))
看那个 e^(i(m-n)θ) ——结果只跟 m-n 有关,不再跟 m 和 n 各自有关。绝对位置在内积里消失了,只剩相对位置。这就是 RoPE 的全部数学之美:用一个朴素的旋转,让 attention 自动只看相对距离。
而且这个性质不能被简单的 "加位置 embedding" 实现——加法在内积里展开会留下 q·pos_m + pos_n·k + pos_m·pos_n 三个交叉项,无法清干净绝对位置。Su Jianlin 在 2021 年发现这个构造时,真的是一记纯数学的妙手。
实际实现:d_head 维向量被分成 d_head/2 对,每对独立旋转一个不同频率的角度。频率 θ_i = 10000^(-2i/d_head) 从 1 衰减到 1/10000,使得不同对的"波长"覆盖从 ~6 token 到 ~62800 token 的所有相对距离尺度。这就是为什么 d_head 必须偶数——奇数会有一个孤零零的维度不能配对成复数。
RoPE looks magical — why not just add a position embedding? Why rotate? Its real beauty: Q·K^T naturally becomes a "function of position difference" — and there's a mathematically unique construction for that. Step by step:
View each pair of Q / K dims as a point in the complex plane:
q_m = (q_a, q_b) → complex q = q_a + i·q_b
k_n = (k_a, k_b) → complex k = k_a + i·k_b
RoPE rotates each position m by mθ:
q'_m = q · e^(imθ)
k'_n = k · e^(inθ)
The magic moment: look at the inner product Q · K (which is attention's score):
⟨q'_m, k'_n⟩ = Re(q'_m · k'_n*)
= Re(q · e^(imθ) · k* · e^(-inθ))
= Re(q · k* · e^(i(m-n)θ))
That e^(i(m-n)θ) — the result depends only on m-n, no longer on m and n individually. Absolute position vanishes from the inner product; only relative position remains. This is the entirety of RoPE's mathematical beauty: a plain rotation makes attention automatically see only relative distance.
And this property cannot be achieved by simple "add a position embedding" — addition leaves three crossterms (q·pos_m + pos_n·k + pos_m·pos_n) in the inner product expansion, irreducible. When Su Jianlin discovered this construction in 2021, it was a stroke of pure math brilliance.
In practice: d_head dims are paired into d_head/2 complex pairs; each pair rotates by a different frequency. Frequencies θ_i = 10000^(-2i/d_head) decay from 1 down to 1/10000, so "wavelengths" span ~6 tokens to ~62800 tokens — covering every relative-distance scale. This is why d_head must be even — an odd dim would leave one unpaired axis with no complex partner.
RoPE 公式里那个 10000^(-2i/d) 的10000不是物理常数。它是个设计参数,影响"模型能感知多远的位置差"。直觉:
Llama-3 在 Llama-2 基础上把 base 从 10000 调到了 500000——为了支持 128K 上下文。base 大了之后所有 wavelength 都拉长,模型在 128K 范围内不会"绕回"。这是 Llama-3 能原生支持 128K 的关键(配合长上下文训练数据)。
DeepSeek-V3 的 base 更夸张,达到 1,000,000,对应 128K-256K 上下文。这件事跟 YaRN 配合使用——base 决定长度范围,YaRN 决定推理时怎么把训练长度外推。
The 10000 in 10000^(-2i/d) isn't a physical constant. It's a design parameter controlling "how far the model can perceive position differences". Intuition:
Llama-3 bumped base from Llama-2's 10000 to 500000 — to support 128K context. Larger base stretches all wavelengths so the model never "wraps" within 128K. This is the key knob (paired with long-context training data) enabling Llama-3's native 128K.
DeepSeek-V3 goes further: base 1,000,000, supporting 128K-256K. Often paired with YaRN — base sets the length budget, YaRN dictates how to extrapolate beyond training length at inference.
归一化在 Transformer 里不只 RMSNorm 一种。三种主流变体的差异其实影响训练稳定性,推理时形状一样,但有些性质值得知道:
关键差异:
推理时 RMSNorm 是 ggml 里的 ggml_rms_norm,QK-Norm 是在 build_attn 里手动加。无论哪个,推理代价都很小(每个 token ~5K FLOPs),但训练时的影响显著——错一个归一化策略,大模型可能在第 10000 step 突然发散。
RMSNorm isn't the only normalizer in Transformers. Three mainstream variants differ in subtle ways that affect training stability; at inference the shape is the same but some properties matter:
Key differences:
At inference, RMSNorm is ggml_rms_norm; QK-Norm is added manually in build_attn. Either way inference cost is tiny (~5K FLOPs per token), but training impact is significant — pick the wrong norm and a large model can suddenly diverge at step 10000.
build_attn 的前 3 行 · 最暴力的 matmul
first three lines of build_attn · the brutest matmul
RMSNorm 之后,hidden state 已经准备好"变身"。接下来这一步看起来平淡——三次矩阵乘——但它占了整层 90% 以上的 FLOPs。Q/K/V 投影矩阵是整个 transformer 里最大的可学权重(单个权重矩阵就 4096×4096 = 16M 个参数,fp16 32 MB,32 层就是 1 GB——比 embedding 还大)。
"一个 token,三种身份":
After RMSNorm, the hidden state is ready to "transform". The next step looks unremarkable — three matrix multiplications — but it accounts for over 90% of one layer's FLOPs. Q/K/V projection matrices are the biggest learnable weights in a transformer (a single matrix is 4096×4096 = 16M params, 32 MB in fp16, 1 GB across 32 layers — bigger than the embedding table).
"One token, three identities":
Llama-3-8B 有 32 个 attention head。物理上看,你以为是 32 套独立的 Q/K/V 投影矩阵,但实现上不是——它们被拼成一个大矩阵。wq 是 [n_embd, n_head × d_head] = [4096, 32 × 128] = [4096, 4096]。它本身就是 32 个头,reshape 一下视角就分出来了。
这是个很重要的工程选择:三个大 matmul 比 32 × 3 = 96 个小 matmul 快得多。GPU 上启动一次 kernel 的开销是固定的(几 μs),小矩阵摊不开 SM 利用率。所以现代实现都是"大 matmul + 一次 reshape",把 head 维当成 reshape 出来的虚拟维度。
Llama-3-8B has 32 attention heads. Logically you'd imagine 32 independent Q/K/V projection matrix sets — but implementation-wise that's not how it works. They're packed into one big matrix. wq is [n_embd, n_head × d_head] = [4096, 32 × 128] = [4096, 4096]. It IS the 32 heads; reshape just changes how you view it.
This is a load-bearing engineering choice: three big matmuls beat 32 × 3 = 96 small matmuls by a huge margin. GPU kernel launch is a fixed cost (a few μs); small matrices can't saturate SMs. So modern implementations all do "one big matmul + one reshape", treating the head dim as a virtual axis exposed by reshape.
llama-kv-cache-unified.cpp · 一张让 decode 可行的桌子
llama-kv-cache-unified.cpp · the table that makes decode possible
这是整篇文章最关键的一章。如果只能记住一件事,记住这个:没有 KV cache,LLM 推理就是一台不可能跑起来的机器。
原因在第 2 章解释过:decode 时只有 1 个 token 进入模型,但 attention 公式要求这个 token 的 Q 跟所有历史 K/V 算内积。如果每次 decode 都重新算历史 token 的 K/V,那生成第 100 个 token 就要重算 99 个 token 的全部 32 层 attention——O(n²) 的 prefill 反复跑,完全没法用。
解决方法朴素到不像方法:把每个 token 的 K 和 V 算一次,存起来,以后直接读。这就是 KV cache。它把 decode 从 O(n) 每 token 变成 O(1) 每 token(忽略 cache 本身的读取)。代价是内存——而且是显存,GPU 最贵的那种内存。
This is the most important chapter in this article. If you only remember one thing, remember this: without KV cache, LLM inference is a machine that simply cannot run.
The reason was sketched in Ch. 2: at decode time only one token enters the model, yet attention requires that token's Q to inner-product with all historical K/V. If you recompute history K/V on every decode step, producing the 100th token reruns 99 tokens' worth of full 32-layer attention — O(n²) prefill on every decode. Unusable.
The fix is so plain it barely sounds like one: compute each token's K and V once, save them, reuse forever. That's the KV cache. It turns decode from O(n) per token into O(1) per token (ignoring the cache fetch itself). The price is memory — and specifically VRAM, the most expensive memory the GPU has.
容量公式很简单:
size = 2 × n_layer × n_head_kv × d_head × n_ctx × sizeof(dtype)
(2 因为 K 和 V 各存一份)
代入 Llama-3-8B + 8K 上下文 + fp16:
同样这个模型,如果是每用户一份 KV cache,服务 32 个并发用户就是 32 GB——一张 80 GB H100 砸下去,模型权重 16 GB 占走,KV cache 占 32 GB,剩下空间装中间激活、通信缓冲。这就是为什么 LLM 推理服务最贵:一个用户 1 GB+,模型本身才 16 GB,用户的"历史"比模型自己还重。
Capacity formula:
size = 2 × n_layer × n_head_kv × d_head × n_ctx × sizeof(dtype)
(2 because K and V each stored)
Plug in Llama-3-8B + 8K context + fp16:
Same model, per-user KV cache: 32 concurrent users = 32 GB. On an 80 GB H100, model weights take 16 GB, KV cache eats 32 GB, the rest covers intermediate activations and comm buffers. This is why LLM serving is so expensive: each user 1 GB+, the model itself only 16 GB — the user's "history" outweighs the model.
llama.cpp 早期的 KV cache 是一个"线性写入,直到满"的简单数组(llama_kv_cache),但多用户 / 长对话场景下逐渐演化出 llama_kv_cache_unified(src/llama-kv-cache-unified.cpp),核心结构是"很多 slot,每 slot 标记 [pos, seq_id]"——同一张物理 cache 上同时住几条对话,跟 vLLM 的 PagedAttention 思路接近但实现简单很多。
Early llama.cpp KV cache was a simple "write linearly until full" array (llama_kv_cache). Multi-user and long-conversation workloads pushed it toward llama_kv_cache_unified (src/llama-kv-cache-unified.cpp): "lots of slots, each tagged with [pos, seq_id]" — multiple conversations share one physical cache, similar in spirit to vLLM's PagedAttention but much simpler in mechanism.
每一层 attention 块在同一张 KV cache上做两件事:
ggml_cpy(本质 memcpy)。耗时 ~几十 μs。这两件事在每一层都重复一次。所以 32 层模型一次 forward,KV cache 被读 32 次、写 32 次——它是整个推理里读写最频繁的张量。
Each attention block does two things to the same KV cache:
ggml_cpy (just memcpy). Takes ~tens of μs.Both happen per layer. So a 32-layer forward reads KV cache 32 times and writes it 32 times — the most-trafficked tensor in the whole inference.
find_slot 找到 head=0 的 6 个连续空 slot,标记 [pos=0..5, seq_id=0]。32 层 attention 各自把自己算出的 K/V 写进这 6 个 slot——总共写入 2 × 32 × 8 × 128 × 6 × 2 = 1.5 MB。然后第 1 个 decode token 来的时候,Q 跟这 6 个 K 做内积,再跟 6 个 V 加权,得到第 7 个 token 的 hidden state——这才是 attention 的"记得前文"。
Prefill 4 tokens: find_slot grabs head=0 + 4 contiguous empty slots, marks them [pos=0..3, seq_id=0]. All 32 attention layers write their computed K/V into these 4 slots — total 2 × 32 × 8 × 128 × 4 × 2 = 1 MB written. When the first decode token arrives, its Q dot-products against those 4 K rows, weighted-sums the 4 V rows, and produces token-5's hidden state. That is what "the model remembers what came before" really means.
看 llama.cpp 的 KV 张量声明会发现一个细节:K 和 V 的形状不一样。
Looking at llama.cpp's KV tensor declarations reveals a curious detail: K and V have different shapes.
这不是手抖。attention 公式里 QKᵀ 跟 P·V 的内存访问模式相反:
Q × K.T,需要 K 按"行=token"组织——K cache 长这样,每行一个 token,顺序读高速。所以 V 物理上是转置存储的——写入时 ggml_cpy 会做一次 transpose,读取时直接顺序流。这一步看似小,实测能给 attention kernel 提 30% 速度——因为 GPU memory access 对"顺序 vs 跨步"极其敏感。
这也是为什么 fp8 KV cache(第 18 章)只对 K 容易做,V 转置之后 fp8 量化的 outlier 跨 token 分布,要做per-head per-block scale才能保持精度——vLLM 在这部分有大量 CUDA kernel 代码。
Not a typo. QKᵀ and P·V in attention have opposite memory access patterns:
Q × K.T, K must be organized "row = token" — K cache is laid out this way, each row a token, sequential reads are fast.So V is physically stored transposed — ggml_cpy at write time does a transpose; reads stream sequentially. Looks minor, measurably 30% faster attention — GPU memory access is brutally sensitive to "sequential vs strided".
This is also why fp8 KV cache (Ch.18) is easier on K than V: after V's transpose, fp8 outliers spread across tokens, requiring per-head per-block scales to preserve precision. vLLM has a lot of CUDA kernel code dedicated to exactly this.
llama.cpp 的 find_slot 在全部 slot 占满时返回 -1——上层主循环必须做点什么。三种主流策略:
llama_kv_self_seq_rm):丢掉最早的 N 个 token 的 KV,新 token 写进腾出来的 slot。问题是模型彻底忘了开头——chat 应用里"开头是 system prompt",丢掉就崩了。实际生产里更常见的策略是在 KV cache 快满之前主动 evict 整条对话——把已经几小时没活跃的 session 的 KV 直接 free 掉。vLLM 的 BlockSpaceManager.swap_out 甚至能把不活跃的 session 的 KV 暂存到 CPU 内存,等用户回来再 swap_in。这是把 OS swap 抄进 LLM 推理。
llama.cpp's find_slot returns -1 when all slots are occupied — the caller must do something. Three mainstream strategies:
llama_kv_self_seq_rm): drop the earliest N tokens' KV, write new tokens into the freed slots. Problem: model totally forgets the beginning — and in chat "the beginning is the system prompt"; lose it and behavior breaks.In real production, the more common move is to actively evict whole conversations before the cache fills — sessions inactive for hours get their entire KV freed. vLLM's BlockSpaceManager.swap_out goes further: park inactive sessions' KV in CPU memory, swap_in when the user comes back. OS swap, transplanted into LLM inference.
把 Llama-3-70B(GQA-8 · 80 层 · n_embd_head_v = 128)做一笔账。模型权重:fp16 是 140 GB,Q4_K_M 是 ~40 GB。per-user per-8K-context KV cache 是 ~2.5 GB。
用 8 × H100 (80GB) 跑这个服务:
一个 H100 节点(8 卡)的 AWS 报价是 ~$32/h。如果服务跑满 170 个并发用户、每用户每分钟 1 个 query,一小时 ~10000 query · 每 query ~$0.0032。但 KV cache 不工作时(用户在打字)也占着内存,实际 utilization 远不到这个数——所以 OpenAI / Anthropic 等服务的真实毛利都在 50-70% 区间,看起来高其实大头被 idle KV cache 吃掉了。
这就是 MLA(第 11 章)、PagedAttention(第 17 章)、prefix caching(第 19 章)、KV swap-out 等所有"省 KV / 共享 KV / 挪 KV"技术真正的商业意义:每砍 1/2 的 KV cache 占用,服务密度翻倍,毛利涨 20 个点。
Take Llama-3-70B (GQA-8 · 80 layers · n_embd_head_v = 128). Weights: 140 GB at fp16, ~40 GB at Q4_K_M. Per-user, per-8K KV cache is ~2.5 GB.
Serving on 8 × H100 (80 GB):
One H100 node (8 cards) on AWS lists at ~$32/h. Fully utilized at 170 concurrent users, 1 query/min each, that's ~10000 queries/hour · ~$0.0032 per query. But KV cache also holds memory when the user is typing; real utilization is much lower. This is why OpenAI / Anthropic post 50-70% gross margins that look high but are mostly eaten by idle KV cache.
This is the actual commercial meaning of MLA (Ch.11), PagedAttention (Ch.17), prefix caching (Ch.19), KV swap-out, every "save KV / share KV / move KV" technique: cut KV usage in half and service density doubles, gross margin gains 20 points.
StreamingLLM(Han Song 团队 2023)论文有个反直觉发现:把 KV cache 的前 4 个 token 永远保留,模型就能处理任意长流;不保留,模型立刻崩。这 4 个 token 不是"系统 prompt",可能是 <|begin_of_text|> + 任意三个早期文本 token——为什么这么神奇?
研究后发现一个普适现象:Transformer 训练中,attention 倾向于把大量分数分配给序列最前面的几个 token,即使这些 token 语义无关。原因:softmax 公式 exp(QK)/Σexp 要求所有 token 的概率和为 1——但在某些 query 看来,"整段历史都没什么相关的"。模型必须把概率泄到某个地方,而早期 token 因为训练时永远存在 成了天然的"泄洪口"。
所以这些前几个 token 被称为 attention sinks——它们的 KV不承载有意义的信息,但必须存在,否则其它 token 的 attention 概率分配出问题,scores 数值范围混乱,生成质量崩坏。
这个发现实际催生了三件事:
<sink> token,后续推理时它就是天然的泄洪口。这个研究领域 2023-2024 还很热,但没有大规模生产落地——StreamingLLM 在非聊天场景(纯流式 log 处理)用得多,聊天场景因为 prompt 长度本来就有限,attention sink 优化的边际收益不大。
StreamingLLM (Han Song team, 2023) found a counterintuitive result: always keep the first 4 KV cache tokens and a model handles arbitrarily long streams; drop them and the model breaks instantly. These 4 tokens aren't a "system prompt" — could be <|begin_of_text|> + any three early text tokens. Why so magical?
Investigation revealed a universal phenomenon: in Transformer training, attention tends to assign large amounts of score to the first few tokens of the sequence — even when semantically irrelevant. Reason: the softmax formula exp(QK)/Σexp requires all tokens' probabilities sum to 1 — but in some queries' view, "nothing in the history is relevant". The model has to dump probability somewhere; early tokens, being always present during training, became the natural "overflow drain".
So these first few tokens are called attention sinks — their KV carries no meaningful info, but must exist, or other tokens' attention probability distributions break, score ranges go haywire, generation quality collapses.
This finding produced three things:
<sink> special token at training, providing a natural overflow drain at inference.This area was hot in 2023-2024 but not widely deployed in production — StreamingLLM is more used in non-chat scenarios (pure-stream log processing); chat is bounded enough that attention-sink optimization's marginal gain is small.
除了量化(C18 fp8 KV),另一类减小 KV cache 的思路是主动丢掉一部分 KV ——保留"重要"的,丢"不重要"的。怎么定义"重要"是一系列论文的核心:
注意这类技术跟量化是正交的——可以叠加。fp8 KV 砍一半内存 + H2O 再丢一半 KV = 总共 KV cache 砍到 1/4。这是 2024 年长上下文推理优化的组合战。
llama.cpp 主线还没集成这些(仍是简单 ring buffer),但研究分支 已经有实验。vLLM 在 2024 年加了 SnapKV 选项。这是个还在快速演化的领域,标准答案没出。
Besides quantization (C18 fp8 KV), another way to shrink KV cache is proactively drop parts of it — keep "important" tokens, drop "unimportant" ones. How to define "important" is the topic of a series of papers:
These techniques are orthogonal to quantization — they stack. fp8 KV halves memory + H2O drops half the tokens = total KV down to 1/4. This is the 2024 long-context inference optimization combo.
llama.cpp main hasn't integrated these (still simple ring buffer); experimental branches have. vLLM added SnapKV as an option in 2024. This is a fast-evolving field; no standard answer yet.
ggml_flash_attn_ext · 三步合一
ggml_flash_attn_ext · three steps fused into one
到这一步,这一层 attention 的所有素材都齐了:
[d_head, n_head, n_tokens][d_head, n_head_kv, n_kv][d_head, n_head_kv, n_kv]朴素 attention 公式三步:S = QKᵀ / √d → P = softmax(S + mask) → O = P·V。中间矩阵 S 大小是 [n_tokens × n_kv]——n_tokens × n_kv 个 float。8K 上下文下,光一个 head 的 S 就是 64M 个 float = 256 MB,32 个 head 是 8 GB,32 层是 256 GB——显存装不下。
这就是FlashAttention(Tri Dao, 2022)解决的问题。它的洞察是:S 这个中间矩阵根本不用整张物化。把 Q 和 K/V 都切成小块(tiles),在 SRAM 里算完一块就立刻 softmax 加权,累加到 O 上,然后扔掉这块 S。整个 attention 变成"三步合一"的单个 kernel——既省了显存读写,又省了显存容量。
By this point all ingredients are ready:
[d_head, n_head, n_tokens][d_head, n_head_kv, n_kv][d_head, n_head_kv, n_kv]Naive attention is three steps: S = QKᵀ / √d → P = softmax(S + mask) → O = P·V. The intermediate S is [n_tokens × n_kv] floats. At 8K context, one head's S = 64M floats = 256 MB; 32 heads = 8 GB; 32 layers = 256 GB. Doesn't fit in VRAM.
This is what FlashAttention (Tri Dao, 2022) solved. The insight: S doesn't need to be materialized whole. Tile Q and K/V into blocks. For each tile, compute its S in SRAM, softmax-weight V immediately, accumulate to O, throw S away. The whole attention becomes "three steps fused into one" kernel — saving VRAM bandwidth and VRAM capacity at once.
| 朴素Naive | FlashAttentionFlashAttention | |
|---|---|---|
| 中间 S 矩阵intermediate S | 物化到 HBM | tiled · 留在 SRAM |
| HBM 读写HBM traffic | O(N²) · 每步都读写 | O(N) · 只读 Q/K/V/O |
| SRAM 用法SRAM usage | 不利用 | tile 进 SRAM 反复用 |
| 实际 H100 加速actual speedup | 1× | 5–10× longer context |
| 能跑的最长上下文max context | ~4K (80 GB) | ~128K+ |
上面那个 kq_mask 是 transformer 里唯一让模型"顺序"的东西。它是一个 [n_kv, n_tokens] 的下三角矩阵:第 i 个 query token 只能看到 j ≤ i 的 key token。在 softmax 之前,mask 上"不该看到"的位置被加上 -inf,softmax 后这些位置概率为 0。
这就是 GPT 类自回归模型不能并行预测后面的 token 的根本原因——它的训练目标就是"看着前 i 个 token,预测第 i+1 个",mask 是这个目标在 attention 公式里的具体形式。如果去掉 mask,模型就变成"双向"的(BERT 那种),用法完全不同。
That kq_mask is the only thing in a transformer that gives it a "direction". A [n_kv, n_tokens] lower-triangular matrix: query token i can only see key tokens j ≤ i. Before softmax, mask positions are set to -inf, so after softmax those positions are 0.
This is the root reason autoregressive GPTs can't parallel-predict future tokens. The training objective is "look at first i tokens, predict token i+1", and the mask is that objective made concrete in the attention formula. Drop the mask and you get a bidirectional model (BERT-style) — a completely different use case.
[n_embd, n_tokens] 的 hidden state——跟输入同形。然后接一个 output projection(W_o)、残差连接,再走 FFN/MoE(第 12 章),再来一次 RMSNorm + attention + FFN……重复 32 次。32 层后,最后一层的 hidden state 进入第 13 章的 LM head。
This layer's attention finishes. Output is [n_embd, n_tokens] hidden state — same shape as input. Then output projection (W_o), residual add, then FFN/MoE (Ch. 12), then RMSNorm + attention + FFN again… 32 times. After 32 layers the final hidden state enters the LM head (Ch. 13).
朴素 softmax 需要看完所有元素才能算:先求 max(数值稳定),再 exp 求和(归一化分母),最后逐项除。这就是为什么 attention 的中间 S = QKᵀ 矩阵必须先完整物化到 HBM,再做 softmax——典型 O(N²) 内存。
FlashAttention 的核心数学是 online softmax(Milakov & Gimelshein,2018):边读边更新,不需要先看完所有元素。原理是维护两个累加量,随着新元素流入做修正:
读到 x_i 之前 · 已经处理过 i-1 个元素:
m_{i-1} = max(x_1, ..., x_{i-1}) // 当前最大值
d_{i-1} = Σ exp(x_j - m_{i-1}) // 当前归一化分母
读到 x_i,更新:
m_i = max(m_{i-1}, x_i)
d_i = d_{i-1} · exp(m_{i-1} - m_i) + exp(x_i - m_i)
那个 exp(m_{i-1} - m_i) 因子是 修正系数——之前的所有 exp 都是基于旧的 max 算的,新 max 一来,所有旧值都被"整体缩小"一点。这一步保证了:无论你读到第几个元素,d_i 都是"就这一段已读元素的正确归一化分母"。
FlashAttention 把这套 online softmax 跟分块矩阵乘结合:Q 和 K/V 都切成小块(tile),每次只把一对 tile 拉进 SRAM。在 SRAM 里算这对 tile 的局部 S = QK^T,直接 online softmax 累加到输出 O 上,然后把当前块扔掉,读下一对——全程不物化整张 S 矩阵。
Naive softmax requires seeing all elements: first compute max (for numerical stability), then exp-sum (the normalization denominator), then divide. This is why attention's intermediate S = QKᵀ must be fully materialized to HBM before softmax — classic O(N²) memory.
FlashAttention's core math is online softmax (Milakov & Gimelshein, 2018): update incrementally; never need to see everything first. Maintain two accumulators that update as new elements stream in:
Before reading x_i · already processed i-1 elements:
m_{i-1} = max(x_1, ..., x_{i-1}) // running max
d_{i-1} = Σ exp(x_j - m_{i-1}) // running denominator
Read x_i, update:
m_i = max(m_{i-1}, x_i)
d_i = d_{i-1} · exp(m_{i-1} - m_i) + exp(x_i - m_i)
The exp(m_{i-1} - m_i) factor is a correction coefficient — all the prior exps were computed against the old max; when the max grows, all old values get uniformly shrunk a bit. After the update, d_i is "the correct normalization denominator for just the elements seen so far".
FlashAttention couples this online softmax with block matrix multiply: Q and K/V are tiled; each iteration loads one tile pair into SRAM. In SRAM, compute the tile's local S = QK^T, online-softmax-accumulate it onto the output O, discard, load next pair — the full S matrix is never materialized.
看这个伪代码,你能感觉到FlashAttention 的精妙:它把所有"需要看完整列才能算"的依赖,全部用 alpha / beta 这两个修正项消解掉了。每次内循环只需要 SRAM 里的当前块和几个标量(m, d),整个外循环结束时 O_i 是正确的 softmax(QKᵀ)V[i] 行——跟 naive 算法bit-equivalent(忽略 fp 累加误差)。
实测收益:在 A100 上,8K seq len、d=128 的 attention,naive 算法用 ~70 GB HBM 流量(物化 S 矩阵 + 来回读写),FlashAttention 只用 ~7 GB——10× 内存带宽节省直接翻译成 ~5-7× 实际加速。同时支持的最长 seq 也从 ~4K 涨到 ~64K。
Reading the pseudocode you can feel FlashAttention's elegance: every "need to see the whole row first" dependency is dissolved by those two correction terms, alpha and beta. The inner loop needs only the current tiles in SRAM and a few scalars (m, d). After the outer loop, O_i is the correct softmax(QKᵀ)V[i] — bit-equivalent to the naive algorithm (modulo fp accumulation noise).
Measured win: on A100, 8K seq len, d=128 attention, naive uses ~70 GB HBM traffic (materialize S + back-and-forth). FlashAttention uses ~7 GB — 10× bandwidth saved, translating to ~5-7× real speedup. Max workable seq length grows from ~4K to ~64K.
FlashAttention 自 2022 年首发以来,迭代过两次大版本,每次都跟硬件代际深度耦合:
有意思的遗产问题:vLLM 一开始用 FlashAttention v2,后来加 PagedAttention 自己的 attention kernel,FA v3 又重写了一遍兼容 paged KV。kernel 实现是个吃工程力的活——同样数学,不同硬件,得分别写。这也是为什么"新硬件出来到完整推理栈支持" 通常要 6-12 个月——不是模型移植慢,是 kernel 移植慢。
FlashAttention has shipped two major versions since 2022, each tightly coupled to a hardware generation:
An interesting legacy issue: vLLM started on FA v2, then added PagedAttention with its own kernel, then FA v3 was rewritten to support paged KV. Kernel implementations are engineering-heavy — same math, different hardware, separate writes. This is why "new hardware → full inference stack support" usually takes 6-12 months — model porting isn't slow; kernel porting is.
FlashAttention 的所有收益都来自"S 矩阵不物化到 HBM"。但 decode 阶段 Q 只有 1 行——S = QK^T 本来就只有 [1, N_kv] 行,物化到 HBM 也就几十 KB,跟模型权重比微不足道。所以 decode 阶段 FA 的收益主要在 prefill 的 attention 上,decode 收益很小。
这就是为什么 vLLM / TensorRT-LLM 等推理引擎都有专门为 decode 优化的 attention kernel——它们不走 FA,走"纯流式"算法,每次只处理 1 个 query token,但同时把 KV 按 PagedAttention 的 block 索引读——这就是 paged_attention_v2 CUDA kernel 在干的事。
FlashDecoding(Tri Dao 团队 2023): 解决了"decode 阶段 GPU 还是欠饱和" 的问题。思路是把 KV cache 沿 seq 维度切成多块,每块用一个 SM 独立算这一段 KV 对当前 Q 的 attention 贡献,最后再合并所有 SM 的结果。这样 SM 占用率从 ~30% 涨到 ~70%,在长上下文 decode 上加速 ~3-5×。是 2024 年 long-context decode 加速的最大单点优化。
All of FlashAttention's wins come from "S not materialized to HBM". But at decode, Q has only 1 row — S = QK^T is just [1, N_kv], a few tens of KB to materialize, negligible against model weights. So FA's win at decode is essentially zero; the win is all on prefill attention.
This is why vLLM / TensorRT-LLM ship decode-specific attention kernels — they don't use FA but a "pure streaming" algorithm: one query token at a time, but reading KV via PagedAttention's block indexing. This is exactly what the paged_attention_v2 CUDA kernel does.
FlashDecoding (Tri Dao's team, 2023): tackled "even decode under-saturates the GPU". The idea: split KV cache along the sequence axis into chunks; each SM independently computes "this chunk's contribution to the current Q's attention"; merge afterward. SM occupancy goes from ~30% to ~70%, ~3-5× speedup on long-context decode. Single biggest 2024 long-context decode optimization.
最朴素的"省 KV cache"招式
the simplest "save KV cache" trick
第 8 章那个 1 GB 的数字里,有一个变量叫 n_head_kv——Llama-3-8B 不是 32,而是 8。这并不是错。是GQA(Grouped Query Attention)的设计:让 32 个 query head 共享 8 个 K/V head——每 4 个 query 共用 1 组 K/V。
这件事的动机直接得难以反驳:
n_head_kv。GQA 把 KV head 砍到 1/4,KV cache 直接砍 75%。"MQA"(Multi-Query Attention)是 GQA 的极端版:所有 query head 共用一组 K/V(n_head_kv = 1)。再激进一点。最早是 PaLM 用,现在 Falcon、StarCoder 用。代价是精度损失更明显——这也是为什么 Llama-2 之后大家更喜欢 GQA(折中)而不是 MQA。
In Ch.8's 1 GB number there's a variable n_head_kv — Llama-3-8B has not 32 but 8. That's not a typo. It's GQA (Grouped Query Attention): 32 query heads share 8 K/V heads — every 4 queries share one K/V pair.
Motivation is hard to argue with:
n_head_kv. GQA quarters KV heads, quartering the cache."MQA" (Multi-Query Attention) is GQA's extreme: all queries share one K/V (n_head_kv = 1). More aggressive. First used in PaLM; now Falcon, StarCoder. Cost: more visible accuracy hit. Which is why after Llama-2 most teams settled on GQA (middle ground) over MQA.
| 配置Config | n_head | n_head_kv | 每 query 共享几组 KVqueries per KV | KV cache · 8K · fp16KV size · 8K · fp16 |
|---|---|---|---|---|
| MHA (Llama-1) | 32 | 32 | 1 | 4.3 GB |
| GQA-8 (Llama-3-8B) | 32 | 8 | 4 | 1.07 GB |
| GQA-4 | 32 | 4 | 8 | 540 MB |
| MQA (Falcon) | 32 | 1 | 32 | 135 MB |
n_head_kv 是模型卡里最不起眼但成本最高的一个参数。
Same Llama-3-8B-shaped model — KV cache spans 32× across these four configs. n_head_kv is the most-overlooked, most-cost-impacting field on the model card.
Multi-head Latent Attention · 一招省到 1/14
Multi-head Latent Attention · cache shrunk to 1/14
GQA 节省 KV cache 的方式是"少存几组 K/V"。MLA 走了另一条路——K 和 V 根本不存原始张量,只存一个"压缩潜空间"里的低维向量。要用的时候再展开。
这是 DeepSeek 在 V2 / V3 用的招式(论文《DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model》)。把每 token 的 KV cache 占用摆在一起对比:
2 × 32 × 8 × 128 × 2 bytes ≈ 128 KB / token60 × 576 × 2 bytes ≈ 68 KB / token · 只存一个 latent c_kv,576 = d_c 512 + d_rope 642 × 60 × 128 × 128 × 2 ≈ 3.75 MB / token ·DeepSeek-V2 论文宣称 MLA 比 MHA 省 ~57×,跟这个数对得上注意 V2/V3 是大模型(总参数 236B / 671B),不是 7B 级别——MLA 设计的真正目的就是让这种规模的模型还能在合理硬件上 decode 长上下文。Llama 这边的 GQA 已经够 8B/70B 用了。
GQA saves KV cache by "storing fewer K/V groups". MLA takes a different road — don't store the raw K and V at all; store only a low-dim vector in a "compressed latent space". Expand on demand.
This is the trick DeepSeek used in V2/V3 (paper: DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model). Per-token KV cache, side by side:
2 × 32 × 8 × 128 × 2 bytes ≈ 128 KB / token60 × 576 × 2 bytes ≈ 68 KB / token · just one latent c_kv, where 576 = d_c 512 + d_rope 642 × 60 × 128 × 128 × 2 ≈ 3.75 MB / token · DeepSeek-V2's paper claims MLA saves ~57× over MHA, consistent with thisNote: V2/V3 are large models (236B / 671B total params), not 7B-class — MLA exists precisely so that scale can still decode long context on reasonable hardware. Llama's GQA is already enough at 8B/70B.
朴素 attention:每个 head 有自己的 K/V,存 n_head_kv × d_head 个数。MLA 的拆解:
c_kv ∈ ℝ^d_c(DeepSeek-V2:d_c = 512)。这是真正存进 KV cache 的东西。c_kv 投影出来:K = W_uk · c_kv(W_uk 是个 [n_head × d_head, d_c] 的"解压矩阵")。V = W_uv · c_kv。原本要存 n_head_kv × d_head × 2 个数,现在只存 d_c 个数。Llama 数字:8 × 128 × 2 = 2048 → DeepSeek:512(再加少量 RoPE 维)≈ 1/4。
但这里有个问题:每次 decode 时,从 latent 解压出 K 和 V,再做 attention——岂不是多了一次 matmul?这就是 MLA 真正巧妙的地方。
Naive attention: each head stores n_head_kv × d_head numbers for K/V. MLA's decomposition:
c_kv ∈ ℝ^d_c (DeepSeek-V2: d_c = 512). This is what actually lives in the KV cache.K = W_uk · c_kv (W_uk is a [n_head × d_head, d_c] "decompression matrix").V = W_uv · c_kv.Instead of storing n_head_kv × d_head × 2 numbers, store d_c. Llama: 8 × 128 × 2 = 2048 → DeepSeek: 512 (plus a small RoPE tail) ≈ 1/4.
But there's a catch: every decode step now expands latent → K, V, then runs attention. Extra matmul each step? This is exactly where MLA gets clever.
这就是 MLA 看起来像魔法的根源:"压缩"是数学上的 free lunch,不是计算量上的妥协。因为线性矩阵可以重新分组,你可以把"解压"那步永远放在权重侧而不是数据侧——latent 张量永远不需要被显式解开。
实际实现里还有个细节:RoPE 不能直接施加在 latent 上(latent 的几何不保留位置旋转)。所以 DeepSeek 把每个 head 的 K 拆成两部分:大部分维度走 MLA(无 RoPE),少数维度单独走标准 K(带 RoPE)。两部分拼起来给 attention。代码里这是 kv_lora_rank 和 qk_rope_head_dim 两个参数。
This is why MLA looks like magic: "compression" is a free lunch in math, not a compute trade-off. Linear matrices can be re-grouped, so the "decompression" step lives permanently on the weight side, never on the data side — latents never get explicitly expanded.
One real-world wrinkle: RoPE can't be applied directly to the latent (it doesn't preserve position rotation geometry). So DeepSeek splits each head's K in two: most dims go via MLA (no RoPE), a few dims go standard K (with RoPE). Both are concatenated for attention. In code these are kv_lora_rank and qk_rope_head_dim.
MLA 看着像魔法,关键就是"把解压矩阵 W_uk 吸收进 Q 的投影 W_q"这一步。展开看其实就是一连串合法的线性代数。
朴素 attention 公式(单 head,省略缩放和 softmax):
O = softmax((Q · Kᵀ) / √d) · V
Q ∈ ℝ^[n, d] · K ∈ ℝ^[m, d] · V ∈ ℝ^[m, d]
MLA 把 K 拆成 K = c_kv · W_uk(其中 c_kv ∈ ℝ^[m, d_c],W_uk ∈ ℝ^[d_c, d])。代入 Q·Kᵀ:
Q · Kᵀ = Q · (c_kv · W_uk)ᵀ
= Q · W_ukᵀ · c_kvᵀ
= (Q · W_ukᵀ) · c_kvᵀ
注意 Q = h · W_q(h 是输入 hidden state)。所以:
Q · W_ukᵀ = h · W_q · W_ukᵀ = h · W_q'
其中 W_q' := W_q · W_ukᵀ ∈ ℝ^[d_model, d_c]
这就是吸收:W_q' 可以在模型加载时一次性算出来,运行时 Q' = h · W_q' 是一次正常的 matmul,跟普通 attention 等价的代价。然后:
Q · Kᵀ = Q' · c_kvᵀ
直接拿 latent c_kv 算 attention scores · 不需要"解压" K
V 那边同样手法:V = c_kv · W_uv,把 W_uv 吸收进输出投影 W_o,得到 W_o' := W_uv · W_o。
结果:KV cache 只存 d_c 维的 c_kv,推理时算两个新 matmul 但没多花 FLOPs——你只是把"解压 K · 算 attention · 解压 V · 投影 O" 重新分组成了"算 Q' · 算 attention · 投影 O'"。这就是吸收的全部:线性代数里的纯结合律重排。
更妙的是,推理 batch size 大的时候,W_uk / W_uv 那两个矩阵从来不需要被算到——它们只活在权重侧,推理路径上看不到。这是模型架构里少有的、纯几何技巧带来的几倍内存收益。
MLA looks like magic; the key is "absorb the decompression matrix W_uk into Q's projection W_q". Expanded, it's just a chain of perfectly legal linear algebra.
Naive attention (single head, omitting scale and softmax):
O = softmax((Q · Kᵀ) / √d) · V
Q ∈ ℝ^[n, d] · K ∈ ℝ^[m, d] · V ∈ ℝ^[m, d]
MLA factors K = c_kv · W_uk (c_kv ∈ ℝ^[m, d_c], W_uk ∈ ℝ^[d_c, d]). Substitute into Q·Kᵀ:
Q · Kᵀ = Q · (c_kv · W_uk)ᵀ
= Q · W_ukᵀ · c_kvᵀ
= (Q · W_ukᵀ) · c_kvᵀ
And Q = h · W_q (h is the input hidden state), so:
Q · W_ukᵀ = h · W_q · W_ukᵀ = h · W_q'
where W_q' := W_q · W_ukᵀ ∈ ℝ^[d_model, d_c]
That's absorption: W_q' is precomputed once at model load; at runtime Q' = h · W_q' is one normal matmul, same cost as vanilla attention. Then:
Q · Kᵀ = Q' · c_kvᵀ
attention scores directly from the latent c_kv · no "decompress K" needed
Same trick on V: V = c_kv · W_uv, absorb W_uv into the output projection W_o, get W_o' := W_uv · W_o.
Result: KV cache stores only d_c-dim c_kv. Two matmuls at runtime but no FLOPs increase — you've just regrouped "decompress K · attention · decompress V · project O" into "compute Q' · attention · project O'". That's all absorption is: pure associativity-driven re-grouping of linear algebra.
Even better: at large batch size, W_uk / W_uv are never computed — they live only on the weight side, invisible on the inference path. A rare case of model architecture getting multi-fold memory savings from a pure geometric trick.
上面那个吸收推导有个隐含前提:Q 和 K 都是线性函数。但 RoPE 不是线性的——它是一个旋转,Q 和 K 各自被乘上一个位置相关的旋转矩阵。
具体地说,RoPE 把 K 变成 K(m) = R(m) · K_orig,其中 R(m) 是依赖位置 m 的旋转矩阵。如果你想吸收 W_uk 进 W_q,需要把 R(m) 一起处理——但 R(m) 跟 token 位置有关,每个 token 不同,没法预先打包进权重。
DeepSeek 的解决:把 K 拆成两部分:
attention 时把两部分拼起来当 K_final 用。所以 MLA 的 KV cache 实际存的是 [d_c + d_rope] = 576 维的混合向量——绝大部分能享受吸收带来的压缩,少数维度走标准 RoPE 路径。这就是为什么前面公式里出现 "576 = d_c 512 + d_rope 64"。
同样的设计也用在 Q 上:q_nope(non-RoPE)+ q_rope(RoPE)。所以 DeepSeek-V2/V3 的 attention 节点比 Llama 多了几次拼接操作,但 KV cache 容量降下来,带宽收益远超拼接开销。这是模型架构与推理引擎的协同设计——光看模型论文不够,要对着 ggml 的 cgraph 节点才能完整看明白。
The absorption proof above had a hidden assumption: Q and K are linear functions. RoPE is not linear — it's a rotation; Q and K are each multiplied by a position-dependent rotation matrix.
Concretely, RoPE makes K(m) = R(m) · K_orig, where R(m) is the rotation for position m. To absorb W_uk into W_q you'd need to absorb R(m) too — but R(m) is per-token, different each time; you can't bake it into precomputed weights.
DeepSeek's fix: split K in two:
At attention time the two parts are concatenated to form K_final. So MLA's KV actually stores a [d_c + d_rope] = 576-dim mixed vector — most dims enjoy absorption's compression, a few take the standard RoPE path. That's where "576 = d_c 512 + d_rope 64" earlier came from.
Same design on Q: q_nope (non-RoPE) + q_rope (RoPE). So DeepSeek-V2/V3 attention nodes have a few extra concat operations vs. Llama, but KV cache shrinks enough that bandwidth gains dwarf concat cost. Model architecture co-designed with the inference engine — reading the paper alone isn't enough; you need to read it against the ggml cgraph nodes to see the whole picture.
同一层 FFN,256 个备选,只跑 8 个
same FFN slot, 256 candidates, only 8 fire
到这里 attention 块讲完了,但 transformer 一层不只有 attention——后面还有 FFN(Feed Forward Network),通常占模型参数量的 2/3。对 Llama-3-8B 来说,FFN 是一个 4096 → 14336 → 4096 的两层 MLP(SwiGLU 激活),每层每 token 都跑这个 14336 维的中间表示。Decode 时极其费内存带宽。
MoE(Mixture of Experts)思路:不是只有一个 FFN,而是有 N 个"专家"FFN。每个 token 经过一个 router,只激活 top-k 个专家。比如 DeepSeek-V3 是 256 个路由专家 + 1 个共享专家,每 token 激活 8 个路由专家。"激活" = "FFN 计算实际经过这些专家",其它 248 个专家这一步根本不参与。
结果:模型参数总量很大(DeepSeek-V3 是 671B),但每个 token 实际激活的参数只有 37B——decode 阶段的内存带宽消耗按 37B 算,但模型容量是 671B。"大模型而不慢"的诀窍就在这里。
Attention is done, but a transformer layer has more than attention — there's also the FFN (Feed Forward Network), usually 2/3 of all parameters. For Llama-3-8B, the FFN is a 4096 → 14336 → 4096 two-layer MLP (SwiGLU activation) — every token, every layer, goes through that 14336-dim intermediate. Murderously bandwidth-heavy at decode.
MoE (Mixture of Experts) idea: have N "expert" FFNs. Each token passes through a router, which activates only top-k experts. DeepSeek-V3 is 256 routed experts + 1 shared expert, 8 activated per token. "Activated" = "FFN compute actually goes through these"; the other 248 experts don't participate.
Result: total params huge (DeepSeek-V3 has 671B), but active params per token only 37B. Decode bandwidth = 37B's worth. Model capacity = 671B's worth. "Big model that isn't slow" — that's the trick.
"每个 token 选 top-k"听起来简单,实际跑起来会出三种病:
DeepSeek-V3 用了"无 aux loss"路由(论文里的 "auxiliary-loss-free load balancing"),通过给每个专家一个动态 bias 来调整选择概率——既保留了"自然路由"的语义,又避免了 aux loss 对模型质量的影响。这个细节在 llama.cpp 里对应 build_moe_ffn 的 expert bias 参数。
"Each token picks top-k" sounds simple, but in practice three diseases emerge:
DeepSeek-V3 uses an "aux-loss-free" router (paper: "auxiliary-loss-free load balancing"), giving each expert a dynamic bias that nudges selection probabilities — keeping "natural routing" semantics while avoiding aux-loss's quality cost. In llama.cpp this surfaces as the expert-bias arg of build_moe_ffn.
671B 参数的 DeepSeek-V3 不可能塞进一张 GPU——8×H100 的 640 GB 也要量化才装得下。Tensor Parallelism(沿 d_model 维切矩阵)是稠密模型的标准做法,但对 MoE 不友好:每个 expert 都是独立的 FFN,横切它意义不大。MoE 的自然并行轴是expert parallelism——把 256 个专家分散到多张卡上,每张卡负责 32 个 expert。
但这立刻引出一个新问题:一个 token 选中的 top-8 个专家,大概率不在同一张卡上。每一层 FFN 都要把这个 token 的 hidden state发到 8 张卡里有它选中的 expert 的那几张,算完再聚回来——这就是 all-to-all 通信,MoE 训练 / 推理里最大的工程挑战。
671B-param DeepSeek-V3 won't fit on a single GPU — 8×H100 = 640 GB needs quantization to fit. Tensor Parallelism (cut along d_model) is the dense-model default but doesn't suit MoE: every expert is an independent FFN, cutting it horizontally helps little. The natural axis for MoE is expert parallelism — spread the 256 experts across cards, 32 experts per card.
This immediately introduces a new problem: a token's top-8 experts almost certainly don't all live on one card. Every FFN layer must broadcast this token's hidden state to the cards holding its selected experts, compute, then gather back — that's all-to-all communication, the biggest engineering challenge in MoE training/inference.
关键瓶颈是 step 2 和 step 4 的两次 all-to-all。8 卡 NVLink 4.0 单向 ~450 GB/s,每个 token hidden 状态 7168 × 2 bytes ≈ 14 KB——一次 batch 1024 token 的两次 all-to-all 是 ~30 MB,约 67 μs。听起来不多,但 32 层 × 67 μs = 2 ms 的纯通信时间。MoE 模型的 decode 延迟里,通信能占 20-30%——这就是为什么 DeepSeek 自己写了 DeepEP 那个高性能 all-to-all 库,以及为什么"训练 MoE 比训练同 FLOPs 的稠密模型工程量大得多"。
The bottleneck is the two all-to-alls in steps 2 and 4. 8-card NVLink 4.0 is ~450 GB/s one-way; each token hidden state is 7168 × 2 bytes ≈ 14 KB. Two all-to-alls for a 1024-token batch = ~30 MB ≈ 67 μs. Sounds small, but 32 layers × 67 μs = 2 ms of pure communication. 20-30% of MoE decode latency can be communication — which is why DeepSeek wrote their own high-performance all-to-all library DeepEP, and why "training MoE is much more engineering than training a dense model of the same FLOPs".
MoE 训练里,经典的负载均衡方法是加一个 auxiliary loss(L_aux):惩罚"某些 expert 被选得过多"。但 aux loss 有两个副作用:
DeepSeek-V3 论文提出的 auxiliary-loss-free 方法,核心是给每个 expert 加一个 动态 bias b_i,在 routing 时:
选 top-k 时按 (gate_logit_i + b_i) 排序
但最终的"路由权重"还是用原始 gate_logit_i
巧妙之处:b_i 只影响"谁被选中",不影响"选中后的权重"。训练时维护每个 expert 的 token 计数,过载的 expert 把 b_i 调低(下次少被选),冷门 expert 把 b_i 调高(下次多被选)。这是外部反馈而不是梯度信号——主任务 loss 完全不被污染。
llama.cpp 在 build_moe_ffn 里通过 expert_bias 参数支持这个:加载 DeepSeek-V3 模型时,bias 是模型权重的一部分,运行时直接加到 gate logit 上。训练时的动态调整在推理时变成了静态的 bias 向量——又一个"训练阶段算的代价,推理阶段直接享用"的例子。
The classic MoE load-balancing trick is an auxiliary loss (L_aux): penalize "some experts get picked too often". But aux loss has two side effects:
DeepSeek-V3's auxiliary-loss-free method adds a dynamic bias b_i per expert. At routing time:
top-k selection by (gate_logit_i + b_i)
but the final "routing weight" still uses raw gate_logit_i
Clever bit: b_i affects who gets picked, not the weight of the pick. During training they track per-expert token counts; overloaded experts get b_i decreased (picked less next time), underloaded experts get b_i increased. It's external feedback, not a gradient signal — the main loss stays clean.
llama.cpp's build_moe_ffn supports this via the expert_bias arg: when loading DeepSeek-V3, the bias is part of the saved weights, added directly to gate logits at runtime. Training-time dynamic adjustment becomes inference-time static bias — another "training pays, inference takes" idiom.
MoE 不是一个固定方案,每家在专家数 / top-k / 共享专家 / 路由函数这几个旋钮上做了不同选择。把目前主流的拉一张表对比:
MoE isn't one fixed recipe — each lab picks differently on number of experts / top-k / shared experts / gating function. A side-by-side of current production models:
| model | n_expert | top-k | 共享专家shared exp | 路由函数gate fn | active / total |
|---|---|---|---|---|---|
| Mixtral 8x7B | 8 | 2 | 无 | softmax | 12.9B / 47B |
| Mixtral 8x22B | 8 | 2 | 无 | softmax | 39B / 141B |
| DBRX | 16 | 4 | 无 | softmax | 36B / 132B |
| Phi-3.5-MoE | 16 | 2 | 无 | softmax + aux loss | 6.6B / 42B |
| DeepSeek-V2 | 160 routed + 2 shared | 6 | 2 个固定激活 | softmax + aux loss | 21B / 236B |
| DeepSeek-V3 | 256 routed + 1 shared | 8 | 1 个固定激活 | sigmoid + aux-free bias | 37B / 671B |
| Llama-4 Maverick | 128 | 1 | 有 | sigmoid | 17B / 400B |
| Llama-4 Behemoth | 16 | 1 | 有 | sigmoid | 288B / ~2T |
看几个有意思的演化方向:
从这张表能看出 LLM 工程界的方法收敛:大家在不同点上探索,但最终都在往"更多小专家 + 共享专家 + sigmoid router + aux-free balance" 这个方向走。2024 年底的"教科书 MoE" 跟 2023 年 Switch Transformer 论文里的"教科书 MoE" 已经很不一样。
Notable evolution lines:
This table shows the field's methodological convergence: people explored different points but all moved toward "more small experts + shared expert + sigmoid router + aux-free balance". The "textbook MoE" of late 2024 looks very different from Switch Transformer's "textbook MoE" of 2023.
训练 MoE 时,因为batch 内每个专家收到的 token 数不固定,GPU 内存分配很难——你不知道要给每个专家留多少空间。早期 Switch Transformer 的解决方案是"给每个专家一个 capacity factor"——比如每 expert 最多收 1.5 × (n_tokens / n_experts) 个 token,超过的直接丢掉(走残差跳过 FFN)。
这件事在训练里 acceptable,在推理里完全不行——丢一个 token 模型生成就崩。所以推理时的 MoE 实现都不做 capacity 限制:Mixtral / DeepSeek 推理时每个专家无上限地处理所有路由到它的 token。代价是每个 expert 处理的 token 数不齐——某些 expert 一次跑 50 个 token,另一些跑 5 个。这导致 GPU 上 grouped GEMM 的负载不均,部分 SM 提前完工等待。
这就是为什么"推理时 MoE 实际加速比理论激活率低" 的根本原因——理论上 671B/37B 是 18× 算力节省,但实测可能只到 10× 左右,5-8× 的损耗在负载不均 + 通信开销 上。vLLM 和 SGLang 都在专家分组(把热门专家分到不同卡上)和动态调度上做了很多努力,但是个开放问题。
In MoE training, each expert receives a variable number of tokens per batch — GPU memory allocation is awkward; you don't know how much to reserve per expert. Early Switch Transformer introduced "capacity factor" — cap each expert at 1.5 × (n_tokens / n_experts); overflow tokens are dropped (skip FFN via residual).
Acceptable in training; unacceptable in inference — drop one token and generation breaks. So inference MoE implementations don't cap capacity: Mixtral / DeepSeek let each expert process all tokens routed to it. Cost: uneven per-expert token counts — some get 50 tokens, others 5. This makes grouped GEMM load-imbalanced; some SMs finish early and idle.
This is the root reason "MoE inference speedup falls short of the theoretical active-ratio". 671B/37B should be 18× compute saving; in practice ~10×. The 5-8× loss is in load imbalance + comm overhead. vLLM and SGLang invest heavily in expert grouping (placing hot experts on different cards) and dynamic scheduling, but it's an open problem.
最后一次 matmul · 4096 维 → 128k 维
the last matmul · 4096 → 128k
走完 32 层 transformer,最后一个 token("llama")的 hidden state 是一个 4096 维向量。它是模型对"下一个 token 应该是什么"的分布式表示——但还不是概率分布。要变成概率分布,得先投回 128k 维的词表空间,得到logits:每个 vocab id 上一个实数,代表"它当下一个 token 的合理性"。
这一步的计算极简——一次 matmul。但矩阵很大:LM head 的权重 output 是 [n_embd, n_vocab] = [4096, 128256],fp16 占 1 GB。所以 LM head 跟 embedding 表是整个模型里两个最重的单体权重。
有意思的是,很多模型(Llama-3 之前的 Llama-1/2、Gemma 等)把这两个权重共享(weight tying)——embedding 表本身就是 LM head 的转置。这省了 1 GB,代价是输出和输入空间被绑定。Llama-3 故意不绑,允许两边各自优化,精度涨,内存涨。
After 32 transformer layers, the final token's ("llama") hidden state is a 4096-dim vector. It's the model's distributed representation of "what the next token should be" — but not yet a probability distribution. To get one, project back to 128k-dim vocab space, producing logits: a real number per vocab id, encoding "plausibility as the next token".
The compute is dead simple — one matmul. But the matrix is huge: LM head's output weight is [n_embd, n_vocab] = [4096, 128256], 1 GB in fp16. LM head + embedding table = the two heaviest single weights in the model.
Interesting twist: many models (Llama-1/2, Gemma) share these two via weight tying — the embedding table is the LM head's transpose. Saves 1 GB; cost is input and output spaces are tied. Llama-3 deliberately untied them — both sides can specialize, accuracy up, memory up.
第 5 章那个 logits=[0,0,0,0,0,1] 现在派上用场了。prefill 时,6 个 token 都走完了 32 层 transformer,理论上 6 个位置都能算出 logits——但我们只关心第 6 个(预测第 7 个 token)。所以 llama.cpp 在 LM head 之前会 skip 掉不需要的位置,只对那 1 个 token 做 matmul:
节省的 FLOPs ≈ (n_tokens - 1) × 2 × n_embd × n_vocab
prefill 4 token 时:省 ≈ 3 GFLOPs · 不小
这是个简单但关键的优化——所有现代推理引擎都默认开。如果你做 logprob 输出(用户要每个 token 的概率),才会强制要求"所有位置都算 logits",这时 LM head 就成了 prefill 阶段的次要瓶颈。
That logits=[0,0,0,0,0,1] from Ch.5 finally earns its keep. Prefill: all 6 tokens have walked the 32 layers, all 6 positions could in principle yield logits — but we only care about position 6 (predicts token 7). So before LM head llama.cpp skips the unneeded positions and matmuls only the 1 token of interest:
FLOPs saved ≈ (n_tokens - 1) × 2 × n_embd × n_vocab
4-token prefill: ~3 GFLOPs saved · not nothing
Simple but load-bearing optimization — every modern engine has it on. The only time it's off: logprob output mode (user wants per-token probabilities), then LM head becomes a secondary prefill bottleneck.
Llama-3 词表 128k,LM head 1 GB——还能装下一张 H100。但多语言模型(NLLB, XLM-R)词表能达到 250k-1M,LM head 直接膨胀到 4-10 GB。在大模型 + 大词表场景下,LM head 反而成了单卡放不下的张量。
解决方案是 vocab-parallel sharding:把 LM head 沿词表维度切到多张卡上,每张卡只算 vocab 的一段 logit。比如 TP=8 下,第 0 张卡算 vocab[0:16000] 的 logit,第 1 张卡算 vocab[16000:32000]……做 sampling 时需要一次跨卡聚合(all-gather 全部 logit 到一张卡上 sample,或者每张卡本地 top-k 然后聚合 top-k)。
注意这跟 TP 的"沿 d_model 切矩阵" 不同——TP 在 attention/FFN 里切的是输入或输出特征维,词表维是新增的并行轴。所以 vocab-parallel 在概念上是第 4 种并行(在 TP/PP/EP 之外)。生产环境很少独立使用,通常跟 TP 叠加。Megatron-LM 的 VocabParallelEmbedding 类专门做这件事。
另一个相关 trick: logit soft-capping(Gemma 用)。如果 LM head 的输出范围 unbounded,某些 token 的 logit 可能爆到 ±100,softmax 后退化成 one-hot,sampling 失去多样性。在 LM head 后面套一个 tanh(x / cap) × cap (cap 是 ~30),把范围拉回 [-cap, cap]——既稳又给 sampling 留概率分布。llama.cpp 在 build 里有 f_logit_scale 参数实现。
Llama-3's 128k vocab + 1 GB LM head still fits one H100. But multilingual models (NLLB, XLM-R) push vocabs to 250k-1M, LM head balloons to 4-10 GB. Large model + large vocab → LM head becomes a tensor that doesn't fit on one card.
The fix: vocab-parallel sharding. Shard LM head along the vocab axis across cards; each card computes logits for one slice. Under TP=8, card 0 handles vocab[0:16000], card 1 handles [16000:32000], etc. Sampling then needs one cross-card aggregation (all-gather all logits to one card to sample, or local top-k per card then aggregate).
This is distinct from "shard along d_model" TP — TP cuts input/output feature dims in attention/FFN; vocab is a new parallel axis. So vocab-parallel is conceptually a fourth parallelism dimension (alongside TP/PP/EP). Rarely standalone in production, usually layered with TP. Megatron-LM's VocabParallelEmbedding class does this.
A related trick: logit soft-capping (Gemma). If LM head output is unbounded, some token logits can blow up to ±100, softmax degenerates to one-hot, sampling loses diversity. Wrap LM head with tanh(x / cap) × cap (cap ~30) to bring it back to [-cap, cap] — stable, leaves room for probability distribution. llama.cpp implements via f_logit_scale in build.
第 18 章那个 Q4_K_M = "mixed precision" 把 LM head 单独留在 Q6_K 不量化到 Q4 ——这件事的原因值得展开。
普通层(attention / FFN)的权重量化损失会被后续层平滑掉:某个 weight 量化错 5%,经过 32 层 transformer 的层层抹平,对最终 hidden state 影响只剩 ~0.1%。但 LM head 不一样——它是最后一层,后面没有 transformer 抹平。它的输出直接是 sampling 的输入。
更糟的是,LM head 是 [n_embd × n_vocab] 的矩阵,n_vocab = 128k 行各自独立。一个权重出错只影响一个 vocab id 的概率——但哪一个 id 是随机的。如果倒霉量化错的那个 id 恰好是高频 token("the", " ", "的"),整个生成质量立刻断崖式下跌。
实测数据:对 Llama-3-8B,LM head 单独量化到 Q4 → PPL 涨 ~8%(比全模型 Q4 还差),保持 Q6_K → PPL 涨仅 ~0.5%。多花 1 GB 内存换 6% 精度,绝对划算。
这件事的一般原则:越接近模型 IO 边界的层,量化越敏感。LM head 和 embedding 表都属于"边界层",量化要谨慎;中间的 attention/FFN 厚得多,量化容忍度高。理解这个,你就明白为什么 GGUF 量化方案那么琐碎(每层不同位数)——它是大量经验测试出的层级量化策略。
Ch.18's Q4_K_M = "mixed precision" leaves LM head at Q6_K instead of quantizing to Q4. Worth unpacking why.
Quantization errors in regular layers (attention / FFN) get smoothed over by subsequent layers: a 5% weight error, after 32 layers of transformer flattening, contributes only ~0.1% to the final hidden state. LM head is different — it's the last layer; no transformer behind to smooth. Its output is directly sampling's input.
Worse: LM head is [n_embd × n_vocab] with n_vocab = 128k independent rows. One weight error affects just one vocab id's probability — but which id is random. If the misquantized id happens to be a high-frequency token ("the", " ", "的"), generation quality drops off a cliff.
Measurements: Llama-3-8B with LM head quantized to Q4 alone → PPL up ~8% (worse than the whole model at Q4); keeping it at Q6_K → PPL up only ~0.5%. 1 GB more memory for 6% accuracy — clearly worth it.
General principle: layers near the model's IO boundary are more quant-sensitive. LM head and embedding are "boundary layers" — quantize carefully. Middle attention/FFN is much thicker, more tolerant. Understand this and you'll see why GGUF quantization is so fiddly (different bits per layer) — it's a per-layer empirical strategy crafted from extensive testing.
llama-sampling.cpp · 一条 sampler chain
llama-sampling.cpp · a chain of samplers
logits 在手了——128256 个浮点数,代表每个 vocab id 当下一个 token 的"可能性"。要让它变成一个具体的 token,要经过两步:
temperature、top-k、top-p、repetition penalty 等等"过滤器"修改 logits/probabilities——让分布更尖、更扁、或者抑制重复。llama.cpp 把这两步统一抽象成"sampler chain"——每个 sampler 是一个改 logits 的函数,链尾的那个负责"挑出一个 token"。这设计跟 Unix pipe 一样:简单、组合、可换。
Logits in hand — 128256 floats, each one's "plausibility" as the next token. To pick one concrete token, two steps:
temperature, top-k, top-p, repetition penalty and other "filters" to modify logits/probabilities — sharpening, flattening, or suppressing repeats.llama.cpp models both as a "sampler chain" — each sampler is a function that mutates logits; the tail sampler picks the final id. Like Unix pipes: simple, composable, swappable.
| sampler | 在干什么what it does | 控制感feel |
|---|---|---|
temperature |
logit ÷ T · T < 1 让分布更尖,T > 1 让它更扁divide logits by T · T<1 sharpens, T>1 flattens | creativity |
top-k |
只保留前 k 个,其余设 -infkeep top k, others to -inf | cut tail |
top-p |
累积概率到 p 为止,其余设 -inf · 比 top-k 自适应keep cumulative probability up to p; adaptive | cut adaptive tail |
min-p |
保留 p ≥ min_p × max_p 的所有,丢其余 · 比 top-p 更稳keep where p ≥ min_p × max_p · steadier than top-p | robust cut |
repetition penalty |
最近出现过的 token 的 logit 减一点 · 防重复subtract from logits of recent tokens · anti-loop | de-loop |
frequency penalty |
同上但按出现次数累加 · OpenAI 风格same but proportional to count · OpenAI-style | de-loop |
mirostat |
动态调 temperature 让"perplexity"贴近目标值dynamically tune temp toward a target perplexity | auto-pilot |
grammar / GBNF |
把"不符合给定语法的 token"全部 mask 掉 · 强约束输出格式mask any token not matching a BNF grammar · structured output | hard-rail |
dist (链尾) |
最后这步:从修过形的概率里掷骰子选一个 tokenfinal step: actually draw from the reshaped distribution | roll |
penalties → top-k → top-p → temperature → dist——先按规则过滤,再调整尖度,最后掷骰子。顺序换了行为也变:先 temperature 再 top-p 跟先 top-p 再 temperature 结果完全不同。
Sampler order matters. A common good order: penalties → top-k → top-p → temperature → dist — filter first, sharpen second, roll last. Reorder and behavior changes: "temp before top-p" ≠ "top-p before temp".
"你好,llama" 走完 32 层、LM head 投影、sampler chain 之后,假设我们用 temperature=0.7 + top-p=0.9,选出来的下一个 token 可能是 "你"(id 47045)。这个 id 会被立即拼到对话历史里,作为下一次 decode 的最后一个 token——下一次 decode 时,它的 K 和 V 算出来写进 KV cache,Q 跟前 6 个 K 算 attention,产出第 8 个 token……如此循环,直到第 15 章的 EOS 出现。
After our prompt passes through 32 layers, LM head, and a sampler chain with temperature=0.7 + top-p=0.9, the next token might be (say) "Hi". That id is immediately appended to the conversation history and becomes the last token of the next decode. Its K and V are computed and written into KV cache; its Q does attention against the previous 4 K. Repeat until the EOS of Ch.15 fires.
top-p 有个众人皆知但很少人说清楚的毛病:cutoff 不自适应。top-p=0.9 在高确定性情境下会保留太多——比如下一个 token 是 ".",最大概率 0.85,top-p=0.9 还得拉进来一两个 ~0.05 的备选 token,这些备选语义可能完全无关,纯粹靠 temperature 进 noise。top-p=0.9 在高不确定性情境下又保留太少——如果前 20 个 token 概率都在 0.04-0.06 之间,top-p=0.9 强行卡在 ~18 个,把第 19、20 个有意义的 token 砍掉。
Min-p(2024 paper《Min P Sampling: Balancing Creativity and Coherence at High Temperature》)解决了这个:不按累积概率算,而是按"跟最大概率的相对值"算。规则:保留所有 p ≥ min_p × p_max 的 token。
结果是更适合高 temperature 创作场景 + 避免低概率 noise token。llama.cpp 早期就支持,vLLM 2024 也加进了 sampler chain。"2024 年的默认 sampler 是 min_p 而不是 top_p" 已是社区共识。
Top-p has a widely known but rarely articulated issue: its cutoff isn't adaptive. Top-p=0.9 keeps too many in high-certainty situations — if the next token is "." with p_max = 0.85, top-p=0.9 still pulls in one or two ~0.05 alternatives that may be semantically unrelated, pure temperature noise. Top-p=0.9 keeps too few in high-uncertainty situations — if the top 20 tokens all hover 0.04-0.06, top-p=0.9 cuts at ~18 and drops legitimate 19th and 20th candidates.
Min-p (2024 paper Min P Sampling: Balancing Creativity and Coherence at High Temperature) fixes this. Instead of cumulative probability, use "relative-to-max": keep all tokens with p ≥ min_p × p_max.
Result: better for high-temperature creative writing + avoids low-prob noise tokens. llama.cpp shipped support early; vLLM added it in 2024. "The 2024 default sampler is min_p, not top_p" is community consensus now.
"repetition penalty"(C14 主表里那条)是最早出现的抗重复 sampler,但它太粗暴——只要 token 出现过就一律打折,常常误杀必须重复的 token(逗号、空格、"the" 这种)。2024 年社区(主要 KoboldAI / SillyTavern 圈子)给出了两个更精细的方案:
这两个 sampler 都不在主流学术论文里,纯粹是 llama.cpp 社区(应用方)摸出来的工程经验。它们说明了 sampling 这一层的设计空间还远没有探索完——研究界主要关心模型本身,但 sampling 是离用户体感最近的一层,小调整能带来大改变。
"Repetition penalty" (in C14's main table) is the earliest anti-loop sampler, but too blunt — discounts any token that appeared, often killing necessary repeats (commas, spaces, "the"). The 2024 community (mostly KoboldAI / SillyTavern circles) came up with two finer-grained alternatives:
Neither sampler is in mainstream academic papers — they're pure llama.cpp-community engineering folklore. They show that the sampling design space is far from exhausted. Research focuses on the model; sampling is the layer closest to user experience, and small tweaks can change a lot.
Mirostat 是个奇怪但优雅的 sampler。它的目标不是"砍掉低概率 token",而是"让生成序列的 surprise 保持在指定值"——具体说,让每个生成 token 的 -log(p) 平均稳定在 tau(用户给的目标值,比如 5.0)。
工作原理是一个简单的反馈控制:
tau ——这是目标"perplexity"(实际上是 ln-perplexity 的等价)。mu ——它就像一个 top-k 的"有效 k"。用户体感: tau=3 → 模型谨慎、保守(每个 token 都"很确定");tau=8 → 模型大胆、有创造性(每个 token 都"有点出乎意料")。这比 temperature 更语义可控——temperature 是输入,perplexity 是输出,Mirostat 直接调输出。
llama.cpp 内置 mirostat v1 / v2 两个版本,sampler chain 里一个 sampler 就实现。它在故事生成 / 角色扮演等"需要保持一致语气"的场景下显著好于固定 temperature——因为它主动校正了局部 perplexity 漂移。
Mirostat is a strange-but-elegant sampler. Its goal isn't "cut low-prob tokens" but "keep generation surprise at a target setpoint" — specifically, average -log(p) per generated token at tau (user setpoint, e.g. 5.0).
It's literally a feedback controller:
tau — target "perplexity" (technically log-perplexity equivalent).mu — it's like an effective top-k cap.User feel: tau=3 → cautious, conservative (every token "certain"); tau=8 → bold, creative (every token "slightly surprising"). More semantically controllable than temperature — temperature is input, perplexity is output; Mirostat drives the output directly.
llama.cpp ships mirostat v1 / v2, each a single sampler in the chain. Notably better than fixed temperature for "maintain consistent tone" scenarios (fiction, roleplay) — it actively corrects local perplexity drift.
2018 年 BERT 之前的所有 seq2seq 翻译 / 摘要任务都用 beam search:维护 k 个候选序列,每步把每个候选扩展所有可能的下一 token,排序保留概率最高的 k 个。它不掷骰子,纯启发式搜索"整体最优序列"。
但 LLM 推理基本不用 beam search,理由:
n 参数的原因。少数仍用 beam search 的场景:翻译任务(正确翻译是相对客观的)、speculative decoding 里的 draft 模型(选 top-k 候选)、code completion(语法约束下 beam 比 sampling 稳)。但聊天 / 创意写作全部用 sampling-based。这也是Transformer 时代的 sampling 哲学跟 RNN 时代不一样的地方:模型本身够好,允许多样性 > 强行最优。
Before 2018's BERT, all seq2seq translation / summarization used beam search: maintain k candidate sequences; at each step expand each by all possible next tokens; keep the top k by probability. No dice rolling; pure heuristic search for the "globally most likely sequence".
But LLM inference almost never uses beam search:
n.Beam survives in: translation (where "correct" is relatively objective); speculative decoding's draft model (top-k candidates); code completion (under syntactic constraints, beam is steadier). But chat / creative writing all uses sampling. This is the Transformer-era sampling philosophy: model is good enough that diversity beats forced optimality.
没有魔法,就是一个特殊 id
no magic, just one special id
模型怎么知道"该停了"?这件事看起来需要"理解语义",其实没有——它只是一个特殊的 token id。Llama-3 的 EOS 是 128009(对应字符串 <|eot_id|>),DeepSeek 是 100001。训练时,每段对话末尾都接这个 token,模型学会"在合适的地方输出它"。推理时,sampler 一旦采到这个 id,主循环就 break。
How does the model know "time to stop"? Sounds like it needs "semantic understanding". It doesn't — it's just one special token id. Llama-3's EOS is 128009 (string: <|eot_id|>); DeepSeek's is 100001. At training, every dialog ends with this token, and the model learns "output it at appropriate spots". At inference, the moment the sampler picks that id, the main loop breaks.
llama_vocab_is_eog 里的 EOG 是"end of generation"——一个广义概念,不只是 EOS。Llama-3 有两个 EOS:<|end_of_text|>(128001)是"整段文本结束",<|eot_id|>(128009)是"当前轮次结束"。在 chat 场景里我们要在轮次结束就停,而不是等"文本结束"——所以 sampler 默认把这两个都当 stop token。
这又是个容易踩坑的工程细节:Llama-3 刚开源时,很多人用 Llama-2 风格的 stop token(只检测 </s>),结果模型说完一轮后继续往下扯"User:..."——因为 <|eot_id|> 不在 stop 列表里。这种 bug 在整个推理栈里最难调,因为模型看起来一切正常,只是不知道该闭嘴。
EOG in llama_vocab_is_eog = "end of generation" — a broader concept than EOS. Llama-3 has two EOSs: <|end_of_text|> (128001) = "document ends", <|eot_id|> (128009) = "this turn ends". In chat we want to stop at turn end, not document end — so the sampler treats both as stop tokens by default.
This is another nasty engineering footgun: when Llama-3 first shipped, many integrations used Llama-2-style stop detection (only </s>) and the model would happily keep going after its turn, hallucinating "User:..." — because <|eot_id|> wasn't in the stop set. Hardest bug to debug in the whole stack: the model looks fine, just doesn't know how to shut up.
OpenAI API 允许传 stop=["\n\n", "User:", "###"]——模型一旦输出匹配这些字符串就停。看起来简单但实现有三个坑:
"User:" 这个字符串可能跨越多个 token——"Use"+"r:"。所以推理引擎要每生成一个 token,都重新拼接最近 N 个 token 的字符串,做后缀匹配。"User:",模型刚生成了 "User",可能下一个 token 是 ":" → 触发 stop,可能是 " feedback" → 不触发。但 "User" 已经流式发给客户端了。如果触发了 stop,要么不流式("攒一段再发"——延迟变高),要么截断流式输出(发了 "User" 就发不回去)。OpenAI 选了"不流式直到 stop 决出胜负"——所以你会观察到 streaming API 偶尔"卡一下"再继续,就是引擎在判断 stop。这件事在 llama.cpp 里是 common_sampler::stop_strings 字段,匹配在主循环里做。很多 wrapper 库写错过这个——一个常见 bug 是 stop 触发了但已经流出去的 token 没有 rollback,客户端 UI 显示了不该显示的内容。
OpenAI's API accepts stop=["\n\n", "User:", "###"] — model output stops the moment it matches these strings. Looks simple; three traps:
"User:" may span multiple tokens — "Use"+"r:". So the engine must after every token, re-concat the last N tokens' strings and check suffix-match."User:" and the model just emitted "User"; next token might be ":" → triggers; might be " feedback" → doesn't. But "User" already streamed to the client. If stop triggers, you either don't stream ("buffer until resolved", latency up) or truncate the streamed output (you sent "User" — can't unsend). OpenAI picked "no streaming until stop is resolved" — that's why you'll observe streaming API occasionally "hitching" before continuing: it's the engine resolving stop.In llama.cpp this lives in common_sampler::stop_strings; matching happens in the main loop. Many wrapper libraries get this wrong — a common bug: stop triggers but already-streamed tokens aren't rolled back, and the client UI shows content that shouldn't have appeared.
第 23 章讲了"把对话格式化成模型懂的字符串" 这是 forward 方向。但反方向——模型输出回来怎么解析回结构化字段——同样重要。Function calling 尤甚:模型输出可能是:
这段输出里有4 件事需要分别处理:
<|start_header_id|>...<|end_header_id|> 是角色头,要剥掉。<|python_tag|> 后面是工具调用,要解析 JSON,触发外部函数。<|eom_id|>(end of message)而不是 <|eot_id|>(end of turn)——意思是"我说完了这一段,但下面还要继续"(等工具结果)。所以 chat completion API 内部其实是一个状态机,在 streaming token 时根据特殊 token 切换状态(普通流式 / 工具调用解析 / role 切换等)。OpenAI 的 chat completion 协议把这一切包装成"看似简单" 的 JSON,但底下是这一整套 chat template + 反解析 + 状态机。
llama.cpp 在 tools/server/utils.hpp 的 oaicompat_chat_params_parse 里做这件事,vLLM 有自己的一套openai/serving_chat.py。这部分是推理栈里业务最复杂的地方——比 attention kernel 还容易出 bug,因为它要兼容十几种 chat template 的不同特殊 token。
Ch.23 covered "format conversation into the model's preferred string" — the forward direction. The reverse direction — parsing model output back into structured fields — matters equally. Function calling especially: the model might emit:
This output has four things to handle separately:
<|start_header_id|>...<|end_header_id|> is the role header — strip it.<|python_tag|> is a tool call — parse JSON, trigger external function.<|eom_id|> (end of message) instead of <|eot_id|> (end of turn) — meaning "I'm done with this segment, but more is coming" (waiting for tool result).So chat completion APIs are internally state machines, switching states based on special tokens while streaming (normal stream / tool call parse / role switch). OpenAI's chat completion protocol wraps all this into "look-simple" JSON, but underneath is the full chat template + reverse-parsing + state-machine machinery.
llama.cpp handles this in tools/server/utils.hpp's oaicompat_chat_params_parse; vLLM has its own openai/serving_chat.py. This is the most business-logic-heavy part of the inference stack — buggier than the attention kernel, because it must handle a dozen different chat-template variations.
从 main.cpp 到 ggml kernel · 一次推理的全谱
main.cpp → ggml kernel · the whole spectrum
把前面所有章节的函数名串起来,一次推理的完整调用栈大概是这样:
String together every function we've named, and one inference looks like this:
从这张表能看出两件事:
这就是为什么同样的 GPU,跑长 prompt 是赚的,跑长 output 是亏的。前者按算力定价,后者按时间定价。整个推理产业的服务化(API 定价、batch 调度、连续 batching、speculative decoding 等等)都在围绕这一点做文章。
Two takeaways from this table:
This is why same GPU, long prompts are profitable, long outputs are lossy. Former is compute-priced, latter is time-priced. Every productization choice in inference serving (API pricing, batching, continuous batching, speculative decoding) is fighting on this slope.
同一件事 · 两种写法
same job · different shapes
前 16 章都用 llama.cpp 当主线,是因为它小——一个 C++ 项目,几万行代码,源码可读、可改、可 fork。但生产环境的另一极是 vLLM——Python + CUDA,4 万颗 GitHub 星,UC Berkeley 出的,backed by Anyscale,主打"高吞吐量服务化"。两者代表的是两种推理引擎的哲学。
The first 16 chapters used llama.cpp as the through-line because it's small — a single C++ project, tens of thousands of lines, source-readable, fork-friendly. The other pole in production is vLLM — Python + CUDA, 40k GitHub stars, from UC Berkeley, backed by Anyscale, optimized for "high-throughput serving". They represent two philosophies of inference engines.
| llama.cpp | vLLM | |
|---|---|---|
| 语言language | C++ + 自实现 ggml | Python + PyTorch + CUDA |
| 代码规模codebase | ~80k lines · 单 repo | ~200k lines · 多依赖 |
| Backendbackend | CPU / CUDA / Metal / Vulkan / ROCm | CUDA 一等公民 · ROCm beta |
| KV cache 管理KV cache | ring buffer / unified · 简单 | PagedAttention · 页式分配 |
| batch 调度batching | 静态 batch | continuous batching · token 级抢占 |
| 最佳场景sweet spot | 单机 · 单用户 · 嵌入 | 多用户 · 高并发 · API 服务 |
| 部署形态deployment | 静态二进制 · ~10 MB | Python 进程 · ~5 GB(含 CUDA) |
| 量化支持quantization | GGUF · Q2-Q8 · 极成熟 | AWQ / GPTQ · 较新 |
| 实际定位positioning | "本地 / 边缘 / 实验" | "生产 / 集群 / SaaS" |
llama.cpp 的 KV cache 是"连续分配 + ring buffer"——简单,但有碎片化问题。多用户场景下,user A 的对话长 200 token,user B 长 8000 token,如果 cache 按用户线性切分,user A 那一段后面 7800 token 的位置就是空的,但其他用户用不上(因为不知道这块何时被回收)。vLLM 的 PagedAttention(论文 "Efficient Memory Management for Large Language Model Serving with PagedAttention" )用操作系统的虚拟内存思路解决:
block_tables)再算——多了一层间接,但 GPU 上一次 memory gather 的开销很小。llama.cpp's KV cache is "contiguous + ring buffer" — simple but suffers fragmentation. Multi-user case: user A's chat is 200 tokens, user B's is 8000. Linearly slicing the cache leaves a 7800-token gap behind A — useless to others, because nobody knows when it'll be reclaimed. vLLM's PagedAttention (paper: Efficient Memory Management for Large Language Model Serving with PagedAttention) borrows OS-level virtual memory:
block_tables) — one extra indirection, but GPU memory gather is cheap.两个项目在互相借鉴——llama.cpp 后来也支持类 paged 的 KV 调度,vLLM 也支持 GGUF。但根本气质不一样:llama.cpp 是"能装进任何缝隙的瑞士军刀",vLLM 是"同时服务 1000 人的工业铣床"。
The two projects borrow from each other — llama.cpp now supports a paged-ish KV scheduler, vLLM supports GGUF. But the temperaments differ: llama.cpp is "a Swiss army knife that fits any crack"; vLLM is "an industrial milling machine that serves 1000 simultaneously".
vLLM 比 llama.cpp 快的另一个不那么显眼的原因是 CUDA Graph。原理:一次 decode forward 涉及 ~几百 个 CUDA kernel launch,每次 launch 在 H100 上有 ~3-5 μs 固定开销——decode 100 个 kernel × 5 μs = ~500 μs 全是 launch 时间。这部分时间跟模型计算无关,纯粹是 CPU → GPU 的指令传输开销。
CUDA Graph 的思路:把这 100 个 kernel 的静态序列 一次性"录"成一个 graph 对象,后续 forward 时直接整体重放这个 graph——只需要一次 graph launch,所有 kernel 顺序执行,launch 开销摊到 0。一次性能给 decode latency 降 10-20%。
但 CUDA Graph 有个严苛要求:graph 内部的 tensor 形状必须完全固定,kernel 序列必须完全确定。这跟 vLLM 的 dynamic batch 起冲突——batch 大小总在变。解决:vLLM 为常见 batch 大小(1, 2, 4, 8, 16, ..., 256)各自捕获一个 graph,运行时根据实际 batch 大小选最近的 graph(padding 一下)。代价是预热时间长(启动要捕获几十个 graph),内存多 ~1-2 GB。
llama.cpp 长期没用 CUDA Graph——因为它的 graph builder(ggml)就是动态构造的,跟 CUDA Graph 的"静态" 假设不兼容。直到 2024 年才加了实验性的 graph capture 支持,但仍不如 vLLM 成熟。这是"vLLM decode 100 tokens/s, llama.cpp 60 tokens/s 同模型同硬件" 这种数字差的来源之一。
Another less-obvious reason vLLM beats llama.cpp on speed: CUDA Graph. The setup: one decode forward involves ~hundreds of CUDA kernel launches; each launch has ~3-5 μs fixed overhead on H100 — 100 kernels × 5 μs = ~500 μs of pure launch time. This time is unrelated to model compute — purely CPU → GPU instruction transfer.
CUDA Graph's idea: "record" the static sequence of those 100 kernels once into a graph object; subsequent forwards just replay the entire graph — one graph launch, all kernels execute in order, launch overhead amortized to 0. Gives 10-20% decode latency reduction.
But CUDA Graph has strict requirements: tensor shapes inside the graph must be fully fixed; kernel sequence must be fully determined. This conflicts with vLLM's dynamic batching — batch sizes change constantly. Fix: vLLM captures a graph for each common batch size (1, 2, 4, 8, 16, ..., 256); at runtime picks the nearest captured graph based on actual batch (with padding). Cost: long warmup (capture dozens of graphs at startup), 1-2 GB extra memory.
llama.cpp didn't use CUDA Graph for a long time — its graph builder (ggml) is dynamically constructed, incompatible with CUDA Graph's "static" assumption. Experimental support arrived in 2024 but is less mature than vLLM's. This is one source of "vLLM 100 tokens/s, llama.cpp 60 tokens/s on the same model and hardware" numbers.
vLLM 的 kernel 不全是手写 CUDA。它是个三层栈:
llama.cpp 走完全不同的路:它自己实现了 ggml,所有 kernel 都是手写 C++/CUDA,不依赖 cuBLAS。这给了它跨平台能力(Metal/Vulkan/SYCL 各自实现一套相同 op 的 backend),代价是每个新硬件特性都要重写。Triton 不行,因为 Triton 只跑 NVIDIA GPU。
这是"llama.cpp 跨平台 vs vLLM CUDA-only" 这个根本差异的技术根源:Python 引擎可以用 Triton 把 NVIDIA 当一等公民,C++ 引擎要平衡多平台只能自己写 kernel。各自有道理,但路径锁死了发展空间。
vLLM's kernels aren't all hand-written CUDA. It's a three-layer stack:
llama.cpp takes a completely different path: implements ggml itself, all kernels hand-written C++/CUDA, no cuBLAS dependency. This gives cross-platform capability (Metal/Vulkan/SYCL each implement the same op set in their own backend), at the cost of having to rewrite every new hardware feature. Triton's no help — Triton only runs NVIDIA GPUs.
This is the technical root of the "llama.cpp cross-platform vs vLLM CUDA-only" fundamental divide: Python engines can treat NVIDIA as first-class via Triton; C++ engines balancing multi-platform must write their own kernels. Each path makes sense but locks the trajectory.
把 16 GB 模型挤进 5 GB 显存 · 精度损 3%
squeezing a 16 GB model into 5 GB VRAM · 3% accuracy cost
到这里整篇文章一直在用 fp16(2 字节)讲。但线上几乎没人用 fp16 跑推理——大家都在跑量化版本。一个 Llama-3-8B fp16 是 16 GB,Q4_K_M 量化是 4.6 GB,PPL 只差 ~3%——同样一张 16 GB 显存,fp16 只能装下模型本身没余地,Q4_K_M 还能留 11 GB 给 KV cache。量化决定了"什么模型能跑在什么硬件上"。
但"4-bit"这个词太笼统——4-bit 量化有十种不同实现,llama.cpp 自己就有 Q4_0 / Q4_1 / Q4_K / Q4_K_S / Q4_K_M 等等。差别全在怎么处理"异常值"——少数权重数值远大于均值,简单 4-bit 截断会把它们截烂。
The whole article so far has run on fp16 (2 bytes). But nobody runs production inference at fp16 — everyone uses quantized versions. Llama-3-8B fp16 is 16 GB, Q4_K_M is 4.6 GB, PPL difference is ~3% — on the same 16 GB GPU, fp16 barely fits the model itself, Q4_K_M leaves 11 GB for KV cache. Quantization decides "what model runs on what hardware".
But "4-bit" is too vague — there are ten different 4-bit schemes. llama.cpp alone ships Q4_0 / Q4_1 / Q4_K / Q4_K_S / Q4_K_M and more. The differences are all about handling "outliers" — a few weights with values far above the mean. Naive 4-bit truncation crushes them.
llama.cpp 的"K-quants"是 2023 年 ikawrakow 设计的一套精细方案。Q4_K 不是把每个 weight 量成 4 bit——它是把 32 个 weights 打包成一个"block",8 个 block 再打包成一个"super-block"。每个 block 有自己的 scale,每个 super-block 有自己的 super-scale——多层量化,把异常值"分摊"到合适的尺度。
llama.cpp's "K-quants" were designed by ikawrakow in 2023. Q4_K doesn't simply 4-bit each weight — it packs 32 weights into a "block", 8 blocks into a "super-block". Each block has its own scale; the super-block has a super-scale. Layered quantization that amortizes outliers across appropriate scales.
Q4_K_M 不是"所有权重都 Q4_K"。它是 llama.cpp 量化方案家族里的"mixed"档:
attention.wv 和 feed_forward.w2(对精度更敏感的层): 一半用 Q6_K(~6.5 bits/weight)output(LM head): 用 Q6_K——LM head 量化错了直接破坏生成,这层不能省所以 Q4_K_M 实际平均 ~4.85 bits/weight。Q4_K_S(S = small)就全 Q4_K,~4.6 bits/weight,体积更小但 PPL 多 ~1%。这种哪些层用什么量化的策略在 llama_model_quantize_internal 里写死,基于对每层 weight 分布的经验观察。
Q4_K_M doesn't mean "all weights Q4_K". It's the "mixed" tier in llama.cpp's quant family:
attention.wv and feed_forward.w2 (more precision-sensitive layers): half use Q6_K (~6.5 bits/weight)output (LM head): Q6_K — quantize this layer wrong and generation breaks; not negotiableSo Q4_K_M actually averages ~4.85 bits/weight. Q4_K_S (S = small) is full Q4_K, ~4.6 bits/weight — smaller but +1% PPL. The "which layer at which quant" policy is hard-coded in llama_model_quantize_internal, based on empirical per-layer weight-distribution observations.
这张表是整个量化哲学的浓缩:Q4_K_M 是甜点——再省往下到 Q3 就开始明显掉精度。不是所有模型都有这个甜点:更小的模型(3B 以下)Q4 已经掉得厉害,要 Q5 以上才稳;更大的模型(70B+)Q3 都还能用,因为冗余多。大模型更"抗量化"——这也是为什么 70B-Q4 比 8B-fp16 又准又快又省。
This table is the whole quantization philosophy in one place: Q4_K_M is the sweet spot — go down to Q3 and accuracy collapses. Not every model has this sweet spot: smaller models (under 3B) lose noticeably already at Q4 and need Q5+; larger models (70B+) survive Q3 well because they have more redundancy. Larger models are more "quantization-resilient" — which is exactly why 70B-Q4 beats 8B-fp16 on accuracy, speed, and memory.
权重量化是静态的(模型一次性量化完写盘),但 KV cache 是动态的——每一步 decode 都新写一份。要量化 KV cache,得在每次写入时实时压缩,读取时反量化。这个开销不能太大,所以一般选格式简单的 fp8(e4m3 或 e5m2)而不是 K-quants 那种多层结构。
fp8 KV cache 把 Ch.8 那 1 GB 直接砍一半到 512 MB。代价是 PPL 多 ~0.5%——比权重量化还安全,因为 attention 公式本身就在 softmax 后做归一化,抹平了一些精度噪声。vLLM 的 kv_cache_dtype="fp8_e5m2" 是生产环境标配。
Weight quantization is static (quantize once, write to disk). But KV cache is dynamic — written fresh on every decode step. To quantize the KV cache you compress on every write and dequantize on every read. That overhead can't be high, so usually a simple format like fp8 (e4m3 or e5m2) — not K-quants' multi-tier scheme.
fp8 KV cache halves Ch.8's 1 GB to 512 MB. PPL cost is ~0.5% — even safer than weight quantization because the softmax normalization in attention itself absorbs some precision noise. vLLM's kv_cache_dtype="fp8_e5m2" is the production default.
fp8 有两种主流格式,1 字节里的位分配不同:
NVIDIA Hopper(H100)开始硬件原生支持这两种格式,fp8 GEMM 性能是 fp16 的 2× · TFLOPS 翻倍直接对应 decode 速度。所以 vLLM 在 H100 上跑 fp8 KV cache 不只是省内存,还是省时间。
fp8 has two mainstream formats; the bit allocation differs:
NVIDIA Hopper (H100) introduced native hardware fp8 support; fp8 GEMM throughput is 2× fp16. Double the TFLOPS translates directly to decode speed. So fp8 KV cache on H100 isn't just memory savings — it's time savings too.
llama.cpp 的 GGUF 量化只是一种路线。生产环境里你会碰到至少 5 种主流方案,它们在 "如何选择 scale" 这个核心问题上走了不同路:
llama.cpp's GGUF quantization is just one approach. In production you'll encounter at least five mainstream schemes, each diverging on the core question "how to pick scales":
| scheme | 校准方式calibration | 优势strength | 劣势weakness | 主要用户main users |
|---|---|---|---|---|
| GGUF (K-quants) | 纯权重统计 · 离线weight stats · offline | 简单 · 跨平台simple · cross-platform | ~3% PPL 损~3% PPL drop | llama.cpp · Ollama · LM Studio |
| GPTQ | 校准数据集 · 逐层 Hessiancal dataset · per-layer Hessian | 精度高 · ~1% PPL 损accurate · ~1% PPL drop | 需要校准数据 · 量化慢needs cal data · slow to quant | HuggingFace · AutoGPTQ |
| AWQ | activation-aware scaleactivation-aware scale | ~0.5% PPL 损 · 推理快~0.5% PPL · fast inference | 校准数据敏感cal-data-sensitive | vLLM · TGI |
| SmoothQuant | smooth activation outlier 到 weightsmooth act outliers to weights | 支持 W8A8enables W8A8 | 实现复杂 · 模型特异complex · model-specific | NVIDIA TensorRT-LLM |
| HQQ | 无校准 · 数据无关no calibration · data-free | 极快量化(秒级)extremely fast quant (seconds) | ~2% PPL 损~2% PPL drop | 研究 · 实验research · experimental |
三个关键观察:
实际选择:llama.cpp 用户用 GGUF,vLLM / TGI 部署用 AWQ,TensorRT-LLM 用 SmoothQuant。同一个模型 Llama-3-70B 在 HuggingFace Hub 上可能有 6-8 个不同量化版本——根据用户的引擎选用就行。
Three key observations:
Real choices: llama.cpp users use GGUF; vLLM / TGI deployments use AWQ; TensorRT-LLM uses SmoothQuant. The same Llama-3-70B on HuggingFace Hub may have 6-8 different quantized variants — pick by which engine you're using.
两个跟 decode is memory-bound 死磕的方法
two ways of fighting decode-is-memory-bound
第 2 章定下了整个推理的基本矛盾:decode is memory-bound。这一章讲两个真正改变这件事的方法——它们都是这两年 LLM 工程界最大的产品差异化武器,但在学术圈外讨论得不够。
这两件事都没改模型架构——纯粹是推理引擎层面的工程,但效果比改 attention 公式还大。理解这两个能让你看懂"为什么同样模型 vLLM 跑得比 llama.cpp 快 3 倍"。
Ch. 2 set the core tension: decode is memory-bound. This chapter covers the two techniques that actually change that — they're the biggest product differentiators in LLM inference these days, yet underdiscussed outside academia.
Neither changes the model architecture — pure engine-level engineering, but with bigger impact than changing the attention formula. Understanding these explains "why vLLM beats llama.cpp 3× on the same model".
朴素 decode 是纯串行:生成第 i 个 token 必须等第 i-1 个出来,因为 i 要喂回模型当输入。一次 decode ~15 ms,生成 100 个 token 就是 1.5 秒——这 1.5 秒里 GPU 大部分时间在等 HBM。算力闲着。
Speculative decoding 的洞察:用一个小模型(draft model)便宜地猜 k 个 token,然后让大模型对这 k 个 token 一次性跑一遍 forward 验证——大模型这一次是 prefill 而不是 decode,k 个 token 并行算,等于免费验证。
关键的数学:如果 draft 猜对了 k 个里的前 m 个,大模型这一次就等价于做了 m+1 次 decode——因为大模型 forward 的同时也算出了"给定第 i 个 token,第 i+1 个应该是什么"。一个被丢弃的 draft token 不亏(大模型这次还是产出了它的预测),只要 draft 命中率 > 1/k,就赚。
Naive decode is purely serial: generating token i requires token i-1 first (i is the input). One decode ~15 ms; 100 tokens = 1.5 s. During that 1.5 s the GPU sits mostly idle waiting on HBM. Compute is starving.
Speculative decoding's insight: let a small "draft model" cheaply guess k tokens; have the big model run a single forward over those k tokens to verify — that forward is a prefill, not a decode, all k tokens parallel, essentially free.
The key math: if the draft got the first m of k tokens right, the big model's forward is equivalent to m+1 decodes — because the big model also computes "given token i, what should token i+1 be" as a side product. Any discarded draft token is not lost (the big model still computed its own prediction). As long as draft hit-rate > 1/k, you win.
| 变体Variant | draft 模型draft source | 命中率hit rate | 加速speedup |
|---|---|---|---|
| vanilla speculative | 小模型 · 1B 给 70B 当 draftsmall model · 1B drafts 70B | 50-70% | 2-3× |
| Medusa | 主模型 + 几个并行 head 同时预测main model + extra heads predicting in parallel | 60-80% | 2.2× |
| EAGLE | 小 draft head 直接看主模型的 hidden statetiny head on top of main model's hidden state | 75-85% | 3× |
| Lookahead | 自己 n-gram 缓存,无需 draft 模型self n-gram cache, no draft model | 40-60% | 1.5-2× |
| Speculative + prefix sharing | 同一 batch 多用户共享 draftbatched users share draft | — | 4-5× (高并发) |
第 8 章那张 KV cache 桌子里,有个被忽视的事实:同一段 token 序列在同一个模型上,KV 永远一样。它是纯函数——给定输入 token 和位置 id,K 和 V 是确定的。所以如果两个用户的请求都以"You are a helpful assistant. Today is ..." 开头,这一段的 KV 完全可以共享——不用重新 prefill。
这个观察催生了 prefix caching。生产环境里,API 用户的 prompt 通常长这样:
典型一个 API 应用,system + examples 占总 prompt 的 70-90%。如果命中 prefix cache,这部分 prefill 时间归零——TTFT 从 100 ms 降到 10 ms 是常态。
Ch.8's KV cache table contains an underappreciated fact: the same token sequence over the same model always produces the same KV. It's a pure function: given input tokens and position ids, K and V are deterministic. So if two users' requests both start with "You are a helpful assistant. Today is ...", that portion's KV can be shared — no need to re-prefill.
This observation gave rise to prefix caching. In production, API user prompts typically look like:
For a typical API app, system + examples account for 70-90% of total prompt tokens. A prefix-cache hit makes that prefill time zero — TTFT drops from 100 ms to 10 ms routinely.
这两个 trick 是正交的——可以叠加。一个典型的 API 服务调用,完整时间分解:
These two tricks are orthogonal — they stack. A typical API call, decomposed:
叠加之后 13.3 s → 3.5 s,4× 加速,什么模型都没动。这就是为什么 2024-2025 年同样硬件、同样模型,主流推理服务的实际吞吐量比 2022 年提了 5-10 倍。训练这两年遇到瓶颈,推理这边在悄悄发力。
顺手说一句:这两个 trick 还有个共同的副作用——它们让"同一个 batch 里的所有用户"对计算量的贡献不再对称。命中 prefix cache 的用户几乎免费;接收 speculative draft 多的用户消耗更多算力。计费因此变成一个细微问题——OpenAI 在 2024 年底推出"prompt caching discount"就是为了把 prefix caching 的收益分给用户。
Stacked: 13.3 s → 3.5 s, 4× speedup without touching the model. This is why in 2024-2025, on the same hardware and model, mainstream inference services hit 5-10× the real throughput of 2022. Training hit a wall these two years; inference quietly leveled up.
An aside: these two tricks share a side-effect — they make "users in the same batch" no longer contribute symmetrically to compute. A user hitting prefix cache is nearly free; a user with many accepted speculative tokens consumes more compute. Pricing becomes subtle — OpenAI's late-2024 "prompt caching discount" exists exactly to share the prefix-caching gain with users.
主表里那 5 种 speculative 变体值得展开。它们解决"怎么便宜地猜对下个 token" 这个核心问题的方式不同:
vLLM 2024 内置 EAGLE 支持,SGLang 也跟进。llama.cpp 主线只有 vanilla speculative,有人在做 EAGLE 移植但还没合并。在reasoning 模型 场景下(C26),EAGLE 加速尤其大——因为 reasoning 阶段语义连贯性高,draft 命中率能到 85%+。
The 5 speculative variants in the main table deserve unpacking. Each solves "cheaply guess the next token" differently:
vLLM 2024 has built-in EAGLE; SGLang followed. llama.cpp main only has vanilla speculative; EAGLE port is WIP but not merged. In reasoning model scenarios (C26), EAGLE accelerates dramatically — reasoning has high semantic continuity, draft hit rate reaches 85%+.
vLLM 的真正杀招 · token-level 抢占
vLLM's actual killer move · token-level preemption
在线服务里,同时会有几十上百个用户在跟模型说话。最朴素的做法是static batching:把同时到来的请求拼成一个 batch 一起跑,直到 batch 里所有请求都生成完 EOS 才返回。问题是不同请求的输出长度差别巨大——有的说 20 个 token 就完,有的要说 2000 个。整个 batch 会被最慢的那个拖死,前面那些早早完成的请求等着,GPU 大部分时间在跑"无效 decode"。
2022 年 OSDI 的 Orca 论文(《Orca: A Distributed Serving System for Transformer-Based Generative Models》)第一次提出了continuous batching(也叫 in-flight batching):每生成一个 token 就重新组 batch,完成的请求立刻退出 batch,等待中的新请求立刻加入。vLLM 把它做成了开源标准实现,这是它最重要的工程贡献——比 PagedAttention 还重要。
In online serving, tens to hundreds of users talk to the model simultaneously. The naive approach is static batching: pack arriving requests into a batch, run them together until every request in the batch emits EOS. Problem: output lengths vary wildly — some requests finish in 20 tokens, others need 2000. The whole batch is held hostage by the longest. The early finishers wait around; the GPU spends most of its time on "useless decode".
The 2022 OSDI Orca paper (Orca: A Distributed Serving System for Transformer-Based Generative Models) introduced continuous batching (aka in-flight batching): regroup the batch after every generated token; finished requests drop out instantly, waiting requests jump in instantly. vLLM made it the open-source default — this is vLLM's single most important engineering contribution, more impactful than PagedAttention.
| Static batchingStatic batching | Continuous batchingContinuous batching | |
|---|---|---|
| batch 组成方式batch composition | 一次组好,跑到所有 EOSfixed once, runs until all EOS | 每 step 重新组re-formed each step |
| 输出长度差异敏感性output-length variance | 短请求被拖死short requests dragged | 完成即退出finish → leave |
| prefill 怎么处理prefill handling | 所有 prefill 攒齐才跑collected, then batched | prefill 与 decode 混跑prefill mixed with decode |
| 新请求等待时间new request wait | 等当前 batch 结束until current batch ends | 下一 step 立刻插入joins on next step |
| GPU 利用率GPU utilization | ~30-50%~30-50% | 70-90% |
| 吞吐量提升throughput gain | 1× | 5-20×(论文实测) |
vLLM 把所有请求抽象成两类:
每个 step,scheduler 决定这一 step 哪些 WAITING 请求要 prefill、哪些 RUNNING 请求要 decode,然后把它们打包成一个 batch给 model_runner 跑。关键技巧是 prefill 和 decode 可以混在同一个 batch 里——一个 batch 里同时有"5 个新请求各自 prefill 200 token"和"10 个老请求各 decode 1 token",GPU 一次 forward 把这些都算完。
vLLM abstracts all requests into two states:
Each step, the scheduler decides which WAITING requests to prefill this step and which RUNNING requests to decode, then packs them into one batch for model_runner. The key trick: prefill and decode can mix in the same batch — a single batch can contain "5 new requests prefilling 200 tokens each" and "10 ongoing requests each decoding 1 token"; the GPU finishes all of it in one forward.
continuous batching 是 happy path——但新请求一直涌入,KV cache 总会满。这时 scheduler 就要做抢占(preemption):把某些 RUNNING 请求踢出去,让出 KV slot 给新请求。vLLM 有两种抢占策略:
vLLM 默认 recompute,因为 prefix caching 命中率高时它最优。Sglang(另一家高吞吐引擎)默认 swap。这两种选择直接影响 P99 延迟——swap 让 P50 好但 P99 偶尔很差(swap-out 那次),recompute 让平均更平但偶尔有"请求莫名重新跑了一遍 prefill"。
Continuous batching is the happy path — but new requests keep arriving and KV cache eventually fills. The scheduler must preempt: evict some RUNNING requests to free KV slots for newcomers. vLLM has two policies:
vLLM defaults to recompute (best when prefix-cache hit rates are high). Sglang (another high-throughput engine) defaults to swap. The choice directly impacts P99 latency — swap makes P50 good but P99 occasionally awful (the swap-out moment), recompute is flatter on average but occasionally "a request mysteriously ran prefill twice".
第 2 章说过 prefill 和 decode 是两种工作负载。那它们怎么混在一个 batch 里?这件事需要解释一下,因为它不显然。
关键:模型 forward 的输入抽象是 llama_ubatch{ token, pos, seq_id, ... }(第 5 章)——它不区分"这个 token 是 prefill 还是 decode",只关心"这个 token 属于哪条序列、在那条序列的什么位置"。一个 batch 里:
这个 batch 总共 6 + 1 + 200 + 1 = 208 个 token。一次 forward 处理 208 个 token,Q 矩阵 [208, d_head]。attention 的 mask 矩阵根据 seq_id + pos 算出来——seq_0 的 token 5 只能看到 seq_0 的 token 0..5,看不到 seq_2 的 token——这就是多用户 batch 的隔离。
这件事在 H100 上实测能跑得很好,因为 SM 数量大(132 个),一次 forward 同时容纳 prefill 和 decode 的算力 / 带宽混合负载毫无压力。但kernel 设计要支持这种 mixed batch——FlashAttention v2/v3 都明确针对这件事做了优化(varlen API,可以接受不同长度的 Q)。
顺手补一个 vLLM 在 mixed batch 上的陷阱:同一个 batch 里 prefill 的 N 大,decode 的 N=1。如果 batch 总 token 数有限(max_num_batched_tokens),一个长 prefill 进来会挤掉很多 decode——P99 latency 突然变差。vLLM 引入 --max-num-seqs 和 chunked prefill(第 21 章)就是为了解决这个不公平。
Ch.2 said prefill and decode are two different workloads. So how can they share one batch? This deserves an explanation, since it's not obvious.
Key: model forward's input abstraction is llama_ubatch{ token, pos, seq_id, ... } (Ch.5) — it doesn't distinguish "prefill vs decode token", only "which sequence does this token belong to, at what position". One batch:
Batch total: 6 + 1 + 200 + 1 = 208 tokens. One forward processes 208 tokens, Q matrix [208, d_head]. Attention's mask is computed from seq_id + pos — seq_0's token 5 sees only seq_0's tokens 0..5, not seq_2's — that's per-user isolation inside the batch.
This works great on H100 because there are 132 SMs; one forward easily absorbs the mixed compute/bandwidth load of prefill and decode. But the kernel must support this mixed batch — FlashAttention v2/v3 explicitly added this (the varlen API, accepting Q rows of different lengths).
One footgun: in a mixed batch, prefill has large N, decode N=1. If the batch's total token budget (max_num_batched_tokens) is capped, one long prefill crowds out many decodes — P99 latency spikes. vLLM's --max-num-seqs flag and chunked prefill (Ch.21) exist precisely to fight this unfairness.
长 prompt 不再拖死所有 decode
long prompts no longer hold decode hostage
第 20 章那张吞吐表里有一行加了 +60% 的 "chunked prefill"——这是 vLLM 2024 年的关键升级。背景是 continuous batching 的不公平:如果某个用户发了一个 32K token 的长 prompt,它的 prefill 要算 ~50 ms,这 50 ms 里同 batch 的其他 decode 用户全部等着——他们看到的 TPOT 突然变成 50 ms 而不是 15 ms。
chunked prefill 的招式:把长 prefill 切成小块,每块 ~512 token,跟 decode 混跑。一个 32K prefill 切成 64 块,每个 batch step 跑 1 块 + 一堆 decode——单 step latency 维持在 ~20 ms,decode 用户感知不到长 prompt 用户的存在。代价是长 prompt 用户的 TTFT 略微变高(从 50 ms 涨到 80 ms),但 TPOT 更稳。
That +60% row in Ch.20's throughput table for "chunked prefill" is vLLM's 2024 key upgrade. Background: continuous batching is unfair. If one user sends a 32K prompt, prefilling it takes ~50 ms; during those 50 ms every other decode user in the batch waits — their TPOT spikes from 15 ms to 50 ms.
Chunked prefill's move: slice long prefill into small chunks of ~512 tokens; mix them with decode. A 32K prefill becomes 64 chunks; each batch step runs 1 chunk + many decodes — per-step latency stays ~20 ms, decode users never notice the long-prompt user. Cost: the long-prompt user's TTFT goes from 50 ms to ~80 ms, but TPOT is flatter.
切成块跑不是完全免费的。原本 32K prompt 一次 prefill 要读 KV cache 1 次(因为是从空开始),chunked 之后每一块 prefill 都要读"之前已经写入的所有 KV"——第 8 块要读前 7 块 × 512 = 3584 个 K——总读取量从 O(N²/2) 变成 O(N²)。
对 32K prompt,实测多读 ~3 GB 数据(per layer 维度累加)。看起来很多,但因为 prefill 阶段本来就 compute-bound,HBM 带宽有富余,这部分多读几乎不影响总时间。chunked prefill 在 compute-bound 阶段交换 latency-fairness——只在 prefill 才有意义,decode 是 memory-bound,chunked 反而损失。
这就是为什么 vLLM 默认只对 prefill chunked,decode 永远整段跑。调度策略跟硬件物理特性强绑定——理解这一点,你就能预测下一代 GPU(更高 HBM 带宽)会让 chunked prefill 的相对优势如何变化。
Slicing isn't free. A 32K prompt's monolithic prefill reads KV cache 0 times (starts empty); chunked prefill must read "everything written by prior chunks" — the 8th chunk reads 7 × 512 = 3584 K rows — total reads go from O(N²/2) to O(N²).
For a 32K prompt, that's ~3 GB extra data read (summed across layers). Looks expensive but prefill is compute-bound and HBM bandwidth has headroom; the extra reads barely move total time. Chunked prefill trades excess bandwidth for latency-fairness in the compute-bound regime — only sensible during prefill, where decode (memory-bound) would lose, not win.
That's why vLLM defaults to chunked prefill only; decode runs whole. Scheduling is tightly coupled to hardware physics — understand this and you can predict how the relative advantage shifts on next-gen GPUs (higher HBM bandwidth).
chunked prefill 是逻辑上混跑 prefill 和 decode。2024 年又出现了一个更激进的思路——物理上把它们分开,放在不同的 GPU上:
这条路线还在研究 → 量产的过渡期。最大的实际部署是 Anthropic 内部和某些超大规模服务——你看不到它的具体代码,但能从 P99 latency 改善的口径推断。下一代推理引擎大概率会原生支持这种"disaggregated prefill/decode"。
Chunked prefill mixes prefill and decode logically. 2024 brought a more radical idea — split them physically across different GPUs:
This line is still in the research → production transition. The biggest real deployments are at Anthropic-tier scale and a few hyperscalers — you can't see their code, but P99 latency improvements give it away. Next-gen inference engines will likely natively support this "disaggregated prefill/decode" pattern.
一张卡跑不动,八张卡怎么协同
when one card isn't enough, how do eight cooperate
Llama-3-405B fp16 是 810 GB——单 H100 (80 GB) 一张装不下模型权重,更别说 KV cache。要让它跑起来,必须把权重切到多张卡上。"切"有三种思路,产业里叫 TP / PP / EP:
这三种是正交的,可以叠加。比如 Llama-3-405B 在 8 × H100 上典型用 TP=8(8 张卡协同算每一层);DeepSeek-V3 671B 在 16 × H100 上用 TP=4 × EP=4(每 4 张卡协同算 attention,16 张卡分摊 expert)。
Llama-3-405B fp16 = 810 GB — a single H100 (80 GB) can't hold the weights, never mind KV cache. To run it you must split weights across cards. Three strategies, industry-known as TP / PP / EP:
The three are orthogonal and compose. Llama-3-405B on 8×H100 typically runs TP=8 (8 cards cooperate per layer). DeepSeek-V3 671B on 16×H100 runs TP=4 × EP=4 (every 4 cards do attention together; the 16 split the experts).
| TP | PP | EP | |
|---|---|---|---|
| 切什么what to slice | 单层权重矩阵a single weight matrix | 不同层different layers | MoE 专家MoE experts |
| 通信类型comms | all-reduce | send/recv | all-to-all |
| 通信频率comms freq | 每层 2 次 | stage 边界 1 次 | 每 FFN 层 2 次 |
| 每次通信量comm size | [n_tokens × d_model] | [n_tokens × d_model] | [n_tokens × d_model] |
| 需要快带宽needs fast link | NVLink 必需 | PCIe 即可 | NVLink 必需 |
| 扩展上限scale limit | ~8 (NVLink domain) | 几十-几百 | ~64 |
| 流水气泡pipeline bubble | — | prefill 大 · decode 小 | — |
| 适用场景use case | 单机多卡single node, multi-GPU | 跨节点cross-node | MoE onlyMoE only |
朴素地"把矩阵从中间切两半"是不够的——TP 要保证切完之后,计算结果跟未切版本完全一致(注意可能有微小 fp 误差,但语义上一致)。这需要在每一层用列切 + 行切配对:
W_q / W_k / W_v 沿输出维度(列)切——每张卡得到 1/TP 个 head。W_o 沿输入维度(行)切——每张卡乘自己那部分,然后 all-reduce 把结果加起来。W_gate / W_up 沿列切,W_down 沿行切 + all-reduce。所以一层 transformer 在 TP=8 下,有两次 all-reduce(一次 attention 出口,一次 FFN 出口)。32 层模型 → 64 次 all-reduce。每次 all-reduce 通信量 ~n_tokens × d_model × 2 bytes,decode 时 ~30 KB,prefill 时几 MB。在 NVLink 4.0(450 GB/s 双向)上一次 all-reduce ~5-50 μs,32 层叠加 ~200 μs - 几 ms,占总 latency 5-15%。
Naively "slice the matrix down the middle" doesn't suffice — TP must guarantee the post-slice computation is bit-equivalent (modulo fp noise) to the un-sliced version. The trick: per layer, column-shard + row-shard pairing:
W_q / W_k / W_v shard along the output dim (column) — each card owns 1/TP heads.W_o shards along the input dim (row) — each card multiplies its slice, then all-reduce sums everything up.W_gate / W_up column-sharded; W_down row-sharded + all-reduce.So one transformer layer under TP=8 has two all-reduces (one at attention exit, one at FFN exit). 32 layers → 64 all-reduces. Each all-reduce moves ~n_tokens × d_model × 2 bytes — ~30 KB at decode, several MB at prefill. On NVLink 4.0 (450 GB/s bidirectional), one all-reduce takes ~5-50 μs; 32 layers stacks to ~200 μs - several ms, 5-15% of total latency.
PP 看起来很自然——"把 32 层分给 8 张卡,每卡 4 层"——但 forward 是顺序的:卡 A 算完 layer 0-3,把结果发给卡 B 算 layer 4-7,卡 B 算的时候卡 A 在干嘛?闲着。这就是pipeline bubble:第一个 token 流过 8 张卡的时候,前 7 张卡都在等。
解决的办法是 micro-batching:把 batch 切成多个 micro-batch,让卡 A 算完 mb_0 立刻开始 mb_1,同时卡 B 在算 mb_0。理想情况下 8 张卡同时各自处理一个 micro-batch,气泡降到 ~1/n_micro_batches。
但decode 不能 micro-batch——一个 batch step 只有 1 个 token,切不动。所以 PP 在 decode 阶段气泡严重——卡 A 算一个 token 的 4 层 → 卡 B → 卡 C → ……→ 卡 H 算完出来,中间 7 张卡每次都要等。这就是为什么纯 PP 推理几乎没人用,生产环境都是 PP + TP 混合(大节点之间 PP,节点内 TP)。
PP looks natural — "32 layers across 8 cards, 4 layers each" — but forward is sequential: card A finishes layers 0-3, sends to card B for 4-7. What's card A doing while B works? Sitting idle. That's the pipeline bubble: as the first token flows through 8 cards, 7 cards wait.
The fix is micro-batching: slice the batch into multiple micro-batches; A starts mb_1 the moment it finishes mb_0, while B runs mb_0. Ideally 8 cards each process one micro-batch concurrently; bubble drops to ~1/n_micro_batches.
But decode can't micro-batch — one batch step has only 1 token to spread. So PP at decode has severe bubbles — A does 4 layers, B, C, ..., H finishes; 7 cards always waiting. Which is why pure-PP inference is rarely used; production deployments use PP + TP mixed (PP across nodes, TP within).
Llama-3-405B 在 fp8 量化下是 ~405 GB,GQA-8 在 128K 上下文下每用户 KV cache ~10 GB。一台 8×H100 节点 640 GB,拆账如下:
这个 SKU 极度奢侈:一节点 ~$32/h,15 用户 = $2.13/用户·小时。这就是为什么 405B 服务的per-token 定价比 70B 服务高 5×——单 GPU 服务密度低了 10×。
下一代的优化方向:
Llama-3-405B in fp8 = ~405 GB; GQA-8 at 128K context = ~10 GB KV cache per user. One 8×H100 node (640 GB):
This SKU is luxurious: one node at ~$32/h, 15 users = $2.13/user-hour. Hence why 405B's per-token pricing is 5× a 70B's — service density is 10× lower per GPU.
Next-gen directions:
SSE · 流式输出 · 对话模板
SSE · streaming · the chat template
前 22 章讲的全是怎么生成 token。但 token 不是用户看到的东西——用户看到的是字符串,而且是一个一个蹦出来的。从 token 到屏幕,这"最后一公里"有两件事要做:
<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n你好,llama<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n(Llama-3 格式)。这一层包装错了模型就不工作。The previous 22 chapters were entirely about generating tokens. But tokens aren't what users see — users see a string, and one that appears character by character. From token to screen, this "last mile" has two pieces:
<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nhello, llama<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n (Llama-3 format). Get this wrapping wrong and the model breaks.这三种格式的每个分隔符都是特殊 token(不在普通词表里,需要 tokenizer 显式插入)。Llama-3 的 <|eot_id|> 是 128009,<|start_header_id|> 是 128006。如果你拿 user 输入直接 tokenize,这些 token 永远不会出现——必须由应用层把它们插进去。
llama.cpp 把这件事放到 llama_chat_apply_template 函数,通过 模型自带的 Jinja2 模板(GGUF metadata 里有)渲染。OpenAI API 兼容的 chat 接口(包括 vLLM 的)都会自动做这一步——但裸 completion 接口需要你自己来。这就是为什么 raw completion 和 chat completion API 在同一个模型上输出风格差异巨大:不是模型不一样,是 chat completion 帮你套了正确的模板。
Every separator in these formats is a special token (not in regular vocab; must be explicitly inserted by the tokenizer). Llama-3's <|eot_id|> is 128009, <|start_header_id|> is 128006. If you tokenize raw user input, these never appear — they must be injected by the application layer.
llama.cpp handles this in llama_chat_apply_template, rendering the model's bundled Jinja2 template (stored in GGUF metadata). OpenAI-compatible chat APIs (vLLM included) do this automatically — but the raw completion endpoint requires you to do it. This is why raw completion and chat completion on the same model produce wildly different styles: not because the model differs, but because chat completion applied the correct template for you.
Server-Sent Events 协议非常简单:HTTP 响应不结束,服务器持续往 socket 上写 data: {...}\n\n 行,客户端 EventSource 一行行解析。OpenAI / vLLM / Anthropic 的 streaming API 都是这个协议:
SSE is dead simple: the HTTP response never ends; the server keeps writing data: {...}\n\n lines to the socket; the client EventSource parses line by line. OpenAI / vLLM / Anthropic streaming APIs all use this:
看起来简单,但实际有三个坑:
AsyncEngineClient 里通过 cancellation token 检测,llama.cpp HTTP server 也有 --idle-timeout。否则 idle 的 KV cache 会一直占着,新请求被挤掉。Looks simple, but three real footguns:
AsyncEngineClient; llama.cpp HTTP server has --idle-timeout. Otherwise idle KV cache squats and crowds out new requests.第 3 章讲了 tokenize。反向(detokenize)看似简单——查 vocab 表把 id 翻成字符串拼起来。但实际不能这么干:Llama-3 的 vocab 里有几千个"带前导空格"的 token(" the", " hello"),拼接时如果你直接 concat 那些 token 的字符串,会得到 "the helloworld" 这样的乱七八糟的输出——空格的位置全错了。
正确做法:每个 token 的字符串已经带了它该带的前导空格(BPE 的训练就是这么做的)。所以 detokenize 就是简单的字符串 concat,不要自作主张加空格。但 leading <|begin_of_text|> 这类 special token 在 detokenize 时要被吃掉,不能直接输出到用户屏幕。llama.cpp 通过 llama_vocab::detokenize(..., remove_special=true) 标记区分。
流式场景下,detokenize 必须按 token 增量跑,但不能假设每个 token 是独立可解码的 utf-8 序列。Chinese token "你"是 id 47045,detokenize 后是 3 字节 utf-8 E4 BD A0 ——干净的 1 字符。但 emoji "🤖"(U+1F916)在 BPE 里可能被拆成2 个 token,每个 token 单独解码出来是非法 utf-8。所以服务器要维护一个"待发字节缓冲",每次检查这个缓冲从头能解出多少完整字符,只发出可解部分,剩下半截留到下一次拼接。
Ch.3 covered tokenize. The reverse (detokenize) looks easy — look each id up in vocab and concat. But you can't just do that: Llama-3's vocab has thousands of tokens with leading spaces (" the", " hello"); naive concat gives "the helloworld" — spaces in the wrong places.
Correct: each token's string already carries its leading space (BPE training did this for you). So detokenize is just concat — don't add spaces. But special tokens like <|begin_of_text|> must be eaten at detokenize, not shown to the user. llama.cpp distinguishes via llama_vocab::detokenize(..., remove_special=true).
In streaming, detokenize runs incrementally per token, but you can't assume each token decodes to a clean utf-8 substring. The Chinese "你" (id 47045) detokenizes to 3 utf-8 bytes E4 BD A0 — clean. But "🤖" (U+1F916) may be split across 2 BPE tokens; each individually is invalid utf-8. Servers must maintain a "pending byte buffer", checking how much of it forms complete characters, emitting only that, leaving the partial bytes for the next concat.
GBNF · 限制采样空间 · 函数调用
GBNF · constrained sampling · function calling
很多生产场景下,你需要模型输出严格格式的内容——比如一段 JSON、一个 SQL 语句、一个函数调用。朴素方法是"在 prompt 里求模型乖一点"——但这不可靠,模型偶尔会瞎写。可靠的方法是constrained decoding:在 sampler chain 里加一道关卡,把所有违法 token 的 logit 设为 -inf,模型从一开始就不可能选到非法 token。
llama.cpp 内置 GBNF(GGML BNF,一种 BNF 变体)语法引擎,vLLM 用 outlines / xgrammar。两者本质一样:用户给一个语法,引擎构建一个有限状态机,每生成一个 token 都更新 FSM 状态,只允许"从当前状态能走出去"的 token。
Many production scenarios need strictly formatted output — JSON, SQL, function calls. The naive approach is "beg the model to behave in the prompt" — but it's unreliable; the model occasionally goes off. The reliable approach is constrained decoding: add a guard to the sampler chain that sets the logit of every illegal token to -inf — the model literally cannot pick an invalid one.
llama.cpp ships GBNF (GGML BNF, a BNF variant) grammar engine; vLLM uses outlines / xgrammar. Both work the same way: user provides a grammar, engine builds a finite state machine, every emitted token updates FSM state, and only "tokens that can leave the current state" are allowed.
每生成一个 token 之前,grammar sampler 都要做一件事:遍历整个 vocab 表(128k),对每个 token 的字符串问"这串字符接在当前 FSM 状态后面合法吗?" 不合法的 logit 设为 -inf。这听起来贵——但 llama.cpp 做了大量优化:
实测开 grammar 后 sampling 慢 ~30%——但整体推理时间几乎不变,因为 sampling 本来就占不到 5% 总时间。换来 100% 的格式合规率,是绝对划算的交易。
Before emitting each token, the grammar sampler does one thing: iterate the entire vocab (128k), ask of each token's string "can this string follow the current FSM state legally?" Set illegal tokens' logits to -inf. Sounds expensive — but llama.cpp does heavy optimization:
Measured: sampling is ~30% slower with grammar on — but total inference time barely moves, because sampling was <5% of the total. 100% format compliance for a near-free cost. Excellent trade.
OpenAI 的 function calling 看起来像魔法——你给个 schema,模型自动产出符合 schema 的 JSON。背后就是constrained decoding:OpenAI 把每个 function schema 编译成 GBNF / outlines 形式的 grammar,推理时挂到 sampler chain 上,保证输出 100% schema-valid。它的"不会输出非法 JSON"承诺,是工程而不是魔法。
开源世界里,vLLM 的 guided_decoding 参数支持 JSON Schema / regex / lark grammar 三种约束源——它会自动把 JSON Schema 编译成 xgrammar 的内部 FSM。Anthropic 的 tool use API 也是类似实现。
有意思的限制:constrained decoding 让模型保证格式合法,但不保证内容正确。比如你约束输出 {"age": number},模型可能输出 {"age": 999999}——格式合法,内容荒谬。"格式"和"语义"是两件事——constrained decoding 解决前者,后者还得靠模型本身。
OpenAI's function calling looks magical — give a schema, model produces schema-valid JSON. Behind the scenes: constrained decoding. OpenAI compiles each schema into a GBNF / outlines grammar, attached to the sampler chain at inference. The "can't emit invalid JSON" guarantee is engineering, not magic.
In the open world, vLLM's guided_decoding parameter supports JSON Schema / regex / lark grammar — it compiles JSON Schema into xgrammar's internal FSM. Anthropic's tool-use API uses a similar mechanism.
An interesting limitation: constrained decoding guarantees format legality, not content correctness. Constrain output to {"age": number} and the model can still emit {"age": 999999} — format legal, content absurd. Format and semantics are different — constrained decoding solves the former, the latter falls back on the model itself.
constrained decoding 的缺陷:它强迫模型输出特定 token,但模型可能不擅长在那个 token 上——一个不熟悉 JSON 的模型被强迫输出 JSON,可能在每个 key 上瞎选(因为它的概率分布不集中在合理 key 上)。
OpenAI / Anthropic 的"structured output" SOTA 是constraint + finetune 联动:在 SFT(supervised fine-tuning)阶段大量喂"schema → schema-valid JSON"的对应,模型自己就学会了"看到 schema 描述,自然产出合法 JSON"。这时再叠加 constrained decoding,既双保险又语义自然。
这就是为什么 GPT-4o structured outputs 比 Llama-3 + GBNF JSON 输出的语义质量高一截——前者是"模型本来就想这么写,grammar 兜底",后者是"模型不太会,grammar 硬掰"。同样的工具,语义底子不同,输出质量差很多。
Constrained decoding has a weakness: it forces the model into specific tokens, but the model may be bad at choosing those — forcing a JSON-naive model to emit JSON makes it pick keys somewhat randomly (its probability mass isn't concentrated on sensible keys).
OpenAI's / Anthropic's structured output SOTA is constraint + finetune in tandem: SFT (supervised fine-tuning) heavy on "schema → schema-valid JSON" pairs, so the model learns "see schema description, naturally emit valid JSON". Layer constrained decoding on top — double insurance and semantically natural.
This is why GPT-4o structured outputs are semantically a class above Llama-3 + GBNF JSON — the former is "model wants to write it this way, grammar backstop"; the latter is "model isn't great, grammar forces it". Same tool, different semantic base, very different quality.
vision encoder · projector · 像素也变成 token
vision encoder · projector · pixels become tokens too
前 24 章的所有讨论都基于一个隐含前提:输入是文字。但 GPT-4o / Claude 3.5 / Gemini / Llama-3.2 Vision 这些"看得见图"的模型怎么工作?它们没有专门处理图片的内部模块——而是把图片"翻译"成 transformer 听得懂的语言:tokens。一张 512×512 的图,在模型眼里就是 256 个"视觉 token",跟文字 token 一起排成同一个序列送进同一个 transformer。
这件事的精彩在于:整本书前 24 章的所有内容,在多模态模型上都不变——KV cache / attention / MoE / sampling / speculative / 量化全部照旧。改变的只是序列的开头那几百个 token从哪里来。
The whole article so far had an implicit assumption: input is text. So how do "see-the-image" models — GPT-4o / Claude 3.5 / Gemini / Llama-3.2 Vision — work? They don't have a special image module inside; they "translate" images into the language the transformer already understands: tokens. A 512×512 image, in the model's eyes, becomes 256 "visual tokens" sitting in the same sequence as text tokens, fed into the same transformer.
The elegant part: everything in the previous 24 chapters still applies to multimodal models — KV cache, attention, MoE, sampling, speculative, quantization. What changes is only where the first few hundred tokens of the sequence come from.
主流 vision encoder(LLaVA / Llama-3.2-Vision / Qwen2-VL 等)都走类似流程:
这一下子产生 256 个"视觉 token" 塞进序列前面——你可以把它们想象成 256 个"系统说话"——后续 LLM 推理跟纯文本完全一样。所以一张图实际占用 256 个上下文 token + 对应的 KV cache。这是为什么 GPT-4 Vision 的 image input 比 text input 贵——一张高清图等价于 ~1500 个文字 token,而且必须经过完整 prefill。
Mainstream vision encoders (LLaVA / Llama-3.2-Vision / Qwen2-VL) all follow a similar pipeline:
This produces 256 "visual tokens" prepended to the sequence — think of them as 256 "system utterances". Subsequent LLM inference is identical to text. So an image actually occupies 256 context tokens + the matching KV cache. That's why GPT-4 Vision's image input is expensive — a high-res image equals ~1500 text tokens, all of which must go through a full prefill.
同样占一个 token 位置,visual token 从产生到使用的成本远高于文本 token,因为它要先走一个完整的 vision encoder。Llama-3.2-Vision 的 CLIP-ViT-L 是 ~300M 参数,24 层 transformer——一张 512×512 图的 encoder forward 大约 ~80 GFLOPs,在 H100 上 ~15 ms。
这件事的几个隐性影响:
所以"多模态模型在视觉任务上推理慢" 不是模型质量问题,是视觉路径上的固定开销。这也是为什么 GPT-4o 这类原生多模态模型用了"visual token 池化"或"动态分辨率"等优化——把 256 个 visual token 减到 64-128 个,但精度损失需要靠 finetune 补。
For the same token slot, a visual token's create-to-consume cost vastly exceeds a text token's — it first runs through a full vision encoder. Llama-3.2-Vision's CLIP-ViT-L is ~300M params, 24 layers; one 512×512 image encoder forward is ~80 GFLOPs, ~15 ms on H100.
Hidden implications:
So "multimodal inference is slow on vision tasks" isn't a model-quality issue — it's a fixed visual-path overhead. This is why GPT-4o-class native multimodal models use "visual token pooling" or "dynamic resolution" optimizations — squeezing 256 visual tokens down to 64-128, with precision losses recovered through finetuning.
上面讲的"把 visual token 拼到 text token 前面" 是early fusion(LLaVA / Llama-3.2-Vision)。还有两种思路:
实际推理引擎里:llama.cpp 主要支持 early fusion(LLaVA 兼容),Llama-3.2-Vision 的 cross-attention 变体需要专门支持(因为它要在 graph 里插入额外的 attention 节点,KV cache 设计也不同)。支持新的多模态架构 是推理引擎团队最常加班的事——每出一个新模型,kernel + graph 都要改。
The above — "prepend visual tokens to text tokens" — is early fusion (LLaVA / Llama-3.2-Vision). Two other approaches:
In real inference engines: llama.cpp mainly supports early fusion (LLaVA-compatible); Llama-3.2-Vision's cross-attention variant needs dedicated support (extra attention nodes in the graph, different KV cache layout). Supporting new multimodal architectures is what inference-engine teams pull overtime for — each new model means kernel and graph changes.
同样的"把任意东西编码成 token" 思路对其他模态也成立:
所以 LLM 推理引擎本质上在变成"通用 token 处理器"——前端接什么模态的 encoder 不重要,只要能产出 token 就行。前 24 章讲的全部推理优化(KV cache / attention / sampling / batching / 量化)全部适用。这是 LLM 范式的力量:把一切异构数据统一到 token 序列上,优化路径就只有一条。
The same "encode anything as tokens" idea works for other modalities:
So LLM inference engines are essentially becoming "universal token processors" — what encoder feeds the front-end doesn't matter, as long as it produces tokens. Everything from the previous 24 chapters (KV cache / attention / sampling / batching / quantization) applies. That's the power of the LLM paradigm: unify all heterogeneous data into token sequences and have only one optimization path.
o1 / R1 · test-time compute · "思考"也是推理
o1 / R1 · test-time compute · "thinking" is inference too
2024 年 9 月 OpenAI 发布 o1,把整个 LLM 推理的economics翻了个底朝天。之前的模型生成 100-1000 个 token 就给出最终回答;o1 / DeepSeek-R1 / Gemini-2-Flash-Thinking 可以"思考" 10000-50000 个内部 token,然后才输出几百 token 的最终回答。这些"思考 token"对用户是不可见的,但推理引擎要扎扎实实地生成它们。
从架构上,reasoning 模型跟普通 LLM 一模一样——同样的 transformer / attention / FFN / MoE。所有奇迹都来自训练数据 + RL 奖励:模型被训练成在给出答案前先生成一大段"思考过程"(在 <think>...</think> 标签里包着),并通过 RLHF / RLVR 让"思考能提升答案质量" 成为模型的本能。
但这件事在推理工程上的影响巨大。
September 2024, OpenAI released o1 and flipped the entire LLM inference economics upside down. Pre-o1 models emitted 100-1000 tokens before finalizing an answer; o1 / DeepSeek-R1 / Gemini-2-Flash-Thinking can "think" 10000-50000 internal tokens before producing a few hundred tokens of final answer. These "thinking tokens" are invisible to the user but the inference engine generates them all the same.
Architecturally, reasoning models are identical to regular LLMs — same transformer / attention / FFN / MoE. The magic comes from training data + RL rewards: the model is trained to produce a long "thinking" segment (wrapped in <think>...</think> tags) before answering, and RLHF / RLVR makes "thinking improves answer quality" an instinct.
But the impact on inference engineering is enormous.
第 2 章那张 "TTFT + n_out × TPOT" 表在 o1 时代彻底变形。一个典型 o1 query:
用户看到的是"慢但准" 的体验——但服务方付的是20 秒的 GPU 时间,而不是之前的 ~2 秒。一个 o1 query 的真实推理成本比同等长度的 GPT-4 答案贵 10×——这就是为什么 o1 API 定价高,而且 OpenAI 把"reasoning effort" 做成参数(low / medium / high)给用户选——让用户在成本和质量之间显式选。
The Ch.2 "TTFT + n_out × TPOT" formula warps completely in the o1 era. A typical o1 query:
User sees "slow but accurate" — but the service pays for 20 seconds of GPU time, not the ~2s from before. An o1 query's real inference cost is 10× a same-length GPT-4 answer — which is why o1 API pricing is high, and why OpenAI exposes "reasoning effort" as a parameter (low / medium / high) letting users explicitly trade cost for quality.
o1 时代的成本结构推动了几个很新的推理优化方向:
</think> 后才开始流答案,客户端体验是"很久没动静,然后一下子开始输出"。有意思的是:reasoning 模型几乎是第一次让推理引擎需要为"大量串行 decode" 而不是"低延迟用户响应" 做优化。这反过来又催生 speculative decoding / fp4 / 更高效 attention kernel 等被研究界忽视已久的硬件路线——因为现在 decode 速度直接决定 reasoning 质量。
The o1-era cost structure drives several very new inference optimization directions:
</think>; client-side it feels like "silent for a long time, then suddenly outputs".Notably: reasoning models are almost the first time inference engines need to optimize for "massive serial decode" rather than "low-latency user response". This in turn drives speculative decoding / fp4 / more efficient attention kernels — hardware paths the research community had been neglecting — because now decode speed directly determines reasoning quality.
A100 → H100 → B200 · 同一个模型 · 不同的物理舞台
A100 → H100 → B200 · same model · different physical stage
前 26 章基本都把硬件当"就这样"——但其实每两年NVIDIA 发布一代新 GPU,推理引擎都要大改一遍。从 2020 年的 A100 到 2024 年的 H200 到 2025 年的 B200,关键性能指标变了几倍,推理路径上的优化策略也跟着变。
The previous 26 chapters mostly treated hardware as "given" — but every two years NVIDIA ships a new GPU generation and inference engines have to rewrite significant parts. From A100 (2020) to H200 (2024) to B200 (2025), key performance metrics moved by multiples; optimization strategy follows.
| A100 80GB | H100 80GB | H200 141GB | B200 192GB | |
|---|---|---|---|---|
| 代号arch | Ampere | Hopper | Hopper | Blackwell |
| fp16 TFLOPS | 312 | 990 | 990 | 2250 |
| fp8 TFLOPS | — | 1980 | 1980 | 4500 |
| fp4 TFLOPS | — | — | — | 9000 |
| HBM 容量 | 80 GB | 80 GB | 141 GB | 192 GB |
| HBM 带宽 | 2.0 TB/s | 3.35 TB/s | 4.8 TB/s | 8.0 TB/s |
| NVLink 双向 | 600 GB/s | 900 GB/s | 900 GB/s | 1800 GB/s |
| 关键新硬件key new HW | Tensor Core gen 3 | fp8 · TMA · WGMMA | + HBM3e | fp4 · 2nd-gen Transformer Engine |
| 对推理的意义inference impact | FlashAttention v1 起飞FlashAttention v1 takes off | fp8 KV · FA v3 · TMA 预取fp8 KV · FA v3 · TMA prefetch | 长 ctx · 大模型可单卡long ctx · large model single-card | fp4 权重 · 训练推理融合fp4 weights · train/infer fused |
Hopper(H100)上的三个新硬件单元是近两年所有推理优化的物理基础:
这三个加在一起让 H100 上 fp16 模型的 attention kernel 利用率从 50% 涨到 70%+,fp8 模型再翻倍。不写新 kernel,根本拿不到这些收益——这是为什么"新硬件→量产推理栈" 总要 6-12 个月。
Three new hardware units in Hopper (H100) are the physical basis for nearly every inference optimization of the past two years:
Combined: Hopper attention kernels on fp16 hit 70%+ utilization vs 50% before; fp8 doubles that again. You can't get these gains without writing new kernels — which is why "new hardware → production inference stack" always takes 6-12 months.
2025 年 NVIDIA 出的 B200 / GB200 NVL72 系统是 LLM 推理的新一代分水岭。两个关键变化:
更深远的:Blackwell 的"Transformer Engine 2.0" 把训练时的 fp8 自动校准跟推理时的 fp4 部署放进同一个软件栈——一个模型可以 fp8 训完直接 fp4 部署,损失自动测算补偿。训练和推理的工具链开始合体——这是过去 5 年里第一次。下一步可能是 2026-2027 年的"训练时就为 fp4 部署优化" (quantization-aware training natively)。
NVIDIA's B200 / GB200 NVL72 systems (2025) mark a new watershed for LLM inference. Two key shifts:
Deeper still: Blackwell's "Transformer Engine 2.0" puts training-time fp8 auto-calibration and inference-time fp4 deployment in the same software stack — a model trained fp8 can deploy fp4 directly, with losses auto-measured and compensated. Training and inference toolchains begin to merge — a first in five years. The next step is probably 2026-2027's "train natively optimized for fp4 deployment" (quantization-aware training as default).
NVIDIA 80% 的 LLM 推理市场不是没人挑战,但每一家路径不同:
趋势:NVIDIA 软件生态 + 算力领先 还会持续 2-3 年,然后 AMD MI 系列追上的可能性最大(软件栈靠 vLLM / SGLang 等开源项目慢慢补齐)。TPU 和 Trainium 是"封闭花园"——技术上不一定差,但锁生态。
NVIDIA's 80% LLM-inference market share is challenged, but each path differs:
Trend: NVIDIA's software ecosystem + compute lead persists for 2-3 more years; AMD MI series most likely to catch up next (software gradually closed by vLLM / SGLang and other open-source efforts). TPU and Trainium are "walled gardens" — technically competitive but ecosystem-locked.
前 24 章的所有断言 · 自己验证
verify every claim above · with your own eyes
这一章不讲新东西,只讲怎么自己看。前面 27 章的每个断言都可以被验证——下面是一组最实用的工具。
No new content in this chapter — only how to look for yourself. Every claim in the previous 27 can be verified. Here's the toolkit.
每次 forward 都构建一张 ggml_cgraph——它就是第 16 章 stack trace 的静态图表示。打印出来能看到 32 层 attention 的每一个节点。
Every forward constructs a ggml_cgraph — the static-graph form of Ch. 16's stack trace. Print it and you see every node across all 32 layers.
想看 H100 上 attention 那一步真的把 SM 喂饱了没?用 NVIDIA 自家的 NSight。ncu(Compute)给单 kernel 的微观分析——FLOPs 利用率、memory throughput、warp stall 原因。nsys(Systems)给整个时间轴——你能看到 prefill 那 5 ms 里有几次 kernel launch,decode 每步等了多久 memory。
Want to verify attention is actually saturating SMs on H100? Use NVIDIA's NSight. ncu (Compute) gives per-kernel microscopy — FLOPs utilization, memory throughput, warp-stall reasons. nsys (Systems) gives a full timeline — see how many kernel launches fit into prefill's 5 ms, and how long each decode step waits on memory.
llama.cpp 是推理引擎,优化得很猛;HuggingFace 的 transformers 是 reference implementation——慢但每一步都裸露在 Python 里,容易插断点、加 print、改公式。如果你想"看着 hidden state 一层一层长大",在 LlamaModel.forward 里改改:
llama.cpp is an inference engine with heavy optimizations. HuggingFace's transformers is the reference — slow but everything is naked Python; easy to break, print, modify. To "watch hidden state grow layer by layer", patch LlamaModel.forward:
"这个 Q4_K_M 量化跟 fp16 比损失多少?" 这种问题用 llama.cpp 自带的 llama-perplexity 工具直接量:
"How much does Q4_K_M quantization cost compared to fp16?" Measure it directly with llama.cpp's bundled llama-perplexity:
每一章我都做了同一件事:挑一段 llama.cpp 源码,把它放进上下文,逐行解释为什么这么写。这套手法在所有系统类技术写作里都通用——比 paper 易读、比 tutorial 深入、比 talk 可重放。如果你也想写这种,流程大概是:
Every chapter did the same thing: pull a piece of real llama.cpp source into the page, then explain line by line why it's written that way. This recipe generalizes to any systems-writing — more readable than a paper, deeper than a tutorial, replayable unlike a talk. To write more of this: