ursb.me / notes
FIELD NOTE / 06 网络协议 Network Protocols 2026

一次请求
一生

A Request,
end to end.

一个 GET 请求要在 UDP 之上跑完 13 道协议工序、跨 4 个加密级、穿过 3 类流,才能让你看到一个 200 OK——然后连接还要走完关闭、排空、复活三种结局。
这是 HTTP/3 与 QUIC 的全景手册,每一步都标出对应的 RFC 条款。

A single GET has to walk thirteen protocol stages on top of UDP, four cryptographic levels and three stream classes before it can land a single 200 OK — then the connection still has to walk close, drain, or revive.
This is a field map of HTTP/3 and QUIC, with every step pinned to the relevant RFC clause.

协议流水线 · 24 章 · 4 段 Protocol pipeline · 24 chapters · 4 acts ▸ 滚动开始 ▸ scroll to start
I为什么是 H3Why H3 II传输层 · QUICTransport · QUIC IIIHTTP/3 与生命周期HTTP/3 & lifecycle IV现状 · 决策State of the world
CHAPTER 01

三个公式 — HTTP 到底是什么

Three formulas — what is HTTP, really?

三个公式,一具协议骨骼

three formulas, one protocol skeleton

"HTTP" 在大多数人嘴里是一种东西——一个能让浏览器去网站取页面的协议。但工程师如果还把它当成一种东西,就永远理解不了为什么会有 HTTP/3。HTTP 从来不是一个协议,它是三个正交协议的乘积

To most people, "HTTP" is a thing — the protocol your browser uses to fetch a page. Engineers who keep thinking of it as one thing will never understand why HTTP/3 exists. HTTP has never been one protocol; it has always been the product of three orthogonal layers.

公式 1 / FORMULA 1FORMULA 1
HTTP = Semantics + Framing + Transport HTTP/1.1 = RFC 9110 + RFC 9112 (ASCII) + TCP HTTP/2 = RFC 9110 + RFC 9113 (binary) + TCP + TLS HTTP/3 = RFC 9110 + RFC 9114 (binary) + QUIC (UDP)
推论:HTTP/1→2→3 只换了下面两层。语义没动。Implication: HTTP/1→2→3 only swap the bottom two layers. The semantics never moved.
公式 2 / FORMULA 2FORMULA 2
QUIC = UDP + TLS 1.3 + Loss recovery + Congestion control + Streams
推论:QUIC 不是 "TCP over UDP"。它把 TCP 整个塞进了用户态,并把 TLS 写进了协议本身。Implication: QUIC is not "TCP over UDP". It is TCP rewritten in user space, with TLS welded into the protocol itself.
公式 3 / FORMULA 3FORMULA 3
HTTP/3 = QUIC streams + QPACK + H3 framing
推论:HTTP/3 拿掉了 HTTP/2 里所有"为了 TCP 而做"的复杂度(优先级树、PUSH_PROMISE、HPACK 严序依赖)。剩下的东西很薄。Implication: HTTP/3 strips out everything HTTP/2 only had to do because of TCP — priority trees, PUSH_PROMISE, HPACK's strict ordering. What's left is thin.

三层「骨骼」对照

HTTP anatomy at a glance

版本Version Semantics Framing Transport
HTTP/0.9 (1991)GET onlyTCP
HTTP/1.0 (1996, RFC 1945)headers, methodsASCII, 1 req / connTCP
HTTP/1.1 (1997-2022, RFC 9112)同上 + chunked+ chunked, keepaliveASCII, keepalive, pipeliningTCP (+ TLS)
HTTP/2 (2015-2022, RFC 9113)RFC 9110二进制 · 多路复用 · HPACKbinary · mux · HPACKTCP + TLS 1.2/1.3
HTTP/3 (2022, RFC 9114)RFC 9110二进制 · 简化 · QPACKbinary · simpler · QPACKQUIC (UDP+TLS 1.3)
FIELD NOTE FIELD NOTE 公式 1 是这篇文章的真正主语
所有"为什么 HTTP/3 这么设计"的问题,答案都是"因为它在同一份语义下,把FramingTransport 都换了"。
Formula 1 is the real subject of this essay.
Every "why does HTTP/3 do it this way" question collapses into "because the semantics stayed the same, while Framing and Transport both got replaced".
CHAPTER 02

家谱 — 三十年 HTTP 演进

Family tree — 30 years of HTTP

从 Tim Berners-Lee 的一行 GET 到 Cloudflare 的 50% 全网流量

from Tim Berners-Lee's first GET to Cloudflare's 50% global traffic

HTTP/3 不是凭空出现的。它是 30 年技术堆栈一次次试错的产物:从 HTTP/0.9 的一行 GET /,到 SPDY 的实验,到 HTTP/2 的"二进制化",再到 QUIC 把 TCP 整个搬进用户态。每一步都少做了一个假设

HTTP/3 didn't appear from nowhere. It's the product of thirty years of trial and error: HTTP/0.9's one-line GET /, SPDY's experiments, HTTP/2's binary framing, finally QUIC dragging TCP into user space. Each step drops one assumption.

1991 1996 1999 2009 2015 2018 2021 2022 → HTTP/0.91991 · TBL HTTP/1.0RFC 1945 HTTP/1.1RFC 2616 → 9112 SPDYGoogle, 2009 HTTP/2RFC 7540 · 9113 gQUIC2012 · Roskind IETF QUIC2016 WG QUIC v1RFC 9000 · 2021-05 HTTP/3RFC 9114 · 2022-06 TLS 1.2RFC 5246 · 2008 TLS 1.3RFC 8446 · 2018 HTTP semantics line HTTP framing release QUIC transport TLS milestone retired / merged
FIG 02·1 HTTP 协议家谱 · 1991 → 2026 · 三条主线(semantics / framing / transport)的交错演进。 Fig 02·1 · HTTP family tree, 1991 → 2026 · three lines (semantics / framing / transport) braiding through 30 years.

关键节点

Key milestones

年份Year 事件Event 关键人物 / 文档Person / Doc
1991HTTP/0.9 — 一行 GET /single-line GET /Tim Berners-Lee · CERN
1996HTTP/1.0 · RFC 1945Henrik Frystyk Nielsen · W3C
1997HTTP/1.1 · RFC 2068 → 2616 (1999) → 7230 (2014) → 9112 (2022)Roy Fielding · UCI
2008TLS 1.2 · RFC 5246Tim Dierks · Eric Rescorla
2009SPDY 在 Chrome 实验experimental in ChromeMike Belshe · Roberto Peon · Google
2012gQUIC 在 Google 内部at GoogleJim Roskind
2015HTTP/2 · RFC 7540Mark Nottingham · Martin Thomson
2016IETF QUIC WG 成立charteredMark Nottingham · Lars Eggert
2018TLS 1.3 · RFC 8446Eric Rescorla · Mozilla
2018-11"HTTP/3" 正式命名name finalisedMark Nottingham · IETF 103
2021-05RFC 9000/9001/9002 · QUIC v1Iyengar · Thomson · Bishop · Pardue
2022-06RFC 9114 · HTTP/3Mike Bishop · Akamai
2022-06RFC 9204 · QPACKCharles 'Buck' Krasic · Mike Bishop · Alan Frindell
2023RFC 9460 · HTTPS RR (SVCB)Ben Schwartz · Mike Bishop · Erik Nygren
2024QUIC v2 · RFC 9369 · 字段排列变更,反僵化field re-shuffle, anti-ossificationMartin Duke
TRIVIA HTTP/3 命名差点叫 "HTTP over QUIC"。2018 年 11 月在 IETF 103 Bangkok 会上,Mark Nottingham 一句"为什么不直接叫 HTTP/3"被点头通过——这意味着 IETF 第一次公开承认 transport 选择是 HTTP 版本号的一部分。 HTTP/3 was almost called "HTTP over QUIC". At IETF 103 in Bangkok (Nov 2018), Mark Nottingham casually asked "why not just HTTP/3" — the room nodded. That was IETF's first public admission that transport choice is part of HTTP's version number.
CHAPTER 03

HTTP/2 的死结 — TCP 的三宗罪

HTTP/2's deadlock — TCP's three sins

为什么花了七年发现 HTTP/2 还不够

why it took seven years to find out HTTP/2 wasn't enough

2015 年 HTTP/2 发布的时候,大家以为 HTTP 终于"完工"了。它把 ASCII 换成了二进制,把 6 条 TCP 连接压成 1 条,把头部用 HPACK 压扩到 95%。结果跑了三年实战,工程师们发现 HTTP/2 留下了三个根本治不好的问题——而且都不是 HTTP/2 的错。是 TCP 的错。

When HTTP/2 shipped in 2015, everyone thought HTTP was finally "done". It swapped ASCII for binary, collapsed 6 TCP connections into 1, compressed headers ~95% with HPACK. Three years of production later, engineers found that HTTP/2 left three diseases that couldn't be cured — and none of them were HTTP/2's fault. They were TCP's fault.

三宗罪 · The three sins

The three sins

罪一 · TCP HOL
SIN 1 · TCP HOL
"一个包卡死全场""One packet stalls everyone"

HTTP/2 在应用层多路复用 100 个流,但 TCP 在传输层仍然要求按序交付。一个数据包丢了,整条 TCP 连接停下来等重传——即使另外 99 个流毫无关系。这叫 TCP head-of-line blocking

HTTP/2 multiplexes 100 streams at the application layer, but TCP at the transport layer still demands in-order delivery. Drop one packet, the entire TCP connection halts — even if the other 99 streams are unrelated. This is TCP head-of-line blocking.

实测:3% 丢包率下 HTTP/2 经常比 HTTP/1.1 多连接还慢。

Measured: at 3% loss, HTTP/2 often loses to HTTP/1.1 multi-connection.

罪二 · 握手 RTT
SIN 2 · Handshake RTT
"三步走才能开口""Three steps before you speak"

HTTP/2 必须跑在 TLS 上(实际上)。一次新连接要:TCP SYN/SYN-ACK/ACK(1 RTT)+ TLS 1.2 ClientHello/ServerHello(2 RTT)= 3 RTT;用 TLS 1.3 + TCP Fast Open 还是 2 RTT。200ms 的跨洲 RTT 下,开口就花 400~600ms。

HTTP/2 must run over TLS (in practice). A fresh connection needs: TCP SYN/SYN-ACK/ACK (1 RTT) + TLS 1.2 ClientHello/ServerHello (2 RTT) = 3 RTT; TLS 1.3 + TCP Fast Open still 2 RTT. At 200ms intercontinental RTT, you spend 400-600ms before saying a word.

实测:手机 4G/5G 上,握手时间常常超过整个页面的 LCP 预算。

Measured: on 4G/5G, handshake alone often eats the page's entire LCP budget.

罪三 · 连接绑死
SIN 3 · IP-pinned
"Wi-Fi 切 5G 就断""Wi-Fi to 5G ⇒ disconnect"

TCP 连接由 (src_ip, src_port, dst_ip, dst_port) 五元组定义。手机从 Wi-Fi 切到 5G,src_ip 变了——TCP 连接立即报废,TLS 会话也跟着重建。前端 SPA 里那个长连接 WebSocket 就这样断了。

A TCP connection is identified by the 4-tuple (src_ip, src_port, dst_ip, dst_port). When a phone switches Wi-Fi → 5G, src_ip changes — the TCP connection is dead on arrival, the TLS session along with it. That long-lived WebSocket inside your SPA? Gone.

实测:Meta 测算 5% 的视频流断流是因为切网。

Measured: Meta attributes ~5% of video stalls to network switches.

罪四(隐藏) · 协议僵化
SIN 4 (hidden) · Ossification
"想加新字段都加不了""You can't add a new field"

中间盒(运营商 NAT、企业防火墙、CDN)对 TCP/TLS 字段有路径上的判断逻辑。RFC 允许的扩展字段到中间盒手里就被丢包。TLS 1.3 当初为此用了"中间盒兼容模式"伪装成 TLS 1.2。HTTP/3 干脆躲到 UDP 里。

Middleboxes — ISP NATs, enterprise firewalls, CDNs — inspect TCP/TLS fields and silently drop anything new. RFC-permitted extensions get blackholed in flight. TLS 1.3 ended up disguising itself as TLS 1.2. HTTP/3 just hides inside UDP.

实测:TLS 1.3 早期遭遇 ~3% 中间盒丢包。

Measured: early TLS 1.3 saw ~3% middlebox drops.

罪一可视化 · TCP HOL vs QUIC 流独立

Visualising Sin 1 · TCP HOL vs QUIC stream independence

HTTP/2 OVER TCP HTTP/3 OVER QUIC single byte stream, in-order independent streams, per-stream order A B C D E app sees: TCP byte stream — one ordered pipe A·1 A1 B·1 B1 C·1 ✗ LOST D·1 E·1 ⏸ ALL streams blocked until C·1 retransmit A1 B1 ━━━━━ stalled ━━━━━ Result · 全停 5 streams suffer for 1 lost packet. RTT spent on retransmit + HOL. A B C D E app sees: A·1 A1 B·1 B1 C·1 ✗ LOST D·1 D1 E·1 E1 ⏸ stream C only A1 B1 D1 E1 (C waits) Result · 只 C 停 4/5 streams unaffected, deliver in flight. RTT to recover only stream C.
FIG 03·1 5 条流,PN3 丢一次。左:TCP 一根管子按序送,全停;右:QUIC 五条独立流,只 C 流停。 Fig 03·1 · Five streams, one packet (PN3) lost. Left: TCP one ordered pipe, every stream stalls; Right: QUIC five independent streams, only C blocks.
「HTTP/2 把 HTTP 治好了,
但 HTTP/2 自己被 TCP 治残了。」
"HTTP/2 cured HTTP,
and then TCP crippled HTTP/2."
Daniel Stenberg · curl · 2018

为什么不直接改 TCP?

Why not just fix TCP?

这是 IETF 在 2015-2016 年最先想到的方案。但 TCP 是内核态协议——任何字段改动都要等 Linux / Windows / iOS / Android / 每一台路由器升级一遍。看看 TCP Fast Open(RFC 7413, 2014)现状:发布十年了,实际部署率仍然 < 5%,因为中间盒会丢掉它的 cookie。

结论:在 TCP 上演进 = 在十年这个时间尺度上演进。

That was IETF's first instinct in 2015-2016. But TCP lives in the kernel — any field change waits for Linux / Windows / iOS / Android / every router to ship a new version. Look at TCP Fast Open (RFC 7413, 2014): ten years on, deployment is still < 5%, because middleboxes drop its cookie.

Conclusion: evolving on top of TCP means evolving on a decade timescale.

FIELD NOTE · 数字 FIELD NOTE · NUMBERS Google 在 2016 SIGCOMM 论文里给了一个让所有人闭嘴的数字:Google.com 搜索的端到端延迟,gQUIC 比 TCP+TLS 快 8%(中位),慢链路上快 16%(中位)。这两个百分点是 IETF QUIC 工作组成立的直接动力。 Google's 2016 SIGCOMM paper dropped one number that shut the room up: end-to-end latency of Google.com search was 8% faster on gQUIC than TCP+TLS at the median, 16% faster at slow-link median. Those two percentage points were the direct trigger for the IETF QUIC WG.
CHAPTER 04

为什么是 UDP — 中间盒,僵化,与可部署性

Why UDP — middleboxes, ossification, and deployability

不是因为 UDP 好,是因为 UDP 不被人管

not because UDP is good, but because nobody touches UDP

"为什么 QUIC 跑在 UDP 上?" 这是任何讲 HTTP/3 的人都要回答的第一个问题。直觉答案"UDP 没有可靠传输、所以 QUIC 自己实现可靠"是错的——这是结果,不是原因。真正的原因只有一个:UDP 是当今互联网上仅剩的、中间盒不会乱碰的协议号。

"Why does QUIC run on UDP?" is the first question every HTTP/3 talk has to answer. The intuitive answer — "UDP isn't reliable, so QUIC has to add its own reliability" — is wrong. That's a consequence, not a cause. The real reason is one sentence: UDP is the only protocol number left on the modern internet that middleboxes don't mess with.

候选清单 · The shortlist

The shortlist

选项Option 优势Pros 为什么不行Why not
SCTP 天然多流,按消息边界传输native multi-stream, message-based IP protocol number 132 — 大多数 NAT 直接丢包,~50% 丢包率IP protocol 132 — most NATs drop, ~50% loss
DCCP 无序但拥塞控制unordered with cc IP protocol 33 — 同上,部署率 < 0.1%IP protocol 33 — same, < 0.1% deployed
新协议号New IP protocol 理论最干净theoretically cleanest 需要全球每一台路由器+NAT+防火墙升级,不可能needs every router/NAT/firewall on Earth to upgrade — impossible
TCP option 复用现有连接reuse existing conn 中间盒会清空未知 TCP optionsmiddleboxes strip unknown options
UDP 所有 NAT/防火墙都放行 UDP/443UDP/443 traverses everywhere 需要在用户态重造 TCP——但这就是 QUIC 想做的have to rebuild TCP in user space — but that's exactly what QUIC wants
FIELD NOTE · Ossification FIELD NOTE · Ossification "协议僵化"(protocol ossification)是 2015 年后 IETF 的核心关切。一个协议越成功,就越僵——因为越多中间盒会假设它的字段含义。TCP 已经僵到任何 RFC 改动都要 10 年才能跑通。QUIC 的策略是主动反僵化:从第一天起就加密 packet number、加密 header flags、GREASE 假参数、定期发版本协商——让中间盒"除了 UDP 头和源端口什么都看不见"。 "Protocol ossification" became IETF's main concern after 2015. A protocol becomes more rigid the more successful it gets — because more middleboxes start assuming what its fields mean. TCP is so ossified that any RFC change takes a decade to propagate. QUIC's strategy is active anti-ossification: encrypt packet numbers from day one, encrypt header flags, GREASE fake parameters, ship periodic version negotiation — so middleboxes see nothing but the UDP header and source port.

用户态的代价

The user-space cost

把 TCP 的所有功能(重传、拥塞控制、流控、多路复用、连接管理)搬到用户态,意味着每个 QUIC 数据包都要:进内核 → recvfrom() 拷贝到用户态 → 解密 → 处理 → 加密 → sendto() 拷贝回内核 → 网卡。Fastly 2020 年的实测:QUIC 的 CPU 成本是 TCP+TLS 的 ~2 倍。这是 HTTP/3 真正的负面成本,我们会在第 22 章详细讲。

Moving everything TCP did (retransmit, cc, flow control, mux, connection management) into user space means every QUIC packet has to: enter kernel → recvfrom() copy to user space → decrypt → handle → encrypt → sendto() copy back → NIC. Fastly's 2020 measurement: QUIC costs ~2x the CPU of TCP+TLS. That is HTTP/3's real downside, and we will revisit it in chapter 22.

FIELD NOTE · 反讽 FIELD NOTE · Irony UDP 在 1980 年被设计成"最简单的不可靠协议"——只是一层薄薄的端口分发。四十年后,它成了承载世界上一半 web 流量的可靠协议宿主。"最简单"反而是最难僵化的。 UDP was designed in 1980 as "the simplest unreliable protocol" — a thin port demultiplexer. Forty years later, it has become the host of half the world's web traffic — reliably. "Simplest" turns out to mean "hardest to ossify".
CHAPTER 05

QUIC 全景 — 4 加密级 · 3 PN 空间 · 2 类 Header

QUIC at a glance — 4 levels · 3 PN spaces · 2 headers

在钻进每章细节之前,先把骨架记牢

memorise the skeleton before diving into each chapter

QUIC 的设计可以用三个小数字描述:4 个加密级(Initial / 0-RTT / Handshake / 1-RTT)、3 个 Packet Number 空间(Initial / Handshake / Application)、2 类 Header(Long / Short)。这三个数字之间的关系,是后面所有章节的预读骨架。

QUIC's design fits into three small numbers: 4 encryption levels (Initial / 0-RTT / Handshake / 1-RTT), 3 Packet Number spaces (Initial / Handshake / Application), 2 Header types (Long / Short). The relationship between these three numbers is the pre-read skeleton for every later chapter.

协议栈 · The stack

The stack

应用
App
HTTP/3 (RFC 9114)
+ QPACK
QPACK (RFC 9204)
传输
Transport
QUIC (RFC 9000-2)
加密
Crypto
TLS 1.3 (RFC 8446)

↓ UDP/443 · IPv4 / IPv6 · 链路层link layer

↓ UDP/443 · IPv4 / IPv6 · link layer

STACK 注意 TLS 1.3 不是在 QUIC 之下而是在 QUIC 内部。QUIC 用 CRYPTO 帧携带 TLS 1.3 的 records,而不是反过来。这就是为什么 RFC 9001 叫 "Using TLS to Secure QUIC" 而不是 "QUIC over TLS"。 Note that TLS 1.3 is not below QUIC but inside QUIC. QUIC carries TLS 1.3 records inside CRYPTO frames, not the other way around. That is why RFC 9001 is titled "Using TLS to Secure QUIC" — not "QUIC over TLS".

4 个加密级 · 4 levels

The four encryption levels

Initial
公开 salt + DCID 派生密钥。任何人都能解密——这层的"加密"只是为了反僵化、防止中间盒乱碰。
Keys derived from public salt + DCID. Anyone can decrypt — this "encryption" only exists to fight ossification, to keep middleboxes from poking inside.
0-RTT (Early Data)
用前一次会话恢复的 PSK 派生。只在恢复连接时存在。承担重放风险(见 Ch08)。
Keys derived from a resumed session's PSK. Only exists on connection resumption. Carries replay risk (see Ch08).
Handshake
TLS 1.3 EE/CERT/FIN 完成后派生。真加密开始,但还在握手过程中。
Keys derived after TLS 1.3 EE/CERT/FIN. Real encryption kicks in here — still inside the handshake.
1-RTT (Application)
握手完成后用的主密钥。承担 99% 的数据传输。可以做 key update(密钥滚动)。
The main key after handshake completes. Carries 99% of all data. Supports key update (rotating keys mid-connection).

3 个 PN 空间

The three PN spaces

空间 1 · Initial
Space 1 · Initial
PN_0..N
独立编号,从 0 起
independent, starts at 0
  • CRYPTO (ClientHello)
  • ACK (initial)
  • PADDING (anti-amp)
空间 2 · Handshake
Space 2 · Handshake
PN_0..M
独立编号,从 0 起
independent, starts at 0
  • CRYPTO (EE/CERT/FIN)
  • ACK (handshake)
空间 3 · Application
Space 3 · Application
PN_0..∞
独立编号,从 0 起
independent, starts at 0
  • STREAM, ACK, MAX_DATA …
  • HANDSHAKE_DONE
  • NEW_CONNECTION_ID
为什么三个空间? Why three? 如果 Initial / Handshake / 1-RTT 共用一套 PN,丢包检测就会"看错"——你不知道是 Initial 包丢了还是 1-RTT 包丢了,因为它们已经被你的内核乱序处理。三个独立空间 = 三套独立的 ACK 状态 = 没有"跨级"的 head-of-line blocking。这是 QUIC 比 TCP+TLS 干净的根源之一。 If Initial / Handshake / 1-RTT shared one PN, loss detection would "guess wrong" — you can't tell whether an Initial packet was lost or a 1-RTT one, because the kernel may have reordered them. Three independent spaces = three independent ACK clocks = no cross-level head-of-line blocking. This is one of the reasons QUIC is structurally cleaner than TCP+TLS.

2 类 Header

The two header types

Long Header · 4 种
Long Header · 4 forms
握手期使用
used during handshake

字段:Version(32) · DCID Len(8) · DCID · SCID Len(8) · SCID · Type-specific...

Fields: Version(32) · DCID Len(8) · DCID · SCID Len(8) · SCID · Type-specific...

Initial · 0-RTT · Handshake · Retry

Short Header · 1 种
Short Header · 1 form
握手后使用(99% 流量)
post-handshake (99% of traffic)

字段:Flags(8) · DCID · PN(8/16/24/32)

Fields: Flags(8) · DCID · PN(8/16/24/32)

1-RTT only

FIELD NOTE · 字段对齐 FIELD NOTE · Field alignment QUIC v2(RFC 9369)故意把字段排列搅乱了一次——目的就是检测中间盒是否在做"它不该做的" QUIC v1 字段解析。如果中间盒按 v1 字段顺序解析 v2 包,会立刻报错。这是主动反僵化策略落到字节级的体现。 QUIC v2 (RFC 9369) deliberately shuffled the field order — to detect middleboxes doing things they shouldn't with QUIC v1 field parsing. If a middlebox tries to parse a v2 packet with v1 layout, it breaks immediately. This is active anti-ossification realised at the byte level.
MAIN LINE · THE REQUEST

一次 GET ursb.me 的一生 — 字节级生命周期

The life of one GET ursb.me — a byte-level lifecycle

从 DNS 查询到 200 OK 到连接关闭 · 每一步都标 RFC §

from DNS query to 200 OK to connection close · every step pinned to its RFC §

接下来 14 章流水线都用同一条请求把它们串起来——在 Chrome 地址栏输入 https://ursb.me,按回车。我们跟着这次请求的字节流走完它的一生:DNS 解析、初次握手、传输请求、收到响应、连接闲置、网络切换、最后优雅关闭——一共 10 个阶段。每章都有一个 "◇ 在我们的 GET 请求里" 卡片,告诉你这一章的输入、变换、输出分别是什么。

这条主线的角色清单是:

The next 14 pipeline chapters all hang off one request: type https://ursb.me in Chrome, press Enter. We follow this request's byte stream through its full life: DNS query, first handshake, request payload, response, idle, network switch, graceful close — 10 phases. Every chapter below carries a "◇ In our GET request" card showing input, transform, output at that stage.

The cast on this main line:

角色清单 · The setup

The setup

// what the user typed URL = "https://ursb.me/" Method = GET // idempotent → 0-RTT eligible // client Browser = "Chrome 134 on macOS" Library = "google/quiche (C++)" src_ip = 192.168.1.42 // Wi-Fi at T+0 src_port = 52341 // ephemeral // server Origin = "ursb.me" Stack = "Cloudflare quiche + nginx 1.26" dst_ip = 39.105.102.252 dst_port = 443 // UDP — not TCP // network RTT = 40 ms // home Wi-Fi → BJ aliyun Loss = ~1.5% // peak hour Path MTU = 1500 // stored state from prior visit PSK ticket = "valid · age=2h" // 0-RTT eligible

10 个阶段全景

All 10 phases at a glance

Client Server time T+0 Phase 0 · DNS HTTPS RR · DoH/DoQ · RFC 9460 + 8484/9250 T+5ms Phase 1 · Initial[CH + 0-RTT[GET /]] 1228 B · padded ≥ 1200 · RFC 9000 §17.2, §14.1 T+25ms Phase 2 · Initial[SH] + Handshake[EE,Cert,CV,FIN] ~2900 B (cert chain + TP) · RFC 9001 §4 T+45ms Phase 3 · 1-RTT[STREAM 0: 200 OK + HTML] 3200 B body · QPACK 5 B header · RFC 9114 §7 T+65ms Phase 4 · 1-RTT[FIN + ACK + HANDSHAKE_DONE_ACK] stream 0 closed · RFC 9000 §19.8 Phase 5 · idle · keep-alive PING every ~25s RFC 9000 §10.1.2 T+8min Phase 6 · PATH_CHALLENGE / RESPONSE (Wi-Fi → 5G) new src_ip · same CID · RFC 9000 §9 T+15min Phase 7-8 · GOAWAY → CONNECTION_CLOSE → drain (3 PTO) RFC 9114 §5.2 · RFC 9000 §10
FIG cmain·1 主线 10 阶段总时序 · 颜色编码:蓝=客户端,紫=服务端加密层,绿=数据交付,铜=客户端响应,黄=关闭。 Fig cmain·1 · Full main-line timeline · colour code: blue = client, purple = server crypto, green = data delivery, copper = client response, amber = close.

阶段 0 · DNS 解析(pre-flight)

Phase 0 · DNS resolution (pre-flight)

Chrome 不会直接发 QUIC 包——它先要问 DNS:ursb.me 在哪?支持哪些 ALPN? 这里 Chrome 用 DoH(DNS over HTTPS,RFC 8484)向 1.1.1.1 查询,请求里同时问 A(IPv4)和 HTTPSRFC 9460)两种 RR——后者一行就能拿到 ALPN 列表 + IP hint,省一个 RTT。

Chrome can't fire a QUIC packet yet — it needs DNS first: where's ursb.me? Which ALPNs does it speak? Chrome queries 1.1.1.1 over DoH (RFC 8484), asking simultaneously for A (IPv4) and the new HTTPS RR (RFC 9460). The latter returns ALPN + IP hint in one record, saving an RTT.

INPUT
URL bar string"https://ursb.me/"
OUTPUT (DNS)
DNS RR setA → 39.105.102.252
HTTPS 1 . alpn="h3,h2"
DoH wire · POST /dns-query · application/dns-messageRFC 8484 + 9460
; Question section: ursb.me HTTPS QNAME = ursb.me. QTYPE = 65 ; HTTPS (SVCB-compatible) QCLASS = 1 ; IN ; Answer ─ HTTPS RR ursb.me. 300 IN HTTPS 1 . \ alpn="h3,h2" ; ← tells the browser H3 is OK ipv4hint="39.105.102.252" ; ← skip a separate A query port=443

阶段 1 · 第一个 UDP 包出发

Phase 1 · The first UDP datagram leaves

DNS 回包 5 ms 后,Chrome 拼出第一个真正的 QUIC 包。因为有上次会话的 PSK ticket,这次走 0-RTT:ClientHello 和 GET 请求一起放进同一个 UDP 数据报

5 ms after the DNS response, Chrome assembles the first actual QUIC packet. Because we have a PSK ticket from last visit, this is a 0-RTT send: ClientHello and the GET request ride in the same UDP datagram.

INPUT
empty TLS state + PSK"send a ClientHello and a GET"
OUTPUT (wire)
1228-byte UDP datagramInitial[CRYPTO:CH+PSK]
+ 0-RTT[STREAM 0: GET]
+ PADDING
UDP/443 · the first packet on the wiretcpdump · 1228 B total
; IP/UDP headers (28 B) IPv4: src=192.168.1.42 dst=39.105.102.252 proto=17 UDP: src=52341 dst=443 len=1200 ; Coalesced QUIC datagram (1200 B) — TWO packets, ONE UDP send [Initial packet, PN_Initial=0] (~520 B) CRYPTO[0..516]: TLS 1.3 ClientHello { SNI = "ursb.me" ; (or ECH-encrypted) ALPN = ["h3"] key_share = x25519:0x9c4f… pre_shared_key = [ticket from last visit] early_data = ext_42 ; "I will send 0-RTT" quic_transport_parameters = { ; RFC 9000 §18 initial_max_data = 10_485_760 ; 10 MB initial_max_streams_bidi = 100 initial_max_streams_uni = 100 max_idle_timeout = 30_000 ; 30 s disable_active_migration = false } } [0-RTT packet, PN_Application=0] (~680 B) STREAM sid=0, off=0, fin=true: HEADERS frame (QPACK static-indexed, see Ch15) :method=GET :scheme=https :authority=ursb.me :path=/ accept=text/html,*/* user-agent=Mozilla/5.0… priority=u=0,i ; RFC 9218 PADDING × N ; pad to 1200 anti-amp floor

阶段 2 · 服务器握手响应

Phase 2 · Server's handshake response

20 ms 后第一个回程包到达。这是多包合并(coalesced datagram)的典型场景:服务器在同一个 UDP 数据报里塞了 Initial、Handshake、1-RTT 三种包,分别承载握手不同阶段的 CRYPTO 帧和首批数据。

20 ms later the first server datagram arrives. This is a classic coalesced case: the server packs Initial, Handshake and 1-RTT packets all into one UDP datagram, carrying CRYPTO frames for different handshake stages plus the first batch of response data.

INPUT (server received)
our 1228-byte datagramverifies PSK · accepts 0-RTT
OUTPUT (server → client)
~2900-byte coalesced replyInitial[SH]
+ Handshake[EE,Cert,CV,FIN]
+ 1-RTT[200 OK headers]
server → client · coalesced UDP datagramRFC 9001 §4 · RFC 9000 §12.2
[Initial packet, PN_Initial=0] (~80 B) ACK [0] ; ack the client's Initial CRYPTO[0..]: TLS ServerHello { selected_psk = 0 key_share = x25519:0xa1b2… } [Handshake packet, PN_Handshake=0] (~2600 B) CRYPTO[0..]: EncryptedExtensions { early_data=accept, alpn=h3, TP=… } Certificate { cert + intermediates · ~1700 B } CertificateVerify { sig over transcript } Finished { mac } [1-RTT packet, PN_Application=0] (~250 B) HANDSHAKE_DONE ; RFC 9000 §19.20 — keys are committed NEW_CONNECTION_ID seq=1 ; pre-stock 1 spare CID for migration STREAM sid=0, off=0: ; can answer 0-RTT GET already HEADERS (QPACK · 7 B) → :status=200, content-type=text/html

阶段 3 · 200 OK + 正文到达

Phase 3 · 200 OK + body arrives

前面那个 Handshake 包确认完后,Chrome 在 ~45 ms 收到完整正文。3200 字节的 HTML 通过同一个 Stream 0 的 DATA 帧分两个 1-RTT 包送到——这就是 0-RTT 的胜利:用户看到 200 OK 时握手还没完全结束

A few packets later, the complete body lands by ~45 ms. The 3200-byte HTML rides Stream 0 in two DATA frames spread across 1-RTT packets. The 0-RTT win is concrete here: the user sees 200 OK before the handshake is fully closed.

INPUT
HEADERS only:status=200 · content-type · content-length=3200
OUTPUT
full HTML to renderer3200 bytes body → into Chromium Loading stage
(see Field Note 02 Ch00 Loading)

阶段 4 · FIN + 收尾 ACK

Phase 4 · FIN + final ACK

Chrome 收完 3200 字节后在 STREAM 帧上看到 FIN=1,知道服务器不会再发了。客户端回一个空 STREAM(带 FIN)关闭自己的方向——这是双向流的半关闭语义。同时回一个 HANDSHAKE_DONE 的 ACK,让服务器知道可以丢掉 Handshake 密钥。

Once Chrome receives the 3200 bytes, the STREAM frame carries FIN=1 — no more data this direction. The client replies with an empty STREAM(FIN) to close its direction — bidirectional half-close semantics. It also ACKs HANDSHAKE_DONE, allowing the server to drop the Handshake keys.

INPUT (server sent)
STREAM[0] body + FIN
OUTPUT (client sent)
ACK + STREAM[0] FINstream 0 fully closed · resources freed

阶段 5 · 闲置与 PING 保活

Phase 5 · Idle & PING keep-alive

连接没有立刻关——Chrome 默认会保留它 30 秒,等下一个请求(CSS、图片、API 调用)复用。期间双方按需发 PING 帧(RFC 9000 §19.2)防 NAT 表项过期。max_idle_timeout 在 TP 里协商出来——min(client 30s, server 30s) = 30s。

The connection doesn't close immediately — Chrome holds it for 30 s, hoping the next request (CSS, images, an API call) reuses it. Either side may send PING frames (RFC 9000 §19.2) to keep NAT mappings alive. max_idle_timeout was negotiated in TP — min(client 30s, server 30s) = 30s.

INPUT
no app datastream 0 closed, others ready
OUTPUT
PING every ~25scwnd/RTT stat kept warm
NAT mapping refreshed

阶段 6 · 切网迁移

Phase 6 · Network migration

8 分钟后用户走出咖啡馆,手机切到 5G——src_ip192.168.1.42 变成 10.220.5.13。Chrome 启用预存的备用 CID,服务器看到陌生 IP + 合法 CID 立刻发 PATH_CHALLENGE。一次 RTT 内完成路径验证,连接没断

8 minutes later the user walks out of the café and the phone switches to 5G — src_ip flips from 192.168.1.42 to 10.220.5.13. Chrome activates the pre-stocked spare CID; the server sees a new IP with a valid CID and fires PATH_CHALLENGE. Path validated in one RTT; the connection survives.

INPUT
old path = Wi-Fisrc=192.168.1.42:52341
OUTPUT
new path = 5Gsrc=10.220.5.13:34188
DCID rotated · same crypto state

阶段 7-8 · 优雅关闭

Phase 7-8 · Graceful close

15 分钟后服务器决定下线这个连接(也可能是版本升级、负载均衡、配额到期),发 GOAWAY(H3 帧 0x07)告诉客户端"我不再接受新流,但已开的流我处理完"。等所有未完成的流结束后,发 CONNECTION_CLOSE(QUIC 帧 0x1c)正式结束连接。然后进入 draining 状态 3 PTO,等任何延迟的包不再处理——避免和"新连接"混淆。详见 Ch19

15 minutes in, the server decides to retire this connection (rolling deploy, load-balance, quota expiry). It sends GOAWAY (H3 frame 0x07): "I'll finish in-flight streams but accept no new ones." After the last stream is done, it sends CONNECTION_CLOSE (QUIC frame 0x1c). The server then enters draining state for 3 PTO, ignoring any late packets to avoid confusion with a "new" connection. See Ch19.

INPUT
connection still alivestreams 4, 8, 12 running
OUTPUT
GOAWAY → drain → CLOSEstreams complete
3 PTO drain · then closed

阶段 9 · Stateless Reset(备选结局)

Phase 9 · Stateless Reset (alternate ending)

如果服务器进程意外重启(OOM、crash、容器升级),客户端发的下一个 1-RTT 包会让新进程找不到对应的连接上下文。新进程不能用 CONNECTION_CLOSE(没密钥也没握手状态),只能发一个 Stateless Reset——一段看起来像随机 UDP 数据但末尾带 16 字节 reset token(在阶段 2 的 NEW_CONNECTION_ID 里预发过)的包。客户端识别 token 后才能安全地说"对方真的丢状态了",然后销毁本地连接。这是 RFC 9000 §10.3 给出的无状态恢复路径

If the server process unexpectedly restarts (OOM, crash, container upgrade), the next 1-RTT packet from the client finds the new process without any matching connection state. The new process can't send CONNECTION_CLOSE (no keys, no state). Instead it emits a Stateless Reset: a packet that looks like random UDP bytes but ends in the 16-byte reset token the original server pre-distributed via NEW_CONNECTION_ID in Phase 2. Only the client can recognise the token — and only then can it safely conclude "peer really lost state" and tear down locally. This is the stateless-recovery path of RFC 9000 §10.3.

INPUT
server restartedno conn state · cannot decrypt
OUTPUT
Stateless Resetrandom-looking 21+ B
tail = reset_token from §18.2

阶段 → 章节 路线图

Phase → chapter roadmap

每一章下面都会有一个 "◇ 在我们的 GET 请求里" 卡片,把这一章的输入/动作/输出对应到上面 10 个阶段。下面这张表先把对应关系列清楚——按这个顺序读:

Each chapter below carries a "◇ In our GET request" card that anchors its input / action / output to the 10 phases above. Use this table as the reading map:

主线阶段Main-line phase深入章节Drill-down chapterRFC §
Phase 0 · DNSCh22 Field work · Ch04 UDP9460 · 8484 · 9250
Phase 1 · Initial outCh06 UDP Datagram · Ch08 0-RTT9000 §17.2 · 9001 §4
Phase 2 · Server cryptoCh07 Handshake · Ch09 Crypto layers9001 §4-§5 · 8446 §4
Phase 3 · 200 OKCh14 H3 frames · Ch15 QPACK9114 §7 · 9204
Phase 4 · FINCh11 Streams · Ch12 Loss9000 §3 (states) · §19.8
Phase 5 · idleCh13 Congestion9000 §10.1 · §19.2 PING
Phase 6 · migrationCh17 Migration9000 §9
Phase 7-8 · closeCh19 Lifecycle (new)9114 §5.2 · 9000 §10
Phase 9 · stateless resetCh19 Lifecycle (new)9000 §10.3 · §18.2
"DNS 解析 5 ms · 握手 + 0-RTT GET 25 ms · 收到 200 OK 45 ms ·
15 分钟后优雅关闭。
整个过程 50% 的时间花在加密,30% 在等光速。"
"DNS in 5 ms · handshake + 0-RTT GET in 25 ms · 200 OK at 45 ms ·
gracefully closed 15 minutes later.
Half the time was in crypto, a third in waiting on the speed of light."
主线 · 阶段总览 main-line · phase summary
CHAPTER 06

UDP Datagram — 包在线上长什么样

UDP Datagram — what the packet looks like

字节级的 QUIC 包结构

QUIC packet structure, byte by byte

在主线里
In our request
T+20ms
线程 / 层
Layer
QUIC / UDP
RFC
9000 §17
输入 → 输出
In → Out
bytes → UDP payload

主线时间 T+20ms:你的 Chrome 浏览器把一段还没什么内容的 ClientHello(TLS 1.3)包成一个 UDP 包,从源端口 52341 发到目标 39.105.102.252:443。这一节我们把这个包按字节拆开。

Main-line time T+20ms: Chrome wraps a still-mostly-empty TLS 1.3 ClientHello into one UDP datagram, sent from source port 52341 to destination 39.105.102.252:443. This chapter takes that packet apart byte by byte.

◇ 在我们的 GET 请求里 · 主线阶段 1◇ In our GET request · Main-line phase 1

输入 / INPUT
INPUT
TLS 1.3 ClientHello~500 B 加密握手载荷~500 B encrypted handshake payload
输出 / OUTPUT
OUTPUT
1228-byte UDP datagramIP(28B) + UDP(8B) + QUIC header + Initial packet (PN=0) + 0-RTT(STREAM 0:GET) + PADDING

两类 Header

The two header types

FIG 06·1 · QUIC Long Header(握手期)
FIG 06·1 · QUIC Long Header (handshake)
0        1        2        3        4        5        6        7
byte 0 1..4 5 6..(6+DCID) +1 SCID type-specific Flags 1 byte Version 32 bit · 0x00000001 = v1 DCID len 1 Destination CID 0–20 bytes SCID len Source CID Type-specific Token, length, packet number, payload Header form: 1 bit · 1 = Long Fixed bit: 1 bit · always 1 (used for QUIC bit-mux) Type: 2 bits · 00=Initial · 01=0-RTT · 10=Handshake · 11=Retry Type-specific: 2 bits PN length encoded: 2 bits · 1/2/3/4 bytes (all 4 of these are header-protected) SHORT HEADER (1-RTT, post-handshake): Flags 1 byte Destination CID implicit length, negotiated Packet number 1–4 bytes · header-protected Encrypted payload AEAD over frames
FIG 06·1 QUIC 数据包结构 · 长/短 Header 对照 · 注意 PN 和 Flags 都被 Header Protection 加密。 Fig 06·1 · QUIC packet structure · long vs short header · note PN and flags are both header-protected.

UDP 之外没有别人

Nothing outside the UDP

UDP/443 · main-line T+20ms · first packet on the wiremock · constructed to spec
; Constructed by hand to match RFC 9000 §17.2 byte-for-byte; NOT a real ; tcpdump capture. The encrypted regions and checksums are abbreviated. ; To capture a real one: `SSLKEYLOGFILE=keys.log ./quiche-client https://ursb.me` ; then `tcpdump -i lo0 -w cap.pcap udp port 443` → open in Wireshark. 0x0000: 4500 04dc 0001 0000 4011 8c2b ; IP header (UDP proto 17 = 0x11) 0x0010: c0a8 012a 2769 66fc cc55 01bb ; UDP: src=52341, dst=443 (0x1bb) 0x0014: 04c8 0000 ; UDP len=1224, checksum (computed) 0x001c: c0 ; Flags = 0b11000000 → Long header, Initial 0x001d: 00 00 00 01 ; Version = QUIC v1 (RFC 9000) 0x0021: 08 7b 0f 23 e4 a1 c4 12 5b ; DCID len=8, DCID=... 0x002a: 08 3a 51 02 d8 ef 99 11 7c ; SCID len=8, SCID=... 0x0033: 00 ; Token Length = 0 (no Retry token) 0x0034: 44 b0 ; Length = 1200 (var-int) 0x0036: [ ENCRYPTED ] ; Packet number + payload (AEAD) ; payload (~1180 bytes after decryption): ; CRYPTO[0..len] = TLS 1.3 ClientHello ; PADDING ; pad to ≥ 1200 bytes — anti-amplification
为什么是 1200 字节? Why 1200 bytes? RFC 9000 §14.1 强制 Initial 包必须填到 ≥ 1200 字节。原因:(1) 反放大攻击——客户端"先付够字节",服务器才有 3 倍预算(见 Ch18)回大包;(2) PMTU 下限——如果路径 MTU < 1200 包就被丢,客户端立刻 fallback。1200 这个数字 RFC 9000 直接定义为下限(不是从 IPv6 MTU 1280 减出来的——IPv6 头 40 + UDP 头 8 = 48,1280 − 48 = 1232,比 1200 还多 32 字节)。RFC 选 1200 是为了在 IPv4 隧道、6to4、IPsec 等场景留出额外 32 字节的封装余量。 RFC 9000 §14.1 mandates Initial packets ≥ 1200 bytes. Two reasons: (1) anti-amplification — the client must "send enough first" so the server gets a 3× budget (see Ch18) to send big responses; (2) PMTU floor — if path MTU < 1200, the packet drops and the client falls back fast. The 1200 figure is directly defined by RFC 9000 as the floor (it is not derived from IPv6's 1280-byte MTU: IPv6 header 40 + UDP header 8 = 48, so 1280 − 48 = 1232, leaving 32 bytes more than 1200). RFC chose 1200 to leave that 32-byte cushion for IPv4 tunnels, 6to4, IPsec and other encapsulations.

Header Protection · 头部加密

Header Protection

QUIC 不只加密 payload,还加密 packet number 和 flags 的最低几位。具体做法:取 payload 加密后的密文取 16 字节"样本",用对应级别的密钥跑 AES-ECB 派生出一个 mask,把 mask 异或到 PN 和 flags 上。这一层"header protection"专门防中间盒读取 PN 做流量分析。

QUIC encrypts not only the payload but also the packet number and the low bits of the flags. The recipe: take a 16-byte sample of the ciphertext payload, run AES-ECB with the level's HP key to derive a mask, XOR the mask onto PN and flags. This "header protection" specifically defeats middleboxes that would otherwise read PN for traffic analysis.

DEVTOOLS 在 Chrome 看 H3 的最直接方法:DevTools → Network → 打开 Protocol 列。一行 h3 = 走的 HTTP/3。如果你看到 h2,说明被某个环节挡了——浏览器走了 TCP fallback。要查为什么,跑 chrome://net-export/ 导出 NetLog 再用 netlog-viewer.appspot.com 看。 The quickest way to see H3 in Chrome: DevTools → Network → enable the Protocol column. A row marked h3 = HTTP/3. If it says h2, something blocked you and the browser fell back to TCP. To diagnose, dump chrome://net-export/ and load it into netlog-viewer.appspot.com.
CHAPTER 07

握手 — 1-RTT 与 TLS 1.3 的合体

Handshake — 1-RTT and the TLS-1.3 merger

QUIC 不是 TLS over UDP,而是 QUIC carrying TLS

QUIC isn't TLS over UDP, QUIC carries TLS

在主线里
In our request
T+20..100ms
Layer
QUIC + TLS 1.3
RFC
9001 · 8446
输出
Output
1-RTT keys

把 HTTP/2 干掉的"2-RTT 起步"是 HTTP/3 最大的卖点。但要真正理解为什么 HTTP/3 能做到 1-RTT(重连 0-RTT),你需要看清 QUIC 和 TLS 1.3 是怎么融合的:不是上下层堆叠,而是 QUIC 用 CRYPTO 帧承载 TLS 1.3 的握手 records,让握手和应用数据共用一个 RTT。

Killing the "2-RTT minimum" left over from HTTP/2 is HTTP/3's biggest selling point. To really see why HTTP/3 hits 1-RTT (and 0-RTT on resumption), you need to look at how QUIC and TLS 1.3 merge: not as stacked layers, but QUIC carrying TLS 1.3 handshake records inside CRYPTO frames, letting handshake and application data share a single RTT.

◇ 在我们的 GET 请求里 · 主线阶段 1 + 2◇ In our GET request · Phase 1 + 2

INPUT
empty TLS state客户端首发 ClientHello(含 key_share、ALPN=h3、TP)client sends ClientHello (key_share, ALPN=h3, TP)
OUTPUT
4 key sets + HANDSHAKE_DONEInitial / 0-RTT / Handshake / 1-RTT 四级密钥派生完成Initial / 0-RTT / Handshake / 1-RTT keys all derived

完整 1-RTT 时序

Full 1-RTT timeline

Client Server Initial[CRYPTO: ClientHello, ALPN=h3] + PADDING to 1200 bytes · PN_Initial = 0 Initial[CRYPTO: ServerHello] + Handshake[CRYPTO: EE, Cert, CertVerify, Finished] Handshake[CRYPTO: Finished] + 1-RTT[STREAM 0: GET /] data piggybacks on the handshake 1-RTT[STREAM 0: 200 OK + body] 1 RTT
FIG 07·1 QUIC 1-RTT 握手时序 · 三种颜色 = 三个 PN 空间 · 注意客户端发送 Finished 时同包带了 GET。 Fig 07·1 · QUIC 1-RTT handshake · three colours = three PN spaces · note the client packs GET inside the same datagram as Finished.

TLS 1.3 进了哪里?

Where does TLS 1.3 live?

TLS records
TLS records
ClientHello / SH / EE …
QUIC 帧
QUIC frame
CRYPTO
QUIC 包
QUIC packet
Initial / Handshake
物理
Physical
UDP / IP
FIELD NOTE · 不是 "TLS over QUIC" FIELD NOTE · NOT "TLS over QUIC" RFC 9001 的标题刻意叫 "Using TLS to Secure QUIC"。TLS 1.3 在 QUIC 里只剩下两个角色:(1) 密钥协商引擎——产出 4 套密钥(Initial / 0-RTT / Handshake / 1-RTT);(2) 身份认证——证书链、CertVerify、Finished。TLS 1.3 的 record layer 整个被砍掉了——QUIC 自己做加密包装。这就是为什么 OpenSSL 不能直接用,所有 QUIC 库都用 BoringSSL fork 或者 quictls / s2n。 RFC 9001 is titled "Using TLS to Secure QUIC" on purpose. Inside QUIC, TLS 1.3 plays only two roles: (1) key-agreement engine — produces four key sets (Initial / 0-RTT / Handshake / 1-RTT); (2) identity authentication — certificate chain, CertVerify, Finished. TLS 1.3's record layer is amputated — QUIC handles the packet wrapping itself. This is why stock OpenSSL doesn't work: every QUIC implementation forks BoringSSL, quictls or s2n.

为什么是 1-RTT?

Why 1-RTT?

Protocolhandshake+ first data合计total
TCP + TLS 1.21 RTT (SYN) + 2 RTT (TLS)+ 1 RTT4 RTT
TCP + TLS 1.31 RTT (SYN) + 1 RTT (TLS)+ 1 RTT3 RTT
TCP Fast Open + TLS 1.30.5 RTT (TFO) + 1 RTT+ 1 RTT2 RTT*
QUIC + TLS 1.3 (1-RTT)1 RTT (handshake + data)1 RTT
QUIC + TLS 1.3 (0-RTT)0 RTT (data on first packet)0.5 RTT

📖 RFC 9000 §18 · Transport Parameters 拆解📖 RFC 9000 §18 · Transport Parameters dissected

握手期间客户端和服务器各自声明一组 transport parameters(TP),夹在 TLS ClientHello / EncryptedExtensions 的扩展里。这是整个连接生命周期里所有窗口、超时、限额的源头。下面是 18 个标准参数中最关键的 11 个:

During handshake both sides declare a set of transport parameters (TP), wrapped inside TLS ClientHello / EncryptedExtensions extensions. This is the single source of truth for every window, timeout, and limit in the connection's lifetime. Eleven of the eighteen standard parameters that actually matter:

id · name含义MeaningChrome 默认
0x01 max_idle_timeout空闲超时(取双方最小值)idle timeout (min of both)30 s
0x02 stateless_reset_token用于 §10.3 无状态重置used by §10.3 stateless reset16 B random
0x03 max_udp_payload_size能接受的最大 UDP 载荷max UDP payload accepted1452
0x04 initial_max_data连接级流控窗connection-level flow window10 MB
0x05 init_max_stream_data_bidi_local本方主动开的双向流的初始窗stream window for streams we open6 MB
0x06 init_max_stream_data_bidi_remote对方开的双向流streams peer opens6 MB
0x07 init_max_stream_data_uni单向流unidirectional streams6 MB
0x08 initial_max_streams_bidi允许并发双向流数concurrent bidi stream cap100
0x09 initial_max_streams_uni单向流数uni stream cap100
0x0b max_ack_delay最大 ACK 拖延(影响 PTO)max ACK delay (drives PTO)25 ms
0x0c disable_active_migration禁用主动迁移(手机选 false)opt-out of active migrationfalse
0x0e active_connection_id_limit允许对端预存的 CID 数peer's CID pool size8
0x20 max_datagram_frame_sizeDATAGRAM 帧最大长度(默认 0 = 不启用)DATAGRAM frame max (0 = disabled)0 / 1200
§18 · 关键约束 TP 不是协商,是声明——每方独立宣布自己接受什么。生效值是两个声明的更严限制。比如双方都给 max_idle_timeout=30s ⇒ 30s 起效;如果客户端说 30s 服务器说 10s,10s 生效有些参数(如 disable_active_migration)只有服务器能发,客户端发了就是协议违反。 TP is not a negotiation — it's declarations. Each side independently states what it will accept. The effective value is the tighter of the two. Both say max_idle_timeout=30s ⇒ 30s wins; client says 30s, server says 10s ⇒ 10s wins. Some parameters (like disable_active_migration) are server-only; a client sending them is a protocol violation.

* TFO 的 cookie 在路上经常被中间盒丢,工程界一般不把它算成"真的可用"。

* TFO cookies frequently get stripped by middleboxes; in practice not considered "really usable".

CHAPTER 08

0-RTT — 把请求塞进握手包里

0-RTT — stuffing the request inside the handshake

免费的午餐,但有重放的尾巴

a free lunch, with a replay-attack tail

在主线里
In our request
2nd visit · T+0ms
触发
Trigger
session resumption
RFC
9001 §4.1 · 8470
省下
Saves
1 RTT

第一次访问 ursb.me 之后,服务器在 1-RTT 握手末尾发了一个 NewSessionTicket——这是一段被服务器密钥加密的 blob,里面装着 PSK。Chrome 把它存起来。下次再访问 ursb.me,Chrome 把 ticket 重新发回去,同时把 GET 请求用 PSK 派生的 0-RTT 密钥加密、放进 Early Data 一起发出去——握手第 0 个 RTT 应用数据就在路上了。

After the first visit to ursb.me, the server appends a NewSessionTicket at the tail of the 1-RTT handshake — an opaque blob encrypted by the server's own key, containing a PSK. Chrome stores it. On the next visit, Chrome ships the ticket back, and simultaneously encrypts the GET request with the PSK-derived 0-RTT key and sends it as Early Data — application bytes are flying before RTT 1.

◇ 在我们的 GET 请求里 · 主线阶段 1(恢复)/ Phase 1 (resumption)◇ In our GET request · undefined

INPUT
PSK ticket(age 2h)PSK ticket (age 2h)上次访问遗留的 session ticketleftover session ticket
OUTPUT
0-RTT 密钥 + Early Data 许可0-RTT key + Early Data permissionCH 和 GET 拼进同一个 UDP 包CH + GET in same UDP datagram

0-RTT 时序

0-RTT timeline

Client Server Initial[CRYPTO: ClientHello+PSK] + 0-RTT[STREAM 0: GET /] one UDP datagram, two QUIC packets coalesced, different keys Initial[SH] + Handshake[FIN] + 1-RTT[STREAM 0: 200 OK + body] handshake completion piggybacks the response body ~0.5 RTT
FIG 08·1 0-RTT 时序:握手包和 GET 共一份 UDP 数据报;客户端到服务器只有半个 RTT就发完整个请求。 Fig 08·1 · 0-RTT timeline: handshake and GET share one UDP datagram; the request reaches the server after half an RTT.

重放风险

The replay risk

0-RTT 的 PSK 没有新鲜度。攻击者可以录下你的第一个 UDP 包,重发任意多次——服务器无法区分"是你"还是"录像回放"。对查询型 GET 没问题(重复也是同一个结果),但如果是 POST /transfer/100USD,重放就是一百次转账

The 0-RTT PSK carries no freshness. An attacker can record your first UDP datagram and replay it forever — the server can't tell "you" from "tape rewind". Fine for an idempotent GET (same answer). Catastrophic for POST /transfer/100USD — that's a hundred transfers.

三道防线

The three defences

浏览器侧 · 方法白名单
Client · method whitelist
Chrome / Firefox 只在幂等方法(GET、HEAD)上启用 0-RTT。POST / PUT / DELETE 一律退到 1-RTT。
Chrome / Firefox only enable 0-RTT on idempotent methods (GET, HEAD). POST / PUT / DELETE fall back to 1-RTT.
服务器侧 · Early-Data 头
Server · Early-Data header
RFC 8470 规定:服务器把 0-RTT 收到的请求转给上游时,加一行 Early-Data: 1。应用层(如 Cloudflare Worker)看到这行可以决定"不处理"或"返回 425 Too Early"。
RFC 8470: when forwarding 0-RTT-arrived requests upstream, the server adds Early-Data: 1. The application layer (e.g. Cloudflare Worker) can then choose "don't process" or "return 425 Too Early".
TLS 侧 · 时间窗 + 反重放缓存
TLS · time window + anti-replay cache
服务器只在 ticket 发出后的有限时间窗(一般 ≤ 10 秒)接受 0-RTT,并在 Redis-like 缓存里记下"已经见过的 PSK ID"做去重。Cloudflare 用 BoringSSL 的 SSL_CTX_set_early_data_enabled + 集群级 deduper。
The server accepts 0-RTT only inside a narrow time window after ticket issuance (typically ≤ 10s), backed by a Redis-style cache recording "PSK IDs already seen" for dedup. Cloudflare uses BoringSSL's SSL_CTX_set_early_data_enabled plus a cluster-wide deduper.
CASE · CLOUDFLARE
Cloudflare 的 0-RTT 策略
Cloudflare's 0-RTT policy

Cloudflare 默认对所有客户开启 0-RTT,但仅限 GET/HEAD 且 URL 中不含 query string(query 经常是状态变更动作)。如果客户的 origin 返回 Cache-Control: privateSet-Cookie,Cloudflare 边缘自动把请求升级到 1-RTT 才转给 origin。Cloudflare 的工程博客《Even faster connection establishment with QUIC 0-RTT resumption》给出的实测:0-RTT 让已经访问过的回访用户首字节延迟(TTFB)的中位数降低 ~50ms

Cloudflare enables 0-RTT for all customers by default, but only for GET/HEAD requests without a query string (queries are often state-changing). If the origin returns Cache-Control: private or Set-Cookie, the Cloudflare edge auto-promotes the request to 1-RTT before forwarding upstream. Per their blog «Even faster connection establishment with QUIC 0-RTT resumption», 0-RTT lowers median TTFB for returning users by ~50ms.

CHAPTER 09

加密分层 — 4 套密钥的精确边界

Crypto layers — the exact boundaries of 4 key sets

为什么 Initial 包"加密"但任何人都能解密

why Initial packets are "encrypted" yet anyone can decrypt them

在主线里
In our request
T+0..140ms
Layer
QUIC · key schedule
RFC
9001 §5 · §7
关键
Key idea
salt → HKDF → keys

◇ 在我们的 GET 请求里 · 主线阶段 1-2(密钥派生)/ Phase 1-2 (key schedule)◇ In our GET request · undefined

INPUT
TLS handshake secretsTLS handshake secretsmaster_secret · handshake_secret · early_secretmaster_secret · handshake_secret · early_secret
OUTPUT
4 级 × 3 = 12 个量4 levels × 3 = 12 values(key 16B, iv 12B, hp 16B) for Initial/0-RTT/Handshake/1-RTT(key 16B, iv 12B, hp 16B) for Initial/0-RTT/Handshake/1-RTT

四套密钥的派生时点

When each key set is derived

① INITIAL
公开 salt
public salt
PN_Initial
  • salt = 0x38762cf7…
  • HKDF(salt, DCID)
  • 仅防中间盒,不防窃听middlebox-proof only
② EARLY DATA (0-RTT)
PSK-derived
PN_Application*
  • 从上次 ticket 的 PSK 派生derived from last session's PSK
  • 客户端单向使用client → server only
③ HANDSHAKE
DH 之后派生
post-DH derived
PN_Handshake
  • TLS DH 完成后立即派生derived after TLS DH
  • 用于 EE/Cert/FINprotects EE/Cert/FIN
④ 1-RTT (APPLICATION)
主密钥
primary key
PN_Application
  • 承载 99% 数据carries 99% of data
  • 支持 KEY_UPDATE 滚动supports KEY_UPDATE rotation
WHY "INITIAL" IS PUBLIC Initial 包的密钥从一个公开的 salt(RFC 9001 §5.2 写明 0x38762cf7…f5b8)+ 客户端选的 DCID 派生。任何人都能算出来。所以 Initial 包的"加密"不是防窃听——它防的是"中间盒看了 ClientHello 之后做出不该做的事"。这是反僵化策略落在密钥层的体现。 Initial packets derive their keys from a public salt (RFC 9001 §5.2 spells out 0x38762cf7…f5b8) + the client-chosen DCID. Anyone can compute them. So Initial-packet "encryption" does not protect confidentiality — it protects against "middleboxes peeking at ClientHello and then acting on what they saw". This is anti-ossification at the key-schedule layer.

Key Schedule 全图

Full key schedule

QUIC v1 key derivation · RFC 9001 §5HKDF · TLS_AES_128_GCM_SHA256
; Step 1 · Initial keys (公开) initial_salt = 0x38762cf7f55934b34d179ae6a4c80cadccbb7f0a ; (QUIC v1) initial_secret = HKDF-Extract(initial_salt, DCID) client_initial_secret = HKDF-Expand-Label(initial_secret, "client in") server_initial_secret = HKDF-Expand-Label(initial_secret, "server in") ; Step 2 · TLS 1.3 secrets (handshake DH 之后) handshake_traffic_secret = TLS-derive(... DHE ...) client_hs_secret = HKDF-Expand-Label(handshake_secret, "c hs traffic") server_hs_secret = HKDF-Expand-Label(handshake_secret, "s hs traffic") ; Step 3 · 1-RTT (application) secrets client_app_secret = HKDF-Expand-Label(master_secret, "c ap traffic") server_app_secret = HKDF-Expand-Label(master_secret, "s ap traffic") ; Each secret then derives: key = HKDF-Expand-Label(secret, "quic key", 16 bytes) ; AEAD key iv = HKDF-Expand-Label(secret, "quic iv", 12 bytes) ; AEAD nonce base hp = HKDF-Expand-Label(secret, "quic hp", 16 bytes) ; header-protect key
FIELD NOTE · 抓包要点 FIELD NOTE · Decoding capture Wireshark 解 QUIC 必须有 SSLKEYLOGFILE:浏览器把每一级的 secret 写到这个文件,Wireshark 读了之后能解所有 4 级。在 macOS 启动 Chrome:SSLKEYLOGFILE=~/keys.log /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome。一旦 Encrypted Client Hello(ECH,draft-ietf-tls-esni) 进入稳定版(Cloudflare 2023 已开,Chrome 117+ 默认),这条招就只能拿到 outer ClientHello,真正的 SNI 在 inner ClientHello 里被 HPKE 加密 Wireshark needs SSLKEYLOGFILE to decrypt QUIC: the browser writes each level's secret to that file, and Wireshark can decode all four. On macOS, launch Chrome with SSLKEYLOGFILE=~/keys.log /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome. Once Encrypted Client Hello (ECH, draft-ietf-tls-esni) stabilises (Cloudflare turned it on in 2023; Chrome 117+ ships it on by default), this trick only yields the outer ClientHello — the real SNI lives inside an HPKE-encrypted inner ClientHello.
CHAPTER 10

帧字典 — 28 种 QUIC 帧

Frame catalog — 28 kinds of QUIC frame

payload 不是字节流,是帧的串联

payload isn't a byte stream, it's a chain of frames

Layer
QUIC
RFC
9000 §19 · 9221
总数
Total
28 + DATAGRAM
结构
Form
Type · varint · payload

解密一个 QUIC 包的 payload,你得到的不是"一段数据",而是一串。每个帧自带类型和长度——服务器和客户端按顺序处理。下面把 RFC 9000 §19 的全部帧(加上 RFC 9221 的 DATAGRAM)整理成四类,让骨架可见。

Decrypt a QUIC packet's payload and you don't get "a chunk of data" — you get a chain of frames. Each carries its own type and length; both ends process them in order. Below is the full RFC 9000 §19 catalogue (plus RFC 9221 DATAGRAM), sorted into four families.

◇ 在我们的 GET 请求里 · 主线阶段 1, 3, 4◇ In our GET request · Phases 1, 3, 4

INPUT
HTTP 语义 + 控制意图HTTP semantics + control intentsGET+ 流控更新 + ACK + 探测
OUTPUT
payload 里的一串帧a chain of frames in the payloadSTREAM · ACK · MAX_S_D · CRYPTO · PADDING · …STREAM · ACK · MAX_S_D · CRYPTO · PADDING · …

四大族 · The four families

The four families

族 1 · 控制
Family 1 · Control
连接生命周期
connection lifecycle
8
  • PADDING (0x00)
  • PING (0x01)
  • CONNECTION_CLOSE (0x1c-1d)
  • HANDSHAKE_DONE (0x1e)
  • NEW_TOKEN (0x07)
  • NEW_CONNECTION_ID (0x18)
  • RETIRE_CONNECTION_ID (0x19)
  • PATH_CHALLENGE / PATH_RESPONSE (0x1a-1b)
族 2 · 可靠性
Family 2 · Reliability
ACK 与丢包
acks & loss
2
  • ACK (0x02)
  • ACK_ECN (0x03)
族 3 · 流 & 流量控制
Family 3 · Streams & flow control
应用数据载体
application data carrier
12
  • STREAM (0x08-0x0f · 8 variants)
  • RESET_STREAM (0x04)
  • STOP_SENDING (0x05)
  • MAX_DATA / DATA_BLOCKED (0x10/0x14)
  • MAX_STREAM_DATA / STREAM_DATA_BLOCKED
  • MAX_STREAMS / STREAMS_BLOCKED
族 4 · 密码学 + 扩展
Family 4 · Crypto + ext.
握手 / 不可靠数据
handshake / datagram
2
  • CRYPTO (0x06)
  • DATAGRAM (0x30-0x31, RFC 9221)
DATAGRAM · RFC 9221 §5 DATAGRAM 帧是 QUIC 唯一不可靠的载荷——不重传、不流控、不排序。最大长度受 SETTINGS 的 max_datagram_frame_size 限(默认禁用,需要双方协商)。它存在的全部理由是 WebTransport / MASQUE / Media-over-QUIC 这种"宁愿丢一帧也别等"的实时场景。普通 HTTP/3 流量根本不该碰它。 The DATAGRAM frame is QUIC's only unreliable payload — no retransmit, no flow control, no ordering. Max size capped by the max_datagram_frame_size SETTING (disabled by default, must be negotiated). Its sole purpose is to enable WebTransport / MASQUE / Media-over-QUIC — the "better-drop-than-wait" real-time use cases. Plain HTTP/3 traffic should never touch it.

STREAM 的 8 种变体

The eight STREAM variants

STREAM 帧的低 3 位编码了三个独立开关:OFF(带不带偏移量)/ LEN(带不带长度)/ FIN(是不是流尾)。2³ = 8 个 type 编码 0x08-0x0f。

The low 3 bits of a STREAM frame encode three independent flags: OFF (carries an offset?), LEN (carries a length?), FIN (is this stream's last byte?). 2³ = 8 type codes 0x08-0x0f.

0x08 STREAM
0x09 + FIN
0x0a + LEN
0x0b + LEN + FIN
0x0c + OFF
0x0d + OFF + FIN
0x0e + OFF + LEN
0x0f + OFF + LEN + FIN

一个 UDP 数据报里有什么

What's inside one UDP datagram

UDP DATAGRAM src=52341 · dst=443 · len=1224 · cksum=… → payload (1216 bytes) goes into a QUIC packet QUIC PACKET · 1-RTT Flags 1 B (HP) DCID 8 B PN 1-4 B (HP) ENCRYPTED PAYLOAD (AEAD-sealed) → decrypts to a chain of frames ▼ decrypt FRAME CHAIN · concat, processed left → right ACK type=0x02 acks PN 100-105, 90 STREAM (+OFF+LEN) type=0x0e · sid=0 · off=512 · len=200 data: "<html>...</html>" (stream A body bytes) STREAM (+LEN+FIN) type=0x0b · sid=4 · len=42 data: response trailer FIN ✓ MAX_S_D type=0x11 sid=0 max=1MB PADDING type=0x00 × N pad to MTU ~280 B ▸ same packet can carry frames for many streams ▸ a stream's bytes can be split across many packets
FIG 10·1 从 UDP 数据报 → QUIC packet → 帧链 · 一个 packet 可以承载多种帧、跨多条流。 Fig 10·1 · UDP datagram → QUIC packet → frame chain · one packet can carry many frame types across many streams.
VARINT QUIC 几乎每个长度字段都用可变长整数(var-int)编码:前 2 位决定占 1/2/4/8 字节。0x37 = 55;0x40 0x40 = 64;0x80 0x00 0x40 0x00 = 16384。这种"小数小占用"的设计让 ACK 帧之类的小包平均小 30%,是 QUIC 的隐形性能源。 Almost every length field in QUIC uses variable-length integers (var-ints): the top 2 bits decide 1/2/4/8-byte encoding. 0x37 = 55; 0x40 0x40 = 64; 0x80 0x00 0x40 0x00 = 16384. This "small = small" encoding makes small frames like ACKs ~30% smaller on average — an invisible source of QUIC's throughput edge.

ACK 帧的精巧

The elegance of ACK

QUIC 的 ACK 帧比 TCP 的 SACK 强 10 倍。一个 ACK 帧里可以装 Range{largest_ack, [gap, ack_range]*}——告诉对方"我收到了 PN 100,PN 90-95 收到,PN 80-85 收到,..."最多一个 ACK 帧描述整个连接的所有已收。TCP SACK option 因为占在 TCP options 里,最多 4 个 range;QUIC 没限制。

QUIC's ACK frame is 10× more capable than TCP SACK. One ACK frame can pack multiple ranges: {largest_ack, [gap, ack_range]*} — "I have PN 100, 90-95, 80-85, ..." A single ACK frame can describe every received PN of the whole connection. TCP SACK lives in TCP options, capped at 4 ranges; QUIC has no such cap.

CHAPTER 11

流多路复用 — StreamID 的 2-bit 字典

Stream multiplexing — the 2-bit StreamID dictionary

HTTP/2 在应用层做多路复用,HTTP/3 在传输层做

HTTP/2 mux at the app layer, HTTP/3 mux at transport

Layer
QUIC streams
RFC
9000 §2-§4
编码
Encoding
low 2 bits
最大并发
Max concurrent
var-int (≥ 100 default)

◇ 在我们的 GET 请求里 · 主线阶段 3◇ In our GET request · Phase 3

INPUT
4 条逻辑流4 logical streamsrequest bidi + control uni + QPACK enc/decrequest bidi + control uni + QPACK enc/dec
OUTPUT
已分配的 StreamIDallocated StreamIDs0 (req) · 2 (ctrl) · 6 (QPACK enc) · 10 (QPACK dec)0 (req) · 2 (ctrl) · 6 (QPACK enc) · 10 (QPACK dec)

StreamID 编码

StreamID encoding

每个流有一个 var-int 编码的 ID。最低 2 位同时编码两件事:方向(双向 / 单向)+ 发起方(客户端 / 服务器)。

Every stream has a var-int ID. The low 2 bits encode two things at once: direction (bidi / uni) and originator (client / server).

bits编码Encoded含义MeaningHTTP/3 use
0x000, 4, 8, 12, …客户端发起双向流Client-initiated bidi请求流request streams
0x011, 5, 9, 13, …服务端发起双向流Server-initiated bidiHTTP/3 不用unused in H3
0x022, 6, 10, …客户端发起单向流Client-initiated unicontrol · QPACK encoder/decodercontrol · QPACK enc/dec
0x033, 7, 11, …服务端发起单向流Server-initiated unicontrol · QPACK · Pushcontrol · QPACK · Push
主线主线 · 我们的 GET In our main-line 浏览器发起 GET ursb.me/,用的是 Stream ID = 0(第一条客户端双向流)。Chrome 同时打开三条单向流:StreamID=2(H3 control stream)、StreamID=6(QPACK encoder)、StreamID=10(QPACK decoder)。这就是为什么下一章讲 HTTP/3 帧时你会看到"控制流要先开"。 Our GET uses StreamID = 0 (the first client-initiated bidi stream). Chrome simultaneously opens three uni streams: StreamID=2 (H3 control), StreamID=6 (QPACK encoder), StreamID=10 (QPACK decoder). This is why the next chapter says "the control stream must open first".

四条流并行 · 一条连接里的全部

Four streams in parallel · everything inside one connection

ONE QUIC CONNECTION · time → T+100ms → T+250ms Stream 0 client-initiated bidi type 0x00 · request/response HEADERS (req) FIN HEADERS (rsp) DATA · body Stream 2 client uni · control type 0x02 · SETTINGS, GOAWAY SETTINGS PRIORITY_U Stream 6 client uni · QPACK encoder type 0x02 · inserts → server INSERT INSERT INSERT Stream 10 client uni · QPACK decoder type 0x03 · acks ← server ACK ACK | pkt PN=12 | PN=13 | PN=14 | PN=15 | PN=16 Each UDP packet (PN) may carry frames from multiple streams. Each stream is independently flow-controlled.
FIG 11·1 一条 QUIC 连接里的 4 条平行流 · 请求双向流 + 控制流 + QPACK encoder/decoder · 帧可以挤进同一个 packet。 Fig 11·1 · Four parallel streams inside one QUIC connection · request bidi + control + QPACK encoder/decoder · frames coalesce into the same packet.

流量控制 · 两级

Flow control · two levels

STREAM LEVEL
每条流独立
Per-stream

每条流维护 MAX_STREAM_DATA。发送方累积发送的字节超过这个值就停。接收方通过 MAX_STREAM_DATA 帧主动增窗。

Each stream tracks MAX_STREAM_DATA. Sender stops when cumulative sent bytes hit the limit. Receiver grows the window with MAX_STREAM_DATA frames.

CONNECTION LEVEL
整条连接共享
Connection-wide

所有流字节的总和受 MAX_DATA 限。避免单个连接吃光内存。Chrome 默认 6 MB(OkHttp 25 MB · curl 1 MB)。

Sum of all streams' bytes capped by MAX_DATA. Stops one connection from eating all memory. Chrome defaults to 6 MB (OkHttp 25 MB · curl 1 MB).

📖 RFC 9000 §4 · 流量控制公式📖 RFC 9000 §4 · The flow-control formulas

流量控制有两个独立维度,每个维度都跑一组相同的状态变量:

Flow control runs in two independent dimensions, each with the same set of state variables:

QUIC flow control · sender stateRFC 9000 §4.1, §4.2
; Connection level (across ALL streams) conn.max_data ; advertised by peer via MAX_DATA frame conn.bytes_in_flight ; sum of all stream offsets sent INVARIANT: bytes_in_flight ≤ max_data ; otherwise DATA_BLOCKED ; Stream level (per stream) stream.max_data ; peer's MAX_STREAM_DATA(sid, N) stream.offset ; highest byte sent so far INVARIANT: offset ≤ stream.max_data ; else STREAM_DATA_BLOCKED ; Window update strategy (receiver side) WHEN consumed(stream) ≥ stream.window_threshold: stream.max_data += stream_window_size SEND MAX_STREAM_DATA(sid, stream.max_data) ; Chrome's strategy: bump window when receiver consumes half of current stream_window_size = 6 MB ; doubles on bandwidth detection stream_window_threshold = stream.max_data / 2
§4 · 为什么要两级 流级窗防单条流吃光对端内存(比如客户端下大文件,对端 buffer 撑爆)。连接级窗防 100 条流总和吃光内存(每条流只占 60 KB 也能加起来 6 MB)。两个窗任何一个用完,对应流(或整连接)就停发 STREAM 数据——但 PING、ACK、控制帧还能发,连接不会死。这是 QUIC 比 TCP 多了一层的关键。 Stream-level window stops one stream from eating peer memory (client downloads a huge file, peer buffer blows up). Connection-level window stops the sum of all streams from doing the same (100 streams × 60 KB each = 6 MB). When either window empties, that stream (or the whole connection) stops sending STREAM data — but PING, ACK and control frames keep flowing, connection survives. This is the extra layer QUIC adds on top of TCP.
"HTTP/2 用一个 TCP 连接装 100 条流,
HTTP/3 用一个 QUIC 连接装 100 条真正独立的流。"
"HTTP/2 stuffs 100 streams into one TCP connection.
HTTP/3 stuffs 100 actually independent streams into one QUIC connection."
RFC 9000 §2 paraphrased
CHAPTER 12

丢包恢复 — 严格单调的 Packet Number

Loss recovery — strictly-monotonic Packet Number

为什么 QUIC 的 RTT 比 TCP 准

why QUIC measures RTT more accurately than TCP

Layer
QUIC recovery
RFC
9002
关键
Key
PN never reused
触发器
Triggers
ACK · PTO · time

◇ 在我们的 GET 请求里 · 主线阶段 3(如果丢包)/ Phase 3 (if lost)◇ In our GET request · undefined

INPUT
PN 14 (body part 2) 长时间未 ACKPN 14 (body part 2) outstanding > thresh阈值:3 包 或 9/8 × max_RTTthreshold: 3 packets or 9/8 × max_RTT
OUTPUT
重发同样数据,但分配新 PN 17resend same bytes with fresh PN 17PN 14 永久弃用PN 14 abandoned forever

TCP 的歧义 vs QUIC 的清晰

TCP's ambiguity vs QUIC's clarity

TCP · seq number = byte offset (reusable) QUIC · packet number is strictly monotonic sender side receiver side seq=1000, len=200 seq=1200, len=200 ✗ lost seq=1200, len=200 (retransmit) SAME seq as the lost one ACK seq=1400 ⚠ which transmit? Karn says: throw away RTT sample Result · 重传歧义 RTT estimate breaks on retransmits. Karn's algorithm: discard RTT on retx. PN=100 [STREAM offset=0..199] PN=101 [STREAM offset=200..399] ✗ lost PN=102 [STREAM offset=200..399] NEW PN — same stream data, fresh ID ACK [102] ✓ unambiguous · RTT = T_ack − T_send(102) Result · 单调清晰 RTT exact on every packet, even retransmits. Why BBR runs better on QUIC than TCP.
FIG 12·1 TCP 重传复用 seq 导致 RTT 测量歧义 · QUIC 每次重传分配新 PN · 解决 30 年的 Karn 算法 Fig 12·1 · TCP retransmits reuse seq, breaking RTT estimation (the Karn problem) · QUIC's monotonic PN gives every retransmit a fresh ID and exact RTT.
TCP
重传歧义
retransmission ambiguity

TCP 的 sequence number 指代字节偏移。重传时 seq 完全相同——你收到的 ACK 到底是回原包还是回重传包?没法分。这就是著名的 retransmission ambiguity,导致 RTT 测量必须用"Karn 算法"忽略重传 RTT。

TCP seq numbers identify byte offsets. A retransmission has the same seq as the original. When an ACK arrives, you can't tell whether it's for the original or the retransmit. This is the infamous retransmission ambiguity; it forces TCP to use "Karn's algorithm" and discard retransmit RTT samples.

QUIC
PN 严格单调
PN strictly monotonic

QUIC 的 packet number 永不复用。重传时新 PN,旧 PN 永远废弃。ACK 回的是哪个 PN,就是哪个 PN——RTT 测量绝对精确。这是 BBR 等高级拥塞控制能在 QUIC 上"开挂"的根源。

QUIC packet numbers are never reused. A retransmit carries a new PN, the old PN is dead forever. An ACK names exactly the PN it acknowledges — RTT samples are exact. This is why advanced cc like BBR runs better on QUIC than TCP.

三类丢包检测

Three classes of loss detection

ACK-based · 包裹被超越
ACK-based · packet outpaced
如果 PN X+3 已经 ACK 了,但 PN X 没 ACK——X 大概率丢了。RFC 9002 默认阈值:3 个包 9/8 × max_RTT 之后宣告丢失。
If PN X+3 is ACKed but PN X isn't — X is likely lost. RFC 9002 default thresholds: 3 packets or 9/8 × max_RTT before declaring it lost.
Probe Timeout (PTO)
Probe Timeout (PTO)
最久那个未 ACK 的包发出后超过 smoothed_RTT + 4 × RTTVAR + max_ack_delay,就触发 PTO——发一个 PING 探测包"叫醒"对方。取代了 TCP 的 RTO 一次干 1 秒。
If the oldest unacked packet has been outstanding longer than smoothed_RTT + 4·RTTVAR + max_ack_delay, PTO fires — send a PING to "wake up" the peer. Replaces TCP's RTO with its 1-second hammer.
Anti-amplification 校准
Anti-amplification adjust
握手期间,服务器受 3x 限制不能乱发探测包——RFC 9002 §6.2.2 规定 PTO 在握手期更保守。
During handshake the server is capped by the 3x amplification rule, so RFC 9002 §6.2.2 requires PTO to be more conservative during handshake.

连接级 vs 流级丢包

Connection-level vs stream-level loss

KILLER FEATURE KILLER FEATURE 假设 PN 5 丢了,里面装的是 Stream A 的字节 0-1200。Stream A 必须等重传。但 PN 6 装的是 Stream B——它的解密和应用层处理不需要等。这就是 HTTP/3 干掉 TCP head-of-line 的关键:丢包只阻塞被丢的流,不阻塞其它流 Suppose PN 5 is lost — it carried Stream A bytes 0-1200. Stream A must wait for retransmission. But PN 6 carried Stream B — its decryption and application-layer processing don't have to wait. This is the key to HTTP/3 killing TCP HOL: loss blocks only the affected stream, never the others.

📖 RFC 9002 §6 · 丢包检测伪代码📖 RFC 9002 §6 · Loss detection in pseudocode

RFC 9002 Appendix A · OnAckReceivedQUIC loss recovery
; State per PN space (Initial / Handshake / Application) largest_acked_packet ; highest PN we've heard ACKed time_of_last_ack_eliciting ; when did we last send something needing ACK loss_detection_timer ; the one timer that drives PTO + early loss pto_count ; resets on each new ACK ; Constants (§6.1.1, §6.1.2) kPacketThreshold = 3 ; ACKed past N pkts → loss kTimeThreshold = 9 / 8 ; × max(srtt, latest_rtt) kGranularity = 1 ms OnAckReceived(ack_frame): FOR each newly_acked in ack_frame.ranges: UpdateRtt(newly_acked.send_time) sent_packets.remove(newly_acked.pn) cc.on_packet_acked(newly_acked) ; cc grows cwnd DetectAndRemoveLostPackets() pto_count = 0 SetLossDetectionTimer() DetectAndRemoveLostPackets(): ; §6.1 loss_delay = kTimeThreshold × max(srtt, latest_rtt) loss_delay = max(loss_delay, kGranularity) lost_send_time = now − loss_delay FOR each unacked in sent_packets: IF unacked.send_time ≤ lost_send_time: ; time-based MARK unacked as LOST IF largest_acked − unacked.pn ≥ kPacketThreshold: ; reordering-based MARK unacked as LOST cc.on_packets_lost(lost_packets) retransmit_data(lost_packets) ; assign NEW PNs (Ch12 reason) SetLossDetectionTimer(): ; §6.2 IF there are loss candidates: timer = earliest_loss_time + loss_delay ELSE: ; PTO mode pto = (srtt + max(4 × rttvar, kGranularity) + max_ack_delay) × 2^pto_count timer = time_of_last_ack_eliciting + pto
§6 · 两根触发器 RFC 9002 给了两个独立的丢包判定:(1) 包阈值——后面 3 个包都 ACK 了但这个没;(2) 时间阈值——超过 9/8×max(sRTT,latestRTT) 还没 ACK。任一触发就视为丢失。没有触发但有 ACK 等待时,PTO(Probe Timeout)兜底——每次失败 PTO 指数翻倍(×2^pto_count)。这两个机制合起来取代了 TCP 的 RTO + Fast Retransmit。 RFC 9002 gives two independent loss triggers: (1) packet threshold — the next 3 are ACKed but this one isn't; (2) time threshold — outstanding longer than 9/8×max(sRTT,latestRTT). Either fires → declared lost. If neither fires but ACKs are outstanding, PTO kicks in — doubles on each timeout (×2^pto_count). Together these replace TCP's RTO + Fast Retransmit.
CHAPTER 13

拥塞控制 — 用户态的 BBR 实验场

Congestion control — BBR's user-space playground

QUIC 让拥塞控制变成应用配置

QUIC turns congestion control into an app setting

Layer
QUIC cc
RFC
9002 §7 (NewReno)
实际
In practice
BBR v2/v3 / CUBIC
特点
Property
pluggable

TCP 的拥塞控制写在内核里——升级一次要等几年。QUIC 把它搬到了用户态。Cloudflare 想换 BBR v3?改一行 Rust。Google YouTube 想用自家的 cc 算法?同样改一行 C++。这是 QUIC 真正的"研发加速器"价值——它让网络拥塞控制变成应用层关切,而不是十年内核排队等升级的事。

TCP cc lives in the kernel — upgrading takes years. QUIC moved it to user space. Cloudflare wants BBR v3? Change one Rust line. Google YouTube wants its own cc algorithm? Same — one C++ line. This is QUIC's real "R&D accelerator" value: it turns congestion control into an application concern, not a decade-long kernel queue.

◇ 在我们的 GET 请求里 · 主线阶段 3 + 5◇ In our GET request · Phase 3 + 5

INPUT
RTT 测量 + ACK 调步RTT samples + ACK pacingsRTT 40ms · bw 50 Mbps · loss 1.5%sRTT 40ms · bw 50 Mbps · loss 1.5%
OUTPUT
cwnd 目标 ≈ BDPcwnd target ≈ BDPBBR ProbeBW 主导BBR ProbeBW dominant

三大算法对照

Three algorithms side by side

ccsignalthroughputfairness部署在deployed at
NewReno (RFC 9002 default)lossbaselinegood小实现库的默认smaller libs default
CUBIC (RFC 8312)loss1.5x baselinegoodLinux TCP 默认 · ngtcp2Linux TCP default · ngtcp2
BBR v2/v3bandwidth + RTT2-3x baselinewarn:CUBIC starveGoogle · Cloudflare · Meta

BBR 凭什么这么强

Why BBR wins

CUBIC / NewReno 用丢包当拥塞信号——但现代网络的丢包大多来自无线信道错误,不是拥塞。BBR 直接测量瓶颈带宽(max bandwidth)和最小 RTT,用 BDP(带宽时延积)当目标在途字节数。结果:BBR 在有损但不拥塞的链路(4G/5G/Wi-Fi)上吃满带宽,CUBIC 在那种链路上一脚刹车一脚油。

CUBIC / NewReno treat loss as the congestion signal — but most modern packet loss comes from wireless channel errors, not congestion. BBR directly measures bottleneck bandwidth (max bw) and minimum RTT, then uses BDP (bandwidth-delay product) as its target in-flight. Result: BBR saturates bandwidth on lossy but uncongested links (4G/5G/Wi-Fi), where CUBIC stutters between brake and accelerator.

CONGESTION WINDOW · over time · 4G with 1.5% random loss cwnd 100k 75k 50k 25k 0 ▼ loss ▼ loss ▼ loss ▼ loss NewReno (additive) CUBIC (concave) BBR v2 (bw-based) 0s 1s 2s 3s 4s BBR ignores random loss · CUBIC carves a sawtooth · NewReno crawls
FIG 13·1 cwnd 在弱网(1.5% 随机丢包)下的三种行为 · BBR 看带宽不看丢包,吃满;CUBIC 锯齿;NewReno 缓爬。 Fig 13·1 · cwnd behaviour on a 1.5%-random-loss link · BBR saturates the link (it measures bandwidth, not loss); CUBIC sawtooths; NewReno crawls.
CASE · GOOGLE YOUTUBE
YouTube 上 BBR 实测
BBR in YouTube production

Google 2017 年 SIGCOMM 论文《BBR: Congestion-Based Congestion Control》给出:在美国跨州链路上,BBR 让 YouTube 的视频缓冲事件率下降 53%,启动时间降低 8%。2024 年 BBR v3 进一步把吞吐稳定性提升约 15%。Google 把 BBR 同时部署到 TCP(Linux 内核 4.9+)和 QUIC(QUICHE)——但 QUIC 上的 BBR 因为 PN 单调更精确(见 Ch12),效果更稳。

Google's 2017 SIGCOMM paper «BBR: Congestion-Based Congestion Control» reported: on US cross-state links, BBR reduced YouTube's video rebuffer rate by 53% and startup time by 8%. BBR v3 (2024) tightened throughput stability another ~15%. Google deploys BBR on both TCP (Linux kernel 4.9+) and QUIC (QUICHE) — but the QUIC variant runs more stably thanks to monotonic PN (see Ch12).

Spin Bit · 让运营商喘口气

Spin Bit · throwing operators a bone

QUIC 把 packet number 都加密了——运营商再也不能用过去测 TCP RTT 的招测 QUIC RTT。这让大量运营商抓狂(他们的 SLA 监控、流量调度全靠 RTT 数据)。QUIC WG 妥协的设计:Spin Bit——short header 里有 1 比特,在每个 RTT 翻转一次,中间盒不解密也能被动测算 RTT。客户端可以选择关闭它(出于隐私),但生产环境基本都开。

QUIC encrypts packet numbers — operators can no longer measure RTT the way they did with TCP. This drove operators wild (their SLAs and traffic engineering all depend on RTT). QUIC WG's compromise: Spin Bit — 1 bit in the short header that flips once per RTT. Middleboxes can passively measure RTT without decrypting. Clients may disable it for privacy, but in production it's almost always on.

📖 RFC 9002 §7 · NewReno 状态机伪代码📖 RFC 9002 §7 · NewReno state machine pseudocode

RFC 9002 §7 · NewReno (default cc)Appendix B
; State congestion_window (bytes) bytes_in_flight (bytes) ssthresh ; slow start threshold congestion_recovery_start_time ; for filtering duplicate triggers ; Constants kInitialWindow = 10 × max_datagram_size ; ~14 KB kMinimumWindow = 2 × max_datagram_size ; ~2.9 KB kLossReductionFactor = 0.5 kPersistentCongestionThr = 3 ; PTOs of no progress OnPacketAcked(acked): ; §7.3.1 IF acked.in_congestion_recovery: RETURN ; ignore old retransmits IF cwnd ≥ ssthresh: ; congestion avoidance cwnd += (max_datagram_size × acked.size) / cwnd ELSE: ; slow start cwnd += acked.size OnPacketsLost(lost): ; §7.3.2 IF any(p.send_time > congestion_recovery_start_time for p in lost): congestion_recovery_start_time = now() ssthresh = cwnd × kLossReductionFactor ; halve cwnd = max(ssthresh, kMinimumWindow) IF in_persistent_congestion(lost): ; §7.6 cwnd = kMinimumWindow congestion_recovery_start_time = 0 ; restart from scratch ; Sending constraint (everywhere) INVARIANT: bytes_in_flight ≤ cwnd
§7 · 拥塞控制三大概念 这个伪代码里出现的核心机制:(1) 慢启动——cwnd 每收一个 ACK 涨一份;(2) 拥塞回避——cwnd 每收一个 ACK 涨 1/cwnd(即每 RTT 涨 1 包);(3) 持续拥塞——3 个 PTO 没有任何 ACK,被认定为路径中断,cwnd 重置到最小。BBR 抛弃了这套循环,直接测量瓶颈带宽,因此性能优于 NewReno 1.5-3 倍——参见 Ch20 性能数据。 Three concepts in this pseudocode: (1) slow start — cwnd grows by one segment per ACK; (2) congestion avoidance — cwnd grows by 1/cwnd per ACK (i.e. 1 packet per RTT); (3) persistent congestion — 3 PTOs with no ACK is treated as a path break, cwnd resets to minimum. BBR ditches this whole loop and directly measures bottleneck bandwidth, hence 1.5-3× the throughput of NewReno — see Ch20 for production numbers.
CHAPTER 14

HTTP/3 帧 — 把 HTTP/2 砍掉一半

HTTP/3 frames — HTTP/2 with half the surface chopped off

QUIC 已经做完的事,HTTP/3 就不再重复

whatever QUIC already did, HTTP/3 doesn't redo

Layer
HTTP/3
RFC
9114 §7
帧类型
Frame types
7
对比 H2
vs H2
~50% less

HTTP/2 有 10 种帧(DATA / HEADERS / PRIORITY / RST_STREAM / SETTINGS / PUSH_PROMISE / PING / GOAWAY / WINDOW_UPDATE / CONTINUATION),HTTP/3 只有 7 种——因为 QUIC 把流控制、流终止、ping、优先级都包了。HTTP/3 只剩"HTTP 自己的事"。

HTTP/2 has 10 frame types. HTTP/3 has 7 — because QUIC already handles flow control, stream reset, ping, and priority. HTTP/3 only carries "HTTP's own business" now.

◇ 在我们的 GET 请求里 · 主线阶段 3◇ In our GET request · Phase 3

INPUT
HTTP method/path/headers + 3200 B bodyHTTP method/path/headers + 3200 B bodyGETHTTP/3 · 12 个头字段 + HTML
OUTPUT
HEADERS 帧 + DATA 帧HEADERS frame + DATA frameHEADERS ≈ 7 B (QPACK) · DATA = 3200 BHEADERS ≈ 7 B (QPACK) · DATA = 3200 B

帧清单

Frame list

TypeHex用途PurposeHTTP/2 里In HTTP/2
DATA0x00HTTP bodyHTTP bodysame
HEADERS0x01QPACK 压缩头部QPACK-encoded headerssame
CANCEL_PUSH0x03取消 Push(已死)cancel push (dead)
SETTINGS0x04连接参数connection paramssame
PUSH_PROMISE0x05服务器 Push(已死)server push (dead)deprecated
GOAWAY0x07优雅关闭graceful closesame
MAX_PUSH_ID0x0d允许的 Push ID 上限push limit
— 砍掉 —— removed —PRIORITY · RST_STREAM · PING · WINDOW_UPDATE · CONTINUATIONQUIC 处理handled by QUIC

三类流的开局

Three streams open the connection

Control · uni 0x02
SETTINGS · GOAWAY
每端必须开 1 条
each side must open one
QPACK encoder · uni 0x02
动态表更新
dynamic table updates
→ peer's decoder
QPACK decoder · uni 0x03
已收确认
insert count ack
→ peer's encoder
FIELD NOTE · 类型字节的诡计 FIELD NOTE · Type-byte trick 单向流的第一个字节是 stream type,不是 frame type。0x00 = control, 0x01 = push, 0x02 = QPACK encoder, 0x03 = QPACK decoder。GREASE 类型(用预留范围 0x1f * N + 0x21RFC 9114 §7.2.8 + RFC 9287)任何端都可以发——这就是 RFC 9114 的反僵化策略:故意送一些对方不认识的流,强迫实现"遇到不认识就忽略",否则永远不会有 0x04 出现。 A uni stream's first byte is the stream type, not a frame type. 0x00 = control, 0x01 = push, 0x02 = QPACK encoder, 0x03 = QPACK decoder. GREASE types (reserved range 0x1f·N + 0x21, RFC 9114 §7.2.8 + RFC 9287) can be sent by either side — RFC 9114's anti-ossification trick: deliberately send streams the peer doesn't recognise, forcing implementations to "ignore unknown", so 0x04 can land in the future.

优先级 · 砍了 PRIORITY 之后

Priority · what replaced PRIORITY

HTTP/2 的优先级是个有名的笑话——RFC 7540 §5.3 设计了一棵weighted dependency tree,让客户端"告诉服务器谁先发"。Firefox 写过、Chrome 写过、Safari 没写。三家行为完全不一致,最后 RFC 9113 把它整段废弃了。

HTTP/3 选择了完全不同的路线 ——RFC 9218 · Extensible Priorities for HTTP(2022-06,和 RFC 9114 同期发):

HTTP/2's priority was a famous joke — RFC 7540 §5.3 designed a weighted dependency tree for clients to tell servers "send these first". Firefox shipped one. Chrome shipped a different one. Safari shipped none. The three implementations behaved nothing alike. RFC 9113 finally obsoleted the whole thing.

HTTP/3 went a different route entirely — RFC 9218 · Extensible Priorities for HTTP (2022-06, shipped with RFC 9114):

priority header
声明优先级
declare priority
u=0..7 · i
  • priority: u=3 — urgency 0(高)…7(低)
  • i — incremental(流式可逐字节渲染)
PRIORITY_UPDATE
中途调整
re-prioritise mid-flight
frame 0xF0700 (H3)
  • 客户端在 control stream 发送sent on the control stream
  • 用 SF-Item 结构SF-Item dict syntax
服务端策略
Server policy
建议而非强制
advisory, not mandatory
scheduler-defined
  • 服务器可以无视server may ignore
  • RFC 不规定调度算法RFC doesn't pick a scheduler
RFC 9218 example · what Chrome sendscurl -v --http3 ursb.me
; HTML — top urgency, render incrementally as bytes arrive GET / HTTP/3 priority: u=0, i ; CSS — high urgency, blocking but not incremental GET /app.css HTTP/3 priority: u=2 ; Image — low priority, can wait GET /hero.webp HTTP/3 priority: u=5, i ; PRIORITY_UPDATE on the control stream — Chrome bumps an image into view [PRIORITY_UPDATE frame 0x0F0700] prioritized_id = 0x14 ; stream 20 priority_field = "u=1, i" ; now urgent, scrolled into viewport
DEVTOOLS Chrome DevTools → Network → 右侧 "Priority" 列。Chrome 内部把主资源/CSS/JS/图片/字体映射成 u=0..5。你可以用 fetch API 的 priority 选项手动覆盖:fetch(url, { priority: 'high' })。这是 RFC 9218 在浏览器侧的唯一对外接口。 Chrome DevTools → Network → "Priority" column on the right. Chrome maps main resource / CSS / JS / image / font internally to u=0..5. You can override with the Fetch API's priority option: fetch(url, { priority: 'high' }). That's the only browser-facing surface for RFC 9218.
CHAPTER 15

QPACK — 给 HPACK 解开有序枷锁

QPACK — unshackling HPACK from strict order

为什么不能直接用 HPACK

why we couldn't just keep HPACK

Layer
HTTP/3 header compression
RFC
9204
静态表
Static table
99 entries
动态表
Dynamic table
tunable · default 4096 B

◇ 在我们的 GET 请求里 · 主线阶段 3◇ In our GET request · Phase 3

INPUT
12 个 HTTP 头部键值对12 HTTP header KV pairs~600 B 原始字节~600 B raw
OUTPUT
5-7 字节5-7 bytes静态表查中 · 动态表命中 · 字面编码static hits · dynamic hits · literals

HPACK 在 QUIC 上的不可救药

Why HPACK can't live on QUIC

HPACK(HTTP/2)依赖一个严格同步的动态表。服务器在 Stream A 发了 ":status: 200",告诉客户端"把这条加进表,索引 62"。下一个流可以用索引 62 来引用——前提是 Stream A 在 Stream B 之前到达。HTTP/2 over TCP 天然按序,所以没问题。

QUIC 各流相互独立、并发到达。Stream A 的 update 还没来,Stream B 已经用了索引 62——无法解压。这就把 transport 层好不容易消灭的 head-of-line blocking 又拽回了应用层。

HPACK (HTTP/2) depends on a strictly synchronised dynamic table. Server sends ":status: 200" on Stream A and says "insert this, index 62". The next stream can now refer to index 62 — assuming Stream A arrives before Stream B. HTTP/2 over TCP is naturally ordered, so this works.

QUIC streams are independent and arrive concurrently. If Stream A's update hasn't landed yet but Stream B already references index 62 — cannot decode. The head-of-line blocking the transport layer worked so hard to kill comes roaring back at the app layer.

QPACK 的三招

QPACK's three moves

静态表扩容 + 现代化
Bigger, modernised static table
HPACK 静态表 61 项 → QPACK 99 项。新加了 alt-svccontent-security-policystrict-transport-security:scheme: https 等现代 web 必备字段。
HPACK static table 61 entries → QPACK 99. Added alt-svc, content-security-policy, strict-transport-security, :scheme: https and other modern-web staples.
双向同步流
Bi-directional sync streams
encoder stream(单向 0x02)发送"插入这条到动态表"指令;decoder stream(单向 0x03)反向告知"我已经收到 N 条 insertion"——这两个数字叫 Insert Count 和 Known Received Count。
The encoder stream (uni 0x02) sends "insert this into the dynamic table". The decoder stream (uni 0x03) reports back "I've received N insertions so far" — the two counters: Insert Count and Known Received Count.
Required Insert Count + 阻塞容忍
Required Insert Count + tolerable blocking
每个 HEADERS 帧带一个 Required Insert Count。如果接收端的 Insert Count 还不够,这条 HEADERS 就仅这条暂存——其它流照样跑。SETTINGS 里可配置允许多少条"阻塞中的流"(默认 100)。如果发送端"压缩太激进"导致阻塞超限,发送端会自动退回到不引用动态表的字面编码。
Each HEADERS frame carries a Required Insert Count. If the receiver's Insert Count isn't there yet, only that HEADERS pends — other streams run on. SETTINGS configures how many "blocked streams" are tolerated (default 100). If the sender's aggressive compression would exceed the cap, it auto-falls-back to literal encoding without referencing the dynamic table.

阻塞容忍可视化

Tolerable blocking, visualised

REQUIRED INSERT COUNT · why QPACK doesn't HOL encoder stream server → client INSERT idx=62 :status: 200 INSERT idx=63 cache-control: … Stream 0 request A HEADERS · RIC=63 references idx=62 Stream 4 request B HEADERS · RIC=63 references idx=62, 63 Stream 0 packet arrives before encoder INSERTs ⏸ this one stream blocked, waiting for INSERT 62-63 Stream 4 packet arrives after INSERTs are buffered ✓ decodes immediately QPACK COUNTERS Insert Count (server) = 64 Known Received Count = 61 Required Insert Count = 63
FIG 15·1 Stream 0 引用了还没到的 idx=62 → 它一条暂存;Stream 4 等 INSERT 都到了再来,立即解码。HOL 只发生在这一条流上。 Fig 15·1 · Stream 0 references idx=62 before the INSERT lands → only Stream 0 blocks; Stream 4 arrives after the INSERTs and decodes immediately. HOL stays per-stream.

实测压缩率

Measured compression

scenarioraw bytesHPACK (H2)QPACK (H3)
首次请求first request~600~50~52
同连接重复请求repeated request, same conn~600~5~6
弱网(丢包)weak link (lossy)~600~5 + HOL~6 (no HOL)

压缩率本身差不多。QPACK 的赢面在抗丢包

Compression ratios are nearly identical. QPACK's win is in resistance to loss.

PRACTICAL 大多数 QUIC 库默认动态表只开 4 KB——比 HPACK 的 64 KB 小得多。原因:动态表越大,"阻塞中的流"越多。在内网/低延迟场景可以调大;公网/移动场景不要。如果你在 nginx 配置 H3 时看到 http3_max_field_size,那就是它。 Most QUIC libraries default the dynamic table to 4 KB — much smaller than HPACK's 64 KB. Reason: the bigger the table, the more "blocked streams" pile up. Bump it up on intra-datacenter / low-latency paths; don't on public / mobile networks. The nginx knob is http3_max_field_size.
CHAPTER 16

Server Push 之死

The death of Server Push

一个写在 RFC 里、被市场否决的功能

a feature that lived in the RFC and died in production

STATUS
默认禁用disabled by default
Chrome
v106+ 移除removed
Firefox
默认关off by default
替代方案
Replacement
103 Early Hints

2015 年 HTTP/2 把 Server Push 当成杀手特性写进了 RFC 7540——服务器知道客户端马上要 app.css,那为什么不提前推给它?2022 年 Chrome 106 默认禁用了 Server Push。2024 年彻底从 Chromium 代码里移除。HTTP/3 RFC 9114 出于"协议完整性"保留了 PUSH_PROMISE 帧——但浏览器都不接。

In 2015, HTTP/2 wrote Server Push into RFC 7540 as a killer feature — the server knows the client will need app.css, so why not push it ahead of time? In 2022, Chrome 106 disabled Server Push by default. In 2024, it was deleted from the Chromium tree. HTTP/3 RFC 9114 kept the PUSH_PROMISE frame for "protocol completeness" — but no browser accepts it anymore.

◇ 在我们的 GET 请求里 · 主线阶段 N/A · 不触发◇ In our GET request · Phase N/A · no-op

INPUT
— · 我们这次 GET 不触发 push— · our GET doesn't trigger pushChrome 106+ 默认禁 pushChrome 106+ default-disables push
OUTPUT
— · 浏览器直接 CANCEL_PUSH— · browser CANCEL_PUSHes immediately即使服务器主动 push 也会被丢pushes from server get cancelled

死因 · 三个

Three causes of death

死因 1 · 缓存不知道
Cause 1 · cache ignorance
服务器不知道客户端有什么
Server doesn't know what the client has

服务器盲目 push app.css——但如果客户端缓存里已经有了呢?带宽白浪费。Chrome 实测发现 70%+ 的 push 被客户端立即 CANCEL_PUSH 掉。

The server blindly pushes app.css — but what if the client already has it cached? Bandwidth wasted. Chrome's telemetry: 70%+ of pushes get immediately CANCEL_PUSHed.

死因 2 · 优先级混乱
Cause 2 · priority chaos
Push 抢了真正请求的带宽
Push steals real-request bandwidth

服务器推的 app.css 在线上跟客户端发起的 app.js 抢拥塞窗。BBR 不知道哪个更急——结果两个都慢。

The server-pushed app.css competes with the client-issued app.js on the congestion window. BBR can't tell which is more urgent — both end up slower.

死因 3 · 替代品更好
Cause 3 · better alternative
103 Early Hints + preload
103 Early Hints + preload

服务器先发一个 HTTP 103 Early Hints 响应(RFC 8297),告诉客户端"你可能会需要 app.css"。客户端自己决定要不要 preload。简单、可观察、不抢带宽。

Server sends a HTTP 103 Early Hints response (RFC 8297) telling the client "you'll probably need app.css". The client decides whether to preload. Simple, observable, no bandwidth war.

活下来的东西
What survived
实际部署的 Push 用法
Push patterns that actually shipped

CDN(Cloudflare 等)依然在边缘到 origin 之间偷偷用 Push 做 prefetch 优化——这不进客户端浏览器,所以不受 Chrome 106 影响。这种"内网 Push"还活着。

CDNs (Cloudflare et al.) still quietly use Push between their edge and origin for prefetch optimisation — that traffic never reaches the client browser, so Chrome 106 doesn't affect it. "Intra-network Push" lives on.

"Server Push 在 RFC 里完美无瑕,
在生产里几乎没找到一个稳定的用例。"
"Server Push was flawless in the RFC,
and almost no stable use case ever showed up in production."
Patrick Meenan · Chrome Web Performance · 2022
CHAPTER 17

连接迁移 — Wi-Fi 到 5G 不断线

Connection migration — Wi-Fi to 5G without dropping

CID 是 QUIC 的身份证

the Connection ID is QUIC's passport

在主线里
In our request
mid-flight switch
Layer
QUIC · CID layer
RFC
9000 §9
关键帧
Key frames
PATH_CHALLENGE / RESPONSE

主线时刻 T+200ms(请求中途):你走出咖啡馆,手机自动切到 5G。src_ip: 192.168.1.4210.220.5.13。TCP 在这里必死,因为连接由四元组定义。HTTP/3 不死——因为 QUIC 连接由 Connection ID 定义,而不是四元组。

Main-line time T+200ms (mid-request): you walk out of the café, the phone hops to 5G. src_ip: 192.168.1.4210.220.5.13. TCP dies here, because TCP identifies a connection by the 4-tuple. HTTP/3 doesn't die — because QUIC identifies a connection by the Connection ID, not by IP-port.

◇ 在我们的 GET 请求里 · 主线阶段 6◇ In our GET request · Phase 6

INPUT
旧路径 Wi-Fi 192.168.1.42:52341old path Wi-Fi 192.168.1.42:52341NSURLSession reports Wi-Fi lossWi-Fi 信号丢失
OUTPUT
新路径 5G 10.220.5.13:34188new path 5G 10.220.5.13:34188同 CID · PATH_CHALLENGE 验证完成same CID · PATH_CHALLENGE OK

CID 池 · 提前准备好

CID pool · prepared in advance

连接建立后,服务器和客户端不停发 NEW_CONNECTION_ID 帧,互相给对方备好"未来可以用的 CID 列表"。每个 CID 还附带一个 Stateless Reset Token——用于无状态重置。

Once the connection is up, both sides keep emitting NEW_CONNECTION_ID frames, populating each other's "list of CIDs you may use in future". Each CID carries a Stateless Reset Token too — for stateless reset.

Path Validation 流程

Path Validation flow

Client (5G now) Server 1-RTT[ANY: src=10.220.5.13] DCID = next from pool PATH_CHALLENGE[random=0xAFBE…] server: "prove you're on this path" PATH_RESPONSE[0xAFBE…] echo back the same nonce 1-RTT[STREAM 0: body chunk N+1] migration complete, stream resumes
FIG 17·1 Path Validation 时序 · 浏览器换 src_ip 后服务器主动发起 challenge · 一次 RTT 内验证完毕 Fig 17·1 · Path Validation timeline · server initiates the challenge after seeing a new src_ip · validated within one RTT
FIELD NOTE · Apple iCloud Private Relay FIELD NOTE · Apple iCloud Private Relay Apple 的 iCloud Private Relay(2021 上线)是迄今最大的连接迁移实战场。手机在 Wi-Fi/5G 之间频繁切换,每次切换 Path Validation 都要在 100-300ms 内完成。Apple 的实测:让中位 RTT 在切网瞬间没有明显抖动,因为新路径在旧路径还没关闭前就完成了验证——这就是 RFC 9000 §9.4 描述的"NAT rebinding without active migration"模式。 Apple's iCloud Private Relay (launched 2021) is by far the largest connection-migration testbed. Phones flip Wi-Fi/5G constantly, and each flip requires Path Validation in 100-300ms. Apple's data: median RTT shows no detectable jitter at the moment of switch — because the new path is validated before the old path is torn down. This is the "NAT rebinding without active migration" mode in RFC 9000 §9.4.

NAT Rebinding · 隐式迁移

NAT Rebinding · implicit migration

家用路由器的 NAT 表项一般有过期时间(30 秒~2 分钟)。如果客户端短时间没发包,NAT 会回收映射;下次再发包时,src_port 可能变了——这等于一次"客户端不知情的迁移"。RFC 9000 §9 把这种情况归到 "passive migration",处理逻辑和主动迁移一致:服务器看到新 4-tuple 就发 PATH_CHALLENGE。

Home router NAT entries usually have an expiration (30s-2min). If the client stays silent, NAT recycles the mapping; the next packet may have a different src_port — effectively a "migration the client doesn't know about". RFC 9000 §9 calls this "passive migration", handled identically: the server sees a new 4-tuple and sends PATH_CHALLENGE.

CHAPTER 18

放大攻击与多路径 — UDP 的代价与红利

Amplification & Multipath — UDP's tax and bonus

3x 限制 · Retry · MPQUIC

3x limit · Retry · MPQUIC

◇ 在我们的 GET 请求里 · 主线阶段 1-2◇ In our GET request · Phase 1-2

INPUT
1228 B 客户端 Initial 字节1228 B from client Initial握手未完成 · 地址未验证handshake pending · address unverified
OUTPUT
服务器 3684 B 预算server budget 3684 B3 × 客户端字节3 × client bytes

为什么 UDP 容易被攻击放大

Why UDP invites amplification

UDP 无连接 ⇒ 服务器不知道"请求人是不是真的在这个 src_ip"。攻击者可以伪造 victim 的 src_ip 给 QUIC 服务器发 1 字节小包,让服务器回复 10000 字节大包到 victim ——典型的 DNS amp 攻击套路。QUIC 必须从协议层防住。

UDP is connectionless ⇒ the server doesn't know "is the requester really at this src_ip?" An attacker can spoof a victim's src_ip, send 1-byte QUIC packets to the server, and trick it into firing 10 000-byte responses at the victim — the classic DNS amp pattern. QUIC has to defend at the protocol level.

3 倍限制

The 3x limit

RFC 9000 §8.1 在握手未完成(即客户端地址未被验证)之前,服务器返给客户端的总字节数不能超过它从客户端收到的总字节数的 3 倍。这就是为什么客户端的第一个 Initial 包必须填到 ≥ 1200 字节——保证服务器有 3600 字节预算发完证书链。 Until the handshake completes (client address unverified), the server's total bytes to the client must not exceed 3× the bytes it has received from the client. This is exactly why the client's first Initial must pad to ≥ 1200 bytes — to give the server a 3600-byte budget to ship the cert chain.
CLIENT → SERVER bytes received from client SERVER → CLIENT (budget) ≤ 3× of received bytes (until verified) T+0 · just first Initial received 1200 B 3600 B (3 × 1200) × 3 cert chain typical size ~4800 B (cert + intermediates) ⚠ exceeds 3× T+RTT · client ACKs / repeats 1200 B + 1200 B budget now 7200 B ✓ fits cert after handshake completes address verified ✓ no limit · congestion control governs "3× rule" is also what kills UDP reflection amp attacks: a forged-IP attacker only gets 3× back.
FIG 18·1 3 倍预算如何随客户端字节增长 · 证书链塞不下时分两次 RTT · 验证完成后 cap 解除。 Fig 18·1 · How the 3× budget grows with client bytes · cert chain split across two RTTs when oversized · cap removed once verified.

Retry · 服务器忙的时候

Retry · for busy servers

如果服务器收到的 ClientHello 看起来可疑(流量异常、资源紧张),可以回一个 Retry 包——里面装一个加密的 token。客户端必须重发 ClientHello 并带上 token。token 等于"我证明你在这个 IP"——下次再来直接信任。Cloudflare 在 DDoS 攻击期间会大量使用 Retry。

If a ClientHello looks suspicious (traffic spikes, resource crunch), the server can return a Retry packet carrying an encrypted token. The client must re-send ClientHello with that token. The token attests "I've proven you're at this IP" — next visits skip the check. Cloudflare hammers Retry during DDoS storms.

Multipath QUIC · 真正的红利

Multipath QUIC · the real bonus

draft-ietf-quic-multipath(截至 2026 已成熟)允许一个 QUIC 连接同时跑 Wi-Fi 和 5G 两条路径。包号空间共享,stream 数据在两条路径上自由调度。Apple iCloud Private Relay 是最早的大规模生产 MPQUIC 部署。

与 MPTCP 对比:MPTCP 只能在内核做,部署率 < 1%;MPQUIC 完全在用户态,每个 QUIC 库都可以独立实现。

draft-ietf-quic-multipath (mature by 2026) lets one QUIC connection simultaneously use Wi-Fi and 5G. Packet number spaces are shared; stream data schedules freely across paths. Apple iCloud Private Relay is the earliest large-scale MPQUIC deployment.

vs MPTCP: MPTCP is kernel-only, < 1% deployed. MPQUIC lives entirely in user space — any QUIC library can implement it independently.

CASE · APPLE
iCloud Private Relay 的多路径
iCloud Private Relay multipath

Apple 使用 MASQUE(CONNECT-UDP)把 QUIC 隧道分发给两个独立的中继节点。手机端的 NSURLSession + MPQUIC 自动在 Wi-Fi/5G 两条物理路径上做透明聚合——当 Wi-Fi 抖动时,5G 直接接管,应用层零感知。这是第一次在消费级设备上规模化跑 MPQUIC。

Apple uses MASQUE (CONNECT-UDP) to distribute QUIC tunnels across two independent relay nodes. NSURLSession + MPQUIC on the phone transparently aggregates across Wi-Fi/5G physical paths — when Wi-Fi jitters, 5G takes over instantly, with zero app awareness. The first consumer-scale MPQUIC deployment.

CHAPTER 19

连接的生命周期 — 关闭、排空、复活

Connection lifecycle — close, drain, revive

GOAWAY · CONNECTION_CLOSE · draining · idle · stateless reset

GOAWAY · CONNECTION_CLOSE · draining · idle · stateless reset

主线阶段
Phase
5 / 7-8 / 9
Layer
QUIC + HTTP/3 lifecycle
RFC
9000 §10 · 9114 §5.2
关键帧
Key frames
GOAWAY · CC · PING · NEW_TOKEN

之前 18 章都讲请求的事——但一个真实的 QUIC 连接还要走完关闭、排空、复活三种结局。生产环境里大部分 bug、半小时一次的"无原因连接重置"、CDN 滚动重启时的瞬时错误,全藏在这一章。

The previous 18 chapters covered request arrival. A real QUIC connection still has to walk through close, drain, revive. Most production bugs, the "mysterious connection resets" every 30 minutes, the transient errors during CDN rolling restarts — they all hide in this chapter.

◇ 在我们的 GET 请求里 · 主线阶段 5 / 7-8 / 9◇ In our GET request · Phase 5 / 7-8 / 9

INPUT
活的连接 · idle 15 分钟live connection · idle 15 minCDN 决定回收 / 进程升级 / 用户切网CDN recycles · server upgrade · user goes offline
OUTPUT
closed · drained · 或 resetclosed · drained · or reset3 PTO 后真正死亡 · 状态从客户端内存抹除truly dead 3 PTO later · state erased from client memory

四种结局 · The four endings

The four endings

优雅关闭 · Graceful close
Graceful close
服务器先发 GOAWAYRFC 9114 §5.2,H3 帧 0x07)告诉客户端"新流我不接,已开的流我处理完"。等所有 stream 跑完,发 CONNECTION_CLOSERFC 9000 §19.19,QUIC 帧 0x1c)正式结束。
Server sends GOAWAY first (RFC 9114 §5.2, H3 frame 0x07): "no new streams, but I'll finish in-flight ones". Once every stream completes, it sends CONNECTION_CLOSE (RFC 9000 §19.19, QUIC frame 0x1c) for real.
立即关闭 · Immediate close
Immediate close
不要 GOAWAY 这一步,直接发 CONNECTION_CLOSE(error=N)所有进行中的流立即收到 RESET_STREAM。常见于客户端检测到加密协议错误时——比如 PN 单调性被破坏(§13.2.3)。
Skip GOAWAY entirely and send CONNECTION_CLOSE(error=N) at once. All in-flight streams receive RESET_STREAM. Common when the client detects a crypto-layer violation — e.g. PN monotonicity broken (§13.2.3).
空闲超时 · Idle timeout
Idle timeout
RFC 9000 §10.1:双方在 TP 里协商出 max_idle_timeout,取较小值。30 秒没收到任何包,连接静默销毁——不发 CC,不通知对端。这是 NAT 表项过期的常态。要保活:发 PING 帧(§19.2)刷新计时器。
RFC 9000 §10.1: both ends negotiate max_idle_timeout in TP, take the smaller. After 30 s with no packets, the connection is silently destroyed — no CC, no peer notification. This is also how NAT entries die. To prevent: send PING (§19.2) to reset the timer.
无状态重置 · Stateless reset
Stateless reset
服务器进程崩了重启,找不到客户端发的 1-RTT 包对应的连接状态。它没有密钥发 CC——只能发一个看起来像随机噪声的 Stateless Reset 包(§10.3),末尾带 16 字节 reset_token(来自对端之前 NEW_CONNECTION_ID 时分配的)。客户端识别 token 后才能销毁本地连接。
Server process crashes and restarts, can't match the client's 1-RTT packet to any connection state. It has no key to send CC — only a packet that looks like random noise: a Stateless Reset (§10.3) ending in the 16-byte reset_token (issued earlier via NEW_CONNECTION_ID). Only the client can recognise that token, then tear down locally.

三态机 · Closing / Draining / Closed

Three states · Closing / Draining / Closed

CONNECTION CLOSE STATE MACHINE · RFC 9000 §10.2 ACTIVE streams flowing PING keeps alive send CC CLOSING resend CC on any recv timer = 3 × PTO timer fires DRAINING discard incoming no resend 3 PTO × recv CC (skip CLOSING, go straight to DRAINING) recv stateless reset → immediate DRAINING idle timeout → silently CLOSED (no CC, no DRAINING)
FIG 19·1 RFC 9000 §10.2 状态机 · ACTIVE → CLOSING / DRAINING → CLOSED · 三条捷径分别为:收 CC、收 stateless reset、idle 超时。 Fig 19·1 · RFC 9000 §10.2 state machine · ACTIVE → CLOSING / DRAINING → CLOSED · three short-cuts: receiving CC, receiving stateless reset, idle timeout.

为什么需要 Draining

Why draining exists

关闭不能立刻完成——因为对端可能还在 in-flight 中送包过来。如果端点立刻销毁连接状态、再开一个新连接,新连接可能收到旧连接的包,把它误当成新连接的握手包处理——后果可能很严重。

RFC 9000 §10.2 的解法是:发完 CONNECTION_CLOSE 后进入 closing 状态,3 PTO 之内每收到一个包就回一次 CC(用 idempotent CC 避免对端不断重试);然后进入 draining,纯丢包 3 PTO;最后才进入 closed 销毁内存。这 3+3=6 PTO 大约 100-300ms——是 QUIC 连接关闭的真实耗时,不是你看到的"立刻"。

Close cannot complete instantly — the peer might still be sending packets in-flight. If an endpoint frees state immediately and opens a fresh connection, the fresh one might receive the old connection's packets and confuse them with new-connection handshake — potentially catastrophic.

RFC 9000 §10.2's fix: after sending CONNECTION_CLOSE, enter closing; for 3 PTO reply with another CC to every incoming packet (idempotent CC prevents the peer's retries). Then enter draining: silently drop everything for another 3 PTO. Only then enter closed and free memory. 3 + 3 = 6 PTO ≈ 100-300 ms — that's the real cost of closing a QUIC connection, not the "instant" you see.

GOAWAY · HTTP/3 层的优雅

GOAWAY · HTTP/3's grace

HTTP/3 GOAWAY frame · sent on control streamRFC 9114 §5.2
; H3 frame format: Type=0x07, Length, StreamID [GOAWAY frame] Type = 0x07 ; goaway Length = 4 ; bytes after length StreamID = 0x14 ; "I won't process any stream >= 20" ; Semantics: ; - Streams with id < 0x14: server WILL complete them ; - Streams with id >= 0x14: server WILL reject (H3_REQUEST_REJECTED) ; - Client MUST retry rejected requests on a NEW connection ; Servers can send GOAWAY multiple times, each one LOWERING the StreamID. ; Final GOAWAY can be StreamID=0 ("no more streams period"), then CC.
PRACTICAL · 滚动重启 PRACTICAL · Rolling restart CDN(Cloudflare、Fastly、Akamai)滚动重启边缘节点时必须正确实现 GOAWAY,否则上百万个长连接会被一次性 reset,客户端瞬时全员重连 = 雪崩。正确做法:先发 GOAWAY(stream_id=∞) 标记 "不接新请求",等 ~30 秒让 in-flight 完成,再发 GOAWAY(0) + CONNECTION_CLOSE。Cloudflare 的 Pingora 框架专门为这套逻辑做了状态机。 CDNs (Cloudflare, Fastly, Akamai) must implement GOAWAY correctly during edge node rolling restarts, or millions of long-lived connections get reset at once — every client reconnects simultaneously = thundering herd. Correct sequence: send GOAWAY(stream_id=∞) marking "no new requests", wait ~30 s for in-flight to drain, then GOAWAY(0) + CONNECTION_CLOSE. Cloudflare's Pingora framework has a dedicated state machine for this.

CONNECTION_CLOSE 错误码

CONNECTION_CLOSE error codes

CONNECTION_CLOSE 帧带一个错误码——按"是 QUIC 层错还是 H3 层错"分两种:

CONNECTION_CLOSE carries an error code — split into "QUIC-layer" vs "H3-layer":

frame 0x1c · QUIC 层QUIC-layercodeframe 0x1d · H3 层(透传)H3-layer (passthrough)code
NO_ERROR0x00H3_NO_ERROR0x0100
INTERNAL_ERROR0x01H3_GENERAL_PROTOCOL_ERROR0x0101
CONNECTION_REFUSED0x02H3_INTERNAL_ERROR0x0102
FLOW_CONTROL_ERROR0x03H3_STREAM_CREATION_ERROR0x0103
STREAM_LIMIT_ERROR0x04H3_CLOSED_CRITICAL_STREAM0x0104
STREAM_STATE_ERROR0x05H3_FRAME_UNEXPECTED0x0105
PROTOCOL_VIOLATION0x0aH3_REQUEST_REJECTED0x010b
CRYPTO_ERROR(N)0x0100+NH3_VERSION_FALLBACK0x0110

完整清单:RFC 9000 §20 列 18 个 QUIC 错误码;RFC 9114 §8.1 列 17 个 H3 错误码。CRYPTO_ERROR(N) 把所有 TLS Alert 透传出来——比如 CRYPTO_ERROR(0x132) = TLS BAD_RECORD_MAC。

Full lists: RFC 9000 §20 defines 18 QUIC error codes; RFC 9114 §8.1 defines 17 H3 error codes. CRYPTO_ERROR(N) tunnels any TLS Alert — e.g. CRYPTO_ERROR(0x132) = TLS BAD_RECORD_MAC.

Stateless Reset · 服务器丢状态后的最后一招

Stateless Reset · the last resort after state loss

stateless reset · looks like noise · ends with reset_tokenRFC 9000 §10.3 + §21.11
; 任意长度 packet (≥ 22 B),最后 16 字节 = 预分配的 reset_token ; arbitrary-length packet (≥ 22 B), last 16 B = pre-issued reset_token +--------+-----------------------------+----------------------+ | fixed | unpredictable random | reset_token (16 B) | | bit | (≥ 5 B, looks like PN) | | +--------+-----------------------------+----------------------+ ; The receiver finds a CID in its table whose stateless_reset_token matches ; the last 16 bytes — that proves the peer "really lost state". ; Without the token, this would be indistinguishable from random injection.
FIELD NOTE · token 怎么分发 FIELD NOTE · how the token gets there 服务器在每次发 NEW_CONNECTION_ID(RFC 9000 §18.2)都会附上一个 stateless_reset_token,由 HMAC(reset_secret, CID) 派生。客户端把所有看到的 token 存起来;下次如果收到一个"看起来像随机包"且末尾 16 字节命中其中一个 token,就触发 stateless reset 销毁路径。无密钥下的状态恢复——这是 QUIC 工程最优雅的设计之一。 The server attaches a stateless_reset_token every time it sends NEW_CONNECTION_ID (RFC 9000 §18.2), derived as HMAC(reset_secret, CID). The client stores every token it's ever seen. Next time it receives a "looks-random" packet whose last 16 bytes match a stored token, it triggers the stateless-reset teardown path. Keyless state recovery — one of the most elegant designs in QUIC engineering.

KEY_UPDATE · 长连接的密钥滚动

KEY_UPDATE · key rotation on long-lived connections

如果一条连接活了几小时(比如 WebSocket 替代品),用同一把 1-RTT 密钥发太多包会增加分析攻击面。RFC 9001 §6 给出了原地滚动密钥的机制:发送方把 short header 的 Key Phase 位(1 bit)翻转,并用派生的下一代密钥加密。接收方看到 Key Phase 变了,跑一次 HKDF 派生新密钥解密。这一切不需要新一轮握手

If a connection lives for hours (e.g. as a WebSocket replacement), using the same 1-RTT key for too many packets opens analysis attack surface. RFC 9001 §6 defines in-place key rotation: the sender flips the short-header's Key Phase bit (1 bit) and encrypts with the next-generation derived key. The receiver notices Key Phase changed, runs an HKDF step to derive the new key, decrypts. All this without a new handshake.

"关闭不是事件,是过程。" "Close isn't an event, it's a process." Martin Thomson · QUIC WG · RFC 9000 design note
CHAPTER 20

实现现状 — 谁在跑 HTTP/3

Implementations — who runs HTTP/3 today

2026 年的版图

the 2026 landscape

浏览器

Browsers

Chrome / Edge
QUICHE C++
Firefox
neqo · Rust
Safari
URLSession
Brave / Opera
= Chromium

服务器 / 反向代理

Servers / reverse proxies

Cloudflare edge
quiche
Caddy
quic-go
nginx 1.26+
quictls
LiteSpeed
lsquic
Apache mod_http3
实验性experimental
Google GFE
internal quiche
IIS / Win Server 2025
msquic (kernel-mode)

库与产品 · 星座图

Libraries & products · constellation map

google/quiche C++ Chrome / Edge Envoy gRPC · GFE cf/quiche Rust · C-API CF edge nginx 1.26 curl --http3 quinn Rust · async Hyper · Tonic IPFS Cloudflare WT msquic C · kernel-mode IIS · WinServer .NET HttpClient ngtcp2+nghttp3 C · lean Node.js QUIC curl alt IETF interop quic-go Go Caddy s2n-quic AWS · Rust aioquic Python · research lsquic LiteSpeed · C size of circle ≈ install base · arrows = "powers"
FIG 19·1 2026 年 QUIC 库版图 · 圈大小 ≈ 部署量 · 连线表示"驱动"关系 · 同色系 = 同生态。 Fig 19·1 · The 2026 QUIC library constellation · circle size ≈ install base · lines mean "powers" · same colour = same ecosystem.

关键库

Key libraries

librarylang谁用Used by特点Strength
Google quicheC++Chrome · gRPC · Envoy最早最完整most complete
Cloudflare quicheRustCF edge · nginx-quic最快 C-APIfastest C-API
msquicCWindows Server · .NET内核态加速kernel-mode boost
quic-goGoCaddy · IPFSGo 生态唯一Go-ecosystem standard
quinnRustHyper · Tonic · IPFS异步原生async-native
ngtcp2 + nghttp3Ccurl · Node.js最克制最稳lean & rock-stable
aioquicPython学术研究 · CTFresearch · CTF易读源码readable source
s2n-quicRustAWS安全审计严格security-first
picoquicC学术参考实现academic referenceIETF interop 主力IETF interop workhorse
lsquicCLiteSpeed嵌入式部署embeddable

部署份额

Deployment share

45%
Top 1M 站点已开 H3
of Top 1M sites support H3
~35%
全网 web 请求走 H3
of all web requests use H3
~8%
UDP/443 被中间盒阻断
UDP/443 blocked by middlebox

来源:Web Almanac 2025、Cloudflare Radar、W3Techs。CDN 默认开启(Cloudflare / Fastly / Akamai / AWS CloudFront / Google Cloud LB)是普及主因。

Source: Web Almanac 2025, Cloudflare Radar, W3Techs. CDN default-on (Cloudflare / Fastly / Akamai / AWS CloudFront / Google Cloud LB) drove the bulk of adoption.

CHAPTER 21

性能数据 — 真实生产里赢了多少

Performance — what HTTP/3 actually wins in production

不是包治百病的灵药

not a panacea

大厂的实测

Real production numbers

公司 · 场景Company · scenario指标Metric提升Improvement来源Source
Google · YouTube India (4G)视频卡顿率中位video rebuffer median−20% ~ −40%SIGCOMM 2017 · Langley et al.
Google · Searchtail latency−16%SIGCOMM 2017
Meta · Facebook App请求错误率request error rate−5%Meta Engineering · 2020
Meta · video streamvideo stall rate−20%+Meta Engineering · 2020
Cloudflare · returning users0-RTT median TTFB−50msCF blog · 0-RTT resumption
Cloudflare · global弱网 TTFBpoor-link TTFB−10% ~ −15%CF Radar · 2024
Fastly · GA launchcold connect−40%Fastly blog · RFC 9000 GA
Apple · iCloud Private Relay切网 RTT 抖动network-switch RTT jitter~ 0(看不出)~ zero (imperceptible)WWDC 2022 · session 110337

数字来自厂商公开 blog / SIGCOMM 论文。原文如有更新请以最新版本为准;上表数字保留首次公开值。

Numbers cite each vendor's first public disclosure on blog or SIGCOMM. If the post has been updated since, the original disclosure value is kept here.

何时 HTTP/3 不如 HTTP/2

When H3 loses to H2

场景 1 · 内网零丢包
Case 1 · lossless internal net
机房内部微服务
intra-DC microservices

数据中心内部丢包 < 0.01%,TCP HOL 几乎不发生。但 HTTP/3 用户态 UDP 处理带来 2× CPU 成本。结果是纯吞吐 H2 over TCP 完胜。gRPC 至今主流仍是 HTTP/2。

Intra-DC loss < 0.01%, TCP HOL almost never fires. But HTTP/3's user-space UDP carries a 2× CPU tax. On pure throughput, H2 over TCP crushes. gRPC still defaults to HTTP/2.

场景 2 · UDP 被封
Case 2 · UDP blocked
企业网 / 金融网络
enterprise / fin networks

~8% 连接尝试因 UDP/443 被防火墙阻断。浏览器 Happy Eyeballs 会自动 fallback 到 H2 over TCP——但用户先付了"试错"的延迟。

~8% of connection attempts get UDP/443 blocked by firewalls. Browser Happy Eyeballs auto-falls back to H2 over TCP — but the user has already paid the "tried it and failed" latency.

场景 3 · 后端瓶颈
Case 3 · backend-bound
SSR + 重 React
SSR + heavy React

如果你的 LCP 主要花在服务端渲染或 JS 主线程上,省下来的 RTT 在水池里游泳,看不见。Patrick Meenan:H3 提升下限,不抬上限。

If your LCP is dominated by SSR or JS main-thread work, the saved RTTs swim in a pond — invisible. Patrick Meenan: H3 raises the floor, not the ceiling.

场景 4 · H3 大杀器
Case 4 · H3 dominates
移动 + 弱网 + 多请求
mobile + weak net + many requests

手机用户、4G/5G、丢包 1-3%、页面有 50+ 子请求——这是 H3 设计场景。0-RTT、连接迁移、流独立丢包恢复全用上。

Mobile users, 4G/5G, 1-3% loss, page has 50+ subresources — H3's home turf. 0-RTT, migration, per-stream loss recovery all fire.

"如果你不知道你的用户在哪里,
HTTP/3 就是合理的默认选择。"
"If you don't know where your users are,
HTTP/3 is the sensible default."
Lucas Pardue · Cloudflare · IETF 116
CHAPTER 22

工程实战 — 部署与调试

Field work — deploying and debugging

DNS · curl · Wireshark · qlog · sysctl

DNS · curl · Wireshark · qlog · sysctl

协议发现 · DNS 路径

Discovery · the DNS path

老路 · Alt-Svc 头
Old way · Alt-Svc
第一次访问必走 TCP
first visit always TCP

服务器在响应头里加一行:alt-svc: h3=":443"; ma=86400。客户端记 24 小时,第二次访问才走 H3。意味着新用户的首屏永远拿不到 0-RTT

Server appends a response header: alt-svc: h3=":443"; ma=86400. Client caches it 24h, uses H3 from the next visit. Meaning new users never get 0-RTT on the first paint.

新路 · HTTPS RR (RFC 9460)
New way · HTTPS RR (RFC 9460)
DNS 里直接告诉你
DNS knows up-front

在 DNS 区文件加一行:
ursb.me. 300 IN HTTPS 1 . alpn="h3,h2" ipv4hint="39.105.102.252"
浏览器解析 DNS 就拿到了——第一次访问直接走 H3。配合 RFC 8484 DoHRFC 9250 DoQ,连 DNS 查询本身都加密。

Add one line to the DNS zone:
ursb.me. 300 IN HTTPS 1 . alpn="h3,h2" ipv4hint="39.105.102.252"
The browser gets it at DNS resolution time — first visit goes straight to H3. Combined with RFC 8484 DoH or RFC 9250 DoQ, the DNS query itself is encrypted.

DoQ · RFC 9250 DNS over QUIC(DoQ)是 QUIC 的第二大应用——不是 HTTP/3,是直接把 DNS 查询塞进 QUIC 流。AdGuard、NextDNS、Cloudflare 1.1.1.1 都支持。相比 DoT(DNS over TLS)省 1 RTT,相比 DoH 省 HTTP/3 那一层开销。ALPN 编号是 doq,默认端口 853。 DNS over QUIC (DoQ) is QUIC's second-biggest application — not HTTP/3, just plain DNS queries stuffed into a QUIC stream. AdGuard, NextDNS, Cloudflare 1.1.1.1 all support it. vs DoT it saves 1 RTT; vs DoH it skips the HTTP/3 overhead. ALPN doq, default port 853.

客户端工具

Client tools

verify a server speaks HTTP/3field commands
# curl with H3 (needs --with-quiche or --with-ngtcp2) $ curl --http3 -I https://ursb.me/ HTTP/3 200 alt-svc: h3=":443"; ma=86400 # quiche-client (Cloudflare reference client) $ quiche-client --no-verify https://ursb.me/ # Test a server with custom ALPN $ nghttp3 https://ursb.me/ # Capture and decrypt in Wireshark $ SSLKEYLOGFILE=~/keys.log /Applications/Google\ Chrome.app/... $ tcpdump -i en0 -w cap.pcap udp port 443 # Wireshark → Preferences → TLS → (Pre)-Master-Secret log → ~/keys.log

qlog + qvis · QUIC 的标准日志

qlog + qvis · the standard QUIC log

因为 QUIC 加密了一切,光抓包看不出连接内部发生了什么。IETF 用 qlogdraft-ietf-quic-qlog-main-schema,2024 已多版)定义了一份结构化 JSON 日志格式——服务端/客户端用任何 QUIC 库都可以输出 qlog,把它扔到 qvis.quictools.info 就能看到拥塞窗口曲线、PN 单调性、ACK 时序、loss event、stream 优先级。这是 H3 调试的唯一正解。

Because QUIC encrypts everything, raw pcap shows nothing about what's happening inside. IETF defined qlog (draft-ietf-quic-qlog-main-schema, several revisions by 2024) — a structured JSON log format any QUIC library can emit. Drop it into qvis.quictools.info and you get the congestion-window curve, PN monotonicity, ACK timeline, loss events, stream priorities. The only sane debug path for H3.

QVIS · CONGESTION SAMPLE · loaded from upload.qlog conn 7b0f23e4… · 4.2s · 1842 events bytes 120k 90k 60k 30k 0 0s 1s 2s 3s 4s time since first packet (s) cwnd (bytes) bytes_in_flight ▼ loss · PN 421 ▼ loss · PN 1043 smoothed RTT stream 0 (HTML) stream 4 (CSS) stream 8 (JS) streams 12-40 (images)
FIG 21·1 qvis 拥塞窗口面板(mock)· cwnd 蓝实线 · bytes_in_flight 淡蓝虚线 · sRTT 紫线 · 红色 ▼ 是丢包事件。 Fig 21·1 · A qvis congestion panel (mock) · cwnd solid blue · bytes_in_flight dashed · sRTT purple · red ▼ marks loss events.

服务器端调优

Server-side tuning

设置Knob默认Default推荐Recommended为什么Why
net.core.rmem_max208 KB≥ 2.5 MB单个 UDP socket 缓冲,避免突发丢包single-socket buffer to absorb bursts
net.core.wmem_max208 KB≥ 2.5 MB同上 · 发送方same · send side
GSO/GROoffon让网卡分片 = CPU 降一半NIC segmentation = halve CPU
SO_REUSEPORTon · per-core用 eBPF 把 CID 路由到 CPUeBPF-route CID → CPU
io_uringexperimental异步 IO · 减少系统调用async I/O · fewer syscalls
QPACK dynamic table4 KB4-16 KB大 = 压缩好但 HOL 风险larger = better compression, more HOL risk
DEVOPS 第一次在 nginx 1.26 上跑 HTTP/3 的必修课:(1) 用 quictls fork 替代 OpenSSL,否则编不过;(2) 配置 listen 443 quic reuseport;——少了 reuseport 单核 CPU 直接吃满;(3) 在同一份配置里保留 listen 443 ssl; 走 TCP fallback;(4) 加 add_header alt-svc 'h3=":443"; ma=86400';——一开始我就忘了这条,浏览器永远走不到 H3。 The compulsory checklist for first-time HTTP/3 on nginx 1.26: (1) replace OpenSSL with the quictls fork or it won't build; (2) configure listen 443 quic reuseport; — without reuseport one CPU core pegs at 100%; (3) keep listen 443 ssl; in the same config for TCP fallback; (4) add add_header alt-svc 'h3=":443"; ma=86400'; — I once forgot this and the browser never upgraded.
CHAPTER 23

批评与争议 — HTTP/3 的负面成本

Critique — HTTP/3's downside ledger

没有免费的午餐

no free lunches

争议 1 · 用户态 CPU 成本

Critique 1 · user-space CPU tax

Fastly 在 2020 年公开的实测:在相同吞吐下,HTTP/3 的 CPU 消耗是 HTTP/2 over TLS 的 1.5x ~ 2x。原因:每个 UDP 包都要进出用户态、做独立 AEAD 加解密、维护用户态拥塞控制状态。这是 CDN 厂商真正头疼的事——同样的服务器,H3 流量上限只有 H2 的一半。

Fastly's 2020 disclosure: at equal throughput, HTTP/3 burns 1.5x – 2x the CPU of HTTP/2 over TLS. Reason: every UDP packet crosses user/kernel boundary, does its own AEAD encrypt/decrypt, and maintains user-space cc state. The real CDN pain — the same box can carry half the H3 traffic of H2.

解药正在路上

Cures in progress

io_uring + Generic Segmentation Offload
Linux 5.x+ 的 io_uring 把"批量发包"变成可能。配合 GSO,让网卡分片 UDP——CPU 成本立刻降一半。
Linux 5.x+ io_uring enables "batch send". Combined with GSO, the NIC segments UDP — halving CPU cost overnight.
AF_QUIC (Linux kernel)
2024 年 LWN 上有提案:把 QUIC 加密/拥塞控制塞回内核,给个新的 socket 类型 AF_QUIC。还在讨论,远未合入。
A 2024 LWN proposal: push QUIC encryption + cc back into the kernel, add a new socket family AF_QUIC. Still discussion, far from merge.
msquic kernel-mode (Windows)
Microsoft 已经把 msquic 跑在 Windows 内核态,IIS 直接享受。比纯用户态实现快 30-40%。
Microsoft already runs msquic in Windows kernel mode; IIS reaps the benefit directly. 30-40% faster than pure user-space.
DPDK / XDP offload
Cloudflare 在边缘节点用 XDP 在内核网络栈早段过滤 UDP/443,绕开 socket 处理,CPU 又省 20%。
Cloudflare's edge nodes use XDP to filter UDP/443 in the kernel networking fastpath, bypassing socket processing — another 20% off.

争议 2 · 可观测性"瞎了"

Critique 2 · observability "goes blind"

过去运营商靠 TCP 序列号、SACK、SNI 明文做带宽统计、QoS 调度、DPI 拦截。QUIC 把这些全加密了。运营商失去了路径上的"抓手"——这是有意的,但也是一些行业(金融监管、合规审计、家长控制)真正头疼的事。Spin Bit 是部分妥协,但远远不够。

Carriers used to measure bandwidth, do QoS, run DPI based on TCP seq/SACK/cleartext SNI. QUIC encrypted all of that. Operators lost their "handles" on the path — this was intentional, but it's a real pain for industries like financial regulation, compliance auditing, parental control. Spin Bit is a partial compromise; nowhere close to enough.

争议 3 · HTTP/2 其实够用?

Critique 3 · was HTTP/2 enough?

Patrick Meenan、Steve Souders 等 Web 性能老兵不停指出:如果你的网站性能瓶颈是 JS 执行SSR 等待第三方脚本,HTTP/3 帮你的部分微乎其微。这是真的。HTTP/3 抬升的是分布的下限——P95、P99 的弱网用户体验。如果你的产品根本没有 P95 弱网用户(比如你只服务美国/欧洲城市光纤),花精力上 H3 的 ROI 接近零。

Patrick Meenan, Steve Souders and other web-perf veterans keep pointing out: if your bottleneck is JS execution, SSR wait, or third-party scripts, HTTP/3 helps you very little. True. HTTP/3 lifts the floor of the distribution — P95/P99 weak-link users. If your product has no P95 weak-link users (e.g. you only serve fiber-grade US/EU cities), the ROI of switching to H3 is near zero.

FAIRNESS 公平地说,对开发者来说 H3 的实际损益取决于场景:服务全球移动用户 ⇒ 显著正收益;服务局域网/桌面办公 ⇒ 几乎中性;内部微服务 ⇒ 负收益。"所有人都要上 H3"不是技术决策——是 CDN 厂商希望你这么做。 Honestly: H3's payoff depends on the workload. Serving global mobile users ⇒ clear net win. Serving LAN / desktop office ⇒ roughly neutral. Internal microservices ⇒ net loss. "Everyone should be on H3" is not a technical conclusion — it's what CDNs want you to believe.
CHAPTER 24

HTTP/3 之后 — QUIC 上长出来的下一代

After HTTP/3 — what's growing on top of QUIC

WebTransport · MASQUE · MoQ · HTTP/4?

WebTransport · MASQUE · MoQ · HTTP/4?

HTTP/3 不是终点——它是 QUIC 这个"通用安全传输"找到的第一个杀手应用。QUIC 上正在长出一片新协议生态。下面是 2026 年的四个方向。

HTTP/3 isn't the finish line — it's the first killer app of QUIC as a "generic secure transport". A whole protocol ecosystem is growing on top. Here are 2026's four directions.

① WebTransport · 取代 WebSocket

① WebTransport · the WebSocket successor

WebSocket 跑在 HTTP/1.1 Upgrade 上,有 TCP head-of-line,无可靠/不可靠混合、不适合 RTC。WebTransport over HTTP/3W3C WebTransport API + draft-ietf-webtrans-http3)给浏览器开放:(a) 可靠双向流;(b) 不可靠 datagram(RFC 9221)。Chrome 自 97 起原生支持,ALPN 复用 h3。云游戏 / 在线协作 / 实时翻译已经开始迁。

WebSocket runs on HTTP/1.1 Upgrade, inherits TCP HOL, lacks a mixed reliable/unreliable channel, and is awful for RTC. WebTransport over HTTP/3 (W3C WebTransport API + draft-ietf-webtrans-http3) exposes to browsers: (a) reliable bidi streams; (b) unreliable datagrams (RFC 9221). Chrome shipped support in 97; ALPN reuses h3. Cloud gaming, collaboration, live translation are migrating.

② MASQUE · 下一代 VPN 隧道

② MASQUE · the next-gen VPN tunnel

CONNECT-UDP
用 H3 转发 UDP
tunnel UDP via H3
CONNECT-IP
用 H3 转发整个 IP 包
tunnel IP via H3
Capsule
通用 H3 容器协议
generic H3 container
CAPSULE · RFC 9297 RFC 9297 · HTTP Datagrams and the Capsule Protocol 是 CONNECT-UDP / CONNECT-IP 的容器规范——定义了如何在 HTTP/3 流里携带 "类似数据报" 的载荷。是整个 MASQUE 栈的第三块拼图,我之前漏了。 RFC 9297 · HTTP Datagrams and the Capsule Protocol is the container spec behind CONNECT-UDP / CONNECT-IP — it defines how "datagram-like" payloads ride inside HTTP/3 streams. The third piece of the MASQUE puzzle that I missed in my first pass.
CASE · ICLOUD PRIVATE RELAY
两跳隔离:MASQUE 的杀手案例
Two-hop relay: the MASQUE killer demo

Apple iCloud Private Relay(iOS 15+ 的 iCloud+ 功能)是目前最大量产的 MASQUE 实战。它的核心架构不是"用 H3 加密一下",而是故意把信任切两半

iCloud Private Relay (iOS 15+ as part of iCloud+) is by far the largest production MASQUE deployment. Its core trick isn't "tunnel things in H3" — it's deliberately splitting trust into two halves:

MASQUE · CONNECT-UDP · TWO-HOP RELAY (iCloud Private Relay) CLIENT iPhone, Safari INGRESS Apple (mask.icloud.com) EGRESS CDN partner (CF / Akamai) ORIGIN ursb.me knows: client IP does NOT know: target knows: target does NOT know: client IP H3 · CONNECT-UDP encrypted to ingress capsule · blinded ID ingress strips client IP TLS 1.3 / H3 egress IP only Trust split Neither hop sees both {client_ip, target} — collusion alone breaks privacy. Apple operates the ingress; egress is run by an independent CDN partner. RFC 9298 · CONNECT-UDP RFC 9297 · Capsule WWDC 2022 · session 10009
FIG 24·1 两跳架构 · Apple 看得到客户端 IP 但看不到目标域名 · CDN 伙伴看得到目标但拿不到客户端 IP · 单方泄露都不够 Fig 24·1 · Two-hop split · Apple sees client IP but not the destination; CDN partner sees the destination but not the client IP · neither side alone can deanonymise.

三件关键事实:

  • 客户端→入口 用 CONNECT-UDP(RFC 9298)建 H3 隧道;隧道内载荷再用 capsule(RFC 9297)打包传给出口。
  • 入口(Apple 自营)和出口(CDN 合作伙伴,目前为 Cloudflare / Akamai / Fastly)由不同主体运营——除非两家串通,没人能同时知道" 访问 哪里"。
  • 每个连接的客户端 IP 在入口处被替换成同一地理区域内的盲化 IP——服务器看到的 IP 仍能做粗粒度地理路由(CDN POP 选择、本地化),但精度只到城市级。

Three things to know:

  • Client → ingress uses CONNECT-UDP (RFC 9298) to set up the H3 tunnel; payloads inside are wrapped in capsules (RFC 9297) and forwarded to the egress.
  • Ingress (Apple-operated) and egress (CDN partners — currently Cloudflare, Akamai, Fastly) are run by different entities. Absent collusion, no single side knows "who visited where".
  • The client IP is rewritten at the ingress to a region-blinded IP — origins can still do coarse geo-routing (POP selection, localisation), but only at city granularity.

这是 MASQUE 至今最大、唯一商用规模的部署。它没有用 CONNECT-IP(更激进的整 IP 包封装),只用 CONNECT-UDP——Apple 不需要 VPN 全包代理的语义,只需要让 Web 流量"看起来都是同一个 IP 发出的"。剩下的 CONNECT-IP 用例(VPN 替代)还在等下一波。

This is the largest — and so far only commercial-scale — MASQUE deployment. Notably it uses only CONNECT-UDP, not CONNECT-IP (the more aggressive whole-IP-packet tunnel). Apple doesn't need full VPN semantics; it just needs web traffic to "look like it comes from one IP". The CONNECT-IP use case (full VPN replacement) is still waiting for the next wave.

③ Media over QUIC(MoQ)

③ Media over QUIC (MoQ)

HLS / DASH 延迟 5-30 秒;WebRTC 延迟 100ms 但太重、不好缓存。Media over QUIC(IETF MoQ WG 推进中)目标是亚秒级延迟 + CDN 可缓存,发布/订阅模式。预期取代体育直播、低延迟视频、合作直播的传输层。Cloudflare 已经把它内置进 Workers。

HLS / DASH have 5-30s latency; WebRTC is 100ms but heavy and uncacheable. Media over QUIC (IETF MoQ WG in progress) targets sub-second latency + CDN-cacheable, with a pub/sub model. Slated to replace transport for sports streaming, low-latency video, collaborative live. Cloudflare already ships it inside Workers.

④ HTTP/4?

④ HTTP/4?

IETF 当前的共识:未来 5-10 年内不会有 HTTP/4。理由很务实:HTTP/3 + QUIC 的扩展机制(Datagram、KEY_UPDATE、ALPN、可插拔 cc、TLS extension)已经足够柔软。要加东西(抗量子加密、FEC 前向纠错、新拥塞算法),都可以作为扩展挂在 QUIC 上,不需要新的主版本号。所以 H3 大概率会像 IPv4 那样长寿。

Current IETF consensus: no HTTP/4 in the next 5-10 years. Practical reason: HTTP/3 + QUIC's extension mechanisms (Datagram, KEY_UPDATE, ALPN, pluggable cc, TLS extensions) are flexible enough. Adding things — post-quantum crypto, FEC, new cc algorithms — all fit as QUIC extensions; no new major version needed. So H3 will likely live like IPv4 — for decades.

"我们花三十年做了一个能装下未来三十年的传输层。" "Thirty years of work for a transport layer that can hold the next thirty." Lars Eggert · IETF QUIC WG · 2022
CHEATSHEET

如果你只记 10 件事

If you only remember 10 things

这一节就是给你撕下来贴墙的

the page you'd print, pin, and screenshot

读完 24 章是一回事,下一次给同事讲清楚是另一回事。下面 10 条是这篇文章里最反直觉、最值得带走的事实——每一条都标了对应章节和最关键的 RFC 锚点。

Reading 24 chapters is one thing; explaining it cleanly to a colleague is another. The ten facts below are the most counter-intuitive takeaways — each pinned to its chapter and the single most important RFC anchor.

  1. 01
    0-RTT ≠ 免费0-RTT isn't free
    只能用在 idempotent 方法(GET/HEAD)。POST/PUT 默认有重放风险;服务器读 Early-Data: 1 决定是否降级到 1-RTT。Ch 08 · RFC 8470
    Idempotent methods only (GET/HEAD). POST/PUT on 0-RTT is replay-prone; servers inspect Early-Data: 1 to decide whether to defer to 1-RTT. Ch 08 · RFC 8470
  2. 02
    Initial 必须 padding 到 ≥ 1200 字节Initial packets must pad to ≥ 1200 bytes
    反放大攻击的 3× 预算从这里来——客户端"先付够字节",服务器才能合法回大包。Ch 06 · RFC 9000 §14.1
    The 3× anti-amplification budget starts here — the client has to "pay enough bytes first" before the server may return a large response. Ch 06 · RFC 9000 §14.1
  3. 03
    TLS 1.3 在 QUIC 里被腰斩TLS 1.3 inside QUIC is amputated
    只保留两个角色:(1) 密钥协商引擎,(2) 身份认证。record layer 整个砍掉,QUIC 自己做加密包装——这就是为什么 stock OpenSSL 不行,所有 QUIC 库都 fork BoringSSL / quictls。Ch 07 · RFC 9001
    Two roles only: (1) key-agreement engine, (2) identity auth. The record layer is gone — QUIC handles packet wrapping itself. Hence stock OpenSSL won't work; every QUIC implementation forks BoringSSL or quictls. Ch 07 · RFC 9001
  4. 04
    4 加密级 / 3 PN 空间4 encryption levels / 3 PN spaces
    Initial / 0-RTT / Handshake / 1-RTT 四套密钥;PN 空间 Initial、Handshake、Application 互相独立,各自严格单调。0-RTT 复用 Application PN 空间。Ch 09 · RFC 9001 §4
    Four key sets: Initial / 0-RTT / Handshake / 1-RTT. PN spaces (Initial, Handshake, Application) are independent and each strictly monotonic; 0-RTT shares the Application space. Ch 09 · RFC 9001 §4
  5. 05
    28 种 QUIC 帧分四族28 QUIC frames in four families
    控制(8) · 可靠性(2) · 流与流控(12) · 加密+扩展(2)。STREAM 一种类型有 8 个变体——OFF/LEN/FIN 三 bit 编进 type 字节里。Ch 10 · RFC 9000 §19
    Control (8) · reliability (2) · streams & flow ctrl (12) · crypto + ext (2). The single STREAM type carries 8 variants — OFF/LEN/FIN as three bits inside the type byte. Ch 10 · RFC 9000 §19
  6. 06
    QPACK 用双向同步流杀 HOLQPACK kills HOL with two sync streams
    encoder 流(0x02) + decoder 流(0x03);每条 HEADERS 带 Required Insert Count。表没就绪只阻塞这一条流,其它流照跑。默认动态表 4 KB——比 HPACK 64 KB 小得多是有意的。Ch 15 · RFC 9204
    An encoder stream (0x02) + a decoder stream (0x03); each HEADERS frame carries a Required Insert Count. Missing inserts block only that stream — others run on. The 4 KB default dynamic table (vs HPACK's 64 KB) is intentional. Ch 15 · RFC 9204
  7. 07
    迁移靠 CID,不靠 IPMigration uses CID, not IP
    Wi-Fi → 5G 时 IP/port 全换,连接照样活——服务器看的是 Destination CID。PATH_CHALLENGE / PATH_RESPONSE 做新路径验证,防伪迁。Ch 17 · RFC 9000 §9
    When Wi-Fi → 5G swaps the IP/port pair, the connection survives because the server keys on Destination CID. PATH_CHALLENGE / PATH_RESPONSE validate the new path against spoofing. Ch 17 · RFC 9000 §9
  8. 08
    Alt-Svc 永远拿不到首屏 0-RTTAlt-Svc can't deliver first-hit 0-RTT
    客户端必须先走一次 TCP 才能拿到 H3 提示。HTTPS RR(RFC 9460)在 DNS 解析时就告诉浏览器走 H3——首次访问直接 H3。要做就做 HTTPS RR。Ch 22 · RFC 9460
    Client has to make one TCP visit first before learning H3 is on. HTTPS RR (RFC 9460) carries the hint in the DNS response itself — first visit goes straight to H3. If you're serious, ship HTTPS RR. Ch 22 · RFC 9460
  9. 09
    H3 比 H2 多 1.5–2× CPU 成本H3 burns 1.5–2× the CPU of H2
    Fastly 2020 实测。原因是用户态 UDP + 每包独立 AEAD + 用户态拥塞控制。GSO + SO_REUSEPORT + (Linux) io_uring + XDP 能砍掉一半。Ch 23
    Fastly 2020 measurement. Caused by user-space UDP, per-packet AEAD, and user-space congestion control. GSO + SO_REUSEPORT + Linux io_uring + XDP can halve it. Ch 23
  10. 10
    H3 抬下限,不抬上限H3 lifts the floor, not the ceiling
    P95 / P99 弱网用户受益巨大;光纤桌面用户感觉不出。如果你的用户没有 P95 弱网那一段,迁移 ROI 接近零。这是 Patrick Meenan 的公论。Ch 21
    P95 / P99 weak-link users gain hugely; fiber-grade desktop users feel nothing. If your audience has no P95 weak-link segment, the migration ROI is near zero (Patrick Meenan's call). Ch 21
DEBUG · 第 11 件事 DEBUG · the bonus one qlog + qvis 是唯一调试路。QUIC 加密了一切,原始 pcap 看不到拥塞窗口、丢包、ACK 时序、stream 优先级——任何严肃的 H3 部署都必须打 qlog,扔进 qvis 看。不接 qlog 等于盲调。Ch 22 qlog + qvis is the only debug path. QUIC encrypts everything; a raw pcap shows nothing about cwnd, loss, ACK timing or stream priorities. Any serious H3 deployment must emit qlog and load it into qvis. No qlog = flying blind. Ch 22

从你按下回车,
到屏幕上跳出 200 OK
HTTP/3 用 13 步把一次请求
封装成一个 UDP 包,
跨过四层加密,
在一个 RTT 里完成。

From the moment you press Enter,
to the moment 200 OK appears,
HTTP/3 wraps a request in a UDP datagram,
crosses four encryption layers,
and finishes in a single RTT —
in thirteen movements.

FIN // END OF FIELD NOTE 06
✦ ✦ ✦
阅读Reads

留下评论Leave a comment

评论Comments

加载中…Loading…