vLLM 开发动态报告 - 2026-01-31

时间窗口: 2026-01-31 11:44 (UTC+8) ~ 2026-02-01 11:44 (UTC+8) 数据统计: 新 Issue 9 | 关闭 Issue 12 | 新 PR 27 | 合并 PR 36 | 关闭未合并 PR 7

📊 每日开发状态摘要

在本次观察窗口内，vLLM 项目保持了高活跃度，共合并了 36 个 PR，并新增了 27 个 PR，显示出持续的集成和开发节奏。核心关注点集中在 AMD/ROCm 生态的深度集成（特别是新的 ATOM 后端提案）、推测解码性能问题的修复与增强，以及围绕 KV 缓存量化（如 INT8 支持）和 多模态模型支持 的持续演进。工程优化，如编译时改进和内存管理，也是当日的重要主题。

🎯 AMD/ROCm 生态相关动态

本周期 AMD 生态相关活动非常活跃，包含架构提案、性能优化和问题修复：

新增 Issue [#33478]：引入 ATOM 作为 AMD GPU 的模型实现后端 (RFC)
- 用户：zejunchen-zejun
- 内容：正式提案将 ROCm 团队开发的 ATOM 库作为 vLLM 在 AMD GPU 上的一个高性能模型实现后端 (--model-impl atom)。ATOM 集成了 ROCm 的高性能算子库 aiter 和通信库 mori，旨在为 AMD GPU 提供更具竞争力的推理性能，同时平衡用户易用性与 ROCm 库的快速迭代。该提案已触发 ROCm 相关维护者的注意。
- 影响：这是 AMD 生态在 vLLM 中的一个战略性步骤。若被接受，将为 AMD 平台提供一个官方维护的、深度优化的推理后端，与现有的 transformers 等后端并列，显著增强 vLLM 对 AMD 硬件的原生支持深度和性能表现。
新增 PR [#33493]：wvSplitKrc 的性能调优与用例扩展
- 用户：amd-hhashemi（AMD 员工）
- 内容：针对 wvSplitKrc 进行性能调优，并扩展其支持的情况范围。PR 描述中提供了在 MI355 上的性能对比数据，显示在多个 (m, n) 配置下（特别是 n 较大时）获得了显著的性能提升（例如，从 4.42ms 降至 3.23ms）。
- 影响：直接提升 AMD GPU 上特定计算模式的 kernel 性能。
新增 PR [#33463]：[ROCm][AITER] KV 缓存分割导致准确度大幅下降
- 用户：dorhuri123
- 内容：该 PR 旨在为 ROCm AITER 统一注意力后端适配 KV 缓存更新分离的修改。但测试表明，这一改动在 lm_eval 基准上导致了严重的准确度回归（例如，hellaswag 准确率从 0.60 降至 0.26），表明当前 ROCm AITER 路径尚不能安全支持 KV 缓存分割，可能缺少关键的元数据或更新操作。
- 影响：这是一个重要的警示性 PR，揭示了在向新架构（KV 缓存分离）迁移时，AMD 后端可能存在的兼容性与正确性问题，需要优先解决。
已合并 PR [#32492]：为 qwen3-next 模型启用 aiter 注意力后端
- 内容：修复了 ATOM 注意力后端对 Qwen3-Next 模型（使用非标准块大小 544）的支持问题。通过临时放宽块大小检查，使模型能够在 ATOM 后端上运行并通过了准确性测试。
- 影响：扩大了 ATOM 后端支持的模型范围，提升了特定模型在 AMD 平台上的可用性。
其他相关：
- 已合并 PR [#33492]：更新 ROCm CI 中的 huggingface-hub 版本依赖，保持与主 CI 一致。
- 已合并 PR [#33277]：在 ROCm CI 的 test_sharded_state_loader 测试中强制 max_num_seqs=1，以减少批处理方差导致的测试不稳定性。

小结：AMD 生态在本周期表现突出，从高层架构提案（ATOM）到底层 kernel 优化（wvSplitKrc）和问题修复（准确度回归）均有涉及，显示出 AMD 团队对 vLLM 投入的持续性和深度。

💬 高热度讨论分析

已关闭 Issue [#33091]：Whisper 模型在 FA2 + 完整 CUDA 图下的准确度问题
- 核心议题：Whisper 模型在使用 FlashAttention-2 (FA2) 并结合完整 CUDA 图 (cudagraph_mode=FULL) 时，生成文本出现严重错误。
- 观点与调试过程：
  - 提出者与社区：提供了详尽的复现步骤，并确认 FA3 正常，FA2 异常。
  - 维护者 (@ProExpertProg)：引导进行一系列测试，隔离问题 (-cc.cudagraph_mode=NONE 有效)，最终定位到问题是 FA2 与完整 CUDA 图模式不兼容，而非 torch.compile 的问题。
  - 其他参与者：提供了在 A100、L4 等卡上的复现情况和工作区 (enforce-eager)。
- 争议焦点：无实质争议，是一个协同调试过程。
- 最终结论：问题根源是 FA2 内核在完整 CUDA 图捕获下不安全。已通过 PR [#33360] 修复，该修复确保了 CrossAttention 在构建元数据时使用正确的 max_seq_len。
已关闭 Issue [#22808]：前缀缓存在大批次和大提示下速度更慢且偶尔输出乱码
- 核心议题：多位用户报告在 A100 等显卡上，启用前缀缓存后，随着并发数或提示增大，吞吐量下降甚至产生错误输出。
- 观点汇总：
  - 用户们：分享了在 A100、A800、L20 上的相似遭遇，并指出关闭前缀缓存后恢复正常。
  - 维护者 (@robertgshaw2-redhat)：初期怀疑是 FA2 内核问题，建议尝试 Triton 注意力。
  - 其他用户：测试证实切换为 FLASHINFER 注意力后端可以解决该问题。
- 争议焦点：无明显争议，主要现象是多个用户确认同一问题，并找到了替代方案。
- 状态：该 Issue 因超过 90 天无新活动被自动关闭。但根本原因可能尚未在主流后端（FA2/Triton）中解决，用户可参考使用 FLASHINFER 作为临时规避方案。
新增 Issue [#33480]：为 KV 缓存量化添加 INT8 支持（目前仅 FP8）
- 核心议题：用户请求为 KV 缓存量化增加 INT8 支持，以惠及大量不支持 FP8 的旧硬件（如 A100、RTX 4090 等）。
- 观点：
  - 提出者：详细阐述了 INT8 的硬件普适性、内存效率和对成本敏感场景的重要性。
  - 社区开发者 (@drshvik)：立即响应，表示愿意接手实现，并已提交 PR [#33495] 来完成配置层的验证支持。
- 争议焦点：无争议，是一个得到积极响应的重要功能请求。
- 当前状态：功能实施已启动。PR [#33495]（已提交）添加了配置支持，为后续内核实现铺平道路。

🔥 热门话题与趋势分析

硬件生态扩展与优化：AMD/ROCm 支持是绝对热点（见上文）。同时，NVIDIA 新硬件兼容性也在持续跟进，如 PR [#33417] 为 RTX Blackwell (SM120) 添加 NVFP4 MoE 内核支持。
推测解码 (Speculative Decoding)：
- 问题：DeepSeek-V3.2 使用 deepseek_mtp 方法时接受率过低 (#33497)。
- 功能增强：为 AFMoE 模型添加 EAGLE3 支持 (#33111)，并在基准测试工具中支持任意多模态数据集 (#33486)。
- 性能优化：为全贪婪采样跳过 softmax 计算以提升性能 (#32852)。
量化技术演进：
- KV 缓存量化：INT8 支持成为新需求 (#33480)，反映了社区对更广泛硬件适配的呼声。
- 权重量化：ARM CPU 上的 KleidiAI 加速 int4 动态量化支持得到增强，现支持 BF16 激活 (#33122)。
- 新格式支持：改进 GGUF 模型的 repo_id:quant_type 语法解析，支持更丰富的量化变体 (#33371)。
多模态与音频模型：
- 新模型支持：添加 Kimi-Audio-7B 音频文本模型 (#33469)。
- 功能完善：修复 Qwen3-ASR 转录输出的后处理 (#33410)，为重排 API 添加多模态文档支持 (#33468)。
- 架构重构：重构视觉块模态处理以统一化 (#33498)。
工程与性能优化：
- 编译与 CUDA 图：修复 cold-start 编译时间回归 (#33441)，为 torch.compile 添加结构化日志 (#33213)。
- 内存与缓存管理：支持显式清空多模态和编码器缓存 (#33452 及后续修复 #33481)。
- 基准测试工具：修复 vllm bench serve 中未设置 API_KEY 时的授权头问题 (#33488)，调整 vllm bench startup 的默认参数以使其更实用 (#33489)。

🛠️ 重点技术变更

ATOM 后端 RFC (#33478)：一个具有战略意义的提案。引入 ATOM 作为 AMD GPU 的专用后端，意味着 vLLM 可能将像拥抱 transformers 一样，深度集成来自硬件厂商的优化堆栈，为 AMD 平台带来潜在的显著性能提升和更好的长期维护性。
KV 缓存 INT8 支持提案与启动 (#33480, #33495)：从 FP8 扩展到 INT8 虽是小步，但意义重大。它直接回应了广大使用 Pascal、Ampere 等旧架构 GPU 用户的需求，降低了高性能推理的门槛，有助于 vLLM 在更广泛的部署场景中保持竞争力。
Whisper FA2 + CUDA 图准确度修复 (#33360)：该修复解决了特定模型（编码器-解码器架构）在特定优化配置下的兼容性问题。它提醒社区，性能优化（如图捕获）的推进需要细致地覆盖各种模型架构和内核组合，稳定性与性能同等重要。
FlexAttention 项目关闭 (#19765)：这个长期跟踪 Issue 因不活跃而关闭，标志着初期对 PyTorch FlexAttention 集成的探索告一段落。社区的发展重点可能已转向更成熟或性能更优的注意力后端（如 FlashInfer、AITER 等）。
MoE 后端默认优先级调整 (#33490)：此 PR 揭示了为不同硬件（Hopper vs Blackwell）和配置（是否启用专家并行）选择最优 MoE 内核的复杂性。它尝试通过调整后端选择器的优先级来优化默认性能，但深层问题在于需要一个更智能、基于性能数据的动态选择机制。

📈 开发活跃度观察

高合并率：24 小时内合并 36 个 PR，表明项目有高效的代码审查和合并流程，能快速吸收社区贡献。
贡献者多样：活跃贡献者包括来自 AMD (amd-hhashemi)、Meta/Facebook (jma99fb)、Cohere (kkt-cohere)、IBM (fuscof-ibm) 等公司的开发者以及独立贡献者，生态健康。
ROCm 持续投入：AMD 相关 PR 数量稳定，从 CI 修复到核心功能优化均有覆盖，显示其对该平台的承诺。
问题解决周期：多个长期（90天以上）未活动的 Issue 被自动关闭，同时也有像 #26223（视频编码器缓存挂起）这样的“老大难”问题通过 PR (#33110) 得到解决，显示维护团队在处理积压问题和推进关键修复之间取得了平衡。

💡 值得关注的问题

ATOM 后端的前景：RFC #33478 需要社区，特别是维护者的积极反馈。其设计理念、集成方式以及对现有代码库的影响值得深入讨论。
KV 缓存 INT8 的完整实现：Issue #33480 已获得配置层支持，后续需要内核层的实现与集成。关注其进展和性能表现。
容器默认时区设置：Issue #33472 提议将容器默认时区从 America/Los_Angeles 改为 UTC。虽然是小改动，但关乎生产环境的最佳实践，预计会顺利采纳。
DeepSeek-V3.2 推测解码问题：新开的 Issue #33497 报告了 deepseek_mtp 方法接受率低的问题。鉴于 DeepSeek-V3 系列模型的热度，此问题的排查和修复优先级可能较高。

📋 附录：详细数据列表

新增 Issue

#33478 [Draft][RFC]: Introduce ATOM as model implementation backend for AMD GPU — rocm,RFC — by zejunchen-zejun (创建于: 2026-01-31 21:58 (UTC+8))
#33497 [Bug]: Low acceptance rate for DeepSeek-V3.2 with deepseek_mtp speculative method in v0.15.0 — bug — by qianlihuang (创建于: 2026-02-01 11:24 (UTC+8))
#33480 [Feature]: Add INT8 Support for KV Cache Quantization (Currently FP8-Only) — feature request — by Wiziechen (创建于: 2026-01-31 23:32 (UTC+8))
#33485 [Installation]: — installation — by KevinODonovan60 (创建于: 2026-02-01 02:59 (UTC+8))
#33484 [Installation]: — installation — by KevinODonovan60 (创建于: 2026-02-01 02:58 (UTC+8))
#33483 [Installation]: — installation — by KevinODonovan60 (创建于: 2026-02-01 02:54 (UTC+8))
#33464 [Bug]: qwen3-vl-reranker-8b run failed on VLLM 0.15.0 — bug — by zhcn000000 (创建于: 2026-01-31 13:04 (UTC+8))
#33472 [Feature]: default timezone in container should be UTC — feature request — by lee-b (创建于: 2026-01-31 18:50 (UTC+8))
#33470 [Bug]: V1 Engine: TypeError: FlashAttentionImpl.init() got an unexpected keyword argument ‘layer_idx’ for Qwen2 — bug — by yuliu625 (创建于: 2026-01-31 18:13 (UTC+8))

已关闭 Issue

#19765 Vllm + FlexAttention Work Tracking — feature request,stale — by drisspg (关闭于: 2026-02-01 10:17 (UTC+8))
#22808 [Bug]: Prefix caching with larger batch-sizes and large prompts is much slower and occasionally outputs garbage — bug,stale — by mobicham (关闭于: 2026-02-01 10:16 (UTC+8))
#23820 [Bug]: Incorrect output throughput calculation for concurrent requests in benchmark_serving.py — bug,stale — by z-vishal (关闭于: 2026-02-01 10:16 (UTC+8))
#33416 [Bug] NVFP4 MoE kernels fail on RTX Blackwell (SM12.0) - device capability family check missing SM120 — 无标签 — by renehonig (关闭于: 2026-02-01 06:06 (UTC+8))
#33411 [Bug]: parameter VLLM_USE_V1=0 is not effective in vllm v0.15.0. — bug — by soooocold (关闭于: 2026-02-01 05:16 (UTC+8))
#33453 [Usage]: — usage — by MachineLearningHan (关闭于: 2026-02-01 05:16 (UTC+8))
#33485 [Installation]: — installation — by KevinODonovan60 (关闭于: 2026-02-01 05:12 (UTC+8))
#33484 [Installation]: — installation — by KevinODonovan60 (关闭于: 2026-02-01 05:12 (UTC+8))
#33483 [Installation]: — installation — by KevinODonovan60 (关闭于: 2026-02-01 05:12 (UTC+8))
#26223 [Bug]: Silent hang when video frames exceed profiled encoder budget (Qwen2-VL, num_frames > _MAX_FRAMES_PER_VIDEO) — bug,stale — by Jerry-Terrasse (关闭于: 2026-02-01 00:41 (UTC+8))
#33464 [Bug]: qwen3-vl-reranker-8b run failed on VLLM 0.15.0 — bug — by zhcn000000 (关闭于: 2026-01-31 23:55 (UTC+8))
#33091 [Bug]: Whisper accuracy issue with FA2+CG — bug — by NickLucche (关闭于: 2026-01-31 12:15 (UTC+8))

新增 PR

#33498 [Experimental][Refactor] Refactor vision chunk modality processing for unification — documentation,needs-rebase,multi-modality — by Isotr0py (创建于: 2026-02-01 11:41 (UTC+8))
#33481 [Bugfix] Fix inconsistent handling of cache reset — bug,documentation,performance,frontend,ready,v1 — by DarkLight1337 (创建于: 2026-02-01 01:17 (UTC+8))
#33494 [Doc]: update paths for Offline/Online/Others example sections — documentation — by soyr-redhat (创建于: 2026-02-01 09:54 (UTC+8))
#33493 Perf tuning and expansion of cases covered for wvSplitKrc — rocm — by amd-hhashemi (创建于: 2026-02-01 09:21 (UTC+8))
#33496 [Bugfix] Fix assertion error in flashmla backend with fullgraph enabled — bug,v1 — by Kurumi5210 (创建于: 2026-02-01 11:05 (UTC+8))
#33488 fix: only include Authorization header when OPENAI_API_KEY is set — performance,ready — by zack041 (创建于: 2026-02-01 04:02 (UTC+8))
#33492 [ROCm][CI] Update huggingface-hub pin — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-02-01 08:32 (UTC+8))
#33486 [Misc] support arbitrary MM datasets in spec dec bench — documentation,performance,speculative-decoding — by kkt-cohere (创建于: 2026-02-01 03:55 (UTC+8))
#33479 [Refactor] Make Renderer an abstract class — ready,v1 — by DarkLight1337 (创建于: 2026-01-31 22:49 (UTC+8))
#33495 [Feature] Support ‘int8’ in CacheConfig validation — 无标签 — by drshvik (创建于: 2026-02-01 09:54 (UTC+8))
#33491 Sort safetensors files to ensure deterministic loading order — 无标签 — by Lumosis (创建于: 2026-02-01 08:23 (UTC+8))
#33490 [MoE Refactor] Improve Oracle for Backend Specific Defaults — 无标签 — by xyang16 (创建于: 2026-02-01 05:55 (UTC+8))
#33467 [ModelRunner V2] Misc minor simplifications and optimizations — v1,nvidia — by njhill (创建于: 2026-01-31 16:23 (UTC+8))
#33489 Change defaults for vllm bench startup — performance,ready — by ProExpertProg (创建于: 2026-02-01 05:38 (UTC+8))
#33487 Fix spacing around non-special added tokens#33457 — v1 — by baonudesifeizhai (创建于: 2026-02-01 03:57 (UTC+8))
#33469 Add Kimi-Audio-7B model support — new-model,multi-modality — by tunglinwood (创建于: 2026-01-31 17:45 (UTC+8))
#33468 Add multimodal support to reranking API — frontend,needs-rebase — by sathergate (创建于: 2026-01-31 16:27 (UTC+8))
#33482 [CI/Build] Replace seed_everything — ready — by DarkLight1337 (创建于: 2026-02-01 01:46 (UTC+8))
#33473 Update huggingface-hub pin for the last time before Transformers v5 — ready,ci/build — by hmellor (创建于: 2026-01-31 20:05 (UTC+8))
#33477 [Deprecation] Remove deprecated items related to pooling — documentation,frontend,ready — by DarkLight1337 (创建于: 2026-01-31 21:56 (UTC+8))
#33463 [ROCm][AITER] KV cache split causes large accuracy regression — rocm,v1 — by dorhuri123 (创建于: 2026-01-31 12:08 (UTC+8))
#33476 [Doc] Update plugin deprecation notices — documentation,ready — by DarkLight1337 (创建于: 2026-01-31 21:55 (UTC+8))
#33474 [Misc] Replace deprecated interface seed_everything — performance,cpu — by esmeetu (创建于: 2026-01-31 20:30 (UTC+8))
#33475 [Bugfix] Fix incompatibility between #33372 and #32863 — bug — by DarkLight1337 (创建于: 2026-01-31 21:09 (UTC+8))
#33471 [EPLB]: Optimize export_load_view update — 无标签 — by Spicy-Stick (创建于: 2026-01-31 18:27 (UTC+8))
#33466 [Distributed] Add async P2P overlap for pipeline parallelism — v1,nvidia — by eicherseiji (创建于: 2026-01-31 15:17 (UTC+8))
#33465 [PluggableLayer][3/N] Apply PluggableLayer to llm_head and vocab embedding layer — 无标签 — by whx-sjtu (创建于: 2026-01-31 14:28 (UTC+8))

已合并 PR

#33417 fix: Add SM120 (RTX Blackwell) support for FlashInfer CUTLASS NVFP4 MoE kernels — ready,ci/build,nvidia,quantization — by renehonig (合并于: 2026-02-01 06:06 (UTC+8))
#33492 [ROCm][CI] Update huggingface-hub pin — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-02-01 10:51 (UTC+8))
#33479 [Refactor] Make Renderer an abstract class — ready,v1 — by DarkLight1337 (合并于: 2026-02-01 10:36 (UTC+8))
#33462 [Misc] Fix flashinfer related tests — ready,ci/build,nvidia — by esmeetu (合并于: 2026-02-01 05:10 (UTC+8))
#33121 Fix grammar — v1,cpu — by smashyalts (合并于: 2026-02-01 01:59 (UTC+8))
#32942 [Bugfix]: Fix display errors in TORCH_CHECK messages — bug,ready,cpu — by lingebeng (合并于: 2026-02-01 01:48 (UTC+8))
#33246 [Misc] support collect_env for endpoint /server_info — frontend,ready — by muma378 (合并于: 2026-02-01 01:43 (UTC+8))
#33473 Update huggingface-hub pin for the last time before Transformers v5 — ready,ci/build — by hmellor (合并于: 2026-02-01 01:14 (UTC+8))
#33408 [Refactor] Move MM data parsing outside processor — ready,v1,multi-modality,llama,qwen — by DarkLight1337 (合并于: 2026-02-01 00:46 (UTC+8))
#33477 [Deprecation] Remove deprecated items related to pooling — documentation,frontend,ready — by DarkLight1337 (合并于: 2026-02-01 00:44 (UTC+8))
#33110 [Bugfix] Early-reject requests with MM data longer than encode cache capacity — bug,ready,v1 — by YunzhuLu (合并于: 2026-02-01 00:41 (UTC+8))
#33452 Support clear mm and encoder cache — documentation,frontend,ready,v1,meta-exported,fb-exported — by jma99fb (合并于: 2026-01-31 23:22 (UTC+8))
#33013 [BugFix][Router Replay] Capture Logical Experts with EPLB — bug,ready,v1 — by HollowMan6 (合并于: 2026-01-31 23:12 (UTC+8))
#33441 [fix][torch.compile] Fix cold-start compilation time increase by adding kv cache update to splitting ops — ready,ready-run-all-tests — by ProExpertProg (合并于: 2026-01-31 22:48 (UTC+8))
#33476 [Doc] Update plugin deprecation notices — documentation,ready — by DarkLight1337 (合并于: 2026-01-31 22:48 (UTC+8))
#33213 [ez] Add structured torch.compile logs — ready — by angelayi (合并于: 2026-01-31 21:00 (UTC+8))
#33378 support return prompt token ids in responses — frontend — by cmunley1 (合并于: 2026-01-31 22:04 (UTC+8))
#33474 [Misc] Replace deprecated interface seed_everything — performance,cpu — by esmeetu (合并于: 2026-01-31 21:38 (UTC+8))
#33475 [Bugfix] Fix incompatibility between #33372 and #32863 — bug — by DarkLight1337 (合并于: 2026-01-31 21:21 (UTC+8))
#32863 [Frontend] Use new Renderer for Completions and Tokenize API — documentation,frontend,ready,v1,ready-run-all-tests — by DarkLight1337 (合并于: 2026-01-31 20:51 (UTC+8))
#32852 [perf] v1/spec_decode: skip softmax for all-greedy rejection sampling — ready,v1 — by caozuoba (合并于: 2026-01-31 17:51 (UTC+8))
#32492 [ROCM] Enable aiter attn backend for qwen3-next model — documentation,rocm,ready,ci/build,v1,qwen — by jennyyyyzhen (合并于: 2026-01-31 17:03 (UTC+8))
#33078 [BugFix] Add synchronize in CutlassW4A8LinearKernel to ensure data is ready for use. — bug,ready,nvidia — by ayrnb (合并于: 2026-01-31 16:14 (UTC+8))
#33203 [Kernel] [Helion] [3/N] Helion kernel registry — ready — by gmagogsfm (合并于: 2026-01-31 15:38 (UTC+8))
#33122 [CPU][Feat] Enable KleidiAI accelerated int4 dynamic quant with BF16 activations on Arm CPUs — ready — by fadara01 (合并于: 2026-01-31 15:16 (UTC+8))
#33111 Add EAGLE3 support for AFMoE — speculative-decoding,ready — by AutumnAurelium (合并于: 2026-01-31 14:53 (UTC+8))
#33174 Add support for Mistral Large 3 inference with Flashinfer MoE — performance,ready,ci/build,deepseek,nvidia — by dbari (合并于: 2026-01-31 14:48 (UTC+8))
#33200 [Bugfix] Handle Asym W4A16 (ConchLinearKernel) for CT — bug,ready — by mgehre-amd (合并于: 2026-01-31 14:21 (UTC+8))
#33410 [Bugfix] Fix Qwen3ASR language asr tag in output — bug,frontend,ready,qwen — by NickLucche (合并于: 2026-01-31 13:24 (UTC+8))
#32964 [Kernel] [Helion] [2/N] Helion kernel wrapper — ready — by gmagogsfm (合并于: 2026-01-31 12:53 (UTC+8))
#33427 [Attention] Clarify comment explaining attn_logits +1 dimension — ready,v1 — by fuscof-ibm (合并于: 2026-01-31 12:50 (UTC+8))
#33415 [Voxtral Streaming -> Voxtral Realtime] Rename all voxtral related classes, fn, files — documentation,new-model,ready,multi-modality — by patrickvonplaten (合并于: 2026-01-31 12:49 (UTC+8))
#33277 [ROCm][CI] Force max_num_seqs=1 on ROCm In test_sharded_state_loader to reduce flakiness — rocm,ready — by micah-wil (合并于: 2026-01-31 12:28 (UTC+8))
#33444 [Misc] offest -> offset in comments and variable names — speculative-decoding,v1 — by russellb (合并于: 2026-01-31 12:19 (UTC+8))
#33360 [BugFix] Fix whisper FA2 + full cudagraphs — bug,ready,v1,nvidia — by LucasWilkinson (合并于: 2026-01-31 12:15 (UTC+8))
#33371 [UX] Use gguf repo_id:quant_type syntax for examples and docs — documentation,ready,quantization — by mgoin (合并于: 2026-01-31 12:14 (UTC+8))

关闭但未合并的 PR

#33357 [BugFix] Fix cold start compilation time — bug,ready — by zou3519 (关闭于: 2026-02-01 11:23 (UTC+8))
#20359 [PP][V1]: Integrate Token Throttling into vLLM — needs-rebase,stale,v1 — by gty111 (关闭于: 2026-02-01 10:17 (UTC+8))
#20377 [Bugfix] Remove executable flag on a few files related to flash_attn and flashinfer — needs-rebase,stale,v1 — by tlrmchlsmth (关闭于: 2026-02-01 10:17 (UTC+8))
#33166 Update README.md — documentation — by chaosisnotopen (关闭于: 2026-02-01 01:58 (UTC+8))
#33482 [CI/Build] Replace seed_everything — ready — by DarkLight1337 (关闭于: 2026-02-01 01:49 (UTC+8))
#33425 [Bugfix] Fix correct error message when len(prompt) + max_tokens > max_model_len — bug,frontend,needs-rebase — by sducouedic (关闭于: 2026-01-31 23:07 (UTC+8))
#33466 [Distributed] Add async P2P overlap for pipeline parallelism — v1,nvidia — by eicherseiji (关闭于: 2026-01-31 16:32 (UTC+8))