vLLM 开发动态报告 - 2026-01-15
时间窗口: 2026-01-15 10:54 (UTC+8) ~ 2026-01-16 10:54 (UTC+8) 数据统计: 新 Issue 20 | 关闭 Issue 14 | 新 PR 59 | 合并 PR 42 | 关闭未合并 PR 9
📊 每日开发状态摘要
本周期(2026年1月15-16日)vLLM项目保持了高活跃度的开发和问题修复。主要进展集中在多模态模型支持、性能优化、以及对AMD ROCm平台缺陷的修复上。社区讨论热点围绕量化API的设计、多模态输入处理、以及分布式部署(Ray)中的问题展开,表明项目正快速迭代以适应日益复杂的模型和部署需求。
🎯 AMD/ROCm 生态相关动态
本周期AMD/ROCm相关动态活跃,主要集中在问题修复、测试适配和性能优化上,体现了对该平台持续投入的稳定性提升。
Issues:
- #32434: [Bug]: gpt-oss no output with TRITON_ATTN backend with spec decode on ROCm
- 用户:
micah-wil(非AMD员工) - 摘要: 在ROCm平台上,使用
TRITON_ATTN注意力后端和推测解码(spec decode)时,gpt-oss模型无输出。 - 技术细节: 当前
gpt-oss在ROCm上需要ROCM_AITER_UNIFIED_ATTN后端才能正常工作。已创建一个修复性PR#32431来为相关测试启用此后端。 - 影响: 暴露了ROCm平台下不同注意力后端与推测解码功能的兼容性问题,需要特定配置。
- 用户:
- #32409: [Bug][XPU]: Failed to serve w4a16 Qwen3 VL MoE… (被错误标记为ROCm)
- 摘要: 虽然是Intel XPU问题,但被github-actions机器人错误地标记并
@了ROCm相关维护者。这反映了自动化流程的误判,但本身不涉及AMD技术。
- 摘要: 虽然是Intel XPU问题,但被github-actions机器人错误地标记并
已关闭 Issues:
- #26700: [Bug][ROCm]: W8A8BlockFp8LinearOp does not use AITER on MI355X
- 用户:
gronsti-amd(AMD员工) - 状态: 已由提交
3c680f4a修复并关闭。这解决了AMD MI355X显卡上特定FP8线性层未使用优化内核(AITER)的问题,提升了性能。
- 用户:
PRs (合并与新增):
- #32444: [CI][AMD] Skip test_permute_cols… (新PR)
- 用户:
rasmith - 摘要: 在ROCm CI中跳过
test_permute_cols测试,因为该内核未在ROCm上构建和使用。 - 分析: 属于平台差异化处理,避免不必要的测试失败,保持CI稳定。
- 用户:
- #32431: [ROCm][CI] Enable AITER Unified Attention On ROCm For gpt-oss Test (已合并)
- 用户:
micah-wil - 摘要: 为ROCm上的gpt-oss测试启用
AITER和ROCM_AITER_UNIFIED_ATTN注意力后端,以修复上述Issue#32434。 - 影响: 确保ROCm平台上特定模型的测试套件能够正确执行。
- 用户:
- #32419: Support ROCm aiter specific fusion of per_tensor RMSNorm+QuantFP8 (新PR)
- 用户:
tpopp - 摘要: 为ROCm AITER路径添加针对静态缩放、每张量(per-tensor)的RMSNorm+量化FP8的融合模式。
- 技术细节: 类似于CUDA平台的
fusion.py,但针对ROCm AITER定制。据称在MI325上为Llama3模型带来4-5%的速度提升。 - 影响: 重要的性能优化,展示了AMD团队在针对其硬件优化计算图融合方面的持续努力。
- 用户:
- #32408: [CI][Hardware][AMD][Kernel] Align FP8 quant utils and fix test_rotary_embedding_mla_cache_fused (新PR)
- 用户:
mawong-amd(AMD员工) - 摘要: 修复AMD CI中失败的
test_rotary_embedding_mla_cache_fused测试。通过将AMD的FP8转换逻辑与NVIDIA对齐,并引入相关修复。 - 影响: 确保AMD平台FP8数学计算的正确性与跨平台一致性,是保证量化模型准确性的基础工作。
- 用户:
- #32404: Remove non-aiter quant matching path for rocm_aiter_fusion.py (新PR)
- 用户:
tpopp - 摘要: 清理
rocm_aiter_fusion.py中冗余的非AITER量化匹配路径,该路径仅在启用AITER时有效,否则会导致重复模式创建问题。 - 影响: 代码清理,提高ROCm融合逻辑的健壮性和清晰度。
- 用户:
- #32372: [CI][BugFix][AMD][FP8] Fix test_rms_norm… (已合并)
- 用户:
rasmith - 摘要: 修复ROCm上
test_rms_norm测试,使用平台特定的fp8_dtype并正确设置设备,避免非法内存访问。 - 影响: 基础性修复,确保FP8相关单元测试在ROCm上可靠运行。
- 用户:
- #32413: [ROCm][Bugfix] Disable hip sampler to fix deepseek‘s accuracy issue on ROCm (已合并)
- 用户:
ganyi1996ppo - 摘要: 临时禁用 ROCm的
hip采样器,以修复deepseek-r1模型在ROCm平台上倾向于生成重复内容的准确性问题。 - 影响: 这是一个关键且临时的修复措施,直接解决了影响AMD平台模型输出质量的重要缺陷。这表明在特定硬件/驱动组合下,底层采样库可能存在不稳定性,需要进一步调查根本原因。
- 用户:
💬 高热度讨论分析
- Issue #32412: [RFC]: online quantization user facing API
- 核心议题: 如何设计用户友好的API来支持未来可能更复杂的在线量化(运行时量化)配置。
- 主要观点:
- 提议者 (
vkuzo): 建议新增一个online_quantization_config配置对象,以扩展当前的简单字符串参数(如quantization=”fp8″),从而支持缩放粒度、忽略特定层等高级配置。 - 参与者1 (
kylesayrs): 建议扩展现有quantization参数,使其既能接受预设字符串,也能接受配置对象(如QuantizationScheme)。同时讨论了为MoE路由器等层设置默认忽略规则。 - 参与者2 (
vkuzo): 回应倾向于保持两个参数分离以确保类型清晰,但也对简化方案持开放态度。他进一步提出可以重命名现有参数为quantization_backend以增加清晰度。
- 提议者 (
- 争议焦点: API设计哲学:是保持简单(单参数多态)还是追求清晰(多参数专责)。以及是否复用现有
CompressedTensors的数据结构。 - 当前状态: 开放讨论中。共识是需要一个更强大的配置方案,但在具体实现路径上仍有待商榷。
- Issue #32378: [Usage]: How to add mixed text and image modal inputs…
- 核心议题: 用户在使用Qwen3VL-Reranker模型时,不清楚如何通过vLLM API向
documents列表传入同时包含图像和文本的候选内容。 - 主要观点/进展:
- 用户 (
wade0604): 展示了使用单文本或单图像的成功代码,但困惑于如何组合。 - 维护者 (
DarkLight1337,noooop): 指引用户参考现有的多模态评分示例代码,并建议扩展请求中的content字段。 - 用户后续尝试: 用户尝试了嵌套结构但得到多个分数,而非期望的单个图文对分数。
- 用户 (
- 争议焦点: 无实质性争议,更多是API使用方式的教学和文档澄清问题。暴露出多模态复杂输入场景下的API易用性挑战。
- 当前状态: 开放中,等待更明确的示例或文档说明。
- 核心议题: 用户在使用Qwen3VL-Reranker模型时,不清楚如何通过vLLM API向
- PR #32379: fix: improve NIXL import safety and add UCX compatibility checks
- 核心议题: 修复NIXL连接器因UCX库版本混合导致的段错误问题,并增强其安全性和错误提示。
- 主要观点/互动:
- 作者 (
seekskyworld): 提出检测UCX混合安装并快速失败、安全地惰性导入NIXL、以及处理UCX_PROTO_ENABLE环境变量等方案。 - 受影响用户 (
esmeetu): 在评论中验证补丁,但最初报告问题依旧,并提供了环境信息(未设置UCX_PROTO_ENABLE)。 - 作者进一步分析: 根据用户反馈,深入调查并发现
UCX_PROTO_ENABLE=n可能禁用GPU RMA协议导致崩溃,在后续提交中增加了对此的防护。
- 作者 (
- 争议焦点: 无。这是一个针对具体技术问题的协作调试过程,展现了开源社区解决问题的典型流程。
- 当前状态: PR开放,正在根据用户反馈迭代修复。
🔥 热门话题与趋势分析
- 量化支持深化: 在线量化API设计(#32412)、NVFP4/MXFP4 MoE支持(#32257, #32285)、AMD平台FP8融合优化(#32419)等PR显示,社区正致力于使vLLM支持更灵活、更高效的量化方案,覆盖从训练后量化到在线量化的全流程。
- 模型支持持续扩张: 新增了对Molmo2 FP8量化(#32385)、OpenVLA机器人模型(#32390)、A.X-K1(#32407)等模型的支持,反映了vLLM紧跟模型发展前沿,不断扩展其多模态和专用模型生态。
- 性能与核心优化: 动态推测解码(#32374)、MoE内核选择器重构(#32414)、Triton注意力内核性能调优(#32403)、LoRA扩展内核启发式更新(#32425)等工作,表明项目在基础性能和核心调度逻辑上仍在进行深度优化。
- 多模态推理成熟化: 关于多模态输入处理的讨论(#32378)及相关代码重构(#32406, #32382),表明该功能已从“有”向“好用”和“代码整洁”阶段发展。
- KV缓存与连接器: KV传输与推测解码兼容性(#32409)、CPU卸载默认启用(#32421)、NIXL连接器稳定性(#32379)、OffloadingConnector的“仅预占”模式(#32398)等议题,显示出分布式推理和长上下文内存管理是当前的攻坚重点。
- 平台支持与稳定性: 大量针对ROCm的测试修复、CI优化和bug修复(如#32413, #32444, #32431),以及Docker构建问题修复(#32377, #32373),体现了项目对生产环境稳定性和多平台支持的高度重视。
🛠️ 重点技术变更
- PR #32414: [MoE Refactor] Make Oracle Select Kernels In Priority Order
- 解读: 这是MoE内核选择机制的一次重要重构。它为不同的MoE内核(如Triton, FlashInfer, AITER等)引入了统一的特性支持接口和优先级注册机制。
- 影响: 未来可以更自动化、更可靠地根据硬件架构和模型特性选择最优内核,并能提前在初始化时验证兼容性,提供更清晰的错误信息。这是提升MoE模型部署体验和性能的基础性工作。
- PR #32421: [UX] Use kv_offloading_backend=native by default (已合并)
- 解读: 将CPU原生KV缓存卸载设置为默认后端(当用户指定
--kv_offloading_size时)。 - 影响: 显著改善了用户体验。用户只需指定卸载大小即可启用CPU卸载,而无需额外指定后端。这降低了使用门槛,有利于推广KV卸载功能来处理更长的上下文或更多并发请求。
- 解读: 将CPU原生KV缓存卸载设置为默认后端(当用户指定
- PR #32413: [ROCm][Bugfix] Disable hip sampler to fix deepseek‘s accuracy issue on ROCm (已合并)
- 解读: 通过设置环境变量临时绕过了ROCm
hip采样器的一个缺陷,该缺陷导致DeepSeek模型在AMD GPU上输出重复内容。 - 影响: 直接解决了影响AMD平台模型可用性的关键问题。这是一个治标不治本的临时方案,但保障了用户即时可用的稳定性,为后续彻底修复
hip库或vLLM集成问题赢得了时间。
- 解读: 通过设置环境变量临时绕过了ROCm
- Issue #32432: [Bug]: FlashInfer warmup crash on Blackwell NVFP4
- 解读: 在Blackwell架构GPU上使用NVFP4量化模型和FlashInfer注意力后端时,预热阶段因PyTorch API变更(
non_blocking=None)而崩溃。 - 影响: 影响了最新硬件(Blackwell)和先进量化格式(NVFP4)的组合使用。虽然可通过设置
VLLM_USE_V1=0回退到旧引擎规避,但阻碍了V1引擎在新硬件上的功能体验。需要等待FlashInfer库或vLLM适配层更新。
- 解读: 在Blackwell架构GPU上使用NVFP4量化模型和FlashInfer注意力后端时,预热阶段因PyTorch API变更(
- PR #32374: [V1][Spec Decode] Add Dynamic SD (新PR)
- 解读: 为V1引擎引入了动态推测解码功能。该功能能根据实时批量大小和token接受率,动态调整推测的token数量(K值),以在不同负载下保持性能收益。
- 影响: 提升了推测解码在变化负载场景(如RL训练rollout)下的鲁棒性和效率,是推测解码技术走向成熟和自适应的重要一步。
📈 开发活跃度观察
- 贡献者多样性: 本周期活跃贡献者包括AMD员工(
tjtanaa,gronsti-amd,mawong-amd等)、NVIDIA员工、英特尔相关贡献者以及大量社区独立开发者,显示出广泛的生态参与。 - AMD团队活跃: AMD相关贡献非常集中,涵盖了从驱动层问题(#32413)、内核优化(#32419)、测试修复(#32444, #32408)到CI/CD优化(#32264)的全链条,表明AMD正系统性地提升其在vLLM生态中的稳定性和性能表现。
- 核心模块深度迭代: 对MoE(#32414)、推测解码(#32374)、注意力后端等核心模块的重构和优化PR仍在持续进行,说明项目在追求快速功能拓展的同时,并未放松对核心架构的打磨。
- 代码审查与合并效率: 在24小时内合并了42个PR,处理了14个Issue,显示社区维护和自动化流程(如CI、mergebot)运行高效。许多涉及多平台兼容性的PR(如#32372, #32431)能快速得到处理并合并。
💡 值得关注的问题
- 在线量化API设计决策(Issue #32412): 该讨论的结果将定义未来vLLM量化功能的用户体验和扩展能力,值得社区成员关注并参与意见。
- ROCm平台gpt-oss无输出问题(Issue #32434): 虽然已有临时CI修复,但根本原因(为何
TRITON_ATTN后端在ROCm上不工作)仍需查明,这对保证ROCm平台功能完整性重要。 - FlashInfer在Blackwell上的兼容性问题(Issue #32432): 影响最新硬件与量化技术的结合使用,需要上游(FlashInfer)或vLLM适配层尽快解决。
- Ray集群部署问题(Issue #32400, #32401): 用户在大规模集群中部署多模型实例时遇到的placement group冲突和资源调度问题,反映了vLLM在复杂生产环境调度方面面临的挑战,其解决方案对云服务提供商和大型企业用户至关重要。
📋 附录:详细数据列表
新增 Issue
- #32436 [Bug]: GLM 4.7 Tool Call issue — bug — by 9620300 (创建于: 2026-01-16 08:03 (UTC+8))
- #32446 [New Model]: Support for TranslateGemma series — 无标签 — by venki-lfc (创建于: 2026-01-16 09:37 (UTC+8))
- #32445 [Bug]: GLMASR rope error — bug — by catsled (创建于: 2026-01-16 09:29 (UTC+8))
- #32434 [Bug]: gpt-oss no output with TRITON_ATTN backend with spec decode on ROCm — bug,rocm — by micah-wil (创建于: 2026-01-16 07:05 (UTC+8))
- #32432 [Bug]: FlashInfer warmup crash on Blackwell NVFP4: non_blocking=None passed to Tensor.to() — bug — by CristyNel (创建于: 2026-01-16 06:37 (UTC+8))
- #32412 [RFC]: online quantization user facing API — RFC — by vkuzo (创建于: 2026-01-15 20:58 (UTC+8))
- #32424 [Bug]: MoE LoRA hits compile error on B200 — bug — by Jackmin801 (创建于: 2026-01-16 02:42 (UTC+8))
- #32410 [Bug][XPU]: Failed to serve w4a16 Qwen3 VL MoE on Intel Arc Pro B60 — bug,intel-gpu — by noobHappylife (创建于: 2026-01-15 19:32 (UTC+8))
- #32378 [Usage]: How to add mixed text and image modal inputs to documents for qwen3vl-rerank model vllm inference? — usage — by wade0604 (创建于: 2026-01-15 11:57 (UTC+8))
- #32409 [Bug]: KV Transfer does not work with Spec Decode — bug — by szutenberg (创建于: 2026-01-15 18:54 (UTC+8))
- #32402 [RFC]: Add a new
/collect_envapi to response current vllm instance environment — feature request — by lengrongfu (创建于: 2026-01-15 16:55 (UTC+8)) - #32399 [Bug]: High Rate of endless Decoding in DeepSeekV3.2 Inference with vLLM v0.13.0 — bug — by makabaka6338 (创建于: 2026-01-15 16:25 (UTC+8))
- #32401 [Feature]: when will custom placement group configuration be supported for DP — 无标签 — by ywExcellent (创建于: 2026-01-15 16:43 (UTC+8))
- #32400 [Bug]: Ray cluster(5 Node), Deploy DP=2, Creating too many placement groups — 无标签 — by ywExcellent (创建于: 2026-01-15 16:37 (UTC+8))
- #32391 [Bug]: shellcheck hook fails with find syntax error and causes silent failures — bug — by junuxyz (创建于: 2026-01-15 15:04 (UTC+8))
- #32388 [Usage]: Does EPLB support CompressedTensorsWNA16MarlinMoEMethod in v0.12.0 or higher version? — usage — by IEI-mjx (创建于: 2026-01-15 14:46 (UTC+8))
- #32370 [Bug]: use vllm server but tell me not import vllm.benchmarks.latency — bug — by ilovexsir (创建于: 2026-01-15 11:00 (UTC+8))
- #32380 [Feature]: Benchmak Scalability Optimization — feature request — by Bounty-hunter (创建于: 2026-01-15 12:16 (UTC+8))
- #32373 [Bug]: Fail to load vLLM on new NVIDIA driver — bug — by huydhn (创建于: 2026-01-15 11:15 (UTC+8))
- #32375 [Bug]: Tensor Parallel + NixlConnector failed — bug — by esmeetu (创建于: 2026-01-15 11:20 (UTC+8))
已关闭 Issue
- #15235 [Bug]: LLVM ERROR: Failed to compute parent layout for slice layout. — bug,stale — by GennVa (关闭于: 2026-01-16 10:13 (UTC+8))
- #18567 [Bug]: Single-Node data parallel (–data-parallel-size=4) leads to vLLM crash — bug,stale — by Rexhaif (关闭于: 2026-01-16 10:13 (UTC+8))
- #32424 [Bug]: MoE LoRA hits compile error on B200 — bug — by Jackmin801 (关闭于: 2026-01-16 02:58 (UTC+8))
- #32214 [Feature]: Extend startup time collection script to work with sweep — help wanted,good first issue,feature request — by desertfire (关闭于: 2026-01-15 17:25 (UTC+8))
- #32213 [Bug]: GLM4 tool parser crashes with TypeError when parsing tools with no arguments — bug — by AnasMaar (关闭于: 2026-01-15 16:46 (UTC+8))
- #26700 [Bug][ROCm]: W8A8BlockFp8LinearOp does not use AITER on MI355X — bug,rocm — by gronsti-amd (关闭于: 2026-01-15 15:54 (UTC+8))
- #32311 [Bug/Question]: Unexpected CUDA Graph Replay observed only in the first request’s prefill under PIECEWISE mode — usage — by zhenwei-intel (关闭于: 2026-01-15 15:13 (UTC+8))
- #27787 [Bug]: Under conditions of high concurrency, DeepSeek-V3.1 exhibits the phenomenon of generating nonsensical outputs when running on vllm 0.11.1rc4 — bug — by makabaka6338 (关闭于: 2026-01-15 14:47 (UTC+8))
- #27245 [Bug]: High Latency in One-Time Scenarios for DeepSeek Deployed on H200 — bug — by makabaka6338 (关闭于: 2026-01-15 14:46 (UTC+8))
- #30453 [Bug]: Frequent Repetitive Decoding Problems in DeepSeek-V3.2 — bug — by makabaka6338 (关闭于: 2026-01-15 14:46 (UTC+8))
- #30318 [Bug]: vllm v0.10.1 - Non-compliant JSON Output in Tool Calling When Inferencing DeepSeek-V3.1-Terminus — bug — by makabaka6338 (关闭于: 2026-01-15 14:46 (UTC+8))
- #32225 [Bug]:
common_attn_metadata.max_seq_lennot incremented properly in Eagle implementation — bug — by ofirzaf (关闭于: 2026-01-15 14:39 (UTC+8)) - #25327 [Bug]: InternVL 3.5 is insanely slow — bug,stale — by mertunsall (关闭于: 2026-01-15 14:23 (UTC+8))
- #31864 [Bug][CPU Backend]: Gibberish output on CPU backend with DP2 + MoE Model — bug,cpu — by kzwrime (关闭于: 2026-01-15 12:50 (UTC+8))
新增 PR
- #32447 [Bugfix] Fix ROCm dockerfiles — bug,rocm,ci/build — by tjtanaa (创建于: 2026-01-16 10:46 (UTC+8))
- #32438 [Bug] Add TPU backend option — bug,ready — by vanbasten23 (创建于: 2026-01-16 08:32 (UTC+8))
- #32414 [MoE Refactor] Make Oracle Select Kernels In Priority Order — documentation,performance,rocm,ready,ci/build,v1,llama,gpt-oss,nvidia — by robertgshaw2-redhat (创建于: 2026-01-15 22:00 (UTC+8))
- #32442 [Bugfix] Abort engine requests on client disconnect — bug,frontend — by JayZenith (创建于: 2026-01-16 09:02 (UTC+8))
- #32423 [CI] Fix LM Eval Large Models (H100) — ready,ci-failure — by MatthewBonanni (创建于: 2026-01-16 02:29 (UTC+8))
-
#32395 [Frontend][1/n] Make pooling entrypoints request schema consensus CompletionRequest — documentation,frontend,ready,multi-modality — by noooop (创建于: 2026-01-15 15:43 (UTC+8)) - #32444 [CI][AMD] Skip test_permute_cols since the kernel is not used and not built for ROCm — rocm — by rasmith (创建于: 2026-01-16 09:09 (UTC+8))
- #32437 [Hardware][SM100] Add TRTLLM Kernel for INT4 W4A16 Kernel. — needs-rebase,nvidia — by pavanimajety (创建于: 2026-01-16 08:21 (UTC+8))
- #32443 [Bugfix][chat_completion] ensure tokenizer eos_token_id is added to stop_token_ids — bug,frontend — by Flink-ddd (创建于: 2026-01-16 09:08 (UTC+8))
- #32441 [Docs] Clarify structured outputs configuration for Qwen3 reasoning mode — documentation,structured-output,qwen — by davzaman (创建于: 2026-01-16 09:00 (UTC+8))
- #32440 [Misc] Add VLLM Config to Prometheus Logger — v1 — by pooyadavoodi (创建于: 2026-01-16 08:50 (UTC+8))
- #32431 [ROCm][CI] Enable AITER Unified Attention On ROCm For gpt-oss Test — rocm,ready,gpt-oss — by micah-wil (创建于: 2026-01-16 05:56 (UTC+8))
- #32439 [Bugfix] Abort engine requests on client disconnect — bug,frontend — by JayZenith (创建于: 2026-01-16 08:42 (UTC+8))
- #32385 [Model] Molmo2: Enable quantized weight mapping for vision backbone — 无标签 — by George-Polya (创建于: 2026-01-15 13:10 (UTC+8))
- #32374 [V1][Spec Decode] Add Dynamic SD — documentation,speculative-decoding,needs-rebase,v1 — by ekagra-ranjan (创建于: 2026-01-15 11:16 (UTC+8))
- #32429 [Core] Cleanup shm based object store on engine shutdown — v1,multi-modality — by walterbm (创建于: 2026-01-16 05:04 (UTC+8))
- #32435 [Misc] Add Device Config to Prometheus Logger — v1 — by pooyadavoodi (创建于: 2026-01-16 07:28 (UTC+8))
- #32390 [Model] Add OpenVLA model support — new-model,multi-modality — by PalmDr (创建于: 2026-01-15 15:00 (UTC+8))
- #32433 [Refactor] Remove unused file
pallas_kv_cache_update.py— tpu,ready,v1 — by yewentao256 (创建于: 2026-01-16 06:53 (UTC+8)) - #32422 [Refactor] Remove unused file — frontend,ready — by yewentao256 (创建于: 2026-01-16 00:56 (UTC+8))
- #32430 [Bugfix] Seed OSS 36B Tool Parsing — bug — by ApexArray (创建于: 2026-01-16 05:47 (UTC+8))
- #32418 [EPLB][BugFix]Possible deadlock fix — bug,ready — by ilmarkov (创建于: 2026-01-15 23:46 (UTC+8))
- #32420 [2/x][Frontend] Draft: Also consider async kv transfers in flight for graceful drain decision — frontend,needs-rebase,v1 — by wseaton (创建于: 2026-01-16 00:26 (UTC+8))
- #32427 [ROCM] add 3d triton kernel for non-standard block size support under rocm_attn — rocm,v1 — by jennyyyyzhen (创建于: 2026-01-16 03:59 (UTC+8))
- #32428 [Logging][Bugfix] fix scheduler stats logging — bug,v1,meta-exported,fb-exported — by jennyyyyzhen (创建于: 2026-01-16 04:02 (UTC+8))
- #32371 v1: Standardize request validation errors to VLLMValidationError — needs-rebase,v1 — by JayZenith (创建于: 2026-01-15 11:11 (UTC+8))
- #32425 [LoRA] Update LoRA expand kernel heuristic — 无标签 — by xyang16 (创建于: 2026-01-16 02:53 (UTC+8))
- #32416 [BugFix] Python file source reading can fail on UnicodeDecodeError — bug,ready — by zou3519 (创建于: 2026-01-15 23:10 (UTC+8))
- #32417 [BUGFIX] Fix degenerate strides in TRTLLM query tensors for FlashInfer backend. Fixes issue #32353 — bug,ready,v1,nvidia — by vadiklyutiy (创建于: 2026-01-15 23:42 (UTC+8))
- #32426 [CI] Remove unused docker/Dockerfile.nightly_torch — ci/build — by orionr (创建于: 2026-01-16 03:10 (UTC+8))
- #32398 [KVConnector] OffloadingConnector: Add preemptions-only mode — v1,kv-connector — by orozery (创建于: 2026-01-15 16:23 (UTC+8))
- #32421 [UX] Use kv_offloading_backend=native by default — ready,v1,kv-connector — by mgoin (创建于: 2026-01-16 00:34 (UTC+8))
- #32387 [FEATURE] Add support for capturing hidden states in Qwen3MoeLLMModel… — qwen — by tzhouam (创建于: 2026-01-15 13:53 (UTC+8))
- #32413 [ROCm][Bugfix] Disable hip sampler to fix deepseek’s accuracy issue on ROCm — bug,rocm,ready,v1,deepseek — by ganyi1996ppo (创建于: 2026-01-15 21:53 (UTC+8))
- #32419 Support ROCm aiter specific fusion of per_tensor RMSNorm+QuantFP8 — rocm — by tpopp (创建于: 2026-01-16 00:25 (UTC+8))
- #32403 [Performance] Improve Triton prefill attention kernel’s performance — v1 — by Isotr0py (创建于: 2026-01-15 17:05 (UTC+8))
- #32415 [Fix] Fix buffer sizing and memory handling for routed-expert return in TP — v1 — by TomerBN-Nvidia (创建于: 2026-01-15 22:35 (UTC+8))
- #32407 [Model] Add huggingface skt/A.X-K1 model — documentation,new-model — by fort726 (创建于: 2026-01-15 18:25 (UTC+8))
- #32379 fix: improve NIXL import safety and add UCX compatibility checks — documentation,kv-connector — by seekskyworld (创建于: 2026-01-15 12:04 (UTC+8))
- #32411 [Misc] Fix typo: seperator -> separator in flashmla_sparse.py — v1 — by T1mn (创建于: 2026-01-15 20:12 (UTC+8))
- #32406 [3/N] Group together media-related code — ready,multi-modality — by DarkLight1337 (创建于: 2026-01-15 18:12 (UTC+8))
- #32404 Remove non-aiter quant matching path for rocm_aiter_fusion.py — rocm — by tpopp (创建于: 2026-01-15 17:25 (UTC+8))
- #32394 fix(tool_parsers): enable type conversion for Seed OSS tool parser streaming mode — 无标签 — by karanb192 (创建于: 2026-01-15 15:30 (UTC+8))
- #32372 [CI][BugFix][AMD][FP8] Fix test_rms_norm so it runs correctly on ROCm — bug,rocm,ready — by rasmith (创建于: 2026-01-15 11:12 (UTC+8))
- #32396 [Refactor] [11/N] to simplify the mcp architecture — structured-output,frontend,ready,v1,gpt-oss — by chaunceyjiang (创建于: 2026-01-15 15:51 (UTC+8))
- #32408 [CI][Hardware][AMD][Kernel] Align FP8 quant utils and fix test_rotary_embedding_mla_cache_fused — rocm — by mawong-amd (创建于: 2026-01-15 18:27 (UTC+8))
- #32405 [Hardware][RISCV] Add RISC-V RVV backend support for CPU executor — ci/build,cpu — by hansu2022 (创建于: 2026-01-15 17:33 (UTC+8))
- #32397 [Model] Enable LoRA support for internvl2 — 无标签 — by MatteoFari (创建于: 2026-01-15 16:17 (UTC+8))
- #32382 [2/N] Move cache factories to MM registry — ready,v1,multi-modality — by DarkLight1337 (创建于: 2026-01-15 12:22 (UTC+8))
- #32389 [Model] Avoid token selection in SigLIP pooling head — ready — by DarkLight1337 (创建于: 2026-01-15 14:48 (UTC+8))
- #32369 [Refactor] [10/N] to simplify the vLLM openai completion serving architecture — frontend,ready,v1,gpt-oss — by chaunceyjiang (创建于: 2026-01-15 10:58 (UTC+8))
- #32386 [Feature] Add FIPS 140-3 compliant hash algorithm option for multimodal hashing — multi-modality — by karanb192 (创建于: 2026-01-15 13:33 (UTC+8))
- #32393 Support custom URI schemes and trace handlers for profiler — v1,meta-exported,fb-exported — by diviramon (创建于: 2026-01-15 15:18 (UTC+8))
- #32392 Support custom URI schemes and trace handlers for profiler — v1,meta-exported,fb-exported — by brad-mengchi (创建于: 2026-01-15 15:12 (UTC+8))
- #32384 [Bugfix] Fix xgrammar dtype mismatch on macOS CPU inference — bug,structured-output,v1 — by karanb192 (创建于: 2026-01-15 12:48 (UTC+8))
- #32376 [code clean] remove duplicate check — ready,v1,multi-modality — by andyxning (创建于: 2026-01-15 11:44 (UTC+8))
- #32381 [responsesAPI] allow reasoning parser to output multiple reasoning items — frontend,qwen,deepseek — by qandrew (创建于: 2026-01-15 12:20 (UTC+8))
- #32383 [responsesAPI] allow tuning include_stop_str_in_output — frontend — by qandrew (创建于: 2026-01-15 12:42 (UTC+8))
- #32377 [Docker] Remove CUDA compatibility library loading; fixes #32373 — ci/build,nvidia — by wangshangsam (创建于: 2026-01-15 11:54 (UTC+8))
已合并 PR
- #32447 [Bugfix] Fix ROCm dockerfiles — bug,rocm,ci/build — by tjtanaa (合并于: 2026-01-16 10:50 (UTC+8))
- #32423 [CI] Fix LM Eval Large Models (H100) — ready,ci-failure — by MatthewBonanni (合并于: 2026-01-16 08:52 (UTC+8))
- #32431 [ROCm][CI] Enable AITER Unified Attention On ROCm For gpt-oss Test — rocm,ready,gpt-oss — by micah-wil (合并于: 2026-01-16 08:55 (UTC+8))
- #32360 Add thread_n=64 support to Marlin MoE — ready — by mgoin (合并于: 2026-01-16 08:45 (UTC+8))
- #32257 [Feat] Support non-gated MoE with Marlin, NVFP4 CUTLASS, FP8, INT8, compressed-tensors — ready,nvidia — by TomerBN-Nvidia (合并于: 2026-01-16 08:15 (UTC+8))
- #32422 [Refactor] Remove unused file — frontend,ready — by yewentao256 (合并于: 2026-01-16 06:59 (UTC+8))
- #31827 [MoE Refactor][17/N] Apply Refactor to Bf16 — ready,nvidia — by zyongye (合并于: 2026-01-16 04:53 (UTC+8))
- #32238 [ROCM] DSfp4 mla projection gemms weight dynamic quantization — rocm,ready,v1 — by maleksan85 (合并于: 2026-01-16 04:13 (UTC+8))
- #32416 [BugFix] Python file source reading can fail on UnicodeDecodeError — bug,ready — by zou3519 (合并于: 2026-01-16 04:01 (UTC+8))
- #32421 [UX] Use kv_offloading_backend=native by default — ready,v1,kv-connector — by mgoin (合并于: 2026-01-16 02:55 (UTC+8))
- #32264 [ROCm] [CI] [Release] Rocm wheel pipeline with sccache — rocm,ready,ci/build — by tjtanaa (合并于: 2026-01-16 02:56 (UTC+8))
- #32285 [Quant] Support MXFP4 W4A16 for compressed-tensors MoE models — ready,quantization — by dsikka (合并于: 2026-01-15 23:25 (UTC+8))
- #32362 [BugFix] Fix
assert x_s.shape[-1] == x_q.shape[-1] // group_shape[1]in Blackwell Quantized MoE Test — bug,ready — by LucasWilkinson (合并于: 2026-01-16 02:19 (UTC+8)) - #30361 [Attention][AMD] Make flash-attn optional — rocm,speculative-decoding,ready,v1 — by mgehre-amd (合并于: 2026-01-16 01:18 (UTC+8))
- #32131 fixing podman build issue — rocm,ready,ci/build,meta-exported,fb-exported — by smitkadvani (合并于: 2026-01-16 01:07 (UTC+8))
- #32359 [Feature] Support async scheduling + PP — ready,v1 — by yewentao256 (合并于: 2026-01-16 01:06 (UTC+8))
- #32350 [ROCm][CI] Pin transformers 4.57.3 to fix jina test failures — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-01-15 15:19 (UTC+8))
- #32348 [Model Runner V2] Support FlashInfer backend & Fix CUDA Graph bug [1/2] — bug,v1,nvidia — by WoosukKwon (合并于: 2026-01-16 00:59 (UTC+8))
- #32413 [ROCm][Bugfix] Disable hip sampler to fix deepseek’s accuracy issue on ROCm — bug,rocm,ready,v1,deepseek — by ganyi1996ppo (合并于: 2026-01-16 00:35 (UTC+8))
- #29887 [ROCm][Perf] Enable shuffle kv cache layout and assembly paged attention kernel for
AiterFlashAttentionBackend— rocm,ready,v1 — by ganyi1996ppo (合并于: 2026-01-15 23:29 (UTC+8)) - #32339 [Attention][MLA] Make
FLASHINFER_MLAthe default MLA backend on Blackwell, and TRTLLM the default prefill — ready,nvidia — by MatthewBonanni (合并于: 2026-01-15 22:49 (UTC+8)) - #31715 [ROCm] Improve error handling while loading quantized model on gfx120… — rocm,ready — by brian033 (合并于: 2026-01-15 20:16 (UTC+8))
- #32406 [3/N] Group together media-related code — ready,multi-modality — by DarkLight1337 (合并于: 2026-01-15 19:52 (UTC+8))
- #32372 [CI][BugFix][AMD][FP8] Fix test_rms_norm so it runs correctly on ROCm — bug,rocm,ready — by rasmith (合并于: 2026-01-15 19:05 (UTC+8))
- #31995 [ROCM] Add ROCm image build to release pipeline — rocm,ready,ci/build — by dllehr-amd (合并于: 2026-01-15 19:01 (UTC+8))
- #32396 [Refactor] [11/N] to simplify the mcp architecture — structured-output,frontend,ready,v1,gpt-oss — by chaunceyjiang (合并于: 2026-01-15 18:49 (UTC+8))
- #32337 [Benchmark] [Feature] add vllm bench sweep startup command — documentation,performance,ready — by lengrongfu (合并于: 2026-01-15 17:25 (UTC+8))
- #32382 [2/N] Move cache factories to MM registry — ready,v1,multi-modality — by DarkLight1337 (合并于: 2026-01-15 17:02 (UTC+8))
- #32389 [Model] Avoid token selection in SigLIP pooling head — ready — by DarkLight1337 (合并于: 2026-01-15 17:01 (UTC+8))
- #32321 fix: avoid crash on zero-arg tool calls in glm4 parser — ready — by seekskyworld (合并于: 2026-01-15 16:45 (UTC+8))
- #32314 [Bugfix] Strengthen the check of X-data-parallel-rank in Hybrid LB mode — bug,frontend,ready,v1 — by dtcccc (合并于: 2026-01-15 16:31 (UTC+8))
- #32369 [Refactor] [10/N] to simplify the vLLM openai completion serving architecture — frontend,ready,v1,gpt-oss — by chaunceyjiang (合并于: 2026-01-15 15:41 (UTC+8))
- #32312 [Bugfix] Fix stale
common_attn_metadata.max_seq_lenin speculative decoding with Eagle — bug,speculative-decoding,ready,v1 — by ofirzaf (合并于: 2026-01-15 14:39 (UTC+8)) - #32361 [BugFix] Fix DeepSeek-V3.1 + DeepGEMM incompatible scale shapes — bug,ready,deepseek — by LucasWilkinson (合并于: 2026-01-15 14:32 (UTC+8))
- #32376 [code clean] remove duplicate check — ready,v1,multi-modality — by andyxning (合并于: 2026-01-15 13:29 (UTC+8))
- #32201 [CI][AMD][Quantization][BugFix] Fix fp8 max in quant_utils.py and update test_fp8_quant.::test_static_fp8_quant_group_2d to use correct fp8 dtype and adjust atol/rtol — bug,rocm,ready — by rasmith (合并于: 2026-01-15 13:04 (UTC+8))
- #32355 [ROCm][CI] Disable async scheduling on ROCm for test_structured_output[meta-llama/Meta-Llama-3.1-8B-Instruct-xgrammar-auto-speculative_config9] — rocm,structured-output,ready,v1,llama — by micah-wil (合并于: 2026-01-15 12:53 (UTC+8))
- #31997 [CI/Build][Hardware][AMD] Fix v1/shutdown — rocm,ready,v1 — by rjrock (合并于: 2026-01-15 12:01 (UTC+8))
- #31867 [Bugfix] Add CpuCommunicator.dispatch and combine to fix DP+MoE inference — bug,ready,cpu — by kzwrime (合并于: 2026-01-15 12:50 (UTC+8))
- #32366 [Misc] Remove redundant line — ready — by Potabk (合并于: 2026-01-15 12:29 (UTC+8))
- #32345 Support configure skip_special_tokens in openai response api — frontend,ready — by 842974287 (合并于: 2026-01-15 12:07 (UTC+8))
- #32342 Fix optional parameter parsing in MiniMax M2 tool parser #32278 — ready — by baonudesifeizhai (合并于: 2026-01-15 12:05 (UTC+8))
关闭但未合并的 PR
- #32442 [Bugfix] Abort engine requests on client disconnect — bug,frontend — by JayZenith (关闭于: 2026-01-16 10:16 (UTC+8))
- #16606 [Kernel] Adding basic Triton JitCache for triton_attn — needs-rebase,ci/build,stale,v1 — by bringlein (关闭于: 2026-01-16 10:13 (UTC+8))
- #32439 [Bugfix] Abort engine requests on client disconnect — bug,frontend — by JayZenith (关闭于: 2026-01-16 08:46 (UTC+8))
- #32293 support non gated fused moe for compressed tensors w8a8 int8 — ready — by NVShreyas (关闭于: 2026-01-16 08:27 (UTC+8))
- #32371 v1: Standardize request validation errors to VLLMValidationError — needs-rebase,v1 — by JayZenith (关闭于: 2026-01-16 05:46 (UTC+8))
- #30395 [ROCm] [CI] [Release] Add rocm wheel release pipeline — rocm,needs-rebase,ci/build — by tjtanaa (关闭于: 2026-01-16 03:45 (UTC+8))
- #32387 [FEATURE] Add support for capturing hidden states in Qwen3MoeLLMModel… — qwen — by tzhouam (关闭于: 2026-01-16 00:38 (UTC+8))
- #32392 Support custom URI schemes and trace handlers for profiler — v1,meta-exported,fb-exported — by brad-mengchi (关闭于: 2026-01-15 15:13 (UTC+8))
- #27958 Make pre-commit work on fedora — 无标签 — by rabi (关闭于: 2026-01-15 11:42 (UTC+8))