vLLM 开发动态报告 - 2026-03-08
时间窗口: 2026-03-08 11:29 (UTC+8) ~ 2026-03-09 11:29 (UTC+8) 数据统计: 新 Issue 11 | 关闭 Issue 8 | 新 PR 54 | 合并 PR 11 | 关闭未合并 PR 26
📊 每日开发状态摘要
本周期(2026年3月8日至9日)vLLM项目开发活动保持活跃,共产生11个新Issue和54个新PR,合并了11个PR。开发焦点集中在模型兼容性修复(特别是Qwen系列)、CPU模式支持以及AMD ROCm平台的深度优化。同时,关于API设计(如usage字段默认行为)和内部架构(如客户端协议分离)的讨论仍在持续。
🎯 AMD/ROCm 生态相关动态
本周期AMD生态相关活动显著,主要集中在性能优化和bug修复,由AMD员工(ChuanLi1101, hongxiayang 等)主导。
- PR #36425: [ROCm] Enable AITER fused allreduce + RMSNorm compilation pass (by
ChuanLi1101)- 技术细节:新增ROCm专用的
RocmAiterAllReduceFusionPass编译通道,将TP(Tensor Parallelism)通信中的allreduce操作与AITER CK的RMSNorm计算融合为单个图节点(rocm_aiter_fused_allreduce_rmsnorm)。这是NVIDIA FlashInfer对应功能的ROCm/AITER版本。 - 影响:旨在为多GPU ROCm部署(如MI300X/MI355X)减少图开销和内存流量,是提升分布式训练/推理性能的关键一步。贡献者指出此为第一阶段(图级融合),第二阶段将升级为使用AITER原生内核以实现真正的通信计算重叠。
- 讨论亮点:AMD审核人(
tjtanaa)要求提供性能基准和模型精度(lm-eval)验证数据,体现了对功能质量和性能收益的严谨要求。
- 技术细节:新增ROCm专用的
- PR #36422: [ROCm][Bugfix] Fix MXFP4 MoE emulate fallback logic on MX-capable hardware (by
ChuanLi1101)- 技术细节:修复了
QuarkOCP_MX_MoEMethod中一个布尔逻辑错误,该错误导致在支持MX指令集(如MI350X/gfx950)的硬件上,即使AITER CK内核不兼容(如ROCm版本不匹配),也无法回退到模拟模式,从而产生错误输出。 - 影响:确保了Quark量化工具链中MXFP4格式的MoE模型在AMD硬件上的鲁棒性,当高性能内核不可用时能安全降级。
- 技术细节:修复了
- PR #36374 / #36376: [Bugfix][ROCm] full graph capture on triton-attn fix (by
hongxiayang)- 技术细节:提供了两个方案修复ROCm上使用Triton注意力时全CUDA图捕获导致的“写入只读页”错误。
- Option 1 (PR #36374):在ROCm上,不为Triton注意力的
softmax_segm_output等缓冲区分配内存,强制使用2D内核(类似AITER路径)。 - Option 2 (PR #36376):在ROCm上,覆盖Triton注意力的
get_cudagraph_support()方法,返回NEVER,从而完全禁用全图捕获,仅使用分段图。
- Option 1 (PR #36374):在ROCm上,不为Triton注意力的
- 影响:解决了ROCm平台一个影响稳定性的核心问题,确保了CUDA图优化技术的可用性。
- 技术细节:提供了两个方案修复ROCm上使用Triton注意力时全CUDA图捕获导致的“写入只读页”错误。
- 已关闭 Issue #36167: Engine initialization failure with Qwen3 Omni on Strix Halo/ROCM
- 结论:用户通过手动添加ROCm的
clr库到LD_LIBRARY_PATH解决了问题,但指出这感觉像临时方案。这暴露了在某些AMD消费级平台(Ryzen AI)上依赖库路径的配置问题。
- 结论:用户通过手动添加ROCm的
💬 高热度讨论分析
- Issue #36408: [Bug]: Qwen3.5-9B (BF16/AWQ) Illegal Memory Access (6条评论)
- 核心议题:用户在WSL2/RTX 3090 Ti环境下,运行Qwen3.5-9B(包括AWQ量化版)时遇到CUDA非法内存访问错误。
- 观点整理:
- 报告者 (d-etu):提供了详细的复现步骤,强调问题不仅限于量化模型。
- 参与者1 (SaxenaShiv):表示有兴趣修复,并询问是否可在其他GPU(RTX 4060)上复现。
- 参与者2 (hocop):确认在双3090上遇到相同问题,并指出禁用MTP后问题解决,提供了关键线索。
- 参与者3 (ZJY0516):建议测试主线代码,暗示相关问题可能已修复。
- 报告者跟进:在更新NVIDIA驱动后,AWQ版本问题解决,但基础BF16版本出现新错误,表明问题可能具多层原因。
- 争议焦点/结论:问题可能与特定驱动版本、MTP(可能是某种并行技术) 的兼容性有关,非单一原因。驱动更新解决了部分问题,但揭示了更深层的不稳定性。
- PR #36425: [ROCm] Enable AITER fused allreduce + RMSNorm compilation pass (4条评论)
- 核心议题:AMD审核人对新功能提出性能验证要求。
- 观点整理:
- 审核人 (tjtanaa):要求提供融合前后的模型评估分数(lm-eval)和跨不同TP规模的性能增益数据,并建议在CI中启用端到端测试。
- 贡献者 (ChuanLi1101):解释该PR为第一阶段(图级融合),主要收益是减少中间张量和图开销;真正的性能提升(通信计算重叠)需依赖第二阶段的AITER原生融合内核。他提议在AMD CI节点上协同进行性能测试。
- 总结:讨论体现了开源协作中厂商贡献的严谨性,不仅要求功能正确,更要求有可衡量的性能收益和完整的集成测试。
- Issue #36382: [CPU] AssertionError in CPUModelRunner (2条评论)
- 核心议题:在macOS (ARM64) CPU模式下,
CPUModelRunner初始化时因num_tokens_no_spec属性为NumPy数组而非PyTorch Tensor导致断言失败。 - 观点整理:
- 报告者 (maxwillzq):提供了详细复现步骤,并直接给出了修复代码——注释掉类型断言。
- 另一用户 (NimaSoroush):确认该修复方法有效。
- 结论:这是一个典型的CPU/GPU代码路径差异导致的兼容性问题。用户直接提供了解决方案,体现了社区的高效互助。
- 核心议题:在macOS (ARM64) CPU模式下,
🔥 热门话题与趋势分析
- Qwen系列模型问题集中爆发:多个Issue报告了Qwen3.5/3在不同配置下的问题(#36408, #36411, #36372, #36432),涉及非法内存访问、LoRA加载失败、pipeline parallelism下的KeyError、聊天模板应用错误等。这表明该系列模型在vLLM中的兼容性测试面临较大压力。
- CPU支持与健壮性:除了上述macOS CPU问题,已关闭的Issue #35789修复了多节点CPU部署的NUMA绑定问题,PR #36430也在修复CPU模型运行器中的非张量成员替换问题。CPU推理支持正在稳步推进但细节挑战多。
- 多模态与推理功能持续增强:PR #36431修复了Qwen3-Omni视频音频交错处理的多个bug,PR #36375修复了Qwen3推理模型流式与非流式输出不一致的问题,显示了对此类复杂模型支持的精耕细作。
- AMD平台优化明显加速:本周期多个核心PR均围绕ROCm生态,从内核融合(#36425)、量化回退逻辑(#36422)到CUDA图兼容性(#36374/36376),表明AMD团队正在对vLLM进行系统性的深度优化和问题扫除。
🛠️ 重点技术变更
- PR #36398: Allow
markdownlintto run locally (已合并)- 解读:将文档lint工具
markdownlint从仅CI运行改为同时支持本地预提交钩子运行,并升级到官方markdownlint-cli2。 - 影响:提升开发者体验(DX),使文档贡献者能在提交前本地修复格式问题,降低PR失败率,体现了对项目质量维护流程的改进。
- 解读:将文档lint工具
- PR #36166: [Frontend] Add GPU-less render serving path (已合并)
- 解读:引入了一个纯CPU的预处理服务路径 (
vllm launch render),提供聊天模板渲染、分词等功能,无需GPU或推理引擎。 - 影响:为前后端分离架构奠定了基础,允许将耗时的预处理环节卸载到专用服务,是提升系统架构灵活性和可扩展性的重要一步。
- 解读:引入了一个纯CPU的预处理服务路径 (
- PR #36388: [Bugfix] Fix hybrid Attention+Mamba models failing…
- 解读:修复了当混合KV缓存管理器被禁用时(如使用了
--kv-events-config),Attention+Mamba混合模型(如Qwen3.5-0.8B)启动失败的问题。 - 影响:放宽了校验逻辑,允许
MambaSpec与Attention Spec共存,确保新兴的混合架构模型在更多配置下可用。
- 解读:修复了当混合KV缓存管理器被禁用时(如使用了
- PR #36170: [Dependency] Remove default ray dependency (已合并)
- 解读:将Ray从默认依赖项改为可选依赖,仅在使用Ray分布式执行器后端时才需要安装。
- 影响:解决了用户环境中因vLLM强制安装Ray而导致的版本冲突问题,减小了安装包体积和潜在安全风险,是依赖项管理的重要优化。
📈 开发活跃度观察
- 贡献者:AMD团队成员(
ChuanLi1101,hongxiayang)在本周期异常活跃,提交了多个涉及核心优化和修复的PR。常规贡献者如lailoo、scoootscooob、esmeetu等持续在模型、调度器、前端等领域进行bug修复和功能完善。 - 代码审查与合并:共有11个PR被合并,涵盖文档、前端、模型支持、CPU后端和AMD优化等多个方面。一些PR(如#35815, #36170)在获得批准后等待合并,表明核心团队对合并节奏有所把控。针对AMD相关PR的审查对话显示出对性能和测试数据的严格要求。
- Issue处理:关闭了8个Issue,其中包括一个历时较长的“Ray依赖”RFC讨论(#33445),表明一些架构决策已达成共识并落地。
💡 值得关注的问题
- Qwen模型家族稳定性:集中出现的多个Qwen相关Bug(#36408, #36411, #36372等)需要核心团队重点关注,可能需要系统性的兼容性测试和修复。
- CPU模式的健壮性:Issue #36382和PR #36430揭示的CPU/GPU代码路径不一致问题,可能会随着CPU支持功能的增加而愈发普遍,需要建立更统一的抽象或测试机制。
- RFC #36406: Changing the Default Behavior of usage Field in Responses:此提议希望改变流式响应中
usage字段默认不返回的行为,以更好满足生产环境计费和监控需求。这涉及对OpenAI API标准的偏离,可能引发社区广泛讨论,需要谨慎权衡兼容性与实用性。
📋 附录:详细数据列表
新增 Issue
- #36411 [Bug]: vLLM not working with qwen3.5 27B — bug — by surapuramakhil (创建于: 2026-03-08 22:15 (UTC+8))
- #36408 [Bug]: Qwen3.5-9B (BF16/AWQ) Illegal Memory Access in vLLM v0.17.0 (WSL2/RTX3090 Ti) — bug — by d-etu (创建于: 2026-03-08 20:52 (UTC+8))
- #36435 [Bug]: Responses API streaming emits tool call XML as
response.output_text.deltainstead ofresponse.function_call_arguments.deltafor non-harmony models — bug — by herve-ves (创建于: 2026-03-09 11:00 (UTC+8)) - #36433 [Bug]: matryoshka need gpu-memory??? — bug — by ciaoyizhen (创建于: 2026-03-09 10:56 (UTC+8))
- #36432 [Bug]: ValueError: No user query found in messages QWEN 3.5 27B VLLM 0.16.0 NIGHTLY — bug — by HojaMuerta (创建于: 2026-03-09 10:54 (UTC+8))
- #36382 [CPU] AssertionError in CPUModelRunner: device_tensor type mismatch (numpy.ndarray) — 无标签 — by maxwillzq (创建于: 2026-03-08 14:26 (UTC+8))
- #36407 [Bug]: KeyError in get_layers_from_vllm_config with pipeline parallelism (vLLM 0.16.0) — bug — by tusharshetty61 (创建于: 2026-03-08 20:35 (UTC+8))
- #36406 [RFC]: Changing the Default Behavior of usage Field in Responses — RFC — by Csrayz (创建于: 2026-03-08 20:32 (UTC+8))
- #36372 [Bug]: LoRA on Qwen-3.5-27B fails to run — bug — by Nero10578 (创建于: 2026-03-08 11:59 (UTC+8))
- #36389 [Bug]: Multi-node MP backend with vllm v17 - inner dp world group is not initialized — bug — by alan-cooney-dsit (创建于: 2026-03-08 16:00 (UTC+8))
- #36387 📋 Documentation Enhancement Suggestion — 无标签 — by croviatrust (创建于: 2026-03-08 15:05 (UTC+8))
已关闭 Issue
- #36435 [Bug]: Responses API streaming emits tool call XML as
response.output_text.deltainstead ofresponse.function_call_arguments.deltafor non-harmony models — bug — by herve-ves (关闭于: 2026-03-09 11:11 (UTC+8)) - #35789 [Bug]: CPU OMP thread autobind fails when number of local ranks <= NUMA nodes — bug — by shunzhiwen (关闭于: 2026-03-09 11:07 (UTC+8))
- #33445 [RFC] Remove mandatory ray installation — 无标签 — by yewentao256 (关闭于: 2026-03-09 11:06 (UTC+8))
- #27898 [Doc]: Multi-node EP on EFA (i.e. no IBGDA/DeepEP) — documentation,stale — by nathan-az (关闭于: 2026-03-09 10:14 (UTC+8))
- #35213 [RFC]: Disaggregate RendererClient from EngineClient as Foundation for Disaggregated Frontend — RFC — by sagearc (关闭于: 2026-03-09 06:29 (UTC+8))
- #33911 [Bug]: Gemma-3 multimodal models (4b/12b/27b) fail with torch.compile assertion error — torch.compile — by tomasruizt (关闭于: 2026-03-09 01:20 (UTC+8))
- #36167 [Bug]: Engine initialization failure with Qwen3 Omni on Strix Halo/ROCM — bug,rocm — by Eoghanmc22 (关闭于: 2026-03-09 00:23 (UTC+8))
- #34028 [Usage]: Unable to load Mistral-Small-3.2-24B-Instruct-2506-FP8 due to “Value error, Found unknown quantization”, but the same configs worked for vllm v0.11.0 — usage — by gabinguo (关闭于: 2026-03-08 22:23 (UTC+8))
新增 PR
- #36436 [Misc] Refactored 5 duplicate helper functions that were copied-pasted across multiple parsers — 无标签 — by taneem-ibrahim (创建于: 2026-03-09 11:13 (UTC+8))
- #36385 [Model] Add support for BERT-like Chinese ERNIE pooling models — documentation,new-model,needs-rebase — by whyiug (创建于: 2026-03-08 14:55 (UTC+8))
- #36434 [XPU] Add test script of PD disaggregation — ready,v1,kv-connector — by zhenwei-intel (创建于: 2026-03-09 10:56 (UTC+8))
- #36431 [Bugfix] Fix Qwen3-Omni/Qwen2.5-Omni use_audio_in_video deployment issues — bug,multi-modality,qwen — by xxddccaa (创建于: 2026-03-09 10:44 (UTC+8))
- #36418 Populate multimedia pre-processing performance metrics via PerfStats in the vLLM engine — v1,multi-modality,meta-exported,fb-exported — by d-biswa (创建于: 2026-03-09 02:39 (UTC+8))
- #36398 Allow
markdownlintto run locally — documentation,performance,ready,ci/build,cpu,kv-connector,nvidia — by hmellor (创建于: 2026-03-08 19:12 (UTC+8)) - #36430 [Bugfix] Avoid to replace non-tensor members in cpu model runner — bug,ready,v1,cpu — by bigPYJ1151 (创建于: 2026-03-09 10:34 (UTC+8))
- #36420 Revert “[Core] NGram GPU Implementation compatible with Async Scheduler” (#29184) — speculative-decoding,v1 — by zhewenl (创建于: 2026-03-09 03:57 (UTC+8))
- #36428 [Darft][Spec decode] Proposer Interface Unification — speculative-decoding,v1 — by wangxiyuan (创建于: 2026-03-09 10:27 (UTC+8))
- #36429 Bridge pad_token_id from model config to tokenizer — meta-exported,fb-exported — by cgufb (创建于: 2026-03-09 10:32 (UTC+8))
- #36427 TMA Support for Fully-Sharded LoRA MoE + Tuned Config Control — 无标签 — by RunkaiTao (创建于: 2026-03-09 10:22 (UTC+8))
- #36410 [Bugfix] Fix GLM4 tool parser double serialization issue — bug — by kasyoukin (创建于: 2026-03-08 21:39 (UTC+8))
- #36395 fix(lora): add bounds checking for TP configurations — 无标签 — by lailoo (创建于: 2026-03-08 18:52 (UTC+8))
- #36402 fix(lora): use replaced_module_name in pooling model name check — ready — by gambletan (创建于: 2026-03-08 19:33 (UTC+8))
- #36426 docs: fix local testing setup in contribution guide — documentation — by eexwhyzee (创建于: 2026-03-09 09:49 (UTC+8))
- #36425 [ROCm] Enable AITER fused allreduce + RMSNorm compilation pass — rocm — by ChuanLi1101 (创建于: 2026-03-09 09:15 (UTC+8))
- #36424 [Refactor] Remove dead code in KV connector — ready,v1,kv-connector — by yewentao256 (创建于: 2026-03-09 09:03 (UTC+8))
- #36423 [XPU] Support cpu kv offloading on XPU platform — v1 — by chaojun-zhang (创建于: 2026-03-09 08:59 (UTC+8))
- #36416 [Bugfix] Clear stale CG keys after memory profiling — bug,ready,v1,nvidia — by MatthewBonanni (创建于: 2026-03-09 02:27 (UTC+8))
- #36422 [ROCm][Bugfix] Fix MXFP4 MoE emulate fallback logic on MX-capable hardware — bug,rocm — by ChuanLi1101 (创建于: 2026-03-09 06:58 (UTC+8))
- #36421 [chore] Refactor the commonly shared functions for the parsers under one module — 无标签 — by taneem-ibrahim (创建于: 2026-03-09 05:49 (UTC+8))
- #36414 [Bugfix] Add PyTorch GDN fallback for SM12.0+ (Blackwell) in Qwen3Next/Qwen3.5 — bug,performance,ci/build,qwen — by lucaspirola (创建于: 2026-03-09 02:08 (UTC+8))
- #36413 [Feat] add rmsnorm fp4 quant fusion of flashinfer — nvidia — by sparkecho (创建于: 2026-03-09 01:31 (UTC+8))
- #36417 [Bugfix] Download mmproj GGUF files for multimodal models — bug — by s-zx (创建于: 2026-03-09 02:35 (UTC+8))
- #36419 [Bugfix] Routed experts capturer reinitialization after KV cache profiling — bug,v1 — by HollowMan6 (创建于: 2026-03-09 02:45 (UTC+8))
- #36415 Usage stats V2 design — documentation — by simon-mo (创建于: 2026-03-09 02:22 (UTC+8))
- #36412 Fix/resupport nongated fused moe triton — ready,nvidia — by shaunkotek (创建于: 2026-03-08 23:21 (UTC+8))
- #36409 logging: opt-in per-rank log files (Ray-friendly) (#23761) — documentation — by KOKOSde (创建于: 2026-03-08 21:35 (UTC+8))
- #36380 fix: handle escaped <\think> tags in reasoning parser (closes #36207) — 无标签 — by zpassion (创建于: 2026-03-08 13:31 (UTC+8))
- #36399 fix: handle model validation error in realtime WebSocket session.update — frontend — by gambletan (创建于: 2026-03-08 19:13 (UTC+8))
- #36394 fix: correct timestamp drift in speech-to-text for audio > 30s — frontend — by alvinttang (创建于: 2026-03-08 18:46 (UTC+8))
- #36384 Fix/fla triton autotune oom 34954 — qwen — by oneraghavan (创建于: 2026-03-08 14:35 (UTC+8))
- #36404 fix(lora): fix variable shadowing in get_supported_lora_modules — 无标签 — by gambletan (创建于: 2026-03-08 19:35 (UTC+8))
- #36401 fix: use byte count for realtime WebSocket audio size validation — frontend — by gambletan (创建于: 2026-03-08 19:17 (UTC+8))
- #36396 fix: release VideoCapture resources and guard div-by-zero in video utils — 无标签 — by gambletan (创建于: 2026-03-08 18:56 (UTC+8))
- #36392 fix: register atexit handler to prevent ImportError during FlashInfer shutdown — nvidia — by gambletan (创建于: 2026-03-08 17:51 (UTC+8))
- #36403 fix(s3): paginate list_objects_v2 to return all objects — 无标签 — by alvinttang (创建于: 2026-03-08 19:34 (UTC+8))
- #36400 fix: guard against ZeroDivisionError in layerwise profiler — 无标签 — by alvinttang (创建于: 2026-03-08 19:14 (UTC+8))
- #36397 fix: check HTTP status in batch read_file to prevent silent failures — frontend — by alvinttang (创建于: 2026-03-08 19:12 (UTC+8))
- #36393 add nemotron v3 reasoning parser — 无标签 — by shaunkotek (创建于: 2026-03-08 18:37 (UTC+8))
- #36405 tests(v1): cover parallel sampling output_kind matrix for AsyncLLM and LLMEngine — v1 — by KOKOSde (创建于: 2026-03-08 19:58 (UTC+8))
- #36378 fix(lora): add bounds checking in slice_lora_b for TP configurations — documentation,performance,new-model,frontend,speculative-decoding,ci/build,v1,multi-modality,deepseek,gpt-oss — by lailoo (创建于: 2026-03-08 12:57 (UTC+8))
- #36391 [Core] Decouple draft-token hook from speculative decoding — v1 — by scoootscooob (创建于: 2026-03-08 16:23 (UTC+8))
- #36390 [Bugfix] Fix KV-cache head-of-line blocking in scheduler — bug,v1 — by scoootscooob (创建于: 2026-03-08 16:22 (UTC+8))
- #36388 [Bugfix] Fix hybrid Attention+Mamba models failing when hybrid KV cache manager is disabled — bug,v1 — by esmeetu (创建于: 2026-03-08 15:14 (UTC+8))
- #36383 feat(openai): add per-request timing metrics and completion_tokens_de… — documentation,frontend — by dryu-nv (创建于: 2026-03-08 14:32 (UTC+8))
- #36386 fix: clean shutdown for AR Fusion workspace on ctrl-c — nvidia — by goingforstudying-ctrl (创建于: 2026-03-08 14:56 (UTC+8))
- #36381 [Doc]: Add CUDA 13.0 recommendation for GPT-OSS on Blackwell GPUs — documentation,gpt-oss,nvidia — by lailoo (创建于: 2026-03-08 13:58 (UTC+8))
- #36377 fix: handle escaped <\think> tags in reasoning parser (closes #36207) — 无标签 — by zpassion (创建于: 2026-03-08 12:31 (UTC+8))
- #36379 [Frontend] Fix usage incorrectly returned with empty stream_options` — frontend,v1 — by Csrayz (创建于: 2026-03-08 13:06 (UTC+8))
- #36376 [Bugfix][ROCm] Memory access fault fix for full graph capture for triton-attn - Option 2 — bug,rocm,v1 — by hongxiayang (创建于: 2026-03-08 12:30 (UTC+8))
- #36375 [Bugfix] Fix non-streaming/streaming inconsistency for Qwen3 reasoning when enable_thinking is not set — bug,frontend,qwen — by esmeetu (创建于: 2026-03-08 12:28 (UTC+8))
- #36374 [Bugfix][ROCm] full graph capture on triton-attn fix - Option1 — bug,rocm,v1 — by hongxiayang (创建于: 2026-03-08 12:23 (UTC+8))
- #36373 [Bugfix][Frontend] Do not persist load_inplace on stored LoRA requests — bug,frontend — by omkaark (创建于: 2026-03-08 12:08 (UTC+8))
已合并 PR
- #36149 fix: Use iterator as not to store all the file loads in memory at once — ready — by shaunkotek (合并于: 2026-03-09 11:25 (UTC+8))
- #35579 [Examples][1/n] Resettle basic examples. — documentation,ready,ci/build,cpu — by noooop (合并于: 2026-03-09 11:22 (UTC+8))
- #35815 [Bugfix] Fix CPU OMP autobind assertion to use local_world_size — bug,ready,v1,cpu — by weiguangli-io (合并于: 2026-03-09 11:07 (UTC+8))
- #36170 [Dependency] Remove default ray dependency — documentation,rocm,ready,ci/build,nvidia — by yewentao256 (合并于: 2026-03-09 11:06 (UTC+8))
- #36398 Allow
markdownlintto run locally — documentation,performance,ready,ci/build,cpu,kv-connector,nvidia — by hmellor (合并于: 2026-03-09 11:05 (UTC+8)) - #36301 [XPU][Doc] update xpu document about triton dependency/conflict issue. — documentation,ready — by jikunshang (合并于: 2026-03-09 10:09 (UTC+8))
- #28044 [cudagraph] fix cudagraph warning in deepseekv32 — ready,deepseek,nvidia — by ZJY0516 (合并于: 2026-03-09 08:27 (UTC+8))
- #35986 Add support for ModelOpt MXFP8 MoE models — ready,nvidia,quantization — by danisereb (合并于: 2026-03-09 04:00 (UTC+8))
- #36341 [CI] fix flaky empty responses and add diagnostic assertions in vision chat tests — rocm,ready — by AndreasKaratzas (合并于: 2026-03-08 14:42 (UTC+8))
- #36166 [Frontend] Add GPU-less render serving path (
vllm launch render) — frontend,ready,v1 — by sagearc (合并于: 2026-03-08 23:35 (UTC+8)) - #35657 [Model] Nano Nemotron VL - fast media preprocessing — ready — by nvnbagrov (合并于: 2026-03-08 18:04 (UTC+8))
关闭但未合并的 PR
- #31400 [wip] custom allreduce and custom unquantized_gemm — 无标签 — by wxsIcey (关闭于: 2026-03-09 10:25 (UTC+8))
- #36421 [chore] Refactor the commonly shared functions for the parsers under one module — 无标签 — by taneem-ibrahim (关闭于: 2026-03-09 05:54 (UTC+8))
- #36415 Usage stats V2 design — documentation — by simon-mo (关闭于: 2026-03-09 02:26 (UTC+8))
- #33958 Fix double‑scaled linear RoPE cache for long‑context configs #33911 — 无标签 — by baonudesifeizhai (关闭于: 2026-03-09 01:16 (UTC+8))
- #34919 [Bugfix][Model] Qwen3-Coder and Qwen3-Coder-Next: Fix qwen3_coder tool parser to handle </parameter> in parameter content — bug,qwen — by yaro-tal (关闭于: 2026-03-09 00:19 (UTC+8))
- #35228 [Frontend] Disaggregate RendererClient from EngineClient [1/N] — documentation,performance,new-model,rocm,structured-output,frontend,needs-rebase,ci/build,v1,qwen — by sagearc (关闭于: 2026-03-08 21:48 (UTC+8))
- #36378 fix(lora): add bounds checking in slice_lora_b for TP configurations — documentation,performance,new-model,frontend,speculative-decoding,ci/build,v1,multi-modality,deepseek,gpt-oss — by lailoo (关闭于: 2026-03-08 18:46 (UTC+8))
- #36132 Handle null RoPE factor in max length derivation (+tests) — 无标签 — by siewcapital (关闭于: 2026-03-08 16:36 (UTC+8))
- #36334 [Doc]: Add README.md to examples directory — documentation,nvidia — by goingforstudying-ctrl (关闭于: 2026-03-08 14:36 (UTC+8))
- #36369 [Doc]: Add model card template — documentation — by goingforstudying-ctrl (关闭于: 2026-03-08 14:36 (UTC+8))
- #36368 [Doc]: Add common errors and solutions guide — documentation — by goingforstudying-ctrl (关闭于: 2026-03-08 14:36 (UTC+8))
- #36367 [Doc]: Add comprehensive environment variables reference — documentation — by goingforstudying-ctrl (关闭于: 2026-03-08 14:36 (UTC+8))
- #36366 [Doc]: Add API migration guide — documentation — by goingforstudying-ctrl (关闭于: 2026-03-08 14:36 (UTC+8))
- #36365 [Doc]: Add production deployment checklist — documentation — by goingforstudying-ctrl (关闭于: 2026-03-08 14:36 (UTC+8))
- #36364 [Doc]: Add glossary of terms — documentation — by goingforstudying-ctrl (关闭于: 2026-03-08 14:36 (UTC+8))
- #36363 [Doc]: Add performance tuning guide — documentation — by goingforstudying-ctrl (关闭于: 2026-03-08 14:36 (UTC+8))
- #36362 [Doc]: Add APC debugging tips documentation — documentation — by goingforstudying-ctrl (关闭于: 2026-03-08 14:36 (UTC+8))
- #36354 [Bugfix]: Fix unclean shutdown from ctrl-c with AR Fusion — bug,documentation,qwen,nvidia — by goingforstudying-ctrl (关闭于: 2026-03-08 14:35 (UTC+8))
- #36353 [Bugfix]: Support transformers 5.x Qwen3_5MoeTextConfig — bug,documentation,qwen — by goingforstudying-ctrl (关闭于: 2026-03-08 14:35 (UTC+8))
- #36352 [Bugfix]: Fix Qwen3.5 missing max_window_layers attribute — bug,documentation,qwen — by goingforstudying-ctrl (关闭于: 2026-03-08 14:35 (UTC+8))
- #36351 [Doc]: Add deployment patterns guide — documentation,structured-output,tool-calling,cpu,nvidia — by goingforstudying-ctrl (关闭于: 2026-03-08 14:35 (UTC+8))
- #36349 [Doc]: Add quick start cheat sheet — documentation,structured-output,tool-calling,cpu,nvidia — by goingforstudying-ctrl (关闭于: 2026-03-08 14:35 (UTC+8))
- #36335 [Doc]: Add debugging tip for Automatic Prefix Caching — documentation — by goingforstudying-ctrl (关闭于: 2026-03-08 14:35 (UTC+8))
- #36333 [Doc]: Add –speculative-config option documentation — documentation,nvidia — by goingforstudying-ctrl (关闭于: 2026-03-08 14:35 (UTC+8))
- #36381 [Doc]: Add CUDA 13.0 recommendation for GPT-OSS on Blackwell GPUs — documentation,gpt-oss,nvidia — by lailoo (关闭于: 2026-03-08 14:08 (UTC+8))
- #36377 fix: handle escaped <\think> tags in reasoning parser (closes #36207) — 无标签 — by zpassion (关闭于: 2026-03-08 13:20 (UTC+8))