vLLM 开发动态报告 - 2026-02-28

时间窗口: 2026-02-28 11:37 (UTC+8) ~ 2026-03-01 11:37 (UTC+8) 数据统计: 新 Issue 11 | 关闭 Issue 12 | 新 PR 45 | 合并 PR 30 | 关闭未合并 PR 12

📊 每日开发状态摘要

在本次分析周期内，vLLM 社区保持了极高的开发活跃度，共合并了30个PR，处理了大量问题。开发重点集中在模型兼容性修复（特别是 Qwen 系列）、AMD ROCm 平台的功能增强与性能优化，以及推理后端（如推测解码、CUDA Graph）的稳定性改进上。社区积极响应用户反馈，快速修复了多个影响部署的关键bug。

🎯 AMD/ROCm 生态相关动态

本周期 AMD 生态相关的开发活动非常活跃，主要集中在功能启用、性能调优和问题修复上。

Issue 与 Bug 报告:
- Issue #35633: 用户报告在 AMD MI355X GPU 上，vLLM 0.16 的 ROCm Docker 镜像无法运行使用了 MXFP4 量化的 amd/Kimi-K2.5-MXFP4 模型，原因是缺少 amd-quark 包。这直接指向了 AMD Quark 量化工具链与 vLLM 官方镜像的集成问题，是 AMD 生态中模型部署的关键阻碍。已被标记为 bug 和 rocm。
PR 与功能开发/修复:
- PR #35601 ([ROCm][Bugfix]: Disable AITER Triton ROPE by default): 针对在 gfx942 (MI300) 上大规模批次场景下 AITER RoPE 实现性能下降的问题，默认禁用了该实现，但保留了 +rotary_embedding 自定义算子。这体现了对 AMD 硬件上不同算子性能的精细化调优。
- PR #35597, #35596, #35595: 这是一系列由贡献者 brucechanglongxu 提交的 PR，旨在为 ROCm 平台启用更多量化方案支持。
  - #35597 启用了压缩张量 WNA16 测试。
  - #35596 通过回退到 Triton 内核，使 moe_wna16 (GPTQ/AWQ MoE) 能在 ROCm 上运行。
  - #35595 将 experts_int8 量化添加到 ROCm 支持列表中。这三个 PR 显著扩展了 AMD GPU 上可运行的量化模型范围，是提升平台竞争力的重要步骤。
- PR #35069 ([ROCm] Derive device capability from GCN arch string without CUDA init) (已合并): 修复了在 Ray 等分布式环境中，因查询算力而初始化 CUDA 导致的失败。改为解析 GCN 架构字符串来获取算力，提升了 ROCm 平台在复杂部署环境下的健壮性。

总结：AMD 生态在本周期是开发热点，团队正在系统性地补齐量化支持短板（MXFP4, WNA16 MoE, INT8 MoE），并针对特定硬件进行性能调优和稳定性修复。amd-quark 包的缺失问题表明，软件栈的完整打包和交付仍是需要关注的环节。

💬 高热度讨论分析

Issue #35608: CUDA error: CUBLAS_STATUS_INVALID_VALUE：
- 核心议题：用户在 v0.16.0 镜像中部署 Qwen3.5-122B 时遇到 cuBLAS 错误，通过 unset LD_LIBRARY_PATH 解决。
- 各方观点：
  - 提问者 (adenzhou1350)：认为是 Docker 镜像内的 CUDA 库冲突。
  - 参与者 (ZJY0516)：指出这是 CUDA 12.9 + Torch 2.10 的已知问题，需升级 nvidia-cublas-cu12 包。
  - 另一参与者 (shahizat)：确认在 H100 上遇到相同问题并提供了关联 Issue 链接。
- 争议焦点：无实质争议，更多是经验分享和解决方案汇总。
- 当前状态：问题已定位为特定 CUDA 版本下的库兼容性问题，提供了明确的升级方案。
Issue #35414 / #35617: Qwen3.5 在旧 GPU（2080 Ti）上因 bfloat16 失败：
- 核心议题：用户在使用 --dtype float16 时，Qwen3.5/Qwen3Next 模型仍创建了 bfloat16 参数，导致在不支持 bf16 的 GPU 上崩溃。
- 各方观点：
  - 遇到问题的用户 (chuanSir123, BUJIDAOVS)：报告了错误并尝试通过环境变量和修改配置文件临时解决。
  - 维护者 (通过 PR #35617)：指出根本原因是模型代码直接使用了 HuggingFace config.dtype（来自 config.json 的原始 torch_dtype），而非用户指定的 --dtype。
- 争议焦点：无争议，是一个明确的 bug。
- 最终结论：通过 PR #35617 (已合并) 修复，移除模型初始化中硬编码的 dtype=config.dtype，使其遵循 vLLM 模型加载器设置的全局 dtype，从而尊重用户命令行参数。
Issue #35633: AMD MI355 MXFP4 支持缺失：
- 核心议题：如前述，用户无法在 vLLM ROCm 镜像中测试 AMD 最新的 MXFP4 量化模型。
- 各方观点：
  - 报告者 (functionstackx)：详细列出了错误栈，并直接 @ 了多位疑似 AMD 员工的开发者 (powderluv, chunfangamd, andyluo7)。
  - 机器人 (github-actions)：自动标记并 CC 了 ROCm 相关的维护者。
- 争议焦点：尚未展开深入讨论，但直接暴露了 AMD 新硬件特性与 vLLM 发布流程的同步问题。
- 当前状态：问题新开，等待 AMD 团队或维护者响应。

🔥 热门话题与趋势分析

Qwen 模型家族部署问题集中：多个 Issue 涉及 Qwen 系列（Qwen3.5-122B, 35B-A3B, 3-Omni, VL-Embedding），问题包括 CUDA 库冲突 (#35608)、TTFT 尾延迟 (#35625)、参数不支持 (#35602)、多模态限制器失败 (#35624) 和 dtype 忽略 (#35414)。这表明随着 Qwen 模型的流行，其多样化的配置和特性对推理引擎提出了全面挑战。
AMD 平台支持加速：如前所述，本周期出现大量 ROCm 相关的 PR，从修复基础功能（设备能力查询）到扩展高级特性支持（多种 MoE 量化），显示 AMD 团队或社区正在积极推动 vLLM 在 AMD 硬件上的成熟度。
推理性能与稳定性深耕：围绕推测解码（MTP 权重验证 #35548、MTP 层传播修复 #35606）、CUDA Graph（输入地址调试增强 #35605）、流水线并行（RoutedExperts 修复 #35623）和分布式执行器（Ray 死锁 #35403）的修复持续进行，表明项目在追求极致性能的同时，也在不断夯实大规模、复杂部署场景下的稳定性基础。

🛠️ 重点技术变更

PR #35617 ([Bugfix][Model] Fix Qwen3.5/Qwen3Next ignoring –dtype flag) (已合并)：一个看似简单的改动，解决了影响旧 GPU 用户部署新版大模型的关键兼容性问题。技术影响：确保了用户命令行参数的权威性，避免了模型配置对运行环境的隐性依赖，提升了部署的可预测性。
PR #35271 ([Feat] Add CUDA torch fallbacks for fp8_mqa_logits…) (已合并)：为 DeepGemm 的 fp8_mqa_logits 等函数添加了 PyTorch 回退实现。技术影响：这使得不支持 DeepGemm（如 sm80 A100）或未安装相关库的 CUDA 平台，能够通过标准 PyTorch 算子运行 GLM-5 等依赖这些函数的稀疏注意力模型，显著扩展了模型的硬件支持范围。
PR #35557 ([Bugfix] Fix Anthropic API base64 image handling in Messages endpoint) (已合并)：修复了 Anthropic 兼容 API 中 base64 图像和工具调用返回图像的处理逻辑。技术影响：完善了 vLLM 对多模态输入的处理能力，提升了其对不同 API 协议（此处为 Anthropic）的兼容性，对于构建通用 AI 服务网关至关重要。
PR #35548 ([MTP] Validate that MTP weights are actually loaded) (已合并)：在加载 MTP (Multi-Token Predictor) 模型时，验证权重是否真的被加载，防止使用未初始化的内存。技术影响：提升了推测解码的安全性，避免了因使用残缺的量化模型 checkpoint 而导致性能 silently degrade（零接受率）的陷阱，要求模型提供者交付完整的权重。

📈 开发活跃度观察

贡献者多样性：除了核心维护者（如 DarkLight1337, WoosukKwon, chaunceyjiang），本周期出现了多位聚焦特定领域的活跃贡献者，例如专注 ROCm 量化的 brucechanglongxu，以及提交多个模型修复的 lailoo、cwazai。
AMD 团队参与：从被 @ 的用户名 (-amd 后缀) 和 ROCm 相关 PR 的集中出现来看，AMD 内部的工程师团队正在深度参与 vLLM 开发，并驱动其生态完善。
高效的代码合并：在45个新PR中合并了30个，合并率约67%。许多修复性 PR 在创建当天即被合并（如 #35600, #35586, #35589, #35617），反映了核心团队对解决用户问题的响应速度和高效的代码审查流程。

💡 值得关注的问题

AMD 新特性集成滞后：Issue #35633 暴露了 AMD 最新硬件特性（MI355 的 MXFP4）与 vLLM 发布流程存在断点。如何更敏捷地将合作伙伴的前沿技术集成到官方镜像中，是一个需要协调的问题。
CPU 后端编译问题：Issue #35599 显示 v0.16.0 在特定配置（HEAD_DIM=576）下 CPU 编译失败，虽然提供了回退方案，但表明 CPU 后端测试覆盖可能不足，影响小众但重要的使用场景。
Ray 分布式执行器的稳定性：Issue #35403 和其修复 PR #35405 揭示了在流水线并行 + 多模态场景下，Ray 编译 DAG 处理大 tensor 的脆弱性。随着分布式、多模态推理需求增长，此底层执行框架的稳定性至关重要。
大规模模型部署的复杂性：Issue #35625 (Qwen3.5-35B-A3B 的 TTFT 尾延迟)、#35496 (Qwen3.5-397B 的编译超时) 表明，部署超大规模或具有复杂推理结构的模型时，性能调优和初始化的挑战依然巨大，需要用户具备更深的技术洞察力。

📋 附录：详细数据列表

新增 Issue

#35633 [Bug]: parity with cuda: rocm v0.16 image missing amd quark kimi k2.5 mxfp4 — bug,rocm — by functionstackx (创建于: 2026-03-01 09:20 (UTC+8))
#35631 [Feature]: Pooling Model Performance Optimizations — feature request — by yewentao256 (创建于: 2026-03-01 07:00 (UTC+8))
#35625 [Bug]: TTFT latency issue with Qwen3.5-35B-A3B model using vllm — bug — by shahizat (创建于: 2026-03-01 04:34 (UTC+8))
#35624 [Bug]: Qwen3-Omni Model Fails when try to l — bug — by Blaze-DSP (创建于: 2026-03-01 03:25 (UTC+8))
#35608 [Bug]: vllm 0.16.0+image encountered CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasGemmEx — bug — by adenzhou1350 (创建于: 2026-02-28 18:40 (UTC+8))
#35612 [Bug]: invalid argument at cumem_allocator.cpp:119 — bug — by flutist (创建于: 2026-02-28 21:32 (UTC+8))
#35599 [Bug]: cpu version compile failed in 0.16.0 — bug — by wanghualex1-wq (创建于: 2026-02-28 15:40 (UTC+8))
#35603 [Bug]: vllm: error: unrecognized arguments: –task embedding — bug — by dojeese-maker (创建于: 2026-02-28 17:26 (UTC+8))
#35602 [Bug]: vllm: error: unrecognized arguments: –language-model-only — bug — by ciaoyizhen (创建于: 2026-02-28 16:40 (UTC+8))
#35593 [Bug]: CUDA Error 803 (system has unsupported display driver / cuda driver combination) when host driver is 590.48.01 due to cuda-compat conflict — bug — by kkkzhonghaiwei (创建于: 2026-02-28 14:49 (UTC+8))
#35585 [CI] MockWeightTransferEngine missing trainer_send_weights implementation — ci-failure — by LucasWilkinson (创建于: 2026-02-28 12:28 (UTC+8))

已关闭 Issue

#35414 [Bug]: 4*2080ti 22g deploy Qwen3.5-35B-A3B fail:2080 Ti does not support bfloat16 — bug — by chuanSir123 (关闭于: 2026-03-01 11:27 (UTC+8))
#27508 [Bug]: openai/gpt-oss-120b can’t run on H100 — bug,stale — by matheusfvesco (关闭于: 2026-03-01 10:16 (UTC+8))
#27807 [Bug]: When using CPU to run examples/offline_inference/basic/basic.py, “RuntimeError: Device string must not be empty” is generated. — bug,stale — by spacecat2002 (关闭于: 2026-03-01 10:16 (UTC+8))
#27868 [Bug]: Segfault in NCCL cuMemCreate during distributed engine initialization on 3080 — bug,stale — by zjcx524 (关闭于: 2026-03-01 10:16 (UTC+8))
#27877 [Usage]: How to install nightly version??? Why this command doesn’t work? — usage,stale — by sleepwalker2017 (关闭于: 2026-03-01 10:16 (UTC+8))
#27886 [Bug]: vLLM 0.10.2/0.11.0 bench serve deadlocks when benchmarking DeepSeek-R1-BF16 (sglang 0.4.7), with processes hanging indefinitely during script execution — bug,stale — by tears710 (关闭于: 2026-03-01 10:16 (UTC+8))
#35403 [Bug]: Ray Compiled DAG timeout/deadlock during VLM forward pass with PP>1 and high-res images — bug — by emricksini-h (关闭于: 2026-03-01 00:14 (UTC+8))
#35021 [Bug]: GLM-5（Sparse MLA / DSA 模型）无法在 sm80 GPU（A100/A800）上运行 — DeepGemm 硬依赖无 fallback — bug — by qjxjy123 (关闭于: 2026-02-28 18:12 (UTC+8))
#35602 [Bug]: vllm: error: unrecognized arguments: –language-model-only — bug — by ciaoyizhen (关闭于: 2026-02-28 17:04 (UTC+8))
#29915 [Feature]: include reasoning tokens in /v1/messages Anthropic endpoint if model supports it — feature request — by alew3 (关闭于: 2026-02-28 17:02 (UTC+8))
#35585 [CI] MockWeightTransferEngine missing trainer_send_weights implementation — ci-failure — by LucasWilkinson (关闭于: 2026-02-28 14:47 (UTC+8))
#35496 [Bug]: RPC call to sample_tokens timed out. Qwen3.5-397B-A17B — bug — by DBDXSS (关闭于: 2026-02-28 14:06 (UTC+8))

新增 PR

#35635 [torch.compile] Move torch.Size producers to consumer subgraph in split_graph — 无标签 — by bsherifi (创建于: 2026-03-01 10:42 (UTC+8))
#35620 [Chore] Cleanup BNB utilization dead code — ready — by Isotr0py (创建于: 2026-03-01 01:08 (UTC+8))
#35617 [Bugfix][Model] Fix Qwen3.5/Qwen3Next ignoring –dtype flag on older GPUs — bug,ready,qwen — by lailoo (创建于: 2026-03-01 00:49 (UTC+8))
#35629 feat(spec_decode): remove unpadded drafter batch mode — documentation,speculative-decoding,v1 — by sladyn98 (创建于: 2026-03-01 06:42 (UTC+8))
#35634 [Refactor] Simplify chat_completion_full_generator for tool parsers — frontend,ready — by yewentao256 (创建于: 2026-03-01 09:59 (UTC+8))
#35632 [Docs] Update CacheConfig block_size docstring to remove inaccurate limit when using CUDA — nvidia — by eicherseiji (创建于: 2026-03-01 08:06 (UTC+8))
#35630 [MISC] Fixing a null reference by removing parallel_utils from mypy EXCLUDE — ready — by taneem-ibrahim (创建于: 2026-03-01 06:46 (UTC+8))
#35628 [Model Runner V2] Minor refactoring for EncoderRunner — v1 — by WoosukKwon (创建于: 2026-03-01 05:30 (UTC+8))
#35627 [2/N] Elastic EP Milestone 2: Integrating NIXL-EP — kv-connector,nvidia — by itayalroy (创建于: 2026-03-01 05:09 (UTC+8))
#35626 [Bugfix] Fix RoutedExpertsCapturer for pipeline parallelism (PP > 1) — bug,v1 — by S1ro1 (创建于: 2026-03-01 04:49 (UTC+8))
#35621 [Model Runner V2] Add ModelStateInterface [4/N] — v1,nvidia — by WoosukKwon (创建于: 2026-03-01 01:59 (UTC+8))
#35623 [Bugfix] Fix RoutedExpertsCapturer for pipeline parallelism (PP > 1) — bug,v1 — by S1ro1 (创建于: 2026-03-01 02:54 (UTC+8))
#35622 test: verify auto-revert PR creation (will close immediately) — 无标签 — by zhewenl (创建于: 2026-03-01 02:14 (UTC+8))
#35611 [Bugfix] Fix CompressedTensors MoE g_idx registration when actorder is null — bug — by sxu75374 (创建于: 2026-02-28 19:46 (UTC+8))
#35619 [Perf] Eliminate duplicate bitmatrix metadata computation in gpt oss … — gpt-oss — by banparth (创建于: 2026-03-01 00:57 (UTC+8))
#35618 [Benchmark] Avoid unnecessary video download in MMVU — performance,ready — by DarkLight1337 (创建于: 2026-03-01 00:55 (UTC+8))
#35615 [Tool Parser] Fix Qwen3Coder streaming parameter loss with speculative decode — frontend,qwen — by voipmonitor (创建于: 2026-02-28 23:59 (UTC+8))
#35613 [SP] Add opt-in ragged sequence parallelism path via VLLM_ENABLE_SP_RAGGED — v1 — by baonudesifeizhai (创建于: 2026-02-28 22:29 (UTC+8))
#35616 [Bugfix] Improve engine ready timeout error message — bug,v1 — by lailoo (创建于: 2026-03-01 00:08 (UTC+8))
#35614 [Docs] Add speculative-config configuration reference page — documentation — by lailoo (创建于: 2026-02-28 23:21 (UTC+8))
#35605 [CUDA Graph] Enhance CUDA graph input address debugging — nvidia — by yiz-liu (创建于: 2026-02-28 17:36 (UTC+8))
#35587 [BugFix][Model]Fix the garbled code in Ernie4.5-VL caused by fast_moe_cold_start — bug — by CSWYF3634076 (创建于: 2026-02-28 13:04 (UTC+8))
#35601 [ROCm][Bugfix]: Disable AITER Triton ROPE by default — bug,rocm,ready — by Rohan138 (创建于: 2026-02-28 16:17 (UTC+8))
#35581 Fix Qwen3_5MTP packed_modules_mapping for gate_up_proj — ready,qwen — by cwazai (创建于: 2026-02-28 12:00 (UTC+8))
#35597 [ROCm][Quantization] Enable compressed-tensors WNA16 test on ROCm — rocm,ready — by brucechanglongxu (创建于: 2026-02-28 15:18 (UTC+8))
#35594 [BUGFIX]fix CUDA OOM ERROR : invalid argument at cumem_allocator.cpp:119 — bug,nvidia — by flutist (创建于: 2026-02-28 15:17 (UTC+8))
#35609 [Bugfix] Fix TypeError in benchmarks/benchmark_prefix_caching.py with –sort — bug,performance — by GuoxiangZu (创建于: 2026-02-28 18:56 (UTC+8))
#35588 [Feat] Supports Anthropic Messages count_tokens API — frontend,ready — by chaunceyjiang (创建于: 2026-02-28 13:24 (UTC+8))
#35610 Add pytorch-triton-xpu to requirements — ci/build — by andswitch (创建于: 2026-02-28 19:04 (UTC+8))
#35607 fix(qwen3.5-mtp): propagate spec_step_idx to enable multi-layer MTP — qwen — by cwazai (创建于: 2026-02-28 18:13 (UTC+8))
#35604 [Frontend][1/n] Improve pooling entrypoints classify. — frontend — by noooop (创建于: 2026-02-28 17:36 (UTC+8))
#35606 fix(qwen3.5-mtp): propagate spec_step_idx to enable multi-layer MTP — needs-rebase,v1,qwen — by cwazai (创建于: 2026-02-28 17:57 (UTC+8))
#35591 Fix kv_cache_dtype auto resolution for ModelOpt kv_cache_quant_algo — documentation,kv-connector — by sriharshamudumba (创建于: 2026-02-28 14:32 (UTC+8))
#35592 [Docs][1/n] Reorganize pooling docs classify. — documentation — by noooop (创建于: 2026-02-28 14:45 (UTC+8))
#35600 [Benchmark] Improve UX of sweep scripts — documentation,performance — by DarkLight1337 (创建于: 2026-02-28 16:12 (UTC+8))
#35598 # [Bugfix]Fix indexError in moe wna16 quantization with enable-expert-parallel — bug — by ivyilike (创建于: 2026-02-28 15:34 (UTC+8))
#35586 [Benchmark] Rename SLA Finder to Workload Explorer — documentation,performance — by DarkLight1337 (创建于: 2026-02-28 12:54 (UTC+8))
#35596 [ROCm][Quantization] Enable moe_wna16 on ROCm via Triton fallback — rocm — by brucechanglongxu (创建于: 2026-02-28 15:18 (UTC+8))
#35595 [ROCm][Quantization] Enable experts_int8 on ROCm — rocm — by brucechanglongxu (创建于: 2026-02-28 15:17 (UTC+8))
#35589 [CI] add trainer_send_weights for MockWeightTransferEngine — ready — by chaunceyjiang (创建于: 2026-02-28 14:10 (UTC+8))
#35580 [CI] Defining extended V1 e2e + engine tests — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-02-28 11:45 (UTC+8))
#35590 Use smg-grpc-proto package for gRPC proto definitions — ci/build — by CatherineSue (创建于: 2026-02-28 14:15 (UTC+8))
#35582 [Do not merge] FA4 hang bisect Trial 1: full FA4 without diagnostics — documentation,ready,ci/build,v1,nvidia — by LucasWilkinson (创建于: 2026-02-28 12:22 (UTC+8))
#35583 [Do not merge] FA4 hang bisect Trial 2: Python changes only (main cmake) — documentation,ready,v1,nvidia — by LucasWilkinson (创建于: 2026-02-28 12:23 (UTC+8))
#35584 [Do not merge] FA4 hang bisect Trial 3: cmake + requirements only (main Python) — ready,ci/build,nvidia — by LucasWilkinson (创建于: 2026-02-28 12:23 (UTC+8))

已合并 PR

#35620 [Chore] Cleanup BNB utilization dead code — ready — by Isotr0py (合并于: 2026-03-01 03:22 (UTC+8))
#35617 [Bugfix][Model] Fix Qwen3.5/Qwen3Next ignoring –dtype flag on older GPUs — bug,ready,qwen — by lailoo (合并于: 2026-03-01 11:27 (UTC+8))
#32195 Add TMA support to fused_moe_lora kernel — ready — by gnovack (合并于: 2026-03-01 10:55 (UTC+8))
#35621 [Model Runner V2] Add ModelStateInterface [4/N] — v1,nvidia — by WoosukKwon (合并于: 2026-03-01 05:19 (UTC+8))
#35557 [Bugfix] Fix Anthropic API base64 image handling in Messages endpoint — bug,frontend,ready — by voipmonitor (合并于: 2026-03-01 04:57 (UTC+8))
#35571 [ROCm][CI] Parametrize vision score tests across attention backends with per-backend tolerances — rocm,ready — by AndreasKaratzas (合并于: 2026-02-28 16:59 (UTC+8))
#35441 [Deprecation] Deprecate code in 0.17 as scheduled — frontend,ready,v1,multi-modality — by yewentao256 (合并于: 2026-03-01 01:32 (UTC+8))
#35618 [Benchmark] Avoid unnecessary video download in MMVU — performance,ready — by DarkLight1337 (合并于: 2026-03-01 01:07 (UTC+8))
#35405 [Fix] Avoid sending image input to other PP ranks — ready,v1 — by emricksini-h (合并于: 2026-03-01 00:14 (UTC+8))
#35581 Fix Qwen3_5MTP packed_modules_mapping for gate_up_proj — ready,qwen — by cwazai (合并于: 2026-02-28 22:50 (UTC+8))
#35280 custom dataset img support base64 — documentation,performance,rocm,frontend,ready,v1,deepseek,kv-connector,nvidia — by flutist (合并于: 2026-02-28 19:49 (UTC+8))
#35271 [Feat] Add CUDA torch fallbacks for fp8_mqa_logits/fp8_paged_mqa_logits_torch function — ready,v1,nvidia — by chaunceyjiang (合并于: 2026-02-28 18:12 (UTC+8))
#34214 add io_process_plugin for sparse embedding — documentation,frontend,ready,ci/build,v1 — by staugust (合并于: 2026-02-28 17:16 (UTC+8))
#33671 [Feature]Supports Anthropic Thinking Block — bug,frontend,ready — by mariohong128 (合并于: 2026-02-28 17:02 (UTC+8))
#33762 Add padding support to wvSplitK solution for skinny GEMMs — rocm,ready — by amd-hhashemi (合并于: 2026-02-28 17:02 (UTC+8))
#34861 [1/N] Elastic EP Milestone 2 — rocm,frontend,ready,ci/build,v1,multi-modality,cpu,nvidia — by itayalroy (合并于: 2026-02-28 12:46 (UTC+8))
#35600 [Benchmark] Improve UX of sweep scripts — documentation,performance — by DarkLight1337 (合并于: 2026-02-28 16:36 (UTC+8))
#35586 [Benchmark] Rename SLA Finder to Workload Explorer — documentation,performance — by DarkLight1337 (合并于: 2026-02-28 15:31 (UTC+8))
#35575 [Misc] Change logging level from info to debug for tool parser import — ready — by chaunceyjiang (合并于: 2026-02-28 14:51 (UTC+8))
#35589 [CI] add trainer_send_weights for MockWeightTransferEngine — ready — by chaunceyjiang (合并于: 2026-02-28 14:47 (UTC+8))
#35071 [ROCm][CI] Expose tests to AMD production CI and fix amdsmi heap corruption — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-02-28 13:57 (UTC+8))
#35069 [ROCm] Derive device capability from GCN arch string without CUDA init — rocm,ready,nvidia — by AndreasKaratzas (合并于: 2026-02-28 13:55 (UTC+8))
#35170 [ROCm][CI] Adding infiniband mappings for moriio tests — rocm,ready,ci/build,v1,kv-connector — by AndreasKaratzas (合并于: 2026-02-28 13:53 (UTC+8))
#35212 [EPLB] Enforce sync eplb for NCCL-based all2all backend — ready — by ilmarkov (合并于: 2026-02-28 13:47 (UTC+8))
#35510 [Bugfix] Move chat completion response_format validation to Pydantic model_validator — bug,frontend,ready — by umut-polat (合并于: 2026-02-28 13:26 (UTC+8))
#35537 [Bugfix] Fixes for SLA finder — bug,documentation,performance — by DarkLight1337 (合并于: 2026-02-28 12:20 (UTC+8))
#35503 [Bugfix] Propagate compilation_time from workers to main process for TP>1 — bug,ready,v1,cpu — by huydhn (合并于: 2026-02-28 13:03 (UTC+8))
#35466 [CI/Build] CPU release supports both of AVX2 and AVX512 — ready,ci/build,v1,cpu — by majian4work (合并于: 2026-02-28 12:35 (UTC+8))
#35548 [MTP] Validate that MTP weights are actually loaded — ready,deepseek — by MatthewBonanni (合并于: 2026-02-28 12:27 (UTC+8))
#35527 [ROCm] Add stablelm Head Size 80 To Supported Head Sizes For ROCM_ATTN — documentation,rocm,ready,v1 — by micah-wil (合并于: 2026-02-28 12:16 (UTC+8))

关闭但未合并的 PR

#35623 [Bugfix] Fix RoutedExpertsCapturer for pipeline parallelism (PP > 1) — bug,v1 — by S1ro1 (关闭于: 2026-03-01 04:41 (UTC+8))
#35622 test: verify auto-revert PR creation (will close immediately) — 无标签 — by zhewenl (关闭于: 2026-03-01 02:14 (UTC+8))
#23806 [Fix]: Fix qwen moe quant config & parameter loading — needs-rebase,stale,qwen — by nikhil-arm (关闭于: 2026-03-01 00:25 (UTC+8))
#35421 [Bugfix] Fix Qwen3Coder tool call streaming with speculative decoding — bug,frontend,qwen — by voipmonitor (关闭于: 2026-03-01 00:06 (UTC+8))
#35485 [Bugfix][ROCm] Disable full CUDA graph capture on RDNA3/RDNA4 (gfx1x) — bug,rocm,nvidia — by haosdent (关闭于: 2026-02-28 22:33 (UTC+8))
#35606 fix(qwen3.5-mtp): propagate spec_step_idx to enable multi-layer MTP — needs-rebase,v1,qwen — by cwazai (关闭于: 2026-02-28 18:03 (UTC+8))
#31561 [Feat][PP] support async send for PP — v1 — by pisceskkk (关闭于: 2026-02-28 15:37 (UTC+8))
#35498 Fix indexError in moe wna16 quantization with enable-expert-parallel — 无标签 — by ivyilike (关闭于: 2026-02-28 15:21 (UTC+8))
#35582 [Do not merge] FA4 hang bisect Trial 1: full FA4 without diagnostics — documentation,ready,ci/build,v1,nvidia — by LucasWilkinson (关闭于: 2026-02-28 12:31 (UTC+8))
#35583 [Do not merge] FA4 hang bisect Trial 2: Python changes only (main cmake) — documentation,ready,v1,nvidia — by LucasWilkinson (关闭于: 2026-02-28 12:31 (UTC+8))
#35584 [Do not merge] FA4 hang bisect Trial 3: cmake + requirements only (main Python) — ready,ci/build,nvidia — by LucasWilkinson (关闭于: 2026-02-28 12:31 (UTC+8))
#32918 [Fix] Update CUTLASS_REVISION to v4.3.5 — ready,needs-rebase,ci/build,v1,kv-connector,nvidia — by pacoxu (关闭于: 2026-02-28 11:57 (UTC+8))