vLLM 开发动态报告 - 2026-02-28
时间窗口: 2026-02-28 11:37 (UTC+8) ~ 2026-03-01 11:37 (UTC+8) 数据统计: 新 Issue 11 | 关闭 Issue 12 | 新 PR 45 | 合并 PR 30 | 关闭未合并 PR 12
📊 每日开发状态摘要
在本次分析周期内,vLLM 社区保持了极高的开发活跃度,共合并了30个PR,处理了大量问题。开发重点集中在模型兼容性修复(特别是 Qwen 系列)、AMD ROCm 平台的功能增强与性能优化,以及推理后端(如推测解码、CUDA Graph)的稳定性改进上。社区积极响应用户反馈,快速修复了多个影响部署的关键bug。
🎯 AMD/ROCm 生态相关动态
本周期 AMD 生态相关的开发活动非常活跃,主要集中在功能启用、性能调优和问题修复上。
- Issue 与 Bug 报告:
- Issue #35633: 用户报告在 AMD MI355X GPU 上,vLLM 0.16 的 ROCm Docker 镜像无法运行使用了 MXFP4 量化的
amd/Kimi-K2.5-MXFP4模型,原因是缺少amd-quark包。这直接指向了 AMD Quark 量化工具链与 vLLM 官方镜像的集成问题,是 AMD 生态中模型部署的关键阻碍。已被标记为bug和rocm。
- Issue #35633: 用户报告在 AMD MI355X GPU 上,vLLM 0.16 的 ROCm Docker 镜像无法运行使用了 MXFP4 量化的
- PR 与功能开发/修复:
- PR #35601 ([ROCm][Bugfix]: Disable AITER Triton ROPE by default): 针对在 gfx942 (MI300) 上大规模批次场景下 AITER RoPE 实现性能下降的问题,默认禁用了该实现,但保留了
+rotary_embedding自定义算子。这体现了对 AMD 硬件上不同算子性能的精细化调优。 - PR #35597, #35596, #35595: 这是一系列由贡献者
brucechanglongxu提交的 PR,旨在为 ROCm 平台启用更多量化方案支持。#35597启用了压缩张量 WNA16 测试。#35596通过回退到 Triton 内核,使moe_wna16(GPTQ/AWQ MoE) 能在 ROCm 上运行。#35595将experts_int8量化添加到 ROCm 支持列表中。 这三个 PR 显著扩展了 AMD GPU 上可运行的量化模型范围,是提升平台竞争力的重要步骤。
- PR #35069 ([ROCm] Derive device capability from GCN arch string without CUDA init) (已合并): 修复了在 Ray 等分布式环境中,因查询算力而初始化 CUDA 导致的失败。改为解析 GCN 架构字符串来获取算力,提升了 ROCm 平台在复杂部署环境下的健壮性。
- PR #35601 ([ROCm][Bugfix]: Disable AITER Triton ROPE by default): 针对在 gfx942 (MI300) 上大规模批次场景下 AITER RoPE 实现性能下降的问题,默认禁用了该实现,但保留了
总结:AMD 生态在本周期是开发热点,团队正在系统性地补齐量化支持短板(MXFP4, WNA16 MoE, INT8 MoE),并针对特定硬件进行性能调优和稳定性修复。amd-quark 包的缺失问题表明,软件栈的完整打包和交付仍是需要关注的环节。
💬 高热度讨论分析
- Issue #35608: CUDA error: CUBLAS_STATUS_INVALID_VALUE:
- 核心议题:用户在 v0.16.0 镜像中部署 Qwen3.5-122B 时遇到 cuBLAS 错误,通过
unset LD_LIBRARY_PATH解决。 - 各方观点:
- 提问者 (
adenzhou1350):认为是 Docker 镜像内的 CUDA 库冲突。 - 参与者 (
ZJY0516):指出这是 CUDA 12.9 + Torch 2.10 的已知问题,需升级nvidia-cublas-cu12包。 - 另一参与者 (
shahizat):确认在 H100 上遇到相同问题并提供了关联 Issue 链接。
- 提问者 (
- 争议焦点:无实质争议,更多是经验分享和解决方案汇总。
- 当前状态:问题已定位为特定 CUDA 版本下的库兼容性问题,提供了明确的升级方案。
- 核心议题:用户在 v0.16.0 镜像中部署 Qwen3.5-122B 时遇到 cuBLAS 错误,通过
- Issue #35414 / #35617: Qwen3.5 在旧 GPU(2080 Ti)上因 bfloat16 失败:
- 核心议题:用户在使用
--dtype float16时,Qwen3.5/Qwen3Next 模型仍创建了 bfloat16 参数,导致在不支持 bf16 的 GPU 上崩溃。 - 各方观点:
- 遇到问题的用户 (
chuanSir123,BUJIDAOVS):报告了错误并尝试通过环境变量和修改配置文件临时解决。 - 维护者 (通过 PR #35617):指出根本原因是模型代码直接使用了 HuggingFace
config.dtype(来自config.json的原始torch_dtype),而非用户指定的--dtype。
- 遇到问题的用户 (
- 争议焦点:无争议,是一个明确的 bug。
- 最终结论:通过 PR #35617 (已合并) 修复,移除模型初始化中硬编码的
dtype=config.dtype,使其遵循 vLLM 模型加载器设置的全局 dtype,从而尊重用户命令行参数。
- 核心议题:用户在使用
- Issue #35633: AMD MI355 MXFP4 支持缺失:
- 核心议题:如前述,用户无法在 vLLM ROCm 镜像中测试 AMD 最新的 MXFP4 量化模型。
- 各方观点:
- 报告者 (
functionstackx):详细列出了错误栈,并直接 @ 了多位疑似 AMD 员工的开发者 (powderluv,chunfangamd,andyluo7)。 - 机器人 (
github-actions):自动标记并 CC 了 ROCm 相关的维护者。
- 报告者 (
- 争议焦点:尚未展开深入讨论,但直接暴露了 AMD 新硬件特性与 vLLM 发布流程的同步问题。
- 当前状态:问题新开,等待 AMD 团队或维护者响应。
🔥 热门话题与趋势分析
- Qwen 模型家族部署问题集中:多个 Issue 涉及 Qwen 系列(Qwen3.5-122B, 35B-A3B, 3-Omni, VL-Embedding),问题包括 CUDA 库冲突 (#35608)、TTFT 尾延迟 (#35625)、参数不支持 (#35602)、多模态限制器失败 (#35624) 和 dtype 忽略 (#35414)。这表明随着 Qwen 模型的流行,其多样化的配置和特性对推理引擎提出了全面挑战。
- AMD 平台支持加速:如前所述,本周期出现大量 ROCm 相关的 PR,从修复基础功能(设备能力查询)到扩展高级特性支持(多种 MoE 量化),显示 AMD 团队或社区正在积极推动 vLLM 在 AMD 硬件上的成熟度。
- 推理性能与稳定性深耕:围绕推测解码(MTP 权重验证 #35548、MTP 层传播修复 #35606)、CUDA Graph(输入地址调试增强 #35605)、流水线并行(RoutedExperts 修复 #35623)和分布式执行器(Ray 死锁 #35403)的修复持续进行,表明项目在追求极致性能的同时,也在不断夯实大规模、复杂部署场景下的稳定性基础。
🛠️ 重点技术变更
- PR #35617 ([Bugfix][Model] Fix Qwen3.5/Qwen3Next ignoring –dtype flag) (已合并):一个看似简单的改动,解决了影响旧 GPU 用户部署新版大模型的关键兼容性问题。技术影响:确保了用户命令行参数的权威性,避免了模型配置对运行环境的隐性依赖,提升了部署的可预测性。
- PR #35271 ([Feat] Add CUDA torch fallbacks for fp8_mqa_logits…) (已合并):为 DeepGemm 的
fp8_mqa_logits等函数添加了 PyTorch 回退实现。技术影响:这使得不支持 DeepGemm(如 sm80 A100)或未安装相关库的 CUDA 平台,能够通过标准 PyTorch 算子运行 GLM-5 等依赖这些函数的稀疏注意力模型,显著扩展了模型的硬件支持范围。 - PR #35557 ([Bugfix] Fix Anthropic API base64 image handling in Messages endpoint) (已合并):修复了 Anthropic 兼容 API 中 base64 图像和工具调用返回图像的处理逻辑。技术影响:完善了 vLLM 对多模态输入的处理能力,提升了其对不同 API 协议(此处为 Anthropic)的兼容性,对于构建通用 AI 服务网关至关重要。
- PR #35548 ([MTP] Validate that MTP weights are actually loaded) (已合并):在加载 MTP (Multi-Token Predictor) 模型时,验证权重是否真的被加载,防止使用未初始化的内存。技术影响:提升了推测解码的安全性,避免了因使用残缺的量化模型 checkpoint 而导致性能 silently degrade(零接受率)的陷阱,要求模型提供者交付完整的权重。
📈 开发活跃度观察
- 贡献者多样性:除了核心维护者(如
DarkLight1337,WoosukKwon,chaunceyjiang),本周期出现了多位聚焦特定领域的活跃贡献者,例如专注 ROCm 量化的brucechanglongxu,以及提交多个模型修复的lailoo、cwazai。 - AMD 团队参与:从被 @ 的用户名 (
-amd后缀) 和 ROCm 相关 PR 的集中出现来看,AMD 内部的工程师团队正在深度参与 vLLM 开发,并驱动其生态完善。 - 高效的代码合并:在45个新PR中合并了30个,合并率约67%。许多修复性 PR 在创建当天即被合并(如 #35600, #35586, #35589, #35617),反映了核心团队对解决用户问题的响应速度和高效的代码审查流程。
💡 值得关注的问题
- AMD 新特性集成滞后:Issue #35633 暴露了 AMD 最新硬件特性(MI355 的 MXFP4)与 vLLM 发布流程存在断点。如何更敏捷地将合作伙伴的前沿技术集成到官方镜像中,是一个需要协调的问题。
- CPU 后端编译问题:Issue #35599 显示 v0.16.0 在特定配置(HEAD_DIM=576)下 CPU 编译失败,虽然提供了回退方案,但表明 CPU 后端测试覆盖可能不足,影响小众但重要的使用场景。
- Ray 分布式执行器的稳定性:Issue #35403 和其修复 PR #35405 揭示了在流水线并行 + 多模态场景下,Ray 编译 DAG 处理大 tensor 的脆弱性。随着分布式、多模态推理需求增长,此底层执行框架的稳定性至关重要。
- 大规模模型部署的复杂性:Issue #35625 (Qwen3.5-35B-A3B 的 TTFT 尾延迟)、#35496 (Qwen3.5-397B 的编译超时) 表明,部署超大规模或具有复杂推理结构的模型时,性能调优和初始化的挑战依然巨大,需要用户具备更深的技术洞察力。
📋 附录:详细数据列表
新增 Issue
- #35633 [Bug]: parity with cuda: rocm v0.16 image missing amd quark kimi k2.5 mxfp4 — bug,rocm — by functionstackx (创建于: 2026-03-01 09:20 (UTC+8))
- #35631 [Feature]: Pooling Model Performance Optimizations — feature request — by yewentao256 (创建于: 2026-03-01 07:00 (UTC+8))
- #35625 [Bug]: TTFT latency issue with Qwen3.5-35B-A3B model using vllm — bug — by shahizat (创建于: 2026-03-01 04:34 (UTC+8))
- #35624 [Bug]: Qwen3-Omni Model Fails when try to l — bug — by Blaze-DSP (创建于: 2026-03-01 03:25 (UTC+8))
- #35608 [Bug]: vllm 0.16.0+image encountered CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasGemmEx — bug — by adenzhou1350 (创建于: 2026-02-28 18:40 (UTC+8))
- #35612 [Bug]: invalid argument at cumem_allocator.cpp:119 — bug — by flutist (创建于: 2026-02-28 21:32 (UTC+8))
- #35599 [Bug]: cpu version compile failed in 0.16.0 — bug — by wanghualex1-wq (创建于: 2026-02-28 15:40 (UTC+8))
- #35603 [Bug]: vllm: error: unrecognized arguments: –task embedding — bug — by dojeese-maker (创建于: 2026-02-28 17:26 (UTC+8))
- #35602 [Bug]: vllm: error: unrecognized arguments: –language-model-only — bug — by ciaoyizhen (创建于: 2026-02-28 16:40 (UTC+8))
- #35593 [Bug]: CUDA Error 803 (system has unsupported display driver / cuda driver combination) when host driver is 590.48.01 due to cuda-compat conflict — bug — by kkkzhonghaiwei (创建于: 2026-02-28 14:49 (UTC+8))
- #35585 [CI] MockWeightTransferEngine missing trainer_send_weights implementation — ci-failure — by LucasWilkinson (创建于: 2026-02-28 12:28 (UTC+8))
已关闭 Issue
- #35414 [Bug]: 4*2080ti 22g deploy Qwen3.5-35B-A3B fail:2080 Ti does not support bfloat16 — bug — by chuanSir123 (关闭于: 2026-03-01 11:27 (UTC+8))
- #27508 [Bug]: openai/gpt-oss-120b can’t run on H100 — bug,stale — by matheusfvesco (关闭于: 2026-03-01 10:16 (UTC+8))
- #27807 [Bug]: When using CPU to run examples/offline_inference/basic/basic.py, “RuntimeError: Device string must not be empty” is generated. — bug,stale — by spacecat2002 (关闭于: 2026-03-01 10:16 (UTC+8))
- #27868 [Bug]: Segfault in NCCL cuMemCreate during distributed engine initialization on 3080 — bug,stale — by zjcx524 (关闭于: 2026-03-01 10:16 (UTC+8))
- #27877 [Usage]: How to install nightly version??? Why this command doesn’t work? — usage,stale — by sleepwalker2017 (关闭于: 2026-03-01 10:16 (UTC+8))
- #27886 [Bug]: vLLM 0.10.2/0.11.0 bench serve deadlocks when benchmarking DeepSeek-R1-BF16 (sglang 0.4.7), with processes hanging indefinitely during script execution — bug,stale — by tears710 (关闭于: 2026-03-01 10:16 (UTC+8))
- #35403 [Bug]: Ray Compiled DAG timeout/deadlock during VLM forward pass with PP>1 and high-res images — bug — by emricksini-h (关闭于: 2026-03-01 00:14 (UTC+8))
- #35021 [Bug]: GLM-5(Sparse MLA / DSA 模型)无法在 sm80 GPU(A100/A800)上运行 — DeepGemm 硬依赖无 fallback — bug — by qjxjy123 (关闭于: 2026-02-28 18:12 (UTC+8))
- #35602 [Bug]: vllm: error: unrecognized arguments: –language-model-only — bug — by ciaoyizhen (关闭于: 2026-02-28 17:04 (UTC+8))
- #29915 [Feature]: include reasoning tokens in /v1/messages Anthropic endpoint if model supports it — feature request — by alew3 (关闭于: 2026-02-28 17:02 (UTC+8))
- #35585 [CI] MockWeightTransferEngine missing trainer_send_weights implementation — ci-failure — by LucasWilkinson (关闭于: 2026-02-28 14:47 (UTC+8))
- #35496 [Bug]: RPC call to sample_tokens timed out. Qwen3.5-397B-A17B — bug — by DBDXSS (关闭于: 2026-02-28 14:06 (UTC+8))
新增 PR
- #35635 [torch.compile] Move torch.Size producers to consumer subgraph in split_graph — 无标签 — by bsherifi (创建于: 2026-03-01 10:42 (UTC+8))
- #35620 [Chore] Cleanup BNB utilization dead code — ready — by Isotr0py (创建于: 2026-03-01 01:08 (UTC+8))
- #35617 [Bugfix][Model] Fix Qwen3.5/Qwen3Next ignoring –dtype flag on older GPUs — bug,ready,qwen — by lailoo (创建于: 2026-03-01 00:49 (UTC+8))
- #35629 feat(spec_decode): remove unpadded drafter batch mode — documentation,speculative-decoding,v1 — by sladyn98 (创建于: 2026-03-01 06:42 (UTC+8))
- #35634 [Refactor] Simplify
chat_completion_full_generatorfor tool parsers — frontend,ready — by yewentao256 (创建于: 2026-03-01 09:59 (UTC+8)) - #35632 [Docs] Update
CacheConfigblock_size docstring to remove inaccurate limit when using CUDA — nvidia — by eicherseiji (创建于: 2026-03-01 08:06 (UTC+8)) - #35630 [MISC] Fixing a null reference by removing parallel_utils from mypy EXCLUDE — ready — by taneem-ibrahim (创建于: 2026-03-01 06:46 (UTC+8))
- #35628 [Model Runner V2] Minor refactoring for EncoderRunner — v1 — by WoosukKwon (创建于: 2026-03-01 05:30 (UTC+8))
- #35627 [2/N] Elastic EP Milestone 2: Integrating NIXL-EP — kv-connector,nvidia — by itayalroy (创建于: 2026-03-01 05:09 (UTC+8))
- #35626 [Bugfix] Fix RoutedExpertsCapturer for pipeline parallelism (PP > 1) — bug,v1 — by S1ro1 (创建于: 2026-03-01 04:49 (UTC+8))
- #35621 [Model Runner V2] Add ModelStateInterface [4/N] — v1,nvidia — by WoosukKwon (创建于: 2026-03-01 01:59 (UTC+8))
- #35623 [Bugfix] Fix RoutedExpertsCapturer for pipeline parallelism (PP > 1) — bug,v1 — by S1ro1 (创建于: 2026-03-01 02:54 (UTC+8))
- #35622 test: verify auto-revert PR creation (will close immediately) — 无标签 — by zhewenl (创建于: 2026-03-01 02:14 (UTC+8))
- #35611 [Bugfix] Fix CompressedTensors MoE g_idx registration when actorder is null — bug — by sxu75374 (创建于: 2026-02-28 19:46 (UTC+8))
- #35619 [Perf] Eliminate duplicate bitmatrix metadata computation in gpt oss … — gpt-oss — by banparth (创建于: 2026-03-01 00:57 (UTC+8))
- #35618 [Benchmark] Avoid unnecessary video download in MMVU — performance,ready — by DarkLight1337 (创建于: 2026-03-01 00:55 (UTC+8))
- #35615 [Tool Parser] Fix Qwen3Coder streaming parameter loss with speculative decode — frontend,qwen — by voipmonitor (创建于: 2026-02-28 23:59 (UTC+8))
- #35613 [SP] Add opt-in ragged sequence parallelism path via VLLM_ENABLE_SP_RAGGED — v1 — by baonudesifeizhai (创建于: 2026-02-28 22:29 (UTC+8))
- #35616 [Bugfix] Improve engine ready timeout error message — bug,v1 — by lailoo (创建于: 2026-03-01 00:08 (UTC+8))
- #35614 [Docs] Add speculative-config configuration reference page — documentation — by lailoo (创建于: 2026-02-28 23:21 (UTC+8))
- #35605 [CUDA Graph] Enhance CUDA graph input address debugging — nvidia — by yiz-liu (创建于: 2026-02-28 17:36 (UTC+8))
- #35587 [BugFix][Model]Fix the garbled code in Ernie4.5-VL caused by fast_moe_cold_start — bug — by CSWYF3634076 (创建于: 2026-02-28 13:04 (UTC+8))
- #35601 [ROCm][Bugfix]: Disable AITER Triton ROPE by default — bug,rocm,ready — by Rohan138 (创建于: 2026-02-28 16:17 (UTC+8))
- #35581 Fix Qwen3_5MTP packed_modules_mapping for gate_up_proj — ready,qwen — by cwazai (创建于: 2026-02-28 12:00 (UTC+8))
- #35597 [ROCm][Quantization] Enable compressed-tensors WNA16 test on ROCm — rocm,ready — by brucechanglongxu (创建于: 2026-02-28 15:18 (UTC+8))
- #35594 [BUGFIX]fix CUDA OOM ERROR : invalid argument at cumem_allocator.cpp:119 — bug,nvidia — by flutist (创建于: 2026-02-28 15:17 (UTC+8))
- #35609 [Bugfix] Fix TypeError in benchmarks/benchmark_prefix_caching.py with –sort — bug,performance — by GuoxiangZu (创建于: 2026-02-28 18:56 (UTC+8))
- #35588 [Feat] Supports Anthropic Messages count_tokens API — frontend,ready — by chaunceyjiang (创建于: 2026-02-28 13:24 (UTC+8))
- #35610 Add pytorch-triton-xpu to requirements — ci/build — by andswitch (创建于: 2026-02-28 19:04 (UTC+8))
- #35607 fix(qwen3.5-mtp): propagate spec_step_idx to enable multi-layer MTP — qwen — by cwazai (创建于: 2026-02-28 18:13 (UTC+8))
-
#35604 [Frontend][1/n] Improve pooling entrypoints classify. — frontend — by noooop (创建于: 2026-02-28 17:36 (UTC+8)) - #35606 fix(qwen3.5-mtp): propagate spec_step_idx to enable multi-layer MTP — needs-rebase,v1,qwen — by cwazai (创建于: 2026-02-28 17:57 (UTC+8))
- #35591 Fix kv_cache_dtype auto resolution for ModelOpt kv_cache_quant_algo — documentation,kv-connector — by sriharshamudumba (创建于: 2026-02-28 14:32 (UTC+8))
-
#35592 [Docs][1/n] Reorganize pooling docs classify. — documentation — by noooop (创建于: 2026-02-28 14:45 (UTC+8)) - #35600 [Benchmark] Improve UX of sweep scripts — documentation,performance — by DarkLight1337 (创建于: 2026-02-28 16:12 (UTC+8))
- #35598 # [Bugfix]Fix indexError in moe wna16 quantization with enable-expert-parallel — bug — by ivyilike (创建于: 2026-02-28 15:34 (UTC+8))
- #35586 [Benchmark] Rename SLA Finder to Workload Explorer — documentation,performance — by DarkLight1337 (创建于: 2026-02-28 12:54 (UTC+8))
- #35596 [ROCm][Quantization] Enable moe_wna16 on ROCm via Triton fallback — rocm — by brucechanglongxu (创建于: 2026-02-28 15:18 (UTC+8))
- #35595 [ROCm][Quantization] Enable experts_int8 on ROCm — rocm — by brucechanglongxu (创建于: 2026-02-28 15:17 (UTC+8))
- #35589 [CI] add trainer_send_weights for MockWeightTransferEngine — ready — by chaunceyjiang (创建于: 2026-02-28 14:10 (UTC+8))
- #35580 [CI] Defining extended V1 e2e + engine tests — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-02-28 11:45 (UTC+8))
- #35590 Use smg-grpc-proto package for gRPC proto definitions — ci/build — by CatherineSue (创建于: 2026-02-28 14:15 (UTC+8))
- #35582 [Do not merge] FA4 hang bisect Trial 1: full FA4 without diagnostics — documentation,ready,ci/build,v1,nvidia — by LucasWilkinson (创建于: 2026-02-28 12:22 (UTC+8))
- #35583 [Do not merge] FA4 hang bisect Trial 2: Python changes only (main cmake) — documentation,ready,v1,nvidia — by LucasWilkinson (创建于: 2026-02-28 12:23 (UTC+8))
- #35584 [Do not merge] FA4 hang bisect Trial 3: cmake + requirements only (main Python) — ready,ci/build,nvidia — by LucasWilkinson (创建于: 2026-02-28 12:23 (UTC+8))
已合并 PR
- #35620 [Chore] Cleanup BNB utilization dead code — ready — by Isotr0py (合并于: 2026-03-01 03:22 (UTC+8))
- #35617 [Bugfix][Model] Fix Qwen3.5/Qwen3Next ignoring –dtype flag on older GPUs — bug,ready,qwen — by lailoo (合并于: 2026-03-01 11:27 (UTC+8))
- #32195 Add TMA support to fused_moe_lora kernel — ready — by gnovack (合并于: 2026-03-01 10:55 (UTC+8))
- #35621 [Model Runner V2] Add ModelStateInterface [4/N] — v1,nvidia — by WoosukKwon (合并于: 2026-03-01 05:19 (UTC+8))
- #35557 [Bugfix] Fix Anthropic API base64 image handling in Messages endpoint — bug,frontend,ready — by voipmonitor (合并于: 2026-03-01 04:57 (UTC+8))
- #35571 [ROCm][CI] Parametrize vision score tests across attention backends with per-backend tolerances — rocm,ready — by AndreasKaratzas (合并于: 2026-02-28 16:59 (UTC+8))
- #35441 [Deprecation] Deprecate code in 0.17 as scheduled — frontend,ready,v1,multi-modality — by yewentao256 (合并于: 2026-03-01 01:32 (UTC+8))
- #35618 [Benchmark] Avoid unnecessary video download in MMVU — performance,ready — by DarkLight1337 (合并于: 2026-03-01 01:07 (UTC+8))
- #35405 [Fix] Avoid sending image input to other PP ranks — ready,v1 — by emricksini-h (合并于: 2026-03-01 00:14 (UTC+8))
- #35581 Fix Qwen3_5MTP packed_modules_mapping for gate_up_proj — ready,qwen — by cwazai (合并于: 2026-02-28 22:50 (UTC+8))
- #35280 custom dataset img support base64 — documentation,performance,rocm,frontend,ready,v1,deepseek,kv-connector,nvidia — by flutist (合并于: 2026-02-28 19:49 (UTC+8))
- #35271 [Feat] Add CUDA torch fallbacks for fp8_mqa_logits/fp8_paged_mqa_logits_torch function — ready,v1,nvidia — by chaunceyjiang (合并于: 2026-02-28 18:12 (UTC+8))
- #34214 add io_process_plugin for sparse embedding — documentation,frontend,ready,ci/build,v1 — by staugust (合并于: 2026-02-28 17:16 (UTC+8))
- #33671 [Feature]Supports Anthropic Thinking Block — bug,frontend,ready — by mariohong128 (合并于: 2026-02-28 17:02 (UTC+8))
- #33762 Add padding support to wvSplitK solution for skinny GEMMs — rocm,ready — by amd-hhashemi (合并于: 2026-02-28 17:02 (UTC+8))
- #34861 [1/N] Elastic EP Milestone 2 — rocm,frontend,ready,ci/build,v1,multi-modality,cpu,nvidia — by itayalroy (合并于: 2026-02-28 12:46 (UTC+8))
- #35600 [Benchmark] Improve UX of sweep scripts — documentation,performance — by DarkLight1337 (合并于: 2026-02-28 16:36 (UTC+8))
- #35586 [Benchmark] Rename SLA Finder to Workload Explorer — documentation,performance — by DarkLight1337 (合并于: 2026-02-28 15:31 (UTC+8))
- #35575 [Misc] Change logging level from info to debug for tool parser import — ready — by chaunceyjiang (合并于: 2026-02-28 14:51 (UTC+8))
- #35589 [CI] add trainer_send_weights for MockWeightTransferEngine — ready — by chaunceyjiang (合并于: 2026-02-28 14:47 (UTC+8))
- #35071 [ROCm][CI] Expose tests to AMD production CI and fix amdsmi heap corruption — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-02-28 13:57 (UTC+8))
- #35069 [ROCm] Derive device capability from GCN arch string without CUDA init — rocm,ready,nvidia — by AndreasKaratzas (合并于: 2026-02-28 13:55 (UTC+8))
- #35170 [ROCm][CI] Adding infiniband mappings for moriio tests — rocm,ready,ci/build,v1,kv-connector — by AndreasKaratzas (合并于: 2026-02-28 13:53 (UTC+8))
- #35212 [EPLB] Enforce sync eplb for NCCL-based all2all backend — ready — by ilmarkov (合并于: 2026-02-28 13:47 (UTC+8))
- #35510 [Bugfix] Move chat completion response_format validation to Pydantic model_validator — bug,frontend,ready — by umut-polat (合并于: 2026-02-28 13:26 (UTC+8))
- #35537 [Bugfix] Fixes for SLA finder — bug,documentation,performance — by DarkLight1337 (合并于: 2026-02-28 12:20 (UTC+8))
- #35503 [Bugfix] Propagate compilation_time from workers to main process for TP>1 — bug,ready,v1,cpu — by huydhn (合并于: 2026-02-28 13:03 (UTC+8))
- #35466 [CI/Build] CPU release supports both of AVX2 and AVX512 — ready,ci/build,v1,cpu — by majian4work (合并于: 2026-02-28 12:35 (UTC+8))
- #35548 [MTP] Validate that MTP weights are actually loaded — ready,deepseek — by MatthewBonanni (合并于: 2026-02-28 12:27 (UTC+8))
- #35527 [ROCm] Add
stablelmHead Size 80 To Supported Head Sizes For ROCM_ATTN — documentation,rocm,ready,v1 — by micah-wil (合并于: 2026-02-28 12:16 (UTC+8))
关闭但未合并的 PR
- #35623 [Bugfix] Fix RoutedExpertsCapturer for pipeline parallelism (PP > 1) — bug,v1 — by S1ro1 (关闭于: 2026-03-01 04:41 (UTC+8))
- #35622 test: verify auto-revert PR creation (will close immediately) — 无标签 — by zhewenl (关闭于: 2026-03-01 02:14 (UTC+8))
- #23806 [Fix]: Fix qwen moe quant config & parameter loading — needs-rebase,stale,qwen — by nikhil-arm (关闭于: 2026-03-01 00:25 (UTC+8))
- #35421 [Bugfix] Fix Qwen3Coder tool call streaming with speculative decoding — bug,frontend,qwen — by voipmonitor (关闭于: 2026-03-01 00:06 (UTC+8))
- #35485 [Bugfix][ROCm] Disable full CUDA graph capture on RDNA3/RDNA4 (gfx1x) — bug,rocm,nvidia — by haosdent (关闭于: 2026-02-28 22:33 (UTC+8))
- #35606 fix(qwen3.5-mtp): propagate spec_step_idx to enable multi-layer MTP — needs-rebase,v1,qwen — by cwazai (关闭于: 2026-02-28 18:03 (UTC+8))
- #31561 [Feat][PP] support async send for PP — v1 — by pisceskkk (关闭于: 2026-02-28 15:37 (UTC+8))
- #35498 Fix indexError in moe wna16 quantization with enable-expert-parallel — 无标签 — by ivyilike (关闭于: 2026-02-28 15:21 (UTC+8))
- #35582 [Do not merge] FA4 hang bisect Trial 1: full FA4 without diagnostics — documentation,ready,ci/build,v1,nvidia — by LucasWilkinson (关闭于: 2026-02-28 12:31 (UTC+8))
- #35583 [Do not merge] FA4 hang bisect Trial 2: Python changes only (main cmake) — documentation,ready,v1,nvidia — by LucasWilkinson (关闭于: 2026-02-28 12:31 (UTC+8))
- #35584 [Do not merge] FA4 hang bisect Trial 3: cmake + requirements only (main Python) — ready,ci/build,nvidia — by LucasWilkinson (关闭于: 2026-02-28 12:31 (UTC+8))
- #32918 [Fix] Update CUTLASS_REVISION to v4.3.5 — ready,needs-rebase,ci/build,v1,kv-connector,nvidia — by pacoxu (关闭于: 2026-02-28 11:57 (UTC+8))