vLLM 开发动态报告 - 2026-03-01

时间窗口: 2026-03-01 11:26 (UTC+8) ~ 2026-03-02 11:26 (UTC+8) 数据统计: 新 Issue 15 | 关闭 Issue 10 | 新 PR 52 | 合并 PR 13 | 关闭未合并 PR 12

📊 每日开发状态摘要

本周期（2026年3月1日至2日），vLLM项目保持了极高的开发活跃度，新增了52个PR和15个Issue。开发重点集中在多硬件平台支持优化（尤其是AMD ROCm与Intel XPU的bug修复）、性能与评估工具链的完善（如统一评估框架的RFC），以及模型与量化支持（如NVFP4、MXFP4等）。社区就多项功能设计（如命令命名、构建流程）展开了热烈讨论，显示出项目在快速演进中对工程质量的重视。

🎯 AMD/ROCm 生态相关动态

本周期AMD/ROCm生态相关活动非常活跃，主要涉及MI355X新硬件上的量化模型支持问题和HIP构建流程的修复。

Issues:
- #35637 [Bug]: mi355 minimax m2.1 arch mxfp4 rocm AITER TP4 error
  - 描述：用户 functionstackx 报告，在MI355X GPU上使用AMD官方的MXFP4量化检查点 (amd/MiniMax-M2.1-MXFP4) 并启用AITER内核 (VLLM_ROCM_USE_AITER=1) 时，TP4配置下出现运行时错误，而TP2配置下正常工作。这直接影响了MI355X旗舰特性（MXFP4）的性能表现。
  - 影响：暴露了AMD AITER融合MoE内核在特定量化格式（MXFP4）和高并行度（TP4）组合下的兼容性问题，可能影响用户对新硬件（MI355）量化性能的评估。
- #35641 [Bug]: ROCm MI355X Kimi K2.5 AITER TP8 MLA kernel Error (num_head=8)
  - 描述：同样由 functionstackx 报告，在MI355X上运行Kimi K2.5的MXFP4检查点时，AITER的MLA（多头潜在注意力）内核因不支持num_head=8而报错，目前仅支持16或128头。
  - 影响：阻碍了Kimi K2.5模型在ROCm平台上发挥全部性能，表明AMD的AITER内核对特定模型架构的支持仍需完善。
- #35644 [Feature]: AMD MXFP4 MiniMax M2.5 Checkpoint
  - 描述：用户 functionstackx 向AMD团队请求发布更受欢迎的MiniMax M2.5模型的MXFP4量化检查点，以替代现有的M2.1检查点。
  - 影响：反映了社区用户对AMD最新量化技术（MXFP4）应用于主流模型的高度期待。
- #35642 [Bug]: HIP build in Docker: offload-arch stderr contaminates compiler flags
  - 描述：用户 npathak13 提交了一个非常详细的技术分析，指出在Docker内（无GPU透传）进行HIP构建时，offload-arch工具失败的错误信息会通过两条路径污染编译器标志，导致构建失败。
  - 影响：这是一个影响所有容器化ROCm CI/CD流水线的根本性构建问题。
PRs:
- #35658 [ROCm] add amd-quark package in requirements for rocm to use quantized models
  - 描述：由AMD员工 hongxiayang 提交，旨在修复一个依赖问题（#35633），将 amd-quark 包加入ROCm的依赖文件，确保构建的Docker镜像或wheel包能直接支持AMD的量化模型。
  - 状态：Open，待解决pre-commit检查。
- #35672 [Core] Move test utility to test file
  - 描述：移除了一个生产代码中未使用的权重洗牌函数，并指出所有生产路径现在都使用AITER进行权重洗牌。这间接反映了AITER内核在AMD生态中的核心地位。
  - 状态：Open，已标记为ready。
- #35692 [Bug] Fix HIP build in Docker: filter offload-arch stderr from compiler flags
  - 描述：针对Issue #35642的修复，在cmake/utils.cmake和CMakeLists.txt中过滤掉offload-arch输出的警告信息，防止其污染编译器标志。
  - 状态：Open，正在进行技术讨论和修改。
- #34931 [AMD][CI] Support Triton attention with ExampleConnector
  - 描述：一个已合并的PR，解决了ROCm默认的Triton注意力后端与ExampleConnector的KV缓存布局不兼容的问题，提升了ROCm平台在KV卸载场景下的功能完整性。
  - 状态：Merged。

总结：AMD生态在本周期的动态集中于解决新一代硬件（MI355X）上的软件栈瓶颈和夯实基础构建与部署流程。社区用户积极反馈问题，AMD团队（通过-amd后缀用户和hongxiayang）也在跟进修复，显示出良好的互动。

💬 高热度讨论分析

Issue #35639: [RFC]: vllm bench eval for Unified Accuracy + Performance Evaluation
- 核心议题：提议新增一个统一命令，在单次运行中同时进行准确性（lm_eval）和性能（Prometheus指标）评估，以检测性能回归并关联精度-性能数据。
- 观点交锋：
  - 提议方 (AndreasKaratzas)：认为现有CI只评估精度，无法发现性能回归和硬件特定问题，统一评估能节省GPU算力并提供更全面的洞察。
  - 反对/建议方 (DarkLight1337)：认为vllm bench传统上专指性能基准测试，为避免混淆，建议使用新命令vllm eval。
  - 讨论发展：提议方随后幽默地提出了vllm omnigres（omni+regression）的替代命名方案，缓和了讨论气氛，显示出对命名的开放性。
- 当前状态：讨论进行中，核心功能设计获得认同，但具体命名待定。
Issue #35642 & PR #35692: HIP Docker构建失败问题
- 核心议题：如前所述，这是一个深入的技术问题。Issue中用户提供了极其详细的根因分析和临时解决方案。
- 讨论特点：该讨论热度体现在分析的深度而非评论数量。用户npathak13独立完成了问题诊断并给出了修复方案，体现了社区成员的高技术水准。随后在PR #35692中，提交者infektyd与AI代码助手(gemini-code-assist)就CMake修复细节进行了迭代，展示了现代开发协作的一种形态。
- 当前状态：问题根因明确，修复PR已提交并在细化中。
PR #35670 & #35684: 修复多模态渲染端点
- 核心议题：修复/v1/chat/completions/render端点处理多模态请求时内部数据结构序列化失败的问题。
- 讨论特点：同一提交者(sergey-zinchenko)因DCO签署问题关闭了原PR(#35670)并重新提交了#35684。这虽非功能争议，但反映了项目对贡献规范（如签署提交）的严格要求，是开源项目维护质量的体现。
- 当前状态：原PR已关闭，新PR待审。

🔥 热门话题与趋势分析

硬件支持多样化与问题攻坚：除了AMD MI355X，本周期内Intel XPU（Issue #35651, #35638）和NVIDIA Blackwell（Issue #35659）均出现了特定Bug报告。这反映出vLLm在扩展硬件生态时面临的兼容性挑战，也是其作为主流推理引擎的必然阶段。社区针对不同硬件的驱动版本、内核精度、最佳实践进行了大量具体讨论。
评估与监控的工程化：Issue #35639的RFC标志着社区开始系统化地思考如何将准确性评估与性能监控在CI/CD中深度集成，旨在提前发现性能回归，这是项目走向成熟、重视生产环境稳定性的重要信号。
量化技术的深入应用与问题：围绕MXFP4 (AMD)、NVFP4 (NVIDIA) 等新式量化格式的讨论密集（Issues #35637, #35641, #35644, #35659；PR #35660）。议题涵盖量化检查点的获取、内核支持、内存节省效益以及由此引发的精度/稳定性问题（如#35693未初始化的scale导致inf），表明量化是当前性能优化的核心战场，但也引入了额外的复杂性。

🛠️ 重点技术变更

PR #35658: [ROCm] add amd-quark package in requirements
- 技术解读：将amd-quark量化工具包从“需要手动安装”变为ROCm后端的正式依赖。这是一个重要的用户体验和部署友好性改进，确保了使用官方ROCm镜像或wheel的用户可以开箱即用地运行AMD量化模型。
- 影响：降低了AMD量化模型的使用门槛，加强了vLLM对AMD量化生态的原生支持。
PR #35672: [Core] Move test utility to test file
- 技术解读：清理了生产代码中一个遗留的、已被AITER内核实现所取代的权重洗牌函数。这是一个代码健康度改进。
- 影响：简化了代码库，明确了AITER作为AMD平台上权重洗牌的唯一生产路径，减少了维护负担和潜在混淆。
PR #35692: [Bug] Fix HIP build in Docker
- 技术解读：通过过滤编译器调用的错误输出，解决了HIP在容器化构建环境中的固有缺陷。这是一个基础设施稳定性的关键修复。
- 影响：将大幅提高基于Docker的ROCm CI/CD流水线的可靠性和成功率，对AMD生态的开发者体验至关重要。

📈 开发活跃度观察

贡献者：本周期有大量来自硬件厂商（如AMD员工functionstackx， Intel员工jikunshang）和云厂商/研究机构（如Red Hat, Cohere）的深度参与，他们不仅是使用者，更是问题的发现者和修复者。
代码审查与合并：合并了13个PR，其中包括一个历时较长的FA4注意力集成PR (#32974)，表明大型特性分支正在稳步并入主线。同时，许多新PR迅速被标记为ready状态，显示核心维护团队评审高效。
Issue管理：关闭了10个旧Issue（部分因静默关闭），同时积极处理新报问题，社区issue的流转保持健康。

💡 值得关注的问题

AMD MI355X的软件栈成熟度：Issues #35637和#35641揭示了在新旗舰GPU上，先进的量化格式（MXFP4）和高性能内核（AITER）对特定模型架构的支持尚不完善。这需要AMD团队优先解决，以兑现其硬件性能承诺。
统一评估框架的落地：Issue #35639中的RFC具有很高的工程价值。其具体设计、实现路径以及与现有CI体系的整合值得持续关注，可能成为未来vLLM版本质量保障的核心组件。
Qwen3.5模型系列的稳定性：多个Issue (#35638, #35700, #35702) 和已关闭的Issue (#35238, #35414) 都指向Qwen3.5模型在特定配置（FP8、结构化输出、旧GPU）下的各类问题（崩溃、dtype错误、输出不符）。这可能需要对Qwen3.5的模型实现进行系统性审视。
权重加载性能回归：Issue #35663报告nightly版本模型权重加载时间激增（>3倍），这是一个严重的用户体验和部署效率问题，需尽快定位根因。

📋 附录：详细数据列表

新增 Issue

#35702 [Bug]: Qwen3.5-FP8 Crashes VLLM — bug — by darsh12 (创建于: 2026-03-02 11:17 (UTC+8))
#35700 [Bug]: Qwen3.5 structured output doesn’t work — bug — by 315930399 (创建于: 2026-03-02 11:06 (UTC+8))
#35663 [Bug]: Nightly takes >3x time to load weights compared to v0.16.0 and earlier nightlies — bug — by ehfd (创建于: 2026-03-01 23:49 (UTC+8))
#35686 [Bug][UX]: Unclean shutdown from ctrl-c with AR Fusion — bug,help wanted,good first issue — by robertgshaw2-redhat (创建于: 2026-03-02 07:28 (UTC+8))
#35651 [Bug][XPU]: Inference generates garbage with flash attention — bug — by andswitch (创建于: 2026-03-01 16:31 (UTC+8))
#35637 [Bug]: mi355 minimax m2.1 arch mxfp4 rocm AITER TP4 error — bug,rocm — by functionstackx (创建于: 2026-03-01 12:32 (UTC+8))
#35688 [CI Failure][Tool Calling]: — ci-failure — by robertgshaw2-redhat (创建于: 2026-03-02 08:12 (UTC+8))
#35679 [CI] Chat template missing for PowerMoE-3b model — ci-failure — by LucasWilkinson (创建于: 2026-03-02 04:58 (UTC+8))
#35665 [Bug]: Multimodal Requests Fail on /v1/chat/completions/render Endpoint — bug — by sergey-zinchenko (创建于: 2026-03-02 00:21 (UTC+8))
#35641 [Bug]: ROCm MI355X Kimi K2.5 AITER TP8 MLA kernel Error (num_head=8) — bug,performance,rocm — by functionstackx (创建于: 2026-03-01 14:49 (UTC+8))
#35659 [Bug]: cudaErrorIllegalAddress under sustained parallel load with CUDA Graphs on Blackwell SM120 (NVFP4 MoE) — 无标签 — by munakaya (创建于: 2026-03-01 23:11 (UTC+8))
#35638 [Question][XPU]: Best practices and optimized arguments for running 30B+ models on Intel Arc B580 (Dual GPU) via vLLM-XPU — usage — by jiekechoo (创建于: 2026-03-01 13:09 (UTC+8))
#35644 [Feature]: AMD MXFP4 MiniMax M2.5 Checkpoint — feature request,rocm — by functionstackx (创建于: 2026-03-01 15:13 (UTC+8))
#35642 [Bug]: HIP build in Docker: offload-arch stderr contaminates compiler flags via cmake/utils.cmake and CMAKE_HIP_FLAGS — bug,rocm — by npathak13 (创建于: 2026-03-01 15:01 (UTC+8))
#35639 [RFC]: vllm bench eval for Unified Accuracy + Performance Evaluation — rocm,RFC — by AndreasKaratzas (创建于: 2026-03-01 13:36 (UTC+8))

已关闭 Issue

#35593 [Bug]: CUDA Error 803 (system has unsupported display driver / cuda driver combination) when host driver is 590.48.01 due to cuda-compat conflict — bug — by kkkzhonghaiwei (关闭于: 2026-03-02 10:49 (UTC+8))
#20611 [Bug]: can not get function call result in tool_calls fileds when using qwen3 withing stream is true and enable_thinking is false — bug,stale — by zhaoyangmushiyi (关闭于: 2026-03-02 10:16 (UTC+8))
#26360 [Bug]: Qwen3-VL-235b-thinking-awq suddenly stops in docker. — bug,stale — by kulievvitaly (关闭于: 2026-03-02 10:16 (UTC+8))
#27911 [Bug]: Potential Integer Overflow and Out-of-bounds in selective_scan_fwd.cu — bug,stale — by molly-ting (关闭于: 2026-03-02 10:15 (UTC+8))
#27912 [Usage]: How should I use the CPU to deploy QWEN3 VL 30B-A3B? — usage,stale — by maxgameone (关闭于: 2026-03-02 10:15 (UTC+8))
#27915 [Bug]: Potential Out-of-bounds in moe_wna16.cu and marlin_moe_wna16/ops.cu — bug,stale — by molly-ting (关闭于: 2026-03-02 10:15 (UTC+8))
#27917 [Bug]: Potential Out-of-bounds in moe_align_sum_kernels.cu — bug,stale — by molly-ting (关闭于: 2026-03-02 10:15 (UTC+8))
#35679 [CI] Chat template missing for PowerMoE-3b model — ci-failure — by LucasWilkinson (关闭于: 2026-03-02 07:45 (UTC+8))
#35238 Qwen3.5-27B dtype mismatch in DeltaNet layers during torch.compile (float != c10::Half) — torch.compile — by Anoneeeemus (关闭于: 2026-03-02 04:14 (UTC+8))
#35414 [Bug]: 4*2080ti 22g deploy Qwen3.5-35B-A3B fail:2080 Ti does not support bfloat16 — bug — by chuanSir123 (关闭于: 2026-03-01 11:27 (UTC+8))

新增 PR

#35703 [WIP][Hybrid] Map more FullAttn layers to one page — v1 — by peakcrosser7 (创建于: 2026-03-02 11:19 (UTC+8))
#35690 [Bigfix] Fix padding in FULL_DECODE path when MTP is enabled for DP case — v1 — by zyongye (创建于: 2026-03-02 08:58 (UTC+8))
#35701 feat(responses): stateless multi-turn via encrypted_content (RFC #26934 + @grs proposal) — documentation,frontend — by will-deines (创建于: 2026-03-02 11:10 (UTC+8))
#35699 # Support multimodal speculative decoding in non-parallel draft_model mode — speculative-decoding,v1 — by EanWang211123 (创建于: 2026-03-02 10:56 (UTC+8))
#35696 [Model] Optional FP8 lm_head compression for Llama and Mistral — llama — by lucaspirola (创建于: 2026-03-02 10:15 (UTC+8))
#35698 [XPU]Design collect_env.py include the xpu platform specific implementations — 无标签 — by 1643661061leo (创建于: 2026-03-02 10:47 (UTC+8))
#35697 v1 — needs-rebase,cpu — by yintong-lu (创建于: 2026-03-02 10:36 (UTC+8))
#35667 [Bugfix] Fix tool call ID collision in Anthropic endpoint — bug,frontend — by umut-polat (创建于: 2026-03-02 00:49 (UTC+8))
#35695 [Attention] Enable FP8 KV cache for Triton MLA backend — documentation,v1 — by lucaspirola (创建于: 2026-03-02 10:13 (UTC+8))
#35694 [Core] Support FP8 weight storage in unquantized linear and embedding layers — 无标签 — by lucaspirola (创建于: 2026-03-02 10:12 (UTC+8))
#35693 [Bugfix] Fix uninitialized NVFP4 global scale causing inf overflow — bug — by lucaspirola (创建于: 2026-03-02 10:09 (UTC+8))
#35660 [Quantization] Support NVFP4-quantized lm_head and embed_tokens via modelopt — 无标签 — by lucaspirola (创建于: 2026-03-01 23:33 (UTC+8))
#35687 [Bugfix] Treat as implicit reasoning end in Qwen3 parser — bug,ready,qwen — by qmx (创建于: 2026-03-02 07:30 (UTC+8))
#35692 [Bug] Fix HIP build in Docker: filter offload-arch stderr from compiler flags — bug,ci/build — by infektyd (创建于: 2026-03-02 09:52 (UTC+8))
#35672 [Core] Move test utility to test file — ready,gpt-oss — by wjabbour (创建于: 2026-03-02 03:09 (UTC+8))
#35656 [Bugfix][Model] Fix FP8 k_scale/v_scale not loaded for Qwen3-MoE — bug,qwen — by oneraghavan (创建于: 2026-03-01 22:37 (UTC+8))
#35661 interleave mm strings via request — frontend — by netanel-haber (创建于: 2026-03-01 23:38 (UTC+8))
#35684 [Bug] Fix Failure in /v1/chat/completions/render for Multimodal Requests (https://github.com/vllm-project/vllm/issues/35665) — bug,frontend,v1 — by sergey-zinchenko (创建于: 2026-03-02 06:23 (UTC+8))
#35683 [MISC] Removed unused function find_all_indices() from tool_parsers/utils.py — ready — by taneem-ibrahim (创建于: 2026-03-02 06:23 (UTC+8))
#35691 [XPU] fix mxfp4 activation type — ready — by jikunshang (创建于: 2026-03-02 09:02 (UTC+8))
#35689 [CI/Build] ensure local CMake cache isnt included in docker build context — 无标签 — by wjabbour (创建于: 2026-03-02 08:45 (UTC+8))
#35676 [Bugfix] VllmRunner should clear memory on exit in single process mode — bug,v1 — by zou3519 (创建于: 2026-03-02 04:15 (UTC+8))
#35674 [Hardware][Power] Add IBM POWER8 (ppc64le) CPU backend support — documentation,speculative-decoding,ci/build,v1,cpu — by Scottcjn (创建于: 2026-03-02 03:45 (UTC+8))
#35685 Fix: Suppress spurious cuDeviceGetAttribute errors in cumem_allocator — 无标签 — by infektyd (创建于: 2026-03-02 07:12 (UTC+8))
#35681 Fix unresolved-import errors when using Astral’s ty by removing src.root — 无标签 — by tlrmchlsmth (创建于: 2026-03-02 05:16 (UTC+8))
#35640 [MISC] fixed tool_parser mypy errors — ready — by taneem-ibrahim (创建于: 2026-03-01 14:02 (UTC+8))
#35670 [Bug] Fix Failure in /v1/chat/completions/render for Multimodal Requests (https://github.com/vllm-project/vllm/issues/35665) — bug,documentation,performance,new-model,rocm,frontend,speculative-decoding,ci/build,v1,multi-modality — by sergey-zinchenko (创建于: 2026-03-02 01:45 (UTC+8))
#35682 feat(spec_decode): expose token provenance and stats in output (#19993) — v1 — by luca-akka (创建于: 2026-03-02 05:34 (UTC+8))
#35680 [Bugfix] Fix slow ModelOpt Llama-4 checkpoint loading via tensor contiguity — bug,llama — by universeplayer (创建于: 2026-03-02 05:03 (UTC+8))
#35654 [cohere][fix][spec-decode]: fix crash when allowed_token_ids is set without penalties — v1 — by kkt-cohere (创建于: 2026-03-01 22:20 (UTC+8))
#35677 [Bugfix] Fix ModelOpt Llama-4 slow checkpoint loading via tensor contiguity — bug,documentation,performance,rocm,structured-output,frontend,tpu,speculative-decoding,ci/build,v1 — by universeplayer (创建于: 2026-03-02 04:40 (UTC+8))
#35678 [UX] Remove NoOpOffloader log — 无标签 — by robertgshaw2-redhat (创建于: 2026-03-02 04:46 (UTC+8))
#35675 [Bug Fix] Qwen3.5-nvfp4 MTP Speculative Decoding Weight Shape Mismatch — bug,qwen — by nguyen599 (创建于: 2026-03-02 04:13 (UTC+8))
#35673 [Torch 2.11] Guard torch._C._cpu attribute checks for forward compatibility — performance,v1,cpu — by atalman (创建于: 2026-03-02 03:09 (UTC+8))
#35671 [Model Runner V2] Use block table apis for capture inputs — v1,nvidia — by WoosukKwon (创建于: 2026-03-02 02:15 (UTC+8))
#35669 Feature/offloading manager stats — v1,kv-connector — by Srinivasoo7 (创建于: 2026-03-02 01:30 (UTC+8))
#35658 [ROCm] add amd-quark package in requirements for rocm to use quantized models — rocm,ci/build — by hongxiayang (创建于: 2026-03-01 23:08 (UTC+8))
#35666 [Misc] Use VLLMValidationError consistently in completion and chat completion protocol — frontend — by umut-polat (创建于: 2026-03-02 00:33 (UTC+8))
#35664 [Misc] Use VLLMValidationError consistently in SamplingParams._verify_args — 无标签 — by umut-polat (创建于: 2026-03-02 00:16 (UTC+8))
#35662 [Bugfix] Fix misleading cancel error message and noisy CancelledError log — bug,frontend — by umut-polat (创建于: 2026-03-01 23:44 (UTC+8))
#35657 [Model] Nano Nemotron VL - fast media preprocessing — 无标签 — by nvnbagrov (创建于: 2026-03-01 22:38 (UTC+8))
#35655 [Model] Nemotron Nano VL faster preprocessing — ci/build,multi-modality,gpt-oss,nvidia — by nvnbagrov (创建于: 2026-03-01 22:36 (UTC+8))
#35650 [Bugfix] Suppress UserWarning in binary2tensor for read-only numpy arrays — bug — by lin-shh (创建于: 2026-03-01 15:43 (UTC+8))
#35645 Fix TYPE_CHECKING stub defaults in envs.py to match actual runtime defaults — ready — by lin-shh (创建于: 2026-03-01 15:13 (UTC+8))
#35643 Replace bare Exception with ValueError for invalid argument values — 无标签 — by lin-shh (创建于: 2026-03-01 15:13 (UTC+8))
#35647 Replace bare Exception with specific exception types (v1, entrypoints) — frontend,v1 — by lin-shh (创建于: 2026-03-01 15:23 (UTC+8))
#35649 [Misc] Replace bare AssertionError with specific exception types — ready,v1 — by lin-shh (创建于: 2026-03-01 15:38 (UTC+8))
#35648 [Misc] Fix typos in comments: explict→explicit, paramaters→parameters — ready — by lin-shh (创建于: 2026-03-01 15:38 (UTC+8))
#35653 Use MMEncoderAttention (=use FlashAttention) instead torch.sdpa in radio.py — 无标签 — by netanel-haber (创建于: 2026-03-01 21:32 (UTC+8))
#35652 Add SwapConnector for per-layer KV cache swapping and reorganize entrypoints — v1,kv-connector — by ZehaoLu98 (创建于: 2026-03-01 17:26 (UTC+8))
#35646 Fix typo: implictly -> implicitly in isaac.py docstring — 无标签 — by lin-shh (创建于: 2026-03-01 15:22 (UTC+8))
#35636 fix: correct max_loras grid size in fused_moe_lora kernels — 无标签 — by N3u0ns (创建于: 2026-03-01 12:19 (UTC+8))

已合并 PR

#35256 [Bugfix] Fix dtype mismatch in RMSNormGated.forward_native() during torch.compile — bug,ready — by haosdent (合并于: 2026-03-02 04:14 (UTC+8))
#35327 Fix deprecated v1 config tests — ready — by jcaip (合并于: 2026-03-02 09:32 (UTC+8))
#32974 [Attention] FA4 integration — documentation,ready,ci/build,v1,nvidia — by LucasWilkinson (合并于: 2026-03-02 07:44 (UTC+8))
#34832 Revert “[Bugfix] Disable TRTLLM attention with KV transfer enabled (#33192)” — bug,ready,v1,nvidia — by ZhanqiuHu (合并于: 2026-03-02 06:32 (UTC+8))
#35475 [torch.compile] Undo the fast_moe_cold_start hack in torch>=2.11 — documentation,ready,gpt-oss — by zou3519 (合并于: 2026-03-02 05:44 (UTC+8))
#35671 [Model Runner V2] Use block table apis for capture inputs — v1,nvidia — by WoosukKwon (合并于: 2026-03-02 02:31 (UTC+8))
#35382 fix(mxfp4): return is_monolithic=False when LoRA is enabled for Triton backend — ready — by yoonsnowdev (合并于: 2026-03-01 22:59 (UTC+8))
#35630 [MISC] Fixing a null reference by removing parallel_utils from mypy EXCLUDE — ready — by taneem-ibrahim (合并于: 2026-03-01 22:26 (UTC+8))
#34798 [Mamba1] - Kernel Level Chunk Alignment for Prefix Caching — ready,v1 — by Josephasafg (合并于: 2026-03-01 20:40 (UTC+8))
#35628 [Model Runner V2] Minor refactoring for EncoderRunner — v1 — by WoosukKwon (合并于: 2026-03-01 16:15 (UTC+8))
#34931 [AMD][CI] Support Triton attention with ExampleConnector — rocm,ready,v1,kv-connector — by rjrock (合并于: 2026-03-01 15:58 (UTC+8))
#35646 Fix typo: implictly -> implicitly in isaac.py docstring — 无标签 — by lin-shh (合并于: 2026-03-01 15:34 (UTC+8))
#35617 [Bugfix][Model] Fix Qwen3.5/Qwen3Next ignoring –dtype flag on older GPUs — bug,ready,qwen — by lailoo (合并于: 2026-03-01 11:27 (UTC+8))

关闭但未合并的 PR

#35701 feat(responses): stateless multi-turn via encrypted_content (RFC #26934 + @grs proposal) — documentation,frontend — by will-deines (关闭于: 2026-03-02 11:11 (UTC+8))
#35009 update-pillow-version to mitigate CVE-2026-25990 — ci/build — by to-curiosity (关闭于: 2026-03-02 10:22 (UTC+8))
#35346 Cpu dispatcher — documentation,performance,rocm,needs-rebase,ci/build,v1,cpu,nvidia — by majian4work (关闭于: 2026-03-02 09:04 (UTC+8))
#26371 [Kernel] FA4 Integration. — needs-rebase,stale,v1,nvidia — by zyongye (关闭于: 2026-03-02 08:10 (UTC+8))
#35670 [Bug] Fix Failure in /v1/chat/completions/render for Multimodal Requests (https://github.com/vllm-project/vllm/issues/35665) — bug,documentation,performance,new-model,rocm,frontend,speculative-decoding,ci/build,v1,multi-modality — by sergey-zinchenko (关闭于: 2026-03-02 06:09 (UTC+8))
#35677 [Bugfix] Fix ModelOpt Llama-4 slow checkpoint loading via tensor contiguity — bug,documentation,performance,rocm,structured-output,frontend,tpu,speculative-decoding,ci/build,v1 — by universeplayer (关闭于: 2026-03-02 04:51 (UTC+8))
#31512 [Hardware][Power] Add IBM MASS + NUMA optimizations for POWER8 — documentation,speculative-decoding,needs-rebase,ci/build,v1,cpu — by Scottcjn (关闭于: 2026-03-02 03:45 (UTC+8))
#34955 [Bugfix] kimi k2.5 tool call tokens leaking into reasoning_content — bug,frontend,deepseek — by felixmr1 (关闭于: 2026-03-02 00:41 (UTC+8))
#35655 [Model] Nemotron Nano VL faster preprocessing — ci/build,multi-modality,gpt-oss,nvidia — by nvnbagrov (关闭于: 2026-03-01 22:37 (UTC+8))
#35607 fix(qwen3.5-mtp): propagate spec_step_idx to enable multi-layer MTP — qwen — by cwazai (关闭于: 2026-03-01 20:28 (UTC+8))
#35652 Add SwapConnector for per-layer KV cache swapping and reorganize entrypoints — v1,kv-connector — by ZehaoLu98 (关闭于: 2026-03-01 17:26 (UTC+8))
#24918 [Attention] remove decode wrapper of flashinfer — needs-rebase,stale,v1,nvidia — by happierpig (关闭于: 2026-03-01 17:24 (UTC+8))