vLLM 开发动态报告 - 2026-03-01
时间窗口: 2026-03-01 11:26 (UTC+8) ~ 2026-03-02 11:26 (UTC+8) 数据统计: 新 Issue 15 | 关闭 Issue 10 | 新 PR 52 | 合并 PR 13 | 关闭未合并 PR 12
📊 每日开发状态摘要
本周期(2026年3月1日至2日),vLLM项目保持了极高的开发活跃度,新增了52个PR和15个Issue。开发重点集中在多硬件平台支持优化(尤其是AMD ROCm与Intel XPU的bug修复)、性能与评估工具链的完善(如统一评估框架的RFC),以及模型与量化支持(如NVFP4、MXFP4等)。社区就多项功能设计(如命令命名、构建流程)展开了热烈讨论,显示出项目在快速演进中对工程质量的重视。
🎯 AMD/ROCm 生态相关动态
本周期AMD/ROCm生态相关活动非常活跃,主要涉及MI355X新硬件上的量化模型支持问题和HIP构建流程的修复。
- Issues:
- #35637 [Bug]: mi355 minimax m2.1 arch mxfp4 rocm AITER TP4 error
- 描述:用户
functionstackx报告,在MI355X GPU上使用AMD官方的MXFP4量化检查点 (amd/MiniMax-M2.1-MXFP4) 并启用AITER内核 (VLLM_ROCM_USE_AITER=1) 时,TP4配置下出现运行时错误,而TP2配置下正常工作。这直接影响了MI355X旗舰特性(MXFP4)的性能表现。 - 影响:暴露了AMD AITER融合MoE内核在特定量化格式(MXFP4)和高并行度(TP4)组合下的兼容性问题,可能影响用户对新硬件(MI355)量化性能的评估。
- 描述:用户
- #35641 [Bug]: ROCm MI355X Kimi K2.5 AITER TP8 MLA kernel Error (num_head=8)
- 描述:同样由
functionstackx报告,在MI355X上运行Kimi K2.5的MXFP4检查点时,AITER的MLA(多头潜在注意力)内核因不支持num_head=8而报错,目前仅支持16或128头。 - 影响:阻碍了Kimi K2.5模型在ROCm平台上发挥全部性能,表明AMD的AITER内核对特定模型架构的支持仍需完善。
- 描述:同样由
- #35644 [Feature]: AMD MXFP4 MiniMax M2.5 Checkpoint
- 描述:用户
functionstackx向AMD团队请求发布更受欢迎的MiniMax M2.5模型的MXFP4量化检查点,以替代现有的M2.1检查点。 - 影响:反映了社区用户对AMD最新量化技术(MXFP4)应用于主流模型的高度期待。
- 描述:用户
- #35642 [Bug]: HIP build in Docker: offload-arch stderr contaminates compiler flags
- 描述:用户
npathak13提交了一个非常详细的技术分析,指出在Docker内(无GPU透传)进行HIP构建时,offload-arch工具失败的错误信息会通过两条路径污染编译器标志,导致构建失败。 - 影响:这是一个影响所有容器化ROCm CI/CD流水线的根本性构建问题。
- 描述:用户
- #35637 [Bug]: mi355 minimax m2.1 arch mxfp4 rocm AITER TP4 error
- PRs:
- #35658 [ROCm] add amd-quark package in requirements for rocm to use quantized models
- 描述:由AMD员工
hongxiayang提交,旨在修复一个依赖问题(#35633),将amd-quark包加入ROCm的依赖文件,确保构建的Docker镜像或wheel包能直接支持AMD的量化模型。 - 状态:Open,待解决pre-commit检查。
- 描述:由AMD员工
- #35672 [Core] Move test utility to test file
- 描述:移除了一个生产代码中未使用的权重洗牌函数,并指出所有生产路径现在都使用AITER进行权重洗牌。这间接反映了AITER内核在AMD生态中的核心地位。
- 状态:Open,已标记为
ready。
- #35692 [Bug] Fix HIP build in Docker: filter offload-arch stderr from compiler flags
- 描述:针对Issue #35642的修复,在
cmake/utils.cmake和CMakeLists.txt中过滤掉offload-arch输出的警告信息,防止其污染编译器标志。 - 状态:Open,正在进行技术讨论和修改。
- 描述:针对Issue #35642的修复,在
- #34931 [AMD][CI] Support Triton attention with ExampleConnector
- 描述:一个已合并的PR,解决了ROCm默认的Triton注意力后端与
ExampleConnector的KV缓存布局不兼容的问题,提升了ROCm平台在KV卸载场景下的功能完整性。 - 状态:Merged。
- 描述:一个已合并的PR,解决了ROCm默认的Triton注意力后端与
- #35658 [ROCm] add amd-quark package in requirements for rocm to use quantized models
总结:AMD生态在本周期的动态集中于解决新一代硬件(MI355X)上的软件栈瓶颈和夯实基础构建与部署流程。社区用户积极反馈问题,AMD团队(通过-amd后缀用户和hongxiayang)也在跟进修复,显示出良好的互动。
💬 高热度讨论分析
- Issue #35639:
[RFC]:vllm bench evalfor Unified Accuracy + Performance Evaluation- 核心议题:提议新增一个统一命令,在单次运行中同时进行准确性(lm_eval)和性能(Prometheus指标)评估,以检测性能回归并关联精度-性能数据。
- 观点交锋:
- 提议方 (AndreasKaratzas):认为现有CI只评估精度,无法发现性能回归和硬件特定问题,统一评估能节省GPU算力并提供更全面的洞察。
- 反对/建议方 (DarkLight1337):认为
vllm bench传统上专指性能基准测试,为避免混淆,建议使用新命令vllm eval。 - 讨论发展:提议方随后幽默地提出了
vllm omnigres(omni+regression)的替代命名方案,缓和了讨论气氛,显示出对命名的开放性。
- 当前状态:讨论进行中,核心功能设计获得认同,但具体命名待定。
- Issue #35642 & PR #35692: HIP Docker构建失败问题
- 核心议题:如前所述,这是一个深入的技术问题。Issue中用户提供了极其详细的根因分析和临时解决方案。
- 讨论特点:该讨论热度体现在分析的深度而非评论数量。用户
npathak13独立完成了问题诊断并给出了修复方案,体现了社区成员的高技术水准。随后在PR #35692中,提交者infektyd与AI代码助手(gemini-code-assist)就CMake修复细节进行了迭代,展示了现代开发协作的一种形态。 - 当前状态:问题根因明确,修复PR已提交并在细化中。
- PR #35670 & #35684: 修复多模态渲染端点
- 核心议题:修复
/v1/chat/completions/render端点处理多模态请求时内部数据结构序列化失败的问题。 - 讨论特点:同一提交者(
sergey-zinchenko)因DCO签署问题关闭了原PR(#35670)并重新提交了#35684。这虽非功能争议,但反映了项目对贡献规范(如签署提交)的严格要求,是开源项目维护质量的体现。 - 当前状态:原PR已关闭,新PR待审。
- 核心议题:修复
🔥 热门话题与趋势分析
- 硬件支持多样化与问题攻坚:除了AMD MI355X,本周期内Intel XPU(Issue #35651, #35638)和NVIDIA Blackwell(Issue #35659)均出现了特定Bug报告。这反映出vLLm在扩展硬件生态时面临的兼容性挑战,也是其作为主流推理引擎的必然阶段。社区针对不同硬件的驱动版本、内核精度、最佳实践进行了大量具体讨论。
- 评估与监控的工程化:Issue #35639的RFC标志着社区开始系统化地思考如何将准确性评估与性能监控在CI/CD中深度集成,旨在提前发现性能回归,这是项目走向成熟、重视生产环境稳定性的重要信号。
- 量化技术的深入应用与问题:围绕MXFP4 (AMD)、NVFP4 (NVIDIA) 等新式量化格式的讨论密集(Issues #35637, #35641, #35644, #35659;PR #35660)。议题涵盖量化检查点的获取、内核支持、内存节省效益以及由此引发的精度/稳定性问题(如#35693未初始化的scale导致inf),表明量化是当前性能优化的核心战场,但也引入了额外的复杂性。
🛠️ 重点技术变更
- PR #35658: [ROCm] add amd-quark package in requirements
- 技术解读:将
amd-quark量化工具包从“需要手动安装”变为ROCm后端的正式依赖。这是一个重要的用户体验和部署友好性改进,确保了使用官方ROCm镜像或wheel的用户可以开箱即用地运行AMD量化模型。 - 影响:降低了AMD量化模型的使用门槛,加强了vLLM对AMD量化生态的原生支持。
- 技术解读:将
- PR #35672: [Core] Move test utility to test file
- 技术解读:清理了生产代码中一个遗留的、已被AITER内核实现所取代的权重洗牌函数。这是一个代码健康度改进。
- 影响:简化了代码库,明确了AITER作为AMD平台上权重洗牌的唯一生产路径,减少了维护负担和潜在混淆。
- PR #35692: [Bug] Fix HIP build in Docker
- 技术解读:通过过滤编译器调用的错误输出,解决了HIP在容器化构建环境中的固有缺陷。这是一个基础设施稳定性的关键修复。
- 影响:将大幅提高基于Docker的ROCm CI/CD流水线的可靠性和成功率,对AMD生态的开发者体验至关重要。
📈 开发活跃度观察
- 贡献者:本周期有大量来自硬件厂商(如AMD员工
functionstackx, Intel员工jikunshang)和云厂商/研究机构(如Red Hat, Cohere)的深度参与,他们不仅是使用者,更是问题的发现者和修复者。 - 代码审查与合并:合并了13个PR,其中包括一个历时较长的
FA4注意力集成PR (#32974),表明大型特性分支正在稳步并入主线。同时,许多新PR迅速被标记为ready状态,显示核心维护团队评审高效。 - Issue管理:关闭了10个旧Issue(部分因静默关闭),同时积极处理新报问题,社区issue的流转保持健康。
💡 值得关注的问题
- AMD MI355X的软件栈成熟度:Issues #35637和#35641揭示了在新旗舰GPU上,先进的量化格式(MXFP4)和高性能内核(AITER)对特定模型架构的支持尚不完善。这需要AMD团队优先解决,以兑现其硬件性能承诺。
- 统一评估框架的落地:Issue #35639中的RFC具有很高的工程价值。其具体设计、实现路径以及与现有CI体系的整合值得持续关注,可能成为未来vLLM版本质量保障的核心组件。
- Qwen3.5模型系列的稳定性:多个Issue (#35638, #35700, #35702) 和已关闭的Issue (#35238, #35414) 都指向Qwen3.5模型在特定配置(FP8、结构化输出、旧GPU)下的各类问题(崩溃、dtype错误、输出不符)。这可能需要对Qwen3.5的模型实现进行系统性审视。
- 权重加载性能回归:Issue #35663报告nightly版本模型权重加载时间激增(>3倍),这是一个严重的用户体验和部署效率问题,需尽快定位根因。
📋 附录:详细数据列表
新增 Issue
- #35702 [Bug]: Qwen3.5-FP8 Crashes VLLM — bug — by darsh12 (创建于: 2026-03-02 11:17 (UTC+8))
- #35700 [Bug]: Qwen3.5 structured output doesn’t work — bug — by 315930399 (创建于: 2026-03-02 11:06 (UTC+8))
- #35663 [Bug]: Nightly takes >3x time to load weights compared to v0.16.0 and earlier nightlies — bug — by ehfd (创建于: 2026-03-01 23:49 (UTC+8))
- #35686 [Bug][UX]: Unclean shutdown from ctrl-c with AR Fusion — bug,help wanted,good first issue — by robertgshaw2-redhat (创建于: 2026-03-02 07:28 (UTC+8))
- #35651 [Bug][XPU]: Inference generates garbage with flash attention — bug — by andswitch (创建于: 2026-03-01 16:31 (UTC+8))
- #35637 [Bug]: mi355 minimax m2.1 arch mxfp4 rocm AITER TP4 error — bug,rocm — by functionstackx (创建于: 2026-03-01 12:32 (UTC+8))
- #35688 [CI Failure][Tool Calling]: — ci-failure — by robertgshaw2-redhat (创建于: 2026-03-02 08:12 (UTC+8))
- #35679 [CI] Chat template missing for PowerMoE-3b model — ci-failure — by LucasWilkinson (创建于: 2026-03-02 04:58 (UTC+8))
- #35665 [Bug]: Multimodal Requests Fail on /v1/chat/completions/render Endpoint — bug — by sergey-zinchenko (创建于: 2026-03-02 00:21 (UTC+8))
- #35641 [Bug]: ROCm MI355X Kimi K2.5 AITER TP8 MLA kernel Error (num_head=8) — bug,performance,rocm — by functionstackx (创建于: 2026-03-01 14:49 (UTC+8))
- #35659 [Bug]: cudaErrorIllegalAddress under sustained parallel load with CUDA Graphs on Blackwell SM120 (NVFP4 MoE) — 无标签 — by munakaya (创建于: 2026-03-01 23:11 (UTC+8))
- #35638 [Question][XPU]: Best practices and optimized arguments for running 30B+ models on Intel Arc B580 (Dual GPU) via vLLM-XPU — usage — by jiekechoo (创建于: 2026-03-01 13:09 (UTC+8))
- #35644 [Feature]: AMD MXFP4 MiniMax M2.5 Checkpoint — feature request,rocm — by functionstackx (创建于: 2026-03-01 15:13 (UTC+8))
- #35642 [Bug]: HIP build in Docker: offload-arch stderr contaminates compiler flags via cmake/utils.cmake and CMAKE_HIP_FLAGS — bug,rocm — by npathak13 (创建于: 2026-03-01 15:01 (UTC+8))
- #35639 [RFC]:
vllm bench evalfor Unified Accuracy + Performance Evaluation — rocm,RFC — by AndreasKaratzas (创建于: 2026-03-01 13:36 (UTC+8))
已关闭 Issue
- #35593 [Bug]: CUDA Error 803 (system has unsupported display driver / cuda driver combination) when host driver is 590.48.01 due to cuda-compat conflict — bug — by kkkzhonghaiwei (关闭于: 2026-03-02 10:49 (UTC+8))
- #20611 [Bug]: can not get function call result in tool_calls fileds when using qwen3 withing stream is true and enable_thinking is false — bug,stale — by zhaoyangmushiyi (关闭于: 2026-03-02 10:16 (UTC+8))
- #26360 [Bug]: Qwen3-VL-235b-thinking-awq suddenly stops in docker. — bug,stale — by kulievvitaly (关闭于: 2026-03-02 10:16 (UTC+8))
- #27911 [Bug]: Potential Integer Overflow and Out-of-bounds in selective_scan_fwd.cu — bug,stale — by molly-ting (关闭于: 2026-03-02 10:15 (UTC+8))
- #27912 [Usage]: How should I use the CPU to deploy QWEN3 VL 30B-A3B? — usage,stale — by maxgameone (关闭于: 2026-03-02 10:15 (UTC+8))
- #27915 [Bug]: Potential Out-of-bounds in moe_wna16.cu and marlin_moe_wna16/ops.cu — bug,stale — by molly-ting (关闭于: 2026-03-02 10:15 (UTC+8))
- #27917 [Bug]: Potential Out-of-bounds in moe_align_sum_kernels.cu — bug,stale — by molly-ting (关闭于: 2026-03-02 10:15 (UTC+8))
- #35679 [CI] Chat template missing for PowerMoE-3b model — ci-failure — by LucasWilkinson (关闭于: 2026-03-02 07:45 (UTC+8))
- #35238 Qwen3.5-27B dtype mismatch in DeltaNet layers during torch.compile (float != c10::Half) — torch.compile — by Anoneeeemus (关闭于: 2026-03-02 04:14 (UTC+8))
- #35414 [Bug]: 4*2080ti 22g deploy Qwen3.5-35B-A3B fail:2080 Ti does not support bfloat16 — bug — by chuanSir123 (关闭于: 2026-03-01 11:27 (UTC+8))
新增 PR
- #35703 [WIP][Hybrid] Map more FullAttn layers to one page — v1 — by peakcrosser7 (创建于: 2026-03-02 11:19 (UTC+8))
- #35690 [Bigfix] Fix padding in FULL_DECODE path when MTP is enabled for DP case — v1 — by zyongye (创建于: 2026-03-02 08:58 (UTC+8))
- #35701 feat(responses): stateless multi-turn via encrypted_content (RFC #26934 + @grs proposal) — documentation,frontend — by will-deines (创建于: 2026-03-02 11:10 (UTC+8))
- #35699 # Support multimodal speculative decoding in non-parallel
draft_modelmode — speculative-decoding,v1 — by EanWang211123 (创建于: 2026-03-02 10:56 (UTC+8)) - #35696 [Model] Optional FP8 lm_head compression for Llama and Mistral — llama — by lucaspirola (创建于: 2026-03-02 10:15 (UTC+8))
- #35698 [XPU]Design collect_env.py include the xpu platform specific implementations — 无标签 — by 1643661061leo (创建于: 2026-03-02 10:47 (UTC+8))
- #35697 v1 — needs-rebase,cpu — by yintong-lu (创建于: 2026-03-02 10:36 (UTC+8))
- #35667 [Bugfix] Fix tool call ID collision in Anthropic endpoint — bug,frontend — by umut-polat (创建于: 2026-03-02 00:49 (UTC+8))
- #35695 [Attention] Enable FP8 KV cache for Triton MLA backend — documentation,v1 — by lucaspirola (创建于: 2026-03-02 10:13 (UTC+8))
- #35694 [Core] Support FP8 weight storage in unquantized linear and embedding layers — 无标签 — by lucaspirola (创建于: 2026-03-02 10:12 (UTC+8))
- #35693 [Bugfix] Fix uninitialized NVFP4 global scale causing inf overflow — bug — by lucaspirola (创建于: 2026-03-02 10:09 (UTC+8))
- #35660 [Quantization] Support NVFP4-quantized lm_head and embed_tokens via modelopt — 无标签 — by lucaspirola (创建于: 2026-03-01 23:33 (UTC+8))
- #35687 [Bugfix] Treat
as implicit reasoning end in Qwen3 parser — bug,ready,qwen — by qmx (创建于: 2026-03-02 07:30 (UTC+8)) - #35692 [Bug] Fix HIP build in Docker: filter offload-arch stderr from compiler flags — bug,ci/build — by infektyd (创建于: 2026-03-02 09:52 (UTC+8))
- #35672 [Core] Move test utility to test file — ready,gpt-oss — by wjabbour (创建于: 2026-03-02 03:09 (UTC+8))
- #35656 [Bugfix][Model] Fix FP8 k_scale/v_scale not loaded for Qwen3-MoE — bug,qwen — by oneraghavan (创建于: 2026-03-01 22:37 (UTC+8))
- #35661 interleave mm strings via request — frontend — by netanel-haber (创建于: 2026-03-01 23:38 (UTC+8))
- #35684 [Bug] Fix Failure in /v1/chat/completions/render for Multimodal Requests (https://github.com/vllm-project/vllm/issues/35665) — bug,frontend,v1 — by sergey-zinchenko (创建于: 2026-03-02 06:23 (UTC+8))
- #35683 [MISC] Removed unused function find_all_indices() from tool_parsers/utils.py — ready — by taneem-ibrahim (创建于: 2026-03-02 06:23 (UTC+8))
- #35691 [XPU] fix mxfp4 activation type — ready — by jikunshang (创建于: 2026-03-02 09:02 (UTC+8))
- #35689 [CI/Build] ensure local CMake cache isnt included in docker build context — 无标签 — by wjabbour (创建于: 2026-03-02 08:45 (UTC+8))
- #35676 [Bugfix] VllmRunner should clear memory on exit in single process mode — bug,v1 — by zou3519 (创建于: 2026-03-02 04:15 (UTC+8))
- #35674 [Hardware][Power] Add IBM POWER8 (ppc64le) CPU backend support — documentation,speculative-decoding,ci/build,v1,cpu — by Scottcjn (创建于: 2026-03-02 03:45 (UTC+8))
- #35685 Fix: Suppress spurious cuDeviceGetAttribute errors in cumem_allocator — 无标签 — by infektyd (创建于: 2026-03-02 07:12 (UTC+8))
- #35681 Fix unresolved-import errors when using Astral’s ty by removing src.root — 无标签 — by tlrmchlsmth (创建于: 2026-03-02 05:16 (UTC+8))
- #35640 [MISC] fixed tool_parser mypy errors — ready — by taneem-ibrahim (创建于: 2026-03-01 14:02 (UTC+8))
- #35670 [Bug] Fix Failure in /v1/chat/completions/render for Multimodal Requests (https://github.com/vllm-project/vllm/issues/35665) — bug,documentation,performance,new-model,rocm,frontend,speculative-decoding,ci/build,v1,multi-modality — by sergey-zinchenko (创建于: 2026-03-02 01:45 (UTC+8))
- #35682 feat(spec_decode): expose token provenance and stats in output (#19993) — v1 — by luca-akka (创建于: 2026-03-02 05:34 (UTC+8))
- #35680 [Bugfix] Fix slow ModelOpt Llama-4 checkpoint loading via tensor contiguity — bug,llama — by universeplayer (创建于: 2026-03-02 05:03 (UTC+8))
- #35654 [cohere][fix][spec-decode]: fix crash when allowed_token_ids is set without penalties — v1 — by kkt-cohere (创建于: 2026-03-01 22:20 (UTC+8))
- #35677 [Bugfix] Fix ModelOpt Llama-4 slow checkpoint loading via tensor contiguity — bug,documentation,performance,rocm,structured-output,frontend,tpu,speculative-decoding,ci/build,v1 — by universeplayer (创建于: 2026-03-02 04:40 (UTC+8))
- #35678 [UX] Remove NoOpOffloader log — 无标签 — by robertgshaw2-redhat (创建于: 2026-03-02 04:46 (UTC+8))
- #35675 [Bug Fix] Qwen3.5-nvfp4 MTP Speculative Decoding Weight Shape Mismatch — bug,qwen — by nguyen599 (创建于: 2026-03-02 04:13 (UTC+8))
- #35673 [Torch 2.11] Guard torch._C._cpu attribute checks for forward compatibility — performance,v1,cpu — by atalman (创建于: 2026-03-02 03:09 (UTC+8))
- #35671 [Model Runner V2] Use block table apis for capture inputs — v1,nvidia — by WoosukKwon (创建于: 2026-03-02 02:15 (UTC+8))
- #35669 Feature/offloading manager stats — v1,kv-connector — by Srinivasoo7 (创建于: 2026-03-02 01:30 (UTC+8))
- #35658 [ROCm] add amd-quark package in requirements for rocm to use quantized models — rocm,ci/build — by hongxiayang (创建于: 2026-03-01 23:08 (UTC+8))
- #35666 [Misc] Use VLLMValidationError consistently in completion and chat completion protocol — frontend — by umut-polat (创建于: 2026-03-02 00:33 (UTC+8))
- #35664 [Misc] Use VLLMValidationError consistently in SamplingParams._verify_args — 无标签 — by umut-polat (创建于: 2026-03-02 00:16 (UTC+8))
- #35662 [Bugfix] Fix misleading cancel error message and noisy CancelledError log — bug,frontend — by umut-polat (创建于: 2026-03-01 23:44 (UTC+8))
- #35657 [Model] Nano Nemotron VL - fast media preprocessing — 无标签 — by nvnbagrov (创建于: 2026-03-01 22:38 (UTC+8))
- #35655 [Model] Nemotron Nano VL faster preprocessing — ci/build,multi-modality,gpt-oss,nvidia — by nvnbagrov (创建于: 2026-03-01 22:36 (UTC+8))
- #35650 [Bugfix] Suppress UserWarning in binary2tensor for read-only numpy arrays — bug — by lin-shh (创建于: 2026-03-01 15:43 (UTC+8))
- #35645 Fix TYPE_CHECKING stub defaults in envs.py to match actual runtime defaults — ready — by lin-shh (创建于: 2026-03-01 15:13 (UTC+8))
- #35643 Replace bare Exception with ValueError for invalid argument values — 无标签 — by lin-shh (创建于: 2026-03-01 15:13 (UTC+8))
- #35647 Replace bare Exception with specific exception types (v1, entrypoints) — frontend,v1 — by lin-shh (创建于: 2026-03-01 15:23 (UTC+8))
- #35649 [Misc] Replace bare AssertionError with specific exception types — ready,v1 — by lin-shh (创建于: 2026-03-01 15:38 (UTC+8))
- #35648 [Misc] Fix typos in comments: explict→explicit, paramaters→parameters — ready — by lin-shh (创建于: 2026-03-01 15:38 (UTC+8))
- #35653 Use MMEncoderAttention (=use FlashAttention) instead torch.sdpa in radio.py — 无标签 — by netanel-haber (创建于: 2026-03-01 21:32 (UTC+8))
- #35652 Add SwapConnector for per-layer KV cache swapping and reorganize entrypoints — v1,kv-connector — by ZehaoLu98 (创建于: 2026-03-01 17:26 (UTC+8))
- #35646 Fix typo: implictly -> implicitly in isaac.py docstring — 无标签 — by lin-shh (创建于: 2026-03-01 15:22 (UTC+8))
- #35636 fix: correct max_loras grid size in fused_moe_lora kernels — 无标签 — by N3u0ns (创建于: 2026-03-01 12:19 (UTC+8))
已合并 PR
- #35256 [Bugfix] Fix dtype mismatch in RMSNormGated.forward_native() during torch.compile — bug,ready — by haosdent (合并于: 2026-03-02 04:14 (UTC+8))
- #35327 Fix deprecated v1 config tests — ready — by jcaip (合并于: 2026-03-02 09:32 (UTC+8))
- #32974 [Attention] FA4 integration — documentation,ready,ci/build,v1,nvidia — by LucasWilkinson (合并于: 2026-03-02 07:44 (UTC+8))
- #34832 Revert “[Bugfix] Disable TRTLLM attention with KV transfer enabled (#33192)” — bug,ready,v1,nvidia — by ZhanqiuHu (合并于: 2026-03-02 06:32 (UTC+8))
- #35475 [torch.compile] Undo the fast_moe_cold_start hack in torch>=2.11 — documentation,ready,gpt-oss — by zou3519 (合并于: 2026-03-02 05:44 (UTC+8))
- #35671 [Model Runner V2] Use block table apis for capture inputs — v1,nvidia — by WoosukKwon (合并于: 2026-03-02 02:31 (UTC+8))
- #35382 fix(mxfp4): return is_monolithic=False when LoRA is enabled for Triton backend — ready — by yoonsnowdev (合并于: 2026-03-01 22:59 (UTC+8))
- #35630 [MISC] Fixing a null reference by removing parallel_utils from mypy EXCLUDE — ready — by taneem-ibrahim (合并于: 2026-03-01 22:26 (UTC+8))
- #34798 [Mamba1] - Kernel Level Chunk Alignment for Prefix Caching — ready,v1 — by Josephasafg (合并于: 2026-03-01 20:40 (UTC+8))
- #35628 [Model Runner V2] Minor refactoring for EncoderRunner — v1 — by WoosukKwon (合并于: 2026-03-01 16:15 (UTC+8))
- #34931 [AMD][CI] Support Triton attention with ExampleConnector — rocm,ready,v1,kv-connector — by rjrock (合并于: 2026-03-01 15:58 (UTC+8))
- #35646 Fix typo: implictly -> implicitly in isaac.py docstring — 无标签 — by lin-shh (合并于: 2026-03-01 15:34 (UTC+8))
- #35617 [Bugfix][Model] Fix Qwen3.5/Qwen3Next ignoring –dtype flag on older GPUs — bug,ready,qwen — by lailoo (合并于: 2026-03-01 11:27 (UTC+8))
关闭但未合并的 PR
- #35701 feat(responses): stateless multi-turn via encrypted_content (RFC #26934 + @grs proposal) — documentation,frontend — by will-deines (关闭于: 2026-03-02 11:11 (UTC+8))
- #35009 update-pillow-version to mitigate CVE-2026-25990 — ci/build — by to-curiosity (关闭于: 2026-03-02 10:22 (UTC+8))
- #35346 Cpu dispatcher — documentation,performance,rocm,needs-rebase,ci/build,v1,cpu,nvidia — by majian4work (关闭于: 2026-03-02 09:04 (UTC+8))
- #26371 [Kernel] FA4 Integration. — needs-rebase,stale,v1,nvidia — by zyongye (关闭于: 2026-03-02 08:10 (UTC+8))
- #35670 [Bug] Fix Failure in /v1/chat/completions/render for Multimodal Requests (https://github.com/vllm-project/vllm/issues/35665) — bug,documentation,performance,new-model,rocm,frontend,speculative-decoding,ci/build,v1,multi-modality — by sergey-zinchenko (关闭于: 2026-03-02 06:09 (UTC+8))
- #35677 [Bugfix] Fix ModelOpt Llama-4 slow checkpoint loading via tensor contiguity — bug,documentation,performance,rocm,structured-output,frontend,tpu,speculative-decoding,ci/build,v1 — by universeplayer (关闭于: 2026-03-02 04:51 (UTC+8))
- #31512 [Hardware][Power] Add IBM MASS + NUMA optimizations for POWER8 — documentation,speculative-decoding,needs-rebase,ci/build,v1,cpu — by Scottcjn (关闭于: 2026-03-02 03:45 (UTC+8))
- #34955 [Bugfix] kimi k2.5 tool call tokens leaking into reasoning_content — bug,frontend,deepseek — by felixmr1 (关闭于: 2026-03-02 00:41 (UTC+8))
- #35655 [Model] Nemotron Nano VL faster preprocessing — ci/build,multi-modality,gpt-oss,nvidia — by nvnbagrov (关闭于: 2026-03-01 22:37 (UTC+8))
- #35607 fix(qwen3.5-mtp): propagate spec_step_idx to enable multi-layer MTP — qwen — by cwazai (关闭于: 2026-03-01 20:28 (UTC+8))
- #35652 Add SwapConnector for per-layer KV cache swapping and reorganize entrypoints — v1,kv-connector — by ZehaoLu98 (关闭于: 2026-03-01 17:26 (UTC+8))
- #24918 [Attention] remove decode wrapper of flashinfer — needs-rebase,stale,v1,nvidia — by happierpig (关闭于: 2026-03-01 17:24 (UTC+8))