vLLM 开发动态报告 - 2026-02-19

时间窗口: 2026-02-19 11:27 (UTC+8) ~ 2026-02-20 11:27 (UTC+8) 数据统计: 新 Issue 11 | 关闭 Issue 11 | 新 PR 60 | 合并 PR 22 | 关闭未合并 PR 6

📊 每日开发状态摘要

在2026年2月19日至20日的24小时内，vLLM项目展现了极高的开发活跃度，共处理了22个PR和22个Issue（新增与关闭数量相等），同时有60个新PR被创建。开发焦点集中在大规模模型部署（如Qwen3.5-397B）的兼容性与性能修复、量化精度问题（尤其是FP8与MoE的结合）以及跨平台支持（特别是AMD ROCm生态）的完善上。社区讨论热烈，多个涉及核心功能（如嵌入生成、NVLink支持）的Issue引发了深度技术交流。

🎯 AMD/ROCm 生态相关动态

本周期内，有明确的与AMD/ROCm生态相关的开发活动，主要体现为一个修复ROCm平台兼容性的PR：

PR #34931 ([AMD][CI] Support Triton attention with ExampleConnector)：
- 提交者：rjrock（非典型“-amd”后缀，但标题和标签明确指向AMD和ROCm）。
- 内容分析：该PR旨在解决ROCm平台上，ExampleConnector（用于KV传输）与Triton注意力后端之间的兼容性问题。问题的根源在于，ROCm默认使用Triton注意力，而CUDA默认使用Flash注意力，两者的KV缓存布局不同。原有的ExampleConnector实现仅支持Flash注意力的布局，导致在ROCm上运行出错。
- 技术影响：此修复对于AMD GPU用户使用KV传输（P/D分离）功能至关重要。它直接解决了不同硬件平台底层内核差异带来的兼容性壁垒，是推进vLLM在AMD生态中功能完整性的重要一步。目前该PR处于草案（Draft）状态，因为对TritonAttentionMetadata的检查被标记为模型特定，可能需要进一步泛化。

总结：本周期AMD相关活动聚焦于解决跨平台KV传输功能的阻塞性问题，体现了对AMD硬件支持从“能运行”到“功能完备”的持续优化。虽然没有涉及Quark量化或MI300等新特性，但对现有核心功能（注意力机制、分布式通信）的适配是生态建设的基础。

💬 高热度讨论分析

Issue #34910: “vLLM生成的embeddings与Hugging Face sentence-transformer不匹配”
- 核心议题：用户发现使用相同模型（Qwen3-Embedding-0.6B）时，vLLM与Hugging Face Sentence Transformers库产生的嵌入向量存在符号差异，质疑vLLM的实现正确性。
- 观点与争议：
  - 用户 (ehsankf): 提供了详细的对比实验数据，显示符号差异始终存在，即使设置trust_remote_code=True后问题依旧。
  - 贡献者 (darshjme-codes): 进行了“专家分析”，指出根本原因在于预处理流水线不一致（vLLM将嵌入模型视为生成模型处理，而SentenceTransformers有专用流程），并提出了“pipeline drift”的概念。建议在vLLM中实现专用的嵌入预处理模块。
  - 维护者 (haosdent): 提出了更直接的可能性——不同注意力实现（FlashAttention/PagedAttention vs SDPA/eager）带来的固有数值差异，并询问这种差异是否导致了实际应用中的坏结果。
- 争议焦点：问题的性质是“bug”（实现错误）还是“expected behavior”（不同优化后端导致的合理差异）。贡献者的长篇分析倾向于前者，而维护者的提问倾向于后者。
- 当前状态：问题仍为开放状态，等待用户对haosdent提问的反馈，以确定下一步是深入调查还是澄清文档。
Issue #34891: “RuntimeError: [SymmDeviceMemory] Device does not support multicasting.”
- 核心议题：用户在使用多张H200/H100 GPU（配置了NVLink）加载大模型时，遇到关于“对称设备内存”不支持多播的错误，导致服务启动失败。
- 观点与争议：
  - 多位用户 (vitush93, robogast, RocketRider): 确认遇到相同问题，影响模型包括Qwen3.5和Llama 3.3 70B，并指出在v0.15.1版本工作正常。
  - 用户 (vitush93): 通过AI辅助分析，指出根源于两个提交：一个默认启用了fuse_allreduce_rms优化，另一个切换了FlashInfer的API，新API内部使用了需要NVLink/多播支持的SymmDeviceMemory，但优化过程没有检查硬件兼容性。
  - 维护者 (haosdent): 提供了临时解决方案：设置环境变量 VLLM_ALLREDUCE_USE_SYMM_MEM=0 来禁用该优化路径。
- 争议焦点：无实质性争议，更多是用户间协作排查和分享解决方案。
- 当前状态：问题开放。维护者提供了有效的工作区。根本原因已明确（新优化默认启用但缺乏硬件能力检测），预计后续会有PR进行条件判断或提供更优雅的回退。
Issue #34893: “Qwen3.5-397B-A17B-FP8 fails with TP=4”
- 核心议题：在4路张量并行下加载Qwen3.5 FP8模型时，在融合线性层分片过程中出现维度不匹配的运行时错误。
- 观点与争议：
  - 用户 (UmutAlihan): 提供了完整的复现步骤和错误栈，指出问题发生在权重加载的特定函数中。
  - 维护者 (Isotr0py): 首先怀疑用户使用的nightly版本尚未包含对该模型FP8支持的修复，建议升级到特定commit之后。
  - 用户 (UmutAlihan) 跟进: 发现即使使用了包含修复的预构建Docker镜像，问题依旧。指出镜像虽然基于正确的分支构建，但可能并未包含那个关键修复。
  - 维护者 (haosdent): 确认了用户的怀疑，检查镜像文件后发现相关代码确实未被更新，证实了预构建镜像的问题。
- 争议焦点：无争议。讨论从最初的代码bug排查，转向了CI/CD和发布流程问题——预构建的Docker镜像未能同步最新的代码修复。
- 当前状态：问题开放。根本原因锁定在构建/发布流程的同步问题，而非模型代码本身。

🔥 热门话题与趋势分析

大规模模型部署的“尖峰”问题：以Qwen3.5-397B-FP8为代表，多个Issue (#34893, #34891, #34892) 暴露了在极端配置（高TP、FP8量化、MoE）下，软件栈在权重分片、内存优化、多GPU通信等环节的脆弱性。这反映出项目在追求极致性能和支持最前沿模型时面临的持续挑战。
量化与精度问题的复杂性：FP8等低精度格式的普及带来了新的问题维度。Issue #34892和其修复PR #34914揭示了量化缩放因子极端值（近零）与特定计算内核（CUTLASS）相互作用导致NaN的隐蔽问题。这要求开发对数值稳定性有更精细的掌控。
性能优化与内核选择：多个PR (#34900, #34924, #34880) 专注于替换默认操作（如torch.cat）、启用新内核路径（DeepGEMM swapAB）或扩展优化适用范围（CUDA Graph到推测解码）。这表明社区在基础性能榨取上持续投入。
用户体验与健壮性：Issue #34932（Hermes工具解析器并发问题）、PR #34930（检测缺失的量化权重文件）和PR #34862（Voxtral实时API空输入防护）表明，提升生产环境的稳定性、可观测性和容错能力已成为重要开发方向。
文档与协作：有专门的Issue (#34901) 请求完善推测解码等新功能的文档，同时PR #34934进行了大规模的拼写修复。这反映了项目在快速发展中，对知识管理和协作效率的重视。

🛠️ 重点技术变更

PR #34914 (Fix Qwen3.5 Cutlass fp8 kernel on hopper - clamp block scales)：已合并。该修复解决了Qwen3.5 FP8模型在Hopper架构（如H200）上使用FlashInfer CUTLASS MoE内核时，因部分专家的权重缩放因子极小（~1e-23）而导致输出NaN的问题。通过钳制缩放因子，保障了数值稳定性。影响：修复了当前主线版本中一个导致模型输出乱码的关键缺陷，对FP8 MoE模型的可用性至关重要。
PR #34880 (Enables Eagle drafter support for FULL CUDA Graph mode)：进行中。此PR为Eagle推测解码的草稿模型启用了完整CUDA图模式，从而大幅减少内核启动开销，提升推测解码性能。影响：将显著提升使用Eagle系列草稿模型进行推测解码时的吞吐量和延迟表现。
PR #34818 (Fix Basic Models Test)：已合并。此PR修复了由于PR #33600引入的CUDA初始化逻辑变更而导致的一系列模型初始化测试失败。它通过将特定于平台（如MLA）的后端块大小更新逻辑抽象到平台模块中，避免了在测试导入阶段就初始化CUDA。影响：恢复了CI测试的稳定性，是底层框架健壮性的一次重要修复。
Issue #34910 (embeddings from vllm does not match hugging face)：开放讨论。虽然未解决，但其中的专家分析指出了一个潜在的系统性设计问题：vLLM可能缺少对嵌入模型的专用预处理流水线。影响：此讨论可能推动vLLM未来为嵌入任务设计更准确、专用的处理逻辑，而不仅仅是将其视为文本生成的变体。

📈 开发活跃度观察

贡献者活跃：24小时内新增60个PR，显示出极高的社区贡献和开发迭代速度。核心维护者（如 haosdent, ywang96, mgoin）活跃于多个PR和Issue的审查、修复和讨论中。
审查与合并效率：合并了22个PR，表明代码审查和集成流程高效。许多修复关键问题的PR（如#34914, #34876）在一天内就被合并，响应迅速。
AMD生态参与：本周期内只有1个明确标记为AMD/ROCm的PR (#34931)，且由一位非典型AMD员工提交。这或许表明AMD内部的直接代码贡献在本窗口期相对平静，或主要集中在内部测试和问题复现阶段。

💡 值得关注的问题

Issue #34893 (Qwen3.5-397B-A17B-FP8 fails with TP=4)：此问题暴露了项目预构建制品（如Docker镜像）与代码仓库的同步可能滞后或出错的风险。对于依赖这些制品的生产用户来说，这是一个潜在的痛点，需要关注其解决和后续的流程改进。
Issue #34935 (TypeError: '>' not supported between instances of 'str' and 'int')：虽然是一个使用问题，但反映了上游库（如transformers）的默认行为变更对vLLM兼容性产生的涟漪效应。社区需要持续关注此类依赖变化，并考虑如何在API或文档中提供更清晰的指引。
PR #34931 ([AMD][CI] Support Triton attention with ExampleConnector)：作为当前活跃的AMD相关修复，其进展和最终合并情况值得AMD平台用户密切关注。它直接关系到KV传输等高级功能在ROCm上的可用性。

📋 附录：详细数据列表

新增 Issue

#34893 [Bug]: Qwen3.5-397B-A17B-FP8 fails with TP=4 - fused linear layer sharding incompatibility — 无标签 — by UmutAlihan (创建于: 2026-02-19 20:56 (UTC+8))
#34935 [Usage]: TypeError: ‘>’ not supported between instances of ‘str’ and ‘int’ — usage — by lottopotato (创建于: 2026-02-20 10:25 (UTC+8))
#34910 [Bug]: the embeddings from vllm does not match hugging face sentence-transformer — bug — by ehsankf (创建于: 2026-02-20 03:27 (UTC+8))
#34932 RuntimeError: Already borrowed in Hermes tool parser under concurrent load — 无标签 — by eligotts (创建于: 2026-02-20 09:48 (UTC+8))
#34891 [Bug]: RuntimeError: [SymmDeviceMemory] Device does not support multicasting. — bug — by RocketRider (创建于: 2026-02-19 20:38 (UTC+8))
#34929 [Feature]: Get latency and tpot per request in vLLM benchmark — feature request — by snova-rodrigom (创建于: 2026-02-20 08:32 (UTC+8))
#34892 [Bug]: Qwen3.5 FP8 accuracy degradation with FlashInfer CUTLASS MoE backend — bug — by Isotr0py (创建于: 2026-02-19 20:44 (UTC+8))
#34886 [Bug]: #32618 performance issue — bug — by gengchaogit (创建于: 2026-02-19 18:38 (UTC+8))
#34901 [Doc]: Speculators Docs Follow-ups — documentation — by kylesayrs (创建于: 2026-02-20 01:16 (UTC+8))
#34869 [Bug]: Deepseek V3.1 NVFP4 Weight Loading Fails — bug — by wzhao18 (创建于: 2026-02-19 12:21 (UTC+8))
#34877 [Bug]: OOM during mxfp4 weight swizzle when –enable-sleep-mode is enabled on single GPU — 无标签 — by mmahmed (创建于: 2026-02-19 14:33 (UTC+8))

已关闭 Issue

#34403 [Bug]: error when use vllm on deepseek-vl2 — bug — by Pissohappy (关闭于: 2026-02-20 10:33 (UTC+8))
#34892 [Bug]: Qwen3.5 FP8 accuracy degradation with FlashInfer CUTLASS MoE backend — bug — by Isotr0py (关闭于: 2026-02-20 08:15 (UTC+8))
#34622 [CI Failure]: B200 Kernels — ci-failure — by NickLucche (关闭于: 2026-02-20 07:14 (UTC+8))
#34806 [CI Failure]: models/test_initialization.py::test_can_initialize_large_subset[EagleMiniCPMForCausalLM] — ci-failure — by ilmarkov (关闭于: 2026-02-20 06:49 (UTC+8))
#34800 [CI Failure]: entrypoints/weight_transfer/test_weight_transfer_llm.py — ci-failure — by ilmarkov (关闭于: 2026-02-19 23:01 (UTC+8))
#15483 [Bug]: vLLM engine crashes then restarts and loads the model on sleep if a chat request is made — bug,stale — by erdaltoprak (关闭于: 2026-02-19 22:45 (UTC+8))
#34345 [Bug]: Since version >=0.14.0, the key-value cache of heterogeneous graphics cards cannot be allocated correctly. — bug — by gengchaogit (关闭于: 2026-02-19 17:17 (UTC+8))
#34532 [Bug]: Realtime API crashes when client terminates connection “incorrectly” — bug — by nullquery (关闭于: 2026-02-19 15:21 (UTC+8))
#34869 [Bug]: Deepseek V3.1 NVFP4 Weight Loading Fails — bug — by wzhao18 (关闭于: 2026-02-19 15:20 (UTC+8))
#33512 Responses API reasoning_tokens always zero for text-based reasoning parsers — 无标签 — by anencore94 (关闭于: 2026-02-19 15:16 (UTC+8))
#25448 [New Model]: Issac 0.1 — new-model — by ywang96 (关闭于: 2026-02-19 12:57 (UTC+8))

新增 PR

#34934 [Bugfix][CI] fix typos — bug,documentation,performance,rocm,structured-output,frontend,ci/build,v1,multi-modality,qwen — by 1195343015 (创建于: 2026-02-20 10:17 (UTC+8))
#34871 [Bugfix] Fix GDN attention crash with mixed decode/spec-decode batches — bug,v1 — by haosdent (创建于: 2026-02-19 12:22 (UTC+8))
#34916 [Minor] Add logging when using MXFP4 MXFP8 TRTLLM backend — ready — by frankwang28 (创建于: 2026-02-20 04:07 (UTC+8))
#34937 [Bugfix]: Qwen3 reasoning parser now handles in prompt prefix — bug,qwen — by babyplutokurt (创建于: 2026-02-20 10:59 (UTC+8))
#34936 [UX] Add --performance-mode {balanced,latency,throughput} — 无标签 — by mgoin (创建于: 2026-02-20 10:58 (UTC+8))
#34933 [FIX] fused moe with lora shared expert dual stream (1.07x otps) — 无标签 — by jhaotingc (创建于: 2026-02-20 09:53 (UTC+8))
#34930 [Bugfix] Detect missing shard files in quantized checkpoints — bug,v1 — by thillai-c (创建于: 2026-02-20 09:43 (UTC+8))
#34931 [AMD][CI] Support Triton attention with ExampleConnector — rocm,v1,kv-connector — by rjrock (创建于: 2026-02-20 09:46 (UTC+8))
#34874 [WIP][Bugfix] Fix prefix caching for Mamba ‘all’ mode (Nemotron models) — bug,v1 — by haosdent (创建于: 2026-02-19 12:31 (UTC+8))
#34898 [Bug] Refactor max_num_batched_tokens to account for drafting — bug,speculative-decoding,ready,v1 — by benchislett (创建于: 2026-02-20 01:04 (UTC+8))
#34900 [Model Bash][DSR1] Add selective dynamic shape marking for CustomOp — performance,deepseek,nvidia — by vadiklyutiy (创建于: 2026-02-20 01:07 (UTC+8))
#34899 Bump Flashinfer Version and Re-enable DeepSeek NVFP4 AR+Norm Fusion — ready,ci/build,deepseek,nvidia,ready-run-all-tests — by wzhao18 (创建于: 2026-02-20 01:07 (UTC+8))
#34883 [Core] Add All-to-All communication backend for DCP — v1,nvidia — by sungsooha (创建于: 2026-02-19 16:18 (UTC+8))
#34875 [Bugfix] Fix V1 logprobs empty strings for multi-byte UTF-8 tokens when logprobs > 0 — bug,v1 — by haosdent (创建于: 2026-02-19 12:40 (UTC+8))
#34909 [Refactor] Extract Harmony streaming SSE event builders into streaming_events.py — frontend,ready,gpt-oss — by sfeng33 (创建于: 2026-02-20 03:20 (UTC+8))
#34870 [perf] Avoid dtype promotion sync in mamba_get_block_table_tensor — ready,v1 — by hl475 (创建于: 2026-02-19 12:21 (UTC+8))
#34923 [ROCm][CI] Added MI325 mirrors — rocm,ci/build — by AndreasKaratzas (创建于: 2026-02-20 07:18 (UTC+8))
#34902 [Spec Decode] Track per request level spec decode acceptance metric — v1,meta-exported,fb-exported — by zixi-qi (创建于: 2026-02-20 01:18 (UTC+8))
#34928 [Kernel] [Helion] [9/N] Canonicalize GPU variant names to base model names — ready — by gmagogsfm (创建于: 2026-02-20 07:59 (UTC+8))
#34872 [WIP][Bugfix] Fix GPTQ quantized Qwen3-Next MoE expert weight loading — bug,qwen — by haosdent (创建于: 2026-02-19 12:29 (UTC+8))
#34868 [WIP][Bugfix] Fix torch.compile startup time logging inaccuracy — bug — by haosdent (创建于: 2026-02-19 12:17 (UTC+8))
#34925 [Bug] Fix pooling unit test run_mteb_rerank — bug,ready — by yewentao256 (创建于: 2026-02-20 07:38 (UTC+8))
#34927 [Bugfix][Kernel] Fix activation_kernels.cu build failure on Blackwell… — bug — by FloatingVertex (创建于: 2026-02-20 07:53 (UTC+8))
#34921 [Models] LFM2: Support LoRA — 无标签 — by tianshu-Michael-yu (创建于: 2026-02-20 05:00 (UTC+8))
#34926 [Core][KV Transfer] Support PD disagg + speculative decoding KV lifecycle — v1,kv-connector — by ZhanqiuHu (创建于: 2026-02-20 07:40 (UTC+8))
#34895 [Draft] Gate cuBLAS kernel fallback — needs-rebase,ci/build,deepseek — by roikoren755 (创建于: 2026-02-19 22:50 (UTC+8))
#34917 [Perf][WIP] Replace torch.cat with vectorized CUDA kernel for MLA query concat — v1,nvidia — by LopezCastroRoberto (创建于: 2026-02-20 04:18 (UTC+8))
#34924 [Perf] Enable FlashInfer DeepGEMM swapAB on SM90 by default — performance,ready,ready-run-all-tests — by mgoin (创建于: 2026-02-20 07:35 (UTC+8))
#34908 [UX] More descriptive reasons in is_supported_config for MoE — ready,nvidia — by mgoin (创建于: 2026-02-20 03:14 (UTC+8))
#34907 [ROCm][P/D][MORI][BugFix] Add transfer_id for moriio_connector so moriio_connector to restore P/D functionality — bug,documentation,rocm,kv-connector — by rasmith (创建于: 2026-02-20 03:12 (UTC+8))
#34906 [Quantization] Support FP8 MoE bias for models like GPT-OSS — gpt-oss — by jasperjiaguo (创建于: 2026-02-20 02:57 (UTC+8))
#34922 [ROCm][CI] Loosen RemoteOpenAIServer Startup Timeout — rocm — by micah-wil (创建于: 2026-02-20 06:03 (UTC+8))
#34914 [Bugfix] Fix Qwen3.5 Cutlass fp8 kernel on hopper - clamp block scales — bug,ready,qwen,nvidia — by ywang96 (创建于: 2026-02-20 03:52 (UTC+8))
#34918 Change targets for AMD build in the “CI” pipeline — rocm,ready,ci/build — by Alexei-V-Ivanov-AMD (创建于: 2026-02-20 04:18 (UTC+8))
#34912 [compile] Fix torch.compile time discrepancy in logging. — ready — by zhxchen17 (创建于: 2026-02-20 03:51 (UTC+8))
#34920 Enable Tunable Control for naive_block_assignment — 无标签 — by RunkaiTao (创建于: 2026-02-20 04:52 (UTC+8))
#34913 [CI] Fix custome offline ci issue, V1 Others Test bug — bug,ready,v1 — by yewentao256 (创建于: 2026-02-20 03:51 (UTC+8))
#34919 [Bugfix][Model] Qwen3-Coder and Qwen3-Coder-Next: Fix qwen3_coder tool parser to handle </parameter> in parameter content — bug,qwen — by yaro-tal (创建于: 2026-02-20 04:41 (UTC+8))
#34887 [BugFix] anthropic/serving_messages: fix tool call arguments streaming — bug,frontend,ready — by dtrifiro (创建于: 2026-02-19 19:53 (UTC+8))
#34915 [ci] Use the right tag for CPU arm64 image — ci/build — by khluu (创建于: 2026-02-20 04:01 (UTC+8))
#34911 [Bugfix] Fix Qwen3.5 FP8 on Hopper — bug,frontend,qwen,nvidia — by ywang96 (创建于: 2026-02-20 03:47 (UTC+8))
#34905 Fix GLM4 parser tests — 无标签 — by RNabel (创建于: 2026-02-20 02:34 (UTC+8))
#34904 Support prompt_embeds for pooling requests in output processor — v1 — by laviier (创建于: 2026-02-20 01:50 (UTC+8))
#34903 [Bug] Fix illegal memory access issue for model runner v2 — bug,ready,v1,nvidia — by yewentao256 (创建于: 2026-02-20 01:33 (UTC+8))
#34894 [DOC] Add INT8 W4A8 docs and Arm’s supported quantization schemes — documentation — by fadara01 (创建于: 2026-02-19 21:46 (UTC+8))
#34897 WIP: Add Kuberenetes DRA examples — documentation — by harche (创建于: 2026-02-19 23:44 (UTC+8))
#34880 [Spec Decode][CUDA Graphs] Enables Eagle drafter support for FULL CUDA Graph mode — speculative-decoding,v1,nvidia — by yiz-liu (创建于: 2026-02-19 15:39 (UTC+8))
#34890 [CPU][Feat] Enable KleidiAI grouprise INT8_W4A8 with BF16 inputs — 无标签 — by fadara01 (创建于: 2026-02-19 20:29 (UTC+8))
#34867 [Build] Add cuda 12.8 release wheel — documentation,ready,ci/build,nvidia — by LucasWilkinson (创建于: 2026-02-19 11:55 (UTC+8))
#34896 [PD] Change kv_load_failure_policy Default from “recompute” to “fail” — documentation,ready,kv-connector — by NickLucche (创建于: 2026-02-19 23:08 (UTC+8))
#34881 [Bugfix] fix qwen 3.5 dp+ep assertion error — bug,qwen — by ZJY0516 (创建于: 2026-02-19 15:46 (UTC+8))
#34889 [Misc] Minor tests/v1/logits_processors/test_custom_offline.py changes — v1 — by NickLucche (创建于: 2026-02-19 20:09 (UTC+8))
#34888 [Transformers backend] Ignore MTP weights when num_nextn_predict_layers=0 — 无标签 — by SteadfastAsArt (创建于: 2026-02-19 19:55 (UTC+8))
#34882 docs: update CPU Docker images to reference Docker Hub instead of AWS ECR — documentation,cpu — by cluster2600 (创建于: 2026-02-19 15:55 (UTC+8))
#34885 [CI/Build] Try to make beam search test less flaky — ready — by DarkLight1337 (创建于: 2026-02-19 17:46 (UTC+8))
#34884 [Bugfix] Fix edge case in UUID data parsing — bug,ready — by DarkLight1337 (创建于: 2026-02-19 16:20 (UTC+8))
#34878 [ROCm][Test] Fix beam search determinism failures from batch-size-dependent FP divergence and removed wrong marker — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-02-19 14:43 (UTC+8))
#34879 [ROCm][CI] Removing all blocking labels from MI355 until stable infra — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-02-19 15:26 (UTC+8))
#34876 [Bug] Fix DeepSeek V3 weight loading caused by incorrect prefix — bug,ready,deepseek — by wzhao18 (创建于: 2026-02-19 12:50 (UTC+8))
#34873 [WIP][Feature] Support GGUF models with qwen3vl architecture — qwen — by haosdent (创建于: 2026-02-19 12:30 (UTC+8))

已合并 PR

#34840 [Core] Fix state names in pause_scheduler() — v1 — by markmc (合并于: 2026-02-20 09:21 (UTC+8))
#34359 [CI] Add GPT-OSS Eval job for H100 — ready,ci/build,gpt-oss — by mgoin (合并于: 2026-02-20 09:14 (UTC+8))
#34856 [Model Runner V2] Minor CPU optimizations — v1 — by njhill (合并于: 2026-02-20 08:05 (UTC+8))
#34665 [Bugfix] Fix benchmark_fused_collective crash on CustomOp init — bug,performance,ready — by mayank-ketkar-sf (合并于: 2026-02-20 08:01 (UTC+8))
#34908 [UX] More descriptive reasons in is_supported_config for MoE — ready,nvidia — by mgoin (合并于: 2026-02-20 07:20 (UTC+8))
#34818 [Bugfix] Fix Basic Models Test — bug,speculative-decoding,ready,v1,multi-modality,ci-failure,nvidia — by MatthewBonanni (合并于: 2026-02-20 06:49 (UTC+8))
#34914 [Bugfix] Fix Qwen3.5 Cutlass fp8 kernel on hopper - clamp block scales — bug,ready,qwen,nvidia — by ywang96 (合并于: 2026-02-20 05:34 (UTC+8))
#34918 Change targets for AMD build in the “CI” pipeline — rocm,ready,ci/build — by Alexei-V-Ivanov-AMD (合并于: 2026-02-20 05:26 (UTC+8))
#34263 [Refactor] Deprecate head_first for chunk_gated_delta_rule — ready,qwen — by yewentao256 (合并于: 2026-02-20 02:23 (UTC+8))
#34808 Revert “[NemotronH] Do not force router to run in fp32 (#34582)” — bug,ready,nvidia — by roikoren755 (合并于: 2026-02-20 01:47 (UTC+8))
#34344 [Model Bash][DeepSeekR1] Remove Shared Expert Clone — performance,ready,deepseek — by robertgshaw2-redhat (合并于: 2026-02-19 23:56 (UTC+8))
#34471 [Llama4,Quantization] Simplify and generalize logic for Q/K permutations in quantized self-attn layers — ready,llama — by eldarkurtic (合并于: 2026-02-19 23:55 (UTC+8))
#34719 [Bugfix] Qwen3.5 kv-scale weight remapping — bug,ready,qwen — by Linda-Stadter (合并于: 2026-02-19 20:13 (UTC+8))
#34885 [CI/Build] Try to make beam search test less flaky — ready — by DarkLight1337 (合并于: 2026-02-19 19:16 (UTC+8))
#34884 [Bugfix] Fix edge case in UUID data parsing — bug,ready — by DarkLight1337 (合并于: 2026-02-19 18:24 (UTC+8))
#34878 [ROCm][Test] Fix beam search determinism failures from batch-size-dependent FP divergence and removed wrong marker — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-02-19 16:25 (UTC+8))
#34879 [ROCm][CI] Removing all blocking labels from MI355 until stable infra — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-02-19 15:53 (UTC+8))
#34862 [Voxtral Realtime] Fix engine crash on empty multimodal embeddings — ready — by talnirnx (合并于: 2026-02-19 15:21 (UTC+8))
#34876 [Bug] Fix DeepSeek V3 weight loading caused by incorrect prefix — bug,ready,deepseek — by wzhao18 (合并于: 2026-02-19 15:20 (UTC+8))
#34847 [Bugfix] Add Quant Config to Llava Next Projector — bug — by alex-jw-brooks (合并于: 2026-02-19 15:18 (UTC+8))
#34836 fix(docs): fix typos in comments and docstrings — gpt-oss — by machov (合并于: 2026-02-19 15:17 (UTC+8))
#33513 [Frontend] Fix reasoning_tokens for text-based parsers in Responses API — frontend,ready — by anencore94 (合并于: 2026-02-19 15:16 (UTC+8))

关闭但未合并的 PR

#33382 [BugFix] Fix DeepGEMM Warmup Logic — bug — by robertgshaw2-redhat (关闭于: 2026-02-20 10:21 (UTC+8))
#34550 [CI] [Kernel] Rename Helion config key to match actual H100 SXM5 device name used in CI — 无标签 — by gmagogsfm (关闭于: 2026-02-20 08:55 (UTC+8))
#34868 [WIP][Bugfix] Fix torch.compile startup time logging inaccuracy — bug — by haosdent (关闭于: 2026-02-20 08:08 (UTC+8))
#34913 [CI] Fix custome offline ci issue, V1 Others Test bug — bug,ready,v1 — by yewentao256 (关闭于: 2026-02-20 04:47 (UTC+8))
#34911 [Bugfix] Fix Qwen3.5 FP8 on Hopper — bug,frontend,qwen,nvidia — by ywang96 (关闭于: 2026-02-20 03:50 (UTC+8))
#34159 [Docs] MTP Docs — documentation,needs-rebase,v1 — by kylesayrs (关闭于: 2026-02-20 01:11 (UTC+8))