vLLM 开发动态报告 - 2026-02-19
时间窗口: 2026-02-19 11:27 (UTC+8) ~ 2026-02-20 11:27 (UTC+8) 数据统计: 新 Issue 11 | 关闭 Issue 11 | 新 PR 60 | 合并 PR 22 | 关闭未合并 PR 6
📊 每日开发状态摘要
在2026年2月19日至20日的24小时内,vLLM项目展现了极高的开发活跃度,共处理了22个PR和22个Issue(新增与关闭数量相等),同时有60个新PR被创建。开发焦点集中在大规模模型部署(如Qwen3.5-397B)的兼容性与性能修复、量化精度问题(尤其是FP8与MoE的结合)以及跨平台支持(特别是AMD ROCm生态)的完善上。社区讨论热烈,多个涉及核心功能(如嵌入生成、NVLink支持)的Issue引发了深度技术交流。
🎯 AMD/ROCm 生态相关动态
本周期内,有明确的与AMD/ROCm生态相关的开发活动,主要体现为一个修复ROCm平台兼容性的PR:
- PR #34931 (
[AMD][CI] Support Triton attention with ExampleConnector):- 提交者:
rjrock(非典型“-amd”后缀,但标题和标签明确指向AMD和ROCm)。 - 内容分析:该PR旨在解决ROCm平台上,
ExampleConnector(用于KV传输)与Triton注意力后端之间的兼容性问题。问题的根源在于,ROCm默认使用Triton注意力,而CUDA默认使用Flash注意力,两者的KV缓存布局不同。原有的ExampleConnector实现仅支持Flash注意力的布局,导致在ROCm上运行出错。 - 技术影响:此修复对于AMD GPU用户使用KV传输(P/D分离)功能至关重要。它直接解决了不同硬件平台底层内核差异带来的兼容性壁垒,是推进vLLM在AMD生态中功能完整性的重要一步。目前该PR处于草案(Draft)状态,因为对
TritonAttentionMetadata的检查被标记为模型特定,可能需要进一步泛化。
- 提交者:
总结:本周期AMD相关活动聚焦于解决跨平台KV传输功能的阻塞性问题,体现了对AMD硬件支持从“能运行”到“功能完备”的持续优化。虽然没有涉及Quark量化或MI300等新特性,但对现有核心功能(注意力机制、分布式通信)的适配是生态建设的基础。
💬 高热度讨论分析
- Issue #34910: “vLLM生成的embeddings与Hugging Face sentence-transformer不匹配”
- 核心议题:用户发现使用相同模型(Qwen3-Embedding-0.6B)时,vLLM与Hugging Face Sentence Transformers库产生的嵌入向量存在符号差异,质疑vLLM的实现正确性。
- 观点与争议:
- 用户 (
ehsankf): 提供了详细的对比实验数据,显示符号差异始终存在,即使设置trust_remote_code=True后问题依旧。 - 贡献者 (
darshjme-codes): 进行了“专家分析”,指出根本原因在于预处理流水线不一致(vLLM将嵌入模型视为生成模型处理,而SentenceTransformers有专用流程),并提出了“pipeline drift”的概念。建议在vLLM中实现专用的嵌入预处理模块。 - 维护者 (
haosdent): 提出了更直接的可能性——不同注意力实现(FlashAttention/PagedAttention vs SDPA/eager)带来的固有数值差异,并询问这种差异是否导致了实际应用中的坏结果。
- 用户 (
- 争议焦点:问题的性质是“bug”(实现错误)还是“expected behavior”(不同优化后端导致的合理差异)。贡献者的长篇分析倾向于前者,而维护者的提问倾向于后者。
- 当前状态:问题仍为开放状态,等待用户对
haosdent提问的反馈,以确定下一步是深入调查还是澄清文档。
- Issue #34891: “RuntimeError: [SymmDeviceMemory] Device does not support multicasting.”
- 核心议题:用户在使用多张H200/H100 GPU(配置了NVLink)加载大模型时,遇到关于“对称设备内存”不支持多播的错误,导致服务启动失败。
- 观点与争议:
- 多位用户 (
vitush93,robogast,RocketRider): 确认遇到相同问题,影响模型包括Qwen3.5和Llama 3.3 70B,并指出在v0.15.1版本工作正常。 - 用户 (
vitush93): 通过AI辅助分析,指出根源于两个提交:一个默认启用了fuse_allreduce_rms优化,另一个切换了FlashInfer的API,新API内部使用了需要NVLink/多播支持的SymmDeviceMemory,但优化过程没有检查硬件兼容性。 - 维护者 (
haosdent): 提供了临时解决方案:设置环境变量VLLM_ALLREDUCE_USE_SYMM_MEM=0来禁用该优化路径。
- 多位用户 (
- 争议焦点:无实质性争议,更多是用户间协作排查和分享解决方案。
- 当前状态:问题开放。维护者提供了有效的工作区。根本原因已明确(新优化默认启用但缺乏硬件能力检测),预计后续会有PR进行条件判断或提供更优雅的回退。
- Issue #34893: “Qwen3.5-397B-A17B-FP8 fails with TP=4”
- 核心议题:在4路张量并行下加载Qwen3.5 FP8模型时,在融合线性层分片过程中出现维度不匹配的运行时错误。
- 观点与争议:
- 用户 (
UmutAlihan): 提供了完整的复现步骤和错误栈,指出问题发生在权重加载的特定函数中。 - 维护者 (
Isotr0py): 首先怀疑用户使用的nightly版本尚未包含对该模型FP8支持的修复,建议升级到特定commit之后。 - 用户 (
UmutAlihan) 跟进: 发现即使使用了包含修复的预构建Docker镜像,问题依旧。指出镜像虽然基于正确的分支构建,但可能并未包含那个关键修复。 - 维护者 (
haosdent): 确认了用户的怀疑,检查镜像文件后发现相关代码确实未被更新,证实了预构建镜像的问题。
- 用户 (
- 争议焦点:无争议。讨论从最初的代码bug排查,转向了CI/CD和发布流程问题——预构建的Docker镜像未能同步最新的代码修复。
- 当前状态:问题开放。根本原因锁定在构建/发布流程的同步问题,而非模型代码本身。
🔥 热门话题与趋势分析
- 大规模模型部署的“尖峰”问题:以Qwen3.5-397B-FP8为代表,多个Issue (#34893, #34891, #34892) 暴露了在极端配置(高TP、FP8量化、MoE)下,软件栈在权重分片、内存优化、多GPU通信等环节的脆弱性。这反映出项目在追求极致性能和支持最前沿模型时面临的持续挑战。
- 量化与精度问题的复杂性:FP8等低精度格式的普及带来了新的问题维度。Issue #34892和其修复PR #34914揭示了量化缩放因子极端值(近零)与特定计算内核(CUTLASS)相互作用导致NaN的隐蔽问题。这要求开发对数值稳定性有更精细的掌控。
- 性能优化与内核选择:多个PR (#34900, #34924, #34880) 专注于替换默认操作(如
torch.cat)、启用新内核路径(DeepGEMM swapAB)或扩展优化适用范围(CUDA Graph到推测解码)。这表明社区在基础性能榨取上持续投入。 - 用户体验与健壮性:Issue #34932(Hermes工具解析器并发问题)、PR #34930(检测缺失的量化权重文件)和PR #34862(Voxtral实时API空输入防护)表明,提升生产环境的稳定性、可观测性和容错能力已成为重要开发方向。
- 文档与协作:有专门的Issue (#34901) 请求完善推测解码等新功能的文档,同时PR #34934进行了大规模的拼写修复。这反映了项目在快速发展中,对知识管理和协作效率的重视。
🛠️ 重点技术变更
- PR #34914 (
Fix Qwen3.5 Cutlass fp8 kernel on hopper - clamp block scales):已合并。该修复解决了Qwen3.5 FP8模型在Hopper架构(如H200)上使用FlashInfer CUTLASS MoE内核时,因部分专家的权重缩放因子极小(~1e-23)而导致输出NaN的问题。通过钳制缩放因子,保障了数值稳定性。影响:修复了当前主线版本中一个导致模型输出乱码的关键缺陷,对FP8 MoE模型的可用性至关重要。 - PR #34880 (
Enables Eagle drafter support for FULL CUDA Graph mode):进行中。此PR为Eagle推测解码的草稿模型启用了完整CUDA图模式,从而大幅减少内核启动开销,提升推测解码性能。影响:将显著提升使用Eagle系列草稿模型进行推测解码时的吞吐量和延迟表现。 - PR #34818 (
Fix Basic Models Test):已合并。此PR修复了由于PR #33600引入的CUDA初始化逻辑变更而导致的一系列模型初始化测试失败。它通过将特定于平台(如MLA)的后端块大小更新逻辑抽象到平台模块中,避免了在测试导入阶段就初始化CUDA。影响:恢复了CI测试的稳定性,是底层框架健壮性的一次重要修复。 - Issue #34910 (
embeddings from vllm does not match hugging face):开放讨论。虽然未解决,但其中的专家分析指出了一个潜在的系统性设计问题:vLLM可能缺少对嵌入模型的专用预处理流水线。影响:此讨论可能推动vLLM未来为嵌入任务设计更准确、专用的处理逻辑,而不仅仅是将其视为文本生成的变体。
📈 开发活跃度观察
- 贡献者活跃:24小时内新增60个PR,显示出极高的社区贡献和开发迭代速度。核心维护者(如
haosdent,ywang96,mgoin)活跃于多个PR和Issue的审查、修复和讨论中。 - 审查与合并效率:合并了22个PR,表明代码审查和集成流程高效。许多修复关键问题的PR(如#34914, #34876)在一天内就被合并,响应迅速。
- AMD生态参与:本周期内只有1个明确标记为AMD/ROCm的PR (#34931),且由一位非典型AMD员工提交。这或许表明AMD内部的直接代码贡献在本窗口期相对平静,或主要集中在内部测试和问题复现阶段。
💡 值得关注的问题
- Issue #34893 (
Qwen3.5-397B-A17B-FP8 fails with TP=4):此问题暴露了项目预构建制品(如Docker镜像)与代码仓库的同步可能滞后或出错的风险。对于依赖这些制品的生产用户来说,这是一个潜在的痛点,需要关注其解决和后续的流程改进。 - Issue #34935 (
TypeError: '>' not supported between instances of 'str' and 'int'):虽然是一个使用问题,但反映了上游库(如transformers)的默认行为变更对vLLM兼容性产生的涟漪效应。社区需要持续关注此类依赖变化,并考虑如何在API或文档中提供更清晰的指引。 - PR #34931 (
[AMD][CI] Support Triton attention with ExampleConnector):作为当前活跃的AMD相关修复,其进展和最终合并情况值得AMD平台用户密切关注。它直接关系到KV传输等高级功能在ROCm上的可用性。
📋 附录:详细数据列表
新增 Issue
- #34893 [Bug]: Qwen3.5-397B-A17B-FP8 fails with TP=4 - fused linear layer sharding incompatibility — 无标签 — by UmutAlihan (创建于: 2026-02-19 20:56 (UTC+8))
- #34935 [Usage]: TypeError: ‘>’ not supported between instances of ‘str’ and ‘int’ — usage — by lottopotato (创建于: 2026-02-20 10:25 (UTC+8))
- #34910 [Bug]: the embeddings from vllm does not match hugging face sentence-transformer — bug — by ehsankf (创建于: 2026-02-20 03:27 (UTC+8))
- #34932 RuntimeError: Already borrowed in Hermes tool parser under concurrent load — 无标签 — by eligotts (创建于: 2026-02-20 09:48 (UTC+8))
- #34891 [Bug]: RuntimeError: [SymmDeviceMemory] Device does not support multicasting. — bug — by RocketRider (创建于: 2026-02-19 20:38 (UTC+8))
- #34929 [Feature]: Get latency and tpot per request in vLLM benchmark — feature request — by snova-rodrigom (创建于: 2026-02-20 08:32 (UTC+8))
- #34892 [Bug]: Qwen3.5 FP8 accuracy degradation with FlashInfer CUTLASS MoE backend — bug — by Isotr0py (创建于: 2026-02-19 20:44 (UTC+8))
- #34886 [Bug]: #32618 performance issue — bug — by gengchaogit (创建于: 2026-02-19 18:38 (UTC+8))
- #34901 [Doc]: Speculators Docs Follow-ups — documentation — by kylesayrs (创建于: 2026-02-20 01:16 (UTC+8))
- #34869 [Bug]: Deepseek V3.1 NVFP4 Weight Loading Fails — bug — by wzhao18 (创建于: 2026-02-19 12:21 (UTC+8))
- #34877 [Bug]: OOM during mxfp4 weight swizzle when –enable-sleep-mode is enabled on single GPU — 无标签 — by mmahmed (创建于: 2026-02-19 14:33 (UTC+8))
已关闭 Issue
- #34403 [Bug]: error when use vllm on deepseek-vl2 — bug — by Pissohappy (关闭于: 2026-02-20 10:33 (UTC+8))
- #34892 [Bug]: Qwen3.5 FP8 accuracy degradation with FlashInfer CUTLASS MoE backend — bug — by Isotr0py (关闭于: 2026-02-20 08:15 (UTC+8))
- #34622 [CI Failure]: B200 Kernels — ci-failure — by NickLucche (关闭于: 2026-02-20 07:14 (UTC+8))
- #34806 [CI Failure]: models/test_initialization.py::test_can_initialize_large_subset[EagleMiniCPMForCausalLM] — ci-failure — by ilmarkov (关闭于: 2026-02-20 06:49 (UTC+8))
- #34800 [CI Failure]: entrypoints/weight_transfer/test_weight_transfer_llm.py — ci-failure — by ilmarkov (关闭于: 2026-02-19 23:01 (UTC+8))
- #15483 [Bug]: vLLM engine crashes then restarts and loads the model on sleep if a chat request is made — bug,stale — by erdaltoprak (关闭于: 2026-02-19 22:45 (UTC+8))
- #34345 [Bug]: Since version >=0.14.0, the key-value cache of heterogeneous graphics cards cannot be allocated correctly. — bug — by gengchaogit (关闭于: 2026-02-19 17:17 (UTC+8))
- #34532 [Bug]: Realtime API crashes when client terminates connection “incorrectly” — bug — by nullquery (关闭于: 2026-02-19 15:21 (UTC+8))
- #34869 [Bug]: Deepseek V3.1 NVFP4 Weight Loading Fails — bug — by wzhao18 (关闭于: 2026-02-19 15:20 (UTC+8))
- #33512 Responses API reasoning_tokens always zero for text-based reasoning parsers — 无标签 — by anencore94 (关闭于: 2026-02-19 15:16 (UTC+8))
- #25448 [New Model]: Issac 0.1 — new-model — by ywang96 (关闭于: 2026-02-19 12:57 (UTC+8))
新增 PR
- #34934 [Bugfix][CI] fix typos — bug,documentation,performance,rocm,structured-output,frontend,ci/build,v1,multi-modality,qwen — by 1195343015 (创建于: 2026-02-20 10:17 (UTC+8))
- #34871 [Bugfix] Fix GDN attention crash with mixed decode/spec-decode batches — bug,v1 — by haosdent (创建于: 2026-02-19 12:22 (UTC+8))
- #34916 [Minor] Add logging when using MXFP4 MXFP8 TRTLLM backend — ready — by frankwang28 (创建于: 2026-02-20 04:07 (UTC+8))
- #34937 [Bugfix]: Qwen3 reasoning parser now handles
in prompt prefix — bug,qwen — by babyplutokurt (创建于: 2026-02-20 10:59 (UTC+8)) - #34936 [UX] Add
--performance-mode {balanced,latency,throughput}— 无标签 — by mgoin (创建于: 2026-02-20 10:58 (UTC+8)) - #34933 [FIX] fused moe with lora shared expert dual stream (1.07x otps) — 无标签 — by jhaotingc (创建于: 2026-02-20 09:53 (UTC+8))
- #34930 [Bugfix] Detect missing shard files in quantized checkpoints — bug,v1 — by thillai-c (创建于: 2026-02-20 09:43 (UTC+8))
- #34931 [AMD][CI] Support Triton attention with ExampleConnector — rocm,v1,kv-connector — by rjrock (创建于: 2026-02-20 09:46 (UTC+8))
- #34874 [WIP][Bugfix] Fix prefix caching for Mamba ‘all’ mode (Nemotron models) — bug,v1 — by haosdent (创建于: 2026-02-19 12:31 (UTC+8))
- #34898 [Bug] Refactor max_num_batched_tokens to account for drafting — bug,speculative-decoding,ready,v1 — by benchislett (创建于: 2026-02-20 01:04 (UTC+8))
- #34900 [Model Bash][DSR1] Add selective dynamic shape marking for CustomOp — performance,deepseek,nvidia — by vadiklyutiy (创建于: 2026-02-20 01:07 (UTC+8))
- #34899 Bump Flashinfer Version and Re-enable DeepSeek NVFP4 AR+Norm Fusion — ready,ci/build,deepseek,nvidia,ready-run-all-tests — by wzhao18 (创建于: 2026-02-20 01:07 (UTC+8))
- #34883 [Core] Add All-to-All communication backend for DCP — v1,nvidia — by sungsooha (创建于: 2026-02-19 16:18 (UTC+8))
- #34875 [Bugfix] Fix V1 logprobs empty strings for multi-byte UTF-8 tokens when logprobs > 0 — bug,v1 — by haosdent (创建于: 2026-02-19 12:40 (UTC+8))
- #34909 [Refactor] Extract Harmony streaming SSE event builders into streaming_events.py — frontend,ready,gpt-oss — by sfeng33 (创建于: 2026-02-20 03:20 (UTC+8))
- #34870 [perf] Avoid dtype promotion sync in mamba_get_block_table_tensor — ready,v1 — by hl475 (创建于: 2026-02-19 12:21 (UTC+8))
- #34923 [ROCm][CI] Added MI325 mirrors — rocm,ci/build — by AndreasKaratzas (创建于: 2026-02-20 07:18 (UTC+8))
- #34902 [Spec Decode] Track per request level spec decode acceptance metric — v1,meta-exported,fb-exported — by zixi-qi (创建于: 2026-02-20 01:18 (UTC+8))
- #34928 [Kernel] [Helion] [9/N] Canonicalize GPU variant names to base model names — ready — by gmagogsfm (创建于: 2026-02-20 07:59 (UTC+8))
- #34872 [WIP][Bugfix] Fix GPTQ quantized Qwen3-Next MoE expert weight loading — bug,qwen — by haosdent (创建于: 2026-02-19 12:29 (UTC+8))
- #34868 [WIP][Bugfix] Fix torch.compile startup time logging inaccuracy — bug — by haosdent (创建于: 2026-02-19 12:17 (UTC+8))
- #34925 [Bug] Fix pooling unit test
run_mteb_rerank— bug,ready — by yewentao256 (创建于: 2026-02-20 07:38 (UTC+8)) - #34927 [Bugfix][Kernel] Fix activation_kernels.cu build failure on Blackwell… — bug — by FloatingVertex (创建于: 2026-02-20 07:53 (UTC+8))
- #34921 [Models] LFM2: Support LoRA — 无标签 — by tianshu-Michael-yu (创建于: 2026-02-20 05:00 (UTC+8))
- #34926 [Core][KV Transfer] Support PD disagg + speculative decoding KV lifecycle — v1,kv-connector — by ZhanqiuHu (创建于: 2026-02-20 07:40 (UTC+8))
- #34895 [Draft] Gate cuBLAS kernel fallback — needs-rebase,ci/build,deepseek — by roikoren755 (创建于: 2026-02-19 22:50 (UTC+8))
- #34917 [Perf][WIP] Replace torch.cat with vectorized CUDA kernel for MLA query concat — v1,nvidia — by LopezCastroRoberto (创建于: 2026-02-20 04:18 (UTC+8))
- #34924 [Perf] Enable FlashInfer DeepGEMM swapAB on SM90 by default — performance,ready,ready-run-all-tests — by mgoin (创建于: 2026-02-20 07:35 (UTC+8))
- #34908 [UX] More descriptive reasons in is_supported_config for MoE — ready,nvidia — by mgoin (创建于: 2026-02-20 03:14 (UTC+8))
- #34907 [ROCm][P/D][MORI][BugFix] Add transfer_id for moriio_connector so moriio_connector to restore P/D functionality — bug,documentation,rocm,kv-connector — by rasmith (创建于: 2026-02-20 03:12 (UTC+8))
- #34906 [Quantization] Support FP8 MoE bias for models like GPT-OSS — gpt-oss — by jasperjiaguo (创建于: 2026-02-20 02:57 (UTC+8))
- #34922 [ROCm][CI] Loosen RemoteOpenAIServer Startup Timeout — rocm — by micah-wil (创建于: 2026-02-20 06:03 (UTC+8))
- #34914 [Bugfix] Fix Qwen3.5 Cutlass fp8 kernel on hopper - clamp block scales — bug,ready,qwen,nvidia — by ywang96 (创建于: 2026-02-20 03:52 (UTC+8))
- #34918 Change targets for AMD build in the “CI” pipeline — rocm,ready,ci/build — by Alexei-V-Ivanov-AMD (创建于: 2026-02-20 04:18 (UTC+8))
- #34912 [compile] Fix torch.compile time discrepancy in logging. — ready — by zhxchen17 (创建于: 2026-02-20 03:51 (UTC+8))
- #34920 Enable Tunable Control for naive_block_assignment — 无标签 — by RunkaiTao (创建于: 2026-02-20 04:52 (UTC+8))
- #34913 [CI] Fix custome offline ci issue,
V1 OthersTest bug — bug,ready,v1 — by yewentao256 (创建于: 2026-02-20 03:51 (UTC+8)) - #34919 [Bugfix][Model] Qwen3-Coder and Qwen3-Coder-Next: Fix qwen3_coder tool parser to handle </parameter> in parameter content — bug,qwen — by yaro-tal (创建于: 2026-02-20 04:41 (UTC+8))
- #34887 [BugFix] anthropic/serving_messages: fix tool call arguments streaming — bug,frontend,ready — by dtrifiro (创建于: 2026-02-19 19:53 (UTC+8))
- #34915 [ci] Use the right tag for CPU arm64 image — ci/build — by khluu (创建于: 2026-02-20 04:01 (UTC+8))
- #34911 [Bugfix] Fix Qwen3.5 FP8 on Hopper — bug,frontend,qwen,nvidia — by ywang96 (创建于: 2026-02-20 03:47 (UTC+8))
- #34905 Fix GLM4 parser tests — 无标签 — by RNabel (创建于: 2026-02-20 02:34 (UTC+8))
- #34904 Support prompt_embeds for pooling requests in output processor — v1 — by laviier (创建于: 2026-02-20 01:50 (UTC+8))
- #34903 [Bug] Fix illegal memory access issue for model runner v2 — bug,ready,v1,nvidia — by yewentao256 (创建于: 2026-02-20 01:33 (UTC+8))
- #34894 [DOC] Add INT8 W4A8 docs and Arm’s supported quantization schemes — documentation — by fadara01 (创建于: 2026-02-19 21:46 (UTC+8))
- #34897 WIP: Add Kuberenetes DRA examples — documentation — by harche (创建于: 2026-02-19 23:44 (UTC+8))
- #34880 [Spec Decode][CUDA Graphs] Enables Eagle drafter support for FULL CUDA Graph mode — speculative-decoding,v1,nvidia — by yiz-liu (创建于: 2026-02-19 15:39 (UTC+8))
- #34890 [CPU][Feat] Enable KleidiAI grouprise INT8_W4A8 with BF16 inputs — 无标签 — by fadara01 (创建于: 2026-02-19 20:29 (UTC+8))
- #34867 [Build] Add cuda 12.8 release wheel — documentation,ready,ci/build,nvidia — by LucasWilkinson (创建于: 2026-02-19 11:55 (UTC+8))
- #34896 [PD] Change kv_load_failure_policy Default from “recompute” to “fail” — documentation,ready,kv-connector — by NickLucche (创建于: 2026-02-19 23:08 (UTC+8))
- #34881 [Bugfix] fix qwen 3.5 dp+ep assertion error — bug,qwen — by ZJY0516 (创建于: 2026-02-19 15:46 (UTC+8))
- #34889 [Misc] Minor
tests/v1/logits_processors/test_custom_offline.pychanges — v1 — by NickLucche (创建于: 2026-02-19 20:09 (UTC+8)) - #34888 [Transformers backend] Ignore MTP weights when num_nextn_predict_layers=0 — 无标签 — by SteadfastAsArt (创建于: 2026-02-19 19:55 (UTC+8))
- #34882 docs: update CPU Docker images to reference Docker Hub instead of AWS ECR — documentation,cpu — by cluster2600 (创建于: 2026-02-19 15:55 (UTC+8))
- #34885 [CI/Build] Try to make beam search test less flaky — ready — by DarkLight1337 (创建于: 2026-02-19 17:46 (UTC+8))
- #34884 [Bugfix] Fix edge case in UUID data parsing — bug,ready — by DarkLight1337 (创建于: 2026-02-19 16:20 (UTC+8))
- #34878 [ROCm][Test] Fix beam search determinism failures from batch-size-dependent FP divergence and removed wrong marker — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-02-19 14:43 (UTC+8))
- #34879 [ROCm][CI] Removing all blocking labels from MI355 until stable infra — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-02-19 15:26 (UTC+8))
- #34876 [Bug] Fix DeepSeek V3 weight loading caused by incorrect prefix — bug,ready,deepseek — by wzhao18 (创建于: 2026-02-19 12:50 (UTC+8))
- #34873 [WIP][Feature] Support GGUF models with qwen3vl architecture — qwen — by haosdent (创建于: 2026-02-19 12:30 (UTC+8))
已合并 PR
- #34840 [Core] Fix state names in pause_scheduler() — v1 — by markmc (合并于: 2026-02-20 09:21 (UTC+8))
- #34359 [CI] Add GPT-OSS Eval job for H100 — ready,ci/build,gpt-oss — by mgoin (合并于: 2026-02-20 09:14 (UTC+8))
- #34856 [Model Runner V2] Minor CPU optimizations — v1 — by njhill (合并于: 2026-02-20 08:05 (UTC+8))
- #34665 [Bugfix] Fix benchmark_fused_collective crash on CustomOp init — bug,performance,ready — by mayank-ketkar-sf (合并于: 2026-02-20 08:01 (UTC+8))
- #34908 [UX] More descriptive reasons in is_supported_config for MoE — ready,nvidia — by mgoin (合并于: 2026-02-20 07:20 (UTC+8))
- #34818 [Bugfix] Fix Basic Models Test — bug,speculative-decoding,ready,v1,multi-modality,ci-failure,nvidia — by MatthewBonanni (合并于: 2026-02-20 06:49 (UTC+8))
- #34914 [Bugfix] Fix Qwen3.5 Cutlass fp8 kernel on hopper - clamp block scales — bug,ready,qwen,nvidia — by ywang96 (合并于: 2026-02-20 05:34 (UTC+8))
- #34918 Change targets for AMD build in the “CI” pipeline — rocm,ready,ci/build — by Alexei-V-Ivanov-AMD (合并于: 2026-02-20 05:26 (UTC+8))
- #34263 [Refactor] Deprecate
head_firstforchunk_gated_delta_rule— ready,qwen — by yewentao256 (合并于: 2026-02-20 02:23 (UTC+8)) - #34808 Revert “[NemotronH] Do not force router to run in fp32 (#34582)” — bug,ready,nvidia — by roikoren755 (合并于: 2026-02-20 01:47 (UTC+8))
- #34344 [Model Bash][DeepSeekR1] Remove Shared Expert Clone — performance,ready,deepseek — by robertgshaw2-redhat (合并于: 2026-02-19 23:56 (UTC+8))
- #34471 [Llama4,Quantization] Simplify and generalize logic for Q/K permutations in quantized self-attn layers — ready,llama — by eldarkurtic (合并于: 2026-02-19 23:55 (UTC+8))
- #34719 [Bugfix] Qwen3.5 kv-scale weight remapping — bug,ready,qwen — by Linda-Stadter (合并于: 2026-02-19 20:13 (UTC+8))
- #34885 [CI/Build] Try to make beam search test less flaky — ready — by DarkLight1337 (合并于: 2026-02-19 19:16 (UTC+8))
- #34884 [Bugfix] Fix edge case in UUID data parsing — bug,ready — by DarkLight1337 (合并于: 2026-02-19 18:24 (UTC+8))
- #34878 [ROCm][Test] Fix beam search determinism failures from batch-size-dependent FP divergence and removed wrong marker — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-02-19 16:25 (UTC+8))
- #34879 [ROCm][CI] Removing all blocking labels from MI355 until stable infra — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-02-19 15:53 (UTC+8))
- #34862 [Voxtral Realtime] Fix engine crash on empty multimodal embeddings — ready — by talnirnx (合并于: 2026-02-19 15:21 (UTC+8))
- #34876 [Bug] Fix DeepSeek V3 weight loading caused by incorrect prefix — bug,ready,deepseek — by wzhao18 (合并于: 2026-02-19 15:20 (UTC+8))
- #34847 [Bugfix] Add Quant Config to Llava Next Projector — bug — by alex-jw-brooks (合并于: 2026-02-19 15:18 (UTC+8))
- #34836 fix(docs): fix typos in comments and docstrings — gpt-oss — by machov (合并于: 2026-02-19 15:17 (UTC+8))
- #33513 [Frontend] Fix reasoning_tokens for text-based parsers in Responses API — frontend,ready — by anencore94 (合并于: 2026-02-19 15:16 (UTC+8))
关闭但未合并的 PR
- #33382 [BugFix] Fix DeepGEMM Warmup Logic — bug — by robertgshaw2-redhat (关闭于: 2026-02-20 10:21 (UTC+8))
- #34550 [CI] [Kernel] Rename Helion config key to match actual H100 SXM5 device name used in CI — 无标签 — by gmagogsfm (关闭于: 2026-02-20 08:55 (UTC+8))
- #34868 [WIP][Bugfix] Fix torch.compile startup time logging inaccuracy — bug — by haosdent (关闭于: 2026-02-20 08:08 (UTC+8))
- #34913 [CI] Fix custome offline ci issue,
V1 OthersTest bug — bug,ready,v1 — by yewentao256 (关闭于: 2026-02-20 04:47 (UTC+8)) - #34911 [Bugfix] Fix Qwen3.5 FP8 on Hopper — bug,frontend,qwen,nvidia — by ywang96 (关闭于: 2026-02-20 03:50 (UTC+8))
- #34159 [Docs] MTP Docs — documentation,needs-rebase,v1 — by kylesayrs (关闭于: 2026-02-20 01:11 (UTC+8))