vLLM 开发动态报告 - 2026-02-12

时间窗口: 2026-02-12 11:36 (UTC+8) ~ 2026-02-13 11:36 (UTC+8) 数据统计: 新 Issue 24 | 关闭 Issue 7 | 新 PR 63 | 合并 PR 38 | 关闭未合并 PR 10

📊 每日开发状态摘要

在过去的24小时内，vLLM项目保持了极高的开发活跃度，新增与合并的PR数量均处于高位。开发焦点集中在并行计算架构的演进与优化（如解耦注意力/FFN并行性）、前端服务架构的重构讨论（解耦前端与引擎），以及多模态和量化模型的支持与问题修复上。值得注意的是，多项CI测试在分布式（DP/TP/EP）和异构硬件（ARM CPU、Ascend、Intel HPU）环境下出现失败，表明项目在快速迭代中对基础设施兼容性提出了持续挑战。

🎯 AMD/ROCm 生态相关动态

本周期内，AMD生态相关贡献非常活跃，主要集中在性能优化、构建修复和量化支持方面。

新增PR：

PR #34481 ([Bugfix][Hardware][AMD] Add ahead-of-time weight dequantization for quantization emulation)：由 c0de128 提交。该PR针对在CDNA3等不支持原生MX格式的AMD硬件上模拟运行MXFP4/MXFP6/NVFP4模型时性能低下的问题，引入了环境变量 VLLM_EMULATION_DEQUANT_WEIGHTS_AOT=1，允许在模型加载时预先反量化权重，以空间换时间，大幅提升模拟量化下的推理速度。
PR #34469 ([Bugfix][Hardware][AMD] Fix string literal comparison in DISPATCH_BY_KV_CACHE_DTYPE macro)：由 c0de128 提交。修复了Clang 22+（ROCm 6.3+）编译错误，该错误源于宏中字符串字面量的地址比较（C++20已弃用）。通过改用 std::string() 包装确保值比较，保障了在较新ROCm工具链上的顺利构建。
PR #34410 (small adjustment to wvSplitKrc)：由 AMD员工 amd-hhashemi 提交并已合并。一个针对GPT-OSS模型的AITER内核微调，解决了在某些vLLM Docker环境中单次prompt请求失败的问题。
PR #34447 ([ROCm][CI] Pin TorchCodec to v0.10.0 for ROCm compatibility)：由 AndreasKaratzas 提交并已合并。为兼容ROCm PyTorch构建（缺失稳定ABI头文件），将TorchCodec依赖固定至v0.10.0，解决了源码构建失败问题。
PR #34431 ([ROCm][quantization] improve OCP weight quant parser robust)：由 AMD员工 xuebwang-amd 提交并已合并。增强了Quark OCP权重量化解析器的鲁棒性，解决了在特定条件下因is_mxfp4_quant方法调用失败导致的模型加载错误。
PR #34456 ([CI/Build] Add .deps to .dockerignore ...)：由 tlrmchlsmth 提交。通过将 .deps 目录加入 .dockerignore，防止构建ROCm镜像时混入主机上的CUDA依赖状态，确保构建环境纯净。

已关闭Issue：

Issue #34162 ([CI Failure]: ROCm GPT-OSS Eval)：ROCm GPT-OSS评估测试失败，已通过PR #34371修复并关闭。
Issue #34365 ([CI Failure]: mi325_1: Async Engine, Inputs, Utils, Worker, Config Test (CPU))：因废弃环境变量 VLLM_TEST_GROUP_NAME 导致的验证失败，已通过PR #34350修复并关闭。

总结：AMD团队在本周期表现突出，贡献覆盖了内核性能优化、量化支持、构建系统修复和CI稳定性等多个层面，体现了其对完善vLLM在AMD硬件上用户体验的持续投入。

💬 高热度讨论分析

Issue #34444 ([RFC]: Decoupled Attention/FFN Parallelism)：
- 核心议题：提出扩展解码上下文并行（DCP），引入解耦的注意力/FFN并行度和A2A通信后端，以解决在大规模TP下GQA/MLA模型的KV缓存复制问题。
- 观点与争议：
  - 维护者 LucasWilkinson 建议重构RFC，聚焦于新增 -tpa 参数本身及其对 parallel_state.py 的影响，将DCP/Helix细节移至用例章节，并将Q投影复制讨论移回相关Issue。
  - 作者 sungsooha 积极响应，澄清了关于TPA应是显式参数还是内部派生、TPA/DCP/KVP关系等设计问题，并迅速按照建议重构了RFC。
- 当前状态：RFC已按指导重构，讨论聚焦于基础设施设计，属于高层次架构演进讨论。
Issue #34401 ([CI Failure]: Distributed Tests (8 GPUs)(H100))：
- 核心议题：TP=2, DP=4, EP启用时，CI测试因检测到重复GPU和FlashInfer allreduce融合问题而失败。
- 观点与调试：
  - 用户 ZJY0516 推测是CUDA_DEVICE_INDEX设置问题，但表示本地类似配置可运行。
  - haosdent 指出根本原因是FlashInfer allreduce融合未传递TP进程组，并给出了代码补丁建议。
  - ProExpertProg 尝试修复但未完全解决，并链接到新Issue #34458（AR+rms在TP=2 DP=2时损坏）。
- 当前状态：问题根因已初步定位（allreduce融合与进程组不匹配），但修复引发了更深层问题，仍在调查中。
PR #34419 (pass raw request to io_process_plugin)：
- 核心议题：在离线接口和在线API统一使用 parse_data 后，IO处理器插件无法获取原始请求中的额外参数（如truncate_prompt_tokens）。
- 观点与讨论：
  - 作者 staugust 提出应传递完整请求对象以满足插件需求。
  - christian-pinto 回顾了设计初衷（统一接口），认为可以恢复传递完整请求，但需提供默认实现保持向后兼容。
  - DarkLight1337 和 noooop 均表示赞同，并提及了相关PR和未来对齐计划。
- 当前状态：社区共识倾向于支持插件获取更多上下文，正在寻求一个既满足需求又保持API简洁和向后兼容的方案。

🔥 热门话题与趋势分析

架构演进与解耦：
- 前端解耦：Issue #34407提出了一个两阶段解耦前端（online/frontend）与引擎（engine/renderer）的RFC，旨在实现GPU-less部署和更灵活的架构。
- 并行性优化：Issue #34444关于解耦注意力/FFN并行性的RFC，以及Issue #34458中暴露的AR+rms在复杂并行配置下的问题，反映了社区对极致性能和资源利用率的不懈追求。
多模态模型支持：
- 模型新增：PR #34426添加了对Ovis2.6多模态模型的支持。
- 问题修复：Issue #34442指出Kimi K2工具解析器的8k参数长度限制过严；PR #34483修复了GLM-4V/GLM-OCR单张图片编码缓存低估的bug；Issue #34403报告了DeepSeek-VL2模型加载失败。
量化与硬件优化：
- AMD量化：如前所述，多个PR针对AMD平台的量化模拟性能、解析器鲁棒性进行优化。
- NV量化修复：PR #34476修复了Nemotron-3-Nano NVFP4模型在TP>1时的准确率回归问题，指出错误地将MergedColumnParallelLinear替换为ColumnParallelLinear导致权重量化尺度错位。
- 新量化支持：PR #34478为Step3.5-Flash模型添加了NVFP4 MoE支持。
CI与测试基础设施挑战：
- 大量CI失败Issue（#34465, #34464, #34463, #34460, #34459, #34401）表明，随着功能复杂度（DBO、DP、EP、多硬件后端）提升，测试环境的稳定性和覆盖率面临巨大压力。

🛠️ 重点技术变更

PR #34485 ([Refactor] Pass full VllmConfig to Renderer)：一个关键的前置重构，将完整的VllmConfig而不仅仅是ModelConfig传递给Renderer，为后续支持多模态（MM）处理的架构铺平了道路。
PR #34444 ([RFC]: Decoupled Attention/FFN Parallelism)：提出了重要的并行计算架构演进方案。通过引入独立的注意力张量并行大小（TPA），旨在更精细地控制计算与内存开销，特别是在处理长上下文和GQA/MLA模型时，可避免KV缓存的不必要复制。
Issue #34407 ([RFC]: Disaggregated Frontend)：提出了一个影响深远的服务架构变更。计划将在线服务层拆分为独立的前端（处理tokenization、MM输入等）和渲染器（处理detokenization、工具调用解析等），并与GPU推理引擎解耦。这将提升部署灵活性，支持纯CPU前端和自定义集成。
PR #34476 ([BUGFIX] Fix accuracy regression for NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 with TP>1)：一次关键的问题修复。通过还原一个错误统一代码路径的提交，恢复了MergedColumnParallelLinear的正确使用，解决了TP>1时因权重量化尺度错位导致的精度崩溃问题，保障了量化模型的分布式推理可靠性。
PR #34445 ([BugFix] Add block_size validation for mamba cache align mode)：针对Mamba模型在缓存对齐（align）模式下，当block_size大于max_num_batched_tokens时可能导致调度死锁的问题，增加了启动时的参数验证，提升了系统的健壮性。

📈 开发活跃度观察

贡献者活跃：本周期出现了大量来自不同机构和个人的贡献，包括AMD (-amd后缀)、NVIDIA、Intel以及众多社区开发者，反映了vLLM广泛的生态影响力。
代码审查与合并：在63个新增PR中，有38个被合并，合并率较高。许多PR（如#34427， #34435， #34443）在创建当天即被合并，显示核心团队对bug修复和明确改进的高效处理。
讨论深度：围绕架构RFC的讨论（如#34444, #34407）展现了社区对项目长期技术方向的深入思考和严谨的设计评审过程。

💡 值得关注的问题

架构演进决策：Issue #34407（前端解耦）和Issue #34444（解耦注意力并行）是两个重大的架构RFC。它们的最终设计和实施将深刻影响vLLM未来的可扩展性、部署模式和性能上限，值得社区成员持续关注和参与讨论。
复杂并行场景下的稳定性：Issue #34401、#34458以及多个CI失败案例表明，在DP、TP、EP、DCP等多种并行策略交织，特别是结合FlashInfer等高级优化时，易出现隐蔽的进程组管理、内存访问和通信问题。这将是保证vLLM在生产环境中稳定运行的关键挑战。
多模态与工具调用生态的完善：Issue #34442（工具参数长度限制）、#34403（DeepSeek-VL2加载失败）以及关于GLM-5-FP8工具调用格式的bug报告，揭示了在多模态和复杂Agent用例支持上仍需不断打磨。随着AI应用向多模态和智能体发展，这方面的稳定性与功能完整性至关重要。

📋 附录：详细数据列表

新增 Issue

#34479 [Usage]: Streaming Prefill for 1M-Token Prompts in vLLM — usage — by Wesley-Jzy (创建于: 2026-02-13 09:48 (UTC+8))
#34484 [Bug]: kv_load_failure_recovery example fails when using sync KV load mode — bug — by sykzhong (创建于: 2026-02-13 11:12 (UTC+8))
#34477 [Bug]: TVM_FFI_ICHECK(args->n_group != 0) « “n_group should not be zero for DeepSeekV3 routing” — bug — by jeejeelee (创建于: 2026-02-13 09:18 (UTC+8))
#34407 [RFC]: Disaggregated Frontend — Separating Online Serving from Engine — RFC — by hyeongyun0916 (创建于: 2026-02-12 16:27 (UTC+8))
#34413 [Bug]: Qwen coder next performance after d7982da commit. — bug — by stavinsky (创建于: 2026-02-12 17:38 (UTC+8))
#34399 [Bug]: Nemotron 3 (all quants) take a LONG time to load — bug — by jiangwu300 (创建于: 2026-02-12 11:44 (UTC+8))
#34470 [Bug]: NVIDIA Jetson Thor: Value ‘sm_110a’ is not defined for option ‘gpu-name’ — bug — by Kaweees (创建于: 2026-02-13 06:08 (UTC+8))
#34444 [RFC]: Decoupled Attention/FFN Parallelism — RFC — by sungsooha (创建于: 2026-02-13 01:07 (UTC+8))
#34465 [CI Failure]: Intel HPU Test — ci-failure — by rajkiranjoshi (创建于: 2026-02-13 05:14 (UTC+8))
#34464 [CI Failure]: Ascend NPU Test — ci-failure — by rajkiranjoshi (创建于: 2026-02-13 05:01 (UTC+8))
#34463 [CI Failure]: ARM CPU Test (test_moe.py) — ci-failure,cpu — by rajkiranjoshi (创建于: 2026-02-13 04:52 (UTC+8))
#34460 [CI Failure]: Distributed Tests (2 GPUs)(H100) test_dbo.py — ci-failure — by rajkiranjoshi (创建于: 2026-02-13 04:45 (UTC+8))
#34459 [CI Failure]: Entrypoints Integration (Responses API) (test_parsable_context.py) — ci-failure — by rajkiranjoshi (创建于: 2026-02-13 04:35 (UTC+8))
#34458 [Bug]: AR+rms broken for TP=2 DP=2 — bug — by ProExpertProg (创建于: 2026-02-13 03:55 (UTC+8))
#34401 [CI Failure]: Distributed Tests (8 GPUs)(H100) — ci-failure — by wzhao18 (创建于: 2026-02-12 14:06 (UTC+8))
#34452 [Usage]: Issue Running Nemotron 3 Nano NVFP4 on sm12x (RTX 5090 / Blackwell Pro 6000) — usage — by shahizat (创建于: 2026-02-13 02:43 (UTC+8))
#34449 [Bug]: GLM-5-FP8 malformed tool calls — bug — by TALLEC-Scott (创建于: 2026-02-13 02:20 (UTC+8))
#34442 [Bug]: kimi k2 tool parser has a 8k buffer limit for args — bug — by bigBrain1901 (创建于: 2026-02-13 00:56 (UTC+8))
#34437 [Bug]: Qwen3 Next with heterogeneous GPU (FP8 overflow?) — bug — by Nepherpitou (创建于: 2026-02-12 22:29 (UTC+8))
#34423 [Feature]: GPT-OSS: Unify harmony workflows with other model type workflows — feature request — by jonoillar (创建于: 2026-02-12 18:37 (UTC+8))
#34417 [Feature]: logprobs for gpt-oss harmony — feature request — by cmunley1 (创建于: 2026-02-12 17:53 (UTC+8))
#34406 [Bug]: Instruction following capability is deteriorating：Output introduces parameter defined in functioncall incorrectly — bug — by Patrickpan9 (创建于: 2026-02-12 16:09 (UTC+8))
#34403 [Bug]: error when use vllm on deepseek-vl2 — bug — by Pissohappy (创建于: 2026-02-12 15:23 (UTC+8))
#34405 [Bug]: Qwen3-Coder-Next加载模型自带的qwen3coder_tool_parser_vllm.py，报错No module named ‘vllm.entrypoints.openai.protocol’ — bug — by MaoJianwei (创建于: 2026-02-12 15:40 (UTC+8))

已关闭 Issue

#11173 [Installation]: XPU dependencies are missing — installation,intel-gpu,stale — by pepijndevos (关闭于: 2026-02-13 10:18 (UTC+8))
#26619 [Bug]:–gpu-memory-utilization 0.5 Not Honored When Running Multiple Instances — bug,stale — by p936422648 (关闭于: 2026-02-13 10:16 (UTC+8))
#34365 [CI Failure]: mi325_1: Async Engine, Inputs, Utils, Worker, Config Test (CPU) — ci-failure — by AndreasKaratzas (关闭于: 2026-02-13 04:52 (UTC+8))
#34162 [CI Failure]: mi325_1: ROCm GPT-OSS Eval — rocm,ci-failure — by AndreasKaratzas (关闭于: 2026-02-13 04:51 (UTC+8))
#33515 [Bug]: Error in inspecting model architecture ‘NemotronHForCausalLM’ — bug — by shahizat (关闭于: 2026-02-13 01:24 (UTC+8))
#34252 [Bug]: Multi-GPU NCCL initialization fails with Cuda failure 700 ‘an illegal memory access was encountered’ on NVIDIA B200 GPUs — bug — by venishpatidar (关闭于: 2026-02-12 15:16 (UTC+8))
#34357 [Bug]: MoE models with shared experts do not support DP+EP — bug — by jeejeelee (关闭于: 2026-02-12 14:44 (UTC+8))

新增 PR

#34426 [New Model] support new model ovis2.6 — documentation,new-model,ready,multi-modality — by myselvess (创建于: 2026-02-12 19:09 (UTC+8))
#34430 [Bugfix] FIx TorchAO config bugs — bug,documentation — by jwpark33 (创建于: 2026-02-12 20:43 (UTC+8))
#34485 [Refactor] Pass full VllmConfig to Renderer — ready,v1 — by DarkLight1337 (创建于: 2026-02-13 11:27 (UTC+8))
#34418 [Bug Fix] Fix MambaManager.cache_blocks() crash on null blocks in align mode — bug,v1 — by haosdent (创建于: 2026-02-12 17:54 (UTC+8))
#34483 [Bugfix] Fix encoder cache underestimation for GLM-4V/GLM-OCR single image — bug,ready — by haosdent (创建于: 2026-02-13 10:54 (UTC+8))
#34476 [BUGFIX] Fix accuracy regression for NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 with TP>1 — bug,ready,nvidia — by vadiklyutiy (创建于: 2026-02-13 08:27 (UTC+8))
#34482 Support CUDAGraph on XPU Platform — v1,nvidia — by xinyu-intel (创建于: 2026-02-13 10:36 (UTC+8))
#34481 [Bugfix][Hardware][AMD] Add ahead-of-time weight dequantization for quantization emulation — bug,rocm — by c0de128 (创建于: 2026-02-13 10:35 (UTC+8))
#34427 [Bugfix] Delete unused redundant code in Kimi-K2.5 — bug,ready — by LoganJane (创建于: 2026-02-12 19:24 (UTC+8))
#34435 [Refactor] Simplify BOS/EOS token handling — structured-output,ready,v1,kv-connector — by DarkLight1337 (创建于: 2026-02-12 22:19 (UTC+8))
#34443 [Bugfix] Remove assert that’s no longer valid — bug,ready — by bnellnm (创建于: 2026-02-13 01:07 (UTC+8))
#34445 [BugFix] Add block_size validation for mamba cache align mode — bug,ready — by peakcrosser7 (创建于: 2026-02-13 01:17 (UTC+8))
#34451 Fix num_logprobs parameter description in sampler.py — ready,v1 — by zhuohan123 (创建于: 2026-02-13 02:41 (UTC+8))
#34446 [CI/Build] Update video URLs for testing — ready — by DarkLight1337 (创建于: 2026-02-13 01:37 (UTC+8))
#34480 [AsyncTP] Set return_A=False in AllGatherGEMMPattern to unlock multim… — needs-rebase — by tianrengao (创建于: 2026-02-13 10:13 (UTC+8))
#34440 [BugFix] Fix and optimize max_num_blocks_per_req calculation for MambaSpec — bug,ready,v1 — by peakcrosser7 (创建于: 2026-02-13 00:40 (UTC+8))
#34472 [Bugfix] Fix Pydantic BaseModel usage by replacing dataclasses.replace with setattr — bug,frontend — by yashnib (创建于: 2026-02-13 06:39 (UTC+8))
#34478 [Model] Add NVFP4 quantization support for Step3.5-Flash — nvidia — by tacos8me (创建于: 2026-02-13 09:44 (UTC+8))
#34453 [Bugfix] Add method to swap quant_method on FusedMoE to fix LoRA issues — bug,ready — by bnellnm (创建于: 2026-02-13 03:10 (UTC+8))
#34424 [Perf] Optimize FP8 gemm of sm120. — nvidia — by wenshuai-xiaomi (创建于: 2026-02-12 19:03 (UTC+8))
#34475 [Benchmark] Support recursive trace discovery and JAX trace formats — performance,frontend,v1,cpu — by YJYJLee (创建于: 2026-02-13 08:26 (UTC+8))
#34466 [CI/Build] Add opentelemetry libs in default vllm build (requirements/common.txt) — ready,ci/build — by vladmihailescu (创建于: 2026-02-13 05:20 (UTC+8))
#34433 [Bugfix] Remove broken raw url GGUF model loading support — bug,ready — by Isotr0py (创建于: 2026-02-12 22:04 (UTC+8))
#34474 [Bugfix] Fix assertion error in _dummy_run for MTP speculative decoding — bug,ready,v1 — by LucasWilkinson (创建于: 2026-02-13 07:27 (UTC+8))
#34415 [KV Connector] Add temporary, off-by-default VLLM_DISABLE_REQUEST_ID_RANDOMIZATION workaround — ready,v1,kv-connector — by eicherseiji (创建于: 2026-02-12 17:42 (UTC+8))
#34473 [Test] Add FP8 KV Cache Testing for MLA Backends — v1 — by wzhao18 (创建于: 2026-02-13 06:49 (UTC+8))
#34471 [Llama4,Quantization] Simplify and generalize logic for Q/K permutations in quantized self-attn layers — llama — by eldarkurtic (创建于: 2026-02-13 06:31 (UTC+8))
#34469 [Bugfix][Hardware][AMD] Fix string literal comparison in DISPATCH_BY_KV_CACHE_DTYPE macro — bug,rocm — by c0de128 (创建于: 2026-02-13 05:55 (UTC+8))
#34456 [CI/Build] Add .deps to .dockerignore to prevent state from being pulled into ROCm build image — rocm — by tlrmchlsmth (创建于: 2026-02-13 03:40 (UTC+8))
#34467 [Bugfix] Replace c10::optional with std::optional in topk kernel — bug — by FloatingVertex (创建于: 2026-02-13 05:27 (UTC+8))
#34468 [CI][Entrypoints] Validate detokenize token IDs to prevent int64 overflow causing 500 — frontend — by AndreasKaratzas (创建于: 2026-02-13 05:35 (UTC+8))
#34457 [Bugfix][MTP][Sparse MLA] Allow sparse MLA with MTP to run with FULL cudagraphs — bug,performance,v1,nvidia — by MatthewBonanni (创建于: 2026-02-13 03:50 (UTC+8))
#34462 [Perf] CFS-aware torch thread count in set_default_torch_num_threads — cpu — by Meshkati (创建于: 2026-02-13 04:49 (UTC+8))
#34455 [Bugfix] Remove assert causing hipErrorStreamCaptureUnsupported — bug,rocm — by JadenMathias (创建于: 2026-02-13 03:35 (UTC+8))
#34461 fix(openai): correct timestamp drift for long audio chunking (#32588) — frontend — by kelvinvelasquez-SDE (创建于: 2026-02-13 04:46 (UTC+8))
#34410 small adjustment to wvSplitKrc — rocm,ready — by amd-hhashemi (创建于: 2026-02-12 16:58 (UTC+8))
#34454 [Bugfix]: Fix structured output in multi-turn gpt-oss — bug,gpt-oss — by bbrowning (创建于: 2026-02-13 03:25 (UTC+8))
#34447 [ROCm][CI] Pin TorchCodec to v0.10.0 for ROCm compatibility — rocm,ready — by AndreasKaratzas (创建于: 2026-02-13 01:43 (UTC+8))
#34441 [Feature] Add FlashInfer cuDNN backend for ViT attention — v1,qwen,nvidia — by dorhuri123 (创建于: 2026-02-13 00:48 (UTC+8))
#34450 [CI] Make DBO tests non-optional — ready,ci/build — by LucasWilkinson (创建于: 2026-02-13 02:21 (UTC+8))
#34448 [Kernel] Integrate SM100 MXFP8 blockscaled grouped MM and quant kernels — ci/build — by EdalatiAli (创建于: 2026-02-13 02:05 (UTC+8))
#34436 Fix MoE for the Transformers modelling backend — ready — by hmellor (创建于: 2026-02-12 22:28 (UTC+8))
#34428 [Voxtral Realtime] Refactor & Improve buffering logic — ready,ci/build,multi-modality — by patrickvonplaten (创建于: 2026-02-12 19:31 (UTC+8))
#34431 [ROCm][quantization] improve OCP weight quant parser robust — rocm,ready — by xuebwang-amd (创建于: 2026-02-12 21:11 (UTC+8))
#34425 [Bug Fix] Re-add the DCP/PCP compatibility check for CUDA platform — bug,needs-rebase,nvidia — by Siritao (创建于: 2026-02-12 19:09 (UTC+8))
#34439 [Docs] Spec decoding docs warning removal — documentation — by NickLucche (创建于: 2026-02-13 00:26 (UTC+8))
#34438 Add explicit validation error for tool calls. — 无标签 — by juliendenize (创建于: 2026-02-12 23:02 (UTC+8))
#34416 [Misc] Update tests and examples for Prithvi/Terratorch models — documentation,ready,ci/build,multi-modality — by christian-pinto (创建于: 2026-02-12 17:53 (UTC+8))
#34422 [Bug Fix] Fix –served-model-name greedily consuming positional model_tag — bug — by haosdent (创建于: 2026-02-12 18:19 (UTC+8))
#34432 [WIP][Kernel] Add Helion kernel for rms_norm_dynamic_per_token_quant — 无标签 — by xiaohongchen1991 (创建于: 2026-02-12 21:34 (UTC+8))
#34414 [Bugfix] Add quant_config in ViT of Kimi-K2.5 — bug — by LoganJane (创建于: 2026-02-12 17:40 (UTC+8))
#34434 [CPU][Perf] Accelerate Attention head for s390x using vector intrinsics — v1,cpu — by R3hankhan123 (创建于: 2026-02-12 22:07 (UTC+8))
#34411 Add config file for fused MoE for Nemotron (TP4, B200) — ready — by danisereb (创建于: 2026-02-12 17:18 (UTC+8))
#34429 DCP fix for GLM5 — v1 — by prakhar-prakash-juspay (创建于: 2026-02-12 20:30 (UTC+8))
#34400 [V0 Deprecation] Remove code related to per-request logits processors — frontend,ready,v1 — by DarkLight1337 (创建于: 2026-02-12 13:20 (UTC+8))
#34402 [Misc] Update Qwen3-VL models dummy image size calculation — qwen — by Isotr0py (创建于: 2026-02-12 14:46 (UTC+8))
#34421 Add support for LoRA with NVFP4 MoE models — 无标签 — by danisereb (创建于: 2026-02-12 18:08 (UTC+8))
#34419 pass raw request to io_process_plugin — frontend — by staugust (创建于: 2026-02-12 18:06 (UTC+8))
#34420 Fix MoE models in EP mode on Ascend — 无标签 — by ylyhyqsl (创建于: 2026-02-12 18:08 (UTC+8))
#34412 [DO NOT MERGE ] Evidence for FlashInfer allreduce_fusion one-shot (kARResidualRMSNorm) causes deterministic NaN corruption and GSM8K collapse — 无标签 — by baonudesifeizhai (创建于: 2026-02-12 17:30 (UTC+8))
#34404 [Bug Fix] Enable non-gated MoE support in Triton backends (#34356) — bug,nvidia — by haosdent (创建于: 2026-02-12 15:31 (UTC+8))
#34409 [CPU Offloading] Add offloading connector scheduler load delay metric — v1,kv-connector — by pougetat (创建于: 2026-02-12 16:52 (UTC+8))
#34408 fix(exaone4): pad MLP intermediate size for group quant TP alignment — 无标签 — by junstar92 (创建于: 2026-02-12 16:50 (UTC+8))

已合并 PR

#34047 [ROCm][CI] Fix serving tokens test failures — rocm,ready — by AndreasKaratzas (合并于: 2026-02-13 11:27 (UTC+8))
#34309 Add new sections to CODEOWNERS — ci/build — by DarkLight1337 (合并于: 2026-02-13 10:39 (UTC+8))
#33706 [Hybrid] Fix and optimize block-aligned splitting in mamba cache align mode — ready,v1 — by peakcrosser7 (合并于: 2026-02-13 10:21 (UTC+8))
#33907 [Bugfix] Fix Random Dataset Prefix Length Inaccuracy — bug,performance,ready — by frankwang28 (合并于: 2026-02-13 10:21 (UTC+8))
#34025 [Kernel] [Helion] [5/N] Add Helion Autotuning infrastructure — ready — by gmagogsfm (合并于: 2026-02-13 10:21 (UTC+8))
#34427 [Bugfix] Delete unused redundant code in Kimi-K2.5 — bug,ready — by LoganJane (合并于: 2026-02-13 10:18 (UTC+8))
#34435 [Refactor] Simplify BOS/EOS token handling — structured-output,ready,v1,kv-connector — by DarkLight1337 (合并于: 2026-02-13 10:18 (UTC+8))
#34443 [Bugfix] Remove assert that’s no longer valid — bug,ready — by bnellnm (合并于: 2026-02-13 10:18 (UTC+8))
#34445 [BugFix] Add block_size validation for mamba cache align mode — bug,ready — by peakcrosser7 (合并于: 2026-02-13 10:18 (UTC+8))
#34451 Fix num_logprobs parameter description in sampler.py — ready,v1 — by zhuohan123 (合并于: 2026-02-13 10:18 (UTC+8))
#34446 [CI/Build] Update video URLs for testing — ready — by DarkLight1337 (合并于: 2026-02-13 10:15 (UTC+8))
#33373 [Kernel] [Helion] [4/N] Add silu_mul_fp8 Helion kernel — ready — by gmagogsfm (合并于: 2026-02-13 10:13 (UTC+8))
#33198 [Core] Profiler improvements and lazy initialization — frontend,ready,v1,cpu — by jaewonlee-fb (合并于: 2026-02-13 08:16 (UTC+8))
#33195 [Core] Add sleep level 0 mode with enqueue/wait pattern — frontend,ready,v1 — by jaewonlee-fb (合并于: 2026-02-13 08:16 (UTC+8))
#33709 [Frontend] Enable generic structured_outputs for responses API — frontend,ready — by alecsolder (合并于: 2026-02-13 08:15 (UTC+8))
#34378 Use paged_attention_v1 for sliding window decode in rocm_aiter_fa — rocm,ready,v1,meta-exported,fb-exported — by iseeyuan (合并于: 2026-02-13 08:14 (UTC+8))
#34433 [Bugfix] Remove broken raw url GGUF model loading support — bug,ready — by Isotr0py (合并于: 2026-02-13 01:40 (UTC+8))
#33506 [Kernel] Support Flashinfer trtllm fused MoE non gated FP8 & NVFP4 — performance,ready,nvidia — by amitz-nv (合并于: 2026-02-13 05:06 (UTC+8))
#34410 small adjustment to wvSplitKrc — rocm,ready — by amd-hhashemi (合并于: 2026-02-13 04:26 (UTC+8))
#34374 [Bugfix] Enforce DeepGEMM when using sparse_attn_indexer on CUDA — bug,ready,nvidia — by mgoin (合并于: 2026-02-13 04:08 (UTC+8))
#34447 [ROCm][CI] Pin TorchCodec to v0.10.0 for ROCm compatibility — rocm,ready — by AndreasKaratzas (合并于: 2026-02-13 02:47 (UTC+8))
#34436 Fix MoE for the Transformers modelling backend — ready — by hmellor (合并于: 2026-02-13 01:29 (UTC+8))
#34428 [Voxtral Realtime] Refactor & Improve buffering logic — ready,ci/build,multi-modality — by patrickvonplaten (合并于: 2026-02-13 01:46 (UTC+8))
#33803 [Voxstral Realtime] Enable tests — ready,multi-modality — by patrickvonplaten (合并于: 2026-02-13 01:43 (UTC+8))
#34431 [ROCm][quantization] improve OCP weight quant parser robust — rocm,ready — by xuebwang-amd (合并于: 2026-02-13 01:40 (UTC+8))
#33451 [Attention] Add FlashInfer Sparse MLA backend — documentation,performance,rocm,ready,ci/build,v1,cpu,nvidia — by MatthewBonanni (合并于: 2026-02-13 01:21 (UTC+8))
#34439 [Docs] Spec decoding docs warning removal — documentation — by NickLucche (合并于: 2026-02-13 01:01 (UTC+8))
#34382 [BUG] Reset running requests when clearing cache for pause/resume — bug,ready,v1 — by hao-aaron (合并于: 2026-02-13 00:19 (UTC+8))
#34411 Add config file for fused MoE for Nemotron (TP4, B200) — ready — by danisereb (合并于: 2026-02-12 22:09 (UTC+8))
#34192 [ROCm] Enable MXFP4 MoE weight pre-shuffling on gfx950 and update aiter — rocm,ready,ci/build — by dllehr-amd (合并于: 2026-02-12 21:06 (UTC+8))
#34400 [V0 Deprecation] Remove code related to per-request logits processors — frontend,ready,v1 — by DarkLight1337 (合并于: 2026-02-12 20:44 (UTC+8))
#34104 Fix Mistral config remap to accept compressed-tensors quantization #34028 — bug,ready — by baonudesifeizhai (合并于: 2026-02-12 16:22 (UTC+8))
#34128 Vllm CPU benchmark suite improvement — documentation,performance,ready,ci/build,cpu — by louie-tsai (合并于: 2026-02-12 16:04 (UTC+8))
#34397 [bugfix] refactor FunASR’s _get_data_parser — bug,ready — by AllenDou (合并于: 2026-02-12 15:26 (UTC+8))
#33446 [Bugfix] Fix Sparse24 Compressed Tensors models — bug,ready,nvidia — by kylesayrs (合并于: 2026-02-12 15:15 (UTC+8))
#34085 Fix DeepSeek-OCR tensor validation for all size variants — ready,deepseek — by yichuan-w (合并于: 2026-02-12 14:50 (UTC+8))
#34379 [BugFix] Fix DP chunking — bug,ready — by LucasWilkinson (合并于: 2026-02-12 14:44 (UTC+8))
#34329 [Refactor] Pass Renderer to Input Processor — frontend,ready,v1 — by DarkLight1337 (合并于: 2026-02-12 11:38 (UTC+8))

关闭但未合并的 PR

#25801 [Core] Bookkeeping optimization: Batchify updates 1D numpy arrays (e.g. num_tokens, num_tokens_no_spec) — performance,stale,v1 — by Jialin (关闭于: 2026-02-13 10:17 (UTC+8))
#34472 [Bugfix] Fix Pydantic BaseModel usage by replacing dataclasses.replace with setattr — bug,frontend — by yashnib (关闭于: 2026-02-13 10:06 (UTC+8))
#34151 Revert “[MISC] Fix Tensor Parallelism for Quantized Mamba Models with n_groups=1 (#33257)” — 无标签 — by amitz-nv (关闭于: 2026-02-13 00:28 (UTC+8))
#32497 [Bugfix][Hardware][AMD] Fix RCCL initialization in Ray distributed executor — bug,rocm,v1 — by c0de128 (关闭于: 2026-02-13 05:36 (UTC+8))
#34381 [Kernel] adopt mxfp8 grouped_gemm and grouped_quant kernel — ci/build — by EdalatiAli (关闭于: 2026-02-13 01:53 (UTC+8))
#34261 [Frontend] Realtime API uses Renderer — frontend,ready,needs-rebase,v1 — by DarkLight1337 (关闭于: 2026-02-13 00:35 (UTC+8))
#34367 [BugFix] Skip null blocks when adding cached blocks in current step — bug,v1 — by peakcrosser7 (关闭于: 2026-02-13 00:19 (UTC+8))
#34278 [Bugfix] Fix P2pNcclConnector NCCL send/recv key mismatch in disaggregated prefill XpYd — bug,kv-connector — by shwgao (关闭于: 2026-02-12 23:33 (UTC+8))
#34422 [Bug Fix] Fix –served-model-name greedily consuming positional model_tag — bug — by haosdent (关闭于: 2026-02-12 22:56 (UTC+8))
#34402 [Misc] Update Qwen3-VL models dummy image size calculation — qwen — by Isotr0py (关闭于: 2026-02-12 19:58 (UTC+8))