vLLM 开发动态报告 - 2026-02-12
时间窗口: 2026-02-12 11:36 (UTC+8) ~ 2026-02-13 11:36 (UTC+8) 数据统计: 新 Issue 24 | 关闭 Issue 7 | 新 PR 63 | 合并 PR 38 | 关闭未合并 PR 10
📊 每日开发状态摘要
在过去的24小时内,vLLM项目保持了极高的开发活跃度,新增与合并的PR数量均处于高位。开发焦点集中在并行计算架构的演进与优化(如解耦注意力/FFN并行性)、前端服务架构的重构讨论(解耦前端与引擎),以及多模态和量化模型的支持与问题修复上。值得注意的是,多项CI测试在分布式(DP/TP/EP)和异构硬件(ARM CPU、Ascend、Intel HPU)环境下出现失败,表明项目在快速迭代中对基础设施兼容性提出了持续挑战。
🎯 AMD/ROCm 生态相关动态
本周期内,AMD生态相关贡献非常活跃,主要集中在性能优化、构建修复和量化支持方面。
新增PR:
- PR #34481 (
[Bugfix][Hardware][AMD] Add ahead-of-time weight dequantization for quantization emulation):由c0de128提交。该PR针对在CDNA3等不支持原生MX格式的AMD硬件上模拟运行MXFP4/MXFP6/NVFP4模型时性能低下的问题,引入了环境变量VLLM_EMULATION_DEQUANT_WEIGHTS_AOT=1,允许在模型加载时预先反量化权重,以空间换时间,大幅提升模拟量化下的推理速度。 - PR #34469 (
[Bugfix][Hardware][AMD] Fix string literal comparison in DISPATCH_BY_KV_CACHE_DTYPE macro):由c0de128提交。修复了Clang 22+(ROCm 6.3+)编译错误,该错误源于宏中字符串字面量的地址比较(C++20已弃用)。通过改用std::string()包装确保值比较,保障了在较新ROCm工具链上的顺利构建。 - PR #34410 (
small adjustment to wvSplitKrc):由 AMD员工amd-hhashemi提交并已合并。一个针对GPT-OSS模型的AITER内核微调,解决了在某些vLLM Docker环境中单次prompt请求失败的问题。 - PR #34447 (
[ROCm][CI] Pin TorchCodec to v0.10.0 for ROCm compatibility):由AndreasKaratzas提交并已合并。为兼容ROCm PyTorch构建(缺失稳定ABI头文件),将TorchCodec依赖固定至v0.10.0,解决了源码构建失败问题。 - PR #34431 (
[ROCm][quantization] improve OCP weight quant parser robust):由 AMD员工xuebwang-amd提交并已合并。增强了Quark OCP权重量化解析器的鲁棒性,解决了在特定条件下因is_mxfp4_quant方法调用失败导致的模型加载错误。 - PR #34456 (
[CI/Build] Add .deps to .dockerignore ...):由tlrmchlsmth提交。通过将.deps目录加入.dockerignore,防止构建ROCm镜像时混入主机上的CUDA依赖状态,确保构建环境纯净。
已关闭Issue:
- Issue #34162 (
[CI Failure]: ROCm GPT-OSS Eval):ROCm GPT-OSS评估测试失败,已通过PR #34371修复并关闭。 - Issue #34365 (
[CI Failure]: mi325_1: Async Engine, Inputs, Utils, Worker, Config Test (CPU)):因废弃环境变量VLLM_TEST_GROUP_NAME导致的验证失败,已通过PR #34350修复并关闭。
总结:AMD团队在本周期表现突出,贡献覆盖了内核性能优化、量化支持、构建系统修复和CI稳定性等多个层面,体现了其对完善vLLM在AMD硬件上用户体验的持续投入。
💬 高热度讨论分析
- Issue #34444 (
[RFC]: Decoupled Attention/FFN Parallelism):- 核心议题:提出扩展解码上下文并行(DCP),引入解耦的注意力/FFN并行度和A2A通信后端,以解决在大规模TP下GQA/MLA模型的KV缓存复制问题。
- 观点与争议:
- 维护者
LucasWilkinson建议重构RFC,聚焦于新增-tpa参数本身及其对parallel_state.py的影响,将DCP/Helix细节移至用例章节,并将Q投影复制讨论移回相关Issue。 - 作者
sungsooha积极响应,澄清了关于TPA应是显式参数还是内部派生、TPA/DCP/KVP关系等设计问题,并迅速按照建议重构了RFC。
- 维护者
- 当前状态:RFC已按指导重构,讨论聚焦于基础设施设计,属于高层次架构演进讨论。
- Issue #34401 (
[CI Failure]: Distributed Tests (8 GPUs)(H100)):- 核心议题:TP=2, DP=4, EP启用时,CI测试因检测到重复GPU和FlashInfer allreduce融合问题而失败。
- 观点与调试:
- 用户
ZJY0516推测是CUDA_DEVICE_INDEX设置问题,但表示本地类似配置可运行。 haosdent指出根本原因是FlashInfer allreduce融合未传递TP进程组,并给出了代码补丁建议。ProExpertProg尝试修复但未完全解决,并链接到新Issue #34458(AR+rms在TP=2 DP=2时损坏)。
- 用户
- 当前状态:问题根因已初步定位(allreduce融合与进程组不匹配),但修复引发了更深层问题,仍在调查中。
- PR #34419 (
pass raw request to io_process_plugin):- 核心议题:在离线接口和在线API统一使用
parse_data后,IO处理器插件无法获取原始请求中的额外参数(如truncate_prompt_tokens)。 - 观点与讨论:
- 作者
staugust提出应传递完整请求对象以满足插件需求。 christian-pinto回顾了设计初衷(统一接口),认为可以恢复传递完整请求,但需提供默认实现保持向后兼容。DarkLight1337和noooop均表示赞同,并提及了相关PR和未来对齐计划。
- 作者
- 当前状态:社区共识倾向于支持插件获取更多上下文,正在寻求一个既满足需求又保持API简洁和向后兼容的方案。
- 核心议题:在离线接口和在线API统一使用
🔥 热门话题与趋势分析
- 架构演进与解耦:
- 前端解耦:Issue #34407提出了一个两阶段解耦前端(online/frontend)与引擎(engine/renderer)的RFC,旨在实现GPU-less部署和更灵活的架构。
- 并行性优化:Issue #34444关于解耦注意力/FFN并行性的RFC,以及Issue #34458中暴露的AR+rms在复杂并行配置下的问题,反映了社区对极致性能和资源利用率的不懈追求。
- 多模态模型支持:
- 模型新增:PR #34426添加了对Ovis2.6多模态模型的支持。
- 问题修复:Issue #34442指出Kimi K2工具解析器的8k参数长度限制过严;PR #34483修复了GLM-4V/GLM-OCR单张图片编码缓存低估的bug;Issue #34403报告了DeepSeek-VL2模型加载失败。
- 量化与硬件优化:
- AMD量化:如前所述,多个PR针对AMD平台的量化模拟性能、解析器鲁棒性进行优化。
- NV量化修复:PR #34476修复了Nemotron-3-Nano NVFP4模型在TP>1时的准确率回归问题,指出错误地将
MergedColumnParallelLinear替换为ColumnParallelLinear导致权重量化尺度错位。 - 新量化支持:PR #34478为Step3.5-Flash模型添加了NVFP4 MoE支持。
- CI与测试基础设施挑战:
- 大量CI失败Issue(#34465, #34464, #34463, #34460, #34459, #34401)表明,随着功能复杂度(DBO、DP、EP、多硬件后端)提升,测试环境的稳定性和覆盖率面临巨大压力。
🛠️ 重点技术变更
- PR #34485 (
[Refactor] Pass full VllmConfig to Renderer):一个关键的前置重构,将完整的VllmConfig而不仅仅是ModelConfig传递给Renderer,为后续支持多模态(MM)处理的架构铺平了道路。 - PR #34444 (
[RFC]: Decoupled Attention/FFN Parallelism):提出了重要的并行计算架构演进方案。通过引入独立的注意力张量并行大小(TPA),旨在更精细地控制计算与内存开销,特别是在处理长上下文和GQA/MLA模型时,可避免KV缓存的不必要复制。 - Issue #34407 (
[RFC]: Disaggregated Frontend):提出了一个影响深远的服务架构变更。计划将在线服务层拆分为独立的前端(处理tokenization、MM输入等)和渲染器(处理detokenization、工具调用解析等),并与GPU推理引擎解耦。这将提升部署灵活性,支持纯CPU前端和自定义集成。 - PR #34476 (
[BUGFIX] Fix accuracy regression for NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 with TP>1):一次关键的问题修复。通过还原一个错误统一代码路径的提交,恢复了MergedColumnParallelLinear的正确使用,解决了TP>1时因权重量化尺度错位导致的精度崩溃问题,保障了量化模型的分布式推理可靠性。 - PR #34445 (
[BugFix] Add block_size validation for mamba cache align mode):针对Mamba模型在缓存对齐(align)模式下,当block_size大于max_num_batched_tokens时可能导致调度死锁的问题,增加了启动时的参数验证,提升了系统的健壮性。
📈 开发活跃度观察
- 贡献者活跃:本周期出现了大量来自不同机构和个人的贡献,包括AMD (
-amd后缀)、NVIDIA、Intel以及众多社区开发者,反映了vLLM广泛的生态影响力。 - 代码审查与合并:在63个新增PR中,有38个被合并,合并率较高。许多PR(如#34427, #34435, #34443)在创建当天即被合并,显示核心团队对bug修复和明确改进的高效处理。
- 讨论深度:围绕架构RFC的讨论(如#34444, #34407)展现了社区对项目长期技术方向的深入思考和严谨的设计评审过程。
💡 值得关注的问题
- 架构演进决策:Issue #34407(前端解耦)和Issue #34444(解耦注意力并行)是两个重大的架构RFC。它们的最终设计和实施将深刻影响vLLM未来的可扩展性、部署模式和性能上限,值得社区成员持续关注和参与讨论。
- 复杂并行场景下的稳定性:Issue #34401、#34458以及多个CI失败案例表明,在DP、TP、EP、DCP等多种并行策略交织,特别是结合FlashInfer等高级优化时,易出现隐蔽的进程组管理、内存访问和通信问题。这将是保证vLLM在生产环境中稳定运行的关键挑战。
- 多模态与工具调用生态的完善:Issue #34442(工具参数长度限制)、#34403(DeepSeek-VL2加载失败)以及关于GLM-5-FP8工具调用格式的bug报告,揭示了在多模态和复杂Agent用例支持上仍需不断打磨。随着AI应用向多模态和智能体发展,这方面的稳定性与功能完整性至关重要。
📋 附录:详细数据列表
新增 Issue
- #34479 [Usage]: Streaming Prefill for 1M-Token Prompts in vLLM — usage — by Wesley-Jzy (创建于: 2026-02-13 09:48 (UTC+8))
- #34484 [Bug]: kv_load_failure_recovery example fails when using sync KV load mode — bug — by sykzhong (创建于: 2026-02-13 11:12 (UTC+8))
- #34477 [Bug]: TVM_FFI_ICHECK(args->n_group != 0) « “n_group should not be zero for DeepSeekV3 routing” — bug — by jeejeelee (创建于: 2026-02-13 09:18 (UTC+8))
- #34407 [RFC]: Disaggregated Frontend — Separating Online Serving from Engine — RFC — by hyeongyun0916 (创建于: 2026-02-12 16:27 (UTC+8))
- #34413 [Bug]: Qwen coder next performance after d7982da commit. — bug — by stavinsky (创建于: 2026-02-12 17:38 (UTC+8))
- #34399 [Bug]: Nemotron 3 (all quants) take a LONG time to load — bug — by jiangwu300 (创建于: 2026-02-12 11:44 (UTC+8))
- #34470 [Bug]: NVIDIA Jetson Thor: Value ‘sm_110a’ is not defined for option ‘gpu-name’ — bug — by Kaweees (创建于: 2026-02-13 06:08 (UTC+8))
- #34444 [RFC]: Decoupled Attention/FFN Parallelism — RFC — by sungsooha (创建于: 2026-02-13 01:07 (UTC+8))
- #34465 [CI Failure]: Intel HPU Test — ci-failure — by rajkiranjoshi (创建于: 2026-02-13 05:14 (UTC+8))
- #34464 [CI Failure]: Ascend NPU Test — ci-failure — by rajkiranjoshi (创建于: 2026-02-13 05:01 (UTC+8))
- #34463 [CI Failure]: ARM CPU Test (test_moe.py) — ci-failure,cpu — by rajkiranjoshi (创建于: 2026-02-13 04:52 (UTC+8))
- #34460 [CI Failure]: Distributed Tests (2 GPUs)(H100) test_dbo.py — ci-failure — by rajkiranjoshi (创建于: 2026-02-13 04:45 (UTC+8))
- #34459 [CI Failure]: Entrypoints Integration (Responses API) (test_parsable_context.py) — ci-failure — by rajkiranjoshi (创建于: 2026-02-13 04:35 (UTC+8))
- #34458 [Bug]: AR+rms broken for TP=2 DP=2 — bug — by ProExpertProg (创建于: 2026-02-13 03:55 (UTC+8))
- #34401 [CI Failure]: Distributed Tests (8 GPUs)(H100) — ci-failure — by wzhao18 (创建于: 2026-02-12 14:06 (UTC+8))
- #34452 [Usage]: Issue Running Nemotron 3 Nano NVFP4 on sm12x (RTX 5090 / Blackwell Pro 6000) — usage — by shahizat (创建于: 2026-02-13 02:43 (UTC+8))
- #34449 [Bug]: GLM-5-FP8 malformed tool calls — bug — by TALLEC-Scott (创建于: 2026-02-13 02:20 (UTC+8))
- #34442 [Bug]: kimi k2 tool parser has a 8k buffer limit for args — bug — by bigBrain1901 (创建于: 2026-02-13 00:56 (UTC+8))
- #34437 [Bug]: Qwen3 Next with heterogeneous GPU (FP8 overflow?) — bug — by Nepherpitou (创建于: 2026-02-12 22:29 (UTC+8))
- #34423 [Feature]: GPT-OSS: Unify harmony workflows with other model type workflows — feature request — by jonoillar (创建于: 2026-02-12 18:37 (UTC+8))
- #34417 [Feature]: logprobs for gpt-oss harmony — feature request — by cmunley1 (创建于: 2026-02-12 17:53 (UTC+8))
- #34406 [Bug]: Instruction following capability is deteriorating:Output introduces parameter defined in functioncall incorrectly — bug — by Patrickpan9 (创建于: 2026-02-12 16:09 (UTC+8))
- #34403 [Bug]: error when use vllm on deepseek-vl2 — bug — by Pissohappy (创建于: 2026-02-12 15:23 (UTC+8))
- #34405 [Bug]: Qwen3-Coder-Next加载模型自带的qwen3coder_tool_parser_vllm.py,报错No module named ‘vllm.entrypoints.openai.protocol’ — bug — by MaoJianwei (创建于: 2026-02-12 15:40 (UTC+8))
已关闭 Issue
- #11173 [Installation]: XPU dependencies are missing — installation,intel-gpu,stale — by pepijndevos (关闭于: 2026-02-13 10:18 (UTC+8))
- #26619 [Bug]:–gpu-memory-utilization 0.5 Not Honored When Running Multiple Instances — bug,stale — by p936422648 (关闭于: 2026-02-13 10:16 (UTC+8))
- #34365 [CI Failure]: mi325_1: Async Engine, Inputs, Utils, Worker, Config Test (CPU) — ci-failure — by AndreasKaratzas (关闭于: 2026-02-13 04:52 (UTC+8))
- #34162 [CI Failure]: mi325_1: ROCm GPT-OSS Eval — rocm,ci-failure — by AndreasKaratzas (关闭于: 2026-02-13 04:51 (UTC+8))
- #33515 [Bug]: Error in inspecting model architecture ‘NemotronHForCausalLM’ — bug — by shahizat (关闭于: 2026-02-13 01:24 (UTC+8))
- #34252 [Bug]: Multi-GPU NCCL initialization fails with Cuda failure 700 ‘an illegal memory access was encountered’ on NVIDIA B200 GPUs — bug — by venishpatidar (关闭于: 2026-02-12 15:16 (UTC+8))
- #34357 [Bug]: MoE models with shared experts do not support DP+EP — bug — by jeejeelee (关闭于: 2026-02-12 14:44 (UTC+8))
新增 PR
- #34426 [New Model] support new model ovis2.6 — documentation,new-model,ready,multi-modality — by myselvess (创建于: 2026-02-12 19:09 (UTC+8))
- #34430 [Bugfix] FIx TorchAO config bugs — bug,documentation — by jwpark33 (创建于: 2026-02-12 20:43 (UTC+8))
- #34485 [Refactor] Pass full VllmConfig to Renderer — ready,v1 — by DarkLight1337 (创建于: 2026-02-13 11:27 (UTC+8))
- #34418 [Bug Fix] Fix MambaManager.cache_blocks() crash on null blocks in align mode — bug,v1 — by haosdent (创建于: 2026-02-12 17:54 (UTC+8))
- #34483 [Bugfix] Fix encoder cache underestimation for GLM-4V/GLM-OCR single image — bug,ready — by haosdent (创建于: 2026-02-13 10:54 (UTC+8))
- #34476 [BUGFIX] Fix accuracy regression for NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 with TP>1 — bug,ready,nvidia — by vadiklyutiy (创建于: 2026-02-13 08:27 (UTC+8))
- #34482 Support CUDAGraph on XPU Platform — v1,nvidia — by xinyu-intel (创建于: 2026-02-13 10:36 (UTC+8))
- #34481 [Bugfix][Hardware][AMD] Add ahead-of-time weight dequantization for quantization emulation — bug,rocm — by c0de128 (创建于: 2026-02-13 10:35 (UTC+8))
- #34427 [Bugfix] Delete unused redundant code in Kimi-K2.5 — bug,ready — by LoganJane (创建于: 2026-02-12 19:24 (UTC+8))
- #34435 [Refactor] Simplify BOS/EOS token handling — structured-output,ready,v1,kv-connector — by DarkLight1337 (创建于: 2026-02-12 22:19 (UTC+8))
- #34443 [Bugfix] Remove assert that’s no longer valid — bug,ready — by bnellnm (创建于: 2026-02-13 01:07 (UTC+8))
- #34445 [BugFix] Add block_size validation for mamba cache align mode — bug,ready — by peakcrosser7 (创建于: 2026-02-13 01:17 (UTC+8))
- #34451 Fix num_logprobs parameter description in sampler.py — ready,v1 — by zhuohan123 (创建于: 2026-02-13 02:41 (UTC+8))
- #34446 [CI/Build] Update video URLs for testing — ready — by DarkLight1337 (创建于: 2026-02-13 01:37 (UTC+8))
- #34480 [AsyncTP] Set return_A=False in AllGatherGEMMPattern to unlock multim… — needs-rebase — by tianrengao (创建于: 2026-02-13 10:13 (UTC+8))
- #34440 [BugFix] Fix and optimize max_num_blocks_per_req calculation for MambaSpec — bug,ready,v1 — by peakcrosser7 (创建于: 2026-02-13 00:40 (UTC+8))
- #34472 [Bugfix] Fix Pydantic BaseModel usage by replacing dataclasses.replace with setattr — bug,frontend — by yashnib (创建于: 2026-02-13 06:39 (UTC+8))
- #34478 [Model] Add NVFP4 quantization support for Step3.5-Flash — nvidia — by tacos8me (创建于: 2026-02-13 09:44 (UTC+8))
- #34453 [Bugfix] Add method to swap quant_method on FusedMoE to fix LoRA issues — bug,ready — by bnellnm (创建于: 2026-02-13 03:10 (UTC+8))
- #34424 [Perf] Optimize FP8 gemm of sm120. — nvidia — by wenshuai-xiaomi (创建于: 2026-02-12 19:03 (UTC+8))
- #34475 [Benchmark] Support recursive trace discovery and JAX trace formats — performance,frontend,v1,cpu — by YJYJLee (创建于: 2026-02-13 08:26 (UTC+8))
- #34466 [CI/Build] Add opentelemetry libs in default vllm build (requirements/common.txt) — ready,ci/build — by vladmihailescu (创建于: 2026-02-13 05:20 (UTC+8))
- #34433 [Bugfix] Remove broken raw url GGUF model loading support — bug,ready — by Isotr0py (创建于: 2026-02-12 22:04 (UTC+8))
- #34474 [Bugfix] Fix assertion error in _dummy_run for MTP speculative decoding — bug,ready,v1 — by LucasWilkinson (创建于: 2026-02-13 07:27 (UTC+8))
- #34415 [KV Connector] Add temporary, off-by-default
VLLM_DISABLE_REQUEST_ID_RANDOMIZATIONworkaround — ready,v1,kv-connector — by eicherseiji (创建于: 2026-02-12 17:42 (UTC+8)) - #34473 [Test] Add FP8 KV Cache Testing for MLA Backends — v1 — by wzhao18 (创建于: 2026-02-13 06:49 (UTC+8))
- #34471 [Llama4,Quantization] Simplify and generalize logic for Q/K permutations in quantized self-attn layers — llama — by eldarkurtic (创建于: 2026-02-13 06:31 (UTC+8))
- #34469 [Bugfix][Hardware][AMD] Fix string literal comparison in DISPATCH_BY_KV_CACHE_DTYPE macro — bug,rocm — by c0de128 (创建于: 2026-02-13 05:55 (UTC+8))
- #34456 [CI/Build] Add .deps to .dockerignore to prevent state from being pulled into ROCm build image — rocm — by tlrmchlsmth (创建于: 2026-02-13 03:40 (UTC+8))
- #34467 [Bugfix] Replace c10::optional with std::optional in topk kernel — bug — by FloatingVertex (创建于: 2026-02-13 05:27 (UTC+8))
- #34468 [CI][Entrypoints] Validate detokenize token IDs to prevent int64 overflow causing 500 — frontend — by AndreasKaratzas (创建于: 2026-02-13 05:35 (UTC+8))
- #34457 [Bugfix][MTP][Sparse MLA] Allow sparse MLA with MTP to run with FULL cudagraphs — bug,performance,v1,nvidia — by MatthewBonanni (创建于: 2026-02-13 03:50 (UTC+8))
- #34462 [Perf] CFS-aware torch thread count in set_default_torch_num_threads — cpu — by Meshkati (创建于: 2026-02-13 04:49 (UTC+8))
- #34455 [Bugfix] Remove assert causing hipErrorStreamCaptureUnsupported — bug,rocm — by JadenMathias (创建于: 2026-02-13 03:35 (UTC+8))
- #34461 fix(openai): correct timestamp drift for long audio chunking (#32588) — frontend — by kelvinvelasquez-SDE (创建于: 2026-02-13 04:46 (UTC+8))
- #34410 small adjustment to wvSplitKrc — rocm,ready — by amd-hhashemi (创建于: 2026-02-12 16:58 (UTC+8))
- #34454 [Bugfix]: Fix structured output in multi-turn gpt-oss — bug,gpt-oss — by bbrowning (创建于: 2026-02-13 03:25 (UTC+8))
- #34447 [ROCm][CI] Pin TorchCodec to v0.10.0 for ROCm compatibility — rocm,ready — by AndreasKaratzas (创建于: 2026-02-13 01:43 (UTC+8))
- #34441 [Feature] Add FlashInfer cuDNN backend for ViT attention — v1,qwen,nvidia — by dorhuri123 (创建于: 2026-02-13 00:48 (UTC+8))
- #34450 [CI] Make DBO tests non-optional — ready,ci/build — by LucasWilkinson (创建于: 2026-02-13 02:21 (UTC+8))
- #34448 [Kernel] Integrate SM100 MXFP8 blockscaled grouped MM and quant kernels — ci/build — by EdalatiAli (创建于: 2026-02-13 02:05 (UTC+8))
- #34436 Fix MoE for the Transformers modelling backend — ready — by hmellor (创建于: 2026-02-12 22:28 (UTC+8))
- #34428 [Voxtral Realtime] Refactor & Improve buffering logic — ready,ci/build,multi-modality — by patrickvonplaten (创建于: 2026-02-12 19:31 (UTC+8))
- #34431 [ROCm][quantization] improve OCP weight quant parser robust — rocm,ready — by xuebwang-amd (创建于: 2026-02-12 21:11 (UTC+8))
- #34425 [Bug Fix] Re-add the DCP/PCP compatibility check for CUDA platform — bug,needs-rebase,nvidia — by Siritao (创建于: 2026-02-12 19:09 (UTC+8))
- #34439 [Docs] Spec decoding docs warning removal — documentation — by NickLucche (创建于: 2026-02-13 00:26 (UTC+8))
- #34438 Add explicit validation error for tool calls. — 无标签 — by juliendenize (创建于: 2026-02-12 23:02 (UTC+8))
- #34416 [Misc] Update tests and examples for Prithvi/Terratorch models — documentation,ready,ci/build,multi-modality — by christian-pinto (创建于: 2026-02-12 17:53 (UTC+8))
- #34422 [Bug Fix] Fix –served-model-name greedily consuming positional model_tag — bug — by haosdent (创建于: 2026-02-12 18:19 (UTC+8))
- #34432 [WIP][Kernel] Add Helion kernel for rms_norm_dynamic_per_token_quant — 无标签 — by xiaohongchen1991 (创建于: 2026-02-12 21:34 (UTC+8))
- #34414 [Bugfix] Add quant_config in ViT of Kimi-K2.5 — bug — by LoganJane (创建于: 2026-02-12 17:40 (UTC+8))
- #34434 [CPU][Perf] Accelerate Attention head for s390x using vector intrinsics — v1,cpu — by R3hankhan123 (创建于: 2026-02-12 22:07 (UTC+8))
- #34411 Add config file for fused MoE for Nemotron (TP4, B200) — ready — by danisereb (创建于: 2026-02-12 17:18 (UTC+8))
- #34429 DCP fix for GLM5 — v1 — by prakhar-prakash-juspay (创建于: 2026-02-12 20:30 (UTC+8))
- #34400 [V0 Deprecation] Remove code related to per-request logits processors — frontend,ready,v1 — by DarkLight1337 (创建于: 2026-02-12 13:20 (UTC+8))
- #34402 [Misc] Update Qwen3-VL models dummy image size calculation — qwen — by Isotr0py (创建于: 2026-02-12 14:46 (UTC+8))
- #34421 Add support for LoRA with NVFP4 MoE models — 无标签 — by danisereb (创建于: 2026-02-12 18:08 (UTC+8))
- #34419 pass raw request to io_process_plugin — frontend — by staugust (创建于: 2026-02-12 18:06 (UTC+8))
- #34420 Fix MoE models in EP mode on Ascend — 无标签 — by ylyhyqsl (创建于: 2026-02-12 18:08 (UTC+8))
- #34412 [DO NOT MERGE ] Evidence for FlashInfer allreduce_fusion one-shot (kARResidualRMSNorm) causes deterministic NaN corruption and GSM8K collapse — 无标签 — by baonudesifeizhai (创建于: 2026-02-12 17:30 (UTC+8))
- #34404 [Bug Fix] Enable non-gated MoE support in Triton backends (#34356) — bug,nvidia — by haosdent (创建于: 2026-02-12 15:31 (UTC+8))
- #34409 [CPU Offloading] Add offloading connector scheduler load delay metric — v1,kv-connector — by pougetat (创建于: 2026-02-12 16:52 (UTC+8))
- #34408 fix(exaone4): pad MLP intermediate size for group quant TP alignment — 无标签 — by junstar92 (创建于: 2026-02-12 16:50 (UTC+8))
已合并 PR
- #34047 [ROCm][CI] Fix serving tokens test failures — rocm,ready — by AndreasKaratzas (合并于: 2026-02-13 11:27 (UTC+8))
- #34309 Add new sections to CODEOWNERS — ci/build — by DarkLight1337 (合并于: 2026-02-13 10:39 (UTC+8))
- #33706 [Hybrid] Fix and optimize block-aligned splitting in mamba cache align mode — ready,v1 — by peakcrosser7 (合并于: 2026-02-13 10:21 (UTC+8))
- #33907 [Bugfix] Fix Random Dataset Prefix Length Inaccuracy — bug,performance,ready — by frankwang28 (合并于: 2026-02-13 10:21 (UTC+8))
- #34025 [Kernel] [Helion] [5/N] Add Helion Autotuning infrastructure — ready — by gmagogsfm (合并于: 2026-02-13 10:21 (UTC+8))
- #34427 [Bugfix] Delete unused redundant code in Kimi-K2.5 — bug,ready — by LoganJane (合并于: 2026-02-13 10:18 (UTC+8))
- #34435 [Refactor] Simplify BOS/EOS token handling — structured-output,ready,v1,kv-connector — by DarkLight1337 (合并于: 2026-02-13 10:18 (UTC+8))
- #34443 [Bugfix] Remove assert that’s no longer valid — bug,ready — by bnellnm (合并于: 2026-02-13 10:18 (UTC+8))
- #34445 [BugFix] Add block_size validation for mamba cache align mode — bug,ready — by peakcrosser7 (合并于: 2026-02-13 10:18 (UTC+8))
- #34451 Fix num_logprobs parameter description in sampler.py — ready,v1 — by zhuohan123 (合并于: 2026-02-13 10:18 (UTC+8))
- #34446 [CI/Build] Update video URLs for testing — ready — by DarkLight1337 (合并于: 2026-02-13 10:15 (UTC+8))
- #33373 [Kernel] [Helion] [4/N] Add silu_mul_fp8 Helion kernel — ready — by gmagogsfm (合并于: 2026-02-13 10:13 (UTC+8))
- #33198 [Core] Profiler improvements and lazy initialization — frontend,ready,v1,cpu — by jaewonlee-fb (合并于: 2026-02-13 08:16 (UTC+8))
- #33195 [Core] Add sleep level 0 mode with enqueue/wait pattern — frontend,ready,v1 — by jaewonlee-fb (合并于: 2026-02-13 08:16 (UTC+8))
- #33709 [Frontend] Enable generic structured_outputs for responses API — frontend,ready — by alecsolder (合并于: 2026-02-13 08:15 (UTC+8))
- #34378 Use paged_attention_v1 for sliding window decode in rocm_aiter_fa — rocm,ready,v1,meta-exported,fb-exported — by iseeyuan (合并于: 2026-02-13 08:14 (UTC+8))
- #34433 [Bugfix] Remove broken raw url GGUF model loading support — bug,ready — by Isotr0py (合并于: 2026-02-13 01:40 (UTC+8))
- #33506 [Kernel] Support Flashinfer trtllm fused MoE non gated FP8 & NVFP4 — performance,ready,nvidia — by amitz-nv (合并于: 2026-02-13 05:06 (UTC+8))
- #34410 small adjustment to wvSplitKrc — rocm,ready — by amd-hhashemi (合并于: 2026-02-13 04:26 (UTC+8))
- #34374 [Bugfix] Enforce DeepGEMM when using sparse_attn_indexer on CUDA — bug,ready,nvidia — by mgoin (合并于: 2026-02-13 04:08 (UTC+8))
- #34447 [ROCm][CI] Pin TorchCodec to v0.10.0 for ROCm compatibility — rocm,ready — by AndreasKaratzas (合并于: 2026-02-13 02:47 (UTC+8))
- #34436 Fix MoE for the Transformers modelling backend — ready — by hmellor (合并于: 2026-02-13 01:29 (UTC+8))
- #34428 [Voxtral Realtime] Refactor & Improve buffering logic — ready,ci/build,multi-modality — by patrickvonplaten (合并于: 2026-02-13 01:46 (UTC+8))
- #33803 [Voxstral Realtime] Enable tests — ready,multi-modality — by patrickvonplaten (合并于: 2026-02-13 01:43 (UTC+8))
- #34431 [ROCm][quantization] improve OCP weight quant parser robust — rocm,ready — by xuebwang-amd (合并于: 2026-02-13 01:40 (UTC+8))
- #33451 [Attention] Add FlashInfer Sparse MLA backend — documentation,performance,rocm,ready,ci/build,v1,cpu,nvidia — by MatthewBonanni (合并于: 2026-02-13 01:21 (UTC+8))
- #34439 [Docs] Spec decoding docs warning removal — documentation — by NickLucche (合并于: 2026-02-13 01:01 (UTC+8))
- #34382 [BUG] Reset running requests when clearing cache for pause/resume — bug,ready,v1 — by hao-aaron (合并于: 2026-02-13 00:19 (UTC+8))
- #34411 Add config file for fused MoE for Nemotron (TP4, B200) — ready — by danisereb (合并于: 2026-02-12 22:09 (UTC+8))
- #34192 [ROCm] Enable MXFP4 MoE weight pre-shuffling on gfx950 and update aiter — rocm,ready,ci/build — by dllehr-amd (合并于: 2026-02-12 21:06 (UTC+8))
- #34400 [V0 Deprecation] Remove code related to per-request logits processors — frontend,ready,v1 — by DarkLight1337 (合并于: 2026-02-12 20:44 (UTC+8))
- #34104 Fix Mistral config remap to accept compressed-tensors quantization #34028 — bug,ready — by baonudesifeizhai (合并于: 2026-02-12 16:22 (UTC+8))
- #34128 Vllm CPU benchmark suite improvement — documentation,performance,ready,ci/build,cpu — by louie-tsai (合并于: 2026-02-12 16:04 (UTC+8))
- #34397 [bugfix] refactor FunASR’s _get_data_parser — bug,ready — by AllenDou (合并于: 2026-02-12 15:26 (UTC+8))
- #33446 [Bugfix] Fix Sparse24 Compressed Tensors models — bug,ready,nvidia — by kylesayrs (合并于: 2026-02-12 15:15 (UTC+8))
- #34085 Fix DeepSeek-OCR tensor validation for all size variants — ready,deepseek — by yichuan-w (合并于: 2026-02-12 14:50 (UTC+8))
- #34379 [BugFix] Fix DP chunking — bug,ready — by LucasWilkinson (合并于: 2026-02-12 14:44 (UTC+8))
- #34329 [Refactor] Pass Renderer to Input Processor — frontend,ready,v1 — by DarkLight1337 (合并于: 2026-02-12 11:38 (UTC+8))
关闭但未合并的 PR
- #25801 [Core] Bookkeeping optimization: Batchify updates 1D numpy arrays (e.g. num_tokens, num_tokens_no_spec) — performance,stale,v1 — by Jialin (关闭于: 2026-02-13 10:17 (UTC+8))
- #34472 [Bugfix] Fix Pydantic BaseModel usage by replacing dataclasses.replace with setattr — bug,frontend — by yashnib (关闭于: 2026-02-13 10:06 (UTC+8))
- #34151 Revert “[MISC] Fix Tensor Parallelism for Quantized Mamba Models with n_groups=1 (#33257)” — 无标签 — by amitz-nv (关闭于: 2026-02-13 00:28 (UTC+8))
- #32497 [Bugfix][Hardware][AMD] Fix RCCL initialization in Ray distributed executor — bug,rocm,v1 — by c0de128 (关闭于: 2026-02-13 05:36 (UTC+8))
- #34381 [Kernel] adopt mxfp8 grouped_gemm and grouped_quant kernel — ci/build — by EdalatiAli (关闭于: 2026-02-13 01:53 (UTC+8))
- #34261 [Frontend] Realtime API uses Renderer — frontend,ready,needs-rebase,v1 — by DarkLight1337 (关闭于: 2026-02-13 00:35 (UTC+8))
- #34367 [BugFix] Skip null blocks when adding cached blocks in current step — bug,v1 — by peakcrosser7 (关闭于: 2026-02-13 00:19 (UTC+8))
- #34278 [Bugfix] Fix P2pNcclConnector NCCL send/recv key mismatch in disaggregated prefill XpYd — bug,kv-connector — by shwgao (关闭于: 2026-02-12 23:33 (UTC+8))
- #34422 [Bug Fix] Fix –served-model-name greedily consuming positional model_tag — bug — by haosdent (关闭于: 2026-02-12 22:56 (UTC+8))
- #34402 [Misc] Update Qwen3-VL models dummy image size calculation — qwen — by Isotr0py (关闭于: 2026-02-12 19:58 (UTC+8))