vLLM 开发动态报告 - 2025-12-15

时间窗口: 2025-12-15 10:48 (UTC+8) ~ 2025-12-16 10:48 (UTC+8) 数据统计: 新 Issue 15 | 关闭 Issue 20 | 新 PR 58 | 合并 PR 29 | 关闭未合并 PR 19

📊 每日开发状态摘要

在12月15日至16日的时间窗口内，vLLM项目开发活动保持活跃，共新增58个PR，合并29个，合并率约50%，表明代码审查与集成效率较高。社区讨论集中在性能优化（如FP8填充、MoE内核缓存）、模型支持（如Qwen3-VL、TokenClassification）以及跨硬件兼容性（如torch.accelerator API迁移）等主题上。同时，多个由AMD及Intel工程师提交的PR显示了硬件生态厂商对项目支持的持续投入。

🎯 AMD/ROCm 生态相关动态

本周期内，AMD生态相关贡献集中在Bug修复和性能优化，未涉及新功能或新硬件的初始支持。

PR #30719 ([ROCm][GPTQ][Bugfix] Fix GPTQ GEMM kernel output zeroing race condition)
- 贡献者: AndreasKaratzas
- 技术细节: 修复了ROCm平台上GPTQ GEMM内核中的竞态条件。当input_size > 128时，内核会启动多个Z维度的线程块，通过atomicAdd累加部分结果。原代码在blockIdx.z == 0的块内初始化输出为0，但__syncthreads()仅同步块内线程，导致其他Z维度的块可能在清零完成前写入结果，造成数值错误。
- 影响: 从根本上解决了特定条件下GPTQ量化模型在AMD GPU上可能产生错误输出的问题，提升了ROCm后端的数值稳定性和可靠性。
PR #30730 ([ROCm][Bugfix] Switch auto fallback backend from guidance to outlines)
- 贡献者: AndreasKaratzas
- 技术细节: 修复了结构化输出（backend=”auto”）在从xgrammar回退时的逻辑。当遇到xgrammar不支持的schema特性（如multipleOf）时，原逻辑会回退到guidance后端，但该后端在后续generate()调用中会产生无效JSON。本PR将回退后端改为outlines，确保了正确性。
- 影响: 提升了ROCm平台上使用结构化输出功能的用户体验和可靠性。
PR #30714 ([ROCm] [MXFP4] Deepseek Fp4 projection gemms dynamically quantized)
- 贡献者: dllehr-amd (AMD员工)
- 技术细节: 为DeepSeek模型的投影层GEMM操作启用了动态量化的MXFP4支持。这是一个针对特定模型和量化格式的性能优化PR。
- 影响: 进一步优化了DeepSeek模型在AMD GPU上使用MXFP4量化时的性能。
PR #30703 ([Bugfix] Fix ViT with FlashAttention on ROCm)
- 贡献者: MatthewBonanni
- 技术细节: 修复了PR #30575引入的、在ROCm平台上的一个回归问题。该PR为ViT的FlashAttention调用添加了fa_version参数，但AMD的flash_attn_varlen_func并不接受此参数，导致失败。
- 影响: 确保了多模态模型中的ViT编码器在ROCm平台上能够继续正常工作。

💬 高热度讨论分析

本期讨论热度相对分散，以下是几个有代表性的讨论：

Issue #30696 [RFC]: Per-instance EPLB metrics
- 核心议题: 提议为专家并行负载均衡（EPLB）暴露Prometheus指标，以便负载均衡器能感知实例的专家负载均衡状态，从而做出更智能的路由决策。
- 不同观点:
  - 提出者 (markmc): 当前EPLB的平衡状态仅输出到日志，负载均衡器无法获取。需要导出avg_tokens_per_rank、max_tokens_per_rank等指标来反映实例健康度。
  - 参与者 (robertgshaw2-redhat): 询问是否应将指标设为Counter类型，再通过PromQL计算平均值。
  - 提出者回应: 对如何将“平衡性”概念转化为Counter类型感到困惑，寻求更清晰的直觉解释。
- 争议焦点: 如何设计既准确又直观的指标类型（Gauge vs Counter）来表征动态的负载平衡状态。
- 当前状态: 讨论开放，寻求关于指标设计的进一步建议。
Issue #30663 [RFC]: Consolidate Intel Quantization Toolkit Integration in vLLM
- 核心议题: Intel团队提出RFC，旨在整合当前分散在三个文件（inc.py, auto_round.py, ipex_quant.py）中的Intel量化支持，以降低维护成本并提供统一入口。
- 讨论情况: 讨论热度不高（仅2条评论），但涉及重要合作方。
- 观点:
  - 参与者 (robertgshaw2-redhat): 简单回复“SGTM”（Sounds Good To Me），表示初步赞同。
- 当前状态: RFC开放，提案获得了初步积极反馈，等待进一步细化和实施。

🔥 热门话题与趋势分析

量化支持扩展与问题修复：社区对新型量化格式（如NVFP4, MXFP4）的支持需求旺盛。相关Issue（#30694）请求为MoE模型添加NVFP4A16支持，而多个已关闭的Issue（如#29715）则反映了在Blackwell等新GPU上运行NVFP4模型时遇到的各种兼容性和性能问题。
MoE模型性能优化：MoE模型是当前性能优化的焦点。PR #30718 通过缓存DeepGEMM MoE内核获得了吞吐量提升；Issue #30727 则围绕如何将MoE模块化内核的缓存优化方案通用化展开了讨论。
硬件兼容性与抽象化：这是一个重要趋势。Issue #30679 提出了一个雄心勃勃的RFC，建议用PyTorch新的torch.accelerator API 全面替换硬编码的torch.cuda调用，以提升对非CUDA设备（如AMD、Intel、自定义加速器）的友好性，获得了一位贡献者的支持。
新模型与特性支持：社区不断推动对新模型架构（如Olmo2 in #30691）和新任务类型（如TokenClassification in #30107）的支持。同时，对现有模型（如Qwen3-VL、LLaMa 4视觉模型）的深度优化也在持续进行。

🛠️ 重点技术变更

PR #30705 ([BUILD] use sm_100f when compiling flashmla to fix support on sm103)：此PR修复了FlashMLA内核在Blackwell B300（SM103）上的编译支持。通过使用CUDA 12.9引入的sm_100f系列指定符替代旧的10.0a，确保了对所有10.x计算能力的兼容。影响：直接解决了在新一代Blackwell架构GPU上启用FlashMLA后端的关键阻碍。
PR #30708 ([Bugfix] Fail instead of ignoring when CompilationConfig gets invalid args)：此PR修复了编译配置（CompilationConfig）参数验证的静默失败问题。此前，传入无效参数（如旧的use_inductor）会被忽略，现在会明确报错。影响：提升了API的健壮性和用户体验，避免开发者因参数拼写错误或已废弃而得不到预期行为。
PR #30670 ([Bugfix] Fix multimodal configuration for Qwen3VL MOE model)：此PR修复了Qwen3-VL MoE模型因缺少is_multimodal_pruning_enabled配置字段而启动失败的问题。影响：虽然是一个小修复，但确保了大型多模态MoE模型的正常加载，对用户至关重要。
PR #30693 ([Refactor] [3/N] Move tool parser tests and run on CPU)：作为工具解析器代码重构的一部分，此PR将相关测试移至CPU运行。影响：降低了CI测试成本，是项目持续优化开发流程的体现。

📈 开发活跃度观察

贡献者活跃：在新增的58个PR中，识别到约37位不同的贡献者（部分PR未在提供数据中显示用户），包括来自AMD (AndreasKaratzas, dllehr-amd)、NVIDIA (如 Harry-Chen, Isotr0py)、Red Hat (robertgshaw2-redhat) 等团队及众多独立开发者，社区参与度高。
合并效率：29个PR在24小时内被合并，表明核心维护团队审查与合并流程高效。合并的PR涵盖了Bug修复、构建系统、性能优化、文档等多个方面。
跨团队协作：明显可见来自AMD、Intel、NVIDIA等硬件厂商的工程师在各自擅长的领域（如ROCm修复、量化整合、新GPU支持）进行深度投入，体现了vLLM作为多硬件平台项目的协作特性。

💡 值得关注的问题

硬件抽象化迁移 (Issue #30679)：用torch.accelerator API替换torch.cuda硬编码的提议，是一个影响深远的架构性变更。其实施难度和向后兼容性需要社区仔细评估，但长期看有利于vLLM成为真正的硬件无关推理引擎。
FP8性能优化策略 (Issue #30717)：关于通过令牌填充（Padding）来优化FP8 GEMM性能的RFC，在B200上展示了高达26.2%的吞吐量提升。这涉及在推理延迟和计算效率之间的权衡，其最终设计和实现方式值得关注。
社区功能需求：一些由用户提出的功能请求，如为MoE模型增加NVFP4A16支持（#30694）、支持更细粒度的FP8 KV-Cache缩放因子（#30685）等，反映了实际部署中的痛点，是项目未来发展的重要方向。
音频模型预热优化 (PR #30706)：针对Whisper等语音识别模型首次请求延迟过高的问题，提出了预处理预热方案。这对于提升端到端用户体验有重要意义，其设计思路可能被应用到其他多模态模型。

📋 附录：详细数据列表

新增 Issue

#30734 [Bug]: qwen3-1.7b output text truncate when serving with vllm docker。在默认参数下，vllm推理结果异常截断 — bug — by jjxyai (创建于: 2025-12-16 10:12 (UTC+8))
#30727 [Performance] Optimization through caching the moe modular kernels — 无标签 — by yewentao256 (创建于: 2025-12-16 07:45 (UTC+8))
#30721 [Bug]: AttributeError: ‘Qwen3VLConfig’ object has no attribute ‘intermediate_size’ — bug — by mobicham (创建于: 2025-12-16 05:18 (UTC+8))
#30726 [Feature]: Suggestion: make lazy_imports.py faster — feature request — by zou3519 (创建于: 2025-12-16 07:44 (UTC+8))
#30679 [RFC]: Replace torch.cuda API with torch.accelerator for better hardware compatiblity. — RFC — by jikunshang (创建于: 2025-12-15 16:23 (UTC+8))
#30722 [Bug]: llama4_pythonic tool parser fails with SyntaxError on nested list parameters — bug — by mphilippnv (创建于: 2025-12-16 05:26 (UTC+8))
#30717 [RFC]: Token Padding Strategy for FP8 GEMM Performance Optimization — RFC — by 0xjunhao (创建于: 2025-12-16 04:23 (UTC+8))
#30707 [Bug]: RTX 5080 (SM120) + NVFP4 model fails pre-flight memory check despite model fitting in VRAM — 无标签 — by Platano78 (创建于: 2025-12-16 01:52 (UTC+8))
#30696 [RFC]: Per-instance EPLB metrics — feature request — by markmc (创建于: 2025-12-15 22:27 (UTC+8))
#30694 [Feature]: CompressedTensors: NVFP4A16 not supported for MoE models — feature request — by zhangyimi (创建于: 2025-12-15 21:29 (UTC+8))
#30691 [Bug]: GGUF model with architecture olmo2 is not supported yet — bug — by lkhml (创建于: 2025-12-15 19:59 (UTC+8))
#30682 [Bug]: No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization). — bug — by shahizat (创建于: 2025-12-15 16:33 (UTC+8))
#30690 [Bug]: alueError: Call to add_lora method failed: Loading lora [“table_qc failed: No adapter found for /data/home/zhizhehui/dev/train/table_ocr/Reward_model_train/output/qwen_lora_output/checkpoint-57”] — bug — by dgzxx-2000 (创建于: 2025-12-15 19:08 (UTC+8))
#30685 [Feature]: fp8 kv cache for finer-grained scaling factors (e.g., per channel). — feature request — by zx-ai (创建于: 2025-12-15 17:32 (UTC+8))
#30663 [RFC]: Consolidate Intel Quantization Toolkit Integration in vLLM — RFC — by yiliu30 (创建于: 2025-12-15 10:58 (UTC+8))

已关闭 Issue

#24297 [Bug]: Crash on –otlp-traces-endpoint=${OTEL_EXPORTER_OTLP_TRACES_ENDPOINT} when CPU mode — bug,stale — by codefromthecrypt (关闭于: 2025-12-16 10:42 (UTC+8))
#21505 [Feature]: Full cudagraph support for MLA attention backend with DeepSeek MTP(Speculative decode) — feature request,stale — by qiaoning (关闭于: 2025-12-16 10:38 (UTC+8))
#21702 [Feature]: Support for returning a value when using wait_for_save in v1 — feature request,stale — by flesher0813 (关闭于: 2025-12-16 10:38 (UTC+8))
#22007 [Usage]: ModuleNotFoundError: No module named ‘vllm.vllm_flash_attn.layers’ — usage,stale — by yokingma (关闭于: 2025-12-16 10:38 (UTC+8))
#22879 [Feature]: Support GPT-OSS with Function calls for Chat Completion API — feature request,stale — by hetaoBackend (关闭于: 2025-12-16 10:38 (UTC+8))
#22987 [Usage]: How to suppress the 10 % output drift between speculative and non-speculative modes? — usage,stale — by kiexu (关闭于: 2025-12-16 10:38 (UTC+8))
#22997 [Bug]: v1 doesn’t raise an error if request priority is set while model server not using priority scheduling — bug,stale — by MusaTalluzi-cohere (关闭于: 2025-12-16 10:38 (UTC+8))
#23007 [Bug]: GPT OSS 120B token usage is 0 on response API, even though it responded back — bug,stale — by xinyu-dev (关闭于: 2025-12-16 10:38 (UTC+8))
#23025 [Bug]: KeyError: ‘layers.1.mlp.experts.w13_weight’ for serving GLM 4.5 air — bug,stale — by ChaosAIVision (关闭于: 2025-12-16 10:38 (UTC+8))
#23037 [Usage]: cpu infernce with kv-cache on gpu — usage,stale — by devops724 (关闭于: 2025-12-16 10:38 (UTC+8))
#23050 [Bug]: torch.OutOfMemoryError: CUDA out of memory. — bug,stale — by ZJUzengkun (关闭于: 2025-12-16 10:38 (UTC+8))
#23052 [Feature]: Localhost-by-default, API keys/mTLS, and inference-only tokens for vLLM — feature request,stale — by Cristliu (关闭于: 2025-12-16 10:38 (UTC+8))
#30734 [Bug]: qwen3-1.7b output text truncate when serving with vllm docker。在默认参数下，vllm推理结果异常截断 — bug — by jjxyai (关闭于: 2025-12-16 10:32 (UTC+8))
#24689 [Bug]: Llama 3.3 70B hangs with full cuda graph for decode-only — bug,stale — by alexm-redhat (关闭于: 2025-12-16 07:21 (UTC+8))
#18851 [Bug]: Strange error AssertionError: failed to get the hash of the compiled graph when running Qwen/Qwen3-8B via LLM class — bug,torch.compile,stale,actionable — by vadimkantorov (关闭于: 2025-12-16 01:45 (UTC+8))
#29294 [CPU Backend] [Doc]: Update Installation Docs for Arm CPUs — documentation,cpu — by fadara01 (关闭于: 2025-12-16 03:46 (UTC+8))
#29349 [Bug]: Please use the new API settings to control TF32 behavior… — bug — by wasertech (关闭于: 2025-12-15 20:56 (UTC+8))
#30107 [Feature]: Model Support: Qwen3 Token Classification — feature request — by bd2lcco (关闭于: 2025-12-15 16:13 (UTC+8))
#26585 [Usage]: use vllm embedding to extract last token hidden states? — usage — by rxqy (关闭于: 2025-12-15 14:54 (UTC+8))
#29715 [Bug]: Qwen3-VL fails during multimodal encoder profiling (expected 3 dims, got 2) on Blackwell + NVFP4 (FlashInfer) even after CUDA header fix — bug — by Firworksyt (关闭于: 2025-12-15 14:31 (UTC+8))

新增 PR

#30733 improve lazy import test — ready — by BoyuanFeng (创建于: 2025-12-16 10:05 (UTC+8))
#30735 add Qwen3OmniMoeAudioEncoder and support torch compile — qwen — by XiaobingSuper (创建于: 2025-12-16 10:13 (UTC+8))
#30716 fused_moe_lora PDL improvements — ready — by gnovack (创建于: 2025-12-16 04:18 (UTC+8))
#30732 [CI/Build] Allow user to configure NVSHMEM version via ENV or command line — ci/build — by eicherseiji (创建于: 2025-12-16 09:56 (UTC+8))
#30731 [Bugfix] Fix broken ViT attention selection for Blackwell device — ready,nvidia — by Isotr0py (创建于: 2025-12-16 09:37 (UTC+8))
#30729 [Perf] re-enable flashinfer rotary_embedding custom ops — 无标签 — by jiahanc (创建于: 2025-12-16 08:13 (UTC+8))
#30730 [ROCm][Bugfix] fix(structured_output): Switch auto fallback backend from guidance to outlines — rocm,structured-output,v1 — by AndreasKaratzas (创建于: 2025-12-16 08:58 (UTC+8))
#30705 [BUILD] use sm_100f when compiling flashmla to fix support on sm103 — ready,ci/build — by Harry-Chen (创建于: 2025-12-16 00:59 (UTC+8))
#30728 [BugFix] skip aot_compile for a unit test — 无标签 — by BoyuanFeng (创建于: 2025-12-16 07:51 (UTC+8))
#30688 chores: adjust the attn register param order — ready — by ILikeIneine (创建于: 2025-12-15 19:03 (UTC+8))
#30718 [Perf] Cache for deepgemm moe, 2.1% Throuput improvement, 2.2% TTFT improvement — ready — by yewentao256 (创建于: 2025-12-16 04:49 (UTC+8))
#30725 light version of prefix caching for hybrid models gdn attention — v1,qwen — by joennlae (创建于: 2025-12-16 07:22 (UTC+8))
#30724 [WIP] Fix edge case Mistral tool parser — 无标签 — by joa-stdn (创建于: 2025-12-16 06:44 (UTC+8))
#30713 [TRTLLM] Remove the MoE GEMM weight name change — bug,ready,nvidia — by minosfuture (创建于: 2025-12-16 03:34 (UTC+8))
#30709 [Misc][LLaMa4] Compile LLaMa Vision Encoder — llama — by Lucaskabela (创建于: 2025-12-16 02:34 (UTC+8))
#30710 [UX][Attention] Add attention_config argument to LLM() — frontend,ready,ci/build — by MatthewBonanni (创建于: 2025-12-16 02:57 (UTC+8))
#30719 [ROCm][GPTQ][Bugfix] Fix GPTQ GEMM kernel output zeroing race condition — rocm — by AndreasKaratzas (创建于: 2025-12-16 05:01 (UTC+8))
#30723 [CI] Generalize gsm8k test args and add Qwen3-Next MTP B200 test — ready,ci/build,qwen,nvidia — by mgoin (创建于: 2025-12-16 05:43 (UTC+8))
#30711 Update note comment for flashinfer attention warmup — ready — by mgoin (创建于: 2025-12-16 03:23 (UTC+8))
#30720 [Feature]nvfp4 universal fallback emulation — 无标签 — by Rob-P-Smith (创建于: 2025-12-16 05:16 (UTC+8))
#30673 Enable GDC for regular Triton MoE by calling mm_k from Lora — 无标签 — by RunkaiTao (创建于: 2025-12-15 15:19 (UTC+8))
#30699 [Bugfix] Skip missing parameters during GGUF Gemma2 weight loading — 无标签 — by kitaekatt (创建于: 2025-12-15 23:45 (UTC+8))
#30702 [Bugfix] Handle missing config.json in speculator probe for GGUF models — 无标签 — by kitaekatt (创建于: 2025-12-15 23:59 (UTC+8))
#30704 Update batch invariant to use attention config — ready,v1 — by MatthewBonanni (创建于: 2025-12-16 00:54 (UTC+8))
#30708 [Bugfix] Fail instead of ignoring when CompilationConfig gets invalid args — bug,performance,ready,torch.compile — by mgoin (创建于: 2025-12-16 01:58 (UTC+8))
#30715 [Do not merge][Test] Revert “[Attention] Use sparse prefill kernel for fp8 kv-cache in DeepSeek-v3.2” — ready,v1,deepseek,gpt-oss,nvidia — by LucasWilkinson (创建于: 2025-12-16 03:59 (UTC+8))
#30714 [ROCm] [MXFP4] Deepseek Fp4 projection gemms dynamically quantized — rocm,v1,deepseek — by dllehr-amd (创建于: 2025-12-16 03:46 (UTC+8))
#30703 [Bugfix] Fix ViT with FlashAttention on ROCm — rocm,ready — by MatthewBonanni (创建于: 2025-12-16 00:18 (UTC+8))
#30680 [Model] Add video input support for transformers modeling backend — documentation,v1,multi-modality — by ch3nku1 (创建于: 2025-12-15 16:24 (UTC+8))
#30712 [Mamba] Removed disable cascade attn in MambaModelConfig — 无标签 — by Josephasafg (创建于: 2025-12-16 03:32 (UTC+8))
#30670 [Bugfix] Fix multimodal configuration for Qwen3VL MOE model — ready,qwen,nvidia — by maxyanghu (创建于: 2025-12-15 14:51 (UTC+8))
#30678 [CPU] Add action to automatically label CPU related PRs — ci/build — by fadara01 (创建于: 2025-12-15 16:22 (UTC+8))
#30697 [Refactor] EPLB rebalance algo to NumPy — 无标签 — by ilmarkov (创建于: 2025-12-15 22:33 (UTC+8))
#30695 Remove SkipValidation from ModelConfig — ready — by hmellor (创建于: 2025-12-15 22:07 (UTC+8))
#30706 fix: add warmup for audio preprocessing — frontend — by TheCodeWrangler (创建于: 2025-12-16 01:36 (UTC+8))
#30700 feat(api): Eager chat template warmup to eliminate first-request latency — frontend — by TheCodeWrangler (创建于: 2025-12-15 23:47 (UTC+8))
#30671 [Bugfix] Fix missing first token in tool calls during reasoning-to-tool transition — frontend,ready — by mondaylord (创建于: 2025-12-15 15:08 (UTC+8))
#30701 Dynres — 无标签 — by netanel-haber (创建于: 2025-12-15 23:50 (UTC+8))
#30698 feat(api): Eager chat template warmup to eliminate first-request latency — frontend — by TheCodeWrangler (创建于: 2025-12-15 23:18 (UTC+8))
#30693 [Refactor] [3/N] Move tool parser tests and run on CPU — ready,ci/build,qwen,deepseek — by DarkLight1337 (创建于: 2025-12-15 21:07 (UTC+8))
#30684 [MM Encoder]: Migrate legacy ViT MultiHeadAttention to new MMEncoderAttention interface — tpu,v1,llama — by Isotr0py (创建于: 2025-12-15 16:39 (UTC+8))
#30692 OffloadingConnector: Support kernel_block_size != block_size — v1 — by orozery (创建于: 2025-12-15 20:55 (UTC+8))
#30681 [Hardware] Replace torch.cuda.empty_cache with torch.accelerator.empty_cache — documentation,v1,nvidia — by jikunshang (创建于: 2025-12-15 16:24 (UTC+8))
#30675 [Refactor] [2/N] Move tool parsers into the vLLM main directory — documentation,frontend,ready,tool-calling,llama,qwen,deepseek,gpt-oss — by chaunceyjiang (创建于: 2025-12-15 16:04 (UTC+8))
#30687 Triton Attention: Support cross-layers blocks — v1 — by orozery (创建于: 2025-12-15 18:19 (UTC+8))
#30686 [WIP] Fix docker build cache — ready,ci/build — by wzshiming (创建于: 2025-12-15 17:36 (UTC+8))
#30689 [Feature]: Prometheus Metrics Abstraction — speculative-decoding,needs-rebase,v1,kv-connector — by mladjan-gadzic (创建于: 2025-12-15 19:05 (UTC+8))
#30674 [BugFix] Add embed_input_ids method to make QWenLMHeadModel a vllm model — ready,qwen — by iwzbi (创建于: 2025-12-15 15:35 (UTC+8))

#30672 [Model][Last/N] Improve all pooling task

Generate runner supports using embed and token_embed tasks. — frontend,v1 — by noooop (创建于: 2025-12-15 15:19 (UTC+8))

#30683 [UT][PCP&DCP] UT for block_table.py — v1 — by pisceskkk (创建于: 2025-12-15 16:38 (UTC+8))
#30676 Sihao issue586 fix — v1 — by 1643661061leo (创建于: 2025-12-15 16:06 (UTC+8))
#30677 [Docs] Update design/multiprocessing.md — documentation — by windsonsea (创建于: 2025-12-15 16:06 (UTC+8))
#30666 [Model] Automatic conversion of TokenClassification model — ready — by noooop (创建于: 2025-12-15 12:30 (UTC+8))
#30665 增加动态端口 — v1 — by 1643661061leo (创建于: 2025-12-15 11:43 (UTC+8))
#30669 [Doc] Add AI Badgr framework integration documentation — documentation — by miguelmanlyx (创建于: 2025-12-15 14:07 (UTC+8))
#30668 [DSv32] Short sequence prefill using MHA — v1 — by qianlihuang (创建于: 2025-12-15 13:33 (UTC+8))
#30667 Support GPU tensors in tensor_data() to enable GPU-accelerated multimodal preprocessing — v1 — by storyicon (创建于: 2025-12-15 12:38 (UTC+8))
#30664 Phase 3 hybrid attention — documentation,performance,new-model,speculative-decoding,v1,llama,qwen — by RGBmarya (创建于: 2025-12-15 11:37 (UTC+8))

已合并 PR

#30705 [BUILD] use sm_100f when compiling flashmla to fix support on sm103 — ready,ci/build — by Harry-Chen (合并于: 2025-12-16 06:48 (UTC+8))
#30237 [Bugfix] fix streaming final output for non harmony — frontend,ready,gpt-oss — by penfree (合并于: 2025-12-16 09:03 (UTC+8))
#30212 [Platform] Refactor Platform attention backend selection to avoid breakpoint for OOT platform — rocm,tpu,ready,nvidia — by Isotr0py (合并于: 2025-12-16 01:36 (UTC+8))
#30657 [Log] Skip piecewise cudagraph warn when using full cudagraph — ready,nvidia — by BoyuanFeng (合并于: 2025-12-15 10:49 (UTC+8))
#30710 [UX][Attention] Add attention_config argument to LLM() — frontend,ready,ci/build — by MatthewBonanni (合并于: 2025-12-16 06:29 (UTC+8))
#30529 [Benchmarks] auto_tune.sh: Use hostname variable for server requests — performance,ready — by KevinMusgrave (合并于: 2025-12-16 06:00 (UTC+8))
#30704 Update batch invariant to use attention config — ready,v1 — by MatthewBonanni (合并于: 2025-12-16 04:24 (UTC+8))
#30708 [Bugfix] Fail instead of ignoring when CompilationConfig gets invalid args — bug,performance,ready,torch.compile — by mgoin (合并于: 2025-12-16 04:18 (UTC+8))
#30703 [Bugfix] Fix ViT with FlashAttention on ROCm — rocm,ready — by MatthewBonanni (合并于: 2025-12-16 03:45 (UTC+8))
#30594 [docs][fix] Update Arm CPU vLLM wheel installation docs — documentation,ready — by fadara01 (合并于: 2025-12-16 03:46 (UTC+8))
#30670 [Bugfix] Fix multimodal configuration for Qwen3VL MOE model — ready,qwen,nvidia — by maxyanghu (合并于: 2025-12-15 22:06 (UTC+8))
#30695 Remove SkipValidation from ModelConfig — ready — by hmellor (合并于: 2025-12-16 01:34 (UTC+8))
#30671 [Bugfix] Fix missing first token in tool calls during reasoning-to-tool transition — frontend,ready — by mondaylord (合并于: 2025-12-16 00:13 (UTC+8))
#30040 [Frontend] add tools for dsv32 developer role — frontend,ready — by yjc9696 (合并于: 2025-12-15 23:08 (UTC+8))
#30627 [MoE][Refactor 1/N] Separate Online Quantization — ready — by robertgshaw2-redhat (合并于: 2025-12-15 22:54 (UTC+8))
#30693 [Refactor] [3/N] Move tool parser tests and run on CPU — ready,ci/build,qwen,deepseek — by DarkLight1337 (合并于: 2025-12-15 21:45 (UTC+8))
#30675 [Refactor] [2/N] Move tool parsers into the vLLM main directory — documentation,frontend,ready,tool-calling,llama,qwen,deepseek,gpt-oss — by chaunceyjiang (合并于: 2025-12-15 20:54 (UTC+8))
#29805 [Misc][Hybrid allocator + kv connector] Optionally enable hybrid allocator + KV cache connector — ready,v1,kv-connector — by NickLucche (合并于: 2025-12-15 19:17 (UTC+8))
#30674 [BugFix] Add embed_input_ids method to make QWenLMHeadModel a vllm model — ready,qwen — by iwzbi (合并于: 2025-12-15 18:38 (UTC+8))
#30648 [Bugfix] Drop empty tool_calls lists to keep assistant replies in chat template — frontend,ready — by seokhyunan (合并于: 2025-12-15 12:21 (UTC+8))
#30666 [Model] Automatic conversion of TokenClassification model — ready — by noooop (合并于: 2025-12-15 16:13 (UTC+8))
#30552 typing: Add type hints to TurnMetrics class in context.py — frontend,ready,gpt-oss — by yurekami (合并于: 2025-12-15 15:00 (UTC+8))
#28439 [New Model] BAGEL support (AR only) — documentation,new-model,ready,qwen — by princepride (合并于: 2025-12-15 14:58 (UTC+8))
#30125 [CustomOp][MM] Extract MMEncoderAttention as CustomOp and replace the backend of QwenVisionAttention with it. — rocm,tpu,ready,multi-modality,qwen,nvidia — by shen-shanshan (合并于: 2025-12-15 11:13 (UTC+8))
#30662 [XPU] fix Dockerfile.xpu, avoid wheel conflicts — ready,ci/build — by jikunshang (合并于: 2025-12-15 13:32 (UTC+8))
#30658 [Bugfix] Fix deepseek_v32 tokenizer_mode — structured-output,frontend,ready,v1,deepseek — by jeejeelee (合并于: 2025-12-15 12:20 (UTC+8))
#30282 [Feat] Refactor for parallel_config in FusedMoEModularKernel — ready,nvidia — by yewentao256 (合并于: 2025-12-15 12:21 (UTC+8))
#30649 additional protection for CVE-2025-62164 — frontend,ready,multi-modality — by wenqiglantz (合并于: 2025-12-15 11:07 (UTC+8))
#30547 [CustomOp] Support object-level enable for CustomOp — ready — by shen-shanshan (合并于: 2025-12-15 11:02 (UTC+8))

关闭但未合并的 PR

#24936 [WIP][performance] DP for ViT in Intern S1 model — stale — by hsliuustc0106 (关闭于: 2025-12-16 10:45 (UTC+8))
#30588 Fix edge case Mistral tool parser — documentation,performance,new-model,rocm,structured-output,frontend,tpu,needs-rebase,ci/build,v1 — by joa-stdn (关闭于: 2025-12-16 06:44 (UTC+8))
#30720 [Feature]nvfp4 universal fallback emulation — 无标签 — by Rob-P-Smith (关闭于: 2025-12-16 05:25 (UTC+8))
#30561 feat(serve): add warmup support for consistent first-request performance — documentation,frontend — by TheCodeWrangler (关闭于: 2025-12-15 23:52 (UTC+8))
#27900 [Misc][LLaMa4] Compile LLaMa Vision Encoder layers — documentation,performance,new-model,rocm,structured-output,frontend,tpu,ready,ci/build,v1 — by Lucaskabela (关闭于: 2025-12-16 00:31 (UTC+8))
#30500 feat(gguf): Extract HF config from GGUF metadata for repos without config.json — needs-rebase — by kitaekatt (关闭于: 2025-12-16 00:49 (UTC+8))
#30497 fix(gguf): GGUF model support fixes for Blackwell GPUs — structured-output,needs-rebase,v1 — by kitaekatt (关闭于: 2025-12-15 23:56 (UTC+8))
#30421 fix(gemma2): Skip missing parameters during GGUF weight loading — structured-output,v1 — by kitaekatt (关闭于: 2025-12-15 23:45 (UTC+8))
#30698 feat(api): Eager chat template warmup to eliminate first-request latency — frontend — by TheCodeWrangler (关闭于: 2025-12-15 23:33 (UTC+8))
#30567 [Bug] Fix AttributeError: ‘Qwen3VLMoeConfig’ object has no attribute ‘intermediate_size’ — ready,needs-rebase,qwen — by yewentao256 (关闭于: 2025-12-15 23:18 (UTC+8))
#30556 feat: batched shared encoder for whisper beam search — documentation,performance,rocm,structured-output,frontend,speculative-decoding,needs-rebase,ci/build,v1,multi-modality — by TheCodeWrangler (关闭于: 2025-12-15 22:54 (UTC+8))
#19095 fix: cuda 12.6 installation — ready,needs-rebase,ci/build,stale — by mickaelseznec (关闭于: 2025-12-15 18:31 (UTC+8))
#26180 [Benchmark] Add probability density function based sampling in RandomDataset — performance — by mickaelseznec (关闭于: 2025-12-15 18:30 (UTC+8))
#30665 增加动态端口 — v1 — by 1643661061leo (关闭于: 2025-12-15 15:51 (UTC+8))
#26622 [WIP][CustomOp] Make deepseek indexer custom op. — deepseek — by whx-sjtu (关闭于: 2025-12-15 15:48 (UTC+8))
#23722 [Misc] Moved override for allreduce fusion thresholds from env var to config — torch.compile,stale — by nvjullin (关闭于: 2025-12-15 15:28 (UTC+8))
#27147 [MM Encoder]: Wrap mm encoder attention interface as CustomOps — tpu,needs-rebase,v1,llama — by Isotr0py (关闭于: 2025-12-15 15:12 (UTC+8))
#30668 [DSv32] Short sequence prefill using MHA — v1 — by qianlihuang (关闭于: 2025-12-15 14:05 (UTC+8))
#30572 Add additional protection for CVE-2025-62164 — frontend,ready,multi-modality — by russellb (关闭于: 2025-12-15 10:55 (UTC+8))