vLLM 开发动态报告 - 2025-12-15
时间窗口: 2025-12-15 10:48 (UTC+8) ~ 2025-12-16 10:48 (UTC+8) 数据统计: 新 Issue 15 | 关闭 Issue 20 | 新 PR 58 | 合并 PR 29 | 关闭未合并 PR 19
📊 每日开发状态摘要
在12月15日至16日的时间窗口内,vLLM项目开发活动保持活跃,共新增58个PR,合并29个,合并率约50%,表明代码审查与集成效率较高。社区讨论集中在性能优化(如FP8填充、MoE内核缓存)、模型支持(如Qwen3-VL、TokenClassification)以及跨硬件兼容性(如torch.accelerator API迁移)等主题上。同时,多个由AMD及Intel工程师提交的PR显示了硬件生态厂商对项目支持的持续投入。
🎯 AMD/ROCm 生态相关动态
本周期内,AMD生态相关贡献集中在Bug修复和性能优化,未涉及新功能或新硬件的初始支持。
- PR #30719 ([ROCm][GPTQ][Bugfix] Fix GPTQ GEMM kernel output zeroing race condition)
- 贡献者: AndreasKaratzas
- 技术细节: 修复了ROCm平台上GPTQ GEMM内核中的竞态条件。当
input_size > 128时,内核会启动多个Z维度的线程块,通过atomicAdd累加部分结果。原代码在blockIdx.z == 0的块内初始化输出为0,但__syncthreads()仅同步块内线程,导致其他Z维度的块可能在清零完成前写入结果,造成数值错误。 - 影响: 从根本上解决了特定条件下GPTQ量化模型在AMD GPU上可能产生错误输出的问题,提升了ROCm后端的数值稳定性和可靠性。
- PR #30730 ([ROCm][Bugfix] Switch auto fallback backend from guidance to outlines)
- 贡献者: AndreasKaratzas
- 技术细节: 修复了结构化输出(
backend=”auto”)在从xgrammar回退时的逻辑。当遇到xgrammar不支持的schema特性(如multipleOf)时,原逻辑会回退到guidance后端,但该后端在后续generate()调用中会产生无效JSON。本PR将回退后端改为outlines,确保了正确性。 - 影响: 提升了ROCm平台上使用结构化输出功能的用户体验和可靠性。
- PR #30714 ([ROCm] [MXFP4] Deepseek Fp4 projection gemms dynamically quantized)
- 贡献者:
dllehr-amd(AMD员工) - 技术细节: 为DeepSeek模型的投影层GEMM操作启用了动态量化的MXFP4支持。这是一个针对特定模型和量化格式的性能优化PR。
- 影响: 进一步优化了DeepSeek模型在AMD GPU上使用MXFP4量化时的性能。
- 贡献者:
- PR #30703 ([Bugfix] Fix ViT with FlashAttention on ROCm)
- 贡献者: MatthewBonanni
- 技术细节: 修复了PR #30575引入的、在ROCm平台上的一个回归问题。该PR为ViT的FlashAttention调用添加了
fa_version参数,但AMD的flash_attn_varlen_func并不接受此参数,导致失败。 - 影响: 确保了多模态模型中的ViT编码器在ROCm平台上能够继续正常工作。
💬 高热度讨论分析
本期讨论热度相对分散,以下是几个有代表性的讨论:
- Issue #30696 [RFC]: Per-instance EPLB metrics
- 核心议题: 提议为专家并行负载均衡(EPLB)暴露Prometheus指标,以便负载均衡器能感知实例的专家负载均衡状态,从而做出更智能的路由决策。
- 不同观点:
- 提出者 (markmc): 当前EPLB的平衡状态仅输出到日志,负载均衡器无法获取。需要导出
avg_tokens_per_rank、max_tokens_per_rank等指标来反映实例健康度。 - 参与者 (robertgshaw2-redhat): 询问是否应将指标设为
Counter类型,再通过PromQL计算平均值。 - 提出者回应: 对如何将“平衡性”概念转化为
Counter类型感到困惑,寻求更清晰的直觉解释。
- 提出者 (markmc): 当前EPLB的平衡状态仅输出到日志,负载均衡器无法获取。需要导出
- 争议焦点: 如何设计既准确又直观的指标类型(Gauge vs Counter)来表征动态的负载平衡状态。
- 当前状态: 讨论开放,寻求关于指标设计的进一步建议。
- Issue #30663 [RFC]: Consolidate Intel Quantization Toolkit Integration in vLLM
- 核心议题: Intel团队提出RFC,旨在整合当前分散在三个文件(
inc.py,auto_round.py,ipex_quant.py)中的Intel量化支持,以降低维护成本并提供统一入口。 - 讨论情况: 讨论热度不高(仅2条评论),但涉及重要合作方。
- 观点:
- 参与者 (robertgshaw2-redhat): 简单回复“SGTM”(Sounds Good To Me),表示初步赞同。
- 当前状态: RFC开放,提案获得了初步积极反馈,等待进一步细化和实施。
- 核心议题: Intel团队提出RFC,旨在整合当前分散在三个文件(
🔥 热门话题与趋势分析
- 量化支持扩展与问题修复:社区对新型量化格式(如NVFP4, MXFP4)的支持需求旺盛。相关Issue(#30694)请求为MoE模型添加NVFP4A16支持,而多个已关闭的Issue(如#29715)则反映了在Blackwell等新GPU上运行NVFP4模型时遇到的各种兼容性和性能问题。
- MoE模型性能优化:MoE模型是当前性能优化的焦点。PR #30718 通过缓存DeepGEMM MoE内核获得了吞吐量提升;Issue #30727 则围绕如何将MoE模块化内核的缓存优化方案通用化展开了讨论。
- 硬件兼容性与抽象化:这是一个重要趋势。Issue #30679 提出了一个雄心勃勃的RFC,建议用PyTorch新的
torch.acceleratorAPI 全面替换硬编码的torch.cuda调用,以提升对非CUDA设备(如AMD、Intel、自定义加速器)的友好性,获得了一位贡献者的支持。 - 新模型与特性支持:社区不断推动对新模型架构(如Olmo2 in #30691)和新任务类型(如TokenClassification in #30107)的支持。同时,对现有模型(如Qwen3-VL、LLaMa 4视觉模型)的深度优化也在持续进行。
🛠️ 重点技术变更
- PR #30705 ([BUILD] use sm_100f when compiling flashmla to fix support on sm103):此PR修复了FlashMLA内核在Blackwell B300(SM103)上的编译支持。通过使用CUDA 12.9引入的
sm_100f系列指定符替代旧的10.0a,确保了对所有10.x计算能力的兼容。影响:直接解决了在新一代Blackwell架构GPU上启用FlashMLA后端的关键阻碍。 - PR #30708 ([Bugfix] Fail instead of ignoring when CompilationConfig gets invalid args):此PR修复了编译配置(
CompilationConfig)参数验证的静默失败问题。此前,传入无效参数(如旧的use_inductor)会被忽略,现在会明确报错。影响:提升了API的健壮性和用户体验,避免开发者因参数拼写错误或已废弃而得不到预期行为。 - PR #30670 ([Bugfix] Fix multimodal configuration for Qwen3VL MOE model):此PR修复了Qwen3-VL MoE模型因缺少
is_multimodal_pruning_enabled配置字段而启动失败的问题。影响:虽然是一个小修复,但确保了大型多模态MoE模型的正常加载,对用户至关重要。 - PR #30693 ([Refactor] [3/N] Move tool parser tests and run on CPU):作为工具解析器代码重构的一部分,此PR将相关测试移至CPU运行。影响:降低了CI测试成本,是项目持续优化开发流程的体现。
📈 开发活跃度观察
- 贡献者活跃:在新增的58个PR中,识别到约37位不同的贡献者(部分PR未在提供数据中显示用户),包括来自AMD (
AndreasKaratzas,dllehr-amd)、NVIDIA (如Harry-Chen,Isotr0py)、Red Hat (robertgshaw2-redhat) 等团队及众多独立开发者,社区参与度高。 - 合并效率:29个PR在24小时内被合并,表明核心维护团队审查与合并流程高效。合并的PR涵盖了Bug修复、构建系统、性能优化、文档等多个方面。
- 跨团队协作:明显可见来自AMD、Intel、NVIDIA等硬件厂商的工程师在各自擅长的领域(如ROCm修复、量化整合、新GPU支持)进行深度投入,体现了vLLM作为多硬件平台项目的协作特性。
💡 值得关注的问题
- 硬件抽象化迁移 (Issue #30679):用
torch.acceleratorAPI替换torch.cuda硬编码的提议,是一个影响深远的架构性变更。其实施难度和向后兼容性需要社区仔细评估,但长期看有利于vLLM成为真正的硬件无关推理引擎。 - FP8性能优化策略 (Issue #30717):关于通过令牌填充(Padding)来优化FP8 GEMM性能的RFC,在B200上展示了高达26.2%的吞吐量提升。这涉及在推理延迟和计算效率之间的权衡,其最终设计和实现方式值得关注。
- 社区功能需求:一些由用户提出的功能请求,如为MoE模型增加NVFP4A16支持(#30694)、支持更细粒度的FP8 KV-Cache缩放因子(#30685)等,反映了实际部署中的痛点,是项目未来发展的重要方向。
- 音频模型预热优化 (PR #30706):针对Whisper等语音识别模型首次请求延迟过高的问题,提出了预处理预热方案。这对于提升端到端用户体验有重要意义,其设计思路可能被应用到其他多模态模型。
📋 附录:详细数据列表
新增 Issue
- #30734 [Bug]: qwen3-1.7b output text truncate when serving with vllm docker。在默认参数下,vllm推理结果异常截断 — bug — by jjxyai (创建于: 2025-12-16 10:12 (UTC+8))
- #30727 [Performance] Optimization through caching the moe modular kernels — 无标签 — by yewentao256 (创建于: 2025-12-16 07:45 (UTC+8))
- #30721 [Bug]: AttributeError: ‘Qwen3VLConfig’ object has no attribute ‘intermediate_size’ — bug — by mobicham (创建于: 2025-12-16 05:18 (UTC+8))
- #30726 [Feature]: Suggestion: make lazy_imports.py faster — feature request — by zou3519 (创建于: 2025-12-16 07:44 (UTC+8))
- #30679 [RFC]: Replace
torch.cudaAPI withtorch.acceleratorfor better hardware compatiblity. — RFC — by jikunshang (创建于: 2025-12-15 16:23 (UTC+8)) - #30722 [Bug]: llama4_pythonic tool parser fails with SyntaxError on nested list parameters — bug — by mphilippnv (创建于: 2025-12-16 05:26 (UTC+8))
- #30717 [RFC]: Token Padding Strategy for FP8 GEMM Performance Optimization — RFC — by 0xjunhao (创建于: 2025-12-16 04:23 (UTC+8))
- #30707 [Bug]: RTX 5080 (SM120) + NVFP4 model fails pre-flight memory check despite model fitting in VRAM — 无标签 — by Platano78 (创建于: 2025-12-16 01:52 (UTC+8))
- #30696 [RFC]: Per-instance EPLB metrics — feature request — by markmc (创建于: 2025-12-15 22:27 (UTC+8))
- #30694 [Feature]: CompressedTensors: NVFP4A16 not supported for MoE models — feature request — by zhangyimi (创建于: 2025-12-15 21:29 (UTC+8))
- #30691 [Bug]: GGUF model with architecture olmo2 is not supported yet — bug — by lkhml (创建于: 2025-12-15 19:59 (UTC+8))
- #30682 [Bug]: No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization). — bug — by shahizat (创建于: 2025-12-15 16:33 (UTC+8))
- #30690 [Bug]: alueError: Call to add_lora method failed: Loading lora [“table_qc failed: No adapter found for /data/home/zhizhehui/dev/train/table_ocr/Reward_model_train/output/qwen_lora_output/checkpoint-57”] — bug — by dgzxx-2000 (创建于: 2025-12-15 19:08 (UTC+8))
- #30685 [Feature]: fp8 kv cache for finer-grained scaling factors (e.g., per channel). — feature request — by zx-ai (创建于: 2025-12-15 17:32 (UTC+8))
- #30663 [RFC]: Consolidate Intel Quantization Toolkit Integration in vLLM — RFC — by yiliu30 (创建于: 2025-12-15 10:58 (UTC+8))
已关闭 Issue
- #24297 [Bug]: Crash on –otlp-traces-endpoint=${OTEL_EXPORTER_OTLP_TRACES_ENDPOINT} when CPU mode — bug,stale — by codefromthecrypt (关闭于: 2025-12-16 10:42 (UTC+8))
- #21505 [Feature]: Full cudagraph support for MLA attention backend with DeepSeek MTP(Speculative decode) — feature request,stale — by qiaoning (关闭于: 2025-12-16 10:38 (UTC+8))
- #21702 [Feature]: Support for returning a value when using wait_for_save in v1 — feature request,stale — by flesher0813 (关闭于: 2025-12-16 10:38 (UTC+8))
- #22007 [Usage]: ModuleNotFoundError: No module named ‘vllm.vllm_flash_attn.layers’ — usage,stale — by yokingma (关闭于: 2025-12-16 10:38 (UTC+8))
- #22879 [Feature]: Support GPT-OSS with Function calls for Chat Completion API — feature request,stale — by hetaoBackend (关闭于: 2025-12-16 10:38 (UTC+8))
- #22987 [Usage]: How to suppress the 10 % output drift between speculative and non-speculative modes? — usage,stale — by kiexu (关闭于: 2025-12-16 10:38 (UTC+8))
- #22997 [Bug]: v1 doesn’t raise an error if request priority is set while model server not using priority scheduling — bug,stale — by MusaTalluzi-cohere (关闭于: 2025-12-16 10:38 (UTC+8))
- #23007 [Bug]: GPT OSS 120B token usage is 0 on response API, even though it responded back — bug,stale — by xinyu-dev (关闭于: 2025-12-16 10:38 (UTC+8))
- #23025 [Bug]: KeyError: ‘layers.1.mlp.experts.w13_weight’ for serving GLM 4.5 air — bug,stale — by ChaosAIVision (关闭于: 2025-12-16 10:38 (UTC+8))
- #23037 [Usage]: cpu infernce with kv-cache on gpu — usage,stale — by devops724 (关闭于: 2025-12-16 10:38 (UTC+8))
- #23050 [Bug]: torch.OutOfMemoryError: CUDA out of memory. — bug,stale — by ZJUzengkun (关闭于: 2025-12-16 10:38 (UTC+8))
- #23052 [Feature]: Localhost-by-default, API keys/mTLS, and inference-only tokens for vLLM — feature request,stale — by Cristliu (关闭于: 2025-12-16 10:38 (UTC+8))
- #30734 [Bug]: qwen3-1.7b output text truncate when serving with vllm docker。在默认参数下,vllm推理结果异常截断 — bug — by jjxyai (关闭于: 2025-12-16 10:32 (UTC+8))
- #24689 [Bug]: Llama 3.3 70B hangs with full cuda graph for decode-only — bug,stale — by alexm-redhat (关闭于: 2025-12-16 07:21 (UTC+8))
- #18851 [Bug]: Strange error
AssertionError: failed to get the hash of the compiled graphwhen runningQwen/Qwen3-8BviaLLMclass — bug,torch.compile,stale,actionable — by vadimkantorov (关闭于: 2025-12-16 01:45 (UTC+8)) - #29294 [CPU Backend] [Doc]: Update Installation Docs for Arm CPUs — documentation,cpu — by fadara01 (关闭于: 2025-12-16 03:46 (UTC+8))
- #29349 [Bug]: Please use the new API settings to control TF32 behavior… — bug — by wasertech (关闭于: 2025-12-15 20:56 (UTC+8))
- #30107 [Feature]: Model Support: Qwen3 Token Classification — feature request — by bd2lcco (关闭于: 2025-12-15 16:13 (UTC+8))
- #26585 [Usage]: use vllm embedding to extract last token hidden states? — usage — by rxqy (关闭于: 2025-12-15 14:54 (UTC+8))
- #29715 [Bug]: Qwen3-VL fails during multimodal encoder profiling (expected 3 dims, got 2) on Blackwell + NVFP4 (FlashInfer) even after CUDA header fix — bug — by Firworksyt (关闭于: 2025-12-15 14:31 (UTC+8))
新增 PR
- #30733 improve lazy import test — ready — by BoyuanFeng (创建于: 2025-12-16 10:05 (UTC+8))
- #30735 add Qwen3OmniMoeAudioEncoder and support torch compile — qwen — by XiaobingSuper (创建于: 2025-12-16 10:13 (UTC+8))
- #30716 fused_moe_lora PDL improvements — ready — by gnovack (创建于: 2025-12-16 04:18 (UTC+8))
- #30732 [CI/Build] Allow user to configure NVSHMEM version via ENV or command line — ci/build — by eicherseiji (创建于: 2025-12-16 09:56 (UTC+8))
- #30731 [Bugfix] Fix broken ViT attention selection for Blackwell device — ready,nvidia — by Isotr0py (创建于: 2025-12-16 09:37 (UTC+8))
- #30729 [Perf] re-enable flashinfer rotary_embedding custom ops — 无标签 — by jiahanc (创建于: 2025-12-16 08:13 (UTC+8))
- #30730 [ROCm][Bugfix] fix(structured_output): Switch auto fallback backend from guidance to outlines — rocm,structured-output,v1 — by AndreasKaratzas (创建于: 2025-12-16 08:58 (UTC+8))
- #30705 [BUILD] use sm_100f when compiling flashmla to fix support on sm103 — ready,ci/build — by Harry-Chen (创建于: 2025-12-16 00:59 (UTC+8))
- #30728 [BugFix] skip aot_compile for a unit test — 无标签 — by BoyuanFeng (创建于: 2025-12-16 07:51 (UTC+8))
- #30688 chores: adjust the attn register param order — ready — by ILikeIneine (创建于: 2025-12-15 19:03 (UTC+8))
- #30718 [Perf] Cache for deepgemm moe, 2.1% Throuput improvement, 2.2% TTFT improvement — ready — by yewentao256 (创建于: 2025-12-16 04:49 (UTC+8))
- #30725 light version of prefix caching for hybrid models gdn attention — v1,qwen — by joennlae (创建于: 2025-12-16 07:22 (UTC+8))
- #30724 [WIP] Fix edge case Mistral tool parser — 无标签 — by joa-stdn (创建于: 2025-12-16 06:44 (UTC+8))
- #30713 [TRTLLM] Remove the MoE GEMM weight name change — bug,ready,nvidia — by minosfuture (创建于: 2025-12-16 03:34 (UTC+8))
- #30709 [Misc][LLaMa4] Compile LLaMa Vision Encoder — llama — by Lucaskabela (创建于: 2025-12-16 02:34 (UTC+8))
- #30710 [UX][Attention] Add
attention_configargument toLLM()— frontend,ready,ci/build — by MatthewBonanni (创建于: 2025-12-16 02:57 (UTC+8)) - #30719 [ROCm][GPTQ][Bugfix] Fix GPTQ GEMM kernel output zeroing race condition — rocm — by AndreasKaratzas (创建于: 2025-12-16 05:01 (UTC+8))
- #30723 [CI] Generalize gsm8k test args and add Qwen3-Next MTP B200 test — ready,ci/build,qwen,nvidia — by mgoin (创建于: 2025-12-16 05:43 (UTC+8))
- #30711 Update note comment for flashinfer attention warmup — ready — by mgoin (创建于: 2025-12-16 03:23 (UTC+8))
- #30720 [Feature]nvfp4 universal fallback emulation — 无标签 — by Rob-P-Smith (创建于: 2025-12-16 05:16 (UTC+8))
- #30673 Enable GDC for regular Triton MoE by calling
mm_kfrom Lora — 无标签 — by RunkaiTao (创建于: 2025-12-15 15:19 (UTC+8)) - #30699 [Bugfix] Skip missing parameters during GGUF Gemma2 weight loading — 无标签 — by kitaekatt (创建于: 2025-12-15 23:45 (UTC+8))
- #30702 [Bugfix] Handle missing config.json in speculator probe for GGUF models — 无标签 — by kitaekatt (创建于: 2025-12-15 23:59 (UTC+8))
- #30704 Update batch invariant to use attention config — ready,v1 — by MatthewBonanni (创建于: 2025-12-16 00:54 (UTC+8))
- #30708 [Bugfix] Fail instead of ignoring when CompilationConfig gets invalid args — bug,performance,ready,torch.compile — by mgoin (创建于: 2025-12-16 01:58 (UTC+8))
- #30715 [Do not merge][Test] Revert “[Attention] Use sparse prefill kernel for fp8 kv-cache in DeepSeek-v3.2” — ready,v1,deepseek,gpt-oss,nvidia — by LucasWilkinson (创建于: 2025-12-16 03:59 (UTC+8))
- #30714 [ROCm] [MXFP4] Deepseek Fp4 projection gemms dynamically quantized — rocm,v1,deepseek — by dllehr-amd (创建于: 2025-12-16 03:46 (UTC+8))
- #30703 [Bugfix] Fix ViT with FlashAttention on ROCm — rocm,ready — by MatthewBonanni (创建于: 2025-12-16 00:18 (UTC+8))
- #30680 [Model] Add video input support for transformers modeling backend — documentation,v1,multi-modality — by ch3nku1 (创建于: 2025-12-15 16:24 (UTC+8))
- #30712 [Mamba] Removed disable cascade attn in MambaModelConfig — 无标签 — by Josephasafg (创建于: 2025-12-16 03:32 (UTC+8))
- #30670 [Bugfix] Fix multimodal configuration for Qwen3VL MOE model — ready,qwen,nvidia — by maxyanghu (创建于: 2025-12-15 14:51 (UTC+8))
- #30678 [CPU] Add action to automatically label CPU related PRs — ci/build — by fadara01 (创建于: 2025-12-15 16:22 (UTC+8))
- #30697 [Refactor] EPLB rebalance algo to NumPy — 无标签 — by ilmarkov (创建于: 2025-12-15 22:33 (UTC+8))
- #30695 Remove
SkipValidationfromModelConfig— ready — by hmellor (创建于: 2025-12-15 22:07 (UTC+8)) - #30706 fix: add warmup for audio preprocessing — frontend — by TheCodeWrangler (创建于: 2025-12-16 01:36 (UTC+8))
- #30700 feat(api): Eager chat template warmup to eliminate first-request latency — frontend — by TheCodeWrangler (创建于: 2025-12-15 23:47 (UTC+8))
- #30671 [Bugfix] Fix missing first token in tool calls during reasoning-to-tool transition — frontend,ready — by mondaylord (创建于: 2025-12-15 15:08 (UTC+8))
- #30701 Dynres — 无标签 — by netanel-haber (创建于: 2025-12-15 23:50 (UTC+8))
- #30698 feat(api): Eager chat template warmup to eliminate first-request latency — frontend — by TheCodeWrangler (创建于: 2025-12-15 23:18 (UTC+8))
- #30693 [Refactor] [3/N] Move tool parser tests and run on CPU — ready,ci/build,qwen,deepseek — by DarkLight1337 (创建于: 2025-12-15 21:07 (UTC+8))
- #30684 [MM Encoder]: Migrate legacy ViT
MultiHeadAttentionto newMMEncoderAttentioninterface — tpu,v1,llama — by Isotr0py (创建于: 2025-12-15 16:39 (UTC+8)) - #30692 OffloadingConnector: Support kernel_block_size != block_size — v1 — by orozery (创建于: 2025-12-15 20:55 (UTC+8))
- #30681 [Hardware] Replace
torch.cuda.empty_cachewithtorch.accelerator.empty_cache— documentation,v1,nvidia — by jikunshang (创建于: 2025-12-15 16:24 (UTC+8)) - #30675 [Refactor] [2/N] Move tool parsers into the vLLM main directory — documentation,frontend,ready,tool-calling,llama,qwen,deepseek,gpt-oss — by chaunceyjiang (创建于: 2025-12-15 16:04 (UTC+8))
- #30687 Triton Attention: Support cross-layers blocks — v1 — by orozery (创建于: 2025-12-15 18:19 (UTC+8))
- #30686 [WIP] Fix docker build cache — ready,ci/build — by wzshiming (创建于: 2025-12-15 17:36 (UTC+8))
- #30689 [Feature]: Prometheus Metrics Abstraction — speculative-decoding,needs-rebase,v1,kv-connector — by mladjan-gadzic (创建于: 2025-12-15 19:05 (UTC+8))
- #30674 [BugFix] Add embed_input_ids method to make QWenLMHeadModel a vllm model — ready,qwen — by iwzbi (创建于: 2025-12-15 15:35 (UTC+8))
-
#30672 [Model][Last/N] Improve all pooling task Generate runner supports using embed and token_embed tasks. — frontend,v1 — by noooop (创建于: 2025-12-15 15:19 (UTC+8)) - #30683 [UT][PCP&DCP] UT for block_table.py — v1 — by pisceskkk (创建于: 2025-12-15 16:38 (UTC+8))
- #30676 Sihao issue586 fix — v1 — by 1643661061leo (创建于: 2025-12-15 16:06 (UTC+8))
- #30677 [Docs] Update design/multiprocessing.md — documentation — by windsonsea (创建于: 2025-12-15 16:06 (UTC+8))
- #30666 [Model] Automatic conversion of TokenClassification model — ready — by noooop (创建于: 2025-12-15 12:30 (UTC+8))
- #30665 增加动态端口 — v1 — by 1643661061leo (创建于: 2025-12-15 11:43 (UTC+8))
- #30669 [Doc] Add AI Badgr framework integration documentation — documentation — by miguelmanlyx (创建于: 2025-12-15 14:07 (UTC+8))
- #30668 [DSv32] Short sequence prefill using MHA — v1 — by qianlihuang (创建于: 2025-12-15 13:33 (UTC+8))
- #30667 Support GPU tensors in tensor_data() to enable GPU-accelerated multimodal preprocessing — v1 — by storyicon (创建于: 2025-12-15 12:38 (UTC+8))
- #30664 Phase 3 hybrid attention — documentation,performance,new-model,speculative-decoding,v1,llama,qwen — by RGBmarya (创建于: 2025-12-15 11:37 (UTC+8))
已合并 PR
- #30705 [BUILD] use sm_100f when compiling flashmla to fix support on sm103 — ready,ci/build — by Harry-Chen (合并于: 2025-12-16 06:48 (UTC+8))
- #30237 [Bugfix] fix streaming final output for non harmony — frontend,ready,gpt-oss — by penfree (合并于: 2025-12-16 09:03 (UTC+8))
- #30212 [Platform] Refactor Platform attention backend selection to avoid breakpoint for OOT platform — rocm,tpu,ready,nvidia — by Isotr0py (合并于: 2025-12-16 01:36 (UTC+8))
- #30657 [Log] Skip piecewise cudagraph warn when using full cudagraph — ready,nvidia — by BoyuanFeng (合并于: 2025-12-15 10:49 (UTC+8))
- #30710 [UX][Attention] Add
attention_configargument toLLM()— frontend,ready,ci/build — by MatthewBonanni (合并于: 2025-12-16 06:29 (UTC+8)) - #30529 [Benchmarks]
auto_tune.sh: Use hostname variable for server requests — performance,ready — by KevinMusgrave (合并于: 2025-12-16 06:00 (UTC+8)) - #30704 Update batch invariant to use attention config — ready,v1 — by MatthewBonanni (合并于: 2025-12-16 04:24 (UTC+8))
- #30708 [Bugfix] Fail instead of ignoring when CompilationConfig gets invalid args — bug,performance,ready,torch.compile — by mgoin (合并于: 2025-12-16 04:18 (UTC+8))
- #30703 [Bugfix] Fix ViT with FlashAttention on ROCm — rocm,ready — by MatthewBonanni (合并于: 2025-12-16 03:45 (UTC+8))
- #30594 [docs][fix] Update Arm CPU vLLM wheel installation docs — documentation,ready — by fadara01 (合并于: 2025-12-16 03:46 (UTC+8))
- #30670 [Bugfix] Fix multimodal configuration for Qwen3VL MOE model — ready,qwen,nvidia — by maxyanghu (合并于: 2025-12-15 22:06 (UTC+8))
- #30695 Remove
SkipValidationfromModelConfig— ready — by hmellor (合并于: 2025-12-16 01:34 (UTC+8)) - #30671 [Bugfix] Fix missing first token in tool calls during reasoning-to-tool transition — frontend,ready — by mondaylord (合并于: 2025-12-16 00:13 (UTC+8))
- #30040 [Frontend] add tools for dsv32 developer role — frontend,ready — by yjc9696 (合并于: 2025-12-15 23:08 (UTC+8))
- #30627 [MoE][Refactor 1/N] Separate Online Quantization — ready — by robertgshaw2-redhat (合并于: 2025-12-15 22:54 (UTC+8))
- #30693 [Refactor] [3/N] Move tool parser tests and run on CPU — ready,ci/build,qwen,deepseek — by DarkLight1337 (合并于: 2025-12-15 21:45 (UTC+8))
- #30675 [Refactor] [2/N] Move tool parsers into the vLLM main directory — documentation,frontend,ready,tool-calling,llama,qwen,deepseek,gpt-oss — by chaunceyjiang (合并于: 2025-12-15 20:54 (UTC+8))
- #29805 [Misc][Hybrid allocator + kv connector] Optionally enable hybrid allocator + KV cache connector — ready,v1,kv-connector — by NickLucche (合并于: 2025-12-15 19:17 (UTC+8))
- #30674 [BugFix] Add embed_input_ids method to make QWenLMHeadModel a vllm model — ready,qwen — by iwzbi (合并于: 2025-12-15 18:38 (UTC+8))
- #30648 [Bugfix] Drop empty tool_calls lists to keep assistant replies in chat template — frontend,ready — by seokhyunan (合并于: 2025-12-15 12:21 (UTC+8))
- #30666 [Model] Automatic conversion of TokenClassification model — ready — by noooop (合并于: 2025-12-15 16:13 (UTC+8))
- #30552 typing: Add type hints to TurnMetrics class in context.py — frontend,ready,gpt-oss — by yurekami (合并于: 2025-12-15 15:00 (UTC+8))
- #28439 [New Model] BAGEL support (AR only) — documentation,new-model,ready,qwen — by princepride (合并于: 2025-12-15 14:58 (UTC+8))
- #30125 [CustomOp][MM] Extract MMEncoderAttention as CustomOp and replace the backend of QwenVisionAttention with it. — rocm,tpu,ready,multi-modality,qwen,nvidia — by shen-shanshan (合并于: 2025-12-15 11:13 (UTC+8))
- #30662 [XPU] fix Dockerfile.xpu, avoid wheel conflicts — ready,ci/build — by jikunshang (合并于: 2025-12-15 13:32 (UTC+8))
- #30658 [Bugfix] Fix deepseek_v32 tokenizer_mode — structured-output,frontend,ready,v1,deepseek — by jeejeelee (合并于: 2025-12-15 12:20 (UTC+8))
- #30282 [Feat] Refactor for
parallel_configinFusedMoEModularKernel— ready,nvidia — by yewentao256 (合并于: 2025-12-15 12:21 (UTC+8)) - #30649 additional protection for CVE-2025-62164 — frontend,ready,multi-modality — by wenqiglantz (合并于: 2025-12-15 11:07 (UTC+8))
- #30547 [CustomOp] Support object-level enable for CustomOp — ready — by shen-shanshan (合并于: 2025-12-15 11:02 (UTC+8))
关闭但未合并的 PR
- #24936 [WIP][performance] DP for ViT in Intern S1 model — stale — by hsliuustc0106 (关闭于: 2025-12-16 10:45 (UTC+8))
- #30588 Fix edge case Mistral tool parser — documentation,performance,new-model,rocm,structured-output,frontend,tpu,needs-rebase,ci/build,v1 — by joa-stdn (关闭于: 2025-12-16 06:44 (UTC+8))
- #30720 [Feature]nvfp4 universal fallback emulation — 无标签 — by Rob-P-Smith (关闭于: 2025-12-16 05:25 (UTC+8))
- #30561 feat(serve): add warmup support for consistent first-request performance — documentation,frontend — by TheCodeWrangler (关闭于: 2025-12-15 23:52 (UTC+8))
- #27900 [Misc][LLaMa4] Compile LLaMa Vision Encoder layers — documentation,performance,new-model,rocm,structured-output,frontend,tpu,ready,ci/build,v1 — by Lucaskabela (关闭于: 2025-12-16 00:31 (UTC+8))
- #30500 feat(gguf): Extract HF config from GGUF metadata for repos without config.json — needs-rebase — by kitaekatt (关闭于: 2025-12-16 00:49 (UTC+8))
- #30497 fix(gguf): GGUF model support fixes for Blackwell GPUs — structured-output,needs-rebase,v1 — by kitaekatt (关闭于: 2025-12-15 23:56 (UTC+8))
- #30421 fix(gemma2): Skip missing parameters during GGUF weight loading — structured-output,v1 — by kitaekatt (关闭于: 2025-12-15 23:45 (UTC+8))
- #30698 feat(api): Eager chat template warmup to eliminate first-request latency — frontend — by TheCodeWrangler (关闭于: 2025-12-15 23:33 (UTC+8))
- #30567 [Bug] Fix AttributeError: ‘Qwen3VLMoeConfig’ object has no attribute ‘intermediate_size’ — ready,needs-rebase,qwen — by yewentao256 (关闭于: 2025-12-15 23:18 (UTC+8))
- #30556 feat: batched shared encoder for whisper beam search — documentation,performance,rocm,structured-output,frontend,speculative-decoding,needs-rebase,ci/build,v1,multi-modality — by TheCodeWrangler (关闭于: 2025-12-15 22:54 (UTC+8))
- #19095 fix: cuda 12.6 installation — ready,needs-rebase,ci/build,stale — by mickaelseznec (关闭于: 2025-12-15 18:31 (UTC+8))
- #26180 [Benchmark] Add probability density function based sampling in
RandomDataset— performance — by mickaelseznec (关闭于: 2025-12-15 18:30 (UTC+8)) - #30665 增加动态端口 — v1 — by 1643661061leo (关闭于: 2025-12-15 15:51 (UTC+8))
- #26622 [WIP][CustomOp] Make deepseek indexer custom op. — deepseek — by whx-sjtu (关闭于: 2025-12-15 15:48 (UTC+8))
- #23722 [Misc] Moved override for allreduce fusion thresholds from env var to config — torch.compile,stale — by nvjullin (关闭于: 2025-12-15 15:28 (UTC+8))
- #27147 [MM Encoder]: Wrap mm encoder attention interface as
CustomOps— tpu,needs-rebase,v1,llama — by Isotr0py (关闭于: 2025-12-15 15:12 (UTC+8)) - #30668 [DSv32] Short sequence prefill using MHA — v1 — by qianlihuang (关闭于: 2025-12-15 14:05 (UTC+8))
- #30572 Add additional protection for CVE-2025-62164 — frontend,ready,multi-modality — by russellb (关闭于: 2025-12-15 10:55 (UTC+8))