vLLM 开发动态报告 - 2025-12-27

时间窗口: 2025-12-27 11:01 (UTC+8) ~ 2025-12-28 11:01 (UTC+8) 数据统计: 新 Issue 9 | 关闭 Issue 33 | 新 PR 35 | 合并 PR 5 | 关闭未合并 PR 11

📊 每日开发状态摘要

本周期内，vLLM 项目开发活跃，代码清理和用户体验改进是主要方向。新增了大量 PR（35个），但合并比例较低（5个），表明大量功能修复和代码质量提升工作处于审核阶段。同时，项目关闭了33个长期未动的陈旧 Issue，反映出对 Issue 清单的维护。值得关注的是，出现了涉及新硬件架构（Blackwell）和复杂功能组合（如 Tensor Parallelism 与 Prefix Caching）的 Bug 报告，显示出社区在将 vLLM 推向更广泛和极限场景时遇到的挑战。

🎯 AMD/ROCm 生态相关动态

本周期内，AMD 生态相关的直接更新较少，但有一项重要的持续性工作取得进展。

MoE FP8 权重处理重构的延续：
- PR #31169 ([MoE Refactor][10/N] Cleanup Fp8 Process Weights After Loading)，由 robertgshaw2-redhat 提交并已合并。
- 分析：此 PR 是 MoE 重构系列的第十部分，核心是清理和统一 FP8 量化模型（包括 block-wise 和 tensor-wise）在加载后的权重处理逻辑。虽然标题未直接提及 AMD，但其描述中的测试计划明确包含了 AITER block/tensor - rocm 的测试用例。这标志着针对 AMD ROCm 平台的 FP8 MoE 内核支持工作正随着核心代码重构同步推进，旨在确保 AMD 平台能复用统一、优化后的权重处理流程，为后续性能优化奠定基础。
CI 依赖项修正：
- PR #31441 ([ROCm][CI] Added perceptron lib in requirements for isaac multi-modal test)，由 AndreasKaratzas 提交，带有 rocm 标签。
- 分析：此 PR 为多模态模型 Isaac 的测试添加了缺失的 perceptron 库依赖。其目的在于修复 ROCm CI 测试环境下的依赖问题，保障 AMD 平台上的多模态模型测试能够正常运行，属于基础设施维护范畴。

小结：本周期 AMD 生态的显性活动不多，但通过已合并的 PR #31169 可以看出，对 AMD 平台（特别是 FP8 MoE）的深度集成支持工作正在核心代码重构的框架下稳步进行。

💬 高热度讨论分析

本周期新增 Issue 评论数普遍较少，但部分近期关闭的 Issue 曾有过热烈讨论。

Issue #31422: [Bug]: Model returns junk when using tensor_parallel == 2 with gpt-oss
- 核心议题：在 Blackwell GPU (B200) 上，使用 openai/gpt-oss-120b 模型且开启 Tensor Parallelism (TP=2) 和前缀缓存时，除第一个请求外，后续请求返回空内容。
- 观点分析：
  - 报告者 (vgoklani)：通过详细测试定位到是 TP=2 与 prefix caching 的组合导致问题。禁用前缀缓存或使用 TP=1 可解决。
  - 潜在原因：报告者分析指出，这可能是 Blackwell 架构下前缀缓存与张量并行之间的兼容性问题。
- 当前状态：Issue 新开，尚未有官方回复。这是一个涉及新硬件、大模型和高级优化功能组合的复杂 Bug，需要核心开发者关注。
Issue #31437 & PR #31438: [Bug]: Streaming tool calls missing id/type/name in finish chunk
- 核心议题：在流式传输工具调用时，最后一个数据块（finish chunk）丢失了 id, type, function.name 等关键字段，破坏了 OpenAI API 兼容性。
- 观点分析：
  - 报告者/修复者 (amittell)：精准定位到代码中创建新 DeltaMessage 时未保留原始字段的问题，并立即提交了修复 PR (#31438)。
  - 社区响应：无反对意见。PR 在创建后很快通过了初步格式检查，体现了对 API 一致性和兼容性问题的高效修复。
- 当前状态：PR #31438 已提交，正在等待代码审查和 CI 测试。
Issue #16732 (已关闭): [Feature]: return graceful inference text input validation errors
- 核心议题：当批处理中的某个请求因输入过长等原因验证失败时，vLLM 会直接抛出异常，导致整个批次中断。用户希望错误能被优雅处理，允许跳过坏样本。
- 观点分析：
  - 需求方 (vadimkantorov 等)：认为对于离线批量推理场景，应将每个提示的错误作为结果对象的一部分返回，而不是抛出异常中断整个进程。
  - 维护方 (njhill)：曾建议使用 truncate_prompt_tokens 参数，但后续发现该功能可能未按预期工作，并表示应另开 Issue 修复。
  - 其他用户：多人表示遇到相同问题，并指出旧版本曾有更优雅的处理方式（警告+空输出）。
- 最终结论：该 Issue 因长期未活动被自动关闭。但其反映的“批量请求中部分失败的优雅处理”是一个经典的工程问题，可能需要在架构层面进行设计。
Issue #22444 (已关闭): [Bug]: Low GPU Utilization with Image Payloads for Qwen2-VL-2B-Instruct Embeddings
- 核心议题：使用 Qwen2-VL 生成嵌入向量时，若输入包含图像，GPU 利用率极低，吞吐量远低于纯文本输入。
- 观点分析：
  - 报告者 (fabiozappo)：提供了详尽的性能对比数据。
  - 贡献者 (ZJY0516)：尝试复现并分析，指出图像编码产生的 token 数远多于文本，吞吐量差距可能部分源于此。同时发现增加 API 工作进程数 (--api-server-count) 反而可能降低性能或导致错误。
  - 维护者 (DarkLight1337)：建议进行性能剖析以定位瓶颈，并指出多进程的有效性取决于 CPU 核心数。
- 最终结论：Issue 在未明确给出根本解决方案的情况下，因不活跃被关闭。这暴露了多模态模型预处理（如图像编码）可能成为性能瓶颈，且其与 vLLM 服务架构的交互有待优化。

🔥 热门话题与趋势分析

API 与用户体验优化：多个 PR（如 #31427 增加日志过滤选项、#31426 改进错误信息、#31423 优化 DBO 错误提示）聚焦于改善用户体验、提供更清晰的调试信息，表明项目在功能稳定后，日益重视开发者体验和易用性。
多模态模型支持持续扩展：
- 新增模型：Issue #31410 请求支持 T5Gemma，PR #31436 为 GLM-ASR 音频语言模型添加支持。
- 性能与修复：PR #31402 移除 ViT 中冗余的 all_reduce 操作以优化性能，PR #31403 修复 Hunyuan-VL 的形状不匹配问题。
- 基础设施：PR #31401 放宽了多模态模型测试对 transformers 库版本的严格限制，以提升测试覆盖率。
内核与执行引擎优化：
- 执行策略：PR #31412 尝试通过对 KV Cache 传输的块 ID 进行排序来优化 I/O 效率。
- 内核配置：PR #31407 为特定 GLM 模型在 RTX Pro 6000 上添加了融合 MoE 内核的优化配置。
- 编译与缓存：PR #31376 修复了编译配置缓存的关键 bug。

🛠️ 重点技术变更

PR #31169: [MoE Refactor][10/N] Cleanup Fp8 Process Weights After Loading（已合并）
- 技术解读：对 FP8 量化 MoE 模型的权重后处理逻辑进行了重大重构。清理了针对 block-wise 和 tensor-wise 量化的重复代码路径，将其统一为可重用的辅助函数。
- 影响：提高了 FP8 MoE 相关代码的模块化和可维护性，为所有后端（包括 NVIDIA CUTLASS/TRTLLM、AMD AITER 等）提供了统一的权重处理基础，是推进 FP8 量化在异构硬件上成熟支持的重要一步。
PR #31376: [BugFix] Fix cache issue in compilation_config（已合并）
- 技术解读：修复了 @lru_cache 装饰的 get_cached_compilation_config() 函数在上下文切换（set_current_vllm_config）时缓存未及时清除的 bug。原代码在退出上下文时才清除缓存，导致在上下文内可能错误地使用旧的配置。
- 影响：确保了在使用 torch.compile 等依赖编译配置的功能时，动态切换配置的正确性，避免了潜在的隐蔽错误。
PR #31438: [Bugfix] Preserve tool call id/type/name in streaming finish chunk（进行中）
- 技术解读：修复了流式工具调用响应中最后一个数据块丢失关键字段的问题。通过修改 serving_chat.py 中的逻辑，在构建 finish chunk 时保留从工具解析器获得的原始字段。
- 影响：提升了 vLLM 对 OpenAI 工具调用流式 API 的兼容性，确保客户端能够正确接收到完整的工具调用信息。

📈 开发活跃度观察

贡献者活跃度：用户 yurekami 在本周期极其活跃，提交了超过10个 PR，内容主要集中在代码质量提升（替换泛型异常、添加类型注解、统一魔术数字为常量）和用户体验改善（增强错误信息）。这表明社区在推动新功能的同时，也在持续进行代码库的“内务管理”工作。
代码审查与合并：周期内合并 PR 数量（5个）远少于新增 PR 数量（35个）。合并的 PR 包括重要的 Bug 修复（#31376, #30912）和清理工作（#31169, #31405）。大量新增 PR 处于开放状态，表明代码审查队列饱满，合并标准严格。
Issue 管理：关闭了33个 Issue，其中绝大多数是标记为 stale 后自动关闭的。这显示了项目通过自动化流程定期清理陈旧、未跟进的问题，以保持 Issue 列表的可管理性。

💡 值得关注的问题

Blackwell 架构兼容性问题（Issue #31422）：在 NVIDIA 最新的 Blackwell GPU 上，张量并行与前缀缓存的组合暴露出新的兼容性 Bug。这需要核心开发者对新的硬件特性进行适配和测试。
MM Cache AssertionError 导致引擎崩溃（Issue #31404）：一个关于多模态缓存的断言错误导致整个推理引擎崩溃，请求被中止。这是一个严重稳定性问题，影响生产环境可用性。
NVIDIA NGC 容器版本过时（Issue #31424）：官方提供的 NGC 容器内嵌的 vLLM 版本（0.11.1）严重落后于当前稳定版（0.13.x），给依赖该容器的用户带来困扰，可能影响官方镜像的声誉。
AMD 生态支持的持续进展：虽然本周期直接内容不多，但 PR #31169 的合并表明针对 ROCm 平台的深度优化（特别是 FP8 MoE）正在主干代码中稳步推进，值得长期关注。

📋 附录：详细数据列表

新增 Issue

#31440 [Feature]: Allow usage of continue_final_message in /embeddings endpoint — feature request — by kevin-pw (创建于: 2025-12-28 09:14 (UTC+8))
#31439 [Bug]: aot_compile disables inductor graph partition — bug — by BoyuanFeng (创建于: 2025-12-28 08:04 (UTC+8))
#31437 [Bug]: Streaming tool calls missing id/type/name in finish chunk — 无标签 — by amittell (创建于: 2025-12-28 07:54 (UTC+8))
#31424 [Bug]: NVIDIA NGC container nvcr.io/nvidia/vllm:25.12-py3 ships outdated vLLM 0.11.1 instead of 0.13.x — bug — by vgoklani (创建于: 2025-12-28 05:27 (UTC+8))
#31422 [Bug]: Model returns junk when using tensor_parallel == 2 with gpt-oss, — bug — by vgoklani (创建于: 2025-12-28 05:06 (UTC+8))
#31414 [Feature][Cleanup]: Unify vllm.utils.flashinfer and vllm.model_executor.layers.quantization.utils.flashinfer_utils — help wanted,good first issue,feature request — by robertgshaw2-redhat (创建于: 2025-12-28 02:27 (UTC+8))
#31410 [New Model]: Support T5Gemma — new-model — by npuichigo (创建于: 2025-12-27 21:55 (UTC+8))
#31404 [Bug]: MM cache AssertionError crashes engine, causes request aborts — bug — by shirejoni (创建于: 2025-12-27 16:59 (UTC+8))
#31398 [Doc]: Eagle3 with tensor parallelism — documentation — by JSYRD (创建于: 2025-12-27 11:10 (UTC+8))

已关闭 Issue

#16732 [Feature]: return graceful inference text input validation errors as part of output (without throwing an exception) - to enable skipping / handling bad examples after the processing of good ones — feature request,stale — by vadimkantorov (关闭于: 2025-12-28 10:17 (UTC+8))
#19739 [Feature]: Logging details about incorrect requests — feature request,stale — by ErykCh (关闭于: 2025-12-28 10:16 (UTC+8))
#20218 [Feature]: Split and shorten long CI jobs (e.g. entrypoints, spec decodes, kernels, etc.) — feature request,ci/build,stale — by kaori-seasons (关闭于: 2025-12-28 10:16 (UTC+8))
#21122 [Bug]: Issue running mistralai/Magistral-Small-2506 on NVIDIA hardware — bug,stale — by jerin-scalers-ai (关闭于: 2025-12-28 10:16 (UTC+8))
#21899 [Bug]: Processor mismatch between what is provided by OpenGVLab and VLLM for InternVL leading to outputs of the processor being too large to be decoded for the tokenizer — bug,stale — by SStoica12 (关闭于: 2025-12-28 10:16 (UTC+8))
#22444 [Bug]: Low GPU Utilization with Image Payloads for Qwen2-VL-2B-Instruct Embeddings — bug,stale — by fabiozappo (关闭于: 2025-12-28 10:16 (UTC+8))
#23163 [Bug]: GPU memory allocate problem — bug,stale — by lsk-gith (关闭于: 2025-12-28 10:16 (UTC+8))
#23365 [Feature]: Add support for /v1/responses API benchmarking backend — feature request,stale — by alew3 (关闭于: 2025-12-28 10:15 (UTC+8))
#23626 [Feature]: Support for Multi-Vision Tower Models (e.g., Valley-Eagle-7B) — feature request,stale,multi-modality — by oswardlx (关闭于: 2025-12-28 10:15 (UTC+8))
#23661 [Bug]: Get the issue No platform detected, vLLM is running on UnspecifiedPlatform when deploy on kubernetes using GPU time slicing — bug,stale — by barbarian23 (关闭于: 2025-12-28 10:15 (UTC+8))
#23667 [CI]: Entrypoints tests cleanup — ci/build,stale — by njhill (关闭于: 2025-12-28 10:15 (UTC+8))
#23697 [Bug]: Recycling request IDs can lead to requests leaking in the engine due to race condition in the cancellation logic — bug,stale — by Adolfo-Karim (关闭于: 2025-12-28 10:15 (UTC+8))
#23707 [Bug]: Flex Attention compiled block mask + multiprocessing — bug,stale — by drisspg (关闭于: 2025-12-28 10:15 (UTC+8))
#23719 [Usage]: CLS Fine-Tuned Model Fails to Launch with Multi-LoRA — usage,stale — by rfhshfhs (关闭于: 2025-12-28 10:15 (UTC+8))
#23780 [Feature][P/D]: Optimize NIXL Connector xfer Launch — help wanted,good first issue,feature request,stale — by robertgshaw2-redhat (关闭于: 2025-12-28 10:15 (UTC+8))
#23794 [Bug]: Getting different result in each run even after using temperature=0 in instruction finetuned Gemma-3-270m-it — bug,stale — by sujit0892 (关闭于: 2025-12-28 10:15 (UTC+8))
#23797 [Usage]: InputProcessingError: Failed to prepare inputs for sequence group with request id: 0, — usage,stale — by Datta0 (关闭于: 2025-12-28 10:15 (UTC+8))
#23821 [Bug]: Gemma 3n does not work in vllm/vllm-openai:v0.10.1.1 — bug,stale — by DaGeRe (关闭于: 2025-12-28 10:15 (UTC+8))
#23873 [Renderer] Merge OpenAIServing._preprocess_* with Processor.process_inputs — feature request,stale — by ywang96 (关闭于: 2025-12-28 10:15 (UTC+8))
#23874 [Renderer] Renderer Prototyping with Qwen2.5VL — feature request,stale — by ywang96 (关闭于: 2025-12-28 10:15 (UTC+8))
#23880 [MM Encoder] ViT attention performance and consolidation — help wanted,good first issue,feature request,stale — by ywang96 (关闭于: 2025-12-28 10:15 (UTC+8))
#23881 [RFC]: Optimize grammar_bitmask Overhead by Moving Logic to Worker_runner — RFC,stale — by GGKOP (关闭于: 2025-12-28 10:15 (UTC+8))
#23884 [MM Encoder] General encoder performance improvement — help wanted,good first issue,feature request,stale — by ywang96 (关闭于: 2025-12-28 10:15 (UTC+8))
#23885 [Usage]: how to improve toks/s in the single-concurrent test use the evalscope tool? deploy qwen3-30-a3b model use 4 cards. — usage,stale — by funny000 (关闭于: 2025-12-28 10:15 (UTC+8))
#23886 bad words implementation issue — stale — by da03 (关闭于: 2025-12-28 10:15 (UTC+8))
#23893 [Bug]: Unable to capture kernel information through nsys ro torch profiler on RTX5090 — bug,stale — by m404notfound (关闭于: 2025-12-28 10:15 (UTC+8))
#23911 [Bug]: The compilation of v0.9.2 succeeded with MacheteLinearKernel enabled for the RTX 4090 (CUDA architecture 8.9), but a runtime error was encountered. — bug,stale — by zzningxp (关闭于: 2025-12-28 10:15 (UTC+8))
#23921 [Bug]: 5090 Qwen3-30B-A3B-FP8 fails when TP=2！ — bug,stale — by tensorflowt (关闭于: 2025-12-28 10:15 (UTC+8))
#23969 [Bug]: v1 xformers + sliding window not working — bug,stale — by Muennighoff (关闭于: 2025-12-28 10:15 (UTC+8))
#31253 [Bug]: VLLM_USE_FLASHINFER_MOE_FP16=1 generate different logprob for the same prompt in different run — bug — by zyongye (关闭于: 2025-12-28 03:59 (UTC+8))
#31325 [Bug]: Qwen3-32B Quality Degeneration on Offline Generate — bug — by thavens (关闭于: 2025-12-28 03:14 (UTC+8))
#30905 [Bug]: vLLM 0.11.2 AutoTune Missed Import In TorchInductor w/ Compile Sizes > 1 — bug,torch.compile — by rouchenzi (关闭于: 2025-12-27 12:08 (UTC+8))
#31361 [Usage]: Question about the dummy run。It seems the dummy run use different precision? — usage — by Dingjifeng (关闭于: 2025-12-27 11:41 (UTC+8))

新增 PR

#31441 [ROCm][CI] Added perceptron lib in requirements for isaac multi-modal test — rocm,ci/build — by AndreasKaratzas (创建于: 2025-12-28 09:22 (UTC+8))
#31438 [Bugfix] Preserve tool call id/type/name in streaming finish chunk — frontend — by amittell (创建于: 2025-12-28 07:55 (UTC+8))
#31436 Add GLM-ASR multimodal support — new-model,multi-modality — by baonudesifeizhai (创建于: 2025-12-28 07:29 (UTC+8))
#31427 [Feature] Add –disable-metrics-access-log to filter monitoring logs — frontend — by yurekami (创建于: 2025-12-28 05:57 (UTC+8))
#31426 [UX] Improve DCP/PCP/MTP error messages with backend suggestions — v1 — by yurekami (创建于: 2025-12-28 05:51 (UTC+8))
#31428 [Bug] Fix GLM4 tool parser TypeError with empty arguments — 无标签 — by yurekami (创建于: 2025-12-28 06:00 (UTC+8))
#31429 Add descriptive error messages to bare asserts in forward_context.py — 无标签 — by yurekami (创建于: 2025-12-28 06:08 (UTC+8))
#31430 Consolidate duplicate exception handling in ray/lazy_utils.py — 无标签 — by yurekami (创建于: 2025-12-28 06:22 (UTC+8))
#31431 Add INT32_BITS constant to replace magic number in quant_utils.py — 无标签 — by yurekami (创建于: 2025-12-28 06:24 (UTC+8))
#31432 Add named constant for continuous usage report interval — 无标签 — by yurekami (创建于: 2025-12-28 06:27 (UTC+8))
#31433 Add explicit warning categories to warnings.warn() calls — 无标签 — by yurekami (创建于: 2025-12-28 06:34 (UTC+8))
#31434 Add return type annotations to post_init methods — 无标签 — by yurekami (创建于: 2025-12-28 06:37 (UTC+8))
#31435 Add descriptive error messages to IPEX ops assertions — 无标签 — by yurekami (创建于: 2025-12-28 06:44 (UTC+8))
#31425 [Cleanup] Replace generic Exception with specific types — frontend,v1 — by yurekami (创建于: 2025-12-28 05:38 (UTC+8))
#31423 [UX] Improve DBO/microbatching error message for unsupported backends — 无标签 — by yurekami (创建于: 2025-12-28 05:20 (UTC+8))
#31415 Apply refactor to ct — nvidia — by robertgshaw2-redhat (创建于: 2025-12-28 02:45 (UTC+8))
#31407 Add Fused MoE Triton kernels for GLM-4.5-Air, GLM-4.5v, GLM-4.6v on 2x RTX Pro 6000 — performance,ready — by mratsim (创建于: 2025-12-27 18:00 (UTC+8))
#31418 [v1] Improve GPU blocks error messages with actionable suggestions — v1 — by yurekami (创建于: 2025-12-28 04:22 (UTC+8))
#31421 [Cleanup] Add descriptive messages to empty exceptions — 无标签 — by yurekami (创建于: 2025-12-28 05:04 (UTC+8))
#31420 [Cleanup] Replace generic Exception with specific types (part 2) — v1 — by yurekami (创建于: 2025-12-28 04:46 (UTC+8))
#31419 [Cleanup] Replace generic Exception with ValueError in quant utils — 无标签 — by yurekami (创建于: 2025-12-28 04:40 (UTC+8))
#31413 poc of removing ModularKernelMethod and maybe_init_modular_kernel — 无标签 — by robertgshaw2-redhat (创建于: 2025-12-28 01:33 (UTC+8))
#31417 [Reasoning] Add GLM-4.7 reasoning parser for template-injected tag — 无标签 — by yurekami (创建于: 2025-12-28 04:14 (UTC+8))
#31416 [Feature][Cleanup] Unify flashinfer utils into package structure — nvidia — by yurekami (创建于: 2025-12-28 03:55 (UTC+8))
#31412 [Core] Sort block IDs at I/O layer for contiguous memory access — v1 — by majiayu000 (创建于: 2025-12-28 00:16 (UTC+8))
#31411 [Core] Sort blocks by block_id in FreeKVCacheBlockQueue.append_n — v1 — by majiayu000 (创建于: 2025-12-27 23:51 (UTC+8))
#31408 Add Loraconfig parameter to get_punica_wrapper function — 无标签 — by ZT-AIA (创建于: 2025-12-27 18:59 (UTC+8))
#31405 [Chore]: Remove HF format Phi4-MM examples — documentation,ready — by Isotr0py (创建于: 2025-12-27 17:00 (UTC+8))
#31401 [CI/Build] Ignore max transformers version for more common tests — ready,multi-modality — by Isotr0py (创建于: 2025-12-27 14:50 (UTC+8))
#31409 [Model] LLAMA MTP — llama — by sun-chao-hahaha (创建于: 2025-12-27 21:18 (UTC+8))
#31406 [v1] Add encoder-only/cross attention support to Triton Attention backend — v1 — by Isotr0py (创建于: 2025-12-27 17:09 (UTC+8))
#31403 [Model] Fix hunyuan-vl shape mismatch — 无标签 — by Potabk (创建于: 2025-12-27 16:45 (UTC+8))
#31402 [Misc] Remove redundant all reduce in qkv split for ViTs — 无标签 — by Isotr0py (创建于: 2025-12-27 15:18 (UTC+8))
#31400 [wip] custom allreduce and custom unquantized_gemm — 无标签 — by wxsIcey (创建于: 2025-12-27 14:20 (UTC+8))
#31399 [Doc] Clarify tensor parallelism support for EAGLE-based draft models — documentation — by JSYRD (创建于: 2025-12-27 11:38 (UTC+8))

已合并 PR

#31169 [MoE Refactor][10/N] Cleanup Fp8 Process Weights After Loading — rocm,ready,nvidia — by robertgshaw2-redhat (合并于: 2025-12-28 04:22 (UTC+8))
#31376 [BugFix] Fix cache issue in compilation_config — ready — by BoyuanFeng (合并于: 2025-12-27 22:30 (UTC+8))
#31405 [Chore]: Remove HF format Phi4-MM examples — documentation,ready — by Isotr0py (合并于: 2025-12-27 21:42 (UTC+8))
#31401 [CI/Build] Ignore max transformers version for more common tests — ready,multi-modality — by Isotr0py (合并于: 2025-12-27 21:06 (UTC+8))
#30912 Fix/get raw stream patch #30905 — ready,torch.compile — by baonudesifeizhai (合并于: 2025-12-27 12:08 (UTC+8))

关闭但未合并的 PR

#23250 AITER MHA backend update — rocm,stale,v1 — by fsx950223 (关闭于: 2025-12-28 10:33 (UTC+8))
#18901 [V1][Metrics] Add time_per_prefill_token — stale,v1 — by sahelib25 (关闭于: 2025-12-28 10:17 (UTC+8))
#20677 [Bugfix] Fix grafana’s model_name list showing other values — documentation,stale — by kebe7jun (关闭于: 2025-12-28 10:16 (UTC+8))
#22029 Draft: Qwen2.5 VL eagle3 — speculative-decoding,stale,v1,qwen — by IzzyPutterman (关闭于: 2025-12-28 10:16 (UTC+8))
#22362 Improved defaulting of chunked prefill and prefix caching in V1 — stale — by hypdeb (关闭于: 2025-12-28 10:16 (UTC+8))
#23889 fix bad words issue when adding leading space changes token id count — stale — by da03 (关闭于: 2025-12-28 10:15 (UTC+8))
#23930 Add actionable solutions to top 3 error messages — stale — by mtr7x (关闭于: 2025-12-28 10:15 (UTC+8))
#31418 [v1] Improve GPU blocks error messages with actionable suggestions — v1 — by yurekami (关闭于: 2025-12-28 05:07 (UTC+8))
#31411 [Core] Sort blocks by block_id in FreeKVCacheBlockQueue.append_n — v1 — by majiayu000 (关闭于: 2025-12-28 00:02 (UTC+8))
#31409 [Model] LLAMA MTP — llama — by sun-chao-hahaha (关闭于: 2025-12-27 21:21 (UTC+8))
#31403 [Model] Fix hunyuan-vl shape mismatch — 无标签 — by Potabk (关闭于: 2025-12-27 18:07 (UTC+8))