vLLM 开发动态报告 - 2025-12-27
时间窗口: 2025-12-27 11:01 (UTC+8) ~ 2025-12-28 11:01 (UTC+8) 数据统计: 新 Issue 9 | 关闭 Issue 33 | 新 PR 35 | 合并 PR 5 | 关闭未合并 PR 11
📊 每日开发状态摘要
本周期内,vLLM 项目开发活跃,代码清理和用户体验改进是主要方向。新增了大量 PR(35个),但合并比例较低(5个),表明大量功能修复和代码质量提升工作处于审核阶段。同时,项目关闭了33个长期未动的陈旧 Issue,反映出对 Issue 清单的维护。值得关注的是,出现了涉及新硬件架构(Blackwell)和复杂功能组合(如 Tensor Parallelism 与 Prefix Caching)的 Bug 报告,显示出社区在将 vLLM 推向更广泛和极限场景时遇到的挑战。
🎯 AMD/ROCm 生态相关动态
本周期内,AMD 生态相关的直接更新较少,但有一项重要的持续性工作取得进展。
- MoE FP8 权重处理重构的延续:
- PR #31169 (
[MoE Refactor][10/N] Cleanup Fp8 Process Weights After Loading),由robertgshaw2-redhat提交并已合并。 - 分析:此 PR 是 MoE 重构系列的第十部分,核心是清理和统一 FP8 量化模型(包括 block-wise 和 tensor-wise)在加载后的权重处理逻辑。虽然标题未直接提及 AMD,但其描述中的测试计划明确包含了
AITER block/tensor - rocm的测试用例。这标志着针对 AMD ROCm 平台的 FP8 MoE 内核支持工作正随着核心代码重构同步推进,旨在确保 AMD 平台能复用统一、优化后的权重处理流程,为后续性能优化奠定基础。
- PR #31169 (
- CI 依赖项修正:
- PR #31441 (
[ROCm][CI] Added perceptron lib in requirements for isaac multi-modal test),由AndreasKaratzas提交,带有rocm标签。 - 分析:此 PR 为多模态模型 Isaac 的测试添加了缺失的
perceptron库依赖。其目的在于修复 ROCm CI 测试环境下的依赖问题,保障 AMD 平台上的多模态模型测试能够正常运行,属于基础设施维护范畴。
- PR #31441 (
小结:本周期 AMD 生态的显性活动不多,但通过已合并的 PR #31169 可以看出,对 AMD 平台(特别是 FP8 MoE)的深度集成支持工作正在核心代码重构的框架下稳步进行。
💬 高热度讨论分析
本周期新增 Issue 评论数普遍较少,但部分近期关闭的 Issue 曾有过热烈讨论。
- Issue #31422: [Bug]: Model returns junk when using tensor_parallel == 2 with gpt-oss
- 核心议题:在 Blackwell GPU (B200) 上,使用
openai/gpt-oss-120b模型且开启 Tensor Parallelism (TP=2) 和前缀缓存时,除第一个请求外,后续请求返回空内容。 - 观点分析:
- 报告者 (
vgoklani):通过详细测试定位到是TP=2与prefix caching的组合导致问题。禁用前缀缓存或使用 TP=1 可解决。 - 潜在原因:报告者分析指出,这可能是 Blackwell 架构下前缀缓存与张量并行之间的兼容性问题。
- 报告者 (
- 当前状态:Issue 新开,尚未有官方回复。这是一个涉及新硬件、大模型和高级优化功能组合的复杂 Bug,需要核心开发者关注。
- 核心议题:在 Blackwell GPU (B200) 上,使用
- Issue #31437 & PR #31438: [Bug]: Streaming tool calls missing id/type/name in finish chunk
- 核心议题:在流式传输工具调用时,最后一个数据块(finish chunk)丢失了
id,type,function.name等关键字段,破坏了 OpenAI API 兼容性。 - 观点分析:
- 报告者/修复者 (
amittell):精准定位到代码中创建新DeltaMessage时未保留原始字段的问题,并立即提交了修复 PR (#31438)。 - 社区响应:无反对意见。PR 在创建后很快通过了初步格式检查,体现了对 API 一致性和兼容性问题的高效修复。
- 报告者/修复者 (
- 当前状态:PR #31438 已提交,正在等待代码审查和 CI 测试。
- 核心议题:在流式传输工具调用时,最后一个数据块(finish chunk)丢失了
- Issue #16732 (已关闭): [Feature]: return graceful inference text input validation errors
- 核心议题:当批处理中的某个请求因输入过长等原因验证失败时,vLLM 会直接抛出异常,导致整个批次中断。用户希望错误能被优雅处理,允许跳过坏样本。
- 观点分析:
- 需求方 (
vadimkantorov等):认为对于离线批量推理场景,应将每个提示的错误作为结果对象的一部分返回,而不是抛出异常中断整个进程。 - 维护方 (
njhill):曾建议使用truncate_prompt_tokens参数,但后续发现该功能可能未按预期工作,并表示应另开 Issue 修复。 - 其他用户:多人表示遇到相同问题,并指出旧版本曾有更优雅的处理方式(警告+空输出)。
- 需求方 (
- 最终结论:该 Issue 因长期未活动被自动关闭。但其反映的“批量请求中部分失败的优雅处理”是一个经典的工程问题,可能需要在架构层面进行设计。
- Issue #22444 (已关闭): [Bug]: Low GPU Utilization with Image Payloads for Qwen2-VL-2B-Instruct Embeddings
- 核心议题:使用 Qwen2-VL 生成嵌入向量时,若输入包含图像,GPU 利用率极低,吞吐量远低于纯文本输入。
- 观点分析:
- 报告者 (
fabiozappo):提供了详尽的性能对比数据。 - 贡献者 (
ZJY0516):尝试复现并分析,指出图像编码产生的 token 数远多于文本,吞吐量差距可能部分源于此。同时发现增加 API 工作进程数 (--api-server-count) 反而可能降低性能或导致错误。 - 维护者 (
DarkLight1337):建议进行性能剖析以定位瓶颈,并指出多进程的有效性取决于 CPU 核心数。
- 报告者 (
- 最终结论:Issue 在未明确给出根本解决方案的情况下,因不活跃被关闭。这暴露了多模态模型预处理(如图像编码)可能成为性能瓶颈,且其与 vLLM 服务架构的交互有待优化。
🔥 热门话题与趋势分析
- API 与用户体验优化:多个 PR(如 #31427 增加日志过滤选项、#31426 改进错误信息、#31423 优化 DBO 错误提示)聚焦于改善用户体验、提供更清晰的调试信息,表明项目在功能稳定后,日益重视开发者体验和易用性。
- 多模态模型支持持续扩展:
- 新增模型:Issue #31410 请求支持 T5Gemma,PR #31436 为 GLM-ASR 音频语言模型添加支持。
- 性能与修复:PR #31402 移除 ViT 中冗余的 all_reduce 操作以优化性能,PR #31403 修复 Hunyuan-VL 的形状不匹配问题。
- 基础设施:PR #31401 放宽了多模态模型测试对 transformers 库版本的严格限制,以提升测试覆盖率。
- 内核与执行引擎优化:
- 执行策略:PR #31412 尝试通过对 KV Cache 传输的块 ID 进行排序来优化 I/O 效率。
- 内核配置:PR #31407 为特定 GLM 模型在 RTX Pro 6000 上添加了融合 MoE 内核的优化配置。
- 编译与缓存:PR #31376 修复了编译配置缓存的关键 bug。
🛠️ 重点技术变更
- PR #31169: [MoE Refactor][10/N] Cleanup Fp8 Process Weights After Loading(已合并)
- 技术解读:对 FP8 量化 MoE 模型的权重后处理逻辑进行了重大重构。清理了针对 block-wise 和 tensor-wise 量化的重复代码路径,将其统一为可重用的辅助函数。
- 影响:提高了 FP8 MoE 相关代码的模块化和可维护性,为所有后端(包括 NVIDIA CUTLASS/TRTLLM、AMD AITER 等)提供了统一的权重处理基础,是推进 FP8 量化在异构硬件上成熟支持的重要一步。
- PR #31376: [BugFix] Fix cache issue in compilation_config(已合并)
- 技术解读:修复了
@lru_cache装饰的get_cached_compilation_config()函数在上下文切换(set_current_vllm_config)时缓存未及时清除的 bug。原代码在退出上下文时才清除缓存,导致在上下文内可能错误地使用旧的配置。 - 影响:确保了在使用
torch.compile等依赖编译配置的功能时,动态切换配置的正确性,避免了潜在的隐蔽错误。
- 技术解读:修复了
- PR #31438: [Bugfix] Preserve tool call id/type/name in streaming finish chunk(进行中)
- 技术解读:修复了流式工具调用响应中最后一个数据块丢失关键字段的问题。通过修改
serving_chat.py中的逻辑,在构建 finish chunk 时保留从工具解析器获得的原始字段。 - 影响:提升了 vLLM 对 OpenAI 工具调用流式 API 的兼容性,确保客户端能够正确接收到完整的工具调用信息。
- 技术解读:修复了流式工具调用响应中最后一个数据块丢失关键字段的问题。通过修改
📈 开发活跃度观察
- 贡献者活跃度:用户
yurekami在本周期极其活跃,提交了超过10个 PR,内容主要集中在代码质量提升(替换泛型异常、添加类型注解、统一魔术数字为常量)和用户体验改善(增强错误信息)。这表明社区在推动新功能的同时,也在持续进行代码库的“内务管理”工作。 - 代码审查与合并:周期内合并 PR 数量(5个)远少于新增 PR 数量(35个)。合并的 PR 包括重要的 Bug 修复(#31376, #30912)和清理工作(#31169, #31405)。大量新增 PR 处于开放状态,表明代码审查队列饱满,合并标准严格。
- Issue 管理:关闭了33个 Issue,其中绝大多数是标记为
stale后自动关闭的。这显示了项目通过自动化流程定期清理陈旧、未跟进的问题,以保持 Issue 列表的可管理性。
💡 值得关注的问题
- Blackwell 架构兼容性问题(Issue #31422):在 NVIDIA 最新的 Blackwell GPU 上,张量并行与前缀缓存的组合暴露出新的兼容性 Bug。这需要核心开发者对新的硬件特性进行适配和测试。
- MM Cache AssertionError 导致引擎崩溃(Issue #31404):一个关于多模态缓存的断言错误导致整个推理引擎崩溃,请求被中止。这是一个严重稳定性问题,影响生产环境可用性。
- NVIDIA NGC 容器版本过时(Issue #31424):官方提供的 NGC 容器内嵌的 vLLM 版本(0.11.1)严重落后于当前稳定版(0.13.x),给依赖该容器的用户带来困扰,可能影响官方镜像的声誉。
- AMD 生态支持的持续进展:虽然本周期直接内容不多,但 PR #31169 的合并表明针对 ROCm 平台的深度优化(特别是 FP8 MoE)正在主干代码中稳步推进,值得长期关注。
📋 附录:详细数据列表
新增 Issue
- #31440 [Feature]: Allow usage of continue_final_message in /embeddings endpoint — feature request — by kevin-pw (创建于: 2025-12-28 09:14 (UTC+8))
- #31439 [Bug]: aot_compile disables inductor graph partition — bug — by BoyuanFeng (创建于: 2025-12-28 08:04 (UTC+8))
- #31437 [Bug]: Streaming tool calls missing id/type/name in finish chunk — 无标签 — by amittell (创建于: 2025-12-28 07:54 (UTC+8))
- #31424 [Bug]: NVIDIA NGC container nvcr.io/nvidia/vllm:25.12-py3 ships outdated vLLM 0.11.1 instead of 0.13.x — bug — by vgoklani (创建于: 2025-12-28 05:27 (UTC+8))
- #31422 [Bug]: Model returns junk when using tensor_parallel == 2 with gpt-oss, — bug — by vgoklani (创建于: 2025-12-28 05:06 (UTC+8))
- #31414 [Feature][Cleanup]: Unify
vllm.utils.flashinferandvllm.model_executor.layers.quantization.utils.flashinfer_utils— help wanted,good first issue,feature request — by robertgshaw2-redhat (创建于: 2025-12-28 02:27 (UTC+8)) - #31410 [New Model]: Support T5Gemma — new-model — by npuichigo (创建于: 2025-12-27 21:55 (UTC+8))
- #31404 [Bug]: MM cache AssertionError crashes engine, causes request aborts — bug — by shirejoni (创建于: 2025-12-27 16:59 (UTC+8))
- #31398 [Doc]: Eagle3 with tensor parallelism — documentation — by JSYRD (创建于: 2025-12-27 11:10 (UTC+8))
已关闭 Issue
- #16732 [Feature]: return graceful inference text input validation errors as part of output (without throwing an exception) - to enable skipping / handling bad examples after the processing of good ones — feature request,stale — by vadimkantorov (关闭于: 2025-12-28 10:17 (UTC+8))
- #19739 [Feature]: Logging details about incorrect requests — feature request,stale — by ErykCh (关闭于: 2025-12-28 10:16 (UTC+8))
- #20218 [Feature]: Split and shorten long CI jobs (e.g. entrypoints, spec decodes, kernels, etc.) — feature request,ci/build,stale — by kaori-seasons (关闭于: 2025-12-28 10:16 (UTC+8))
- #21122 [Bug]: Issue running mistralai/Magistral-Small-2506 on NVIDIA hardware — bug,stale — by jerin-scalers-ai (关闭于: 2025-12-28 10:16 (UTC+8))
- #21899 [Bug]: Processor mismatch between what is provided by OpenGVLab and VLLM for InternVL leading to outputs of the processor being too large to be decoded for the tokenizer — bug,stale — by SStoica12 (关闭于: 2025-12-28 10:16 (UTC+8))
- #22444 [Bug]: Low GPU Utilization with Image Payloads for Qwen2-VL-2B-Instruct Embeddings — bug,stale — by fabiozappo (关闭于: 2025-12-28 10:16 (UTC+8))
- #23163 [Bug]: GPU memory allocate problem — bug,stale — by lsk-gith (关闭于: 2025-12-28 10:16 (UTC+8))
- #23365 [Feature]: Add support for /v1/responses API benchmarking backend — feature request,stale — by alew3 (关闭于: 2025-12-28 10:15 (UTC+8))
- #23626 [Feature]: Support for Multi-Vision Tower Models (e.g., Valley-Eagle-7B) — feature request,stale,multi-modality — by oswardlx (关闭于: 2025-12-28 10:15 (UTC+8))
- #23661 [Bug]: Get the issue No platform detected, vLLM is running on UnspecifiedPlatform when deploy on kubernetes using GPU time slicing — bug,stale — by barbarian23 (关闭于: 2025-12-28 10:15 (UTC+8))
- #23667 [CI]: Entrypoints tests cleanup — ci/build,stale — by njhill (关闭于: 2025-12-28 10:15 (UTC+8))
- #23697 [Bug]: Recycling request IDs can lead to requests leaking in the engine due to race condition in the cancellation logic — bug,stale — by Adolfo-Karim (关闭于: 2025-12-28 10:15 (UTC+8))
- #23707 [Bug]: Flex Attention compiled block mask + multiprocessing — bug,stale — by drisspg (关闭于: 2025-12-28 10:15 (UTC+8))
- #23719 [Usage]: CLS Fine-Tuned Model Fails to Launch with Multi-LoRA — usage,stale — by rfhshfhs (关闭于: 2025-12-28 10:15 (UTC+8))
- #23780 [Feature][P/D]: Optimize NIXL Connector xfer Launch — help wanted,good first issue,feature request,stale — by robertgshaw2-redhat (关闭于: 2025-12-28 10:15 (UTC+8))
- #23794 [Bug]: Getting different result in each run even after using temperature=0 in instruction finetuned Gemma-3-270m-it — bug,stale — by sujit0892 (关闭于: 2025-12-28 10:15 (UTC+8))
- #23797 [Usage]: InputProcessingError: Failed to prepare inputs for sequence group with request id: 0, — usage,stale — by Datta0 (关闭于: 2025-12-28 10:15 (UTC+8))
- #23821 [Bug]: Gemma 3n does not work in vllm/vllm-openai:v0.10.1.1 — bug,stale — by DaGeRe (关闭于: 2025-12-28 10:15 (UTC+8))
- #23873 [Renderer] Merge
OpenAIServing._preprocess_*withProcessor.process_inputs— feature request,stale — by ywang96 (关闭于: 2025-12-28 10:15 (UTC+8)) - #23874 [Renderer]
RendererPrototyping with Qwen2.5VL — feature request,stale — by ywang96 (关闭于: 2025-12-28 10:15 (UTC+8)) - #23880 [MM Encoder] ViT attention performance and consolidation — help wanted,good first issue,feature request,stale — by ywang96 (关闭于: 2025-12-28 10:15 (UTC+8))
- #23881 [RFC]: Optimize grammar_bitmask Overhead by Moving Logic to Worker_runner — RFC,stale — by GGKOP (关闭于: 2025-12-28 10:15 (UTC+8))
- #23884 [MM Encoder] General encoder performance improvement — help wanted,good first issue,feature request,stale — by ywang96 (关闭于: 2025-12-28 10:15 (UTC+8))
- #23885 [Usage]: how to improve toks/s in the single-concurrent test use the evalscope tool? deploy qwen3-30-a3b model use 4 cards. — usage,stale — by funny000 (关闭于: 2025-12-28 10:15 (UTC+8))
- #23886 bad words implementation issue — stale — by da03 (关闭于: 2025-12-28 10:15 (UTC+8))
- #23893 [Bug]: Unable to capture kernel information through nsys ro torch profiler on RTX5090 — bug,stale — by m404notfound (关闭于: 2025-12-28 10:15 (UTC+8))
- #23911 [Bug]: The compilation of v0.9.2 succeeded with MacheteLinearKernel enabled for the RTX 4090 (CUDA architecture 8.9), but a runtime error was encountered. — bug,stale — by zzningxp (关闭于: 2025-12-28 10:15 (UTC+8))
- #23921 [Bug]: 5090 Qwen3-30B-A3B-FP8 fails when TP=2! — bug,stale — by tensorflowt (关闭于: 2025-12-28 10:15 (UTC+8))
- #23969 [Bug]: v1 xformers + sliding window not working — bug,stale — by Muennighoff (关闭于: 2025-12-28 10:15 (UTC+8))
- #31253 [Bug]:
VLLM_USE_FLASHINFER_MOE_FP16=1generate different logprob for the same prompt in different run — bug — by zyongye (关闭于: 2025-12-28 03:59 (UTC+8)) - #31325 [Bug]: Qwen3-32B Quality Degeneration on Offline Generate — bug — by thavens (关闭于: 2025-12-28 03:14 (UTC+8))
- #30905 [Bug]: vLLM 0.11.2 AutoTune Missed Import In TorchInductor w/ Compile Sizes > 1 — bug,torch.compile — by rouchenzi (关闭于: 2025-12-27 12:08 (UTC+8))
- #31361 [Usage]: Question about the dummy run。It seems the dummy run use different precision? — usage — by Dingjifeng (关闭于: 2025-12-27 11:41 (UTC+8))
新增 PR
- #31441 [ROCm][CI] Added perceptron lib in requirements for isaac multi-modal test — rocm,ci/build — by AndreasKaratzas (创建于: 2025-12-28 09:22 (UTC+8))
- #31438 [Bugfix] Preserve tool call id/type/name in streaming finish chunk — frontend — by amittell (创建于: 2025-12-28 07:55 (UTC+8))
- #31436 Add GLM-ASR multimodal support — new-model,multi-modality — by baonudesifeizhai (创建于: 2025-12-28 07:29 (UTC+8))
- #31427 [Feature] Add –disable-metrics-access-log to filter monitoring logs — frontend — by yurekami (创建于: 2025-12-28 05:57 (UTC+8))
- #31426 [UX] Improve DCP/PCP/MTP error messages with backend suggestions — v1 — by yurekami (创建于: 2025-12-28 05:51 (UTC+8))
- #31428 [Bug] Fix GLM4 tool parser TypeError with empty arguments — 无标签 — by yurekami (创建于: 2025-12-28 06:00 (UTC+8))
- #31429 Add descriptive error messages to bare asserts in forward_context.py — 无标签 — by yurekami (创建于: 2025-12-28 06:08 (UTC+8))
- #31430 Consolidate duplicate exception handling in ray/lazy_utils.py — 无标签 — by yurekami (创建于: 2025-12-28 06:22 (UTC+8))
- #31431 Add INT32_BITS constant to replace magic number in quant_utils.py — 无标签 — by yurekami (创建于: 2025-12-28 06:24 (UTC+8))
- #31432 Add named constant for continuous usage report interval — 无标签 — by yurekami (创建于: 2025-12-28 06:27 (UTC+8))
- #31433 Add explicit warning categories to warnings.warn() calls — 无标签 — by yurekami (创建于: 2025-12-28 06:34 (UTC+8))
- #31434 Add return type annotations to post_init methods — 无标签 — by yurekami (创建于: 2025-12-28 06:37 (UTC+8))
- #31435 Add descriptive error messages to IPEX ops assertions — 无标签 — by yurekami (创建于: 2025-12-28 06:44 (UTC+8))
- #31425 [Cleanup] Replace generic Exception with specific types — frontend,v1 — by yurekami (创建于: 2025-12-28 05:38 (UTC+8))
- #31423 [UX] Improve DBO/microbatching error message for unsupported backends — 无标签 — by yurekami (创建于: 2025-12-28 05:20 (UTC+8))
- #31415 Apply refactor to ct — nvidia — by robertgshaw2-redhat (创建于: 2025-12-28 02:45 (UTC+8))
- #31407 Add Fused MoE Triton kernels for GLM-4.5-Air, GLM-4.5v, GLM-4.6v on 2x RTX Pro 6000 — performance,ready — by mratsim (创建于: 2025-12-27 18:00 (UTC+8))
- #31418 [v1] Improve GPU blocks error messages with actionable suggestions — v1 — by yurekami (创建于: 2025-12-28 04:22 (UTC+8))
- #31421 [Cleanup] Add descriptive messages to empty exceptions — 无标签 — by yurekami (创建于: 2025-12-28 05:04 (UTC+8))
- #31420 [Cleanup] Replace generic Exception with specific types (part 2) — v1 — by yurekami (创建于: 2025-12-28 04:46 (UTC+8))
- #31419 [Cleanup] Replace generic Exception with ValueError in quant utils — 无标签 — by yurekami (创建于: 2025-12-28 04:40 (UTC+8))
- #31413 poc of removing ModularKernelMethod and maybe_init_modular_kernel — 无标签 — by robertgshaw2-redhat (创建于: 2025-12-28 01:33 (UTC+8))
- #31417 [Reasoning] Add GLM-4.7 reasoning parser for template-injected
tag — 无标签 — by yurekami (创建于: 2025-12-28 04:14 (UTC+8)) - #31416 [Feature][Cleanup] Unify flashinfer utils into package structure — nvidia — by yurekami (创建于: 2025-12-28 03:55 (UTC+8))
- #31412 [Core] Sort block IDs at I/O layer for contiguous memory access — v1 — by majiayu000 (创建于: 2025-12-28 00:16 (UTC+8))
- #31411 [Core] Sort blocks by block_id in FreeKVCacheBlockQueue.append_n — v1 — by majiayu000 (创建于: 2025-12-27 23:51 (UTC+8))
- #31408 Add Loraconfig parameter to get_punica_wrapper function — 无标签 — by ZT-AIA (创建于: 2025-12-27 18:59 (UTC+8))
- #31405 [Chore]: Remove HF format Phi4-MM examples — documentation,ready — by Isotr0py (创建于: 2025-12-27 17:00 (UTC+8))
- #31401 [CI/Build] Ignore max transformers version for more common tests — ready,multi-modality — by Isotr0py (创建于: 2025-12-27 14:50 (UTC+8))
- #31409 [Model] LLAMA MTP — llama — by sun-chao-hahaha (创建于: 2025-12-27 21:18 (UTC+8))
- #31406 [v1] Add encoder-only/cross attention support to Triton Attention backend — v1 — by Isotr0py (创建于: 2025-12-27 17:09 (UTC+8))
- #31403 [Model] Fix hunyuan-vl shape mismatch — 无标签 — by Potabk (创建于: 2025-12-27 16:45 (UTC+8))
- #31402 [Misc] Remove redundant all reduce in qkv split for ViTs — 无标签 — by Isotr0py (创建于: 2025-12-27 15:18 (UTC+8))
- #31400 [wip] custom allreduce and custom unquantized_gemm — 无标签 — by wxsIcey (创建于: 2025-12-27 14:20 (UTC+8))
- #31399 [Doc] Clarify tensor parallelism support for EAGLE-based draft models — documentation — by JSYRD (创建于: 2025-12-27 11:38 (UTC+8))
已合并 PR
- #31169 [MoE Refactor][10/N] Cleanup Fp8 Process Weights After Loading — rocm,ready,nvidia — by robertgshaw2-redhat (合并于: 2025-12-28 04:22 (UTC+8))
- #31376 [BugFix] Fix cache issue in compilation_config — ready — by BoyuanFeng (合并于: 2025-12-27 22:30 (UTC+8))
- #31405 [Chore]: Remove HF format Phi4-MM examples — documentation,ready — by Isotr0py (合并于: 2025-12-27 21:42 (UTC+8))
- #31401 [CI/Build] Ignore max transformers version for more common tests — ready,multi-modality — by Isotr0py (合并于: 2025-12-27 21:06 (UTC+8))
- #30912 Fix/get raw stream patch #30905 — ready,torch.compile — by baonudesifeizhai (合并于: 2025-12-27 12:08 (UTC+8))
关闭但未合并的 PR
- #23250 AITER MHA backend update — rocm,stale,v1 — by fsx950223 (关闭于: 2025-12-28 10:33 (UTC+8))
- #18901 [V1][Metrics] Add time_per_prefill_token — stale,v1 — by sahelib25 (关闭于: 2025-12-28 10:17 (UTC+8))
- #20677 [Bugfix] Fix grafana’s model_name list showing other values — documentation,stale — by kebe7jun (关闭于: 2025-12-28 10:16 (UTC+8))
- #22029 Draft: Qwen2.5 VL eagle3 — speculative-decoding,stale,v1,qwen — by IzzyPutterman (关闭于: 2025-12-28 10:16 (UTC+8))
- #22362 Improved defaulting of chunked prefill and prefix caching in V1 — stale — by hypdeb (关闭于: 2025-12-28 10:16 (UTC+8))
- #23889 fix bad words issue when adding leading space changes token id count — stale — by da03 (关闭于: 2025-12-28 10:15 (UTC+8))
- #23930 Add actionable solutions to top 3 error messages — stale — by mtr7x (关闭于: 2025-12-28 10:15 (UTC+8))
- #31418 [v1] Improve GPU blocks error messages with actionable suggestions — v1 — by yurekami (关闭于: 2025-12-28 05:07 (UTC+8))
- #31411 [Core] Sort blocks by block_id in FreeKVCacheBlockQueue.append_n — v1 — by majiayu000 (关闭于: 2025-12-28 00:02 (UTC+8))
- #31409 [Model] LLAMA MTP — llama — by sun-chao-hahaha (关闭于: 2025-12-27 21:21 (UTC+8))
- #31403 [Model] Fix hunyuan-vl shape mismatch — 无标签 — by Potabk (关闭于: 2025-12-27 18:07 (UTC+8))