vLLM 开发动态报告 - 2026-02-08
时间窗口: 2026-02-08 11:37 (UTC+8) ~ 2026-02-09 11:37 (UTC+8) 数据统计: 新 Issue 8 | 关闭 Issue 25 | 新 PR 29 | 合并 PR 10 | 关闭未合并 PR 12
📊 每日开发状态摘要
在2026年2月8日至9日期间,vLLM项目展现出高度的开发与维护活跃度,共处理了25个遗留Issue并新增了29个PR。开发重点集中在三个方面:一是对前沿技术栈(Python 3.14、PyTorch 2.10)的兼容性准备;二是针对新模型架构(如Qwen3.5、GLM-4.7-Flash)的支持与适配;三是持续的性能优化和问题修复,特别是在多模态、推测解码和AMD生态支持领域。
🎯 AMD/ROCm 生态相关动态
本周期内AMD生态相关活动较为活跃,主要体现在核心性能优化、bug修复和CI资源管理上。
- 性能优化:为新一代架构准备
- PR #34100 (Open): 由AMD员工 (
amd-hhashemi) 提交。该PR将wvSplitKQ内核的计算指令从之前的版本转换为16x16 MFMA(Matrix Fused Multiply-Add),旨在为未来的mi4xx架构(推测为MI400系列)做准备。同时,这一改动也降低了寄存器压力,提升了现有wvSplitKQ内核在MI300等架构上的性能。
- PR #34100 (Open): 由AMD员工 (
- Bug修复:解决
torch.compile兼容性问题- PR #34108 (Open): 修复了PR #33941引入的一个关键bug。该bug导致在ROCm平台上使用
torch.compile时,由于amdsmi的FFI调用无法被Dynamo正确追踪,模型前向传播会崩溃。解决方案是将GCN架构检测逻辑提前到模块导入时执行,将结果存储为常量,从而绕过了Dynamo的追踪限制,确保了编译功能的正常运行。
- PR #34108 (Open): 修复了PR #33941引入的一个关键bug。该bug导致在ROCm平台上使用
- 功能优化:移除MoE内核冗余逻辑
- PR #34086 (Open): 移除了
FusedMoE内核中的分块(chunking)机制。由于chunked-prefill特性现在能保证单次前向传播的令牌数不会超过~65K,内核级别的分块循环已成为不必要的开销。此优化简化了代码,移除了约300行复杂逻辑。该PR虽然不专属于AMD,但其rocm标签表明它同样适用于ROCm平台,是通用性能提升的一部分。
- PR #34086 (Open): 移除了
- CI/资源管理
- PR #34059 (Merged): 调整了AMD CI的资源配置,减少了“Benchmarks”和“Benchmarks CLI Test”两个测试组的GPU用量,以释放更多资源供其他测试任务使用,反映了对AMD CI管道效率的持续管理。
- 生态系统同步:升级至PyTorch 2.10
- PR #30525 (Merged): 这是一个全局性升级,但对于AMD生态至关重要。它将项目依赖的PyTorch版本升级至2.10.0(包含ROCm版本),是vLLM支持最新AMD硬件和软件栈的基础。
💬 高热度讨论分析
- Issue #29595: Qwen3-VL在vLLM v0.11.1+上的Gounding精度下降(已关闭,30条评论)
- 核心议题: 用户报告Qwen3-VL模型在vLLM v0.11.1及以后版本中,物体检测(grounding)能力严重下降,特别是在Hopper架构(H20/H100)GPU上,而A100上正常。
- 观点与争议:
- 用户方: 提供了详细的复现步骤和对比截图,指出性能下降和精度问题并存,怀疑与Flash Attention后端在Hopper上的特定实现有关。
- 维护者方: 迅速响应,首先引导用户确认注意力后端,并推测问题可能与PyTorch 2.9在Hopper设备上的内部变化有关。最终,通过合并PR #30525(升级至PyTorch 2.10) 解决了此问题,表明根本原因可能是PyTorch 2.9与Hopper架构的兼容性问题。
- 当前状态: 问题已随PyTorch 2.10升级而关闭,强调了vLLM对上游PyTorch版本依赖的敏感性。
- Issue #22308: GPT-OSS模型自定义工具调用支持(已关闭,19条评论)
- 核心议题: 用户询问如何为GPT-OSS模型配置
--tool-call-parser以支持自定义工具调用。 - 讨论过程:
- 初始回复错误地指向了
reasoning-parser。 - 随后维护者澄清,Harmony格式的工具调用支持已通过
--tool-call-parser openai实现,并已在文档中更新。 - 另一用户提出在流式模式下(
stream=True)使用openai解析器效果不佳,工具调用未被正确解析。 - 核心争议点: 流式与非流式请求下工具调用的处理路径和提示格式存在差异,导致模型输出不一致。
- 初始回复错误地指向了
- 最终结论: 维护者确认流式工具调用存在bug(已被PR #24768修复),原因是系统提示中未正确激活
commentary通道。此讨论揭示了工具调用实现中流式/非流式路径一致性的重要性。
- 核心议题: 用户询问如何为GPT-OSS模型配置
- Issue #34096: 支持vLLM与Python 3.14(Open,3条评论)
- 核心议题: 发起者建议vLLM应开始支持Python 3.14,因为PyTorch已提供官方支持。
- 观点与进展:
- 发起者: 详细测试了依赖兼容性,指出
ray[cgraph]是主要障碍,并提供了使其变为可选依赖的解决方案和成功从源码安装的补丁。 - 维护者: 立即表示接手此任务 (
I will take this :))。 - PyTorch团队成员: 补充说明
torch.compile已在PyTorch 2.10中支持Python 3.14。
- 发起者: 详细测试了依赖兼容性,指出
- 当前状态: 正在进行中。这体现了vLLM社区对保持与最新Python生态系统兼容的前瞻性规划。
🔥 热门话题与趋势分析
- 前沿环境适配: Python 3.14支持和PyTorch 2.10升级是本期重点,反映出项目紧跟底层技术栈发展的步伐。
- 模型支持攻坚战: 多个Issue反映了新模型架构快速迭代带来的挑战:
GLM-4.7-Flash因使用glm4_moe_lite架构(仅存在于transformersmain分支)而无法加载。- 经
Unsloth微调的Ministral-3模型缺少vLLM所需的配置文件(如params.json),导致部署困难。 - 新增PR开始为Qwen3.5模型系列提供官方支持。
- 工具调用与流式处理的复杂性: Issue #22308和PR #34101(修复Hermes工具调用流式截断问题)均表明,在流式输出、停止序列、复杂解析逻辑交织的场景下,工具调用功能的健壮性面临持续挑战。
- 多模态与内存优化: 已合并的PR #32493引入了
enable_mm_embeds标志,允许用户在禁用视觉编码器(节省内存)的同时,仍能输入预计算的图像嵌入,为多模态服务的部署提供了更灵活的配置选项。 - 安装与依赖问题: Issue #34090中用户因
libmpi_cxx.so.40缺失导致安装失败,这可能是由特定PyTorch wheel包引入的间接依赖,提示了复杂依赖环境下的用户支持问题。
🛠️ 重点技术变更
- Python 3.14支持前瞻(Issue #34096): 尽管处于早期阶段,但社区已开始系统性评估和解决依赖问题,这是确保vLLM未来兼容性的关键一步。
- Qwen3.5模型支持(PR #34110): 提前为即将发布的Qwen3.5密集版和MoE版模型添加支持框架,体现了与头部模型厂商的紧密合作。
- PyTorch 2.10正式升级(PR #30525): 一个历时近两个月的大型PR终于合并。此升级不仅解决了Hopper架构上的性能/精度问题,也为利用PyTorch 2.10的新特性(如对Python 3.14的编译支持)奠定了基础。
- 多模态“文本only”模式增强(PR #32493): 通过
--limit-mm-per-prompt '{"image": 0}' --enable-mm-embeds组合,实现了在保留嵌入输入功能的前提下,节省编码器内存开销的“轻量级”多模态服务模式,提升了部署灵活性。 - AMD架构优化(PR #34100): 将内核指令集向16x16 MFMA迁移,是针对AMD下一代GPU架构的底层优化,旨在持续提升其在AI计算中的竞争力。
📈 开发活跃度观察
- AMD贡献者活跃: 用户名为
amd-hhashemi和AndreasKaratzas的贡献者提交了重要的内核优化和bug修复PR,显示AMD团队在vLLM项目中的投入深入到了核心性能层和框架兼容性层。 - 社区参与度高: 在模型支持(如Qwen3.5、GLM)和问题排查(如Qwen3-VL精度)方面,出现了来自模型厂商或相关团队的开发者(如
JJJYmmm,chaunceyjiang)的深度参与和协作。 - Issue处理高效: 在24小时内关闭了25个历史Issue,其中多数因“stale”被自动关闭,但也包括一些通过代码修复解决的核心问题(如Qwen3-VL精度),表明维护团队在积极清理问题池。
💡 值得关注的问题
- Python 3.14依赖地狱(Issue #34096):
ray[cgraph]等关键依赖尚未提供Python 3.14的wheel包,这将成为vLLM支持新版Python的最大障碍。社区可能需要推动上游依赖或制定过渡方案。 - 模型架构的快速演进: GLM-4.7-Flash等模型依赖
transformers库的main分支,与vLLM依赖稳定版PyPy包的策略存在冲突。如何平衡对新模型的快速支持与依赖稳定性,是一个持续存在的挑战。 - AMD生态整合的深水区: PR #34108暴露了AMD专用库(
amdsmi)与PyTorch动态图编译(torch.compile)的兼容性问题。随着编译技术普及,此类底层集成问题可能会更多地浮现。 - 工具调用的流式处理一致性(Issue #22308 & PR #34101): 流式与非流式API在工具调用解析上的行为差异容易导致开发者困惑。确保两种模式下解析逻辑的完全一致,对提供稳定的开发者体验至关重要。
📋 附录:详细数据列表
新增 Issue
- #34096 [Tracking Feature]: Support vLLM with Python 3.14 — feature request — by mgoin (创建于: 2026-02-09 01:12 (UTC+8))
- #34106 [Bug]: MLP Speculator AttributeError: ‘MLPSpeculatorConfig’ object has no attribute ‘num_attention_heads’ — bug — by kylesayrs (创建于: 2026-02-09 08:53 (UTC+8))
- #34090 what am I doing wrong ? libmpi_cxx.so.40: cannot open shared object file: No such file or directory — usage — by peter247 (创建于: 2026-02-08 23:24 (UTC+8))
- #34099 [Usage]: VLLM with finetuned unsloth ministral 3 — usage — by LeDuySon (创建于: 2026-02-09 02:38 (UTC+8))
- #34098 [Bug]: GLM-4.7-Flash requires transformers from git (glm4_moe_lite but Transformers does not recognize this architecture) — bug — by brokedba (创建于: 2026-02-09 02:13 (UTC+8))
- #34095 [Usage]: Option to set system_fingerprint? — usage — by vrdn-23 (创建于: 2026-02-09 00:49 (UTC+8))
- #34094 [Bug]: assert num_cache_lines >= batch — bug — by vitalik (创建于: 2026-02-09 00:45 (UTC+8))
- #34076 [Bug]: KV Cache Memory Bottleneck Calculation in Pipeline Parallel (_check_enough_kv_cache_memory in get_kv_cache_configs) — bug — by chizhiwei (创建于: 2026-02-08 12:02 (UTC+8))
已关闭 Issue
- #22308 [Feature]: If I want gpt-oss to be able to call custom tools, how should I set the –tool-call-parser parameter during deployment? — feature request,stale,gpt-oss — by LZAndy (关闭于: 2026-02-09 10:18 (UTC+8))
- #23033 [RFC]: Support mmmu benchmark — RFC,stale — by tanruixiang (关闭于: 2026-02-09 10:18 (UTC+8))
- #23107 [Feature]: consider all env vars in compilation hash with some opt-out — good first issue,feature request,stale — by youkaichao (关闭于: 2026-02-09 10:18 (UTC+8))
- #25443 [Usage]: Why is OpenCV used for video and image preprocessing? Especially when it comes to processing videos, the efficiency is too low. I want to modify this part to torchvision parallel computing. Where should I start? — usage,stale — by HAOYON-666 (关闭于: 2026-02-09 10:18 (UTC+8))
- #25561 [Bug]: output is not deterministic — bug,stale — by zhink (关闭于: 2026-02-09 10:18 (UTC+8))
- #25587 [Feature Request] Support
q_lora_rank=Nonefor MiniCPM3 — feature request,stale — by bzantium (关闭于: 2026-02-09 10:18 (UTC+8)) - #26282 [Usage]: silencing ALL vLLM-enabled logging — usage,stale — by BramVanroy (关闭于: 2026-02-09 10:17 (UTC+8))
- #26393 [Bug]: vllm-v0.11.0 run Qwen3-Next-80B-A3B-Instruct-FP8 fail — bug,stale — by exceedzhang (关闭于: 2026-02-09 10:17 (UTC+8))
- #26409 [Bug]: Performance issue with v1 engine — bug,stale — by rse173 (关闭于: 2026-02-09 10:17 (UTC+8))
- #26433 [RFC]: Batching speculation — RFC,stale — by pkuwangh (关闭于: 2026-02-09 10:17 (UTC+8))
- #26479 [Bug]: 0.10.1 offline stop profiler error Can’t disable Kineto profiler when it’s not running — bug,stale — by Xerxes-cn (关闭于: 2026-02-09 10:17 (UTC+8))
- #26491 [Bug]: vllm/vllm-openai:nightly container crashes with –otlp-traces-endpoint due to missing opentelemetry package — bug,stale — by Aymendje (关闭于: 2026-02-09 10:17 (UTC+8))
- #26493 [Feature]: expose model revisions in OpenAI v1/models endpoint — feature request,stale — by mikix (关闭于: 2026-02-09 10:17 (UTC+8))
- #26511 [Performance]: Deepseek3.2 Compile Time Regression on B200 with Pytorch 2.9 — performance,torch.compile,stale — by Lucaskabela (关闭于: 2026-02-09 10:17 (UTC+8))
- #26544 [Bug]: TranscriptionRequest is missing
request_id— bug,stale — by eicherseiji (关闭于: 2026-02-09 10:17 (UTC+8)) - #26555 [Bug]: Highly concurrent calls to the vllm service can cause graphics card crashes — bug,stale — by shuoshuo0 (关闭于: 2026-02-09 10:17 (UTC+8))
- #26561 [Bug]: Qwen3-Coder encountered a large number of errors when using the calling capabilities of vllm-0.11.0. — bug,stale — by Jeremy-J-J (关闭于: 2026-02-09 10:17 (UTC+8))
- #26573 [Bug]: When stream=True, the seed-oss model, the tool_calls = None — bug,stale,tool-calling — by CallmeZhangChenchen (关闭于: 2026-02-09 10:17 (UTC+8))
- #26576 [Bug]: Hardcoded token_ids size causes OOV error for small tokenizer — bug,stale — by grzegorz-k-karch (关闭于: 2026-02-09 10:17 (UTC+8))
- #26589 [Bug]: Ovis2.5 9B FP8 quantisation bug — bug,stale — by Dineshkumar-Anandan-ZS0367 (关闭于: 2026-02-09 10:17 (UTC+8))
- #26598 [Bug]: Pipeline parallelism skips GPUs and hangs during model load — bug,stale — by devnen (关闭于: 2026-02-09 10:17 (UTC+8))
- #33888 [Feature]: when support torch 2.10.0 — feature request — by chamwen (关闭于: 2026-02-09 05:51 (UTC+8))
- #29595 [Bug]: Qwen3-VL-235B-A22B-Instruct Grounding Accuracy Issue in vLLM (>= v0.11.1) — bug,torch.compile — by 420516460 (关闭于: 2026-02-09 05:51 (UTC+8))
- #33974 [Feature]: We propose the official development and maintenance of a VLLM integration or plugin within Dify. — feature request — by ooodwbooo (关闭于: 2026-02-09 01:04 (UTC+8))
- #21943 [Feature] Skip modules for disabled modalities — good first issue,feature request,multi-modality — by DarkLight1337 (关闭于: 2026-02-08 20:57 (UTC+8))
新增 PR
- #34110 [MODEL] Adding Support for Qwen3.5 Models — new-model,speculative-decoding,v1,qwen — by JJJYmmm (创建于: 2026-02-09 11:16 (UTC+8))
- #34111 [DRAFT][XPU] clean up existing ipex — documentation,v1 — by jikunshang (创建于: 2026-02-09 11:17 (UTC+8))
- #34091 refactor_code_for_repeated_functions — cpu,nvidia — by tom-zju (创建于: 2026-02-09 00:05 (UTC+8))
- #34075 Support MP backend for elastic EP scale-down — frontend,v1 — by jianzs (创建于: 2026-02-08 11:41 (UTC+8))
- #34086 [Feature]: Remove Chunking From FusedMoE — documentation,rocm,gpt-oss,nvidia — by SouthWest7 (创建于: 2026-02-08 19:39 (UTC+8))
- #34109 [Kernel] Refactor FlashInfer allreduce for mnnvl backend — nvidia — by hjjq (创建于: 2026-02-09 10:47 (UTC+8))
- #34102 [DP] Only use DP padding when cudagraphs are actually used — speculative-decoding,ready,v1,nvidia — by LucasWilkinson (创建于: 2026-02-09 03:46 (UTC+8))
- #34092 [torch.compile] Disable recursive pre_grad_passes — 无标签 — by zou3519 (创建于: 2026-02-09 00:19 (UTC+8))
- #34108 [ROCm][Bugfix] Resolve Dynamo tracing crash from amdsmi calls in on_gfx* arch detection — bug,rocm — by AndreasKaratzas (创建于: 2026-02-09 10:05 (UTC+8))
- #34077 [BUGFIX] Fix accuracy bugs in Qwen3-Next MTP — bug,ready,v1,qwen — by vadiklyutiy (创建于: 2026-02-08 12:31 (UTC+8))
- #34103 [Tiny] Rename encoder budget file to more specific name — ready,v1,multi-modality — by reaganjlee (创建于: 2026-02-09 05:20 (UTC+8))
- #34097 [LoRA] Support LoRA for Qwen3OmniMoeThinkerForConditionalGeneration — documentation,v1,qwen — by linitra24 (创建于: 2026-02-09 01:28 (UTC+8))
- #34107 [CI] Remove empty image_size_factors for fuyu, glm4_1v, glm_ocr — ready,multi-modality — by AndreasKaratzas (创建于: 2026-02-09 09:30 (UTC+8))
- #34087 [Bugfix] Fix shared expert input for latent MoE in EP+DP (Nemotron-H) — bug,nvidia — by TomerBN-Nvidia (创建于: 2026-02-08 20:38 (UTC+8))
- #34105 Move spec decode offline script from example/ to inside vllm for reusability — documentation,speculative-decoding,v1 — by ekagra-ranjan (创建于: 2026-02-09 08:00 (UTC+8))
- #34104 Fix Mistral config remap to accept compressed-tensors quantization #34028 — 无标签 — by baonudesifeizhai (创建于: 2026-02-09 06:04 (UTC+8))
- #34100 Convert wvSplitKQ to 16x16 MFMA in prep for mi4xx. — rocm — by amd-hhashemi (创建于: 2026-02-09 03:00 (UTC+8))
- #34101 Fix hermes tool call stream truncation when stop sequences are used — frontend — by maxdebayser (创建于: 2026-02-09 03:08 (UTC+8))
- #34085 Fix DeepSeek-OCR tensor validation for all size variants — ready,deepseek — by yichuan-w (创建于: 2026-02-08 18:09 (UTC+8))
- #34088 [BugFix] Change support no act and mul for marlin — bug,ready,nvidia — by TomerBN-Nvidia (创建于: 2026-02-08 21:57 (UTC+8))
- #34079 [Bugfix] Fix negative local_cache_hit in P/D disaggregation metrics — bug,documentation,v1,kv-connector — by Prowindy (创建于: 2026-02-08 13:58 (UTC+8))
- #34093 [torch.compile] Stop doing unnecessary FakeTensorProp in PiecewiseCompileInterpreter — 无标签 — by zou3519 (创建于: 2026-02-09 00:23 (UTC+8))
- #34084 Convert online APIs to use Renderer — frontend — by reaganjlee (创建于: 2026-02-08 17:31 (UTC+8))
- #34089 Constrain numpy/opencv for compatibility — ci/build — by aissam-out (创建于: 2026-02-08 23:06 (UTC+8))
- #34083 [Docs] Update docs to include mm processor + encoder benchmarks — documentation,frontend,v1,multi-modality — by reaganjlee (创建于: 2026-02-08 17:07 (UTC+8))
- #34082 [Frontend, grpc] add support for Embed RPC in grpc_server — frontend — by santiramos27 (创建于: 2026-02-08 15:59 (UTC+8))
- #34081 custom behavior — frontend — by dsingal0 (创建于: 2026-02-08 15:39 (UTC+8))
- #34080 [bugfix] fix: correctly identify bottleneck worker in pipeline parallel KV cache allocation — bug,v1 — by chizhiwei (创建于: 2026-02-08 14:27 (UTC+8))
- #34078 [bugfix] fix: correctly identify bottleneck worker in pipeline parallel KV cache allocation (#34076) — bug,v1 — by chizhiwei (创建于: 2026-02-08 13:44 (UTC+8))
已合并 PR
- #34027 [bug-fix] supported_tasks is breaking backward compatibility at init_app_state — bug,frontend,ready — by kouroshHakha (合并于: 2026-02-09 09:46 (UTC+8))
- #30525 [Release 2.10] Update to Torch 2.10 - final release — documentation,rocm,ci/build,v1,cpu,gpt-oss,nvidia,ready-run-all-tests — by atalman (合并于: 2026-02-09 05:51 (UTC+8))
- #33786 Add support for ModelOpt MXFP8 dense models — documentation,ready,nvidia — by danisereb (合并于: 2026-02-09 03:16 (UTC+8))
- #32958 glm 4.6 fused tuned inference config for B200 — ready — by navmarri14 (合并于: 2026-02-09 02:55 (UTC+8))
- #33735 [torch.compile] Add an option to force-enable the MOE cold start optimization — ready — by zou3519 (合并于: 2026-02-09 02:42 (UTC+8))
- #34088 [BugFix] Change support no act and mul for marlin — bug,ready,nvidia — by TomerBN-Nvidia (合并于: 2026-02-09 01:18 (UTC+8))
- #33771 [Revert] Fix performance regression for GLM-4.7-GPTQ decode and MTP acceptance rate — ready,v1,nvidia — by aabbccddwasd (合并于: 2026-02-09 00:13 (UTC+8))
- #32493 Add embedding input functionality for disabled modalities [remake] — documentation,frontend,ready,v1,multi-modality — by reaganjlee (合并于: 2026-02-08 20:57 (UTC+8))
- #34059 [ROCm] [CI] Reduce Resource of two test groups — rocm,ready,ci/build — by tjtanaa (合并于: 2026-02-08 15:17 (UTC+8))
- #33855 [Perf] Simplify DeepseekV32 tokenizer, ensure fast detokenization used — structured-output,ready,v1,deepseek — by njhill (合并于: 2026-02-08 15:16 (UTC+8))
关闭但未合并的 PR
- #31458 Vox streaming on top — frontend,tpu,needs-rebase,v1 — by patrickvonplaten (关闭于: 2026-02-09 10:59 (UTC+8))
- #19710 [Model] Activated LoRA — documentation,frontend,needs-rebase,stale,v1,tool-calling — by tdoublep (关闭于: 2026-02-09 10:19 (UTC+8))
- #22236 [Perf][Feat][Core] Workload-Aware KVCache Eviction Policy — documentation,performance,needs-rebase,ci/build,stale,v1 — by Chasingdreams6 (关闭于: 2026-02-09 10:19 (UTC+8))
- #25722 [RFC][Core] propagate the error message up to the frontend process — stale,v1 — by 842974287 (关闭于: 2026-02-09 10:18 (UTC+8))
- #25742 feat(minicpm3): support
q_lora_rankis None — speculative-decoding,stale,v1,deepseek — by bzantium (关闭于: 2026-02-09 10:18 (UTC+8)) - #26557 [Qwen3-Next] MoE configs for 4090 TP=1/PP — stale,qwen — by ReinForce-II (关闭于: 2026-02-09 10:17 (UTC+8))
- #33469 Add Kimi-Audio-7B model support — documentation,new-model,multi-modality — by tunglinwood (关闭于: 2026-02-09 10:12 (UTC+8))
- #33854 [BugFix] Potential bug fix for
test_async_tp_pass_correctness— bug,ready,v1 — by LucasWilkinson (关闭于: 2026-02-09 07:25 (UTC+8)) - #29726 [Frontend] Add streaming tool-call support to Responses API (non-Harmony) — frontend,gpt-oss — by sumitaryal (关闭于: 2026-02-09 01:25 (UTC+8))
- #34081 custom behavior — frontend — by dsingal0 (关闭于: 2026-02-08 16:55 (UTC+8))
- #33023 refactor: unify FlashInfer utils into vllm.utils.flashinfer — nvidia — by puranikyashaswin (关闭于: 2026-02-08 14:35 (UTC+8))
- #34078 [bugfix] fix: correctly identify bottleneck worker in pipeline parallel KV cache allocation (#34076) — bug,v1 — by chizhiwei (关闭于: 2026-02-08 13:59 (UTC+8))