vLLM 开发动态报告 - 2026-03-14
时间窗口: 2026-03-14 11:46 (UTC+8) ~ 2026-03-15 11:46 (UTC+8) 数据统计: 新 Issue 11 | 关闭 Issue 10 | 新 PR 51 | 合并 PR 14 | 关闭未合并 PR 8
📊 每日开发状态摘要
本周期(24小时)vLLM 开发活动高度活跃,新增和合并了大量 PR(51新增/14合并),核心工作集中在性能优化、内核集成、系统稳定性修复以及代码架构清理。社区关注点明显转向 V1 引擎的完善、多模态支持以及推理管道的深度优化,同时多个涉及内存安全和并发稳定的 Bug 报告得到了及时响应。
🎯 AMD/ROCm 生态相关动态
本周期无直接的 AMD 硬件/ROCm 驱动层核心代码更新。 仅有一处间接相关的讨论:
- RFC #37075 (Opt-in Media URL Cache for
MediaConnector): 该提议旨在为媒体URL添加缓存,以解决网络受限环境(如 AMD CI构建 和其他防火墙后部署)中重复下载导致的问题。它提到了一个已存在的 PR #36951,该PR为AMD CI添加了持久缓存挂载,以消除测试运行间的冗余下载。这表明AMD团队正在通过基础设施改进来提升vLLM在其生态CI环境中的稳定性和效率,但未涉及对ROCm或AMD GPU内核的代码修改。
💬 高热度讨论分析
- 已关闭 Issue #23814: “[Bug]: illegal memory access when there are multiple concurrent request”
- 核心议题:用户在生产环境高并发请求下遇到CUDA非法内存访问错误,怀疑与内存泄漏或GPU内存管理不当有关。
- 观点整理:
- 问题报告者 (@seabnavin19等):在使用工具调用(force tool calling)和高并发负载时,错误稳定复现。
- 维护者 (@njhill):初期怀疑与流水线并行(PP)有关,但后续确认在非PP场景下也会出现,推测可能与引导解码(guided decoding/xgrammar)相关。
- 其他用户 (@ggamsso):尝试通过切换引导解码后端(从xgrammar到guidance)来规避,但未根本解决问题。
- 争议焦点:无明确争议,更多是问题现象和可能原因的讨论与确认。
- 最终结论:该Issue因长期无新进展被标记为“stale”并自动关闭。根本原因可能已在后续版本的其他修复中解决,但未在本讨论中明确闭环。
- 已关闭 Issue #21278: “[Performance]: Speculative decoding doesn’t seem to speed up inference?”
- 核心议题:用户尝试使用小模型作为草案模型进行推测解码,但未观察到任何推理加速效果,反而更慢。
- 观点整理:
- 问题报告者 (@la1ty):测试了多种草案模型和
num_speculative_tokens组合,均未加速。 - 其他用户 (@gerayking):提出可能原因包括数据集(中文)接受率不高、v1版本不支持推测解码等。
- 问题报告者补充:认为草案模型采样结果传输到主模型的时间可能是瓶颈。
- 问题报告者 (@la1ty):测试了多种草案模型和
- 争议焦点:无直接争议,主要是对性能未达预期的原因分析和经验分享。
- 最终结论:Issue因“stale”被关闭。讨论揭示了推测解码在实际应用中的复杂性,其加速效果受草案模型匹配度、数据分布、硬件传输开销等多因素影响,并非总能保证提升。
- 新增 Issue #37080: “[Bug]: qwen 3.5 memory requirement of int4 model is higher than fp8”
- 核心议题:用户运行Qwen3.5-27B的INT4量化模型时遇到内存不足(OOM)错误,而其FP8版本运行正常,这与量化理论预期相悖。
- 观点整理:
- 问题报告者 (@ErykCh):提供了详细的错误日志和对比命令,显示INT4模型内存需求异常。
- 贡献者 (@David-Wen2025):指出用户可能使用了错误的量化参数(
--quantization moe_wna16适用于MoE模型),对于Qwen3.5-27B(Dense模型)应使用gptq或gptq_marlin,并给出了修正后的启动命令。
- 争议焦点:无争议。这是一个典型的用户配置问题,通过社区互助快速定位。
- 当前状态:Issue仍为开放状态,但已由社区成员提供了明确的解决方案。这反映了用户文档和错误提示方面仍有改进空间。
🔥 热门话题与趋势分析
- 性能优化浪潮:多个PR致力于极致性能,如 #37066 提出基于布隆过滤器的分布式KV缓存发现新方案,#37063 优化FP8量化合并注意力状态的向量化存储,#37041 集成FlashInfer的融合RoPE与KV缓存追加操作。
- 数据结构和API清理:响应Q1路线图,#37078 引入
TokenArray替换list[int]以优化V1引擎内存布局,#37085 开始标准化权重加载API,旨在减少代码重复和提升新格式支持能力。 - 工具调用与推理解析修复:围绕不同模型系列(Mistral, GLM4 MoE, SeedOSS, Harmony)的工具调用和推理解析逻辑出现一系列修复PR(如 #37081, #37044, #37071, #37070, #37072),表明此功能在复杂使用场景下仍需持续打磨。
- 系统稳定性与内存管理:多个Issue报告了深层次的系统问题,如 #37076 提到的KV块分配器在逐出压力下的潜在释放后使用(use-after-free)问题,以及 #37065 修复的多引擎实例间睡眠/唤醒的GPU VMM操作竞争条件。这体现了社区对生产环境稳定性的高度关注。
🛠️ 重点技术变更
- PR #36817 (已合并): “[Model Runner V2] Add Support for XD-RoPE”
- 技术解读:为V2模型运行器增加了对XD-RoPE(扩展维度的RoPE)位置编码的支持,使vLLM能够运行如
tencent/HunyuanOCR这类需要处理3D或4D位置信息的多模态模型。 - 影响:扩展了vLLM对前沿多模态模型架构的支持范围,保持了在复杂位置编码技术上的竞争力。
- 技术解读:为V2模型运行器增加了对XD-RoPE(扩展维度的RoPE)位置编码的支持,使vLLM能够运行如
- PR #36139 (已合并): “[Feature] Add InstantTensor weight loader”
- 技术解读:引入了一种新的权重加载格式
instanttensor,旨在利用高速存储(如400Gbps网络存储)的带宽,加速大模型的加载过程。 - 影响:为超大规模模型或频繁切换模型的场景提供了潜在的加载速度优化方案,是提升用户体验和资源利用率的基础设施改进。
- 技术解读:引入了一种新的权重加载格式
- PR #36903 (已合并): “[Misc] Clean up Kimi-audio whisper encoder loading”
- 技术解读:通过为
DefaultModelLoader.Source增加subfolder字段,并重构Kimi-audio模型的Whisper编码器加载逻辑,解决了复杂多模态模型组件从子目录加载的问题。 - 影响:提升了多模态模型加载的灵活性和代码清晰度,为集成更复杂的模型结构铺平道路。
- 技术解读:通过为
- Issue #37076 (新增): “[Bug]: Potential use-after-free in KV block allocator under eviction pressure”
- 技术解读:通过模糊测试发现,在启用前缀缓存和高内存压力下,KV块分配器可能返回错误的块边界,导致不同请求的令牌数据“渗漏”,造成非确定性输出。这是一个严重的内存安全问题。
- 影响:暴露了核心内存管理模块在极端并发和逐出场景下的潜在缺陷,需要核心开发者优先调查和修复,关系到推理的确定性和正确性。
- PR #37062 (已合并): “[Frontend] Reduce chat template warmup logging levels”
- 技术解读:将聊天模板预热过程中,因模型无有效模板而产生的
ChatTemplateResolutionError日志级别从ERROR降为INFO,并移除了堆栈跟踪。 - 影响:改善了用户体验,避免了无害的、预期的警告信息被误认为是启动错误,使日志更清晰。
- 技术解读:将聊天模板预热过程中,因模型无有效模板而产生的
📈 开发活跃度观察
- 贡献活跃:单日新增51个PR,显示出社区极高的贡献热情和项目快速迭代的特点。贡献者范围广泛,从核心维护者到新增开发者。
- 审查与合并高效:14个PR在当日被合并,其中包含多个功能性和修复性PR(如 #36817, #36139, #36903),表明核心团队审查流程高效,能及时吸纳社区贡献。
- AMD参与:虽然本周期无核心代码提交,但通过 RFC #37075 的提及,可见AMD团队在CI/测试基础设施层面持续投入,以保障其平台兼容性。
💡 值得关注的问题
- KV缓存内存安全性 (Issue #37076):报告中提及的“潜在释放后使用”问题是高优先级的安全隐患,需要密切关注其调查和修复进展。
- 推测解码的稳定性和性能 (Issue #37035, #21278):用户在使用推测解码(尤其是配合FlashInfer和特定模型)时遇到的非法内存访问和性能不增反降问题,表明该高级功能在复杂配置下的稳健性有待加强。
- 媒体缓存标准化 (RFC #37075):该提议旨在解决测试和生产中网络依赖问题,其设计是否被采纳及实现方式,对改善开发体验和部署鲁棒性有普遍意义。
📋 附录:详细数据列表
新增 Issue
- #37086 [Feature]: Expose stable request completion hook in streaming serving paths — feature request — by cbdeane (创建于: 2026-03-15 09:32 (UTC+8))
- #37080 [Bug]: qwen 3.5 memory requirement of int4 model is higher than fp8 — bug — by ErykCh (创建于: 2026-03-15 07:29 (UTC+8))
- #37077 [Feature][Cleanup]: Optimize token data structures (list[int] to numpy arrays) — 无标签 — by akh64bit (创建于: 2026-03-15 06:32 (UTC+8))
- #37084 Core Engine: Weight Loading Cleanup - Standardize load_weights API — 无标签 — by akh64bit (创建于: 2026-03-15 08:55 (UTC+8))
- #37079 [Bug]: Native Tool Calling collapses reasoning quality when any Tool/Function/Filter is enabled (vLLM auto-tool-choice backend) — bug — by Tyrannius (创建于: 2026-03-15 07:26 (UTC+8))
- #37076 [Bug]: Potential use-after-free in KV block allocator under eviction pressure — bug — by Yunzez (创建于: 2026-03-15 05:54 (UTC+8))
- #37075 [RFC]: Opt-in Media URL Cache for
MediaConnector— RFC — by AndreasKaratzas (创建于: 2026-03-15 05:41 (UTC+8)) - #37073 [Feature]: — feature request — by BromigoTools (创建于: 2026-03-15 05:35 (UTC+8))
- #37035 [Bug]: cudaErrorIllegalAddress in gdn_attn.py:237 when using qwen3_next_mtp with num_speculative_tokens=5 under load — bug — by Quentin-M (创建于: 2026-03-14 12:10 (UTC+8))
- #37060 [Bug]: sm110: torch.AcceleratorError: CUDA error: an illegal instruction was encountered — bug — by shahizat (创建于: 2026-03-15 00:46 (UTC+8))
- #37037 [Bug]: FlatLogprobs empty slice crashes with IndexError and delta-mode logprobs[-0:] returns stale data — 无标签 — by mango766 (创建于: 2026-03-14 12:15 (UTC+8))
已关闭 Issue
- #21278 [Performance]: Speculative decoding doesn’t seem to speed up inference? — performance,stale — by la1ty (关闭于: 2026-03-15 10:18 (UTC+8))
- #23814 [Bug]: illegal memory access when there are multiple concurrent request — bug,stale — by seabnavin19 (关闭于: 2026-03-15 10:18 (UTC+8))
- #28654 [Feature][P1]: Bake Base Docker Images into EC2 AMI with Packer — feature request,ci/build,stale — by rzabarazesh (关闭于: 2026-03-15 10:17 (UTC+8))
- #28703 [Bug]: run_batch hang on fresh start — bug,stale — by cmpute (关闭于: 2026-03-15 10:17 (UTC+8))
- #28745 [Bug]: LongCat-Flash fails to start with TorchDynamo due to “Data-dependent assertion failed” in FusedMoE — bug,stale — by baonudesifeizhai (关闭于: 2026-03-15 10:17 (UTC+8))
- #28767 [Performance]: Prefill TTFT and latency both increased — performance,stale — by Shaoting-Feng (关闭于: 2026-03-15 10:17 (UTC+8))
- #37079 [Bug]: Native Tool Calling collapses reasoning quality when any Tool/Function/Filter is enabled (vLLM auto-tool-choice backend) — bug — by Tyrannius (关闭于: 2026-03-15 07:47 (UTC+8))
- #31901 [Bug]: Guided decoding with xgrammar failed on macOS CPU inference — bug — by Inokinoki (关闭于: 2026-03-15 01:29 (UTC+8))
- #26808 [Bug]: [OpenAI server][Transcriptions] M4A uploads fail while MP3 succeeds (Whisper) — bug,stale — by ikaadil (关闭于: 2026-03-14 23:44 (UTC+8))
- #35665 [Bug]: Multimodal Requests Fail on /v1/chat/completions/render Endpoint — bug — by sergey-zinchenko (关闭于: 2026-03-14 22:10 (UTC+8))
新增 PR
- #37094 [Bugfix] Align routed_experts with streamed and final output token spans — bug,v1 — by HareshKarnan (创建于: 2026-03-15 11:21 (UTC+8))
- #37089 [WIP] [Model] Support Qwen1 use_logn_attn and use_dynamic_ntk — qwen — by haosdent (创建于: 2026-03-15 11:06 (UTC+8))
- #37088 [WIP] [Bugfix] Move GDN warmup after KV cache allocation to fix memory leak (#36973) — bug,qwen — by haosdent (创建于: 2026-03-15 10:43 (UTC+8))
- #37090 [WIP][Bugfix] Disable cross-layer KV cache for MLA attention backends — bug,v1 — by haosdent (创建于: 2026-03-15 11:08 (UTC+8))
- #37091 [WIP][Bugfix] Add autotuning guard to all unprotected FlashInfer MoE kernels — bug,nvidia — by haosdent (创建于: 2026-03-15 11:09 (UTC+8))
- #37092 [WIP][Bugfix] Clamp -1 async placeholders to fix CUDA assert in multimodal+EAGLE3 — bug,v1,nvidia — by haosdent (创建于: 2026-03-15 11:10 (UTC+8))
- #37093 [Frontend][Misc] Remove unused log in
/is_sleeping— frontend — by esmeetu (创建于: 2026-03-15 11:12 (UTC+8)) - #37061 [Frontend] Remove
torchcodecfrom audio dependency — frontend,ready,ci/build — by Isotr0py (创建于: 2026-03-15 00:57 (UTC+8)) - #37040 [Frontend] Avoid startup error log for models without chat template — ready — by DarkLight1337 (创建于: 2026-03-14 12:28 (UTC+8))
- #37069 Yzong rh/nixl ssm rebase — v1,kv-connector — by yzong-rh (创建于: 2026-03-15 04:02 (UTC+8))
- #37087 [Bugfix] Cache token IDs in Hermes2ProToolParser to fix “Already borr… — bug — by stonelazy (创建于: 2026-03-15 10:04 (UTC+8))
- #37085 feat: Standardize load_weights API via AutoWeightsLoader — documentation,llama — by akh64bit (创建于: 2026-03-15 08:55 (UTC+8))
- #37083 [Misc] Add unit tests for min_p Triton sampling kernel — v1 — by gkuwanto (创建于: 2026-03-15 08:45 (UTC+8))
- #37078 [Feature][Cleanup]: Optimize token data structures (list[int] to TokenArray) — structured-output,tpu,v1,kv-connector — by akh64bit (创建于: 2026-03-15 06:50 (UTC+8))
- #37081 Add Mistral Guidance — structured-output,frontend,v1 — by juliendenize (创建于: 2026-03-15 07:53 (UTC+8))
- #37082 [V1, V2] Add temperature for prompt logprobs — frontend,v1 — by JacobHelwig (创建于: 2026-03-15 08:14 (UTC+8))
- #37072 [Bugfix] Fix harmony parser crash on terminal tokens after end-of-message — bug,frontend,gpt-oss — by Pradyun92 (创建于: 2026-03-15 05:31 (UTC+8))
- #37070 [Bugfix] Fix harmony streaming tool call crash and argument splitting — bug,frontend,gpt-oss — by Pradyun92 (创建于: 2026-03-15 05:31 (UTC+8))
- #37074 [Feature][Frontend] add support for Cohere Embed v2 API — documentation,frontend — by walterbm (创建于: 2026-03-15 05:36 (UTC+8))
- #37059 fix(moe): detect and handle unquantized MoE weights in NVFP4 checkpoints — 无标签 — by Th0rgal (创建于: 2026-03-15 00:35 (UTC+8))
- #37071 [Bugfix] Fix Responses API harmony streaming: token splitting, missing done events, nested sequence_number — bug,frontend,gpt-oss — by Pradyun92 (创建于: 2026-03-15 05:31 (UTC+8))
- #37065 [Bugfix][sleepmode] Serialize GPU VMM operations to fix sleep/wakeup race condition — bug,v1 — by Flink-ddd (创建于: 2026-03-15 02:19 (UTC+8))
- #37068 [Bugfix] Add Qwen3.5 MoE support to benchmark_moe.py — bug,performance,qwen — by iphands (创建于: 2026-03-15 03:55 (UTC+8))
- #37062 [Frontend] Reduce chat template warmup logging levels — ready — by njhill (创建于: 2026-03-15 00:59 (UTC+8))
- #37051 Fix priority preemption regression test in scheduler — v1 — by ezylopx5 (创建于: 2026-03-14 20:36 (UTC+8))
- #37041 Add FlashInfer fused RoPE + paged KV cache append integration in vLLM #24678 — rocm,v1,nvidia — by baonudesifeizhai (创建于: 2026-03-14 14:24 (UTC+8))
- #37067 Improve CPU platform detection fallback for source checkouts — 无标签 — by ezylopx5 (创建于: 2026-03-15 02:54 (UTC+8))
- #37066 [Distributed] Add OfflineState bloom-filter cooperative caching KV connector — documentation,performance,kv-connector — by Ilank1 (创建于: 2026-03-15 02:31 (UTC+8))
- #37063 Fuse fp8 quant merge attn states vec128 — performance,v1 — by carlyou (创建于: 2026-03-15 01:03 (UTC+8))
- #37064 [Doc] Clarify schema enforcement behavior for tool_choice modes — documentation,tool-calling — by cemigo114 (创建于: 2026-03-15 01:35 (UTC+8))
- #37045 [Kernel] Porting the TRTLLM minimax_allreduce_rms kernels — ci/build — by jeejeelee (创建于: 2026-03-14 17:32 (UTC+8))
- #37055 Fix text only inputs for MRoPE models with the Transformers modelling backend — 无标签 — by hmellor (创建于: 2026-03-14 23:12 (UTC+8))
- #37054 [Bugfix] Fix KV scales inconsistency in fp8 MLA & FlashInfer kv_cache_dtype “auto” leading to gibberish — bug,documentation,v1,nvidia — by andylolu2 (创建于: 2026-03-14 23:00 (UTC+8))
- #37058 [Frontend] Remove librosa from audio dependency — ci/build,multi-modality — by Isotr0py (创建于: 2026-03-14 23:49 (UTC+8))
- #37057 Fix pipeline parallel with multimodal models with the Transformers modelling backend — 无标签 — by hmellor (创建于: 2026-03-14 23:19 (UTC+8))
- #37056 Update FP8 MoE backend selection for B200 (Blackwell) — 无标签 — by faradawn (创建于: 2026-03-14 23:17 (UTC+8))
- #37052 fix(gdn_attn): prevent CUDA illegal memory access with >4 speculative tokens — v1,nvidia — by machov (创建于: 2026-03-14 22:07 (UTC+8))
- #37053 fused qknorm+rope kernel optimization for SM9.0 — 无标签 — by EricccYang (创建于: 2026-03-14 22:09 (UTC+8))
- #37049 chore: clean up non-core lint issues — documentation,performance,ready — by whyiug (创建于: 2026-03-14 19:31 (UTC+8))
- #37044 [Bugfix] Fix GLM4 MoE and SeedOSS reasoning parser regressions — bug — by he-yufeng (创建于: 2026-03-14 17:29 (UTC+8))
- #37043 [Perf] Skip unnecessary int64 round-trip in sampler — v1 — by reidliu41 (创建于: 2026-03-14 15:41 (UTC+8))
- #37050 [Bugfix] Fix crash when using PP in multi-node (Issue #37001) — bug,documentation,v1 — by siewcapital (创建于: 2026-03-14 20:21 (UTC+8))
- #37042 fix: sync delta_token_ids with delta_text during stop-sequence buffering — v1 — by gambletan (创建于: 2026-03-14 15:08 (UTC+8))
- #37048 examples: bugfix disagg EPD proxy argparse — bug,documentation,kv-connector — by whyiug (创建于: 2026-03-14 19:26 (UTC+8))
- #37047 [Doc][Attention] Fix MLA top-of-file comments — 无标签 — by WineChord (创建于: 2026-03-14 19:14 (UTC+8))
- #37046 docs: fix disaggregated encoder example README — documentation,kv-connector — by whyiug (创建于: 2026-03-14 18:59 (UTC+8))
- #37034 fix: resolve kv_cache_dtype=’auto’ from checkpoint kv_cache_quant_algo — 无标签 — by alvinttang (创建于: 2026-03-14 12:06 (UTC+8))
- #37038 [Bugfix] Fix FlatLogprobs empty slice crash and delta-mode logprobs stale data — bug,v1 — by mango766 (创建于: 2026-03-14 12:17 (UTC+8))
- #37039 fix: sync delta_token_ids with delta_text during stop-sequence buffering — documentation,performance,rocm,structured-output,frontend,tpu,speculative-decoding,needs-rebase,ci/build,v1 — by gambletan (创建于: 2026-03-14 12:21 (UTC+8))
- #37033 fix: correct FP8 error message to reference compute capability — 无标签 — by alvinttang (创建于: 2026-03-14 11:59 (UTC+8))
- #37036 fix: correct misleading FP8 error message about CUDA version vs compute capability — nvidia — by gambletan (创建于: 2026-03-14 12:14 (UTC+8))
已合并 PR
- #36168 [Build] Upgrade xgrammar to get a security fix — ready,ci/build — by russellb (合并于: 2026-03-15 11:13 (UTC+8))
- #37040 [Frontend] Avoid startup error log for models without chat template — ready — by DarkLight1337 (合并于: 2026-03-15 00:36 (UTC+8))
- #37062 [Frontend] Reduce chat template warmup logging levels — ready — by njhill (合并于: 2026-03-15 04:48 (UTC+8))
- #32384 [Bugfix] Fix xgrammar dtype mismatch on macOS CPU inference — bug,structured-output,ready,v1 — by karanb192 (合并于: 2026-03-15 01:29 (UTC+8))
- #36139 [Feature] Add InstantTensor weight loader — documentation,ready,ci/build,cpu,nvidia — by arlo-aisys (合并于: 2026-03-15 01:05 (UTC+8))
- #35109 [Bugfix][Frontend] Fix audio transcription for MP4, M4A, and WebM formats — bug,rocm,frontend,ready,ci/build,cpu,nvidia — by seanmamasde (合并于: 2026-03-14 23:44 (UTC+8))
- #36817 [Model Runner V2] Add Support for XD-RoPE — ready,v1,nvidia — by santiramos27 (合并于: 2026-03-15 00:27 (UTC+8))
- #36903 [Misc] Clean up Kimi-audio whisper encoder loading — ready — by Isotr0py (合并于: 2026-03-14 23:37 (UTC+8))
- #36971 Mistral common v10 — rocm,ready,ci/build — by juliendenize (合并于: 2026-03-14 22:26 (UTC+8))
- #35684 [Bug] Fix Failure in /v1/chat/completions/render for Multimodal Requests (https://github.com/vllm-project/vllm/issues/35665) — bug,frontend,ready,v1,cpu — by sergey-zinchenko (合并于: 2026-03-14 22:10 (UTC+8))
- #37014 [CI] Shard Multi-Modal Models (Standard) into 4 parallel jobs — ready,ci/build — by khluu (合并于: 2026-03-14 16:26 (UTC+8))
- #36997 Enable loading of fused expert weights in the Transformers modelling backend — ready — by hmellor (合并于: 2026-03-14 15:01 (UTC+8))
- #36919 [Refactor] Relocate chat completion and anthropic tests — ready,ci/build,tool-calling — by sfeng33 (合并于: 2026-03-14 12:16 (UTC+8))
- #37015 [CI] Split Distributed Tests (4 GPUs) into 3 parallel jobs — ready,ci/build — by khluu (合并于: 2026-03-14 12:21 (UTC+8))
关闭但未合并的 PR
- #24954 [Bug] Fix gpt-oss missing tool content — bug,frontend,ready,stale,gpt-oss — by levunet (关闭于: 2026-03-15 10:18 (UTC+8))
- #26872 [Attention][DSA] Abstrat DSA topk-indices-buffer to AttentionBackend — needs-rebase,stale,v1,deepseek — by MengqingCao (关闭于: 2026-03-15 10:18 (UTC+8))
- #27613 add a warmup mode in CUDAGraphMode — needs-rebase,stale,v1,nvidia — by bangshengtang (关闭于: 2026-03-15 10:17 (UTC+8))
- #28018 [Core][WIP] Add back waterline for scheduler — needs-rebase,stale,v1 — by whx-sjtu (关闭于: 2026-03-15 10:17 (UTC+8))
- #37053 fused qknorm+rope kernel optimization for SM9.0 — 无标签 — by EricccYang (关闭于: 2026-03-14 22:50 (UTC+8))
- #37039 fix: sync delta_token_ids with delta_text during stop-sequence buffering — documentation,performance,rocm,structured-output,frontend,tpu,speculative-decoding,needs-rebase,ci/build,v1 — by gambletan (关闭于: 2026-03-14 15:02 (UTC+8))
- #36810 [ROCm][Perf] Fused GEMM + static FP8 output quantization — rocm — by andyluo7 (关闭于: 2026-03-14 14:02 (UTC+8))
- #37036 fix: correct misleading FP8 error message about CUDA version vs compute capability — nvidia — by gambletan (关闭于: 2026-03-14 12:15 (UTC+8))