vLLM 开发动态报告 - 2026-03-29
时间窗口: 2026-03-29 11:55 (UTC+8) ~ 2026-03-30 11:55 (UTC+8) 数据统计: 新 Issue 10 | 关闭 Issue 16 | 新 PR 39 | 合并 PR 5 | 关闭未合并 PR 17
📊 每日开发状态摘要
在2026年3月29日至30日的24小时内,vLLM项目保持了较高的开发活跃度,新增了10个Issue和39个PR。开发焦点集中在性能优化与内存管理(特别是CUDA图与KV缓存)、AMD生态支持(ROCm测试覆盖与设备兼容性)以及多模态模型与API功能的bug修复上。社区在处理新问题的同时,也关闭了16个长期未动的Issue,体现了积极的维护态势。
🎯 AMD/ROCm 生态相关动态
本周期内AMD/ROCm生态有显著进展,主要集中在完善测试覆盖、新增设备支持以及修复核心功能。
- 新增Issue (Bug报告):
- #38498: [Bug][ROCm]: Step3.5 Flash MTP init error
- 详情:用户
starwang1024在ROCm平台上使用Step-3.5-Flash-FP8模型并启用MTP推测解码时,遇到了初始化错误。机器人已自动召唤ROCm维护者 (@hongxiayang,@tjtanaa) 并请求提供完整的环境信息以便复现。 - 影响:这表明在ROCm平台上运行最新的复杂推理模型(结合FP8量化、专家并行、推测解码等特性)仍存在兼容性挑战,需要AMD团队和社区共同排查。
- 详情:用户
- #38498: [Bug][ROCm]: Step3.5 Flash MTP init error
- 新增PR (功能与修复):
- #38454:[ROCm][Test] Add hybrid block size and RDNA4 backend selection tests
- 提交者:
dondetir(根据命名惯例,疑似AMD贡献者)。 - 内容:增加了7个新的测试,覆盖了两个关键路径:1) 验证混合模型(如Qwen3.5)的各种真实块大小在ROCm注意力后端选择中不被拒绝(防止#36994问题复发);2) 模拟RDNA3/RDNA4 (gfx115x) 设备的注意力后端选择逻辑,确保
ROCM_AITER_FA等后端在非MI300硬件上被正确拒绝。 - 意义:增强了ROCm平台对复杂模型配置和新硬件架构的测试覆盖度,提高了代码健壮性。
- 提交者:
- #38455:[ROCm] Add RDNA 3.5/4 device IDs (gfx1150, gfx1151, gfx1201)
- 提交者:
dondetir。 - 内容:在设备ID映射表中新增了三个RDNA 3.5/4架构的APU和独立显卡ID(如AMD Radeon 890M、8060S、RX9070XT)。修复了这些设备因回退到通用名称而可能引发的下游逻辑问题。
- 意义:为新一代AMD消费级和移动GPU提供了更好的识别和支持,扩大了vLLM在AMD硬件上的潜在应用场景。
- 提交者:
- #38457:[ROCm] [DOC] Update the Documentation to include ROCm Nightly Wheel support
- 提交者:
tjtanaa(用户名包含”tjtanaa”,虽无”-amd”后缀,但频繁处理ROCm PR,很可能是AMD员工)。 - 内容:更新官方安装文档,加入了对ROCm夜间构建wheel的支持说明,并提及已升级至ROCm 7.2.1、Torch 2.10和Triton 3.6。
- 意义:降低了开发者在AMD平台上尝鲜最新vLLM特性的门槛,提供了更便捷的安装方式。
- 提交者:
- #38454:[ROCm][Test] Add hybrid block size and RDNA4 backend selection tests
- 已合并PR (问题修复):
- #38450:[ROCm][CI] Fix cross-attention dispatch for encoder-decoder models (已合并)
- 提交者:
AndreasKaratzas。 - 内容:修复了在ROCm平台上,编码器-解码器模型(如Whisper)的交叉注意力层在
ROCM_ATTN和ROCM_AITER_FA后端上计算错误的问题。方法是将ENCODER_DECODER类型从这两个后端的支持列表中移除,迫使调度器选择能正确工作的后端(如ROCM_AITER_UNIFIED_ATTN或TRITON_ATTN)。 - 意义:解决了ROCm平台上Whisper等模型波束搜索结果错误的核心问题,提升了关键模型类别的推理准确性。
- 提交者:
- #38450:[ROCm][CI] Fix cross-attention dispatch for encoder-decoder models (已合并)
💬 高热度讨论分析
- Issue #38486: [Bug]: cuda graph takes too much memory for qwen 3.5
- 核心议题:用户
ErykCh报告在Qwen3.5 35B FP8模型上,启用CUDA图会导致内存溢出(OOM),即使GPU内存看似充足。 - 观点与讨论:
- 报告者 (
ErykCh):通过一系列详细的测试,定位到CUDA图内存预估(VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS)和实际占用存在差异,即使设置更保守的参数(max-model-len 512)仍会OOM,认为这是一个bug。 - 日志提示:日志指出在v0.19中CUDA图内存分析将默认启用,并提示调整
gpu-memory-utilization。
- 报告者 (
- 争议焦点:无直接对立观点。这更像是一个深入的bug排查过程,报告者提供了大量现场数据。
- 当前状态:开放中。尚未有维护者直接回复,但报告者详细的日志为诊断提供了宝贵信息。这反映了CUDA图内存管理在复杂模型和大上下文场景下的挑战。
- 核心议题:用户
- Issue #38488: [Bug]:
reasoning_contentsilently dropped on incoming assistant messages- 核心议题:用户
delta9000发现vLLM在解析多轮对话中输入消息时,只读取reasoning字段,而丢弃了旧版字段reasoning_content,导致依赖reasoning_content的客户端(如Vercel AI SDK)数据丢失。 - 观点与讨论:
- 报告者 (
delta9000):指出这是PR #33635只修复了输出兼容性、未修复输入兼容性导致的问题,影响了MiniMax-M2等模型的工具调用链质量。 - 维护者 (
chaunceyjiang):明确回复当前仅支持reasoning字段,reasoning_content已不再支持。
- 报告者 (
- 争议焦点:文档承诺的向后兼容(
reasoning_content将继续工作)与代码实现存在矛盾。 - 当前结论:维护者确认不兼容。报告者表示将转向在客户端(Opencode)修复。问题可能以“不予修复”或“更新文档”的方式关闭。
- 核心议题:用户
- RFC Issue #38474: [RFC]: Add Mooncake Store Connector for Shared KV Cache Reuse
- 核心议题:用户
LCAIZJ提出一个RFC,建议为vLLM集成 Mooncake分布式KV缓存存储,实现跨实例、跨会话的KV缓存块复用,以消除冗余的预填充计算。 - 观点与讨论:目前尚无评论。但其技术方案非常详细,引用了在
vllm-ascend项目中已实现的架构,提出了清晰的集成路径和API设计。 - 值得关注点:此提案触及了LLM服务部署中一个核心的效率和成本问题。如果被采纳,将显著提升多实例部署和冷启动场景下的性能。其未来的讨论情况值得密切关注。
- 核心议题:用户
🔥 热门话题与趋势分析
- 性能与内存优化持续高热:
- CUDA图内存管理(#38486)和KV缓存压缩(#38479 - TurboQuant)是两大焦点。社区在不断探索极限性能与有限显存之间的平衡。
- PR #38139(已合并)通过消除CPU-GPU间冗余的数据拷贝,在池化任务上获得了48.9%的吞吐量提升,展示了底层优化带来的巨大收益。
- AMD平台支持系统化加强:
- 从新增设备ID支持(#38455)、扩展测试用例(#38454),到修复特定模型架构问题(#38450),AMD生态的支持正从“能用”向“稳定、好用”迈进。文档的更新(#38457)也意味着对用户体验的重视。
- 多模态与复杂模型支持深化:
- 出现了多模态模型编码器缓存配置(#38459, #38465)、Transformers v5兼容性(#38461)、以及音频-视频模型测试稳定性(#38492)等一系列问题。这表明随着支持模型种类的扩大,集成和稳定性工作愈加繁重。
🛠️ 重点技术变更
- PR #38139 (已合并): [Perf] Remove redundant device copies for CPU-only pooling token IDs
- 技术解读:该PR优化了池化(如Embedding模型)任务中,令牌ID从CPU传输到GPU再传回CPU的冗余路径。通过让
PoolingMetadata直接提供get_prompt_token_ids_cpu()方法,避免了不必要的GPU内存分配和传输。 - 影响:在特定基准测试中带来48.9%的端到端吞吐量提升,是底层通信优化带来显著性能收益的典范。
- 技术解读:该PR优化了池化(如Embedding模型)任务中,令牌ID从CPU传输到GPU再传回CPU的冗余路径。通过让
- PR #38450 (已合并): [ROCm][CI] Fix cross-attention dispatch for encoder-decoder models
- 技术解读:通过修改后端支持列表,而非直接修复有bug的后端内核,巧妙地绕过了ROCm上编码器-解码器模型交叉注意力计算错误的问题。这是一种务实的“调度层修复”,确保了功能的正确性,并为后续内核修复争取了时间。
- 影响:恢复了Whisper等模型在AMD GPU上波束搜索的准确性,对语音识别等应用至关重要。
- PR #38479 (开放中): [Attention Backend] TurboQuant: 2-bit KV cache compression with 4x capacity
- 技术解读:引入了一种基于学习的2-4位KV缓存压缩技术(TurboQuant),通过随机旋转、量化、草图等技术,宣称可实现2-4倍的KV缓存容量提升。PR设计了新的内核和解码路径,并支持新的
--kv-cache-dtype tq3/tq4选项。 - 影响:若经实践验证有效,将极大缓解长上下文场景的显存压力,是高密度服务的重要技术。但代码变更量大,需要严格评审。
- 技术解读:引入了一种基于学习的2-4位KV缓存压缩技术(TurboQuant),通过随机旋转、量化、草图等技术,宣称可实现2-4倍的KV缓存容量提升。PR设计了新的内核和解码路径,并支持新的
📈 开发活跃度观察
- 社区参与度高:24小时内产生49个新增项目(Issue+PR),显示社区非常活跃。用户
ErykCh、NJX-njx等在Issue中提供了极其详细的调试信息,体现了高级用户群体的深度参与。 - AMD贡献者活跃:
tjtanaa、dondetir、AndreasKaratzas等贡献者(部分可辨识为AMD相关)在本周期提交/合并了多个ROCm相关的核心PR,显示AMD团队正系统性、有节奏地推进其平台支持。 - 审查与合并节奏:合并了5个PR,关闭了16个陈旧Issue,表明核心维护团队在积极处理积压工作,保持项目健康度。
💡 值得关注的问题
- Issue #38489: NemotronH Super 120B crashes on AGX Thor unified memory:报告了一个在统一内存架构(NVIDIA AGX Thor)上运行双vLLM实例时出现的特定崩溃。这指向了vLLM在非传统离散GPU架构上的并发内存管理隐患,对边缘计算和特定服务器部署有重要影响。
- RFC Issue #38474: Add Mooncake Store Connector:这个关于分布式共享KV缓存的提案,可能引领vLLM架构向 disaggregation 方向演进,对大规模部署的性价比有颠覆性潜力,其社区反馈和决策需要跟踪。
- Issue #38472 & PR #38475: P2pNcclConnector KV cache accumulation OOM:在解耦推理(P/D)场景下,消费者节点因未及时释放接收的KV缓存块而导致OOM。PR #38475 提供了修复。这提醒我们在使用高级分布式特性时,需要关注其资源管理细节。
📋 附录:详细数据列表
新增 Issue
- #38488 [Bug]:
reasoning_contentsilently dropped on incoming assistant messages — 无标签 — by delta9000 (创建于: 2026-03-30 07:18 (UTC+8)) - #38499 [Bug]: “stop” parameter will stop the reasoning_content — bug — by miaolanxinlisa-netizen (创建于: 2026-03-30 11:37 (UTC+8))
- #38498 [Bug][ROCm]: Step3.5 Flash MTP init error — bug,rocm — by starwang1024 (创建于: 2026-03-30 11:14 (UTC+8))
- #38474 [RFC]: Add Mooncake Store Connector for Shared KV Cache Reuse — RFC — by LCAIZJ (创建于: 2026-03-29 23:05 (UTC+8))
- #38489 [Bug]: NemotronH Super 120B crashes mid-decode when two vLLM instances run concurrently on AGX Thor unified memory — bug — by mrjbj (创建于: 2026-03-30 07:28 (UTC+8))
- #38486 [Bug]: cuda graph takes too much memory for qwen 3.5 — bug — by ErykCh (创建于: 2026-03-30 03:01 (UTC+8))
- #38485 [Bug]: There should be flag/setting to collect environment — bug — by ErykCh (创建于: 2026-03-30 02:18 (UTC+8))
- #38472 [Bug]: [xPyD]Potential OOM when using v1 P2pNcclConnector as KV cache transport: KV cache accumulation on decode instance. — bug — by cxywhite (创建于: 2026-03-29 21:37 (UTC+8))
- #38470 [Bug]: When using the Sonnet dataset for benchmark testing, if the input length is too small, the CPU usage becomes abnormally high with no error logs, making it impossible to run the benchmark properly. — bug — by frankie-ys (创建于: 2026-03-29 21:08 (UTC+8))
- #38459 [Bug]:
limit_mm_per_promptis ineffective for Qwen3-VL — bug — by Disapole-Xiao (创建于: 2026-03-29 16:51 (UTC+8))
已关闭 Issue
- #14739 [New Model]:Can you support the VLA series models? For example, openVLA. — new-model,stale — by psong123 (关闭于: 2026-03-30 10:18 (UTC+8))
- #17720 [RFC]: Enabling Arm Neoverse CI Runners — RFC,stale — by nikhil-arm (关闭于: 2026-03-30 10:18 (UTC+8))
- #23335 [Bug]: Missing tokens while streaming when using gpt-oss-20b — bug,stale — by Blaze-DSP (关闭于: 2026-03-30 10:18 (UTC+8))
- #23816 [Bug]: multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 8 leaked shared_memory objects to clean up at shutdow — bug,stale — by gkm0120 (关闭于: 2026-03-30 10:17 (UTC+8))
- #26646 [Bug]: KV cache can’t be quantized for Qwen3-Next — bug,stale — by drrros (关闭于: 2026-03-30 10:17 (UTC+8))
- #28724 [New Model]: add unsloth/MiniMax-M2-GGUF support — stale — by kaikuo (关闭于: 2026-03-30 10:17 (UTC+8))
- #28963 [Bug]: MiniMax tool parsing errors — bug,stale — by wizche (关闭于: 2026-03-30 10:16 (UTC+8))
- #29286 [Performance]: cache system prompt token ids — performance,stale — by Eviannn (关闭于: 2026-03-30 10:16 (UTC+8))
- #29573 [Feature]: Programmatic Tool Calling (PTC) for code based tool execution — feature request,stale — by harche (关闭于: 2026-03-30 10:16 (UTC+8))
- #29662 [Bug]: DSR1 fp4/fp8 MTP with spec num 3 has perf drop when enable async-scheduling — bug,stale — by shyeh25 (关闭于: 2026-03-30 10:16 (UTC+8))
- #29670 [Bug]: single node 8 h20,ebo failed! — bug,stale — by bleedingfight (关闭于: 2026-03-30 10:16 (UTC+8))
- #29676 [Bug]: First call to llama model takes too much time compared to subsequent ones — bug,stale — by ozancaglayan (关闭于: 2026-03-30 10:16 (UTC+8))
- #29678 [Feature]: could develop vllm java client sdk with langchain4j — feature request,stale — by mullerhai (关闭于: 2026-03-30 10:16 (UTC+8))
- #29725 [Bug]: Responses API: Streaming returns ResponseTextDeltaEvent instead of ResponseFunctionCallArgumentsDeltaEvent for tool calls while using non-harmony models — bug,stale — by sumitaryal (关闭于: 2026-03-30 10:16 (UTC+8))
- #29737 [Installation]: RM has detected an NVML/RM version mismatch — installation,stale — by helxsz (关闭于: 2026-03-30 10:16 (UTC+8))
- #38382 [Transformers v5] Mistral multimodal models — help wanted,good first issue — by hmellor (关闭于: 2026-03-29 17:59 (UTC+8))
新增 PR
- #38491 [XPU] Fix spec-decode UTs under tests/v1/spec_decode — intel-gpu,speculative-decoding,v1 — by yma11 (创建于: 2026-03-30 09:08 (UTC+8))
- #38500 fix: pin 1 unpinned action — ci/build — by dagecko (创建于: 2026-03-30 11:37 (UTC+8))
- #38492 [CI] Add temperature=0.0, reduce max_tokens, and add debug prints to audio_in_video tests — rocm,ready — by AndreasKaratzas (创建于: 2026-03-30 09:45 (UTC+8))
- #38487 [Misc] Always use
forward_mulmatforConv3don newer versions of torch. — ready — by ywang96 (创建于: 2026-03-30 06:41 (UTC+8)) - #38497 Add @ZJY0516 to CODEOWNERS — ready,ci/build — by ZJY0516 (创建于: 2026-03-30 10:52 (UTC+8))
- #38494 [Bugfix] Fix Step-3.5-Flash MTP acceptance rate by detecting own shared_head weights — bug,speculative-decoding,v1 — by elinx (创建于: 2026-03-30 10:04 (UTC+8))
- #38468 Add platform manual_seed_all API — performance,rocm,intel-gpu,speculative-decoding,v1,nvidia — by yma11 (创建于: 2026-03-29 19:25 (UTC+8))
- #38496 [Model Runner V2] Fuse probabilistic rejection sample kernels — v1 — by TheEpicDolphin (创建于: 2026-03-30 10:34 (UTC+8))
- #38464 [Logging] Improve DCP error message to suggest VLLM_ATTENTION_BACKEND — v1 — by WJYuuuu (创建于: 2026-03-29 18:58 (UTC+8))
- #38490 Revert “[Perf] Remove redundant device copies for CPU-only pooling token IDs, 48.9% E2E throughput improvement” (#38139) — v1 — by zhewenl (创建于: 2026-03-30 09:03 (UTC+8))
- #38495 [CI] Fix SPLADE pooler test broken by #38139 — ready — by haosdent (创建于: 2026-03-30 10:14 (UTC+8))
- #38477 [Frontend] Add unified usage behavior control for entrypoint — frontend — by Csrayz (创建于: 2026-03-29 23:25 (UTC+8))
- #38493 [Bugfix] Fix Step-3.5-Flash MTP acceptance rate by preserving trained shared_head weights — bug,speculative-decoding,v1 — by elinx (创建于: 2026-03-30 09:54 (UTC+8))
- #38462 [Logging] Add JIT compilation progress log for FlashInfer — v1,nvidia — by WJYuuuu (创建于: 2026-03-29 18:30 (UTC+8))
- #38456 [CI] Fix online FP8 quantization materializing tensors on CPU — bug,needs-rebase — by haosdent (创建于: 2026-03-29 15:33 (UTC+8))
- #38481 Fix potential infinite loop in SonnetDataset.sample when using short input-len — performance — by frankie-ys (创建于: 2026-03-30 00:09 (UTC+8))
- #38455 [ROCm] Add RDNA 3.5/4 device IDs (gfx1150, gfx1151, gfx1201) — rocm — by dondetir (创建于: 2026-03-29 14:21 (UTC+8))
- #38484 [Build] Add SM121 (DGX Spark / GB10) to published build targets — ci/build,nvidia — by JCorners68 (创建于: 2026-03-30 02:15 (UTC+8))
- #38478 [Bug fix][Quantization] Fix dummy weight loading — bug,needs-rebase — by Josephasafg (创建于: 2026-03-29 23:35 (UTC+8))
- #38463 [Quantization] Consolidate experts_int8 with fp8 online quantization — needs-rebase — by Josephasafg (创建于: 2026-03-29 18:40 (UTC+8))
- #38458 [Docs] Add vLLM CI overview documentation for contributors — documentation — by khluu (创建于: 2026-03-29 16:13 (UTC+8))
- #38475 fix(p2p_nccl): free KV recv_store entries immediately to prevent OOM (#38472) — v1,kv-connector — by saifmb0 (创建于: 2026-03-29 23:13 (UTC+8))
- #38479 [Attention Backend] TurboQuant: 2-bit KV cache compression with 4x capacity — v1,nvidia — by vibhavagarwal5 (创建于: 2026-03-29 23:36 (UTC+8))
- #38476 [WIP] Add TRITON_MLA_SPARSE backend for SM80 sparse MLA support — documentation,rocm,v1,nvidia — by haosdent (创建于: 2026-03-29 23:23 (UTC+8))
- #38483 fix(v1): Handle max_model_len overflow gracefully instead of crashing — v1 — by machov (创建于: 2026-03-30 01:30 (UTC+8))
- #38482 (security) Fix SSRF in batch runner download_bytes_from_url — documentation,frontend — by jperezdealgaba (创建于: 2026-03-30 00:24 (UTC+8))
- #38480 Add DEFAULT_MAX_SAMPLE_ATTEMPTS constant to prevent infinite loop when randomly selected poem lines exceed input_len. If generation fails after max attempts, raise RuntimeError with helpful message. — performance — by frankie-ys (创建于: 2026-03-29 23:52 (UTC+8))
- #38471 Fix potential infinite loop in SonnetDataset.sample — performance — by frankie-ys (创建于: 2026-03-29 21:11 (UTC+8))
- #38473 Dev fixFix potential infinite loop in SonnetDataset.sample — performance — by frankie-ys (创建于: 2026-03-29 22:49 (UTC+8))
- #38469 fix: Add apply_with_spec_decode() method to LogitBiasLogitsProcessor — v1 — by ranger2571 (创建于: 2026-03-29 20:01 (UTC+8))
- #38467 [Feature] Add apply_with_spec_decode() to LogitBiasLogitsProcessor — v1 — by NJX-njx (创建于: 2026-03-29 19:13 (UTC+8))
- #38466 [Bugfix] Add CPU profiler summary equivalent to CUDA summary — bug,cpu,nvidia — by NJX-njx (创建于: 2026-03-29 19:11 (UTC+8))
- #38465 [Bugfix] Fix limit_mm_per_prompt being ignored for encoder cache profiling — bug,multi-modality — by NJX-njx (创建于: 2026-03-29 19:09 (UTC+8))
- #38460 [Perf] Batch KV cache swap copies via cuMemcpyBatchAsync — ready,v1 — by Etelis (创建于: 2026-03-29 17:58 (UTC+8))
- #38457 [ROCm] [DOC] Update the Documentation to include ROCm Nightly Wheel support — documentation,rocm — by tjtanaa (创建于: 2026-03-29 15:41 (UTC+8))
- #38461 Fixed issues — multi-modality — by rpathade (创建于: 2026-03-29 18:08 (UTC+8))
- #38454 [ROCm][Test] Add hybrid block size and RDNA4 backend selection tests — rocm,v1 — by dondetir (创建于: 2026-03-29 14:21 (UTC+8))
- #38453 [kv_offload+HMA][8/N]: Support multi-group worker transfer — v1 — by orozery (创建于: 2026-03-29 13:25 (UTC+8))
- #38452 fix(metrics): capture num_preemptions before _reset() clears it in log() — v1 — by 610lyn (创建于: 2026-03-29 13:00 (UTC+8))
已合并 PR
- #38317 [ROCm][CI] Enable hybrid chunked prefill test — rocm,ready,ci/build,v1 — by AndreasKaratzas (合并于: 2026-03-30 10:30 (UTC+8))
- #38442 [QeRL] Fix online quantized reloading — bug,ready,ci/build,v1,quantization — by kylesayrs (合并于: 2026-03-30 04:56 (UTC+8))
- #38139 [Perf] Remove redundant device copies for CPU-only pooling token IDs, 48.9% E2E throughput improvement — ready,v1 — by yewentao256 (合并于: 2026-03-30 02:12 (UTC+8))
- #38410 [Transformers v5] fix missing pixtral/voxtral multimodal dispatch — ready — by allgather (合并于: 2026-03-29 17:59 (UTC+8))
- #38450 [ROCm][CI] Fix cross-attention dispatch for encoder-decoder models — documentation,rocm,ready,v1 — by AndreasKaratzas (合并于: 2026-03-29 13:08 (UTC+8))
关闭但未合并的 PR
- #29072 Fix Dynamo trace errors for CPU offloading without UVA. — needs-rebase — by chaojun-zhang (关闭于: 2026-03-30 11:37 (UTC+8))
- #29479 [Kernels] Unify CPU-to-device tensor view API as get_device_view_from_cpu_tensor — nvidia — by chaojun-zhang (关闭于: 2026-03-30 11:37 (UTC+8))
- #38490 Revert “[Perf] Remove redundant device copies for CPU-only pooling token IDs, 48.9% E2E throughput improvement” (#38139) — v1 — by zhewenl (关闭于: 2026-03-30 10:25 (UTC+8))
- #29079 fix: Add PIECEWISE cudagraph mode config for prefill server to avoid startup errors — documentation,stale,kv-connector,nvidia — by xbfs (关闭于: 2026-03-30 10:16 (UTC+8))
- #29201 [CI/Build] Replace COPY scripts with bind mounts to reduce layers — needs-rebase,ci/build,stale — by mirzaim (关闭于: 2026-03-30 10:16 (UTC+8))
- #38493 [Bugfix] Fix Step-3.5-Flash MTP acceptance rate by preserving trained shared_head weights — bug,speculative-decoding,v1 — by elinx (关闭于: 2026-03-30 10:04 (UTC+8))
- #38456 [CI] Fix online FP8 quantization materializing tensors on CPU — bug,needs-rebase — by haosdent (关闭于: 2026-03-30 09:40 (UTC+8))
- #37112 [Feature] Add reasoning_budget to cap thinking tokens via existing reasoning parser — frontend,needs-rebase,v1 — by abhinand5 (关闭于: 2026-03-30 03:40 (UTC+8))
- #38036 Add 501 response to STT endpoint OpenAPI spec — frontend — by joaquinhuigomez (关闭于: 2026-03-30 03:36 (UTC+8))
- #38480 Add DEFAULT_MAX_SAMPLE_ATTEMPTS constant to prevent infinite loop when randomly selected poem lines exceed input_len. If generation fails after max attempts, raise RuntimeError with helpful message. — performance — by frankie-ys (关闭于: 2026-03-30 00:09 (UTC+8))
- #38471 Fix potential infinite loop in SonnetDataset.sample — performance — by frankie-ys (关闭于: 2026-03-29 23:50 (UTC+8))
- #38473 Dev fixFix potential infinite loop in SonnetDataset.sample — performance — by frankie-ys (关闭于: 2026-03-29 23:15 (UTC+8))
- #38341 [XPU] fix lora uts — intel-gpu,gpt-oss — by chaojun-zhang (关闭于: 2026-03-29 22:09 (UTC+8))
- #33178 [Quantization] - Consolidate experts_int8 with FP8 Modular Kernels — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ci/build,v1,multi-modality — by Josephasafg (关闭于: 2026-03-29 18:41 (UTC+8))
- #37361 fix(compilation): fix piecewise CUDA graph bugs with splitting_ops — nvidia — by Complexity-ML (关闭于: 2026-03-29 18:14 (UTC+8))
- #38216 [Perf] Batch KV cache swap copies via cuMemcpyBatchAsync — performance,rocm,ready,ci/build,v1,kv-connector,nvidia — by Etelis (关闭于: 2026-03-29 17:59 (UTC+8))
- #38321 [ROCm] Fix ROCM_ATTN cross-attention for beam search encoder-decoder models — rocm,intel-gpu,ready,v1 — by AndreasKaratzas (关闭于: 2026-03-29 13:09 (UTC+8))