[vLLM GitHub 开发动态] 2025-12-17
[概览]
- 时间窗口: 2025-12-17 10:45 (UTC+8) ~ 2025-12-18 10:45 (UTC+8)
- 新 issue: 25 (label 分布: bug:13, torch.compile:3, feature request:3, ci-failure:2, usage:2)
- 关闭 issue: 30
- 新 PR: 62 (label 分布: ready:26, v1:11, documentation:8, ci/build:8, qwen:7)
- 合并 PR: 38
- 关闭未合并 PR: 18
[新 issue]
-
#30859 [Bug]: set_current_vllm_config() is only done during the initialization stage but not the runtime stage — bug — by nvpohanh (创建于: 2025-12-17 16:59 (UTC+8)) [💬6] ### Your current environment
Any env
### 🐛 Describe the bug
# Issue Statement
Currently,
set_current_vllm_config()is only done during the initialization stage but not the runtime stage. If the code tries to callget_current_vllm_config(), vLLM prints a warning “Current vLLM config is not set.” and returns a default config.…
-
#30905 [Bug]: vLLM 0.11.2 AutoTune Missed Import In TorchInductor w/ Compile Sizes > 1 — bug — by rouchenzi (创建于: 2025-12-18 05:49 (UTC+8)) [💬1] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#30893 [Bug]: V1 Engine splits offline embedding batches unexpectedly when multiprocessing is enabled — bug — by dmadicTT (创建于: 2025-12-18 01:24 (UTC+8)) [💬1] ### Your current environment
The output of
```text ============================== System Info ============================== OS : Ubuntu 22.04.4 LTS (x86_64) ...python collect_env.py -
#30904 [Bug]: Empty content on NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 — bug — by puppetm4st3r (创建于: 2025-12-18 05:21 (UTC+8)) ### Your current environment
System: Cant install collect_env.py but its an ubuntu 24 machine with 1 Blackwell RTX 6000 PRO and 128gb RAM, anyway all run inside the official VLLM image.
### 🐛 Describe the bug
Docker running command:
``` …
-
#30854 [CI Failure]: Entrypoints Integration Test (API Server) — ci-failure — by SungMinCho (创建于: 2025-12-17 15:55 (UTC+8)) [💬1] ### Name of failing test
buildkite/ci/pr### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#30889 [CI Failure]: distributed-tests-h200 — torch.compile,ci-failure — by zou3519 (创建于: 2025-12-18 00:59 (UTC+8)) [💬2] ### Name of failing test
python3 examples/offline_inference/data_parallel.py –model Qwen/Qwen1.5-MoE-A2.7B –tp-size=1 –dp-size=2 –max-model-len 2048
### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#30896 [Feature]: fused GEMM +collectives helion kernel — feature request — by LironKesem (创建于: 2025-12-18 03:06 (UTC+8)) ### 🚀 The feature, motivation and pitch
Breaking down the task:
- Hopper:
-making sure it is running (all-gather+gemm)
- integated into vLLM
- add quantization support.
- benchmarking
- testing on other hardware.
…
- Hopper:
-making sure it is running (all-gather+gemm)
-
#30872 [Bug]: Triton kernel failing for LoRA on SM100 — bug — by josefdra (创建于: 2025-12-17 22:18 (UTC+8)) [💬2] ### Your current environment
The output of
```text ============================== System Info ============================== OS : Ubuntu 24.04.3 LTS (x86_64) ...python collect_env.py -
#30880 [Feature]: more robust way to detect if compile_sizes/compile_ranges applies to the current compilation — feature request,torch.compile — by zou3519 (创建于: 2025-12-17 23:47 (UTC+8)) [💬1] ### 🚀 The feature, motivation and pitch
In https://github.com/vllm-project/vllm/pull/30489 we add a global variable to control if something is marked as being an encoder. In the future I think we’ll want to move past this: we’ll likely compile things that are more than just {encoders, decoders}.
The compile_ranges and compile_sizes work operates under one assumption: that the captured graph has one dynamic size, and that dynamic size is the number of tokens.
We should be able to infer this f…
-
#30879 [Doc]: Add some documentation about encoder compilation — documentation,torch.compile — by zou3519 (创建于: 2025-12-17 23:44 (UTC+8)) [💬1] ### 📚 The doc issue
I want something like a design doc for encoder compilation. For example:
- It uses support_torch_compile and set_model_tag to avoid cache collisions
- it supports or doesn’t support the following features that VllmBackend does: cudagraphs, compile_ranges, and a high-level explanation for how these are turned off or on.
- it inherits from compilation_config (or maybe it doesn’t)
- here’s how to turn it on/off
I’m having a difficult time thinking through the edge cases in htt…
-
#30882 [Bug]: Marlin Fp8 Block Quant Failure — bug,help wanted,good first issue — by robertgshaw2-redhat (创建于: 2025-12-17 23:55 (UTC+8)) [💬2] ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#30834 [Bug]: vllm0.12.0 h100 PTX was compiled with an unsupported toolchain — bug — by Qingyuncookie (创建于: 2025-12-17 11:30 (UTC+8)) [💬3] ### Your current environment
==============================
System Info
============================== … -
#30861 [Bug]: TypeError: unhashable type: ‘dict’ when serving deepseek32 — bug — by jeejeelee (创建于: 2025-12-17 17:27 (UTC+8)) [💬1] ### Your current environment
The output of
```text Collecting environment information... uv is set ============================== System Info ...python collect_env.py -
#30866 [Bug]: Deploying DeepSeek-V3.1-Terminus with vLLM v0.11.2, streaming requests that include tool calls suffer from a missing first token issue. — bug — by controlRun (创建于: 2025-12-17 19:05 (UTC+8)) [💬1] ### Your current environment
The output of
```text ============================== System Info ============================== OS : CentOS Linux 7 (Core) (x86_64) ...python collect_env.py -
#30865 [Usage]:Tools GLM4.6v with vLLM — usage — by qBrabus (创建于: 2025-12-17 18:51 (UTC+8)) ### Your current environment
Hello,
I am running tests on this model, which I find excellent. However, I am encountering a few issues and would like to know whether it is possible to fix them or if I am simply asking for the impossible.
First of all, here is my vLLM configuration:
`docker run -d \ –name vllm-llm \ –gpus ‘“device=4,5,6,7”’ \ -e NVIDIA_DRIVER_CAPABILITIES=compute,utility \ -e VLLM_OBJECT_STORAGE_SHM_BUFFER_NAME=”${SHM_NAME}” \ -v /raid/workspace/qladane/vllm/hf-cache:/root/….
-
#30847 [Bug]: Qwen 3VL via Efficient Video Sampling (EVS) to trim video embeddings and found that the number of tokens after timestamp in the Prompt was not aligned with the actual number of tokens after pruning? — bug — by xshqhua (创建于: 2025-12-17 14:46 (UTC+8)) [💬1] ### Your current environment
vllm serve Qwen3-VL-8B --video-pruning-rate=0.75 messages=[ { "role": "user", "content": [ # {"type": "text", "text": "What's in this video?"}, ... -
#30843 [Bug]: vllm serve GLM-4.6V-Flash error — bug — by F0undLinks (创建于: 2025-12-17 14:23 (UTC+8)) [💬4] ### Your current environment
docker image: vllm/vllm-openai:nightly(sha256:b40b770900bfb2b4a66bc04e888141830e20fd732c79e07ab3e3d6186d0ed437)
vllm version: 0.13.0rc2.dev118+g29f7d9771 transformers version: 4.57.3
### 🐛 Describe the bug
…
-
#30851 [New Model]: FunAudioLLM/Fun-ASR-Nano-2512 — new-model,multi-modality — by verigle (创建于: 2025-12-17 15:07 (UTC+8)) ### The model to consider.
support new model : FunAudioLLM/Fun-ASR-Nano-2512
### The closest model vllm already supports.
No response
### What’s your difficulty of supporting the model you want?
…
-
#30848 [Bug]: Fail to run Qwen3-Next model. — bug — by MaoJianwei (创建于: 2025-12-17 14:54 (UTC+8)) [💬2] ### Your current environment
The output of
vllm is built from source code, main branch. ```text python 3.12.9 ...python collect_env.py -
#30853 [Performance]: DeepSeek V3.2 Benchmarking: Significant performance discrepancy between initial and subsequent runs — performance — by zkf331 (创建于: 2025-12-17 15:14 (UTC+8)) [💬2] ### Proposal to improve performance
Description I followed the benchmarking instructions for DeepSeek V3.2 provided in the documentation: https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-V3_2.html#benchmarking
### Report of performance regression
…
-
#30857 [Bug]: 使用VLLM进行推理会卡死 — bug — by Bonjir (创建于: 2025-12-17 16:49 (UTC+8)) ### Your current environment
============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) GCC version : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version : Could not collect CMake version : version 4.0.3 Libc version : glibc-2.35 …
-
#30855 [Usage]: Qwen3-30B-A3B-NVFP4 fails on Dell Pro Max GB10 with “no kernel image is available for execution on the device” — usage — by nanbogong (创建于: 2025-12-17 16:44 (UTC+8)) ### Your current environment
``` Hardware: Dell Pro Max GB10 OS: Ubuntu 24 CUDA: cuda_13.0.r13.0 Cuda compilation tools, release 13.0, V13.0.88; vllm: V0.12.0 torch_version: 2.9.0+cu128 model: RedHatAI/Qwen3-30B-A3B-NVFP4 or nvidia/Qwen3-30B-A3B-NVFP4 or nvidia/Qwen3-30B-A3B-FP4 …
-
#30832 [Performance]: DeepSeek-V3.2 on 8xH20 30 decode tokens/sec — performance — by lisp2025 (创建于: 2025-12-17 11:08 (UTC+8)) ### Proposal to improve performance
My Env: vllm 0.13.0rc2.dev178+g676db55ee deep_gemm 2.1.1+c9f8b34 cuda. 12.9 python. 3.10.18
command is the same as: vllm serve mypath/DeepSeek-V3.2
… -
#30839 [RFC]: Enabling Zero-Copy Video with PyNvVideoCodec and IPC — RFC — by brandonpelfrey (创建于: 2025-12-17 12:46 (UTC+8)) ### Motivation.
This RFC proposes a set of changes for enabling HW-accelerated video decoding and zero-copy transfer to CoreEngine for improving VLM latency/throughput and reducing CPU pressure for (eventual) multi-GPU scaling. These changes are aimed at improving the basic DP=1 case first with future improvements to support more complex DP>1 or TP cases.
# Benchmark Data
A comparison of throughput for video captioning tasks consistently showed a 2-3% improvement in overall throughput. This …
-
#30835 [Feature]: support glm-46v — feature request — by Busboy3129 (创建于: 2025-12-17 11:57 (UTC+8)) [💬1] ### 🚀 The feature, motivation and pitch
https://huggingface.co/collections/zai-org/glm-46v
### Alternatives
No response
### Additional context …
[已关闭 issue]
-
#19184 [Bug]: Image Fails to Initialize (Undetected Platform) because of LD_LIBRARY_PATH, PATH environment error with vllm >= 0.9.0 — bug,stale — by junam2 (关闭于: 2025-12-18 10:26 (UTC+8)) [💬7] ### Your current environment
The output of
```text INFO 06-04 19:48:17 [__init__.py:247] No platform detected, vLLM is running on UnspecifiedPlatform Collecting environment information... ============================== System Info ...python collect_env.py -
#19439 [Performance]: Higher latency after using tensor parallelism on whisper turbo — performance,stale — by ash-11 (关闭于: 2025-12-18 10:26 (UTC+8)) [💬3] ### Report of performance regression
I am observing higher latency after using tensor parallelism. I am running the script below on A100x2 GPU.
For tensor_parallel_size=1, the latency comes out to be 92ms and for tensor_parallel_size=2, the latency is 112ms which is counter intutive.
```python …
-
#21176 [Bug]: ValueError: There is no module or parameter named ‘mm_whisper_embeddings’ in LlamaForCausalLM — bug,stale — by chuber11 (关闭于: 2025-12-18 10:26 (UTC+8)) [💬8] ### Your current environment
FROM pytorch/pytorch:2.5.1-cuda12.1-cudnn9-devel
RUN apt update && apt install -y git screen
RUN pip install flashinfer-python RUN pip install -U “vllm[audio]” –extra-index-url https://wheels.vllm.ai/nightly
ENTRYPOINT [“sleep”, “999999999”] # For now: exec into container and run vllm serve from there …
-
#21695 [Bug]: can not support InternVL3-78B-AWQ — bug,stale — by yangpeng-space (关闭于: 2025-12-18 10:26 (UTC+8)) [💬6] ### Your current environment
ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.
### 🐛 Describe the bug
code: CUDA_VISIBLE_DEVICES=3,4 vllm serve ./InternVL3-78B-AWQ –gpu-memory-utilization 0.95 –max-model-len 32768 –quantization awq –enforce-eager –host 0.0.0.0 –port 11441 –tensor-parallel-size 2 –dtype half –trust-remote-code
error: …
-
#21746 [Feature]: Add LoRA adapter support for Gemma3nForConditionalGeneration models — feature request,stale — by sydarb (关闭于: 2025-12-18 10:26 (UTC+8)) [💬6] ### 🚀 The feature, motivation and pitch
vLLM currently throws the following error when attempting to use LoRA adapters with Gemma3n models:
ERROR: ValueError: Gemma3nForConditionalGeneration does not support LoRA yet.Minimum Viable Implementation: For now, LoRA support could be implemented for text-only inference streams, even if multimodal components (vision/audio) are being added or don’t initially support LoRA.
### Alternatives …
-
#22071 [Usage]: Kubernetes Offline Model Usage — usage,stale — by coffee-overflow (关闭于: 2025-12-18 10:25 (UTC+8)) [💬4] ### Your current environment
Chart version: 0.1.6
### How would you like to use vllm
Hello,
I have installed vLLM in a Rancher Kubernetes environment using a Helm chart.
…
-
#22079 [Bug]: repeat response or error response with confused code — bug,stale — by laibruce (关闭于: 2025-12-18 10:25 (UTC+8)) [💬3] ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#22111 [Bug]: The class UnquantizedLinearMethod must implement the ‘embedding’ method — bug,stale — by zchen-gitch (关闭于: 2025-12-18 10:25 (UTC+8)) [💬9] ### Your current environment
My environment:
vllm 0.10.0 torch 2.7.1 torchaudio 2.7.1 torchvision 0.22.1…
-
#22206 [Feature]: consider offering a linux/arm64 build on Docker Hub — feature request,stale — by mikix (关闭于: 2025-12-18 10:25 (UTC+8)) [💬3] ### 🚀 The feature, motivation and pitch
Right now, looks like
linux/amd64is the only build offered. It would be convenient if there were alinux/arm64build too.### Alternatives
No response
### Additional context
…
-
#22376 [Bug]: Model architectures [‘GptOssForCausalLM’] failed to be inspected. — bug,stale — by QichangZheng (关闭于: 2025-12-18 10:25 (UTC+8)) [💬16] ### Your current environment
The output of
```text ============================== System Info ============================== OS : Ubuntu 22.04.4 LTS (x86_64) ...python collect_env.py -
#22656 [Bug]: [Bug]: Caching is incompatible with gradient checkpointing in Qwen3DecoderLayer. Setting
past_key_value=None. — bug,stale — by Randomdude11 (关闭于: 2025-12-18 10:25 (UTC+8)) [💬3] ### Your current environmentThe output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#23086 [Bug][NIXL]: AssertionError assert self.dst_num_blocks[engine_id] == nixl_agent_meta.num_blocks — bug,stale — by wzmayus (关闭于: 2025-12-18 10:25 (UTC+8)) [💬6] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#23098 [Bug]: Deepseek v3 function call error — bug,stale — by deepblacksky (关闭于: 2025-12-18 10:25 (UTC+8)) [💬2] ### Your current environment
```text vllm 0.9.0 deepseek-v3-0324 H20-141*8
vllm docker:
…
-
#23102 [Usage]: How to run a benchmark with dataset FreedomIntelligence/ALLaVA-4V? — usage,stale — by LJH-LBJ (关闭于: 2025-12-18 10:25 (UTC+8)) [💬2] ### Your current environment
I don’t know how to integrate it with vllm. https://huggingface.co/datasets/FreedomIntelligence/ALLaVA-4V
### How would you like to use vllm
I don’t know how to integrate it with vllm. https://huggingface.co/datasets/FreedomIntelligence/ALLaVA-4V
…
-
#23115 [Bug]: Model load in GKE on TPU v6e is extremely slow when HF cache is backed by GCSFuse — bug,stale — by kiratp (关闭于: 2025-12-18 10:25 (UTC+8)) [💬5] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#23128 [Usage]: When starting vllm/vllm-openai:gptoss it says gpt_oss is not installed — usage,stale — by alew3 (关闭于: 2025-12-18 10:25 (UTC+8)) [💬3] ### Your current environment
When starting docker to run openai/gpt-oss-20b as below, I get the following warnings and can’t use websearch or code interpreter
(APIServer pid=1) WARNING 08-18 11:42:46 [tool.py:38] gpt_oss is not installed, browsing is disabled (APIServer pid=1) WARNING 08-18 11:42:46 [tool.py:69] gpt_oss is not installed, code interpreter is disabledI inspected the docker and can see it is installed: ``` $ pip list | grep oss …
-
#23138 [Bug]:Non-deterministic outputs despite temperature=0.0 when handling concurrent requests — bug,stale — by banne2266 (关闭于: 2025-12-18 10:25 (UTC+8)) [💬2] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#23152 [Bug]: TypeError: expected str, bytes or os.PathLike object, not NoneType — bug,stale — by hyqf98 (关闭于: 2025-12-18 10:25 (UTC+8)) [💬5] ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#23153 [Bug]: 5090 LapTop Notebook will kill the vLLM in Docker — bug,stale — by emptystack1024 (关闭于: 2025-12-18 10:25 (UTC+8)) [💬2] ### Your current environment
============================== System Info ============================== OS : Could not collect GCC version : Could not collect Clang version : Could not collect CMake version : Could not collect Libc version : N/A …
-
#23157 [RFC]: Integrate MetaX GPU Backend into vLLM Plugin — RFC,stale — by ILikeIneine (关闭于: 2025-12-18 10:25 (UTC+8)) [💬2] # Integrate MetaX GPU into vLLM Plugin
<img src=”https://github.com/user-attachments/assets/6d3cd958-c31c-42aa-ac17-431f0da8cee2” width=30%>
## Background
MetaX is dedicated to delivering full-stack GPU chips and solutions for heterogeneous computing, which are widely applicable in cutting-edge domains such as Intelligent Computing, Cloud Computing, Autonomous Vehicles. These solutions provide robust computational support for the advancement of the digital economy.
MetaX develops full-stack …
-
#23196 [Bug]: unable to use GPTOss built-in browser tool — bug,stale — by Hannibal046 (关闭于: 2025-12-18 10:25 (UTC+8)) [💬2] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#23215 [Bug]: Grammar backend fails to process empty
guided_grammarstring — bug,stale — by ultmaster (关闭于: 2025-12-18 10:25 (UTC+8)) [💬2] ### Your current environmentv0.10.1 / v0.10.2
### 🐛 Describe the bug
The grammar backend (
vllm/v1/structured_output/backend_xgrammar.py) crashes when given an empty string ("") asguided_grammar. This results in a runtime error during compilation:``` …
-
#28609 [CI][Feature]: Add workflow to make sure PR branch is regularly synced with main — feature request,ci — by khluu (关闭于: 2025-12-18 08:46 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch
To remind PR authors to sync with main, preventing out of sync PRs causing issues when merged into main. Ideally always synced within 1 day of latest main.
### Alternatives
No response
### Additional context
…
-
#30893 [Bug]: V1 Engine splits offline embedding batches unexpectedly when multiprocessing is enabled — bug — by dmadicTT (关闭于: 2025-12-18 07:07 (UTC+8)) [💬1] ### Your current environment
The output of
```text ============================== System Info ============================== OS : Ubuntu 22.04.4 LTS (x86_64) ...python collect_env.py -
#30854 [CI Failure]: Entrypoints Integration Test (API Server) — ci-failure — by SungMinCho (关闭于: 2025-12-18 05:15 (UTC+8)) [💬1] ### Name of failing test
buildkite/ci/pr### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#30882 [Bug]: Marlin Fp8 Block Quant Failure — bug,help wanted,good first issue — by robertgshaw2-redhat (关闭于: 2025-12-18 00:02 (UTC+8)) [💬2] ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#27152 [Perf][Qwen3-next]:
torch.compileGDN attn — performance — by vadiklyutiy (关闭于: 2025-12-17 22:04 (UTC+8)) [💬3] ### Proposal to improve performanceRight now GDN attn (
Qwen3NextGatedDeltaNet) (used in Qwen3-next) aren’t covered bytorch.compile. GDN unlike full attn contain a lot of operators including a lot of elementwise. Below is illustrationRight now GDN attn implemented as custom op and
torch.compiledoesn’t go inside.I wrote the following script that call v…
-
#30554 [Bug]: When using minimax m2 model tool call, throw IndexError in serving_chat.py — bug — by WangErXiao (关闭于: 2025-12-17 16:37 (UTC+8)) ### Your current environment
When using the MiniMax M2 model tool call, an IndexError (“list index out of range”) is raised at line 1181 in serving_chat.py: actual_call = tool_parser.streamed_args_for_tool[index] . It’s caused by minimax_m2_tool_parser.py’s streamed_args_for_tool field.
This field always is empty lis### 🐛 Describe the bug
When using the MiniMax M2 model tool call, an IndexError (“list index out of range”) is raised at line 1181 in serving_chat.py It’s caused by minimax_m2…
-
#24284 [RFC]: Support reporting tool output tokens in OutputTokensDetails — RFC,stale — by yeqcharlotte (关闭于: 2025-12-17 14:01 (UTC+8)) [💬2] ### Motivation.
When we leverage vllm to drive tool call execution loop, it’s important to understand the overall throughput produced by the system. The overall useful token generated by the system is the sum of decoding tokens and tool output tokens.
Reasoning and final output tokens will be captured as “output tokens” but response from tool environment is directly added to prefill.
### Proposed Change.
Tool tokens can be simply calculated as follow: …
-
#30835 [Feature]: support glm-46v — feature request — by Busboy3129 (关闭于: 2025-12-17 12:18 (UTC+8)) [💬1] ### 🚀 The feature, motivation and pitch
https://huggingface.co/collections/zai-org/glm-46v
### Alternatives
No response
### Additional context …
[新 PR]
- #30881 [Compressed-Tensors] Simplify NVFP4 Conditions, enable marlin support for NVFP4A16 MoEs — 无标签 — by dsikka (创建于: 2025-12-17 23:51 (UTC+8)) [+49/-44, 3 files | commented:5 | 📝草稿]
## Purpose
- Clean-up the conditions used to select the Nvfp4 schemes
- Enable running Nvfp4A16 MoEs through the marlin pathway
## Test Plan
- Smoke testing Qwen3 MoE NVFP4 and NVFP4A16 models
- Test on b200 and lower
## Test Result
…
-
#30907 [Bug] Fix batch invariant in torch 2.10 — ready — by yewentao256 (创建于: 2025-12-18 06:20 (UTC+8)) [💬2 | +20/-24, 1 files | commented:2 approved:2] ## Purpose
Fixes https://github.com/pytorch/pytorch/issues/170490
## Test
Originally:
```bash … …
- #30867 [Bugfix] Fix tool_choice=”none” being ignored by GPT-OSS/harmony models — frontend,gpt-oss — by HaloWorld (创建于: 2025-12-17 20:52 (UTC+8)) [💬2 | +87/-4, 2 files | commented:5]
GPT-OSS models using the harmony format were ignoring the
tool_choice="none"parameter and could still trigger tool calls when tools were provided in the request. This issue arose because the_make_request_with_harmonymethod only checked for the existence ofrequest.tools, without accounting for thetool_choicesetting or theexclude_tools_when_tool_choice_noneflag. This fix ensures that harmony models respect theexclude_tools_when_tool_choice_noneflag, aligning their behavior wi… -
#30860 add new triton op fused_sigmoid_gating_delta_rule_update. — qwen — by momo609 (创建于: 2025-12-17 17:07 (UTC+8)) [💬2 | +346/-86, 3 files | commented:1] ## Purpose qwen3_next add fused_sigmoid_gating_delta_rule_update op which fused fused_gdn_gating+fused_recurrent_gated_delta_rule
## Test Plan
## Test Result
... -
#30887 [Bugfix] [Kernel] 3D Triton kernel: mask out V blocks that fall outside sliding window — 无标签 — by tdoublep (创建于: 2025-12-18 00:46 (UTC+8)) [💬2 | +6/-0, 1 files | commented:1] ## Purpose
There is currently a bug in the 3D attention kernel where we don’t correctly mask out V blocks that fall out of the sliding window. On main, we can be reading garbage blocks that contain NaN which will corrupt the output. This PR resolves it
Note: this does not affect the 2D kernel because we prune out tiles that fall outside the sliding window. I propose a follow-up PR to prune the tiles in the 3D kernel as well, but I think we should get this fix in now.
Potentially fix:
- https…
-
#30917 [Feature] Support using FlexKV as anothor KV Cache Offloading option. — documentation,needs-rebase,v1,kv-connector — by axxx03 (创建于: 2025-12-18 10:35 (UTC+8)) [💬3 | +367/-2, 4 files | commented:1] Description FlexKV is a distributed KV store and multi-level cache management system developed by Tencent Cloud’s TACO team in collaboration with the community, designed for large-scale LLM inference scenarios. FlexKV leverages multi-level caching to enable inference engines to achieve higher throughput and lower latency.
In our case, when intergated with FlexKV, we can achieve the following improvement:
- ISL=21K, OSL=1K, batch_size=8: TTFT decreases by 60%, TPOT increases by 13%, and QP…
-
#30895 [BugFix] Handle errors when preprocessing added requests — bug,ready,v1 — by njhill (创建于: 2025-12-18 02:24 (UTC+8)) [💬4 | +88/-2, 2 files | commented:2 approved:1]
preprocess_add_request()runs in the engine core input socket processing thread. It’s not expected to raise exceptions but if it does, they aren’t caught or logged - the input processing thread exits and the engine hangs silently :(This PR catches and logs request-scoped preprocessing errors, and returns an output to fail the request in question.
I’ll open another PR to deal with any other unexpected exceptions occurring in the input socket processing thread (which should be considered fata…
-
#30916 [BugFix] Fix spec decode + structured outputs + preemption edge case — bug,ready,v1 — by njhill (创建于: 2025-12-18 09:58 (UTC+8)) [+5/-1, 1 files | commented:1] Fix an edge case that can be triggered when using spec decode with structured outputs.
There is a sequence of preemption and drafting being skipped that can result in the scheduler’s
request.spec_token_idsbeing stale which can then fail bitmask generation because they will be out of sync with the grammar.This fixes that case by ensuring that
request.spec_token_idsis cleared when drafting is skipped for a given step, rather than potentially leaving set to a prior step’s draft tokens.It…
-
#30911 [AMD][CI] fix lm eval ci arg — rocm,ready,ci/build — by divakar-amd (创建于: 2025-12-18 07:38 (UTC+8)) [+3/-3, 1 files | commented:1 approved:1] Fixing args for AMD-CI run in accordance with https://github.com/vllm-project/vllm/pull/30723
This fixes the following tests for AMD-CI:
LM Eval Small Models LM Eval Small Models (1 Card)
-
#30910 [BugFix] Partial revert of #29558 (DeepEP HT + PIECEWISE CG support) — ready — by LucasWilkinson (创建于: 2025-12-18 07:28 (UTC+8)) [💬2 | +14/-74, 2 files | commented:1 approved:1] Partially revert https://github.com/vllm-project/vllm/pull/29558 as this broke H200 tests
https://buildkite.com/vllm/ci/builds/43863#019b29e9-5c1b-4eff-83f7-c8304f774aa7
i.e.
VLLM_ALL2ALL_BACKEND=deepep_high_throughput VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model Qwen/Qwen1.5-MoE-A2.7B --tp-size=1 --dp-size=2 --max-model-len 2048…
-
#30906 [Feature]: support serving nvfp4 W4A16 moe models uisng Marlin — 无标签 — by EdalatiAli (创建于: 2025-12-18 05:59 (UTC+8)) [💬3 | +220/-12, 2 files | commented:2]
## Purpose This PR enables serving nvfp4 W4A16 compressed-tensors quantized MoE models by adding
CompressedTensorsW4A16Nvfp4MoeMethod. Weight-only nvfp4 quantization improves the quality at the expense of higher latency at large concurrencies. As the provided results show, nvfp4 W4A16 quantized version ofQwen/Qwen3-30B-A3Bsingificantly outperforms the nvfp4 W4A4 variant.## Test Plan …
-
#30836 [Model] Add MiMo-V2-Flash support — documentation,new-model — by Abatom (创建于: 2025-12-17 12:08 (UTC+8)) [💬11 | +793/-13, 6 files | commented:3]
## Purpose Add support for MiMo-V2-Flash. ## Test Plan
``` vllm serve –model XiaomiMiMo/MiMo-V2-Flash
–host 0.0.0.0
–port 9001
… - #30869 [Bugfix] fix the alias bug of AttentionBackendEnum when register CUSTOM attention backend to vllm — 无标签 — by zejunchen-zejun (创建于: 2025-12-17 21:19 (UTC+8)) [💬6 | +4/-2, 1 files | commented:1] The bug is the TORCH_SDPA is the alias of the CUSTOM backend in AttentionBackendEnum because both of them don’t have the value. When you register your attention backend as below, it will be unexpectedly registered to TORCH_SDPA: ``` import torch from vllm.attention.backends.registry import register_backend, AttentionBackendEnum from vllm.attention.backends.abstract import AttentionType from vllm.attention.selector import get_attn_backend from vllm.attention.backends.abstract import AttentionBack…
-
#30914 [Bug] Fix torch inductor issue — ready,qwen — by yewentao256 (创建于: 2025-12-18 08:35 (UTC+8)) [+5/-3, 1 files | commented:1] ## Purpose
Context: https://vllm-dev.slack.com/archives/C08U97ZRC0J/p1765934670913979
VLLM_ALL2ALL_BACKEND="deepep_high_throughput" VLLM_USE_DEEP_GEMM=1 python3 examples/offline_inference/data_parallel.py --model Qwen/Qwen1.5-MoE-A2.7B --tp-size=1 --dp-size=2 --max-model-len 2048```bash (EngineCore_DP0 pid=2074384) ERROR 12-18 00:32:02 [core.py:866] File “…
-
#30915 [Fix][FlexAttention] return max logical block index to handle reused blocks — v1 — by ivanium (创建于: 2025-12-18 08:54 (UTC+8)) [+42/-4, 2 files | commented:1]
## Purpose
For FlexAttention, we need to build
physical_to_logical_mappingto reversely map physical block ids to logical block ids. This process previously assumes logical block ids are always unique, which is not true for some attention types such as sliding window attention, where some blocks may be released and reused later, causing the same physical block id to appear multiple times in a row ofblock_tableat different logical block indices. As a result, … -
#30875 [CI][Feature] Adds auto-rebase PR rule — ready,ci/build — by rafvasq (创建于: 2025-12-17 23:10 (UTC+8)) [💬3 | +12/-0, 1 files | commented:4 approved:2] ## Purpose
PRs >= 10 commits behind
mainare automatically rebased to stay reasonably in sync. Conflicts or PRs withneeds-rebaselabels would still defer to the author.Closes https://github.com/vllm-project/vllm/issues/28609
CC @khluu
-
#30849 [Misc][Benchmark] use pid tracking, replaced pkill commands — performance — by RuixiangMa (创建于: 2025-12-17 14:59 (UTC+8)) [💬3 | +29/-5, 1 files | commented:1] ## Purpose Problem :pkill commands could accidentally kill other vllm services Solution: pid-based process tracking and safe termination,which is zero risk to other services
## Test Result
-
#30913 [docker] install cuda13 version of lmcache and nixl — ci/build,kv-connector,nvidia — by soodoshll (创建于: 2025-12-18 07:47 (UTC+8)) [💬1 | +18/-1, 1 files | commented:2] Fix #30628.
## Purpose
This PR modifies the
Dockerfileto installnixl-cuda13when using CUDA13. Thenixlpackage looks for two sub-packagesnixl-cuda13andnixl-cuda12in order, and uses the first found one.lmcachetriggers compilation from source when installed on arm64, therefore some dependency packages need to be installed.cc @wangshangsam
…
-
#30912 Fix/get raw stream patch #30905 — 无标签 — by baonudesifeizhai (创建于: 2025-12-18 07:39 (UTC+8)) [+17/-0, 1 files | commented:2]
## Purpose
Fix TorchInductor autotune error when using
compile_sizes > 1in compilation config. The error occurs because TorchInductor’s autotune code generation usesget_raw_stream()function without defining it, causingRuntimeError: name 'get_raw_stream' is not defined.This PR adds a workaround that patches
get_raw_streamintobuiltinsmodule by importing it fromtorch._C._cuda_getCurrentRawStream.Related issue: https://github.com/vllm-project/v…
- #30909 [ROCm][Bugfix] Fix
fa_versionargument error inflash_attn_maxseqlen_wrapperfor ROCm without aiter — rocm — by AndreasKaratzas (创建于: 2025-12-18 06:58 (UTC+8)) [+5/-4, 1 files | commented:2] ### Problem On ROCm platforms withAITEReither uninstalled or disabled,flash_attn_varlen_funcfails with:TypeError: flash_attn_varlen_func() got an unexpected keyword argument 'fa_version'This occurs because
is_rocm_aiterisFalsewhenaiteris disabled, causing the code to fall through to the else branch which unconditionally passesfa_versiontoflash_attn_varlen_func. However, the ROCm version of Flash Attention (viavllm.attention.utils.fa_utils) does not support… -
#30908 Migrate activation kernels to libtorch stable ABI — ci/build,cpu,nvidia — by mikaylagawarecki (创建于: 2025-12-18 06:49 (UTC+8)) [+223/-130, 12 files | commented:3]
## Purpose
This change requires 2.10 to build
First step towards migration of the cuda wheel to the libtorch stable ABI see https://github.com/vllm-project/vllm/issues/26946
activation_kernels.cuis migrated to use the stable ABI/API …
-
#30885 [Kernel][Performance] Enable smaller Scaling Factor tiling for NVFP4 small-batch decoding — nvidia — by LopezCastroRoberto (创建于: 2025-12-18 00:33 (UTC+8)) [💬1 | +169/-22, 7 files | commented:4] ## Summary This PR adds an opt-in NVFP4 backend variant that uses smaller scaling-factor tiling (8x4 SF layout). The change targets small-concurrency decode workloads and delivers ~25–35% higher output token throughput compared to the current best NVFP4 backend at small batch sizes.
Note: this backend is not recommended for medium or large batch sizes. Benefits are limited to the small-batch decode regime - see below.
Enable backend by explicitly setting:
```bash export VLLM_NVFP4GEMM…
- #30899 [BugFix] Workspace allocation during profile run : DeepEPHighThroughput + DeepGEMM — ready — by varun-sundar-rabindranath (创建于: 2025-12-18 03:29 (UTC+8)) [💬2 | +4/-1, 1 files | commented:1 approved:1]
## Purpose
Running server command,
VLLM_ALL2ALL_BACKEND="deepep_high_throughput" vllm serve Qwen/Qwen3-30B-A3B-FP8 --data-parallel-size 2 --enable-expert-parallel --no-enable-prefix-caching --port 9010with the benchmark command,
vllm bench serve --model Qwen/Qwen3-30B-A3B-FP8 --dataset-name random --num-prompts 256 --random-input-len 4096 --random-output-len 1 --ignore-eos --port 9010 --backend vllmTrips the assert [here](https://github.com/vllm-project/vllm/blob/e3a0f21e6ce7…
-
#30902 [compile] Fix CI for test_gpt2_cache_hit — ready — by zhxchen17 (创建于: 2025-12-18 04:10 (UTC+8)) [💬1 | +15/-6, 2 files | commented:1 approved:1] Signed-off-by: zhxchen17 zhxchen17@fb.com
## Purpose
Fixing torch 2.10 release CI issues from https://github.com/pytorch/pytorch/issues/170549
## Test Plan
pytest tests/compile/test_aot_compile.py …
-
#30903 [UX] Reduce DeepGEMM warmup log output to single progress bar — ready,startup-ux — by MatthewBonanni (创建于: 2025-12-18 05:17 (UTC+8)) [+99/-42, 1 files | commented:2 approved:1] ## Purpose Showing a progress bar for each shape during DeepGEMM warmup is unnecessary. In the interest of reducing log clutter, this PR reduces the output to a single progress bar, only shown on rank 0.
## Test Plan
vllm serve deepseek-ai/DeepSeek-R1 -dp 8 --enable-expert-parallel## Test Result …
-
#30897 [NVFP4][Perf] Tune NVFP4 input quant kernel for small batch size — performance,quantization,ready,nvidia — by mgoin (创建于: 2025-12-18 03:09 (UTC+8)) [💬1 | +243/-97, 5 files | commented:2] ## Purpose
We discovered that the nvfp4 input quant operation was severely bottlenecking performance for dense models at small batch sizes. While investigating the flashinfer impl, I noticed that there is logic to increase the grid size for small M, based on the swizzled layout to help out with the padding ops. Adding that “effective row” calculation greatly …
- #30833 [Quant] Make static quant support all group shapes — 无标签 — by LucasWilkinson (创建于: 2025-12-17 11:19 (UTC+8)) [💬1 | +409/-45, 7 files | commented:2] Preparatory pr for https://github.com/vllm-project/vllm/pull/30141 (per-head quant)
-
#30898 [Refactor] Refactor for
DeepGemmQuantScaleFMTusing cache — ready — by yewentao256 (创建于: 2025-12-18 03:15 (UTC+8)) [💬1 | +29/-9, 2 files | commented:2] ## PurposeA follow up PR for https://github.com/vllm-project/vllm/pull/30336#issuecomment-3632960722
- We should use
DeepGemmQuantScaleFMTinstead ofif self.use_deep_gemm_e8m0 and self.is_blackwell: - Update the
DeepGemmQuantScaleFMTusing cache so that we won’t havetorch._dynamo.exc.Unsupported: can't handle functions not implemented in pythonin the future
- We should use
-
#30900 fix fp8 online quantization streaming with tp > 1 — ready — by vkuzo (创建于: 2025-12-18 03:41 (UTC+8)) [💬4 | +32/-8, 1 files | commented:1 approved:2] Summary:
Fix for https://github.com/vllm-project/vllm/issues/30830
When we added online fp8 quant with streaming weight post-processing in https://github.com/vllm-project/vllm/pull/29196, a bug was introduced where TP>1 case was not always handled correctly. Specifically:
- https://github.com/vllm-project/vllm/pull/29196 assumed that
weight_loadercopiesloaded_weighttoparamdirectly - this is not always true, as
weight_loadercan call arbitrary logic on bothparamand `loaded_wei…
- https://github.com/vllm-project/vllm/pull/29196 assumed that
-
#30842 [Kernels][FI] Skip trtllm attention when num_kv_heads=1 — ready,nvidia — by yeqcharlotte (创建于: 2025-12-17 13:41 (UTC+8)) [💬7 | +56/-1, 2 files | commented:1 approved:1]
## Purpose We got the following error when running a small model on blackwell ``` [WORKER]: File “/redacted/path/executor/abstract.py”, line 116, in initialize_from_config [WORKER]: self.collective_rpc(“compile_or_warm_up_model”) [WORKER]: File “/redacted/path/executor/uniproc_executor.py”, line 75, in collective_rpc [WORKER]: result = run_method(self.driver_worker, method, args, kwargs) [WORKER]: File “/redacted/path/serial_utils.py”, line 460, in run_me…
-
#30894 Improve DCP error message with actionable guidance — v1 — by Dhruv-80 (创建于: 2025-12-18 01:39 (UTC+8)) [💬3 | +64/-26, 4 files | commented:1] ## Purpose
Improve the error message shown when Decode Context Parallelism (DCP) is enabled with an attention backend that does not support returning softmax LSE values during decode.
Previously, this case raised a generic assertion error that did not provide actionable guidance to users. This change replaces the assertion with a user-facing exception that clearly explains the limitation and suggests how to resolve it (e.g., by selecting a different attention backend or disabling DCP). …
-
#30864 [Model] Nemotron Parse 1.1 Support — new-model,ci/build,multi-modality — by amitz-nv (创建于: 2025-12-17 18:50 (UTC+8)) [💬3 | +1102/-29, 9 files | commented:6] ## Purpose
- Add support for NVIDIA Nemotron Parse 1.1 (HF name:
nvidia/NVIDIA-Nemotron-Parse-v1.1)- Adapted from https://github.com/amalad/vllm/blob/nemotron_parse/vllm/model_executor/models/nemotron_parse.py that’s based on https://huggingface.co/nvidia/NVIDIA-Nemotron-Parse-v1.1/blob/main/hf_nemotron_parse_modeling.py
- Bart classes based on old vLLM codebase: https://github.com/vllm-project/vllm/blob/v0.10.2/vllm/model_executor/models/bart.py
- Updated RADIO model
- Changed its `f…
- Add support for NVIDIA Nemotron Parse 1.1 (HF name:
-
#30892 [Feature] Add /v1/kv_cache/status API endpoint — frontend,v1 — by maggie26375 (创建于: 2025-12-18 01:16 (UTC+8)) [💬4 | +233/-0, 5 files | commented:1] This PR adds a new REST API endpoint
/v1/kv_cache/statusthat returns detailed KV cache status information in JSON format.## Changes
- Add
/v1/kv_cache/statusendpoint to api_server.py - Add
num_total_blocksandnum_free_blocksfields to SchedulerStats - Add new Prometheus gauges for block metrics
- Update Scheduler.make_stats() to collect block information
- Add unit tests for the new endpoint
…
- Add
-
#30891 [Kernel] Configure BlockReduce block size in RMSNorm kernel — 无标签 — by xyang16 (创建于: 2025-12-18 01:05 (UTC+8)) [💬1 | +82/-46, 2 files | commented:1] ## Purpose
This PR is to configure
BlockReduceblock size template parameter in RMSNorm kernel.- The max_block_size to launch kernel can be 256 or 1024, see here. If max_block_size=256,
BlockReduceonly need to performance reduction across up to 256 threads. - So this PR pass BLOCK_SIZE as template parameter and use it for
BlockReduce.
## Test Plan
``` …
- The max_block_size to launch kernel can be 256 or 1024, see here. If max_block_size=256,
-
#30850 kvcache dump analysis — v1 — by nitingupta910 (创建于: 2025-12-17 15:05 (UTC+8)) [💬3 +2169/-0, 14 files commented:1] -
#30890 [UX] Add
-epshorthand for--enable-expert-parallel— 无标签 — by mgoin (创建于: 2025-12-18 01:00 (UTC+8)) [💬1 | +3/-1, 1 files | commented:1]## Purpose
## Test Plan
## Test Result
... - #30901 custom build backend — rocm,needs-rebase,ci/build — by dtrifiro (创建于: 2025-12-18 03:47 (UTC+8)) [💬1 | +218/-46, 5 files | commented:1 | 📝草稿] Followup to https://github.com/vllm-project/vllm/pull/13480
-
#30888 [Perf] Move eplb rebalance algo to async thread — 无标签 — by ilmarkov (创建于: 2025-12-18 00:52 (UTC+8)) [+909/-387, 8 files | commented:1 | 📝草稿] Review commits: https://github.com/vllm-project/vllm/pull/30888/files/6014dc26d35db160f4e577d9f239175ff5dfe689..67ff54997d383b8c88db652c3df39b4eb9d6a8ea
## Purpose Follow-up of #30697
Move rebalance algo to async thread in async EPLB and move CPU on GPU dependency to async thread. In EPLB we need to get rebalancing results on CPU in order to perform routing logic. It means that at some point we need to wait for gpu computation on CPU creating CPU bubble. In this PR we avoid this bubble in the …
-
#30844 [docs]: add ecosystem projects sr in docs/governance — documentation,ready — by Xunzhuo (创建于: 2025-12-17 14:24 (UTC+8)) [💬2 | +1/-0, 1 files | commented:1 approved:1] ## Purpose
This PR added ecosystem projects sr in docs/governance
Essential Elements of an Effective PR Description Checklist
- [ ] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)". - [ ] The test plan, such as providing test command. ... -
#30877 [V1][Hybrid][WIP] Mamba Prefix Caching with align mode — documentation,needs-rebase,v1,qwen — by peakcrosser7 (创建于: 2025-12-17 23:26 (UTC+8)) [💬2 | +2046/-96, 32 files | commented:1 | 📝草稿] The cleaned-up version of #29272
## Purpose
## Test Plan
## Test Result
…
-
#30886 [Doc][WIP] Add developer guide for CustomOp — documentation — by shen-shanshan (创建于: 2025-12-18 00:36 (UTC+8)) [💬2 | +39/-0, 1 files | commented:1 | 📝草稿] ## Purpose
Work in progress…
Feel free to leave any feedbacks.
## Test Plan
## Test Result
…
- #30878 [CI][Bugfix] Fix flaky
tests/entrypoints/openai/test_audio.py::test_chat_streaming_audio— ready — by NickLucche (创建于: 2025-12-17 23:36 (UTC+8)) [💬3 | +3/-1, 1 files | commented:1 approved:1] As per-slack discussion @DarkLight1337 - #30856 Fix changes the behavior from overriding kv_connector_extra_config to updating — ready — by akakakakakaa (创建于: 2025-12-17 16:47 (UTC+8)) [💬3 | +2/-2, 1 files | commented:1 approved:1] When using pooling models together with LMCache, the system crashes due to a missing guard in the LMCache implementation: it does not check whether sampling_params is None. Pooling models do not use sampling parameters, so this leads to a runtime error. (LMCache implementation, [vLLM native implementation (maybe works correctly)](https://github.com/vllm-project/vl…
- #30874 [CI] Fix mypy for
vllm/lora— 无标签 — by hmellor (创建于: 2025-12-17 23:01 (UTC+8)) [💬1 | +105/-70, 10 files | commented:6] This change brings us closer to not needing a custom mypy check. Part of #26533 -
#30873 [Fix] pass chat_template_kwargs to get_system_message in gpt-oss — frontend,gpt-oss — by seunggil1 (创建于: 2025-12-17 22:32 (UTC+8)) [💬5 | +18/-4, 1 files | commented:1] ## Purpose
This PR addresses the issue where
chat_template_kwargs(specificallymodel_identity) is ignored when usinggpt-ossmodels.Currently,
gpt-ossmodels use a custom request handling method_make_request_with_harmonyinstead of the standardapply_hf_chat_template. Inside this method, theget_system_messagefunction is called without passingchat_template_kwargs. As a result, themodel_identityparameter defaults to a hardcoded value (“You are ChatGPT…”), preventing u… -
#30884 [BugFix] Fix offline inference of Qwen3 omni with use_audio_in_video=True — qwen — by Li-dongyang (创建于: 2025-12-17 23:57 (UTC+8)) [💬2 | +1/-1, 1 files | commented:1] ## Purpose
During offline inference with Qwen3-omni using vLLM 0.12, enabling use_audio_in_video leads to nonsensical generations:
- garbled outputs
- repetitive tokens
Investigation suggests the cause might be related to the cache logic of processor. A temporary workaround has been proposed here.
For background information regarding bug reproducibility, please refer to [this related issue](https://github.com/vllm-…
-
#30883 [Chore] Remove v0 dead code for Qwen2.5-omni — ready,qwen — by Isotr0py (创建于: 2025-12-17 23:55 (UTC+8)) [💬1 | +0/-22, 1 files | commented:1 approved:1]
## Purpose
- Just found we missed Qwen2.5-omni’s
embed_multimodal_v0when removingembed_multimodal_v0from models.
## Test Plan
## Test Result
…
- Just found we missed Qwen2.5-omni’s
- #30876 GLM-4.7 Tool Parser and Doc Update — documentation,tool-calling — by zRzRzRzRzRzRzR (创建于: 2025-12-17 23:20 (UTC+8)) [💬3 | +215/-4, 5 files | commented:1] Added support for GLM-4.7’s Tool Parser and improved documentation.
-
#30868 [Chore] Factor out logic for requesting initial memory — ready,v1 — by DarkLight1337 (创建于: 2025-12-17 20:55 (UTC+8)) [💬3 | +56/-21, 3 files | commented:4 approved:1] ## Purpose
I saw that this logic is duplicated by vllm-omni:
- https://github.com/vllm-project/vllm-omni/blob/acf66ee2cc3799f22e01ec5eeddfe86f39a14388/vllm_omni/worker/gpu_ar_worker.py#L61
- https://github.com/vllm-project/vllm-omni/blob/acf66ee2cc3799f22e01ec5eeddfe86f39a14388/vllm_omni/worker/gpu_generation_worker.py#L7
Let’s factor this out into a separate function. cc @hsliuustc0106 @tzhouam
I have also improved the error message to show exactly which device has insufficient memory (more…
-
#30852 Adapt the old parameter enable_thinking in chat_template_kwargs — ready,deepseek — by SongDI911 (创建于: 2025-12-17 15:10 (UTC+8)) [💬5 | +4/-0, 2 files | commented:2 approved:1] ## Purpose Adapt the old parameter enable_thinking in chat_template_kwargs ## Test Plan
vllm serve DeepSeek-V3.2 --tensor-parallel-size 8 --tokenizer-mode deepseek_v32 --tool-call-parser deepseek_v32 --enable-auto-tool-choice --reasoning-parser deepseek_v3```python import openai …
-
#30871 Attempt to fix ppc64le build — cpu — by npanpaliya (创建于: 2025-12-17 22:08 (UTC+8)) [💬1 | +2/-0, 1 files | commented:1]
## Purpose
## Test Plan
## Test Result
... -
#30870 Remove obsolete token chunking in fused MoE kernel — 无标签 — by KonstGolfi (创建于: 2025-12-17 21:55 (UTC+8)) [💬3 | +40/-51, 1 files | commented:1] ### Fixes #30620 Removing obsolete token chunking in fused MoE kernel
- The chunking logic (~65k tokens) in fused_experts_impl was introduced to avoid IMA issues prior to chunked prefill. With chunked prefill now enforced upstream, this code path is no longer reachable. Removing it eliminates dead code and simplifies the fused MoE execution path without changing behavior.
- Test plan commands; 1) vllm bench throughput 2) pytest -s -v vllm/model_executor/layers/fused_moe/fused_moe.py…
-
#30863 add triton ops fused_qkvzba_split_reshape_cat for qwen3_next — qwen — by ZT-AIA (创建于: 2025-12-17 18:48 (UTC+8)) [💬4 | +145/-7, 3 files | commented:1] ## Purpose add triton ops fused_qkvzba_split_reshape_cat for qwen3_next ## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... - #30858 Fix lazy import — structured-output,ready,v1 — by hmellor (创建于: 2025-12-17 16:56 (UTC+8)) [💬1 | +6/-4, 1 files | commented:1] Will hopefully fix timeout in V1 entrypoints test
-
#30841 [Bugfix] deepseek-V3.2 self.weights_proj has no bias — ready,deepseek — by baoqian426 (创建于: 2025-12-17 13:29 (UTC+8)) [💬6 | +5/-1, 1 files | commented:1 approved:1]
## Purpose self.weights_proj has no bias,some other hardware bias maybe not initial with 0 maybe not correct
H20 bias initial with 0
kunlun bias not initial with 0
…
- #30862 [ci] Sync test areas yaml file with test-pipeline — ci/build — by khluu (创建于: 2025-12-17 18:29 (UTC+8)) [💬1 | +10/-21, 5 files | approved:1 commented:1] Sync up to https://github.com/vllm-project/vllm/commit/10ee1c64cfa7c0b7f68e9ee793435c9cafbf821a
-
#30840 [Doc][ResponsesAPI] add documentation — documentation,frontend,ready — by qandrew (创建于: 2025-12-17 13:20 (UTC+8)) [💬3 | +41/-4, 2 files | commented:1 approved:1] ## Purpose Add to https://docs.vllm.ai/en/latest/serving/openai_compatible_server/
## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... -
#30831 [XPU] fix broken fp8 online quantization for XPU platform — ready — by yma11 (创建于: 2025-12-17 10:54 (UTC+8)) [💬2 | +35/-0, 1 files | commented:3 approved:1]
## Purpose XPU online FP8 quantization is broken by recent introduced streaming quantization feature. Since streaming quantization has acc issue (#30830), here we just provide a quick fix for the broken without supporting this feature. We will enable it once the acc issue resolved.
## Test Plan ``` VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 VLLM_WORKER_MULTIPROC_METHOD=spawn python3 examples/offline_inference/basic/generate.py –model Qwen/Qwen3-30B-A3B –enforce-eager –dt…
-
#30846 [BugFix] Fix logprobs with spec decode and modified logits — bug,ready,v1 — by njhill (创建于: 2025-12-17 14:39 (UTC+8)) [💬1 | +30/-11, 2 files | commented:1] “raw” logprobs/logits mode does not currently work properly with spec decoding.
The logits need to be cloned in this case before being modified in-place by sampling parameters, but this wasn’t being done in the rejection sampler.
-
#30845 [ROCm] Group quant RMS norm fusion patterns — rocm,needs-rebase — by vllmellm (创建于: 2025-12-17 14:38 (UTC+8)) [💬1 | +161/-62, 5 files | commented:1 | 📝草稿] ## Purpose #30828 Guards creation of group quant patterns with current_platform.is_cuda() as a quick patch for ROCm platform. This PR re-enables group quant pattern search for the triton fallback kernel used in
per_token_group_quant_fp8. Perf improvement is summarised below:Metric Before After Successful requests 1000 1000 … -
#30838 Idea for git bisect automation with Buildkite — ci/build — by evandekieft (创建于: 2025-12-17 12:42 (UTC+8)) [💬2 | +1061/-0, 5 files | commented:1] Implement automated git bisect using Buildkite CI to efficiently find the first bad commit. The system consists of two pipelines:
- Driver pipeline: Orchestrates bisect process by querying failed jobs and triggering validation builds at each bisect step
- Validator pipeline: Dynamically runs only specific failing test jobs to speed up bisection
Key features:
- Query failed jobs from a build and run only those during bisection
- Can be run locally or on CI
- Supports specifying build number or …
-
#30837 [Doc] Show that
use_audio_in_videois supported in docs — documentation,ready,qwen — by DarkLight1337 (创建于: 2025-12-17 12:23 (UTC+8)) [💬2 | +0/-8, 4 files | commented:2]## Purpose
Address https://github.com/vllm-project/vllm/issues/30779#issuecomment-3661887610 now that #27721 has been merged
## Test Plan
## Test Result
…
[已合并 PR]
- #30811 [ROCm][CI] Reduce Flakiness For test_async_scheduling Using ROCM_ATTN With FP32 — rocm,ready,v1 — by micah-wil (合并于: 2025-12-18 10:27 (UTC+8)) [💬6 | +7/-11, 2 files | commented:7 approved:1]
We’ve seen some flakiness in
v1/e2e/test_async_scheduling.py::test_without_spec_decodingever since the test was initially fixed in https://github.com/vllm-project/vllm/pull/29358. Ultimately we believe it was due to switching to fp16 for the test. Before, the test failed about 10%-20% of the time with the following error: ``` =========================== short test summary info ============================ FAILED v1/e2e/test_async_scheduling.py::test_without_spec_… -
#30738 [Metrics] Model FLOPs Utilization estimation — ready,v1 — by SungMinCho (合并于: 2025-12-18 09:40 (UTC+8)) [💬16 | +2186/-2, 8 files | commented:2 approved:2 changes:1] Signed-off-by: SungMinCho tjdals4565@gmail.com
## TL;DR
This PR implements optional “MFU stats logging”, which appends “per-GPU flops/bandwidth” information to the existing periodic logs, reporting the average compute/memory performance of GPUs in each Engine for the duration of that log interval. These stats are calculated with minimal overhead by feeding in SchedulerOutput to the analytic config-based perf calculator at every iteration.
## How to use
Set
--enable-mfu-metricswhen launc… -
#30875 [CI][Feature] Adds auto-rebase PR rule — ready,ci/build — by rafvasq (合并于: 2025-12-18 08:46 (UTC+8)) [💬3 | +12/-0, 1 files | commented:4 approved:2] ## Purpose
PRs >= 10 commits behind
mainare automatically rebased to stay reasonably in sync. Conflicts or PRs withneeds-rebaselabels would still defer to the author.Closes https://github.com/vllm-project/vllm/issues/28609
CC @khluu
- #30386 [v1] Add PrefixLM support to TritonAttention backend — rocm,ready,v1,multi-modality — by Isotr0py (合并于: 2025-12-18 08:05 (UTC+8)) [💬6 | +281/-124, 4 files | commented:1 approved:1]
## Purpose
- There are some feedbacks from model vendors reported that FlexAttention’s PrefixLM implementation (image BiDi attetion) is a bit slow because of block mask recomputation.
- This PR also adds image-BiDi attention support to exisition Triton attention backend, which should be faster than FlexAttention.
## Test Plan
## Test Result
pytest -s -v tests/models/multimodal/generation/test_common.py -k gemma3-test…
-
#30700 feat(api): Eager chat template warmup to eliminate first-request latency — documentation,performance,rocm,structured-output,frontend,ready,ci/build,v1,multi-modality,tool-calling — by TheCodeWrangler (合并于: 2025-12-18 08:01 (UTC+8)) [💬5 | +52/-0, 2 files | commented:4 approved:1] ## Summary Adds automatic warmup of chat template processing during server initialization to eliminate the significant latency penalty on the first chat completion request.
## Problem When a vLLM server starts, the first chat completion request experiences much higher latency than subsequent requests. This is caused by lazy initialization of:
- Chat template content format auto-detection — Analyzes the Jinja2 template structure
- Jinja2 template compilation — Compiles the chat templa…
- #30899 [BugFix] Workspace allocation during profile run : DeepEPHighThroughput + DeepGEMM — ready — by varun-sundar-rabindranath (合并于: 2025-12-18 07:00 (UTC+8)) [💬2 | +4/-1, 1 files | commented:1 approved:1]
## Purpose
Running server command,
VLLM_ALL2ALL_BACKEND="deepep_high_throughput" vllm serve Qwen/Qwen3-30B-A3B-FP8 --data-parallel-size 2 --enable-expert-parallel --no-enable-prefix-caching --port 9010with the benchmark command,
vllm bench serve --model Qwen/Qwen3-30B-A3B-FP8 --dataset-name random --num-prompts 256 --random-input-len 4096 --random-output-len 1 --ignore-eos --port 9010 --backend vllmTrips the assert [here](https://github.com/vllm-project/vllm/blob/e3a0f21e6ce7…
-
#30842 [Kernels][FI] Skip trtllm attention when num_kv_heads=1 — ready,nvidia — by yeqcharlotte (合并于: 2025-12-17 17:54 (UTC+8)) [💬7 | +56/-1, 2 files | commented:1 approved:1]
## Purpose We got the following error when running a small model on blackwell ``` [WORKER]: File “/redacted/path/executor/abstract.py”, line 116, in initialize_from_config [WORKER]: self.collective_rpc(“compile_or_warm_up_model”) [WORKER]: File “/redacted/path/executor/uniproc_executor.py”, line 75, in collective_rpc [WORKER]: result = run_method(self.driver_worker, method, args, kwargs) [WORKER]: File “/redacted/path/serial_utils.py”, line 460, in run_me…
-
#28495 2.9.1 PyTorch release update — rocm,ci/build,nvidia,ready-run-all-tests — by atalman (合并于: 2025-12-18 04:20 (UTC+8)) [💬7 | +21/-21, 10 files | commented:8 approved:2] Update to 2.9.1 PyTorch release and triton 3.5.1
CI Test: https://buildkite.com/vllm/ci/builds/42137
-
#30844 [docs]: add ecosystem projects sr in docs/governance — documentation,ready — by Xunzhuo (合并于: 2025-12-18 02:45 (UTC+8)) [💬2 | +1/-0, 1 files | commented:1 approved:1] ## Purpose
This PR added ecosystem projects sr in docs/governance
Essential Elements of an Effective PR Description Checklist
- [ ] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)". - [ ] The test plan, such as providing test command. ... -
#30563 [Attention] Update tests to remove deprecated env vars — rocm,speculative-decoding,ready,ci/build,v1,multi-modality,kv-connector,nvidia — by MatthewBonanni (合并于: 2025-12-18 01:49 (UTC+8)) [💬7 | +582/-449, 34 files | commented:7 approved:1]
## Purpose Attention-related environment variables have been deprecated by #26315. This PR updates the tests to remove all usage of these environment variables.
## Test Plan CI (use
ready-run-all-tests)## Test Result
…
- #30878 [CI][Bugfix] Fix flaky
tests/entrypoints/openai/test_audio.py::test_chat_streaming_audio— ready — by NickLucche (合并于: 2025-12-18 01:49 (UTC+8)) [💬3 | +3/-1, 1 files | commented:1 approved:1] As per-slack discussion @DarkLight1337 -
#30747 [Fix] uniform decode batch check — bug,ready,v1 — by Jialin (合并于: 2025-12-17 19:58 (UTC+8)) [💬2 | +121/-8, 2 files | commented:1 changes:1 approved:1] ## Purpose Correctly tag batch with the right uniform_decode value. Before #28579, unpadded scheduled tokens are used to check uniform_decode.
In this PR, we simply revert it back to the original behaviour and move the logic to a dedicated function so we could add unit test easily.
Before
Current <img width=”598” height=”179” alt…
-
#30868 [Chore] Factor out logic for requesting initial memory — ready,v1 — by DarkLight1337 (合并于: 2025-12-17 23:32 (UTC+8)) [💬3 | +56/-21, 3 files | commented:4 approved:1] ## Purpose
I saw that this logic is duplicated by vllm-omni:
- https://github.com/vllm-project/vllm-omni/blob/acf66ee2cc3799f22e01ec5eeddfe86f39a14388/vllm_omni/worker/gpu_ar_worker.py#L61
- https://github.com/vllm-project/vllm-omni/blob/acf66ee2cc3799f22e01ec5eeddfe86f39a14388/vllm_omni/worker/gpu_generation_worker.py#L7
Let’s factor this out into a separate function. cc @hsliuustc0106 @tzhouam
I have also improved the error message to show exactly which device has insufficient memory (more…
- #30827 [Model] Gemma3: Support untied word embeddings — ready — by www-spam (合并于: 2025-12-17 23:11 (UTC+8)) [💬2 | +15/-4, 1 files | commented:3 approved:1]
## Summary
- Add support for
tie_word_embeddings=Falsein Gemma3ForCausalLM - This enables loading merged LoRA models where embeddings are untied after merging
## Changes
- Add
ColumnParallelLinearimport - Remove assertion that enforced
tie_word_embeddings=True - Create separate
lm_headlayer when embeddings are untied - Update
compute_logitsto uselm_headwhen available - Use
getattrfor safe config access inload_weights…
- Add support for
-
#30852 Adapt the old parameter enable_thinking in chat_template_kwargs — ready,deepseek — by SongDI911 (合并于: 2025-12-17 23:10 (UTC+8)) [💬5 | +4/-0, 2 files | commented:2 approved:1] ## Purpose Adapt the old parameter enable_thinking in chat_template_kwargs ## Test Plan
vllm serve DeepSeek-V3.2 --tensor-parallel-size 8 --tokenizer-mode deepseek_v32 --tool-call-parser deepseek_v32 --enable-auto-tool-choice --reasoning-parser deepseek_v3```python import openai …
-
#30748 [Docs] fix function name — documentation,ready — by lengrongfu (合并于: 2025-12-17 20:14 (UTC+8)) [💬3 | +1/-1, 1 files | commented:1 approved:1] ## Purpose
Current this function name is
get_kv_cache_specso need fix it.## Test Plan
## Test Result
... -
#30688 chores: adjust the attn register param order — ready — by ILikeIneine (合并于: 2025-12-17 19:58 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1 approved:1] ## Purpose This is a follow-up to #26487.
It accidently change the order of attention registry default parameter, which caused the failure in normal registry and crashed all the workflow of our plugins tests (the vllm-metax).
For not adding
class_path =everywhere in the registry code in non-mamba cases ( and I think that’s the most common situation) , it is needed to exchange the order of the default parameter.## Test Plan By plugin ## T…
- #30858 Fix lazy import — structured-output,ready,v1 — by hmellor (合并于: 2025-12-17 19:33 (UTC+8)) [💬1 | +6/-4, 1 files | commented:1] Will hopefully fix timeout in V1 entrypoints test
-
#30841 [Bugfix] deepseek-V3.2 self.weights_proj has no bias — ready,deepseek — by baoqian426 (合并于: 2025-12-17 19:32 (UTC+8)) [💬6 | +5/-1, 1 files | commented:1 approved:1]
## Purpose self.weights_proj has no bias,some other hardware bias maybe not initial with 0 maybe not correct
H20 bias initial with 0
kunlun bias not initial with 0
…
- #30862 [ci] Sync test areas yaml file with test-pipeline — ci/build — by khluu (合并于: 2025-12-17 18:30 (UTC+8)) [💬1 | +10/-21, 5 files | approved:1 commented:1] Sync up to https://github.com/vllm-project/vllm/commit/10ee1c64cfa7c0b7f68e9ee793435c9cafbf821a
-
#30749 [Refactor] [4/N] Move VLLM_SERVER_DEV endpoints into the serve directory — frontend,ready,ci/build — by chaunceyjiang (合并于: 2025-12-17 18:27 (UTC+8)) [💬6 | +259/-151, 19 files | commented:9] ## Purpose follow up https://github.com/vllm-project/vllm/pull/28040
## Test Plan ``` # export VLLM_SERVER_DEV_MODE=1 # vllm serve /home/jovyan/qwen3-8b –enable-auto-tool-choice –reasoning-parser qwen3 –tool-call-parser hermes –max-model-len 1024 –gpu-memory-utilization 0.5 …. (APIServer pid=84002) INFO 12-16 06:49:33 [api_server.py:1333] Starting vLLM API server 0 on http://0.0.0.0:8000 (APIServer pid=84002) INFO 12-16 06:49:33 [launcher.py:38] Available routes are: …
-
#29569 [NIXL][Bugfix] Fix NIXL/RDMA registration failure over CuMemAllocator — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ci/build,v1,multi-modality — by Somoku (合并于: 2025-12-17 17:52 (UTC+8)) [💬9 | +10/-0, 1 files | commented:8 approved:2] ## Purpose
During the implementation of my NIXL-based weight transfer engine for RL, I found that the NIXL memory registration progress will fail if I turn on
enable_sleep_modeoption. Specifically, it reportsibv_reg_mr() failed: Bad addressinside UCX. As mentioned in kvcache-ai/Mooncake#351, the root cause is that the CuMemAllocator for model weights and KVCache provides virtual add… -
#30823 [Bug] Fix AttributeError: ‘ColumnParallelLinear’ object has no attribute
weight_scale_inv— ready — by yewentao256 (合并于: 2025-12-17 18:00 (UTC+8)) [💬1 | +5/-2, 1 files | commented:1 approved:1] ## PurposeFix AttributeError: ‘ColumnParallelLinear’ object has no attribute
weight_scale_invexport MODEL="RedHatAI/Kimi-K2-Thinking-FP8-Block"vllm serve $MODEL -tp 8 --port 9256 --enable-expert-parallel --enforce_eager --trust_remote_code --gpu_memory_utilization 0.94 --max_model_len 4096```bash (Worker_TP0_EP0 pid=1089072) INFO 12-16 23:53:39 [default_loader.py:308] Loading weights took 36.23 seconds …
-
#30743 [compile] Recompile graph module during Dynamo cache loading. — ready — by zhxchen17 (合并于: 2025-12-17 18:00 (UTC+8)) [💬2 | +1/-0, 1 files | commented:1 approved:2] Recompile the graph module after loading state.
TODO: If https://github.com/vllm-project/vllm/pull/30728 is landed, address that as well.## Purpose Fixing torch2.10 release blocking test described in https://github.com/vllm-project/vllm/pull/30728.
## Test Plan
…
-
#30785 [Fix]Load kv-cache dtype from hf_quant_config.json automatically (fix for reverted PR) — ready — by danielafrimi (合并于: 2025-12-17 17:56 (UTC+8)) [💬2 | +83/-1, 2 files | commented:1 approved:1] HF checkpoint hf_quant_config.json (modelopt) file specifies “kv_cache_quant_algo”: “FP8”, but vLLM loads the unquantized version of kv-cache.
setting the vLLM server argument explicitly as below: “–kv-cache-dtype fp8” makes it worked.
this PR fetched the kv_cache_quant_algo from the config itself without specifying it as args (while there is hf_quant_config.json).
The key fix was doing this conversion early when creating
CacheConfiginarg_utils.py, so the resolved value is seen everyw… -
#30809 [compile] Ignore VLLM_FORCE_AOT_LOAD from cache factors — ready — by zhxchen17 (合并于: 2025-12-17 17:56 (UTC+8)) [💬1 | +1/-0, 1 files | commented:1 approved:1] Signed-off-by: zhxchen17 zhxchen17@fb.com
## Purpose Fixing https://github.com/pytorch/pytorch/issues/170549
## Test Plan pytest tests/compile/test_aot_compile.py::test_shape_env
## Test Result …
-
#30810 [compile] Disable aot when eager backend is used. — ready — by zhxchen17 (合并于: 2025-12-17 17:55 (UTC+8)) [💬2 | +9/-2, 1 files | commented:1 approved:1] Signed-off-by: zhxchen17 zhxchen17@fb.com
## Purpose Fixing https://github.com/pytorch/pytorch/issues/170492 and https://github.com/pytorch/pytorch/issues/170488
## Test Plan compile/fullgraph/test_basic_correctness.py
## Test Result …
-
#30816 [UX] Make
vllm bench servediscover model by default and use –input-len — performance,perf-benchmarks,ready — by mgoin (合并于: 2025-12-17 17:55 (UTC+8)) [💬1 | +79/-13, 2 files | commented:1 approved:2]## Purpose
Implemented changes to make vllm bench serve work without requiring
--modelor dataset-specific input/output length arguments.``` vllm serve Qwen/Qwen-0.6B
# Before (required model and dataset-specific args): …
-
#30840 [Doc][ResponsesAPI] add documentation — documentation,frontend,ready — by qandrew (合并于: 2025-12-17 17:53 (UTC+8)) [💬3 | +41/-4, 2 files | commented:1 approved:1] ## Purpose Add to https://docs.vllm.ai/en/latest/serving/openai_compatible_server/
## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... -
#29575 CustomOp: grouped topk — ready — by xinyu-intel (合并于: 2025-12-17 17:43 (UTC+8)) [💬8 | +75/-14, 4 files | commented:3 approved:1] ## Purpose
plugin can register grouped_topk op for deepseek_v2/3 workloads.
-
#30712 [Mamba] Removed disable cascade attn in MambaModelConfig — ready — by Josephasafg (合并于: 2025-12-17 16:48 (UTC+8)) [💬1 | +0/-6, 1 files | commented:1 approved:1] ## Purpose Since this is already protected by
MambaManagerthere is no reason to disable cascade attention from here. ## Test Plan## Test Result
Essential Elements of an Effective PR Description Checklist
... -
#30555 [Bugfix][Frontend] Prevent IndexError in MiniMax M2 tool parser during streaming extraction — frontend,ready,tool-calling — by WangErXiao (合并于: 2025-12-17 16:37 (UTC+8)) [💬7 | +137/-4, 2 files | commented:2 approved:1] Fix https://github.com/vllm-project/vllm/issues/30554.
Prevent IndexError in MiniMax M2 tool parser during streaming extraction. Add args str to streamed_args_for_tool field.
-
#30831 [XPU] fix broken fp8 online quantization for XPU platform — ready — by yma11 (合并于: 2025-12-17 16:28 (UTC+8)) [💬2 | +35/-0, 1 files | commented:3 approved:1]
## Purpose XPU online FP8 quantization is broken by recent introduced streaming quantization feature. Since streaming quantization has acc issue (#30830), here we just provide a quick fix for the broken without supporting this feature. We will enable it once the acc issue resolved.
## Test Plan ``` VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 VLLM_WORKER_MULTIPROC_METHOD=spawn python3 examples/offline_inference/basic/generate.py –model Qwen/Qwen3-30B-A3B –enforce-eager –dt…
-
#30829 [Bugfix][CPU] Fix CPU backend ROPE dispatch for VL models — ready — by bigPYJ1151 (合并于: 2025-12-17 15:25 (UTC+8)) [💬1 | +9/-0, 1 files | commented:1 approved:1]
## Purpose
#29873 changed ROPE dispatch of VL models, some test cases failed.
Fix it by use torch native impl temporarily
## Test Plan
…
-
#30711 Update note comment for flashinfer attention warmup — ready — by mgoin (合并于: 2025-12-17 13:29 (UTC+8)) [💬1 | +3/-4, 1 files | commented:1]
## Purpose
## Test Plan
## Test Result
... - #30799 bump up compressed tensors version to 0.13.0 — quantization,ready,ci/build — by shanjiaz (合并于: 2025-12-17 13:01 (UTC+8)) [💬2 | +1/-1, 1 files]
## Purpose
Bump up compressed-tensors version.
- quantization_config now includes scale_dtype and zp_dtype
- additional strategies have been added to support attention and kv_cache quantization, notably
QuantizationStrategy.ATTN_HEAD. This is required for Eldar’s kv cache expansion work in vLLM. - All old CT models have compatibility with this version.
## Test Plan
## Test Result
…
-
#30787 [CI/Build] Fix compatibility between #30244 and #30396 — ready — by DarkLight1337 (合并于: 2025-12-17 12:21 (UTC+8)) [💬3 | +3/-1, 1 files | approved:2 commented:2]
## Purpose
Fix failing Blackwell Fusion E2E tests: https://buildkite.com/vllm/ci/builds/43739/steps/canvas?jid=019b271b-324e-4335-b7ff-907af32e4217#019b271b-324e-4335-b7ff-907af32e4217
## Test Plan
## Test Result
…
-
#30678 [CPU] Add action to automatically label CPU related PRs — ready,ci/build — by fadara01 (合并于: 2025-12-17 12:21 (UTC+8)) [💬3 | +14/-0, 1 files | commented:10] ## Purpose
Add action to automatically label CPU related PRs This helps us stay on top of changes in CPU Backend, and simulate code ownership for people working on CPU backend but are not maintainers.
## Test Plan
N/A
## Test Result …
[关闭未合并 PR]
- #21558 [Feat][Scheduler] Implement shortest prefill first scheduling — needs-rebase,stale,v1 — by KuntaiDu (关闭于: 2025-12-18 10:26 (UTC+8)) [💬11 | +193/-4, 3 files | commented:1]
## Essential Elements of an Effective PR Description Checklist
- The purpose of the PR, such as “Fix some issue (link existing issues this PR will resolve)”.
- The test plan, such as providing test command.
- The test results, such as pasting the results comparison before and after, or e2e results
- (Optional) The necessary documentation update, such as updating
supported_models.mdandexamplesfor a new model.
## Purpose
This PR provides initial implementation for #20962 t…
-
#21952 [Bugfix] Ignore the stop str while in reasoning stage — stale,v1 — by hustxiayang (关闭于: 2025-12-18 10:25 (UTC+8)) [💬5 | +118/-27, 10 files | commented:4] When evaluating reasoning models, we found that it would stop with stop strings during a reasoning stage, which is unexpected.
Thus, in this PR, ignore the stop strings while in a reasoning stage.
-
#23092 Propagate exception message to EngineGeneratorError — stale,v1 — by fgebhart (关闭于: 2025-12-18 10:25 (UTC+8)) [💬5 | +2/-2, 1 files | commented:1]
## Purpose
Proposing to propagate the exception message to EngineGeneratorError to ease debugging of users of vLLM.
## Test Plan
## Test Result
…
-
#30625 fix: prevent reasoning output when enable_thinking is false — frontend,needs-rebase — by llsj14 (关闭于: 2025-12-18 09:26 (UTC+8)) [💬7 | +73/-62, 1 files | commented:4]
## Purpose
- When
enable_thinking: falseis set,<think></think>tokens are injected into the prompt to disable reasoning. However, the current implementation doesn’t detect this and incorrectly outputs model responses as “reasoning” instead of “content”. - This PR fixes the issue by checking for reasoning end tokens in the prompt before calling reasoning extraction, ensuring all output is returned as “content” when reasoning is disabled.
## Solution
- Added …
- When
-
#30850 kvcache dump analysis — v1 — by nitingupta910 (关闭于: 2025-12-18 05:01 (UTC+8)) [💬3 +2169/-0, 14 files commented:1] -
#30576 [ROCm][CI] Add retry logic and xfail handling for flaky ROCm test in test_async_scheduling — documentation,performance,rocm,structured-output,frontend,ready,ci/build,v1,multi-modality,tool-calling — by AndreasKaratzas (关闭于: 2025-12-18 04:27 (UTC+8)) [💬10 | +1/-0, 1 files | commented:1 approved:1] Adds flaky test handling for
test_without_spec_decodingwhich intermittently fails on ROCm due to floating-point non-determinism in logprob comparisons.## Changes
- Added
@pytest.mark.flaky(reruns=3, reruns_delay=2)to retry on failure - Added conditional
@pytest.mark.xfailfor ROCm platform to prevent CI blockage after retries exhausted
## Root Cause
Different async scheduling and executor configurations produce slightly different floating-point reduction orderings on ROCm, causing l…
- Added
-
#30154 Integration for Ray LLM with load_format=runai_streamer — ready — by jiangwu300 (关闭于: 2025-12-18 02:18 (UTC+8)) [💬12 | +3/-1, 1 files | commented:2 approved:2] ## Purpose This PR makes it so that when a model is a s3 (or other cloud storage location) and we have runai_streamer turned on, we do not overwrite the self.model with a local directory which leads to failure of engine initialization since the local dir will not have the safe tensors.
## Test Plan Run a simple ray job with ray data llm, passing in a s3 model path with runai_streamer turned on.
## Test Result Runs as expected, no more failure.
…
-
#25198 Ignore invalid READY EngineCoreRequestType — v1 — by minosvasilias (关闭于: 2025-12-18 01:31 (UTC+8)) [💬4 | +15/-2, 1 files | commented:1]
## Purpose When utilizing
data_parallel_size=2in anAsyncLLMEnginevllm has failed to start with the following error: ``` (EngineCore_1 pid=309) Exception in thread Thread-3 (process_input_sockets): (EngineCore_1 pid=309) Traceback (most recent call last): (EngineCore_1 pid=309) File “/usr/lib/python3.10/threading.py”, line 1016, in _bootstrap_inner (EngineCore_1 pid=309) self.run() (EngineCore_1 pid=309) File “/usr/lib/python3.10/threading.py”, line 9… -
#24018 Allow loading of cpatonn/InternVL3_5-14B-AWQ-4bit — needs-rebase,unstale — by sjuxax (关闭于: 2025-12-18 01:11 (UTC+8)) [💬4 | +2/-2, 1 files | commented:3 approved:1] ## Purpose
Allow loading of https://huggingface.co/cpatonn/InternVL3_5-14B-AWQ-4bit. Without this patch, we get a Pydantic validation error like this:
```text (APIServer pid=2029616) Traceback (most recent call last): (APIServer pid=2029616) File “
", line 198, in _run_module_as_main (APIServer pid=2029616) File " ", line 88, in _run_code (APIServer pid=2029616) File "/home/jeff/local_clones/github.com/vllm-project/vllm/vllm/entrypoints/openai/api_server... - #30715 [Do not merge][Test] Revert “[Attention] Use sparse prefill kernel for fp8 kv-cache in DeepSeek-v3.2” — ready,v1,deepseek,gpt-oss,nvidia — by LucasWilkinson (关闭于: 2025-12-18 00:02 (UTC+8)) [💬2 | +256/-1372, 30 files | commented:1] Reverts vllm-project/vllm#27532
-
#27506 [Multimodal] Move profiling info out of processing info — multi-modality,llama,qwen,deepseek — by DarkLight1337 (关闭于: 2025-12-17 23:40 (UTC+8)) [💬4 | +1208/-1143, 40 files | commented:10 | 📝草稿] ## Purpose
- Separate the code that is only needed for profiling / dummy data from the main processing code.
- Prepare to remove
MultiModalRegistryby assigning the factory classes to the model class directly
TODOs
- Update the rest of the models
- Update the docs
…
- #29629 vLLM with MPS Support — documentation,frontend,ci/build,v1,llama,cpu — by ericcurtin (关闭于: 2025-12-17 23:08 (UTC+8)) [💬27 | +1283/-13, 12 files | commented:10] For macOS/Apple/Metal
-
#30162 [Deepseek] Fix OOM during DeepSeek R1 startup — v1,deepseek — by MatthewBonanni (关闭于: 2025-12-17 23:03 (UTC+8)) [💬2 | +28/-12, 3 files | commented:5]
## Purpose Starting up DeepSeek R1 DP8/EP on 8xH200 currently OOMs at the default
--gpu-memory-utilization(0.9). This PR prevents an unnecessary 4 GiB allocation during the post-CG-capture dummy run, allowing it to start up without having to reduce--gpu-memory-utilization. It still OOMs when prompted, though, so while this improves the situation and reflects the original intent of the code, it doesn’t solve the problem.## Test Plan ``` vllm serve deepseek-a…
-
#30725 light version of prefix caching for hybrid models gdn attention — needs-rebase,v1,qwen — by joennlae (关闭于: 2025-12-17 21:10 (UTC+8)) [💬3 | +496/-36, 14 files | commented:1 | 📝草稿] Copied and rebased from https://github.com/vllm-project/vllm/pull/28176
Thanks to @peakcrosser7 and @minminsun
-
#26540 [CI] Fix mypy for
vllm/engineandvllm/utils— ready,needs-rebase,v1,multi-modality — by wwl2755 (关闭于: 2025-12-17 20:25 (UTC+8)) [💬15 | +115/-70, 14 files | commented:7 changes:1 approved:2] Part of https://github.com/vllm-project/vllm/issues/26533Fix mypy precommit issues under “vllm/engine” and “vllm/utils” and move them to “FILES”
CC: @hmellor @yewentao256
Original: ``` Run mypy for local Python installation……………………………………………………..Failed
- hook id: mypy-local …
-
#30838 Idea for git bisect automation with Buildkite — ci/build — by evandekieft (关闭于: 2025-12-17 12:44 (UTC+8)) [💬2 | +1061/-0, 5 files | commented:1] Implement automated git bisect using Buildkite CI to efficiently find the first bad commit. The system consists of two pipelines:
- Driver pipeline: Orchestrates bisect process by querying failed jobs and triggering validation builds at each bisect step
- Validator pipeline: Dynamically runs only specific failing test jobs to speed up bisection
Key features:
- Query failed jobs from a build and run only those during bisection
- Can be run locally or on CI
- Supports specifying build number or …
-
#30452 [Core] Optimize encoder cache with mask-based storage — documentation,performance,structured-output,frontend,ready,v1,multi-modality,tool-calling,llama,qwen — by sunYtokki (关闭于: 2025-12-17 12:02 (UTC+8)) [💬9 | +547/-70, 13 files | commented:8 approved:1 changes:1] ## Purpose Optimizes encoder cache memory usage by storing only actual embeddings instead of full placeholder tensors using
is_embedmasks. This addresses the memory inefficiency identified in https://github.com/vllm-project/vllm/issues/25903Key improvements:
- Stores only embedding positions marked in
is_embedmask (not full scattered tensors) - Adds
PlaceholderRange.get_num_embeds()andextract_embeds_range()methods - Simplifies encoder output gathering logic in `gpu_model_runne…
- Stores only embedding positions marked in
- #18409 opmtize — needs-rebase,unstale — by momo609 (关闭于: 2025-12-17 11:05 (UTC+8)) [💬5 | changes:1]