[vLLM GitHub 开发动态] 2025-12-18
[概览]
- 时间窗口: 2025-12-18 10:48 (UTC+8) ~ 2025-12-19 10:48 (UTC+8)
- 新 issue: 30 (label 分布: bug:16, feature request:4, ci-failure:4, RFC:2, installation:1)
- 关闭 issue: 36
- 新 PR: 51 (label 分布: v1:17, documentation:13, ready:10, frontend:7, nvidia:7)
- 合并 PR: 50
- 关闭未合并 PR: 10
[新 issue]
-
#30999 [Feature] GLM45 tool parser: Stream tool name before full arguments — 无标签 — by so2liu (创建于: 2025-12-19 10:35 (UTC+8)) ## Feature Request
### Problem Description
When using GLM-4.5 with tool calling (
--tool-call-parser glm45), the current parser waits for the complete tool call structure before returning anything to the client.For long-running tool calls (e.g., generating articles with a
write_articletool), this creates poor user experience:- Users see nothing during generation
- Users don’t know what the model is doing
- No way to provide early feedback about which tool was selected …
-
#30996 [Bug]: vLLM 0.12.0 + LMCache outputs ERROR: PrometheusLogger instance already created with different metadata. — bug — by keivenchang (创建于: 2025-12-19 09:59 (UTC+8)) ### Your current environment
```text ============================== System Info ============================== OS : Ubuntu 24.04.2 LTS (x86_64) GCC version : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 Clang version : 18.1.3 (1ubuntu1) CMake version : version 4.2.0 …
-
#30956 [Feature]: could output the given format logger ? — feature request — by ucas010 (创建于: 2025-12-18 17:35 (UTC+8)) [💬5] ### 🚀 The feature, motivation and pitch
hi,dear , i have def the logger from py scripts ,etc, logger_utils.py and could i use shell run the command with the logger, such as ,
vllm serve qwen3-embedding-0.6b --logger_file logger_utils.pythx …
-
#30995 [Bug]: Fused MoE errors without safe serialization — bug — by ojh31 (创建于: 2025-12-19 09:35 (UTC+8)) ### Your current environment
The output of
```text Collecting environment information... uv is set ============================== System Info ...python collect_env.py -
#30959 [Installation]: AssertionError: CUDA_HOME is not set — installation — by shahasim (创建于: 2025-12-18 18:54 (UTC+8)) [💬1] ### Your current environment
Hi,
I am trying to build a docker image for
vllm==0.12.0and I am unable to build because the setup fails withAssertionError: CUDA_HOME is not setand there is a numpy error as well in the logs. I tried the chatbot, downgraded dumpy but it does not work. The same dockerfile if I try and installv0.11.2it works. Could I get some help with this issue? ```STEP 14/19: RUN pip install “numpy<2.0.0” Collecting numpy<2.0.0 Downloading https://www.artifactreposit…
-
#30985 [RFC]: DRY Dependencies for Better Hardware-Specific vLLM Dependency Management — RFC — by wjhrdy (创建于: 2025-12-19 04:06 (UTC+8)) [💬2] ### Motivation.
vLLM currently maintains 16 separate
requirements/*.txtfiles (cuda.txt, rocm.txt, tpu.txt, xpu.txt, cpu.txt, plus build/test variants) with no centralized dependency specification for common dependencies. This creates three critical issues:#### 1. No Single Source of Truth Creates CI Fragility
- Each hardware target has divergent torch versions:
- CUDA:
torch==2.9.0 - XPU:
torch==2.9.0+xpuwith custom index URLs - TPU: Completely different dependency …
- CUDA:
- Each hardware target has divergent torch versions:
-
#30929 [CI Failure]: mi325_4: DeepSeek V2-Lite Async EPLB Accuracy — ci-failure — by AndreasKaratzas (创建于: 2025-12-18 13:32 (UTC+8)) [💬2] ### Name of failing test
bash .buildkite/scripts/scheduled_integration_test/deepseek_v2_lite_ep_async_eplb.sh 0.25 1319 8030### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#30926 [CI Failure]: mi325_1: V1 Test entrypoints — ci-failure — by AndreasKaratzas (创建于: 2025-12-18 13:11 (UTC+8)) [💬2] ### Name of failing test
pytest -v -s v1/entrypoints### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#30928 [CI Failure]: mi325_1: LM Eval Small Models (1 Card) — ci-failure — by AndreasKaratzas (创建于: 2025-12-18 13:28 (UTC+8)) [💬2] ### Name of failing test
pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-small.txt --tp-size=1### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#30927 [CI Failure]: mi325_1: LM Eval Small Models — ci-failure — by AndreasKaratzas (创建于: 2025-12-18 13:20 (UTC+8)) [💬2] ### Name of failing test
pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-small.txt --tp-size=1### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#30989 [Bug]: CUTLASS BLOCK SCALE FP8 IMA on B200 — bug — by robertgshaw2-redhat (创建于: 2025-12-19 05:49 (UTC+8)) ### Your current environment
B200
### 🐛 Describe the bug
Currently, CUTLASS BLOCK SCALE FP8 is broken on main. There are two problems.
- A) it is impossible to use cutlass block scale fp8
…
-
#30933 [Usage]: What is the latest instruction to run DeepSeek V3.2? — usage — by IKACE (创建于: 2025-12-18 14:18 (UTC+8)) [💬1] ### Your current environment
vLLM 0.12.0
### How would you like to use vllm
I am following the guidelines here https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-V3_2.html for running DeepSeek v3.2. By following the instructions I installed vLLM 0.12.0 on my H200 node. However, when I try to run it with
vllm serve deepseek-ai/DeepSeek-V3.2 --tensor-parallel-size 8 --tokenizer-mode deepseek_v32it gives an error``` (APIServer pid=816209) ValueError: No tokenizer regis…
-
#30970 [Bug]: MPClient sends queue message but EngineCoreProc does not receive, causing timeout. — bug — by YuYun329 (创建于: 2025-12-18 22:54 (UTC+8)) ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#30969 [Bug]: SmolLM3-3B FP8 models fail to load in v0.11.1 with “Unable to find matching target” error in compressed-tensors config — bug — by GauthierRoy (创建于: 2025-12-18 22:36 (UTC+8)) ### Your current environment
The output of
Running in official Docker image: vllm/vllm-openai:v0.11.1 GPU: NVIDIA L4 (GCP g2-standard-8) `| NVIDIA-SMI 570.195.03 Driver Version: 570.195.03 CUDA Version: 12.9 |` vLLM version: 0.11.1 ...python collect_env.py -
#30964 [Bug]: semaphore leak for Qwen3-Next-80B-A3B-Instruct on long data inference — bug — by AIR-hl (创建于: 2025-12-18 20:50 (UTC+8)) ### Your current environment
Collecting environment information… ============================== System Info ============================== OS : Ubuntu 24.04.2 LTS (x86_64) GCC version : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 Clang version : Could not collect CMake version : version 4.2.0 …
-
#30939 [Bug]: The same input produces outputs of different lengths in the same batch. — bug — by sorenwu (创建于: 2025-12-18 15:29 (UTC+8)) ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#30963 [Bug]: AssertionError: Failed to apply prompt replacement for mm_items[‘image’][0] — bug — by liaomingg (创建于: 2025-12-18 20:45 (UTC+8)) [💬1] ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#30958 [Bug]: Vllm Server stuck and automatically shutdown. — bug — by tzjtatata (创建于: 2025-12-18 18:34 (UTC+8)) ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#30934 [Bug]: Qwen3-VL-235B-A22B-Thinking-FP8 not divisible by weight quantization block_n = 128. — bug — by serser (创建于: 2025-12-18 14:39 (UTC+8)) [💬2] ### Your current environment
I am trying to run FP8 version on 8*A100 GPUs but I encountered
ValueError: The output_size of gate's and up's weight = 192 is not divisible by weight quantization block_n = 128.``` Collecting environment information... ============================== ... -
#30942 [Bug]: Scale Elastic EP is Pending on EngineCore — bug — by xeonliu (创建于: 2025-12-18 15:41 (UTC+8)) [💬1] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#30943 [Bug]: memory leak (nvidia v100, mineru, dp*8) — bug — by ChrisKimZHT (创建于: 2025-12-18 15:44 (UTC+8)) ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ...python collect_env.py -
#30941 [Performance]: Why Does Latency Remain Unchanged in vLLM 0.11.0 When Input Token Count Decreases for qwen3-vl-30b-a3b? — performance — by Hormoney (创建于: 2025-12-18 15:40 (UTC+8)) ### Proposal to improve performance
No response
### Report of performance regression
No response
### Misc discussion on performance
…
-
#30940 [Feature]: The issue of being unable to achieve streaming output in function call inference — feature request — by cluo991 (创建于: 2025-12-18 15:35 (UTC+8)) ### 🚀 The feature, motivation and pitch
The issue of being unable to achieve streaming output in function call inference
### Alternatives
No response
### Additional context
…
-
#30938 [Feature]: Support NVIDIA ModelOpt FP8 PTQ variants (FP8_PER_CHANNEL_PER_TOKEN / FP8_PB_WO) in vLLM — feature request — by CedricHwong (创建于: 2025-12-18 15:11 (UTC+8)) ### 🚀 The feature, motivation and pitch
### Current behavior vLLM’s built-in modelopt quantization support only recognizes a limited set of ModelOpt checkpoint formats (e.g., quant_algo: “FP8” and NVFP4). When loading newer ModelOpt FP8 PTQ exports such as:
- FP8_PER_CHANNEL_PER_TOKEN (per-channel weight scale + per-token dynamic activation quantization) - fp8_pb_wo / FP8_PB_WO (block-scaled FP8 weight-only with 4D block scales) vLLM fails early during config parsing / quantization... -
#30931 [Bug]: Prefix Cache Corruption with LoRA with the same name but different id — bug — by Butanium (创建于: 2025-12-18 14:01 (UTC+8)) ### Your current environment
The output of
```text Collecting environment information... uv is set ============================== System Info ...python collect_env.py -
#30930 [Draft] [RFC]: vLLM + UCC Backend — RFC — by lengrongfu (创建于: 2025-12-18 13:45 (UTC+8)) ### Motivation.
Integrating UCC (Unified Collective Communication) https://github.com/openucx/ucc into vLLM as an optional communication backend. ### Strengthening vLLM’s Backend Abstraction
- UCC supports a wide range of hardware backends,
- It decouples vLLM’s distributed engine from any single communication API.
- It allows developers to register new communication backends through UCC’s standardized API.
- This modularity supports future frameworks, accelerators, or transport technologies w…
-
#30922 [Bug]: Use the offical doucment vllm online method deploy DeepSeek-OCR,the result is very bad . but I ust the offline method the result is normal. why ? — bug — by git-liweichao (创建于: 2025-12-18 12:08 (UTC+8)) [💬1] ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#30923 [Bug]: Use the offical doucment vllm online method deploy DeepSeek-OCR,the result is very bad . but I ust the offline method the result is normal. why ? — bug — by git-liweichao (创建于: 2025-12-18 12:14 (UTC+8)) ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#30921 [Feature]: Models with different quantization for different layers/blocks — feature request — by Datta0 (创建于: 2025-12-18 11:37 (UTC+8)) ### 🚀 The feature, motivation and pitch
We at Unsloth have dynamically quantized bnb 4bit models which we ideally do to preserve accuracy, keeping in mind the important layers. We let them be in 16-bit while quantizing the rest to 4-bit.
For example, if you see unsloth/Qwen3-VL-2B-Instruct-unsloth-bnb-4bit, all of the vision layers are left unquantized, while having only a [few language la…
-
#30919 [Bug]: GPT-OSS-120B NotImplementedError: FlashInfer backend currently does not support attention sinks — bug — by shyeh25 (创建于: 2025-12-18 11:16 (UTC+8)) [💬1] ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
[已关闭 issue]
-
#3583 Supporting RWKV models — feature request,stale — by RonanKMcGovern (关闭于: 2025-12-19 10:14 (UTC+8)) [💬13] ### 🚀 The feature, motivation and pitch
Linear attention allows for longer context and faster inference. The eagle model has a 2T checkpoint soon.
### Alternatives
NA
### Additional context
…
-
#11655 [Feature]: Support Inflight quantization: load as 8bit quantization. — feature request,stale — by ShelterWFF (关闭于: 2025-12-19 10:14 (UTC+8)) [💬24] ### 🚀 The feature, motivation and pitch
VLLM supports 4bit inflight quantification, but does not support 8bit, 8bit speed is faster than 4bit, request support for support.
### Alternatives
No response
### Additional context …
-
#13047 [Bug]:
undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEwhen running0.7.3.dev57+g2ae88905.precompiledon A100 — bug,stale — by imangohari1 (关闭于: 2025-12-19 10:14 (UTC+8)) [💬15] ### Your current environmentThe output of `python collect_env.py`
```text INFO 02-10 17:07:03 __init__.py:190] Automatically detected platform cuda. Collecting environment information... PyTorch version: 2.5.1+cu124 Is debug build: False ... -
#14193 [Bug]: when I use disaggregated_prefill, if I don’t input anything ,KV receiving thread will report time out — bug,stale — by mengli0 (关闭于: 2025-12-19 10:14 (UTC+8)) [💬9] ### Your current environment
PyTorch version: 2.5.1+cu124 Is debug build: False CUDA used to build PyTorch: 12.4 ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.3 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect …
-
#16467 [Bug]: Grammar error: Pointer ‘/$defs/xxxxx’ does not exist — bug,stale — by ItzAmirreza (关闭于: 2025-12-19 10:14 (UTC+8)) [💬8] ### Your current environment
The output of `python collect_env.py`
```text root@c2069cb9db81:/vllm-workspace# python3 collect_env.py INFO 04-11 01:50:21 [__init__.py:239] Automatically detected platform cuda. Collecting environment information... PyTorch version: 2.6.0+cu124 ... -
#17618 [Bug]: Engine Core initialization failed. See root cause above — bug,stale — by darkness8i8 (关闭于: 2025-12-19 10:14 (UTC+8)) [💬44] ### Your current environment
Colab notebooks, A100
### 🐛 Describe the bug
I have no idea what’s wrong. This model works with normal pipeline, but fails with vllm. It was saved to 16 bit and built off Unsloth/Llama3.18b Instruct
This works -> from transformers import pipeline …
-
#17813 [Bug]: 0.8.5 post1 cuda error — bug,stale — by gengchaogit (关闭于: 2025-12-19 10:14 (UTC+8)) [💬18] ### Your current environment
env: 5070ti 16G+5080TI 16G model:Qwen3-32B-AWQ command:PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True VLLM_USE_TRITON_FLASH_ATTN=1 vllm serve xunlei_model/Qwen3-32B-AWQ –port 8099 –max-model-len 25000 –served-model-name vllm –enforce_eager –tensor-parallel-size 2 –gpu_memory_utilization 0.95
A few days ago I installed a relatively new vllm, which allowed me to run tensors in parallel on two graphics cards. A lot of concurrent tests were done at that time …
-
#18231 [Bug]: SamplingParams() use_beam_search error — bug,stale — by michellewheatleyxsite (关闭于: 2025-12-19 10:14 (UTC+8)) [💬7] ### Your current environment
The output of
```text Your output of `python collect_env.py` here ``` I am using the following packages/versions: ...python collect_env.py -
#18489 [Bug]: SharedStorageConnector does not load KV cache — bug,stale — by KoyamaSohei (关闭于: 2025-12-19 10:14 (UTC+8)) [💬9] ### Your current environment
The output of
```text ============================== System Info ============================== OS : Ubuntu 24.04.2 LTS (aarch64) ...python collect_env.py -
#18572 [Bug]: ValueError: Attempted to assign 25 + 25 + 25 = 75 multimodal tokens to 147 placeholders — bug,stale — by Mypainismorethanyours (关闭于: 2025-12-19 10:14 (UTC+8)) [💬10] ### Your current environment
vllm==0.7.3 transformers==4.49.0
### 🐛 Describe the bug
```import os os.environ[“HF_HOME”] = “/gz-data/.cache/huggingface” …
-
#18725 [Performance]: Unexpected: B200 GPU Performance Similar to H200 for Qwen/QwQ-32B, Expected B200 to be Significantly Faster — performance,stale — by awterman (关闭于: 2025-12-19 10:14 (UTC+8)) [💬7] ### Proposal to improve performance
We are observing that the B200 GPU is performing similarly to the H200 GPU when running inference with the
Qwen/QwQ-32Bmodel using vLLM. We expect the B200 to have significantly better performance.Hardware Information:
- CPU: 192 x vCPU
- Memory: 1585 GB
Benchmark Script: …
-
#18730 [Bug]:RuntimeError: Engine core initialization failed. — bug,stale — by YasinFu (关闭于: 2025-12-19 10:14 (UTC+8)) [💬8] ### Your current environment
The output of
```text (vllm) llm@aitt:/data_a/llm$ python collect_env.py INFO 05-23 10:07:58 [__init__.py:239] Automatically detected platform cuda. Collecting environment information... ============================== ...python collect_env.py -
#23072 [Bug]: eagle3 draft model len > 2048 will be broken — bug,stale — by fan-niu (关闭于: 2025-12-19 10:13 (UTC+8)) [💬2] ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#23074 [Bug]: Can the thinking mode of qwen3 be used simultaneously with the guided_json of vllm — bug,stale — by ZYJ-3721 (关闭于: 2025-12-19 10:13 (UTC+8)) [💬3] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#25287 [Bug]: CrashLoopBackOff on AMD GPU (MI300x) with rocm/vllm Container on DigitalOcean Kubernetes — bug,rocm — by kev-pebble (关闭于: 2025-12-19 09:39 (UTC+8)) [💬6] Attempting to deploy the rocm/vllm container on a DigitalOcean Kubernetes (DOKS) cluster with AMD MI300x GPUs and are encountering a persistent CrashLoopBackOff error. We are currently unable to get application logs due to a platform-level issue, but we have isolated the problem to the vllm serve command itself.
Environment: * Platform: DigitalOcean Kubernetes (DOKS) * Kubernetes Version: v1.33.1 * GPU Node: DigitalOcean gpu-mi300x1-192gb-devcloud (AMD MI300x) * Container Image: doc…
-
#30843 [Bug]: vllm serve GLM-4.6V-Flash error — bug — by F0undLinks (关闭于: 2025-12-19 09:29 (UTC+8)) [💬7] ### Your current environment
docker image: vllm/vllm-openai:nightly(sha256:b40b770900bfb2b4a66bc04e888141830e20fd732c79e07ab3e3d6186d0ed437)
vllm version: 0.13.0rc2.dev118+g29f7d9771 transformers version: 4.57.3
### 🐛 Describe the bug
…
-
#29520 [CI Failure]: mi325_1: Multi-Modal Models Test (Standard) — ci-failure — by AndreasKaratzas (关闭于: 2025-12-19 07:04 (UTC+8)) [💬6] ### Name of failing test
` pip install git+https://github.com/TIGER-AI-Lab/Mantis.git && pip freeze grep -E ‘torch’ && pytest -v -s models/multimodal -m core_model –ignore models/multimodal/generation/test_whisper.py –ignore models/multimodal/processing && cd .. && VLLM_WORKER_MULTIPROC_METHOD=spawn pytest -v -s tests/models/multimodal/generation/test_whisper.py -m core_model` ### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug i…
-
#30929 [CI Failure]: mi325_4: DeepSeek V2-Lite Async EPLB Accuracy — ci-failure — by AndreasKaratzas (关闭于: 2025-12-19 07:03 (UTC+8)) [💬2] ### Name of failing test
bash .buildkite/scripts/scheduled_integration_test/deepseek_v2_lite_ep_async_eplb.sh 0.25 1319 8030### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#29519 [CI Failure]: mi325_1: Examples Test — ci-failure — by AndreasKaratzas (关闭于: 2025-12-19 07:02 (UTC+8)) [💬2] ### Name of failing test
`pip install tensorizer && python3 offline_inference/basic/generate.py –model facebook/opt-125m && python3 offline_inference/basic/generate.py –model meta-llama/Llama-2-13b-chat-hf –cpu-offload-gb 10 && python3 offline_inference/basic/chat.py && python3 offline_inference/prefix_caching.py && python3 offline_inference/llm_engine_example.py && python3 offline_inference/audio_language.py –seed 0 && python3 offline_inference/vision_language.py –seed 0 && python3 offlin…
-
#29538 [CI Failure]: mi325_8: Kernels Quantization Test %N — ci-failure — by AndreasKaratzas (关闭于: 2025-12-19 07:01 (UTC+8)) [💬1] ### Name of failing test
pytest -v -s kernels/quantization --shard-id=$BUILDKITE_PARALLEL_JOB --num-shards=$BUILDKITE_PARALLEL_JOB_COUNT### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#29468 [CI Failure]: mi325_1: Basic Models Tests (Other) — ci-failure — by AndreasKaratzas (关闭于: 2025-12-19 07:00 (UTC+8)) [💬4] ### Name of failing test
pytest -v -s models/test_transformers.py models/test_registry.py### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#30926 [CI Failure]: mi325_1: V1 Test entrypoints — ci-failure — by AndreasKaratzas (关闭于: 2025-12-19 07:00 (UTC+8)) [💬2] ### Name of failing test
pytest -v -s v1/entrypoints### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#29526 [CI Failure]: mi325_1: Entrypoints Integration Test (Pooling) — ci-failure — by AndreasKaratzas (关闭于: 2025-12-19 06:59 (UTC+8)) [💬5] ### Name of failing test
export VLLM_WORKER_MULTIPROC_METHOD=spawn && pytest -v -s entrypoints/pooling### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#29534 [CI Failure]: mi325_8: LoRA Test %N — ci-failure — by AndreasKaratzas (关闭于: 2025-12-19 06:59 (UTC+8)) [💬3] ### Name of failing test
pytest -v -s lora --shard-id=$BUILDKITE_PARALLEL_JOB --num-shards=$BUILDKITE_PARALLEL_JOB_COUNT --ignore=lora/test_chatglm3_tp.py --ignore=lora/test_llama_tp.py --ignore=lora/test_llm_with_multi_loras.py --ignore=lora/test_olmoe_tp.py --ignore=lora/test_deepseekv2_tp.py --ignore=lora/test_gptoss_tp.py --ignore=lora/test_qwen3moe_tp.py### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#30928 [CI Failure]: mi325_1: LM Eval Small Models (1 Card) — ci-failure — by AndreasKaratzas (关闭于: 2025-12-19 06:59 (UTC+8)) [💬2] ### Name of failing test
pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-small.txt --tp-size=1### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#30927 [CI Failure]: mi325_1: LM Eval Small Models — ci-failure — by AndreasKaratzas (关闭于: 2025-12-19 06:58 (UTC+8)) [💬2] ### Name of failing test
pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-small.txt --tp-size=1### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#30939 [Bug]: The same input produces outputs of different lengths in the same batch. — bug — by sorenwu (关闭于: 2025-12-18 21:14 (UTC+8)) ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#25718 [Bug][ROCm]: large max_num_seqs hurts performance on AMD — bug,rocm — by draftbk (关闭于: 2025-12-18 16:54 (UTC+8)) [💬7] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#22245 [Bug]: VLLM_ROCM_USE_AITER=1 hit device_gemm with the specified compilation parameters does not support this GEMM problem for Qwen3-235B-A22B — bug,rocm,stale — by billishyahao (关闭于: 2025-12-18 16:27 (UTC+8)) [💬8] ### Your current environment
The output of
```text Your output of `python collect_env.py` here Collecting environment information... ...python collect_env.py -
#30934 [Bug]: Qwen3-VL-235B-A22B-Thinking-FP8 not divisible by weight quantization block_n = 128. — bug — by serser (关闭于: 2025-12-18 16:15 (UTC+8)) [💬2] ### Your current environment
I am trying to run FP8 version on 8*A100 GPUs but I encountered
ValueError: The output_size of gate's and up's weight = 192 is not divisible by weight quantization block_n = 128.``` Collecting environment information... ============================== ... -
#21504 [RFC] [ROCm] [AITER]: Propose a
_aiter_opsclass like_custom_opsand_ipex_ops— rocm,RFC,unstale — by tjtanaa (关闭于: 2025-12-18 14:55 (UTC+8)) [💬5] ### Motivation.This RFC proposes the creation of a new module,
vllm/_aiter_ops.py, to centralize the management, conditional loading, and registration of AITER kernels for ROCm. This new module will be analogous to the existing_custom_ops.pyand_ipex_ops.py, providing a single, authoritative source for all AITER-related operations. This change will improve code organization, prevent circular dependencies, simplify the developer experience, and streamline testing.As the integration of…
-
#25102 [Bug]: Crash when using CUTLASS_MLA on B200 with flashinfer 0.3.1:
UnboundLocalError: cannot access local variable 'returned_lse' where it is not associated with a value— bug,stale — by smarterclayton (关闭于: 2025-12-18 13:56 (UTC+8)) [💬2] ### Your current environmentThe output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.pyfrom the decode worker -
#29522 [CI Failure]: mi325_1: PyTorch Fullgraph Test — ci-failure — by AndreasKaratzas (关闭于: 2025-12-18 13:35 (UTC+8)) [💬2] ### Name of failing test
pytest -v -s compile/fullgraph/test_full_graph.py -k 'not test_fp8_kv_scale_compile' && pytest -v -s compile/distributed/test_fusions_e2e.py -k 'TRITON and not +quant_fp8 and not Llama-4'### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#29443 [CI Failure]: mi325_1: Python-only Installation Test — ci-failure — by AndreasKaratzas (关闭于: 2025-12-18 13:25 (UTC+8)) [💬5] ### Name of failing test
bash standalone_tests/python_only_compile.sh### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#30923 [Bug]: Use the offical doucment vllm online method deploy DeepSeek-OCR,the result is very bad . but I ust the offline method the result is normal. why ? — bug — by git-liweichao (关闭于: 2025-12-18 12:25 (UTC+8)) ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#30834 [Bug]: vllm0.12.0 h100 PTX was compiled with an unsupported toolchain — bug — by Qingyuncookie (关闭于: 2025-12-18 11:54 (UTC+8)) [💬4] ### Your current environment
==============================
System Info
============================== …
[新 PR]
-
#30944 [XPU] enable fp8 online streaming quantization — 无标签 — by yma11 (创建于: 2025-12-18 15:44 (UTC+8)) [💬1 | +32/-107, 2 files | commented:4] ## Purpose This PR enables fp8 online streaming quantization on xpu path for other linear and MoE.
## Test Plan
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 VLLM_WORKER_MULTIPROC_METHOD=spawn python3 examples/offline_inference/basic/generate.py --model Qwen/Qwen3-30B-A3B --enforce-eager --dtype=float16 --max-model-len=4096 --quantization=fp8 -tp=4## Test Result ``` Processed prompts: 100%|██████████████████| 4/4 [00:00<00:00, 4.21it/s, est. speed input: 23.18 toks/s, output: 67.43 toks/s] …
-
#30998 [Perf][ROCm][AWQ] Improve performance of fused MoE GPTQ-AWQ and AWQ dequant kernels — rocm — by yuttian1 (创建于: 2025-12-19 10:27 (UTC+8)) [💬1 | +206/-356, 3 files | commented:2] Summary
This PR optimizes and tunes the following AWQ-related kernels on AMD ROCm platforms:
fused_moe_kernel_gptq_awq optimizes and tunes by cuichang@amd.com
awq_dequant
The changes focus on reducing memory traffic, improving arithmetic efficiency, and updating tuned configurations for better performance on large-scale MoE workloads (e.g. Qwen3-235B-VL-AWQ).
…
-
#30997 Add Molmo2 multimodal model support — documentation,new-model,multi-modality — by sangho-vision (创建于: 2025-12-19 10:22 (UTC+8)) [💬2 | +3044/-1, 7 files | commented:2]
## Purpose
This PR adds support for Molmo2 to vLLM. Molmo2 is a family of open vision-language models developed by the Allen Institute for AI (Ai2) that support image, video and multi-image understanding and grounding.
This change:
- Introduces a new Molmo2 model implementation for vLLM
- Registers Molmo2 in the multimodal model registry …
-
#30965 [Perf][ROCm][AWQ] Improve performance of fused MoE GPTQ-AWQ and AWQ dequant kernels — rocm,needs-rebase — by yuttian1 (创建于: 2025-12-18 21:05 (UTC+8)) [💬1 | +206/-356, 3 files | commented:2] Summary
This PR optimizes and tunes the following AWQ-related kernels on AMD ROCm platforms:
fused_moe_kernel_gptq_awq optimizes and tunes by cuichang@amd.com
awq_dequant
The changes focus on reducing memory traffic, improving arithmetic efficiency, and updating tuned configurations for better performance on large-scale MoE workloads (e.g. Qwen3-235B-VL-AWQ).
…
-
#30957 [Feature]: Support NVIDIA ModelOpt HF FP8 variants FP8_PER_CHANNEL_PER_TOKEN and FP8_PB_WO in vLLM — documentation,frontend,nvidia — by CedricHwong (创建于: 2025-12-18 17:46 (UTC+8)) [💬3 | +435/-16, 6 files | commented:4]
## Purpose This PR adds support for two NVIDIA ModelOpt-exported FP8 HuggingFace checkpoint variants that currently cannot be loaded by vLLM: - quant_algo=FP8_PER_CHANNEL_PER_TOKEN (aka hf_fp8_pc_pt) - quant_algo=FP8_PB_WO (ModelOpt may emit fp8_pb_wo, handled case-insensitively) It keeps strict quant_algo matching (only known ModelOpt algos map to quantization="modelopt") to avoid ... -
#30967 [Misc] Remove unused custom ops
copy_blocksandcopy_blocks_mla— kernel,ready — by lengrongfu (创建于: 2025-12-18 22:02 (UTC+8)) [+0/-286, 6 files | commented:1] ## PurposeFound that some defined ops are no longer in use, so need remote it.
## Test Plan
## Test Result
... -
#30991 [ROCm][CI/Build] Update ROCm dockerfiles — rocm,ci/build — by gshtras (创建于: 2025-12-19 06:50 (UTC+8)) [+11/-5, 2 files | commented:1 approved:1] Dockerfile.rocm: Override torch profiler setting to allow full traces Write extra info to versions.txt
Dockerfile.rocm_base: Revert to ROCm7.0 due to a performance regression in 7.1 that is investigated internally Update torch commit to ROCm 2.9.0 release Update triton commit to match the new torch commit Update AITER to a more recent commit
-
#30953 Hotfix cpu container build — documentation,ci/build,cpu — by maryamtahhan (创建于: 2025-12-18 17:03 (UTC+8)) [💬3 | +125/-11, 3 files | commented:3] ## Purpose This PR improves the CPU Docker build process for better compatibility with Podman and adds build configuration tracking through OCI labels.
Changes:
- Podman Compatibility: Replace bind mounts with COPY for requirements files in
docker/Dockerfile.cputo improve compatibility with Podman - OCI Labels: Add standard OCI image labels and custom build configuration labels to track CPU instruction set flags (AVX2, AVX512, etc.) in built images
- **Cross-platform Build Sup…
- Podman Compatibility: Replace bind mounts with COPY for requirements files in
-
#30974 [Bugfix] Fix incorrect tiles creation for mm prefix triton attention — ready — by Isotr0py (创建于: 2025-12-19 01:07 (UTC+8)) [+10/-4, 1 files | commented:3]
## Purpose
- Actually, #30386’s implementation is still incorrect, because the tiles are still selected casually, so the hidden_states is not fully converged.
- This PR fixes this issue to make sure correct all valid tiles are selected for computaion
## Test Plan
python examples/offline_inference/vision_language.py -m gemma3 --num-prompts 1…
-
#30994 [WIP]improve cpu Benchmark Suite tests for 0.12.0 — performance,ci/build,cpu — by louie-tsai (创建于: 2025-12-19 08:25 (UTC+8)) [+76/-21, 1 files | commented:1] ## Purpose
Improve CPU benchmark tests for 0.12.0
## Test Plan
## Test Result
... -
#30987 Eagle 3 fix sometimes when you don’t set architectures you get “Model architectures [‘EagleLlamaModel’] are not supported for now. — new-model,llama — by aidando73 (创建于: 2025-12-19 04:11 (UTC+8)) [💬2 | +2/-0, 1 files | commented:2 changes:1]
## Purpose
If you don’t set the model architecture in the config - I think transformers pulls out Eagle3LlamaModel or something - which isn’t mapped in vllm
## Test Plan
You can repro with this config:
…
-
#30993 Bump Flashinfer to v0.6.0rc1 — ci/build,nvidia — by elvischenv (创建于: 2025-12-19 07:50 (UTC+8)) [+6/-67, 9 files | commented:1]
## Purpose Bump Flashinfer to v0.6.0rc1. API change: argument
tile_tokens_dimhas been removed from all TRTLLM MoE kernels.## Test Plan
## Test Result
…
-
#30992 [Misc] Remove deprecated metric vllm:time_per_output_token_seconds for v0.13 release — v1 — by jliu9515 (创建于: 2025-12-19 07:18 (UTC+8)) [+0/-39, 1 files | commented:2] This metric was deprecated in v0.11 and renamed to vllm:inter_token_latency_seconds. The TODO comment indicated it should be removed in v0.13.0.
Follows up on #30396 which removed other deprecated items for v0.13.
## Purpose
- This metric was deprecated in v0.11 and renamed to
vllm:inter_token_latency_seconds - The TODO comment indicated it should be removed in v0.13.0
- Follows up on #30396 which removed other deprecated items for v0.13
##…
- This metric was deprecated in v0.11 and renamed to
-
#30973 [Bugfix] Remove
tile_size=64for mm_prefix triton attention — ready — by Isotr0py (创建于: 2025-12-19 00:18 (UTC+8)) [+0/-7, 1 files | commented:1 approved:2]## Purpose
- Fix https://github.com/vllm-project/vllm/pull/30386#discussion_r2631541204
cc @lucianommartins
## Test Plan
## Test Result …
- #30990 [MoE Refactor][3/N] Deprecate cutlass block quant fp8 (b200) — ready,nvidia — by robertgshaw2-redhat (创建于: 2025-12-19 06:31 (UTC+8)) [💬3 | +3/-685, 7 files | commented:2 approved:2]
## Purpose
- per https://github.com/vllm-project/vllm/issues/30989, CUTLASS Block quant FP8 does not run on main. When modifying vLLM to force it to run, it get an IMA immediately
- per discussion with NVIDIA, FlashInfer kernels are better for TP DSR1 anyways
- rather than fixing the IMA, we will just remove this kernel
- remove to simplify code
## Test Plan
- ci to ensure nothing broke
## Test Result …
-
#30979 [MoE Refactor] Add mk for cutlass fp8 block — v1,nvidia — by robertgshaw2-redhat (创建于: 2025-12-19 02:15 (UTC+8)) [+191/-32, 5 files | commented:1 | 📝草稿] NOTE: #30989 [Bug]: CUTLASS BLOCK SCALE FP8 IMA on B200
The kernel is broken on main
-
#30983 Check for truthy
rope_parametersnot the existence of it — bug,ready — by hmellor (创建于: 2025-12-19 03:21 (UTC+8)) [+5/-5, 1 files | commented:1 approved:1] Fixes the following situation:- Model explicitly sets
rope_parameters=None(new name, Transfomers v5) orrope_scaling=None(old name, Transformers v4) - User is using Transformers v4 with vLLM
- Model explicitly sets
-
#30984 Grid construction based on num_active_lora and support CUDA graph capture across various num_active_lora — v1,nvidia — by yugong333 (创建于: 2025-12-19 03:53 (UTC+8)) [+201/-41, 9 files | commented:2]
## Purpose
Reduce overhead of idle kernel launch due to max-loras in grid construction
## Test Plan
## Test Result
…
-
#30988 [P/D] Add utility to slow down server for decode benchmarking — frontend,v1,kv-connector — by hjjq (创建于: 2025-12-19 05:10 (UTC+8)) [+82/-0, 10 files | commented:1 | 📝草稿] https://github.com/vllm-project/vllm/pull/25986 works but does not support real data. This feature allows benchmarking with real data by slowing down the decode node to wait until enough requests are completed and sent from the prefill node.
Usage:
VLLM_SERVER_DEV_MODE=1 vllm serve ... --slowdown-threshold Nwhere N is the number of requests at which slowdown will be turned off. If the number of scheduled requests is less than N, the scheduler will sleep 10s for each step. This threshold can… -
#30986 Eagle 3 fix sometimes when you don’t set architectures you get “Model architectures [‘EagleLlamaModel’] are not supported for now. “ — documentation,performance,new-model,rocm,frontend,tpu,ci/build,v1,tool-calling,llama — by aidando73 (创建于: 2025-12-19 04:08 (UTC+8)) [💬3 | +20070/-1312, 118 files | commented:2]
## Purpose
As per title
## Test Plan
## Test Result
…
-
#30980 [Do not merge][Async] Asynchronous DP coordination — v1 — by MatthewBonanni (创建于: 2025-12-19 02:40 (UTC+8)) [💬1 | +253/-42, 2 files | commented:2 | 📝草稿] ## Purpose When using async scheduling with MTP and DP,
_get_valid_sampled_token_countcauses a sync point. Following this sync point, DP coordination causes a GPU bubble. This PR creates and employs an asynchronous replacement forcoordinate_batch_across_dp, allowing this communication to overlap with execution.## Test Plan ``` vllm serve deepseek-ai/DeepSeek-R1
-dp 8
–enable-expert-parallel
–no-enable-prefix-caching
–trust-remote-code
… - #30982 Update Pytorch version update docs — documentation — by atalman (创建于: 2025-12-19 03:18 (UTC+8)) [💬1 | +9/-14, 1 files | commented:1] This updates PyTorch runbook
- #30981 blackwell — frontend,v1 — by dtunikov (创建于: 2025-12-19 02:49 (UTC+8)) [💬2 | +80/-0, 6 files | commented:2] enforce
-
#30935 [XPU] allow custom workers (e.g. vllm-omni workers) to be used on XPU — 无标签 — by faaany (创建于: 2025-12-18 14:56 (UTC+8)) [💬1 | +4/-1, 1 files | commented:1 approved:1] ## Purpose Currently, the worker class on XPU is hard-coded to “vllm.v1.worker.xpu_worker.XPUWorker”. This PR improves the logic to only override if
worker_clsis still the default “auto”. This allows custom workers (e.g. vllm-omni workers) to be used on XPU.see CUDA as a reference: https://github.com/vllm-project/vllm/blob/main/vllm/platforms/cuda.py#L156
Essential Elements of an Effective PR Description Checklist
- [x] The purpose of the PR, such as "F... -
#30977 Docs: add OpenAI SDK example for Qwen2.5-VL classification — documentation,qwen — by Dhruv-80 (创建于: 2025-12-19 01:55 (UTC+8)) [💬2 | +39/-0, 1 files | commented:3] ## Purpose
Add a usage example demonstrating how to call Qwen2.5-VL classification models served by vLLM using the OpenAI-compatible SDK.
This clarifies the required structured multimodal input format and helps users avoid
400 Bad Requesterrors caused by passing raw text or media URLs directly in requests.Fixes #27413 …
-
#30978 Add positional embedding and kv_cache fusion for llama and gpt-oss — v1,llama,gpt-oss — by dllehr-amd (创建于: 2025-12-19 01:57 (UTC+8)) [+255/-70, 6 files | commented:1 | 📝草稿]
## Purpose
## Test Plan
## Test Result
... -
#30976 Use aiter triton fused_add_rmsnorm_pad for gpt-oss — gpt-oss — by Rohan138 (创建于: 2025-12-19 01:39 (UTC+8)) [+39/-2, 3 files | commented:1 | 📝草稿]
## Purpose
Adds fused padding op before router GEMM on ROCm, eliminating this unfused pad after the GEMM before the fused_moe: https://github.com/ROCm/vllm/blob/main/vllm/model_executor/layers/fused_moe/layer.py#1603
Before:
After: <img width=”1462” height=”34” alt=”image” src=”https://github.com/user-attachments/assets/e5979f88-03c0-…
-
#30975 [Misc] Disable default
--ready-check-timeout-secextra call in vllm bench — performance — by NickLucche (创建于: 2025-12-19 01:37 (UTC+8)) [+2/-3, 1 files | commented:1] As per brief offline discussion with @mgoin, currentvllm bench serveimplementation will default to sending (at least) one extra request to probe for server up status. I suppose this is due to a non-uniform backend/healthcheck API, so we’ve been defaulting to sending the same test request https://github.com/vllm-project/vllm/blob/686cbaac643c3412036728dd5e6bc29d6cce1a9f/vllm/benchmarks/serve.py#L596I believe this is largely misleading for most use-cases as it results in an ambiguous “wa…
-
#30971 GB200 DeepSeek R1-NVFP4 Updates — needs-rebase,v1,deepseek,nvidia — by elvircrn (创建于: 2025-12-18 23:23 (UTC+8)) [💬1 | +379/-60, 14 files | commented:2 | 📝草稿] ## Purpose
## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... - #30924 [BugFix] Fix TypeError: unhashable type: ‘dict’ when serving deepseek32 — ready,v1,deepseek — by LucasWilkinson (创建于: 2025-12-18 12:16 (UTC+8)) [💬1 | +11/-6, 1 files | commented:5] FIX: https://github.com/vllm-project/vllm/issues/30861
-
#30972 feat(metrics): Add Prometheus exemplars support for request-level met… — documentation,frontend,v1 — by TheCodeWrangler (创建于: 2025-12-18 23:30 (UTC+8)) [💬2 | +205/-29, 10 files | commented:1] ## Purpose
Add opt-in support for Prometheus exemplars on per-request histogram metrics. Exemplars attach request IDs to metric observations, enabling correlation between metrics and individual requests for debugging and tracing.
This is useful for:
- Debugging slow requests by correlating latency spikes with specific request IDs
- Tracing individual requests through distributed systems
- Investigating outliers in histogram distributions
## Key Implementation Notes …
-
#30949 [Doc] Add Sophgo TPU Support — documentation,ready — by wzyrrr (创建于: 2025-12-18 16:30 (UTC+8)) [💬2 | +1/-0, 1 files | commented:1 approved:1] vllm-tpu based on the Sophgo TPU is open-sourced,and we contributed documentation support to the vLLM community.
## Purpose We implemented a vllm plugin based on Sophgo TPU according to the RFC https://github.com/vllm-project/vllm/issues/11162 (Hardware Out-Of-Tree Plugin) standard, and this plugin can achieve high-efficiency inference of LLMs on Sophgo TPUs using the vllm framework.
Users can download the source code and install this plugin via pip install -e . …
- #30962 [Quantization] support logical_widths for fp8 marlin — 无标签 — by jinzhen-lin (创建于: 2025-12-18 20:36 (UTC+8)) [💬3 | +10/-0, 1 files | commented:6] fix https://github.com/vllm-project/vllm/issues/30750
-
#30968 Add hidden dimension validation for multimodal embedding inputs — multi-modality — by wenqiglantz (创建于: 2025-12-18 22:08 (UTC+8)) [💬1 | +836/-10, 4 files | commented:4] The original fix for https://github.com/advisories/GHSA-pmqf-x6x8-p7qw added options that required opting in to image and text embeds inputs. https://github.com/vllm-project/vllm/pull/27204
This PR adds hidden dimension validation for multimodal embedding inputs when the feature is turned back on.
Already reviewed via a private security report by @DarkLight1337 @Isotr0py
-
#30966 Migrate some old models to Transformers modelling backend — documentation,new-model — by hmellor (创建于: 2025-12-18 21:51 (UTC+8)) [💬1 | +8/-1455, 6 files | commented:2 | 📝草稿] Migrate support of some older models to use the Transformers modeling backend instead of having duplicated implementations in vLLM.
These architectures are ~2 years old and do not feature in the top 100 most used architectures in vLLM (with the exception of GPT2).
This migration:
- Reduces the number of model implementations maintained in vLLM
- Adds support for the following features that the Transformers modelling backend base class supports (these were not all added to the original model i…
-
#30960 Migrate from
mypytoty— ci/build — by hmellor (创建于: 2025-12-18 20:04 (UTC+8)) [💬1 | +20/-207, 5 files | commented:3 | 📝草稿] Depends on:- vllm-project/vllm/issues/26533
- Replace all
mypyreferences withty
pre-commitsetup:- Switch to only running on the lower bound Python version instead of testing all supported versions …
-
#30954 [Kernel][AWQ] Optimize awq_gemm: K-group reuse for scales/zeros, increase pipeline stages, and simplify dequant math — needs-rebase — by yuttian1 (创建于: 2025-12-18 17:32 (UTC+8)) [💬1 | +201/-245, 3 files | commented:2] What this PR changes 1) K-group optimization (reuse scales/zeros within AWQ group)
AWQ group size is 128 while the kernel K tile is 32.
Within a group, scales and zero-points are identical across multiple K tiles, so we reuse the same scales/zeros for the 4 consecutive K tiles inside the group.
This reduces global memory reads for scales and zero-points.
2) More pipeline stages …
-
#30961 [Quantization] fix marlin w8a8 check — 无标签 — by jinzhen-lin (创建于: 2025-12-18 20:21 (UTC+8)) [💬1 | +3/-6, 1 files | commented:1] The marlin W8A8 is currently disabled now, we should raise error for this case.
fix https://github.com/vllm-project/vllm/issues/30904
-
#30955 [Bugfix][CPU] Fix Mac CPU build — cpu — by bigPYJ1151 (创建于: 2025-12-18 17:32 (UTC+8)) [💬2 | +1/-1, 1 files | approved:1] ## Purpose
Fix mac CPU build issue after #30531
## Test Plan
Mac smoke test
## Test Result
…
-
#30952 [ROCm] Serving Fails on Radeon Due to AITER Dtype Import — rocm,ready — by vllmellm (创建于: 2025-12-18 16:48 (UTC+8)) [+18/-12, 1 files | commented:2 approved:1] ## Purpose The AITER method get_gfx_custom_op_core doesn’t handle the gfx1201 architecture yet, causing issues when running vLLM even if AITER is not on . Currently AITER is not yet supported on gfx1201, so we need to prevent its usage until proper support is added to the AITER library.
## Test Plan Run test_basic_correctness.py on gfx1201 hardware (skipping meta-llama/Llama-3.2-1B-Instruct tests as they are time consuming). ## Test Result Before ``` ERROR test_basic_correctness.py - Runtim…
-
#30950 [Metrics] Add Prometheus counters for Model FLOPs Utilization (MFU) — documentation,ready,v1 — by markmc (创建于: 2025-12-18 16:41 (UTC+8)) [💬2 | +109/-1, 5 files | commented:2] See #30738 - this is a follow-on to export these metrics via Prometheus in addition to the console logging
The metrics are only calculated and available with
--enable-mfu-metrics -
#30936 [v1][CP] Improve DCP/PCP/MTP error messages with actionable guidance — v1 — by yurekami (创建于: 2025-12-18 14:59 (UTC+8)) [💬3 | +49/-15, 1 files | commented:1] ## Summary
This PR improves error messages when context parallel features (DCP/PCP/MTP) are used with incompatible attention backends.
Problem (Fixes #28407):
- Current error messages use raw
AssertionErrorwith technical jargon - No explanation of what DCP (Decode Context Parallel), PCP (Prefill Context Parallel), or MTP are
- No list of compatible backends
- No actionable fix suggestions
…
- Current error messages use raw
-
#30925 [Multimodal] Add FIPS 140-3 compliant hashing support — multi-modality — by yurekami (创建于: 2025-12-18 13:07 (UTC+8)) [💬2 | +220/-5, 2 files | commented:2] ## Summary
This PR adds FIPS 140-3 compliant SHA-256 hashing as an alternative to blake3 for multimodal content hashing in vLLM. This enables vLLM deployment in regulated environments (government, healthcare, finance) that require FIPS-approved cryptographic algorithms.
### Changes
- Add
_Sha256Hasherwrapper class providing FIPS-compliant SHA-256 hashing - Add
_Blake3Hasherwrapper class maintaining consistent interface - Add
_create_hasher()factory function for automatic hasher select…
- Add
-
#30946 [Core] Improve DP synchronization error messages — v1 — by yurekami (创建于: 2025-12-18 16:10 (UTC+8)) [+26/-3, 1 files | commented:1] ## Summary
Replace bare
assertstatements with informativeRuntimeErrorexceptions indp_utils.pyto improve debugging experience for Data Parallel users.Changes:
- Replace
assert num_tokens_padded >= num_tokens_unpaddedwith descriptive error showing actual values - Replace
assert should_attempt_dp_padding == should_dp_padwith error explaining DP rank synchronization mismatch - Replace
assert uniform_decode is not Nonewith error explaining microbatching parameter requiremen…
- Replace
-
#30945 [Feature] adpat step3 with eager mode — documentation,frontend,v1,deepseek — by InhabitancyCocoon (创建于: 2025-12-18 16:06 (UTC+8)) [💬1 | +2981/-118, 31 files | commented:2]
## Purpose
## Test Plan
## Test Result
... -
#30951 [Misc] Improve worker error messages for better debugging — v1,cpu — by yurekami (创建于: 2025-12-18 16:42 (UTC+8)) [+36/-6, 3 files | commented:1] ## Summary
Replace bare
assertstatements with informativeRuntimeError/TypeErrorexceptions in v1 worker modules. This provides users with actionable error messages instead of crypticAssertionErrorexceptions.## Changes
###
vllm/v1/worker/gpu/dp_utils.py- Replace
assert dp_size > 1with RuntimeError explaining Data Parallel configuration requirements
###
vllm/v1/worker/cpu_model_runner.py… - Replace
-
#30947 [RFC][docs] Add lightweight AI assisted contribution policy — documentation — by markmc (创建于: 2025-12-18 16:22 (UTC+8)) [💬2 | +27/-0, 2 files | commented:3 | 📝草稿] Adds AI assisted contribution sections to
governance/process.mdandcontributing/README.md, establishing clear guidelines for using AI tools while maintaining code quality and attribution standards.🤖 Generated with Claude Code
-
#30948 fix: Suppress torch.frombuffer UserWarning for non-writable buffers — 无标签 — by yurekami (创建于: 2025-12-18 16:25 (UTC+8)) [💬1 | +26/-2, 3 files | commented:1] ## Summary Fixes #26781 - Suppress UserWarning from
torch.frombuffer()on non-writable buffers## Problem
torch.frombuffer()emits aUserWarningwhen given a non-writable buffer (e.g., frompickle.dumps()or shared memory):UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor...…
-
#30937 [Feature] Add –ssl-ciphers CLI argument for TLS cipher control — frontend — by rickychen-infinirc (创建于: 2025-12-18 15:03 (UTC+8)) [+4/-0, 2 files | commented:1] ## Summary This PR adds support for the
--ssl-ciphersCLI argument, enabling users to specify allowed SSL/TLS cipher suites for fine-grained security control.Fixes #30569
## Changes
vllm/entrypoints/openai/cli_args.py: Addedssl_ciphersparameter toFrontendArgsdataclassvllm/entrypoints/openai/api_server.py: Passssl_ciphersargument toserve_http()function
## Motivation …
-
#30920 [Bugfix] Fix Unicode issues in GLM-4 tool calling — ready — by chaunceyjiang (创建于: 2025-12-18 11:26 (UTC+8)) [+2/-1, 1 files | commented:1 approved:1] ## Purpose
## Test Plan main
[ChatCompletionMessageFunctionToolCall(id='chatcmpl-tool-a206a0a9cae7c285', function=Function(arguments='{"city": "\\u5317\\u4eac", "date": "2024-06-27"}', name='get_weather'), type='function'), ChatCompletionMessageFunctionToolCall(id='chatcmpl-tool-9b1f3dc7e2f41093', function=Function(arguments='{"city": "\\u4e0a\\u6d77", "date": "2024-06-27"}', name='get_weather'), type='function')]this pr …
- #30932 enable vllm ut for Intel GPU — documentation,rocm,speculative-decoding,v1,nvidia — by wincent8 (创建于: 2025-12-18 14:06 (UTC+8)) [💬2 | +156/-59, 16 files | commented:1 | 📝草稿] enable vllm ut for Intel GPU
[已合并 PR]
- #30270 [ROCm][CI][Bugfix] Multi-Modal Model Support Fixes and Attention Backend Improvements — rocm,ready,ci/build,multi-modality,qwen — by AndreasKaratzas (合并于: 2025-12-19 10:17 (UTC+8)) [💬5 | +109/-47, 7 files | commented:10]
This PR addresses several ROCm-specific issues with multi-modal/vision-language models and improves attention backend dispatching for encoder-only self-attention models. It renders green the following test groups on ROCm:
Multi-Modal Models Test (Standard)Multi-Modal Models Test (Extended) 1Multi-Modal Models Test (Extended) 2Multi-Modal Models Test (Extended) 3
#### Key Changes
Attention Backend Selection (
vllm/platforms/rocm.py):- Added dtype validation (fp16/bf16 o…
-
#30895 [BugFix] Handle errors when preprocessing added requests — bug,ready,ci/build,v1 — by njhill (合并于: 2025-12-19 09:29 (UTC+8)) [💬4 | +93/-3, 3 files | commented:2 approved:1]
preprocess_add_request()runs in the engine core input socket processing thread. It’s not expected to raise exceptions but if it does, they aren’t caught or logged - the input processing thread exits and the engine hangs silently :(This PR catches and logs request-scoped preprocessing errors, and returns an output to fail the request in question.
I’ll open another PR to deal with any other unexpected exceptions occurring in the input socket processing thread (which should be considered fata…
- #30867 [Bugfix] Fix tool_choice=”none” being ignored by GPT-OSS/harmony models — frontend,ready,gpt-oss — by HaloWorld (合并于: 2025-12-19 09:34 (UTC+8)) [💬5 | +86/-4, 2 files | commented:6 approved:1]
GPT-OSS models using the harmony format were ignoring the
tool_choice="none"parameter and could still trigger tool calls when tools were provided in the request. This issue arose because the_make_request_with_harmonymethod only checked for the existence ofrequest.tools, without accounting for thetool_choicesetting or theexclude_tools_when_tool_choice_noneflag. This fix ensures that harmony models respect theexclude_tools_when_tool_choice_noneflag, aligning their behavior wi… -
#30684 [MM Encoder]: Migrate legacy ViT
MultiHeadAttentionto newMMEncoderAttentioninterface — tpu,ready,v1,llama — by Isotr0py (合并于: 2025-12-19 02:04 (UTC+8)) [💬6 | +182/-266, 20 files | commented:2 approved:1]## Purpose
- Following PR for #30125
- Migrate
MultiHeadAttentionusage to newMMEncoderAttention
## Test Plan
pytest - s-v tests/kernels/attention/test_attention.py…
-
#30973 [Bugfix] Remove
tile_size=64for mm_prefix triton attention — ready — by Isotr0py (合并于: 2025-12-19 03:42 (UTC+8)) [+0/-7, 1 files | commented:1 approved:2]## Purpose
- Fix https://github.com/vllm-project/vllm/pull/30386#discussion_r2631541204
cc @lucianommartins
## Test Plan
## Test Result …
-
#29128 [Cleanup] Refactor FlashInferMetadataBuilder — ready,v1,nvidia — by benchislett (合并于: 2025-12-19 06:45 (UTC+8)) [💬1 | +403/-269, 1 files | commented:9 approved:1] ## Purpose
Reorganize the FlashInferMetadata into clear
prefillanddecodesections that either belong to FlashInfer or TRTLLM execution pathways. This separation is desirable because it allows us to make explicit which metadata needs to be prepared for each backend, and therefore which computations can be omitted when a certain backend is not used.As such, I refactor (and make skip-able) some computations related to the paged kv indices, see
_compute_paged_kv_indices.## Test Plan
I …
-
#30907 [Bug] Fix batch invariant in torch 2.10 — ready — by yewentao256 (合并于: 2025-12-18 23:27 (UTC+8)) [💬3 | +20/-24, 1 files | commented:5 approved:3] ## Purpose
Fixes https://github.com/pytorch/pytorch/issues/170490
## Test
Originally:
```bash … …
-
#30419 [NIXL][BUG FIX] Fix both failing issue and accuracy issue with nixl + host_buffer on CUDA — ready,v1,kv-connector,nvidia — by xuechendi (合并于: 2025-12-19 06:10 (UTC+8)) [💬13 | +53/-36, 3 files | commented:9 dismissed:1] ## Purpose
This PR is fixed upon https://github.com/vllm-project/vllm/pull/30420 Should get that one merged firstly
Two issue detected and resolved in this PR
- Fix a bug after #29665 for running PD with cpu host buffer
- Fix accuracy issue for running PD with cpu host buffer, described in https://github.com/vllm-project/vllm/issues/30358
…
-
#30983 Check for truthy
rope_parametersnot the existence of it — bug,ready — by hmellor (合并于: 2025-12-19 05:59 (UTC+8)) [+5/-5, 1 files | commented:1 approved:1] Fixes the following situation:- Model explicitly sets
rope_parameters=None(new name, Transfomers v5) orrope_scaling=None(old name, Transformers v4) - User is using Transformers v4 with vLLM
- Model explicitly sets
-
#30916 [BugFix] Fix spec decode + structured outputs + preemption edge case — bug,ready,v1 — by njhill (合并于: 2025-12-19 04:59 (UTC+8)) [+5/-1, 1 files | commented:1 approved:1] Fix an edge case that can be triggered when using spec decode with structured outputs.
There is a sequence of preemption and drafting being skipped that can result in the scheduler’s
request.spec_token_idsbeing stale which can then fail bitmask generation because they will be out of sync with the grammar.When triggered, vLLM crashes with an error like this: ``` (EngineCore_DP0 pid=1549249) File “/home/nickhill/workspace/vllm2/vllm/vllm/v1/engine/core.py”, line 920, in _process_engine_ste…
-
#30652 Strengthen input validation and tests for ‘parse_raw_prompts’. — ready — by mivehk (合并于: 2025-12-19 03:51 (UTC+8)) [💬3 | +11/-10, 1 files | commented:9 approved:1]
## Purpose The ‘parse_raw_prompts’ can benefit from the stricter input validation in the array-of-arrays input path. This is for when input is a list of lists; however, we verify the nested inner list is a non-null list of integers beyond the first inner list verification.
## Test Plan Added a pytest coverage for invalid and edge-case inputs of mixed-type nested lists for parse_raw_prompts. Updated the array-of-arrays input condition of parse_raw_prompts to strict…
-
#30319 [BugFix] Spec decode with VLLM_ENABLE_V1_MULTIPROCESSING=0 — ready,v1 — by heheda12345 (合并于: 2025-12-19 03:47 (UTC+8)) [💬1 | +2/-1, 1 files | commented:1 approved:1]
## Purpose We now forgets to pass the draft tokens from model runner to scheduler. This PR fix it.
## Test Plan
VLLM_ENABLE_V1_MULTIPROCESSING=0 python3 examples/offline_inference/spec_decode.py --test## Test Result …
-
#30363 Add removal version for all2all backend env var — documentation,ready,ci/build,v1,nvidia — by elizabetht (合并于: 2025-12-19 03:46 (UTC+8)) [💬8 | +40/-43, 12 files | commented:4 changes:1 approved:2] ## Purpose Convert VLLM_ALL2ALL_BACKEND envvar to config value
## Test Plan CI run should be sufficient
## Test Result
... -
#30820 [Bug] Fix compressed tensor not using deepgemm — ready — by yewentao256 (合并于: 2025-12-19 03:45 (UTC+8)) [💬2 | +10/-1, 2 files | commented:2 approved:1] ## Purpose
A fix for https://github.com/vllm-project/vllm/pull/30718#discussion_r2624920272
CT should use deepgemm when available, this PR fixes that
## Test
export MODEL="RedHatAI/Qwen3-30B-A3B-FP8-block"…
-
#30629 tuned fused configs for B300 — 无标签 — by navmarri14 (合并于: 2025-12-19 03:41 (UTC+8)) [💬4 | +441/-0, 3 files | commented:1 approved:1]
## Purpose This PR adds a tuned fused MoE kernel configuration for the GLM-4.6 MoE architecture on NVIDIA B300 GPUs using FP8 quantization.
Specifically, it targets the configuration:
Experts (E): 160 Sharded size N=192 for TP=8, N=384 for TP=4, N=768 for TP=2 …
- #30729 [Perf] enable flashinfer rotary_embedding custom ops in DeepSeek rotary — ready,deepseek — by jiahanc (合并于: 2025-12-19 03:31 (UTC+8)) [💬6 | +24/-2, 2 files | commented:7 approved:1]
## Purpose
- Enable flashinfer rotary_embedding custom ops in DeepSeek
To use the custom op, add
--compilation_config.custom_ops+=+rotary_embeddingin vllm config ## Test Plan ``` VLLM_USE_FLASHINFER_MOE_FP4=1 python3 -m vllm.entrypoints.openai.api_server –model nvidia/DeepSeek-R1-0528-FP4-v2 –tokenizer nvidia/DeepSeek-R1-0528-FP4-v2 –dtype auto –kv-cache-dtype fp8 –tensor-parallel-size 1 –pipeline-parallel-size 1 –data-parallel-size 4 –enable-expert-parallel –swap-space 16 –max-…
- Enable flashinfer rotary_embedding custom ops in DeepSeek
To use the custom op, add
-
#30745 [BugFix]Reclaim resources to prevent memory leaks when use LMCacheMPConnector — ready,kv-connector — by wz1qqx (合并于: 2025-12-19 03:09 (UTC+8)) [💬2 | +24/-0, 2 files | commented:7 approved:1] ## Purpose This is from Novita.AI Team
Do clean up to prevent memory leak ## Test Plan
## Test Result
... -
#30935 [XPU] allow custom workers (e.g. vllm-omni workers) to be used on XPU — 无标签 — by faaany (合并于: 2025-12-19 02:16 (UTC+8)) [💬1 | +4/-1, 1 files | commented:1 approved:1] ## Purpose Currently, the worker class on XPU is hard-coded to “vllm.v1.worker.xpu_worker.XPUWorker”. This PR improves the logic to only override if
worker_clsis still the default “auto”. This allows custom workers (e.g. vllm-omni workers) to be used on XPU.see CUDA as a reference: https://github.com/vllm-project/vllm/blob/main/vllm/platforms/cuda.py#L156
Essential Elements of an Effective PR Description Checklist
- [x] The purpose of the PR, such as "F... -
#29476 [BugFix] Add sleep to fix tight loop and release GIL — ready,v1 — by alec-flowers (合并于: 2025-12-19 01:52 (UTC+8)) [💬3 | +7/-0, 1 files | commented:2 approved:1] Potential Fix for - https://github.com/vllm-project/vllm/issues/29369
While not very elegant it does do the job of releasing the GIL.
## Purpose
## Test Plan
## Test Result
…
-
#29002 [Bugfix] fix DP-aware routing in OpenAI API requests — frontend,ready,ci/build,v1,nvidia — by inkcherry (合并于: 2025-12-19 01:50 (UTC+8)) [💬2 | +68/-0, 7 files | commented:5 dismissed:1 approved:1] ## Purpose fix https://github.com/vllm-project/vllm/pull/24945 In
add_request, duplicate initialization is skipped, but during the previousself.processor.process_inputs,data_parallel_rankis not initialized. Using-H 'X-data-parallel-rank'to specify the data parallel rank would be invalid in this case., cc @njhillhttps://github.com/vllm-project/vllm/blob/d69062c67af46a2e624be92162e9db585eef329b/vllm/v1/engine/async_llm.py#L283-L302
## Test Plan
## Test Result
…
- #30218 [Cleanup] Remove unused ModelRunner V1
InputBatch.num_tokensfield — tpu,ready,v1 — by njhill (合并于: 2025-12-19 01:17 (UTC+8)) [+12/-36, 4 files | commented:4 approved:1] Even though we’ll be transitioning to GPU ModelRunner V2, makes sense to remove this in the meantime. - #30909 [ROCm][Bugfix] Fix
fa_versionargument error inflash_attn_maxseqlen_wrapperfor ROCm without aiter — rocm,ready — by AndreasKaratzas (合并于: 2025-12-18 16:45 (UTC+8)) [+5/-4, 1 files | commented:2 approved:1] ### Problem On ROCm platforms withAITEReither uninstalled or disabled,flash_attn_varlen_funcfails with:TypeError: flash_attn_varlen_func() got an unexpected keyword argument 'fa_version'This occurs because
is_rocm_aiterisFalsewhenaiteris disabled, causing the code to fall through to the else branch which unconditionally passesfa_versiontoflash_attn_varlen_func. However, the ROCm version of Flash Attention (viavllm.attention.utils.fa_utils) does not support… -
#30730 [ROCm][Bugfix] fix(structured_output): Skip guidance backend for schemas with patternProperties — rocm,structured-output,ready,v1 — by AndreasKaratzas (合并于: 2025-12-18 15:04 (UTC+8)) [💬7 | +45/-2, 2 files | commented:2 approved:1] ## Summary
This PR fixes a structured output generation failure when using
backend="auto"with JSON schemas containingpatternProperties. The fix adds detection forpatternPropertiesin the auto-mode fallback logic and routes such schemas to theoutlinesbackend instead ofguidance.## Problem
When using
structured_outputswithbackend="auto"and a JSON schema containingpatternProperties, thellguidancelibrary produces malformed output consisting of{followed by endless… -
#30900 fix fp8 online quantization streaming with tp > 1 — ready — by vkuzo (合并于: 2025-12-19 00:45 (UTC+8)) [💬4 | +33/-8, 1 files | commented:1 approved:2] Summary:
Fix for https://github.com/vllm-project/vllm/issues/30830
When we added online fp8 quant with streaming weight post-processing in https://github.com/vllm-project/vllm/pull/29196, a bug was introduced where TP>1 case was not always handled correctly. Specifically:
- https://github.com/vllm-project/vllm/pull/29196 assumed that
weight_loadercopiesloaded_weighttoparamdirectly - this is not always true, as
weight_loadercan call arbitrary logic on bothparamand `loaded_wei…
- https://github.com/vllm-project/vllm/pull/29196 assumed that
-
#30598 [LoRA] Set default MXFP4 LoRA backend to Marlin — ready — by xyang16 (合并于: 2025-12-19 00:42 (UTC+8)) [💬4 | +5/-5, 1 files | commented:2 approved:1] ## Purpose
This PR set default MXFP4 LoRA backend to Marlin because Triton has accuracy issues and Marlin has slight better performance.
- Use Triton only if Marlin is disabled (set
VLLM_MXFP4_USE_MARLIN=0explicitly) andtriton_kernelsis supported. - Use Marlin by default
- if
VLLM_MXFP4_USE_MARLINis not set - if
VLLM_MXFP4_USE_MARLIN=1 - if
triton_kernelsis not supported
- if
## Benchmarking …
- Use Triton only if Marlin is disabled (set
-
#30949 [Doc] Add Sophgo TPU Support — documentation,ready — by wzyrrr (合并于: 2025-12-19 00:29 (UTC+8)) [💬2 | +1/-0, 1 files | commented:1 approved:1] vllm-tpu based on the Sophgo TPU is open-sourced,and we contributed documentation support to the vLLM community.
## Purpose We implemented a vllm plugin based on Sophgo TPU according to the RFC https://github.com/vllm-project/vllm/issues/11162 (Hardware Out-Of-Tree Plugin) standard, and this plugin can achieve high-efficiency inference of LLMs on Sophgo TPUs using the vllm framework.
Users can download the source code and install this plugin via pip install -e . …
-
#30822 [Bugfix][torch2.10] Fix test_qwen2_5_vl_compilation with 2.10 RC — ready,qwen — by Lucaskabela (合并于: 2025-12-19 00:23 (UTC+8)) [💬1 | +13/-7, 3 files | commented:4 approved:1]
## Purpose See https://github.com/pytorch/pytorch/issues/170568
In torch 2.10, we will rely more heavily on aot_precompile; part of this upgrade leads to a successful load and re-instantion of the
VLLMBackendin caching.py; however, for multimodal encoders that previously relied on theset_model_tagcontext manager to forwardis_encoderinformation to compile ranges, this field is no longer set (since aot loading happens at callsite)Therefore, we need to …
-
#30188 [Model] adds jais 2 support — documentation,new-model,ready — by sarathc-cerebras (合并于: 2025-12-18 23:46 (UTC+8)) [💬8 | +534/-0, 4 files | commented:10] ## Purpose
## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... -
#30915 [Fix][FlexAttention] return max logical block index to handle reused blocks — ready,v1 — by ivanium (合并于: 2025-12-18 14:42 (UTC+8)) [+42/-4, 2 files | commented:2 approved:1]
## Purpose
For FlexAttention, we need to build
physical_to_logical_mappingto reversely map physical block ids to logical block ids. This process previously assumes logical block ids are always unique, which is not true for some attention types such as sliding window attention, where some blocks may be released and reused later, causing the same physical block id to appear multiple times in a row ofblock_tableat different logical block indices. As a result, … - #29776 [Misc] support nsys profile for bench latency — performance,ready — by izhuhaoran (合并于: 2025-12-18 22:52 (UTC+8)) [💬7 | +14/-12, 1 files | commented:4 approved:1] ## Purpose As titled, this pr support nsys profile for bench latency
-
#30537 Filter safetensors files to download if .safetensors.index.json exists — ready — by mgoin (合并于: 2025-12-18 22:51 (UTC+8)) [💬1 | +29/-6, 1 files | commented:1 changes:1 approved:1] ## Purpose
I noticed when running
vllm serve openai/gpt-oss-20bon a new machine that it not only downloaded the base model files, but also included theoriginal/directory that I assumed was picked up because it has files ending in*.safetensors``` model-00000-of-00002.safetensors: 100%|██████████████████████████████████████████████████████████████████████████| 4.79G/4.79G [06:24<00:00, 12.5MB/s] model-00001-of-00002.safetensors: 100%|███████████████████████████████████████████████████…
-
#30910 [BugFix] Partial revert of #29558 (DeepEP HT + PIECEWISE CG support) — ready — by LucasWilkinson (合并于: 2025-12-18 15:50 (UTC+8)) [💬2 | +14/-74, 2 files | commented:1 approved:2] Partially revert https://github.com/vllm-project/vllm/pull/29558 as this broke H200 tests
https://buildkite.com/vllm/ci/builds/43863#019b29e9-5c1b-4eff-83f7-c8304f774aa7
i.e.
VLLM_ALL2ALL_BACKEND=deepep_high_throughput VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model Qwen/Qwen1.5-MoE-A2.7B --tp-size=1 --dp-size=2 --max-model-len 2048…
-
#30955 [Bugfix][CPU] Fix Mac CPU build — cpu — by bigPYJ1151 (合并于: 2025-12-18 17:38 (UTC+8)) [💬2 | +1/-1, 1 files | approved:1] ## Purpose
Fix mac CPU build issue after #30531
## Test Plan
Mac smoke test
## Test Result
…
-
#30952 [ROCm] Serving Fails on Radeon Due to AITER Dtype Import — rocm,ready — by vllmellm (合并于: 2025-12-18 19:47 (UTC+8)) [+18/-12, 1 files | commented:2 approved:1] ## Purpose The AITER method get_gfx_custom_op_core doesn’t handle the gfx1201 architecture yet, causing issues when running vLLM even if AITER is not on . Currently AITER is not yet supported on gfx1201, so we need to prevent its usage until proper support is added to the AITER library.
## Test Plan Run test_basic_correctness.py on gfx1201 hardware (skipping meta-llama/Llama-3.2-1B-Instruct tests as they are time consuming). ## Test Result Before ``` ERROR test_basic_correctness.py - Runtim…
-
#29935 [moe] Use enable_chunking func (to support disabling chunking) — ready — by minosfuture (合并于: 2025-12-18 17:02 (UTC+8)) [💬1 | +2/-2, 1 files | commented:1 approved:1]
## Purpose
VLLM_ENABLE_FUSED_MOE_ACTIVATION_CHUNKING wasn’t used. This PR uses it and allow disabling activation chunking. Disabling chunking is good for throughput performance.
## Test Plan
tested VLLM_ENABLE_FUSED_MOE_ACTIVATION_CHUNKING
…
-
#30883 [Chore] Remove v0 dead code for Qwen2.5-omni — ready,qwen — by Isotr0py (合并于: 2025-12-18 11:54 (UTC+8)) [💬1 | +0/-22, 1 files | commented:1 approved:1]
## Purpose
- Just found we missed Qwen2.5-omni’s
embed_multimodal_v0when removingembed_multimodal_v0from models.
## Test Plan
## Test Result
…
- Just found we missed Qwen2.5-omni’s
-
#30920 [Bugfix] Fix Unicode issues in GLM-4 tool calling — ready — by chaunceyjiang (合并于: 2025-12-18 15:12 (UTC+8)) [+2/-1, 1 files | commented:1 approved:1] ## Purpose
## Test Plan main
[ChatCompletionMessageFunctionToolCall(id='chatcmpl-tool-a206a0a9cae7c285', function=Function(arguments='{"city": "\\u5317\\u4eac", "date": "2024-06-27"}', name='get_weather'), type='function'), ChatCompletionMessageFunctionToolCall(id='chatcmpl-tool-9b1f3dc7e2f41093', function=Function(arguments='{"city": "\\u4e0a\\u6d77", "date": "2024-06-27"}', name='get_weather'), type='function')]this pr …
-
#30531 [CPU] Refactor CPU fused MOE — ready,ci/build,cpu — by bigPYJ1151 (合并于: 2025-12-18 14:36 (UTC+8)) [💬4 | +1394/-206, 23 files | commented:8 changes:2]
## Purpose
Refactor CPU fused MOE by optimizing tile schedule and enable torch compile
Part of #29580
main: ```log …
-
#30225 [Platform] Let EPD work with non-cuda platform — ready,nvidia — by wangxiyuan (合并于: 2025-12-18 14:45 (UTC+8)) [💬2 | +4/-1, 1 files | commented:1 approved:1] ## Purpose SImple change to make EPD work with non-cuda platform. Remove the cuda hardcode logic. ## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... -
#30706 fix: add warmup for audio preprocessing — frontend,ready — by TheCodeWrangler (合并于: 2025-12-18 14:12 (UTC+8)) [💬4 | +126/-1, 1 files | commented:6 approved:1] # Audio Preprocessing Warmup for Whisper Models
## Purpose
Fixes first-request latency issue for Whisper transcription models where the first request can take significantly longer (10-20x) than subsequent requests.
Problem: The first Whisper transcription request experiences substantial latency overhead due to lazy initialization, causing poor user experience and potential timeout issues in production.
Root Cause: Investigation revealed two lazy initialization bottlenecks:
- **Libro…
-
#30903 [UX] Reduce DeepGEMM warmup log output to single progress bar — ready,startup-ux — by MatthewBonanni (合并于: 2025-12-18 12:21 (UTC+8)) [+99/-42, 1 files | commented:2 approved:1] ## Purpose Showing a progress bar for each shape during DeepGEMM warmup is unnecessary. In the interest of reducing log clutter, this PR reduces the output to a single progress bar, only shown on rank 0.
## Test Plan
vllm serve deepseek-ai/DeepSeek-R1 -dp 8 --enable-expert-parallel## Test Result …
-
#30814 [KV connector][LMCache] Only record the cuda event when there are request to store/load — ready,kv-connector,nvidia — by ApostaC (合并于: 2025-12-18 13:31 (UTC+8)) [💬2 | +40/-17, 2 files | commented:1 approved:1]
## Purpose
This is a small refactor/optimization that we only record the CUDA events when there are requests to store and load
## Test Plan N/A
## Test Result …
-
#30911 [AMD][CI] fix lm eval ci arg — rocm,ready,ci/build — by divakar-amd (合并于: 2025-12-18 13:18 (UTC+8)) [+3/-3, 1 files | commented:1 approved:1] Fixing args for AMD-CI run in accordance with https://github.com/vllm-project/vllm/pull/30723
This fixes the following tests for AMD-CI:
LM Eval Small Models LM Eval Small Models (1 Card)
- #29553 [PERF] Qwen3-next. Add fp8 cutlass MoE tuned configs.
chmod -x *MI308X.json— ready,qwen,nvidia — by vadiklyutiy (合并于: 2025-12-18 13:16 (UTC+8)) [💬3 | +882/-0, 7 files | commented:1 approved:1] ## Purpose- For Qwen3-next-FP8 add tuned MoE configs for Blackwell.
- Remove execution permission for
E=128,N=768,device_name=AMD_Instinct_MI308X.json
-
#30765 [Doc][CPU] Update CPU doc — documentation,ready,ci/build,cpu — by bigPYJ1151 (合并于: 2025-12-18 12:59 (UTC+8)) [💬2 | +106/-16, 5 files | commented:9 approved:1]
## Purpose
Update CPU doc for wheel installation
## Test Plan
## Test Result
…
-
#30788 [refactor] Add prefix support to embed_tokens in DeepSeek MTP — ready,deepseek — by zzhx1 (合并于: 2025-12-18 12:45 (UTC+8)) [💬1 | +1/-0, 1 files | commented:1 approved:1]
## Purpose This PR adds proper weight prefixing to the embed_tokens embedding layer in DeepSeekMultiTokenPredictor using maybe_prefix(prefix, “embed_tokens”).
## Test Plan None ## Test Result None …
-
#30902 [compile] Fix CI for test_gpt2_cache_hit — ready — by zhxchen17 (合并于: 2025-12-18 12:22 (UTC+8)) [💬1 | +15/-6, 2 files | commented:1 approved:1] Signed-off-by: zhxchen17 zhxchen17@fb.com
## Purpose
Fixing torch 2.10 release CI issues from https://github.com/pytorch/pytorch/issues/170549
## Test Plan
pytest tests/compile/test_aot_compile.py …
-
#30071 [Quantization] Support Quark int4-fp8 w4a8 for MoE — rocm,ready — by BowenBao (合并于: 2025-12-18 12:20 (UTC+8)) [💬4 | +201/-2, 2 files | commented:6 approved:1] This PR extends support of Quark quantized model for int4-fp8 w4a8 quantization spec.
Specifically:
- Weights are double quantized: first at INT4 per channel, then further at FP8 per tensor.
- Activations are quantized dynamically in FP8 per tensor.
We observe large performance uplift and near full accuracy with Kimi-K2-Thinking quantized by Quark with int4-fp8 w4a8 on MI300x.
-
#30716 fused_moe_lora PDL improvements — ready — by gnovack (合并于: 2025-12-18 11:55 (UTC+8)) [💬1 | +18/-12, 1 files | commented:3 approved:1] ## Purpose
The existing implementation of PDL for the
fused_moe_lorakernel has a couple issues which don’t allow us to realize the full gains of PDL:### 1.
torch.zerosbetween shrink and expand calls.The call to
torch.zerosto initialize the intermediate cache occurs between the shrink and expand calls tofused_moe_lora. This means we cannot overlap the twofused_moe_loracalls via PDL.) [💬10 | +556/-212, 4 files | commented:8 approved:2] ## Overview
This PR addresses the following case, P tensor-parallel-size > D tensor-parallel-size.
I think it helps to differentiate two main cases
### MLA
For MLA model, the workflow is easier: each D worker reads from some other single P worker (fan-out reads to avoid all reading from same remote), as MLA cache is duplicated. Some P workers will not be read from at all.
…
[关闭未合并 PR]
-
#18365 [Hardware][Intel-Gaudi] enable text embedding for Intel-Gaudi backend — needs-rebase,stale — by libinta (关闭于: 2025-12-19 10:14 (UTC+8)) [💬5 | +558/-215, 6 files] This PR adds the initial text embedding implementation on the Intel-Gaudi backend.
-
#18373 Adding basic dynasor functionality to vllm v1 scheduler — needs-rebase,stale,v1 — by PratishthaGaur (关闭于: 2025-12-19 10:14 (UTC+8)) [💬5 | +2404/-7, 4 files] This PR introduces initial support for Dynasor-style early exit in reasoning model within the vLLM v1 scheduler. It enables the system to dynamically insert probe requests that encourage the model to finalize its reasoning and produce an answer when it appears confident.
Key Behavior When the model has generated a multiple of 128 tokens, a probe request is created. The probe request appends a token to the current context, prompting the model to safely exit and produce a final answer if it is re…
-
#30965 [Perf][ROCm][AWQ] Improve performance of fused MoE GPTQ-AWQ and AWQ dequant kernels — rocm,needs-rebase — by yuttian1 (关闭于: 2025-12-19 10:08 (UTC+8)) [💬1 | +206/-356, 3 files | commented:2] Summary
This PR optimizes and tunes the following AWQ-related kernels on AMD ROCm platforms:
fused_moe_kernel_gptq_awq optimizes and tunes by cuichang@amd.com
awq_dequant
The changes focus on reducing memory traffic, improving arithmetic efficiency, and updating tuned configurations for better performance on large-scale MoE workloads (e.g. Qwen3-235B-VL-AWQ).
…
-
#30986 Eagle 3 fix sometimes when you don’t set architectures you get “Model architectures [‘EagleLlamaModel’] are not supported for now. “ — documentation,performance,new-model,rocm,frontend,tpu,ci/build,v1,tool-calling,llama — by aidando73 (关闭于: 2025-12-19 04:08 (UTC+8)) [💬3 | +20070/-1312, 118 files | commented:2]
## Purpose
As per title
## Test Plan
## Test Result
…
-
#30894 Improve DCP error message with actionable guidance — documentation,v1 — by Dhruv-80 (关闭于: 2025-12-19 02:05 (UTC+8)) [💬6 | +107/-27, 5 files | commented:1] ## Purpose
Improve the error message shown when Decode Context Parallelism (DCP) is enabled with an attention backend that does not support returning softmax LSE values during decode.
Previously, this case raised a generic assertion error that did not provide actionable guidance to users. This change replaces the assertion with a user-facing exception that clearly explains the limitation and suggests how to resolve it (e.g., by selecting a different attention backend or disabling DCP). …
-
#29755 Fix KV cache sync issue during CUDA graph replay — v1,kv-connector,nvidia — by yashwantbezawada (关闭于: 2025-12-18 21:22 (UTC+8)) [💬8 | +13/-0, 2 files | commented:8 changes:1 approved:1] Fixes #29608
I ran into this while looking at the async KV transfer code. When using connectors like LMCache with CUDA graphs, there’s a race condition during graph replay.
The issue is that
wait_for_layer_load()calls happen inside the@maybe_transfer_kv_layerdecorator on attention functions. During graph capture, these calls execute normally. But during replay, only the GPU operations are replayed - the Python decorator code gets skipped entirely. So the async KV loads might not be comp… -
#30954 [Kernel][AWQ] Optimize awq_gemm: K-group reuse for scales/zeros, increase pipeline stages, and simplify dequant math — needs-rebase — by yuttian1 (关闭于: 2025-12-18 20:43 (UTC+8)) [💬1 | +201/-245, 3 files | commented:2] What this PR changes 1) K-group optimization (reuse scales/zeros within AWQ group)
AWQ group size is 128 while the kernel K tile is 32.
Within a group, scales and zero-points are identical across multiple K tiles, so we reuse the same scales/zeros for the 4 consecutive K tiles inside the group.
This reduces global memory reads for scales and zero-points.
2) More pipeline stages …
-
#30796 [BugFix][Async] Clear spec tokens for requests re-entering input batch — v1 — by izhuhaoran (关闭于: 2025-12-18 17:59 (UTC+8)) [💬5 | +16/-4, 1 files | commented:1 | 📝草稿] ## Purpose
In async scheduling + spec, requests (re-entering input batch) do not have pre-step draft tokens (since they were not running in the previous step). Therefore, any
scheduled_spec_decode_tokensassigned by the scheduler for these requests are essentially invalid placeholders. Retaining them leads to unnecessary computation and potential unexpected behavior. -
#30945 [Feature] adpat step3 with eager mode — documentation,frontend,v1,deepseek — by InhabitancyCocoon (关闭于: 2025-12-18 16:07 (UTC+8)) [💬1 | +2981/-118, 31 files | commented:2]
## Purpose
## Test Plan
## Test Result
... -
#30643 Auto-rebase PRs older than 40 commits compared to main — ci/build — by khluu (关闭于: 2025-12-18 12:03 (UTC+8)) [💬2 | +7/-0, 1 files | commented:1] Closes https://github.com/vllm-project/vllm/issues/28609