#35321 [Feature]: Encoder self-attention for RocmAttentionImpl — feature request,rocm — by micah-wil (创建于: 2026-02-26 02:44 (UTC+8)) [💬2]
### 🚀 The feature, motivation and pitch
Encoder self-attention is not implemented for ROCM_ATTN. For example, when running the following test with ROCM_ATTN enabled instead of TRITON_ATTN:
pytest -v -s entrypoints/openai/test_run_batch.py::test_empty_file
this error is raised:
NotImplementedError: Encoder self-attention is not implemented for RocmAttentionImpl
### Alternatives
…
#35344 [Bug]: Value error, Model architectures [‘Qwen3_5MoeForConditionalGeneration’] are not supported for now — bug — by Dong09 (创建于: 2026-02-26 09:19 (UTC+8)) [💬6]
### Your current environment
The output of Collecting environment information...
uv is set
==============================
System Info
==============================
OS : Ubuntu 22.04.5 LTS (x86_64)
GCC version : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
...
#35349 [Bug]: MiniMax-2.1: Missing <think> tag in the LLM response after tool calls succeed. — bug — by stingoChen (创建于: 2026-02-26 10:39 (UTC+8))
### Your current environment
The output of python collect_env.py
```text
(vllm-test) root@d9bc88d37413:/opt/vllm-test/vllm# python collect_env.py
Collecting environment information...
uv is set
==============================
...
#35337 [CI] AttributeError in SMControlContextManager: torch.cuda.current_device() returns int, not device object — ci-failure — by LucasWilkinson (创建于: 2026-02-26 06:19 (UTC+8))
## Name of failing test
#35300 [Feature]: Add ISA-level smoke tests using Intel SDE to catch instruction set mismatches — 无标签 — by wjhrdy (创建于: 2026-02-25 22:47 (UTC+8)) [💬2]
### Motivation
vLLM’s CPU backend compiles ISA-specific C kernels (AVX2, AVX512, AVX512BF16, AMX, NEON, VSX) selected at build time via environment variables. The default CI image is built with AVX512BF16 + AVX512VNNI + AMXBF16 enabled (see .buildkite/image_build/image_build_cpu.sh).
No test currently verifies the binary works on CPUs with different ISA levels. Running the resulting bina…
#35310 [Feature]: Qwen-ASR Forced Aligner — feature request — by jppm99 (创建于: 2026-02-26 01:27 (UTC+8)) [💬2]
### 🚀 The feature, motivation and pitch
Will vLLM support Qwen3-ASR Forced-Aligner, allowing to timestamp audio transcription segments?
### Alternatives
No response
### Additional context
…
#35329 [Bug]: quantization=”mxfp4” produces incorrect results / hangs for MoE models at tensor_parallel_size=1 — bug — by zeryx (创建于: 2026-02-26 03:53 (UTC+8)) [💬1]
### Your current environment
- vLLM: 0.15.1 (`vllm/vllm-openai:latest`)
- Model: `Qwen/Qwen3-30B-A3B` (128 experts, all-MoE layers)
- GPU: 2x NVIDIA RTX 5880 Ada (SM89, 49 GB each)
- Also tested with a faithful tiny Qwen3-MoE replica (~190M params, same 128-expert architecture)
### 🐛 Describe the bug
## Bug Description ...
#35336 [Refactor]: Make SSM backends use the null block (0) for padded requests instead of -1 — feature request — by LucasWilkinson (创建于: 2026-02-26 06:15 (UTC+8))
Currently we fill the block table with -1’s in the GPU model runner for padded requests for mamba/ssm backends:
To match the fact that the mamba kernels use PAD_SLOT_ID to indicate an unused block in the block table (for padded requests).
h…
#35335 [Feature]: Performance Optimization for Model Runner V2 — feature request — by yewentao256 (创建于: 2026-02-26 06:01 (UTC+8))
### 🚀 The feature, motivation and pitch
Tasks:
https://github.com/vllm-project/vllm/pull/35333
https://github.com/vllm-project/vllm/pull/35214
sorted + fromiter optimization
More perf optimization exploration
### Alternatives
…
#35332 [Feature]: Profiler num_steps — feature request — by Oseltamivir (创建于: 2026-02-26 05:20 (UTC+8))
### 🚀 The feature, motivation and pitch
/start_profile and /stop_profile are simple toggles with no parameters, some extra switchs (num_steps, merge_profiles, etc.) would be useful to profile a running server.
### Alternatives
No response
### Additional context
…
#35324 [Bug]: FusedMoE.weight_loader MXFP4 branch crashes on standard HuggingFace MoE checkpoints (per-expert 2-D weights) — bug — by zeryx (创建于: 2026-02-26 02:57 (UTC+8))
### Your current environment
vLLM: 0.15.1 (`vllm/vllm-openai:latest` as of 2025-02-25)
Model: `Qwen/Qwen3-30B-A3B` (128 experts, per-expert weight layout)
GPU: NVIDIA RTX 5880 Ada (SM89), also reproducible on any GPU
### 🐛 Describe the bug
…
#35287 [Bug]: An error occurred when Qwen3.5 started the model using –quantization fp8 — bug — by Alibaba-HZY (创建于: 2026-02-25 19:44 (UTC+8)) [💬1]
### Your current environment
The output of python collect_env.py
```text
Your output of `python collect_env.py` here
```
…
#35319 [Bug]: Multi-Node inference with PP > 1 crashes after processing completions request with non-None logprobs parameter. — bug — by JeanPaulShapo (创建于: 2026-02-26 02:31 (UTC+8)) [💬1]
### Your current environment
The output of python collect_env.py on the first node
```text
==============================
System Info
==============================
OS : Ubuntu 22.04.5 LTS (x86_64)
...
#35313 [Bug]: vLLM startup memory check mis-detects available VRAM (reclaimable OS memory) on UMA Systems — bug — by sunil1511 (创建于: 2026-02-26 01:51 (UTC+8))
### 🐛 Describe the bug
## Issue Observed on UMA GPUs (DGX Spark and GH200)
I have observed this bug on both DGX Spark and GH200 systems.
### Problem Description
vLLM defaults to requesting 90% GPU memory utilization (--gpu-memory-utilization=0.9). While it’s typically easier to achieve 90% free memory on non-UMA systems, on UMA systems the memory can be occupied by paged cache and buffers which are reclaimable by the OS.
…
#35293 [Bug] GLM-5-FP8 Crash: CUDAGraph Replay Segmentation Fault — bug — by janssen-llm (创建于: 2026-02-25 21:00 (UTC+8)) [💬3]
### Your current environment
#35268 [Bug]: Qwen3-VL-Rerank model’s rerank API does not support query with mixed image and text inputs. — bug — by Yimi81 (创建于: 2026-02-25 15:05 (UTC+8)) [💬2]
### Your current environment
#35303 [Bug] CompressedTensorsWNA16MarlinMoEMethod registers g_idx params unconditionally, crashes with actorder=null AWQ MoE models — 无标签 — by jhsmith409 (创建于: 2026-02-25 23:47 (UTC+8))
## Your current environment
vLLM version: 0.16.0rc2.dev472+gee59a7c61 (nightly wheel from wheels.vllm.ai)
PyTorch version: 2.10.0+cu130
GPU: NVIDIA GeForce RTX 5090 (Blackwell, sm_120)
NVIDIA driver: 590.48
CUDA (container): 13.0
OS: Ubuntu (Docker container based on vllm/vllm-openai:latest-cu130)
## Model
…
#35285 [Bug]: Large Video Request cause vLLM Progress Core Dump — bug — by rhluo (创建于: 2026-02-25 19:12 (UTC+8)) [💬3]
### Your current environment
vllm 0.15.1 </code>
### 🐛 Describe the bug
...
#35295 [Bugfix Proposal]: Support MIG UUIDs in CUDA_VISIBLE_DEVICES (int() conversion and NVML handle resolution) — bug — by GanyBaba (创建于: 2026-02-25 21:25 (UTC+8))
### Your current environment
5.14.0-611.27.1.el9_7.x86_64 / RHEL 9.7 (Plow)
Python 3.9.25
### 🐛 Describe the bug
Hi team,
…
#35286 [Bug]: Qwen3.5-MoE failed with enable_lora — bug — by hjh0119 (创建于: 2026-02-25 19:42 (UTC+8))
### Your current environment
The output of python collect_env.py
```text
Collecting environment information...
==============================
System Info
==============================
...
#35288 [Bug]: MTP speculative decoding produces corrupted output at concurrency >= 4 (V1 engine) — bug — by dkremez (创建于: 2026-02-25 19:52 (UTC+8))
### Your current environment
The output of python collect_env.py
```text
==============================
System Info
==============================
OS : Ubuntu 22.04.5 LTS (x86_64)
...
#35272 [Feature]: Make the Qwen3-ASR use request_prompt from user input — feature request — by minh-nguyenhoang (创建于: 2026-02-25 16:22 (UTC+8)) [💬1]
### 🚀 The feature, motivation and pitch
Current implementation of Qwen3-ASR for completely ignore user prompt, which can easily be done by just append the request_prompt (maybe add validation so that the request_prompt follow Qwen3-ASR format, e.g. language {lang}<asr_text>{text} or if language is already provided (via to_lang), strip the language part from…
#35276 [Bug]: OpenAI transcribe prompt parameter with whisper return hallucinated transcription — bug — by Matht-j (创建于: 2026-02-25 17:19 (UTC+8)) [💬1]
### Your current environment
vllm 0.13.0
### 🐛 Describe the bug
When i prompt a short list (~10 elements) of Cities to Whisper to improve his spelling, i get a transcription full of hallucination (Repetitive word or syllable). I followed the guideline of prompting but still got no good result. I tried multiple kind of prompting (with and without <startofstranscript, , etc).
[Transcription without prompt.txt](ht…
#35266 [Bug]: Missing opening brace for Qwen3.5 streaming tool calls — bug — by AsterisMono (创建于: 2026-02-25 14:35 (UTC+8)) [💬2]
### Your current environment
The output of python collect_env.py
```text
(vllm) root@oem:~# python collect_env.py
Collecting environment information...
uv is set
==============================
...
#35267 [Feature]: Integrate RadixMLP into vLLM — feature request — by Zyyeric (创建于: 2026-02-25 14:45 (UTC+8))
### 🚀 The feature, motivation and pitch
I’m wondering whether vLLM would consider supporting RadixMLP-style intra-batch deduplication for the prefill path.
The high-level idea of RadixMLP is that it deduplicates position-wise computations (e.g., norms / linear projections / MLP-related activations) for tokens that have identical prefix, while running attention on the original expanded layout.
This shall be complimentary to vLLM’s existing prefix caching / KV-cache mechanisms.
That said, a…
[已关闭 issue]
#33369 [Bug]: Serving model in 0.15.0 Docker container hangs - 0.14.1 worked fine — bug — by sininspira2 (关闭于: 2026-02-26 10:15 (UTC+8)) [💬17]
### Your current environment
The output of python collect_env.py
I don't have all of the python libraries to run on the host, and running the script inside the docker container isn't working.
I'm running vllm/vllm-openai:latest-cu130
The container has access to 2x RTX 6000 Pro Server Edition (Blackwell)
...
#32373 [Bug]: Fail to load vLLM on new NVIDIA driver — bug — by huydhn (关闭于: 2026-02-26 10:15 (UTC+8)) [💬40]
### Your current environment
I can’t run collect_env.py because importing PyTorch itself is failing.
### 🐛 Describe the bug
After https://github.com/vllm-project/vllm/pull/30784, I start seeing vLLM failing to load on its benchmark job. I suspect that the change doesn’t work with newer NVIDIA driver that the job is using to be compatible with CUDA 13.0
```
+—————————————————————————————–+
…
#29098 [Installation]: Failed building wheel for vllm — installation,stale — by aizyler (关闭于: 2026-02-26 10:46 (UTC+8)) [💬2]
### Your current environment
CMake Generate step failed. Build files cannot be regenerated correctly.
Traceback (most recent call last):
File “/root/anaconda3/envs/vllm/lib/python3.12/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py”, line 389, in
main()
File "/root/anaconda3/envs/vllm/lib/python3.12/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 373, in main
json_out["return_val"] = hook(**hook_input["kwargs"])
...
#25538 [Bug]: performance regression caused by frequently preempting and resuming a request — bug,stale — by ronny1996 (关闭于: 2026-02-26 10:16 (UTC+8)) [💬3]
### Your current environment
The output of python collect_env.py
```text
Your output of `python collect_env.py` here
```
…
#26584 [Bug]: gpt-oss-20B output quality issues (I suspect a bug) — bug,stale — by br3no (关闭于: 2026-02-26 10:16 (UTC+8)) [💬7]
### Your current environment
The output of python collect_env.py
```text
/home/xxxxx/src/py/vllm/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm._version'
from vllm.version import __version__ as VLLM_VERSION
Collecting environment information...
...
#27634 [Usage]: how to use –quantization option of vllm serve? — usage,stale — by Septemberlemon (关闭于: 2026-02-26 10:16 (UTC+8)) [💬5]
### Your current environment
```text
==============================
System Info
==============================
OS : Ubuntu 22.04.5 LTS (x86_64)
GCC version : (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0
Clang version : Could not collect
CMake version : Could not collect
…
#27636 [Usage]: vllm如何保留qwen3-vl中的special token — usage,stale — by qfs666 (关闭于: 2026-02-26 10:15 (UTC+8)) [💬2]
### Your current environment
I want to run inference of a specific model. I don’t know how to integrate it with vllm.
…
#27646 [Usage]: How to use vllm bench serve to bench remote deployed vllm models (can’t bench when ep enabled!!!) — usage,stale — by Valerianding (关闭于: 2026-02-26 10:15 (UTC+8)) [💬5]
### Your current environment
#27656 [Bug]: Potential out-of-bounds access in paged_attention_v1.cu and paged_attention_v2.cu — bug,stale — by molly-ting (关闭于: 2026-02-26 10:15 (UTC+8)) [💬3]
### Your current environment
The output of python collect_env.py
```text
Your output of `python collect_env.py` here
```
…
#27662 [Bug]: Decode ITL performance issue with DBO at batch size ~200 — bug,stale — by dagrayvid (关闭于: 2026-02-26 10:15 (UTC+8)) [💬2]
### Your current environment
The output of python collect_env.py
```text
/opt/vllm/lib64/python3.12/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
import pynvml # type: ignore[import]
Collecting environment information...
===========================...
#27667 [Usage]: DeepseekOCR on CPU missing implementation for fused_topk — usage,stale — by brainlag (关闭于: 2026-02-26 10:15 (UTC+8)) [💬2]
### Your current environment
Try to test if it is possible to run DeepseekOCR on CPU using current git main branch.
Fails because there is no implementation of fused_topk for CPU.
```
INFO 10-28 15:41:18 [v1/worker/cpu_model_runner.py:77] Warming up model for the compilation…
ERROR: Traceback (most recent call last):
File “/opt/venv/lib/python3.12/site-packages/starlette/routing.py”, line 677, in lifespan
…
#27692 it run on rtx 5060 ti 16 gb — usage,stale — by bokkob556644-coder (关闭于: 2026-02-26 10:15 (UTC+8)) [💬3]
### Your current environment
#27696 [Bug]: OpenTelemetry Error on V1 — bug,stale — by chandlj (关闭于: 2026-02-26 10:15 (UTC+8)) [💬2]
### Your current environment
The output of python collect_env.py
==============================
System Info
==============================
OS : Ubuntu 22.04.5 LTS (x86_64)
GCC version : (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0
Clang version : Could not collect
...
#35337 [CI] AttributeError in SMControlContextManager: torch.cuda.current_device() returns int, not device object — ci-failure — by LucasWilkinson (关闭于: 2026-02-26 10:01 (UTC+8))
## Name of failing test
#34395 [Bug]: AR+rms+fp4 fusion results in total accuracy collapse for DSV3-fp4 — bug,torch.compile,nvidia — by ProExpertProg (关闭于: 2026-02-26 09:59 (UTC+8)) [💬6]
### Your current environment
The output of python collect_env.py
```text
Collecting environment information...
uv is set
==============================
System Info
...
#34259 [Bug]: gpt-oss triton_kernels crashes during startup with EP enabled (sm90 default) — bug — by mgoin (关闭于: 2026-02-26 05:33 (UTC+8)) [💬2]
### Your current environment
The output of python collect_env.py
```text
Collecting environment information...
uv is set
==============================
System Info
...
#34526 [Bug]: accuracy issue when using multiconnector(Nixl+cpu offloading) — bug,kv-connector — by hsubramony (关闭于: 2026-02-25 17:26 (UTC+8)) [💬4]
### Your current environment
The output of python collect_env.py
```text
Your output of `python collect_env.py` here
```
…
#34891 [Bug]: RuntimeError: [SymmDeviceMemory] Device does not support multicasting. — bug — by RocketRider (关闭于: 2026-02-25 23:31 (UTC+8)) [💬15]
### Trying to load Qwen3.5-397B-A17B-FP8 causes “RuntimeError: [SymmDeviceMemory] Device does not support multicasting” on 4xH200 GPUs
#35140 [CI] Ultravox audio model HuggingFace reference produces invalid output with NaN logprobs — ci-failure — by LucasWilkinson (关闭于: 2026-02-25 20:06 (UTC+8)) [💬4]
## Name of failing test
#27821 [RFC]: Reorganizing ViT Abstraction and Attention Selection Logic — RFC,stale — by tjtanaa (关闭于: 2026-02-25 14:38 (UTC+8)) [💬3]
### Motivation.
This RFC is aimed to address the following issues:
The ViT right now is still pretty coupled with Text backbone attention. This RFC will further the effort to decouple the ViT and the text backbone attention.
Another pain point is that the overriding of the ViT logic is scattered all around the places. We should avoid doing ViT logic overriding in model definition classes. The platform class should define the logic of what ViT is supported and how it should be overwritten…
#34941 [CI Failure]: Intel GPU Test : examples/offline_inference/basic/generate.py — ci-failure — by varun-sundar-rabindranath (关闭于: 2026-02-25 14:22 (UTC+8)) [💬4]
### Name of failing test
Caused by external libraries (e.g. bug in transformers)
…
#35252 [Installation]: No module named ‘vllm.entrypoints.anthropic.serving_messages’ — installation — by timtimyim (关闭于: 2026-02-25 12:15 (UTC+8)) [💬1]
### Your current environment
```==============================
System Info
==============================
OS : Ubuntu 22.04.5 LTS (x86_64)
GCC version : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version : Could not collect
CMake version : Could not collect
Libc version : glibc-2.35
…
#34296 [Usage]: vllm/vllm-openai:v0.15.1 No CUDA GPUs are available,0.10.1.1 is ok — usage — by ooodwbooo (关闭于: 2026-02-25 11:39 (UTC+8)) [💬11]
### Your current environment
#35350 [Model Runner V2] Add model states [1/N] — v1,nvidia — by WoosukKwon (创建于: 2026-02-26 10:47 (UTC+8)) [💬2
+124/-101, 4 files
commented:1]
#35352 [Bug] Fix missing tag after tool call in MiniMax 2.1 — bug — by stingoChen (创建于: 2026-02-26 11:19 (UTC+8)) [💬1 | +6/-1, 1 files | commented:1]
Fix: #35349
## Purpose
This PR improves the reasoning end detection logic in MiniMaxM2AppendThinkReasoningParser.is_reasoning_end.
After a successful tool call, the following line incorrectly evaluates to `True`, which causes the subsequent `` tag to be missing.
## Test Plan
```shell
CUDA_VISIBLE_DEVICES=3,4,5,6 python3 -m vllm.entrypoints.openai.a…
#35346 Cpu dispatcher — documentation,performance,rocm,needs-rebase,ci/build,v1,cpu,nvidia — by majian4work (创建于: 2026-02-26 10:32 (UTC+8)) [💬5 | +2700/-516, 37 files | commented:1]
Rebase https://github.com/dtrifiro/vllm/tree/cpu-build-dispatcher-cleanup
#35274 Remove bc-lint — ready,ci/build,v1 — by hmellor (创建于: 2026-02-25 16:53 (UTC+8)) [+0/-118, 5 files | commented:1 approved:2]
The bc-lint.yml GitHub Actions workflow was added to vLLM as a trial. It was only ever added to the following classes in vLLM:
NewRequestData
CachedRequestData
SchedulerOutput
Therefore, it’s coverage is very low. Also, it is consuming significant GitHub Actions resources. If we look at https://github.com/vllm-project/vllm/actions/metrics/usage we see that it is consuming:
~35% of all workflow runs
~15% of all workflow minutes
…
#35351 [XPU] special handle for pooler models w8a16 gemm — 无标签 — by yma11 (创建于: 2026-02-26 11:05 (UTC+8)) [+4/-0, 1 files | commented:1]
## Purpose
In case of pooler models, pooled_data may change to head_dtype like float which fp8_gemm_w8a16 doesn’t support, so we make a force convert in this PR.
## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
...
#35302 [Bugfix][Hardware][AMD] Support all MoE activations in WNA16 quantization on ROCm — bug,rocm — by c0de128 (创建于: 2026-02-25 23:36 (UTC+8)) [💬2 | +1/-5, 1 files | commented:1]
## Summary
Removes the SiLU-only assertion in MoeWNA16Method.apply() and passes layer.activation through to fused_experts(), enabling GPTQ INT4 MoE models with non-SiLU activations (e.g., SWIGLUSTEP) to run on ROCm.
On ROCm, Marlin MoE is disabled (check_moe_marlin_supports_layer returns False), so all GPTQ INT4 MoE models go through the WNA16 path. The WNA16 path had a hardcoded assert layer.activation == MoEActivation.SILU that blocked models using other activations, even though …
During compile caching we will have error pickling graph module containing
node like op(node, slice(0, symint_0), node) because we use pytree to
internally tree map nodes to _NodePickleData, so slice() is treated as
leaf node. Patching vllm to handle slice() as non-leaf before we fix it
in torch 2.12.
#35348 Fix stale self._positions during piecewise CUDA graph replay in MLA — frontend,v1,nvidia — by tlrmchlsmth (创建于: 2026-02-26 10:37 (UTC+8)) [+380/-45, 14 files | commented:1 | 📝草稿]
## Summary
Fixes a shape mismatch crash in rotary embedding (RuntimeError: The size of tensor a (8192) must match the size of tensor b (475)) when running DeepSeek-R1 with FP8 KV cache, piecewise CUDA graphs, and the fused RoPE+quant path
Root cause: self._positions = positions in MLAAttention.forward() is a Python-level side effect inside a piecewise-compiled graph. During CUDA graph replay, this assignment is not re-executed, leaving self._positions pointing to a stale view from a…
## Purpose
Fix Qwen 3.5 tool calling problem, which complains the json malformated
## Test Plan
for example run
vllm serve Qwen/Qwen3.5-27B –port 8000 –tensor-parallel-size 1 –max-model-len 262144 –reasoning-parser qwen3 –speculative-config ‘{“method”:”qwen3_next_mtp”,”num_speculative_tokens”:2}’ –enable-auto-tool-choice –tool-call-parser qwen35_coder
and use tool calling
…
#35339 [Misc][Harmony] Move Responses API only harmony utils to responses/harmony.py — frontend,ready,gpt-oss — by sfeng33 (创建于: 2026-02-26 06:59 (UTC+8)) [💬2 | +1040/-990, 6 files | commented:1 approved:1]
## Purpose
parser/harmony_utils.py had grown into a mixed bag — it contained both shared Chat Completion parsing logic and Responses API-specific conversion code (input→harmony, harmony→response output items). This meant the parser/ module had direct dependencies on Responses API types (ResponseFunctionToolCall,
ResponseOutputMessage, McpCall, etc.) that don’t belong there.
### This PR
Extracts Responses API-specific Harmony conversion logic from parser/harmony_utils.py into a d…
When a Qwen3-VL prompt ends with punctuation (e.g., . or :) immediately before a video placeholder, the HF processor expands the placeholder into timestamp markers like <0.3 seconds><|vision_start|>.... The < in the timestamp text merges with the preceding punctuation (e.g., .< becomes a single token), breaking exact token matching in _find_mm_placeholders.
### Approach
Add a _find_mm_placeholders override in Qwen3VLMultiModalProcessor following the s…
#35271 [Feat] Add CUDA torch fallbacks for fp8_mqa_logits/fp8_paged_mqa_logits_torch function — v1,nvidia — by chaunceyjiang (创建于: 2026-02-25 16:19 (UTC+8)) [💬1 | +175/-28, 3 files | commented:4]
## Purpose
#35260 [Frontend] Prepare diarized transcription using OpenAI contract with diarized_json and stream events — frontend — by lucaschr21 (创建于: 2026-02-25 11:57 (UTC+8)) [💬1 | +483/-20, 6 files | commented:1]
## Purpose
This PR prepares vLLM’s OpenAI-compatible speech-to-text frontend for diarized transcription outputs by adding support for response_format="diarized_json" and its related streaming event contract.
The goal is to make diarized transcription behavior ready and standardized so that models that can transcribe and diarize can plug into this path and return responses/events aligned with OpenAI’s contract.
For example, models such as vibevoice-asr (which support transcription +…
Fixes AttributeError: 'int' object has no attribute 'index' in SMControlContextManager.__init__() when running with data parallel and expert parallel enabled (DBO tests).
## Root Cause
PR #35042 introduced a bug where torch.cuda.current_device().index is called, but torch.cuda.current_device() returns an int directly, not a device object with an .index attribute.
Fix outdated links in source code, replaced with correct ones
#35345 Fix MoE models in EP mode on Ascend — 无标签 — by ylyhyqsl (创建于: 2026-02-26 09:39 (UTC+8)) [+10/-6, 2 files | commented:1]
Fix MoE models using bare tensor_model_parallel_all_reduce produce garbage output in EP mode on Ascend#6710
## Purpose
MoE models that use FusedMoE with reduce_results=False and then manually call tensor_model_parallel_all_reduce() in their MoE forward method produce garbage output when running with –enable-expert-parallel on Ascend NPU.
## Suggested fix
The fix should be applied in the upstream vLLM model files. Replace:
This PR https://github.com/vllm-project/vllm/pull/29287 does not include https://github.com/vllm-project/vllm/pull/32175, so, it will cause batch_size_next_n == batch_size * next_n
cc @ganyi1996ppo
## Test Plan
```
# Prefill
…
#35343 feat(kv-offload): Strategy B — AdaptiveOffloadingPolicy + non-blocking loads — v1,kv-connector — by Srinivasoo7 (创建于: 2026-02-26 08:59 (UTC+8)) [💬2 | +482/-15, 4 files | commented:1]
Reduces TTFT regression from CPU KV cache loading via two mechanisms:
AdaptiveOffloadingPolicy (scheduler-side):
Measures TTFT during a warm-up phase (no offloading) to capture a clean baseline.
After warm-up, monitors rolling P50 TTFT over a sliding window.
Auto-pauses offloading if TTFT regresses > overhead_threshold_pct (default 5%) above baseline; auto-resumes when it clears.
Propagates effective_load_mode (‘blocking’
‘async_with_fallback’) to the worker each step via…
#35342 feat(kv-offload): Strategy A — StoreReusedOffloadingManager gates CPU stores on reuse frequency — v1 — by Srinivasoo7 (创建于: 2026-02-26 08:58 (UTC+8)) [💬2 | +630/-2, 6 files | commented:1]
Adds a decorator OffloadingManager that transparently wraps any backing manager (LRU, ARC) and suppresses GPU->CPU stores for blocks whose hash has not been seen at least store_threshold (default: 2) times.
vllm/v1/kv_offload/cpu.py – get_manager() wraps backing with
StoreR…
#35261 [MIX] More accurate processing of quantization of mlp.gate — qwen — by vadiklyutiy (创建于: 2026-02-25 12:55 (UTC+8)) [💬4 | +15/-1, 2 files | commented:1]
Addon to #35156.
Mark mlp.gate as non-quant for ModelOpt models only.
Reduce CPU KV cache offloading overhead to near-zero on workloads where blocks are not reused, while preserving full offload benefits for prefix-cache-friendly workloads.
The existing OffloadingConnector unconditionally stores every evicted GPU block to CPU and blocks the scheduling step waiting for PCIe transfers to complete. On one-shot (non-reuse) traffic this wastes PCIe bandwidth with zero benefit. This PR introduces three coordinated strategies:
### Strategy A (P0) — Skip U…
#35270 [XPU][NIXL] Add GPUDirect RDMA support for XPU — ci/build,kv-connector — by zhenwei-intel (创建于: 2026-02-25 15:52 (UTC+8)) [💬3 | +18/-7, 4 files | commented:8]
## Purpose
Add GPUDirect RDMA support for XPU in NIXL connector.
Requirements:
UCX must include the fix from https://github.com/openucx/ucx/pull/11187.
Limitations:
…
#35341 [Bug] FA2 is not supported for NVIDIA Blackwell architecture — bug,v1,nvidia — by olka (创建于: 2026-02-26 07:53 (UTC+8)) [💬1 | +3/-0, 1 files | commented:1]
## Purpose
At the moment FA2 support in Blackwell is broken:
```
Cannot use FA version 2 is not supported due to FA2 is unavaible due to: SM 12.x not supported by flash-attn FA2
Cannot use FA version 2 is not supported due to FA2 is unavaible due to: SM 12.x not supported by flash-attn FA2
Cannot use FA version 2 is not supported due to FA2 is unavaible due to: SM 12.x not supported by flash-attn FA2
Cannot use FA version 2 is not supported due to FA2 is unavaible due to: SM 12.x not supported …
#35278 [Bugfix] Fix broken online quantization due to weights loading tracking — bug,ready,needs-rebase — by Isotr0py (创建于: 2026-02-25 17:28 (UTC+8)) [💬1 | +7/-3, 1 files | commented:1 approved:1]
## Purpose
#35340 [Feature] Add num_steps and delay_steps parameters to /start_profile endpoint — frontend,v1,cpu — by sarathbrp (创建于: 2026-02-26 07:14 (UTC+8)) [💬2 | +225/-23, 10 files | commented:1]
## Purpose
Adds runtime-configurable num_steps and delay_steps parameters to the /start_profile API endpoint, allowing users to override the server’s max_iterations and delay_iterations profiler config on a per-session basis without restarting the server.
#35322 [ROCm][CI] Amending deletion of AMD mirror — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-02-26 02:47 (UTC+8)) [+5/-0, 1 files | commented:1 approved:1]
This mirror was deleted accidentally from:
https://github.com/vllm-project/vllm/pull/34923
This PR reverts that.
#35265 [ROCm][CI] Extending attention backend coverage for Eagle spec decode tests — rocm,ready,ci/build,v1 — by AndreasKaratzas (创建于: 2026-02-25 14:19 (UTC+8)) [💬1 | +314/-150, 4 files | commented:4 approved:1]
This PR fixes the attention backend handling in test_eagle_correctness for ROCm platforms:
Previously, the test would skip entirely when running Llama-4-Scout with FLASH_ATTN on ROCm. Instead of skipping, we now fall back to the FLEX_ATTENTION backend explicitly, allowing the test to actually run on ROCm.
Also previously, the test skipped DeepSeek models when using ROCM_AITER_FA on ROCm. Now we enable the AITER path (VLLM_ROCM_USE_AITER=1), unset VLLM_MLA_DISABLE, and switch to the…
Part of https://github.com/vllm-project/vllm/issues/35335
Pre-allocate the gpu/cpu memory
Optimize copy logic (temple pin in async_copy_to_gpu is expensive)
## Test
```bash
…
#35334 [ROCm] Enabling encoder and encoder-decoder on ROCm and AITER unified backends — documentation,rocm,v1 — by gshtras (创建于: 2026-02-26 05:59 (UTC+8)) [💬1 | +106/-7, 3 files | commented:1]
Another prerequisite to #33271
With this unit tests that use encoder attention such as test_run_batch.py and encoder-decoder models such as examples/offline_inference/audio_language.py -m whisper now work with ROCM_ATTN and ROCM_AITER_UNIFIED_ATTN
#35328 [Core] Move ray-specific WorkerWrapperBase methods to RayWorkerWrapper — ready,v1 — by njhill (创建于: 2026-02-26 03:45 (UTC+8)) [+24/-29, 2 files | commented:1]
Noticed this while working on other items.
In `prepare_nvfp4_moe_layer_for_fi_or_cutlass` (`flashinfer_fp4_moe.py:514-515`), input scales are broadcast via `expand()`:
```python
a13_scale = a13_scale.max().to(torch.float32).expand(num_experts) ...
Optimize maxsim scores computation for pooling models
Originally it is calculated in CPU, now we calculate it in GPU and using the batched version, so we get a lot of performance improvement
## Test
### Acc
…
#35275 [Bugfix] Fix uint32 overflow in Mamba selective scan state pointer arithmetic — bug,ready — by Josephasafg (创建于: 2026-02-25 16:54 (UTC+8)) [+1/-1, 1 files | commented:1 approved:1]
## Purpose
Fixed SSM state corruption for Mamba models under high concurrency.
SSMParamsBase::index_t was uint32_t, causing cache_index * ssm_states_batch_stride to overflow for large cache indices.
This made the kernel write SSM state to wrong memory addresses, leaving the intended slots zeroed. Subsequent decode steps read stale/zero state, producing garbage/incorrect output.
The fix changes index_t from uint32_t to size_t (64-bit).
## Test Plan
## Test Result
…
#35298 [BugFix][XPU] Fix speculative decoding on Intel XPU due to bug with IGC_ForceOCLSIMDWidth=16 — bug — by ofirzaf (创建于: 2026-02-25 21:36 (UTC+8)) [💬2 | +0/-3, 1 files | commented:1 approved:1]
## Purpose
The environment variable setting IGC_ForceOCLSIMDWidth=16 was introduced in #30538 to reduce memory usage when compiling sample_recovered_tokens_kernel.
With the introduction of copy_and_expand_eagle_inputs_kernel in #32887 speculative decoding with draft model stopped working on XPU.
Furthermore, speculative decoding with sampling (T>0) didn’t work with either Eagle/draft model, and Qwen3 + SD didn’t work due to the same issue of OOM in scratchpad.
It turns out that there i…
#35327 Fix deprecated v1 config tests — 无标签 — by jcaip (创建于: 2026-02-26 03:41 (UTC+8)) [+1/-17, 1 files | commented:1 approved:1]
Updated the model name in test_pre_quantized_model and removed the test_opt_125m_int4wo_model_per_module_quant function.
## Purpose
We recently deprecated the v1 configs for Int4WeightOnlyConfig and Float8DynamicActivationFloat8WeightConfig, but there are two model checkpoints still using the v1 configs in the tests:
drisspg/fp8-opt-125m
jerryzh168/opt-125m-int4wo-per-module
This PR changes drisspg/fp8-opt-125m to use [torchao-testing/opt-125m-Float8WeightOnlyConfig-v2-0.15.0](https:…
#35317 Add @BoyuanFeng to CODEOWNERS — ci/build — by BoyuanFeng (创建于: 2026-02-26 02:19 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1 approved:3]
I am happy to maintain vLLM-torch.compile integration and review PRs.
Thank you all for the opportunity! Special thanks to @zou3519 @ProExpertProg @youkaichao @LucasWilkinson @robertgshaw2-redhat @yewentao256 @mgoin @hmellor @njhill @tlrmchlsmth @bigPYJ1151 @heheda12345 @houseroad @ywang96, and many more for their collaboration and support!
#35320 [CI] Add nsys profiling support with NVTX tracing for Buildkite CI — needs-rebase,ci/build — by ojhaanshika (创建于: 2026-02-26 02:40 (UTC+8)) [💬1 | +102/-0, 4 files | commented:1 | 📝草稿]
## Purpose
Add nsys profiling support for Buildkite CI, allowing developers to profile specific test steps with NVIDIA Nsight Systems.
This introduces:
VLLM_ENABLE_LAYERWISE_NVTX_TRACING environment variable in vllm/envs.py to enable layerwise NVTX tracing without requiring the --enable-layerwise-nvtx-tracing CLI flag
A model validator in ObservabilityConfig that automatically applies the env var override
.buildkite/scripts/run-with-profiling.sh — a wrapper script that conditio…
Temporary hack to be removed: Force MoE to use un-quantized path instead of Quark at this location.
#35264 [KV Connector]: Support KV push from Prefill to Decode node using Nixl KV Connector — v1,kv-connector — by snadampal (创建于: 2026-02-25 13:44 (UTC+8)) [💬1 | +830/-13, 4 files | commented:1]
Implemented KV push feature where Prefill node pushes its KV blocks to Decode node as soon as the model executor completes the forward pass and finishes request
The implemenation covers both the scenarios:
Scenario 1: D registers blocks with P before P finishes generating KV
Scenario 2: P has the KV ready before D registers
## Purpose
To improve TTFT for P-D disaggregated inference deployment.
…
#35326 [MoE Refactor] Initial split of DefaultMoERunner class — nvidia — by bnellnm (创建于: 2026-02-26 03:39 (UTC+8)) [+1305/-823, 33 files | commented:1 | 📝草稿]
## Purpose
Split DefaultMoERunner into two more classes:
MoERunnerBase that contains common code for all runners.
ChunkedMoERunner a runner that handles DP chunking. Inherits from MoERunnerBase
DefaultMoERunner inherits from MoERunnerBase and only handles the non-chunked/naive execution path.
Based off https://github.com/vllm-project/vllm/pull/35153
Add FSM (Finite State Machine) speculative decoding support to vLLM v1. This enables fast-forwarding through deterministic token sequences by using a state machine to propose draft tokens, reducing latency for structured outputs like JSON or templated text.
Key features:
FSM-based draft token proposal with no rejection for deterministic paths
Support for branching (multiple valid next tokens) and wildcards (any token allowed)
PR #35074 introduced track_weights_loading() to validate all registered parameters are loaded from checkpoint. This broke loading of GPTQ/AWQ MoE models using the Marlin backend with the following error:
```
ValueError: Following weights were not initialized from checkpoint:
{‘model.layers.0.mlp.experts.w13_g_idx_sort_indices’,
‘model.layers.0.mlp.experts.w2_g_idx_sort_indices’,
‘model.layers.0.mlp.experts.w13_weight_g_idx’,
‘model.layers.0.mlp.experts.w2_weight_g_idx’, …}
…
Based on the logic of https://github.com/vllm-project/guidellm/blob/v0.5.3/src/guidellm/benchmark/profiles.py#L575, we can greatly simplify the logic of SLA scan. Instead of trying to match each SLA threshold directly, we can just scan through the space of latency vs. throughput uniformly and let users plot the figures themselves to get the SLA thresholds for whatever metrics they want.
Removed --sla-params, and replace with --sla-iters to control how to sweep through the spac…
#35323 Add Qwen3.5-MoE (qwen3_5_moe) support via Qwen3-Next backend — documentation,new-model,rocm,frontend,needs-rebase,ci/build,v1,multi-modality,qwen,kv-connector — by blake-snc (创建于: 2026-02-26 02:56 (UTC+8)) [💬4 | +2597/-363, 38 files | commented:1]
## Summary
Adds support for Qwen3.5-MoE models (model_type: qwen3_5_moe, Qwen3_5MoeForConditionalGeneration) which use the same hybrid GDN (GatedDeltaNet) architecture as Qwen3-Next but ship with a different HuggingFace config layout and checkpoint weight format.
Key changes:
Config: Qwen3_5MoeConfig that flattens the outer text_config wrapper and delegates to Qwen3NextConfig
Registry: Register qwen3_5_moe model type and both ForConditionalGeneration / `ForCausal…
#35318 [Refactor] Remove dead or duplicate func utils or variables — performance,ready,v1,nvidia — by yewentao256 (创建于: 2026-02-26 02:21 (UTC+8)) [+0/-199, 9 files | commented:1]
## Purpose
#35312 [Core] Fix gpu_worker.py pre-commit errors — ready,v1 — by njhill (创建于: 2026-02-26 01:41 (UTC+8)) [+6/-2, 1 files | commented:1 approved:1]
Errors appeared when making some unrelated modifications to the file.
#35309 Revert “[Misc] Enable weights loading tracking for quantized models” — ready — by LucasWilkinson (创建于: 2026-02-26 01:16 (UTC+8)) [+4/-15, 1 files | commented:1 approved:1]
Reverts vllm-project/vllm#35074, this PR broke a ton of quantized models weight loading in the nightly https://buildkite.com/vllm/ci/builds/53087/ .
We should re-land this with ready-run-all-tests
cc @Isotr0py @DarkLight1337
#35297 [Model] Add nvidia/llama-nemotron-embed-vl-1b-v2 multimodal embedding model — documentation,new-model,multi-modality,llama,nvidia — by jzakrzew (创建于: 2026-02-25 21:34 (UTC+8)) [💬4 | +545/-31, 8 files | commented:4]
## Purpose
Add support for the nvidia/llama-nemotron-embed-vl-1b-v2 embedding model. The model is quite similar to already implemented LlamaNemotronVLChatModel, but not exactly compatible. I tried to reuse as much of the existing code as possible, modifying the code in nemotron_vl.py to facilitate that.
## Test Plan
```
pytest tests/models/multimodal/pooling/test_llama_nemotron_vl_embed.py tests/mod…
#35307 [CI][HPU] Pin vllm to last-good-commit compatible with vllm-gaudi — ci/build — by PatrykWo (创建于: 2026-02-26 00:21 (UTC+8)) [💬3 | +13/-1, 1 files | commented:1]
The HPU CI test installs the vllm-gaudi plugin on top of the vllm upstream checkout. Recent upstream changes (e.g. kernel abstraction reorganisation in #34055) can break the plugin before it has been updated to match.
Fix: fetch VLLM_STABLE_COMMIT from the vllm-gaudi repository’s tracking branch and check out that commit inside the Docker image. This pins the vllm baseline to a version vllm-gaudi is known to work with, while the CI commit still reflects the triggered build.
Fix CompressedTensorsWNA16MarlinMoEMethod.create_weights() to only register g_idx parameters when self.actorder is set, preventing ValueError from track_weights_loading() when loading compressed-tensors AWQ MoE models with actorder=null
Fix process_weights_after_loading() to derive num_experts/device from w13_weight_packed i…
#35304 [Bugfix][Hardware][AMD] Fix startup hang on ROCm gfx1151 in MinTokensLogitsProcessor — bug,rocm,v1 — by c0de128 (创建于: 2026-02-25 23:49 (UTC+8)) [💬1 | +2/-2, 1 files | commented:1]
## Summary
Fixes an indefinite hang during vLLM startup on AMD Ryzen AI MAX+ 395 (gfx1151) by creating neg_inf_tensor on CPU first and transferring to device with non_blocking=True.
In MinTokensLogitsProcessor.__init__(), the neg_inf_tensor scalar is created directly on the ROCm device via torch.tensor(..., device=self.device). On gfx1151/gfx1150, this triggers HIP kernel compilation that hangs indefinitely — confirmed by py-spy traces showing 100% CPU time stuck at this line.
When max_completion_tokens cuts output before </think> is generated, extract_reasoning() assumes thinking is disabled and returns all tokens as content. This is incorrect — the model was reasoning, it just got truncated.
The fix reads enable_thinking from chat_template_kwargs (mirroring the approach used by DeepSeekV3ReasoningParser) and uses it to determine behavior when </think> is absent:
Fix Qwen3ReasoningParser.extract_reasoning() returning truncated reasoning tokens as content instead of reasoning when max_completion_tokens cuts off the output before </think> is generated.
When thinking is enabled and the output is truncated, the current code sees no </think> and assumes thinking was disabled — returning (None, model_output). This means the API response has content filled with raw chain-of-thought and empty reasoning_content, which is wrong.
…
#35294 [Model Runner V2] support dp & ep for spec decoding — v1,nvidia — by izhuhaoran (创建于: 2026-02-25 21:03 (UTC+8)) [+119/-57, 4 files | commented:1]
Further improve based on #35248
The key change is to trigger speculator.propose() in _dummy_run (called by execute_dummy_batch on idle DP ranks). Without this, when one DP rank has requests and enters speculative decoding, the idle rank only runs the target model forward in its dummy batch — the speculator’s EP communication has no counterpart, causing hang.
Additionally, num_tokens_across_dp in the speculator must be properly synced across DP ranks via all-reduce. When the target mode…
Update example mm_processor_kwargs to use "size" which is more preferred.
Add back-compatibility for passing min_pixels and max_pixels instead of size.shortest_edge and size.longest_edge.
CLOSE https://github.com/vllm-project/vllm/pull/35269#pullrequestreview-3852541054
## Test Plan
## Test Result
…
#35291 [Bugfix] Enable stats logging with multiple API servers (DP>1) — bug,v1 — by LZZhang-HW (创建于: 2026-02-25 20:43 (UTC+8)) [💬2 | +2/-5, 1 files | commented:1]
## Summary
When `data_parallel_size > 1`, `api_server_count` defaults to `data_parallel_size`, causing `client_count > 1`. Previously this disabled **all** stats logging with a
warning:
> "AsyncLLM created with api_server_count more than 1; disabling stats logging to avoid incomplete stats."
This means prefix cache hit rate, speculative decoding metrics (acceptance rate, mean acceptance length), and throughput stats are completely invisible in DP>1 deployments.
## Fix ...
#35283 [Test] Add tests for n parameter in chat completions API — 无标签 — by KrxGu (创建于: 2026-02-25 18:43 (UTC+8)) [💬2 | +228/-0, 1 files | commented:8]
## Summary
This PR adds comprehensive tests for the n parameter in the chat completions API to verify that multiple completion choices are correctly returned (#34305).
## Details
Integration tests (test_chat_n_parameter.py): Full end-to-end tests with server
Non-streaming with n=3
Streaming with n=2
Different seeds producing diverse outputs
…
#35269 [PaddleOCR-VL] improve image_processor size compatibility for Transformers v5.0 — 无标签 — by zhang-prog (创建于: 2026-02-25 15:35 (UTC+8)) [💬4 | +2/-2, 1 files | commented:2]
In Transformers v5.0, the keys in PaddleOCRVLImageProcessor.size have been updated from min_pixels/max_pixels to shortest_edge/longest_edge.
This PR enhances the compatibility of the size parameter to ensure a smooth transition to v5.0 while maintaining backward compatibility.
We can see that the first link in the table here is broken :)
Also applied a suggestion by the gemini bot.
## Test Plan
…
#35282 [Core] Add chunking for audio over 30s on offline inference, — frontend,needs-rebase — by sangbumlikeagod (创建于: 2026-02-25 18:00 (UTC+8)) [💬1 | +130/-4, 2 files | commented:1 | 📝草稿]
## Purpose
Currently, vLLM supports ASR (Automatic Speech Recognition) inference for audio segments exceeding 30 seconds, but this is not yet supported for offline inference.
I have been working on a PR to address this (see my comment here: [link]), but in the meantime, another PR (#34628) covering similar ground was merged.
…
## Purpose
custom dataset img support base64
## Test Plan
## Test Result
pass
—
Essential Elements of an Effective PR Description Checklist
...
#35279 [Perf][Benchmark] Add uniform top-k scalar fast path and profiling co… — performance,v1 — by MatteoFari (创建于: 2026-02-25 17:28 (UTC+8)) [+154/-27, 6 files | commented:1]
## Purpose
Part of #32455
This PR speeds up top-k sampling for a common case and adds better benchmark reporting. It uses a new top_k_scalar path when top-p is off and batch top-k is uniform, while keeping outputs the same.
## Test Plan
On Intel XPU TBX (Translation Based eXecution) simulated devices, torch.xpu.mem_get_info() raises a RuntimeError because the simulated device does not support querying available free memory:
RuntimeError: The device (Intel(R) Graphics [0x674c]) doesn't support querying the available free memory.
This causes vLLM to crash during worker initialization when MemorySnapshot calls current_platform.mem_get_info(device).
…
#35263 fix(docs): update broken perf.vllm.ai link to PyTorch HUD — performance,ci/build — by machov (创建于: 2026-02-25 13:31 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1]
Fixes #31710
The URL referenced in is no longer accessible and returns connection errors.
## Changes
Updated broken link to point to the correct vLLM Performance Dashboard:
Made the link text consistent with other documentation in the repository that already uses the correct PyTorch HUD URL
## Testing
Verified that the new URL is accessible and shows the vLLM performance benchmarks. This URL format is already used in other documentation files throughout the repository (docs/benchmarki…
## Purpose
Flashinfer has a new allreduce API: https://github.com/flashinfer-ai/flashinfer/pull/2130
and a mnnvl backend for allreduce optimized for multi-node NVLink cases (https://github.com/flashinfer-ai/flashinfer/pull/1213).
The mnnvl backend performs same/better than the old trtllm backend that vLLM currently uses.
The new …
#33992 [Bugfix] Fix CUDA compatibility path setting for both datacenter and consumer NVIDIA GPUs — bug,documentation,ready,ci/build,nvidia — by ehfd (合并于: 2026-02-26 10:15 (UTC+8)) [💬36 | +334/-5, 6 files | commented:6 approved:2]
## Purpose
Fix #32373
Fix #33369
Closes #34226
Fixes the core problem in https://github.com/vllm-project/vllm/issues/32373 and https://github.com/vllm-project/vllm/issues/33369, introduced from https://github.com/vllm-project/vllm/pull/30784 and https://github.com/vllm-project/vllm/pull/33116.
The issue with the CUDA compatibility path setting for both datacenter and consumer NVIDIA GPUs:
(1) Consumer NVIDIA GPUs must NOT have the CUDA compatibility libraries inside LD_LIBRARY_PATH.
…
#34134 openpangu-vl support video input — ready,multi-modality — by hujiaxin0 (合并于: 2026-02-26 11:08 (UTC+8)) [💬8 | +87/-0, 1 files | commented:9 approved:1]
## Purpose
This is a method for calculating the Pangu video sampling frame, which is different from the existing method. total_frames_num indicates the total number of frames. It is more reasonable to start the calculation from frame 0 corresponding to 0 seconds.
Assume that total_num_frames = 10 and fps = 1.
The timestamp of frame 0 is 0.0.
The timestamp of frame 1 is 1.0.
The timestamp of frame 9 (the last frame) should be 9.0, that is, (total_frames_num – 1) / o…
#35210 [BugFix] Fix fp4 quant kernel on CUDA 12.8 — bug,ready,nvidia,quantization — by LopezCastroRoberto (合并于: 2026-02-26 10:32 (UTC+8)) [+11/-7, 2 files | commented:1 approved:1]
## Summary
Fixes a bug in the non-swizzled FP4 quantization path (cvt_fp16_to_fp4_sf_major) and silu_mul_cvt_fp16_to_fp4 when CVT_FP4_ELTS_PER_THREAD == 8, leading to failures in test_quantize_to_fp4_padded_no_sf_swizzled and test_silu_mul_nvfp4_quant.py (e.g. on sm120 / CUDA 12.8).
This patch introduces no additional performance overhead for the CUDA 12.8 code path. The CUDA 12.9+ path remains unaffected by this bug fix.
## Root cause
The kernel and host launch used sf_n_unpadded …
Fixes AttributeError: 'int' object has no attribute 'index' in SMControlContextManager.__init__() when running with data parallel and expert parallel enabled (DBO tests).
## Root Cause
PR #35042 introduced a bug where torch.cuda.current_device().index is called, but torch.cuda.current_device() returns an int directly, not a device object with an .index attribute.
Adds --moe-backend argument for explicit MoE kernel selection, allowing users to override the automatic backend selection logic (e.g., --moe-backend triton, --moe-backend marlin, --moe-backend flashinfer_trtllm)
Supports all three oracle paths currently implemented: unquantized, FP8, and NVFP4
If MoEBackend is specified by the user and isn’t valid for the given quantization format, it will error. Currently it doesn’t include CPU, XPU, etc…
Adds a new flag to specify user intent for workload, --performance-mode {balanced,interactivity,throughput}.
Default is balanced which preserves existing behavior.
To start and just demonstrate the feature, setting interactivity mode captures fine-grained CUDA graphs at small batch sizes (step-1 up to 32) and throughput mode doubles default batch size limits.
In the future we can use this flag to:
change scheduler behavior to specialize batch size or smoothness of batch
…
This PR adds CPU weight offloader that hides weight onloading latency by prefetching weights. This saves the performance cost of zero-copy UVA access. This technique was first developed in SGLang for GB200: https://lmsys.org/blog/2025-09-25-gb200-part-2/, and now adapted to support torch.compile and CUDA graph within vLLM in this PR.
Also refactors the offloading to be extensible.
Demonstrated in the trace:
H2D is for prefetching next offloaded we…
#35042 [Platform] Add current_platform.num_compute_units interface — performance,rocm,ready,v1,nvidia — by jikunshang (合并于: 2026-02-25 14:22 (UTC+8)) [💬14 | +72/-52, 24 files | commented:10]
## Purpose
there are some torch.cuda.get_device_properties().multi_processor_count across vllm code base. we can unify it into current_platform.num_compute_units interface to make it clean and extensible for non-cuda hardware like xpu and npu.
## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
...
#35322 [ROCm][CI] Amending deletion of AMD mirror — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-02-26 06:17 (UTC+8)) [+5/-0, 1 files | commented:1 approved:1]
This mirror was deleted accidentally from:
https://github.com/vllm-project/vllm/pull/34923
This PR reverts that.
#35265 [ROCm][CI] Extending attention backend coverage for Eagle spec decode tests — rocm,ready,ci/build,v1 — by AndreasKaratzas (合并于: 2026-02-26 06:16 (UTC+8)) [💬1 | +314/-150, 4 files | commented:4 approved:1]
This PR fixes the attention backend handling in test_eagle_correctness for ROCm platforms:
Previously, the test would skip entirely when running Llama-4-Scout with FLASH_ATTN on ROCm. Instead of skipping, we now fall back to the FLEX_ATTENTION backend explicitly, allowing the test to actually run on ROCm.
Also previously, the test skipped DeepSeek models when using ROCM_AITER_FA on ROCm. Now we enable the AITER path (VLLM_ROCM_USE_AITER=1), unset VLLM_MLA_DISABLE, and switch to the…
#34270 fix(mxfp4): Disable monolithic path for TRITON backend with EP — ready,gpt-oss — by elizabetht (合并于: 2026-02-26 05:33 (UTC+8)) [💬4 | +225/-5, 2 files | commented:9 approved:1]
## Problem
When running MXFP4 models with the TRITON backend and expert parallelism (EP) enabled (e.g. vllm serve openai/gpt-oss-20b -tp=2 -ep), the server crashes with an illegal memory access.
triton_kernel_moe_forward passes expert_map through to triton_kernel_fused_experts but never actually applies it. As a result, legacy_routing produces routing data using global expert IDs that don’t correspond to the local weight indices on each EP rank.
## Fix
When expert_map is …
#34985 [CI][AMD][BugFix] Add torch.cuda.set_device to test_punica_ops so punica kernels execute on same device as tensor — bug,rocm,ready,nvidia — by rasmith (合并于: 2026-02-26 03:18 (UTC+8)) [+2/-0, 1 files | commented:1 approved:1]
## Purpose
The tests in test_punica_ops can fail if a previous test ran and changed the device that kernels execute in, in particular Triton kernels, by using torch.cuda.set_device, e.g. lora/test_layers.py. This can cause subsequent tests to also fail, since the system seems to go into a bad state once this happens.
## Test Plan
```
pytest -v -s lora
–ignore=lora/test_chatglm3_tp.py
–ignore=lora/test_llama_tp.py
–ignore=lora/test_llm_with_mult…
#35049 [ROCm][CI] Disable skinny GEMMs in multimodal tests to fix non-deterministic results — rocm,ready,multi-modality — by AndreasKaratzas (合并于: 2026-02-26 00:48 (UTC+8)) [💬1 | +18/-0, 1 files | commented:2 approved:1]
This PR disables VLLM_ROCM_USE_SKINNY_GEMM for ROCm multimodal tests by setting it to 0 in pytest_configure, which runs before test collection and module imports.
Purpose
The wvSplitKrc skinny GEMM kernel introduced in #33493 uses atomicAdd for cross-wave reduction, which produces non-deterministic results across runs due to floating-point non-associativity. This causes test failures in accuracy-sensitive tests (e.g., test_logprobs.py, test_serving_tokens.py).
See if we can also enable weights loading tracking for quantized models
## Test Plan
## Test Result
…
#35309 Revert “[Misc] Enable weights loading tracking for quantized models” — ready — by LucasWilkinson (合并于: 2026-02-26 01:20 (UTC+8)) [+4/-15, 1 files | commented:1 approved:1]
Reverts vllm-project/vllm#35074, this PR broke a ton of quantized models weight loading in the nightly https://buildkite.com/vllm/ci/builds/53087/ .
We should re-land this with ready-run-all-tests
cc @Isotr0py @DarkLight1337
#32114 [Bugfix] Fix Harmony preamble visibility in Responses API — bug,frontend,ready,gpt-oss — by thepushkarp (合并于: 2026-02-26 00:08 (UTC+8)) [💬11 | +341/-63, 10 files | commented:3 approved:1]
## Summary
Per the Harmony spec, preambles (commentary channel messages with no recipient) are “intended to be shown to end-users”. vLLM was incorrectly treating them as hidden reasoning across multiple code paths.
This PR fixes preamble visibility in 6 code paths across 3 files:
Parser (harmony_utils.py):
parse_output_message(): route preambles to ResponseOutputMessage instead of ResponseReasoningItem
`parse_…
#34528 [Core] Cleanup engine pause/sleep logic — frontend,ready,v1 — by njhill (合并于: 2026-02-25 11:33 (UTC+8)) [💬9 | +302/-198, 10 files | commented:9 approved:1]
This is a follow-on from https://github.com/vllm-project/vllm/pull/33195 which I didn’t get a chance to review, and the subsequent merging with https://github.com/vllm-project/vllm/pull/34125
Always pause scheduler prior to sleeping (regardless of level)
Centralize cache resetting
Also support sleep/pause/resume with inline-engine cases (VLLM_ENABLE_V1_MULTIPROCESSING=0 / external launcher mode) - i.e. move from EngineCoreProc to EngineCore
Deduplicate the added LLM.enqueue logi…
#35085 [Bugfix] Gracefully disable AllReduceFusionPass on GPUs without multicast support — bug,ready — by haosdent (合并于: 2026-02-25 23:31 (UTC+8)) [+20/-8, 1 files | commented:9 approved:1]
## Purpose
Fixes #34891: On GPUs without NVSwitch (e.g., H200/H100 with NVLink bridge-only or PCIe topologies), AllReduceFusionPass.__init__() crashes with RuntimeError: [SymmDeviceMemory] Device does not support multicasting when flashinfer_comm.create_allreduce_fusion_workspace() tries to create a SymmDeviceMemory that requires multicast support.
This PR wraps the workspace creation call in a try/except RuntimeError that logs a warning and returns early, leaving the pass in its def…
#35249 [XPU]Fix for Qwen-OMNI crash — ready,qwen — by xuechendi (合并于: 2026-02-25 23:31 (UTC+8)) [+2/-1, 1 files | commented:7 approved:2]
## Purpose
The error is detected with vLLM-omni project when running v0.16.0 on XPU
https://github.com/vllm-project/vllm-omni/pull/1416
Two issues are detected:
Due to xpu does not support cudagraph yet. vLLM main repo set max_cudagraph_capture_size=None.
This leads to a crash in
…
#34773 [Misc][LoRA] Increase max vocab size limit to 258048 in logits processor — rocm,ready — by bhoomit (合并于: 2026-02-25 23:30 (UTC+8)) [💬8 | +34/-9, 3 files | commented:5 approved:3]
Increases the LoRA logits processor vocab size upper bound from 257024 to 258048 to support models with larger vocabularies (e.g. 258048 vocab size). Adds test coverage for the new limit.
#34211 [Bugfix] Fix step3p5 reasoning with interleaved thinking — bug,ready — by mariohong128 (合并于: 2026-02-25 23:13 (UTC+8)) [💬8 | +387/-14, 2 files | commented:1 approved:1]
## Purpose
When there are multiple rounds of conversation, the prompt contains </think> from the previous round, and the step3p5 reasoning parser failed to correctly determine the end of reasoning.
## Test Plan
Add tests: tests/reasoning/test_step3p5_reasoning_parser.py
Existing testing for speculative decoding is ad-hoc and unreliable. Specifically, when we select which sample prompts to use for the test case we will select from a few ‘kinds’ of prompt, with the ‘sentence’ kind of prompt having a lot of ambiguity and leading to flaky CI tests.
This PR replaces the ‘sentence’ type with some prompts selected from GSM8k. In addition, we now al…
We can see that the first link in the table here is broken :)
Also applied a suggestion by the gemini bot.
## Test Plan
…
#35107 Fix custom processors that use deleted behaviour for Transformers v5 — ready — by hmellor (合并于: 2026-02-25 18:36 (UTC+8)) [💬3 | +32/-0, 1 files | commented:3 approved:1]
Some remote code processors may define optional_attributes in their ProcessorMixin subclass, and then pass these arbitrary attributes directly to ProcessorMixin.__init__, which is no longer allowed in Transformers v5. For backward compatibility, we intercept these optional attributes and set them on the processor instance before calling the original ProcessorMixin.__init__.
In vLLM CI, one architecture which does this is Molmo2ForConditionalGeneration. If we upstream this architecture…
#34677 [Bugfix][CPU] Fix basic unit tests failing in CPU platforms — bug,ready,nvidia — by jasonyanwenl (合并于: 2026-02-25 16:36 (UTC+8)) [💬6 | +15/-6, 1 files | commented:4 changes:1 approved:1]
## Purpose
This PR is to fix issues reported in: https://github.com/vllm-project/vllm/issues/32730 (see this comment)
Specifically, fixed below CPU test failures:
test_vllm_config_defaults: make optimization-level cudagraph mode defaults platform-aware via _platform_aware_cudagraph(), so non-GPU platforms (e.g., CPU) correctly get CUDAGraphMode.NONE instead of GPU-specific modes like PIECEWISE / `FULL_AND_P…
#33069 [Doc] Suggest “–managed-python” flag when installing python using uv — documentation,ready,cpu — by jasonyanwenl (合并于: 2026-02-25 16:19 (UTC+8)) [💬10 | +1/-1, 1 files | commented:1 approved:1]
## Purpose
Build the vllm from source with CPU-only failed:
full error message
```text
(vllm) ubuntu@ip-172-31-4-43:~/vllm$ VLLM_TARGET_DEVICE=cpu uv pip install -e . --no-build-isolation
Resolved 136 packages in 2.94s
× Failed to build `vllm @ file:///home/ubuntu/vllm`
...
#34513 [DOC][BugFix] Specfiy build dependency installation — bug,documentation,ready — by jonoillar (合并于: 2026-02-25 16:04 (UTC+8)) [💬5 | +7/-1, 1 files | commented:2 approved:2]
## Purpose
When setting up the development environment with Python and CUDA/C++ code, we need to install manually the build dependencies (cmake, setuptools, etc)
That’s why I added the instructions to install those build dependencies.
This is an update to the steps in documentation as when installing vLLM without setting up a new environment (–no-build-isolation flag), you need to install the build dependencies explicitly.
## Test Plan
This PR fixes several documentation formatting issues in docs/design/arch_overview.md:
Fixed code block markers - Removed the ??? prefix from code block annotations (??? code → code) to ensure proper render of code examples in the LLM class usage and model constructor sections.
Simplified image embedding - Replaced the verbose HTML tag with standard Markdown image syntax () for better compatibility and cleaner markup.
These changes improve the read…
#35225 docs: document committer proposal process in governance — documentation,ready — by simon-mo (合并于: 2026-02-25 15:58 (UTC+8)) [💬1 | +7/-5, 1 files | commented:4 approved:4]
Documents the committer proposal process in docs/governance/process.md:
Nominate: Committer emails the committer group to nominate a candidate, highlighting contributions
Discuss and vote: Group discusses, votes, voices concerns; lead maintainers decide
Feedback period: After 2-week feedback, if moving forward, nominator invites candidate to open PR to update code ownership (CODEOWNERS, committers list)
Permissions and onboarding: Lead maintainers assign GitHub permiss…
#35203 Remove requirement to use --hf-overrides for DeepseekVLV2ForCausalLM — documentation,ready,v1,deepseek,kv-connector — by hmellor (合并于: 2026-02-25 14:00 (UTC+8)) [💬3 | +9/-56, 5 files | commented:1 approved:1]
Since we implement the config class ourselves, there’s no need to enforce that users pass architectures themselves. We can just add it if it’s missing from the checkpoint.
## Purpose
Add an opt-in integration path for external SM100 (Blackwell) RMSNorm kernels (CuTeDSL / Oink) without introducing a hard dependency in vLLM.
When enabled, vLLM routes eligible CUDA RMSNorm calls to externally-registered torch.ops.oink.* kernels (and otherwise falls back to the existing vLLM CUDA implementation). This is primarily intended for torch.compile + CUDA graph execution.
## What This PR Changes
…
#34424 [Perf] Optimize FP8 gemm of sm120. — performance,ready,nvidia — by wenshuai-xiaomi (合并于: 2026-02-25 14:25 (UTC+8)) [💬2 | +133/-1, 1 files | commented:3 approved:2]
## Purpose
Optimize FP8 gemm of sm120 at any input shape.
## Test Plan
benchmarks/kernels# python3 bench_fp8_gemm.py
run bench_fp8_gemm.py and compare the results with the patch or not.
## Test Result
the current restut without the patch
```
…
Enable CUDAGraph functionality support on the XPU platform leveraging features available in the nightly PyTorch XPU build.
Validated with torch==2.11.0.dev20260216+xpu from nightly channel and torch==2.11.0+xpu from test channel
distributed is not supported
For FLASH_ATTN backend, only PIECEWISE mode is supported
For TRITON_ATTN backend, all modes are supported
## Test Plan
…
#35011 remove cuda check in top_k_top_p_triton kernel — ready,v1,nvidia — by jikunshang (合并于: 2026-02-25 14:22 (UTC+8)) [💬8 | +3/-4, 2 files | commented:2 approved:2]
## Purpose
fix #34941
https://github.com/vllm-project/vllm/pull/33538 introduce triton based top_k_top_p kernel and have some cuda hard code in triton kernel code. this may break non-cuda-like device like xpu and npu.
cc @Yikun @wangxiyuan
## Test Plan
CI
## Test Result
Update shard_id validation to avoid unexpected shard_id for MergedColumnLinear
Update annotations for linear utility functions
## Test Plan
## Test Result
…
#35231 [Responses][CI] Filter negative token IDs in schema fuzz test to avoid 500 errors — frontend,ready — by AndreasKaratzas (合并于: 2026-02-25 13:52 (UTC+8)) [+27/-1, 2 files | commented:2 approved:1]
Adds server-side validation to reject negative token IDs in POST /v1/completions, which previously caused an unhandled 500 Internal Server Error.
## Problem
When the prompt field contains negative token IDs (e.g. {"prompt": [[-1]]}), the server crashes with a 500 instead of returning a proper 400 validation error. This was discovered via schemathesis fuzz testing in test_openapi_stateless.
When a load failure happens for AOT, we just fallback to compiling again.
Instead of mentioning “AOT compile” in the message, change the wording
to notify user we’re simply compiling again. Add a common path when
cache file is corrupted with EOFError so that it’s more clear what compiler is doing.
Test Plan:
Reviewers:
…
#35157 [Linear Attention] fix bug for linear attention + prefix caching + reset_prefix_cache — bug,ready,v1 — by heheda12345 (合并于: 2026-02-25 14:00 (UTC+8)) [💬1 | +74/-1, 2 files | commented:1 approved:1]
## Purpose
We need to clear mamba_state_idx for resumed requests. When requests are force-preempted (e.g., during reset_prefix_cache / KV cache flush), they appear in resumed_req_ids without a corresponding entry in preempted_req_ids, leaving stale mamba_state_idx entries that can point to block indices beyond the new (smaller) block allocation.
## Test Plan
See the new unit test
## Test Result
…
#35237 [Bugfix] Fix AttributeError when passing StructuredOutputsParams to CompletionRequest — bug,frontend,ready — by pks (合并于: 2026-02-25 14:00 (UTC+8)) [+18/-2, 2 files | commented:1 approved:1]
## Summary
The `check_structured_outputs_count` model validator on `CompletionRequest` and `ChatCompletionRequest` uses `mode="before"` and calls `.get()` on the `structured_outputs` value, assuming it is always a dict. However, the field is typed as `StructuredOutputsParams | None`, so passing a dataclass instance is a valid usage but crashes with: ```
AttributeError: 'StructuredOutputsParams' object has no attribute 'get' ```
This PR handles both dict and `StructuredOutputsParams` input...
#34933 [FIX] fused moe with lora shared expert dual stream (1.07x otps) — performance,ready — by jhaotingc (合并于: 2026-02-25 12:40 (UTC+8)) [💬6 | +6/-2, 1 files | commented:1 approved:2]
## Purpose
MoE model with shared expert has dual stream overlap in TP when there’s no LoRA. But the overlapping disappears when adding LoRA module.
#35180 [ROCm]: Enable customop and rope+kvcache fusion for AITER RoPE — rocm,ready — by Rohan138 (合并于: 2026-02-25 12:36 (UTC+8)) [💬3 | +139/-67, 9 files | commented:3 approved:1]
## Purpose
Follow-up to #25135 and #33443 to fix and enable the AITER RoPE custom op by default, and enable RoPE+KVCache fusion. During prefill (q.shape[0] > 256), we thus use the AITER unfused RoPE kernel instead of the vllm native custom op, which gives another 1% uplift on gpt-oss.
This is also a mild prerequisite for MLA RoPE+KVCache fusion on ROCm, since there currently isn’t a vllm native custom op for DeepseekScalingRotaryEmbedding.
## Test Plan
Before:
```
============ Serving Benc…
Refactor SSE streaming event helpers to decouple shared event-building logic from Harmony-specific context
Fix a bug for build-in python tool to generate code interpreter events instead of mcp events (according to https://community.openai.com/t/responses-api-streaming-the-simple-guide-to-events/1363122)
Enhance unit test to validate streaming events
### Architecture: Two-layer design in streaming_events.py
The core refactor splits streaming_events.py into two layers: dispatcher…
#32967 [Frontend] Use init_app_state and FrontendArgs in run_batch — frontend,ready — by pooyadavoodi (合并于: 2026-02-25 11:40 (UTC+8)) [💬6 | +632/-330, 4 files | commented:8 approved:1]
## Purpose
Adding support for more features such as tool calling to run_batch.
This is achieved by using init_app_state from vllm/entrypoints/openai/api_server.py
run_batch now uses a subset of frontend args from vllm/entrypoints/openai/cli_args.py (this is achieved by adding a new base class for FrontendArgs and letting BatchFrontendArgs derive from that).
The approach taken here also removes some code duplication between api_server and run_batch.
#34226 [build] fix priority of cuda-compat libraries in ld loading — ci/build,nvidia — by youkaichao (关闭于: 2026-02-26 10:15 (UTC+8)) [💬3 | +4/-4, 1 files | commented:1]
## Purpose
People still report Error 803: system has unsupported display driver / cuda driver combination after https://github.com/vllm-project/vllm/pull/33116 . It is because some docker images specify the driver so file in nvidia.conf, which gets suppressed by cuda-compat.conf.
This PR adds zzz- to cuda-compat.conf so that it’s ordered lastly.
#35348 Fix stale self._positions during piecewise CUDA graph replay in MLA — frontend,v1,nvidia — by tlrmchlsmth (关闭于: 2026-02-26 10:37 (UTC+8)) [+380/-45, 14 files | commented:1 | 📝草稿]
## Summary
Fixes a shape mismatch crash in rotary embedding (RuntimeError: The size of tensor a (8192) must match the size of tensor b (475)) when running DeepSeek-R1 with FP8 KV cache, piecewise CUDA graphs, and the fused RoPE+quant path
Root cause: self._positions = positions in MLAAttention.forward() is a Python-level side effect inside a piecewise-compiled graph. During CUDA graph replay, this assignment is not re-executed, leaving self._positions pointing to a stale view from a…
This PR reduces overhead from the pure PyTorch implementation of the SSD initial state extraction and state caching by using two additional Triton kernels.
It is based on https://github.com/vllm-project/vllm/pull/26222
cc @tdoublep @s3woz
## Test Plan
The behavior is transparent to the user and thus the existing tests can be reused.
…
#27617 [Model] make the inv_freq in Qwen2_5_VisionRotaryEmbedding device-agnostic — stale,qwen — by yangulei (关闭于: 2026-02-26 10:16 (UTC+8)) [💬5 | +1/-3, 1 files | commented:4]
## Purpose
Currently the inv_freq in Qwen2_5_VisionRotaryEmbedding is set to ‘cpu’ explicitly.
https://github.com/vllm-project/vllm/blob/a8d2e326ecb70876aa73dce70dbe2434c64b710a/vllm/model_executor/models/qwen2_5_vl.py#L610-L612
This PR makes it device-agnostic.
#34425 [Bug Fix] Re-add the DCP/PCP compatibility check for CUDA platform — bug,needs-rebase,nvidia — by Siritao (关闭于: 2026-02-26 09:53 (UTC+8)) [💬2 | +21/-0, 1 files | commented:1]
Background:
PR #29952 moved DCP/PCP + full CUDA graphs compatibility check to platform layer
PR #29558 inadvertently removed this check during conflict resolution (see commit diff)
ROCm platform still has the check, but CUDA is missing it
Impact: Users with DCP/PCP enabled cras…
#34420 Fix MoE models in EP mode on Ascend — 无标签 — by ylyhyqsl (关闭于: 2026-02-26 09:41 (UTC+8)) [💬7 | +11/-4, 2 files | commented:1]
Fix MoE models using bare tensor_model_parallel_all_reduce produce garbage output in EP mode on Ascend#6710
## Purpose
MoE models that use FusedMoE with reduce_results=False and then manually call tensor_model_parallel_all_reduce() in their MoE forward method produce garbage output when running with –enable-expert-parallel on Ascend NPU.
## Suggested fix
The fix should be applied in the upstream vLLM model files. Replace:
The hope is this will help the flaky failures of “V1 e2e + engine” since I identified these tests are often the culprit. They really should be reworked but when I ran locally it really did vary so much run to run.
#34774 [Build] Add BUILDER_CUDA_VERSION to decouple build and runtime/PyTorch CUDA versions — ready,ci/build,nvidia — by tlrmchlsmth (关闭于: 2026-02-26 09:27 (UTC+8)) [💬1 | +39/-26, 5 files | commented:1]
Addresses the CUDA 12.8/12.9 compatibility issue affecting v0.16.0: the official PyTorch PyPI wheel is built on CUDA 12.8, but vLLM wheels have been built on CUDA 12.9. When users pip install vllm, PyTorch pins nvidia-cublas-cu12 to 12.8, which causes runtime failures (pytorch/pytorch#174949).
By decoupling the versions, we can build the vLLM wheel against CUDA 12.8 PyTorch (matching what users get from PyPI) while still compiling CUDA kernels with a 12.9, allowing FlashMLA to still be u…
#34852 [Build] Downgrade to CUDA 12.8 but use nvcc 12.9 to build csrc — ci/build,nvidia — by tlrmchlsmth (关闭于: 2026-02-26 09:27 (UTC+8)) [💬2 | +89/-20, 4 files | commented:2]
Take 2 of https://github.com/vllm-project/vllm/pull/34774
Addresses the CUDA 12.8/12.9 compatibility issue affecting v0.16.0: the official PyTorch PyPI wheel is built on CUDA 12.8, but vLLM wheels have been built on CUDA 12.9. When users pip install vllm, PyTorch pins nvidia-cublas-cu12 to 12.8, which causes runtime failures (https://github.com/pytorch/pytorch/issues/174949).
By decoupling the versions, we can build the vLLM wheel against CUDA 12.8 PyTorch (matching what users get from P…
#34517 [Bugfix] Fix mypy errors on StructuredOutputsParams in OpenAI entrypoints — bug — by junuxyz (关闭于: 2026-02-26 09:20 (UTC+8)) [💬2 | +7/-2, 1 files | commented:1]
## Purpose
This PR fixes mypy errors in OpenAI entrypoint code paths using StructuredOutputsParams.
Root cause: StructuredOutputsParams uses pydantic.dataclasses.dataclass, and mypy does not always treat it like a stdlib dataclass in these paths (replace(...), StructuredOutputsParams(json=...)).
Fix: use stdlib dataclass under TYPE_CHECKING for mypy, while keeping pydantic.dataclasses.dataclass at runtime.
```
…
#27996 Add the NV-ModelOPT FP8 & FP4 quantization E2E accuracy test case — 无标签 — by jingyu-ml (关闭于: 2026-02-26 09:18 (UTC+8)) [💬5 | +132/-1, 2 files | commented:2 approved:1]
## Purpose
Add GSM8K accuracy tests for ModelOpt FP8 quantized models to ensure quantization quality is maintained. This PR adds automated testing for:
nvidia/Llama-3.1-8B-Instruct-FP8
nvidia/Qwen3-8B-FP8
The tests verify that ModelOpt FP8 quantization achieves expected accuracy benchmarks on the GSM8K mathematical reasoning task.
## Test Plan
…
#34540 [Kernel] [Helion] [8/N] Remove fake_impl usage and inference — rocm,ready — by gmagogsfm (关闭于: 2026-02-26 05:56 (UTC+8)) [💬13 | +292/-197, 5 files | commented:4 approved:1]
NOTE: this is a manually stacked PR, each commit is reviewed separately. For this PR, please only review the top commit: Replace fake_impl with helion’s infer_output_spec in HelionKernelWrapper
Since we uses HOP to represent Helion kernel calls now, and Helion has internal support to compute output tensor shape, we no longer need to infer fake_impl. This PR removes all relevant logic to fake_impl
In `prepare_nvfp4_moe_layer_for_fi_or_cutlass` (`flashinfer_fp4_moe.py:514-515`), input scales are broadcast via `expand()`:
```python
a13_scale = a13_scale.max().to(torch.float32).expand(num_experts) ...
PR #35074 introduced track_weights_loading() to validate all registered parameters are loaded from checkpoint. This broke loading of GPTQ/AWQ MoE models using the Marlin backend with the following error:
```
ValueError: Following weights were not initialized from checkpoint:
{‘model.layers.0.mlp.experts.w13_g_idx_sort_indices’,
‘model.layers.0.mlp.experts.w2_g_idx_sort_indices’,
‘model.layers.0.mlp.experts.w13_weight_g_idx’,
‘model.layers.0.mlp.experts.w2_weight_g_idx’, …}
…
#35323 Add Qwen3.5-MoE (qwen3_5_moe) support via Qwen3-Next backend — documentation,new-model,rocm,frontend,needs-rebase,ci/build,v1,multi-modality,qwen,kv-connector — by blake-snc (关闭于: 2026-02-26 03:04 (UTC+8)) [💬4 | +2597/-363, 38 files | commented:1]
## Summary
Adds support for Qwen3.5-MoE models (model_type: qwen3_5_moe, Qwen3_5MoeForConditionalGeneration) which use the same hybrid GDN (GatedDeltaNet) architecture as Qwen3-Next but ship with a different HuggingFace config layout and checkpoint weight format.
Key changes:
Config: Qwen3_5MoeConfig that flattens the outer text_config wrapper and delegates to Qwen3NextConfig
Registry: Register qwen3_5_moe model type and both ForConditionalGeneration / `ForCausal…
Fix Qwen3ReasoningParser.extract_reasoning() returning truncated reasoning tokens as content instead of reasoning when max_completion_tokens cuts off the output before </think> is generated.
When thinking is enabled and the output is truncated, the current code sees no </think> and assumes thinking was disabled — returning (None, model_output). This means the API response has content filled with raw chain-of-thought and empty reasoning_content, which is wrong.
…
#34787 [Core] Num External Cached Tokens — frontend,v1 — by aeon-x (关闭于: 2026-02-25 20:09 (UTC+8)) [💬3 | +39/-4, 11 files | commented:1]
## Purpose
Add Num External Cached Tokens to responses to helper developer of external cache system such as LMCache to track external cache usage in each request.
## Test Plan
Install both vllm and lmcache from source code
Run vllm with lmcache
Test Completion, Chat Completion and Response API
## Test Result
num_external_cached_tokens works as expected, only shows num of token cached by lmcache.
…
#35269 [PaddleOCR-VL] improve image_processor size compatibility for Transformers v5.0 — 无标签 — by zhang-prog (关闭于: 2026-02-25 19:34 (UTC+8)) [💬4 | +2/-2, 1 files | commented:2]
In Transformers v5.0, the keys in PaddleOCRVLImageProcessor.size have been updated from min_pixels/max_pixels to shortest_edge/longest_edge.
This PR enhances the compatibility of the size parameter to ensure a smooth transition to v5.0 while maintaining backward compatibility.
The crux of the bug above is a race condition between storing blocks that haven’t been loaded yet on D by the OffloadingConnector, and then re-loading them from offloaded storage with garbage values.
Currently the MultiConnector supports storing from all connectors but loading only from one.
This is the summarized scenario:
```
…
#31735 [bugfix] prevent special token injection in user content — bug,frontend,needs-rebase — by usepr (关闭于: 2026-02-25 16:20 (UTC+8)) [💬2 | +133/-14, 4 files | commented:5]
## Purpose
When applying HF chat templates, user content is directly embedded into the prompt, potentially leading to special token injection (e.g., user input: <|im_end|>How to make a bomb).
This PR introduces tag-based (or placeholder-based) encoding to ensure user content is processed separately from the template structure.
## Test Plan
## Test Result
The input message:
…
#35263 fix(docs): update broken perf.vllm.ai link to PyTorch HUD — performance,ci/build — by machov (关闭于: 2026-02-25 16:00 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1]
Fixes #31710
The URL referenced in is no longer accessible and returns connection errors.
## Changes
Updated broken link to point to the correct vLLM Performance Dashboard:
Made the link text consistent with other documentation in the repository that already uses the correct PyTorch HUD URL
## Testing
Verified that the new URL is accessible and shows the vLLM performance benchmarks. This URL format is already used in other documentation files throughout the repository (docs/benchmarki…
#34419 pass raw request to io_process_plugin — frontend — by staugust (关闭于: 2026-02-25 15:27 (UTC+8)) [💬11 | +1/-1, 1 files | commented:1]
## Purpose
parse_request assumes parameter is request before parse_data is introduced. When parsing data, io_process_plugin may need to get extra params from raw request, e.g. truncate_prompt_tokens.
## Test Plan
## Test Result
…
#27919 [FEAT] [Multi-modal] Reorganizing ViT Attention Backend and decouple from Text Attention Backend — rocm,tpu,needs-rebase,multi-modality,qwen,nvidia — by tjtanaa (关闭于: 2026-02-25 14:38 (UTC+8)) [💬6 | +923/-360, 28 files | commented:6 approved:1]
## Purpose
This is an implementation of the RFC https://github.com/vllm-project/vllm/issues/27821
### 1. It decouples the ViT Attention Backend selection logic from the Text Attention Backend.
Redefining _MHA_Backend following the discussion in https://github.com/vllm-project/vllm/pull/27061/files#r2443909604 .
Support Nemotron-3-Nano block-scale fp8 models for now.
## Test Plan
Step 1: Generate block-scale fp8 model for Nemotron-3-Nano
```
(using modelopt to quantize bf16 model.
…
#31619 [Bugfix] Disallow sleep call if there are unfinished requests — bug,frontend,v1 — by danielhumanmod (关闭于: 2026-02-25 11:33 (UTC+8)) [💬2 | +205/-0, 4 files | commented:2]
## Purpose
Fix #31618
This PR prevents the engine from entering sleep mode while there are active/unfinished requests.
Calling sleep() during in-flight generation can put the engine into an invalid state (e.g., weights offloaded / caches reset while requests are still being processed), which can lead to failures or confusing behavior. To make the API safer and more predictable, this PR explicitly block sleep() when there are unfinished requests.