[vLLM GitHub 开发动态] 2026-01-23
[概览]
- 时间窗口: 2026-01-23 10:50 (UTC+8) ~ 2026-01-24 10:50 (UTC+8)
- 新 issue: 15 (label 分布: bug:10, feature request:2, performance:1, ci-failure:1, cpu:1)
- 关闭 issue: 29
- 新 PR: 75 (label 分布: ready:21, v1:18, bug:17, documentation:16, ci/build:13)
- 合并 PR: 40
- 关闭未合并 PR: 16
[新 issue]
-
#32962 [Performance]: Custom Helion Kernels — performance — by xiaohongchen1991 (创建于: 2026-01-24 03:40 (UTC+8)) [💬1] ### Proposal to improve performance
This is the sub issue from the vLLM Helion Integration Project https://github.com/vllm-project/vllm/issues/32219 proposed by @gmagogsfm. This issue lists down vLLM kernels that are good candidates to explore, implement, and benchmark with Helion. The progress and results will be tracked here.
### Scope ###
- This Issue focuses on kernels integrated into vLLM via the CustomOp framework (potentially getting r…
-
#32972 [CI Failure]: mi325_8: Kernels Attention Test %N — ci-failure — by AndreasKaratzas (创建于: 2026-01-24 06:06 (UTC+8)) [💬1] ### Name of failing test
pytest -v -s kernels/attention### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#32925 [Bug]: tensorize_vllm_model double gpu — bug — by lyc1995452 (创建于: 2026-01-23 17:18 (UTC+8)) [💬1] ### Your current environment
v 0.14 or v0.9.2
### 🐛 Describe the bug
docker run –gpus ‘“device=3”’ –shm-size=4g -p 38954:38954 –name tensorize-deploy -v /storage/leo/workspace/common-ie/asset_script/serialize_model/vllm/model/checkpoint-1350-merged:/storage/model -v /storage/leo/workspace/train_project/train_somefile/chat_template:/storage/chat_template vllm-openai-tensorizer:v0.9.2 –served-model-name K-GPT-V –model /storage/model –port 38954 –load-for…
-
#32959 [Bug]: KeyError:
merging_layer.weightwhen loading Mistral/vision-enabled checkpoints after PR #32780 refactor — bug — by ms1design (创建于: 2026-01-24 03:16 (UTC+8)) [💬1] ### Your current environmentThe output of
```text Collecting environment information... uv is set ============================== System Info ...python collect_env.py -
#32932 [Bug][cpu][arm]: Failure case for BF16 dispatch on non-bf16 supported arm HW — bug,cpu — by gassan-arm (创建于: 2026-01-23 19:00 (UTC+8)) [💬2] ### Your current environment
The output of
```text ============================== Versions of relevant libraries ============================== [pip3] flashinfer-python==0.5.3 ...python collect_env.py -
#32919 [Bug]: Memory fault access when serving DeepSeek-R1-0528 with mori-ep + concurrency of 128 / 256 — bug,rocm — by junkang1991 (创建于: 2026-01-23 15:30 (UTC+8)) [💬1] ### Your current environment
The output of
```text ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) ...python collect_env.py -
#32926 [Feature]: Add dedicated tool parser for Qwen2.5-Coder models — feature request — by hanXen (创建于: 2026-01-23 18:01 (UTC+8)) ### 🚀 The feature, motivation and pitch Qwen2.5-Coder models have no working tool call parser in vLLM. The vLLM documentation recommend
--tool-call-parser hermesfor Qwen2.5, but Qwen2.5-Coder does not follow hermes format — it outputs `json` code blocks or plain JSON instead of<tool_call>tags.Output format analysis (Qwen2.5-Coder-7B-Instruct,
temperature=0):System prompt strategy Model output format vLLM … -
#32938 [Bug]: illegal memory access while running Qwen3-30B-A3B-Instruct-2507 on multi node with DeepEP backend — bug — by llsj14 (创建于: 2026-01-23 21:27 (UTC+8)) ### Your current environment
- H200 hardware
- vLLM v0.13.0 (upstream image)
### 🐛 Describe the bug
Server Script ``` # Node 1 (Primary - handles incoming requests) …
-
#32901 [Feature]: make thinking parameter consistent with openai api — feature request — by ltm920716 (创建于: 2026-01-23 10:52 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch
hello, Why is the “thinking” parameter inconsistent with the OpenAI API? Is there any consideration behind this? Thank you
### Alternatives
No response
### Additional context …
-
#32920 [Bug]: disagg_proxy_demo.py: Method logic error in ‘remove_instance_endpoint’ — bug — by ChenqianCao (创建于: 2026-01-23 15:40 (UTC+8)) ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#32911 [Bug]: “ValueError: No tokenizer file found in directory” when serving Qwen3-Omni — bug — by CHNtentes (创建于: 2026-01-23 13:50 (UTC+8)) [💬3] ### Your current environment
The output of
```text ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) ...python collect_env.py -
#32921 [Bug]: gpt-oss-20b streaming last reasoning content part into content — bug — by DeoLeung (创建于: 2026-01-23 15:57 (UTC+8)) ### Your current environment
The output of
```text ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) ...python collect_env.py -
#32915 [Bug]: Enable Lora cause OOM — bug — by bi1101 (创建于: 2026-01-23 14:55 (UTC+8)) ### Your current environment
The output of
```text ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) ...python collect_env.py -
#32910 [Doc]: Having default of “None” in docs for serve params is not helpful, because there is actually a default — documentation — by jcowles (创建于: 2026-01-23 13:49 (UTC+8)) [💬1] ### 📚 The doc issue
What is the actual default? It’s not ‘None’.
### Suggest a potential alternative/fix
No response
…
-
#32903 [Bug]: FlashInfer error when running vLLM throughput bench on B200 — bug — by baonudesifeizhai (创建于: 2026-01-23 11:47 (UTC+8)) ### Your current environment
### 🐛 Describe the bug
running ``` VLLM_WORKER_MULTIPROC_METHOD=spawn
VLLM_USE_FLASHINFER_MOE_FP8=1
VLLM_FLASHINFER_MOE_BACKEND=throughput
…
[已关闭 issue]
-
#14040 [Bug]: RuntimeError: (‘Quantization scheme is not supported for ‘, ‘the current GPU. Min capability: 80. ‘, ‘Current capability: 75.’) — bug,stale — by Bakht-Ullah (关闭于: 2026-01-24 10:16 (UTC+8)) [💬9] ### Your current environment
I am using google colab T4 GPU
### 🐛 Describe the bug
from vllm.assets.audio import AudioAsset from vllm import LLM, SamplingParams
# Load your local audio file …
-
#18582 [Usage]: How to use the appropriate –gpu-memory-utilization — usage,stale — by ChEnYoNgB (关闭于: 2026-01-24 10:15 (UTC+8)) [💬9] ### Your current environment
```text
============================== System Info ============================== OS : Ubuntu 22.04.4 LTS (x86_64) GCC version : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version : Could not collect …
-
#28312 [Bug]: DeepSeek MTP IMA (with
num_spec_tokens=2) — bug,speculative-decoding — by robertgshaw2-redhat (关闭于: 2026-01-24 02:16 (UTC+8)) [💬6] ### Your current environmentThe output of
```text H200 ```python collect_env.py…
-
#29463 [CI Failure]: mi325_1: Language Models Tests (Standard) — ci-failure — by AndreasKaratzas (关闭于: 2026-01-24 05:58 (UTC+8)) [💬8] ### Name of failing test
pytest -v -s models/language -m 'core_model and (not slow_test)'### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#32751 [CI Failure]: mi325_1 Entrypoints Integration Test (Responses API) — ci-failure — by AndreasKaratzas (关闭于: 2026-01-24 05:58 (UTC+8)) [💬3] ### Name of failing test
pytest -v -s tests/entrypoints/openai/responses### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#32896 [Installation]: How to use v0.10.x with pytorch2.9? — installation — by Wesley-Jzy (关闭于: 2026-01-24 02:42 (UTC+8)) [💬1] ### Your current environment
As the title describes, how can I get a vllm 0.10.x for pytorch 2.9 and cuda 12.8 environment?
### How you are installing vllm
pip install -vvv vllm…
-
#30654 [Feature][Attention][UX]: Incorporate Features into Attention Selection — help wanted,good first issue,feature request — by robertgshaw2-redhat (关闭于: 2026-01-24 02:34 (UTC+8)) [💬12] ### 🚀 The feature, motivation and pitch
SUMMARY:
- we have default attention backends by priority and a notion of which backend supports what hw
- however, certain features are not considered in this (e.g. fp8 kv cache, e.g. attention sinks)
Recent example, we had test failures because we updated the logic to load kv cache quantization from the model config. But since CUTLASS_MLA is the default backend on B200, we started seeing test failures (since CUTLASS MLA does not support fp8 kv cache) b…
-
#27668 [Feature]: WideEP CI Testing — feature request — by robertgshaw2-redhat (关闭于: 2026-01-24 02:30 (UTC+8)) [💬1] ### 🚀 The feature, motivation and pitch
We have seen many regressions related to WideEP in
llm-ddue to frequent changes in vLLM, which is creating significant issues in developer velocityWe need to get some automated testing in place that can run on a nightly basis, including the following:
- DP/EP with deepep ll
- DP/EP with deepep ht
- EPLB
- DBO
- Async Scheduling …
-
#22992 [RFC]: Refactor CI/CD — help wanted,good first issue,RFC — by robertgshaw2-redhat (关闭于: 2026-01-24 02:29 (UTC+8)) [💬17] ### Motivation.
vLLM’s CI/CD has grown in a less than ideal way as it has built up over the years.
We have the following problems:
- CI takes very long, especially on a per commit cycle
- CI has failures that cannot be reproduced on every machine due to numerics
- CI has failures on models that are not the 80-20 of our usage — which runs per commit
- CI failures in early tests often lead to vLLM not cleaning up properly — which creates failures across many tests that makes it hard to ident…
-
#22918 [Feature][Tools]: Complete Redesign of Tool Calling — help wanted,feature request — by robertgshaw2-redhat (关闭于: 2026-01-24 02:29 (UTC+8)) [💬26] ### 🚀 The feature, motivation and pitch
- We currently have a patchwork of support for tools in vLLM
- We currently have regexes in tools which can cause noisy-neighbor issues in vLLM
- We would welcome a contributor to redesign the system, improve our CI coverage, and work together with the ecosystem to ensure vLLM’s support for tool calling is elite
### Alternatives
No response
…
-
#22294 [Feature]: Tune Triton Configs for Qwen3-30B-A3-Fp8 and Bf16 — good first issue,feature request — by robertgshaw2-redhat (关闭于: 2026-01-24 02:28 (UTC+8)) [💬9] ### 🚀 The feature, motivation and pitch
Hardware Cases:
- H100, H200, B200
Configurations:
- TP=1
- TP=2
- TP=4
- TP=8 …
-
#14530 [Feature]: Audit and Update Examples To Use
VLLM_USE_V1=1— good first issue,feature request,unstale — by robertgshaw2-redhat (关闭于: 2026-01-24 02:28 (UTC+8)) [💬6] ### 🚀 The feature, motivation and pitchMany of the examples leverage V0 internals.
We should:
- raise
NotImplementedErrorifenvs.VLLM_USE_V1with these - convert them to use
V1if we can
### Alternatives
…
- raise
-
#23963 Model Performance Bash! — help wanted,performance,feature request — by robertgshaw2-redhat (关闭于: 2026-01-24 02:28 (UTC+8)) [💬3] ### 🚀 The feature, motivation and pitch
Performance continues to be a top priority in the vLLM project.
We have recently seen an explosion of models launched in the last few months. We are seeking help to ensure vLLM runs these models as efficiently as possible.
To achieve this goal, we are creating a Model Performance Bash! The goal of this bash is to analyze the performance of vLLM’s model execution in various configurations to identify opportunities to improve the model side execution. We …
-
#28152 [Feature]: Factor out
zero_expert_numfromFusedMoE— help wanted,feature request — by robertgshaw2-redhat (关闭于: 2026-01-24 02:21 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitchWe have many special cases in
FusedMoEforzero_expert_numThis parameter is used exclusively for
LongCatFlash. We should factor this out ofFusedMoeand put the complexity into the model file.### Alternatives
No response
…
-
#28470 [CI Failure]: Entrypoints Test (API Server) — ci-failure — by robertgshaw2-redhat (关闭于: 2026-01-24 02:21 (UTC+8)) [💬2] ### Name of failing test
tests/entrypoints/openai/test_response_api_with_harmony.py::test_function_call_with_previous_input_messages### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#28402 [CI Failure]: Nightly MM Models Extended (3) — ci-failure — by robertgshaw2-redhat (关闭于: 2026-01-24 02:21 (UTC+8)) [💬1] ### Name of failing test
[2025-11-09T06:15:40Z] FAILED models/multimodal/generation/test_common.py::test_custom_inputs_models[llava_onevision-multiple-images-test_case5] - AssertionError: Test0:
### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#28401 [CI Failure]: Nightly B200 LM Eval Failure — ci-failure — by robertgshaw2-redhat (关闭于: 2026-01-24 02:21 (UTC+8)) [💬1] ### Name of failing test
FAILED evals/gsm8k/test_gsm8k_correctness.py::test_gsm8k_correctness_param[Qwen1.5-MoE-W4A16-CT-tp1] - AssertionError: Accuracy too low: 0.000 < 0.450 - 0.080
### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#28423 [Feature][Kernels]: Integrate FlashInfer MoE Fused Finalize — feature request,torch.compile — by robertgshaw2-redhat (关闭于: 2026-01-24 01:18 (UTC+8)) ### 🚀 The feature, motivation and pitch
We are working on fusing all small ops in DSR1 and other popular models.
There are open work streams on a couple of these:
- RMSNorm + BlockFP8
- ROPE+KV Insert
- All Reduce + RMSNorm
One other one that is possible is fusing the MoE finalize reduction. Here is an op in FlashInfer: …
-
#31039 [Feature]: Integrate Sonic MoE — help wanted,good first issue,feature request — by robertgshaw2-redhat (关闭于: 2026-01-24 01:18 (UTC+8)) [💬6] ### 🚀 The feature, motivation and pitch
https://x.com/wentaoguo7/status/2001773245318541324?s=46&t=jLcDgQXDbYe6HgFmTNYgpg https://github.com/Dao-AILab/sonic-moe
Curious to see benchmarks!
### Alternatives
No response …
-
#28460 [Bug]: rocm crash AMD Ryzen AI 9 HX PRO 370 w/ Radeon 890M - docker/podman — bug,feature request,rocm,gpt-oss — by hargathor (关闭于: 2026-01-23 23:00 (UTC+8)) [💬20] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#28403 [CI Failure]: Nightly H200 Distributed Test Failure — ci-failure — by robertgshaw2-redhat (关闭于: 2026-01-24 00:06 (UTC+8)) ### Name of failing test
tests/compile/test_sequence_parallelism.py### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#28427 [Feature][Kernel]: DeepSeek-R1 KV Proj is Too Slow for TP — feature request — by robertgshaw2-redhat (关闭于: 2026-01-24 00:06 (UTC+8)) [💬9] ### 🚀 The feature, motivation and pitch TRT-LLM has a SwapAB kernel for KV proj for DSR1. We should integrate this by collaborating with the FlashInfer team
Current situation: we run CUTLASS Block Fp8 for KV proj because DeepGEMM upstream does not support it
- the CUTLASS Block Fp8 kernels are ~1/2 the speed of DeepGEMM for other Linear layers
- the CUTLASS Block Fp8 kernels have padding overhead
Example Trace: <img width=”1303” height=”352” alt=”Image” src=”https://github.com/user-attachments…
-
#32848 [Usage]: Structured output — usage — by danielwit-lb (关闭于: 2026-01-23 22:25 (UTC+8)) ### Your current environment
```text Collecting environment information… uv is set ============================== System Info ============================== OS : Amazon Linux 2023.9.20251208 (x86_64) GCC version : (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5) …
-
#32870 [Bug][CPU Backend] [Arm]: FAILED tests/kernels/moe/test_moe.py::test_cpu_fused_moe_basic[silu-False-dtype0-2-8-128-128-1] — bug,cpu — by focusunsink (关闭于: 2026-01-23 20:06 (UTC+8)) [💬5] ### Your current environment
The output of
```text ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (aarch64) ...python collect_env.py -
#32874 [Bug]: DeepSeek V3.2 return Internal Server Errors instead of Bad Request in reasoning mode — bug — by artvl (关闭于: 2026-01-23 11:44 (UTC+8)) [💬2] ### Your current environment
The output of
Collecting environment information... ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) ...python collect_env.py -
#32368 [Bug]: _CPU_MOE_ACT in cpu_fused_moe_torch cause AssertionError: Current vLLM config is not set — bug — by kzwrime (关闭于: 2026-01-23 16:22 (UTC+8)) ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#29466 [CI Failure]: mi325_1: Language Models Test (Extended Pooling) — ci-failure — by AndreasKaratzas (关闭于: 2026-01-23 14:55 (UTC+8)) [💬13] ### Name of failing test
pytest -v -s models/language/pooling -m 'not core_model'### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#32718 [Bug]:
reload_weightsand_get_weights_iteratorreturn cached/stale weights instead of re-reading from disk — bug,rl — by RobotSail (关闭于: 2026-01-23 13:50 (UTC+8)) [💬4] ### Your current environmentThe output of
```text ============================== System Info ============================== OS : CentOS Stream 9 (x86_64) ...python collect_env.py -
#32059 [Feature]: Use platform op if inside a custom op — feature request — by robertgshaw2-redhat (关闭于: 2026-01-23 11:52 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch
We have the
CustomOpfor SIluandMul``` @staticmethod def forward_native(x: torch.Tensor) -> torch.Tensor: “"”PyTorch-native implementation equivalent to forward().””” d = x.shape[-1] // 2 return F.silu(x[…, :d]) * x[…, d:] …
[新 PR]
-
#32953 [UX] Deduplicate sampling parameter startup logs — frontend,ready — by DarkLight1337 (创建于: 2026-01-24 01:13 (UTC+8)) [+14/-34, 4 files | commented:3 approved:1]
## Purpose
Since multiple
OpenAIServing*classes are instantiated byvllm serve(even for a single process), the startup log gets spammed like so:``` (EngineCore_DP0 pid=2378269) INFO 01-23 15:32:21 [core.py:272] init engine (profile, create kv cache, warmup model) took 83.04 seconds (EngineCore_DP0 pid=2378269) INFO 01-23 15:32:21 [vllm.py:618] Asynchronous scheduling is enabled. (APIServer pid=2377843) INFO 01-23 15:32:22 [api_server.py:664] Supported task…
-
#32990 Indicate compile mode in the benchmark results — performance — by huydhn (创建于: 2026-01-24 09:47 (UTC+8)) [+13/-0, 1 files | commented:1 approved:1] ## Purpose
After https://github.com/pytorch/pytorch-integration-testing/pull/135. PyTorch team is now running vLLM benchmark in both eager and compile mode to compare their performance. In eager mode, vLLM is run with
compilation_config.modeset to 0. This PR checks for that config and several other conditions likeenforce_eageror the output filenames to denote if the benchmark is run with compile (default) or not.The field [extra_info.use_compile](https://github.com/pytorch/test-infra/…
-
#32931 [Feature] Add
qwen2_5_codertool parser for Qwen2.5-Coder models — documentation,tool-calling,qwen — by hanXen (创建于: 2026-01-23 18:59 (UTC+8)) [💬3 | +386/-0, 4 files | commented:2]## Purpose
Add a dedicated
qwen2_5_codertool call parser for Qwen2.5-Coder models. These models do not follow hermes<tool_call>format (outputs code blocks instead), but follow<tools>tag format with 100% compliance when prompted with few-shot examples.Fixes #32926
Usage: ```bash …
-
#32944 [ROCm][ViT] Enable Flash Attention Triton backend on RDNA3/RDNA4 — rocm,ready — by monajafi-amd (创建于: 2026-01-23 23:34 (UTC+8)) [💬3 | +34/-1, 1 files | commented:1 approved:1] ## Purpose
Flash Attention’s CK backend only supports CDNA GPUs (gfx90a/gfx942/gfx950). On RDNA3/RDNA4, vLLM falls back to PyTorch SDPA for vision attention, which is significantly slower. Flash Attention 2.8.3+ includes a Triton backend that supports RDNA architectures.
## Changes
- Added
flash_attn_triton_available()invllm/platforms/rocm.pyto detect RDNA GPUs and verify Triton backend availability - Updated
get_vit_attn_backend()to use Flash Attention when available on RDNA GPUs -…
- Added
-
#32965 [Core][Bugfix] allow graceful worker termination — bug,ready,v1 — by joerunde (创建于: 2026-01-24 04:28 (UTC+8)) [+18/-5, 1 files | commented:1 approved:1] ## Purpose
This PR fixes a small bug where workers are immediately sent a SIGTERM on shutdown.
(This was extracted from #28053)
When using the MultiprocExecutor, the parent process signals the local worker processes to shut down by closing a pipe (
self.death_pipe). The worker processes each have a thread waiting for this pipe to close, and will gracefully shut themselves down when it does. However, the parent process immediately sends all workers aSIGTERMright after closing its death p… -
#32902 fix[ROCm]: Remove unconditional aiter import — rocm,speculative-decoding,v1 — by rabi (创建于: 2026-01-23 11:25 (UTC+8)) [💬2 | +18/-7, 3 files | commented:10] ## Purpose AITER modules were being imported unconditionally at module load time, triggering JIT compilation and warnings even when VLLM_ROCM_USE_AITER=0 (the default).
[aiter] WARNING: NUMA balancing is enabled... [aiter] import [module_aiter_enum] under .../aiter/jit/...This occurred because several code paths imported aiter without first checking the VLLM_ROCM_USE_AITER environment variable.
## Test Plan Tested locally …
-
#32989 [Misc]Consolidate RoPE-related parsing into ModelArchitectureConfig — 无标签 — by charlotte12l (创建于: 2026-01-24 08:49 (UTC+8)) [+142/-53, 3 files | commented:1 | 📝草稿]
## Purpose
We are introducing model_arch_config https://github.com/vllm-project/vllm/pull/28454, which defines explicitly what kind of information vLLM engine need from hugggingface config/ user-defined config, so we could avoid hf_config/getattr(hf_config, xxx) got passing around in engine.
Before this PR,
_get_and_verify_max_lenin model.py directly accessed multiple HuggingFace config fields:- rope_parameters
- original_max_position_embeddings …
-
#32988 Auth_token added in documentation as it is required — documentation — by ruizcrp (创建于: 2026-01-24 08:26 (UTC+8)) [💬2 | +2/-0, 1 files | commented:1 approved:1]
## Purpose The documentation for Claude Code is currently not working as described. After some trying, I found out that ANTHROPIC_AUTH_TOKEN is required. At least in my case, this was what made it finally working. This PR is thus just to add this information.
## Test Plan Does not affect code.
## Test Result Does not affect code …
-
#32908 [Bugfix][TPU] Return a Default fp8 MoE Backend — bug — by vanbasten23 (创建于: 2026-01-23 12:35 (UTC+8)) [💬5 | +1/-3, 1 files | commented:2] ## Purpose
Commit caused TPU to fail at
def select_fp8_moe_backend((error). This PR intends to fix this error for TPU.## Test Plan
TPU CI: ``` USE_MOE_EP_KERNEL=1 MODEL_IMPL_TYPE=vllm vllm serve –seed=42 –model=BCCard/Qwen3-Coder-480B-A35B-Instruct-FP8-Dynamic –max-model-len=10240 –max-num-batched-tokens=8192 –max-num-seqs=512 –no-enable-prefix-caching –disab…
-
#32983 [Perf] Cache xpu_get_mem_info() result to avoid duplicate calls — v1 — by sjhddh (创建于: 2026-01-24 07:18 (UTC+8)) [💬2 | +2/-1, 1 files | commented:1] ## Summary Cache the result of
xpu_get_mem_info()in local variables instead of calling it twice in the same expression.Before:
total_allocated_bytes = self.xpu_get_mem_info()[1] - self.xpu_get_mem_info()[0]After: ```python …
-
#32982 [Tests] Standardize RNG seed utility across test files — v1 — by sjhddh (创建于: 2026-01-24 07:18 (UTC+8)) [💬3 | +13/-20, 4 files | commented:2] ## Summary Use the centralized
set_random_seedfromvllm.utils.torch_utilsinstead of scattered custom seed functions:- Remove custom
set_seed()function fromtests/kernels/test_flex_attention.py - Replace
random.seed()withset_random_seed()intests/v1/logits_processors/test_custom_offline.py - Re-export
set_random_seedfromtests/utils.pyfor convenience
## Motivation
- Ensures consistent seeding of
random,numpy, andtorchRNGs - Reduces code duplication …
- Remove custom
-
#32951 [WIP][Async][Spec Decoding] Zero-bubble async scheduling + spec decoding — speculative-decoding,v1 — by MatthewBonanni (创建于: 2026-01-24 01:01 (UTC+8)) [💬1 | +300/-71, 6 files | commented:1 | 📝草稿] co-authored by @izhuhaoran
## Purpose This is a refactor of #29957
## Test Plan On 8x H200: ``` vllm serve deepseek-ai/DeepSeek-R1
-tp 8 -ep
… -
#32981 [Tests] Clarify pytest skip reasons with actionable context — v1 — by sjhddh (创建于: 2026-01-24 07:16 (UTC+8)) [💬3 | +7/-3, 2 files | commented:1] ## Summary Replace vague FIXME comments in test skip markers with more descriptive reasons:
tests/v1/sample/test_topk_topp_sampler.py: Changed from “FIXME: This test is failing right now” to explain the FlashInfer top-k/top-p renorm comparison issuetests/samplers/test_beam_search.py: Changed from “FIXME: This fails on V1 right now” to clarify that V1 engine does not yet support beam search
## Test Plan
- Run
pytest --collect-onlyon the affected test files to verify no syntax errors…
-
#32985 [Fix] Include list index in multimodal validation error messages — frontend — by sjhddh (创建于: 2026-01-24 07:19 (UTC+8)) [+12/-18, 1 files | commented:1] ## Summary Include the list index in multimodal validation error messages to help users identify which specific item has the issue.
Before:
Multi-modal data for image is None but UUID is not providedAfter: ``` …
- #32987 [Docs] Update README with uv recommendation and Python version requirements — documentation — by sjhddh (创建于: 2026-01-24 07:20 (UTC+8)) [💬1 | +7/-1, 1 files | commented:1]
## Summary
Update the README.md “Getting Started” section to:
- Add Python version requirements (3.10 - 3.13) prominently
- Recommend
uvas the preferred installation method - Keep
pipas an alternative for users who prefer it - Update installation guide link to the general installation page
## Motivation
uvis significantly faster thanpipfor installing vLLM- The docs already recommend
uv, README should be consistent …
-
#32986 [Tests] Replace flaky sleep with polling in test_background_cancel — v1 — by sjhddh (创建于: 2026-01-24 07:20 (UTC+8)) [+19/-6, 1 files | commented:1] ## Summary Replace the fixed 0.5s sleep (marked with
# FIXME: This test can be flaky) with a proper polling loop that waits for the response status to change before attempting to cancel.## Changes
- Poll with 0.1s intervals until status changes from “queued” (max 5s)
- Use proper assertions in the post-cancel verification loop
- Remove FIXME comment as the flakiness is addressed
## Motivation
- Fixed sleeps are unreliable: too short on slow machines, wasteful on fast ones …
- #32980 [CI] Add pip and pre-commit caching to pre-commit workflow — ci/build — by sjhddh (创建于: 2026-01-24 07:16 (UTC+8)) [💬1 | +7/-0, 1 files | commented:1]
## Summary
- Add
cache: 'pip'to speed up pip dependency installation - Add
actions/cachefor~/.cache/pre-committo cache pre-commit hook environments - Cache key is based on
.pre-commit-config.yamlhash for proper invalidation
## Test Plan
- Run the pre-commit workflow on a test PR
- Verify cache is created on first run (check logs for “Cache saved”)
- Verify cache is restored on subsequent runs (check logs for “Cache restored”)
- Confirm workflow time decreases by ~30-60 seconds on c…
- Add
-
#32945 [Bugfix] Fix empty response under concurrent requests — bug,frontend,v1 — by vasia123 (创建于: 2026-01-23 23:48 (UTC+8)) [+341/-1, 5 files | commented:2] The
with_cancellationdecorator returned None when the disconnect listener completed before the handler, causing FastAPI to send HTTP 200 with an empty body. Now raises asyncio.CancelledError instead, so the ASGI framework properly handles the disconnected client.Also adds defensive
ready.set()calls in RequestOutputCollector.put() merge/replace branches to prevent latent lost-wakeup race conditions.Related to: https://github.com/vllm-project/vllm/issues/3209
## Purpose
Fix intermitten…
-
#32976 [AMD][Kernel][BugFix] Use correct scale in concat_and_cache_ds_mla_kernel when on gfx942 — bug,rocm — by rasmith (创建于: 2026-01-24 06:51 (UTC+8)) [+9/-7, 1 files | commented:2] ## Purpose This PR updates
concat_and_cache_ds_mla_kernelto use the correct scale divisor when running ongfx942architectures on ROCm. Thetile_sizedivisor was448.0which does not work on AMD platforms with arch ofgfx942, e.g. MI300, MI325. This PR updates the divisor to be224.0ongfx942.Additionally, I consolidated the scale divisor into a constexpr float. ## Test Plan Use lm_eval to check accuracy on
DeepSeek-R1model.Command to run: `…
- #32954 [NVIDIA] [feat] Integrate flashinfer Trtllmgen bf16 moe — nvidia — by Linda-Stadter (创建于: 2026-01-24 01:15 (UTC+8)) [💬3 | +216/-14, 5 files | commented:2]
## Purpose
- Integrate flashinfer trtllm-gen BF16 moe to supported models
- This is a rebased version of PR 28238 by @jiahanc and includes adaptation to the latest moe refactoring changes. I have further verified that the accuracy issues discussed in 28238 are solved.
## Test Plan
``VLLM_USE_FLASHINFER_MOE_FP16=1 VLLM_FLASHINFER_MOE_BACKEND=latency vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct …
-
#32966 Add TUI Monitor: Real-time Terminal Dashboard for vLLM Metrics — documentation — by sjhddh (创建于: 2026-01-24 04:48 (UTC+8)) [💬4 | +448/-0, 1 files | commented:2] ## Summary This PR introduces
vllm_tui_monitor.py, a standalone Terminal User Interface (TUI) tool for monitoring vLLM instances in real-time.Why this is useful: Debugging performance issues or monitoring long-running serving instances often requires setting up a full Prometheus+Grafana stack. This tool provides an immediate, lightweight visualization of key metrics (KV cache usage, throughput, queue depth) directly in the terminal over SSH.
Features:
- **“Reactor Core” Visualizat…
-
#32984 [Perf] Cache exc.errors() result in validation exception handler — frontend — by sjhddh (创建于: 2026-01-24 07:19 (UTC+8)) [+4/-3, 1 files | commented:1] ## Summary Cache the result of
exc.errors()to avoid calling it multiple times in thevalidation_exception_handler.Before:
exc.errors()called 3 times After: Result cached inerrorsvariable, used 3 times## Motivation
- Avoids redundant method calls
- Improves code readability
- Error handling paths should be efficient …
- #32979 [CI] Add pip caching to cleanup_pr_body workflow — ci/build — by sjhddh (创建于: 2026-01-24 07:15 (UTC+8)) [💬1 | +1/-0, 1 files]
## Summary
- Add
cache: 'pip'to theactions/setup-pythonstep in cleanup_pr_body workflow - Reduces pip dependency installation time on cache hits
## Test Plan
- Open/edit a PR to trigger the workflow
- Check workflow logs for “Cache restored” message on subsequent runs
- Verify workflow completes successfully
## Risk …
- Add
- #32978 [Docs] Sync quantization list between README files — documentation — by sjhddh (创建于: 2026-01-24 07:15 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1]
## Summary
- Add AutoRound to the quantization list in
docs/README.md - Ensures consistency with the root
README.mdwhich already lists AutoRound
## Test Plan
- Build documentation with
mkdocs buildto verify no rendering issues - Visual inspection of the docs landing page
## Risk
- Very low risk: documentation-only change, adding one line
- Add AutoRound to the quantization list in
- #32977 [Docs] Fix Apple silicon include path in CPU installation docs — documentation,cpu — by sjhddh (创建于: 2026-01-24 07:14 (UTC+8)) [💬2 | +1/-1, 1 files | commented:1]
## Summary
- Fix incorrect include path in the CPU installation documentation
- The “Build image from source” section for Apple silicon was referencing
cpu.arm.inc.mdinstead ofcpu.apple.inc.md
## Test Plan
- Build the documentation locally with
mkdocs buildormkdocs serve - Verify the Apple silicon section renders correctly without include errors
## Risk
- Very low risk: documentation-only change …
- #32912 [fix] add VLLM_OBJECT_STORAGE_SHM_BUFFER_NAME to compile factors — ready — by dolpm (创建于: 2026-01-23 14:12 (UTC+8)) [💬2 | +38/-0, 2 files | commented:1 approved:3]
random env var…
"VLLM_OBJECT_STORAGE_SHM_BUFFER_NAME": get_env_or_set_default( "VLLM_OBJECT_STORAGE_SHM_BUFFER_NAME", lambda: f"VLLM_OBJECT_STORAGE_SHM_BUFFER_{uuid.uuid4().hex}", ),will cache miss every time :) —
... -
#32964 [Kernel] [Helion] Helion kernel wrapper — 无标签 — by gmagogsfm (创建于: 2026-01-24 03:45 (UTC+8)) [💬2 | +2347/-0, 8 files | commented:3] THIS IS A MANUALLY STACKED PR, PLEASE ONLY REVIEW TOP COMMIT, lower commits are being reviewed separately in an earlier PR.
This PR adds two basic cases to help Helion kernels with compilation and runtime dispatching.
- HelionKernelWrapper would be constructed by vllm.helion.register() in following PRs. It is responsible for adding Helion kernel to registry and partially specify Helion kernel according to GPU platform and model config. As …
- #32974 [Attention][WIP] FA4 integration — ci/build,v1,nvidia — by LucasWilkinson (创建于: 2026-01-24 06:30 (UTC+8)) [+611/-37, 7 files | commented:2] Integrate upstream FA4
-
#32973 [Perf] Overlap workspace_buffer.fill_(0) with compute in MLA attention — 无标签 — by robertgshaw2-redhat (创建于: 2026-01-24 06:19 (UTC+8)) [💬1 | +167/-1, 2 files | commented:2] Move workspace_buffer.fill_(0) for TRT-LLM ragged attention to run in a separate CUDA stream (aux_stream) so it can overlap with the compute operations that precede the attention kernel:
- gather_and_maybe_dequant_cache (or cp_gather_cache for context parallel)
- kv_b_proj
- _concat_k_nope_k_pe
The fill operation is launched at the start of each loop iteration in _compute_prefill_context and _context_parallel_compute_prefill_context, allowing it to execute concurrently with these compute opera…
-
#32975 [Perf] Optimize detokenizer python logic — ready,v1 — by yewentao256 (创建于: 2026-01-24 06:45 (UTC+8)) [+26/-9, 2 files | commented:1] ## Purpose
- adding a
num_output_tokensfunction, to avoidlen(self.output_token_ids)might do a slice in SlowIncrementalDetokenizer
```py class SlowIncrementalDetokenizer(BaseIncrementalDetokenizer): @property def output_token_ids(self) -> list[int]: return ( self.token_ids …
- adding a
- #32950 [MLA] Fuse cat and qaunt for fp8 kv-cache — documentation,performance,ready,deepseek — by LucasWilkinson (创建于: 2026-01-24 00:48 (UTC+8)) [💬7 | +41/-20, 1 files | commented:2 approved:3]
Main:
PR:
-
#32971 [CI] fix version comparsion and exclusion patterns in upload-release-wheels.sh — ready,ci/build — by Harry-Chen (创建于: 2026-01-24 05:39 (UTC+8)) [+6/-5, 1 files | commented:4 approved:1] ## Purpose
@khluu found some issues when trying this script. This PR fixes them.
## Test Plan
Normal CI will not trigger this step. Needs to be tested on the next release.
## Test Result
…
-
#32935 [Bugfix] Fix missing is_layer_skipped check for FusedMoE in AWQConfig — bug,ready — by joninco (创建于: 2026-01-23 21:10 (UTC+8)) [💬2 | +3/-2, 1 files | commented:1 approved:1] ## Purpose
Fix missing
is_layer_skippedcheck for FusedMoE layers inAWQConfig.get_quant_method.This caused models with
modules_to_not_convert(e.g., MTP layers stored as bfloat16) to incorrectly receive quantized methods, leading to weight loading failures when the code expected quantized weights but found bfloat16.Two bugs fixed:
- MoeWNA16 fallback config was missing
modules_to_not_convert, soMoeWNA16Configcouldn’t properly skip layers AWQMarlinMoEMethodwas return…
- MoeWNA16 fallback config was missing
-
#32968 [Misc] Add run one batch script that supports profiling — documentation — by LucasWilkinson (创建于: 2026-01-24 05:28 (UTC+8)) [💬2 | +112/-0, 1 files | commented:1] add a script that runs one batch and can optionally profile it; e.g.
python examples/offline_inference/run_one_batch.py --model deepseek-ai/DeepSeek-V2-Lite --tensor-parallel-size 1 --max-model-len 1024 --gpu-memory-utilization 0.9 --kv-cache-dtype fp8 --trust-remote-code --profile both --profile-dir ./profiles/main -
#32952 [Refactor] Rename
gptq_marlintomarlinto match MoE — performance,ready,ci/build — by mgoin (创建于: 2026-01-24 01:12 (UTC+8)) [💬1 | +40/-40, 24 files | commented:2 approved:1] ## Purpose- Moved
csrc/quantization/gptq_marlin/tocsrc/quantization/marlin/ - Renamed
gptq_marlin.cutomarlin.cu - Renamed
gptq_marlin_gemmtomarlin_gemm
## Test Plan
## Test Result
…
- Moved
-
#32969 [Bugfix][VLM] Fix transformers backend embed_multimodal for Qwen2.5-VL profiling — bug,qwen — by AndreasKaratzas (创建于: 2026-01-24 05:33 (UTC+8)) [💬1 | +29/-12, 1 files | commented:3] Fixes
RuntimeError: split_with_sizes expects split_sizes to sum exactly to 1 (input tensor's size at dimension 0), but got split_sizes=[4900, 4900]when running Qwen2.5-VL with the transformers backend.## Root Cause
During memory profiling, vLLM creates dummy encoder outputs with minimal size (shape
[1, hidden_dim]). The originalembed_multimodalcode calledunsqueeze(0)on 2D tensors, then attemptedtorch.split()with sizes derived fromnum_image_patches(e.g.,[4900, 4900]), c… -
#32913 [fix] CPUDNNLGEMMHandler pointer baked into inductor artifact — ready,cpu — by dolpm (创建于: 2026-01-23 14:23 (UTC+8)) [💬1 | +38/-23, 4 files | commented:3 changes:1 approved:2] fixes https://github.com/vllm-project/vllm/issues/32033
pointers shouldn’t be passed around as integers. either you should maintain some lookup table on the cpp side and deal with deterministic keys in python-land, or throw them into a tensor that will cooperate through the graph. i’ve opted for the latter. they will be treated as constants, and will be folded as such. not fun.
e.g.,
torch.ops._C.onednn_mm.default(buf14, buf11, None, 251002016)## Purpose …
-
#32970 [release] Fix upload release wheel script — ci/build — by khluu (创建于: 2026-01-24 05:37 (UTC+8)) [+2/-2, 1 files commented:1] - Cut out the extra
vwhen comparing tagged version and expected version - Use
rc[0-9]instead ofrcto not excludeaaRCh64wheels accidentally
- Cut out the extra
- #32967 [Frontend] Use init_app_state and FrontendArgs from api_server in run_batch — frontend — by pooyadavoodi (创建于: 2026-01-24 05:20 (UTC+8)) [+186/-132, 3 files | commented:4]
## Purpose
- Adding support for more features such as tool calling to run_batch.
- This is achieved by using
init_app_stateandFrontendArgsfromvllm/entrypoints/openai/api_server.py. - The approach taken here also removes some code duplication between api_server and run_batch.
- Due to args conflict over
--portbetween FrontendArgs and the existing run_batch options, we improve the option names from--portand--urlto--metrics-portand--metrics-urland provide a backward c…
-
#32949 [Bug] Fix benchmark script
moe_permute_unpermute— bug,performance,ready — by yewentao256 (创建于: 2026-01-24 00:32 (UTC+8)) [+3/-5, 1 files | commented:2 approved:1] ## Purposepython benchmarks/kernels/benchmark_moe_permute_unpermute.pyTwo bugs fixed
```bash 93, ip=10.243.64.20, actor_id=4b4963ca6aa94bfc7fb40cf101000000, repr=<benchmark_moe_permute_unpermute.BenchmarkWorker object at 0x7f18ef275790>) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ …
-
#32957 [WIP][Feat] Add RMSNorm NvFp4 Quant Operator (#32612) — ci/build — by sparkecho (创建于: 2026-01-24 02:53 (UTC+8)) [💬4 | +392/-2, 8 files | commented:5] ## Purpose This commit implements rmsnorm + fp4 quant fusion, and integrate to rmsnorm + quant fusion pass, fixing #32612
## Test Plan In progress.
## Test Result TODO
-
#32961 [Bugfix][Kernel] Rename NoEP to NoDPEP and fix NvFP4 MoE expert routing — bug,documentation,performance,nvidia — by aviralgarg05 (创建于: 2026-01-24 03:32 (UTC+8)) [💬3 | +66/-48, 23 files | commented:1] # [Bugfix][Kernel] Rename NoEP to NoDPEP and fix NvFP4 MoE expert routing
Fixes #32009
## Purpose
This PR addresses feedback regarding the naming and logic of MoE preparation strategies, specifically for NVFP4/FlashInfer backends.
### 1. Rename NoEP to NoDPEP As pointed out in PR feedback,
MoEPrepareAndFinalizeNoEPwas a misleading name because it does support Expert Parallelism (EP) via redundant computation and AllReduce. It specifically does not support the Data Parallelism + Expert P… - #32946 [cudagraphs] Refactor cudagraph capture loop — ready,v1,nvidia — by LucasWilkinson (创建于: 2026-01-23 23:55 (UTC+8)) [+117/-59, 3 files | commented:2 approved:1] Refactor cudagraph capture loop to pave the way for different PIECEWISE and FULL sizes and dynamic spec-decode sizes
-
#32963 Update CPU doc according to feedback — documentation,cpu — by louie-tsai (创建于: 2026-01-24 03:41 (UTC+8)) [💬1 | +4/-4, 3 files | commented:1] ## Purpose
Change “supported model” to “recommended model” following TPU doc and feedback. overall, we might functional support official vLLM supported models, but we test those recommended model for function and performance.
## Test Plan NA ## Test Result NA — …
-
#32960 [compile][cuda_graph]Add sym_size handling by folding them to constant — needs-rebase,nvidia — by fxdawnn (创建于: 2026-01-24 03:23 (UTC+8)) [💬1 | +169/-3, 4 files | commented:2 | 📝草稿]
## Purpose Fix #31043
## Test Plan
- Comparing between the pro of the issues after reverting the previous temporary fix
- Adding local test
## Test Result …
-
#32958 glm 4.6 fused tuned inference config for B200 — 无标签 — by navmarri14 (创建于: 2026-01-24 03:02 (UTC+8)) [+163/-0, 1 files | commented:1] This PR adds a tuned fused MoE kernel configuration for the GLM-4.6 MoE architecture on NVIDIA B200 GPUs using FP8 quantization.
Specifically, it targets the configuration:
Experts (E): 160 Sharded size N=384 for TP=4 Device: NVIDIA B200 Dtype: fp8_w8a8 Previously, vLLM lacked a static configuration for these shapes on B200, causing it to fallback to heuristics or require JIT tuning during startup. This config improves startup time and ensures optimal kernel param…
-
#32948 [Dev UX] Add auto-detection for VLLM_PRECOMPILED_WHEEL_VARIANT during install — documentation,ready,ci/build,nvidia — by mgoin (创建于: 2026-01-24 00:30 (UTC+8)) [💬1 | +51/-6, 2 files | commented:2 approved:4]
## Purpose
Inspired by the
--torch-backend=autowe use from uv to get the right CUDA 12.x or 13.x binary for devWe first respect VLLM_MAIN_CUDA_VERSION if set, then check
torch.version.cudaif a user already have torch installed for some reason (we assume it works for them), and last we try nvidia-smi to extract the version. Then we map that to the two wheel variants we have on wheels.vllm.ai, cu129 or cu130 based on 12.x or 13.xOn my dev system with `CUDA…
-
#32956 [Bugfix][CI] Fix pre-commit — bug,v1 — by MatthewBonanni (创建于: 2026-01-24 02:19 (UTC+8)) [+1/-2, 1 files | commented:1 approved:1]
## Purpose Fixes pre-commit issue caused by #30877. The issue shouldn’t break CI because
check_triton_importonly checks the diff, but it shows up when merging main into PR branches## Test Plan
## Test Result
…
-
#32955 [Refactor] Use data parser for matching data items to multi-modal UUIDs — frontend,v1 — by DarkLight1337 (创建于: 2026-01-24 01:29 (UTC+8)) [💬2 | +50/-96, 3 files | commented:1 | 📝草稿] ## Purpose
Move the code related to UUID validation from
LLMclass toInputProcessorto improve the readability. Also fixed mismatch between item counts between_validate_multi_modal_uuidsand_maybe_build_mm_uuids, by using data parser to get the correct number of items in both cases.## Test Plan
## Test Result
... -
#32934 [BugFix] Fix tool parser crash with parentheses in descriptions — bug,tool-calling — by majiayu000 (创建于: 2026-01-23 19:50 (UTC+8)) [💬3 | +244/-27, 2 files | commented:2] ## Purpose
Fixes #32827
When using OpenAI-compatible function calling, if a parameter’s
descriptioncontains example commands like(e.g. ls -la, ssh user@host, cat file.txt), vLLM server crashes with “Connection error”. This happens because the MiniMax-M2 model sometimes echoes the description back as additional attributes in the XML tags, causing the regex parsing to fail.## Test Plan
```bash pytest tests/tool_use/test_minimax_m2_tool_parser.py -v …
-
#32924 Create 52012 — 无标签 — by AKB0700 (创建于: 2026-01-23 16:49 (UTC+8)) [💬2 | +1/-0, 1 files | commented:3]
## Purpose
## Test Plan
## Test Result
... - #32940 [torch.compile][CI] Add back attn fusion on hopper/ada — ready — by ProExpertProg (创建于: 2026-01-23 22:58 (UTC+8)) [+4/-5, 1 files | commented:1 approved:2] We were not running attention quant fusion because the unit tests were using the llama4 for the model name which is too big for a single gpu (but the model wasn’t used in the test apart from the name). Not technically a problem on B200 but we still don’t want to be skipping
-
#32947 [Bugfix] Fix weights offloading for sleep mode — bug,v1 — by jseppanen (创建于: 2026-01-24 00:20 (UTC+8)) [💬1 | +4/-3, 1 files | commented:1] ## Purpose
FIx weights offloading bug.
Before:
[gpu_worker.py:137] Sleep mode freed 24.08 GiB memory, 33.62 GiB memory is still in use.…
-
#32904 [Hardware][AMD][CI][Bugfix] Fix Kernels Attention Cache test — bug,rocm,ready — by mawong-amd (创建于: 2026-01-23 11:50 (UTC+8)) [+1/-1, 1 files | commented:1 approved:2] ## Purpose This PR fixes a failing test (
kernels/attention/test_cache.py::test_reshape_and_cache_flash) caused by https://github.com/vllm-project/vllm/pull/30141. The reference dequantization implementation used in the test assumes the FP8 data format ise4m3fn, but on AMDgfx942-series cards, the FP8 data type used ise4m3fnuzinstead.## Test Plan Run
pytest -sv tests/kernels/attention/test_cache.py -k test_reshape_and_cache_flashon a MI325X as part of… -
#32939 [ROCm][PD] Remove unused moriio connector proxy code — documentation,rocm,ready,kv-connector — by markmc (创建于: 2026-01-23 22:16 (UTC+8)) [💬1 +0/-22, 1 files commented:1 approved:1] - send_request_to_decode() is not called anywhere
- extract_ip_port_fast() for P is called twice
-
#32943 Adding optional speculator tests for larger models — speculative-decoding,ci/build,v1 — by shanjiaz (创建于: 2026-01-23 23:15 (UTC+8)) [+26/-1, 2 files | commented:3 | 📝草稿] ## Purpose We just enabled speculative decoding for qwen3 moe vision language pathway. Adding a test for vision language speculator model.
- Add support for testing Qwen3-30B-MOE-VL-Eagle3 speculator model
- Create separate optional CI job for large speculator model tests on A100 GPUs …
- #32942 [Bugfix]: Fix display errors in TORCH_CHECK messages — bug,cpu — by lingebeng (创建于: 2026-01-23 23:14 (UTC+8)) [+8/-8, 6 files | commented:1] Fix display errors in TORCH_CHECK messages.
-
#32941 draft: removed a priori unneeded fills, use optimized copy kernel for DSV3 — 无标签 — by hypdeb (创建于: 2026-01-23 23:11 (UTC+8)) [+17/-6, 1 files | commented:2 | 📝草稿]
## Purpose
## Test Plan
## Test Result
... -
#32936 [Model Runner V2] support cudagraph check based on attn backend — v1,nvidia — by izhuhaoran (创建于: 2026-01-23 21:14 (UTC+8)) [💬1 | +177/-113, 6 files] ## Purpose
A follow-up PR of #32771 and #32820.
After #32820 we can select any attention backend, but some of them have limitations for CUDA-graph. This PR, like Model-Runner V1, adds a CUDA-graph check that adjusts the cudagraph mode & capture_sizes according to the attention backend. For example, with FLASHINFER + spec decode, a user-specified FULL_AND_PIECEWISE is automatically resolved to PIECEWISE.
``` [WARNING 01-23 20:32:46 [compilation.py:1148] CUDAGraphMode.FULL_AND_PIECEWISE is not…
-
#32906 [ROCm][ViT] Enable Flash Attention Triton backend on RDNA3/RDNA4 — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ci/build,v1,multi-modality — by monajafi-amd (创建于: 2026-01-23 12:07 (UTC+8)) [💬2 | +16257/-10390, 438 files | commented:1 | 📝草稿] ## Purpose
Flash Attention’s CK backend only supports CDNA GPUs (gfx90a/gfx942/gfx950). On RDNA3/RDNA4, vLLM falls back to PyTorch SDPA for vision attention, which is significantly slower. Flash Attention 2.8.3+ includes a Triton backend that supports RDNA architectures.
## Changes
- Added
flash_attn_triton_available()invllm/platforms/rocm.pyto detect RDNA GPUs and verify Triton backend availability - Updated
get_vit_attn_backend()to use Flash Attention when available on RDNA GPUs -…
- Added
-
#32916 [Bugfix] Disable tma_aligned_scales in test_fusions_e2e — bug,ready — by xyang16 (创建于: 2026-01-23 15:12 (UTC+8)) [+9/-1, 3 files | commented:1 approved:1]
## Purpose
This PR is to disable tma_aligned_scales in test_fusions_e2e to unblock CI.
## Test Plan
``` pytest -s -v tests/compile/distributed/test_fusions_e2e.py …
-
#32933 [Bugfix] Fix getting vision features in Transformer Multimodal backend — bug,ready — by zucchini-nlp (创建于: 2026-01-23 19:16 (UTC+8)) [💬1 | +9/-0, 1 files | commented:2 approved:1] Makes sure that transformers multimodal backend keeps working after v5 release.
PR https://github.com/huggingface/transformers/pull/42564 changed the output of
self.model.get_image_featurestotuple | dictformat. Prev we expected the output to always be a single tensor or a list of tensors for non-homogeneous image sizes. A simple check if the output istupleThe default output format currently depends on
model.config.return_dict, so I added both formatscc @hmellor
-
#32937 [DO NOT MERGE][ROCM][PD] Fix MoRIIO connector with transfer_id for P/D coordination — documentation,rocm,kv-connector — by markmc (创建于: 2026-01-23 21:20 (UTC+8)) [💬1 | +514/-72, 5 files | commented:2 | 📝草稿] After #27987 introduced random suffixes to internal request_ids, the MoRIIO connector broke because Prefill and Decode instances now have different internal request_ids (e.g., “cmpl-uuid-abc” vs “cmpl-uuid-def”). The MoRIIO connector’s parallel dispatch model requires a stable, shared identifier for P/D coordination.
This commit introduces a
transfer_id(format: “xfer-{uuid}”) that is generated by the proxy and shared between Prefill and Decode for KV transfer coordination. The connector work… -
#32905 [Frontend][3/n] Make pooling entrypoints request schema consensus | EmbedRequest & ClassifyRequest — documentation,frontend,ready — by noooop (创建于: 2026-01-23 11:54 (UTC+8)) [💬2 | +326/-261, 11 files | commented:2 approved:1]
## Purpose
Split out the following RequestMixin
- EncodingRequestMixin
- EmbedRequestMixin
- ClassifyRequestMixin
address https://github.com/vllm-project/vllm/pull/31784#discussion_r2716425132 …
-
#32927 [Benchmark][Bugfix] Fix race condtion when starting server for sweep benchmark — bug,performance,ready — by Isotr0py (创建于: 2026-01-23 18:07 (UTC+8)) [+37/-0, 2 files | commented:3 approved:1]
## Purpose
- Currently, there is a race condition that bench command is executed too early before server become ready for
vllm bench sweep serve: ``` [BEGIN SERVER] Server overrides: {‘mm_encoder_attn_backend’: ‘FLASH_ATTN’} Server command: [‘vllm’, ‘serve’, ‘/home/mozf/LLM/Qwen3-VL-4B-Instruct/’, ‘–enforce-eager’, ‘–max-model-len’, ‘32768’, ‘–mm-processor-cache-gb=0’, ‘–media-io-kwargs.video.num_frames’, ‘-1’, ‘–mm-encoder-attn-backend’, ‘FLASH_ATTN’] [BEGI…
- Currently, there is a race condition that bench command is executed too early before server become ready for
-
#32923 [StepVL] support close img patch — 无标签 — by ltd0924 (创建于: 2026-01-23 16:32 (UTC+8)) [💬3 | +8/-2, 2 files | commented:3]
## Purpose The default behavior of the StepVL model involves patch operations for image computation. A new environment variable,
VLLM_ENABLE_STEP_VL_IMG_PATCH, has been introduced to allow disabling patch computation, thus supporting image-only computation without patch operations (i.e., calculating embeddings for the image alone).## Test Plan
## Test Result
…
-
#32930 [MoE] Add FusedMoE configs for RTX PRO 6000 Blackwell Max-Q — 无标签 — by ErikDeBruijn (创建于: 2026-01-23 18:43 (UTC+8)) [💬2 | +400/-0, 8 files | commented:2] ## Summary
Add FusedMoE kernel configurations for NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition covering popular MoE architectures.
### Configurations Added (8 files)
| E | N | Use Case | Latency (M=16) | |—|—|———-|—————-| | 8 | 14336 | Mixtral 8x7B/8x22B | 1.58ms | | 8 | 7168 | Large 8-expert | 0.92ms | …
-
#32907 [CI/Build][CPU] Fix failed pooling tests and macos smoke test — ready,cpu — by bigPYJ1151 (创建于: 2026-01-23 12:29 (UTC+8)) [💬1 | +7/-1, 2 files | commented:6 approved:1]
## Purpose
- Skip post weight processing for missing layers
- Exclude unsupported op for macos
## Test Plan
CI tests …
-
#32928 [MoE] Add tuned config for RTX PRO 6000 Blackwell Max-Q (E=64, N=1024) — 无标签 — by ErikDeBruijn (创建于: 2026-01-23 18:27 (UTC+8)) [💬2 | +720/-0, 8 files | commented:1] ## Summary
Add FusedMoE kernel configurations for NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition covering the most popular MoE architectures.
### Configurations Added (8 files)
| E | N | Use Case | |—|—|———-| | 8 | 14336 | Mixtral 8x7B/8x22B | | 8 | 7168 | Large 8-expert models | …
-
#32929 [Draft][FP8]add FP8 WoQ kernel abstraction. — nvidia — by jikunshang (创建于: 2026-01-23 18:27 (UTC+8)) [💬1 | +329/-42, 6 files | commented:2]
## Purpose add abstraction for Wfp8A16 kernel.
## Test Plan
## Test Result
…
-
#32922 [Bugfix] Fix ‘remove_instance_endpoint’ method logic in disagg_proxy_demo — bug,documentation,kv-connector — by ChenqianCao (创建于: 2026-01-23 15:58 (UTC+8)) [💬2 | +2/-2, 1 files | commented:1]
## Purpose
Link to the issue: Fixes #32920
Fixed a simple logic bug in the
remove_instance_endpoint(vllm\examples\online_serving\disaggregated_serving\disagg_proxy_demo.py) method where the logic for removing prefill instances was checking against the wrong instance list and cycling the wrong iterator.## Problem
…
-
#32914 [ROCm][perf] Shuffle KV cache to use paged_attention_common — rocm,ci/build,v1 — by samutamm (创建于: 2026-01-23 14:37 (UTC+8)) [💬4 | +40/-5, 2 files | commented:1] ## Purpose For Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 model, currently
VLLM_ROCM_SHUFFLE_KV_CACHE_LAYOUT=1performs worse on small concurrencies, compared toVLLM_ROCM_SHUFFLE_KV_CACHE_LAYOUT=0. This PR fixes the issue usingpaged_attention_commonfrom aiter (see https://github.com/ROCm/aiter/pull/1821).## Test Plan For input and output lengths of 1k and 8k and concurrencies from 8, 18, 32, 64, 128, compare current main branch with and without VLLM_ROCM_SHUFFLE_KV_CACHE_LAYOUT (_vllm_ma…
-
#32917 Sm120 130 — ci/build,nvidia — by pacoxu (创建于: 2026-01-23 15:29 (UTC+8)) [+186/-3, 3 files | commented:1 | 📝草稿]
https://github.com/vllm-project/vllm/pull/19757 https://github.com/vllm-project/vllm/pull/32237/
## Purpose
## Test Plan
## Test Result …
-
#32918 [Fix] Update CUTLASS_REVISION to v4.3.5 — ci/build,nvidia — by pacoxu (创建于: 2026-01-23 15:29 (UTC+8)) [+1/-1, 1 files | commented:1] ## Purpose
https://github.com/NVIDIA/cutlass?tab=readme-ov-file#whats-new-in-cutlass-43 https://docs.nvidia.com/cutlass/latest/CHANGELOG.html
cutlass 4.3.5 vs 4.2.1, it includes many Blackwell SM100/SM103/SM120/SM130 support feature/bugfix.
- https://github.com/vllm-project/vllm/pull/12702 is a old PR to update cutlass(the PR updated it to 3.8) and CUDA_SUPPORTED_ARCHS( cuda arches has been already updated in main now). #12702 is out-of-dated and should be closed.
Probably this may benefit PR…
-
#32909 [CI][Pooling] Stabilize ModernBERT test — 无标签 — by AndreasKaratzas (创建于: 2026-01-23 13:05 (UTC+8)) [💬2 | +27/-0, 1 files | commented:10] This PR attempts to stabilize the
Language Models Test (Extended Pooling). It marks it as flaky, enables rerunning the test, and logs an informative message regarding the possible inaccurate results based on the conversation here: https://github.com/vllm-project/vllm/pull/32403#pullrequestreview-3695341925## Related
- Addresses flaky test:
test_modernbert_models[float-disham993/electrical-ner-ModernBERT-base] - Related to #32403 which increased test tolerance from
atol=1.2e-2to `atol=3…
- Addresses flaky test:
[已合并 PR]
-
#32944 [ROCm][ViT] Enable Flash Attention Triton backend on RDNA3/RDNA4 — rocm,ready — by monajafi-amd (合并于: 2026-01-24 10:03 (UTC+8)) [💬3 | +34/-1, 1 files | commented:1 approved:1] ## Purpose
Flash Attention’s CK backend only supports CDNA GPUs (gfx90a/gfx942/gfx950). On RDNA3/RDNA4, vLLM falls back to PyTorch SDPA for vision attention, which is significantly slower. Flash Attention 2.8.3+ includes a Triton backend that supports RDNA architectures.
## Changes
- Added
flash_attn_triton_available()invllm/platforms/rocm.pyto detect RDNA GPUs and verify Triton backend availability - Updated
get_vit_attn_backend()to use Flash Attention when available on RDNA GPUs -…
- Added
-
#32279 [Bugfix] Fix FusedMoE LoRA kernel offs_token out of bound value — bug,ready — by xyang16 (合并于: 2026-01-24 09:41 (UTC+8)) [💬6 | +4/-2, 1 files | commented:2 approved:1] ## Purpose
This PR is a fix to correct the out of bound value in FusedMoE LoRA kernel.
offs_tokenout of bound value should benum_valid_tokenssentinel, instead of 0. This will make it avoid reading token 0 for those out of bound range. Result shows accuracy improvement and slight performance gain.- And a minor fix of
moe_weightout of bound value to 0.0, because it’s fp type.
## Test Plan
``` pytest -s -v tests/lora/test_fused_moe_lora_kernel.py …
-
#32965 [Core][Bugfix] allow graceful worker termination — bug,ready,v1 — by joerunde (合并于: 2026-01-24 09:28 (UTC+8)) [+18/-5, 1 files | commented:1 approved:1] ## Purpose
This PR fixes a small bug where workers are immediately sent a SIGTERM on shutdown.
(This was extracted from #28053)
When using the MultiprocExecutor, the parent process signals the local worker processes to shut down by closing a pipe (
self.death_pipe). The worker processes each have a thread waiting for this pipe to close, and will gracefully shut themselves down when it does. However, the parent process immediately sends all workers aSIGTERMright after closing its death p… -
#25954 [Performance] Split FlashAttn attention and cache update — documentation,speculative-decoding,ready,v1,qwen,kv-connector,nvidia,ready-run-all-tests — by ElizaWszola (合并于: 2026-01-24 09:28 (UTC+8)) [💬36 | +458/-68, 21 files | commented:10] This PR creates codepaths for separating KV Cache update and Attention forward op. It also implements this split for FlashAttn backend. This separation facilitates future unwrapping.
#### E2E tests: ran inference on Blackwell machine with
llm = LLM(model="deepseek-ai/DeepSeek-V2-Lite", trust_remote_code=True)and both
VLLM_MLA_DISABLE=1(to test the split) andVLLM_MLA_DISABLE=0(to test if this PR does not affect backends that don’t do the split).Some lm_eval results (flash infer…
- #32742 [Model Runner V2] Add KV Connector support — ready,v1,kv-connector — by njhill (合并于: 2026-01-24 02:49 (UTC+8)) [💬1 | +159/-16, 4 files | commented:4 approved:1] Tested with cpu offloading and NIXL P/D.
- #32912 [fix] add VLLM_OBJECT_STORAGE_SHM_BUFFER_NAME to compile factors — ready — by dolpm (合并于: 2026-01-24 06:53 (UTC+8)) [💬2 | +38/-0, 2 files | commented:1 approved:3]
random env var…
"VLLM_OBJECT_STORAGE_SHM_BUFFER_NAME": get_env_or_set_default( "VLLM_OBJECT_STORAGE_SHM_BUFFER_NAME", lambda: f"VLLM_OBJECT_STORAGE_SHM_BUFFER_{uuid.uuid4().hex}", ),will cache miss every time :) —
... -
#32971 [CI] fix version comparsion and exclusion patterns in upload-release-wheels.sh — ready,ci/build — by Harry-Chen (合并于: 2026-01-24 06:21 (UTC+8)) [+6/-5, 1 files | commented:4 approved:1] ## Purpose
@khluu found some issues when trying this script. This PR fixes them.
## Test Plan
Normal CI will not trigger this step. Needs to be tested on the next release.
## Test Result
…
-
#32935 [Bugfix] Fix missing is_layer_skipped check for FusedMoE in AWQConfig — bug,ready — by joninco (合并于: 2026-01-24 06:19 (UTC+8)) [💬2 | +3/-2, 1 files | commented:1 approved:1] ## Purpose
Fix missing
is_layer_skippedcheck for FusedMoE layers inAWQConfig.get_quant_method.This caused models with
modules_to_not_convert(e.g., MTP layers stored as bfloat16) to incorrectly receive quantized methods, leading to weight loading failures when the code expected quantized weights but found bfloat16.Two bugs fixed:
- MoeWNA16 fallback config was missing
modules_to_not_convert, soMoeWNA16Configcouldn’t properly skip layers AWQMarlinMoEMethodwas return…
- MoeWNA16 fallback config was missing
-
#32692 [Refactor] Clean up unused variables & func — rocm,frontend,ready — by yewentao256 (合并于: 2026-01-24 06:04 (UTC+8)) [+0/-30, 5 files | commented:1 approved:1] ## Purpose
Clean up unused variables & func
-
#32869 [CPU][Feat] Update PyTorch to v2.10 for CPU Backend — ready,ci/build,cpu — by fadara01 (合并于: 2026-01-23 21:13 (UTC+8)) [💬6 | +5/-6, 3 files | commented:2 approved:1] ## Purpose
Update PyTorch to v2.10 for CPU Backend.
A lot of nice features, improvements and bug fixes have gone into PyTorch v2.10 for Arm CPUs and we should capitalize on that in vLLM!
Here’s a con-exhaustive list:
- Enable mimalloc on AArch64: Switched the c10 system allocator to mimalloc on AArch64 to improve large-allocation behavior and overall performance. This delivers broad wins across dtypes, including up to 60% faster DenseNet121 (FP32) and up to 40% faster GPT2-Large (BF16) [#pu…
-
#32952 [Refactor] Rename
gptq_marlintomarlinto match MoE — performance,ready,ci/build — by mgoin (合并于: 2026-01-24 05:48 (UTC+8)) [💬1 | +40/-40, 24 files | commented:2 approved:1] ## Purpose- Moved
csrc/quantization/gptq_marlin/tocsrc/quantization/marlin/ - Renamed
gptq_marlin.cutomarlin.cu - Renamed
gptq_marlin_gemmtomarlin_gemm
## Test Plan
## Test Result
…
- Moved
-
#32831 [CI][AMD][BugFix] Update wvSplitK (and other skinny_gemm wrappers) to ensure tensors passed will be made contiguous for the kernel — bug,rocm,ready — by rasmith (合并于: 2026-01-24 05:35 (UTC+8)) [+23/-0, 2 files | commented:8 approved:1] ## Purpose The
wvSplitKQ_hf_smlkernel is not able to properly handle non-contiguous tensors. This is resulting in failures in AMD CI. This is possibly resulting in other failures or inaccuracies. In the long term, it might be better to update the kernel to be able to properly handle non-contiguous tensors, since this trades incorrectness for performance loss. We can then work on fix to the kernels in skinny_gems.cu to handle non-contiguous kernels.In addition, all of the kernels in `ski…
-
#32949 [Bug] Fix benchmark script
moe_permute_unpermute— bug,performance,ready — by yewentao256 (合并于: 2026-01-24 05:18 (UTC+8)) [+3/-5, 1 files | commented:2 approved:1] ## Purposepython benchmarks/kernels/benchmark_moe_permute_unpermute.pyTwo bugs fixed
```bash 93, ip=10.243.64.20, actor_id=4b4963ca6aa94bfc7fb40cf101000000, repr=<benchmark_moe_permute_unpermute.BenchmarkWorker object at 0x7f18ef275790>) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ …
-
#32614 fix: Add glm4_moe_lite to MLA detection — ready,v1,nvidia — by marksverdhei (合并于: 2026-01-24 04:38 (UTC+8)) [💬3 | +70/-10, 4 files | commented:4 approved:4 changes:1] ## Summary
- Add
glm4_moe_liteandglm4_moe_lite_mtptois_deepseek_mla()check inmodel_arch_config_convertor.py
GLM-4.7-Flash (
glm4_moe_lite) uses Multi-head Latent Attention (MLA) viaGlm4MoeLiteMLAAttention(which inherits fromDeepseekV2MLAAttention) but was missing from the MLA detection.Without this fix, vLLM falls back to standard KV caching instead of efficient MLA caching, resulting in ~4x higher KV cache memory usage.
Note:
glm4_moeis intentionally NOT included a… - Add
- #32946 [cudagraphs] Refactor cudagraph capture loop — ready,v1,nvidia — by LucasWilkinson (合并于: 2026-01-24 04:22 (UTC+8)) [+117/-59, 3 files | commented:2 approved:1] Refactor cudagraph capture loop to pave the way for different PIECEWISE and FULL sizes and dynamic spec-decode sizes
-
#32956 [Bugfix][CI] Fix pre-commit — bug,v1 — by MatthewBonanni (合并于: 2026-01-24 02:26 (UTC+8)) [+1/-2, 1 files | commented:1 approved:1]
## Purpose Fixes pre-commit issue caused by #30877. The issue shouldn’t break CI because
check_triton_importonly checks the diff, but it shows up when merging main into PR branches## Test Plan
## Test Result
…
-
#30877 [V1][Hybrid] Mamba Prefix Caching with align mode — documentation,ready,v1,qwen — by peakcrosser7 (合并于: 2026-01-24 01:56 (UTC+8)) [💬22 | +1775/-129, 42 files | commented:10] The cleaned-up version of #29272
## Purpose
This PR enhances the design of #28176 , adopting the same memory layout as FullAttention while adding support for decode caching and speculative decoding.
The core idea of this Mamba Prefix-Caching implementation (referred to as LPC) is to directly cache Mamba states through block-aligned scheduling. This approach enables rapid support for Prefix-caching in Mamba models without modifications to the underlying kernel code. Furthermore, it ma…
-
#30443 [CI][torch nightlies] Use main Dockerfile with flags for nightly torch tests — documentation,ready,ci/build,ready-run-all-tests — by orionr (合并于: 2026-01-24 02:22 (UTC+8)) [💬19 | +203/-32, 4 files | commented:9 approved:1] Use standard Docker image instead of torch_nightly image for PyTorch nightlies testing and CI runs.
Moving this from https://github.com/vllm-project/ci-infra/pull/239 to a branch on upstream for testing purposes outlined at https://github.com/vllm-project/ci-infra?tab=readme-ov-file#how-to-test-changes-in-this-repo
Tests to confirm:
- Baseline (my
vllmfork matching HEAD, noci-infrachanges) at https://buildkite.com/vllm/ci/builds/42874/steps/canvas. Allowed 5 test runs to move forward. …
- Baseline (my
-
#32885 [CI][Models] Add VLM Support for Sequence Classification Conversion — ready,v1 — by AndreasKaratzas (合并于: 2026-01-23 16:22 (UTC+8)) [💬4 | +155/-39, 3 files | commented:2 approved:1] This PR enables Vision-Language Models (VLMs) like Gemma 3 to be converted to sequence classifiers using the
no_post_processingandfrom_2_way_softmaxmethods. Additionally, it fixes two PyTorch compiler warnings that were causing noise in the logs.## Changes
### 1.
vllm/model_executor/models/adapters.py- Added
_get_language_model_for_seq_cls()helper function to correctly retrieve the inner language model component from VLMs - Updated
load_weights_no_post_processing()and `load_w…
- Added
-
#32397 [Model] Enable LoRA support for internvl2 — ready — by MatteoFari (合并于: 2026-01-24 01:39 (UTC+8)) [💬3 | +16/-3, 1 files | commented:1 approved:1]
## Purpose Enable dynamic LoRA adapters on InternVL2 vision tower + connector (part of #31479) by implementing:
-
get_num_mm_encoder_tokens() -
get_num_mm_connector_tokens()## Technical Details InternVL2 has a CLS token and pixel-shuffle downsampling (downsample_ratio), so LMtokens (post-downsample) don’t match tower tokens (pre-downsample). These helpers map token counts correctly between LM and tower/connector. ...
-
-
#32796 [Misc] Log vLLM logo when starting server — frontend,ready — by njhill (合并于: 2026-01-23 11:15 (UTC+8)) [💬5 | +50/-4, 5 files | commented:3 approved:1] TBD if folks like this :)
- #32940 [torch.compile][CI] Add back attn fusion on hopper/ada — ready — by ProExpertProg (合并于: 2026-01-24 00:49 (UTC+8)) [+4/-5, 1 files | commented:1 approved:2] We were not running attention quant fusion because the unit tests were using the llama4 for the model name which is too big for a single gpu (but the model wasn’t used in the test apart from the name). Not technically a problem on B200 but we still don’t want to be skipping
-
#31059 [Frontend] add logprob, compression_rate to ‘verbose_json’ features — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ready,ci/build,v1 — by sangbumlikeagod (合并于: 2026-01-24 00:35 (UTC+8)) [💬4 | +32/-14, 4 files | commented:6 approved:1] ## Purpose
Last time I made verbose_json options on Whisper inference but I left some features like avg_logprob and compression_ratio. And I made this PR to get those features on verbose_json segments!
https://github.com/vllm-project/vllm/pull/24209
## Test Plan
curl –location ‘http://domain:port/v1/audio/transcriptions’
–header ‘Authorization: ••••••’
… -
#32904 [Hardware][AMD][CI][Bugfix] Fix Kernels Attention Cache test — bug,rocm,ready — by mawong-amd (合并于: 2026-01-24 00:24 (UTC+8)) [+1/-1, 1 files | commented:1 approved:2] ## Purpose This PR fixes a failing test (
kernels/attention/test_cache.py::test_reshape_and_cache_flash) caused by https://github.com/vllm-project/vllm/pull/30141. The reference dequantization implementation used in the test assumes the FP8 data format ise4m3fn, but on AMDgfx942-series cards, the FP8 data type used ise4m3fnuzinstead.## Test Plan Run
pytest -sv tests/kernels/attention/test_cache.py -k test_reshape_and_cache_flashon a MI325X as part of… -
#32939 [ROCm][PD] Remove unused moriio connector proxy code — documentation,rocm,ready,kv-connector — by markmc (合并于: 2026-01-23 23:59 (UTC+8)) [💬1 +0/-22, 1 files commented:1 approved:1] - send_request_to_decode() is not called anywhere
- extract_ip_port_fast() for P is called twice
- #32886 [Bugfix] Fix FP8 MoE EP Weight Loading for ModelOpt Llama4 — bug,ready,llama — by baonudesifeizhai (合并于: 2026-01-23 23:31 (UTC+8)) [💬1 | +21/-1, 1 files | approved:1 commented:3]
## Purpose
#32862
Add a version-guarded fallback in Llama4 MoE weight loading to avoid CPU FP8 indexing on older PyTorch releases. For torch < 2.11, weights are temporarily cast to FP16 for indexing and then cast back, preventing index_cpu errors while leaving newer versions unchanged.
## Test Plan
```
VLLM_DISABLED_KERNELS=FlashInferFP8ScaledMMLinearKernel
VLLM_USE_FLASHINFER_MOE_FP8=0
vllm bench throughput –model=nvidia/Llama-4-Scout-17B-16E-Instruct-FP8 -tp=2 –enable-expert-parallel … -
#32876 [CPU Backend][BugFix] Fix failing CPU MoE test — bug,ready — by fadara01 (合并于: 2026-01-23 20:06 (UTC+8)) [💬3 | +1/-0, 1 files | commented:1 approved:1] ## Purpose
Fixes: #32870
## Test Plan
Ran reproducer in #32870 locally
## Test Result
…
- #32867 [Misc] Postpone torch_profiler deprecation — ready — by NickLucche (合并于: 2026-01-23 22:39 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1 approved:2] I think we can safely postpone this deprecation schedule. Updating warning accordingly.
-
#32916 [Bugfix] Disable tma_aligned_scales in test_fusions_e2e — bug,ready — by xyang16 (合并于: 2026-01-23 22:34 (UTC+8)) [+9/-1, 3 files | commented:1 approved:1]
## Purpose
This PR is to disable tma_aligned_scales in test_fusions_e2e to unblock CI.
## Test Plan
``` pytest -s -v tests/compile/distributed/test_fusions_e2e.py …
-
#32933 [Bugfix] Fix getting vision features in Transformer Multimodal backend — bug,ready — by zucchini-nlp (合并于: 2026-01-23 21:34 (UTC+8)) [💬1 | +9/-0, 1 files | commented:2 approved:1] Makes sure that transformers multimodal backend keeps working after v5 release.
PR https://github.com/huggingface/transformers/pull/42564 changed the output of
self.model.get_image_featurestotuple | dictformat. Prev we expected the output to always be a single tensor or a list of tensors for non-homogeneous image sizes. A simple check if the output istupleThe default output format currently depends on
model.config.return_dict, so I added both formatscc @hmellor
-
#32815 [Feature]: Remove DtoH Copy for lfm2_vl On Default Stream — ready,v1 — by tianshu-Michael-yu (合并于: 2026-01-23 21:20 (UTC+8)) [+260/-158, 5 files | commented:10] Purpose This PR removes all CUDA Device-to-Host (DtoH) memcpy/syncs observed during lfm2_vl preprocess on the default compute stream.
Main changes:
Run LFM2-VL’s SigLIP2 vision path fully packed (unpadded) end-to-end so we don’t trigger tiny syncs from CUDA nonzero / padding logic. Avoid host sync in the vision attention path by keeping max_seqlen on CPU (prevent .item()-style sync in FA wrapper). Remove remaining preprocess DtoH in the ShortConv/Mamba metadata path by using CPU-side query_sta…
-
#32905 [Frontend][3/n] Make pooling entrypoints request schema consensus | EmbedRequest & ClassifyRequest — documentation,frontend,ready — by noooop (合并于: 2026-01-23 20:03 (UTC+8)) [💬2 | +326/-261, 11 files | commented:2 approved:1]
## Purpose
Split out the following RequestMixin
- EncodingRequestMixin
- EmbedRequestMixin
- ClassifyRequestMixin
address https://github.com/vllm-project/vllm/pull/31784#discussion_r2716425132 …
-
#32927 [Benchmark][Bugfix] Fix race condtion when starting server for sweep benchmark — bug,performance,ready — by Isotr0py (合并于: 2026-01-23 20:11 (UTC+8)) [+37/-0, 2 files | commented:3 approved:1]
## Purpose
- Currently, there is a race condition that bench command is executed too early before server become ready for
vllm bench sweep serve: ``` [BEGIN SERVER] Server overrides: {‘mm_encoder_attn_backend’: ‘FLASH_ATTN’} Server command: [‘vllm’, ‘serve’, ‘/home/mozf/LLM/Qwen3-VL-4B-Instruct/’, ‘–enforce-eager’, ‘–max-model-len’, ‘32768’, ‘–mm-processor-cache-gb=0’, ‘–media-io-kwargs.video.num_frames’, ‘-1’, ‘–mm-encoder-attn-backend’, ‘FLASH_ATTN’] [BEGI…
- Currently, there is a race condition that bench command is executed too early before server become ready for
-
#32861 [Voxtral] Add new streaming arch — ready,multi-modality,llama — by patrickvonplaten (合并于: 2026-01-23 19:41 (UTC+8)) [💬1 +768/-314, 9 files commented:9 approved:1] - Moves causal whisper logic into own file
- Adapt voxtral streaming arch to improved architecture
- Add tests that is skipped for now
-
#32907 [CI/Build][CPU] Fix failed pooling tests and macos smoke test — ready,cpu — by bigPYJ1151 (合并于: 2026-01-23 18:48 (UTC+8)) [💬1 | +7/-1, 2 files | commented:6 approved:1]
## Purpose
- Skip post weight processing for missing layers
- Exclude unsupported op for macos
## Test Plan
CI tests …
-
#32698 [Misc] Add
get_nameto missing AttentionBackends — ready,v1 — by NickLucche (合并于: 2026-01-23 18:35 (UTC+8)) [💬5 | +28/-1, 6 files | commented:2 approved:1] There’s a few attention backends that are missing this meta-method, most notably the Mamba ones. Albeit theget_name()descriptor isn’t a functional one, I would expect every backend to define a simple unique tag for identification, adhering to the new structure of AttentionBackends.I don’t have a strong opinion on how this should be carried out (I am also fine with baseclass providing a default), but I tried adding a basic decorator util to make the process less verbose in terms of lines of…
- #32777 [Bugfix] Fix _CPU_MOE_ACT AssertionError when vLLM config not set — bug,ready,cpu — by karanb192 (合并于: 2026-01-23 16:22 (UTC+8)) [+26/-26, 2 files | commented:1 approved:1]
## Summary
- Fixes the
AssertionError: Current vLLM config is not setincpu_fused_moe_torchcaused by_LazyActivationDicttriggeringCustomOp.__init__()during forward pass - Replaces
_LazyActivationDictwith direct function references (_CPU_MOE_ACT_FN) to avoid instantiatingCustomOpclasses - Uses
SiluAndMul.forward_nativedirectly (it’s a@staticmethod) and a standalone_swigluoai_forward_nativefunction forSwigluOAIAndMul
Fixes #32368
## Problem The `_LazyActivation…
- Fixes the
-
#32722 [CI] Fix mypy for
vllm/v1/structured_output— rocm,structured-output,ready,v1,deepseek — by yewentao256 (合并于: 2026-01-23 11:55 (UTC+8)) [💬1 | +32/-25, 18 files | commented:3 approved:1] ## PurposePart of the https://github.com/vllm-project/vllm/issues/26533
## Test
At first:
```bash (yewentao256) [yewentao256@nma-h200-isolated-0-preserve vllm-source]$ pre-commit run –hook-stage manual mypy-3.10 -a …
-
#32806 [torch.compile] Compile
CustomOp.forward_nativeforSiluAndMulandQuantFP8to avoid raw torch ops inside opaque custom ops — ready,torch.compile,ready-run-all-tests — by ProExpertProg (合并于: 2026-01-23 11:52 (UTC+8)) [💬2 | +52/-13, 7 files | commented:5 approved:2] ## Purpose WhenCustomOp.forward_nativeis invoked from within another opaque torch custom op (e.g.fused_moe,unified_attention), it is hidden from the model-level torch.compile. In this case, executingforward_nativedirectly executes eager PyTorch ops, which can significantly hurt performance.In this PR, we compile forward_native manually to avoid this scenario. Existing
CustomOpinvocations visible to model-level compilation are not affected astorch.compileis ignored when nes… -
#32884 [BugFix] deepseek_v32_encoding: Replace asserts with proper exceptions — bug,ready,deepseek — by RishabhSaini (合并于: 2026-01-23 11:44 (UTC+8)) [💬2 | +39/-28, 1 files | commented:1 approved:1] Resolves: https://github.com/vllm-project/vllm/issues/32874
Replace validation asserts with ValueError and parsing asserts with RuntimeError to return 400 Bad Request instead of 500 Internal Server Error for invalid input
[关闭未合并 PR]
-
#31514 FP8 KV cache append optimized with precomputed inverse scale — rocm — by tfpre (关闭于: 2026-01-24 10:38 (UTC+8)) [💬4 | +193/-152, 3 files | commented:1]
## Purpose FP8 KV cache append (
reshape_and_cache_flash) runs every layer, every decode step. The FP8 quantization path was doing per-element division to convert FP16→FP8:// Before: division per element (expensive) __nv_cvt_float_to_fp8(half_to_float(a) / scale, ...)…
-
#25824 [Kernel] Triton-based Top-k and Top-p sampler kernels — performance,needs-rebase,v1 — by cakeng (关闭于: 2026-01-24 08:58 (UTC+8)) [💬4 | +674/-1, 2 files | commented:5]
## Purpose The current method for top-k and top-p sampling utilizes a sorting-based approach on PyTorch operations, which is suboptimal.
This PR introduces pivot-search-based top-k and top-p sampling kernels using Triton, which achieves a speedup of up to 10 times compared to the native method.
Set the env variable “VLLM_USE_TRITON_SAMPLER=1” to enable the Triton-based top-k and top-p sampler.
## Test Plan …
-
#32970 [release] Fix upload release wheel script — ci/build — by khluu (关闭于: 2026-01-24 05:40 (UTC+8)) [+2/-2, 1 files commented:1] - Cut out the extra
vwhen comparing tagged version and expected version - Use
rc[0-9]instead ofrcto not excludeaaRCh64wheels accidentally
- Cut out the extra
-
#31938 moe-offload-fixed — documentation,needs-rebase,ci/build,v1 — by wangyxbh (关闭于: 2026-01-24 02:40 (UTC+8)) [💬5 | +3275/-97, 28 files | commented:2] # PR: CPU Offload for Mixture-of-Experts (MoE) Inference in vLLM
## Summary
This PR proposes a CPU Offload Module for MoE inference in vLLM, enabling a large portion of expert weights and computation to be dynamically offloaded to the CPU while keeping only a small, hot subset of experts cached on GPU.
The design supports:
- Hybrid GPU–CPU execution
- Pinned-memory–based weight streaming
- Asynchronous GPU ↔ CPU interaction via callback …
-
#32147 Add descriptive error message for missing tools. — frontend,needs-rebase,gpt-oss,meta-exported,fb-exported — by madongfly (关闭于: 2026-01-24 02:13 (UTC+8)) [💬3 | +85/-8, 2 files | commented:1] Summary: Add descriptive error message for missing tools.
Test Plan: Unit test
Differential Revision: D90477073
[!NOTE] …
-
#32893 Fix/glm4 moe mla detection — v1,nvidia — by mgoin (关闭于: 2026-01-24 01:56 (UTC+8)) [💬2 | +75/-11, 4 files | commented:1 approved:1 | 📝草稿]
## Purpose
Validated that glm-4.7-flash works by default on B200
## Test Plan
## Test Result
…
-
#32759 [Bugfix] Add glm4_moe_lite to MLA model list to fix excessive KV cache memory usage — bug — by yhfgyyf (关闭于: 2026-01-24 01:50 (UTC+8)) [💬2 | +1/-0, 1 files | commented:2] ## Purpose
Add
glm4_moe_litearchitecture to the MLA supported model list to enable MLA attention for GLM-4.7-Flash.GLM-4.7-Flash uses the
glm4_moe_litearchitecture with MLA attention parameters (kv_lora_rank=512,q_lora_rank=768), but was not recognized byis_deepseek_mla(), preventing MLA attention from being enabled and causing excessive KV cache memory usage. -
#27603 [ROCm][Docs] Update ROCm installation docs for ROCm 7.0 — documentation,rocm,needs-rebase — by Rohan138 (关闭于: 2026-01-24 01:43 (UTC+8)) [💬7 | +56/-55, 1 files | commented:10]
## Purpose
- Update docs in https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html#amd-rocm for ROCm 7.0
- Add and document MI355 support
- Update standalone install instructions for PyTorch, Triton, Flash Attention, and AITER
## Test Plan
…
-
#27200 [Feature] Added Prefix Repetition with Random Lengths dataset — performance,stale — by susiejojo (关闭于: 2026-01-24 01:26 (UTC+8)) [💬4 | +169/-0, 1 files | commented:1] ## Purpose
Fixes issue: #27129
vLLM’s bench serve currently supports a PrefixRepetitionRandomDataset for generating prompts with fixed-length prefix and suffix lengths. It also has support for a RandomDataset which samples input and expected output lengths fr…
-
#32924 Create 52012 — 无标签 — by AKB0700 (关闭于: 2026-01-24 01:06 (UTC+8)) [💬2 | +1/-0, 1 files | commented:3]
## Purpose
## Test Plan
## Test Result
... -
#32748 [Misc] Add index url for torch==2.9.1+cpu — ready,needs-rebase,ci/build,cpu — by lk-chen (关闭于: 2026-01-24 00:39 (UTC+8)) [💬3 | +1/-1, 1 files | commented:2] Added extra index URL for CPU version of PyTorch.
Otherwise I hit “not found” error https://vllm-dev.slack.com/archives/C07QP347J4D/p1768948741022109
## Purpose
## Test Plan
…
-
#32906 [ROCm][ViT] Enable Flash Attention Triton backend on RDNA3/RDNA4 — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ci/build,v1,multi-modality — by monajafi-amd (关闭于: 2026-01-23 23:04 (UTC+8)) [💬2 | +16257/-10390, 438 files | commented:1 | 📝草稿] ## Purpose
Flash Attention’s CK backend only supports CDNA GPUs (gfx90a/gfx942/gfx950). On RDNA3/RDNA4, vLLM falls back to PyTorch SDPA for vision attention, which is significantly slower. Flash Attention 2.8.3+ includes a Triton backend that supports RDNA architectures.
## Changes
- Added
flash_attn_triton_available()invllm/platforms/rocm.pyto detect RDNA GPUs and verify Triton backend availability - Updated
get_vit_attn_backend()to use Flash Attention when available on RDNA GPUs -…
- Added
-
#32928 [MoE] Add tuned config for RTX PRO 6000 Blackwell Max-Q (E=64, N=1024) — 无标签 — by ErikDeBruijn (关闭于: 2026-01-23 18:39 (UTC+8)) [💬2 | +720/-0, 8 files | commented:1] ## Summary
Add FusedMoE kernel configurations for NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition covering the most popular MoE architectures.
### Configurations Added (8 files)
| E | N | Use Case | |—|—|———-| | 8 | 14336 | Mixtral 8x7B/8x22B | | 8 | 7168 | Large 8-expert models | …
-
#31478 Optimize Top-K Sigmoid Routing and QKNorm for MiniMax-M2/M2.1 — needs-rebase,ci/build — by rogeryoungh (关闭于: 2026-01-23 16:40 (UTC+8)) [💬1 | +906/-70, 12 files | commented:2] PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
## Purpose
This PR introduces two performance optimizations targeting MiniMax-M2 / M2.1 inference:
- Adds a
topk_sigmoidCUDA kernel to support sigmoid-based expert routing used by MiniMax-M2. Previously, this was implemented via a workaround using grouped_topk with group_size=1, which incurs unnecessary overhead. - Fuses QKNorm in tensor-parallel…
- Adds a
-
#32909 [CI][Pooling] Stabilize ModernBERT test — 无标签 — by AndreasKaratzas (关闭于: 2026-01-23 14:53 (UTC+8)) [💬2 | +27/-0, 1 files | commented:10] This PR attempts to stabilize the
Language Models Test (Extended Pooling). It marks it as flaky, enables rerunning the test, and logs an informative message regarding the possible inaccurate results based on the conversation here: https://github.com/vllm-project/vllm/pull/32403#pullrequestreview-3695341925## Related
- Addresses flaky test:
test_modernbert_models[float-disham993/electrical-ner-ModernBERT-base] - Related to #32403 which increased test tolerance from
atol=1.2e-2to `atol=3…
- Addresses flaky test:
-
#29046 [Feature]: Disable logging /metrics — frontend — by WorldExplored (关闭于: 2026-01-23 12:43 (UTC+8)) [💬5 +76/-0, 3 files commented:1] - This PR adds a metrics-specific logging control to stop
/metricsaccess-log spam while preserving normal request logs, as requested in #29023. - It introduces a new frontend flag
disable_uvicorn_metrics_access_log(CLI:--disable-uvicorn-metrics-access-log) alongside the existing--disable-uvicorn-access-log. - When the new flag is
False, behavior is unchanged and Uvicorn access logs are controlled by current global options.
…
- This PR adds a metrics-specific logging control to stop