[vLLM GitHub 开发动态] 2026-01-29
[概览]
- 时间窗口: 2026-01-29 11:23 (UTC+8) ~ 2026-01-30 11:23 (UTC+8)
- 新 issue: 24 (label 分布: bug:11, feature request:7, RFC:2, usage:2)
- 关闭 issue: 33
- 新 PR: 62 (label 分布: ready:24, v1:22, bug:15, documentation:11, ci/build:9)
- 合并 PR: 41
- 关闭未合并 PR: 13
[新 issue]
-
#33386 [Feature]: Add support for local image file paths in vision reranker API — feature request — by RC-Qiao (创建于: 2026-01-30 11:18 (UTC+8)) ### 🚀 The feature, motivation and pitch
Currently, the
/rerank(examples/pooling/score/vision_rerank_api_online.py)API only supports:- Text documents
- Image URLs
However, there is no way to directly provide a local file path (e.g.,/Users/xxx/Desktop/photo.jpg) to the API.
### Alternatives
No response …
-
#33384 [Bug]: vLLM serve crashes after processing requests with FlashMLA + DP on DeepSeek — bug — by Mandaluoren (创建于: 2026-01-30 10:56 (UTC+8)) ### Your current environment
Environment
- GPU: NVIDIA H20
- Model: DeepSeek-R1
- Backend: FlashMLA
- TP: 4
- DP: 2
…
-
#33361 [Bug]: Different embeddings produced by LLM and AsyncLLM — bug — by kabell (创建于: 2026-01-30 01:52 (UTC+8)) [💬1] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#33319 [Bug]: Requests Stuck in Waiting Queue with Zero Running — bug — by yptheangel (创建于: 2026-01-29 16:01 (UTC+8)) [💬3] ### Your current environment
The output of
```text /usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you. import pynvml # type: ignore[import] Collecting environment information... ============================...python collect_env.py -
#33369 [Bug]: Serving model in 0.15.0 Docker container hangs - 0.14.1 worked fine — bug — by sininspira2 (创建于: 2026-01-30 05:42 (UTC+8)) [💬3] ### Your current environment
The output of
I don't have all of the python libraries to run on the host, and running the script inside the docker container isn't working. I'm running vllm/vllm-openai:latest-cu130 The container has access to 2x RTX 6000 Pro Server Edition (Blackwell) ...python collect_env.py -
#33381 [RFC]: Align with the openresponses spec. — RFC — by chaunceyjiang (创建于: 2026-01-30 09:24 (UTC+8)) ### Motivation.
Open Responses is an open-source specification for multi-provider, interoperable LLM interfaces inspired by the OpenAI Responses API. It defines a shared request/response model, streaming semantics, and tool invocation patterns so clients and providers can exchange structured inputs and outputs in a consistent shape.
ref: https://www.openresponses.org
### Proposed Change.
…
-
#33333 [Bug]: sm_120 -NvFp4 MoE backend ‘FLASHINFER_CUTLASS’ does not support the deployment configuration since kernel does not support current device. — bug — by shahizat (创建于: 2026-01-29 19:16 (UTC+8)) [💬3] ### Your current environment
The output of
```Collecting environment information... uv is set ============================== System Info ============================== ...python collect_env.py -
#33356 [Feature]: Pipeline Parallel Features & Performance Optimizations — feature request — by yewentao256 (创建于: 2026-01-30 01:03 (UTC+8)) ### 🚀 The feature, motivation and pitch
We can further optimize pipeline parallelism in vLLM for better performance
Tasks:
- Enable async scheduling with Pipeline Parallel
- https://github.com/vllm-project/vllm/pull/32359
- Under review: https://github.com/vllm-project/vllm/pull/32618
- Using
isend/irecvfor async communication. https://github.com/vllm-project/vllm/pull/33368 - Chunked Pipeline Prefill for long context prefill (idea introduced by https://lmsys.org…
- Enable async scheduling with Pipeline Parallel
-
#33367 [New Model]: Bart — 无标签 — by JingHHj (创建于: 2026-01-30 04:58 (UTC+8)) ### The model to consider.
I am working on supporting Florence-2 (https://github.com/vllm-project/vllm/issues/31819) on the newest version, which stop supporting after v0.10.0.
However it’s language model its fully using Bart (https://huggingface.co/facebook/bart-large) which has also been stop supporting on current version. So I am planning to start with supporting Bart first, before I do Florence-2.
### The closest model vllm already supports.
No response
…
-
#33364 [Bug]: Qwen3-Next speculative decoding (DeepSeek MTP) fails for num_speculative_tokens>=3 on H100 with FlashInfer + FP8 KV cache (GMMA operator JIT compile error) — bug — by mingi001025 (创建于: 2026-01-30 02:43 (UTC+8)) ### Your current environment
The output of
```text Your output of `python collect_env.py` here --2026-01-29 10:35:20-- https://raw.githubusercontent.com/vllm-project/vllm/main/vllm/collect_env.py Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.109.133, ... ...python collect_env.py -
#33363 [TESTS] Unit tests for GDN attn — 无标签 — by vadiklyutiy (创建于: 2026-01-30 02:43 (UTC+8)) [💬1] In #33326 was found very hard error in GDN attention kernel.
@benchislett proposed “Please add a unit test for this case. Otherwise, lgtm”
Originally posted by @benchislett in https://github.com/vllm-project/vllm/pull/33326#pullrequestreview-3723804936
-
#33341 [Feature][Performance][Speculative Decoding]: Support Full CUDA Graph for the drafter — feature request — by tomasruizt (创建于: 2026-01-29 20:25 (UTC+8)) [💬1] ### 🚀 The feature, motivation and pitch
Currently, the drafter code in eagle.py only supports piecewise CUDA graphs. Integrating full CUDA graphs would speed up the drafter execution and reduce the full runtime e.g. of speculative decoding with
method=draft_model.This should be possible in practice because we already run the model in full CUDA graph mode when they run standalone (without speculative decoding).
I made some measurements to forecast how much faster SD would run with full CUDA…
- #33347 Bad issue, please ignore — feature request — by stisiTT (创建于: 2026-01-29 23:45 (UTC+8)) [💬1] Error
-
#33348 [Bug] GLM-4.7 uses wrong reasoning parser (should use deepseek_r1 instead of glm45) — 无标签 — by QwertyJack (创建于: 2026-01-29 23:47 (UTC+8)) ## Summary
GLM-4.7 models have a different chat template than GLM-4.5/4.6 models. The key difference is:
- GLM-4.5/4.6: The
<think>token is NOT included in the generation prompt. The model generates<think>as part of its output. - GLM-4.7: The
<think>token IS included in the generation prompt. The model starts generating reasoning content directly without outputting<think>first.
This means GLM-4.7 should use
deepseek_r1reasoning parser (which handles the case where `<t… - GLM-4.5/4.6: The
-
#33340 [Feature]: start DPEngineCoreProc asynchronously — feature request — by zhaozx-cn (创建于: 2026-01-29 20:08 (UTC+8)) ### 🚀 The feature, motivation and pitch
### Motivation Currently, vllm launches each DPEngineCoreProc in a synchronous manner. This issue proposed starting DPEngineCoreProc in a asynchronous manner, which would improve service startup speed
### Proposed Implementation Details utils.py
- Add start_async to execute all DPEngineCoreProc process start
- Add _enginecore_bootstrap to set device id and call EngineCoreProc.run_engine_core
- Use ThreadPoolExecutor to run start_async …
-
#33338 [Bug]: Crash when using presence_penalty with Qwen3-VL in v0.11.0 — bug — by Lrcx (创建于: 2026-01-29 19:54 (UTC+8)) ### Your current environment
The output of
```text ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) ...python collect_env.py -
#33336 [Bug]: AttributeError: ‘Step3VLProcessor’ object has no attribute ‘_get_num_multimodal_tokens’ — bug — by Dineshkumar-Anandan-ZS0367 (创建于: 2026-01-29 19:32 (UTC+8)) ### Your current environment
Hf link: https://huggingface.co/stepfun-ai/Step3-VL-10B
Issues: AttributeError: ‘Step3VLProcessor’ object has no attribute ‘_get_num_multimodal_tokens’
### 🐛 Describe the bug
Discussions: …
-
#33330 [Usage]: Why does the acceptance rate of the second token drop drastically when the Eagle3 draft model has two layers? — usage — by xinanjiao (创建于: 2026-01-29 18:13 (UTC+8)) ### Your current environment
``` Collecting environment information… ============================== System Info ============================== OS : Ubuntu 22.04.3 LTS (x86_64) GCC version : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version : Could not collect …
-
#33329 [Bug]: Eagle3 not support Qwen3-Next-80B-A3B-Instruct-FP8 — bug — by Martion-z (创建于: 2026-01-29 17:53 (UTC+8)) ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#33314 [RFC] [Feature]: Apply RFC #8913 to Quantized Linear Methods (Decouple Kernels from Checkpoint Layout) — feature request,RFC — by BadrBasowid (创建于: 2026-01-29 15:01 (UTC+8)) [💬1] ### 🚀 The feature, motivation and pitch
### Context RFC #8913 proposes decoupling checkpoint formats from kernel integrations by standardizing how weights are represented at runtime and allowing kernel-specific repacking/preprocessing as a separate concern. This RFC scopes that same idea specifically to quantized linear methods (e.g., GPTQ/AWQ/FP8/FP4 linear integrations) where kernel selection is currently constrained by checkpoint packing/layout assumptions.
### Problem
Many quantized line…
-
#33311 [Feature]: support pixel_values_videos input for VLM — feature request — by wuxibin89 (创建于: 2026-01-29 14:05 (UTC+8)) ### 🚀 The feature, motivation and pitch
In RL training, to avoid Retokenization Drift, RL frameworks choose
token-in-token-outfor generation. While it works perfect for text/image, there’s still some mismatch between video processing.In verl, we
apply_chat_templateto messages with video: https://github.com/verl-project/verl/blob/main/verl/experimental/agent_loop/agent_loop.py#L286-L293Then pass the prompt_token_ids and **raw vide…
- #33313 [Bug]: cuda 13 docker image error “ptxas fatal : Value ‘sm_121a’ is not defined for option ‘gpu-name’” — bug — by openzeka-birol-kuyumcu (创建于: 2026-01-29 14:10 (UTC+8)) for docker image vllm/vllm-openai:v0.14.1-cu130 at Dg x-spark Triton rises ptxas fatal : Value ‘sm_121a’ is not defined for option ‘gpu-name’ which is closed bug https://github.com/triton-lang/triton/issues/8539
-
#33307 [Bug]: Latency spikes at input_len=1024 with batch_size=16 (TP1 & TP2) — bug — by mingi001025 (创建于: 2026-01-29 13:40 (UTC+8)) ### Your current environment
The output of
```text Your output of `python collect_env.py` here Collecting environment information... ...python collect_env.py -
#33302 [Usage]: What would happen if the offloading KV cache size is larger than config the max_local_cpu_size ? — usage — by fantasy520 (创建于: 2026-01-29 11:30 (UTC+8)) ### Your current environment
The output of `python collect_env.py`### How would you like to use vllm
I want to run inference of a specific model. I don’t know how to integrate it with vllm. …
[已关闭 issue]
-
#17812 [Benchmark][V1][Spec Decode][EAGLE] Tracking benchmark for V1 EAGLE — performance,stale — by ekagra-ranjan (关闭于: 2026-01-30 10:17 (UTC+8)) [💬6] We have been doing perf bench on MTBench so that e2e speedup and AL are comparable with other setups and academic papers. Thanks to @luyuzhe111 and others for the discussion and helping with measuring the gaps!
## llama 3 8b During model wt loading
- https://github.com/vllm-project/vllm/pull/16035#issuecomment-2790985075
- SGL correction: https://github.com/vllm-project/vllm/pull/16035#issuecomment-2803265232
- SGL setup: https://docs.google.com/document/d/18ETJLsnxR88Qq3VDk5Mq-Hb7vuE…
- https://github.com/vllm-project/vllm/pull/16035#issuecomment-2790985075
-
#17931 invalid conversion from ‘int’ to ‘CUresult’ {aka ‘cudaError_enum’} — installation,stale — by xxm1668 (关闭于: 2026-01-30 10:17 (UTC+8)) [💬7] ### Your current environment
‘DevicePtrInfo getPointer(PyObject*, int)’: /tmp/tmpp8zry39f/main.c:118:20: error: invalid conversion from ‘int’ to ‘CUresult’ {aka ‘cudaError_enum’} [-fpermissive] 118 | CUDA_CHECK(status); // Catch any other cuda API errors | ^ | | | int /tmp/tmpp8zry39f/main.c:24:38: note: in definition of macro ‘CUDA_CHECK’ 24 | #define CUDA_CHECK(ans) { gpuAssert((ans), FILE, LINE);…
-
#25927 [Feature]: Support EAGLE3 for Qwen3-30B-A3B — feature request,stale — by lionsheep0724 (关闭于: 2026-01-30 10:16 (UTC+8)) [💬3] ### 🚀 The feature, motivation and pitch
As far as I know, there’s no support for Qwen3 MoE model. When I tried to load trained draft model via Specforge, below error returns.
RuntimeError: Model does not support EAGLE3 interface but aux_hidden_state_outputs was requestedSimiar issue : https://github.com/vllm-project/vllm/issues/25134### Alternatives
No response …
-
#18372 [Bug]: Dynamic loading LoRA is not working properly — bug,stale — by DonghoKang (关闭于: 2026-01-30 10:17 (UTC+8)) [💬6] ### Your current environment
The output of
export VLLM_ALLOW_RUNTIME_LORA_UPDATING=Truepython collect_env.py### 🐛 Describe the bug …
-
#19942 [Usage]: how to invoke KVCache save in one node deployment development enviroment — usage,stale — by gaowayne (关闭于: 2026-01-30 10:17 (UTC+8)) [💬5] ### Your current environment
The output of `python collect_env.py`### How would you like to use vllm
I would like to let below method run more in single node deployment development environment …
-
#22098 [Bug]: Warning: The matcher has terminated after accepting the stop token, but is trying to accept new token with id 198. — bug,stale — by JakubCerven (关闭于: 2026-01-30 10:17 (UTC+8)) [💬3] ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#23431 [RFC]: Continuous profiling and regression prevention for vLLM — RFC,stale — by linzebing (关闭于: 2026-01-30 10:17 (UTC+8)) [💬3] ### Motivation.
While we continue to push the limit of vLLM performance, an important aspect is to make sure that these wins will stay. It’s quite common in practice that commits may just silently regress engine performance. For instance, we observed a downward trend of engine performance in the past month:
[vLLM Llama 8b shareGPT benchmark dashboard](https://hud…
-
#24039 [Usage]: how to disable thinking for different model — usage,stale — by jiangix-paper (关闭于: 2026-01-30 10:17 (UTC+8)) [💬4] ### Your current environment
vllm 0.10.1
### How would you like to use vllm
I want to call model without thinking using v1/chat/completions interface: For glm4.5 model, the body is: { “model”: “glm-4.5”, …
-
#24807 [Installation]: Errors when compiling VLLM from source | pytorch-cuda12.6-cudnn9 — installation,stale — by Lyken17 (关闭于: 2026-01-30 10:16 (UTC+8)) [💬6] ### Your current environment
Click to expand sys environments
```text Collecting environment information... ============================== System Info ... -
#24869 [Installation]: Assertion Error while serving vllm due to this error : assert param_data.shape == loaded_weight.shape — installation,stale — by VarunUpadhyayShorthillsAI (关闭于: 2026-01-30 10:16 (UTC+8)) [💬4] ### Your current environment
python3 collect_env.py Collecting environment information… ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) GCC version : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version : Could not collect …
-
#25333 [Bug]: Accuracy Discrepancy in Qwen 4B Embeddings: vLLM vs. Transformers — bug,stale — by Alireza3242 (关闭于: 2026-01-30 10:16 (UTC+8)) [💬11] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#25462 [Usage]: How Do I use Ray to deploy a multi node TP Disaggregated Prefilling ? — usage,ray,stale — by yangqinghao-cmss (关闭于: 2026-01-30 10:16 (UTC+8)) [💬7] ### Your current environment
The output of `python collect_env.py`### How would you like to use vllm
I want to run inference for a 1P1D setup, where both P and D utilize only Tensor Parallelism (TP) across multiple nodes. However, I’m unsure how to integrate this with vllm. I’ve tried using Ray with p2pNcclConnector and the lmcache+nixl approach, but neither works. …
-
#25797 [Bug]: Qwen3-Next-80B-A3B-Instruct-AWQ-4bit KeyError: ‘layers.45.mlp.shared_expert.down_proj.weight’ — bug,stale — by xibaoning (关闭于: 2026-01-30 10:16 (UTC+8)) [💬3] ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
- #25815 [Performance]: PD Disaggregate deployment performance issue — performance,stale — by wangxiyuan (关闭于: 2026-01-30 10:16 (UTC+8)) [💬3]
### Issue
For Prefill Decode Disaggregate case, on Decode node, if the KV cache of a request hit the full of HBM, the scheduler will trigger recalculation and put it into the waiting queue, and the KV cache will be released. This will cause prefill calculation to be performed in the next iteration on decode node, resulting in serious performance degradation.
### Solution
Currently, we have 3 possible way:
- address V0 scheduler swap queue to V1. While it maybe break the design for V1 schedule…
-
#25836 [Usage]: how to deploy my own multimodal model — usage,stale — by AdvancedBot (关闭于: 2026-01-30 10:16 (UTC+8)) [💬3] ### Your current environment
The output of `python collect_env.py`Collecting environment information… ============================== System Info ============================== OS : Ubuntu 22.04.3 LTS (x86_64) …
-
#25865 [Bug]: vLLM recognize the bge-m3-korean (embedding model) max length, 512 tokens. — bug,stale — by bnuazz15 (关闭于: 2026-01-30 10:16 (UTC+8)) [💬9] ### Your current environment
vLLM 0.10.2
### 🐛 Describe the bug
bge-m3-korean config.json { – “architectures”: [ …
-
#25898 [Usage]: How to disable the use of chat_template in vllm serve? — usage,stale — by noforit (关闭于: 2026-01-30 10:16 (UTC+8)) [💬3] ### Your current environment
```text Collecting environment information… ============================== System Info ============================== OS : Rocky Linux 8.5 (Green Obsidian) (x86_64) GCC version : (GCC) 9.4.0 Clang version : Could not collect …
-
#25907 [Bug]: qwen3-next CUDA illegal memory access — bug,stale — by felixzhu555 (关闭于: 2026-01-30 10:16 (UTC+8)) [💬2] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#25915 [Performance]: Limited improvements when compared Qwen3 / Qwen3-FP8 — performance,stale — by Lyken17 (关闭于: 2026-01-30 10:16 (UTC+8)) [💬2] ### Proposal to improve performance
On the latest
vllm/vllm-openai:v0.8.5with h100 hardware### Report of performance regression
vllm bench throughput --model Qwen/Qwen3-8B --input-len 1024 --output-len 2048bf16 …
-
#25919 [Feature]: Upstream some unsloth patches if possible — feature request,stale — by vadimkantorov (关闭于: 2026-01-30 10:16 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch
Unsloth adds vllm patches in https://github.com/unslothai/unsloth-zoo/blob/main/unsloth_zoo/vllm_utils.py - which looks extremely fragile wrt vllm updates
Here are some things in there (mostly related to 4-bit / lora quite pertinent in the light of https://thinkingmachines.ai/blog/lora/, but also standby mode - quite useful): ``` patch_vllm_enable_sleep_mode() patch_vllm_graph_capture()
patch_vllm_set_inductor_config() …
-
#25920 [Bug]: Running qwen3-235b in full graph mode with dp4tp4 and disable chunked_prefill will cause inference hang — bug,stale — by wjx-xin (关闭于: 2026-01-30 10:16 (UTC+8)) [💬5] ### Your current environment
The output of
```text Your output of `python collect_env.py` here vllm 0.10.2 vllm-ascend (Recent versions, higher than 0.10.2rc1, used unmerged PR) ``` ...python collect_env.py -
#25934 [Feature]: Jet nemotron models — feature request,stale — by arunpatala (关闭于: 2026-01-30 10:16 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch
Hi,
I was wondering if jet nemotron models can be supported in VLLM. They have a new attention implementation called JetBlock which is linear or more efficient. Is this planned to be supported or not inthe scope? I am just wondering, because I would like to train my custom jet nemotron models and I am using VLLM currently.
https://github.com/NVlabs/Jet-Nemotron
Thanks
…
-
#25939 [RFC][P/D]: Support hybrid and flexible KV Cache transfer timing & pathway at request-level — RFC,stale — by Shirley1Huang (关闭于: 2026-01-30 10:16 (UTC+8)) [💬2] ### Motivation.
In current prefill-decode (PD) disaggregated architectures, a static strategy determining KV Cache transfer timing and method is adopted, which is uniformly applied to all requests. This inflexible approach fails to account for key trade-offs: between latency and throughput when choosing when to start KV Cache transfer(during versus after prefill), and between latency and scalability when choosing how to transfer it (via P2P communication or through a centralized sto…
-
#25940 [Feature]: image2text supports SVG image — feature request,stale — by jfcherng (关闭于: 2026-01-30 10:16 (UTC+8)) [💬3] ### 🚀 The feature, motivation and pitch
I am using InternVL3 iamge2text model. In the model card, it says it supports SVG image. But vLLM failed to recognize a SVG image.
The following is my test payload.
```json { “model”: “OpenGVLab/InternVL3_5-1B”, “messages”: [ …
-
#25957 [Feature]: Upgrade as many vLLM built-in logits processors to the new logits processors programming model as possible — feature request,stale — by afeldman-nm (关闭于: 2026-01-30 10:16 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch
vLLM v1 is intended to handle logits processors differently than vLLM v0. Logits processors should subclass
LogitsProcessor; built-in logits processors should be implemented invllm/v1/sample/logits_processor/builtin.pyand added to theBUILTIN_LOGITS_PROCESSORSlist invllm/v1/sample/logits_processor/__init__.pyHowever, in actuality only several of the vLLM built-in logits processors (Min-P, logits bias, min tokens) have been converted to work …
-
#25979 [Bug]: EAGLE3 gpt-oss eagle3 failed on high concurrencies — bug,stale — by jiahanc (关闭于: 2026-01-30 10:16 (UTC+8)) [💬2] ### Your current environment
Followed standard installation steps on main branch https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html#pre-built-images
The output of
```text Your output of `python collect_env.py` here Collecting environment information... ============================== ...python collect_env.py -
#25985 [Bug]: IMA with the prefix_prefill::context_attention_fwd kernel when capturing full cudagraphs — bug,rocm,stale — by SageMoore (关闭于: 2026-01-30 10:16 (UTC+8)) [💬4] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#33364 [Bug]: Qwen3-Next speculative decoding (DeepSeek MTP) fails for num_speculative_tokens>=3 on H100 with FlashInfer + FP8 KV cache (GMMA operator JIT compile error) — bug — by mingi001025 (关闭于: 2026-01-30 02:48 (UTC+8)) ### Your current environment
The output of
```text Your output of `python collect_env.py` here --2026-01-29 10:35:20-- https://raw.githubusercontent.com/vllm-project/vllm/main/vllm/collect_env.py Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.109.133, ... ...python collect_env.py -
#31186 [Bug]: Qwen3-Next MTP Crash — bug — by benchislett (关闭于: 2026-01-30 02:40 (UTC+8)) [💬2] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#28129 [Feature]: DGX Spark sm121a support — feature request — by swtb3-ryder (关闭于: 2026-01-29 23:54 (UTC+8)) [💬6] ### 🚀 The feature, motivation and pitch
Does vLLM support the DGX Spark hardware? I know nvidia provide a vLLM container with the support packaged, but i believe it uses vLLM version 0.10.2
### Alternatives
No response
### Additional context
…
-
#32701 [Feature]: Async Scheduling + Pipeline Parallel Support — feature request — by yewentao256 (关闭于: 2026-01-30 00:54 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch
Supporting async scheduling + pp
Tasks
- https://github.com/vllm-project/vllm/pull/32359
- Under review: https://github.com/vllm-project/vllm/pull/32618
### Alternatives
…
- #33347 Bad issue, please ignore — feature request — by stisiTT (关闭于: 2026-01-30 00:17 (UTC+8)) [💬1] Error
-
#31658 [Bug] Audio batching fails with variable-length inputs in Ultravox model — 无标签 — by AndriiPasternak31 (关闭于: 2026-01-29 18:43 (UTC+8)) ### Your current environment
- vLLM version: 0.13.0
- GPU: NVIDIA RTX 3090 (24GB)
- PyTorch: 2.9.0+cu129
- Model: fixie-ai/ultravox-v0_6-llama-3_1-8b
### 🐛 Describe the bug
When sending concurrent audio transcription requests to the vLLM server with audio samples of different durations, the engine crashes with: …
[新 PR]
-
#33387 feat(pooling/score): Enhance RerankRequest with optional instruction field and multimodal input support — frontend — by akaihaoshuai (创建于: 2026-01-30 11:18 (UTC+8)) [+442/-21, 5 files | commented:1]
## Purpose
This PR enhances multimodal support in both the
/v1/embeddingsand/rerankendpoints by:- Adding an optional
instructionfield to theRerankRequestschema to enable instruction-aware reranking. - Supporting flexible multimodal input formats for embedding and reranking, including:
- Mixed list of strings and image URLs or Data URLs (e.g.,
["text", "https://...", "data:image/..."]) - Structured objects with explicit
{"text": ...}or `{“i…
- Mixed list of strings and image URLs or Data URLs (e.g.,
- Adding an optional
-
#33385 [Bugfix] Fix inconsistent embeddings between AsyncLLM and LLM — bug,v1 — by MengAiDev (创建于: 2026-01-30 11:00 (UTC+8)) [+12/-0, 1 files | commented:1] The pooling_params.verify() method was only called in the LLM class path, not in the AsyncLLM (v1) path. This caused default values like use_activation=True to not be set, resulting in different embeddings.
Call verify() in InputProcessor.process_inputs() to ensure consistent behavior across both code paths.
Fixes #33361
-
#33332 [Misc] Replace Optional[X] with X | None syntax — rocm,frontend,needs-rebase,v1,multi-modality,cpu,kv-connector,nvidia — by carlory (创建于: 2026-01-29 19:08 (UTC+8)) [💬3 | +129/-135, 56 files | commented:1 approved:1] ## Purpose
Modernize type annotations by replacing
Optional[X]with the more readableX | Noneunion type syntax (PEP 604) that has been available since Python 3.10.Changes:
- Replace
Optional[X]withX | Noneacross 55 files - Remove unused
Optionalimports from typing
## Test Plan
…
- Replace
-
#33379 [Draft][XPU]deprecate ipex and switch to vllm-xpu-kernels — ci/build,v1 — by jikunshang (创建于: 2026-01-30 08:45 (UTC+8)) [+139/-455, 12 files | commented:1]
## Purpose [1/N] https://github.com/vllm-project/vllm/issues/33214 start from this PR, we will switch to vllm-xpu-kernels based kernel impl for Intel GPU. this PR also upgrade dependency to oneapi 2025.3 and pytorch 2.10 for xpu platform.
## Test Plan
## Test Result …
-
#33326 [Bugfix][Kernel] Fix negative memory offset in GDN Triton kernel — bug,ready — by CarstyYou (创建于: 2026-01-29 17:26 (UTC+8)) [💬5 | +18/-15, 1 files | commented:1 approved:3 changes:1] ## Purpose
Fix illegal memory access (
cudaErrorIllegalAddress) when running Qwen3-Next models with MTP speculative decoding and CUDA Graphs enabled.Fixes #31186
cc @vadiklyutiy ### Root Cause
The Triton kernel
fused_recurrent_gated_delta_rule_fwd_kernelin GDN attention did not check forPAD_SLOT_ID = -1values inssm_state_indices. When CUDA Graph pads unused slots with-1, the kernel computes negative memory offsets, causing illegal memory access. … - #33305 [CI][AMD] Skip 4 GPUs testgroup ray tests — documentation,rocm,ready,ci/build,v1 — by rjrock (创建于: 2026-01-29 12:27 (UTC+8)) [💬4 | +8/-0, 2 files | commented:1 approved:1]
## Purpose
To fix the below tests for AMD CI.
- offline_inference/rlhf_colocate.py
- distributed/test_utils.py
## Test Plan
python rlhf_colocate.pypytest distributed/test_utils.py
## Test Result …
-
#33378 support return prompt token ids in responses — frontend — by cmunley1 (创建于: 2026-01-30 08:26 (UTC+8)) [💬5 | +44/-0, 3 files | commented:2]
## Purpose support returning prompt token ids in responses api
## Test Plan
## Test Result
…
-
#33372 Explicitly set
return_dictforapply_chat_template— documentation,ready — by hmellor (创建于: 2026-01-30 06:17 (UTC+8)) [💬1 | +22/-10, 11 files | commented:1 approved:1] In Transformers v5 the default value ofreturn_dictforapply_chat _templatewas changed toTrueto match other Tokeniser methods.This PR explicitly sets:
return_dict=Trueoutside thevllm/dir so that users get used to the new/expected behaviour of Transformersreturn_dict=Falseinside thevllm/dir so that the internal interfaces of vLLM do not need to change to adapt to the new default
- #33376 fix: allow LFM2 MoE prefix caching (align) — ready — by tianshu-Michael-yu (创建于: 2026-01-30 06:57 (UTC+8)) [+11/-3, 2 files | commented:1 approved:1]
## Summary
- allow LFM2 MoE prefix caching in
alignmode - add LFM2-VL Mamba state copy-func for prefix cache
- keep
mamba-cache-mode=allrestriction
## Testing
- ./run_lfm_vllm_benchmark.sh –wait 150 –prewarm-bench –work-dir eval_outputs/outputs_run_1769727159
–model LiquidAI/LFM2-VL-8B-A1B-3096689 –served-model lfm2-vl-8b-a1b
–vllm-arg –enable-prefix-caching –vllm-arg –mamba-cache-mode –vllm-arg align
–vllm-arg –mm-processor-cache-type –vllm-arg shm –vllm-arg –a…
- allow LFM2 MoE prefix caching in
-
#33383 [Bugfix] Disable torch.compile for batch invariance on Blackwell to ensure determinism — bug — by ZhanqiuHu (创建于: 2026-01-30 10:34 (UTC+8)) [+31/-0, 2 files | commented:1] ## Purpose
Fix batch invariance failing on Blackwell (B200) GPUs when
torch.compileis enabled.Fixes #32992
### Observation
- Batch invariance works on Blackwell with
enforce_eager=True - Batch invariance fails on Blackwell with
torch.compileenabled …
- Batch invariance works on Blackwell with
-
#33357 [BugFix] Fix cold start compilation time — bug,ready — by zou3519 (创建于: 2026-01-30 01:23 (UTC+8)) [💬5 | +48/-2, 3 files | commented:3 approved:1 | 📝草稿] https://github.com/vllm-project/vllm/pull/25954 caused a regression in cold start compilation times in all models (MOE and Dense). This PR fixes it. The regression is that #25954 causes strings to appear in the graph and messes with the number of total unique subgraphs that need to be compiled. In general a model has some number of layers and should have 3-5 unique subgraphs, the change made it so that there is approx one subgraph per layer.
Benchmarks: ``` # vLLM 0.14.1 gpt-oss-120b # torch.co…
-
#33349 [Reasoning] Add glm47 reasoning parser for GLM-4.7 models — documentation — by QwertyJack (创建于: 2026-01-29 23:49 (UTC+8)) [💬2 | +306/-1, 3 files | commented:3] ## Summary
GLM-4.7 models have a different chat template than GLM-4.5/4.6 models:
- GLM-4.5/4.6:
<think>is NOT included in the generation prompt - GLM-4.7:
<think>IS included in the generation prompt
This means GLM-4.7 should use
DeepSeekR1ReasoningParserinstead ofDeepSeekV3ReasoningWithThinkingParserused by GLM-4.5/4.6.## Changes
…
- GLM-4.5/4.6:
-
#33382 [BugFix] Fix DeepGEMM Warmup Logic — bug — by robertgshaw2-redhat (创建于: 2026-01-30 09:39 (UTC+8)) [+5/-2, 1 files | commented:1]
## Purpose
- recent PR (not in 0.15) broke the logic for DG by checking the wrong Attr
- this fixes it
Essential Elements of an Effective PR Description Checklist
... -
#33359 Fix
tie_word_embeddingsfor multimodal models in Transformers v5 — ready — by hmellor (创建于: 2026-01-30 01:50 (UTC+8)) [+24/-0, 1 files | commented:2 approved:1] In Transformers v5,tie_word_embeddingsbelongs to the config of the class that can see both layers to be tied. For example:```py SomeVLModel: self.language_model = SomeLanguageModel() self.vision_model = SomeVisionModel()
SomeVLModelForMultimodalLM: self.model = SomeVLModel() self.lm_head = nn.Linear() …
-
#33380 [Feat][SM90] Add cutlass int4 x bf16 MoE — ci/build,nvidia — by IwakuraRein (创建于: 2026-01-30 09:20 (UTC+8)) [+1635/-36, 10 files | commented:1 | 📝草稿]
## Purpose
- Add cutlass int4 x bf16 MoE by referencing #29691 .
## Test Plan
pytest tests/kernels/quantization/test_cutlass_w4a16_moe.pypytest tests/test_cutlass_w4a16_moe_mm.py…
- #33366 [Bugfix][ROCm] Fixing the skinny gemm dispatch logic from #32831 — bug,rocm,ready — by gshtras (创建于: 2026-01-30 04:57 (UTC+8)) [💬4 | +18/-18, 4 files | commented:1]
Fixing 2 issues from the previous PR
- When the wvsplitk kernel is not applicable, the fallback should be hipblaslt through torch._scaled_mm, and not cloning and modifying the input tensor. This fixes the +100% performance regression on amd/LLama-3.1-*-FP8-KV models when running on smaller batch sizes
- Weights can be non-contiguous, and often are in the FP8 case where we explicitly pad them to a multiple of 256. So the condition needs to only be applied to the activation tensor
- cc @rasmith
-
#33375 [Moe Refactor] Provide inplace flag at FusedMoEModularKernel construction time. — ready,nvidia — by bnellnm (创建于: 2026-01-30 06:56 (UTC+8)) [+131/-109, 37 files | commented:5] ## Purpose Pass the inplace flag to the FusedMoEModularKernel constructor. Knowing the inplace (or being able to disable) behavior ahead of time will help simplify some of the runtime logic in layer.py related to running shared experts and making clones of the
hidden_states.## Test Plan CI + MoE refactoring tests
## Test Result
cc @robertgshaw2-redhat , @ProExpertProg
…
-
#33358 [BUGFIX][XPU] fix memory check after XPU reuse GPU_worker — bug,v1 — by xuechendi (创建于: 2026-01-30 01:50 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1 approved:1]
## Purpose
Fix a bug as below: ``` File “/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py”, line 333, in determine_available_memory (EngineCore_DP0 pid=418) ERROR 01-29 15:38:21 [core.py:946] assert self.init_snapshot.free_memory > free_gpu_memory, ( (EngineCore_DP0 pid=418) ERROR 01-29 15:38:21 [core.py:946] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=418) ERROR 01-29 15:38:21 [core.py:946] Assertion…
- #33374 [ModelRunner V2] Support spec decode with structured outputs — v1 — by njhill (创建于: 2026-01-30 06:35 (UTC+8)) [+59/-1, 3 files | commented:1] Tested for correctness, works as expected.
-
#33377 [Bugfix][Async][LMCache] avoid vllm-side double free during async scheduling + LMCache — bug,kv-connector — by KuntaiDu (创建于: 2026-01-30 07:58 (UTC+8)) [💬1 | +27/-4, 1 files | commented:1]
## Purpose
This PR fixes vLLM-side double-request-free bug in async scheduling.
Key reason: in async scheduling when request being aborted, the
get_finishedAPI might be called twice, resulting in vLLM freeing the same request twice. This change make sure that even whenget_finishedis called twice for one request, LMCache only return this request once and ignore the second time.Design graph:
…
-
#33371 [UX] Use gguf
repo_id:quant_typesyntax for examples and docs — documentation,ready,quantization — by mgoin (创建于: 2026-01-30 05:52 (UTC+8)) [💬1 | +79/-28, 4 files | commented:1 approved:1]## Purpose
- Fixed the
repo_id:quant_typesyntax to support extended naming conventions- These are recognized by checking if the base type (e.g., Q4_K) is a valid GGMLQuantizationType and the suffix is one of _M, _S, _L, _XL, _XS, _XXS
- Updated docs, examples, tests
## Test Plan
…
- Fixed the
-
#33351 [Dependency] Remove mandatory ray installation — documentation,rocm,ci/build,v1,nvidia — by yewentao256 (创建于: 2026-01-30 00:31 (UTC+8)) [💬1 | +5/-3, 5 files | commented:8 changes:1] ## Purpose
- vLLM v1 PP can run via the multiprocessing backend; Ray is only required when users explicitly choose the Ray executor backend. Keeping Ray as a default dependency on CUDA/ROCm causes confusion and unnecessarily pulls Ray into environments that don’t use it.
So this PR makes ray a optional dependency.
-
#33373 [Kernel] [Helion] [4/N] Add silu_mul_fp8 Helion kernel — 无标签 — by gmagogsfm (创建于: 2026-01-30 06:24 (UTC+8)) [💬1 | +2369/-0, 11 files | commented:4] NOTE: this is a manually stacked PR, each commit is reviewed separately. For this PR, please only review the top commit: Migrate silu_mul_fp8 kernel to new registration system
This PR implements silu_mul with fp8 quantization Helion op. This is the first OP built with new vLLM+Helion integration framework.
- Added configs for H100, H200 GPUs
-
#33370 [Models]: lfm2_siglip2 return intermediate encoder layers — 无标签 — by lalo (创建于: 2026-01-30 05:43 (UTC+8)) [💬2 | +52/-4, 1 files | commented:1] ## Purpose
Adds select_layer parameter to Siglip2Model, Siglip2VisionTransformer, and Siglip2Encoder to allow returning hidden states from any encoder layer. Useful for VLM architectures that use intermediate vision encoder outputs instead of the final layer.
- select_layer=-1: last layer with post_layernorm (default)
- select_layer=-2: second-to-last without post_layernorm
- Positive indices also supported for direct layer selection
No breaking change - sane defaults.
…
-
#33368 [Feature] Pipeline Parallel Async send/recv, 2.9% E2E throughput improvement — ready,v1 — by yewentao256 (创建于: 2026-01-30 05:41 (UTC+8)) [+332/-13, 3 files | commented:1] ## Purpose
Part of the https://github.com/vllm-project/vllm/issues/33356
Enable async
send/recvfor better performance## Test
export MODEL="Qwen/Qwen3-30B-A3B-Thinking-2507-FP8"…
-
#33327 [Quant] Fix FP8 block quantization for non-aligned dimensions — documentation — by Etelis (创建于: 2026-01-29 17:32 (UTC+8)) [💬1 | +70/-12, 2 files | commented:1] ## Summary
Fixes FP8 block quantization for models with dimensions not divisible by block size (128).
Tested with
bdellabe/DeepSeek-V2-Lite-FP8-BLOCKwhereintermediate_size=10944.## Changes
- Validation fix - Skip partition alignment check for merged weights when TP=1
- Scale shard fix - Compute scale boundaries correctly using
ceil(offset+size) - ceil(offset)…
-
#33360 [BugFix] Fix whisper FA2 + full cudagraphs — bug,ready,v1,nvidia — by LucasWilkinson (创建于: 2026-01-30 01:51 (UTC+8)) [💬1 | +12/-12, 2 files | commented:1 approved:2] Fix: https://github.com/vllm-project/vllm/issues/33091
CrossAttentionBuilder.build()overridesmax_seq_lenwith encoder_seq_lens (new_metadata.max_seq_len = max(encoder_seq_lens_cpu)) this leads to a CG capture withmax_seq_len == 0and an incorrect graph -
#33362 [Deprecation] Deprecate
seed_everythingandscatter_mm_placeholdersin v0.15 — ready,v1 — by yewentao256 (创建于: 2026-01-30 02:35 (UTC+8)) [+0/-75, 3 files | commented:1] ## PurposeDeprecated as scheduled
-
#33345 Feat/add nemotron nano v3 tests — ready,ci/build — by shaharmor98 (创建于: 2026-01-29 23:06 (UTC+8)) [+54/-0, 6 files | commented:1 approved:1] This PR adds support and testing configurations for the NVIDIA Nemotron-3 Nano 30B models (both BF16 and FP8 variants) to the
lm-eval-harnessCI pipeline.## Test Plan
Verified locally using
pytestwith thelm-eval-harnesstest suite.
Essential Elements of an Effective PR Description Checklist
... -
#33365 move spec decode slow test to test_areas.yaml — speculative-decoding,ready,ci/build,v1 — by shanjiaz (创建于: 2026-01-30 02:53 (UTC+8)) [+17/-3, 2 files | commented:1 approved:1] ## Purpose Moved spec decode sloe test to
test_areas.yaml. Added test was previously approved in a former PR.## Test Plan Tested the command locally.
pytest tests/v1/spec_decode/test_acceptance_length.py -m 'slow_test'## Test Result ``` ============================================================ test session starts ============================================================= …
-
#33352 [BugFix] Disable async scheduling for Mamba prefix caching — bug,ready — by peakcrosser7 (创建于: 2026-01-30 00:34 (UTC+8)) [+12/-0, 1 files | commented:2 approved:2]
## Purpose Async scheduling has been enabled by default since v0.14.0. Currently, Mamba prefix caching (“all” and “align” modes) does not support it. Therefore, this PR disables async scheduling when prefix-caching is enabled for Mamba models.
## Test Plan
## Test Result
…
-
#33318 Drafter Supports Multiple KVCache Groups — speculative-decoding,v1 — by tomasruizt (创建于: 2026-01-29 15:33 (UTC+8)) [💬1 | +311/-146, 6 files | commented:2 | 📝草稿] ## Purpose
## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... -
#33355 [torch.compile] Avoid graph fragmentation in unified_kv_cache_update — 无标签 — by algebra-MCX (创建于: 2026-01-30 00:48 (UTC+8)) [💬3 | +255/-5, 3 files | commented:1] ## Purpose
Fixes #33267 Motivation:
unified_kv_cache_updateappears in piecewise cudagraph regions. Previously, each layer used a unique name, forcing each to be compiled separately. This increases cold start compilation time with Dynamo partition because the graphs can no longer be reused.Changes:
This PR optimizes
unified_kv_cache_updateto use a generic name ("from_forward_context") during compilation. The actual layer is resolved dynamically at runtime using `ForwardContex… -
#33342 start dp engine core asynchronously — v1 — by zhaozx-cn (创建于: 2026-01-29 21:46 (UTC+8)) [💬3 | +55/-23, 1 files | commented:1]
## Purpose start dp engine core asynchronously ## Test Plan
## Test Result
... -
#33354 [Feature] Add VLLM_TRITON_AUTOTUNE environment variable — 无标签 — by monajafi-amd (创建于: 2026-01-30 00:40 (UTC+8)) [+6/-0, 1 files | commented:1 | 📝草稿] Implements #33279
Add environment variable
VLLM_TRITON_AUTOTUNEto control Triton kernel autotuning behavior.- Default: Disabled (
0) for deterministic behavior and faster startup - Enabled (
1): For benchmarking or calibrating kernel configurations
## Motivation Triton kernel parameters are not one-size-fits-all across hardware platforms. Currently, autotuning behavior is implicit and not user-controllable, making it difficult…
- Default: Disabled (
-
#33324 [Chore] Move
MediaConnectortovllm.multimodal.media— frontend,ready,multi-modality — by DarkLight1337 (创建于: 2026-01-29 17:17 (UTC+8)) [💬1 | +381/-350, 8 files | commented:2 approved:1] ## PurposeMissed this in #32406, further resolve circular import issues.
Also fix missing
AttributeErrorinvllm.transformers_utils.tokenizer.__getattr__.## Test Plan
## Test Result
…
- #33353 [KVConnector] Allow connector to protect GPU blocks from eviction — v1,kv-connector — by orozery (创建于: 2026-01-30 00:36 (UTC+8)) [+336/-44, 10 files | commented:1] This PR introduces a new connector API allowing the scheduler-side connector to increase GPU blocks ref-counts, in order to prevent them from evicting. In particular, this is necessary for the case of async offloading of sliding window KV data, as it is automatically freed by the KV cache manager as the window progresses.
-
#33350 [Bugfix] Fix broken GLM-OCR initialization — bug — by Isotr0py (创建于: 2026-01-29 23:51 (UTC+8)) [+1/-1, 1 files | commented:1 approved:1]
## Purpose
- The GLM-OCR model is broken actually, because
GlmOcrVisionConfigis only imported at type checking https://github.com/vllm-project/vllm/blob/c6e7404cc5713a926e8b6c187b5f197a5436e9ff/vllm/model_executor/models/glm_ocr.py#L36-L39NameError: name 'GlmOcrVisionConfig' is not defined. Did you mean: 'GlmOcrVisionMLP'?
## Test Plan …
- The GLM-OCR model is broken actually, because
-
#33346 [Models] Refactor Kimi-K2.5 weight loading — ready — by Isotr0py (创建于: 2026-01-29 23:39 (UTC+8)) [+40/-176, 2 files | commented:1 approved:1]
## Purpose
- Refactor Kim-K2.5 model interface usage to catch up previous refactoring
## Test Plan
vllm serve moonshotai/Kimi-K2.5 --enforce-eager -tp 4 --trust-remote-code --mm-encoder-tp-mode data``` …
- #33315 feat: add max tokens per doc in rerank request — frontend — by hustxiayang (创建于: 2026-01-29 15:12 (UTC+8)) [+240/-7, 4 files | commented:1 | 📝草稿] ## Purpose add max_tokens_per_doc in rerank request parameter, which is in cohere rerank api https://docs.cohere.com/reference/rerank and jina rerank api https://jina.ai/reranker/
-
#33343 Mamba multistream — 无标签 — by TomerBN-Nvidia (创建于: 2026-01-29 21:56 (UTC+8)) [💬1 | +539/-54, 10 files | commented:1 | 📝草稿]
## Purpose
## Test Plan
## Test Result
... -
#33344 Fix: Validate placeholder tokens don’t exceed batch length in chunked… — v1 — by GOavi101 (创建于: 2026-01-29 22:30 (UTC+8)) [+29/-3, 1 files | commented:1] ## Summary
This PR implements validation to ensure multimodal placeholder tokens are not truncated by chunked prefill. Placeholder tokens must be processed in a single chunk to ensure proper embedding replacement.
## Problem
When chunked prefill is enabled with
--max-num-batched-tokens, multimodal placeholder tokens could be split across chunks. This breaks embedding replacement because placeholder tokens need to be processed atomically - if they’re split, the embeddings cannot be properly … -
#33331 [Multimodal] Simplify MM input definitions — ready,v1,multi-modality — by DarkLight1337 (创建于: 2026-01-29 18:39 (UTC+8)) [💬1 | +142/-164, 17 files | commented:2 approved:1] ## Purpose
Remove unnecessary fields in
MultiModalFieldElem; instead, always pass dictionaries to constructMultiModalKwargsItemandMultiModalKwargsItemsto make their roles clearer. Asgroup_mm_kwargs_by_modalitystill requires modality information, it has been updated to receive a list of(modality, MultiModalKwargsItem)instead of just a list ofMultiModalKwargsItem.Also, improve related docstrings.
## Test Plan
## Test Result
…
-
#33339 Enable Cross layers KV cache layout at NIXL Connector V2 — documentation,v1,kv-connector — by liranschour (创建于: 2026-01-29 20:06 (UTC+8)) [💬1 | +324/-88, 5 files | commented:1] ## Purpose Enable NIXL Connector to us the new continuous cross layer KV cache layout described in RFC and implemented in #27743
Demonstrate performance improvement of more the 2x in Tok/sec and TTFT due to dramatic reduction of fragmentation of transfer buffers.
Tested with P!=D with run_accuracy_test.sh P=1 D=2
branch num reqs input len TTFT ITL tok/s Desc/transfer – – -… -
#33335 Parakeet avlm — needs-rebase — by netanel-haber (创建于: 2026-01-29 19:26 (UTC+8)) [💬1 | +376/-23, 2 files | commented:1 | 📝草稿]
## Purpose
## Test Plan
## Test Result
... -
#33310 [Chore] Remove
use_data_parallelkwargs from ViT implementation — ready,llama — by Isotr0py (创建于: 2026-01-29 13:59 (UTC+8)) [💬1 | +36/-89, 9 files | commented:2 approved:1]## Purpose
- Clean up
use_data_parallelfrom ViT model implementation
## Test Plan
## Test Result
…
- Clean up
-
#33337 [BUGFIX] Make HF input preprocessing async to prevent frontend event loop blocking — bug,v1 — by yzhu802 (创建于: 2026-01-29 19:42 (UTC+8)) [💬3 | +52/-3, 1 files | commented:1]
## Purpose
Improve the issue where the frontend process event loop becomes blocked under medium to high concurrency with multimodal requests, due to preprocessing in the Hugging Face (HF) processor. This improvement helps enhance the performance of the vLLM framework in multimodal scenarios.
In addition, this change is critical for concurrent testing in the downstream framework vllm-omni. In multi-stage orchestration, the frontend process is responsible for forwa…
-
#33320 [Backport] [Kimi-K2.5] Replace torch.cuda with current_platform for d… — ready,nvidia — by flyrae (创建于: 2026-01-29 16:30 (UTC+8)) [💬4 | +2/-1, 1 files | commented:1 approved:2] commit msg: Replaced the hardcoded
torch.cuda.current_device()withcurrent_platform.current_device()in theKimiK25ForConditionalGenerationinitialization. This change enhances platform compatibility, allowing the model to run on non-CUDA devices (e.g., NPU, ROCm) supported by vLLM.## Purpose
Update
KimiK25ForConditionalGenerationto usecurrent_platform.current_device()instead of the hardcodedtorch.cuda.current_device().…
-
#33334 [Feature] Add rich request snapshot stream for step-level observability (PR #5) — documentation,frontend,needs-rebase,ci/build,v1 — by sriumcp (创建于: 2026-01-29 19:18 (UTC+8)) [💬3 | +13599/-1130, 56 files | commented:1] ## Summary
This PR implements PR #5: Rich Request Snapshot Stream from the step-level tracing plan, adding per-request observability events within sampled scheduler steps.
Part of the step-level tracing series:
- PR #3: Step-level batch summary tracing ✅ (merged)
- PR #4: KV cache metrics utilities ✅ (merged)
- PR #5: Rich request snapshot stream ⬅️ this PR
## What This PR Adds …
-
#33322 [Bugfix] Fix SP compilation shape mismatch errors for multimodal models and prompt embeds — bug — by wangxingran222 (创建于: 2026-01-29 17:10 (UTC+8)) [💬1 | +131/-3, 2 files | commented:1]
## Purpose
When deploying multimodal models on the current vLLM main branch, or deploying text models with the
--enable-prompt-embedsoption enabled, Sequence Parallelism (SP) cannot be enabled due to torch compile errors. Specifically, two issues occur:- Shape mismatch in
aten.copy_operations: When torch compile detects that the forward function modifies input parameters (e.g.,inputs_embeds), the compiler insertsaten.copy_nodes to write modifie…
- Shape mismatch in
-
#33312 [Models] Qwen3-ASR — documentation,new-model,ready,qwen — by ywang96 (创建于: 2026-01-29 14:08 (UTC+8)) [💬1 | +1269/-0, 9 files | approved:2 commented:4]
## Purpose
Add support for Qwen3-ASR model series - see recipe at https://github.com/vllm-project/recipes/blob/main/Qwen/Qwen3-ASR.md
## Test Plan
## Test Result
…
-
#33317 [Bugfix][CPU] Fix thread num for shared memory communication — bug,ready,cpu — by bigPYJ1151 (创建于: 2026-01-29 15:30 (UTC+8)) [+25/-10, 3 files | commented:1 approved:1]
## Purpose
Some CPU specs have different thread num across memory node and leads to error in CPU shared memory communicator. This PR fixed the issue.
## Test Plan
Locally verified.
…
-
#33308 [rocm] unset RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES in ROCM images — rocm,ready,ci/build — by kouroshHakha (创建于: 2026-01-29 13:53 (UTC+8)) [💬2 | +0/-2, 1 files | commented:1 | 📝草稿] vLLM’s ray integration relies on automatically setting the visible devices in scheduling model executor ray workers.
### Testing:
The command below would fail on rocm/v0.14.0 before this pr, but works after this pr:
vllm serve EmbeddedLLM/deepseek-r1-FP8-Dynamic -dp 8 --load-format dummy --distributed-executor-backend ray -cc '{"cudagraph_mode": "FULL_DECODE_ONLY"}' -
#33328 [Quant] Skip CUTLASS block FP8 on Blackwell GPUs — nvidia — by Etelis (创建于: 2026-01-29 17:36 (UTC+8)) [+50/-13, 2 files | commented:1] ## Summary
Skip CUTLASS block FP8 kernels on Blackwell GPUs (sm_120+) and fall back to Triton.
CUTLASS block FP8 is not stable on Blackwell, causing “Invalid status” runtime errors.
## Changes
Added compute capability check in
is_cutlass_block_fp8_supported()to exclude sm_120+.…
- #33321 [ROCm] make rocm_aiter_fa support qwen3-next, remove multiple 16 block size support — documentation,rocm,v1,qwen — by ganyi1996ppo (创建于: 2026-01-29 17:05 (UTC+8)) [💬1 | +2/-3, 2 files | commented:2]
## Purpose
Reset
get_supported_kernel_block_sizesfrommultiple(16)to[1, 16, 32].multiple(16)might cause unsupported block-size like 144 when serving qwen3-next. ## Test Plan gsm8k ## Test Result ``` |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |—–|——:|—————-|—–:|———–|—|—–:|—|—–:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8597|± |0.0096| | | |strict-match | 5|exact_match|↑ |0.8400|± |… -
#33306 [Frontend] Add tool_choice=required support for GPT-OSS Harmony models — frontend,tool-calling,gpt-oss — by gkswns0531 (创建于: 2026-01-29 13:22 (UTC+8)) [💬2 | +242/-14, 3 files | commented:1] ## Summary
This PR adds `tool_choice="required"` support for GPT-OSS Harmony models. When `tool_choice="required"` is set, the model is forced to generate tool calls instead of plain text responses.### Problem
GPT-OSS models use the Harmony chat format, which differs from standard models in its response generation behavior. Even when
tool_choice="required"is set, these models tend to generate direct text responses instead of tool calls, resulting in only 91% success rate for tool call g… -
#33325 [ut] enhance structured output ut — structured-output,v1 — by andyxning (创建于: 2026-01-29 17:23 (UTC+8)) [+2/-2, 1 files | commented:1]
## Purpose Enhance structured output ut. ## Test Plan NA ## Test Result NA —
... -
#33323 [Misc] Clean up HIDDEN_DEPRECATED_METRICS after metric removal — 无标签 — by carlory (创建于: 2026-01-29 17:16 (UTC+8)) [+1/-5, 1 files | commented:1] ## Purpose
Clean up the
HIDDEN_DEPRECATED_METRICSlist in the metrics test file after the deprecated metrics were removed in #29330.The following metrics were removed in #29330:
vllm:gpu_cache_usage_percvllm:gpu_prefix_cache_queriesvllm:gpu_prefix_cache_hits
Since these metrics are now fully removed, they no longer need to be tracked in
HIDDEN_DEPRECATED_METRICS. … -
#33316 [Release] [ROCm] Remove old build step — rocm,ready,ci/build — by tjtanaa (创建于: 2026-01-29 15:23 (UTC+8)) [+2/-19, 1 files | commented:1 approved:1]
## Purpose
While rebasing this PR https://github.com/vllm-project/vllm/pull/33156 , I missed removing the old ROCm Image Build step.
The current pipeline graph will look like this https://buildkite.com/vllm/release-pipeline-shadow/builds/1693/steps/canvas
## Test Plan …
-
#33309 [Bugfix] Make tensorizer encrypted model output tests deterministic — bug — by Zaire404 (创建于: 2026-01-29 13:58 (UTC+8)) [+2/-0, 1 files | commented:1] ## Purpose Fix nondeterministic failures in
tests/model_executor/model_loader/tensorizer_loader/test_tensorizer.py::test_deserialized_encrypted_vllm_model_with_tp_has_same_outputsby making generation deterministic. This test compares generated outputs between the original model and the deserialized encrypted model, without explicitly setting sampling parameters, generation can be nondeterministic and lead to intermittent mismatches.Add deterministic sampling params: `sampling_params = Sampl…
- #33304 [feat] Add per-block extra_keys to KV events — documentation,v1 — by zhongdaor-nv (创建于: 2026-01-29 12:15 (UTC+8)) [💬4 | +37/-3, 3 files | commented:1]
## Summary
- Add
extra_keysfield toBlockStoredevents that exposes the extra hash keys (MM identifiers, LoRA name, cache_salt, prompt embeddings, etc.) used in block hash computation - Each element in the list corresponds to one block in
block_hashes, enabling external KV cache consumers to reconstruct block hashes
## Changes
- Add
extra_keysfield toBlockStoredinkv_events.py - Generate extra_keys for each block individually during
cache_full_blocksinblock_pool.py - Updat…
- Add
-
#33303 Support Pipeline and Data Parallelism for MiniMax-M2 — 无标签 — by rogeryoungh (创建于: 2026-01-29 11:49 (UTC+8)) [+21/-6, 1 files | commented:1]
## Purpose
Adds Pipeline Parallelism (PP) and Data Parallelism (DP) for
minimax_m2. Currently, enabling both PP+DP simultaneously results in character encoding issues. However, both PP and DP work correctly when used individually.## Test Plan
## Test Result
…
[已合并 PR]
-
#33326 [Bugfix][Kernel] Fix negative memory offset in GDN Triton kernel — bug,ready — by CarstyYou (合并于: 2026-01-30 02:40 (UTC+8)) [💬5 | +18/-15, 1 files | commented:1 approved:3 changes:1] ## Purpose
Fix illegal memory access (
cudaErrorIllegalAddress) when running Qwen3-Next models with MTP speculative decoding and CUDA Graphs enabled.Fixes #31186
cc @vadiklyutiy ### Root Cause
The Triton kernel
fused_recurrent_gated_delta_rule_fwd_kernelin GDN attention did not check forPAD_SLOT_ID = -1values inssm_state_indices. When CUDA Graph pads unused slots with-1, the kernel computes negative memory offsets, causing illegal memory access. … - #32696 [Model][Multimodal] Add explicit MusicFlamingo adapter — documentation,new-model,ready — by WangHaoyuuu (合并于: 2026-01-30 11:01 (UTC+8)) [💬5 | +115/-2, 6 files | commented:6 approved:1]
## Summary
- Add an explicit MusicFlamingo adapter that reuses AudioFlamingo3 while accepting MusicFlamingo config/processor with a safe fallback.
- Register MusicFlamingoForConditionalGeneration in the multimodal registry.
- Fix AudioFlamingo3 encoder compatibility for MusicFlamingo checkpoints and Qwen2Audio layer signature.
## Motivation MusicFlamingo shares the AudioFlamingo3 architecture but uses
model_type=musicflamingoandMusicFlamingoProcessor. Without an explicit adapter, type che… -
#33358 [BUGFIX][XPU] fix memory check after XPU reuse GPU_worker — bug,v1 — by xuechendi (合并于: 2026-01-30 01:56 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1 approved:1]
## Purpose
Fix a bug as below: ``` File “/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py”, line 333, in determine_available_memory (EngineCore_DP0 pid=418) ERROR 01-29 15:38:21 [core.py:946] assert self.init_snapshot.free_memory > free_gpu_memory, ( (EngineCore_DP0 pid=418) ERROR 01-29 15:38:21 [core.py:946] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=418) ERROR 01-29 15:38:21 [core.py:946] Assertion…
- #32849 [Docs] Adding links and intro to Speculators and LLM Compressor — documentation,ready — by aireilly (合并于: 2026-01-30 06:12 (UTC+8)) [💬7 | +73/-7, 5 files | commented:4 changes:1 approved:2] Small docs update for the Features section that adds overview docs for LLM Compressor and Speculators projects.
-
#33300 [Bugfix] Enable Triton MoE for FP8 per-tensor dynamic — bug,ready — by mgoin (合并于: 2026-01-30 04:15 (UTC+8)) [+3/-0, 2 files | commented:1 approved:1] ## Purpose
Without this change and running an moe model with fp8 per-tensor dynamic scales, the triton moe backend will not be selected and marlin would be chosen in the end.
(EngineCore_DP0 pid=15286) DEBUG 01-28 21:25:48 [model_executor/.../oracle/fp8.py:332] FP8 MoE backend TRITON does not support the deployment configuration since kernel does not support quantization scheme.This is simply a support registration mistake.
## Test Plan
…
-
#33205 Make
mypyopt-out instead of opt-in — ready,v1 — by hmellor (合并于: 2026-01-29 17:12 (UTC+8)) [+35/-57, 11 files | commented:3 approved:1] The opt-in nature of themypy.pyscript meant that as new directories were added, they were not automatically tested withmypy.This PR:
- Removes the requirement to add new directories/files to the
FILESlist - Adds the currently untested directories/files to the
EXCLUDElist (this is what they effectively are currently) - Performs some simple fixes for some of the previously untested files
- Removes the requirement to add new directories/files to the
-
#33129 [release] Minor fixes to release annotation and wheel upload — ready,ci/build — by khluu (合并于: 2026-01-30 04:09 (UTC+8)) [💬2 +51/-63, 3 files commented:10] - Add commands for CUDA 13.0 Docker images
- Use
twine upload <args>instead oftwine <args> upload. The latter caused errors. - Use
vllm-openai-rocmrepo instead ofvllm-openai:rocmtag - Remove github release part from wheel upload script. This is currently done manually, and if need to be automated, we should separate it into a separate job/script.
- Rename release wheel script to be PyPI only.
-
#33298 [Bugfix] Fix Qwen3-VL-Reranker load. — bug,documentation,ready,qwen — by noooop (合并于: 2026-01-29 16:42 (UTC+8)) [💬4 | +234/-112, 6 files | commented:2 approved:1]
## Purpose
Fix Qwen3-VL-Reranker load.
## Test Plan
tests/entrypoints/pooling/score/test_online_score_vision.py
…
-
#32804 Add Triton fused MoE config for B200 (Nemotron Nano) — ready — by danisereb (合并于: 2026-01-30 03:21 (UTC+8)) [💬2 | +139/-0, 1 files | commented:1 approved:1] ## Purpose
When running Nemotron Nano on B200 the following warning appears:
Using default MoE config. Performance might be sub-optimal! Config file not found at .../vllm/model_executor/layers/fused_moe/configs/E=128,N=1856,device_name=NVIDIA_B200.jsonI used the
benchmark_moe.pyto create a JSON file for this use-case: ``` … - #32954 [NVIDIA] [feat] Integrate flashinfer Trtllmgen bf16 moe — ready,nvidia — by Linda-Stadter (合并于: 2026-01-30 02:00 (UTC+8)) [💬5 | +290/-17, 5 files | commented:10]
## Purpose
- Integrate flashinfer trtllm-gen BF16 moe to supported models
- This is a rebased version of PR 28238 by @jiahanc and includes adaptation to the latest moe refactoring changes. I have further verified that the accuracy issues discussed in 28238 are solved.
## Test Plan
``VLLM_USE_FLASHINFER_MOE_FP16=1 VLLM_FLASHINFER_MOE_BACKEND=latency vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct …
-
#33324 [Chore] Move
MediaConnectortovllm.multimodal.media— frontend,ready,multi-modality — by DarkLight1337 (合并于: 2026-01-30 00:54 (UTC+8)) [💬1 | +381/-350, 8 files | commented:2 approved:1] ## PurposeMissed this in #32406, further resolve circular import issues.
Also fix missing
AttributeErrorinvllm.transformers_utils.tokenizer.__getattr__.## Test Plan
## Test Result
…
- #33287 [ez] Delete torch25_custom_graph_pass — ready — by angelayi (合并于: 2026-01-30 00:47 (UTC+8)) [+0/-44, 1 files | commented:1 approved:1] Addresses https://github.com/vllm-project/vllm/pull/33209/changes/BASE..8ed781422f12b84be4ea004a4039b75839795ab0#diff-f19ed1cccb2fec69f4510e30750fb43450207d3d6c15ba4e0d6aad194ee70fed
-
#33350 [Bugfix] Fix broken GLM-OCR initialization — bug — by Isotr0py (合并于: 2026-01-29 23:56 (UTC+8)) [+1/-1, 1 files | commented:1 approved:1]
## Purpose
- The GLM-OCR model is broken actually, because
GlmOcrVisionConfigis only imported at type checking https://github.com/vllm-project/vllm/blob/c6e7404cc5713a926e8b6c187b5f197a5436e9ff/vllm/model_executor/models/glm_ocr.py#L36-L39NameError: name 'GlmOcrVisionConfig' is not defined. Did you mean: 'GlmOcrVisionMLP'?
## Test Plan …
- The GLM-OCR model is broken actually, because
-
#33157 [Misc] Cleanup Kimi-K2.5’s vision chunk modality entrypoints — documentation,frontend,ready,multi-modality — by Isotr0py (合并于: 2026-01-29 17:46 (UTC+8)) [💬1 | +733/-204, 7 files | commented:3 approved:1]
## Purpose Kimi-l2.5’s vision chunk implementation is a bit messy since it’s a bit urgent to catch up day0 support.
This PR does some remained cleanup for it:
- Add missing tests for vision chunk modality
- Move uuids reconstruction fuctions to render.
## Test Plan …
-
#33331 [Multimodal] Simplify MM input definitions — ready,v1,multi-modality — by DarkLight1337 (合并于: 2026-01-29 21:32 (UTC+8)) [💬1 | +142/-164, 17 files | commented:2 approved:1] ## Purpose
Remove unnecessary fields in
MultiModalFieldElem; instead, always pass dictionaries to constructMultiModalKwargsItemandMultiModalKwargsItemsto make their roles clearer. Asgroup_mm_kwargs_by_modalitystill requires modality information, it has been updated to receive a list of(modality, MultiModalKwargsItem)instead of just a list ofMultiModalKwargsItem.Also, improve related docstrings.
## Test Plan
## Test Result
…
-
#33310 [Chore] Remove
use_data_parallelkwargs from ViT implementation — ready,llama — by Isotr0py (合并于: 2026-01-29 18:20 (UTC+8)) [💬1 | +36/-89, 9 files | commented:2 approved:1]## Purpose
- Clean up
use_data_parallelfrom ViT model implementation
## Test Plan
## Test Result
…
- Clean up
-
#33320 [Backport] [Kimi-K2.5] Replace torch.cuda with current_platform for d… — ready,nvidia — by flyrae (合并于: 2026-01-29 20:29 (UTC+8)) [💬4 | +2/-1, 1 files | commented:1 approved:2] commit msg: Replaced the hardcoded
torch.cuda.current_device()withcurrent_platform.current_device()in theKimiK25ForConditionalGenerationinitialization. This change enhances platform compatibility, allowing the model to run on non-CUDA devices (e.g., NPU, ROCm) supported by vLLM.## Purpose
Update
KimiK25ForConditionalGenerationto usecurrent_platform.current_device()instead of the hardcodedtorch.cuda.current_device().…
-
#32894 [Intel GPU] refine xpu worker — ready,ci/build,v1 — by jikunshang (合并于: 2026-01-29 20:26 (UTC+8)) [+27/-90, 2 files | commented:2 approved:2]
## Purpose There are some legacy code in xpu platform(xpu_worker.py, xpu platform init) this PR refine these file to align with gpu. also disable some failed UT for now. will further check ut status later.
## Test Plan CI
## Test Result …
-
#33312 [Models] Qwen3-ASR — documentation,new-model,ready,qwen — by ywang96 (合并于: 2026-01-29 19:27 (UTC+8)) [💬1 | +1269/-0, 9 files | approved:2 commented:4]
## Purpose
Add support for Qwen3-ASR model series - see recipe at https://github.com/vllm-project/recipes/blob/main/Qwen/Qwen3-ASR.md
## Test Plan
## Test Result
…
-
#33317 [Bugfix][CPU] Fix thread num for shared memory communication — bug,ready,cpu — by bigPYJ1151 (合并于: 2026-01-29 19:26 (UTC+8)) [+25/-10, 3 files | commented:1 approved:1]
## Purpose
Some CPU specs have different thread num across memory node and leads to error in CPU shared memory communicator. This PR fixed the issue.
## Test Plan
Locally verified.
…
- #33042 [Voxtral] Streaming example — documentation,ready,ci/build,multi-modality — by patrickvonplaten (合并于: 2026-01-29 19:22 (UTC+8)) [💬8 | +207/-46, 5 files | commented:7 approved:1] This PR adds a test for the new streaming generator API: https://github.com/vllm-project/vllm/pull/28973 which works nicely!
-
#33130 [Quantization][Refactor] use platform dict to choose kernel — ready — by zufangzhu (合并于: 2026-01-29 18:44 (UTC+8)) [💬2 +23/-13, 1 files commented:6 approved:2] -
#31751 [Bug Fix] Handle variable-length tensors in MultiModalFlatField batching — bug,documentation,new-model,ready,ci/build,v1,multi-modality,llama,qwen — by AndriiPasternak31 (合并于: 2026-01-29 18:43 (UTC+8)) [💬9 | +86/-0, 2 files | commented:5 approved:2] ## Summary
Fix audio batching crash with variable-length inputs in Ultravox model.
Fixes #31658
## Problem
When batching audio tensors with different time dimensions (e.g.,
[80, 325]vs[80, 666]),MultiModalFlatField._reduce_data()incorrectly flattens tensors into individual rows instead of returning them as a list.…
-
#32881 [BugFix] Async Eplb fix potential race condition — bug,ready — by ilmarkov (合并于: 2026-01-29 18:31 (UTC+8)) [+18/-0, 2 files | commented:2 approved:2]
## Purpose Fixes potential race condition in async EPLB. The scenario is following. EPLB successfully transfers layer 1, saves the weights in intermediate buffers, signals main thread that the buffer is ready. The main thread schedules weight copy on main stream when communication on EPLB stream is done and notifies the async thread that buffer is ready. In this scenario it might happen that copying the weights takes too long and async thread will start a weight t…
-
#32769 [fix] tesdt mcp_tool_calling_streaming with a more complex math question — ready — by daniel-salib (合并于: 2026-01-29 18:25 (UTC+8)) [+1/-1, 1 files | approved:2] ## Purpose
python tool is not always triggered if the math question is too simple. When asked to compute larger numbers the python tool triggers consistently
## Test Plan pytest tests/entrypoints/openai/responses/test_mcp_tools.py:: test_mcp_tool_calling_streaming_types
## Test Result
passes consistently
-
#32669 Bugfix: Pass router logits dtype in nemotron shared experts — bug,ready — by amirkl94 (合并于: 2026-01-29 17:36 (UTC+8)) [💬1 | +3/-1, 1 files | commented:1 approved:2] ## Purpose A change introduced in this PR , requires passing
router_logits_dtypeto MoE layer.When running with
dp > 1and flashinfer cutlass MoE kernel in nvfp4, the following error happens: ``` assert self.batched_router_logits.dtype == full_router_logits.dtype, ( ERROR 01-19 05:53:49 [multiproc_executor.py:839] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 01-19 05:53:49 [multiproc_executor.py:839] Assert… - #33116 [CI/Build][BugFix] fix cuda/compat loading order issue in docker build — bug,ready,ci/build,nvidia — by wpc (合并于: 2026-01-29 16:19 (UTC+8)) [💬1 | +2/-2, 1 files | commented:2 approved:1] ## Purpose In PR #30784 we persistents CUDA compat lib path with ldconfig. This caused a CUDA init issue when the docker image runs with new driver without requiring compatibility libs. ``` $ nvidia-smi Mon Jan 26 13:09:50 2026 +—————————————————————————————–+ | NVIDIA-SMI 580.105.08 Driver Version: 580.105.08 CUDA Version: 13.0 | +—————————————–+————————+——————…
- #33283 [Misc] Remove missed
pad_for_cudagraph— ready,nvidia — by LucasWilkinson (合并于: 2026-01-29 17:12 (UTC+8)) [+0/-7, 1 files | commented:1 approved:2] Remove missedpad_for_cudagraphin https://github.com/vllm-project/vllm/pull/30143 -
#32660 [Doc] Update outdated link to Ray documentation — documentation,ready — by graftim (合并于: 2026-01-29 16:56 (UTC+8)) [💬2 | +1/-1, 1 files | approved:1] The link to the ray serve documentation became outdated, so I fixed it to the same page in the current documentation of Ray.
## Purpose
## Test Plan
## Test Result
…
-
#32943 Adding optional speculator tests for larger models — speculative-decoding,ready,ci/build,v1 — by shanjiaz (合并于: 2026-01-29 16:54 (UTC+8)) [💬1 | +45/-4, 2 files | commented:7 approved:1] ## Purpose We just enabled speculative decoding for qwen3 moe vision language pathway. Adding a test for vision language speculator model.
- Add support for testing Qwen3-30B-MOE-VL-Eagle3 speculator model
- Create separate optional CI job for large speculator model tests on A100 GPUs …
-
#33152 [PluggableLayer][2/N] Apply PluggableLayer to linear layers — ready,ready-run-all-tests — by whx-sjtu (合并于: 2026-01-29 16:53 (UTC+8)) [+5/-5, 1 files | commented:1 approved:1] ## Purpose As a task in https://github.com/vllm-project/vllm/issues/32676, this PR applies
PluggableLayerto linear layers.## Test Plan All tests should pass. ## Test Result All tests should pass. —
Essential Elements of an Effective PR Description Checklist
... -
#33212 support returning tokenids in responses api — frontend,ready — by cmunley1 (合并于: 2026-01-29 16:52 (UTC+8)) [+50/-9, 3 files | commented:10]
## Purpose Support returning token ids from responses API
## Test Plan
## Test Result
…
-
#33262 [BugFix] Fix EPLB fail for MoeFP4 model with Marlin backend — bug,ready — by ilmarkov (合并于: 2026-01-29 16:52 (UTC+8)) [+10/-2, 1 files | commented:2 approved:2]
## Purpose PR fixes a bug when
nvidia/DeepSeek-R1-0528-FP4-v2crashes when we enable EPLB with Flashinfer MOE FP4 disabled.In marlin Moe_FP4 backend we don’t use activations scales so we replace them in
replace_parameterutil with None. But it doesn’t set the parameter to None, it creates an empty parameter on cpu which triggers EPLB error:``` (EngineCore_DP1 pid=3244402) File “/home/sagemoore/git/nm-vllm/vllm/distributed/eplb/eplb_state.py”, line 551, …
-
#33256 [Doc]: fixing multiple typos in diverse files — documentation,frontend,ready,ci/build,v1,cpu — by didier-durand (合并于: 2026-01-29 16:52 (UTC+8)) [💬3 | +20/-20, 14 files | commented:1 approved:1] ## Purpose
Improve quality of the repo by fixing typos
## Test Plan
No tests changed or added
## Test Result
…
-
#33316 [Release] [ROCm] Remove old build step — rocm,ready,ci/build — by tjtanaa (合并于: 2026-01-29 15:35 (UTC+8)) [+2/-19, 1 files | commented:1 approved:1]
## Purpose
While rebasing this PR https://github.com/vllm-project/vllm/pull/33156 , I missed removing the old ROCm Image Build step.
The current pipeline graph will look like this https://buildkite.com/vllm/release-pipeline-shadow/builds/1693/steps/canvas
## Test Plan …
-
#33189 [Misc][Build] Lazy load cv2 in nemotron_parse.py — bug,ready — by kiersten-stokes (合并于: 2026-01-29 14:55 (UTC+8)) [💬1 | +4/-1, 1 files | commented:1 approved:1] ## Purpose Lazy loads the cv2 module to avoid import errors in certain envs when not needed
Error trace
``` Traceback (most recent call last): File "", line 1, in File "/usr/lib/python3.12/multiprocessing/spawn.py", line 122, in spawn_main ... -
#33156 [Release] [CI] Optim release pipeline — rocm,ready,ci/build — by tjtanaa (合并于: 2026-01-29 14:45 (UTC+8)) [💬1 | +399/-24, 5 files | commented:8 approved:2]
## Purpose
### Changes
- Docker image release pipeline is not unified with vLLM ROCm wheel release pipeline. Docker image release pipeline uses the content cache and sccache in both cases.
- The sccache is always removed at the end of the docker image creation so that the docker image is readily available to be used as local development environment.
- The test I have carried out is as follows. Docker pull the image to local machine and reinstall a fresh vLL…
-
#33141 Fix tool call indexing double-counting — frontend,ready — by wangln19 (合并于: 2026-01-29 13:57 (UTC+8)) [+4/-3, 1 files | commented:2 approved:1] ## Purpose Fixes a bug in [vllm/entrypoints/openai/chat_completion/serving.py]where tool call IDs were being generated with non-sequential indices due to double-counting. Related to PR “fix: preserve native tool call ID in multi-turn tool calling” #32768.
In the original code,
history_tool_call_cntwas being incremented inside the loop, but the index calculation also usedidxfromenumerate: ```python # Old logic: idx = history_tool_call_cnt + idx # Double counting effect history_tool_ca… -
#33260 [Refactor] Define MM data parser in processing info instead of processor itself — ready,multi-modality,llama,qwen — by DarkLight1337 (合并于: 2026-01-29 13:55 (UTC+8)) [+399/-347, 34 files | commented:2 approved:1] ## Purpose
Gradually separate the logic of data parsing outside of processor class. Also fix some instances where
_get_expected_hidden_sizeis not used for data parsing.## Test Plan
## Test Result
... - #33288 [ez] Delete more torch version checks <= 2.8 — ready — by angelayi (合并于: 2026-01-29 13:28 (UTC+8)) [+22/-70, 2 files | commented:1 approved:1]
I missed the cases where it uses
torch.__version__instead ofis_torch_equal_xx😅 -
#33227 [Misc] Add orozery to CODEOWNERS (core, kv_transfer, kv_offload) — ready,ci/build — by orozery (合并于: 2026-01-29 12:24 (UTC+8)) [+8/-6, 1 files commented:1 approved:1]
[关闭未合并 PR]
-
#19507 deps: Update torch and deps to 2.7.1 — rocm,needs-rebase,ci/build,stale — by seemethere (关闭于: 2026-01-30 10:17 (UTC+8)) [💬8 | +28/-28, 10 files | commented:2] ## Purpose
Updates PyTorch and all of its downstream projects to 2.7.1 respective versions
## Test Plan
CI
## Test Result
…
-
#20250 [Bugfix] enable logging in v1 offline inference — needs-rebase,stale,v1 — by MingzhenHan (关闭于: 2026-01-30 10:17 (UTC+8)) [💬5 | +51/-9, 1 files | commented:2] ## Purpose Fixed the issue of metrics not being logged during running offline inference with vLLM v1.
Related issue:https://github.com/vllm-project/vllm/issues/17382
## Test Result
-
#25828 Fix num tokens when sp moe enabled — ready,stale — by wuxun-zhang (关闭于: 2026-01-30 10:16 (UTC+8)) [💬6 | +1/-1, 1 files | commented:1 approved:1]
For some backends like Gaudi, the input shape of first MOE layer could be 3D (bs, seqlen, hdim). So here when SP MOE enabled (https://github.com/vllm-project/vllm/pull/24982),
num_tokensmay got wrong value.@tlrmchlsmth
## Purpose
## Test Plan …
-
#25930 [Misc] Allow toggling prefix cache during turning — performance,stale — by wdhongtw (关闭于: 2026-01-30 10:16 (UTC+8)) [💬2 | +15/-1, 1 files | commented:1] Using prefix-cache introduce more non-deterministic during turning, as the prefill/decode ratio may change according to prefix cache hit rate.
Add a switch to allowing turning off prefix-cache during turning, while keep the default behavior as before (enabling prefix-cache).
## Purpose
Based on different dataset, input length and random sampling, the prefix-cache hit rate varies across different invocation. This PR provide a switch to optionally turn off prefix-cache, for more stable experime…
- #19434 [WIP][FP8] ScaledMM refactor — needs-rebase,nvidia — by ProExpertProg (关闭于: 2026-01-30 09:29 (UTC+8)) [💬3 | +184/-107, 7 files | commented:2 | 📝草稿]
## Essential Elements of an Effective PR Description Checklist
- The purpose of the PR, such as “Fix some issue (link existing issues this PR will resolve)”.
- The test plan, such as providing test command.
- The test results, such as pasting the results comparison before and after, or e2e results
- (Optional) The necessary documentation update, such as updating
supported_models.mdandexamplesfor a new model.
## Purpose This PR will refactor the fp8 scaled mm kernels to u…
-
#26738 [DO NOT MERGE] 2.9, Inductor partition, standalone compile, monkeypatch fix(es) — documentation,rocm,ready,ci/build,llama — by ProExpertProg (关闭于: 2026-01-30 09:29 (UTC+8)) [💬11 | +1/-1, 1 files | commented:2] In-progress PR to test inductor partitioning in CI.
Includes:
- turning on inductor partitioning by default
Past fixes in this PR now in main:
- #26956 AOT caching issue fix
- #26735 monkeypatch
- #26878 monkeypatch for memory plan output naming …
-
#32115 Fix select_gemm_impl in ModelOptFp8MoEMethod — ready,needs-rebase — by danisereb (关闭于: 2026-01-29 22:48 (UTC+8)) [💬3 | +15/-4, 1 files | commented:3] ## Purpose Add support for Triton as MoE backend, required for LoRA support.
In the init of
ModelOptFp8MoEMethodthe functionselect_fp8_moe_backendis called. If it returnsFp8MoeBackend.TRITON, handling for this case is missing inselect_gemm_impl.The code in this PR is similar to the if-else block in
Fp8MoEMethod.select_gemm_impl(but with only Triton and Cutlass as options).This PR is blocked by another PR: …
-
#30618 [BugFix][Hybrid] Fix prefill chunk incorrectly including draft tokens — bug,ready,needs-rebase,v1 — by peakcrosser7 (关闭于: 2026-01-29 22:16 (UTC+8)) [💬22 | +24/-5, 1 files | commented:4]
## Purpose For the Hybrid model, the tokens scheduled during the prefill phase must not include draft tokens.If draft tokens are included, Mamba will incorrectly save a state with a length of
prompt_len + draft_tokensinstead of the correct length ofprompt_len, leading to the wrong output.## Test Plan test script: ```python from vllm import LLM, SamplingParams
…
- #30553 [Bug][CPU Backend]: Improve L2 cache size detection and usage on aarch64 — bug,needs-rebase,cpu — by Radu2k (关闭于: 2026-01-29 21:46 (UTC+8)) [💬4 | +34/-0, 1 files | commented:6 changes:1]
## Purpose
- Add /sys/devices/system/cpu/cpu0/cache/index2/size parsing for L2 cache on aarch64
- Fallback to 1MB with warning if the sysfs file cannot be opened
- Use 70% of detected L2 cache for attention scheduling to leave enough room for sys processes
## Test Plan
l2_cache_sizegets the right value.## Test Result
l2_cache_sizeis set to 204810240.7 bytes on c8g instance type which is the correct value give that it has 2MB of L2 cache per core and we want to use 70%. … -
#33334 [Feature] Add rich request snapshot stream for step-level observability (PR #5) — documentation,frontend,needs-rebase,ci/build,v1 — by sriumcp (关闭于: 2026-01-29 19:30 (UTC+8)) [💬3 | +13599/-1130, 56 files | commented:1] ## Summary
This PR implements PR #5: Rich Request Snapshot Stream from the step-level tracing plan, adding per-request observability events within sampled scheduler steps.
Part of the step-level tracing series:
- PR #3: Step-level batch summary tracing ✅ (merged)
- PR #4: KV cache metrics utilities ✅ (merged)
- PR #5: Rich request snapshot stream ⬅️ this PR
## What This PR Adds …
-
#32552 Ignore: try to implement parakeet — needs-rebase — by netanel-haber (关闭于: 2026-01-29 19:25 (UTC+8)) [💬1 | +370/-21, 2 files | commented:1 | 📝草稿] ## Purpose
## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... -
#32620 [CI/Build][AMD] Fix distributed/test_utils.py — rocm,ci/build — by rjrock (关闭于: 2026-01-29 12:29 (UTC+8)) [💬4 | +10/-1, 2 files | commented:1 changes:1] ## Purpose To improve the health of AMD’s Distributed Tests (4 GPUs) testgroup.
Ray uses
HIP_VISIBLE_DEVICESfor GPU allocation when on a ROCm platform.## Test Plan
RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES="" pytest -v -s distributed/test_utils.py## Test Result Before: …
-
#31189 [CI/Build] Fix rlhf_colocate.py for AMD — rocm,ci/build — by rjrock (关闭于: 2026-01-29 12:28 (UTC+8)) [💬12 | +15/-2, 2 files | commented:7 changes:1] ## Purpose To get rlhf_colocate.py to pass on AMD’s CI.
## Test Plan
RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES="" VLLM_ALLOW_INSECURE_SERIALIZATION=1 RAY_DEDUP_LOGS=0 python3 rlhf_colocate.py## Test Result Before: An import error on
from vllm.platforms import current_platformdue to ray spawning an Actor with no GPUs allocated.After: Exit code of 0. …