[vLLM GitHub 开发动态] 2026-02-03
[概览]
- 时间窗口: 2026-02-03 11:27 (UTC+8) ~ 2026-02-04 11:27 (UTC+8)
- 新 issue: 26 (label 分布: bug:14, feature request:4, rocm:3, RFC:3, cpu:2)
- 关闭 issue: 19
- 新 PR: 91 (label 分布: ready:30, v1:24, bug:21, documentation:14, frontend:10)
- 合并 PR: 44
- 关闭未合并 PR: 16
[新 issue]
-
#33669 [Usability]: Unify cache gathering kernels (gather_and_maybe_dequant_cache, cp_gather_cache, cp_gather_and_upconvert_fp8_kv_cache) — feature request — by pavanimajety (创建于: 2026-02-03 16:41 (UTC+8)) [💬1] Currently, MLA attention uses multiple separate kernel calls for gathering cache data based on the datatype of ouput kv buffers and stored kv cache:
ops.gather_and_maybe_dequant_cache- gathers cache with optional dequantizationops.cp_gather_cache- gathers cache without dequantization (FP8 path)ops.cp_gather_and_upconvert_fp8_kv_cache- gathers and upconverts FP8 cache
Example from
_compute_prefill_context:``` if not use_fp8_prefill: …
-
#33678 [Bug]: [ROCm] Kimi-K2.5 produces incorrect results on AMD MI308X — bug,rocm — by hiahiawei (创建于: 2026-02-03 17:41 (UTC+8)) [💬1] ### Your current environment
## Environment
- GPU: AMD Instinct MI308X
- vLLM version: v0.15.0
- Model: moonshotai/Kimi-K2.5 (compressed-tensors INT4 quantization)
### 🐛 Describe the bug
## Issue …
-
#33748 [Bug][Infrastructure]: Inconsistent Docker Image Versioning and Missing Tags on Docker Hub — bug — by abeltre1 (创建于: 2026-02-04 09:42 (UTC+8)) [💬1] ### 🐛 Describe the bug
There is NO comprehensive issue tracking all Docker versioning inconsistencies together. The existing issues are fragmented and focus on specific missing versions rather than the systemic problem.
The current Docker image release process creates systematic reproducibility barriers that undermine vLLM’s adoption in production and research environments. Missing version tags (e.g., v0.6.6.post1), stale latest tags lagging weeks behind releases, and absent CUDA 13+ images fo…
-
#33675 [Bug]: [CPU Backend] AttributeError: ‘_OpNamespace’ ‘_C_utils’ object has no attribute ‘init_cpu_threads_env’ — bug,cpu — by HervorTao (创建于: 2026-02-03 17:18 (UTC+8)) [💬3] ### Your current environment
The output of
Collecting environment information... ============================== System Info ============================== OS : macOS 14.5 (arm64) ...python collect_env.py -
#33741 Optimize –help performance: Avoid torch import during help display — 无标签 — by AbhiOnGithub (创建于: 2026-02-04 07:06 (UTC+8)) ## Problem
Currently, running
vllm --helporvllm serve --helpimports torch even though it’s not needed for displaying help text. This adds 1-3 seconds of overhead from torch’s import time and disk I/O for loading hundreds of MB of libraries.## Root Cause
The import chain during help display: ``` vllm/entrypoints/cli/main.py → vllm/entrypoints/cli/benchmark/latency.py
… -
#33666 [Bug]:
RotaryEmbedding+KVCacheops unable to pattern match forROCmAiterTritonRopeReshapeKVCacheFusionPass— bug,rocm — by Rohan138 (创建于: 2026-02-03 15:58 (UTC+8)) [💬8] ### Your current environment``` Collecting environment information… ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) GCC version : (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 Clang version : 20.0.0git (https://github.com/RadeonOpenCompute/llvm-project roc-7.0.0 25314 f4087f6b428f0e6f575ebac8a8a724dab123d06e) …
-
#33733 [Tracking issue]: Integrate flashinfer optimizations (DeepSeek) — feature request — by hjjq (创建于: 2026-02-04 05:25 (UTC+8)) ### 🚀 The feature, motivation and pitch
Nonexhaustive list of new flashinfer features that may be integrated into vLLM.
- https://github.com/flashinfer-ai/flashinfer/pull/2019
- https://github.com/flashinfer-ai/flashinfer/pull/2099
- https://github.com/flashinfer-ai/flashinfer/pull/2233 - wip: https://github.com/vllm-project/vllm/pull/32957
- https://github.com/flashinfer-ai/flashinfer/pull/2037
- https://github.com/flashinfer-ai/flashinfer/pull/2398
- https://github.co…
-
#33708 [Doc]: Update CPU image docs to use Docker Hub after 0.16.0 release — documentation — by nathan-weinberg (创建于: 2026-02-04 00:27 (UTC+8)) [💬1] ### 📚 The doc issue
Follow up from https://github.com/vllm-project/vllm/pull/32032
Once we have CPU images published on Docker Hub, we should update the CPU image documentation to suggest using these base images over the AWS ECR images currently recommended
cc @bigPYJ1151 @fadara01
### Suggest a potential alternative/fix
…
-
#33645 [RFC]: Improve Dynamo compile times 5-10x via unsafe assumptions — RFC,torch.compile — by zou3519 (创建于: 2026-02-03 12:44 (UTC+8)) [💬6] ### Motivation.
Dynamo graph capture is around half of the total compilation time (for llama3.1 70B and glm-4.7-fp8 after https://github.com/vllm-project/vllm/pull/33641). I think this proposal can make the Dynamo graph capture time 5-10x faster (or more).
### Proposed Change.
- Most vLLM decoder forward passes are the same Transformer block repeated N times.
- Today, Dynamo traces through the N Transformer blocks to capture a forward graph.
- We add a new decorator (maybe needs PyTorch chang…
-
#33704 [Bug]: LoRA dtype mismatch in XPU and CPU punica wrappers — 无标签 — by plugyawn (创建于: 2026-02-03 23:18 (UTC+8)) ## Your current environment
- vLLM version: main branch
- Platform: GPU / XPU / CPU
## Description
The punica wrappers and LoRA layer implementations hardcode
torch.float32for intermediate buffers inadd_lora_linearandadd_lora_logits, regardless of the input tensor dtype. This causes dtype mismatches when using bfloat16 models with LoRA.## Root Cause …
-
#33702 [Roadmap]: PD Disaggregation with
NixlConnectorRoadmap — feature request — by NickLucche (创建于: 2026-02-03 22:55 (UTC+8)) ### 🚀 The feature, motivation and pitch## Description
This RFC tracks the current state and planned improvements for Prefill-Decode (P/D) Disaggregation using the NixlConnector, which enables high-performance KV cache transfer between prefill and decode instances using the NIXL library. Currently Supported Features Core Infrastructure…
-
#33685 [Bug]: Step3p5ForCausalLM does not support Pipeline parallelism — bug — by RodriMora (创建于: 2026-02-03 18:59 (UTC+8)) [💬2] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#33697 [Bug]: Model architectures [‘CostWiseGemmaForCausalLM’] are not supported for now. — bug — by CAPOJ (创建于: 2026-02-03 21:28 (UTC+8)) ### Your current environment
Trying to
ValueError: Model architectures ['CostWiseGemmaForCausalLM'] are not supported for now. Supported architectures I updated the vLLM to use BAAI/bge-reranker-v2.5-gemma2-lightweight models by pip install --upgrade typing_extensions pip install -U vllm However, I found that it raised errors. ... -
#33696 [Bug]: [Cpu Backend] Whisper W8A8 failure — bug,cpu — by aditew01 (创建于: 2026-02-03 21:16 (UTC+8)) [💬2] ### 🐛 Describe the bug
Running W8A8 Quantized whisper model - RedHatAI/whisper-large-v3-quantized.w8a8 results in the failure.
```(EngineCore_DP0 pid=35539) output = self.model_runner.execute_model( (EngineCore_DP0 pid=35539) File “/home/aditew01/envs/tvllm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py”, line 3452, in execute_model (EngineCore_DP0 pid=35539) ) = self._preprocess( (EngineCore_DP0 pid=…
-
#33672 [Bug]: HTTP API multimodal embedding causes image_pad token duplication, producing incorrect results — bug — by ojipadeson (创建于: 2026-02-03 17:08 (UTC+8)) [💬1] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#33692 [Bug]: FLash Attn MLA Backend can’t support headdim!=576 — bug — by Yuxin1999 (创建于: 2026-02-03 20:02 (UTC+8)) ### Your current environment
.
### 🐛 Describe the bug
I am running a custom model, which uses MLA while headdim=64, it run with error “raise ValueError(f”Head dimension 64 is not supported by MLA) My Attention Backend is Flashattention-2, and i found this restriction is added by MLACommonBackend, which only supports 576 for headdim. I think, this restriction should be removed or support more dimension for different models
### Before submitting a new issue… …
-
#33670 [Feature]: Add get model info interface — feature request — by Malowking (创建于: 2026-02-03 17:00 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch
I need an API endpoint to retrieve the startup parameters used for vLLM, such as –max-model-len. Specifically, I want to be able to query the model’s maximum context length through this interface.
### Alternatives
No response
### Additional context
…
- #33689 [RFC]: KV Offloading Roadmap — RFC,keep-open — by orozery (创建于: 2026-02-03 19:22 (UTC+8))
Currently supported features:
- Pluggable out-of-tree offloading backend
- CPU-GPU Offloading (NVIDIA+AMD)
- Custom offloading block size
- Fully asynchronous (cross-engine-steps) offloading / loading
- Immediate Offloading (as opposed to spill-over)
- LRU Eviction
- Cross layer blocks
- Metrics
- Onloading preempted requests …
-
#33643 [Bug]: serve qwen3-asr-1.7b error — bug — by jesse996 (创建于: 2026-02-03 12:42 (UTC+8)) ### Your current environment
The output of
```text Environment Variables ============================== NVIDIA_VISIBLE_DEVICES=GPU-f073b17f-3d74-f069-6db7-176b3a19bbda NVIDIA_REQUIRE_CUDA=cuda>=12.9 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiar...python collect_env.py -
#33684 [Usage]: Mistral-Nemo-12B function calls returned in content instead of tool_calls — usage — by jackformtsai (创建于: 2026-02-03 18:41 (UTC+8)) ### Your current environment
INFO 02-03 10:41:10 [init.py:235] Automatically detected platform cuda. Collecting environment information… ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) GCC version : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 …
-
#33667 最新版vllm需要 cuda 12.9 了吗? — usage — by shiwanghua (创建于: 2026-02-03 16:08 (UTC+8)) ### Your current environment
https://wheels.vllm.ai/4061dcf4c51ae33aee4bc73096f82da29b580c57
显示 129 和 130
基于源码安装一直解析不出来:
直接安装某个接近最新的提交 id,会回退到 0.15.0: …
-
#33654 [Bug]: The content of response from Kimi-K2.5 is empty. — bug — by WangTuoxytt (创建于: 2026-02-03 14:23 (UTC+8)) [💬1] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ...python collect_env.py -
#33664 [Bug]: DeepSeek-V3.1 with fp8 KV Cache causes illegal memory access at concurrency ≥ 5 in
serve_benchmark— bug — by lyg95 (创建于: 2026-02-03 15:45 (UTC+8)) ### Your current environmentThe output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#33663 [Bug]:
RotaryEmbeddingCustomOp does not work with gpt-oss — bug,rocm — by Rohan138 (创建于: 2026-02-03 15:28 (UTC+8)) [💬1] ### Your current environment``` Collecting environment information… ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) GCC version : (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 Clang version : 20.0.0git (https://github.com/RadeonOpenCompute/llvm-project roc-7.0.0 25314 f4087f6b428f0e6f575ebac8a8a724dab123d06e) …
-
#33661 [Bug]: Hadamard Transform R4 weight length assertion — bug — by songsm921 (创建于: 2026-02-03 15:26 (UTC+8)) ### Your current environment
The output of
```text ### Environment Information ### Operating System: `Linux-6.11.0-1016-nvidia-aarch64-with-glibc2.39` Python Version: `3.12.3 (main, Jan 8 2026, 11:30:50) [GCC 13.3.0]` llm-compressor Version: `None` # Used 0.8.1 version on other machine and import checkpoint ...python collect_env.py -
#33640 [RFC]: Enable reproducible benchmarking in benchmark_serving_multi_turn with API-usage token counts — RFC — by hyunnnchoi (创建于: 2026-02-03 12:23 (UTC+8)) ### Motivation.
The token counting in
benchmark_serving_multi_turn.pydoesn’t match actual API behavior, making benchmark results hard to reproduce. (benchmarks/multi_turn/benchmark_serving_multi_turn.py)#### Token count mismatch due to missing chat template
Even when using
--limit-min-tokens 0 --limit-max-tokens 0to follow dataset token counts, the benchmark calculates tokens differently from the API:```python # Current implementation - tokenizes each message independently …
[已关闭 issue]
-
#16621 [Feature]: Add support for AMD Strix/Strix Halo APU (gfx1150/gfx1151 RDNA 3.5) — feature request,stale — by deific (关闭于: 2026-02-04 10:18 (UTC+8)) [💬8] ### 🚀 The feature, motivation and pitch
Add support for AMD Strix/Strix Halo APU (gfx1150/gfx1151 RDNA 3.5)
### Alternatives
Add support for AMD Strix/Strix Halo APU (gfx1150/gfx1151 RDNA 3.5)
### Additional context
…
-
#19575 [Bug]: Extremely Low Throughput for google/gemma-3-1b-it with vLLM (Only ~80 tokens/sec) — bug,stale — by aryan-metrum-ai (关闭于: 2026-02-04 10:18 (UTC+8)) [💬7] ### Your current environment
# Low Throughput Issue with
google/gemma-3-1b-iton vLLM## Issue Description We’re observing significantly low throughput when serving the
google/gemma-3-1b-itmodel using vLLM with the OpenAI-compatible API interface. Despite using capable hardware and optimal settings, the model underperforms with throughput far below expectations.
## Environment …
-
#20520 [Bug]: Requests that do not return results within 15 minutes are directly aborted, and then the request will be added by vllm again… — bug,stale — by XZhang00 (关闭于: 2026-02-04 10:17 (UTC+8)) [💬7] ### Your current environment
The output of
```text ============================== System Info ============================== OS : TencentOS Server 4.4 (x86_64) ...python collect_env.py -
#32378 [Usage]: How to add mixed text and image modal inputs to documents for qwen3vl-rerank model vllm inference? — usage — by wade0604 (关闭于: 2026-02-04 09:48 (UTC+8)) [💬22] ### Your current environment
``` ============================== Versions of relevant libraries ============================== [pip3] flashinfer-python==0.5.3 [pip3] numpy==2.2.6 [pip3] nvidia-cublas-cu12==12.8.4.1 [pip3] nvidia-cuda-cupti-cu12==12.8.90 …
-
#33675 [Bug]: [CPU Backend] AttributeError: ‘_OpNamespace’ ‘_C_utils’ object has no attribute ‘init_cpu_threads_env’ — bug,cpu — by HervorTao (关闭于: 2026-02-04 09:43 (UTC+8)) [💬3] ### Your current environment
The output of
Collecting environment information... ============================== System Info ============================== OS : macOS 14.5 (arm64) ...python collect_env.py -
#26372 [Bug]: vLLM engine fails with error: ‘ValueError: Counters can only be incremented by non-negative amounts.’ — bug — by Blaze-DSP (关闭于: 2026-02-04 07:34 (UTC+8)) [💬11] ### 🐛 Describe the bug
When serving an LLM with speculative decoding, engine breaks due to duplicate request id
(Docker Image: vllm/vllm-openai:v0.11.0)
Serve Command: ``` command: - vllm …
-
#33546 [Bug][DeepSeekV32]: AttributeError: ‘FlashMLASparseMetadata’ object has no attribute ‘num_decodes’ — bug — by chaunceyjiang (关闭于: 2026-02-04 07:29 (UTC+8)) ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#33543 [Bug]: Some FP8 MoE models fail assertions on GB200 — bug — by jeejeelee (关闭于: 2026-02-04 05:26 (UTC+8)) [💬6] ### Your current environment
The output of
```text Collecting environment information... uv is set ============================== System Info ...python collect_env.py -
#33685 [Bug]: Step3p5ForCausalLM does not support Pipeline parallelism — bug — by RodriMora (关闭于: 2026-02-03 22:49 (UTC+8)) [💬2] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#33670 [Feature]: Add get model info interface — feature request — by Malowking (关闭于: 2026-02-03 19:53 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch
I need an API endpoint to retrieve the startup parameters used for vLLM, such as –max-model-len. Specifically, I want to be able to query the model’s maximum context length through this interface.
### Alternatives
No response
### Additional context
…
-
#33643 [Bug]: serve qwen3-asr-1.7b error — bug — by jesse996 (关闭于: 2026-02-03 19:33 (UTC+8)) ### Your current environment
The output of
```text Environment Variables ============================== NVIDIA_VISIBLE_DEVICES=GPU-f073b17f-3d74-f069-6db7-176b3a19bbda NVIDIA_REQUIRE_CUDA=cuda>=12.9 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiar...python collect_env.py -
#33532 [CI Failure]: DeepSeek V2 Lite FP8 0% Accuracy [NIGHTLY] — ci-failure — by robertgshaw2-redhat (关闭于: 2026-02-03 18:37 (UTC+8)) [💬4] ### Name of failing test
evals/gsm8k/test_gsm8k_correctness.py::test_gsm8k_correctness[DeepSeek-V2-Lite-Instruct-FP8]### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#33533 [CI Failure]: Distributed Tests (4 GPUs) [Nightly] — ci-failure — by robertgshaw2-redhat (关闭于: 2026-02-03 15:49 (UTC+8)) ### Name of failing test
torchrun --nproc-per-node=4 distributed/test_torchrun_example.py### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#33616 [Bug]: reasoning_content to reasoning rename caused regression in chat templates expecting reasoning_content — bug — by koush (关闭于: 2026-02-03 14:46 (UTC+8)) ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#33472 [Feature]: default timezone in container should be UTC — feature request — by lee-b (关闭于: 2026-02-03 14:32 (UTC+8)) ### 🚀 The feature, motivation and pitch
The default timezone in the vllm(nightly) container is America/Los_Angeles. This is a personal developer setting, which the developer should set on his own machine with TZ env var or by mounting /etc/localtime and /etc/timezone into the container. For production releases, UTC would be a much better default within the image.
### Alternatives
The alternative is for EVERY user outside of the LA timezone to have to configure their system just to get a sane …
-
#17660 [Feature]: Support LoRA adapters to vision/merge modules — feature request,stale — by boris-lok-pentadoc (关闭于: 2026-02-03 14:27 (UTC+8)) [💬10] ### 🚀 The feature, motivation and pitch
For many VLM use cases (such as object detection enabled by Qwen 2.5 VL and others) fine-tuning the vision modules is essential, so support for the full application of LoRA adapters would be really nice. The only way to simulate this behaviour with vLLM currently seems to be launching multiple different instances and switching between them with
/sleep and /wake_upcommands, which is extremely difficult to manage### Alternatives
No response
### Ad…
-
#32782 [Bug]: KV cache 0.14 regression vs 0.11 and 0.13 — bug — by dthp-git (关闭于: 2026-02-03 13:49 (UTC+8)) [💬13] ### Your current environment
The output of
```Collecting environment information... /home/anthony/vllm/lib/python3.12/site-packages/torch/cuda/__init__.py:283: UserWarning: Found GPU4 Quadro M4000 which is of cuda capability 5.2. Minimum and Maximum cuda capability supported by this version of PyTorch is (7.0) - (12.0) ...python collect_env.py -
#33630 [CI Failure]: Test Models (Qwen OMni) — ci-failure — by robertgshaw2-redhat (关闭于: 2026-02-03 12:56 (UTC+8)) ### Name of failing test
FAILED models/test_initialization.py::test_can_initialize_large_subset### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#33497 [Bug]: Low acceptance rate for DeepSeek-V3.2 with
deepseek_mtpspeculative method in v0.15.0 — bug — by qianlihuang (关闭于: 2026-02-03 12:40 (UTC+8)) [💬17] ### Your current environmentThe output of
```text Image vllm/vllm-openai:v0.15.0 ```python collect_env.py…
[新 PR]
-
#33679 [XPU][4/N] add mxfp4 moe model support — 无标签 — by jikunshang (创建于: 2026-02-03 17:43 (UTC+8)) [+53/-31, 1 files | commented:1] ## Purpose [4/N] of https://github.com/vllm-project/vllm/issues/33214 add mxfp4 moe support. we can also refactor xpu part once mxfp4 apply kernel abstraction.
## Test Plan
python3 examples/offline_inference/basic/generate.py --model openai/gpt-oss-20b --temperature 0 --enforce-eager## Test Result
…
-
#33755 [Model] Enable Step3p5ForCausalLM testing — ready — by jeejeelee (创建于: 2026-02-04 10:58 (UTC+8)) [+5/-5, 1 files | commented:1 approved:1]
## Purpose
## Test Plan
## Test Result
... -
#33726 [Model][Spec Decode] Nemotron-H MTP and Mamba Speculative Decoding Support — new-model,v1 — by benchislett (创建于: 2026-02-04 04:32 (UTC+8)) [+823/-116, 14 files | commented:7] ## Purpose
This PR adds support for MTP for the Nemotron-H model family, which will be introduced with Nemotron V3 Super.
To facilitate this, we also implement speculative decoding support for the Mamba attention backends. Previously, mamba-style speculative decoding support was limited to Qwen3-Next. This PR attempts to implement the attention metadata in a simple and unified manner that does not introduce too much complexity to the backend.
Co-authored wi…
-
#33729 [BugFix][Spec Decoding] Fix negative accepted tokens metric crash — bug,ready,v1 — by njhill (创建于: 2026-02-04 04:47 (UTC+8)) [💬3 | +61/-1, 2 files | commented:1 approved:1] In some cases it’s possible for no tokens to be generated in a step for a request which had spec decode tokens.
It’s actually not clear cut whether we should record 0 accepted tokens in this case or not record at all, it depends on the specific purpose / definition of acceptance rate. For now I am just not recording it, this case should happen relatively infrequently anyhow.
Fixes https://github.com/vllm-project/vllm/issues/26372.
-
#33686 feat: Add ColBERT late interaction model support — documentation,new-model,frontend,needs-rebase — by ieBoytsov (创建于: 2026-02-03 19:01 (UTC+8)) [💬11 | +937/-3, 13 files | commented:6] ## Purpose
Add support for ColBERT late interaction models in vLLM. ColBERT is a retrieval model that uses per-token embeddings with MaxSim scoring, offering a balance between the accuracy of cross-encoders and the efficiency of bi-encoders.
Resolves #13827
## Test Plan
Offline tests (embedding correctness, MaxSim scoring, HF comparison)
- pytest tests/models/language/pooling/test_colbert.py -v …
-
#33659 [XPU][2/N] add support unquantized moe support for xpu — ready,ci/build — by jikunshang (创建于: 2026-02-03 15:04 (UTC+8)) [+139/-34, 6 files | commented:1]
## Purpose This PR is [2/N] of https://github.com/vllm-project/vllm/issues/33214 we make xpu moe also modularized and remove monolithic_xpu path.
## Test Plan ci, add a new moe model test for xpu.
## Test Result …
-
#33754 > fix — documentation — by kebe7jun (创建于: 2026-02-04 10:48 (UTC+8)) [💬2 | +38/-2, 3 files | commented:1]
## Purpose
## Test Plan
## Test Result
... -
#33714 [EC Connector] SHMConnector: Share Memory based EC Connector — documentation,frontend,ci/build,v1,multi-modality,llama,qwen,kv-connector — by PiratePai (创建于: 2026-02-04 01:07 (UTC+8)) [💬4 | +557/-0, 8 files | commented:1] ## Purpose
This PR introduces SHMConnector, a new ECConnector implementation that leverages Shared Memory (SHM) and PyTorch RPC to enable low-latency transfer of encoder caches between ECConnector Producer and ECConnector Consumer.
Key Changes:
- Shared Memory Transport: Uses
torch.multiprocessing.reductions.reduce_tensorto create shared memory handles for encoder cache tensors, enabling zero-copy-like inter-process transfer without raw data duplication. - **PyTorch RPC Control Plan…
- Shared Memory Transport: Uses
-
#33753 [CustomOp][MM] Add CustomOp for RelPosAttention — documentation — by shen-shanshan (创建于: 2026-02-04 10:47 (UTC+8)) [💬1 | +200/-161, 4 files | commented:1] ## Purpose
Add CustomOp for RelPosAttention. In this case, this kind of attention can be replaced with custom OOT operations.
## Test Plan
## Test Result
... -
#33752 [Bugfix] Fix kernel benchmark — bug,performance — by jeejeelee (创建于: 2026-02-04 10:43 (UTC+8)) [+18/-0, 4 files | commented:1 | 📝草稿]
## Purpose
## Test Plan
## Test Result
... -
#33746 [Bugfix] Fix Qwen3VL video frame padding for temporal_patch_size — bug,multi-modality,qwen — by jeremyteboul (创建于: 2026-02-04 09:14 (UTC+8)) [💬1 | +454/-0, 3 files | commented:1] Qwen3VL uses temporal_patch_size=2, meaning video frames must come in pairs for the 3D patch embedding to work correctly. When using pre-processed videos from external sources (do_sample_frames=False), the frame count may be odd and requires padding.
This fix:
- Checks if video frame count is not a multiple of temporal_patch_size
- Pads by duplicating the last frame to reach the required alignment
- Updates metadata (frames_indices, total_num_frames) to reflect padding
- Only applies padding wh…
- #33701 [Bugfix] Fix torchrun PP broadcast deadlock with async scheduling — bug,ready,v1 — by Isotr0py (创建于: 2026-02-03 22:27 (UTC+8)) [💬1 | +4/-7, 3 files | commented:4 approved:2]
## Purpose
- Following PR for #33650
- Fix root issue of https://github.com/vllm-project/vllm/issues/33533
torchrunwith PP has already broadcast PP outputs during logits computation, so there is no need to boardcast sampled tokens again. https://github.com/vllm-project/vllm/blob/4bc913aeeca39a304e4ace51febf55f142c8c86e/vllm/v1/worker/gpu_model_runner.py#L3551-L3599
## Test Plan
VLLM_LOGGING_LEVEL=DEBUG PP_SIZE=2 torchrun --nproc-per-node=4 distributed/test_torchrun_example.py…
-
#33751 Use Kernel Abstraction for FP4ScaledMM — nvidia — by elizabetht (创建于: 2026-02-04 10:12 (UTC+8)) [+893/-14, 7 files | commented:1]
## Purpose FP4ScaledMM
## Test Plan
## Test Result
…
-
#33652 [Core] Don’t schedule spec tokens with prefill chunks — ready,v1 — by njhill (创建于: 2026-02-03 14:04 (UTC+8)) [💬1 | +129/-29, 5 files | commented:5 approved:1] As well as avoiding some overhead of unnecessarily scheduled tokens, this is needed for linear attention prefix caching.
Also fixes ModelRunner V2 spec decode with non-async scheduling.
Unit test was written by claude code.
-
#33706 [Hybrid] Fix and optimize block-aligned splitting in mamba cache align mode — v1 — by peakcrosser7 (创建于: 2026-02-04 00:00 (UTC+8)) [+38/-16, 1 files | commented:2]
## Purpose
Currently, block-aligned splitting is only executed during the prefill phase of new requests (
num_output_tokens == 0). It does not account for resumed requests. This leads to resumed requests being scheduled without block alignment, causing Mamba states to be stored in a non-aligned fashion. Consequently, incorrect states are retrieved during subsequent cache hits.Changes in this PR:
- Ensure that block-aligned splitting is performed during the …
-
#33677 [Misc] Add deprecated environment variable utilities — 无标签 — by carlory (创建于: 2026-02-03 17:37 (UTC+8)) [💬1 | +65/-0, 1 files | commented:2] ## Purpose
Add general-purpose utilities for handling deprecated environment variables with deprecation warnings. These functions can be reused across the codebase when deprecating environment variables in favor of CLI arguments or config options.
This addresses the suggestion from @hmellor in PR #33536 to add general versions of the removed
_get_from_env_if_setand_set_from_env_if_setmethods toutils.pyfor reuse in future deprecations.**New functions added to `vllm/utils/system_uti…
-
#33750 [MM] Align the prefix of MMEncoderAttention with Attention — ready,llama,qwen — by shen-shanshan (创建于: 2026-02-04 10:00 (UTC+8)) [+17/-15, 17 files | commented:1 approved:1] ## Purpose
Align the prefix of MMEncoderAttention with Attention.
Find more details at https://github.com/vllm-project/vllm/pull/33674#discussion_r2758936186.
## Test Plan
## Test Result
…
-
#33749 [ROCm][AITER] Fix AITER import regression for explicit backend selection — rocm,speculative-decoding,v1 — by AndreasKaratzas (创建于: 2026-02-04 10:00 (UTC+8)) [💬1 | +156/-35, 5 files | commented:1] A regression was introduced that broke explicit AITER backend selection on ROCm when
VLLM_ROCM_USE_AITER=0(or unset). Users could not explicitly select the AITER backend viaattention_config={"backend": "ROCM_AITER_FA"}even though the backend was available.Error observed:
AttributeError: 'builtin_function_or_method' object has no attribute 'flash_attn_varlen_func'This occurred because:
- `is_aiter_found_and_supported…
-
#33674 [MM] Pass
prefixparameter to MMEncoderAttention — ready,llama,qwen — by shen-shanshan (创建于: 2026-02-03 17:10 (UTC+8)) [+58/-11, 15 files | commented:5 approved:1] ## PurposePass
prefixparameter toMMEncoderAttention. In this case, we can getlayer_indexfromprefixin OOTMMEncoderAttentionops.This is needed by the PR https://github.com/vllm-project/vllm-ascend/pull/6448 (for performance optimization).
## Test Plan
Just passing
prefixparameter toMMEncoderAttentionhas no effect to current codes.…
-
#33747 feat(grpc): expose kv_connector and kv_role in GetServerInfoResponse — frontend — by slin1237 (创建于: 2026-02-04 09:29 (UTC+8)) [💬2 | +72/-2, 2 files | commented:1] ## Summary
Add KV transfer config fields to
GetServerInfoResponseto allow routers to discover the KV transfer mechanism (Mooncake vs NIXL) at runtime.This enables intelligent routing decisions for PD disaggregation:
- Mooncake: router injects
kv_transfer_paramsinto decode requests - NIXL: uses automatic prefix matching, no explicit params needed
### Changes
…
- Mooncake: router injects
-
#33688 [Feature] Enable
TRITON_ATTNfor Batch Invariance — documentation,ready,v1 — by frankwang28 (创建于: 2026-02-03 19:17 (UTC+8)) [💬1 | +13/-4, 4 files | commented:1 approved:1] ## Purpose This PR addsTRITON_ATTNsupport for batch invariance.Related / parent issue: #27433
## Test Plan Run tests with and without the
or is_batch_invariantcheck in the triton_unified_attention’sunified_attentionmethod.## Test Result Tests are run on a B200 (do not have access to a Hopper GPU to validate there 🙁) …
-
#33724 [WideEP] Deprecate pplx all2all backend — nvidia — by tlrmchlsmth (创建于: 2026-02-04 04:20 (UTC+8)) [+6/-0, 1 files | commented:1 approved:1] This PR adds a deprecation warning if the PPLX all2all kernels are used.
The pplx-kernels all2all backend uses https://github.com/perplexityai/pplx-kernels, but this has been replaced by https://github.com/perplexityai/pplx-garden.
pplx-garden does not have the same interface as pplx-kernels so isn’t an easy drop-in replacement, and deprecating and then removing the pplx kernels will reduce the complexity of the MoE layer overall.
-
#33681 [ROCm] [aiter] Split KV cache update for AiterFlashAttention — rocm,v1 — by kliuae (创建于: 2026-02-03 18:11 (UTC+8)) [💬1 | +68/-40, 1 files | commented:2]
## Purpose Supporting #32335, this PR extracts KV cache update from the attention forward pass in AiterFlashAttention.
ROCM_AITER_FA supports both flash and shuffled KV cache layouts. This PR covers both of them and uses the same flag to control the respective KV cache layouts.
## Test Plan Accuracy test with lm_eval
…
-
#33734 [Rocm][Bugfix] Fix dtype not same for gemm_a4w4 op — bug,rocm,ready — by charlifu (创建于: 2026-02-04 05:33 (UTC+8)) [+6/-1, 1 files | commented:1 approved:1] Aiter op per_1x32_f4_quant_hip returns the tensor as
torch.float4_e2m1fn_x2type, which will cause thegemm_a4w4op to raise error because the weight type is stilltorch.uint8.This PR fixes this issue.
-
#33731 [torch.compile] Reorganize vllm/compilation and tests/compile (0/N for vLLM IR) — ready,torch.compile,ci/build,ready-run-all-tests — by ProExpertProg (创建于: 2026-02-04 05:09 (UTC+8)) [💬1 | +1788/-1435, 58 files | commented:1] ## Purpose
Pure code movement and renaming to prepare for vLLM IR, which will add its own passes and tests. Will wait for #33293 before landing.
## Test Plan CI
## Test Result TBD
…
-
#33745 [Disagg] Fix NIXL handshake failures not honoring kv_load_failure_po… — kv-connector — by rajachan (创建于: 2026-02-04 08:27 (UTC+8)) [💬1 | +31/-9, 1 files | commented:1 | 📝草稿] …licy
When NIXL handshake fails (e.g., due to compatibility hash mismatch between prefill and decode instances), requests fail with “engine dead” error instead of gracefully falling back to local recomputation as configured by kv_load_failure_policy=’recompute’.
In the NixlConnectorWorker.get_finished() method, failed requests (from handshake or transfer failures) were merged into done_recving and processed identically to successful transfers. The processing loop would then:
- Try to sync KV…
-
#33738 [WideEP] Fix nvfp4 DeepEP High Throughput All2All backend — ready,nvidia — by tlrmchlsmth (创建于: 2026-02-04 06:45 (UTC+8)) [💬2 | +6/-2, 1 files | commented:1 approved:2] ## Purpose
On current main,
nvidia/DeepSeek-R1-NVFP4crashes when used with the DeepEP high throughput all2all backend.NotImplementedError File "vllm/distributed/device_communicators/all2all.py", line 355, in dispatch_router_logits raise NotImplementedError…
-
#33735 [torch.compile] Add an option to force-enable the MOE cold start optimization — 无标签 — by zou3519 (创建于: 2026-02-04 05:43 (UTC+8)) [+12/-9, 2 files | commented:5] Previously when speculative decoding was enabled, we would force disable the MOE cold start optimization. I have some spec decoding models where I’m sure the optimization applies, so I wanted a flag to force-enable this.
This does not change the default behavior: the default behavior is still “on during most models but off for models with spec decoding”
-
#33737 [Bugfix] Define router_logits_dtype for remaining MoE models — bug,ready — by mgoin (创建于: 2026-02-04 06:32 (UTC+8)) [+9/-4, 6 files | commented:1 approved:1]
## Purpose
After adding a kernel support check for
router_logits_dtypein https://github.com/vllm-project/vllm/pull/33613, we still needed to audit remaining model definitions to make surerouter_logits_dtype=torch.float32was set when need. In this PR, I systematically went through that. Here are my justifications:AfmoeMoE:self.gateis float32 and the forward pass hasrouter_logits = self.gate(hidden_states.to(dtype=torch.float32))BailingMoE: …
-
#33743 [Core] Reduce median/average TTFT by up to ~37% with Inter-Prefill-Budget — v1 — by NadavShmayo (创建于: 2026-02-04 07:42 (UTC+8)) [+234/-0, 5 files | commented:1] ## Summary Introduces
--inter-prefill-budget, a new scheduling constraint that prevents Head-of-Line (HoL) blocking during the prefill stage. This optimization reduces Median TTFT by up to 37% on large models (Gemma-3-27B) by preventing compute-saturated batches from being overloaded with redundant prefill requests.## Purpose When configuring
max_num_batched_tokensfor chunked prefill, to optimize for high throughput and low TTFT, we want to set a big number to avoid the overhead i… - #33744 Resolve compressed-tensors support for non-divisible group_sizes — 无标签 — by dsikka (创建于: 2026-02-04 07:43 (UTC+8)) [+15/-2, 2 files | commented:1 | 📝草稿] WIP
-
#33709 [Frontend] Enable generic structured_outputs for responses API — frontend — by alecsolder (创建于: 2026-02-04 00:35 (UTC+8)) [+58/-7, 2 files | commented:2] ## Purpose The current ResponsesAPI implementation only supports setting an output text format using json_schema, however for more complicated use cases like grammars, regexes, choices, etc, you need to be able to pass in the full structured_outputs object
## Test Plan
vllm serve openai/gpt-oss-20b --enforce-eager --max-model-len=65536 \ --tool-call-parser=openai --enable-auto-tool-choice --reasoning-parser=openai_gptoss``` curl -X POST http://localhost:8000/v1/responses
… -
#33742 Remove custom ops for TRTLLM block/tensor FP8 and BF16 MoEs — nvidia — by mgoin (创建于: 2026-02-04 07:23 (UTC+8)) [💬1 | +19/-103, 5 files | commented:1]
## Purpose
There were comments that mentioned it was not understood why these needed to be wrapped in custom ops, so I tried unwrapping them.
## Test Plan
## Test Result
…
-
#33713 [Bugfix][ROCm] Include float8_e4m3fnuz in NCCL Dtype Dispatching — bug,rocm,ready — by micah-wil (创建于: 2026-02-04 00:51 (UTC+8)) [+1/-1, 1 files | commented:4 approved:1] https://github.com/vllm-project/vllm/pull/33030 fixed dtypes in the Pynccl wrapper, but omitted the case for float8_e4m3fnuz which is used on MI300/325.
Repro command:
vllm serve QWen/Qwen3-30B-A3B-FP8 --enforce-eager --enable-eplb --all2all-backend allgather_reducescatter --eplb-config '{"window_size":10, "step_interval":100, "num_redundant_experts":0, "log_balancedness":true}' --tensor-parallel-size 2 --data-parallel-size 2 --enable-expert-parallelWhen running this on main, I am s…
-
#33740 [WIP] Decompose all gather so clone is exposed to graph — 无标签 — by eellison (创建于: 2026-02-04 06:59 (UTC+8)) [+298/-10, 7 files | commented:1 | 📝草稿] Decomposes remaining all_gather ops to expose reshape + clone for Inductor fusion.
As shown in this benchmark, inductor can codegen this more performantly than eager, as well as potentially fuse into the rest of the graph.
On b200, the benchmark for a reshape + clone is 2.5x faster with inductor than using eager (using cudagraphs to normalize overhead).
Tests added in both tests/compile/distributed/test_all_gather_decompos…
-
#33739 [CI][AMD][BugFix][P/D] Add default_vllm_config to test_moriio_connector.py so tests pass — bug,rocm,v1,kv-connector — by rasmith (创建于: 2026-02-04 06:45 (UTC+8)) [+122/-117, 1 files | commented:3] ## Purpose
test_moriio_handshake_returns_metadataandtest_register_kv_cachesdo not make use of theset_vllm_configfixture and both result in this error when being ran:``` def get_current_vllm_config() -> VllmConfig: if _current_vllm_config is None:
raise AssertionError( "Current vLLM config is not set. This typically means " "get_current_vllm_config() was called outside of a " ... -
#33718 [Refactor] Remove unused dead code — ready — by yewentao256 (创建于: 2026-02-04 01:51 (UTC+8)) [💬2 | +0/-77, 3 files | commented:1 approved:1] ## Purpose
Remove unused dead code
-
#33736 [Spec Decode] Add hidden states extraction system — documentation,new-model,speculative-decoding,v1,kv-connector — by fynnsu (创建于: 2026-02-04 06:18 (UTC+8)) [💬1 | +1270/-4, 9 files | commented:1 | 📝草稿] ## Purpose WORK IN PROGRESS In-tree implementation of hidden states extraction system #33118.
Implements a new spec decode method “extract_hidden_states”:
- ExtractHiddenStatesProposer (based off EagleProposer)
- Changes to gpu_model_runner are just the changes required to add another proposer/spec method
- Includes config handling logic in vllm/config/speculative.py and vllm/transformers_utils/configs/extract_hidden_states.py (similar to any other spec met…
-
#33717 Add AgentMarket - Real Energy Data for AI Agents — documentation — by stromfee (创建于: 2026-02-04 01:31 (UTC+8)) [💬3 | +2/-0, 1 files | commented:1] AgentMarket.cloud - B2A marketplace with 28M+ real energy records.
Ohne Strom keine KI. ⚡
https://agentmarket.cloud
-
#33730 [DNM] testing docker build cache regression — ci/build — by dougbtv (创建于: 2026-02-04 04:54 (UTC+8)) [+1/-1, 1 files | commented:1]
## Purpose
for testing only at the moment.
-
#33732 Implement zero-copy GQA for multimodal and CPU — v1,cpu — by voidbag (创建于: 2026-02-04 05:11 (UTC+8)) [💬6 | +18/-32, 4 files | commented:3]
## Purpose - zero-copy GQA for sdpa for multimodal and cpu backend. - existing verison copies heads to match num_heads - This patch removes the copies and use ‘enable_gqa’ option
## Test Plan
## Test Result …
-
#33720 Onboard voyage-4-nano — new-model,qwen — by chengchengpei (创建于: 2026-02-04 03:06 (UTC+8)) [💬2 | +141/-2, 2 files | commented:1] ## Purpose
Onboard voyage-4-nano
## Test Plan
Run ``` from vllm import LLM from vllm.config import PoolerConfig …
-
#33712 [compile] Remove runner type from ignored caching factor list. — ready — by zhxchen17 (创建于: 2026-02-04 00:48 (UTC+8)) [+2/-3, 1 files | commented:1 approved:3] Summary:
Draft runner should be compiled separartely. Currently it’s ignored from caching factor which results in draft models reusing the same dynamo artifacts which causes issues in https://github.com/pytorch/pytorch/issues/173546
Test Plan:
pytest tests/v1/e2e/test_spec_decode.py::test_draft_model_correctness -v -k “False-args0”
-
#33725 Added latency and throughput benchmark for beam search — performance — by access2rohit (创建于: 2026-02-04 04:24 (UTC+8)) [+541/-0, 1 files | commented:8 | 📝草稿]
## Purpose
## Test Plan
## Test Result
... - #33716 [Voxtral Realtime] Change name — documentation,ready,multi-modality — by patrickvonplaten (创建于: 2026-02-04 01:18 (UTC+8)) [💬1 | +6/-6, 4 files | commented:1 approved:1] Sorry naming changed a tiny bit
-
#33722 [Deprecation] Deprecate profiling envs — documentation,ready — by yewentao256 (创建于: 2026-02-04 04:12 (UTC+8)) [💬2 | +2/-49, 2 files | commented:1] ## Purpose
Deprecate profiling envs as scheduled
-
#33728 [WideEP] Remove naive all2all. Use allgather_reducescatter instead — cpu,nvidia — by tlrmchlsmth (创建于: 2026-02-04 04:44 (UTC+8)) [+18/-140, 4 files | commented:1 | 📝草稿] Remove the naive broadcast-based all2all implementation and use the
allgather_reducescatterinstead ifnaiveis specified.allgather_reducescatteris already the default and should be more efficient. Overall, this simplifies the codebase by removing a seldom used backend -
#33695 enable skipping of SW attention layers when using FP8 KV cache — 无标签 — by jmkuebler (创建于: 2026-02-03 20:55 (UTC+8)) [💬3 | +16/-0, 3 files | commented:1]
## Purpose This PR enables us to keep Sliding Window Attention layers in BF16, whilst quantizing the Full Attention layers. The idea is that there is not much latency / memory to be saved in SW layers, but the quantization overheads are paid nonetheless. Furthermore, from an accuracy perspective, skipping the SW layers is a more conservative approach and does minimize the risks of accuracy degradation. We thus expect slight ITL gains.
It’s the last optimization to…
-
#33727 [CPU][BugFix] Allow w8a8 oneDNN quantized matmul to support 3D inputs — bug,cpu — by fadara01 (创建于: 2026-02-04 04:34 (UTC+8)) [💬1 | +3/-1, 1 files | commented:1] ## Purpose
This enables us to run RedHatAI/whisper-large-v3-quantized.w8a8 on CPU.
Fixes: #33696
## Test Plan
Reproducer #33696 Will enable this in CI once we accelerate the w8a8 GEMMs in oneDNN …
-
#33646 [Bugfix] Handle case when kimi ends reasoning with a tool call — bug — by koush (创建于: 2026-02-03 12:48 (UTC+8)) [💬2 | +226/-2, 2 files | commented:2] ## Purpose
Kimi reasoning parser is currently based off Deepseeks. However, Kimi may start a tool call without using a think end token. When kimi ends with a tool call inside reasoning, this is registered as simply an end of message and no tool call is parsed, because the reasoning end token was not output, and the tool call itself is considered part of that reasoning. The tool call is thus not executed either.
My initial approach was to try to maintain inheritance with deepseek, but I found t…
-
#33723 [wip] fp8 online quant with blockwise scaling — 无标签 — by vkuzo (创建于: 2026-02-04 04:17 (UTC+8)) [+229/-48, 2 files | commented:1 | 📝草稿] Summary:
not ready for review yet
Test Plan:
…
-
#33721 merge — frontend — by NilsHellwig (创建于: 2026-02-04 03:24 (UTC+8)) [+121/-30, 1 files | commented:1]
## Purpose
## Test Plan
## Test Result
... -
#33707 Onboard voyage-4-nano — new-model,qwen — by chengchengpei (创建于: 2026-02-04 00:04 (UTC+8)) [💬5 | +140/-0, 2 files | commented:1] ## Purpose
Onboard voyage-4-nano to vllm
## Test Plan
Run ``` from vllm import LLM from vllm.config import PoolerConfig …
-
#33710 [Core] Improve KV cache block reuse for prefix caching — v1 — by jaewonlee-fb (创建于: 2026-02-04 00:42 (UTC+8)) [+62/-4, 2 files | commented:3] ## Summary Optimize KV cache block allocation to preserve cached blocks longer, improving prefix cache hit rates.
Changes:
-
Add
prepend_n()method toFreeKVCacheBlockQueuefor inserting blocks at the front of the free list -
Modify
free_blocks()to separate fresh vs cached blocks:- Fresh blocks (no hash) → front of queue (reused first)
- Cached blocks (has hash) → back of queue (preserved longer)
…
-
-
#33719 [Core] Begin refactor of structured output backend batching logic [1/n] — structured-output,v1 — by alecsolder (创建于: 2026-02-04 02:26 (UTC+8)) [+992/-19, 8 files | commented:1 | 📝草稿] ## Purpose First PR in a stack for structured output refactoring to support backend specific batching functionality
## Test Plan Unit tests for this PR
## Test Result Passed
-
#33715 [NVIDIA][test] Tests for flashinfer TRTLLM BF16 MoE — nvidia — by Linda-Stadter (创建于: 2026-02-04 01:13 (UTC+8)) [+281/-0, 4 files | commented:1]
## Purpose Adding tests for trtllm bf16 moe backend added in PR [NVIDIA] [feat] Integrate flashinfer Trtllmgen bf16 moe #32954
- unit and integration test for the new moe backend
- unit tests for utility functions ## Test Plan
## Test Result …
-
#33680 [Perf][Kernel] Add faster topKperRow decode kernel for DeepSeek-V3.2 sparse attention — deepseek — by LopezCastroRoberto (创建于: 2026-02-03 18:04 (UTC+8)) [💬3 | +506/-13, 6 files | commented:1] ## Summary This PR adds an optimized top-k per-row decode kernel for DeepSeek-V3.2 sparse attention (K = 2048). The new kernel replaces vLLM’s native
top_k_per_row_decodein the decode path for long-context inference. It uses a 5-pass radix selection algorithm optimized for large-K workloads and long sequence lengths. The implementation is adapted from TileLang / SGLang (PR by DarkSharpness: https://github.com/sgl-project/sglang/pull/11194).## Performance characteristics Not intended to repl…
- #33703 [Bugfix] Support multi-type params parsing for DeepSeek v3.2 — bug,deepseek — by kizill (创建于: 2026-02-03 23:10 (UTC+8)) [💬4 | +217/-19, 2 files | commented:1]
## Purpose
Kilo code uses multi typed params for some reason, and such calls fails to render with exception
'list' object has no attribute 'lowercase'when Kilo code passestype=['str', 'null']## Test Plan I added tests/tool_parsers/test_deepseekv32_tool_parser.py test suite to test different type-value casts ## Test Result Suggested tests for DSV3.2 works ok. - #33641 [torch.compile] Significantly speed up cold start times — ready — by zou3519 (创建于: 2026-02-03 12:24 (UTC+8)) [+41/-21, 3 files | commented:3 approved:1]
Previously, the vLLM-compile compilation:
- does a full dynamo trace to get a full graph
- splits the graph on attention ops into N subgraphs
- compiles each of the N subgraphs via standalone_compile.
- If any of the subgraphs are the same, we rely on aotautograd and inductor to cache-hit, but we still end up producing N artifacts total. Usually there are 3 unique graphs in vLLM models.
- vLLM stores a mapping of N subgraph indices to N compiled artifacts.
Taking control of this back into …
-
#33711 [Core] SWA-specific KV cache optimizations for prefix caching — v1 — by jaewonlee-fb (创建于: 2026-02-04 00:42 (UTC+8)) [+126/-5, 4 files | commented:1] ## Summary Optimize sliding window attention (SWA) layers for better prefix cache efficiency.
Changes:
- Immediate SWA block eviction after prefill
- Free out-of-window blocks right when prefill completes
- Faster block pool capacity reclaim for concurrent requests
- Evict hashes from prefix cache (they won’t be looked up again)
- Modified
get_num_skipped_tokens()for prefix cache compatibility …
- Immediate SWA block eviction after prefill
-
#33691 [EC Connector] SHMConnector: Share Memory based EC Connector for EPD — documentation,performance,new-model,rocm,frontend,speculative-decoding,ci/build,v1,multi-modality,llama — by PiratePai (创建于: 2026-02-03 19:48 (UTC+8)) [💬6 | +6308/-3302, 187 files | commented:2] ## Purpose
This PR introduces SHMConnector, a new implementation of the ECConnector that leverages Shared Memory (SHM) and ZMQ to facilitate low-latency transfer of encoder caches between Producer and Consumer processes.
Key Changes:
- Shared Memory Transport: Uses torch.multiprocessing.reductions.reduce_tensor to create shared memory handles, allowing zero-copy-like transfer of tensrs between processes.
- ZMQ Control Plane: Implements a REQ/ROUTER pattern via ZMQ to reliably broadcast meta…
-
#33660 [PluggableLayer][3/N] Apply PluggableLayer to mamba layers. — ready,ready-run-all-tests — by whx-sjtu (创建于: 2026-02-03 15:22 (UTC+8)) [+13/-31, 3 files | commented:1 approved:1] ## Purpose As a task in https://github.com/vllm-project/vllm/issues/32676, this PR applies PluggableLayer to mamba layers, including mamba_mixer, mamba_mixer2 and plamo2_mamba_mixer.
## Test Plan All ci tests should pass.
## Test Result All ci tests should pass.
…
-
#33673 Fix Gemma3n audio encoder for Transformers v5 — ready — by hmellor (创建于: 2026-02-03 17:09 (UTC+8)) [+9/-4, 1 files | commented:1 approved:1] https://github.com/huggingface/transformers/pull/42564 updated all the multimodal feature getting methods to return
BaseModelOutputWithPooling(which is adict) by default. This is a welcome change as it standardises these methods across all models in Transformers.However, this caused issues for Gemma3n in vLLM because it instantiates the
audio_towerusingfrom_config(therefore using theforwardmethod from Transformers) and expects it to return a tuple.This PR handles the output …
-
#33705 [Hybrid] Enable spec decoding in mamba cache align mode — v1 — by peakcrosser7 (创建于: 2026-02-04 00:00 (UTC+8)) [+1/-8, 2 files | commented:1 | 📝草稿] ## Purpose Re-enabled spec decoding: Previously disabled in https://github.com/vllm-project/vllm/pull/30877, speculative decoding is now re-enabled as the related issues have been confirmed as fixed.
## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... -
#33699 [Bugfix] Fix startup hang for Granite Speech — bug,ready,multi-modality — by DarkLight1337 (创建于: 2026-02-03 22:01 (UTC+8)) [+8/-8, 1 files | commented:1 approved:2] ## Purpose
Fix the hanging models initialization test. Turns out even the processor initialization needs to be inside
set_default_torch_num_threadsfor certain models.## Test Plan
pytest tests/models/test_initialization.py::test_can_initialize_large_subset[GraniteSpeechForConditionalGeneration]…
-
#33647 [Bugfix] Do not add extra \n for image-only cases when constructing multimodal text prompts. — bug,frontend,ready — by noooop (创建于: 2026-02-03 12:55 (UTC+8)) [+4/-1, 1 files | commented:1 approved:2]
## Purpose
Do not add extra \n for image-only cases when constructing multimodal text prompts.
address https://github.com/vllm-project/vllm/pull/33060#discussion_r2757045468
## Test Plan …
-
#33671 [Bugfix][Feature] Fix Anthropic Message Stream Converter and Support Thinking Block — bug,frontend — by mariohong128 (创建于: 2026-02-03 17:02 (UTC+8)) [💬2 | +228/-83, 2 files | commented:1] ## Purpose Issue: The
message_stream_converterfunction was unable to properly handle chunks that simultaneously contain reasoning, text, and multiple tool calls.This PR fix bugs in Anthropic Message Stream Converter (Anthropic /v1/messages REST API endpoint) and support thinking block.
- Improved tool call processing logic to handle multiple tool calls in a single delta chunk, allowing the function to process all content types (reasoning, text, and tool calls) within a single chunk
- Suppo…
- #33690 [Bugfix] Fix step3p5 parser when using mtp — bug — by mariohong128 (创建于: 2026-02-03 19:45 (UTC+8)) [💬1 | +1455/-5, 2 files | commented:1]
## Purpose
Fix step3.5 parser when using mtp.
If model outputs
</tool_call><tool_call><(using mtp will greatly increase the possibility of this), parser will start a new empty toolcall incorrectly. ``` {"error":{"message":"1 validation error for ValidatorIterator\\n1.function.name\\n Field required [type=missing, input_value={\‘arguments\’: \’{}\’}, input_type=dict]\\n For further information visit https://errors.pydantic.dev/2.11/v/missing None","type":"BadRequestError… -
#33700 Bugfix/issue 30128 — bug,v1,qwen — by Hivenet-Igor (创建于: 2026-02-03 22:04 (UTC+8)) [💬1 | +23/-11, 4 files | commented:2] ## Purpose Fix an issue in the Ray distributed executor where setting –pipeline_parallel_size > 3 caused worker initialization failures due to incorrect rank assignment. https://github.com/vllm-project/vllm/issues/30128
When running vLLM on a Ray cluster with larger pipeline parallel configurations, the initial rpc_rank assigned by Ray did not match the final logical rank layout expected by the pipeline parallel executor. This mismatch led to incorrect layer placement and runtime errors during…
-
#33698 [Core][BugFix] Fix PP KV cache sharding memory validation — bug,v1 — by junuxyz (创建于: 2026-02-03 21:40 (UTC+8)) [+183/-13, 2 files | commented:1] Fixes #32105. Refs #32782. Regression introduced by #29431 (commit 8ee90c8).
This PR fixes incorrect KV cache sharding memory validation under Pipeline Parallelism, which could cause false OOM errors.
## Purpose
According to the docs, since v0.14.0,
get_kv_cache_configsworks as follows:…
-
#33693 [Draft] Move Harmony encoding to renderer layer and add gRPC render server — frontend,needs-rebase,gpt-oss — by hyeongyun0916 (创建于: 2026-02-03 20:33 (UTC+8)) [💬1 | +1399/-324, 18 files | commented:3 | 📝草稿] Background I’ve concluded that integrating Harmony into the Renderer layer is essential for a complete and functional render implementation. I’ve opened this Draft PR to verify if this proposed final architecture aligns with the project’s direction.
If the direction is confirmed, I plan to break this down into smaller, manageable PRs.
Changes
- Added uses_harmony property (9e6a89e) - Introduced the uses_harmony pro…
- #33694 [Bugfix] Fix ubatch wrapper num_tokens calculate — bug,v1 — by jiangkuaixue123 (创建于: 2026-02-03 20:47 (UTC+8)) [💬2 | +1/-3, 1 files | commented:1] ## Purpose Based on PR #30120 , we added support for variable ubatch-size; however, the calculation of num_tokens here was incorrect, and this PR fixes the issue.
-
#33683 Fix Gemma3 GGUF for Transformers v5 — ready — by hmellor (创建于: 2026-02-03 18:39 (UTC+8)) [+1/-1, 1 files | commented:1 approved:1] https://github.com/huggingface/transformers/pull/41314 removed
from_text_vision_configsas it’s a long deprecated method which serves no purpose (you can just use__init__).This PR updates the
Gemma3Configinstantiation for GGUF models to use__init__so that they may continue to work with Transformers v5. -
#33682 Fix offline test for Transformers v5 — ready — by hmellor (创建于: 2026-02-03 18:20 (UTC+8)) [+7/-0, 1 files | commented:1 approved:1] As described in https://github.com/huggingface/transformers/blob/main/MIGRATION_GUIDE_V5.md#remote-code-incompatibility,
tokenization_utilsandtokenization_utils_fasthave been removed and are aliased for backward compatibility.The re-import logic in the offline tests is incompatible with these aliases so we ensure that they are deleted during reloading so that they will be re-aliased on next import.
-
#33644 [Bugfix] fix qwen3-asr response error — bug,ready,multi-modality,qwen — by jesse996 (创建于: 2026-02-03 12:44 (UTC+8)) [💬1 | +7/-6, 1 files | commented:5 approved:1]
## Purpose fix https://github.com/vllm-project/vllm/issues/33643 ## Test Plan
## Test Result
... -
#33687 [Refactor] Clean up input preprocessing — ready,multi-modality — by DarkLight1337 (创建于: 2026-02-03 19:17 (UTC+8)) [💬1 | +91/-204, 4 files | commented:1]
## Purpose
Remove code specific to text-only encoder-decoder models, since we now implement them as MM models to simplify the code.
## Test Plan
## Test Result
…
-
#33656 [Misc] Update default image format of
encode_base64— ready,multi-modality — by DarkLight1337 (创建于: 2026-02-03 14:39 (UTC+8)) [💬1 | +4/-18, 2 files | commented:1 approved:2]## Purpose
Resolve the announcement of backwards-incompatible change.
Also clean up related code.
## Test Plan
…
-
#33648 [Feature] Support manually enabling the cumem allocator — documentation,v1 — by kebe7jun (创建于: 2026-02-03 12:58 (UTC+8)) [💬1 | +32/-10, 4 files | commented:1] ## Purpose
Follow up: https://github.com/vllm-project/vllm/pull/33540#issuecomment-3834388477
Adding the
--enable-cumem-allocatorparameter allows users to manually enable the cumem allocator without sleep mode, which is important when using Nixl for PD isolation in the GB series. I have run tests on GB200 and H200, and both are working properly.## Test Plan
GB200: …
-
#33665 [Refactor] Clean up pooling serial utils — documentation,frontend,ready — by DarkLight1337 (创建于: 2026-02-03 15:50 (UTC+8)) [💬1 | +407/-322, 9 files | approved:1 commented:2]
## Purpose
- Use
pybase64which is faster thanbase64 - Consolidate the different attributes into a dataclass so we can have a single dictionary.
- Moved entrypoint-specific utils under entrypoints directory
## Test Plan
…
- Use
-
#33676 [CI/Bug]Fix multimodal encoder budget KeyError for shared audio placeholders — bug,multi-modality — by pacoxu (创建于: 2026-02-03 17:20 (UTC+8)) [💬2 | +2/-2, 1 files | commented:1] ## Purpose
fix a keyerror in CI https://buildkite.com/vllm/ci/builds/49647/steps/canvas?jid=019c205f-dc84-45ec-949c-ea4fc62d9a54
- the CI is a main branch CI failure.
```
[2026-02-02T22:40:50Z] /usr/local/lib/python3.12/dist-packages/vllm/v1/engine/input_processor.py:79: in init – [2026-02-02T22:40:50Z] mm_budget = MultiModalBudget(vllm_config, mm_registry) …
-
#33668 [Compile] Add autotune_batch_hint config for dynamic shapes — 无标签 — by NikhilAPatel (创建于: 2026-02-03 16:13 (UTC+8)) [💬2 | +49/-3, 3 files | commented:1] ## Summary
Add a new config option
autotune_batch_hinttoDynamicShapesConfigthat allows specifying a hint for the batch dimension size during autotuning.- When set, this value is passed as
hint_overridetotorch.mark_dynamic()ortorch._dynamo.decorators.mark_unbacked()for dimension 0 (batch) - This controls what batch size Inductor uses when benchmarking kernel configurations during autotuning
- Works with both BACKED and UNBACKED dynamic shapes types
- Default is
None, which …
- When set, this value is passed as
-
#33642 [Bugfix][Model] Fix DeepSeek-OCR-2 chat template to include BOS token — bug,ready,deepseek — by l4b4r4b4b4 (创建于: 2026-02-03 12:35 (UTC+8)) [💬1 | +5/-4, 1 files | commented:4 approved:1] ## Summary
This PR fixes an issue where DeepSeek-OCR-2 models return empty responses when using the
/v1/chat/completionsendpoint.## Root Cause
DeepSeek-OCR-2 models have
model_type="deepseek_vl_v2"but require a BOS token for proper inference. The existing mapping usedtemplate_deepseek_vl2.jinjawhich does not include the BOS token, causing the model to produce no output.## Fix
…
-
#33657 [XPU] Support Qwen3 next — v1,qwen — by yma11 (创建于: 2026-02-03 14:53 (UTC+8)) [💬3 | +38/-4, 6 files | commented:1]
## Purpose This PR enables Qwen3-next support for XPU path.
## Test Plan
VLLM_WORKER_MULTIPROC_METHOD=spawn python3 examples/offline_inference/basic/generate.py --model Qwen/Qwen3-Next-80B-A3B-Instruct --enforce-eager --quantization fp8 -tp 8## Test Result …
-
#33653 [Docs] Do not build the document if there are no docs changes — 无标签 — by kebe7jun (创建于: 2026-02-03 14:18 (UTC+8)) [💬3 | +13/-1, 1 files | commented:1 | 📝草稿]
## Purpose
Right now,
the docs/readthedocs.org:vllmjob runs for every PR, which causes Read the Docs to frequently reportMaximum concurrency limit reached. Most PRs don’t touch documentation, so this ends up blocking the CI builds for PRs that do modify docs, delaying feedback and wasting time.This PR adds a mechanism to skip Read the Docs builds when there are no documentation changes, avoiding unnecessary builds and improving overall CI efficiency.
See …
-
#33650 [CI/Build] Investigate torchrun distributed tests hanging issue — ready — by Isotr0py (创建于: 2026-02-03 13:15 (UTC+8)) [💬1 | +6/-0, 2 files | commented:1 approved:1]
## Purpose
- Fix #33533
- When enabling async scheduling,
get_pp_group().send_tensor_dict(...)will stuck when enabling PP: https://github.com/vllm-project/vllm/blob/b95cc5014dc7b260e5c70ae33d1b30c54d11306d/vllm/v1/worker/gpu_model_runner.py#L3582-L3586
## Test Plan ``` PP_SIZE=2 torchrun –nproc-per-node=4 distributed/test_torchrun_example.py …
-
#33658 feat(mla): add default do_kv_cache_update for MLA — v1 — by dw2761 (创建于: 2026-02-03 14:53 (UTC+8)) [+50/-9, 2 files | commented:2] ## Purpose This PR is part of #32335
It extracts the MLA KV-cache update op from the MLA attention layer into a default
MLAAttentionImpl.do_kv_cache_updateimplementation.## Test Plan Run v1 latency benchmark with dummy weights on both main and this PR branch, explicitly selecting the MLA backend
--attention-backend FLASH_ATTN_MLA## Test Result
…
-
#33662 [XPU][3/N] add int4 gemm support for xpu(awq/gptq/CT w4a16) — 无标签 — by jikunshang (创建于: 2026-02-03 15:26 (UTC+8)) [+119/-66, 3 files | commented:1]
## Purpose This PR is [3/N] of https://github.com/vllm-project/vllm/issues/33214 add int4 gemm support for xpu(awq/gptq/CT w4a16)
## Test Plan
## Test Result
…
-
#33655 [Model] Add support for Solar Open — documentation,new-model — by oesni (创建于: 2026-02-03 14:34 (UTC+8)) [💬1 | +707/-9, 4 files | commented:1]
## Purpose This PR adds support for the Solar Open model from Upstage.
## Test Plan
- Tested on a local machine with Solar-Open-100B
- Verified that the
/chat/completionsAPI works correctly with the model weights.
- Verified that the
## Test Result …
- Tested on a local machine with Solar-Open-100B
-
#33651 [WIP][Kernel] Add Helion kernel for scaled_mm — 无标签 — by xiaohongchen1991 (创建于: 2026-02-03 13:43 (UTC+8)) [💬2 | +204/-0, 2 files | commented:1] ## Purpose This PR is to add Helion kernel for
scaled_mmoperation. It follows the implementation from the Triton version.This is a subtask for https://github.com/vllm-project/vllm/issues/32962.
## Test Plan
- Test correctness
- Added unit test to cover its correctness
- Benchmark performance at kernel level …
- Test correctness
-
#33649 [Benchmark] Enable reproducible benchmarking with API-usage token counts — performance — by hyunnnchoi (创建于: 2026-02-03 13:05 (UTC+8)) [💬1 | +74/-14, 1 files | commented:1 | 📝草稿]
## Purpose
This PR addresses token counting inconsistencies between
benchmark_serving_multi_turn.pyand actual API behavior, as described in RFC #33640 . Resolves #33640### Problem
- Token count mismatch: The benchmark tokenizes each message content independently, missing special tokens (
<|im_start|>,<|im_end|>, role markers) that chat templates add - Missing
output_tokensfield: Re-tokenizing assistant content on each run causes variance ac…
- Token count mismatch: The benchmark tokenizes each message content independently, missing special tokens (
-
#33639 [DRAFT] Enable torch.compile on Gemma3n_mm Multimodal Embedding Layer — documentation,needs-rebase — by Lucaskabela (创建于: 2026-02-03 12:20 (UTC+8)) [💬2 | +31/-12, 2 files | commented:1] See title - this PR enables torch.compile decorator on the Gemma3n multimodal embedding layer
## Purpose
## Test Plan
vllm serve google/gemma-3n-E2B-it --compilation-config='{"compile_mm_encoder":"true"}'…
[已合并 PR]
-
#33729 [BugFix][Spec Decoding] Fix negative accepted tokens metric crash — bug,ready,v1 — by njhill (合并于: 2026-02-04 07:34 (UTC+8)) [💬3 | +61/-1, 2 files | commented:1 approved:1] In some cases it’s possible for no tokens to be generated in a step for a request which had spec decode tokens.
It’s actually not clear cut whether we should record 0 accepted tokens in this case or not record at all, it depends on the specific purpose / definition of acceptance rate. For now I am just not recording it, this case should happen relatively infrequently anyhow.
Fixes https://github.com/vllm-project/vllm/issues/26372.
-
#32712 [1/N] Initial Implementation of Parser for ResponsesAPI — frontend,ready — by qandrew (合并于: 2026-02-04 10:59 (UTC+8)) [💬12 | +772/-61, 10 files | commented:9 approved:1] ## Purpose
Initial Implementation of https://github.com/vllm-project/vllm/issues/32713. This PR introduces
- the concept of a Parser class to consolidate reasoning / tool calling, a
ParserManager, with the first example being the MinimaxM2 parser. - hook up the Parser into responsesAPI.
- parser manager lazy loads, similar to #28092
- No functional changes expected, just refactors.
Future goals
- hook up GPT-OSS to use parser …
- the concept of a Parser class to consolidate reasoning / tool calling, a
- #33701 [Bugfix] Fix torchrun PP broadcast deadlock with async scheduling — bug,ready,v1 — by Isotr0py (合并于: 2026-02-04 10:17 (UTC+8)) [💬1 | +4/-7, 3 files | commented:4 approved:2]
## Purpose
- Following PR for #33650
- Fix root issue of https://github.com/vllm-project/vllm/issues/33533
torchrunwith PP has already broadcast PP outputs during logits computation, so there is no need to boardcast sampled tokens again. https://github.com/vllm-project/vllm/blob/4bc913aeeca39a304e4ace51febf55f142c8c86e/vllm/v1/worker/gpu_model_runner.py#L3551-L3599
## Test Plan
VLLM_LOGGING_LEVEL=DEBUG PP_SIZE=2 torchrun --nproc-per-node=4 distributed/test_torchrun_example.py…
-
#33536 [Misc] Remove deprecated profiler environment variables — ready — by carlory (合并于: 2026-02-03 14:58 (UTC+8)) [+0/-95, 2 files | commented:2 approved:1] ## Purpose
Remove deprecated environment variables for profiler configuration that were scheduled for removal in v0.15.0. The deprecation warning stated:
“Using %s environment variable is deprecated and will be removed in v0.15.0 or v1.0.0, whichever is soonest.”
Since v0.15.0 has been released, it’s time to clean up these deprecated environment variables.
## Changes
…
-
#33060 [Frontend][4/n] Make pooling entrypoints request schema consensus | ScoreRequest — documentation,frontend,ready — by noooop (合并于: 2026-02-04 09:48 (UTC+8)) [💬7 | +432/-205, 8 files | commented:3 approved:1] ## Purpose
- let rerank_api and score api support list[ScoreMultiModalParam] as inputs
FIX #32378
## Test Plan
vllm/entrypoints/pooling/score/
…
-
#33674 [MM] Pass
prefixparameter to MMEncoderAttention — ready,llama,qwen — by shen-shanshan (合并于: 2026-02-03 22:47 (UTC+8)) [+58/-11, 15 files | commented:5 approved:1] ## PurposePass
prefixparameter toMMEncoderAttention. In this case, we can getlayer_indexfromprefixin OOTMMEncoderAttentionops.This is needed by the PR https://github.com/vllm-project/vllm-ascend/pull/6448 (for performance optimization).
## Test Plan
Just passing
prefixparameter toMMEncoderAttentionhas no effect to current codes.…
-
#33579 [Bugfix] Fix sparse MLA metadata building — bug,ready — by MatthewBonanni (合并于: 2026-02-04 07:29 (UTC+8)) [💬2 | +22/-31, 1 files | commented:10] ## Purpose Fix https://github.com/vllm-project/vllm/issues/33546 https://github.com/vllm-project/vllm/pull/33284 broke sparse MLA by moving logic from the backend to the layer without properly accounting for sparse backends.
## Test Plan
vllm serve deepseek-ai/DeepSeek-V3.2 -tp 8 -ep## Test Result …
-
#33351 [Dependency] Remove comments of ray in dependency files — documentation,rocm,ready,ci/build,v1,nvidia — by yewentao256 (合并于: 2026-02-04 07:30 (UTC+8)) [💬3 | +2/-2, 2 files | commented:9 changes:1] ## Purpose
~- vLLM v1 PP can run via the multiprocessing backend; Ray is only required when users explicitly choose the Ray executor backend. Keeping Ray as a default dependency on CUDA/ROCm causes confusion and unnecessarily pulls Ray into environments that don’t use it.~
~So this PR makes ray a optional dependency.~
vLLM v1 PP can run via the multiprocessing backend; Ray is only required when users explicitly choose the Ray executor backend. This PR removes the confusing comment in the depe…
-
#33613 [Bugfix] Disable TRTLLM FP8 MoE if router_logits_dtype==float32 and routing_method!=DeepSeekV3 — bug,ready,deepseek,nvidia — by mgoin (合并于: 2026-02-04 05:26 (UTC+8)) [💬3 | +43/-33, 5 files | approved:2 commented:2] ## Purpose
FIX https://github.com/vllm-project/vllm/issues/33543
Also dedupe shared logic for routing_logits and routing_bias casting
Opened a Flashinfer issue to track if the kernel ends up adding support for this https://github.com/flashinfer-ai/flashinfer/issues/2469
## Test Plan
…
- #33716 [Voxtral Realtime] Change name — documentation,ready,multi-modality — by patrickvonplaten (合并于: 2026-02-04 05:03 (UTC+8)) [💬1 | +6/-6, 4 files | commented:1 approved:1] Sorry naming changed a tiny bit
-
#33257 [MISC] Fix Tensor Parallelism for Quantized Mamba Models with n_groups=1 — bug,ready — by vadiklyutiy (合并于: 2026-02-04 04:10 (UTC+8)) [💬5 | +84/-118, 1 files | commented:8 approved:1] ### Summary
Enable tensor parallelism (TP > 1) for quantized hybrid Mamba models (e.g., Falcon-H1R-7B with FP8) when
n_groups=1.### Root Cause
Custom weight loaders for group replication were only implemented for non-quantized layers. This PR extends support to quantized layers by leveraging the
weight_loaderproperty onModelWeightParameter(extendsBasevLLMParameter).### Test …
-
#33377 [Bugfix][Async][Connector] avoid vllm-side double free during async scheduling + request abort + async KV cache transfer — bug,ready,v1,kv-connector — by KuntaiDu (合并于: 2026-02-03 21:50 (UTC+8)) [💬9 | +6/-2, 1 files | commented:1 approved:1] ## Purpose
This PR fixes vLLM-side double-request-free bug in async scheduling + KV cache transfer + abort request.
Bug description: when the request is aborted, the same request may enter
get_finishedtwice and enter_free_blockstwice, resulting in double free.Reason: existing logic handles request abort by setting the request to
Noneand skip this request by skipping None request. However, in async KV cache transfer, the request won’t be set toNonebecause it is still finalizing … -
#31541 Turn
@configinto adataclass_transform— structured-output,frontend,speculative-decoding,ready,v1 — by hmellor (合并于: 2026-02-04 01:41 (UTC+8)) [💬3 | +155/-193, 32 files | commented:2 approved:1] Main changes:- Converts the
configdecorator into adataclass_transform - Now we only have to decorate new configs with
configinstead ofconfiganddataclass - It also sets the default
ConfigDictfor this transformed dataclass to forbid extra fields (cc @mgoin who added this to some configs in/vllm/config/compilation.py)
Other changes:
- Update tests to use
configalone mypypre-commit hook upd…
- Converts the
- #33641 [torch.compile] Significantly speed up cold start times — ready — by zou3519 (合并于: 2026-02-04 01:12 (UTC+8)) [+41/-21, 3 files | commented:3 approved:1]
Previously, the vLLM-compile compilation:
- does a full dynamo trace to get a full graph
- splits the graph on attention ops into N subgraphs
- compiles each of the N subgraphs via standalone_compile.
- If any of the subgraphs are the same, we rely on aotautograd and inductor to cache-hit, but we still end up producing N artifacts total. Usually there are 3 unique graphs in vLLM models.
- vLLM stores a mapping of N subgraph indices to N compiled artifacts.
Taking control of this back into …
-
#33673 Fix Gemma3n audio encoder for Transformers v5 — ready — by hmellor (合并于: 2026-02-03 21:49 (UTC+8)) [+9/-4, 1 files | commented:1 approved:1] https://github.com/huggingface/transformers/pull/42564 updated all the multimodal feature getting methods to return
BaseModelOutputWithPooling(which is adict) by default. This is a welcome change as it standardises these methods across all models in Transformers.However, this caused issues for Gemma3n in vLLM because it instantiates the
audio_towerusingfrom_config(therefore using theforwardmethod from Transformers) and expects it to return a tuple.This PR handles the output …
-
#23465 [Attention][FA3] Update FA3 to include new swizzle optimization — documentation,ready,ci/build,unstale,v1 — by LucasWilkinson (合并于: 2026-02-04 00:08 (UTC+8)) [💬7 | +13/-3, 3 files | commented:1 approved:1] vLLM side of https://github.com/vllm-project/flash-attention/pull/82
meta-llama/Meta-Llama-3-8B-Instruct, 1xH100, 4k and 2k out ``` branch rate num_prompts req/s median TTFT (ms) std TTFT p99 TTFT median TPOT (ms) std TPOT p99 TPOT
—— —- ———– —– —————- ——– ——– —————- ——– ——–
MAIN 1.00 120 0.88 141.95 29.83 287.21 11.07 … -
#31034 [P/D] rework mooncake connector and introduce its bootstrap server — documentation,ready,v1,kv-connector — by dtcccc (合并于: 2026-02-04 00:08 (UTC+8)) [💬6 | +1324/-203, 9 files | commented:9 approved:1]
## Purpose
Rework mooncake connector to achieve better performance and prepare for more features in future. Introduce a central bootstrap server on P.
init phase: All P workers register their info (dp/tp/pp rank, zmq worker addr) with the bootstrap server. After all P workers finished registering, proxy and D workers can query when they me…
-
#33699 [Bugfix] Fix startup hang for Granite Speech — bug,ready,multi-modality — by DarkLight1337 (合并于: 2026-02-03 23:57 (UTC+8)) [+8/-8, 1 files | commented:1 approved:2] ## Purpose
Fix the hanging models initialization test. Turns out even the processor initialization needs to be inside
set_default_torch_num_threadsfor certain models.## Test Plan
pytest tests/models/test_initialization.py::test_can_initialize_large_subset[GraniteSpeechForConditionalGeneration]…
- #33597 [Minor] Some code simplification in
scheduler.py— ready,v1 — by njhill (合并于: 2026-02-03 15:00 (UTC+8)) [+19/-29, 1 files | commented:1 approved:1] Noticed while working on other items. -
#33576 [Voxtral models] Skip warm-up to skip confusing error message in warm-up — frontend,ready — by patrickvonplaten (合并于: 2026-02-03 23:22 (UTC+8)) [+10/-3, 3 files | commented:1 approved:2] It seems since recently Transformers’ cached_get_processor throws an error for MistralTokenizer backends.
Let’s skip the warm-up for now to not confuse the user.
``` (APIServer pid=1222019) INFO 02-02 18:08:45 [speech_to_text.py:144] Warming up audio preprocessing libraries… (APIServer pid=1222019) ERROR 02-02 18:08:46 [speech_to_text.py:184] Audio preprocessing warmup failed (non-fatal): %s. First request may experience higher latency. (APIServer pid=1222019) ERROR 02-02 18:08:46 [speech_t…
-
#33647 [Bugfix] Do not add extra \n for image-only cases when constructing multimodal text prompts. — bug,frontend,ready — by noooop (合并于: 2026-02-03 22:43 (UTC+8)) [+4/-1, 1 files | commented:1 approved:2]
## Purpose
Do not add extra \n for image-only cases when constructing multimodal text prompts.
address https://github.com/vllm-project/vllm/pull/33060#discussion_r2757045468
## Test Plan …
-
#33345 Feat/add nemotron nano v3 tests — ready,ci/build — by shaharmor98 (合并于: 2026-02-03 21:52 (UTC+8)) [+54/-0, 6 files | commented:1 approved:1] This PR adds support and testing configurations for the NVIDIA Nemotron-3 Nano 30B models (both BF16 and FP8 variants) to the
lm-eval-harnessCI pipeline.## Test Plan
Verified locally using
pytestwith thelm-eval-harnesstest suite.
Essential Elements of an Effective PR Description Checklist
... -
#33552 Document NixlConnector backend selection via kv_connector_extra_config — documentation,ready,kv-connector — by KrxGu (合并于: 2026-02-03 21:49 (UTC+8)) [💬3 | +29/-0, 1 files | commented:2 approved:1] ## Summary
Adds documentation for selecting NIXL transport backends through the existing
--kv-transfer-configparameter.## Changes
- Added subsection “Selecting a NIXL transport backend (plugin)” to
docs/features/nixl_connector_usage.md - Included examples for both JSON and CLI dotted-key syntax
- Documented the
kv_connector_extra_config.backendsparameter with LIBFABRIC example - Added note about backend availability depending on NIXL build configuration …
- Added subsection “Selecting a NIXL transport backend (plugin)” to
-
#33636 [Models] Intern-S1-Pro — documentation,new-model,ready,qwen — by CUHKSZzxy (合并于: 2026-02-03 21:49 (UTC+8)) [💬3 | +942/-11, 11 files | commented:9 approved:1] ## Purpose Intern-S1-Pro model support.
## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... -
#33683 Fix Gemma3 GGUF for Transformers v5 — ready — by hmellor (合并于: 2026-02-03 20:36 (UTC+8)) [+1/-1, 1 files | commented:1 approved:1] https://github.com/huggingface/transformers/pull/41314 removed
from_text_vision_configsas it’s a long deprecated method which serves no purpose (you can just use__init__).This PR updates the
Gemma3Configinstantiation for GGUF models to use__init__so that they may continue to work with Transformers v5. -
#33682 Fix offline test for Transformers v5 — ready — by hmellor (合并于: 2026-02-03 20:07 (UTC+8)) [+7/-0, 1 files | commented:1 approved:1] As described in https://github.com/huggingface/transformers/blob/main/MIGRATION_GUIDE_V5.md#remote-code-incompatibility,
tokenization_utilsandtokenization_utils_fasthave been removed and are aliased for backward compatibility.The re-import logic in the offline tests is incompatible with these aliases so we ensure that they are deleted during reloading so that they will be re-aliased on next import.
-
#33644 [Bugfix] fix qwen3-asr response error — bug,ready,multi-modality,qwen — by jesse996 (合并于: 2026-02-03 19:33 (UTC+8)) [💬1 | +7/-6, 1 files | commented:5 approved:1]
## Purpose fix https://github.com/vllm-project/vllm/issues/33643 ## Test Plan
## Test Result
... -
#33656 [Misc] Update default image format of
encode_base64— ready,multi-modality — by DarkLight1337 (合并于: 2026-02-03 19:13 (UTC+8)) [💬1 | +4/-18, 2 files | commented:1 approved:2]## Purpose
Resolve the announcement of backwards-incompatible change.
Also clean up related code.
## Test Plan
…
-
#33620 [Bugfix] Disable RoutingMethodType.[Renormalize,RenormalizeNaive] for TRTLLM per-tensor FP8 MoE — bug,ready,nvidia — by mgoin (合并于: 2026-02-03 18:37 (UTC+8)) [+4/-2, 1 files | commented:1 approved:1] ## Purpose
FIX https://github.com/vllm-project/vllm/issues/33532
## Test Plan
## Test Result
Tested on B200
…
-
#33665 [Refactor] Clean up pooling serial utils — documentation,frontend,ready — by DarkLight1337 (合并于: 2026-02-03 18:29 (UTC+8)) [💬1 | +407/-322, 9 files | approved:1 commented:2]
## Purpose
- Use
pybase64which is faster thanbase64 - Consolidate the different attributes into a dataclass so we can have a single dictionary.
- Moved entrypoint-specific utils under entrypoints directory
## Test Plan
…
- Use
-
#32609 [Frontend] Add sampling parameters to Responses API — frontend,ready — by DanielMe (合并于: 2026-02-03 13:51 (UTC+8)) [💬10 | +165/-2, 3 files | commented:3 approved:2] ## Purpose
Port essential sampling parameters from
/v1/chat/completionsto/v1/responsesAPI to provide basic generation control for users of the Responses API.Added Parameters:
presence_penalty,frequency_penalty,repetition_penalty- repetition controlmin_p,seed,stop,ignore_eos,- sampling controlmin_tokens-prompt_logprobs,spaces_between_special_tokensoutput control- …include_stop_str_in_output,truncate_prompt_tokens
-
#33642 [Bugfix][Model] Fix DeepSeek-OCR-2 chat template to include BOS token — bug,ready,deepseek — by l4b4r4b4b4 (合并于: 2026-02-03 16:35 (UTC+8)) [💬1 | +5/-4, 1 files | commented:4 approved:1] ## Summary
This PR fixes an issue where DeepSeek-OCR-2 models return empty responses when using the
/v1/chat/completionsendpoint.## Root Cause
DeepSeek-OCR-2 models have
model_type="deepseek_vl_v2"but require a BOS token for proper inference. The existing mapping usedtemplate_deepseek_vl2.jinjawhich does not include the BOS token, causing the model to produce no output.## Fix
…
-
#33535 [Misc] Remove deprecated VLLM_ALL2ALL_BACKEND environment variable — ready,ci/build — by carlory (合并于: 2026-02-03 15:01 (UTC+8)) [+2/-47, 4 files | commented:1 approved:1] ## Purpose
Remove the deprecated
VLLM_ALL2ALL_BACKENDenvironment variable that was scheduled for removal in v0.15.0.The environment variable was deprecated in favor of the
--all2all-backendcommand-line argument. This PR completes the removal by:- Removing the environment variable definition from
vllm/envs.py - Removing the deprecation warning and env var handling from
ParallelConfig - Updating the integration test script to use
--all2all-backendCLI argument instead - Simplifyin…
- Removing the environment variable definition from
- #33619 Patch Protobuf for CVE 2026-0994 — ci/build — by zaristei2 (合并于: 2026-02-03 16:03 (UTC+8)) [💬1 | +2/-2, 2 files | commented:1] ## Purpose Fixes PB for CVE 2026-0994 for 0.15.1
-
#33621 Patch aiohttp for CVE-2025-69223 — rocm,ready,ci/build — by zaristei2 (合并于: 2026-02-03 16:02 (UTC+8)) [💬1 | +3/-3, 3 files | commented:1 approved:1] ## Purpose upgrade aiohttp to bypass CVE-2025-69223 for the 0.15.1 release. ## Test Plan
## Test Result
-
#33650 [CI/Build] Investigate torchrun distributed tests hanging issue — ready — by Isotr0py (合并于: 2026-02-03 15:49 (UTC+8)) [💬1 | +6/-0, 2 files | commented:1 approved:1]
## Purpose
- Fix #33533
- When enabling async scheduling,
get_pp_group().send_tensor_dict(...)will stuck when enabling PP: https://github.com/vllm-project/vllm/blob/b95cc5014dc7b260e5c70ae33d1b30c54d11306d/vllm/v1/worker/gpu_model_runner.py#L3582-L3586
## Test Plan ``` PP_SIZE=2 torchrun –nproc-per-node=4 distributed/test_torchrun_example.py …
-
#33571 [torch.compile] Document the workaround to standalone_compile failing — documentation,ready — by zou3519 (合并于: 2026-02-03 15:16 (UTC+8)) [💬2 | +28/-0, 2 files | commented:2 approved:1] ## Purpose InductorAdaptor will not fail if something is unserializable in the graph. Instead, it will save something to disk, and upon loading the artifact, there will be undefined behavior.
InductorStandaloneAdaptor is safer in that it will error out if something is unserializable to the graph. If you want the InductorAdaptor behavior (in a safer way), you can just disable the vLLM compilation cache.
## Test Plan & Test Result I tried a model that runs into the standalone_compile error and v…
-
#33635 [Bugfix] Interleaved thinking keeps compatibility with reasoning_content — bug,frontend,ready — by chaunceyjiang (合并于: 2026-02-03 14:46 (UTC+8)) [💬1 | +3/-0, 1 files | commented:2 approved:2] ## Purpose
FIX https://github.com/vllm-project/vllm/issues/33616
introduced by https://github.com/vllm-project/vllm/pull/33402 ## Test Plan
## Test Result
…
-
#33553 [CI/Build] Remove hardcoded America/Los_Angeles timezone from Dockerfiles — ready,ci/build — by carlory (合并于: 2026-02-03 14:32 (UTC+8)) [💬1 | +5/-15, 2 files | commented:1 approved:1] ## Purpose
Remove the
debconf-set-selectionscommands that hardcode the timezone to America/Los_Angeles in the Docker images. This was originally fixed in #12888 but was accidentally reintroduced in #15377.The container should use UTC by default, which is the standard for production environments. Users who need a specific timezone can set it via the
TZenvironment variable or by mounting/etc/localtimeand/etc/timezone.Fixes #33472
## Test Plan
…
-
#33379 [XPU][1/N] Deprecate ipex and switch to vllm-xpu-kernels for xpu platform — ready,ci/build,v1 — by jikunshang (合并于: 2026-02-03 14:46 (UTC+8)) [💬2 | +144/-921, 18 files | commented:10] ## Purpose [1/N] https://github.com/vllm-project/vllm/issues/33214 start from this PR, we will switch to vllm-xpu-kernels based kernel impl for Intel GPU. this PR also upgrade dependency to oneapi 2025.3 and pytorch 2.10 for xpu platform.
## Test Plan
## Test Result
…
-
#32728 Fix quantized Falcon-H1 model loading issues — ready — by shengliangxu (合并于: 2026-02-03 14:31 (UTC+8)) [💬2 | +14/-2, 1 files | commented:2 approved:1] ## Purpose
This change fixes the loading of quantized Falcon-H1 model. The change is originally targeting the loading of Nvidia Model Optimizer quantized models but the issues are universal to quantized model from other quantization libraries.
Specifically:
-
pass the quant_config to sub modules
-
allow remapping of the kv-cache scaling factors. For ModelOpt, we put the k_scale and v_scale under k_proj and v_proj, but the vLLM library seeks them under attention
…
-
-
#33634 [Bugfix] Fix mm budget setting for Qwen Omni models — bug,ready,multi-modality,qwen — by ywang96 (合并于: 2026-02-03 12:56 (UTC+8)) [+5/-0, 1 files | commented:1 approved:1] ## Purpose If a model has hybrid-modality setting (e.g.,
use_audio_in_video=Truefor Qwen Omni model series) then multiple modalities can share the same placeholder inMultiModalInputs. This PR makes the modality checking more robust while add a comment to indicate so.Fixes https://github.com/vllm-project/vllm/issues/33630
## Test Plan
## Test Result
…
-
#30329 [Feature][CPU Backend]: Optimize ARM vectorization backend — ready,cpu — by Radu2k (合并于: 2026-02-03 12:17 (UTC+8)) [💬20 | +552/-597, 5 files | commented:9 changes:1] ## Purpose Implement vectorised backed based on
at::vec::Vectorizedfor ARM, introduce highlevel FP32/FP16/BF16 vector wrappers and add new shared based classVectorizedRegWrappercapable of scaling with dtypes.-
Implement ARM vectorisation based on
at::vec::Vectorized<T>forfloat,c10::Halfandc10::BFloat16with potential to extend for integer data types. -
Introduce
NxVectorizedTVecReg<N, T>andVectorizedRegWrapper<DerivedClassT, N, T>to model N×Vectorizedregister gr...
-
-
#33624 [torch.compile] Don’t do the fast moe cold start optimization if there is speculative decoding — bug,ready — by zou3519 (合并于: 2026-02-03 11:38 (UTC+8)) [💬2 | +32/-1, 2 files | commented:1 approved:2] I’m also down to turn this optimization off by default too, just let me know.
I don’t have a machine to run deepseek v3.2 right now, so someone please test this
[关闭未合并 PR]
-
#33754 > fix — documentation — by kebe7jun (关闭于: 2026-02-04 10:54 (UTC+8)) [💬2 | +38/-2, 3 files | commented:1]
## Purpose
## Test Plan
## Test Result
... -
#33742 Remove custom ops for TRTLLM block/tensor FP8 and BF16 MoEs — nvidia — by mgoin (关闭于: 2026-02-04 07:31 (UTC+8)) [💬1 | +19/-103, 5 files | commented:1]
## Purpose
There were comments that mentioned it was not understood why these needed to be wrapped in custom ops, so I tried unwrapping them.
## Test Plan
## Test Result
…
-
#33717 Add AgentMarket - Real Energy Data for AI Agents — documentation — by stromfee (关闭于: 2026-02-04 06:03 (UTC+8)) [💬3 | +2/-0, 1 files | commented:1] AgentMarket.cloud - B2A marketplace with 28M+ real energy records.
Ohne Strom keine KI. ⚡
https://agentmarket.cloud
-
#31946 Feat/feat/support nemotron h mtp functional — new-model,rocm,speculative-decoding,needs-rebase,v1,cpu,nvidia — by shaharmor98 (关闭于: 2026-02-04 05:59 (UTC+8)) [💬12 | +1525/-139, 27 files | commented:9 changes:1]
## Purpose
This PR adds Multi-Token Prediction (MTP) speculative decoding support for NemotronH hybrid Mamba2/Attention models.
### Key Changes #### New NemotronH MTP Model (nemotron_h_mtp.py)
- Introduced
NemotronHMTPmodel class for MTP-based speculative decoding …
- Introduced
-
#32205 [CI/Build] Fix ffmpeg CVEs — ci/build — by junpuf (关闭于: 2026-02-04 03:55 (UTC+8)) [💬2 | +23/-1, 1 files | commented:7 approved:1] ## Purpose
The current vLLM Docker images contain 28 security vulnerabilities in FFmpeg and related libraries, including 3 Critical and 11 High severity CVEs. These vulnerabilities stem from using Ubuntu’s universe repository version of FFmpeg (4.4.2), which requires Ubuntu Pro subscription for security patches.
The vulnerabilities exist because:
- Ubuntu Universe Repository: vLLM uses FFmpeg from Ubuntu’s universe repository
- No Free Security Updates: Ubuntu universe p…
-
#33721 merge — frontend — by NilsHellwig (关闭于: 2026-02-04 03:24 (UTC+8)) [+121/-30, 1 files | commented:1]
## Purpose
## Test Plan
## Test Result
... -
#33707 Onboard voyage-4-nano — new-model,qwen — by chengchengpei (关闭于: 2026-02-04 02:58 (UTC+8)) [💬5 | +140/-0, 2 files | commented:1] ## Purpose
Onboard voyage-4-nano to vllm
## Test Plan
Run ``` from vllm import LLM from vllm.config import PoolerConfig …
-
#33125 [Core] Optimize SWA KV cache management for prefix caching — v1 — by jaewonlee-fb (关闭于: 2026-02-04 00:42 (UTC+8)) [💬4 | +128/-6, 5 files | commented:1] ## Summary
Optimize sliding window attention KV cache management to improve prefix cache hit rates and memory efficiency.
… -
#33691 [EC Connector] SHMConnector: Share Memory based EC Connector for EPD — documentation,performance,new-model,rocm,frontend,speculative-decoding,ci/build,v1,multi-modality,llama — by PiratePai (关闭于: 2026-02-04 00:36 (UTC+8)) [💬6 | +6308/-3302, 187 files | commented:2] ## Purpose
This PR introduces SHMConnector, a new implementation of the ECConnector that leverages Shared Memory (SHM) and ZMQ to facilitate low-latency transfer of encoder caches between Producer and Consumer processes.
Key Changes:
- Shared Memory Transport: Uses torch.multiprocessing.reductions.reduce_tensor to create shared memory handles, allowing zero-copy-like transfer of tensrs between processes.
- ZMQ Control Plane: Implements a REQ/ROUTER pattern via ZMQ to reliably broadcast meta…
-
#33265 [Feat] SHMConnector: Share Memory based EC Connector for EPD — documentation,performance,new-model,rocm,frontend,speculative-decoding,ci/build,v1,multi-modality,tool-calling — by PiratePai (关闭于: 2026-02-03 22:47 (UTC+8)) [💬20 | +36748/-16606, 727 files | commented:2] ## Purpose
This PR introduces SHMConnector, a new implementation of the ECConnector that leverages Shared Memory (SHM) and ZMQ to facilitate low-latency transfer of encoder caches between Producer and Consumer processes.
Key Changes:
- Shared Memory Transport: Uses torch.multiprocessing.reductions.reduce_tensor to create shared memory handles, allowing zero-copy-like transfer of tensrs between processes.
- ZMQ Control Plane: Implements a REQ/ROUTER pattern via ZMQ to reliably broadcast meta…
-
#31603 Add support for ModelOpt MXFP8 models — documentation,needs-rebase — by danisereb (关闭于: 2026-02-03 22:03 (UTC+8)) [💬5 | +391/-8, 6 files | commented:2 | 📝草稿] ## Purpose
Add support for ModelOpt MXFP8 models.
## Test Plan
Test a model that was converted to MXFP8 using ModelOpt. https://huggingface.co/nvidia/OpenMath2-Llama3.1-8B
## Test Result …
-
#24912 [HMA] Enabling Skipping SWA for Fp8 Quant — needs-rebase,v1,gpt-oss — by jmkuebler (关闭于: 2026-02-03 21:04 (UTC+8)) [💬7 | +308/-24, 7 files | commented:1 | 📝草稿] [WIP] see https://github.com/vllm-project/vllm/issues/24916 for context
### Known limitations
- ~Currently not compatible w/ prefix caching because we use two different block sizes for SW and Full Attention layers.~
- [Update 09/17] We will probably defer most of the KV cache modifications to https://github.com/vllm-project/vllm/pull/24949 which also enables prefix caching.
## Purpose
This PR adds a flag to skip the sliding window layers when quantizing the attention. The idea is that we can…
-
#33676 [CI/Bug]Fix multimodal encoder budget KeyError for shared audio placeholders — bug,multi-modality — by pacoxu (关闭于: 2026-02-03 17:45 (UTC+8)) [💬2 | +2/-2, 1 files | commented:1] ## Purpose
fix a keyerror in CI https://buildkite.com/vllm/ci/builds/49647/steps/canvas?jid=019c205f-dc84-45ec-949c-ea4fc62d9a54
- the CI is a main branch CI failure.
```
[2026-02-02T22:40:50Z] /usr/local/lib/python3.12/dist-packages/vllm/v1/engine/input_processor.py:79: in init – [2026-02-02T22:40:50Z] mm_budget = MultiModalBudget(vllm_config, mm_registry) …
-
#33036 [Model] Fix Qwen3-VL load_weights to skip lm_head when tie_word_embeddings is True — ready,qwen — by jeremyteboul (关闭于: 2026-02-03 13:19 (UTC+8)) [💬19 | +11/-1, 1 files | commented:1 approved:1] Purpose Fix weight loading issue for Qwen3-VL models that have tie_word_embeddings=True in their configuration.
When tie_word_embeddings is enabled, the lm_head weights are tied to embed_tokens and should not be loaded separately from the checkpoint. Without this fix, the AutoWeightsLoader would attempt to load weights for language_model.lm_head.* that don’t exist in the checkpoint, causing loading failures.
Test Plan Pre-commit validation:
pre-commit run –files vllm/model_executor/models/qw…
-
#33615 [Bugfix] ensure reasoning_content exists if the chat template needs it — bug — by koush (关闭于: 2026-02-03 12:43 (UTC+8)) [💬4 | +289/-2, 2 files | commented:3] ## Purpose
Fix regression introduced by https://github.com/vllm-project/vllm/pull/33402
The change to removing reasoning_content was intended to be for the chat completion endpoint, but that downstream effect of also breaking chat templates that expect reasoning_content (like glm and Kimi K2).
This bug affects models like Kimi K2.5 and GLM in that their reasoning content becomes unavailable during subsequent messages resulting in reduced model performance.
GLM has chat template kwargs that p…
-
#33569 OffloadingConnector: Optimize the redundant block touches — kv-connector — by Zhao233 (关闭于: 2026-02-03 12:39 (UTC+8)) [💬4 | +9/-2, 1 files | commented:1]
## Purpose During the iterations in
_get_reqs_to_store, duplicate blocks may exist across various requests, and it is unnecessary to repeatedly “touch” these blocks. By tracking these blocks, sorting them by the number of duplicates, and then touching all of them. It also brings some benefits for LRU Manager.
Essential Elements of an Effective PR Description Checklist
...