[vLLM GitHub 开发动态] 2026-02-10
[概览]
- 时间窗口: 2026-02-10 11:42 (UTC+8) ~ 2026-02-11 11:42 (UTC+8)
- 新 issue: 33 (label 分布: bug:14, feature request:7, model-bash:7, installation:2, usage:2)
- 关闭 issue: 30
- 新 PR: 69 (label 分布: ready:27, v1:15, ci/build:12, bug:11, frontend:8)
- 合并 PR: 57
- 关闭未合并 PR: 18
[新 issue]
-
#34305 [Bug]: — bug — by dhineshkumar-r (创建于: 2026-02-11 11:42 (UTC+8)) ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#34229 [Bug]: [CPU Backend] Whisper W8A8 CPU utilization very low on Arm CPU — bug,cpu — by fadara01 (创建于: 2026-02-10 18:43 (UTC+8)) [💬5] ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#34224 [Installation]: vllm 0.15.1 pip install still rely on torch2.9.1 — installation — by chamwen (创建于: 2026-02-10 17:21 (UTC+8)) [💬3] ### Your current environment
```text torch 2.9.1 torchaudio 2.9.1 torchvision 0.24.1 tqdm 4.67.3 transformers 4.57.6 triton 3.5.1 typer 0.21.1 …
-
#34286 [Model Bash]: DeepSeek R1 NVFP4 Low Latency (B=1) — feature request,model-bash — by robertgshaw2-redhat (创建于: 2026-02-11 07:37 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch
Based on some profiling by @alexm-redhat, there are a few optimizations for low-latency serving that we could use.
- Turn on AR fusion by default (@ProExpertProg working on this)
-
PDL for AR+RMS + the first attention gemm (as done by sgl)
- PDL for AR+RMD + the router gemm (as done by sgl)
…
-
#34303 [RFC]: CUDA Checkpoint/Restore for Near-Zero Cold Starts — RFC — by elizabetht (创建于: 2026-02-11 10:56 (UTC+8)) ### Motivation.
Issue: vllm-project/vllm#33930 Scope: v1 engine only Status: Draft
## Motivation
Cold start latency is a major barrier for several important use cases:
…
-
#34300 [Model Bash][DeepSeek]: Remove Logits Casts in DSR1 NVFP4 TRTLLM — model-bash — by robertgshaw2-redhat (创建于: 2026-02-11 10:08 (UTC+8)) ### 🚀 The feature, motivation and pitch
- see main issue
### Alternatives
No response
### Additional context
…
-
#34288 [Model Bash][DeepSeek]: Remove Bias Casts in DSR1 NVFP4 TRTLLM — feature request,model-bash — by robertgshaw2-redhat (创建于: 2026-02-11 07:43 (UTC+8)) [💬1] ### 🚀 The feature, motivation and pitch
- see main issue for more details
### Alternatives
No response
### Additional context
…
-
#34297 [Feature]: User-specified expert→EP-rank placement file for MoE expert parallelism — feature request — by clusteroptimizerengine (创建于: 2026-02-11 09:24 (UTC+8)) ### 🚀 The feature, motivation and pitch
Hi vLLM people,
I’m working with MoE expert-parallel (EP) deployments and would like to propose a small, optional extension to allow users to supply a static expert→EP-rank placement.
Motivation: For some EP deployments, especially topology-aware or locality-sensitive setups, it’s useful to control which experts are assigned to which EP ranks in a stable, user-defined way (e.g. to reduce unnecessary cross-rank communication or variance). Today, placemen…
-
#34296 [Usage]: vllm/vllm-openai:v0.15.1 No CUDA GPUs are available,0.10.1.1 is ok — usage — by ooodwbooo (创建于: 2026-02-11 09:09 (UTC+8)) ### Your current environment
wsl nvidia-smi ``` root@DESKTOP-G31254H:/mnt/c/Users/CTOS# nvidia-smi Wed Feb 11 09:04:50 2026 +—————————————————————————————–+ | NVIDIA-SMI 575.51.02 Driver Version: 576.02 CUDA Version: 12.9 | |—————————————–+————————+———————-+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | …
-
#34295 [Bug]: — bug — by dpankros (创建于: 2026-02-11 08:49 (UTC+8)) ### Your current environment
The output of
```text ============================== System Info ============================== OS : Ubuntu 24.04.3 LTS (aarch64) ...python collect_env.py -
#34283 [CI Failure]: Multi Modal Models (Extended) 1 — ci-failure — by varun-sundar-rabindranath (创建于: 2026-02-11 07:23 (UTC+8)) [💬1] ### Name of failing test
models/multimodal/generation/test_voxtral.py::test_online_serving
### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#34257 [Bug]: Grok 2 model not working — bug — by jiangwu300 (创建于: 2026-02-11 01:08 (UTC+8)) [💬2] ### Your current environment
vllm 0.14.0
### 🐛 Describe the bug
Serving grok 2 model breaks under vllm version 0.14.0+. I’m running with
vllm serve xai-org/grok-2 --tensor_parallel_size 8…
-
#34290 [Model Bash][DeepSeek]: Remove Clone From Shared Expert Stream — model-bash — by robertgshaw2-redhat (创建于: 2026-02-11 07:44 (UTC+8)) ### 🚀 The feature, motivation and pitch
- investigate why we need the clone
- remove it when we can
### Alternatives
No response
### Additional context …
-
#34289 [Model Bash][DeepSeek]: PDL for AR+RMS and First Attention GEMM — model-bash — by robertgshaw2-redhat (创建于: 2026-02-11 07:44 (UTC+8)) ### 🚀 The feature, motivation and pitch
- see main issue for more details
### Alternatives
No response
### Additional context
…
-
#34287 [Model Bash][DeepSeek]: PDL for ARRMS + RouterGEMM — feature request,model-bash — by robertgshaw2-redhat (创建于: 2026-02-11 07:42 (UTC+8)) ### 🚀 The feature, motivation and pitch
See main issue for more details
### Alternatives
No response
### Additional context
…
-
#34284 [CI Failure]: Metrics - Tracing Test — ci-failure — by varun-sundar-rabindranath (创建于: 2026-02-11 07:28 (UTC+8)) ### Name of failing test
pytest -v -s v1/tracing
### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#34277 [Bug]: P2pNcclConnector NCCL send/recv key mismatch in disaggregated prefill XpYd architecture due to assign_request_id() random suffix — bug — by shwgao (创建于: 2026-02-11 05:42 (UTC+8)) [💬1] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py - #34252 [Bug]: Multi-GPU NCCL initialization fails with Cuda failure 700 ‘an illegal memory access was encountered’ on NVIDIA B200 GPUs — bug — by venishpatidar (创建于: 2026-02-11 00:28 (UTC+8)) [💬2] # Environment GPUs: NVIDIA B200 (8 GPUs) vLLM: latest (pip-installed) Python: 3.12 CUDA Toolkit: 13.1 NVIDIA Driver: 590 Guest OS: Ubuntu 24.04 Host OS: Ubuntu 25.10 Environment: Running inside a VM following the NVIDIA TDX Deployment Guide https://docs.nvidia.com/cc-deployment-guide-tdx.pdf …
-
#34256 [Model Performance SIG]: Improve MoE Oracle — feature request,rocm,model-bash — by robertgshaw2-redhat (创建于: 2026-02-11 00:49 (UTC+8)) [💬4] ### 🚀 The feature, motivation and pitch
We have recently created programmatic interface for selecting NVFP4 and FP8 MoE kernels.
Currently, we have a single “ordering” of kernels for all hardware, device configs, etc. We should make this oracle smarter by having different defaults for different devices / models.
We should do this for:
- DeepSeekV3
- Qwen3MoE
- GLM …
-
#34249 [Bug]: [Fp8] [MoE] ‘FLASHINFER_CUTLASS’ is auto-selected as MoE backend instead of ‘DEEPGEMM’ on hopper — bug — by cjackal (创建于: 2026-02-10 23:27 (UTC+8)) [💬4] ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#34259 [Bug]: gpt-oss triton_kernels crashes during startup with EP enabled (sm90 default) — bug — by mgoin (创建于: 2026-02-11 01:21 (UTC+8)) [💬1] ### Your current environment
The output of
```text Collecting environment information... uv is set ============================== System Info ...python collect_env.py -
#34250 [Bug]: using vllm on Qwen3-Omni-30B-A3B-Instruct: Failed to apply prompt replacement for mm_items[‘audio’][0]. — bug — by katie312 (创建于: 2026-02-10 23:47 (UTC+8)) [💬1] ### Your current environment
### Environment Information
Linux Python 3.10.0 Cuda 12.8
accelerate==1.12.0 …
-
#34258 [Installation]: cannot install deepep — installation — by ZJY0516 (创建于: 2026-02-11 01:10 (UTC+8)) ### Your current environment
```text Collecting environment information… ============================== System Info ============================== OS : Alibaba Cloud Linux release 3 (Soaring Falcon) (x86_64) GCC version : (GCC) 10.2.1 20200825 (Alibaba 10.2.1-3.8 2.32) Clang version : 15.0.7 ( 15.0.7-1.0.3.al8) …
-
#34242 [Bug]: Failed to apply Glm46VProcessor on data — bug — by regmibijay (创建于: 2026-02-10 21:30 (UTC+8)) [💬2] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#34234 [Feature]: Enable CUDA graph capture for Eagle speculator prefill — feature request — by HamzaElshafie (创建于: 2026-02-10 19:30 (UTC+8)) [💬1] ## 🚀 The feature, motivation and pitch
Eagle’s speculator currently runs prefill in eager mode, while decode can use CUDA graphs.
There is an explicit TODO in the codebase to support CUDA graphs for the prefill path. I’m interested in working on this as my first contribution and wanted to check alignment before starting any implementation.
Currently, the Eagle speculator contains the following TODO:
- Location:
vllm/v1/worker/gpu/spec_decode/eagle.py(around line 229) - **Current state…
- Location:
-
#34237 [Bug]: Responses API with Harmony is broken for structured content inputs — bug — by Kimahriman (创建于: 2026-02-10 19:33 (UTC+8)) ### Your current environment
The output of
Can't copy from my environment right nowpython collect_env.py…
-
#34232 [Feature]: Qwen3 realtime asr — feature request — by oktrained (创建于: 2026-02-10 19:04 (UTC+8)) ### 🚀 The feature, motivation and pitch
will qwen 3 asr 0.6/1.7B get the support for realtime endpoint similar mistral’s voxtral?
### Alternatives
No response
### Additional context
…
-
#34231 [Performance]: Refactor the exponential_ method calling to improve performance of TopKTopPSampler — performance — by RocMarshal (创建于: 2026-02-10 19:04 (UTC+8)) ### Proposal to improve performance
### Report of performance regression
…
-
#34225 [Bug]:
maybe_serialize_tool_calls()fails to verify thetool_callstype — bug — by elklepo (创建于: 2026-02-10 17:37 (UTC+8)) ### Your current environmentThe output of
```text N/A ```python collect_env.py…
-
#34212 [Performance]: W4Afp8 is slower than FP8-W8A8 — performance — by liudl85 (创建于: 2026-02-10 15:22 (UTC+8)) ### Proposal to improve performance
No response
### Report of performance regression
We recently tested your new W4Afp8 quantization method on a Qwen-VL model and observed a performance regression compared to FP8-W8A8 dynamic quantization.
Specifically, W4Afp8 delivers about 10% slower end-to-end inference latency with fixed-length input and output and the same batch size. (length of input tokens + length of output tokens < 4k. )
…
-
#34210 [Bug]: Enable DBO on Qwen3-VL-235B-A22B raise TypeError: ‘NoneType’ object is not subscriptable — bug — by Ethan-yt (创建于: 2026-02-10 14:49 (UTC+8)) ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#34209 [Usage]: How to decrease Encoder cache budget from 16384 tokens to something lower? — usage — by DmitryFX (创建于: 2026-02-10 13:36 (UTC+8)) [💬7] ### Your current environment
```
============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) GCC version : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version : Could not collect …
-
#34205 [Bug]: Set env ROCP_TOOL_ATTACH=1 caused vllm server stopped — bug,rocm — by BigFaceBoy (创建于: 2026-02-10 11:50 (UTC+8)) [💬1] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py
[已关闭 issue]
-
#33883 [Bug]: DeepSeek-V3.2 NVFP4 with fp8 kvcache reports
src_cache must be uint8— bug — by kebe7jun (关闭于: 2026-02-11 11:31 (UTC+8)) ### Your current environmentThe output of
```text GB200 main branch ``` ...python collect_env.py -
#17650 [Bug]: the throughput of qwen3moe is low for prompts above 2000 tokens — bug,stale — by aabbccddwasd (关闭于: 2026-02-11 10:18 (UTC+8)) [💬20] ### Your current environment
The output of
```text INFO 05-05 20:13:42 [__init__.py:239] Automatically detected platform cuda. Collecting environment information... PyTorch version: 2.6.0+cu124 Is debug build: False ...python collect_env.py -
#18357 [Bug]: TypeError When Processing Canceled Requests with Tracing Enabled — bug,stale — by Hotckiss (关闭于: 2026-02-11 10:18 (UTC+8)) [💬6] ### Your current environment
The output of
```text PyTorch version: 2.6.0+cu124 Is debug build: False CUDA used to build PyTorch: 12.4 ROCM used to build PyTorch: N/A ...python collect_env.py -
#19865 [Bug]: NCCL issues when running vllm v0.9.1 for the Deepseek-R1 model [B200 GPU] — bug,stale — by haic0 (关闭于: 2026-02-11 10:18 (UTC+8)) [💬4] ### Your current environment
Machine: NV B200 GPU Docker image: vllm/vllm-openai:v0.9.1 model: deepseek-ai/DeepSeek-R1 CUDA: 12.8 Driver Version: 570.133.20
Command: VLLM_USE_V1=1 vllm serve /models/DeepSeek-R1 –tensor-parallel-size 8 –disable-log-requests –trust-remote-code…
-
#20567 [Usage]: –speculative-config is not compatible with “response_format”:{“type”: “json_object”} — usage,stale — by baishiliushu (关闭于: 2026-02-11 10:18 (UTC+8)) [💬7] ### Your current environment INFO 07-07 17:47:39 [init.py:239] Automatically detected platform cuda. INFO 07-07 17:47:41 [api_server.py:1043] vLLM API server version 0.8.5.post1
```bash SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.1, temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=2000, min_tokens=0, logprobs=None, prompt_logprobs=Non…
-
#21529 [Bug]: Incorrect Generation for Qwen2.5-VL-7B-Instruct in Batch Mode — bug,stale — by movchan74 (关闭于: 2026-02-11 10:17 (UTC+8)) [💬5] ### Your current environment
The output of
```text Collecting environment information... uv is set ============================== System Info ...python collect_env.py -
#22616 [Bug]: Qwen3 Moe is extremely slow on high context — bug,stale — by aabbccddwasd (关闭于: 2026-02-11 10:17 (UTC+8)) [💬10] ### Your current environment
The output of
```text ============================== System Info ============================== OS : Ubuntu 24.04.2 LTS (x86_64) ...python collect_env.py -
#23452 [CI]: Abort entire CI job if pre-commit fails — ci/build,stale — by njhill (关闭于: 2026-02-11 10:17 (UTC+8)) [💬9] Tests will need to be re-run regardless in this case so it’s better to fail and have the PR author fix linting issues than proceed to run the whole job.
It would be good to still run pre-commit in parallel so the total running time isn’t impacted, just abort other tests (if possible).
- #23456 [CI]: Replace use of models with smaller models where possible — ci/build,stale — by njhill (关闭于: 2026-02-11 10:17 (UTC+8)) [💬8]
- Many tests use an arbitrary model, use a small one such as opt-125m (or smaller) in these cases.
- If an arbitrary MoE model is required, ibm-research/PowerMoE-3b can be used (or maybe we can identify an even smaller one)
- Ensure
--enforce-eageris included in server startup args for short-running tests that aren’t testing cuda graphs specifically
This can reduce the overall running time significantly
-
#24682 [CI]: Investigate/fix cases of CI tests not running on latest code from PR branch — bug,ci/build,stale — by njhill (关闭于: 2026-02-11 10:17 (UTC+8)) [💬7] We’ve seen a couple of recent cases where PR changes fail on main after being merged, but the same test that ran on the PR’s CI prior to merge passed. It appears that the code which the passing test ran on wasn’t the latest code from the PR branch.
This is separate to the issue of test scoping. In these cases the test in question did run and passed.
It also shouldn’t be a consequence of the use of
USE_PRECOMPILED_WHEELSbecause the code differences in question were just changes to python fil… -
#24748 [New Model]: KOSMOS2.5 IMPLEMENTATION — new-model,stale — by contactsingulary (关闭于: 2026-02-11 10:17 (UTC+8)) [💬7] ### 🚀 The feature, motivation and pitch
KOSMOS-2.5 has recently been implemented into the transformers library. It would be great to have support for it in vLLM as well. This would allow leveraging vLLM’s optimized inference capabilities for KOSMOS-2.5, especially for multimodal tasks where performance and efficiency are crucial.
Adding this integration would enable the community to use KOSMOS-2.5 at scale with lower latency and memory efficiency, similar to other models already supported by v…
-
#26298 [RFC]: Per-request metrics for the offline API. — RFC,stale — by maxdebayser (关闭于: 2026-02-11 10:16 (UTC+8)) [💬9] ### Motivation.
In V0 when calling
LLM.generate()themetricsfield of theRequestOutputobject was set to aRequestMetricsobject:``` @dataclass class RequestMetrics: “"”Metrics associated with a request.
Attributes: ... -
#26387 [Feature]: vllm-flash-attn cutlass support for blackwell — feature request,stale — by tongjin0521 (关闭于: 2026-02-11 10:16 (UTC+8)) [💬3] ### 🚀 The feature, motivation and pitch
Hi team,
The main flash-attention repository recently introduced a CUTLASS-based implementation optimized for NVIDIA B200 GPUs (see issue #1741 and the interface in flash_attn/cute/interface.py).
From my experience, the CUTLASS attention kernel runs about 3.6× faster than the standard implementation on B200 f…
-
#26399 [Bug]: basic reasoning parser has bug — bug,stale — by heyzude (关闭于: 2026-02-11 10:16 (UTC+8)) [💬5] ### Your current environment
Ubuntu 22.04 docker container with latest vllm
### 🐛 Describe the bug
This line is a bug. Currently it’s …
-
#26679 [Feature]: Ensure output consistency when using LoRA with Eagle3 Speculative Decoding — feature request,stale — by hukongyi (关闭于: 2026-02-11 10:16 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch
Currently, when I enable both LoRA and Eagle3 in vLLM, the model runs without crashing, but the generated output is inconsistent with the output from using only the LoRA-adapted model. My setup is as follows: A LoRA adapter fine-tuned on the base model. An Eagle3 drafter model that was trained on the Base Model + LoRA combination.
When I run inference with –enable-lora and –speculative_config for Eagle3, the generated text differs from the output when …
-
#26692 [Usage]: How to release KVCache? — usage,stale — by shenxf1205 (关闭于: 2026-02-11 10:16 (UTC+8)) [💬2] ### Your current environment
```text Collecting environment information… ============================== System Info ============================== OS : Ubuntu 22.04 LTS (x86_64) GCC version : (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 Clang version : Could not collect …
-
#26716 [CI Failure]: vllm/tests/compile/test_basic_correctness.py::test_compile_correctness[test_setting5] — stale,ci-failure — by morrison-turnansky (关闭于: 2026-02-11 10:16 (UTC+8)) [💬2] ### Name of failing test
vllm/tests/compile/test_basic_correctness.py::test_compile_correctness[test_setting5]
### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#34076 [Bug]: KV Cache Memory Bottleneck Calculation in Pipeline Parallel (_check_enough_kv_cache_memory in get_kv_cache_configs) — bug — by chizhiwei (关闭于: 2026-02-11 08:23 (UTC+8)) [💬3] ### Your current environment
0: NVIDIA GeForce RTX 5070 Ti 16303MiB 1: NVIDIA GeForce RTX 5090 D 32607MiB
vllm-0.15.2rc1.dev93+g11a4c9d30.cu128-cp312-cp312-linux_x86_64.whl
### 🐛 Describe the bug
I am trying to deploy a vLLM service locally for my OpenClaw backend, so I need a sufficiently long context length. …
-
#27194 [Bug]: vLLM crashes (EngineDeadError) during high‑concurrency benchmark — bug — by wizche (关闭于: 2026-02-11 06:57 (UTC+8)) [💬5] ### Your current environment
Running docker image
vllm-openai:v0.10.2### 🐛 Describe the bug
Running on 8xB200 with
VLLM_USE_DEEP_GEMM=1and following params: ``` │ (APIServer pid=1) INFO 10-20 03:21:02 [utils.py:328] non-default args: {‘model_tag’: ‘s3://Qwen3-Coder-480B-A35B-Instruct-FP8’, ‘host’: ‘0.0.0.0’, ‘enable_auto_tool_choice’: True, ‘tool_call_parser’: ‘qwen3_coder’, ‘max_model_len’: 131072, ‘served_model_name’: [‘Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8’], ‘load_format’: ‘runai… -
#27441 [Bug]: vllm/v1/core/sched/scheduler.py: Unintended reordering of requests during scheduling — bug — by dongha-yoon (关闭于: 2026-02-11 05:55 (UTC+8)) [💬2] ### Your current environment
This error is independent of the environment.### 🐛 Describe the bug
### Description …
-
#32057 [Feature][DSR1 NVFP4 Model Bash]: FlashInfer Quantize Op — feature request — by robertgshaw2-redhat (关闭于: 2026-02-11 02:57 (UTC+8)) [💬1] ### 🚀 The feature, motivation and pitch
We currently run a flashinfer quantize op
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/utils/flashinfer_fp4_moe.py#L290-L294
This is much slower than the vllm native op
### Alternatives
…
-
#20711 [Feature]: Use
QuantFp8CustomOp-abstraction for MoE layers — good first issue,feature request,torch.compile,keep-open — by ProExpertProg (关闭于: 2026-02-11 01:14 (UTC+8)) [💬15] ### 🚀 The feature, motivation and pitch#19830 added
QuantFp8, which uses theCustomOpabstraction to implement fp8 quantization in both CUDA and torch, allowing Inductor to achieve superior performance over the CUDA ops (which are unoptimized and also do not fuse by default). However, the class has to be instantiated during init, and MoE uses are currently in util free functions many levels deep. Those need to be mildly rearchitected to take advantage of the new abstraction.The use to be…
-
#25877 Tracking Issue: DeepSeek V3.2 support — feature request,stale — by youkaichao (关闭于: 2026-02-11 00:57 (UTC+8)) [💬30] The main PR https://github.com/vllm-project/vllm/pull/25896 has been merged, and the nightly wheel can be used.
Following the recipes, you should be able to run it. Recipe: https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-V3_2-Exp.html .
-
#32105 [Bug]:Pipeline parallelism cannot share kvcache across different GPUs — bug — by zhcn000000 (关闭于: 2026-02-10 23:46 (UTC+8)) [💬13] ### Your current environment
vllm commit 2a4dbe24eadcb8e0354e47f608b53399aec52c4 nvidia rtx4090
### 🐛 Describe the bug
Is this a bug? command ```bash …
-
#34242 [Bug]: Failed to apply Glm46VProcessor on data — bug — by regmibijay (关闭于: 2026-02-10 23:06 (UTC+8)) [💬2] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#32469 [Bug]: Error occurs when using Eagle3: Encoder cache miss for {mm_hash} — bug — by Ericoool9614 (关闭于: 2026-02-10 21:05 (UTC+8)) [💬1] ### Your current environment
The output of
```text Your output of `python collect_env.py` here ``` ============================== ...python collect_env.py -
#34180 [Bug]: fastsafetensors distributed loading crashes due to unsorted file list — bug — by jaim12005 (关闭于: 2026-02-10 15:06 (UTC+8)) [💬4] ### Your current environment
vLLM version: 0.15.2rc1.dev135+g285bab475 PyTorch: with CUDA 13.1 NCCL: 2.28.9+cuda13.0 Platform: 2x NVIDIA DGX Spark (GB10, aarch64), TP=2 across nodes via InfiniBand OS: DGX OS 7.4.0, kernel 6.17.0-1008-nvidia
### 🐛 Describe the bug …
-
#33450 [Bug]: Attention Assertion — bug — by robertgshaw2-redhat (关闭于: 2026-02-10 13:29 (UTC+8)) ### Your current environment
B200
### 🐛 Describe the bug
``` (Worker_DP0_TP0_EP0 pid=1742344) ERROR 01-30 17:41:21 [multiproc_executor.py:852] assert num_decode_tokens % num_decodes == 0, ( (Worker_DP0_TP0_EP0 pid=1742344) ERROR 01-30 17:41:21 [multiproc_executor.py:852] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_DP0_TP0_EP0 pid=1742344) ERROR 01-30 17:41:21 [multiproc_executor.py:852] AssertionError: TRTLLM decode requires uniform query lengths per request. …
-
#32626 [Bug]: TRTLLM Attention Failure with DP/EP — bug — by robertgshaw2-redhat (关闭于: 2026-02-10 13:29 (UTC+8)) [💬2] ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#28726 [Bug]: Unbounded CPU Memory Growth When Using Prefix Caching — bug — by StMarou (关闭于: 2026-02-10 13:03 (UTC+8)) [💬14] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py
[新 PR]
-
#34302 [MoE] Add DeepSeek V3 router GEMM kernel from sglang — ci/build,deepseek — by robertgshaw2-redhat (创建于: 2026-02-11 10:23 (UTC+8)) [💬1 | +886/-3, 9 files | commented:1] Port the optimized router GEMM kernel from sglang’s sgl-kernel for DeepSeek V3 MoE models. This kernel is specifically optimized for small batch sizes (1-16 tokens) common in decode phase.
Key features:
- Computes output = mat_a @ mat_b.T for MoE routing
- Supports bfloat16 input with float32 or bfloat16 output
- Optimized for DSV3 dimensions: hidden_dim=7168, num_experts={256,384}
- Requires SM90+ (Hopper) GPUs and CUDA 12.0+
- Supports Programmatic Dependent Launch (PDL) via TRTLLM_ENABLE_PDL…
-
#34260 fix: Prefer DeepGEMM over FlashInfer for FP8 MoE on Hopper GPUs — needs-rebase — by elizabetht (创建于: 2026-02-11 01:35 (UTC+8)) [💬3 | +320/-12, 2 files | commented:3] On Hopper (SM90), FLASHINFER_CUTLASS was auto-selected as the FP8 MoE backend over DEEPGEMM due to having higher priority in the selection order. FlashInfer kernels are primarily optimized for Blackwell and do not work well with features like chunked prefill on Hopper.
This change makes the backend priority order architecture-aware:
- Blackwell (SM100+): FlashInfer backends preferred over DeepGEMM
- Hopper and older: DeepGEMM preferred over FlashInfer backends
##…
-
#34304 Improvements to wvSplitKrc skinny GEMM solution — rocm — by amd-hhashemi (创建于: 2026-02-11 11:25 (UTC+8)) [+55/-39, 3 files | commented:1] This checkin adjusts wvSplitKrc to use a static reduce buffer that each kernel call cleans back to zero after use. This avoids the need for a .zero_() call, improving perf. Additionally the header has been adjusted to match torch.linear() and avoid need for reshape() calls. In addition activation padding support is introduced, and test coverage added.
Performance gpt-oss-120b, in/out=1024, max_conc=128, on mi355: <html xmlns:v=”urn:schemas-microsoft-com:vml” xm…
-
#34214 wip: add io_process_plugin for sparse embedding — frontend — by staugust (创建于: 2026-02-10 15:56 (UTC+8)) [💬5 | +218/-3, 7 files | commented:8] ## Purpose Update sparse embedding output for
bge-m3, which contains bothtoken_idand correspondingweights, a draft pr for #33882## Test Plan run vllm in
offlineandonline servingmode, and callllm.encode,/poolingapi to fetch sparse embedding result.## Test Result
### online mode
…
-
#34276 [chore] Update FA — ci/build — by LucasWilkinson (创建于: 2026-02-11 05:33 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1] Update FA to pickup
https://github.com/vllm-project/flash-attention/pull/116
https://github.com/vllm-project/vllm/pull/34043 must land first
-
#34251 [Redo] Add
--trust-remote-codeto dataset bench args — performance,ready — by DarkLight1337 (创建于: 2026-02-11 00:13 (UTC+8)) [💬2 | +5/-5, 2 files | commented:1 approved:1] Redo #34208## Purpose
@eugr can you check that this PR doesn’t break
vllm serve? After failing to repro the issue with only this PR, I suspect that the underlying issue is actually a conflict between #34188 and #34208.## Test Plan
## Test Result
…
-
#34211 [Bugfix] Fix step3p5 reasoning with interleaved thinking — bug — by mariohong128 (创建于: 2026-02-10 15:20 (UTC+8)) [💬5 | +387/-14, 2 files | commented:1 | 📝草稿] ## Purpose When there are multiple rounds of conversation, the prompt contains
</think>from the previous round, and the step3p5 reasoning parser failed to correctly determine the end of reasoning.## Test Plan Add tests:
tests/reasoning/test_step3p5_reasoning_parser.py## Test Result
All passed. …
-
#34244 [ROCm] [CI] Add new fusion test cases that are relevant to vLLM IR Ops — rocm,needs-rebase,ci/build — by tjtanaa (创建于: 2026-02-10 21:43 (UTC+8)) [💬1 | +407/-126, 9 files | commented:1 | 📝草稿]
## Purpose
To help the effort of vLLM IR Ops and ensure that the fusion pass is also validated on ROCm, this PR add the enable some of the fusion pass. We will roll out more unit tests because the enablement of the tests are non-trivial, we will enable the tests across multiple PRs.
## Test Plan
After ensuring the tests passed locally, we will use the AMD CI to validate the tests.
…
-
#34230 [vLLM/v1][sample] Refactor the exponential_ method calling to improve performance of TopKTopPSampler — performance,v1 — by RocMarshal (创建于: 2026-02-10 18:51 (UTC+8)) [💬3 | +153/-6, 2 files | commented:1]
Close #34231
## Purpose
- [vLLM/v1][sample] Refactor the exponential_ method calling to improve performance of TopKTopPSampler
## Test Plan
- benchmark_topk_topp_sampler.py …
-
#34299 [torch.compile] Enable AR+rms fusion by default available for
-O2— performance,ready,torch.compile — by ProExpertProg (创建于: 2026-02-11 09:59 (UTC+8)) [+16/-8, 2 files | commented:2 approved:1] ## Purpose Turns out AR+RMS fusion and compiling for a second range only marginally increases warm/cold start compile time, and the benefits are 10-15% (see #24252). This PR enables this fusion by default.## Test Plan Startup, bench, eval, CI
## Test Result
### Startup ``` …
-
#34268 Responses harmony system message structured — frontend,gpt-oss — by Kimahriman (创建于: 2026-02-11 02:52 (UTC+8)) [💬1 | +43/-6, 2 files | commented:5] ## Purpose Resolves https://github.com/vllm-project/vllm/issues/34237
Fix an issue where the responses API will fail if a structured system message content is passed.
## Test Plan New test verifying the behavior.
## Test Result Not able to get tests to run locally. …
- #34298 [ModelBash][DSR1 NVFp4] Removed Bf16 Bias Cast — performance,ready,deepseek,nvidia — by robertgshaw2-redhat (创建于: 2026-02-11 09:52 (UTC+8)) [+4/-7, 1 files | commented:4 approved:2]
## Purpose
- avoid bf16 bias conversion on the hotpath
## Test Plan
- lm eval
## Test Result ``` |Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr| |—–|——:|—————-|—–:|———–|—|—-:|—|—–:| …
-
#34301 [ROCm][Quantization] Add Composable Kernel (CK) backend support for M… — rocm — by dllehr-amd (创建于: 2026-02-11 10:09 (UTC+8)) [+279/-35, 2 files | commented:1] …XFP4 MoE
This commit introduces support for the Composable Kernel (CK) backend for MXFP4-quantized Mixture-of-Experts (MoE) models on ROCm via the Aiter library. This provides an optimized inference path for A16W4 (FP16 activations, FP4 weights) MoE workloads on AMD GPUs.
Key changes:
vllm/_aiter_ops.py:
- Add hidden_pad, intermediate_pad, bias1, and bias2 parameters to _rocm_aiter_fused_moe_impl for handling padded dimensions and bias terms
- Implement fused_topk operation (rocm_aiter_fused…
-
#34292 [CI] Enable mypy coverage for individual excluded files — v1,nvidia — by Lucaskabela (创建于: 2026-02-11 08:05 (UTC+8)) [💬1 | +52/-26, 11 files | commented:1 | 📝草稿]
## Purpose
This change brings us closer to not needing a custom mypy check. Part of https://github.com/vllm-project/vllm/issues/26533
Changes:
vllm/engine/arg_utils.pytype fixes for Field return, config annotations, and defensive assertsvllm/v1/cudagraph_dispatcher.pytype fixes, defensive asserts …
-
#34255 [Docs] Reduce time spent generating API docs — rocm,ready,needs-rebase,v1,multi-modality,cpu,gpt-oss,nvidia — by hmellor (创建于: 2026-02-11 00:35 (UTC+8)) [💬2 | +51/-2, 26 files | commented:1 approved:1] All time reductions are measured in my local environment which takes 520s to build the docs normally.
Changes:
- Update
git-revision-date-localized.exclude:argparse/*->generated/*- these were moved in the past- Add
api/*- these are always generated at build time so the revision dates always fall back to the build date, which is meaningless - This saves ~20s
- Remove
show_if_no_docstring: true:- “if at least one of its direct or indirect members (lower in the tree) ha…
- Update
- #34280 [ROCm][CI] Fix test_sequence_parallel.py location in AMD CI pipeline — rocm,ready,ci/build — by micah-wil (创建于: 2026-02-11 06:11 (UTC+8)) [+1/-1, 1 files | commented:2 approved:2]
https://github.com/vllm-project/vllm/pull/33731 moved
tests/distributed/test_sequence_parallel.pytotests/compile/correctness_e2e/test_sequence_parallel.pyand updated the associated locations in the CI pipelines, but missed one spot in AMD CI forDistributed Tests (2 GPUs). This PR fixes that. -
#34294 [CI] Heavy refactoring of Voxtral multimodal audio model tests — multi-modality — by AndreasKaratzas (创建于: 2026-02-11 08:48 (UTC+8)) [💬2 | +182/-47, 4 files | commented:1] Adds three-layer accuracy tests for the Voxtral multimodal audio model (
mistralai/Voxtral-Mini-3B-2507):- Offline vLLM greedy inference — validates mistral-format weight loading and audio preprocessing.
- HF Transformers greedy inference — independent ground truth via
AutoProcessor+VoxtralForConditionalGeneration. - Online OpenAI-compatible API serving — validates the serving path (chat template, audio encoding, tokenization) matches offline.
### HfRunner patch (`…
-
#34285 [Refactor] Move FusedMoE hidden_size roundup to quant_method — needs-rebase — by BowenBao (创建于: 2026-02-11 07:32 (UTC+8)) [💬3 +121/-82, 5 files commented:3 📝草稿] -
#34293 Fix KeyError loading GLM-4.7-NVFP4: handle k_scale/v_scale remap early — 无标签 — by Kh4L (创建于: 2026-02-11 08:44 (UTC+8)) [+5/-0, 1 files | commented:1] ## Summary This PR fixes a weight-loading crash in glm4_moe when using ModelOpt/NVFP4-style checkpoints that contain FP8
k_scale / v_scale(and related scale/zero_point) tensors. #Problem Some GLM4-MoE checkpoints include per-projection scale keys such as:- layers.
.self_attn.k_proj.k_scale - layers.
.self_attn.v_proj.v_scale
In glm4_moe.load_weights(), stacked parameter mapping can run before kv-scale remapping, rewriting keys like k_proj.k_scale into qkv_proj.k_scale. The runtime mode…
- layers.
-
#34253 Patch protobuf for CVE-2026-0994 — ready,ci/build — by eicherseiji (创建于: 2026-02-11 00:31 (UTC+8)) [+2/-2, 2 files | commented:5 approved:1] ## Purpose
Pin protobuf to exclude versions vulnerable to CVE-2026-0994.
#33619 patched this for the v0.15.1 release branch but main is still unpinned.
## Changes
requirements/build.txt: pinprotobuf >= 5.29.6and exclude vulnerable 6.30.0–6.33.4requirements/common.txt: same …
-
#34291 [Bugfix] Avoid double-sharding pre-sharded FusedMoE expert weights in TP loaders — bug — by dangoldbj (创建于: 2026-02-11 07:46 (UTC+8)) [+118/-10, 2 files | commented:1 | 📝草稿] ## Purpose
Fixes #34257
Fix a
FusedMoEweight-loading bug that can occur when the weight loader already returns tensor-parallel (TP) rank-local expert weights.Reported in #34257 when serving
xai-org/grok-2with--tensor_parallel_size 8, with a failure like:RuntimeError: start (2048) + length (2048) exceeds dimension size (2048)
#### Change …
-
#34271 [Misc] Add pre-commit hook to catch boolean ops in with-statements — ready — by tlrmchlsmth (创建于: 2026-02-11 03:36 (UTC+8)) [💬1 | +75/-0, 2 files | commented:1 approved:2] ## Summary
- Adds an AST-based pre-commit hook that catches
with ctx_a() and ctx_b():— a subtle bug where Python’sandoperator causes only the second context manager to be entered, silently skipping the first - This exact pattern caused a sleep mode weight offloading regression in 6cdf015c3c (#31747), where the memory pool context manager was never entered, preventing 30+ GiB of weights from being offloaded (fixed in #32947)
- No existing linter (ruff, pylint, flake8) has a rule for this …
- Adds an AST-based pre-commit hook that catches
-
#34273 [Misc] Bump
fastsafetensorsversion for latest fixes — ready,ci/build — by njhill (创建于: 2026-02-11 04:05 (UTC+8)) [+4/-5, 4 files | commented:1 approved:1] This includes in particular https://github.com/foundation-model-stack/fastsafetensors/pull/46 which is needed to avoid unnecessary memory overhead on rank 0 in multi-GPU deployments.See https://github.com/vllm-project/vllm/pull/34070.
cc @takeshi-yoshimura
-
#34278 [Bugfix] Fix P2pNcclConnector NCCL send/recv key mismatch in disaggregated prefill XpYd — bug,kv-connector — by shwgao (创建于: 2026-02-11 05:48 (UTC+8)) [💬4 | +44/-11, 1 files | commented:1] ## Purpose
Fix NCCL send/recv key mismatch in
P2pNcclConnectorthat causes disaggregated prefill P2P NCCL XpYd KV cache transfer to hang indefinitely.In the disaggregated prefill XpYd architecture, the proxy generates a single
request_id(containing both prefill and decode addresses) and sends it to both vLLM instances. However,InputProcessor.assign_request_id()inside each vLLM instance independently appends a different random 8-char hex suffix to this ID:```python # vllm/v1/engine/…
-
#34281 [Attn,KV-cache] Use per-head scales in the attention selector — v1 — by eldarkurtic (创建于: 2026-02-11 06:11 (UTC+8)) [+26/-15, 5 files | commented:1 approved:1] As requested by @MatthewBonanni and @LucasWilkinson in https://github.com/vllm-project/vllm/pull/30141 attention backends should be filtered during backend selection based on whether they support per-head attention quantization scales.
This enables early failure when a user attempts to load a model that requires per-head scales but no compatible attention backend is available.
-
#34248 [EPLB] Add communication config autotuning — needs-rebase,ci/build — by ilmarkov (创建于: 2026-02-10 23:21 (UTC+8)) [💬1 | +1066/-98, 8 files | commented:1] ## Purpose Follow-up #33592
Enable autotuning of group size and batch size eplb communication config parameters. Add config
max_num_experts_transfers. If this parameter is provided, vllm adjusts group size and if needed batch size so that number of transfers in one batch of communication on each rank is less or equal thanmax_num_experts_transfers.## Test Plan Added tests cases to
test_eplb_execute.py## Test Result ``` …
-
#34270 fix(mxfp4): Disable monolithic path for TRITON backend with EP — gpt-oss — by elizabetht (创建于: 2026-02-11 03:14 (UTC+8)) [💬2 | +223/-5, 2 files | commented:1] ## Problem
When running MXFP4 models with the TRITON backend and expert parallelism (EP) enabled (e.g.
vllm serve openai/gpt-oss-20b -tp=2 -ep), the server crashes with an illegal memory access.triton_kernel_moe_forwardpassesexpert_mapthrough totriton_kernel_fused_expertsbut never actually applies it. As a result,legacy_routingproduces routing data using global expert IDs that don’t correspond to the local weight indices on each EP rank.## Fix
When
expert_mapis … -
#34279 [Bugfix] Fix fused MoE IMA (sans chunking) by using int64 for strides — bug,ready — by tlrmchlsmth (创建于: 2026-02-11 05:56 (UTC+8)) [+27/-27, 1 files | approved:2 commented:1] ## Summary
- Annotate all stride parameters in
fused_moe_kernelandfused_moe_kernel_gptq_awqastl.int64to prevent int32 overflow in pointer arithmetic for large tensors - This follows the same pattern already used in
fused_batched_moe.py - Without this fix, large problem sizes cause
illegal memory accesscrashes when chunking is disabled, because the C tensor offset exceeds int32 max (~2.1 billion) - This is a prerequisite for removing the chunking workaround (`VLLM_FUSED_MOE_CHUNK…
- Annotate all stride parameters in
-
#34282 Anthropic report cached tokens — frontend — by elizabetht (创建于: 2026-02-11 06:19 (UTC+8)) [+16/-0, 1 files | commented:1]
## Purpose To Report cached tokens, fixes https://github.com/vllm-project/vllm/issues/33923
## Test Plan
## Test Result
…
-
#34262 Make Qwen3VL compatible with Transformers v5 — qwen — by hmellor (创建于: 2026-02-11 02:06 (UTC+8)) [💬3 | +13/-13, 1 files | commented:2] The location of
tie_word_embeddingshas moved for Transformers v5 for multimodal models to a more sensible location (please see https://github.com/vllm-project/vllm/pull/33359 for the full explanation).This PR updates the Qwen3VL implementation to leverage the fix made in https://github.com/vllm-project/vllm/pull/33359 and moves the
vision_configspecific check that was taking place inQwen3LLMModeltoQwen3VLForConditionalGeneration, where the vision config is available. -
#34274 [CI] Actually run tests/kernels/quantization/test_block_fp8.py in CI — ready,ci/build,nvidia — by mgoin (创建于: 2026-02-11 04:47 (UTC+8)) [+7/-9, 3 files | commented:1] Signed-off-by: mgoin mgoin64@gmail.com
## Purpose
We were only running the deepgemm tests in
tests/kernels/quantization/test_block_fp8.pysince we had not H100 runner running the rest of the file. I also fixed some outdated test cases after kernel updates like https://github.com/vllm-project/vllm/pull/28431## Test Plan
…
-
#34243 [Bugfix] Enable attn quantization of Llama-4 by correctly permuting scales for rope (int8, fp8) — bug,ready,llama — by eldarkurtic (创建于: 2026-02-10 21:33 (UTC+8)) [💬1 | +29/-5, 1 files | commented:5] Llama-4 weights of
q/k_projare permuted during model loading to prepare the model for interleaved/gpt-neox rope. The same permutation needs to be applied on quantization weight scales as well.## Purpose 1 So far, all of quantized Llama-4 models used to skip attention quantization as the model’s accuracy would collapse. This was mistakenly interpreted as high sensitivity of the attention layers under quantization. Luckily for quantization, this is not the case and the accuracy collapse was …
-
#34272 fix(mistral): Validate tool_calls type before iterating — frontend,v1 — by elizabetht (创建于: 2026-02-11 03:37 (UTC+8)) [+142/-13, 5 files | commented:1] maybe_serialize_tool_calls() calls next() on the raw tool_calls value without verifying it is an iterable. When tool_calls is a string or other non-iterable type, the pydantic ValidatorIterator tries to validate each character as a dict, producing a confusing error.
Add an explicit type check before iterating: if tool_calls is not a list, tuple, or iterator, raise a clear ValueError. Also skip the message entirely when tool_calls is None instead of creating an empty tuple iterator.
<!– markdo…
-
#34269 [Benchmarks] Fix attention benchmark smoke test — ready,ci/build — by MatthewBonanni (创建于: 2026-02-11 03:02 (UTC+8)) [+2/-1, 1 files | commented:1 approved:1] ## Purpose Fixes CI failure, thanks @varun-sundar-rabindranath for catching this
## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... -
#34275 [ROCm] Add RDNA3 tile-size heuristic for “triton_scaled_mm” kernel — rocm — by monajafi-amd (创建于: 2026-02-11 04:50 (UTC+8)) [+34/-12, 2 files | commented:1]
## Purpose
Add RDNA3-optimised tile-size heuristic for the
triton_scaled_mmkernel to improve INT8 (W8A8) inference performance ongfx11xxGPUs.## Implementation
Tile-size selection is extracted into a new helper,
_get_tile_config(), which … -
#34217 [Frontend] Exploit tokenizers “new stream” in FastIncrementalDetokenizer — ready,v1 — by njhill (创建于: 2026-02-10 16:03 (UTC+8)) [+17/-31, 1 files | commented:1] Finally remembered to revisit https://github.com/vllm-project/vllm/pull/18840.
Faster incremental detokenizer initialization.
Includes some unrelated minor code simplifications in
detokenizer.py. -
#34263 [Refactor] Deprecate
head_firstforchunk_gated_delta_rule— ready,qwen — by yewentao256 (创建于: 2026-02-11 02:12 (UTC+8)) [+8/-36, 3 files | commented:1] ## PurposeWe begin to warn 5 months ago in https://github.com/vllm-project/vllm/pull/24518, it is time to remove the param now
-
#34206 [Kernel] Optimize grouped topk kernel — 无标签 — by xyang16 (创建于: 2026-02-10 13:07 (UTC+8)) [+632/-96, 3 files | commented:2] ## Purpose
This PR add grouped topk kernel optimized for small expert count (num_experts<=512 for single group, or num_experts<=256 for multi group), which covers DeepSeek (num_experts=256, n_group=8), Kimi-K2.5 (num_experts=384, n_group=1), Nemotron (num_experts=512, n_group=1) etc, synced from the latest TRTLLM https://github.com/NVIDIA/TensorRT-LLM/blob/v1.3.0rc2/cpp/tensorrt_llm/kernels/noAuxTcKernels.cu
For large expert count, fall back to existing
grouped_topk_fused_kernel(TRTLLM fall… -
#34265 [Perf] Add FlashInfer top-k support to large context decode path - DeepSeek-V3.2 sparse attention — rocm,v1,deepseek — by LopezCastroRoberto (创建于: 2026-02-11 02:47 (UTC+8)) [+127/-7, 3 files | commented:1 | 📝草稿] ## Summary
This PR integrates FlashInfer’s radix-based top-k kernel as an alternative implementation for the large context top-k operation in the sparse attention indexer, specifically for DeepSeek-V3.2 models.
## Motivation - Microbenchmark results (NVIDIA B200)
## E2E results (NVIDIA B200) …
-
#34266 fix(qwen3-omni): Fix use_audio_in_video detection in prompt updates — qwen — by elizabetht (创建于: 2026-02-11 02:50 (UTC+8)) [+117/-5, 3 files | commented:1] The Qwen3-Omni _maybe_apply_prompt_updates had two bugs in the use_audio_in_video detection logic:
- Used item[‘use_audio_in_video’] directly instead of item.get(), which raises KeyError when the field is missing from mm_kwargs
- Did not break after finding True, causing later items to overwrite the flag back to False
This caused ‘Failed to apply prompt replacement for mm_items[audio][0]’ errors when processing videos with use_audio_in_video=True, because the audio prompt replacement was not…
-
#34267 [Draft] Testing, do not merge — ready,ci/build — by jstawinski (创建于: 2026-02-11 02:51 (UTC+8)) [💬2 | +2/-1, 2 files]
## Purpose Authorized testing with @khluu, will close this PR once complete
Essential Elements of an Effective PR Description Checklist
... -
#34264 Make JAIS compatible with Transformers v5 — 无标签 — by hmellor (创建于: 2026-02-11 02:13 (UTC+8)) [+0/-1, 1 files | commented:1]
add_cross_attentionwas an attribute ofPreTrainedConfigthat got removed in https://github.com/huggingface/transformers/pull/41541This attribute was never explicitly in the JAIS config and the checkpoints we use for testing in CI do not set it.
-
#34261 [Frontend] Realtime API uses Renderer — frontend,ready,v1 — by DarkLight1337 (创建于: 2026-02-11 02:01 (UTC+8)) [+177/-84, 9 files | commented:1] ## Purpose
- Update Realtime API to use Renderer before passing inputs to engine client
- Factor out common code for constructing default
TokenizeParamsinto Renderer - Move
StreamingInputsintoEngineClientprotocol file since it’s only used by the protocol and its subclasses
## Test Plan
## Test Result
…
- #34247 Minor cleanup for Voxtral — ready — by andylolu2 (创建于: 2026-02-10 23:12 (UTC+8)) [+3/-1, 1 files | commented:1 approved:2] Writing better pytorch
- #34254 [kv-cache, ct] Use compressed-tensors as a source of ground-truth for quant strategies — 无标签 — by eldarkurtic (创建于: 2026-02-11 00:33 (UTC+8)) [+10/-8, 1 files | commented:3] At the time of KV-cache quant PR, compressed-tensors version in vLLM was a bit outdated and we had to hardcode values for quant strategies. Now we can delegate this to compressed-tensors to have a single source of truth.
- #34222 [Model Runner V2] Use pinned memory for write_contents — v1 — by WoosukKwon (创建于: 2026-02-10 17:03 (UTC+8)) [+9/-20, 1 files | commented:2 approved:1]
It is much simpler to use pinned memory for handling
write_contentsWe could also consider using a separate stream for the CPU->GPU copy, but I didn’t implement it in this PR. -
#34213 [ModelRunner V2] Simplify penalties implementation — v1 — by njhill (创建于: 2026-02-10 15:45 (UTC+8)) [+11/-33, 4 files | commented:6] Avoid separate
expanded_local_posinput batch state.cc @izhuhaoran
-
#34240 [Docs] Speed up build environment set-up — ready — by hmellor (创建于: 2026-02-10 20:44 (UTC+8)) [+7/-6, 1 files commented:2 approved:2] - Update
post_checkout’s fetch command (used only bygit-revision-date-localizedto the “last updated” info) to be as fast as possible:- Explicitly only fetch
origin main- other branches are not used - Add
--no-tags- tags are not used - Add
--filter=blob:none- defers downloading file blobs until they’re actually needed (since we only care about the graph, I believe they will never be needed) - This reduces the fetch time from ~10-20s to ~3s
- Explicitly only fetch
- Update Python environment manag…
- Update
-
#34245 [Bugfix] fix the import path in moe test utils.py — bug — by michalowski-arm (创建于: 2026-02-10 21:43 (UTC+8)) [💬2 | +1/-1, 1 files | commented:2] Tiny fix to the failing Arm CPU test seen in https://buildkite.com/vllm/ci/builds/50374/steps/canvas?sid=019c3333-d97b-47c3-872a-75936f81fab6&tab=output. The issue appears to have been an incorrect import path, recently changed in #33375.
CC @fadara01
-
#34227 Feat/support nemotron new file format — 无标签 — by shaunkotek (创建于: 2026-02-10 18:00 (UTC+8)) [💬1 | +77/-2, 1 files | commented:5 | 📝草稿]
## Purpose To add support for Nemotrons updated weight format coming with Nemotron 3 Super
## Test Plan
- Load Checkpoint in old format and checkpoint with new format of same weights and compare results.
- Make sure models load both with old and new format
- make sure models generate the same tokens for same prompt when temperature is 0.
…
-
#34246 [Core] Simplify multimodal masking — v1 — by lgeiger (创建于: 2026-02-10 22:21 (UTC+8)) [💬2 | +36/-34, 4 files | commented:1 approved:1] ## Purpose
Since PyTorch 2.9.0 (https://github.com/pytorch/pytorch/pull/156384)
target[mask] = srcdoesn’t causecudaStreamSynchronizeanymore in cases wheremaskis a CPU tensor.This PR simplifies
_merge_multimodal_embeddingsby removing the need formasked_scatter_without re-introducing a CPU/GPU sync. This also simplifies the model runner since the mask doesn’t need to be transferred to the GPU anymore.## Test Plan
I verified that for Qwen3VL no
cudaStreamSynchronizeops ar… -
#34236 [Plugin] Simplify IO Processor Plugin interface — documentation,frontend,ready — by DarkLight1337 (创建于: 2026-02-10 19:31 (UTC+8)) [💬8 | +167/-151, 9 files | commented:6]
## Purpose
- Rename
parse_request -> parse_dataand passrequest.datato it so there is no need to distinguish between offline and online APIs. - For the same reason, deprecate
output_to_response, since we can construct the response automatically from the output data and provided request ID. - Split up
validate_or_generate_paramsintomerge_sampling_paramsandmerge_pooling_paramsto make type annotation easier.
All changes are back-compatible until v…
- Rename
-
#34241 [Bugfix] Grammar ignored when reasoning ends within speculated tokens — bug,structured-output,v1 — by sfbemerk (创建于: 2026-02-10 20:48 (UTC+8)) [+365/-15, 3 files | commented:1 | 📝草稿] ## Purpose
This PR attempts to fix a bug (#31858) when Speculative Decoding (such as MTP), Reasoning, and Structured Output / Grammar are used in combination: typically, grammar is not enabled during reasoning but only for the final answer. However, when the reasoning end token is generated, any subsequent draft tokens are not validated against the grammar, leading to an invalid final answer.
## Test Plan
In general, the bug seems to be independent of the specific SpecDecode method; origina…
- #34233 Bump
mamba-ssmversion in CI for Transformers v5 compatibility — ready,ci/build — by hmellor (创建于: 2026-02-10 19:17 (UTC+8)) [+4/-4, 2 files | commented:1 approved:1] So that it includes https://github.com/state-spaces/mamba/commit/35e927b20fd674f0b30a799a6408b7aac6ffe642 which updates imports which were deleted in Transformers v5. -
#34220 [V1][BugFix] Fix EAGLE3 encoder cache miss with disable_chunked_mm_input — bug,ready,v1 — by KrxGu (创建于: 2026-02-10 16:12 (UTC+8)) [💬2 | +75/-1, 2 files | commented:9 approved:1] Fixes #32469
### Description When using EAGLE3 speculative decoding with multimodal inputs and
disable_chunked_mm_input=True, the scheduler’s rollback calculation didn’t account forshift_computed_tokens, causing it to schedule tokens overlapping the multimodal range without scheduling encoder inputs. This triggered “Encoder cache miss” assertions.### Changes
- Fixed rollback calculation in
_try_schedule_encoder_inputsto account forshift_computed_tokens - Added regression test for …
- Fixed rollback calculation in
-
#34239 fix: correct verbose timestamp offsets for chunked transcription (#32588) — frontend — by Mercury0226 (创建于: 2026-02-10 20:12 (UTC+8)) [💬2 | +89/-14, 2 files | commented:1] ## Problem For chunked transcription with
verbose_json, segment timestamps used a synthetic chunk start (idx * max_audio_clip_s) instead of the true split boundary chosen by low-energy search. On long audio this causes cumulative drift.## Reproduction When chunking is enabled,
_split_audiomay split earlier than the nominal chunk length inside the overlap window. Existing timestamp offset logic in_create_speech_to_textignored that and advanced by fixed chunk size, so segment start/en… - #34235 Stop testing for slow tokenizers as they will not exist soon — ready — by hmellor (创建于: 2026-02-10 19:30 (UTC+8)) [+0/-5, 1 files | commented:1 approved:1] Slow tokenizders do not exist in Transformers v5. This PR pre-emptively removes the test that checks that slow tokenizers load slowly.
-
#34238 fix(v1-worker): improve dcp assertion with backend fallback hint (#28407) — v1 — by Mercury0226 (创建于: 2026-02-10 19:34 (UTC+8)) [💬1 | +41/-1, 2 files | commented:1] ## Problem The current DCP compatibility assertion does not provide actionable recovery guidance when an attention backend does not return decode LSE.
## Reproduction Configure DCP (
decode_context_parallel_size > 1) with a backend implementation that hasneed_to_return_lse_for_decode=False.## Solution
- Extended the assertion message in
check_attention_cp_compatibilitywith explicit fallback guidance:- try another backend using
--attention-backendorVLLM_ATTENTION_BACKEND.
- try another backend using
- Add…
- Extended the assertion message in
-
#34215 [Misc] allow specify is_mm_prefix_lm in hf_config — ready — by lkhphuc (创建于: 2026-02-10 15:56 (UTC+8)) [💬2 | +3/-0, 1 files | commented:2 approved:1] ## Purpose Currently whether to use PrefixLM is hardcoded into a list MM_PREFIX_LM_MODELS. This PR enables model’s hf_config to also specify this behaviour with a flag of the same name. If no flag is given, fallbacks to the hardcoded list.
## Test Plan Locally extends some models
config.jsonwith a fieldis_mm_prefix_lmand print theimage_doc_rangeingpu_model_runner.pyto confirm prefixLM is turned on and off correctly.```python if self.is_mm_p…
-
#34228 Add unit tests for fp8 output fusion of triton_attn — 无标签 — by bringlein (创建于: 2026-02-10 18:39 (UTC+8)) [+127/-1, 1 files | commented:1]
## Purpose
Adding unit tests to cover fused quantization inside the
triton_unified_attentionkernels (and help debugging issues like #31785 ).## Test Plan
``` pytest tests/kernels/attention/test_triton_unified_attention.py …
-
#34219 [Bugfix] Fix FI kernel
chunk_gated_delta_ruleoutput shape for Qwen3.5 — bug,ready,qwen — by ywang96 (创建于: 2026-02-10 16:09 (UTC+8)) [+3/-1, 1 files | commented:1 approved:2] ## Purpose## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... -
#34226 [build] fix priority of cuda-compat libraries in ld loading — ci/build,nvidia — by youkaichao (创建于: 2026-02-10 17:46 (UTC+8)) [💬1 | +4/-4, 1 files | commented:1]
## Purpose
People still report
Error 803: system has unsupported display driver / cuda driver combinationafter https://github.com/vllm-project/vllm/pull/33116 . It is because some docker images specify the driver so file innvidia.conf, which gets suppressed bycuda-compat.conf.This PR adds
zzz-tocuda-compat.confso that it’s ordered lastly.<img width=”970” height=”594” alt=”a119df1a223395c6b8d6145b8913cd2c” src=”https://github.com/user-attachment…
- #34223 [Kernels] Make GGUF linear method allow 3d inputs — 无标签 — by Isotr0py (创建于: 2026-02-10 17:17 (UTC+8)) [+19/-4, 2 files | commented:1 | 📝草稿]
## Purpose
- vLLM-Omni is planning to support FP8 and GGUF quantization support: https://github.com/vllm-project/vllm-omni/issues/1057
- This PR extends the GGUF linear method to accpet X inputs in shape of
(batch_size, seq_len, hidden_size).
## Test Plan TODO
## Test Result TODO
…
-
#34218 [Bugfix] Fix
--trust-remote-codeconflict — bug,documentation,performance,speculative-decoding,ready — by DarkLight1337 (创建于: 2026-02-10 16:03 (UTC+8)) [💬2 | +2/-16, 2 files | commented:2]## Purpose
Remove
--trust-remote-codefrom https://github.com/vllm-project/vllm/pull/34188 as well…Remove import fallback from spec decode example to avoid silently incorrect CLI parsing.
## Test Plan
…
-
#34216 Revert #34208 — performance — by DarkLight1337 (创建于: 2026-02-10 15:58 (UTC+8)) [💬2 | +5/-5, 2 files | commented:1] This reverts commit f69b903b4c70716224b3936cb8503e562e25388e.
## Purpose
The fix is incorrect and actually broke
vllm serve## Test Plan
…
-
#34221 add log of raw request when crashing — frontend,needs-rebase,v1 — by qrshan (创建于: 2026-02-10 16:12 (UTC+8)) [💬1 | +181/-130, 8 files | commented:1]
## Purpose This commit was added to make it easier to identify request parameters that cause crashes in the production environment.
## Test Plan
## Test Result
…
-
#34208 [Bugfix] Add
--trust-remote-codeto dataset bench args — bug,performance,ready — by DarkLight1337 (创建于: 2026-02-10 13:23 (UTC+8)) [💬4 | +5/-5, 2 files | commented:1 approved:1]## Purpose
FIX example tests failing https://buildkite.com/vllm/ci/builds/50767/steps/canvas?jid=019c446c-a9ee-4d8f-a856-a222915b4438&tab=output
#32300 required
args.trust_remote_codewhen loading the dataset, which isn’t passed by spec decode examples. To ensure that this argument is passed, I have moved--trust-remote-codeflag toadd_dataset_parser.## Test Plan
…
-
#34204 [CI/Build] Relax
test_mcp_tool_call— ready — by DarkLight1337 (创建于: 2026-02-10 11:45 (UTC+8)) [+3/-3, 1 files | commented:1 approved:1]## Purpose
FIX flaky https://buildkite.com/vllm/ci/builds/50601/steps/canvas?jid=019c3df1-35a8-4acd-a443-57f3e4449d1e&tab=output, the output is still reasonable.
## Test Plan
## Test Result
…
-
#34207 Feat/silu and mul out variant — documentation,needs-rebase — by tianrengao (创建于: 2026-02-10 13:08 (UTC+8)) [💬2 | +109/-42, 14 files | commented:1 | 📝草稿]
## Purpose
## Test Plan
## Test Result
...
[已合并 PR]
- #34202 [XPU][7/N] enable xpu fp8 moe — ready,ci/build — by zufangzhu (合并于: 2026-02-11 11:33 (UTC+8)) [💬1 | +52/-5, 4 files | commented:1 approved:1] Enable moe fp8 online path for xpu
- #33022 [Kernel] Apply 256bit LDG/STG To Activation Kernels — ready — by AstroVoyager7 (合并于: 2026-02-11 11:31 (UTC+8)) [💬19 | +401/-123, 1 files | commented:10]
## Purpose
Nvidia Blackwell GPU support 256-bit ld/st. Therefore, we modified the activation kernels to support this feature.
## Test Plan
pytest ./tests/kernels/core/test_activation.py## Test Result Hopper: ``` ================================================================================================================ test session starts ================================================================================================================ platform linux – Python 3.10.12, p… -
#33884 [Bugfix][DeepSeek-V3.2] fix fp8 kvcache type cast — bug,ready,deepseek — by kebe7jun (合并于: 2026-02-11 11:31 (UTC+8)) [💬1 | +16/-4, 1 files | commented:1 approved:1]
## Purpose
Fix https://github.com/vllm-project/vllm/issues/33883
cc @LucasWilkinson
## Test Plan
…
-
#34144 [Misc] Clean up validation logic in input processor — ready,v1,multi-modality — by DarkLight1337 (合并于: 2026-02-11 11:29 (UTC+8)) [+72/-86, 3 files | commented:5 approved:1 changes:1]
## Purpose
Simplify the validation for prompt length since encoder-decoder models are always MM models now.
## Test Plan
## Test Result
…
-
#33738 [WideEP] Fix nvfp4 DeepEP High Throughput All2All backend — ready,nvidia — by tlrmchlsmth (合并于: 2026-02-11 11:15 (UTC+8)) [💬2 | +6/-2, 1 files | commented:1 approved:4] ## Purpose
On current main,
nvidia/DeepSeek-R1-NVFP4crashes when used with the DeepEP high throughput all2all backend.NotImplementedError File "vllm/distributed/device_communicators/all2all.py", line 355, in dispatch_router_logits raise NotImplementedError…
-
#34093 [torch.compile] Stop doing unnecessary FakeTensorProp in PiecewiseCompileInterpreter — ready,ready-run-all-tests — by zou3519 (合并于: 2026-02-11 11:15 (UTC+8)) [+41/-4, 2 files | commented:3 approved:2] ## Purpose
This piece of code is essentially running FakeTensors through the entire model’s forward pass. This is unnecessary, the FakeTensors already exist on the model. I did not add an envvar for this because I don’t want to toggle this.
## Test Plan & Test Result
This saves ~2s on llama-3.1-70B cold start time.
Also, run local tests (tests/model)
-
#34251 [Redo] Add
--trust-remote-codeto dataset bench args — performance,ready — by DarkLight1337 (合并于: 2026-02-11 11:10 (UTC+8)) [💬2 | +5/-5, 2 files | commented:1 approved:1] Redo #34208## Purpose
@eugr can you check that this PR doesn’t break
vllm serve? After failing to repro the issue with only this PR, I suspect that the underlying issue is actually a conflict between #34188 and #34208.## Test Plan
## Test Result
…
-
#34021 [Bugfix] Fix Worker.load_model context-manager composition for sleep mode — bug,ready,v1 — by tianshu-Michael-yu (合并于: 2026-02-11 11:07 (UTC+8)) [💬2 | +4/-3, 1 files | commented:1 approved:2] ## What this PR does
Fixes a context-manager composition bug in
vllm/v1/worker/gpu_worker.py:- Before:
with A and B: - After:
with A, B:
In Python,
with A and B:evaluates to a single context manager (BwhenAis truthy), soAis never entered. Here that means_maybe_get_memory_pool_context(tag="weights")is skipped, so weight allocations are not tracked by cuMem sleep mode.## Why this matters …
- Before:
-
#32968 [Misc] Add run one batch script that supports profiling — documentation,ready — by LucasWilkinson (合并于: 2026-02-11 10:29 (UTC+8)) [💬2 | +112/-0, 1 files | commented:1 approved:1] add a script that runs one batch and can optionally profile it; e.g.
python examples/offline_inference/run_one_batch.py --model deepseek-ai/DeepSeek-V2-Lite --tensor-parallel-size 1 --max-model-len 1024 --gpu-memory-utilization 0.9 --kv-cache-dtype fp8 --trust-remote-code --profile both --profile-dir ./profiles/main - #34280 [ROCm][CI] Fix test_sequence_parallel.py location in AMD CI pipeline — rocm,ready,ci/build — by micah-wil (合并于: 2026-02-11 09:08 (UTC+8)) [+1/-1, 1 files | commented:2 approved:2]
https://github.com/vllm-project/vllm/pull/33731 moved
tests/distributed/test_sequence_parallel.pytotests/compile/correctness_e2e/test_sequence_parallel.pyand updated the associated locations in the CI pipelines, but missed one spot in AMD CI forDistributed Tests (2 GPUs). This PR fixes that. -
#32344 [MoE Refactor] Introduce MoERunner abstraction and move execution logic from FusedMoE to DefaultMoERunner — documentation,ready,v1,nvidia — by bnellnm (合并于: 2026-02-11 08:51 (UTC+8)) [💬11 | +913/-753, 25 files | commented:10] ## Purpose
The execution logic in
FusedMoEis rather convoluted due to the number of inter-dependent features, e.g. shared experts, gating, modular vs. non-modular kernels, logic for parallel execution, input chunking, etc. In order to try and simplify theFusedMoElayer we introduce theMoERunnerabstraction. The eventual goal is to have different subclasses of runner to handle the different major combinations of features. This first PR introduces theMoERunnerand moves the existin… - #32979 [CI] Add pip caching to cleanup_pr_body workflow — ready,ci/build — by sjhddh (合并于: 2026-02-11 08:45 (UTC+8)) [💬1 | +1/-0, 1 files | approved:1 commented:1]
## Summary
- Add
cache: 'pip'to theactions/setup-pythonstep in cleanup_pr_body workflow - Reduces pip dependency installation time on cache hits
## Test Plan
- Open/edit a PR to trigger the workflow
- Check workflow logs for “Cache restored” message on subsequent runs
- Verify workflow completes successfully
## Risk …
- Add
-
#34271 [Misc] Add pre-commit hook to catch boolean ops in with-statements — ready — by tlrmchlsmth (合并于: 2026-02-11 07:13 (UTC+8)) [💬1 | +75/-0, 2 files | commented:1 approved:2] ## Summary
- Adds an AST-based pre-commit hook that catches
with ctx_a() and ctx_b():— a subtle bug where Python’sandoperator causes only the second context manager to be entered, silently skipping the first - This exact pattern caused a sleep mode weight offloading regression in 6cdf015c3c (#31747), where the memory pool context manager was never entered, preventing 30+ GiB of weights from being offloaded (fixed in #32947)
- No existing linter (ruff, pylint, flake8) has a rule for this …
- Adds an AST-based pre-commit hook that catches
-
#34092 [torch.compile] Disable recursive pre_grad_passes — ready — by zou3519 (合并于: 2026-02-11 07:02 (UTC+8)) [💬5 | +24/-1, 2 files | commented:2 approved:1] ## Purpose Inductor’s pre-grad passes don’t do anything for vLLM. The pre-grad passes get run even on cache-hit and negatively impact vllm cold compile times by O(1s) Can remove this after the following issue gets fixed https://github.com/pytorch/pytorch/issues/174502
I added an envvar to toggle this so there’s an easy way to turn this back on if something goes wrong or if Inductor ends up adding some useful pre-grad optimizations before the pytorch issue is fixed. It’s an envvar and not a conf…
-
#34022 [Misc][Spec Decode] support different load config for draft model — speculative-decoding,ready,v1 — by ZhengkaiZ (合并于: 2026-02-11 06:52 (UTC+8)) [💬2 | +8/-1, 3 files | commented:1 approved:1] [Misc][Spec Decode] support different load config for draft model
Summary:
Sometime, to achieve better draft model performance, or to use different checkpoint format, we will customize those part, adding freedom to specify different load config can make our private trained draft model work.
Test Plan: Command:
``` …
-
#32860 [BugFix] Fix async EPLB hang with DeepEP LL all2all backend — bug,ready,v1 — by ilmarkov (合并于: 2026-02-11 06:34 (UTC+8)) [💬1 | +56/-0, 2 files | commented:6 approved:2] Fix hang when async EPLB is used with DeepEP LL.
The hang happens because of cooperative launch in DeepEP LL. When Async EPLB in async thread launches NCCL kernels for weight exchange with large number of SMs it blocks DeepEP kernels. When these kernels are launched on different ranks out-of-order we hit a deadlock.
See https://github.com/deepseek-ai/DeepEP/issues/496.
We resolve it by limiting number of CTAs in NCCL. In DP/EP mode NCCL is not used in the hot path so it doesn’t directly affec…
-
#30888 [Perf] Move eplb rebalance algo to async thread — ready — by ilmarkov (合并于: 2026-02-11 06:19 (UTC+8)) [💬5 | +251/-103, 5 files | commented:3 approved:2] ## Purpose Follow-up of #30697
Move rebalance algo to async thread in async EPLB and move CPU on GPU dependency to async thread. In EPLB we need to get rebalancing results on CPU in order to perform routing logic. It means that at some point we need to wait for gpu computation on CPU creating CPU bubble. In this PR we avoid this bubble in the execution thread and move this dependency to async EPLB thread.
## Test Plan gsm8k test server: …
-
#33581 [Feature] Warn about unrecognized environment variables — ready,v1 — by gshtras (合并于: 2026-02-11 05:45 (UTC+8)) [💬2 | +45/-0, 3 files | commented:4 approved:1] Implement #33096
Upon creating LLM Engine go over the entries of os.environ that start with VLLM_ and validate that they are known to vLLM.
At first issue a warning with a potential to make it a hard failure in future releases
-
#31195 [SM100] Resubmit FMHA FP8 prefill for MLA — ready,v1,nvidia — by pavanimajety (合并于: 2026-02-11 05:18 (UTC+8)) [💬11 | +145/-23, 3 files | commented:10] ## Purpose Resubmit FP8 FMHA path for MLA Prefill. Clean up how kernel backends opt into FP8 Prefill. Use
--attention-config.use_prefill_query_quantization=trueto enable FP8 Prefill. This is currently guarded because it shows slightly lower perf than BF16 Prefill due to extra casts and quantizations although the kernel level performance is about ~1.5x better.## Test Plan CI tests and running DeepSeekR1-FP4 manually.
## Test Result Comparison of vLLM serving performance across different a…
-
#34200 [Bugfix] Fix mamba cache dtype for Qwen3.5 — bug,ready,qwen — by ywang96 (合并于: 2026-02-11 05:12 (UTC+8)) [💬1 | +2/-1, 1 files | approved:1 commented:1] ## Purpose Qwen3.5 uses float32 for mamba cache dtype and it’s rather inconvenient to ask users to pass ` –mamba-cache-dtype float32` every single time.
Since it’s not part of the model config we simply hardcode it here.
## Test Plan
## Test Result
…
-
#34269 [Benchmarks] Fix attention benchmark smoke test — ready,ci/build — by MatthewBonanni (合并于: 2026-02-11 05:04 (UTC+8)) [+2/-1, 1 files | commented:1 approved:1] ## Purpose Fixes CI failure, thanks @varun-sundar-rabindranath for catching this
## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... -
#32947 [Bugfix] Fix weights offloading for sleep mode — bug,ready,v1 — by jseppanen (合并于: 2026-02-11 04:38 (UTC+8)) [💬1 | +4/-3, 1 files | commented:1 approved:3] ## Purpose
FIx weights offloading bug.
Before:
[gpu_worker.py:137] Sleep mode freed 24.08 GiB memory, 33.62 GiB memory is still in use.…
-
#34084 Convert online APIs to use Renderer — frontend,ready — by reaganjlee (合并于: 2026-02-11 03:44 (UTC+8)) [💬1 | +33/-9, 2 files | commented:1 approved:1]
## Purpose
## Test Plan
## Test Result
... -
#34182 [Misc] Introduce ec_both role EC (encoder cache) connector — ready,v1 — by furionw (合并于: 2026-02-11 02:55 (UTC+8)) [💬2 | +10/-4, 3 files | commented:1 approved:1] ## Why
This PR is to support ECConnector on an aggregated EPD node.
## Detail
Depending on ECConnector’s role, we try to either
- load embedding into GPU memory, when it’s a
consumerrole - store embedding in ECConnector (e.g. disk), when it’s a
producerrole …
- load embedding into GPU memory, when it’s a
-
#34152 [UX nit] Fix non-default api_server_count message — frontend,ready — by mgoin (合并于: 2026-02-11 02:35 (UTC+8)) [+1/-0, 1 files | commented:1 approved:1]
## Purpose
Since https://github.com/vllm-project/vllm/pull/32525 landed, we would see api_server_count always in the startup message for non-default args, even when not specified from the user:
(APIServer pid=2847065) INFO 02-09 15:29:34 [utils.py:223] non-default args: {'model_tag': 'openai/gpt-oss-20b', 'api_server_count': 1, 'model': 'openai/gpt-oss-20b'}This PR just fixes that by reseting the default if only using one server
## Test Plan
…
-
#34188 [responsesAPI] fix simpleContext streaming output_messages — performance,frontend,ready,gpt-oss — by qandrew (合并于: 2026-02-10 14:53 (UTC+8)) [💬3 | +265/-5, 3 files | commented:1 approved:1] ## Purpose
Before this diff, in SimpleContext when we stream, we get an array of output_messages. this makes debugging harder and behaves differently from non streaming. This PR fixes the issue below
``` … { “message”: “hello”, “tokens”: [ 42 …
- #34247 Minor cleanup for Voxtral — ready — by andylolu2 (合并于: 2026-02-11 02:18 (UTC+8)) [+3/-1, 1 files | commented:1 approved:2] Writing better pytorch
-
#34108 [ROCm][Bugfix] Resolve Dynamo tracing crash from amdsmi calls in on_gfx* arch detection — bug,rocm,ready — by AndreasKaratzas (合并于: 2026-02-10 12:50 (UTC+8)) [💬4 | +27/-35, 1 files | commented:1 approved:2] Fixes a
torch._dynamo.exc.Unsupportedcrash when running models that invokeon_gfx9()(or similar arch detection functions) insidetorch.compileregions on ROCm. This was introduced by #33941.## Problem
PR #33941 replaced
torch.cuda.get_device_properties()calls inon_gfx9()/on_gfx1x()/etc. withamdsmiqueries to avoid early CUDA initialization (which broke Ray worker GPU assignment). However, when these functions are called inside compiled model forward passes (e.g., via `fused… - #34222 [Model Runner V2] Use pinned memory for write_contents — v1 — by WoosukKwon (合并于: 2026-02-11 00:55 (UTC+8)) [+9/-20, 1 files | commented:2 approved:1]
It is much simpler to use pinned memory for handling
write_contentsWe could also consider using a separate stream for the CPU->GPU copy, but I didn’t implement it in this PR. -
#34240 [Docs] Speed up build environment set-up — ready — by hmellor (合并于: 2026-02-11 00:34 (UTC+8)) [+7/-6, 1 files commented:2 approved:2] - Update
post_checkout’s fetch command (used only bygit-revision-date-localizedto the “last updated” info) to be as fast as possible:- Explicitly only fetch
origin main- other branches are not used - Add
--no-tags- tags are not used - Add
--filter=blob:none- defers downloading file blobs until they’re actually needed (since we only care about the graph, I believe they will never be needed) - This reduces the fetch time from ~10-20s to ~3s
- Explicitly only fetch
- Update Python environment manag…
- Update
-
#33698 [Core][BugFix] Fix PP KV cache sharding memory validation — bug,ready,v1 — by junuxyz (合并于: 2026-02-10 23:46 (UTC+8)) [💬4 | +168/-39, 2 files | commented:6 approved:1] Fixes #32105. Refs #32782. Regression introduced by #29431 (commit 8ee90c8).
This PR fixes incorrect KV cache sharding memory validation under Pipeline Parallelism, which could cause false OOM errors.
## Purpose
According to the docs, since v0.14.0,
get_kv_cache_configsworks as follows:…
-
#34077 [BUGFIX] Fix accuracy bugs in Qwen3-Next MTP — bug,ready,v1,qwen — by vadiklyutiy (合并于: 2026-02-10 23:57 (UTC+8)) [💬14 | +18/-4, 1 files | commented:4 approved:1] Qwen3-Next MTP+CG had accuracy loss. The loss was hidden: if run
lm_evalwith highnum_concurrentthe accuracy was ok because for high concurrence CG isn’t used.All problems was in accurate processing of padded part of batch.
- before counted zero len as prefill
- state_indices didn’t take padded elems in account
- copy to persistent place had shape mistmach
## Tests ``` vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 -tp 4 –no-enable-prefix-caching –speculative_config.method qw…
-
#33680 [Perf][Kernel] Add faster topKperRow decode kernel for DeepSeek-V3.2 sparse attention — ready,ci/build,v1,deepseek — by LopezCastroRoberto (合并于: 2026-02-10 23:29 (UTC+8)) [💬8 | +554/-12, 8 files | commented:3 approved:1] ## Summary This PR adds an optimized top-k per-row decode kernel for DeepSeek-V3.2 sparse attention (K = 2048). The new kernel replaces vLLM’s native
top_k_per_row_decodein the decode path for long-context inference. It uses a 5-pass radix selection algorithm optimized for large-K workloads and long sequence lengths. The implementation is adapted from TileLang / SGLang (PR by DarkSharpness: https://github.com/sgl-project/sglang/pull/11194).## Performance characteristics Not intended to repl…
-
#34155 [compile] Enable AOT compile with 2.10 in trunk. — ready — by zhxchen17 (合并于: 2026-02-10 23:24 (UTC+8)) [+1/-1, 1 files | commented:1 approved:1] Following up RFC https://github.com/vllm-project/vllm/issues/33804, we want to turn off this by default in the next vLLM release.
For now we will keep it on with 2.10 to make sure CI can catch regressions. Will follow up again to turn it off when the next release is approaching.
## Purpose
## Test Plan
…
-
#29008 [ROCm][Quantization] GPT_OSS in amd-quark format model loading and emulations — rocm,ready,gpt-oss — by xuebwang-amd (合并于: 2026-02-10 23:08 (UTC+8)) [💬41 | +1094/-213, 13 files | commented:10] ## Purpose
This PR aims for:
- quark model loading, combined with
mxfp4loading function for original openai/gpt-oss-20b & openai/gpt-oss-120b - OCPMX_W4A16, OCPMX_W4AFP8 MoE scheme and emulation forward, unified into class
QuarkOCP_MX_MoEMethod
## Test Plan
- Models:
- GPT_OSS_20B …
- quark model loading, combined with
- #33922 Support benchmarking of Geospatial models — performance,ready,multi-modality — by mgazz (合并于: 2026-02-10 23:04 (UTC+8)) [💬5 | +110/-56, 3 files | commented:10]
This PR adds the following features necessary to the Geospatial models like Prithvi:
- ability to run benchmarks when the tokenizer is not initialised. This is done via a new argument
--skip-tokenizer-init. In this case the benchmark does not print token related metrics. - Adds new benchmark backend that supports sending requests to the /pooling endpoint.
The benchmark can be run as following:
Step 1: start the server
``` …
- ability to run benchmarks when the tokenizer is not initialised. This is done via a new argument
-
#34026 add –insecure arg to the vllm bench to skip TLS — performance,ready — by fanyang-real (合并于: 2026-02-10 22:23 (UTC+8)) [💬4 | +139/-5, 2 files | commented:4 approved:2] ## Purpose
Add
--insecureflag tovllm bench serveto skip TLS certificate verification when connecting to servers with self-signed certificates.This is useful when benchmarking vLLM servers that use self-signed SSL certificates, where the default strict certificate verification would fail.
## Test Plan
```bash pytest tests/benchmarks/test_serve_cli.py -v -m benchmark …
- #34233 Bump
mamba-ssmversion in CI for Transformers v5 compatibility — ready,ci/build — by hmellor (合并于: 2026-02-10 21:46 (UTC+8)) [+4/-4, 2 files | commented:1 approved:1] So that it includes https://github.com/state-spaces/mamba/commit/35e927b20fd674f0b30a799a6408b7aac6ffe642 which updates imports which were deleted in Transformers v5. -
#34220 [V1][BugFix] Fix EAGLE3 encoder cache miss with disable_chunked_mm_input — bug,ready,v1 — by KrxGu (合并于: 2026-02-10 21:05 (UTC+8)) [💬2 | +75/-1, 2 files | commented:9 approved:1] Fixes #32469
### Description When using EAGLE3 speculative decoding with multimodal inputs and
disable_chunked_mm_input=True, the scheduler’s rollback calculation didn’t account forshift_computed_tokens, causing it to schedule tokens overlapping the multimodal range without scheduling encoder inputs. This triggered “Encoder cache miss” assertions.### Changes
- Fixed rollback calculation in
_try_schedule_encoder_inputsto account forshift_computed_tokens - Added regression test for …
- Fixed rollback calculation in
- #34235 Stop testing for slow tokenizers as they will not exist soon — ready — by hmellor (合并于: 2026-02-10 20:08 (UTC+8)) [+0/-5, 1 files | commented:1 approved:1] Slow tokenizders do not exist in Transformers v5. This PR pre-emptively removes the test that checks that slow tokenizers load slowly.
- #34126 Add flagos in MiniCPM-o — 无标签 — by tc-mb (合并于: 2026-02-10 18:51 (UTC+8)) [💬2 | +42/-0, 1 files | commented:4 approved:1] Add flagos in MiniCPM-o. I invited members of flagos to participate in the discussion of merging the operator library. See if there is an elegant way to merge.
-
#34215 [Misc] allow specify is_mm_prefix_lm in hf_config — ready — by lkhphuc (合并于: 2026-02-10 19:16 (UTC+8)) [💬2 | +3/-0, 1 files | commented:2 approved:1] ## Purpose Currently whether to use PrefixLM is hardcoded into a list MM_PREFIX_LM_MODELS. This PR enables model’s hf_config to also specify this behaviour with a flag of the same name. If no flag is given, fallbacks to the hardcoded list.
## Test Plan Locally extends some models
config.jsonwith a fieldis_mm_prefix_lmand print theimage_doc_rangeingpu_model_runner.pyto confirm prefixLM is turned on and off correctly.```python if self.is_mm_p…
-
#34219 [Bugfix] Fix FI kernel
chunk_gated_delta_ruleoutput shape for Qwen3.5 — bug,ready,qwen — by ywang96 (合并于: 2026-02-10 18:41 (UTC+8)) [+3/-1, 1 files | commented:1 approved:2] ## Purpose## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... -
#34137 [Docs] Fix format error in KV load failure recovery doc — documentation — by zzaebok (合并于: 2026-02-10 18:16 (UTC+8)) [💬3 | +1/-0, 1 files | commented:1 approved:2]
## Purpose
In
KV Load Failure Recovery Testdocument, there was a missing code fence closing backticks. Fixed it by adding the closing one.https://docs.vllm.ai/en/latest/examples/offline_inference/kv_load_failure_recovery/#files
…
-
#34218 [Bugfix] Fix
--trust-remote-codeconflict — bug,documentation,performance,speculative-decoding,ready — by DarkLight1337 (合并于: 2026-02-10 16:29 (UTC+8)) [💬2 | +2/-16, 2 files | commented:2]## Purpose
Remove
--trust-remote-codefrom https://github.com/vllm-project/vllm/pull/34188 as well…Remove import fallback from spec decode example to avoid silently incorrect CLI parsing.
## Test Plan
…
-
#32022 [Bugfix] Fix memory inconsistency in cross-process shared memory — bug,ready — by slippersss (合并于: 2026-02-10 16:22 (UTC+8)) [💬9 | +6/-0, 1 files | commented:1 approved:2] ## Purpose
This PR aims to fix potential memory inconsistency in current cross-process shared memory. Related issue has been described in #27858. It is probably due to weak ordering of some CPU architectures. That is, readers may read
metadata_buffer[0] = 1beforebufis fully written. We usememory_fence()to ensure the order of the buffer (i.e.buf) and flag (i.e.metadata_buffer[0]) writes.## Test Plan
This issue occurs intermittently, so we need long-running high-concurrency te…
-
#34216 Revert #34208 — performance — by DarkLight1337 (合并于: 2026-02-10 15:59 (UTC+8)) [💬2 | +5/-5, 2 files | commented:1] This reverts commit f69b903b4c70716224b3936cb8503e562e25388e.
## Purpose
The fix is incorrect and actually broke
vllm serve## Test Plan
…
-
#34208 [Bugfix] Add
--trust-remote-codeto dataset bench args — bug,performance,ready — by DarkLight1337 (合并于: 2026-02-10 14:37 (UTC+8)) [💬4 | +5/-5, 2 files | commented:1 approved:1]## Purpose
FIX example tests failing https://buildkite.com/vllm/ci/builds/50767/steps/canvas?jid=019c446c-a9ee-4d8f-a856-a222915b4438&tab=output
#32300 required
args.trust_remote_codewhen loading the dataset, which isn’t passed by spec decode examples. To ensure that this argument is passed, I have moved--trust-remote-codeflag toadd_dataset_parser.## Test Plan
…
-
#32975 [Perf] Optimize detokenizer python logic — ready,v1 — by yewentao256 (合并于: 2026-02-10 15:54 (UTC+8)) [💬6 | +10/-6, 2 files | commented:2 approved:1] ## Purpose
- adding a
num_output_tokensfunction, to avoidlen(self.output_token_ids)might do a slice in SlowIncrementalDetokenizer
```py class SlowIncrementalDetokenizer(BaseIncrementalDetokenizer): @property def output_token_ids(self) -> list[int]: return ( self.token_ids …
- adding a
- #29387 [BugFix] Avoid prefix cache hit in the same schedule step for mamba layers — bug,documentation,ready,v1 — by heheda12345 (合并于: 2026-02-10 15:41 (UTC+8)) [💬8 | +78/-3, 6 files | commented:8 approved:1]
## Purpose
Mamba layers shouldn’t treat blocks computed in the current step as a prefix cache hit:
Now suppose block_size = 1. We have two requests:
- req1: tokens = [A, B, C, D]
- req2: tokens = [A, B, C, E]
If these two requests are scheduled in the same step, then:
- The cache-hit length of the first request is 0, which is fine.
- The cache-hit length of the second request is 3, meaning that in this scheduling step, it is treated as if the three tokens A, B, C have already been computed, and…
-
#34123 [Frontend][CI] Consolidate instrumentator entrypoints — frontend,ready,ci/build — by noooop (合并于: 2026-02-10 15:30 (UTC+8)) [💬2 | +64/-74, 16 files | commented:9 approved:1] ## Purpose
- mv vllm/entrypoints/openai/basic/ -> vllm/entrypoints/serve/instrumentator/basic.py
- For generative models only, attach disagg_router and rlhf_router and elastic_ep route.
- Aggregate instrumentator-related tests into entrypoints/instrumentator
address https://github.com/vllm-project/vllm/pull/33158#discussion_r2734744964
vllm serve BAAI/bge-base-en …
-
#34190 [Bugfix] Sort hf_weights_files in fastsafetensors_weights_iterator to match #33491 — bug,ready — by jaim12005 (合并于: 2026-02-10 15:06 (UTC+8)) [💬1 | +1/-0, 1 files | commented:1 approved:1] ## Summary
PR #33491 added sorting of
hf_weights_filesinsafetensors_weights_iterator, but the same fix was not applied tofastsafetensors_weights_iterator. This PR applies the identical fix to the fastsafetensors path, preventing NCCL crashes on multi-node tensor parallel deployments.Fixes #34180
## Context
PR #33491 (already merged) sorted
hf_weights_filesusing_natural_sort_keyinsafetensors_weights_iteratorto ensure deterministic loading order. The `fastsafetensors_weigh… -
#34187 [Bugfix] Fix DP Attention Padding in Dummy Run — bug,ready,v1 — by LucasWilkinson (合并于: 2026-02-10 13:29 (UTC+8)) [+1/-0, 1 files | approved:1 commented:1] Mirror of #34009 by @benchislett — “Maintainers are allowed to edit this pull request.” was not enabled on the original PR, so pushing review fixes was not possible.
All credit to @benchislett for the original fix.
## Purpose
FIX #32626 FIX #33450
Problem: TRTLLM attention requires that num_decode_tokens be divisible by num_requests. However, during DP we sometimes do a dummy run on one of the workers so they don’t get out of sync: it such cases, we pad for attention but the attention metad…
-
#34204 [CI/Build] Relax
test_mcp_tool_call— ready — by DarkLight1337 (合并于: 2026-02-10 13:18 (UTC+8)) [+3/-3, 1 files | commented:1 approved:1]## Purpose
FIX flaky https://buildkite.com/vllm/ci/builds/50601/steps/canvas?jid=019c3df1-35a8-4acd-a443-57f3e4449d1e&tab=output, the output is still reasonable.
## Test Plan
## Test Result
…
-
#34148 [Doc] Update usage of
--limit-mm-per-prompt— documentation,ready — by DarkLight1337 (合并于: 2026-02-10 13:12 (UTC+8)) [💬2 | +10/-10, 6 files | commented:2 approved:2] PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.## Purpose
Update
--limit-mm-per-promptusage in docs to use shorthand format.## Test Plan
## Test Result …
-
#34183 [Bugfix][Core] Fix CPU memory leak from Request reference cycle in prefix caching — bug,ready,v1 — by ywang96 (合并于: 2026-02-10 13:03 (UTC+8)) [💬3 | +14/-12, 3 files | commented:6 approved:1] ## Purpose FIXES #28726
#24964 #27896 Introduced a change so that each request cycle creates fewer GC-tracked objects and overall there are less frequent gen-2 collections.
While this is okay for text-only inference where each
Requesthas very little data, for multimodal models this will result memory consumption growth without bound sinceRequesteach can containmm_featuresthat are sometimes up to two-digit MBs.## Test Plan The original issue was reproducible with this test script…
-
#34198 [Bugfix] Adopt
ChunkGatedDeltaRulefor Qwen3.5 — bug,ready,qwen — by ywang96 (合并于: 2026-02-10 11:47 (UTC+8)) [💬1 | +3/-0, 1 files | commented:1 approved:1]## Purpose
ChunkGatedDeltaRulewas recently added in #32846. Qwen3.5 missed this in its initialization which causes an error since its_forward_coreinherits from Qwen3-Next## Test Plan
## Test Result
…
[关闭未合并 PR]
- #18904 [V1][Metrics] Add total_tokens_in_queue (prefill + decode) — needs-rebase,stale,v1 — by sahelib25 (关闭于: 2026-02-11 10:18 (UTC+8)) [💬8 | +29/-0, 3 files | commented:2] This PR will add below metrics to V1, Metric Name | Type | Unit – | – | – total_tokens_in_queue (prefill + decode) | Gauge | Tokens
-
#24158 [GPT-OSS] Fix Pydantic union resolution for ResponseFunctionToolCall in Responses API — documentation,performance,new-model,rocm,frontend,speculative-decoding,ready,ci/build,stale,v1 — by JasonZhu1313 (关闭于: 2026-02-11 10:17 (UTC+8)) [💬11 | +159/-0, 2 files | commented:8 approved:1 changes:1]
## Purpose
When using GPT OSS with tool calling, the vLLM API requires the messages input to be a list containing a mix of message types:
- Regular user/assistant messages in dictionary format:
``` {“role”: “user”, “content”: “What is the weather in Tokyo?”} …
-
#24483 [Bugfix] add masking logits using vocab size to prevent OverflowError while detokenization — needs-rebase,stale,v1 — by ashgold (关闭于: 2026-02-11 10:17 (UTC+8)) [💬7 | +40/-7, 1 files | commented:1]
## Purpose This PR is to fix the issue: https://github.com/vllm-project/vllm/issues/24211
This error occurs when the generated token ID is not present in the vocabulary. To investigate why this issue occurs, I checked the lm_head dimension and vocabulary size of DeepSeek-R1. In config.json, vocab_size is set to 129280, but in tokenizer_config.json, the largest token ID is 118814. I know the model was trained only on datasets with token IDs ≤ 118814, and IDs ≥ 1188…
-
#25303 [Misc][Metrics] expose requests preemptions in logger — documentation,new-model,rocm,frontend,speculative-decoding,ready,stale,v1,multi-modality,tool-calling — by kingsmad (关闭于: 2026-02-11 10:17 (UTC+8)) [💬4 | +53/-1, 2 files | commented:6 changes:2 approved:1] ### Summary The purpose of this PR is to store the num preemptions per step in the logger class locally thus it can be leveraged in children classes.
Currently when no new blocks available from each step, we already record this as request events and set it back to engine client by EngineCoreResponses which later got [aggregated](https://github.com/vllm-project/vllm/blob/6d0…
-
#26635 [Feature] Add abort request endpoint to handle request cancellations — frontend,stale,v1,kv-connector — by jianzs (关闭于: 2026-02-11 10:16 (UTC+8)) [💬5 | +31/-4, 4 files | commented:2]
## Purpose
In scenarios with disaggregated prefill, if an error happens during decoding or the user manually cancels the process before the KV cache is retrieved, the KV cache stored on the prefill node can’t be released. This occurs because the HTTP stream between the prefill node and the user has already closed. To solve this issue, this pull request adds an active abort interface that proactively frees KV cache for
delay_freerequests.## Test Plan
## Test …
-
#26685 [Frontend] Add request arrival log — frontend,stale — by ApsarasX (关闭于: 2026-02-11 10:16 (UTC+8)) [💬5 | +30/-8, 10 files | commented:2] ## Purpose
Currently, logs stating
Received request ......are only output after the request has been encoded by the tokenizer. However, the timing of this log does not reflect the actual arrival time of the request.Therefore, I have added an additional log to record the moment the request arrives.
## Test Plan
## Test Result
…
- #26694 [Frontend] Refactor error handling for multiple exceptions in preprocessing — frontend,stale — by ApsarasX (关闭于: 2026-02-11 10:16 (UTC+8)) [💬2 | +6/-15, 5 files | commented:1]
## Purpose
- PR #15650 forgot to modify
serving_completion.py - Following PR #17664, change
str(e)tof"{e} {e.__cause__}"in the error handling for preprocessing prompt inputs.
## Test Plan
## Test Result
... - PR #15650 forgot to modify
-
#33777 [Bugfix]Fix gdn_attn in CUDA graph padding — bug,needs-rebase,v1,nvidia — by QilaiZhang (关闭于: 2026-02-11 08:49 (UTC+8)) [💬5 | +13/-10, 1 files | commented:1] ## Purpose Fix issues https://github.com/vllm-project/vllm/issues/31649 Description: A bug was identified where the value of
num_prefillwithin the GDN attention metadata was being calculated incorrectly. This miscalculation caused the system logic to skip the crucial “prepare tensors for cudagraph” code path, ultimately leading to a process crash.Root Cause: The bug occurred because
num_prefillwas incorrectly calculated by directly using~spec_sequence_masksto index `non_spec_… -
#34080 [bugfix] fix: correctly identify bottleneck worker in pipeline parallel KV cache allocation — bug,needs-rebase,v1 — by chizhiwei (关闭于: 2026-02-11 08:32 (UTC+8)) [💬2 | +33/-15, 1 files | commented:1] In pipeline parallel, the bottleneck worker for KV cache allocation is determined by minimum memory-per-layer ratio, not just minimum available memory, because different workers may handle different numbers of layers.
This fixes an issue where max_model_len was incorrectly calculated when workers had different layer allocations.
This patch relaxes the validation check for the max_model_len parameter in pipeline parallelism scenarios. Later in the code, the GPU KV cache size and the number of b…
-
#34037 [ROCm][AITER] Add fused RoPE+KVCache pass with MultiOutputPattern fix — rocm,needs-rebase,v1,gpt-oss — by spaparaju (关闭于: 2026-02-11 08:26 (UTC+8)) [💬4 | +731/-156, 16 files | commented:1] ## Summary
- Add
ROCmAiterTritonRopeReshapeKVCacheFusionPassthat fuses rotary embedding + reshape + KV cache update into a single AITER Triton kernel (fused_qk_rope_reshape_and_cache) - Fix for #33666: PyTorch’s
MultiOutputPatternonly traces the first output’s subgraph for anchor node discovery. By returning(dummy, q, k, v)instead of(q, k, v, dummy), theunified_kv_cache_updatenode (most connected subgraph) is traced first, ensuring all anchor nodes are discovered and patt…
- Add
-
#28473 [Core][Spec Decode] prevent negative num_accepted tokens — stale,v1 — by linzebing (关闭于: 2026-02-11 06:52 (UTC+8)) [💬6 | +1/-1, 1 files | commented:3] ## Purpose In the case of request preemption or abortion,
generated_token_idswill be cleared butscheduled_spec_token_idscan still be non-empty.num_acceptedwill become -1 in this case.As a result, we will crash with the following exception: ``` ERROR 11-10 13:09:12 [async_llm.py:522] AsyncLLM output_handler failed. ERROR 11-10 13:09:12 [async_llm.py:522] Traceback (most recent call last): ERROR 11-10 13:09:12 [async_llm.py:522] File “/packages/smart.inference_platform_sp.llm_predic…
-
#33653 [Docs] Do not build the document if there are no docs changes — documentation,needs-rebase — by kebe7jun (关闭于: 2026-02-11 04:03 (UTC+8)) [💬8 | +152/-2, 4 files | commented:4]
## Purpose
Right now,
the docs/readthedocs.org:vllmjob runs for every PR, which causes Read the Docs to frequently reportMaximum concurrency limit reached. Most PRs don’t touch documentation, so this ends up blocking the CI builds for PRs that do modify docs, delaying feedback and wasting time.This PR adds a mechanism to skip Read the Docs builds when there are no documentation changes, avoiding unnecessary builds and improving overall CI efficiency.
See …
-
#34267 [Draft] Testing, do not merge — ready,ci/build — by jstawinski (关闭于: 2026-02-11 03:25 (UTC+8)) [💬2 | +2/-1, 2 files]
## Purpose Authorized testing with @khluu, will close this PR once complete
Essential Elements of an Effective PR Description Checklist
... -
#30870 Remove obsolete token chunking in fused MoE kernel — 无标签 — by KonstGolfi (关闭于: 2026-02-10 20:52 (UTC+8)) [💬3 | +40/-51, 1 files | commented:3 changes:1] ### Fixes #30620 Removing obsolete token chunking in fused MoE kernel
- The chunking logic (~65k tokens) in fused_experts_impl was introduced to avoid IMA issues prior to chunked prefill. With chunked prefill now enforced upstream, this code path is no longer reachable. Removing it eliminates dead code and simplifies the fused MoE execution path without changing behavior.
- Test plan commands; 1) vllm bench throughput 2) pytest -s -v vllm/model_executor/layers/fused_moe/fused_moe.py…
- #28546 [Draft] Resumable requests for streaming — v1 — by patrickvonplaten (关闭于: 2026-02-10 20:13 (UTC+8)) [💬2 | +249/-6, 11 files | commented:2 | 📝草稿] Draft for resumable streaming by @andylolu2
-
#32514 [Performance] Split FlashInfer attention and cache update — v1,nvidia — by Etelis (关闭于: 2026-02-10 20:12 (UTC+8)) [💬3 | +87/-26, 2 files | commented:1] This PR separates the KV-cache update from the attention forward pass for the FlashInfer backend, following the pattern established in #25954 for FlashAttention.
Addresses #32335.
This change maintains backward compatibility:
- Backends with
forward_includes_kv_cache = True(default) continue to work unchanged - The attention layer must call
do_kv_cache_update()beforeforward()for backends withforward_includes_kv_cache = False
## Related PRs
…
- Backends with
-
#33983 [CPU][PPC64] Fix bf16 path in mla_decode.cpp — cpu — by Akashcodes732 (关闭于: 2026-02-10 13:43 (UTC+8)) [💬8 | +19/-2, 1 files | commented:4 approved:2] ## Purpose Add BF16 Kernel types for ppc64 in
mla_decode.cppSimilar issue seen in #33788 , #30329
- #32160 [feat] Add ATOM model impl backend — new-model,rocm,needs-rebase,ci/build,v1 — by zejunchen-zejun (关闭于: 2026-02-10 13:18 (UTC+8)) [💬4 | +269/-10, 7 files | commented:7 changes:1] This PR uses vLLM standard model registry mechanism and introduces ATOM as a new model impl backend for AMD GPU users. For more details, please reach RFC below: https://github.com/vllm-project/vllm/issues/33478