vLLM Daily Report

[vLLM GitHub 开发动态] 2026-02-11

[概览]

时间窗口: 2026-02-11 11:37 (UTC+8) ~ 2026-02-12 11:37 (UTC+8)
新 issue: 23 (label 分布: bug:13, ci-failure:3, feature request:3, torch.compile:2, rocm:2)
关闭 issue: 36
新 PR: 71 (label 分布: ready:26, bug:23, v1:16, ci/build:9, documentation:8)
合并 PR: 44
关闭未合并 PR: 10

[新 issue]

#34395 [Bug]: AR+rms+fp4 fusion results in total accuracy collapse for DSV3-fp4 — bug,torch.compile,nvidia — by ProExpertProg (创建于: 2026-02-12 10:39 (UTC+8)) [💬2] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... uv is set ============================== System Info ...
#34365 [CI Failure]: mi325_1: Async Engine, Inputs, Utils, Worker, Config Test (CPU) — ci-failure — by AndreasKaratzas (创建于: 2026-02-12 02:16 (UTC+8)) [💬2] ### Name of failing test

pytest -s -v tests/config/test_config_generation.py::test_unrecognized_env

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#34391 [Performance]: qknorm+rope fusion slower than unfused on H100 — help wanted,performance,torch.compile — by ProExpertProg (创建于: 2026-02-12 08:56 (UTC+8)) ### Proposal to improve performance

Running vllm bench sweep serve for -cc.pass_config.enable_qknorm_rope_fusion in {True, False} gives the following results:

### Report of performance regression

No response …
#34356 [Bug]: FP8 MoE Backend Regression: Nemotron-3 Models Fail in vLLM 0.15.0/0.15.1 — bug — by jordidelatorre (创建于: 2026-02-12 00:40 (UTC+8)) [💬1] ### Your current environment

## Description

Nemotron-3 models with FP8 quantization fail to load in vLLM 0.15.0 and 0.15.1, but work correctly in vLLM 0.14.1.

## Environment
- OS: Rocky Linux 8.10 (Green Obsidian)
- Kernel: 4.18.0-553.51.1.el8_10.x86_64 …
#34315 [CI Failure]: Flashinfer Kernel tests missing argument: ‘num_logical_experts’ — ci-failure — by wzhao18 (创建于: 2026-02-11 13:34 (UTC+8)) ### Name of failing test

tests/kernels/moe/test_flashinfer.py

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#34380 [Bug]: GLM-5 MTP crashes with num_speculative_tokens > 1 — bug — by mgoin (创建于: 2026-02-12 05:55 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text Collecting environment information... uv is set ============================== System Info ...
#34357 [Bug]: MoE models with shared experts do not support DP+EP — bug — by jeejeelee (创建于: 2026-02-12 00:49 (UTC+8)) [💬2] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... uv is set ============================== System Info ...
#34376 [Bug]: #33731 is causing Triton build fails with gptoss — bug,rocm,gpt-oss — by amd-hhashemi (创建于: 2026-02-12 04:46 (UTC+8)) [💬1] ### Your current environment

Collecting environment information... ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) GCC version : (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 Clang version : 20.0.0git (https://github.com/RadeonOpenCompute/llvm-project roc-7.0.0 25314 f4087f6b428f0e6f575ebac8a8a724dab123d06e) ...
#34377 [Bug]: FP8 speed regression in version 0.16.0rc2.dev87+g0b20469c6 (latest nightly) — bug — by laterbreh (创建于: 2026-02-12 04:50 (UTC+8)) ### Your current environment

vllm/vllm-openai:nightly

Commit-tagged alias: nightly-0b20469c627e94060d1015170b186d19de1db583

### 🐛 Describe the bug

Substantial drop in speed on FP8 images on: …
#34370 [Feature]: Make Prometheus histogram buckets configurable — feature request — by npache (创建于: 2026-02-12 03:18 (UTC+8)) ### Feature Request

Allow configuring Prometheus histogram buckets for vLLM metrics (e.g., vllm:e2e_request_latency_seconds) via CLI flag or environment variable.

### 🎯 Motivation

Currently, histogram buckets for metrics are statically defined in the source code.

…
#34355 [Feature]: Detect langue for Whisper — feature request — by warichet (创建于: 2026-02-12 00:40 (UTC+8)) ### 🚀 The feature, motivation and pitch

Motivation The addition of a language detection feature to the Whisper model will enhance its usability by automatically identifying the language of the audio input before transcription. This is especially useful for multilingual environments where the language of the input audio might not always be known in advance. By integrating language detection directly into the transcription pipeline, we can streamline the process and make it more efficient for…
#34361 [Bug]: Qwen Coder Next prefix caching — bug — by Nepherpitou (创建于: 2026-02-12 01:42 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text Collecting environment information... uv is set ============================== System Info ...
#34351 [Installation]: MAC M1 installation fails because of bits-and-bytes — installation — by jonoillar (创建于: 2026-02-11 23:49 (UTC+8)) ### Your current environment

```text Collecting environment information… ============================== System Info ============================== OS : macOS 15.6.1 (arm64) GCC version : Could not collect Clang version : 17.0.0 (clang-1700.0.13.5) …
#34345 [Bug]: Since version >=0.14.0, the key-value cache of heterogeneous graphics cards cannot be allocated correctly. — bug — by gengchaogit (创建于: 2026-02-11 22:59 (UTC+8)) [💬5] ### Your current environment

1

### 🐛 Describe the bug

1

### Before submitting a new issue…

…

#34340 [Bug]: request larger than max_tokens should return http code 413 instead of 400 — bug — by qin-nz (创建于: 2026-02-11 22:07 (UTC+8)) [💬3] ### Your current environment

N/A

### 🐛 Describe the bug

Now, when user request with a large content, vllm return http code 400 as follow.

  {"detail":"Upstream[decode] error, b'{\"error\":{\"message\":\"\\'max_tokens\\' or \\'max_completion_tokens\\' is too large: 65536. This model\\'s maximum context length is 262144 tokens and your request has 203462 input tokens (65536 > 262144 - 203462). None\",\"type\":\"BadRequestError\",\"param\":null,\"code\":400}}'"},

…

#34323 [CI Failure]: Spawned tests can fail silently — ci-failure — by almayne (创建于: 2026-02-11 16:50 (UTC+8)) [💬4] ### Name of failing test

tests/models/multimodal/generation/test_whisper.py::test_models

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#34331 [RFC]: Ahead of time dequantization of weights for quantization emulation (OCP MX, NVFP4) on unsupported devices — rocm,RFC — by fxmarty-amd (创建于: 2026-02-11 18:26 (UTC+8)) [💬1] ### Motivation.

Hi, opening this for comment.

At the moment, vLLM has some support for quantization simulation of MXFP4/MXFP6/NVFP4 models for execution on devices that do not support these dtypes:
- Quark OCP MX (dense & MOE) at https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/quark/schemes/quark_ocp_mx.py
- compressed-tensors NVFP4 (e.g. https://huggingface.co/RedHatAI/Qwen3-8B-NVFP4, no MOE emulation support atm)
This is for example useful for resear…
#34311 [Model Bash][DeepSeek R1]: Enable AR Fusion By Default — feature request,model-bash — by robertgshaw2-redhat (创建于: 2026-02-11 13:02 (UTC+8)) [💬1] ### 🚀 The feature, motivation and pitch
- see previous
### Alternatives

No response

### Additional context

…
#34333 [RFC]: Automated baselining and degradation detection — RFC — by jmkuebler (创建于: 2026-02-11 19:22 (UTC+8)) ### Motivation.

TL;DR We published a method to detect accuracy degradations with statistical guarantees. This RFC serves as a platform to discuss whether and how this method could help in vLLMs model onboarding and CI processes

## Context When onboarding a new model to vLLM or optimizing the performance, an end-to-end accuracy check is very useful to catch any potential performance degradations against a baseline. Such a baseline could be either the implementation in transformers (in case of…
#34305 [Bug]: gpt-oss-120b Multiple generations not working — bug — by dhineshkumar-r (创建于: 2026-02-11 11:42 (UTC+8)) [💬1] ### 🐛 Describe the bug

I’m trying to generate multiple assistant completions with tool calls for the same user message and system prompt. I tried setting n = 2 and 3, but I only see completion. The model I’m interested in is gpt-oss-120b.

The model is deployed in NVIDIA Dynamo with vLLM backend and the exact runtime image is https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/vllm-runtime?version=0.7.0.

``` curl -X POST http://localhost:${SVC_PORT}/v1/chat/completions ...
#34326 [Bug]: –served-model-name causes model detection issues — bug — by ssendev (创建于: 2026-02-11 17:21 (UTC+8)) ### Your current environment
```
  FROM vllm/vllm-openai:nightly AS audio
  RUN uv pip install --system --no-cache-dir mistral-common[soundfile]
  RUN uv pip install --system 'vllm[audio]'
```
The output of python collect_env.py
...
#34322 [Bug]: Qwen3-Coder-Next模型结合qwen3_coder这个tool parser时，报错IndexError: list index out of range — bug — by MaoJianwei (创建于: 2026-02-11 16:42 (UTC+8)) [💬1] ### Your current environment

v0.15.0

### 🐛 Describe the bug

``` vllm serve /llm_data2/Qwen3-Coder-Next/ –tensor-parallel-size 4 –served-model-name Qwen3-Coder-Next –trust-remote-code –enable-auto-tool-c hoice –tool-call-parser qwen3_coder –max-model-len auto –max-num-batched-tokens 4096 –compilation-config ‘{“cudagraph_mode”:”FULL_DECODE_ONLY “}’ –gpu-memory-utilization 0.8 …
#34317 [Bug]: can not install nightly wheel — bug — by JaheimLee (创建于: 2026-02-11 14:22 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…

[已关闭 issue]

#33242 [Bug]: Qwen3-VL-30B (MoE): CUDA error ‘illegal memory access’ when max_model_len > 128k — bug — by Austin-Liang (关闭于: 2026-02-12 11:20 (UTC+8)) [💬3] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... /home/usr/vllm-env/lib/python3.12/site-packages/torch/cuda/__init__.py:283: UserWarning: ...
#9964 [Installation]: build on arm64 meet a error — installation,stale — by gongchangsui (关闭于: 2026-02-12 10:19 (UTC+8)) [💬15] ### Your current environment

```text The output of python collect_env.py (pytorch_gpu) ➜ vllm git:(main) ✗ python collect_env.py Collecting environment information… WARNING 11-03 12:55:08 _custom_ops.py:19] Failed to import from vllm._C with ModuleNotFoundError(“No module named ‘vllm._C’”) INFO 11-03 12:55:08 importing.py:15] Triton not installed or not compatible; certain GPU-related functions will not be available. PyTorch version: 2.6.0.dev20241101+cu124 Is debug build: False …
#23160 [Usage]: How to get the generated_tokens before current_token in the new logits processor. — usage,stale — by johnking0099 (关闭于: 2026-02-12 10:18 (UTC+8)) [💬4] ### Your current environment
```
  Not necessary.
```
### How would you like to use vllm

In the version 0.10.1 (just released), I see V1 engine supports logits processor now. But I cannot find the way to find all the tokens generated so far. Is it possible to get it?

…
#26558 CUDA illegal memory access in fused_marlin_moe with GLM-4.6-GPTQ MoE model (V1 engine, TP=4) — stale — by chriswritescode-dev (关闭于: 2026-02-12 10:17 (UTC+8)) [💬6] ### Describe the bug

CUDA illegal memory access error during model initialization with GLM-4.6-GPTQ-Int4-Int8Mix MoE model when using the V1 engine with tensor parallelism.

### Environment
- vLLM version: 0.11.0rc2.dev371+gaafb99a4d.d20251010
- Platform: CUDA 13 (tried 12.8 12.9)
- GPU: 4x GPU RTX 6000 pro
- Python: 3.12 …
#26762 [Usage]: about curl http://ip:8000/metrics — usage,stale — by Renoshen (关闭于: 2026-02-12 10:17 (UTC+8)) [💬2] ### Your current environment

When I run this command, I get the following results: # HELP python_gc_objects_collected_total Objects collected during gc # TYPE python_gc_objects_collected_total counter python_gc_objects_collected_total{generation=”0”} 12286.0 python_gc_objects_collected_total{generation=”1”} 1244.0 python_gc_objects_collected_total{generation=”2”} 1326.0 # HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC # TYPE python_gc_objects_uncollectable_tot…
#26774 [Usage]: how to use vllm on CUDA 12.9 — usage,stale — by Mrpingdan (关闭于: 2026-02-12 10:17 (UTC+8)) [💬3] ### Your current environment

```text Traceback (most recent call last): File “/vllm-workspace/collect_env.py”, line 825, in main() File "/vllm-workspace/collect_env.py", line 804, in main output = get_pretty_env_info() ^^^^^^^^^^^^^^^^^^^^^ File "/vllm-workspace/collect_env.py", line 799, in get_pretty_env_info ...
#26797 [Bug]: Token id 5279552648203111001 is out of vocabulary — bug,stale — by xuelei123-xgb (关闭于: 2026-02-12 10:17 (UTC+8)) [💬3] ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…
#26812 [Bug]: Inconsistent Results with Seed Setting in MambaMixer2 — bug,stale — by XuanofXXX (关闭于: 2026-02-12 10:17 (UTC+8)) [💬2] ### Your current environment

The output of python collect_env.py
```text uv is set ============================== System Info ============================== ...
#26825 [Performance]: QuantFP8 forward_native() is on par or slower than forward_cuda() on a B200 GPU — bug,stale — by ElizaWszola (关闭于: 2026-02-12 10:17 (UTC+8)) [💬2] ### Your current environment

The output of python collect_env.py
```text ============================== System Info ============================== OS : Ubuntu 24.04.3 LTS (x86_64) ...
#26843 [Bug]: Undefined symbol cutlass_moe_mm_sm100 on SM120 CUDA builds (macro enabled, grouped_mm_c3x_sm100.cu not compiled) — bug,stale — by Jonahcb (关闭于: 2026-02-12 10:17 (UTC+8)) [💬2] ### Your current environment

Collecting environment information... ============================== System Info ============================== OS : Ubuntu 24.04.3 LTS (x86_64) GCC version : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 Clang version : Could not collect ...
#26850 [Feature]: Add new stats metrics for available_kv_cache_memory — feature request,stale — by MML-coder (关闭于: 2026-02-12 10:17 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch

Currently, vLLM logs the “Available KV cache memory” vllm log during engine initialization (as seen in logs like “ Available KV cache memory: XX.XX GiB”, but this information is not exposed as a Prometheus metric. The current logging approach only provides point-in-time visibility during startup, but doesn’t enable continuous monitoring. This makes it difficult to:
1. Perform capacity planning by the user/customer when they want figure out the actual GPU…
#32391 [CI Failure]: shellcheck hook fails with find syntax error and causes silent failures — bug — by junuxyz (关闭于: 2026-02-12 08:56 (UTC+8)) [💬1] ### Your current environment

This is not environment-specific bug.

### 🐛 Describe the bug

The shellcheck script fails initially due to inappropriate usage of .git in the find command syntax.

```bash ❯ pre-commit run –all-files …
#34315 [CI Failure]: Flashinfer Kernel tests missing argument: ‘num_logical_experts’ — ci-failure — by wzhao18 (关闭于: 2026-02-12 06:17 (UTC+8)) ### Name of failing test

tests/kernels/moe/test_flashinfer.py

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#33962 [Bug]: Realtime transcription completes but server never sends transcription.done when final commit is received during slow streaming — bug — by pjs102793 (关闭于: 2026-02-12 05:01 (UTC+8)) ### Your current environment

``` Collecting environment information… uv is set ============================== System Info ============================== OS : Ubuntu 24.04.1 LTS (x86_64) GCC version : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 …
#33252 [Feature]: Add Deepseek OCR 2 support to the latest version — feature request — by ItzAmirreza (关闭于: 2026-02-12 04:51 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch

https://huggingface.co/deepseek-ai/DeepSeek-OCR-2

### Alternatives

No response

### Additional context

…
#30933 [Usage]: What is the latest instruction to run DeepSeek V3.2? — usage — by IKACE (关闭于: 2026-02-12 04:51 (UTC+8)) [💬2] ### Your current environment

vLLM 0.12.0

### How would you like to use vllm

I am following the guidelines here https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-V3_2.html for running DeepSeek v3.2. By following the instructions I installed vLLM 0.12.0 on my H200 node. However, when I try to run it with vllm serve deepseek-ai/DeepSeek-V3.2 --tensor-parallel-size 8 --tokenizer-mode deepseek_v32 it gives an error

``` (APIServer pid=816209) ValueError: No tokenizer regis…
#27075 [Bug]: gpt-oss poor performance — bug,stale — by seindum (关闭于: 2026-02-12 04:30 (UTC+8)) [💬2] ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…
#33895 [Bug]: /tokenize hangs after large request — bug — by JakubCerven (关闭于: 2026-02-12 02:36 (UTC+8)) [💬4] ### Your current environment

n/a

### 🐛 Describe the bug

running vllm using vllm/vllm-openai:v0.15.0 + QuantTrio/Qwen3-235B-A22B-Instruct-2507-AWQ

after calling large number of /tokenize requests, this endpoint will become unresponsive -> will hang forever

…
#34164 [CI Failure]: mi325_4: LM Eval Large Models — ci-failure — by AndreasKaratzas (关闭于: 2026-02-12 02:17 (UTC+8)) [💬1] ### Name of failing test

pytest -s -v test_lm_eval_correctness.py --config-list-file=configs/models-large.txt --tp-size=4

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…

#29529 [CI Failure]: mi325_2: Distributed Tests (2 GPUs) — rocm,ci-failure — by AndreasKaratzas (关闭于: 2026-02-12 02:16 (UTC+8)) [💬11] ### Name of failing test

`TP_SIZE=1 DP_SIZE=2 pytest -v -s v1/distributed/test_async_llm_dp.py && TP_SIZE=1 DP_SIZE=2 pytest -v -s v1/distributed/test_external_lb_dp.py && DP_SIZE=2 pytest -v -s v1/entrypoints/openai/test_multi_api_servers.py && pytest -v -s entrypoints/llm/test_collective_rpc.py && pytest -v -s ./compile/fullgraph/test_basic_correctness.py && pytest -v -s ./compile/test_wrapper.py && VLLM_TEST_SAME_HOST=1 torchrun –nproc-per-node=4 distributed/test_same_node.py

grep ‘Same n…

#29460 [CI Failure]: mi325_1: Language Models Test (MTEB) — ci-failure — by AndreasKaratzas (关闭于: 2026-02-12 02:10 (UTC+8)) [💬4] ### Name of failing test

pytest -v -s models/language/pooling_mteb_test

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#29509 [CI Failure]: mi325_1: Multi-Modal Models Test (Extended) 3 — ci-failure — by AndreasKaratzas (关闭于: 2026-02-12 02:10 (UTC+8)) [💬10] ### Name of failing test

pip install git+https://github.com/TIGER-AI-Lab/Mantis.git && pytest -v -s models/multimodal/generation/test_common.py -m 'split(group=1) and not core_model'

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#29536 [CI Failure]: mi325_1: Multi-Modal Models Test (Extended) 2 — ci-failure — by AndreasKaratzas (关闭于: 2026-02-12 02:10 (UTC+8)) [💬13] ### Name of failing test

pip install git+https://github.com/TIGER-AI-Lab/Mantis.git; pytest -v -s models/multimodal/generation/test_common.py -m 'split(group=0) and not core_model'

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#29466 [CI Failure]: mi325_1: Language Models Test (Extended Pooling) — ci-failure — by AndreasKaratzas (关闭于: 2026-02-12 02:10 (UTC+8)) [💬19] ### Name of failing test

pytest -v -s models/language/pooling -m 'not core_model'

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#29541 [CI Failure]: mi325_1: Entrypoints Integration Test (API Server 1) — ci-failure — by AndreasKaratzas (关闭于: 2026-02-12 02:10 (UTC+8)) [💬10] ### Name of failing test

pytest -v -s entrypoints/openai/test_collective_rpc.py; pytest -v -s entrypoints/openai --ignore=entrypoints/openai/test_chat_with_tool_reasoning.py --ignore=entrypoints/openai/test_oot_registration.py --ignore=entrypoints/openai/test_tensorizer_entrypoint.py --ignore=entrypoints/openai/correctness/ --ignore=entrypoints/openai/test_collective_rpc.py --ignore=entrypoints/openai/tool_parsers/; pytest -v -s entrypoints/test_chat_utils.py

### Basic information
- Fla…
#34160 [CI Failure]: mi325_1: Entrypoints Unit Tests — ci-failure — by AndreasKaratzas (关闭于: 2026-02-12 02:09 (UTC+8)) [💬1] ### Name of failing test

pytest -s -v tests/entrypoints/openai/tool_parsers/test_openai_tool_parser.py

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#33809 [CI Failure]: Kernels MoE Test %N — ci-failure — by AndreasKaratzas (关闭于: 2026-02-12 02:09 (UTC+8)) [💬3] ### Name of failing test

pytest -v -s kernels/moe

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…

#29520 [CI Failure]: mi325_1: Multi-Modal Models Test (Standard) — ci-failure — by AndreasKaratzas (关闭于: 2026-02-12 02:09 (UTC+8)) [💬10] ### Name of failing test

` pip install git+https://github.com/TIGER-AI-Lab/Mantis.git && pip freeze

grep -E ‘torch’ && pytest -v -s models/multimodal -m core_model –ignore models/multimodal/generation/test_whisper.py –ignore models/multimodal/processing && cd .. && VLLM_WORKER_MULTIPROC_METHOD=spawn pytest -v -s tests/models/multimodal/generation/test_whisper.py -m core_model`

### Basic information

Flaky test
Can reproduce locally
Caused by external libraries (e.g. bug i…

#32751 [CI Failure]: mi325_1 Entrypoints Integration Test (Responses API) — ci-failure — by AndreasKaratzas (关闭于: 2026-02-12 02:07 (UTC+8)) [💬5] ### Name of failing test

pytest -v -s tests/entrypoints/openai/responses

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#34355 [Feature]: Detect langue for Whisper — feature request — by warichet (关闭于: 2026-02-12 02:00 (UTC+8)) ### 🚀 The feature, motivation and pitch

Motivation The addition of a language detection feature to the Whisper model will enhance its usability by automatically identifying the language of the audio input before transcription. This is especially useful for multilingual environments where the language of the input audio might not always be known in advance. By integrating language detection directly into the transcription pipeline, we can streamline the process and make it more efficient for…
#34156 [Bug]: AttributeError: ‘Glm46VImageProcessorFast’ object has no attribute ‘get_number_of_image_patches’ — bug — by regmibijay (关闭于: 2026-02-11 22:09 (UTC+8)) [💬1] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#34237 [Bug]: Responses API with Harmony is broken for structured content inputs — bug — by Kimahriman (关闭于: 2026-02-11 21:14 (UTC+8)) ### Your current environment

The output of python collect_env.py
Can't copy from my environment right now

…
#34311 [Model Bash][DeepSeek R1]: Enable AR Fusion By Default — feature request,model-bash — by robertgshaw2-redhat (关闭于: 2026-02-11 19:59 (UTC+8)) [💬1] ### 🚀 The feature, motivation and pitch
- see previous
### Alternatives

No response

### Additional context

…
#34317 [Bug]: can not install nightly wheel — bug — by JaheimLee (关闭于: 2026-02-11 14:28 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…
#34288 [Model Bash][DeepSeek]: Remove Bias Casts in DSR1 NVFP4 TRTLLM — feature request,model-bash — by robertgshaw2-redhat (关闭于: 2026-02-11 13:01 (UTC+8)) [💬1] ### 🚀 The feature, motivation and pitch
- see main issue for more details
### Alternatives

No response

### Additional context

…
#28278 [Bug]: vLLM v0.11.0 container in k8s pod Fails to Load GLM-4.6-FP8 Model During CUDA Graph Capture， But v0.10.2 is OK. — bug,stale — by AndrewTsao (关闭于: 2026-02-11 12:49 (UTC+8)) [💬2] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ...

[新 PR]

#34397 [bugfix] refactor FunASR’s _get_data_parser — bug,ready — by AllenDou (创建于: 2026-02-12 10:47 (UTC+8)) [+7/-11, 1 files

commented:1 approved:1]

refactor _get_data_parser, move it from FunASRMultiModalProcessor to FunASRProcessingInfo because “BaseMultiModalProcessor._get_data_parser has been moved to BaseProcessingInfo.build_data_parser in v0.16. “
and It’s better to check prompt_len, so I delete skip_prompt_length_check(return True) @DarkLight1337 please take a look, thanks.

#34362 [Refactor] Move validation to params definitions — ready,v1 — by DarkLight1337 (创建于: 2026-02-12 01:50 (UTC+8)) [💬2 | +264/-245, 3 files | commented:2 approved:1] ## Purpose

Move SamplingParams validation from InputProcessor to SamplingParams, similar to how PoolingParams does it.

## Test Plan

## Test Result

...
#34398 [new model] add COLQwen3 code & Inference — documentation,new-model,multi-modality,qwen — by craftsangjae (创建于: 2026-02-12 10:57 (UTC+8)) [💬3 | +864/-0, 9 files | commented:2] ## Purpose

Add native support for ColQwen3 multi-modal late interaction models in vLLM.

ColPali (arXiv:2407.01449) introduces a ColBERT-style multi-vector retrieval approach for vision-language models. ColQwen3 extends Qwen3-VL with a linear projection head that produces per-token L2-normalized embeddings, enabling MaxSim late interaction scoring for document retrieval and reranking across both text and image inputs.

This PR adds support for the followi…
#34353 [Bugfix] fix default is_neox_style to be True for deepseekv3.2 — bug,ready,deepseek — by xyDong0223 (创建于: 2026-02-12 00:10 (UTC+8)) [+1/-1, 1 files | commented:1 approved:2] ## Purpose To resolve the default style issue with index rope, neox should be True when it’s not present in config. ## Test Plan

## Test Result

Essential Elements of an Effective PR Description Checklist
...
#34307 [ROCm] [CI] Add new fusion test cases that are relevant to vLLM IR Ops — rocm,ready,ci/build — by tjtanaa (创建于: 2026-02-11 12:17 (UTC+8)) [💬5 | +312/-105, 11 files | commented:2]

## Purpose

Split away from https://github.com/vllm-project/vllm/pull/34244 to focus on fusion pass only.

## Test Plan

Triggered the tests on AMD CI.

…
#34354 [Bugfix] Fix step3p5 tool parser and unnecessary unstreamed tool args in serving. — bug,frontend — by mariohong128 (创建于: 2026-02-12 00:26 (UTC+8)) [💬1 | +1177/-1425, 5 files | commented:1 | 📝草稿] ## Purpose The step3.5 tool parser and test cases have been modified, resulting in a more stable parser.

In vllm/entrypoints/openai/chat_completion/serving.py, some parsers do not require checking unstreamed tool arguments. For example:

qwen3coder_tool_parser: does not maintain variables for checking unstreamed tool arguments (streamed_args_for_tool), may cause out of index error.

qwen3xml_tool_parser: has some bugs during maintenance, causing duplicate parameter sending.

**step3…
#34327 TRITION_ATTN backend support int8 kvcache dtype — documentation,v1 — by EricccYang (创建于: 2026-02-11 17:22 (UTC+8)) [💬5 | +111/-38, 6 files | commented:4] ## Purpose as title describe

## Test Plan

A100-PCIE-40G Qwen3-4B

vllm bench latency –model ../Qwen3-4B/ –batch-size BATCH_SIZE –input-len 512 –output-len OUT_LEN –max_model_len 2000 –num-iters 1 –gpu-memory-utilization 0.90 –num-iters-warmup 1 –at tention-backend TRITON_ATTN –kv-cache-dtype KV_CACHE_TYEP …
#34385 [Bugfix] Fix MTP accuracy for GLM-5 — bug,speculative-decoding,ready,v1 — by mgoin (创建于: 2026-02-12 07:30 (UTC+8)) [💬1 | +18/-0, 1 files | commented:1 approved:1] ## Purpose

Fix MTP producing NaN logits for models (e.g. GLM-5) whose checkpoints don’t store a duplicate shared_head.head weight in the MTP layer (like DeepSeek V3.2). The existing _maybe_share_lm_head only sets self.model.lm_head, but MTP’s compute_logits uses shared_head.head inside each MTP layer. This left it uninitialized when the checkpoint omits it. This patch explicitly shares the target model’s lm_head with each MTP layer’s shared_head.head, matching what DeepSeek-V3.2 …
#34392 [torch.compile] Disable ar-rms fusion for ds3-fp4 — ready — by ProExpertProg (创建于: 2026-02-12 09:50 (UTC+8)) [💬1 | +42/-3, 3 files | commented:2] ## Purpose #34299 enabled AR+rms fusion by default, which is causing accuracy issues on DS3-fp4. Other fp4 and non-fp4 deepseek models appear unaffected. Disable it by default for those cases.

## Test Plan CI, tested locally

## Test Result Good

…
#34396 [BugFix] Align fused MoE-LoRA kernel config with actual weight shapes — bug — by RunkaiTao (创建于: 2026-02-12 10:44 (UTC+8)) [+5/-1, 1 files | commented:1] ## Purpose We keep the config key derived from the LoRA weight shapes. For the GPT-OSS gate-up kernel, a mismatch occurs because the w1 and w3 expand weights are concatenated.

Before this change, vLLM used the key for gate-up kernels:

max_loras-1-m-rank-3072-2994

After this change, the correct key becomes:

…
#34393 Unify MLA cache gather into gather_cache and remove split ops — v1 — by yashnib (创建于: 2026-02-12 09:51 (UTC+8)) [💬5 | +400/-281, 8 files | commented:1] Closes #33669

This PR unifies gather_and_maybe_dequant_cache and cp_gather_cache into a single kernel and torch op gather_cache, as requested in the feature discussion.

The unified implementation handles both token-major (with optional FP8 dequant) and batch-major (CP copy-only) traversal internally, removing Python-level branching and eliminating duplicate kernel implementations.
#34394 [torch.compile] Remove duplicated split_with_sizes after RoPE — 无标签 — by elvischenv (创建于: 2026-02-12 10:10 (UTC+8)) [+70/-3, 2 files | commented:2]

## Purpose

Consider the following ops around the RoPE: ``` qkv = self.qkv_proj(hidden_states) q, k, v = qkv.split([self.hidden_size, self.hidden_size, self.hidden_size], dim=-1) q, k = self.rotary_emb(positions, q, k)

…
#34390 [Kernel] [Helion] [7/N] Use HOP to represent Helion Kernel call to enable fx tracing and pattern matching — 无标签 — by gmagogsfm (创建于: 2026-02-12 08:43 (UTC+8)) [💬1 | +1860/-121, 10 files | commented:2] NOTE: this is a manually stacked PR, each commit is reviewed separately. For this PR, please only review the top commit: Add Helion autotuning infra

This PR uses a PyTorch HigherOrderOp helion_kernel_wrapper_mutation to represent a Helion kernel call (without specializing to a particular config yet). This is useful for pattern matching and graph-based optimizations.

In the future, this op can be further lowered later into another HOP after specializing into a particular config, the new HOP …
#34337 [GPT-OSS] Remove unnecessary contiguous — ready,gpt-oss — by elvischenv (创建于: 2026-02-11 21:14 (UTC+8)) [+0/-1, 1 files | commented:1 approved:1]

## Purpose Remove unnecessary contiguous for GPT-OSS.

Test Result PR:
```
  [{'eval_name': 'gpqa', 'model_name': 'gpt-oss-120b-medium_temp1.0_20260211_121731', 'metric': 0.7266414141414141}]
```
…
#34379 [BugFix] Fix DP chunking — bug,ready — by LucasWilkinson (创建于: 2026-02-12 05:23 (UTC+8)) [💬2 | +12/-2, 1 files | commented:9 approved:1] https://github.com/vllm-project/vllm/pull/32790 broke DP chunking for models with shared experts by not chunking the shared expert input

causing:
```
  (EngineCore_DP0 pid=2405) RuntimeError: The size of tensor a (256) must match the size of tensor b (4096) at non-singleton dimension 0
```
in tests/v1/distributed/test_dbo.py::test_dbo_dp_ep_gsm8k[deepep_low_latency] [Distributed Tests (2 GPUs)(H100) ](https://buildkite.com/vllm/ci/builds/51065/steps/canvas?sid=019c4b81-18e7-4893-bfec-cb087afbc…
#34386 [Quark] Fix MoE fp8 activation scale handling on mi300 — 无标签 — by BowenBao (创建于: 2026-02-12 07:38 (UTC+8)) [💬1 | +3/-3, 1 files | commented:2] Small fix follow-up after #29008 for running gpt-oss on mi300. Also ensure the fp8 scale conversion only runs when activation is fp8 quantized.
#34382 [BUG] Reset running requests when clearing cache for pause/resume — bug,ready,v1 — by hao-aaron (创建于: 2026-02-12 06:35 (UTC+8)) [+1/-1, 1 files | commented:1 approved:2] ## Purpose Addressing https://github.com/vllm-project/vllm/pull/32351#issuecomment-3870210338, in progress requests block clearing prefix cache in keep mode.

## Test Plan

## Test Result

Essential Elements of an Effective PR Description Checklist
...
#34389 [Custom Ops] Add functional + out variant for scaled_fp4_quant — 无标签 — by tianrengao (创建于: 2026-02-12 08:30 (UTC+8)) [+160/-77, 6 files | commented:1 | 📝草稿] Add PyTorch standard functional + out-variant pair for scaled_fp4_quant, following the same pattern as silu_and_mul.

Old schema: scaled_fp4_quant(Tensor! output, Tensor input, Tensor! output_scale, Tensor input_scale, bool is_sf_swizzled_layout) -> ()

New schemas: scaled_fp4_quant(Tensor input, Tensor input_scale, bool is_sf_swizzled_layout) -> Tensor[] scaled_fp4_quant.out(Tensor input, Tensor input_scale, …
#34374 [Bugfix] Enforce DeepGEMM when using sparse_attn_indexer on CUDA — bug,ready,nvidia — by mgoin (创建于: 2026-02-12 04:23 (UTC+8)) [+5/-0, 1 files | commented:1 approved:1]

## Purpose

If you start dsv3.2 or glm-5 on Blackwell, you don’t require DeepGEMM for the linear/moe kernels, but we do require DeepGEMM for fp8_paged_mqa_logits so it will crash at first inference if not installed. We should enforce this at init time

## Test Plan

## Test Result

…
#34335 [Bugfix] Fix MoE quant_config not initialized under torch.compile — bug — by mgehre-amd (创建于: 2026-02-11 20:07 (UTC+8)) [💬1 | +30/-0, 2 files | commented:2] ## Summary

After the MoE Refactor (#32344), w4a16 models fail with AssertionError: Hidden size mismatch 2048 != 1024 under torch.compile. This is because ensure_moe_quant_config_init() is called in FusedMoE.forward_native(). When torch.compile is active, forward_native is traced by Dynamo, but the side effect of setting self.quant_method.moe_quant_config (an attribute mutation) is not replayed at runtime. This causes moe_quant_config to remain None when `DefaultMoERunner.forward_i…
#34388 [WIP][Spec Decode] P/D disaggregation: transfer hidden states for EAGLE warm-up — documentation,speculative-decoding,v1,kv-connector — by ZhanqiuHu (创建于: 2026-02-12 07:47 (UTC+8)) [💬1 | +1705/-16, 7 files | commented:1 | 📝草稿] ## Summary

Transfer full-prefix hidden states from the prefill instance to the decode instance via RDMA (NixlConnectorWithAux) so that EAGLE’s draft model can “prefill” its own KV cache on the first decode step, matching single-card acceptance rates.

Changes:
- NixlConnectorWithAux: subclass of NixlConnector that manages auxiliary dense tensor (hidden states) transfer alongside KV cache transfer, with slab-based GPU buffer management and backpressure.
- gpu_model_runner.py (v1): `_…
#34384 [ROCm][CI] Revert Test Groups From mi325_8 to mi325_1 Agent Pool In AMD CI — rocm,ready,ci/build — by micah-wil (创建于: 2026-02-12 07:29 (UTC+8)) [💬3 | +7/-7, 1 files | commented:1 approved:1] https://github.com/vllm-project/vllm/pull/33731 mistakenly added some test groups to the 8 GPU agent pool that only require 1 GPU in AMD CI. This PR reverts those changes.
#34387 [ROCm] Update the torch version in rocm_build.txt to use the official 2.10 release — rocm,ci/build — by SageMoore (创建于: 2026-02-12 07:42 (UTC+8)) [+2/-2, 1 files | commented:1] ## Purpose After this change I’m able to build vLLM on an MI300X machine running ROCm 7.1 with the following commands. When choosing a new amdsmi version, I just picked the most recent release on pypy (https://pypi.org/project/amdsmi/#history)
```
  uv pip install -r requirements/rocm-build.txt --index-strategy unsafe-best-match
  python setup.py develop
```
## Test Plan This change should only impact developers who are building locally. I ran a simple lm_eval check with `deepseek-ai/DeepSeek-V2-Li…
#34383 [Ray] Propagate third-party env vars to Ray workers via prefix matching — ready,ci/build,v1 — by kouroshHakha (创建于: 2026-02-12 06:49 (UTC+8)) [💬1 | +279/-36, 5 files | commented:2] ## Problem

get_env_vars_to_copy() only forwarded env vars registered in vllm.envs.environment_variables (VLLM_* vars) plus a small hardcoded set (HF_TOKEN, HUGGING_FACE_HUB_TOKEN). Third-party env vars needed by KV connector integrations (LMCACHE_*, NCCL_*, UCX_*) and PYTHONHASHSEED were silently dropped when propagating from the EngineCore to Ray GPU workers.

This caused subtle failures — e.g., LMCache workers using a different hash algorithm than the scheduler, resulting in 0…
#34371 [Bugfix] Fix some issues with MoERunner PR #32344 — bug,ready — by bnellnm (创建于: 2026-02-12 03:24 (UTC+8)) [💬1 | +6/-3, 2 files | commented:2 approved:2] ## Purpose

Move ensure_moe_quant_config call from FusedMoE.forward_native into _moe_forward and _moe_forward_shared. This is closer to how it was before when it was hidden inside the custom op. It should avoid torch.compile issues.

Fix handling of gate. The use_overlapped flag should have been checked before returning _gate.

Possible fix for https://github.com/vllm-project/vllm/issues/34357

## Test Plan

…
#34375 Fix #32588 — frontend — by mainnebula (创建于: 2026-02-12 04:30 (UTC+8)) [💬2 | +218/-17, 2 files | commented:1 | 📝草稿] ## What Addresses https://github.com/vllm-project/vllm/issues/32588

## Context This contribution was created by Token Steward — an initiative that redirects unused AI compute (from Claude Max subscription plans) toward open-source contributions before weekly token limits reset.

The code was generated by Claude (Anthropic) and should be reviewed like any other contribution. AI-assisted, human-reviewed.

## How to get involved
- If you maintain an o…
#34316 Fix CI failure - Flashinfer Kernel tests — ready,nvidia — by wzhao18 (创建于: 2026-02-11 13:36 (UTC+8)) [💬4 | +3/-0, 3 files | commented:1 approved:4]

## Purpose Fix #34315

## Test Plan Tested locally with
```
  pytest -v -s tests/kernels/moe/test_flashinfer.py
```
…
#34381 [Kernel] adopt mxfp8 grouped_gemm and grouped_quant kernel — ci/build — by EdalatiAli (创建于: 2026-02-12 06:09 (UTC+8)) [💬1 | +1283/-0, 10 files | commented:1 | 📝草稿]

## Purpose Integrate SGLang’s SM100+ expert-specialization MXFP8 blockscaled grouped kernels into vLLM so they are built, registered, importable, and test-covered in the vLLM codebase. Here is source PR for the adopted kernels.

This PR:
- Adds es_sm100_mxfp8_blockscaled_grouped_mm and es_sm100_mxfp8_blockscaled_grouped_quant kernel sources into vLLM’s _C build path (CUDA-gated for SM100-compatible targets)….
#34378 Use paged_attention_v1 for sliding window decode in rocm_aiter_fa — rocm,v1,meta-exported,fb-exported — by iseeyuan (创建于: 2026-02-12 05:23 (UTC+8)) [+2/-29, 1 files | commented:1] Summary: Replace unified_attention (Triton) with paged_attention_v1 for the sliding window decode path in AiterFlashAttentionImpl. paged_attention_v1 already supports sliding window natively via its sliding_window parameter, so this unifies the NHD decode path for both sliding window and non-sliding window cases. The sliding window value is recovered from the flash-attn convention (self.sliding_window[0] + 1), which yields 0 (disabled) when no sliding window is configured.

Test Plan: Requires R…
#34334 [Bugfix] Fix more multimodal tests for transformers V5 — bug,ready,multi-modality,qwen — by zucchini-nlp (创建于: 2026-02-11 19:53 (UTC+8)) [💬1 | +18/-11, 5 files | commented:2 approved:2] There’s also idefics test failing tests/models/multimodal/processing/test_idefics3.py::test_processor_override[False-1-mm_processor_kwargs1-845-HuggingFaceM4/Idefics3-8B-Llama3]

I see the root issue in Idefics3Processor._get_prompt_updates which creates a processor but doesn’t use provided kwargs. The kwargs get filtered out due to cached_get_processor_without_dynamic_kwargs and the number of placeholder is calculated incorrectly. It can be fixed by reverting https://github.com/vllm-pr…
#34349 [Benchmarks] Reduce ready checker log verbosity — performance,ready — by tomasruizt (创建于: 2026-02-11 23:43 (UTC+8)) [+2/-1, 1 files | commented:1 approved:1] ## Summary
- The wait_for_endpoint() ready checker used to log the full traceback of the error on every retry (every 5 seconds), making the output very noisy and hard to read.
- Now it logs only the last line of the error (the actual exception message), which is much more compact and informative.
Before (repeated every 5s): ``` WARNING [ready_checker.py:69] Endpoint is not ready. Error=’Traceback (most recent call last): File “…/aiohttp/connector.py”, line 1268, in _wrap_creat…
#34373 [Feature] Enable uniform KV cache allocation for multi-group HMA models — v1,kv-connector — by Etelis (创建于: 2026-02-12 04:06 (UTC+8)) [+206/-25, 2 files | commented:1] use_uniform_kv_cache() currently rejects any model with more than one KV cache group, which means hybrid-attention models (alternating full + sliding-window layers) cannot use the contiguous cross-layer layout for efficient KV transfers.

This PR relaxes the single-group gate: instead of requiring exactly one group, we loop over all groups and check that they share the same backend shape and stride order.

## Test Plan

Unit tests (tests/v1/kv_connector/unit/test_uniform_kv_cache.py) — …
#34372 [ROCm] [CI] fix test_unrecognized_env (#34350) — rocm — by vagabond2522 (创建于: 2026-02-12 03:41 (UTC+8)) [+7/-0, 1 files | commented:1] Signed-off-by: tjtanaa tunjian.tan@embeddedllm.com

## Purpose

## Test Plan

## Test Result

…
#34369 Add devcontainer configuration file — 无标签 — by vagabond2522 (创建于: 2026-02-12 02:59 (UTC+8)) [+7/-0, 1 files | commented:1]

## Purpose

## Test Plan

## Test Result

...
#34332 fp8.py online quant: reuse layerwise reloading infra, take 3 — 无标签 — by vkuzo (创建于: 2026-02-11 19:07 (UTC+8)) [💬1 | +77/-154, 5 files | commented:1 | 📝草稿] Summary:

Copy of https://github.com/vllm-project/vllm/pull/34184

TODO write me up

Test Plan: TODO

…
#34366 Add language detection feature to Whisper — 无标签 — by warichet (创建于: 2026-02-12 02:35 (UTC+8)) [💬2 | +125/-20, 1 files | commented:1]

## Purpose This PR introduces an automatic language detection feature for the Whisper model. When the language field is not specified in the request, the model will automatically detect the language of the audio input before transcription begins. This is achieved by predicting the most likely language token using the <|startoftranscript|> token in the decoder prompt. If language detection fails, the model defaults to English ("en").

The purpo…
#34368 Add initial devcontainer configuration — 无标签 — by vagabond2522 (创建于: 2026-02-12 02:51 (UTC+8)) [+4/-0, 1 files | commented:1]

## Purpose

## Test Plan

## Test Result

...
#34367 [BugFix] Skip null blocks when adding cached blocks in current step — bug,v1 — by peakcrosser7 (创建于: 2026-02-12 02:37 (UTC+8)) [+2/-0, 1 files | commented:1 | 📝草稿] ## Purpose

## Test Plan

## Test Result

Essential Elements of an Effective PR Description Checklist
...
#34350 [ROCm] [CI] fix test_unrecognized_env — rocm,ready — by tjtanaa (创建于: 2026-02-11 23:46 (UTC+8)) [+10/-3, 1 files | commented:1 approved:1]

## Purpose

Fix https://buildkite.com/vllm/amd-ci/builds/4609/steps/canvas?sid=019c4d1a-4084-4c15-b40f-c0ef00b0ffd7&tab=output

```
...
#34359 [CI] Add GPT-OSS Eval job for H100 — ready,ci/build,gpt-oss — by mgoin (创建于: 2026-02-12 01:15 (UTC+8)) [+13/-0, 1 files | commented:2]

## Purpose

Previously we only evaled gpt-oss on B200 and we’ved missed issues with triton_kernels since it isn’t the default for Blackwell. Running on Hopper will trigger the default case of triton_kernels

## Test Plan

## Test Result

…
#34358 [Bugfix] Standardize getting number of image patches/tokens — bug,ready,multi-modality — by DarkLight1337 (创建于: 2026-02-12 00:58 (UTC+8)) [💬1 | +164/-226, 17 files | commented:3 approved:1] ## Purpose
- Consider mm_kwargs when determining number of image tokens.
- Disallow passing processor=None to simplify the code
- Fix Idefics3 and SmolVLM tests not passing mm_kwargs to the reference processor call.
FIX Idefics3 test in #34334

## Test Plan

…
#34342 [Frontend] Add automatic language detection for Whisper transcription — frontend,multi-modality — by spacecheck (创建于: 2026-02-11 22:24 (UTC+8)) [💬4 | +237/-11, 6 files | commented:2] ads feature from #14174, #25750

## Purpose Add automatic language detection for Whisper transcription when no language parameter is specified.
```
- Whisper auto-detects the spoken language by running a single-token generation with `<|startoftranscript|>` as the decoder prompt and parsing the predicted...
```
#34364 [Fix] Fix tracing test race condition by adding server readiness check — 无标签 — by emricksini-h (创建于: 2026-02-12 02:02 (UTC+8)) [+23/-0, 1 files | commented:1] Trying to fix #34284
#34348 [Docs] Fix typo (“defult”) and double spacing — ready — by SorenDreano (创建于: 2026-02-11 23:23 (UTC+8)) [+1/-1, 1 files | commented:1 approved:1] ## Purpose Update vLLM documentation to fix two small typos (defult -> default and -02 to -O2) and a double space. vLLM will not start with -02.

## Test Plan Documentation-only change.

## Test Result N/A (documentation-only change; no runtime behavior affected). …
#34363 fix: return HTTP 413 when request exceeds max context length — frontend — by Chase-Xuu (创建于: 2026-02-12 01:58 (UTC+8)) [💬2 | +10/-2, 2 files | commented:1] ## Summary

Fixes #34340

Returns HTTP 413 (Request Entity Too Large) instead of 400 (Bad Request) when a request exceeds the model’s maximum context length.

## Changes
- Add optional http_status field to VLLMValidationError exception class
- Return 413 when input tokens exceed max_model_len
  …
#34347 [DRAFT] Async Allgather — ci/build,v1 — by elvircrn (创建于: 2026-02-11 23:22 (UTC+8)) [+422/-58, 3 files | commented:1 | 📝草稿] ## Purpose

## Test Plan

## Test Result

Essential Elements of an Effective PR Description Checklist
...
#34346 Add group quantization support to fused FP8 RMSNorm quant kernels — 无标签 — by Bias92 (创建于: 2026-02-11 23:15 (UTC+8)) [💬2 | +276/-26, 4 files | commented:1] Add group quantization to fused FP8 RMSNorm quant kernels

Closes #24629
- Add group_size (default 0) to fused FP8 RMSNorm quant ops.
- group_size > 0: per-group scaling along the last dimension; group_size == 0: unchanged per-tensor path.
- Add 144 tests; all passed.
Tests: tests/kernels/core/test_rms_norm_group_quant.py

…
#34360 Revert “[MoE Refactor] Introduce MoERunner abstraction and move execution logic from FusedMoE to DefaultMoERunner” — documentation,v1 — by robertgshaw2-redhat (创建于: 2026-02-12 01:16 (UTC+8)) [💬1 | +753/-913, 25 files | commented:1] Reverts vllm-project/vllm#32344
#34344 [Model Bash][DeepSeekR1] Remove Shared Expert Clone — performance,ready,deepseek — by robertgshaw2-redhat (创建于: 2026-02-11 22:59 (UTC+8)) [+4/-8, 2 files | commented:1] ## Purpose
- previously, we required cloning the shared expert input because some models used inplace. For instance, in a previous PR fixing a similar issue (https://github.com/vllm-project/vllm/pull/28942/changes), we had seen certain methods applying inplace unilaterally (https://github.com/zhyajie/vllm/blob/2a16631527f17bb550bc2137ae189ae3ddb74283/vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.py#L404-L416)
- we now have systematically disabled inplace if there is a shared …
#34330 [Multimodal] Expose mm_processor_kwargs for DummyInputsBuilder — ready,multi-modality,llama,qwen,deepseek — by Isotr0py (创建于: 2026-02-11 18:01 (UTC+8)) [💬3 | +131/-27, 72 files | commented:4 approved:1]

## Purpose
- Currently, to enable/disable long video comprehension in Qwen3-VL-style models, user have to increase/decrease longest_edge for hf_processor with mm_processor_kwargs.
- However, these fields won’t reflect to dummy inputs during profiling, which can cause inaccurate profiling results.
- This PR exposes mm_processor_kwargs for DummyInputsBuilder to reflect the config.
## Test Plan ``` vllm serve Qwen/Qwen3-VL-4B-Instruct –enforce-eager –mm-pro…
#34312 kv connector: Add mooncake store connector — kv-connector — by snadampal (创建于: 2026-02-11 13:04 (UTC+8)) [💬1 | +2217/-0, 11 files | commented:1 | 📝草稿]

This PR adds mooncake store connector to vllm kv connectors. The store connector acts as an extension to local cache and co-exists with nixl for p2p. This has been ported from vllm-ascend repo and the connector has been made generic for any mookcake transfer engine transport protocol.

ported from: https://github.com/vllm-project/vllm-ascend/tree/main/vllm_ascend/distributed/kv_transfer/kv_pool

## Purpose To support KV cache offloading from vllm node to external c…
#34339 [Bugfix] Fix kv_load_failure_recovery example in sync mode — bug,documentation,kv-connector — by sykzhong (创建于: 2026-02-11 21:57 (UTC+8)) [💬3 | +1/-0, 1 files | commented:1]

## Purpose

Fix the kv_load_failure_recovery example to work correctly in sync mode. When async_scheduling is not explicitly set, vLLM automatically enables async scheduling by default (unless incompatible configurations are detected). However, async scheduling is incompatible with the sync mode of kv_load_failure_recovery, causing the example to fail.

### Changes

Explicitly set async_scheduling=False in sync mode to prevent automatic async scheduling activation….
#34336 [Bugfix] fix device_name for routing replay — bug — by Li-Yongwen (创建于: 2026-02-11 20:52 (UTC+8)) [+2/-1, 1 files | commented:1]

## Purpose

[Bugfix] fix device_name for routing replay

## Test Plan

## Test Result

…
#34352 Don’t try and run GLM-ASR with remote code — 无标签 — by hmellor (创建于: 2026-02-11 23:54 (UTC+8)) [+0/-1, 1 files | commented:1 approved:1] GLM-ASR has been upstreamed to Transformers in https://github.com/huggingface/transformers/commit/a7f29523361b2cc12e51c1f5133d95f122f6f45c which was first released in v5.0.0rc2.

Therefore we should remove this kwarg from our test. (The test will not actually run in this PR because it’s still gated on Transformesr v5)
#34329 [Refactor] Pass Renderer to Input Processor — frontend,ready,v1 — by DarkLight1337 (创建于: 2026-02-11 17:48 (UTC+8)) [💬2 | +107/-106, 20 files | commented:3 approved:1] ## Purpose

Prepare for next Renderer refactor.

Also improve consistency between how attributes are being accessed between offline and online APIs.
- Importantly, this removes max_model_len from online serving, it should be accessed via model_config.max_model_len. I think this makes the code a bit clearer.
## Test Plan

## Test Result …
#34343 Lazy imports for model loader classes in model_loader/init.py — meta-exported,fb-exported — by klintqinami (创建于: 2026-02-11 22:26 (UTC+8)) [+128/-29, 1 files | commented:1] Summary: model_loader/__init__.py eagerly imports all 7 loader classes (BitsAndBytesModelLoader, DefaultModelLoader, DummyModelLoader, GGUFModelLoader, RunaiModelStreamerLoader, ShardedStateLoader, TensorizerLoader) at module scope. These are only needed when get_model_loader() is called with a matching format, but the eager imports mean that any code importing from the model_loader package (e.g. `from vllm.model_executor.model_loader.weight_utils import default_weight_loader…
#34341 Make has_triton_kernels() handle import failures gracefully — meta-exported,fb-exported — by klintqinami (创建于: 2026-02-11 22:08 (UTC+8)) [+4/-1, 1 files | commented:1] Summary: has_triton_kernels() uses find_spec() to check if the triton_kernels package exists, then eagerly calls import_triton_kernels(). On MTIA, the package is findable (it exists in the build) but fails to import because it depends on GPU-specific Triton modules like triton.language.target_info that don’t exist in MTIA Triton. This causes a crash instead of the function returning False.

The fix wraps the import_triton_kernels() call in a try/except so the function correctly rep…
#34309 Add new sections to CODEOWNERS — ci/build — by DarkLight1337 (创建于: 2026-02-11 12:40 (UTC+8)) [💬1 | +27/-11, 1 files | commented:1] ## Purpose
- Create new section in CODEOWNERS for API-related and IO Processing-related code
- Split up entrypoints/ and add various people that I think are suitable
- Add tokenizers/ and renderers/ to CODEOWNERS, assigning myself and @njhill
- Add sampling_params.py to CODEOWNERS to be consistent with pooling_params.py, and assigned @njhill
- Fixed some outdated directories in CODEOWNERS
## Test Plan

…
#34338 [Bugfix][CI] Fix spawned tests — bug — by NickLucche (创建于: 2026-02-11 21:45 (UTC+8)) [+21/-4, 1 files | commented:1] Attempt to fix https://github.com/vllm-project/vllm/issues/34323 more generally. Do not merge until a full CI run
#34324 Fixed whisper CPU test that does not spawn properly. — multi-modality — by almayne (创建于: 2026-02-11 17:00 (UTC+8)) [💬5 | +0/-1, 1 files | commented:1 approved:1]

Partially addresses #34323

## Purpose

Fix a test that isn’t being run properly. This issue describes the problem with using the annotation @create_new_process_for_each_test("spawn"). This change will prevent spawning or forking of the test.

## Test Plan

…
#34325 [Bugfix] fix device_name for routing replay — bug — by Li-Yongwen (创建于: 2026-02-11 17:08 (UTC+8)) [💬1 | +2/-1, 1 files | commented:1]

## Purpose

[Bugfix] fix device_name for routing replay

## Test Plan

## Test Result

…
#34321 [Bugfix][CPU] Fix llama4 inference on CPU — bug,ready,v1,llama,cpu — by bigPYJ1151 (创建于: 2026-02-11 16:30 (UTC+8)) [💬3 | +60/-18, 6 files | commented:1 approved:1]

## Purpose
- Fix llama4 inference on CPU
## Test Plan

## Test Result

…
#34328 [KV Connector] Support using FlexKV as KV Cache Offloading option. — documentation,kv-connector — by feiqiangs (创建于: 2026-02-11 17:38 (UTC+8)) [💬2 | +445/-0, 4 files | commented:2] FlexKV is a distributed KV store and multi-level cache management system developed by Tencent Cloud’s TACO team in collaboration with the community, designed for large-scale LLM inference scenarios. FlexKV leverages multi-level caching and a distributed KVCache pool to enable inference engines to achieve higher throughput and lower latency.

In our case, when intergated with FlexKV, we can achieve the following improvement:

ISL=21K, OSL=1K, batch_size=8: TTFT decreases by 60%, TPOT increases by…
#34306 [CI][Bugfix] add regression test for GDN fused_recurrent kernel — bug,v1,qwen — by CarstyYou (创建于: 2026-02-11 11:59 (UTC+8)) [💬2 | +542/-8, 3 files | commented:1]

## Purpose

Fix illegal memory access in fused_recurrent kernel when handling PAD_SLOT_ID (-1) for padded sequences (Issue #31186).

Root Cause: The fused_recurrent_gated_delta_rule_fwd_kernel was not properly handling PAD_SLOT_ID (-1) when storing final states in continuous batching mode. When processing padded sequences in CUDA Graph scenarios (especially with Qwen3-Next MTP), the kernel would attem…
#34319 [Doc] Update Marlin support matrix for Turing — documentation,ready — by iori2333 (创建于: 2026-02-11 16:16 (UTC+8)) [💬1 | +5/-4, 2 files | commented:1 approved:2] #29901 adds Marlin support for SM75 devices. This PR updates the support matrix of Marlin documentation.
#34308 [Chore] Move BaseRenderer to base.py — ready — by DarkLight1337 (创建于: 2026-02-11 12:24 (UTC+8)) [+7/-7, 8 files | commented:1 approved:1]

## Purpose

It’s not a protocol anymore.

## Test Plan

## Test Result

…
#34320 [Bugfix] Fix Dynamo unexpected keyword argument — bug — by samutamm (创建于: 2026-02-11 16:27 (UTC+8)) [+3/-3, 1 files | commented:1] ## Purpose

Fix QuantFP8 with torch.compile on ROCm when CustomOP quant_fp8 is disabled with --compilation-config '{"custom_ops": ["-quant_fp8"]}'.

Current main branch raises error:

``` (EngineCore_DP0 pid=565) File “/app/vllm/vllm/v1/executor/multiproc_executor.py”, line 375, in collective_rpc (EngineCore_DP0 pid=565) return aggregate(get_response()) (EngineCore_DP0 pid=565) ^^^^^^^^^^^^^^ …
#34318 [bugfix] Fix Dynamo unexpected keyword argument — bug,documentation,performance,new-model,rocm,frontend,speculative-decoding,ci/build,v1,multi-modality — by samutamm (创建于: 2026-02-11 15:42 (UTC+8)) [💬3 | +6279/-1686, 114 files | commented:1] ## Purpose

Fix QuantFP8 with torch.compile on ROCm when CustomOP quant_fp8 is disabled with --compilation-config '{"custom_ops": ["-quant_fp8"]}'.

Current main branch raises error:

``` (EngineCore_DP0 pid=565) File “/app/vllm/vllm/v1/executor/multiproc_executor.py”, line 375, in collective_rpc (EngineCore_DP0 pid=565) return aggregate(get_response()) (EngineCore_DP0 pid=565) ^^^^^^^^^^^^^^ …
#34313 [Bugfix] Fix weight naming in Qwen3.5 — bug,qwen — by ywang96 (创建于: 2026-02-11 13:04 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1 approved:1]

## Purpose

## Test Plan

## Test Result

...
#34314 [Bugfix] Fix Fp8OnlineMoEMethod crash when weights are on CPU — bug — by aarkharov (创建于: 2026-02-11 13:15 (UTC+8)) [💬3 | +138/-9, 2 files | commented:1] ## Summary

When an external plugin (e.g. online_fp8) creates MoE weight tensors on CPU, the streaming quantization path in Fp8OnlineMoEMethod.patched_weight_loader crashes because it calls ops.scaled_fp8_quant — a CUDA kernel — on CPU tensors.

Root cause: patched_weight_loader unconditionally calls self.process_weights_after_loading(layer) when all weight chunks are loaded, and sets _already_called_process_weights_after_loading = True.

When this happens: Large MoE model ma…
#34310 fix: correct chunk start times for verbose transcription timestamps — frontend — by saikrishnarallabandi (创建于: 2026-02-11 12:42 (UTC+8)) [💬2 | +220/-14, 2 files | commented:1] ## Summary

Fix timestamp drift in chunked audio transcription by using actual chunk boundary times instead of nominal fixed offsets.

Fixes #29350

## Problem

When transcribing long audio (>30s), segment timestamps progressively drift because:

…

[已合并 PR]

#34362 [Refactor] Move validation to params definitions — ready,v1 — by DarkLight1337 (合并于: 2026-02-12 11:33 (UTC+8)) [💬2 | +264/-245, 3 files | commented:2 approved:1] ## Purpose

Move SamplingParams validation from InputProcessor to SamplingParams, similar to how PoolingParams does it.

## Test Plan

## Test Result

...
#33848 [Bug Fix] Fix naive_block_assignment always defaulting to False due to arg misalignment — bug,ready — by RunkaiTao (合并于: 2026-02-12 11:30 (UTC+8)) [💬1 | +2/-1, 2 files | commented:1 approved:2] ## Purpose Fix a bug that naive_block_assignment always defaulting to False due to arg misalignment.

## Test Result gpt-oss 120b max_loras=8, concurrency=1

before ``` ============ Serving Benchmark Result ============ Successful requests: 40
…
#34353 [Bugfix] fix default is_neox_style to be True for deepseekv3.2 — bug,ready,deepseek — by xyDong0223 (合并于: 2026-02-12 02:20 (UTC+8)) [+1/-1, 1 files | commented:1 approved:2] ## Purpose To resolve the default style issue with index rope, neox should be True when it’s not present in config. ## Test Plan

## Test Result

Essential Elements of an Effective PR Description Checklist
...
#34385 [Bugfix] Fix MTP accuracy for GLM-5 — bug,speculative-decoding,ready,v1 — by mgoin (合并于: 2026-02-12 11:08 (UTC+8)) [💬1 | +18/-0, 1 files | commented:1 approved:1] ## Purpose

Fix MTP producing NaN logits for models (e.g. GLM-5) whose checkpoints don’t store a duplicate shared_head.head weight in the MTP layer (like DeepSeek V3.2). The existing _maybe_share_lm_head only sets self.model.lm_head, but MTP’s compute_logits uses shared_head.head inside each MTP layer. This left it uninitialized when the checkpoint omits it. This patch explicitly shares the target model’s lm_head with each MTP layer’s shared_head.head, matching what DeepSeek-V3.2 …
#33963 [Bugfix] send None sentinel on final commit so server properly sends transcription.done — bug,frontend,ready — by pjs102793 (合并于: 2026-02-12 05:01 (UTC+8)) [💬6 | +2/-8, 2 files | commented:1 approved:2 changes:1] ## Purpose

Fixes https://github.com/vllm-project/vllm/issues/33962

Fix server not sending transcription.done when client sends input_audio_buffer.commit with final: true during real-time audio streaming.

Transcription works correctly — transcription.delta events are sent as expected. However, the final commit handler sets _is_input_finished = True but does not send the None sentinel to audio_queue, leaving audio_stream_generator blocked on queue.get() forever. This deadloc…
#34337 [GPT-OSS] Remove unnecessary contiguous — ready,gpt-oss — by elvischenv (合并于: 2026-02-12 04:29 (UTC+8)) [+0/-1, 1 files | commented:1 approved:1]

## Purpose Remove unnecessary contiguous for GPT-OSS.

Test Result PR:
```
  [{'eval_name': 'gpqa', 'model_name': 'gpt-oss-120b-medium_temp1.0_20260211_121731', 'metric': 0.7266414141414141}]
```
…
#33843 [Refactor] Replace activation: str with MoEActivation enum — performance,rocm,ready,cpu,gpt-oss,nvidia,ready-run-all-tests — by mgoin (合并于: 2026-02-12 09:29 (UTC+8)) [💬5 | +474/-282, 48 files | commented:10] ## Purpose

We have had activation defined, validated, and passed around as a raw string forever. Now that we have popular models that don’t just use the silu default and also need to support non-gated MoEs and their activation functions, we need to have a single source of truth for all the activation functions that exist in vLLM and which MoE kernels support which functions. I want to start with MoE since that is where we have the most divergence in support due to fused kernels.

This PR in…

#33626 [ci] Integrate AMD tests into CI — rocm,ready,ci/build — by khluu (合并于: 2026-02-12 08:54 (UTC+8)) [💬3

+25/-10, 6 files

commented:2 approved:2]

Enable 6 AMD test jobs into CI pipeline as the starting point.
Define the config interface for AMD mirror in test definition.

#34384 [ROCm][CI] Revert Test Groups From mi325_8 to mi325_1 Agent Pool In AMD CI — rocm,ready,ci/build — by micah-wil (合并于: 2026-02-12 07:52 (UTC+8)) [💬3 | +7/-7, 1 files | commented:1 approved:1] https://github.com/vllm-project/vllm/pull/33731 mistakenly added some test groups to the 8 GPU agent pool that only require 1 GPU in AMD CI. This PR reverts those changes.
#34371 [Bugfix] Fix some issues with MoERunner PR #32344 — bug,ready — by bnellnm (合并于: 2026-02-12 06:33 (UTC+8)) [💬1 | +6/-3, 2 files | commented:2 approved:2] ## Purpose

Move ensure_moe_quant_config call from FusedMoE.forward_native into _moe_forward and _moe_forward_shared. This is closer to how it was before when it was hidden inside the custom op. It should avoid torch.compile issues.

Fix handling of gate. The use_overlapped flag should have been checked before returning _gate.

Possible fix for https://github.com/vllm-project/vllm/issues/34357

## Test Plan

…
#34316 Fix CI failure - Flashinfer Kernel tests — ready,nvidia — by wzhao18 (合并于: 2026-02-12 06:17 (UTC+8)) [💬4 | +3/-0, 3 files | commented:1 approved:4]

## Purpose Fix #34315

## Test Plan Tested locally with
```
  pytest -v -s tests/kernels/moe/test_flashinfer.py
```
…
#34299 [torch.compile] Enable AR+rms fusion by default available for -O2 — performance,ready,torch.compile — by ProExpertProg (合并于: 2026-02-11 16:30 (UTC+8)) [💬1 | +16/-8, 2 files | commented:4 approved:1] ## Purpose Turns out AR+RMS fusion and compiling for a second range only marginally increases warm/cold start compile time, and the benefits are 5-22% (see #24252). This PR enables this fusion by default.

## Test Plan Startup, bench, eval, CI

## Test Result

### Startup ``` …
#34334 [Bugfix] Fix more multimodal tests for transformers V5 — bug,ready,multi-modality,qwen — by zucchini-nlp (合并于: 2026-02-12 05:02 (UTC+8)) [💬1 | +18/-11, 5 files | commented:2 approved:2] There’s also idefics test failing tests/models/multimodal/processing/test_idefics3.py::test_processor_override[False-1-mm_processor_kwargs1-845-HuggingFaceM4/Idefics3-8B-Llama3]

I see the root issue in Idefics3Processor._get_prompt_updates which creates a processor but doesn’t use provided kwargs. The kwargs get filtered out due to cached_get_processor_without_dynamic_kwargs and the number of placeholder is calculated incorrectly. It can be fixed by reverting https://github.com/vllm-pr…
#34349 [Benchmarks] Reduce ready checker log verbosity — performance,ready — by tomasruizt (合并于: 2026-02-12 04:57 (UTC+8)) [+2/-1, 1 files | commented:1 approved:1] ## Summary
- The wait_for_endpoint() ready checker used to log the full traceback of the error on every retry (every 5 seconds), making the output very noisy and hard to read.
- Now it logs only the last line of the error (the actual exception message), which is much more compact and informative.
Before (repeated every 5s): ``` WARNING [ready_checker.py:69] Endpoint is not ready. Error=’Traceback (most recent call last): File “…/aiohttp/connector.py”, line 1268, in _wrap_creat…
#34279 [Bugfix] Fix fused MoE IMA (sans chunking) by using int64 for strides — bug,ready — by tlrmchlsmth (合并于: 2026-02-11 13:15 (UTC+8)) [💬1 | +27/-27, 1 files | approved:2 commented:1] ## Summary
- Annotate all stride parameters in fused_moe_kernel and fused_moe_kernel_gptq_awq as tl.int64 to prevent int32 overflow in pointer arithmetic for large tensors
- This follows the same pattern already used in fused_batched_moe.py
- Without this fix, large problem sizes cause illegal memory access crashes when chunking is disabled, because the C tensor offset exceeds int32 max (~2.1 billion)
- This is a prerequisite for removing the chunking workaround (`VLLM_FUSED_MOE_CHUNK…
#34350 [ROCm] [CI] fix test_unrecognized_env — rocm,ready — by tjtanaa (合并于: 2026-02-12 02:50 (UTC+8)) [+10/-3, 1 files | commented:1 approved:1]

## Purpose

Fix https://buildkite.com/vllm/amd-ci/builds/4609/steps/canvas?sid=019c4d1a-4084-4c15-b40f-c0ef00b0ffd7&tab=output

```
...
#34243 [Bugfix] Enable attn quantization of Llama-4 by correctly permuting scales for rope (int8, fp8) — bug,ready,llama — by eldarkurtic (合并于: 2026-02-12 02:24 (UTC+8)) [💬1 | +29/-5, 1 files | commented:5 approved:1] Llama-4 weights of q/k_proj are permuted during model loading to prepare the model for interleaved/gpt-neox rope. The same permutation needs to be applied on quantization weight scales as well.

## Purpose 1 So far, all of quantized Llama-4 models used to skip attention quantization as the model’s accuracy would collapse. This was mistakenly interpreted as high sensitivity of the attention layers under quantization. Luckily for quantization, this is not the case and the accuracy collapse was …
#34348 [Docs] Fix typo (“defult”) and double spacing — ready — by SorenDreano (合并于: 2026-02-12 01:02 (UTC+8)) [+1/-1, 1 files | commented:1 approved:1] ## Purpose Update vLLM documentation to fix two small typos (defult -> default and -02 to -O2) and a double space. vLLM will not start with -02.

## Test Plan Documentation-only change.

## Test Result N/A (documentation-only change; no runtime behavior affected). …
#34330 [Multimodal] Expose mm_processor_kwargs for DummyInputsBuilder — ready,multi-modality,llama,qwen,deepseek — by Isotr0py (合并于: 2026-02-12 01:37 (UTC+8)) [💬3 | +131/-27, 72 files | commented:4 approved:1]

## Purpose
- Currently, to enable/disable long video comprehension in Qwen3-VL-style models, user have to increase/decrease longest_edge for hf_processor with mm_processor_kwargs.
- However, these fields won’t reflect to dummy inputs during profiling, which can cause inaccurate profiling results.
- This PR exposes mm_processor_kwargs for DummyInputsBuilder to reflect the config.
## Test Plan ``` vllm serve Qwen/Qwen3-VL-4B-Instruct –enforce-eager –mm-pro…
#33217 [Model Runner V2] Init cuda graph pool when necessary — v1,nvidia — by xinyu-intel (合并于: 2026-02-12 01:12 (UTC+8)) [+6/-2, 2 files | commented:3 approved:1]

## Purpose

Init cuda graph pool only when cudagraph_mode is not NONE. This will help if torch.cuda.graph_pool_handle() is not available in platforms.

## Test Plan

## Test Result

…
#32458 [CI][BugFix] Fix silent failure in shellcheck hook and baseline exist… — bug,ready — by junuxyz (合并于: 2026-02-12 01:03 (UTC+8)) [💬3 | +126/-2, 2 files | commented:4 approved:1] related issue: https://github.com/vllm-project/vllm/issues/32391

## Purpose CI was not reliably running shellcheck due to invalid find invocation, so issues could slip through unnoticed. Initially started as a small one-line fix but after correcting the find command, CI began failing due to pre-existing shellcheck errors in the affected scripts. This PR restores a working signal by fixing the CI invocation and introducing a baseline, so CI only fails on newly introduced warnings.

## Summary
- …
#34273 [Misc] Bump fastsafetensors version for latest fixes — ready,ci/build — by njhill (合并于: 2026-02-11 16:30 (UTC+8)) [+4/-5, 4 files | commented:1 approved:1] This includes in particular https://github.com/foundation-model-stack/fastsafetensors/pull/46 which is needed to avoid unnecessary memory overhead on rank 0 in multi-GPU deployments.

See https://github.com/vllm-project/vllm/pull/34070.

cc @takeshi-yoshimura
#34217 [Frontend] Exploit tokenizers “new stream” in FastIncrementalDetokenizer — ready,v1 — by njhill (合并于: 2026-02-11 18:03 (UTC+8)) [💬1 | +17/-31, 1 files | commented:2 approved:1] Finally remembered to revisit https://github.com/vllm-project/vllm/pull/18840.

Faster incremental detokenizer initialization.

Includes some unrelated minor code simplifications in detokenizer.py.
#33681 [ROCm] [aiter] Split KV cache update for AiterFlashAttention — rocm,ready,v1 — by kliuae (合并于: 2026-02-12 00:26 (UTC+8)) [💬1 | +68/-40, 1 files | commented:3 approved:1]

## Purpose Supporting #32335, this PR extracts KV cache update from the attention forward pass in AiterFlashAttention.

ROCM_AITER_FA supports both flash and shuffled KV cache layouts. This PR covers both of them and uses the same flag to control the respective KV cache layouts.

## Test Plan Accuracy test with lm_eval

…
#33948 [Bugfix]: Fix ROCm fusion attn test; use AttentionBackend utils to create kv cache — bug,rocm,ready — by Rohan138 (合并于: 2026-02-12 00:12 (UTC+8)) [💬7 | +27/-52, 1 files | commented:2 approved:3]

Reuse AttentionBackend utils to initialize the dummy KV cache with the required shape and stride. Also, turns out this UT is being skipped on ROCm because of a typo/missing rename, so BACKENDS -> BACKENDS_FP8.

## Purpose

## Test Plan

## Test Result

…
#34352 Don’t try and run GLM-ASR with remote code — 无标签 — by hmellor (合并于: 2026-02-12 00:09 (UTC+8)) [+0/-1, 1 files | commented:1 approved:1] GLM-ASR has been upstreamed to Transformers in https://github.com/huggingface/transformers/commit/a7f29523361b2cc12e51c1f5133d95f122f6f45c which was first released in v5.0.0rc2.

Therefore we should remove this kwarg from our test. (The test will not actually run in this PR because it’s still gated on Transformesr v5)
#34043 Reapply [Attention][FA3] Update FA3 to include new swizzle optimization — performance,ready,ci/build,v1,nvidia — by LucasWilkinson (合并于: 2026-02-11 23:07 (UTC+8)) [💬1 | +60/-44, 6 files | commented:1 approved:1] Reapply https://github.com/vllm-project/vllm/pull/23465 after revert in https://github.com/vllm-project/vllm/pull/33841 but with correct metadata sizes
#34264 Make JAIS compatible with Transformers v5 — ready — by hmellor (合并于: 2026-02-11 20:30 (UTC+8)) [+0/-1, 1 files | commented:1 approved:1] add_cross_attention was an attribute of PreTrainedConfig that got removed in https://github.com/huggingface/transformers/pull/41541

This attribute was never explicitly in the JAIS config and the checkpoints we use for testing in CI do not set it.
#34268 Responses harmony system message structured — frontend,ready,gpt-oss — by Kimahriman (合并于: 2026-02-11 21:14 (UTC+8)) [💬1 | +43/-6, 2 files | commented:5 approved:2] ## Purpose Resolves https://github.com/vllm-project/vllm/issues/34237

Fix an issue where the responses API will fail if a structured system message content is passed.

## Test Plan New test verifying the behavior.

## Test Result Not able to get tests to run locally. …
#33715 [NVIDIA][test] Tests for flashinfer TRTLLM BF16 MoE — ready,nvidia — by Linda-Stadter (合并于: 2026-02-11 20:38 (UTC+8)) [+296/-1, 7 files | commented:5 approved:2 changes:1] ## Purpose Adding tests for trtllm bf16 moe backend added in PR [NVIDIA] [feat] Integrate flashinfer Trtllmgen bf16 moe #32954

## Test Plan
- unit and integration test for the new moe backend
- unit tests for utility functions
- Changing E2E tests to use fi cutlass because E2E tests triggered an intermittent issue with flashinfer ## Test Result
…
#34262 Make Qwen3VL compatible with Transformers v5 — ready,qwen — by hmellor (合并于: 2026-02-11 20:13 (UTC+8)) [💬4 | +25/-37, 2 files | commented:2 approved:1] The location of tie_word_embeddings has moved for Transformers v5 for multimodal models to a more sensible location (please see https://github.com/vllm-project/vllm/pull/33359 for the full explanation).

This PR updates the Qwen3VL implementation to leverage the fix made in https://github.com/vllm-project/vllm/pull/33359 and moves the vision_config specific check that was taking place in Qwen3LLMModel to Qwen3VLForConditionalGeneration, where the vision config is available.
#34321 [Bugfix][CPU] Fix llama4 inference on CPU — bug,ready,v1,llama,cpu — by bigPYJ1151 (合并于: 2026-02-11 19:07 (UTC+8)) [💬3 | +60/-18, 6 files | commented:1 approved:1]

## Purpose
- Fix llama4 inference on CPU
## Test Plan

## Test Result

…
#34255 [Docs] Reduce time spent generating API docs — rocm,ready,v1,multi-modality,cpu,gpt-oss,nvidia — by hmellor (合并于: 2026-02-11 18:56 (UTC+8)) [💬3 | +50/-2, 25 files | commented:1 approved:1] All time reductions are measured in my local environment which takes 520s to build the docs normally.

Changes:
- Update git-revision-date-localized.exclude:
  - argparse/* -> generated/* - these were moved in the past
  - Add api/* - these are always generated at build time so the revision dates always fall back to the build date, which is meaningless
  - This saves ~20s
- Remove show_if_no_docstring: true:
  - “if at least one of its direct or indirect members (lower in the tree) ha…
#34253 Patch protobuf for CVE-2026-0994 — ready,ci/build — by eicherseiji (合并于: 2026-02-11 18:25 (UTC+8)) [+2/-2, 2 files | commented:5 approved:2] ## Purpose

Pin protobuf to exclude versions vulnerable to CVE-2026-0994.

#33619 patched this for the v0.15.1 release branch but main is still unpinned.

## Changes
- requirements/build.txt: pin protobuf >= 5.29.6 and exclude vulnerable 6.30.0–6.33.4
- requirements/common.txt: same …
#34319 [Doc] Update Marlin support matrix for Turing — documentation,ready — by iori2333 (合并于: 2026-02-11 17:03 (UTC+8)) [💬1 | +5/-4, 2 files | commented:1 approved:2] #29901 adds Marlin support for SM75 devices. This PR updates the support matrix of Marlin documentation.
#34308 [Chore] Move BaseRenderer to base.py — ready — by DarkLight1337 (合并于: 2026-02-11 16:29 (UTC+8)) [+7/-7, 8 files | commented:1 approved:1]

## Purpose

It’s not a protocol anymore.

## Test Plan

## Test Result

…
#34111 [XPU][9/N] clean up existing ipex code/doc — documentation,ready,ci/build,v1,cpu — by jikunshang (合并于: 2026-02-11 16:27 (UTC+8)) [💬2 | +16/-50, 10 files | commented:1 approved:1] ## Purpose part of https://github.com/vllm-project/vllm/issues/33214 clean up ipex, use xpu_ops instead. also update xpu documents.

## Test Plan CI

## Test Result

…
#33247 [model] support FunASR model — documentation,new-model,ready,qwen — by AllenDou (合并于: 2026-02-11 15:37 (UTC+8)) [💬9 | +1585/-3, 7 files | commented:10] Hi @WoosukKwon @DarkLight1337 , this PR adds support for the FunASR model. Could you please take a look?

server, vllm serve allendou/Fun-ASR-Nano-2512-vllm -tp=2 --dtype=float32, Use –dtype=float32 to achieve the highest accuracy

client python3 openai_transcription_client.py --repetition_penalty=1.0

…
#34116 [CPU] Enable FP16 (Half dtype) support for s390x — ready,cpu — by R3hankhan123 (合并于: 2026-02-11 14:41 (UTC+8)) [💬1 | +244/-9, 3 files | commented:1 approved:1] ## Purpose Adds FP16 model inference support for s390x (IBM Z) architecture using vectorized Bit-manipulation. FP16 was previously disabled on s390x, limiting users to BF16 or FP32.

## Test Plan
1. Build docker image and run the server with dtype=half
2. Send an inference request to the server ## Test Result
``` [root@b314lp81 vllm]# docker run –rm -p 8000:8000 local:test ibm-granite/granite-4.0-micro –port=8000 –dtype=half …
#34313 [Bugfix] Fix weight naming in Qwen3.5 — bug,qwen — by ywang96 (合并于: 2026-02-11 13:37 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1 approved:1]

## Purpose

## Test Plan

## Test Result

...
#34298 [ModelBash][DSR1 NVFp4] Avoid Bf16 Bias Cast — performance,ready,deepseek,nvidia — by robertgshaw2-redhat (合并于: 2026-02-11 13:00 (UTC+8)) [+4/-7, 1 files | commented:4 approved:2] ## Purpose
- avoid bf16 bias conversion on the hotpath
## Test Plan
- lm eval
## Test Result ``` |Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr| |—–|——:|—————-|—–:|———–|—|—-:|—|—–:| …
#34013 Threshold fix wvSplitk for occasional CI fails — rocm,ready,ci/build — by amd-hhashemi (合并于: 2026-02-11 11:59 (UTC+8)) [💬2 | +5/-2, 1 files | commented:2 approved:2]

## Purpose

## Test Plan

## Test Result

...
#34149 [Bugfix] Fix benchmark_moe.py inplace assertion with torch >= 2.9 — bug,performance,ready — by mgehre-amd (合并于: 2026-02-11 11:58 (UTC+8)) [💬2 | +3/-2, 1 files | commented:1 approved:1] ## Purpose

Fix benchmarks/kernels/benchmark_moe.py crashing with an AssertionError on torch >= 2.9.

After #33375 moved the disable_inplace() guard from inside fused_experts_impl / dispatch_fused_experts_func to an assertion at the fused_experts entry point, the benchmark was not updated and still unconditionally passes inplace=True. This triggers:
```
  assert not inplace or not disable_inplace()
  AssertionError
```
…
#34236 [Plugin] Simplify IO Processor Plugin interface — documentation,frontend,ready — by DarkLight1337 (合并于: 2026-02-11 11:47 (UTC+8)) [💬8 | +167/-151, 9 files | commented:6 approved:1]

## Purpose
- Rename parse_request -> parse_data and pass request.data to it so there is no need to distinguish between offline and online APIs.
- For the same reason, deprecate output_to_response, since we can construct the response automatically from the output data and provided request ID.
- Split up validate_or_generate_params into merge_sampling_params and merge_pooling_params to make type annotation easier.
All changes are back-compatible until v…

[关闭未合并 PR]

#26805 Fix seed reproducibility issue by adding output.copy_(out) — stale — by XuanofXXX (关闭于: 2026-02-12 10:17 (UTC+8)) [💬4 | +1/-0, 1 files | commented:1]

## Purpose This PR fixes the seed reproducibility issue in vLLM MambaMixer2. Despite setting the seed in the script, the results were not consistent across multiple runs. The issue was found to be in the mamba_mixer2.py file, where the missing output.copy_(out) line prevented correct output from being written to the buffer, causing unstable results.

## Test Plan Added the line output.copy_(out) to the V1 configuration section in mamba_mixer2.py to ensure stable se…
#33518 [Bugfix] Fix NVFP4 MoE weight shapes for non-gated MLPs (Nemotron-Nano) — bug,ready,nvidia — by Code4me2 (关闭于: 2026-02-12 03:01 (UTC+8)) [💬10 | +249/-18, 5 files | commented:5 approved:1] ## Summary

Fix weight shape calculation in prepare_static_weights_for_trtllm_fp4_moe() for models using non-gated MLPs like Nemotron-Nano.

## Problem

Loading nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 fails with:
```
  RuntimeError: shape '[64, 8192, 2048]' is invalid for input of size 536870912
```
…
#34368 Add initial devcontainer configuration — 无标签 — by vagabond2522 (关闭于: 2026-02-12 02:57 (UTC+8)) [+4/-0, 1 files | commented:1]

## Purpose

## Test Plan

## Test Result

...
#34360 Revert “[MoE Refactor] Introduce MoERunner abstraction and move execution logic from FusedMoE to DefaultMoERunner” — documentation,v1 — by robertgshaw2-redhat (关闭于: 2026-02-12 01:51 (UTC+8)) [💬1 | +753/-913, 25 files | commented:1] Reverts vllm-project/vllm#32344
#34184 [Online Quantization] Support memory-efficient online quantization via layerwise loading — 无标签 — by kylesayrs (关闭于: 2026-02-12 00:40 (UTC+8)) [+77/-185, 5 files | commented:1 | 📝草稿] ## Purpose ##
- Support online quantization in a more maintainable way by integrating with existing layerwise processing functionality
## Changes ##
- Change layerwise logic to only copy and re-place into kernel tensors if reloading
## Weights with initialized values ## Handling weights which require values placed at init time is a little tricky. One example are rotary embeddings, whose values are created at init time, and are not loaded from disk. In order to avoid overwriting these values wi…
#33061 [KV MultiConnector]: Add out-of-band handshake metadata get/set functions — kv-connector — by snadampal (关闭于: 2026-02-12 00:13 (UTC+8)) [💬5 | +20/-0, 1 files | commented:1] ## Purpose A MultiConnector with NixlConnector as one of the sub-connectors was failing to complete handshake between two nodes due to (1) missing xfer metadata for the NixlConnector and (2) Nixl handshake listener thread not started. This commit adds those missing get/set xfer metadata functions for kv multiconnector to propagate the calls to its sub-connectors.

## Test Plan Tested vllm P-D disaggregated inference with KV multiconnector with Nixl and LMCache sub-connectors

## Test Result P-D …
#34325 [Bugfix] fix device_name for routing replay — bug — by Li-Yongwen (关闭于: 2026-02-11 20:38 (UTC+8)) [💬1 | +2/-1, 1 files | commented:1]

## Purpose

[Bugfix] fix device_name for routing replay

## Test Plan

## Test Result

…
#30917 [Feature] Support using FlexKV as anothor KV Cache Offloading option. — documentation,v1,kv-connector — by axxx03 (关闭于: 2026-02-11 17:49 (UTC+8)) [💬8 | +383/-0, 3 files | commented:9 changes:1] # Description FlexKV is a distributed KV store and multi-level cache management system developed by Tencent Cloud’s TACO team in collaboration with the community, designed for large-scale LLM inference scenarios. FlexKV leverages multi-level caching to enable inference engines to achieve higher throughput and lower latency.

In our case, when intergated with FlexKV, we can achieve the following improvement:
- ISL=21K, OSL=1K, batch_size=8: TTFT decreases by 60%, TPOT increases by 13%, and QPM i…
#34318 [bugfix] Fix Dynamo unexpected keyword argument — bug,documentation,performance,new-model,rocm,frontend,speculative-decoding,ci/build,v1,multi-modality — by samutamm (关闭于: 2026-02-11 16:15 (UTC+8)) [💬3 | +6279/-1686, 114 files | commented:1] ## Purpose

Fix QuantFP8 with torch.compile on ROCm when CustomOP quant_fp8 is disabled with --compilation-config '{"custom_ops": ["-quant_fp8"]}'.

Current main branch raises error:

``` (EngineCore_DP0 pid=565) File “/app/vllm/vllm/v1/executor/multiproc_executor.py”, line 375, in collective_rpc (EngineCore_DP0 pid=565) return aggregate(get_response()) (EngineCore_DP0 pid=565) ^^^^^^^^^^^^^^ …
#22599 [Feature] Improve logging for error messages — documentation,needs-rebase,unstale,v1 — by elizabetht (关闭于: 2026-02-11 12:00 (UTC+8)) [💬5 | +169/-15, 1 files | commented:4] # Essential Elements of an Effective PR Description Checklist
- The purpose of the PR, such as “Fix some issue (link existing issues this PR will resolve)”.
- The test plan, such as providing test command.
- The test results, such as pasting the results comparison before and after, or e2e results
- (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
## Purpose

…