vLLM Daily Report

« Back to vLLM Reports

[vLLM GitHub 开发动态] 2026-01-12

[概览]

时间窗口: 2026-01-12 10:52 (UTC+8) ~ 2026-01-13 10:52 (UTC+8)
新 issue: 23 (label 分布: bug:13, feature request:4, RFC:3, ci-failure:3, rocm:2)
关闭 issue: 34
新 PR: 61 (label 分布: ready:18, v1:16, nvidia:8, documentation:8, qwen:6)
合并 PR: 42
关闭未合并 PR: 18

[新 issue]

#32227 [Feature]: per head or per channel fp8 kvcache support? — feature request — by zx-ai (创建于: 2026-01-13 09:55 (UTC+8)) ### 🚀 The feature, motivation and pitch

Hello, I’d like to ask about the current FP8 KV cache quantization, which uses per-tensor scaling. I heard you’re developing finer-grained quantization schemes, such as per-channel or per-head quantization. Could you share more details on how this would be implemented?

Specifically:

Will the scale factors be packed together with the KV cache tensors (e.g., appended after the KV data), similar to what DeepSpeed v3.2 does? Will the RoPE (Rotary Position E…
#32219 [RFC]: Add Helion integration in vLLM — RFC — by gmagogsfm (创建于: 2026-01-13 06:41 (UTC+8)) [💬1] ### Motivation.

(with significant inputs from @zou3519, @ProExpertProg, @mgoin, @xiaohongchen1991)

Helion is PyTorch’s latest innovation in authoring custom kernels, featuring simple and familiar syntax, good developer experience, and superior performance.

This RFC proposes a developer-friendly framework for integrating Helion kernels into vLLM, making custom ops in vLLM more efficient, enjoyable to write, and performant in production.

The proposed integration is [prototyped here](https://gi…
#32225 [Bug]: common_attn_metadata.max_seq_len not incremented properly in Eagle implementation — bug — by ofirzaf (创建于: 2026-01-13 08:51 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…
#32220 [Feature]: NVFP4 KV Cache Support — feature request — by yewentao256 (创建于: 2026-01-13 07:08 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch

We haven’t supported nvfp4 kv cache yet, let’s discuss the plans and possible actions in this issue.

```py class TritonAttentionBackend(AttentionBackend): accept_output_buffer: bool = True supported_dtypes: ClassVar[list[torch.dtype]] = [ torch.float16, torch.bfloat16, …
#32223 [CI Failure]: Kernels Core Operation Test — ci-failure — by AndreasKaratzas (创建于: 2026-01-13 07:49 (UTC+8)) [💬1] ### Name of failing test

pytest -v -s kernels/core kernels/test_top_k_per_row.py

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#32222 [CI Failure]: DP EP NixlConnector PD accuracy tests (Distributed) — ci-failure — by AndreasKaratzas (创建于: 2026-01-13 07:31 (UTC+8)) [💬1] ### Name of failing test

uv pip install --system -r /vllm-workspace/requirements/kv_connectors_rocm.txt && VLLM_ATTENTION_BACKEND=ROCM_ATTN DP_EP=1 bash v1/kv_connector/nixl_integration/config_sweep_accuracy_test.sh

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#32221 [CI Failure]: Qwen3-Next-80B-A3B-Instruct MTP Async EPLB Accuracy — ci-failure — by AndreasKaratzas (创建于: 2026-01-13 07:29 (UTC+8)) [💬1] ### Name of failing test

bash .buildkite/scripts/scheduled_integration_test/qwen3_next_mtp_async_eplb.sh 0.8 1319 8040

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#32218 [RFC]: Consolidate Multimodal Related Info — RFC — by charlotte12l (创建于: 2026-01-13 06:20 (UTC+8)) [💬1] ### Background + Motivation We are introducing model_arch_config https://github.com/vllm-project/vllm/pull/28454, which defines explicitly what kind of information vLLM engine need from hugggingface config/ user-defined config, so we could avoid hf_config/getattr(hf_config, xxx) got passing around in engine.

### Previous Discussion For whether model_arch_config should contains multimodal related info, https://github.com/vllm-project/vllm/issues/24384#issuecomment-3489703720 suggest…
#32193 [Bug]: vLLM engine crash under burst load despite expected request queuing (72 concurrent API calls) — bug — by Dineshkumar-Anandan-ZS0367 (创建于: 2026-01-13 01:12 (UTC+8)) [💬5] ### Your current environment

### Description

We observed that the vLLM engine was killed when a burst of API requests hit the server (approximately 72 API requests from two servers simultaneously).

Based on vLLM’s design, we expected these requests to be queued and throttled internally, rather than causing the engine to terminate. This behavior is unexpected and impacts stability during performance and E2E testing.

…
#32214 [Feature]: Extend startup time collection script to work with sweep — feature request — by desertfire (创建于: 2026-01-13 05:36 (UTC+8)) ### 🚀 The feature, motivation and pitch

https://github.com/vllm-project/vllm/pull/29919 added a script to collect startup time metrics, such as cold compilation time and warm compilation time. As being suggested here, https://github.com/vllm-project/vllm/pull/29919#pullrequestreview-3532714953, it would be nice to make it work with existing vllm bench sweep.

cc @ProExpertProg

### Alternatives

No response

…
#32213 [Bug]: GLM4 tool parser crashes with TypeError when parsing tools with no arguments — bug — by AnasMaar (创建于: 2026-01-13 05:25 (UTC+8)) ### 🐛 Describe the bug

When using the GLM4 tool parser (--tool-call-parser glm47) with tools that have no required or optional arguments, the parser crashes with a TypeError.

Error traceback: ``` ERROR [glm4_moe_tool_parser.py:123] Failed to extract tool call spec Traceback (most recent call last): File “/usr/local/lib/python3.12/dist-packages/vllm/tool_parsers/glm4_moe_tool_parser.py”, line 104, in extract_tool_calls pairs = self.func_arg_regex.findall(tc_args) …
#32190 [Bug]: Deepseek-R1 with DEP deployment returns gibberish outputs — bug — by ptarasiewiczNV (创建于: 2026-01-13 00:13 (UTC+8)) [💬2] ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ``` `vllm/vllm-openai:v0.13.0` docker image. ...
#32200 [RFC]: Change kv_load_failure_policy Default from “recompute” to “fail” — RFC — by NickLucche (创建于: 2026-01-13 02:28 (UTC+8)) [💬2] ### Motivation.

In disaggregated prefill setups, the kv_load_failure_policy controls how the system handles failures when the decode instance loads KV cache blocks from the prefill instance.

Currently, the default is "recompute", which recomputes failed blocks on the decode instance. This may lead to a drop in performance:
1. Wrong instance, wrong configuration: Decode instances use low-latency optimizations (e.g., DeepGemm low-latency mode). Recomputing prefill work on decode instanc…
#32187 [Bug]: AttributeError: module ‘vllm.distributed.parallel_state’ has no attribute ‘get_tensor_model_parallel_group’. Did you mean: ‘get_tensor_model_parallel_rank’? — bug — by cqray1990 (创建于: 2026-01-12 22:48 (UTC+8)) ### Your current environment

vllm 0.12.0

### 🐛 Describe the bug

AttributeError: module ‘vllm.distributed.parallel_state’ has no attribute ‘get_tensor_model_parallel_group’. Did you mean: ‘get_tensor_model_parallel_rank’?

### Before submitting a new issue…

…
#32186 [Bug]: Unable to serve Qwen3-8B-FP8 with moriio kv connector — bug,rocm — by junkang1991 (创建于: 2026-01-12 22:37 (UTC+8)) [💬2] ### Your current environment

The output of python collect_env.py
```text ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) ...
#32180 [Bug]: Performance Bottlenecks and V1 Engine Instability on AMD gfx1151 (Strix Halo) — bug,rocm — by kgundbrain (创建于: 2026-01-12 22:03 (UTC+8)) [💬1] ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…
#32176 [Bug]: deepseekv3.2 core dumped with cpu_offload_gb — bug — by mengniwang95 (创建于: 2026-01-12 20:43 (UTC+8)) [💬3] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#32178 [Bug]: DSR1 NVFP4 DEP cannot run and it fails when _initialize_kv_caches — bug — by shyeh25 (创建于: 2026-01-12 21:30 (UTC+8)) [💬1] ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…
#32177 [Feature]: Enable automatic trace context propagation for offline inference — feature request — by minimAluminiumalism (创建于: 2026-01-12 21:26 (UTC+8)) ### 🚀 The feature, motivation and pitch

Enable automatic trace context propagation for offline inference APIs (LLM.generate(), etc.) by extracting trace headers from the current OpenTelemetry context.

Currently, vLLM has two usage modes with different tracing support:
1. Online serving(entry point: http://localhost:8000/v1/completions) can extract trace context from HTTP headers
2. Offline inference(LLM.generate()): there is no way to pass or auto-detect trace context
The low-level `Inp…
#32152 [Bug]: hermes_tool_parser drops final JSON brace during streaming when tokens are chunked (e.g., MTP enabled) — bug — by heyzude (创建于: 2026-01-12 14:46 (UTC+8)) [💬2] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ...
#32172 [Bug]: DeepSeek V3.2 MTP + PD report two errors — bug — by kebe7jun (创建于: 2026-01-12 20:11 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text Collecting environment information... /gpfs/rd/kebe/vllm-env/lib/python3.10/site-packages/_distutils_hack/__init__.py:30: UserWarning: Setuptools is replacing distutils. Support for replacing an already imported distutils is deprecated. In the future, this condition will fail. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml w...
#32151 [Bug]: jina-reranker-m0 infer error — bug — by chen03191108-lab (创建于: 2026-01-12 14:35 (UTC+8)) [💬3] ### Your current environment

============================== System Info ============================== OS : mtos 22.03 (LTS-SP1) (x86_64) GCC version : (GCC) 10.3.1 Clang version : Could not collect CMake version : version 3.31.8 Libc version : glibc-2.34 …
#32154 [Bug]: v0.13.0 docker image run Whisper model return error: “PlaceholderModule should not be used when the original module can be imported” — bug — by shengkui (创建于: 2026-01-12 15:17 (UTC+8)) [💬1] ### Your current environment

The output of python collect_env.py
```text root@94b14e05a285:/vllm-workspace# python3 collect_env.py Collecting environment information... ============================== System Info ...

[已关闭 issue]

#32006 [Bug]: vLLM hangs on specific request each time (qwen-coder-480b-fp8) — bug — by Yanpas (关闭于: 2026-01-13 10:33 (UTC+8)) [💬10] ### Your current environment

Can't do it. It's RH Openshift, GPU H200, Cuda 12.9 Model: Qwen Coder 480b FP8 vllm params: --enable-expert-parallel --data-parallel-size=8 ...
#26854 [Bug]: Use vllm.envs.ENV_VARIABLE instead of ENV_VARIABLE — bug,stale — by Jialin (关闭于: 2026-01-13 10:22 (UTC+8)) [💬3] ### Your current environment

N/A

### 🐛 Describe the bug

Avoid directly using TYPE_CHECKING environment variables in the code (by updating all code with from vllm.envs import FOO_BAR). And use envs.FOO_BAR instead, which go through the envs.getattr which has 2 benefits:
- no extra overhead, as environment variable caching is added in https://github.com/vllm-project/vllm/pull/26146
- using getattr ensured we adopted environment variable updates during service startup time.
…
#22493 [RFC]: Should the gpt-oss reasoning parser use harmony directly? — RFC,stale,gpt-oss — by sethkimmel3 (关闭于: 2026-01-13 10:15 (UTC+8)) [💬4] ### Motivation.

Currently, the GPT-OSS Reasoning Parser can’t really be used directly to extract reasoning content. The comment on L22 says:
```
  The GptOss model uses harmony to extract reasoning content and this parser
      is only used for detecting the end of the reasoning content.
```
However, to unify behavior with other Reasoning Parsers, it may be worth using …
#22809 [Bug]: How to set reasoning_effort for gpt-oss model to “high” in vllm — bug,stale,gpt-oss — by sravan500 (关闭于: 2026-01-13 10:15 (UTC+8)) [💬9] ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…
#22929 [Bug]: GLM-4.5V-FP8 output quality issue — bug,stale — by zixi-qi (关闭于: 2026-01-13 10:15 (UTC+8)) [💬5] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#23108 [Usage]: how to use built-in python tool of gpt-oss-20b after starting vllm serve –tool-server demo? — usage,stale,gpt-oss — by luppx (关闭于: 2026-01-13 10:15 (UTC+8)) [💬22] ### Your current environment

Hi, I’m trying to test gpt-oss with vllm in a H800 GPU. I have successfully installed vllm and relevant dependencies following gpt-oss vllm usage installation guide:
```
  pip install --pre vllm==0.10.1+gptoss \
      --extra-index-url https://wheels.vllm.ai/gpt-oss/ \
      --extra-index-url https://download.pytorch.org/whl/nightly/cu128
```
Following the [gpt-oss vllm usage guide](https://docs.vllm.ai/pro…
#23220 [Feature][Responses API] Browsing Cursor -> Citation — feature request,stale,gpt-oss — by simon-mo (关闭于: 2026-01-13 10:15 (UTC+8)) [💬3] ### 🚀 The feature, motivation and pitch

Currently, we do not translate the cursor (indexed by the browsing tool) into Citation format properly. That means users won’t be able to access the URL being accessed.

Non-streaming mode: https://github.com/vllm-project/vllm/blob/c32e6ad1f63631fd8033f0cca3a35d5e48ccfc7f/vllm/entrypoints/harmony_utils.py#L182-L199

Streaming mode: https://github.com/vllm-project/vllm/blob/c32e6ad1f63631fd8033f0cca3a35d5e48ccfc7f/vllm/entrypoints/openai/serving_responses…
#23293 [Feature][Chat Completion] Support tool_choice other than “auto” with harmony format — feature request,stale,gpt-oss — by heheda12345 (关闭于: 2026-01-13 10:15 (UTC+8)) [💬5] ### 🚀 The feature, motivation and pitch

Similar with https://github.com/vllm-project/vllm/issues/23227, but for chat completion API. Needs to wait until https://github.com/vllm-project/vllm/pull/22386 is merged to start work and may need to collaborate with https://github.com/vllm-project/vllm/issues/23227 to avoid duplicate of work.

### Alternatives

No response

### Additional context …
#23430 [Bug]: Reasoning parser enabled by default and cannot be disabled — bug,stale,gpt-oss — by AyRickk (关闭于: 2026-01-13 10:15 (UTC+8)) [💬8] I run vllm on an offline environment with vllm serve, and since the new 0.10.1 release, the reasoning parser is enabled and I can’t disable it. I tried only with gpt-oss model.

Maybe add a parameter to disable it like --no-reasoning-parser or --reasoning-parser=false
#23632 [Feature]: AttributeError: Model GptOssForCausalLM does not support BitsAndBytes quantization yet. No ‘packed_modules_mapping’ found. Support GptOssForCausalLM of BitsAndBytes quantization? — bug,stale,gpt-oss — by xiaotianns (关闭于: 2026-01-13 10:15 (UTC+8)) [💬6] ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…
#23694 [Bug]: gpt-oss model output issue — bug,stale,gpt-oss — by woshizouguo (关闭于: 2026-01-13 10:15 (UTC+8)) [💬3] ### Your current environment

vLLM 1.0.1 python 3.12

### 🐛 Describe the bug

I fine-tuned the model without using any reasoning data. When generating, the model produces output like:

<|channel|>final<|message|>{"issues":[]}<|return|>

…
#24062 20-series GPUs do not support the sinks parameter; attempting to access it directly raises an error. Could you fix this — stale,gpt-oss — by wang824892540 (关闭于: 2026-01-13 10:15 (UTC+8)) [💬2] https://github.com/vllm-project/vllm/blob/a344a5aa0a58cc1758d9721e848ce1f5ca4b6c7f/vllm/attention/layer.py#L148

PS C:\Users\y> docker run –runtime nvidia –gpus all -v C:/Users/y/model/.cache/huggingface:/root/.cache/huggingface -p 8000:8000 –ipc=host vllm/vllm-openai:latest –model unsloth/gpt-oss-20b-unsloth-bnb-4bit INFO 09-01 19:03:21 [init.py:241] Automatically detected platform cuda. (APIServer pid=1) INFO 09-01 19:03:23 [api_server.py:1805] vLLM API server version 0.10.1.1 (APISer…
#24067 [Bug]: how to get purely deterministic output for gpt-oss-120b? — bug,stale,gpt-oss — by tonyaw (关闭于: 2026-01-13 10:15 (UTC+8)) [💬5] ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…
#24076 [Bug]: While serving GPT-OSS, Streaming function calls output only reasoning_text, without function tool call — bug,stale,gpt-oss — by gsu2017 (关闭于: 2026-01-13 10:15 (UTC+8)) [💬4] ### Your current environment

The output of python collect_env.py
```text ============================== Python Environment ============================== Python version : 3.12.11 (main, Jun 4 2025, 08:56:18) [GCC 11.4.0] (64-bit runtime) ...
#24148 [Usage]: Add toy example for gpt-oss container tools — usage,stale,gpt-oss — by lacora (关闭于: 2026-01-13 10:15 (UTC+8)) [💬2] ### Your current environment
```
  The output of `python collect_env.py`
```
### How would you like to use vllm

Followup on PR https://github.com/vllm-project/vllm/pull/23386 …
#24201 [Feature][gpt-oss] Responses API test enhancement — feature request,stale,gpt-oss — by heheda12345 (关闭于: 2026-01-13 10:15 (UTC+8)) [💬4] ### 🚀 The feature, motivation and pitch

Current gpt-oss test only ensures that the workflow doesn’t crash. It doesn’t have correctness check. Help wanted on making the test more strict to avoid regression.

https://github.com/vllm-project/vllm/blob/main/tests/entrypoints/openai/test_response_api_with_harmony.py

When implementing it, please make sure the tests are not flaky.

### Alternatives

…
#24283 [Bug]: GPT-OSS more robust way to handle messages in commentary channel — bug,stale,gpt-oss — by JasonZhu1313 (关闭于: 2026-01-13 10:15 (UTC+8)) [💬9] ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…
#24292 [Usage]: Adjusting reasoning efforts for GPT-OSS in direct sampling — usage,stale,gpt-oss — by ff1Zzd (关闭于: 2026-01-13 10:15 (UTC+8)) [💬3] ### Your current environment
```
  The output of `python collect_env.py`
```
### How would you like to use vllm

Hi team, I understand that currently we could adjust the “reasoning_effort” for GPT-OSS if we serve the model via VLLM using “Chat Completions-compatible API” approach. However, it seems there is no appropriate parameter for this if we want to initialise the model using “LLM” and do direct sampling (via llm.generate). …
#24797 [Bug]: Failing Qwen MoE EP Test in CI — bug,stale — by minosfuture (关闭于: 2026-01-13 10:14 (UTC+8)) [💬2] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#24803 [Bug]: AttributeError: ‘FusedMoE’ object has no attribute ‘moe’ when trying to run Qwen3-Next — bug,stale — by drrros (关闭于: 2026-01-13 10:14 (UTC+8)) [💬9] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#24815 [Bug]: openbmb/MiniCPM-o-2_6 RuntimeError: CUDA error: an illegal memory access was encountered — bug,stale — by jiangtaozh (关闭于: 2026-01-13 10:14 (UTC+8)) [💬2] ### Your current environment

RTX5090 tp size 2 to deploy openbmb/MiniCPM-o-2_6 ```text (EngineCore_0 pid=271) INFO 09-14 03:09:35 [core.py:71] Initializing a V1 LLM engine (v0.10.1.dev459+g099c04646.d20250808) with config: model='/app/models/openbmb/MiniCPM-o-2_6', speculative_config=None, tokenizer='/prod/models/openbmb/MiniCPM-o-2_6', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torc...
#24816 [Usage]: should we explicitly modify tokernizer.padding_side=left when using vLLM to do batch infernce? — usage,stale — by SpeeeedLee (关闭于: 2026-01-13 10:14 (UTC+8)) [💬2] ### Your current environment N/A

### How would you like to use vllm As title, I’m confused about whether vLLM handles this internally or not.

### Before submitting a new issue…

…
#24843 [Bug]: During the warmup phase, skipping attention (skipatten) caused an OOM (Out of Memory) error later. — bug,stale — by wangxiaoteng888 (关闭于: 2026-01-13 10:14 (UTC+8)) [💬2] ### Your current environment

The output of python collect_env.py
vllm version v0.9.1 ```text Your output of `python collect_env.py` here ``` ============================== System Info ...
#29466 [CI Failure]: mi325_1: Language Models Test (Extended Pooling) — ci-failure — by AndreasKaratzas (关闭于: 2026-01-13 07:23 (UTC+8)) [💬6] ### Name of failing test

pytest -v -s models/language/pooling -m 'not core_model'

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#29515 [CI Failure]: mi325_1: V1 Test attention (H100) — ci-failure — by AndreasKaratzas (关闭于: 2026-01-13 07:23 (UTC+8)) [💬8] ### Name of failing test

pytest -v -s v1/attention

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#31631 [CI Failure]: mi325_1: V1 Test others — ci-failure — by AndreasKaratzas (关闭于: 2026-01-13 07:23 (UTC+8)) [💬3] ### Name of failing test

pytest -s -v -m 'not cpu_test' v1/kv_connector/unit

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#29525 [CI Failure]: mi325_1: Quantization Test — ci-failure — by AndreasKaratzas (关闭于: 2026-01-13 07:22 (UTC+8)) [💬4] ### Name of failing test

uv pip install --system torchao==0.13.0 && VLLM_TEST_FORCE_LOAD_FORMAT=auto pytest -v -s quantization/ --ignore quantization/test_blackwell_moe.py

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#30801 [Bug]: distilbert/distilgpt2 with graph_mode on ROCm platform generates garbage output — bug,rocm — by qli88 (关闭于: 2026-01-12 23:35 (UTC+8)) [💬3] ### Your current environment

Enable graph_mode for model distilbert/distilgpt2 on ROCm platform will generate garbage output.

### 🐛 Describe the bug

For model distilbert/distilgpt2, if graph_mode (enforce_eager=False) is enabled, vLLM will generate garbage output like “!!!!!!!!!!!!!!!” no matter what the prompt is.

### Before submitting a new issue…

…
#31721 [Feature]: Fused MoE Micro Benchmark for CPU Backend — feature request,cpu — by andikarachman (关闭于: 2026-01-12 18:03 (UTC+8)) [💬3] ### 🚀 The feature, motivation and pitch

A unified script to benchmark Fused MoE performance on CPU, instead of having each contributor writing their own benchmarks (and then throwing them). We need something similar to benchmark_paged_attention.py for CUDA. It’s easier (and less noisy) to have a micro benchmark (that runs in ~ seconds) to iterate over instead of re-running end to end benchmarks whi…
#32069 [Usage]: Qwen3-VL-Embedding Accuracy — usage — by xiazi-yu (关闭于: 2026-01-12 15:54 (UTC+8)) [💬10] Comparison testing between Qwen3-VL-Embedding-2B vLLM 13.0 and Qwen3VLEmbedder revealed inconsistencies in accuracy.
#32154 [Bug]: v0.13.0 docker image run Whisper model return error: “PlaceholderModule should not be used when the original module can be imported” — bug — by shengkui (关闭于: 2026-01-12 15:37 (UTC+8)) [💬1] ### Your current environment

The output of python collect_env.py
```text root@94b14e05a285:/vllm-workspace# python3 collect_env.py Collecting environment information... ============================== System Info ...
#32023 [Bug]: Fail to get response from a Qwen2-audio-7B-Instruct Service — bug — by moonlightian (关闭于: 2026-01-12 14:06 (UTC+8)) [💬2] ### Name of failing test

None

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#30064 [Performance]: Degradation on rocm from v0.11.1 to v0.12.0 — performance,rocm — by IdoAsraff (关闭于: 2026-01-12 13:26 (UTC+8)) [💬5] ### Proposal to improve performance

No response

### Report of performance regression

TL;DR: on v0.11.1 I get much better performance (specifically TTFT) with deepseek on rocm than on v0.12.0.

# What I did for v0.12.0

…
#32011 [Bug]: “Re-running CMake” loop In pip install — bug — by zhaohaixu (关闭于: 2026-01-12 12:57 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...

[新 PR]

#32167 [Model] Re-implement Qwen3Omni Audio Encoder — qwen — by ywang96 (创建于: 2026-01-12 18:44 (UTC+8)) [+442/-27, 1 files | commented:1 approved:1] ## Purpose Re-implement Qwen3-Omni Audio Encoder with vLLM primitives with some vectorization improvements. - roughly 10% speedup at high batch sizes according to profiling run with TP=1.

main
```
  (EngineCore_DP0 pid=2620374) INFO 01-12 23:47:30 [gpu_model_runner.py:4696] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 21 audio items of the maximum feature size.
  (EngineCore_DP0 pid=2620374) Audio processing time: 0.3811814785003662 seconds
```
This PR …
#32165 Specify the kv_cache dtype of the draft model — speculative-decoding,llama — by PHOEBEMOON0802 (创建于: 2026-01-12 18:19 (UTC+8)) [💬8 | +73/-2, 2 files | commented:3] Submit this PR to allow specifying the dtype of the kv_cache for the draft model via “kv_cache_dtype” in the speculative_config. The specific usage is as follows: When you need to specify the kv_cache dtype of the draft model as the required type and set the kv_cache dtype of the main model to another type, add the following to the startup command (take “auto” as an example): ``` python -m vllm.entrypoint.openai.api_server
–model /path-to-the-main-model
–tensor-parallel-size 1
–kv…
#32197 [Misc] Allow enabling NCCL for DP sync when async scheduling — ready — by njhill (创建于: 2026-01-13 02:04 (UTC+8)) [💬2 | +26/-16, 4 files | commented:4 approved:1] It’s currently disabled by default in this case but we should allow manual override either way.

See https://github.com/vllm-project/vllm/issues/32140
[!NOTE] Sets ParallelConfig.disable_nccl_for_dp_synchronization to accept None and defers defaulting logic to VllmConfig.
- disable_nccl_for_dp_synchronization changed to bool | None in ParallelConfig and CLI args; doc clarifies default behavior …
#32204 [Core][KVConnector] Support HMA+NixlConnector — needs-rebase,v1,kv-connector — by NickLucche (创建于: 2026-01-13 03:00 (UTC+8)) [💬1 | +449/-185, 6 files | commented:4] ## Overview Currently connectors are not able to take full advantage of models that employ hybrid attention (FA+SWA) and treat all layers as FA, as the Hybrid Kv Cache Manager is disabled.

This PR enables NixlConnector to work with the HMA, resulting in drastically reducing the number of bytes/regions moved with a xfer for SWA+FA models, while laying the ground for state-based ones (mamba etc). Example of the former: ``` # NON-HMA (current master) (EngineCore_DP0 pid=521538) get_block_descs_ids…
#32215 [6/N][Attention] Move utils to more appropriate locations — ready,v1,nvidia,ready-run-all-tests — by MatthewBonanni (创建于: 2026-01-13 05:53 (UTC+8)) [💬1 | +168/-178, 14 files | commented:1 approved:1] ## Purpose Step 6 of #31919:
- Move _make_metadata_with_slice, slice_query_start_locs split_attn_metadata to vllm/v1/worker/ubatch_utils.py
- Move subclass_attention_backend, subclass_attention_backend_with_overrides to vllm/v1/attention/backend.py
## Test Plan CI

## Test Result

…
#32205 [CI/Build] Fix ffmpeg CVEs — ci/build — by junpuf (创建于: 2026-01-13 03:14 (UTC+8)) [+22/-1, 1 files | commented:3] ## Purpose

The current vLLM Docker images contain 31 security vulnerabilities in FFmpeg and related libraries, including 3 Critical and 12 High severity CVEs. These vulnerabilities stem from using Ubuntu’s universe repository version of FFmpeg (4.4.2), which requires Ubuntu Pro subscription for security patches.

The vulnerabilities exist because:
1. Ubuntu Universe Repository: vLLM uses FFmpeg from Ubuntu’s universe repository
2. No Free Security Updates: Ubuntu universe p…
#32194 [Model] Handle trust_remote_code for transformers backend — new-model,ready — by DarkLight1337 (创建于: 2026-01-13 01:18 (UTC+8)) [+14/-1, 2 files | commented:2 approved:1] ## Purpose

Add trust_remote_code argument to try_get_class_from_dynamic_module and handle it accordingly.

## Test Plan

## Test Result

...
#32226 [Misc] improve warning/assert messages — ready,v1 — by cjackal (创建于: 2026-01-13 08:52 (UTC+8)) [+19/-19, 6 files | commented:1 approved:1] ## Purpose

Fix missing whitespaces and typos and polish sentences in various warning/assert messages.

## Test Plan

CI green

## Test Result

…
#32191 [AOT compilation] cached inductor artifacts benchmark #32043 — performance — by dolpm (创建于: 2026-01-13 01:04 (UTC+8)) [💬1 | +662/-0, 1 files | commented:2] benchmark for https://github.com/vllm-project/vllm/pull/25205 ## Purpose

## Test Plan

## Test Result

Essential Elements of an Effective PR Description Checklist
...
#32224 fix cutlass_3x_gemm_fp8_blockwise on sm103a — nvidia — by IwakuraRein (创建于: 2026-01-13 08:45 (UTC+8)) [+36/-6, 2 files | commented:1 | 📝草稿]

## Purpose

When compiling with sm103a, the output of cutlass_3x_gemm_fp8_blockwise is garbage value.

Updated the helpers in csrc/cutlass_extensions/common.hpp to include sm103a.

## Test Plan

…
#32216 [Frontend] Add dedicated KimiK2ReasoningParser for tool call handling — 无标签 — by daniel-salib (创建于: 2026-01-13 06:09 (UTC+8)) [+386/-2, 3 files | commented:4] ## Purpose

Add a dedicated KimiK2ReasoningParser to handle Kimi K2’s behavior of sometimes outputting tool calls without a proper </think> delimiter.

Kimi K2 uses the same <think>...</think> tokens as DeepSeek R1 for reasoning content. However, when making tool calls, the model sometimes omits the </think> token, causing tool call markers to be absorbed into the reasoning content instead of being passed to the tool parser.

This PR adds a specialized reasoning parser that:
- Extends `D…
#32211 [Perf] Optimize requests abort — ready,v1 — by yewentao256 (创建于: 2026-01-13 05:00 (UTC+8)) [💬2 | +4/-3, 1 files | commented:1] ## Purpose

Batch reqs_to_abort and do once at the end instead of calling each time, getting a little bit perf improvement

## Test

export MODEL="zai-org/GLM-4.7-FP8" vllm serve $MODEL -tp 8 --port 9256 --enable-expert-parallel --max_num_seqs 128 lm_eval --model local-completions --model_args "base_url=http://127.0.0.1:9256/v1/completions,model=$MODEL,num_concurrent=1024" --tasks gsm8k

…
#32146 [Frontend] Support OpenAI-style tool call IDs in Kimi K2 parser — documentation — by daniel-salib (创建于: 2026-01-12 13:14 (UTC+8)) [💬2 | +233/-9, 2 files | commented:5] ## Purpose

Kimi K2 sometimes generates OpenAI-style tool call IDs (e.g., call_abc123def456) instead of the native format (functions.get_weather:0). This occurs when the model is prompted with OpenAI-style tool definitions or when the chat template uses OpenAI conventions—the model adapts its output format to match the input format.

Native Kimi K2 format:
```
  <|tool_call_begin|>functions.get_weather:0<|tool_call_argument_begin|>{"city": "Tokyo"}<|tool_call_end|>
```
- Function name is e…
#32217 [Draft][Kernel] Add new flashinfer A2A kernel — nvidia — by hjjq (创建于: 2026-01-13 06:13 (UTC+8)) [+736/-59, 15 files | commented:1 | 📝草稿] ## Purpose Add latest TRT-LLM gen A2A kernel from flashinfer’s MoE-A2A API (https://github.com/flashinfer-ai/flashinfer/pull/2102) (a.k.a. one sided all-to-all) This should perform better than the older A2A kernel (#21003)(flashinfer_all2allv) in large batch size. Enable by --all2all-backend flashinfer_moe_a2a.

Still WIP. Depends on #31770.

## Test Plan

## Test Result

…
#32199 [BUGFIX] Add missed remaping of the names of fp8 kv-scale — ready,qwen — by vadiklyutiy (创建于: 2026-01-13 02:22 (UTC+8)) [💬2 | +7/-0, 1 files | commented:1 approved:1] Qwen3-Next-NVFP4 checkpoint produced a lot of the following warnings
```
  $ vllm serve nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 -tp 2 
  ...
  Parameter layers.3.self_attn.qkqkv_proj.k_scale not found in params_dict, skip loading
  Parameter layers.3.self_attn.qkv_proj.v_scale not found in params_dict, skip loading
  ...
```
…
#32202 Initial structural_tag support for tool calling — tool-calling — by mgoin (创建于: 2026-01-13 02:57 (UTC+8)) [+232/-2, 4 files | commented:1]

## Purpose

My initial start to attempt progress on https://github.com/vllm-project/vllm/issues/32142

Adds structural tag support for tool calling so that tool_choice="required" constrains only the JSON arguments region, not the entire output. This allows models to generate their native tool tokens (e.g., <tool_call>) and any reasoning before the constrained region. Implemented for Hermes parser as reference; other parsers can override get_structure_info(). …
#32201 [CI][AMD][Quantization] Fix fp8 max in quant_utils.py and update test_fp8_quant.::test_static_fp8_quant_group_2d to use correct fp8 dtype and adjust atol/rtol — rocm — by rasmith (创建于: 2026-01-13 02:47 (UTC+8)) [+4/-3, 2 files | commented:2] After a recent PR, the test_fp8_quant.::test_static_fp8_quant_group_2d test was failing on ROCm with the following error:
```
  E       Greatest absolute difference: 0.0009765625 at index (361, 1414) (up to 0.0 allowed)
  E       Greatest relative difference: 0.1428571492433548 at index (361, 1414) (up to 0.12 allowed)
```
The value passed to scaled_quantize is not correct for ROCm and the scaling used in quant_utils in scaled_quantize is …

#32212 Fix various typos found in docs — documentation,structured-output,cpu — by potatosalad (创建于: 2026-01-13 05:10 (UTC+8)) [💬2 | +26/-26, 21 files | commented:1] Fix various typos found in docs

File	Issue	Fix
`docs/contributing/deprecation_policy.md`	“2.Deprecated”	→ “2. Deprecated”
`docs/contributing/model/basic.md`	“interleave sliding windows”	→ “interleaved sliding windows”
`docs/deployment/frameworks/cerebrium.md`	Missing space before backtick	→ added space
`docs/deployment/frameworks/hf_inference_endpoints.md`	“Click to Deploy”	→ “Click the Deploy”
`docs/deployment/integrati…

#32210 [6/n] Migrate paged attention, merge_attn_states, convert_vertical_slash_indexes — needs-rebase,ci/build,cpu,nvidia — by mikaylagawarecki (创建于: 2026-01-13 04:28 (UTC+8)) [💬1 | +1941/-1423, 37 files | commented:1 | 📝草稿]

## Purpose Stacked on https://github.com/vllm-project/vllm/pull/31945

## Test Plan pytest tests/kernels/attention/test_attention.py -v CC=gcc pytest tests/kernels/attention/test_merge_attn_states.py -v

## Test Result …
#32209 [Model Runner V2] Minor refactor for logit_bias — v1 — by WoosukKwon (创建于: 2026-01-13 04:11 (UTC+8)) [+54/-26, 1 files | approved:1 commented:1] Minor code reorganization for better testing and code reuse
[!NOTE] Separates kernel launch logic from state, improving testability and reuse.
- Adds apply_logit_bias(...) helper that computes BLOCK_SIZE from allowed_token_ids, logit_bias_token_ids, and stop_token_ids shapes and invokes _bias_kernel
- Updates LogitBiasState.apply_logit_bias to call the new helper, passing .gpu tensors instead of handling strides/size inline
- Keeps _bias_kernel logic unchange…
#32203 [Rocm][CI] Merge the 2-node test commands into one — rocm,ci/build — by charlifu (创建于: 2026-01-13 02:57 (UTC+8)) [+6/-6, 1 files | commented:2]

[!NOTE] ^{Cursor Bugbot is generating a summary for commit 2f41484af514602667ffd9c40b5708007e364302. Configure here.}
[!NOTE] Consolidates the per-node setup in the AMD CI 2-node test step by chaining the three startup checks into a single command per node.
- In test-amd.yaml under 2 Node Tests (4 GPUs in total), merges the three per-node commands (`distributed/test_same_node….
#32207 [MoE Recator] Remove CustomOp from UnquantizedFusedMoEMethod — 无标签 — by bnellnm (创建于: 2026-01-13 04:01 (UTC+8)) [💬2 | +20/-44, 2 files | commented:1] ## Purpose Remove CustomOp from UnquantizedFusedMoEMethod to make MoE runner refactoring simpler.

## Test Plan CI MoE Refactor integration tests

## Test Result

cc @robertgshaw2-redhat , @mgoin , @ProExpertProg , @WoosukKwon …
#32195 Add TMA support to fused_moe_lora kernel — 无标签 — by gnovack (创建于: 2026-01-13 01:23 (UTC+8)) [+138/-34, 2 files | commented:4] ## Purpose

Adds support for loading A and B matrices from TMA descriptors in the existing fused_moe_lora kernel.

#### Implementation Details
- Adds a supports_tma function to conditionally enable TMA based on compute capability
- Reads LoRA A and B weights using TMA in MoE LoRA shrink and expand, respectively.
  - Since LoRA A and B weights can be represented as a list of tensors (i.e. num_slices > 1), we only pre-allocate the TMA descriptor on-host when num_slices == 1; otherwise, we c…
#32208 [P/D] Add decode-side check for prefill block expiry — kv-connector — by njhill (创建于: 2026-01-13 04:11 (UTC+8)) [+87/-4, 1 files | commented:1 | 📝草稿] Claude code was used to help with this.
#32198 [Docs] Nixl Usage recommend fail kv_load_failure_policy — documentation,kv-connector — by NickLucche (创建于: 2026-01-13 02:08 (UTC+8)) [💬2 | +16/-6, 1 files | commented:1 approved:1] Small update to docs suggested setup cc @wseaton.
[!NOTE] Emphasizes preferred failure handling and clarifies behavior for KV cache transfer.
- Adds "kv_load_failure_policy":"fail" to all NixlConnector example --kv-transfer-config commands (single-host, multi-host prefiller/decoder)
- New “KV Load Failure Policy” section describing fail vs recompute (default) with a warning about performance impact of recompute
- No code changes; documentation only …
#32163 [Model Runner V2] Support logit_bias, allowed_token_ids, min_tokens — v1 — by WoosukKwon (创建于: 2026-01-12 18:08 (UTC+8)) [💬2 | +253/-5, 2 files | commented:8]
[!NOTE] Implements GPU-side sampling constraints with a dedicated state and Triton kernel.
- Adds LogitBiasState to stage per-request allowed_token_ids, logit_bias, and min_tokens/stop_token_ids using UvaBackedTensor and StagedWriteTensor
- Introduces _bias_kernel (Triton) to: mask logits to only allowed_token_ids, apply per-token logit_bias, and enforce min_tokens by suppressing stop_token_ids until min_len
- Provides add_request, apply_staged_writes, …
#32189 fp8 online quant: split out Fp8OnlineLinearMethod — 无标签 — by vkuzo (创建于: 2026-01-12 23:26 (UTC+8)) [💬3 | +164/-116, 2 files | commented:5] Summary:

Split out Fp8OnlineLinearMethod from Fp8LinearMethod to more clearly separate online quant from offline quant logic, following a similar PR recently landed for Fp8OnlineMoEMethod.

In the same PR, beef up testing for online quant in integration tests a bit so we can depend on tests for testing future functionality for online quant. Specifically, extend the online fp8 quant test to also include a small moe model, and also extend it to run inference with a couple of tokens.

Test …
#32206 [WIP][Spec Decode] DFlash — new-model,speculative-decoding,v1,qwen — by benchislett (创建于: 2026-01-13 03:19 (UTC+8)) [+660/-13, 7 files | commented:1 | 📝草稿] ## Purpose

Add support to vLLM for DFlash speculative decoding.

Currently it’s hacked into the EAGLE path, and only works with BS1 and enforce-eager. No blockers here just need to clean up the code to support these cases.

## Test Plan

Check AR on MTBench and compare to https://github.com/sgl-project/sglang/pull/16818. With BS=1 and enforce-eager, we match AL with both reasoning and non-reasoning. …
#32192 [Misc] Change log level for batch queue log — ready,v1 — by NickLucche (创建于: 2026-01-13 01:08 (UTC+8)) [+1/-1, 1 files | approved:1 commented:1] As per https://vllm-dev.slack.com/archives/C07R5Q1Q2BB/p1768224082041559

[!NOTE] ^{Cursor Bugbot is generating a summary for commit e0bdb779a0d93b90afb5f791188ae2bcc3139364. Configure here.}

[!NOTE] …
#32173 [BugFix] scheduler: Fix ordering preserving of skipped requests — ready,v1 — by orozery (创建于: 2026-01-12 20:14 (UTC+8)) [💬1 | +31/-15, 2 files | commented:1 approved:1] This PR fixes the order preserving of requests skipped by the scheduler. A unit test is added to verify the fix.

I came across this bug while testing #29087, and the test there requires this fix.

[!NOTE] ^{Cursor Bugbot is generating a summary for commit 8c08c30fef37d620043c4221960e04d6704ba1dc. Configure here.}

…
#32181 nixl_connector: export UCX_MEM_MMAP_HOOK_MODE=none to avoid a UCX memory leak — kv-connector — by hasB4K (创建于: 2026-01-12 22:05 (UTC+8)) [💬1 | +13/-0, 1 files | commented:3] Implementation of the fix described here: https://github.com/vllm-project/vllm/issues/24264#issuecomment-3710305796
[!NOTE]

Mitigates a UCX memory leak when using NIXL by configuring UCX mmap hooks before loading NIXL bindings.
- If UCX_MEM_MMAP_HOOK_MODE is unset, set it to none prior to importing nixl/rixl; log a warning if nixl or rixl was already imported
- Adds os/sys imports; retains ROCm vs CUDA import paths and existing availability logging in `nixl_conne…
#32155 [Frontend] Normalize Responses API input for client compatibility — frontend — by daniel-salib (创建于: 2026-01-12 15:33 (UTC+8)) [💬1 | +85/-5, 2 files | commented:3] ## Purpose

This PR improves client compatibility for the Responses API by normalizing input items before validation. This specifically addresses compatibility with Codex and other OpenAI SDK clients that serialize requests with varying conventions that can cause validation failures.

Change:
- Strip None values from input items - Some clients (including Codex) include explicit None for optional fields (e.g., name: null, status: null). These are stripped to prevent validation error…
#32196 [BugFix] fix FusedMoE.make_expert_params_mapping in EXAONE-MoE — 无标签 — by lkm2835 (创建于: 2026-01-13 01:49 (UTC+8)) [💬1 | +1/-0, 1 files | commented:1 approved:1] ## Purpose Added the missing changes. Related to #31104.

## Test Plan

## Test Result

...
#32182 [Update] Bump FlashInfer version to 0.6.0 — ci/build,v1,nvidia — by askliar (创建于: 2026-01-12 22:28 (UTC+8)) [💬3 | +52/-50, 3 files | commented:3 | 📝草稿] Latest FlashInfer (0.6.0) changes plan API which breaks fast_plan_decode (again). The main culprit is change in plan function in FI that started accepting an extra o_data_type argument. Since current backend was mostly using positional arguments, those were causing errors in case of new ones being added.

This PR updates FlashInfer to the latest version and fixes API breaking by:
1. Adding an extra o_data_type arg to fast_plan_decode
2. Replacing positional arguments with keyword argumen…
#32184 [Benchmark] Share data between SLA runs — performance,ready — by DarkLight1337 (创建于: 2026-01-12 22:31 (UTC+8)) [+108/-25, 2 files | commented:6 approved:1] ## Purpose

Further optimize the SLA script to share past run data coming from the same serve/bench combination. This is useful when tuning for multiple SLA targets (e.g. E2EL <= 200 ms, 500 ms, 1000 ms, …) while keeping the serve/bench combination unchanged.

## Test Plan

## Test Result

...
#32148 [Model] Standardize pooling heads — ready — by DarkLight1337 (创建于: 2026-01-12 13:58 (UTC+8)) [+182/-149, 9 files | approved:1 commented:9] ## Purpose

Follow-up to #32119, make pooling params actually take effect for custom models.

## Test Plan

## Test Result

...
#32169 [BugFix] [KVConnector] Fix KV events for LMCache connector — kv-connector — by hickeyma (创建于: 2026-01-12 19:19 (UTC+8)) [+1/-1, 1 files | commented:1] ## Purpose

A small fix for the LMCache connector as LMCache does not yet support lora_name as an events property. Bug introduced in #27577.

The error occurs when try to prompt a model with LMCache connector and KV events enabled:

``` […] (EngineCore_DP0 pid=223048) [2026-01-12 11:02:52,233] LMCache INFO: Reqid: cmpl-a310c95f98e4f404-0-8bf3d3d5, Total tokens 32, LMCache hit tokens: 0, need to load: 0 (vllm_v1_adapter.py:1602:lmcach…
#32157 feat: add requires_token_ids interface for sampling params — v1 — by llsj14 (创建于: 2026-01-12 16:24 (UTC+8)) [+23/-0, 2 files | commented:3] ## Purpose
- Async scheduling currently blocks the delivery of prompt/output token IDs to sampling parameters, which means that, for example, prompt or output token IDs are set to -1 and only updated asynchronously.
- Certain logits processors (such as thinking budget)require access to prompt and output token IDs, and therefore an explicit interface is needed to propagate this information to sampling parameters.
- Pooling parameters already expo…
#32188 doc: Update model references in supported_models.md — documentation,ready — by andyzhangx (创建于: 2026-01-12 22:52 (UTC+8)) [💬1 | +2/-2, 1 files | commented:4 approved:1] ## Purpose Update model references in supported_models.md, we are using a tool to parse all listed model names in this page, but some items does not have org name which makes it difficult to parse automatically, this PR fixes the issue.

## Test Plan

## Test Result

...
#32179 [ROCm] [Bugfix] Fix order of mori build in Dockerfile.rocm_base — rocm,ready,ci/build — by tjtanaa (创建于: 2026-01-12 21:40 (UTC+8)) [+21/-13, 1 files | commented:1 approved:1] ## Purpose

Fix the following bug:
```
  $ DOCKER_BUILDKIT=1 docker build \
      -f docker/Dockerfile.rocm_base \
      -t rocm/vllm-dev:base-debug .
```
…
#32185 doc: Update model name for Qwen3-Coder in documentation — documentation,ready,tool-calling,qwen — by andyzhangx (创建于: 2026-01-12 22:37 (UTC+8)) [💬2 | +1/-1, 1 files | approved:1 commented:1] ## Purpose

Update model name for Qwen3-Coder in documentation, original name is incorrect, you could find the correct name here: https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct

## Test Plan

## Test Result

...
#32183 [MM Encoder] Add Triton ViT attention backend — v1,nvidia — by Isotr0py (创建于: 2026-01-12 22:29 (UTC+8)) [+152/-9, 4 files | commented:1 | 📝草稿]

## Purpose

## Test Plan

## Test Result

...
#32170 docs: fix typo in docker_run_bm.sh — tpu,ci/build — by darshan-stack (创建于: 2026-01-12 19:41 (UTC+8)) [💬1 | +11/-20, 1 files | commented:4] Fixes a small typo ‘useually’ -> ‘usually’ in the error message of docker_run_bm.sh.

## Purpose

## Test Plan

## Test Result

…
#32156 [Frontend] Fix missing channel assignment for multi-turn message parsing — frontend,gpt-oss — by daniel-salib (创建于: 2026-01-12 15:45 (UTC+8)) [+115/-2, 2 files | commented:2] ## Purpose

This PR fixes a bug where messages parsed via parse_response_input and parse_input_to_harmony_message were missing required channel assignments, causing parse_output_message to raise ValueError: Unknown channel: None.

This is critical for Codex compatibility in multi-turn conversations where reasoning items and assistant messages from previous responses are passed back as input.

Changes:
- parse_response_input: Assign "analysis" channel to reasoning items (match…
#32175 [Bugfix] [Core] fix sparse_attn_indexer padding — deepseek — by kebe7jun (创建于: 2026-01-12 20:20 (UTC+8)) [💬1 | +9/-1, 1 files | commented:2] ## Purpose

Fix https://github.com/vllm-project/vllm/issues/32172

## Test Plan

``` # Prefill vllm serve /gpfs/rd/models/DeepSeek-V3.2 -tp=2 -dp 4 –trust-remote-code –enable-expert-parallel –all2all-backend=deepep_high_throughput –gpu_memory_utilization=0.9 –max-model-len 102400 –tokenizer-mode=deepseek_v32 –enable-eplb –eplb-config ‘{“window_size”:”32”,”step_interval”:”32”,”num_redundant_experts”:”8”, “async”: “True”}’ –kv-transfer-config ‘{“kv_connector”:”NixlConnect…
#32162 [Frontend][Tracing] Add support for tracing aborted requests — v1 — by zhanghaotong (创建于: 2026-01-12 17:41 (UTC+8)) [💬3 | +88/-19, 4 files | commented:5] ## Purpose Currently, requests that are aborted or terminated due to errors do not emit their corresponding tracing spans, making it challenging to diagnose and troubleshoot issues in production environments.

This PR enhances the tracing functionality by ensuring aborted requests generate trace spans and including the error that caused the abortion in the trace span.

## Test Plan
1. Start vLLM with tracing enabled: ``` export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=grpc://0.0.0.0:4317 export OTEL_S…
#32171 Optimize FlashInfer TRTLLM FP4 MoE quantization #32057 — nvidia — by baonudesifeizhai (创建于: 2026-01-12 20:05 (UTC+8)) [+54/-16, 2 files | commented:1] ## Purpose This PR optimizes the FlashInfer TRTLLM FP4 MoE quantization path by replacing flashinfer.fp4_quantize() with vLLM’s native scaled_fp4_quant() operation, which is faster and better optimized. #32057

## Test Plan ``` export VLLM_DISABLE_COMPILE_CACHE=1 python -m vllm.entrypoints.openai.api_server
–model nvidia/DeepSeek-R1-NVFP4
–tensor-parallel-size 4 …
#32174 Optimize fused MoE LoRA intermediate buffers and Triton indexing — 无标签 — by cwazai (创建于: 2026-01-12 20:17 (UTC+8)) [💬1 | +117/-59, 1 files | commented:2] Summary ——-

This PR optimizes the fused MoE LoRA Triton op in two ways:
1. Reuse intermediate workspaces instead of allocating and zeroing temporary tensors on every call.
2. Reduce integer arithmetic cost in the Triton kernel by:
  - removing the % N modulus in offs_bn and replacing it with a mask
  - using int32 index arithmetic where possible. …
#32149 [Bugfix] Fix missing scale passing for encoder Triton Attention implementation — documentation,ready,v1 — by Isotr0py (创建于: 2026-01-12 14:05 (UTC+8)) [💬1 | +13/-27, 4 files | commented:3 approved:1] ## Purpose
- A small fix for Triton Encoder only attention, otherwise TritonAttentionImpl.scale won’t take effect exactly.
## Test Plan

## Test Result

Essential Elements of an Effective PR Description Checklist
...
#32168 scheduler: Cache also the last block after KV recving — v1 — by orozery (创建于: 2026-01-12 18:56 (UTC+8)) [+3/-2, 1 files | commented:1] This PR fixes the scheduler to commit the last full block of KV data that was async received.

@robertgshaw2-redhat this is modifying code you introduced in #17751. I think it’s safe to cache that last block as well, but not sure. cc @njhill

BTW, do we really have to re-compute the last token, or can we somehow re-use the KV data that we saved for it?

…
#32166 [Model] Re-implement Qwen3Omni Audio Encoder — qwen — by ywang96 (创建于: 2026-01-12 18:40 (UTC+8)) [+425/-17, 1 files | commented:1 | 📝草稿]

## Purpose

## Test Plan

## Test Result

...
#32159 [Doc] Improve LoRA docs — documentation,ready — by jeejeelee (创建于: 2026-01-12 16:54 (UTC+8)) [💬2 | +12/-15, 1 files | commented:2 approved:1] ## Purpose vLLM has removed the LoRA-based extended vocab size feature, so vLLM doesn’t support LoRA adapters like yard1/llama-2-7b-sql-lora-test anymore. We also need to update the related documentation. ## Test Plan

## Test Result

Essential Elements of an Effective PR Description Checklist
...
#32164 [Bugfix] fix apiserver crash exist then engine core close — frontend,v1 — by lengrongfu (创建于: 2026-01-12 18:15 (UTC+8)) [+16/-0, 2 files | commented:1] ## Purpose

Fix: https://github.com/vllm-project/vllm/issues/32116

## Test Plan

We can set a very small ready timeout value to reproduction it.
```
  $ VLLM_ENGINE_READY_TIMEOUT_S=10 CUDA_VISIBLE_DEVICES=0,1 vllm serve /home/jovyan/qwen3-30b-a3b --gpu_memory_utilization=0.9 --enable-expert-parallel --data-parallel-size=2 --api-server-count 2 --data-parallel-address 127.0.0.1 --data-parallel-size-local 2 --data-parallel-start-rank 0
```
…
#32158 [doc] fix broken links — documentation,ready — by minimAluminiumalism (创建于: 2026-01-12 16:53 (UTC+8)) [💬2 | +7/-21, 1 files | approved:1] ## Purpose

Images were using HTML <img> tags with relative paths. MkDocs doesn’t process relative paths in HTML tags, causing incorrect URLs.

## Test Plan

NA …
#32161 [CPU] Split attention dispatch by head_dim alignment — cpu — by R3hankhan123 (创建于: 2026-01-12 17:37 (UTC+8)) [+79/-54, 3 files | commented:2] ## Purpose Separate 32-aligned (AMX/NEON/VEC/VEC16) and 16-only (VEC16 only) head_dim dispatch paths to avoid redundant template instantiations and fix NEON and AMX compilation errors for head_dim 80/112 when static assertions are enabled

Head dims divisible by 32 route through all ISA implementations, while head dims divisible by 16 but not 32 are restricted to VEC16 only, preventing unsupported ISA combinations from being instantiated during compilation.

## Test Plan Build Docker image and t…
#32153 [Frontend] Fix Flaky MCP Streaming Test — ready — by daniel-salib (创建于: 2026-01-12 14:46 (UTC+8)) [+1/-1, 1 files | commented:1 approved:1] ## Purpose test_mcp_code_interpreter_streaming is occasionally failing due to inconsistent behavior of whether it decides to trigger a tool call. Since the math question is too simple, a tool call isn’t always triggered. A more complicated math problem like “123 * 456” consistently triggers the MCP tool call

## Test Plan

pytest entrypoints/openai/responses/test_harmony.py::test_mcp_code_interpreter_streaming

## Test Result

…

#32160 [feat] Add ATOM model impl backend — new-model,rocm — by zejunchen-zejun (创建于: 2026-01-12 17:23 (UTC+8)) [+158/-9, 5 files

commented:1

📝草稿]

#32150 [Model] Remove incorrect SupportsPP from MTP models — ready,qwen,deepseek — by DarkLight1337 (创建于: 2026-01-12 14:25 (UTC+8)) [+6/-15, 6 files | commented:1 approved:2] ## Purpose

MTP models doesn’t actually support PP.

## Test Plan

## Test Result

...
#32147 Add descriptive error message for missing tools. — frontend,gpt-oss,meta-exported,fb-exported — by madongfly (创建于: 2026-01-12 13:28 (UTC+8)) [💬2 | +85/-8, 2 files | commented:1] Summary: Add descriptive error message for missing tools.

Test Plan: Unit test

Differential Revision: D90477073

[!NOTE] …
#32145 update cutlass_moe_mm error check message — nvidia — by XiaobingSuper (创建于: 2026-01-12 12:10 (UTC+8)) [+1/-1, 1 files | commented:1] ## Purpose

Fixed a typo for cutlass_moe_mm arch check message.

## Test Plan

## Test Result

...
#32144 [code clean] remove useless parameters — frontend — by andyxning (创建于: 2026-01-12 10:56 (UTC+8)) [+0/-2, 1 files | commented:2] ## Purpose Code clean. Remove useless parameter. ## Test Plan NA ## Test Result NA —

Essential Elements of an Effective PR Description Checklist
...

[已合并 PR]

#32127 [BugFix] Fix engine crash caused by chat tools + response_format — ready,v1,tool-calling — by njhill (合并于: 2026-01-13 10:33 (UTC+8)) [💬2 | +47/-0, 3 files | commented:3 approved:1] Our input/request validation logic overall is a mess. This is a minimal fix for now but some overhaul is needed imo.

Fixes https://github.com/vllm-project/vllm/issues/32006

[!NOTE] ^{Cursor Bugbot is generating a summary for commit d558adab8c48b9b79a2f68a606e0c02eb82de068. Configure here.}

…
#32197 [Misc] Allow enabling NCCL for DP sync when async scheduling — ready — by njhill (合并于: 2026-01-13 10:03 (UTC+8)) [💬2 | +26/-16, 4 files | commented:4 approved:1] It’s currently disabled by default in this case but we should allow manual override either way.

See https://github.com/vllm-project/vllm/issues/32140
[!NOTE] Sets ParallelConfig.disable_nccl_for_dp_synchronization to accept None and defers defaulting logic to VllmConfig.
- disable_nccl_for_dp_synchronization changed to bool | None in ParallelConfig and CLI args; doc clarifies default behavior …
#32194 [Model] Handle trust_remote_code for transformers backend — new-model,ready — by DarkLight1337 (合并于: 2026-01-13 09:30 (UTC+8)) [+14/-1, 2 files | commented:2 approved:1] ## Purpose

Add trust_remote_code argument to try_get_class_from_dynamic_module and handle it accordingly.

## Test Plan

## Test Result

...
#32036 [responsesAPI] add unit test for optional function tool call id — ready — by qandrew (合并于: 2026-01-13 08:14 (UTC+8)) [💬1 | +118/-0, 1 files | commented:3 approved:1] ## Purpose

## Test Plan follow up from #31999

## Test Result

unit tests pass

…
#32040 [ROCm][CI] Handle pytest status code 5 when a shard isn’t allocated any tests — rocm,ready,ci/build — by divakar-amd (合并于: 2026-01-13 06:35 (UTC+8)) [💬1 | +11/-2, 2 files | commented:2 approved:1] This PR ensures that test shards exiting with status 5 (indicating “no tests collected”) are handled gracefully for AMD CI. In multi-gpu setting, when tests are shared among GPUs, it’s possible that a particular shard may not be allocated with any test at all. In such cases, the pytest returns with status 5. More details on Pytest Exit codes - LINK

```

Shard 0: collected 312 items / 37 deselected / 275 selected Shard 0: Running …
#31962 [Kernel][MoE] fix computation order of MoE weight multiplication and improve flow — bug,ready — by xuebwang-amd (合并于: 2026-01-13 06:17 (UTC+8)) [💬8 | +24/-9, 1 files | commented:3 approved:1] ## Purpose

# Background: This PR is a continuous work on two previous PRs:
- PR #31676
  - key contribution: move bias adding after dequantization
  - computation order: to(compute_type) -> HAS_BIAS (bias adding) -> MUL_ROUTED_WEIGHT: this is the most closest one to the right order except for MoE weight multiplication not in float32.
- PR #31931
  - key contribution: preserving router weight s…
#32199 [BUGFIX] Add missed remaping of the names of fp8 kv-scale — ready,qwen — by vadiklyutiy (合并于: 2026-01-13 04:42 (UTC+8)) [💬2 | +7/-0, 1 files | commented:1 approved:1] Qwen3-Next-NVFP4 checkpoint produced a lot of the following warnings
```
  $ vllm serve nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 -tp 2 
  ...
  Parameter layers.3.self_attn.qkqkv_proj.k_scale not found in params_dict, skip loading
  Parameter layers.3.self_attn.qkv_proj.v_scale not found in params_dict, skip loading
  ...
```
…
#32143 [Model Runner V2] Add support for M-RoPE — v1,nvidia — by WoosukKwon (合并于: 2026-01-13 05:37 (UTC+8)) [💬1 | +203/-7, 5 files | commented:10]

[!NOTE] ^{Cursor Bugbot is generating a summary for commit 59f41b0420ea4e9f40ab13608af156abd2dffedf. Configure here.}
[!NOTE] Adds end-to-end M-RoPE support to the v1 GPU model runner, wiring 3D position IDs through batching and execution.
- Introduces MRopeState with UVA-backed prefill position storage and a Triton kernel to materialize mrope_positions per batch (`mm/mrope_utils….
#32209 [Model Runner V2] Minor refactor for logit_bias — v1 — by WoosukKwon (合并于: 2026-01-13 05:08 (UTC+8)) [+54/-26, 1 files | approved:1 commented:1] Minor code reorganization for better testing and code reuse
[!NOTE] Separates kernel launch logic from state, improving testability and reuse.
- Adds apply_logit_bias(...) helper that computes BLOCK_SIZE from allowed_token_ids, logit_bias_token_ids, and stop_token_ids shapes and invokes _bias_kernel
- Updates LogitBiasState.apply_logit_bias to call the new helper, passing .gpu tensors instead of handling strides/size inline
- Keeps _bias_kernel logic unchange…
#32031 [NIXL][Bugfix] Failure logging overhaul + early metadata free on failure — ready,v1,kv-connector — by NickLucche (合并于: 2026-01-13 04:38 (UTC+8)) [💬2 | +234/-28, 2 files | commented:6 approved:1] ## Overview

This PR is aimed at improving logging to more easily identify failures during runs. It does so by providing a shared util function with more comprehensive context of the failure.

Change result: ``` # before ERROR 01-09 12:04:06 [nixl_connector.py:2281] NIXL transfer setup/initiation failed for request test_transfer_setup_failed_req. Marking blocks as invalid.

…
#32163 [Model Runner V2] Support logit_bias, allowed_token_ids, min_tokens — v1 — by WoosukKwon (合并于: 2026-01-13 03:31 (UTC+8)) [💬2 | +253/-5, 2 files | commented:8]
[!NOTE] Implements GPU-side sampling constraints with a dedicated state and Triton kernel.
- Adds LogitBiasState to stage per-request allowed_token_ids, logit_bias, and min_tokens/stop_token_ids using UvaBackedTensor and StagedWriteTensor
- Introduces _bias_kernel (Triton) to: mask logits to only allowed_token_ids, apply per-token logit_bias, and enforce min_tokens by suppressing stop_token_ids until min_len
- Provides add_request, apply_staged_writes, …
#31748 [Misc][BE] Type coverage for vllm/compilation [3/3] — ready,nvidia — by Lucaskabela (合并于: 2026-01-13 03:24 (UTC+8)) [💬2 | +333/-280, 11 files | commented:3 approved:2] ## Purpose We want to provide better type hint coverage in vllm/compilation to improve maintainability, readability, and reduce silent errors

This PR should be applied on top of #31744

## Test Plan mypy vllm/compilation

## Test Result ``` …
#32192 [Misc] Change log level for batch queue log — ready,v1 — by NickLucche (合并于: 2026-01-13 02:59 (UTC+8)) [+1/-1, 1 files | approved:1 commented:1] As per https://vllm-dev.slack.com/archives/C07R5Q1Q2BB/p1768224082041559

[!NOTE] ^{Cursor Bugbot is generating a summary for commit e0bdb779a0d93b90afb5f791188ae2bcc3139364. Configure here.}

[!NOTE] …
#32173 [BugFix] scheduler: Fix ordering preserving of skipped requests — ready,v1 — by orozery (合并于: 2026-01-13 02:39 (UTC+8)) [💬1 | +31/-15, 2 files | commented:1 approved:1] This PR fixes the order preserving of requests skipped by the scheduler. A unit test is added to verify the fix.

I came across this bug while testing #29087, and the test there requires this fix.

[!NOTE] ^{Cursor Bugbot is generating a summary for commit 8c08c30fef37d620043c4221960e04d6704ba1dc. Configure here.}

…
#31879 [Misc] Set default torch num threads for input processing — ready,v1 — by ywang96 (合并于: 2026-01-13 02:28 (UTC+8)) [💬2 | +12/-16, 2 files | approved:1 commented:1] ## Purpose There have been numerious reports (e.g, #29869, #29078) of CPU contention from multimodal input processing where users do not set OMP_NUM_THREADS and run multiple vLLM instances on the same physical machine. This PR wraps the preprocess call with torch num threads set to the same value as OMP_NUM_THREADS if it’s set, otherwise 1.

## Test Plan

## Test Result

Essential Elements of an Effective PR Description Checklist
...
#30697 [Refactor] EPLB rebalance algo to NumPy — ready — by ilmarkov (合并于: 2026-01-13 02:13 (UTC+8)) [💬2 | +128/-130, 3 files | commented:1 approved:2] ## Purpose

First PR of series of EPLB refactoring and optimizations. In this PR we unify the rebalance algo logic implementation converting to CPU-only numpy-based implementation.

Prerequisite of PR(s) where we will move the rebalancing algo to async EPLB thread to get rid of the points where CPU has to wait for GPU results in the main thread.

## Validation tests/distributed/test_eplb_algo.py Passed …
#32196 [BugFix] fix FusedMoE.make_expert_params_mapping in EXAONE-MoE — 无标签 — by lkm2835 (合并于: 2026-01-13 02:00 (UTC+8)) [💬1 | +1/-0, 1 files | commented:1 approved:1] ## Purpose Added the missing changes. Related to #31104.

## Test Plan

## Test Result

...
#32054 [3/N][Attention] Move AttentionMetadata-related code from utils.py to backend.py — documentation,rocm,speculative-decoding,ready,v1,cpu,nvidia,ready-run-all-tests — by MatthewBonanni (合并于: 2026-01-13 01:13 (UTC+8)) [💬2 | +374/-370, 37 files | commented:4 approved:1] ## Purpose Step 3 of #31919. Moves chunks of code from utils.py to backend.py (unchanged) and updates imports accordingly. The following objects are moved:
- CommonAttentionMetadata
- AttentionMetadataBuilder
- AttentionCGSupport
## Test Plan CI (run all tests)

## Test Result …
#32184 [Benchmark] Share data between SLA runs — performance,ready — by DarkLight1337 (合并于: 2026-01-13 01:12 (UTC+8)) [+108/-25, 2 files | commented:6 approved:1] ## Purpose

Further optimize the SLA script to share past run data coming from the same serve/bench combination. This is useful when tuning for multiple SLA targets (e.g. E2EL <= 200 ms, 500 ms, 1000 ms, …) while keeping the serve/bench combination unchanged.

## Test Plan

## Test Result

...
#32148 [Model] Standardize pooling heads — ready — by DarkLight1337 (合并于: 2026-01-13 01:01 (UTC+8)) [+182/-149, 9 files | approved:1 commented:9] ## Purpose

Follow-up to #32119, make pooling params actually take effect for custom models.

## Test Plan

## Test Result

...
#31988 [Misc][PD] Fix get_attn_backend usage in transfer connectors — ready,v1,kv-connector — by NickLucche (合并于: 2026-01-13 01:10 (UTC+8)) [💬3 | +39/-20, 4 files | commented:5 approved:1] This PR fixes the use of get_attn_backend in the context discussed here https://vllm-dev.slack.com/archives/C07R5Q1Q2BB/p1767810349936949. In particular, it turns out that modification to the interface of this shared function can cause unintended backend retrieval (as a partial configuration was passed in), leading to cases such as ``` VLLM_LOGGING_LEVEL=DEBUG vllm serve google/gemma-3-4b-it –port 8004 –enforce-eager –kv-transfer-config ‘{“kv_connector”:”NixlConnector”,”kv_role”:”kv_both”}’…
#32118 [Bugfix] Fix stale SSM state for new Mamba requests scheduled as decode — ready,v1 — by Josephasafg (合并于: 2026-01-13 01:02 (UTC+8)) [💬3 | +24/-3, 2 files | commented:1 approved:1] ## Purpose Fix stale SSM state corruption when new Mamba requests are scheduled with only 1 token due to token budget exhaustion.

## Problem When the scheduler’s token budget is nearly exhausted, new requests may be allocated only 1 token. A request with 1 token could be classified as decode rather than prefill. This causes a prompt to first be decoded on a stale SSM state (meaning it had previous values - this happens after the I believe most of the gpu blocks are used and a reuse takes place)…
#31528 [FIX] Add NO_MUL activation support for modular kernel path — rocm,ready,gpt-oss,nvidia — by danielafrimi (合并于: 2026-01-13 00:55 (UTC+8)) [💬5 | +368/-71, 17 files | commented:10] This PR adds support for *_no_mul activations (e.g., relu2_no_mul) in the modular kernel MoE path (TritonExperts).

### Problem The modular kernel path assumed all activations use gate/up multiplication (like SiLU, GELU), where output size is N/2. For *_no_mul activations, which apply activation directly without gating, output size should equal input size (N). This caused assertion failures and buffer size mismatches.

…
#29384 [MODEL] New model support for kakaocorp/kanana-1.5-v-3b-instruct — documentation,new-model,ready — by kakao-steve-ai (合并于: 2026-01-13 00:39 (UTC+8)) [💬4 | +790/-0, 5 files | commented:10] ## Purpose Add new model for kakaocorp/kanana-1.5-v-3b-instruct

## Test Plan Verify that the performance scores are correctly produced using lmms-eval.

``` export OPENAI_API_BASE=”http://localhost:8000/v1” export OPENAI_API_KEY=’EMPTY’ lmms-eval –model async_openai –model_args model_version=kakaocorp/kanana-1.5-v-3b-instruct,base_url=$OPENAI_API_BASE,api_key=$OPENAI_API_KEY –tasks m…
#31621 Add K-EXAONE-236B-A23B — documentation,new-model,ready — by lkm2835 (合并于: 2026-01-13 00:30 (UTC+8)) [💬2 | +856/-0, 7 files | commented:10] ## Purpose This PR adds support to K-EXAONE-236B-A23B

## Test Plan

## Test Result

Essential Elements of an Effective PR Description Checklist
...
#32083 [Model Runner V2] Remove async barrier — v1,nvidia — by WoosukKwon (合并于: 2026-01-12 12:24 (UTC+8)) [💬3 | +590/-462, 13 files | commented:7] Currently, the model runner V2 prevents race conditions by introducing async_barrier, which is difficult to reason about and error-prone.

This PR eliminates the need for async_barrier by avoiding the race condition by design. The core idea is double buffering, allocating two copies of states, where one is used by CPU and another is used by GPU.

More detailed design doc is in progress

[!NOTE] ^{Cursor Bugbot is generating a summary for…}
#32188 doc: Update model references in supported_models.md — documentation,ready — by andyzhangx (合并于: 2026-01-13 00:15 (UTC+8)) [💬1 | +2/-2, 1 files | commented:4 approved:1] ## Purpose Update model references in supported_models.md, we are using a tool to parse all listed model names in this page, but some items does not have org name which makes it difficult to parse automatically, this PR fixes the issue.

## Test Plan

## Test Result

...
#32179 [ROCm] [Bugfix] Fix order of mori build in Dockerfile.rocm_base — rocm,ready,ci/build — by tjtanaa (合并于: 2026-01-12 23:33 (UTC+8)) [+21/-13, 1 files | commented:1 approved:1] ## Purpose

Fix the following bug:
```
  $ DOCKER_BUILDKIT=1 docker build \
      -f docker/Dockerfile.rocm_base \
      -t rocm/vllm-dev:base-debug .
```
…
#32185 doc: Update model name for Qwen3-Coder in documentation — documentation,ready,tool-calling,qwen — by andyzhangx (合并于: 2026-01-12 23:10 (UTC+8)) [💬2 | +1/-1, 1 files | approved:1 commented:1] ## Purpose

Update model name for Qwen3-Coder in documentation, original name is incorrect, you could find the correct name here: https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct

## Test Plan

## Test Result

...
#24498 OffloadingConnector: Add cpu_bytes_to_use configuration — documentation,ready,v1,kv-connector — by orozery (合并于: 2026-01-12 23:00 (UTC+8)) [💬26 | +49/-24, 9 files | dismissed:2 commented:7 approved:1] This PR replaces the OffloadingConnector size configuration from num_cpu_blocks to cpu_bytes_to_use. This allows for a more intuitive space allocation (per vLLM instance, across workers).
[!NOTE] Modernizes KV offloading configuration and wiring.
- Replace num_cpu_blocks/kv_bytes_per_rank with instance-wide cpu_bytes_to_use for OffloadingConnector (docs and configs)
- CPUOffloadingSpec now derives num_blocks from KVCacheConfig (page size, tensors, world size) and `…
#28284 [Feature] Support recording expert indices for rollout router replay — performance,frontend,ready,v1,rl — by xhx1022 (合并于: 2026-01-12 22:23 (UTC+8)) [💬68 | +463/-3, 11 files | commented:9 approved:1] ## Purpose

This PR introduces Rollout Router Replay (R3) support into vLLM runtime.
Inspired by the recent research in reinforcement learning alignment for MoE-based LLMs (arXiv:2510.11370, arXiv:2507.18071), this implementation allows recording the expert routing decisions for every token at every layer during model inference. The recorded routing traces can be used for replaying the expert routing process du…
#31573 [P/D] Refactor mooncake connector sender thread using async coroutines — ready,kv-connector — by dtcccc (合并于: 2026-01-12 20:35 (UTC+8)) [+173/-174, 1 files | commented:3 approved:1] ## Purpose This is a separate PR for https://github.com/vllm-project/vllm/pull/31034 to help review. This PR refactored sender thread using async coroutines. All related data are in the same thread so that we can drop their locks. This makes the sender thread simple and easy to maintain.

## Test Plan

## Test Result

...
#32149 [Bugfix] Fix missing scale passing for encoder Triton Attention implementation — documentation,ready,v1 — by Isotr0py (合并于: 2026-01-12 19:13 (UTC+8)) [💬1 | +13/-27, 4 files | commented:3 approved:1] ## Purpose
- A small fix for Triton Encoder only attention, otherwise TritonAttentionImpl.scale won’t take effect exactly.
## Test Plan

## Test Result

Essential Elements of an Effective PR Description Checklist
...
#32134 [Doc] Add documentation for offline API docs feature — documentation,ready — by ricky-chaoju (合并于: 2026-01-12 18:33 (UTC+8)) [💬1 | +8/-0, 1 files | commented:1 approved:1] ## Summary Add documentation for offline API docs feature related :https://github.com/vllm-project/vllm/pull/30184

[!NOTE] ^{Cursor Bugbot is generating a summary for commit 9bcf3bd955c830737300f641b96eadcb63174273. Configure here.}

…
#32159 [Doc] Improve LoRA docs — documentation,ready — by jeejeelee (合并于: 2026-01-12 18:19 (UTC+8)) [💬2 | +12/-15, 1 files | commented:2 approved:1] ## Purpose vLLM has removed the LoRA-based extended vocab size feature, so vLLM doesn’t support LoRA adapters like yard1/llama-2-7b-sql-lora-test anymore. We also need to update the related documentation. ## Test Plan

## Test Result

Essential Elements of an Effective PR Description Checklist
...
#32158 [doc] fix broken links — documentation,ready — by minimAluminiumalism (合并于: 2026-01-12 18:18 (UTC+8)) [💬2 | +7/-21, 1 files | approved:1] ## Purpose

Images were using HTML <img> tags with relative paths. MkDocs doesn’t process relative paths in HTML tags, causing incorrect URLs.

## Test Plan

NA …
#32153 [Frontend] Fix Flaky MCP Streaming Test — ready — by daniel-salib (合并于: 2026-01-12 18:03 (UTC+8)) [+1/-1, 1 files | commented:1 approved:1] ## Purpose test_mcp_code_interpreter_streaming is occasionally failing due to inconsistent behavior of whether it decides to trigger a tool call. Since the math question is too simple, a tool call isn’t always triggered. A more complicated math problem like “123 * 456” consistently triggers the MCP tool call

## Test Plan

pytest entrypoints/openai/responses/test_harmony.py::test_mcp_code_interpreter_streaming

## Test Result

…
#32092 [cpu][bench] Add Fused MoE Micro Benchmark for CPU Backend — performance,ready,cpu — by andikarachman (合并于: 2026-01-12 18:03 (UTC+8)) [💬2 | +175/-0, 1 files | commented:3 changes:1 approved:2] ## Purpose

Add Fused MoE Micro Benchmark for CPU Backend Fixes: #31721

## Test Plan Tested using this instance:

## Test Result …
#30975 [Misc] Disable default --ready-check-timeout-sec extra call in vllm bench — performance — by NickLucche (合并于: 2026-01-12 17:58 (UTC+8)) [+2/-3, 1 files | commented:1 approved:1] As per brief offline discussion with @mgoin, current vllm bench serve implementation will default to sending (at least) one extra request to probe for server up status. I suppose this is due to a non-uniform backend /health check API, so we’ve been defaulting to sending the same test request https://github.com/vllm-project/vllm/blob/686cbaac643c3412036728dd5e6bc29d6cce1a9f/vllm/benchmarks/serve.py#L596

I believe this is largely misleading for most use-cases as it results in an ambiguous “wa…
#32150 [Model] Remove incorrect SupportsPP from MTP models — ready,qwen,deepseek — by DarkLight1337 (合并于: 2026-01-12 17:19 (UTC+8)) [+6/-15, 6 files | commented:1 approved:2] ## Purpose

MTP models doesn’t actually support PP.

## Test Plan

## Test Result

...
#32085 [Model] Improve multimodal pooling examples — documentation,ready — by noooop (合并于: 2026-01-12 15:54 (UTC+8)) [💬2 | +381/-69, 10 files | commented:6 approved:1] ## Purpose

FIX #32069

## Test Plan

Qwen/Qwen3-VL-Embedding-2B …
#32119 [Model] Avoid hardcoding pooling type — ready — by DarkLight1337 (合并于: 2026-01-12 13:28 (UTC+8)) [+47/-22, 6 files | commented:4 approved:2] ## Purpose

Gracefully handle user-specified pooling types where possible, instead of silently ignoring them.

In the next PR, we will work on passing EmbeddingPoolerHead to head argument so that pooling params are applied correctly.

## Test Plan

## Test Result

…

[关闭未合并 PR]

#31729 [Bugfix][Hardware][AMD] Fix hardcoded device in AITER MLA and Fused MOE — rocm,v1 — by c0de128 (关闭于: 2026-01-13 07:27 (UTC+8)) [💬5 | +10/-8, 3 files | commented:1] ## Summary

Fix hardcoded device="cuda" in AITER MLA sparse attention and Fused MOE initialization code. This ensures tensors are created on the correct device in multi-GPU setups and improves ROCm compatibility.

## Changes

### 1. vllm/attention/ops/rocm_aiter_mla_sparse.py Replace 5 instances of device="cuda" with device=q.device or device=q_fp8.device:
- Lines 46, 49: Use q.device in fp8_mqa_logits_torch()
- Lines 127, 135: Use q.device in fp8_paged_mqa_logits_torch() …
#31638 [Bugfix][Hardware][AMD] Narrow broad exception in AITER scaled MM import — rocm — by c0de128 (关闭于: 2026-01-13 07:27 (UTC+8)) [💬4 | +1/-1, 1 files | commented:1] ## Summary

Narrows the exception handling in the AITER scaled MM kernel’s is_supported() method from catching all Exception types to only catching ImportError.

## Problem

The current code uses except Exception: which is too broad and can mask programming errors like AttributeError, TypeError, or other unexpected exceptions that should propagate for debugging.

## Solution

…
#31552 [Bugfix][Hardware][AMD] Fix device parameter and exception handling — rocm — by c0de128 (关闭于: 2026-01-13 07:27 (UTC+8)) [💬3 | +9/-9, 2 files | commented:1] ## Summary

Fix two ROCm-related issues:

### 1. Fusion Helper Functions (vllm/compilation/fusion.py)

Bug: Hardcoded device="cuda" in helper functions prevents explicit device selection.

```python # Before (hardcoded): …
#31587 [Bugfix][Hardware][ROCm] Narrow broad exception in PyNCCL library loading — rocm — by c0de128 (关闭于: 2026-01-13 07:27 (UTC+8)) [💬2 | +3/-3, 1 files | commented:1] ## Summary
- Replace except Exception: with except (OSError, ValueError, RuntimeError): when loading the NCCL/RCCL library
- This prevents masking unexpected errors during library loading, especially important for ROCm/RCCL debugging
## Details The current broad exception handler can hide the actual cause of NCCL/RCCL loading failures. Narrowing to specific exceptions helps debug issues on both CUDA and ROCm platforms.
- OSError: Library file not found or can’t be loaded (from `ctypes.CD…
#31119 [Bugfix][Hardware][AMD] Fix tensor slice assignment in MLA — rocm,v1 — by c0de128 (关闭于: 2026-01-13 07:27 (UTC+8)) [💬10 | +1/-1, 1 files | commented:1] ## Summary Fix inconsistent tensor assignment pattern in rocm_aiter_mla.py.

Bug: Line 160 used direct assignment (=) to set values in a tensor slice, while lines 142, 148, and 154 correctly use .fill_() for the same operation pattern.

```python # Lines 142, 148, 154 - correct pattern: self.paged_kv_indices[num_actual_pages:].fill_(-1) self.paged_kv_indptr[1 + num_reqs :].fill_(paged_kv_indptr[-1]) self.paged_kv_last_page_len[num_reqs:].fill_(1) …
#31121 [Bugfix][Hardware][AMD] Fix list aliasing in fused MoE initialization — rocm — by c0de128 (关闭于: 2026-01-13 07:27 (UTC+8)) [💬14 | +116/-4, 2 files | commented:1] ## Summary Fix critical Python list aliasing bug in ROCm fused MoE implementation.

Bug: The code used [[value] * n] * m pattern which creates m references to the same inner list, not m independent lists.

```python # Before (buggy) - all indices point to same list: s_topk_ids_list = [[fake_expertid] * (n_shared_experts + is_EP)] * max_num_tokens

# After (fixed) - each index has independent list: …
#31176 [Bugfix][Hardware][AMD] Fix hardcoded device in MLA sparse attention — rocm,v1,nvidia — by c0de128 (关闭于: 2026-01-13 07:27 (UTC+8)) [💬19 | +5/-5, 1 files | commented:1] ## Summary

Replace hardcoded device="cuda" with input tensor device (q.device or q_fp8.device) in rocm_aiter_mla_sparse.py for consistency and to avoid potential device mismatch errors.

## Changes

Fixed 5 instances of hardcoded device="cuda":

| Location | Function | Fix | |———-|———-|—–| …
#31178 [Bugfix][Hardware][AMD] Fix device parameter in AITER topK metadata — rocm — by c0de128 (关闭于: 2026-01-13 07:27 (UTC+8)) [💬19 | +6/-4, 2 files | commented:1] ## Summary

Add explicit device parameter to init_aiter_topK_meta_data() instead of hardcoding "cuda". This improves multi-GPU support and makes device handling explicit and consistent with other ROCm functions.

## Changes

File 1: vllm/model_executor/layers/fused_moe/rocm_aiter_fused_moe.py
- Add device: int | str = "cuda" parameter to function signature
- Replace 3 instances of hardcoded device="cuda" with device=device
…
#31251 [Bugfix][Hardware][AMD] Use cub_helpers.h in sampler.cu for ROCm namespace alias — rocm — by c0de128 (关闭于: 2026-01-13 07:27 (UTC+8)) [💬16 | +11/-9, 3 files | commented:1] ## Summary

Replace direct cub/cub.cuh includes with cub_helpers.h in multiple CUDA kernel files. This provides the namespace cub = hipcub; alias needed for ROCm builds.

## Files Fixed

| File | CUB Usage | |——|———–| | csrc/sampler.cu | cub::BlockScan, cub::BlockRadixSort | | csrc/moe/moe_align_sum_kernels.cu | cub::BlockScan | …
#31293 [Bugfix][Hardware][AMD] Fix uninitialized Qlocal registers in ROCm attention kernel — rocm — by c0de128 (关闭于: 2026-01-13 07:27 (UTC+8)) [💬12 | +14/-0, 1 files | commented:1] ## Summary

Fixes uninitialized Qlocal register values in the ROCm PagedAttention wmma kernel that can cause numerical accuracy issues on AMD GPUs.

## The Bug

In paged_attention_ll4_kv_kernel, when GQA_RATIO == 1 (non-GQA models like Llama-2), only lane 0 loads valid Q data into Qlocal registers. Lanes 1-15 retain garbage values from previous GPU cycles.

These uninitialized values then propagate into the gcn_wmma16x16x16_instr (Wave Matrix Multiply-Accumulate) instruction, conta…
#32084 [ROCm][Bugfix] Fix AITER speculative decoding accuracy issue — rocm,v1 — by c0de128 (关闭于: 2026-01-13 07:27 (UTC+8)) [💬6 | +68/-36, 1 files | commented:3] ## Summary
- Fix speculative decoding (query_len > 1) for ROCM_AITER_FA backend
- Fall back to context_attention_fwd when decode has multi-token queries
- Resolves 0% accuracy issue reported in #31625
## Problem

The AITER paged_attention_v1 kernel only supports single-token queries (query_len = 1). When speculative decoding is enabled, the decode path receives multi-token queries, causing the kernel to produce incorrect results (0% accuracy on GSM8K benchmark).

…
#31802 [Core][NIXL] Support HMA+NixlConnector — documentation,performance,structured-output,frontend,tpu,needs-rebase,ci/build,v1,multi-modality,tool-calling — by NickLucche (关闭于: 2026-01-13 03:04 (UTC+8)) [💬3 | +3247/-1277, 135 files | commented:1 | 📝草稿] This PR is based on an early version of https://github.com/vllm-project/vllm/pull/30166 so the diff is a mess. I will clean it up and rebase it asap and provide a more accurate description of the PR then.

UPDATE: check out https://github.com/vllm-project/vllm/pull/32204 for the updated PR

## Overview Currently connectors are not able to take full advantage of models that employ hybrid attention (FA+SWA) and treat all layers as FA, as the Hybrid Kv Cache Manager is disabled.

This PR enables Ni…
#32043 [AOT compilation] cached inductor artifacts benchmark — performance — by dolpm (关闭于: 2026-01-13 01:04 (UTC+8)) [+682/-0, 1 files | commented:3] ## Purpose

split out of https://github.com/vllm-project/vllm/pull/25205

## Test Plan

## Test Result

...
#32166 [Model] Re-implement Qwen3Omni Audio Encoder — qwen — by ywang96 (关闭于: 2026-01-12 18:41 (UTC+8)) [+425/-17, 1 files | commented:1 | 📝草稿]

## Purpose

## Test Plan

## Test Result

...
#31900 Ignore: dynres attempt — needs-rebase — by netanel-haber (关闭于: 2026-01-12 17:34 (UTC+8)) [💬1 | +858/-127, 3 files | commented:1 | 📝草稿] ## Purpose

## Test Plan

## Test Result

Essential Elements of an Effective PR Description Checklist
...
#27511 Add standalone multimodal encoder benchmark — performance,frontend — by alhridoy (关闭于: 2026-01-12 16:59 (UTC+8)) [💬3 | +434/-0, 4 files | commented:4] Fixes #25450

## Summary - add a standalone multimodal encoder benchmark (vllm/benchmarks/encoder.py) that loads dummy-weight models, builds HF-processor inputs, and measures get_multimodal_embeddings latency across configurable batch/image sizes - wire the runner into the CLI as vllm bench encoder and keep a legacy shim at benchmarks/benchmark_encoder.py …
#24091 [EPLB]: Optimize export_load_view update — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,needs-rebase,ci/build,v1 — by dragondream-chen (关闭于: 2026-01-12 12:01 (UTC+8)) [💬20 | +68/-5, 8 files | commented:2] ## Purpose In the feature of EPLB(Experts Load Balance), the PR optimizes the update method for expert load during each forward. The current approach is using the scatter_add_ method based on topk_ids results. When using DeepEP Low-Latency or PPLX on the CUDA platform, expert loads can be obtained directly from expert_tokens_meta.expert_num_tokens, which reduces redundant calculations on the expert load.

## Test Plan
1. Test expert load update Since the use of the kernel, such as DeepEP…
#32144 [code clean] remove useless parameters — frontend — by andyxning (关闭于: 2026-01-12 11:03 (UTC+8)) [+0/-2, 1 files | commented:2] ## Purpose Code clean. Remove useless parameter. ## Test Plan NA ## Test Result NA —

Essential Elements of an Effective PR Description Checklist
...