[vLLM GitHub 开发动态] 2026-01-12
[概览]
- 时间窗口: 2026-01-12 10:52 (UTC+8) ~ 2026-01-13 10:52 (UTC+8)
- 新 issue: 23 (label 分布: bug:13, feature request:4, RFC:3, ci-failure:3, rocm:2)
- 关闭 issue: 34
- 新 PR: 61 (label 分布: ready:18, v1:16, nvidia:8, documentation:8, qwen:6)
- 合并 PR: 42
- 关闭未合并 PR: 18
[新 issue]
-
#32227 [Feature]: per head or per channel fp8 kvcache support? — feature request — by zx-ai (创建于: 2026-01-13 09:55 (UTC+8)) ### 🚀 The feature, motivation and pitch
Hello, I’d like to ask about the current FP8 KV cache quantization, which uses per-tensor scaling. I heard you’re developing finer-grained quantization schemes, such as per-channel or per-head quantization. Could you share more details on how this would be implemented?
Specifically:
Will the scale factors be packed together with the KV cache tensors (e.g., appended after the KV data), similar to what DeepSpeed v3.2 does? Will the RoPE (Rotary Position E…
-
#32219 [RFC]: Add Helion integration in vLLM — RFC — by gmagogsfm (创建于: 2026-01-13 06:41 (UTC+8)) [💬1] ### Motivation.
(with significant inputs from @zou3519, @ProExpertProg, @mgoin, @xiaohongchen1991)
Helion is PyTorch’s latest innovation in authoring custom kernels, featuring simple and familiar syntax, good developer experience, and superior performance.
This RFC proposes a developer-friendly framework for integrating Helion kernels into vLLM, making custom ops in vLLM more efficient, enjoyable to write, and performant in production.
The proposed integration is [prototyped here](https://gi…
-
#32225 [Bug]:
common_attn_metadata.max_seq_lennot incremented properly in Eagle implementation — bug — by ofirzaf (创建于: 2026-01-13 08:51 (UTC+8)) ### Your current environmentThe output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#32220 [Feature]: NVFP4 KV Cache Support — feature request — by yewentao256 (创建于: 2026-01-13 07:08 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch
We haven’t supported nvfp4 kv cache yet, let’s discuss the plans and possible actions in this issue.
```py class TritonAttentionBackend(AttentionBackend): accept_output_buffer: bool = True supported_dtypes: ClassVar[list[torch.dtype]] = [ torch.float16, torch.bfloat16, …
-
#32223 [CI Failure]: Kernels Core Operation Test — ci-failure — by AndreasKaratzas (创建于: 2026-01-13 07:49 (UTC+8)) [💬1] ### Name of failing test
pytest -v -s kernels/core kernels/test_top_k_per_row.py### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#32222 [CI Failure]: DP EP NixlConnector PD accuracy tests (Distributed) — ci-failure — by AndreasKaratzas (创建于: 2026-01-13 07:31 (UTC+8)) [💬1] ### Name of failing test
uv pip install --system -r /vllm-workspace/requirements/kv_connectors_rocm.txt && VLLM_ATTENTION_BACKEND=ROCM_ATTN DP_EP=1 bash v1/kv_connector/nixl_integration/config_sweep_accuracy_test.sh### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#32221 [CI Failure]: Qwen3-Next-80B-A3B-Instruct MTP Async EPLB Accuracy — ci-failure — by AndreasKaratzas (创建于: 2026-01-13 07:29 (UTC+8)) [💬1] ### Name of failing test
bash .buildkite/scripts/scheduled_integration_test/qwen3_next_mtp_async_eplb.sh 0.8 1319 8040### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#32218 [RFC]: Consolidate Multimodal Related Info — RFC — by charlotte12l (创建于: 2026-01-13 06:20 (UTC+8)) [💬1] ### Background + Motivation We are introducing
model_arch_confighttps://github.com/vllm-project/vllm/pull/28454, which defines explicitly what kind of information vLLM engine need from hugggingface config/ user-defined config, so we could avoidhf_config/getattr(hf_config, xxx)got passing around in engine.### Previous Discussion For whether
model_arch_configshould contains multimodal related info, https://github.com/vllm-project/vllm/issues/24384#issuecomment-3489703720 suggest… -
#32193 [Bug]: vLLM engine crash under burst load despite expected request queuing (72 concurrent API calls) — bug — by Dineshkumar-Anandan-ZS0367 (创建于: 2026-01-13 01:12 (UTC+8)) [💬5] ### Your current environment
### Description
We observed that the vLLM engine was killed when a burst of API requests hit the server (approximately 72 API requests from two servers simultaneously).
Based on vLLM’s design, we expected these requests to be queued and throttled internally, rather than causing the engine to terminate. This behavior is unexpected and impacts stability during performance and E2E testing.
…
-
#32214 [Feature]: Extend startup time collection script to work with sweep — feature request — by desertfire (创建于: 2026-01-13 05:36 (UTC+8)) ### 🚀 The feature, motivation and pitch
https://github.com/vllm-project/vllm/pull/29919 added a script to collect startup time metrics, such as cold compilation time and warm compilation time. As being suggested here, https://github.com/vllm-project/vllm/pull/29919#pullrequestreview-3532714953, it would be nice to make it work with existing
vllm bench sweep.cc @ProExpertProg
### Alternatives
No response
…
-
#32213 [Bug]: GLM4 tool parser crashes with TypeError when parsing tools with no arguments — bug — by AnasMaar (创建于: 2026-01-13 05:25 (UTC+8)) ### 🐛 Describe the bug
When using the GLM4 tool parser (
--tool-call-parser glm47) with tools that have no required or optional arguments, the parser crashes with aTypeError.Error traceback: ``` ERROR [glm4_moe_tool_parser.py:123] Failed to extract tool call spec Traceback (most recent call last): File “/usr/local/lib/python3.12/dist-packages/vllm/tool_parsers/glm4_moe_tool_parser.py”, line 104, in extract_tool_calls pairs = self.func_arg_regex.findall(tc_args) …
-
#32190 [Bug]: Deepseek-R1 with DEP deployment returns gibberish outputs — bug — by ptarasiewiczNV (创建于: 2026-01-13 00:13 (UTC+8)) [💬2] ### Your current environment
The output of
```text Your output of `python collect_env.py` here ``` `vllm/vllm-openai:v0.13.0` docker image. ...python collect_env.py -
#32200 [RFC]: Change
kv_load_failure_policyDefault from “recompute” to “fail” — RFC — by NickLucche (创建于: 2026-01-13 02:28 (UTC+8)) [💬2] ### Motivation.In disaggregated prefill setups, the
kv_load_failure_policycontrols how the system handles failures when the decode instance loads KV cache blocks from the prefill instance.Currently, the default is
"recompute", which recomputes failed blocks on the decode instance. This may lead to a drop in performance:- Wrong instance, wrong configuration: Decode instances use low-latency optimizations (e.g., DeepGemm low-latency mode). Recomputing prefill work on decode instanc…
-
#32187 [Bug]: AttributeError: module ‘vllm.distributed.parallel_state’ has no attribute ‘get_tensor_model_parallel_group’. Did you mean: ‘get_tensor_model_parallel_rank’? — bug — by cqray1990 (创建于: 2026-01-12 22:48 (UTC+8)) ### Your current environment
vllm 0.12.0
### 🐛 Describe the bug
AttributeError: module ‘vllm.distributed.parallel_state’ has no attribute ‘get_tensor_model_parallel_group’. Did you mean: ‘get_tensor_model_parallel_rank’?
### Before submitting a new issue…
…
-
#32186 [Bug]: Unable to serve Qwen3-8B-FP8 with moriio kv connector — bug,rocm — by junkang1991 (创建于: 2026-01-12 22:37 (UTC+8)) [💬2] ### Your current environment
The output of
```text ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) ...python collect_env.py -
#32180 [Bug]: Performance Bottlenecks and V1 Engine Instability on AMD gfx1151 (Strix Halo) — bug,rocm — by kgundbrain (创建于: 2026-01-12 22:03 (UTC+8)) [💬1] ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#32176 [Bug]: deepseekv3.2 core dumped with cpu_offload_gb — bug — by mengniwang95 (创建于: 2026-01-12 20:43 (UTC+8)) [💬3] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#32178 [Bug]: DSR1 NVFP4 DEP cannot run and it fails when _initialize_kv_caches — bug — by shyeh25 (创建于: 2026-01-12 21:30 (UTC+8)) [💬1] ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#32177 [Feature]: Enable automatic trace context propagation for offline inference — feature request — by minimAluminiumalism (创建于: 2026-01-12 21:26 (UTC+8)) ### 🚀 The feature, motivation and pitch
Enable automatic trace context propagation for offline inference APIs (
LLM.generate(), etc.) by extracting trace headers from the current OpenTelemetry context.Currently, vLLM has two usage modes with different tracing support:
- Online serving(entry point:
http://localhost:8000/v1/completions) can extract trace context from HTTP headers - Offline inference(
LLM.generate()): there is no way to pass or auto-detect trace context
The low-level `Inp…
- Online serving(entry point:
-
#32152 [Bug]: hermes_tool_parser drops final JSON brace during streaming when tokens are chunked (e.g., MTP enabled) — bug — by heyzude (创建于: 2026-01-12 14:46 (UTC+8)) [💬2] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ...python collect_env.py -
#32172 [Bug]: DeepSeek V3.2 MTP + PD report two errors — bug — by kebe7jun (创建于: 2026-01-12 20:11 (UTC+8)) ### Your current environment
The output of
```text Collecting environment information... /gpfs/rd/kebe/vllm-env/lib/python3.10/site-packages/_distutils_hack/__init__.py:30: UserWarning: Setuptools is replacing distutils. Support for replacing an already imported distutils is deprecated. In the future, this condition will fail. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml w...python collect_env.py -
#32151 [Bug]: jina-reranker-m0 infer error — bug — by chen03191108-lab (创建于: 2026-01-12 14:35 (UTC+8)) [💬3] ### Your current environment
============================== System Info ============================== OS : mtos 22.03 (LTS-SP1) (x86_64) GCC version : (GCC) 10.3.1 Clang version : Could not collect CMake version : version 3.31.8 Libc version : glibc-2.34 …
-
#32154 [Bug]: v0.13.0 docker image run Whisper model return error: “PlaceholderModule should not be used when the original module can be imported” — bug — by shengkui (创建于: 2026-01-12 15:17 (UTC+8)) [💬1] ### Your current environment
The output of
```text root@94b14e05a285:/vllm-workspace# python3 collect_env.py Collecting environment information... ============================== System Info ...python collect_env.py
[已关闭 issue]
-
#32006 [Bug]: vLLM hangs on specific request each time (qwen-coder-480b-fp8) — bug — by Yanpas (关闭于: 2026-01-13 10:33 (UTC+8)) [💬10] ### Your current environment
Can't do it. It's RH Openshift, GPU H200, Cuda 12.9 Model: Qwen Coder 480b FP8 vllm params: --enable-expert-parallel --data-parallel-size=8 ... -
#26854 [Bug]: Use vllm.envs.ENV_VARIABLE instead of ENV_VARIABLE — bug,stale — by Jialin (关闭于: 2026-01-13 10:22 (UTC+8)) [💬3] ### Your current environment
N/A
### 🐛 Describe the bug
Avoid directly using TYPE_CHECKING environment variables in the code (by updating all code with from vllm.envs import FOO_BAR). And use envs.FOO_BAR instead, which go through the envs.getattr which has 2 benefits:
- no extra overhead, as environment variable caching is added in https://github.com/vllm-project/vllm/pull/26146
- using getattr ensured we adopted environment variable updates during service startup time.
…
-
#22493 [RFC]: Should the gpt-oss reasoning parser use harmony directly? — RFC,stale,gpt-oss — by sethkimmel3 (关闭于: 2026-01-13 10:15 (UTC+8)) [💬4] ### Motivation.
Currently, the GPT-OSS Reasoning Parser can’t really be used directly to extract reasoning content. The comment on L22 says:
The GptOss model uses harmony to extract reasoning content and this parser is only used for detecting the end of the reasoning content.However, to unify behavior with other Reasoning Parsers, it may be worth using …
-
#22809 [Bug]: How to set reasoning_effort for gpt-oss model to “high” in vllm — bug,stale,gpt-oss — by sravan500 (关闭于: 2026-01-13 10:15 (UTC+8)) [💬9] ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#22929 [Bug]: GLM-4.5V-FP8 output quality issue — bug,stale — by zixi-qi (关闭于: 2026-01-13 10:15 (UTC+8)) [💬5] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#23108 [Usage]: how to use built-in python tool of gpt-oss-20b after starting vllm serve –tool-server demo? — usage,stale,gpt-oss — by luppx (关闭于: 2026-01-13 10:15 (UTC+8)) [💬22] ### Your current environment
Hi, I’m trying to test gpt-oss with vllm in a H800 GPU. I have successfully installed vllm and relevant dependencies following gpt-oss vllm usage installation guide:
pip install --pre vllm==0.10.1+gptoss \ --extra-index-url https://wheels.vllm.ai/gpt-oss/ \ --extra-index-url https://download.pytorch.org/whl/nightly/cu128Following the [gpt-oss vllm usage guide](https://docs.vllm.ai/pro…
-
#23220 [Feature][Responses API] Browsing Cursor -> Citation — feature request,stale,gpt-oss — by simon-mo (关闭于: 2026-01-13 10:15 (UTC+8)) [💬3] ### 🚀 The feature, motivation and pitch
Currently, we do not translate the cursor (indexed by the browsing tool) into Citation format properly. That means users won’t be able to access the URL being accessed.
Non-streaming mode: https://github.com/vllm-project/vllm/blob/c32e6ad1f63631fd8033f0cca3a35d5e48ccfc7f/vllm/entrypoints/harmony_utils.py#L182-L199
Streaming mode: https://github.com/vllm-project/vllm/blob/c32e6ad1f63631fd8033f0cca3a35d5e48ccfc7f/vllm/entrypoints/openai/serving_responses…
-
#23293 [Feature][Chat Completion] Support tool_choice other than “auto” with harmony format — feature request,stale,gpt-oss — by heheda12345 (关闭于: 2026-01-13 10:15 (UTC+8)) [💬5] ### 🚀 The feature, motivation and pitch
Similar with https://github.com/vllm-project/vllm/issues/23227, but for chat completion API. Needs to wait until https://github.com/vllm-project/vllm/pull/22386 is merged to start work and may need to collaborate with https://github.com/vllm-project/vllm/issues/23227 to avoid duplicate of work.
### Alternatives
No response
### Additional context …
-
#23430 [Bug]: Reasoning parser enabled by default and cannot be disabled — bug,stale,gpt-oss — by AyRickk (关闭于: 2026-01-13 10:15 (UTC+8)) [💬8] I run vllm on an offline environment with
vllm serve, and since the new 0.10.1 release, the reasoning parser is enabled and I can’t disable it. I tried only with gpt-oss model.Maybe add a parameter to disable it like
--no-reasoning-parseror--reasoning-parser=false -
#23632 [Feature]: AttributeError: Model GptOssForCausalLM does not support BitsAndBytes quantization yet. No ‘packed_modules_mapping’ found. Support GptOssForCausalLM of BitsAndBytes quantization? — bug,stale,gpt-oss — by xiaotianns (关闭于: 2026-01-13 10:15 (UTC+8)) [💬6] ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#23694 [Bug]: gpt-oss model output issue — bug,stale,gpt-oss — by woshizouguo (关闭于: 2026-01-13 10:15 (UTC+8)) [💬3] ### Your current environment
vLLM 1.0.1 python 3.12
### 🐛 Describe the bug
I fine-tuned the model without using any reasoning data. When generating, the model produces output like:
<|channel|>final<|message|>{"issues":[]}<|return|>…
-
#24062 20-series GPUs do not support the
sinksparameter; attempting to access it directly raises an error. Could you fix this — stale,gpt-oss — by wang824892540 (关闭于: 2026-01-13 10:15 (UTC+8)) [💬2] https://github.com/vllm-project/vllm/blob/a344a5aa0a58cc1758d9721e848ce1f5ca4b6c7f/vllm/attention/layer.py#L148PS C:\Users\y> docker run –runtime nvidia –gpus all -v C:/Users/y/model/.cache/huggingface:/root/.cache/huggingface -p 8000:8000 –ipc=host vllm/vllm-openai:latest –model unsloth/gpt-oss-20b-unsloth-bnb-4bit INFO 09-01 19:03:21 [init.py:241] Automatically detected platform cuda. (APIServer pid=1) INFO 09-01 19:03:23 [api_server.py:1805] vLLM API server version 0.10.1.1 (APISer…
-
#24067 [Bug]: how to get purely deterministic output for gpt-oss-120b? — bug,stale,gpt-oss — by tonyaw (关闭于: 2026-01-13 10:15 (UTC+8)) [💬5] ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#24076 [Bug]: While serving GPT-OSS, Streaming function calls output only reasoning_text, without function tool call — bug,stale,gpt-oss — by gsu2017 (关闭于: 2026-01-13 10:15 (UTC+8)) [💬4] ### Your current environment
The output of
```text ============================== Python Environment ============================== Python version : 3.12.11 (main, Jun 4 2025, 08:56:18) [GCC 11.4.0] (64-bit runtime) ...python collect_env.py -
#24148 [Usage]: Add toy example for gpt-oss container tools — usage,stale,gpt-oss — by lacora (关闭于: 2026-01-13 10:15 (UTC+8)) [💬2] ### Your current environment
The output of `python collect_env.py`### How would you like to use vllm
Followup on PR https://github.com/vllm-project/vllm/pull/23386 …
-
#24201 [Feature][gpt-oss] Responses API test enhancement — feature request,stale,gpt-oss — by heheda12345 (关闭于: 2026-01-13 10:15 (UTC+8)) [💬4] ### 🚀 The feature, motivation and pitch
Current gpt-oss test only ensures that the workflow doesn’t crash. It doesn’t have correctness check. Help wanted on making the test more strict to avoid regression.
https://github.com/vllm-project/vllm/blob/main/tests/entrypoints/openai/test_response_api_with_harmony.py
When implementing it, please make sure the tests are not flaky.
### Alternatives
…
-
#24283 [Bug]: GPT-OSS more robust way to handle messages in commentary channel — bug,stale,gpt-oss — by JasonZhu1313 (关闭于: 2026-01-13 10:15 (UTC+8)) [💬9] ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#24292 [Usage]: Adjusting reasoning efforts for GPT-OSS in direct sampling — usage,stale,gpt-oss — by ff1Zzd (关闭于: 2026-01-13 10:15 (UTC+8)) [💬3] ### Your current environment
The output of `python collect_env.py`### How would you like to use vllm
Hi team, I understand that currently we could adjust the “reasoning_effort” for GPT-OSS if we serve the model via VLLM using “Chat Completions-compatible API” approach. However, it seems there is no appropriate parameter for this if we want to initialise the model using “LLM” and do direct sampling (via llm.generate). …
-
#24797 [Bug]: Failing Qwen MoE EP Test in CI — bug,stale — by minosfuture (关闭于: 2026-01-13 10:14 (UTC+8)) [💬2] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#24803 [Bug]: AttributeError: ‘FusedMoE’ object has no attribute ‘moe’ when trying to run Qwen3-Next — bug,stale — by drrros (关闭于: 2026-01-13 10:14 (UTC+8)) [💬9] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#24815 [Bug]: openbmb/MiniCPM-o-2_6 RuntimeError: CUDA error: an illegal memory access was encountered — bug,stale — by jiangtaozh (关闭于: 2026-01-13 10:14 (UTC+8)) [💬2] ### Your current environment
RTX5090 tp size 2 to deploy openbmb/MiniCPM-o-2_6 ```text (EngineCore_0 pid=271) INFO 09-14 03:09:35 [core.py:71] Initializing a V1 LLM engine (v0.10.1.dev459+g099c04646.d20250808) with config: model='/app/models/openbmb/MiniCPM-o-2_6', speculative_config=None, tokenizer='/prod/models/openbmb/MiniCPM-o-2_6', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torc... -
#24816 [Usage]: should we explicitly modify
tokernizer.padding_side=leftwhen using vLLM to do batch infernce? — usage,stale — by SpeeeedLee (关闭于: 2026-01-13 10:14 (UTC+8)) [💬2] ### Your current environment N/A### How would you like to use vllm As title, I’m confused about whether vLLM handles this internally or not.
### Before submitting a new issue…
…
-
#24843 [Bug]: During the warmup phase, skipping attention (skipatten) caused an OOM (Out of Memory) error later. — bug,stale — by wangxiaoteng888 (关闭于: 2026-01-13 10:14 (UTC+8)) [💬2] ### Your current environment
The output of
vllm version v0.9.1 ```text Your output of `python collect_env.py` here ``` ============================== System Info ...python collect_env.py -
#29466 [CI Failure]: mi325_1: Language Models Test (Extended Pooling) — ci-failure — by AndreasKaratzas (关闭于: 2026-01-13 07:23 (UTC+8)) [💬6] ### Name of failing test
pytest -v -s models/language/pooling -m 'not core_model'### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#29515 [CI Failure]: mi325_1: V1 Test attention (H100) — ci-failure — by AndreasKaratzas (关闭于: 2026-01-13 07:23 (UTC+8)) [💬8] ### Name of failing test
pytest -v -s v1/attention### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#31631 [CI Failure]: mi325_1: V1 Test others — ci-failure — by AndreasKaratzas (关闭于: 2026-01-13 07:23 (UTC+8)) [💬3] ### Name of failing test
pytest -s -v -m 'not cpu_test' v1/kv_connector/unit### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#29525 [CI Failure]: mi325_1: Quantization Test — ci-failure — by AndreasKaratzas (关闭于: 2026-01-13 07:22 (UTC+8)) [💬4] ### Name of failing test
uv pip install --system torchao==0.13.0 && VLLM_TEST_FORCE_LOAD_FORMAT=auto pytest -v -s quantization/ --ignore quantization/test_blackwell_moe.py### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#30801 [Bug]: distilbert/distilgpt2 with graph_mode on ROCm platform generates garbage output — bug,rocm — by qli88 (关闭于: 2026-01-12 23:35 (UTC+8)) [💬3] ### Your current environment
Enable graph_mode for model distilbert/distilgpt2 on ROCm platform will generate garbage output.
### 🐛 Describe the bug
For model distilbert/distilgpt2, if graph_mode (enforce_eager=False) is enabled, vLLM will generate garbage output like “!!!!!!!!!!!!!!!” no matter what the prompt is.
### Before submitting a new issue…
…
-
#31721 [Feature]: Fused MoE Micro Benchmark for CPU Backend — feature request,cpu — by andikarachman (关闭于: 2026-01-12 18:03 (UTC+8)) [💬3] ### 🚀 The feature, motivation and pitch
A unified script to benchmark Fused MoE performance on CPU, instead of having each contributor writing their own benchmarks (and then throwing them). We need something similar to benchmark_paged_attention.py for CUDA. It’s easier (and less noisy) to have a micro benchmark (that runs in ~ seconds) to iterate over instead of re-running end to end benchmarks whi…
- #32069 [Usage]: Qwen3-VL-Embedding Accuracy — usage — by xiazi-yu (关闭于: 2026-01-12 15:54 (UTC+8)) [💬10] Comparison testing between Qwen3-VL-Embedding-2B vLLM 13.0 and Qwen3VLEmbedder revealed inconsistencies in accuracy.
-
#32154 [Bug]: v0.13.0 docker image run Whisper model return error: “PlaceholderModule should not be used when the original module can be imported” — bug — by shengkui (关闭于: 2026-01-12 15:37 (UTC+8)) [💬1] ### Your current environment
The output of
```text root@94b14e05a285:/vllm-workspace# python3 collect_env.py Collecting environment information... ============================== System Info ...python collect_env.py -
#32023 [Bug]: Fail to get response from a Qwen2-audio-7B-Instruct Service — bug — by moonlightian (关闭于: 2026-01-12 14:06 (UTC+8)) [💬2] ### Name of failing test
None
### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#30064 [Performance]: Degradation on rocm from v0.11.1 to v0.12.0 — performance,rocm — by IdoAsraff (关闭于: 2026-01-12 13:26 (UTC+8)) [💬5] ### Proposal to improve performance
No response
### Report of performance regression
TL;DR: on v0.11.1 I get much better performance (specifically TTFT) with deepseek on rocm than on v0.12.0.
# What I did for v0.12.0
…
-
#32011 [Bug]: “Re-running CMake” loop In pip install — bug — by zhaohaixu (关闭于: 2026-01-12 12:57 (UTC+8)) ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py
[新 PR]
-
#32167 [Model] Re-implement Qwen3Omni Audio Encoder — qwen — by ywang96 (创建于: 2026-01-12 18:44 (UTC+8)) [+442/-27, 1 files | commented:1 approved:1] ## Purpose Re-implement Qwen3-Omni Audio Encoder with vLLM primitives with some vectorization improvements. - roughly 10% speedup at high batch sizes according to profiling run with TP=1.
main
(EngineCore_DP0 pid=2620374) INFO 01-12 23:47:30 [gpu_model_runner.py:4696] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 21 audio items of the maximum feature size. (EngineCore_DP0 pid=2620374) Audio processing time: 0.3811814785003662 secondsThis PR …
- #32165 Specify the kv_cache dtype of the draft model — speculative-decoding,llama — by PHOEBEMOON0802 (创建于: 2026-01-12 18:19 (UTC+8)) [💬8 | +73/-2, 2 files | commented:3]
Submit this PR to allow specifying the dtype of the kv_cache for the draft model via “kv_cache_dtype” in the speculative_config.
The specific usage is as follows:
When you need to specify the kv_cache dtype of the draft model as the required type and set the kv_cache dtype of the main model to another type, add the following to the startup command (take “auto” as an example):
```
python -m vllm.entrypoint.openai.api_server
–model /path-to-the-main-model
–tensor-parallel-size 1
–kv… -
#32197 [Misc] Allow enabling NCCL for DP sync when async scheduling — ready — by njhill (创建于: 2026-01-13 02:04 (UTC+8)) [💬2 | +26/-16, 4 files | commented:4 approved:1] It’s currently disabled by default in this case but we should allow manual override either way.
See https://github.com/vllm-project/vllm/issues/32140
[!NOTE] Sets
ParallelConfig.disable_nccl_for_dp_synchronizationto acceptNoneand defers defaulting logic toVllmConfig.disable_nccl_for_dp_synchronizationchanged tobool | NoneinParallelConfigand CLI args; doc clarifies default behavior …
-
#32204 [Core][KVConnector] Support HMA+NixlConnector — needs-rebase,v1,kv-connector — by NickLucche (创建于: 2026-01-13 03:00 (UTC+8)) [💬1 | +449/-185, 6 files | commented:4] ## Overview Currently connectors are not able to take full advantage of models that employ hybrid attention (FA+SWA) and treat all layers as FA, as the Hybrid Kv Cache Manager is disabled.
This PR enables NixlConnector to work with the HMA, resulting in drastically reducing the number of bytes/regions moved with a xfer for SWA+FA models, while laying the ground for state-based ones (mamba etc). Example of the former: ``` # NON-HMA (current master) (EngineCore_DP0 pid=521538) get_block_descs_ids…
- #32215 [6/N][Attention] Move utils to more appropriate locations — ready,v1,nvidia,ready-run-all-tests — by MatthewBonanni (创建于: 2026-01-13 05:53 (UTC+8)) [💬1 | +168/-178, 14 files | commented:1 approved:1]
## Purpose
Step 6 of #31919:
- Move
_make_metadata_with_slice,slice_query_start_locs split_attn_metadatatovllm/v1/worker/ubatch_utils.py - Move
subclass_attention_backend,subclass_attention_backend_with_overridestovllm/v1/attention/backend.py
## Test Plan CI
## Test Result
…
- Move
-
#32205 [CI/Build] Fix ffmpeg CVEs — ci/build — by junpuf (创建于: 2026-01-13 03:14 (UTC+8)) [+22/-1, 1 files | commented:3] ## Purpose
The current vLLM Docker images contain 31 security vulnerabilities in FFmpeg and related libraries, including 3 Critical and 12 High severity CVEs. These vulnerabilities stem from using Ubuntu’s universe repository version of FFmpeg (4.4.2), which requires Ubuntu Pro subscription for security patches.
The vulnerabilities exist because:
- Ubuntu Universe Repository: vLLM uses FFmpeg from Ubuntu’s universe repository
- No Free Security Updates: Ubuntu universe p…
-
#32194 [Model] Handle
trust_remote_codefor transformers backend — new-model,ready — by DarkLight1337 (创建于: 2026-01-13 01:18 (UTC+8)) [+14/-1, 2 files | commented:2 approved:1] ## PurposeAdd
trust_remote_codeargument totry_get_class_from_dynamic_moduleand handle it accordingly.## Test Plan
## Test Result
... -
#32226 [Misc] improve warning/assert messages — ready,v1 — by cjackal (创建于: 2026-01-13 08:52 (UTC+8)) [+19/-19, 6 files | commented:1 approved:1] ## Purpose
Fix missing whitespaces and typos and polish sentences in various warning/assert messages.
## Test Plan
CI green
## Test Result
…
-
#32191 [AOT compilation] cached inductor artifacts benchmark #32043 — performance — by dolpm (创建于: 2026-01-13 01:04 (UTC+8)) [💬1 | +662/-0, 1 files | commented:2] benchmark for https://github.com/vllm-project/vllm/pull/25205 ## Purpose
## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... -
#32224 fix cutlass_3x_gemm_fp8_blockwise on sm103a — nvidia — by IwakuraRein (创建于: 2026-01-13 08:45 (UTC+8)) [+36/-6, 2 files | commented:1 | 📝草稿]
## Purpose
When compiling with sm103a, the output of
cutlass_3x_gemm_fp8_blockwiseis garbage value.Updated the helpers in
csrc/cutlass_extensions/common.hppto include sm103a.## Test Plan
…
-
#32216 [Frontend] Add dedicated KimiK2ReasoningParser for tool call handling — 无标签 — by daniel-salib (创建于: 2026-01-13 06:09 (UTC+8)) [+386/-2, 3 files | commented:4] ## Purpose
Add a dedicated
KimiK2ReasoningParserto handle Kimi K2’s behavior of sometimes outputting tool calls without a proper</think>delimiter.Kimi K2 uses the same
<think>...</think>tokens as DeepSeek R1 for reasoning content. However, when making tool calls, the model sometimes omits the</think>token, causing tool call markers to be absorbed into the reasoning content instead of being passed to the tool parser.This PR adds a specialized reasoning parser that:
- Extends `D…
-
#32211 [Perf] Optimize requests abort — ready,v1 — by yewentao256 (创建于: 2026-01-13 05:00 (UTC+8)) [💬2 | +4/-3, 1 files | commented:1] ## Purpose
Batch reqs_to_abort and do once at the end instead of calling each time, getting a little bit perf improvement
## Test
export MODEL="zai-org/GLM-4.7-FP8"vllm serve $MODEL -tp 8 --port 9256 --enable-expert-parallel --max_num_seqs 128lm_eval --model local-completions --model_args "base_url=http://127.0.0.1:9256/v1/completions,model=$MODEL,num_concurrent=1024" --tasks gsm8k…
-
#32146 [Frontend] Support OpenAI-style tool call IDs in Kimi K2 parser — documentation — by daniel-salib (创建于: 2026-01-12 13:14 (UTC+8)) [💬2 | +233/-9, 2 files | commented:5] ## Purpose
Kimi K2 sometimes generates OpenAI-style tool call IDs (e.g.,
call_abc123def456) instead of the native format (functions.get_weather:0). This occurs when the model is prompted with OpenAI-style tool definitions or when the chat template uses OpenAI conventions—the model adapts its output format to match the input format.Native Kimi K2 format:
<|tool_call_begin|>functions.get_weather:0<|tool_call_argument_begin|>{"city": "Tokyo"}<|tool_call_end|>- Function name is e…
-
#32217 [Draft][Kernel] Add new flashinfer A2A kernel — nvidia — by hjjq (创建于: 2026-01-13 06:13 (UTC+8)) [+736/-59, 15 files | commented:1 | 📝草稿] ## Purpose Add latest TRT-LLM gen A2A kernel from flashinfer’s MoE-A2A API (https://github.com/flashinfer-ai/flashinfer/pull/2102) (a.k.a. one sided all-to-all) This should perform better than the older A2A kernel (#21003)(
flashinfer_all2allv) in large batch size. Enable by--all2all-backend flashinfer_moe_a2a.Still WIP. Depends on #31770.
## Test Plan
## Test Result
…
-
#32199 [BUGFIX] Add missed remaping of the names of fp8 kv-scale — ready,qwen — by vadiklyutiy (创建于: 2026-01-13 02:22 (UTC+8)) [💬2 | +7/-0, 1 files | commented:1 approved:1] Qwen3-Next-NVFP4 checkpoint produced a lot of the following warnings
$ vllm serve nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 -tp 2 ... Parameter layers.3.self_attn.qkqkv_proj.k_scale not found in params_dict, skip loading Parameter layers.3.self_attn.qkv_proj.v_scale not found in params_dict, skip loading ...…
-
#32202 Initial structural_tag support for tool calling — tool-calling — by mgoin (创建于: 2026-01-13 02:57 (UTC+8)) [+232/-2, 4 files | commented:1]
## Purpose
My initial start to attempt progress on https://github.com/vllm-project/vllm/issues/32142
Adds structural tag support for tool calling so that
tool_choice="required"constrains only the JSON arguments region, not the entire output. This allows models to generate their native tool tokens (e.g.,<tool_call>) and any reasoning before the constrained region. Implemented for Hermes parser as reference; other parsers can overrideget_structure_info(). … -
#32201 [CI][AMD][Quantization] Fix fp8 max in quant_utils.py and update test_fp8_quant.::test_static_fp8_quant_group_2d to use correct fp8 dtype and adjust atol/rtol — rocm — by rasmith (创建于: 2026-01-13 02:47 (UTC+8)) [+4/-3, 2 files | commented:2] After a recent PR, the
test_fp8_quant.::test_static_fp8_quant_group_2d testwas failing on ROCm with the following error:E Greatest absolute difference: 0.0009765625 at index (361, 1414) (up to 0.0 allowed) E Greatest relative difference: 0.1428571492433548 at index (361, 1414) (up to 0.12 allowed)The value passed to
scaled_quantizeis not correct for ROCm and the scaling used inquant_utilsinscaled_quantizeis … -
#32212 Fix various typos found in
docs— documentation,structured-output,cpu — by potatosalad (创建于: 2026-01-13 05:10 (UTC+8)) [💬2 | +26/-26, 21 files | commented:1] Fix various typos found indocsFile Issue Fix docs/contributing/deprecation_policy.md“2.Deprecated” → “2. Deprecated” docs/contributing/model/basic.md“interleave sliding windows” → “interleaved sliding windows” docs/deployment/frameworks/cerebrium.mdMissing space before backtick → added space docs/deployment/frameworks/hf_inference_endpoints.md“Click to Deploy” → “Click the Deploy” `docs/deployment/integrati… -
#32210 [6/n] Migrate paged attention, merge_attn_states, convert_vertical_slash_indexes — needs-rebase,ci/build,cpu,nvidia — by mikaylagawarecki (创建于: 2026-01-13 04:28 (UTC+8)) [💬1 | +1941/-1423, 37 files | commented:1 | 📝草稿]
## Purpose Stacked on https://github.com/vllm-project/vllm/pull/31945
## Test Plan pytest tests/kernels/attention/test_attention.py -v CC=gcc pytest tests/kernels/attention/test_merge_attn_states.py -v
## Test Result …
-
#32209 [Model Runner V2] Minor refactor for logit_bias — v1 — by WoosukKwon (创建于: 2026-01-13 04:11 (UTC+8)) [+54/-26, 1 files | approved:1 commented:1] Minor code reorganization for better testing and code reuse
[!NOTE] Separates kernel launch logic from state, improving testability and reuse.
- Adds
apply_logit_bias(...)helper that computesBLOCK_SIZEfromallowed_token_ids,logit_bias_token_ids, andstop_token_idsshapes and invokes_bias_kernel - Updates
LogitBiasState.apply_logit_biasto call the new helper, passing.gputensors instead of handling strides/size inline - Keeps
_bias_kernellogic unchange…
- Adds
- #32203 [Rocm][CI] Merge the 2-node test commands into one — rocm,ci/build — by charlifu (创建于: 2026-01-13 02:57 (UTC+8)) [+6/-6, 1 files | commented:2]
[!NOTE] Cursor Bugbot is generating a summary for commit 2f41484af514602667ffd9c40b5708007e364302. Configure here.
[!NOTE] Consolidates the per-node setup in the AMD CI 2-node test step by chaining the three startup checks into a single command per node.
- In
test-amd.yamlunder2 Node Tests (4 GPUs in total), merges the three per-node commands (`distributed/test_same_node….
- In
-
#32207 [MoE Recator] Remove CustomOp from UnquantizedFusedMoEMethod — 无标签 — by bnellnm (创建于: 2026-01-13 04:01 (UTC+8)) [💬2 | +20/-44, 2 files | commented:1] ## Purpose Remove CustomOp from UnquantizedFusedMoEMethod to make MoE runner refactoring simpler.
## Test Plan CI MoE Refactor integration tests
## Test Result
cc @robertgshaw2-redhat , @mgoin , @ProExpertProg , @WoosukKwon …
-
#32195 Add TMA support to fused_moe_lora kernel — 无标签 — by gnovack (创建于: 2026-01-13 01:23 (UTC+8)) [+138/-34, 2 files | commented:4] ## Purpose
Adds support for loading A and B matrices from TMA descriptors in the existing fused_moe_lora kernel.
#### Implementation Details
- Adds a
supports_tmafunction to conditionally enable TMA based on compute capability - Reads LoRA A and B weights using TMA in MoE LoRA shrink and expand, respectively.
- Since LoRA A and B weights can be represented as a list of tensors (i.e.
num_slices > 1), we only pre-allocate the TMA descriptor on-host whennum_slices == 1; otherwise, we c…
- Since LoRA A and B weights can be represented as a list of tensors (i.e.
- Adds a
- #32208 [P/D] Add decode-side check for prefill block expiry — kv-connector — by njhill (创建于: 2026-01-13 04:11 (UTC+8)) [+87/-4, 1 files | commented:1 | 📝草稿] Claude code was used to help with this.
-
#32198 [Docs] Nixl Usage recommend
failkv_load_failure_policy — documentation,kv-connector — by NickLucche (创建于: 2026-01-13 02:08 (UTC+8)) [💬2 | +16/-6, 1 files | commented:1 approved:1] Small update to docs suggested setup cc @wseaton.
[!NOTE] Emphasizes preferred failure handling and clarifies behavior for KV cache transfer.
- Adds
"kv_load_failure_policy":"fail"to all NixlConnector example--kv-transfer-configcommands (single-host, multi-host prefiller/decoder) - New “KV Load Failure Policy” section describing
failvsrecompute(default) with a warning about performance impact ofrecompute - No code changes; documentation only …
- Adds
- #32163 [Model Runner V2] Support logit_bias, allowed_token_ids, min_tokens — v1 — by WoosukKwon (创建于: 2026-01-12 18:08 (UTC+8)) [💬2 | +253/-5, 2 files | commented:8]
[!NOTE] Implements GPU-side sampling constraints with a dedicated state and Triton kernel.
- Adds
LogitBiasStateto stage per-requestallowed_token_ids,logit_bias, andmin_tokens/stop_token_idsusingUvaBackedTensorandStagedWriteTensor - Introduces
_bias_kernel(Triton) to: mask logits to onlyallowed_token_ids, apply per-tokenlogit_bias, and enforcemin_tokensby suppressingstop_token_idsuntilmin_len - Provides
add_request,apply_staged_writes, …
- Adds
-
#32189 fp8 online quant: split out Fp8OnlineLinearMethod — 无标签 — by vkuzo (创建于: 2026-01-12 23:26 (UTC+8)) [💬3 | +164/-116, 2 files | commented:5] Summary:
Split out
Fp8OnlineLinearMethodfromFp8LinearMethodto more clearly separate online quant from offline quant logic, following a similar PR recently landed forFp8OnlineMoEMethod.In the same PR, beef up testing for online quant in integration tests a bit so we can depend on tests for testing future functionality for online quant. Specifically, extend the online fp8 quant test to also include a small moe model, and also extend it to run inference with a couple of tokens.
Test …
-
#32206 [WIP][Spec Decode] DFlash — new-model,speculative-decoding,v1,qwen — by benchislett (创建于: 2026-01-13 03:19 (UTC+8)) [+660/-13, 7 files | commented:1 | 📝草稿] ## Purpose
Add support to vLLM for DFlash speculative decoding.
Currently it’s hacked into the EAGLE path, and only works with BS1 and enforce-eager. No blockers here just need to clean up the code to support these cases.
## Test Plan
Check AR on MTBench and compare to https://github.com/sgl-project/sglang/pull/16818. With BS=1 and enforce-eager, we match AL with both reasoning and non-reasoning. …
-
#32192 [Misc] Change log level for batch queue log — ready,v1 — by NickLucche (创建于: 2026-01-13 01:08 (UTC+8)) [+1/-1, 1 files | approved:1 commented:1] As per https://vllm-dev.slack.com/archives/C07R5Q1Q2BB/p1768224082041559
[!NOTE] Cursor Bugbot is generating a summary for commit e0bdb779a0d93b90afb5f791188ae2bcc3139364. Configure here.
[!NOTE] …
-
#32173 [BugFix] scheduler: Fix ordering preserving of skipped requests — ready,v1 — by orozery (创建于: 2026-01-12 20:14 (UTC+8)) [💬1 | +31/-15, 2 files | commented:1 approved:1] This PR fixes the order preserving of requests skipped by the scheduler. A unit test is added to verify the fix.
I came across this bug while testing #29087, and the test there requires this fix.
[!NOTE] Cursor Bugbot is generating a summary for commit 8c08c30fef37d620043c4221960e04d6704ba1dc. Configure here.
…
-
#32181 nixl_connector: export UCX_MEM_MMAP_HOOK_MODE=none to avoid a UCX memory leak — kv-connector — by hasB4K (创建于: 2026-01-12 22:05 (UTC+8)) [💬1 | +13/-0, 1 files | commented:3] Implementation of the fix described here: https://github.com/vllm-project/vllm/issues/24264#issuecomment-3710305796
[!NOTE]
Mitigates a UCX memory leak when using NIXL by configuring UCX mmap hooks before loading NIXL bindings.
- If
UCX_MEM_MMAP_HOOK_MODEis unset, set it tononeprior to importingnixl/rixl; log a warning ifnixlorrixlwas already imported - Adds
os/sysimports; retains ROCm vs CUDA import paths and existing availability logging in `nixl_conne…
- If
-
#32155 [Frontend] Normalize Responses API input for client compatibility — frontend — by daniel-salib (创建于: 2026-01-12 15:33 (UTC+8)) [💬1 | +85/-5, 2 files | commented:3] ## Purpose
This PR improves client compatibility for the Responses API by normalizing input items before validation. This specifically addresses compatibility with Codex and other OpenAI SDK clients that serialize requests with varying conventions that can cause validation failures.
Change:
- Strip None values from input items - Some clients (including Codex) include explicit
Nonefor optional fields (e.g.,name: null,status: null). These are stripped to prevent validation error…
- Strip None values from input items - Some clients (including Codex) include explicit
-
#32196 [BugFix] fix FusedMoE.make_expert_params_mapping in EXAONE-MoE — 无标签 — by lkm2835 (创建于: 2026-01-13 01:49 (UTC+8)) [💬1 | +1/-0, 1 files | commented:1 approved:1] ## Purpose Added the missing changes. Related to #31104.
## Test Plan
## Test Result
... -
#32182 [Update] Bump FlashInfer version to 0.6.0 — ci/build,v1,nvidia — by askliar (创建于: 2026-01-12 22:28 (UTC+8)) [💬3 | +52/-50, 3 files | commented:3 | 📝草稿] Latest FlashInfer (0.6.0) changes plan API which breaks fast_plan_decode (again). The main culprit is change in
planfunction in FI that started accepting an extrao_data_typeargument. Since current backend was mostly using positional arguments, those were causing errors in case of new ones being added.This PR updates FlashInfer to the latest version and fixes API breaking by:
- Adding an extra
o_data_typearg tofast_plan_decode - Replacing positional arguments with keyword argumen…
- Adding an extra
-
#32184 [Benchmark] Share data between SLA runs — performance,ready — by DarkLight1337 (创建于: 2026-01-12 22:31 (UTC+8)) [+108/-25, 2 files | commented:6 approved:1] ## Purpose
Further optimize the SLA script to share past run data coming from the same serve/bench combination. This is useful when tuning for multiple SLA targets (e.g. E2EL <= 200 ms, 500 ms, 1000 ms, …) while keeping the serve/bench combination unchanged.
## Test Plan
## Test Result
... -
#32148 [Model] Standardize pooling heads — ready — by DarkLight1337 (创建于: 2026-01-12 13:58 (UTC+8)) [+182/-149, 9 files | approved:1 commented:9] ## Purpose
Follow-up to #32119, make pooling params actually take effect for custom models.
## Test Plan
## Test Result
... -
#32169 [BugFix] [KVConnector] Fix KV events for LMCache connector — kv-connector — by hickeyma (创建于: 2026-01-12 19:19 (UTC+8)) [+1/-1, 1 files | commented:1] ## Purpose
A small fix for the LMCache connector as LMCache does not yet support
lora_nameas an events property. Bug introduced in #27577.The error occurs when try to prompt a model with LMCache connector and KV events enabled:
``` […] (EngineCore_DP0 pid=223048) [2026-01-12 11:02:52,233] LMCache INFO: Reqid: cmpl-a310c95f98e4f404-0-8bf3d3d5, Total tokens 32, LMCache hit tokens: 0, need to load: 0 (vllm_v1_adapter.py:1602:lmcach…
- #32157 feat: add requires_token_ids interface for sampling params — v1 — by llsj14 (创建于: 2026-01-12 16:24 (UTC+8)) [+23/-0, 2 files | commented:3]
## Purpose
- Async scheduling currently blocks the delivery of prompt/output token IDs to sampling parameters, which means that, for example, prompt or output token IDs are set to -1 and only updated asynchronously.
- Certain logits processors (such as thinking budget)require access to prompt and output token IDs, and therefore an explicit interface is needed to propagate this information to sampling parameters.
- Pooling parameters already expo…
-
#32188 doc: Update model references in supported_models.md — documentation,ready — by andyzhangx (创建于: 2026-01-12 22:52 (UTC+8)) [💬1 | +2/-2, 1 files | commented:4 approved:1] ## Purpose Update model references in supported_models.md, we are using a tool to parse all listed model names in this page, but some items does not have org name which makes it difficult to parse automatically, this PR fixes the issue.
## Test Plan
## Test Result
... -
#32179 [ROCm] [Bugfix] Fix order of mori build in Dockerfile.rocm_base — rocm,ready,ci/build — by tjtanaa (创建于: 2026-01-12 21:40 (UTC+8)) [+21/-13, 1 files | commented:1 approved:1] ## Purpose
Fix the following bug:
$ DOCKER_BUILDKIT=1 docker build \ -f docker/Dockerfile.rocm_base \ -t rocm/vllm-dev:base-debug .…
-
#32185 doc: Update model name for Qwen3-Coder in documentation — documentation,ready,tool-calling,qwen — by andyzhangx (创建于: 2026-01-12 22:37 (UTC+8)) [💬2 | +1/-1, 1 files | approved:1 commented:1] ## Purpose
Update model name for Qwen3-Coder in documentation, original name is incorrect, you could find the correct name here: https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct
## Test Plan
## Test Result
... -
#32183 [MM Encoder] Add Triton ViT attention backend — v1,nvidia — by Isotr0py (创建于: 2026-01-12 22:29 (UTC+8)) [+152/-9, 4 files | commented:1 | 📝草稿]
## Purpose
## Test Plan
## Test Result
... -
#32170 docs: fix typo in docker_run_bm.sh — tpu,ci/build — by darshan-stack (创建于: 2026-01-12 19:41 (UTC+8)) [💬1 | +11/-20, 1 files | commented:4] Fixes a small typo ‘useually’ -> ‘usually’ in the error message of docker_run_bm.sh.
## Purpose
## Test Plan
## Test Result
…
-
#32156 [Frontend] Fix missing channel assignment for multi-turn message parsing — frontend,gpt-oss — by daniel-salib (创建于: 2026-01-12 15:45 (UTC+8)) [+115/-2, 2 files | commented:2] ## Purpose
This PR fixes a bug where messages parsed via
parse_response_inputandparse_input_to_harmony_messagewere missing required channel assignments, causingparse_output_messageto raiseValueError: Unknown channel: None.This is critical for Codex compatibility in multi-turn conversations where reasoning items and assistant messages from previous responses are passed back as input.
Changes:
parse_response_input: Assign"analysis"channel to reasoning items (match…
-
#32175 [Bugfix] [Core] fix sparse_attn_indexer padding — deepseek — by kebe7jun (创建于: 2026-01-12 20:20 (UTC+8)) [💬1 | +9/-1, 1 files | commented:2] ## Purpose
Fix https://github.com/vllm-project/vllm/issues/32172
## Test Plan
``` # Prefill vllm serve /gpfs/rd/models/DeepSeek-V3.2 -tp=2 -dp 4 –trust-remote-code –enable-expert-parallel –all2all-backend=deepep_high_throughput –gpu_memory_utilization=0.9 –max-model-len 102400 –tokenizer-mode=deepseek_v32 –enable-eplb –eplb-config ‘{“window_size”:”32”,”step_interval”:”32”,”num_redundant_experts”:”8”, “async”: “True”}’ –kv-transfer-config ‘{“kv_connector”:”NixlConnect…
-
#32162 [Frontend][Tracing] Add support for tracing aborted requests — v1 — by zhanghaotong (创建于: 2026-01-12 17:41 (UTC+8)) [💬3 | +88/-19, 4 files | commented:5] ## Purpose Currently, requests that are aborted or terminated due to errors do not emit their corresponding tracing spans, making it challenging to diagnose and troubleshoot issues in production environments.
This PR enhances the tracing functionality by ensuring aborted requests generate trace spans and including the error that caused the abortion in the trace span.
## Test Plan
- Start vLLM with tracing enabled: ``` export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=grpc://0.0.0.0:4317 export OTEL_S…
-
#32171 Optimize FlashInfer TRTLLM FP4 MoE quantization #32057 — nvidia — by baonudesifeizhai (创建于: 2026-01-12 20:05 (UTC+8)) [+54/-16, 2 files | commented:1] ## Purpose This PR optimizes the FlashInfer TRTLLM FP4 MoE quantization path by replacing
flashinfer.fp4_quantize()with vLLM’s nativescaled_fp4_quant()operation, which is faster and better optimized. #32057## Test Plan ``` export VLLM_DISABLE_COMPILE_CACHE=1 python -m vllm.entrypoints.openai.api_server
–model nvidia/DeepSeek-R1-NVFP4
–tensor-parallel-size 4 … -
#32174 Optimize fused MoE LoRA intermediate buffers and Triton indexing — 无标签 — by cwazai (创建于: 2026-01-12 20:17 (UTC+8)) [💬1 | +117/-59, 1 files | commented:2] Summary ——-
This PR optimizes the fused MoE LoRA Triton op in two ways:
- Reuse intermediate workspaces instead of allocating and zeroing temporary tensors on every call.
- Reduce integer arithmetic cost in the Triton kernel by:
- removing the
% Nmodulus inoffs_bnand replacing it with a mask - using int32 index arithmetic where possible. …
- removing the
- #32149 [Bugfix] Fix missing scale passing for encoder Triton Attention implementation — documentation,ready,v1 — by Isotr0py (创建于: 2026-01-12 14:05 (UTC+8)) [💬1 | +13/-27, 4 files | commented:3 approved:1]
## Purpose
- A small fix for Triton Encoder only attention, otherwise
TritonAttentionImpl.scalewon’t take effect exactly.
## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... - A small fix for Triton Encoder only attention, otherwise
-
#32168 scheduler: Cache also the last block after KV recving — v1 — by orozery (创建于: 2026-01-12 18:56 (UTC+8)) [+3/-2, 1 files | commented:1] This PR fixes the scheduler to commit the last full block of KV data that was async received.
@robertgshaw2-redhat this is modifying code you introduced in #17751. I think it’s safe to cache that last block as well, but not sure. cc @njhill
BTW, do we really have to re-compute the last token, or can we somehow re-use the KV data that we saved for it?
…
-
#32166 [Model] Re-implement Qwen3Omni Audio Encoder — qwen — by ywang96 (创建于: 2026-01-12 18:40 (UTC+8)) [+425/-17, 1 files | commented:1 | 📝草稿]
## Purpose
## Test Plan
## Test Result
... -
#32159 [Doc] Improve LoRA docs — documentation,ready — by jeejeelee (创建于: 2026-01-12 16:54 (UTC+8)) [💬2 | +12/-15, 1 files | commented:2 approved:1] ## Purpose vLLM has removed the LoRA-based extended vocab size feature, so vLLM doesn’t support LoRA adapters like
yard1/llama-2-7b-sql-lora-testanymore. We also need to update the related documentation. ## Test Plan## Test Result
Essential Elements of an Effective PR Description Checklist
... -
#32164 [Bugfix] fix apiserver crash exist then engine core close — frontend,v1 — by lengrongfu (创建于: 2026-01-12 18:15 (UTC+8)) [+16/-0, 2 files | commented:1] ## Purpose
Fix: https://github.com/vllm-project/vllm/issues/32116
## Test Plan
We can set a very small ready timeout value to reproduction it.
$ VLLM_ENGINE_READY_TIMEOUT_S=10 CUDA_VISIBLE_DEVICES=0,1 vllm serve /home/jovyan/qwen3-30b-a3b --gpu_memory_utilization=0.9 --enable-expert-parallel --data-parallel-size=2 --api-server-count 2 --data-parallel-address 127.0.0.1 --data-parallel-size-local 2 --data-parallel-start-rank 0…
-
#32158 [doc] fix broken links — documentation,ready — by minimAluminiumalism (创建于: 2026-01-12 16:53 (UTC+8)) [💬2 | +7/-21, 1 files | approved:1] ## Purpose
Images were using HTML
<img>tags with relative paths. MkDocs doesn’t process relative paths in HTML tags, causing incorrect URLs.## Test Plan
NA …
-
#32161 [CPU] Split attention dispatch by head_dim alignment — cpu — by R3hankhan123 (创建于: 2026-01-12 17:37 (UTC+8)) [+79/-54, 3 files | commented:2] ## Purpose Separate 32-aligned (AMX/NEON/VEC/VEC16) and 16-only (VEC16 only) head_dim dispatch paths to avoid redundant template instantiations and fix NEON and AMX compilation errors for head_dim 80/112 when static assertions are enabled
Head dims divisible by 32 route through all ISA implementations, while head dims divisible by 16 but not 32 are restricted to VEC16 only, preventing unsupported ISA combinations from being instantiated during compilation.
## Test Plan Build Docker image and t…
-
#32153 [Frontend] Fix Flaky MCP Streaming Test — ready — by daniel-salib (创建于: 2026-01-12 14:46 (UTC+8)) [+1/-1, 1 files | commented:1 approved:1] ## Purpose test_mcp_code_interpreter_streaming is occasionally failing due to inconsistent behavior of whether it decides to trigger a tool call. Since the math question is too simple, a tool call isn’t always triggered. A more complicated math problem like “123 * 456” consistently triggers the MCP tool call
## Test Plan
pytest entrypoints/openai/responses/test_harmony.py::test_mcp_code_interpreter_streaming
## Test Result
…
-
#32160 [feat] Add ATOM model impl backend — new-model,rocm — by zejunchen-zejun (创建于: 2026-01-12 17:23 (UTC+8)) [+158/-9, 5 files commented:1 📝草稿] -
#32150 [Model] Remove incorrect
SupportsPPfrom MTP models — ready,qwen,deepseek — by DarkLight1337 (创建于: 2026-01-12 14:25 (UTC+8)) [+6/-15, 6 files | commented:1 approved:2] ## PurposeMTP models doesn’t actually support PP.
## Test Plan
## Test Result
... -
#32147 Add descriptive error message for missing tools. — frontend,gpt-oss,meta-exported,fb-exported — by madongfly (创建于: 2026-01-12 13:28 (UTC+8)) [💬2 | +85/-8, 2 files | commented:1] Summary: Add descriptive error message for missing tools.
Test Plan: Unit test
Differential Revision: D90477073
[!NOTE] …
-
#32145 update cutlass_moe_mm error check message — nvidia — by XiaobingSuper (创建于: 2026-01-12 12:10 (UTC+8)) [+1/-1, 1 files | commented:1] ## Purpose
Fixed a typo for cutlass_moe_mm arch check message.
## Test Plan
## Test Result
... - #32144 [code clean] remove useless parameters — frontend — by andyxning (创建于: 2026-01-12 10:56 (UTC+8)) [+0/-2, 1 files | commented:2]
## Purpose
Code clean. Remove useless parameter.
## Test Plan
NA
## Test Result
NA
—
Essential Elements of an Effective PR Description Checklist
...
[已合并 PR]
-
#32127 [BugFix] Fix engine crash caused by chat tools + response_format — ready,v1,tool-calling — by njhill (合并于: 2026-01-13 10:33 (UTC+8)) [💬2 | +47/-0, 3 files | commented:3 approved:1] Our input/request validation logic overall is a mess. This is a minimal fix for now but some overhaul is needed imo.
Fixes https://github.com/vllm-project/vllm/issues/32006
[!NOTE] Cursor Bugbot is generating a summary for commit d558adab8c48b9b79a2f68a606e0c02eb82de068. Configure here.
…
-
#32197 [Misc] Allow enabling NCCL for DP sync when async scheduling — ready — by njhill (合并于: 2026-01-13 10:03 (UTC+8)) [💬2 | +26/-16, 4 files | commented:4 approved:1] It’s currently disabled by default in this case but we should allow manual override either way.
See https://github.com/vllm-project/vllm/issues/32140
[!NOTE] Sets
ParallelConfig.disable_nccl_for_dp_synchronizationto acceptNoneand defers defaulting logic toVllmConfig.disable_nccl_for_dp_synchronizationchanged tobool | NoneinParallelConfigand CLI args; doc clarifies default behavior …
-
#32194 [Model] Handle
trust_remote_codefor transformers backend — new-model,ready — by DarkLight1337 (合并于: 2026-01-13 09:30 (UTC+8)) [+14/-1, 2 files | commented:2 approved:1] ## PurposeAdd
trust_remote_codeargument totry_get_class_from_dynamic_moduleand handle it accordingly.## Test Plan
## Test Result
... -
#32036 [responsesAPI] add unit test for optional function tool call id — ready — by qandrew (合并于: 2026-01-13 08:14 (UTC+8)) [💬1 | +118/-0, 1 files | commented:3 approved:1] ## Purpose
## Test Plan follow up from #31999
## Test Result
unit tests pass
…
-
#32040 [ROCm][CI] Handle pytest status code 5 when a shard isn’t allocated any tests — rocm,ready,ci/build — by divakar-amd (合并于: 2026-01-13 06:35 (UTC+8)) [💬1 | +11/-2, 2 files | commented:2 approved:1] This PR ensures that test shards exiting with status 5 (indicating “no tests collected”) are handled gracefully for AMD CI. In multi-gpu setting, when tests are shared among GPUs, it’s possible that a particular shard may not be allocated with any test at all. In such cases, the pytest returns with status 5. More details on Pytest Exit codes - LINK
```
Shard 0: collected 312 items / 37 deselected / 275 selected Shard 0: Running …
-
#31962 [Kernel][MoE] fix computation order of MoE weight multiplication and improve flow — bug,ready — by xuebwang-amd (合并于: 2026-01-13 06:17 (UTC+8)) [💬8 | +24/-9, 1 files | commented:3 approved:1] ## Purpose
# Background: This PR is a continuous work on two previous PRs:
- PR #31676
- key contribution: move bias adding after dequantization
- computation order: to(compute_type) -> HAS_BIAS (bias adding) -> MUL_ROUTED_WEIGHT: this is the most closest one to the right order except for MoE weight multiplication not in float32.
- PR #31931
- key contribution: preserving router weight s…
- PR #31676
-
#32199 [BUGFIX] Add missed remaping of the names of fp8 kv-scale — ready,qwen — by vadiklyutiy (合并于: 2026-01-13 04:42 (UTC+8)) [💬2 | +7/-0, 1 files | commented:1 approved:1] Qwen3-Next-NVFP4 checkpoint produced a lot of the following warnings
$ vllm serve nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 -tp 2 ... Parameter layers.3.self_attn.qkqkv_proj.k_scale not found in params_dict, skip loading Parameter layers.3.self_attn.qkv_proj.v_scale not found in params_dict, skip loading ...…
- #32143 [Model Runner V2] Add support for M-RoPE — v1,nvidia — by WoosukKwon (合并于: 2026-01-13 05:37 (UTC+8)) [💬1 | +203/-7, 5 files | commented:10]
[!NOTE] Cursor Bugbot is generating a summary for commit 59f41b0420ea4e9f40ab13608af156abd2dffedf. Configure here.
[!NOTE] Adds end-to-end M-RoPE support to the v1 GPU model runner, wiring 3D position IDs through batching and execution.
- Introduces
MRopeStatewith UVA-backed prefill position storage and a Triton kernel to materializemrope_positionsper batch (`mm/mrope_utils….
- Introduces
-
#32209 [Model Runner V2] Minor refactor for logit_bias — v1 — by WoosukKwon (合并于: 2026-01-13 05:08 (UTC+8)) [+54/-26, 1 files | approved:1 commented:1] Minor code reorganization for better testing and code reuse
[!NOTE] Separates kernel launch logic from state, improving testability and reuse.
- Adds
apply_logit_bias(...)helper that computesBLOCK_SIZEfromallowed_token_ids,logit_bias_token_ids, andstop_token_idsshapes and invokes_bias_kernel - Updates
LogitBiasState.apply_logit_biasto call the new helper, passing.gputensors instead of handling strides/size inline - Keeps
_bias_kernellogic unchange…
- Adds
-
#32031 [NIXL][Bugfix] Failure logging overhaul + early metadata free on failure — ready,v1,kv-connector — by NickLucche (合并于: 2026-01-13 04:38 (UTC+8)) [💬2 | +234/-28, 2 files | commented:6 approved:1] ## Overview
This PR is aimed at improving logging to more easily identify failures during runs. It does so by providing a shared util function with more comprehensive context of the failure.
Change result: ``` # before ERROR 01-09 12:04:06 [nixl_connector.py:2281] NIXL transfer setup/initiation failed for request test_transfer_setup_failed_req. Marking blocks as invalid.
…
- #32163 [Model Runner V2] Support logit_bias, allowed_token_ids, min_tokens — v1 — by WoosukKwon (合并于: 2026-01-13 03:31 (UTC+8)) [💬2 | +253/-5, 2 files | commented:8]
[!NOTE] Implements GPU-side sampling constraints with a dedicated state and Triton kernel.
- Adds
LogitBiasStateto stage per-requestallowed_token_ids,logit_bias, andmin_tokens/stop_token_idsusingUvaBackedTensorandStagedWriteTensor - Introduces
_bias_kernel(Triton) to: mask logits to onlyallowed_token_ids, apply per-tokenlogit_bias, and enforcemin_tokensby suppressingstop_token_idsuntilmin_len - Provides
add_request,apply_staged_writes, …
- Adds
-
#31748 [Misc][BE] Type coverage for vllm/compilation [3/3] — ready,nvidia — by Lucaskabela (合并于: 2026-01-13 03:24 (UTC+8)) [💬2 | +333/-280, 11 files | commented:3 approved:2] ## Purpose We want to provide better type hint coverage in vllm/compilation to improve maintainability, readability, and reduce silent errors
This PR should be applied on top of #31744
## Test Plan
mypy vllm/compilation## Test Result ``` …
-
#32192 [Misc] Change log level for batch queue log — ready,v1 — by NickLucche (合并于: 2026-01-13 02:59 (UTC+8)) [+1/-1, 1 files | approved:1 commented:1] As per https://vllm-dev.slack.com/archives/C07R5Q1Q2BB/p1768224082041559
[!NOTE] Cursor Bugbot is generating a summary for commit e0bdb779a0d93b90afb5f791188ae2bcc3139364. Configure here.
[!NOTE] …
-
#32173 [BugFix] scheduler: Fix ordering preserving of skipped requests — ready,v1 — by orozery (合并于: 2026-01-13 02:39 (UTC+8)) [💬1 | +31/-15, 2 files | commented:1 approved:1] This PR fixes the order preserving of requests skipped by the scheduler. A unit test is added to verify the fix.
I came across this bug while testing #29087, and the test there requires this fix.
[!NOTE] Cursor Bugbot is generating a summary for commit 8c08c30fef37d620043c4221960e04d6704ba1dc. Configure here.
…
-
#31879 [Misc] Set default torch num threads for input processing — ready,v1 — by ywang96 (合并于: 2026-01-13 02:28 (UTC+8)) [💬2 | +12/-16, 2 files | approved:1 commented:1] ## Purpose There have been numerious reports (e.g, #29869, #29078) of CPU contention from multimodal input processing where users do not set
OMP_NUM_THREADSand run multiple vLLM instances on the same physical machine. This PR wraps the preprocess call with torch num threads set to the same value asOMP_NUM_THREADSif it’s set, otherwise 1.## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... -
#30697 [Refactor] EPLB rebalance algo to NumPy — ready — by ilmarkov (合并于: 2026-01-13 02:13 (UTC+8)) [💬2 | +128/-130, 3 files | commented:1 approved:2] ## Purpose
First PR of series of EPLB refactoring and optimizations. In this PR we unify the rebalance algo logic implementation converting to CPU-only numpy-based implementation.
Prerequisite of PR(s) where we will move the rebalancing algo to async EPLB thread to get rid of the points where CPU has to wait for GPU results in the main thread.
## Validation
tests/distributed/test_eplb_algo.pyPassed … -
#32196 [BugFix] fix FusedMoE.make_expert_params_mapping in EXAONE-MoE — 无标签 — by lkm2835 (合并于: 2026-01-13 02:00 (UTC+8)) [💬1 | +1/-0, 1 files | commented:1 approved:1] ## Purpose Added the missing changes. Related to #31104.
## Test Plan
## Test Result
... - #32054 [3/N][Attention] Move AttentionMetadata-related code from utils.py to backend.py — documentation,rocm,speculative-decoding,ready,v1,cpu,nvidia,ready-run-all-tests — by MatthewBonanni (合并于: 2026-01-13 01:13 (UTC+8)) [💬2 | +374/-370, 37 files | commented:4 approved:1]
## Purpose
Step 3 of #31919. Moves chunks of code from utils.py to backend.py (unchanged) and updates imports accordingly. The following objects are moved:
CommonAttentionMetadataAttentionMetadataBuilderAttentionCGSupport
## Test Plan CI (run all tests)
## Test Result …
-
#32184 [Benchmark] Share data between SLA runs — performance,ready — by DarkLight1337 (合并于: 2026-01-13 01:12 (UTC+8)) [+108/-25, 2 files | commented:6 approved:1] ## Purpose
Further optimize the SLA script to share past run data coming from the same serve/bench combination. This is useful when tuning for multiple SLA targets (e.g. E2EL <= 200 ms, 500 ms, 1000 ms, …) while keeping the serve/bench combination unchanged.
## Test Plan
## Test Result
... -
#32148 [Model] Standardize pooling heads — ready — by DarkLight1337 (合并于: 2026-01-13 01:01 (UTC+8)) [+182/-149, 9 files | approved:1 commented:9] ## Purpose
Follow-up to #32119, make pooling params actually take effect for custom models.
## Test Plan
## Test Result
... - #31988 [Misc][PD] Fix
get_attn_backendusage in transfer connectors — ready,v1,kv-connector — by NickLucche (合并于: 2026-01-13 01:10 (UTC+8)) [💬3 | +39/-20, 4 files | commented:5 approved:1] This PR fixes the use ofget_attn_backendin the context discussed here https://vllm-dev.slack.com/archives/C07R5Q1Q2BB/p1767810349936949. In particular, it turns out that modification to the interface of this shared function can cause unintended backend retrieval (as a partial configuration was passed in), leading to cases such as ``` VLLM_LOGGING_LEVEL=DEBUG vllm serve google/gemma-3-4b-it –port 8004 –enforce-eager –kv-transfer-config ‘{“kv_connector”:”NixlConnector”,”kv_role”:”kv_both”}’… -
#32118 [Bugfix] Fix stale SSM state for new Mamba requests scheduled as decode — ready,v1 — by Josephasafg (合并于: 2026-01-13 01:02 (UTC+8)) [💬3 | +24/-3, 2 files | commented:1 approved:1] ## Purpose Fix stale SSM state corruption when new Mamba requests are scheduled with only 1 token due to token budget exhaustion.
## Problem When the scheduler’s token budget is nearly exhausted, new requests may be allocated only 1 token. A request with 1 token could be classified as decode rather than prefill. This causes a prompt to first be decoded on a stale SSM state (meaning it had previous values - this happens after the I believe most of the gpu blocks are used and a reuse takes place)…
-
#31528 [FIX] Add NO_MUL activation support for modular kernel path — rocm,ready,gpt-oss,nvidia — by danielafrimi (合并于: 2026-01-13 00:55 (UTC+8)) [💬5 | +368/-71, 17 files | commented:10] This PR adds support for
*_no_mulactivations (e.g.,relu2_no_mul) in the modular kernel MoE path (TritonExperts).### Problem The modular kernel path assumed all activations use gate/up multiplication (like SiLU, GELU), where output size is
N/2. For*_no_mulactivations, which apply activation directly without gating, output size should equal input size (N). This caused assertion failures and buffer size mismatches.…
-
#29384 [MODEL] New model support for kakaocorp/kanana-1.5-v-3b-instruct — documentation,new-model,ready — by kakao-steve-ai (合并于: 2026-01-13 00:39 (UTC+8)) [💬4 | +790/-0, 5 files | commented:10] ## Purpose Add new model for kakaocorp/kanana-1.5-v-3b-instruct
## Test Plan Verify that the performance scores are correctly produced using lmms-eval.
``` export OPENAI_API_BASE=”http://localhost:8000/v1” export OPENAI_API_KEY=’EMPTY’ lmms-eval –model async_openai –model_args model_version=kakaocorp/kanana-1.5-v-3b-instruct,base_url=$OPENAI_API_BASE,api_key=$OPENAI_API_KEY –tasks m…
-
#31621 Add K-EXAONE-236B-A23B — documentation,new-model,ready — by lkm2835 (合并于: 2026-01-13 00:30 (UTC+8)) [💬2 | +856/-0, 7 files | commented:10] ## Purpose This PR adds support to K-EXAONE-236B-A23B
## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... -
#32083 [Model Runner V2] Remove async barrier — v1,nvidia — by WoosukKwon (合并于: 2026-01-12 12:24 (UTC+8)) [💬3 | +590/-462, 13 files | commented:7] Currently, the model runner V2 prevents race conditions by introducing
async_barrier, which is difficult to reason about and error-prone.This PR eliminates the need for
async_barrierby avoiding the race condition by design. The core idea is double buffering, allocating two copies of states, where one is used by CPU and another is used by GPU.More detailed design doc is in progress
[!NOTE] Cursor Bugbot is generating a summary for…
-
#32188 doc: Update model references in supported_models.md — documentation,ready — by andyzhangx (合并于: 2026-01-13 00:15 (UTC+8)) [💬1 | +2/-2, 1 files | commented:4 approved:1] ## Purpose Update model references in supported_models.md, we are using a tool to parse all listed model names in this page, but some items does not have org name which makes it difficult to parse automatically, this PR fixes the issue.
## Test Plan
## Test Result
... -
#32179 [ROCm] [Bugfix] Fix order of mori build in Dockerfile.rocm_base — rocm,ready,ci/build — by tjtanaa (合并于: 2026-01-12 23:33 (UTC+8)) [+21/-13, 1 files | commented:1 approved:1] ## Purpose
Fix the following bug:
$ DOCKER_BUILDKIT=1 docker build \ -f docker/Dockerfile.rocm_base \ -t rocm/vllm-dev:base-debug .…
-
#32185 doc: Update model name for Qwen3-Coder in documentation — documentation,ready,tool-calling,qwen — by andyzhangx (合并于: 2026-01-12 23:10 (UTC+8)) [💬2 | +1/-1, 1 files | approved:1 commented:1] ## Purpose
Update model name for Qwen3-Coder in documentation, original name is incorrect, you could find the correct name here: https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct
## Test Plan
## Test Result
... -
#24498 OffloadingConnector: Add cpu_bytes_to_use configuration — documentation,ready,v1,kv-connector — by orozery (合并于: 2026-01-12 23:00 (UTC+8)) [💬26 | +49/-24, 9 files | dismissed:2 commented:7 approved:1] This PR replaces the OffloadingConnector size configuration from num_cpu_blocks to cpu_bytes_to_use. This allows for a more intuitive space allocation (per vLLM instance, across workers).
[!NOTE] Modernizes KV offloading configuration and wiring.
- Replace
num_cpu_blocks/kv_bytes_per_rankwith instance-widecpu_bytes_to_useforOffloadingConnector(docs and configs) CPUOffloadingSpecnow derivesnum_blocksfromKVCacheConfig(page size, tensors, world size) and `…
- Replace
-
#28284 [Feature] Support recording expert indices for rollout router replay — performance,frontend,ready,v1,rl — by xhx1022 (合并于: 2026-01-12 22:23 (UTC+8)) [💬68 | +463/-3, 11 files | commented:9 approved:1] ## Purpose
This PR introduces Rollout Router Replay (R3) support into vLLM runtime.
Inspired by the recent research in reinforcement learning alignment for MoE-based LLMs (arXiv:2510.11370, arXiv:2507.18071), this implementation allows recording the expert routing decisions for every token at every layer during model inference. The recorded routing traces can be used for replaying the expert routing process du… -
#31573 [P/D] Refactor mooncake connector sender thread using async coroutines — ready,kv-connector — by dtcccc (合并于: 2026-01-12 20:35 (UTC+8)) [+173/-174, 1 files | commented:3 approved:1] ## Purpose This is a separate PR for https://github.com/vllm-project/vllm/pull/31034 to help review. This PR refactored sender thread using async coroutines. All related data are in the same thread so that we can drop their locks. This makes the sender thread simple and easy to maintain.
## Test Plan
## Test Result
... - #32149 [Bugfix] Fix missing scale passing for encoder Triton Attention implementation — documentation,ready,v1 — by Isotr0py (合并于: 2026-01-12 19:13 (UTC+8)) [💬1 | +13/-27, 4 files | commented:3 approved:1]
## Purpose
- A small fix for Triton Encoder only attention, otherwise
TritonAttentionImpl.scalewon’t take effect exactly.
## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... - A small fix for Triton Encoder only attention, otherwise
-
#32134 [Doc] Add documentation for offline API docs feature — documentation,ready — by ricky-chaoju (合并于: 2026-01-12 18:33 (UTC+8)) [💬1 | +8/-0, 1 files | commented:1 approved:1] ## Summary Add documentation for offline API docs feature related :https://github.com/vllm-project/vllm/pull/30184
[!NOTE] Cursor Bugbot is generating a summary for commit 9bcf3bd955c830737300f641b96eadcb63174273. Configure here.
…
-
#32159 [Doc] Improve LoRA docs — documentation,ready — by jeejeelee (合并于: 2026-01-12 18:19 (UTC+8)) [💬2 | +12/-15, 1 files | commented:2 approved:1] ## Purpose vLLM has removed the LoRA-based extended vocab size feature, so vLLM doesn’t support LoRA adapters like
yard1/llama-2-7b-sql-lora-testanymore. We also need to update the related documentation. ## Test Plan## Test Result
Essential Elements of an Effective PR Description Checklist
... -
#32158 [doc] fix broken links — documentation,ready — by minimAluminiumalism (合并于: 2026-01-12 18:18 (UTC+8)) [💬2 | +7/-21, 1 files | approved:1] ## Purpose
Images were using HTML
<img>tags with relative paths. MkDocs doesn’t process relative paths in HTML tags, causing incorrect URLs.## Test Plan
NA …
-
#32153 [Frontend] Fix Flaky MCP Streaming Test — ready — by daniel-salib (合并于: 2026-01-12 18:03 (UTC+8)) [+1/-1, 1 files | commented:1 approved:1] ## Purpose test_mcp_code_interpreter_streaming is occasionally failing due to inconsistent behavior of whether it decides to trigger a tool call. Since the math question is too simple, a tool call isn’t always triggered. A more complicated math problem like “123 * 456” consistently triggers the MCP tool call
## Test Plan
pytest entrypoints/openai/responses/test_harmony.py::test_mcp_code_interpreter_streaming
## Test Result
…
-
#32092 [cpu][bench] Add Fused MoE Micro Benchmark for CPU Backend — performance,ready,cpu — by andikarachman (合并于: 2026-01-12 18:03 (UTC+8)) [💬2 | +175/-0, 1 files | commented:3 changes:1 approved:2] ## Purpose
Add Fused MoE Micro Benchmark for CPU Backend Fixes: #31721
## Test Plan Tested using this instance:
## Test Result …
-
#30975 [Misc] Disable default
--ready-check-timeout-secextra call in vllm bench — performance — by NickLucche (合并于: 2026-01-12 17:58 (UTC+8)) [+2/-3, 1 files | commented:1 approved:1] As per brief offline discussion with @mgoin, currentvllm bench serveimplementation will default to sending (at least) one extra request to probe for server up status. I suppose this is due to a non-uniform backend/healthcheck API, so we’ve been defaulting to sending the same test request https://github.com/vllm-project/vllm/blob/686cbaac643c3412036728dd5e6bc29d6cce1a9f/vllm/benchmarks/serve.py#L596I believe this is largely misleading for most use-cases as it results in an ambiguous “wa…
-
#32150 [Model] Remove incorrect
SupportsPPfrom MTP models — ready,qwen,deepseek — by DarkLight1337 (合并于: 2026-01-12 17:19 (UTC+8)) [+6/-15, 6 files | commented:1 approved:2] ## PurposeMTP models doesn’t actually support PP.
## Test Plan
## Test Result
... -
#32085 [Model] Improve multimodal pooling examples — documentation,ready — by noooop (合并于: 2026-01-12 15:54 (UTC+8)) [💬2 | +381/-69, 10 files | commented:6 approved:1] ## Purpose
FIX #32069
## Test Plan
Qwen/Qwen3-VL-Embedding-2B …
-
#32119 [Model] Avoid hardcoding pooling type — ready — by DarkLight1337 (合并于: 2026-01-12 13:28 (UTC+8)) [+47/-22, 6 files | commented:4 approved:2] ## Purpose
Gracefully handle user-specified pooling types where possible, instead of silently ignoring them.
In the next PR, we will work on passing
EmbeddingPoolerHeadtoheadargument so that pooling params are applied correctly.## Test Plan
## Test Result
…
[关闭未合并 PR]
-
#31729 [Bugfix][Hardware][AMD] Fix hardcoded device in AITER MLA and Fused MOE — rocm,v1 — by c0de128 (关闭于: 2026-01-13 07:27 (UTC+8)) [💬5 | +10/-8, 3 files | commented:1] ## Summary
Fix hardcoded
device="cuda"in AITER MLA sparse attention and Fused MOE initialization code. This ensures tensors are created on the correct device in multi-GPU setups and improves ROCm compatibility.## Changes
### 1.
vllm/attention/ops/rocm_aiter_mla_sparse.pyReplace 5 instances ofdevice="cuda"withdevice=q.deviceordevice=q_fp8.device:- Lines 46, 49: Use
q.deviceinfp8_mqa_logits_torch() - Lines 127, 135: Use
q.deviceinfp8_paged_mqa_logits_torch()…
- Lines 46, 49: Use
-
#31638 [Bugfix][Hardware][AMD] Narrow broad exception in AITER scaled MM import — rocm — by c0de128 (关闭于: 2026-01-13 07:27 (UTC+8)) [💬4 | +1/-1, 1 files | commented:1] ## Summary
Narrows the exception handling in the AITER scaled MM kernel’s
is_supported()method from catching allExceptiontypes to only catchingImportError.## Problem
The current code uses
except Exception:which is too broad and can mask programming errors likeAttributeError,TypeError, or other unexpected exceptions that should propagate for debugging.## Solution
…
-
#31552 [Bugfix][Hardware][AMD] Fix device parameter and exception handling — rocm — by c0de128 (关闭于: 2026-01-13 07:27 (UTC+8)) [💬3 | +9/-9, 2 files | commented:1] ## Summary
Fix two ROCm-related issues:
### 1. Fusion Helper Functions (
vllm/compilation/fusion.py)Bug: Hardcoded
device="cuda"in helper functions prevents explicit device selection.```python # Before (hardcoded): …
- #31587 [Bugfix][Hardware][ROCm] Narrow broad exception in PyNCCL library loading — rocm — by c0de128 (关闭于: 2026-01-13 07:27 (UTC+8)) [💬2 | +3/-3, 1 files | commented:1]
## Summary
- Replace
except Exception:withexcept (OSError, ValueError, RuntimeError):when loading the NCCL/RCCL library - This prevents masking unexpected errors during library loading, especially important for ROCm/RCCL debugging
## Details The current broad exception handler can hide the actual cause of NCCL/RCCL loading failures. Narrowing to specific exceptions helps debug issues on both CUDA and ROCm platforms.
OSError: Library file not found or can’t be loaded (from `ctypes.CD…
- Replace
-
#31119 [Bugfix][Hardware][AMD] Fix tensor slice assignment in MLA — rocm,v1 — by c0de128 (关闭于: 2026-01-13 07:27 (UTC+8)) [💬10 | +1/-1, 1 files | commented:1] ## Summary Fix inconsistent tensor assignment pattern in
rocm_aiter_mla.py.Bug: Line 160 used direct assignment (
=) to set values in a tensor slice, while lines 142, 148, and 154 correctly use.fill_()for the same operation pattern.```python # Lines 142, 148, 154 - correct pattern: self.paged_kv_indices[num_actual_pages:].fill_(-1) self.paged_kv_indptr[1 + num_reqs :].fill_(paged_kv_indptr[-1]) self.paged_kv_last_page_len[num_reqs:].fill_(1) …
-
#31121 [Bugfix][Hardware][AMD] Fix list aliasing in fused MoE initialization — rocm — by c0de128 (关闭于: 2026-01-13 07:27 (UTC+8)) [💬14 | +116/-4, 2 files | commented:1] ## Summary Fix critical Python list aliasing bug in ROCm fused MoE implementation.
Bug: The code used
[[value] * n] * mpattern which createsmreferences to the same inner list, notmindependent lists.```python # Before (buggy) - all indices point to same list: s_topk_ids_list = [[fake_expertid] * (n_shared_experts + is_EP)] * max_num_tokens
# After (fixed) - each index has independent list: …
-
#31176 [Bugfix][Hardware][AMD] Fix hardcoded device in MLA sparse attention — rocm,v1,nvidia — by c0de128 (关闭于: 2026-01-13 07:27 (UTC+8)) [💬19 | +5/-5, 1 files | commented:1] ## Summary
Replace hardcoded
device="cuda"with input tensor device (q.deviceorq_fp8.device) inrocm_aiter_mla_sparse.pyfor consistency and to avoid potential device mismatch errors.## Changes
Fixed 5 instances of hardcoded
device="cuda":| Location | Function | Fix | |———-|———-|—–| …
-
#31178 [Bugfix][Hardware][AMD] Fix device parameter in AITER topK metadata — rocm — by c0de128 (关闭于: 2026-01-13 07:27 (UTC+8)) [💬19 | +6/-4, 2 files | commented:1] ## Summary
Add explicit
deviceparameter toinit_aiter_topK_meta_data()instead of hardcoding"cuda". This improves multi-GPU support and makes device handling explicit and consistent with other ROCm functions.## Changes
File 1:
vllm/model_executor/layers/fused_moe/rocm_aiter_fused_moe.py- Add
device: int | str = "cuda"parameter to function signature - Replace 3 instances of hardcoded
device="cuda"withdevice=device
…
- Add
-
#31251 [Bugfix][Hardware][AMD] Use cub_helpers.h in sampler.cu for ROCm namespace alias — rocm — by c0de128 (关闭于: 2026-01-13 07:27 (UTC+8)) [💬16 | +11/-9, 3 files | commented:1] ## Summary
Replace direct
cub/cub.cuhincludes withcub_helpers.hin multiple CUDA kernel files. This provides thenamespace cub = hipcub;alias needed for ROCm builds.## Files Fixed
| File | CUB Usage | |——|———–| |
csrc/sampler.cu|cub::BlockScan,cub::BlockRadixSort| |csrc/moe/moe_align_sum_kernels.cu|cub::BlockScan| … -
#31293 [Bugfix][Hardware][AMD] Fix uninitialized Qlocal registers in ROCm attention kernel — rocm — by c0de128 (关闭于: 2026-01-13 07:27 (UTC+8)) [💬12 | +14/-0, 1 files | commented:1] ## Summary
Fixes uninitialized
Qlocalregister values in the ROCm PagedAttention wmma kernel that can cause numerical accuracy issues on AMD GPUs.## The Bug
In
paged_attention_ll4_kv_kernel, whenGQA_RATIO == 1(non-GQA models like Llama-2), only lane 0 loads valid Q data intoQlocalregisters. Lanes 1-15 retain garbage values from previous GPU cycles.These uninitialized values then propagate into the
gcn_wmma16x16x16_instr(Wave Matrix Multiply-Accumulate) instruction, conta… -
#32084 [ROCm][Bugfix] Fix AITER speculative decoding accuracy issue — rocm,v1 — by c0de128 (关闭于: 2026-01-13 07:27 (UTC+8)) [💬6 | +68/-36, 1 files | commented:3] ## Summary
- Fix speculative decoding (query_len > 1) for ROCM_AITER_FA backend
- Fall back to
context_attention_fwdwhen decode has multi-token queries - Resolves 0% accuracy issue reported in #31625
## Problem
The AITER
paged_attention_v1kernel only supports single-token queries (query_len = 1). When speculative decoding is enabled, the decode path receives multi-token queries, causing the kernel to produce incorrect results (0% accuracy on GSM8K benchmark).…
-
#31802 [Core][NIXL] Support HMA+NixlConnector — documentation,performance,structured-output,frontend,tpu,needs-rebase,ci/build,v1,multi-modality,tool-calling — by NickLucche (关闭于: 2026-01-13 03:04 (UTC+8)) [💬3 | +3247/-1277, 135 files | commented:1 | 📝草稿] This PR is based on an early version of https://github.com/vllm-project/vllm/pull/30166 so the diff is a mess. I will clean it up and rebase it asap and provide a more accurate description of the PR then.
UPDATE: check out https://github.com/vllm-project/vllm/pull/32204 for the updated PR
## Overview Currently connectors are not able to take full advantage of models that employ hybrid attention (FA+SWA) and treat all layers as FA, as the Hybrid Kv Cache Manager is disabled.
This PR enables Ni…
-
#32043 [AOT compilation] cached inductor artifacts benchmark — performance — by dolpm (关闭于: 2026-01-13 01:04 (UTC+8)) [+682/-0, 1 files | commented:3] ## Purpose
split out of https://github.com/vllm-project/vllm/pull/25205
## Test Plan
## Test Result
... -
#32166 [Model] Re-implement Qwen3Omni Audio Encoder — qwen — by ywang96 (关闭于: 2026-01-12 18:41 (UTC+8)) [+425/-17, 1 files | commented:1 | 📝草稿]
## Purpose
## Test Plan
## Test Result
... -
#31900 Ignore: dynres attempt — needs-rebase — by netanel-haber (关闭于: 2026-01-12 17:34 (UTC+8)) [💬1 | +858/-127, 3 files | commented:1 | 📝草稿] ## Purpose
## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... -
#27511 Add standalone multimodal encoder benchmark — performance,frontend — by alhridoy (关闭于: 2026-01-12 16:59 (UTC+8)) [💬3 | +434/-0, 4 files | commented:4] Fixes #25450
## Summary - add a standalone multimodal encoder benchmark (
vllm/benchmarks/encoder.py) that loads dummy-weight models, builds HF-processor inputs, and measuresget_multimodal_embeddingslatency across configurable batch/image sizes - wire the runner into the CLI asvllm bench encoderand keep a legacy shim atbenchmarks/benchmark_encoder.py… -
#24091 [EPLB]: Optimize
export_load_viewupdate — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,needs-rebase,ci/build,v1 — by dragondream-chen (关闭于: 2026-01-12 12:01 (UTC+8)) [💬20 | +68/-5, 8 files | commented:2] ## Purpose In the feature of EPLB(Experts Load Balance), the PR optimizes the update method for expert load during each forward. The current approach is using thescatter_add_method based ontopk_idsresults. When using DeepEP Low-Latency or PPLX on the CUDA platform, expert loads can be obtained directly fromexpert_tokens_meta.expert_num_tokens, which reduces redundant calculations on the expert load.## Test Plan
- Test expert load update Since the use of the kernel, such as DeepEP…
- #32144 [code clean] remove useless parameters — frontend — by andyxning (关闭于: 2026-01-12 11:03 (UTC+8)) [+0/-2, 1 files | commented:2]
## Purpose
Code clean. Remove useless parameter.
## Test Plan
NA
## Test Result
NA
—
Essential Elements of an Effective PR Description Checklist
...