vLLM Daily Report

« Back to vLLM Reports

[vLLM GitHub 开发动态] 2026-01-08

[概览]

时间窗口: 2026-01-08 10:53 (UTC+8) ~ 2026-01-09 10:53 (UTC+8)
新 issue: 13 (label 分布: bug:6, feature request:3, torch.compile:1, performance:1, usage:1)
关闭 issue: 37
新 PR: 57 (label 分布: ready:23, v1:20, rocm:8, nvidia:6, ci/build:6)
合并 PR: 58
关闭未合并 PR: 24

[新 issue]

#32006 [Bug]: vLLM hangs on specific request each time (qwen-coder-480b-fp8) — bug — by Yanpas (创建于: 2026-01-09 09:52 (UTC+8)) ### Your current environment

Can't do it. It's RH Openshift, GPU H200, Cuda 12.9 Model: Qwen Coder 480b FP8 vllm params: --enable-expert-parallel --data-parallel-size=8 ...
#32007 [Feature]: Speculating with a draft model — feature request — by liudl85 (创建于: 2026-01-09 09:55 (UTC+8)) ### 🚀 The feature, motivation and pitch

In docs, I found this Speculating with a draft model

While in codes, I also found this https://github.com/vllm-project/vllm/blob/main/vllm/config/speculative.py#L379-L384

I’m really confused about whether vLLM supports speculation with a d…
#32004 [Bug]: When running the model on an RTX 5090 GPU, the model starts successfully, but the CPU usage remains at 100%. — bug — by ljwps (创建于: 2026-01-09 09:21 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#31963 [Bug]: vllm部署的Qwen3 VL 32B模型推理结果和transformers推理得到的结果不一致 — bug — by yy19941128 (创建于: 2026-01-08 18:54 (UTC+8)) [💬4] ### Your current environment

vllm离线版本和线上版本均部署了Qwen3 vl 32B微调后的模型，但是推理结果和transformers得到的结果不一样，经过排查，经过对话模版对齐后的prompt是完全一样的，tokenizer后的token_id也是完全一致的。不知道是不是vision encoder会有所差异？ vllm的版本是0.11.0，transformers的版本是 4.57.0。 vl任务是ocr，出现很多张图片的ocr结果中有一两个字符不一样，其中transformers基本都是对的，vllm离线和线上版本都是错的

### 🐛 Describe the bug

vllm离线版本和线上版本均部署了Qwen3 vl 32B微调后的模型，但是推理结果和transformers得到的结果不一样，经过排查，经过对话模版对齐后的prompt是完全一样的，tokenizer后的token_id也是完全一致的。不知道是不是vision encoder会有所差异？ vllm的版本是0.11.0，transformers的版本是 4.57.0。 …
#31966 [Feature]: Proposal: Optional Admin Control Plane API + CLI for Safe Production Operations — feature request — by khajamoddin (创建于: 2026-01-08 20:33 (UTC+8)) [💬1] ### 🚀 The feature, motivation and pitch

Introduce a minimal, optional admin control plane for vLLM, consisting of:

A small HTTP Admin API for inspecting runtime state and performing safe lifecycle operations

A thin CLI reference client (vllm admin …) built on top of this API

This control plane is strictly for production operations, not inference, and is:

…
#31990 [Bug]: [H200] Qwen3-Next-80B-A3B-Instruct-FP8 TP1 DP4 EP4 CUDA illegal memory error — bug — by jhaotingc (创建于: 2026-01-09 03:34 (UTC+8)) [💬1] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#31985 [Feature]: Unwrap FusedMoE custom op — feature request,torch.compile — by ProExpertProg (创建于: 2026-01-09 01:26 (UTC+8)) ### 🚀 The feature, motivation and pitch

Currently, the whole FusedMoE layer is wrapped in the fused_moe/fused_moe_shared custom ops. This is mostly due to chunking but prevents torch.compile-based optimizations, increases CPU overhead, and makes it more likely for inefficiencies (copies, sequences of torch ops, etc.) to pop up unexpectedly. It also increases complexity in many cases as we have to write custom kernels for code that could otherwise be implemented in native torch and be effi…
#31975 [New Model]: Qwen3Guard Stream — 无标签 — by Wildshire (创建于: 2026-01-08 22:33 (UTC+8)) ### The model to consider.

Hello

I was wondering if the Qwen3Guard stream family is going to be supported by vLLM. Closest issue / PR related to this topic I found in the repo is this PR #25463

HF models:
- 0.6B
- 4B
- 8B …
#31957 [Bug]: Nemotron Nano V3 FP8 - Expected ‘silu’ activation but got relu2_no_mul — bug — by danisereb (创建于: 2026-01-08 17:30 (UTC+8)) [💬2] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... uv is set ============================== System Info ...
#31961 [Performance]: EPD Disaggregation Performance Testing Scripts — performance — by Adenialzz (创建于: 2026-01-08 18:25 (UTC+8)) ### Proposal to improve performance

Hi, I noticed that epd disaggregation is now ready in vLLM. And the blog shows encouraging results. But we failed to get the expected performance improvement and want to troubleshoot the potential errors. Could you please provide complete testing scripts for reproduction? Thanks in advance.

### Report of performance regression

No response

### Misc discussion on performance

…
#31959 [Usage]: Cannor Run vLLM Docker Container as Non-Root User — usage — by tahmid-therap (创建于: 2026-01-08 17:54 (UTC+8)) I am trying to build a docker image from the vllm/vllm-openai:v0.12.0 image which will allow me to run the vllm server as a defined non-root user and dump the log file under the ownership of the same. However, I am getting constant error during startup regarding library file access inside the container. I have attempted a brute force approach by changing ownership and permission to directories like /usr or /root. However, I am still getting similar errors. The following is my raw dockerfile, apo…
#31954 [Bug] OLMo3 reasoning parser fails to detect </think> end tag, preventing GCD activation — 无标签 — by ivnle (创建于: 2026-01-08 16:11 (UTC+8)) [💬1] ### Description

The OLMo3 reasoning parser (olmo3_reasoning_parser.py) fails to correctly detect when thinking ends, causing all generated tokens to be classified as thinking tokens. This prevents grammar-constrained decoding (GCD) from activating for the post-thinking structured output.

### Evidence

From experiments with OLMo3-7B-Think + GCD + thinking=high on the math500 task:

Question 0 (Polar coordinates conversion): ```json …
#31940 [Bug]: — bug — by EgoistaCercis (创建于: 2026-01-08 11:22 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…

[已关闭 issue]

#20170 [Bug]:RuntimeError: CUDA error: an illegal memory access was encountered — bug,stale — by xiaocode337317439 (关闭于: 2026-01-09 10:27 (UTC+8)) [💬7] ### Your current environment

vllm/vllm-openai：v0.9.1

The output of python collect_env.py
```text ...
#21047 [RFC]: Migrate vLLM Docs back to Sphinx for Localization — RFC,stale — by hwhsu1231 (关闭于: 2026-01-09 10:27 (UTC+8)) [💬9] ### Motivation.

To localize the vLLM Documentation.

Recently, I officially published my 1st localization project, cmake-docs-l10n, on GitHub:
- Preview: https://localizethedocs.github.io/cmake-docs-l10n
- Crowdin: https://localizethedocs.crowdin.com/cmake-docs-l10n
- GitHub: https://github.com/localizethedocs/cmake-docs-l10n
…
#21466 [Bug]: FP8 model crashes with EngineDeadError and CUDA illegal memory access on H100 (CUDA 12.8) — bug,stale — by marcin-brzezanski (关闭于: 2026-01-09 10:27 (UTC+8)) [💬12] ### Your current environment

The output of python collect_env.py
```text ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) ...
#21481 [Feature]: Multiple models one server — feature request,stale — by botirk38 (关闭于: 2026-01-09 10:27 (UTC+8)) [💬3] ### 🚀 The feature, motivation and pitch

Hey Im working on a project and I want to do my on infra inference for small models, I just want one inference server running multiple models to do this though I had to setup my own model management and unloading loading models when appropriate into memory. It would be cool if VLLM had this I would be happy myself to make a PR, if you guys are ok with this.

### Alternatives

No response

### Additional context

…
#22093 [Bug]: update the kv_connector from v0 to v1 in example — bug,stale — by 1mujue (关闭于: 2026-01-09 10:27 (UTC+8)) [💬2] ### Your current environment

============================== System Info ============================== OS : Ubuntu 24.04.1 LTS (x86_64) GCC version : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 Clang version : 18.1.3 (1ubuntu1) CMake version : version 4.0.3 Libc version : glibc-2.39 …
#23009 [Feature]: Plugin framework for out-of-tree custom checkpoint loading — feature request,stale,rl — by 22quinn (关闭于: 2026-01-09 10:26 (UTC+8)) [💬3] ### 🚀 The feature, motivation and pitch

### Motivation Currently, vLLM primarily supports HuggingFace-format checkpoints along with a few built-in custom loaders such as mistral. However, some proprietary reinforcement learning (RL) systems use custom checkpoint and weight formats along with unique loading mechanisms, which creates challenges for adopting vLLM in these environments. Support is required in two key areas (they share some common path):
1. Initial custom weight loading during rollo…
#23227 [Feature][Responses API] Support tool_choice other than “auto” — feature request,stale — by heheda12345 (关闭于: 2026-01-09 10:26 (UTC+8)) [💬13] ### 🚀 The feature, motivation and pitch

Current response API only supports tool_choice=”auto”. For other tool_choice, e.g., tool_choice=”required”, we need to properly enable structure output. See the OpenAI API doc for tool_choice here https://platform.openai.com/docs/guides/function-calling#tool-choice

### Alternatives

No response

### Additional context

…
#23582 [Bug]: vLLM server timeout due to multiprocessing communication error — bug,stale — by shaamil101-etched (关闭于: 2026-01-09 10:26 (UTC+8)) [💬11] ### Your current environment

I ran

sudo docker pull vllm/vllm-openai:latest

``` sudo docker run -d
–gpus all
–name vllm-8b-bf16-b200
…
#24142 [Feature]: support Hunyuan-MT-Chimera-7B and HunYuanDenseV1ForCausalLM — feature request,stale — by devops724 (关闭于: 2026-01-09 10:26 (UTC+8)) [💬8] ### 🚀 The feature, motivation and pitch

Hi Tencent recently released https://github.com/Tencent-Hunyuan/Hunyuan-MT if we have support this model in vllm it would be great

### Alternatives

No response …
#24237 [Bug]: Xformers is not available, falling back, even though I have Xformers installed — bug,stale — by Magmanat (关闭于: 2026-01-09 10:26 (UTC+8)) [💬4] ### Your current environment

The output of python collect_env.py
```text Using Python 3.10.12 environment at: vllm Package Version Editable project location --------------------------------- -------------------------------------- ----------------------------- accelerate 1.10.0 ...
#24272 [Bug]: Multi-node DeepSeek-V3-0324 errors out with CUDA Illegal Memory Access — bug,stale — by amanshanbhag (关闭于: 2026-01-09 10:26 (UTC+8)) [💬5] ### Your current environment

The output of python collect_env.py
Since this is a Slurm environment, I ran: `srun -N 1 --container-image /fsx/ubuntu/vLLM-testing/vllm-ep.sqsh bash -c "wget https://raw.githubusercontent.com/vllm-project/vllm/main/vllm/collect_env.py && python3 collect_env.py"` ```text Collecting environment information... ...
#24304 [Bug]: inference with Qwen 2.5 Omni error: vllm.worker.model_runner_base.InputProcessingError: Failed to prepare inputs for sequence group with request id: 0, Error: index 0 is out of bounds for dimension 0 with size 0 — bug,stale — by qq31415926 (关闭于: 2026-01-09 10:25 (UTC+8)) [💬3] ### Your current environment

The output of python collect_env.py
```text ============================== System Info ============================== OS : Ubuntu 20.04.6 LTS (x86_64) ...
#24436 [Bug]: V1 backend crash when using Qwen3 series model — bug,stale — by nkh0472 (关闭于: 2026-01-09 10:25 (UTC+8)) [💬10] ### Your current environment

The output of python collect_env.py
```text ============================== System Info ============================== OS : Ubuntu 20.04.6 LTS (x86_64) ...
#24439 [Feature]: Support using their own max_num_seqs in prefill and decode stages in pd hybrid scenario — feature request,stale — by Liccol (关闭于: 2026-01-09 10:25 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch

vLLM currently uses a single max_num_seqs parameter to control the batch size for both prefill and decode stages. Maybe support for PD hybrid scenarios because:
- Prefill stage: Requires smaller batches due to higher computational intensity and memory usage per request
- Decode stage: Can handle larger batches because of lower computational requirements per request
- Resource utilization: Fixed batch size leads to either underutilization (decode) or overl…
#24449 [Usage]: How to quantize huggingface model to mxfp4 format and infer via vllm — usage,stale — by charyang-ai (关闭于: 2026-01-09 10:25 (UTC+8)) [💬2] ### Your current environment
```
  The output of `python collect_env.py`
```
### How would you like to use vllm

I want to run inference of a specific model. I don’t know how to integrate it with vllm. …
#24459 [Bug]: Vllm Deepseek-R1 Reasoning Parser error — bug,stale — by access2rohit (关闭于: 2026-01-09 10:25 (UTC+8)) [💬2] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... uv is set ============================== System Info ...
#24475 [Bug]: Some benchmark results for model whisper-large-v3-turbo are zero — bug,stale — by hnt2601 (关闭于: 2026-01-09 10:25 (UTC+8)) [💬2] ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…
#24479 [Feature]: responses API conversation feature — feature request,stale — by questcollector (关闭于: 2026-01-09 10:25 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch

When you call the responses API with the store feature enabled, the msg content is stored in msg_store. Currently msg_store stores content items based on response id key, but it seems that if we manage items by generating conversation id, we can maintain the same context for multiple responses API requests.
- Create a conversation ID and add it to the response body
- Change the key of msg_store to the conversation ID from response_id.
### Alternatives

…
#24494 [Feature]: Integration of enable_thinking parameter for Async LLM Generation for dynamic generaiton — feature request,stale — by Dammerzone (关闭于: 2026-01-09 10:25 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch

Hi guys,

I found the possible use of the extra_body arg “enable_thinking” into V1 with the llm.chat() method.

However this feature doesn’t seems do be supported for the generate() method of the async generation API (according to the bot). For now it needs to be passed through tokenizer parameter at engine initialization.

It could be a great improvement to support dynamic thinking based on request parameter instead of initial initialization.

…
#24507 [Bug]: vllm/vllm-openai image has an stdlib with many CVEs, update stdlib — bug,stale — by rubenporras (关闭于: 2026-01-09 10:25 (UTC+8)) [💬2] ### Your current environment

vllm/vllm-openai image has an stdlib with many CVEs

Please upgrade stdlib.

### 🐛 Describe the bug …
#24508 [Bug]: vllm/vllm-openai image has an linux-libc-dev with many CVEs, update linux-libc-dev — bug,stale — by rubenporras (关闭于: 2026-01-09 10:25 (UTC+8)) [💬2] ### Your current environment

vllm/vllm-openai image has an linux-libc-dev with many CVEs,
please, update linux-libc-dev

…
#24550 [Bug]: the LongCat-Flash model still not offer support — bug,stale — by QiyaoHuang (关闭于: 2026-01-09 10:25 (UTC+8)) [💬4] ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…
#24555 [Feature]: Add New Goodput Metrics in benchmark_serving.py — feature request,stale — by xiaoqun2011 (关闭于: 2026-01-09 10:25 (UTC+8)) [💬3] ### 🚀 The feature, motivation and pitch

Suggested Content:

Currently, the output only includes Request Goodput (requests per second), which limits the ability to evaluate Server-Level Objective (SLO) performance on specific GPU systems. To enhance the granularity of performance analysis, we propose adding the following metrics:

• Output Good Throughput – Measures the number of good tokens generated per second. • Total Good Throughput – Measures the total number of good tok…
#24557 [Bug]: Intermittent Service Downtime Issue with Magistral-Small-2506 Model on GPU VM — bug,stale — by Jimmy888-333 (关闭于: 2026-01-09 10:25 (UTC+8)) [💬2] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#24558 [Bug]: Unexpected service down of Llama-4-Scout-17B-16E-Instruct model on GPU-enabled VM — bug,stale — by Jimmy888-333 (关闭于: 2026-01-09 10:25 (UTC+8)) [💬2] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#24587 [Bug]: use disagg_example_p2p_nccl_xpyd.sh, 每次pd 分离的实例，p和d 只能启动成功一个，另一个就报下面的错，大佬帮忙看看 — bug,stale — by ldh127 (关闭于: 2026-01-09 10:25 (UTC+8)) [💬2] ### Your current environment

@KuntaiDu 大佬瞅一眼， [Bug]: use disagg_example_p2p_nccl_xpyd.sh, 每次pd 分离的实例，p和d 只能启动成功一个，另一个就报下面的错，大佬帮忙看看

INFO 09-10 21:20:09 [init.py:1152] Found nccl from library libnccl.so.2 Traceback (most recent call last): File “/opt/ac2/bin/vllm”, line 8, in sys.exit(main()) ^^^^^^ ...
#24590 [CI Failure]: CUTLASS MLA decode is flaky — stale,ci-failure — by MatthewBonanni (关闭于: 2026-01-09 10:25 (UTC+8)) [💬2] ### Name of failing test

tests/kernels/test_cutlass_mla_decode.py::test_cutlass_mla_decode[torch_dtype1-False-True-64-512-576-1-16-4096-1-128]

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#24591 [Bug]: vllm bench serve --random-input-length=1 generates an empty prompt — bug,stale — by smarterclayton (关闭于: 2026-01-09 10:25 (UTC+8)) [💬3] ### Your current environment

Since at least the last few weeks (mid August), but probably much further back.

### 🐛 Describe the bug

While benchmarking P/D I realized that --random-input-len=1 results in an empty prompt for some models, specifically DeepSeek-R1.

This is due to this chunk in the random dataset: …
#30859 [Bug]: set_current_vllm_config() is only done during the initialization stage but not the runtime stage — bug — by nvpohanh (关闭于: 2026-01-09 07:20 (UTC+8)) [💬7] ### Your current environment

Any env

### 🐛 Describe the bug

# Issue Statement

Currently, set_current_vllm_config() is only done during the initialization stage but not the runtime stage. If the code tries to call get_current_vllm_config(), vLLM prints a warning “Current vLLM config is not set.” and returns a default config.

…
#26110 [RFC]: CI metrics dashboard improvements — RFC,stale — by rzabarazesh (关闭于: 2026-01-09 04:29 (UTC+8)) [💬2] ### Motivation.

Our CI metrics are currently fragmented and we utilize multiple dashboards. We also lack several key metrics such as % of force merges or PR cycle times. We currently have a dashboard tracking some important CI metrics as well as buildkite UI. We seek to improve upon that by integrating vLLM with the comprehensive set of features offered by [pytorch HUD](https://hud.pytorch.org…
#31912 MoE LoRA Kernel Correctness Verification Report — 无标签 — by qywu (关闭于: 2026-01-09 02:52 (UTC+8)) ## Summary

This issue documents the comprehensive correctness verification of the fused MoE LoRA Triton kernel (vllm/lora/ops/triton_ops/fused_moe_lora_op.py).

## Test Methodology

### 1. Kernel-Level Testing Compared the fused MoE LoRA kernel output against a reference implementation that computes: ```python # For each token with lora_idx and expert_id: …
#31876 [CI Failure]: backend_xgrammar.py: Failed to advance FSM for request — ci-failure — by BlankRH (关闭于: 2026-01-09 00:40 (UTC+8)) [💬4] ### Name of failing test

v1/e2e/test_async_scheduling.py::test_with_spec_decoding

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#23557 [New Model]: Grok 2 — good first issue,new-model — by mgoin (关闭于: 2026-01-08 23:57 (UTC+8)) [💬18] ### The model to consider.

https://huggingface.co/xai-org/grok-2

### The closest model vllm already supports.

Grok 1 has already been supported (https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/grok1.py) and it seems Grok 2 is a few modifications ontop of it - still listing its architecture as Grok1ForCausalLM in the config.json

You can reference this sglang PR that implements the changes to the Grok1 d…
#31859 [Bug]: Intermittent 500 Internal Server Error When Stress-Testing Qwen3-VL-2B-Instruct with vLLM on H20 — bug — by AlpacaKnight (关闭于: 2026-01-08 21:28 (UTC+8)) [💬7] ### Your current environment

The service is started with the following command:
``` vllm serve --host 0.0.0.0 --port 8000 --root-path '/' --trust-remote-code --enable-log-requests --enable-log-outputs --gpu-memory-utilization 0.9 --tensor-parallel-size 1 --served-model-name model ``` </details> ...
#31439 [Bug]: aot_compile disables inductor graph partition — bug — by BoyuanFeng (关闭于: 2026-01-08 17:27 (UTC+8)) ### Your current environment

Is CUDA available : True CUDA runtime version : 12.9.86 CUDA_MODULE_LOADING set to : GPU models and configuration : GPU 0: NVIDIA B200 GPU 1: NVIDIA B200 GPU 2: NVIDIA B200 GPU 3: NVIDIA B200 …
#31954 [Bug] OLMo3 reasoning parser fails to detect </think> end tag, preventing GCD activation — 无标签 — by ivnle (关闭于: 2026-01-08 17:18 (UTC+8)) [💬1] ### Description

The OLMo3 reasoning parser (olmo3_reasoning_parser.py) fails to correctly detect when thinking ends, causing all generated tokens to be classified as thinking tokens. This prevents grammar-constrained decoding (GCD) from activating for the post-thinking structured output.

### Evidence

From experiments with OLMo3-7B-Think + GCD + thinking=high on the math500 task:

Question 0 (Polar coordinates conversion): ```json …
#29574 [Performance]: Using vLLM to accelerate VLM models, does the vision encoding part currently support parallel processing, or is it still being processed serially? — performance — by NewZxy (关闭于: 2026-01-08 14:21 (UTC+8)) [💬2] ### Proposal to improve performance

I found that currently, images of different sizes are processed sequentially, which significantly slows down the processing speed. How can we adapt to parallel processing? Should we resize or pad all images to the same size for batch processing, or can we run multiple encoder models in parallel? Thank you.

### Report of performance regression

No response

### Misc discussion on performance

…

[新 PR]

#31968 [CPU] Add head sizes 80 and 112 with vec16 fallback — v1,cpu — by R3hankhan123 (创建于: 2026-01-08 20:50 (UTC+8)) [+33/-4, 2 files | commented:2]

## Purpose Reintroduce support for head dimensions 80 and 112 in CPU attention backend which were previously removed in #27954 but these head dimensions are commonly used by granite models deployed on Z archs. Since these heads are not friendly for Intel AMX instruction set. The implementation now falls back to vec16. ## Test Plan Build Docker image and test using ibm-granite/granite-3b-code-base-2k model which has head size of 80. ## Test Result Server Logs ```…
#31998 [Misc] Enable async scheduling by default with spec decoding — ready — by njhill (创建于: 2026-01-09 07:31 (UTC+8)) [+20/-19, 1 files | commented:1] Now that all of the gaps have been addressed in async scheduling + spec decoding support, we can enable it by default in this case too.

It will still be disabled implicitly for non-EAGLE/MTP types or when padded drafter batch is disabled.

This should only be merged once https://github.com/vllm-project/vllm/pull/30495 is merged.

[!NOTE] Enables async scheduling by default when using compatible speculative decoding, with clearer gating and messaging. …
#31993 [ROCm][CI] Fix test_token_classification.py::test_bert_models — rocm — by divakar-amd (创建于: 2026-01-09 04:53 (UTC+8)) [💬1 | +12/-4, 1 files | commented:1 approved:2] This PR fixes mi325_1: Language Models Test (Extended Pooling): models/language/pooling/test_token_classification.py::test_bert_models[float-boltuix/NeuroBERT-NER]

The issues were caused due to similar reasons mentioned in https://github.com/vllm-project/vllm/pull/31612
[!NOTE] Addresses ROCm-specific correctness in token classification tests.
- For test_bert_models and test_modernbert_models, pass model_kwargs={'attn_implementation': 'eager'} to HF `AutoModelForToken…
#31992 [Bugfix] Fix typo in FusedMoE LoRA reshape comment — 无标签 — by xyang16 (创建于: 2026-01-09 04:36 (UTC+8)) [+1/-1, 1 files | commented:1 approved:1] ## Purpose

This PR is a minor fix of typo in lora_b reshape comment. Need to switch rank and num_experts position.

## Test Plan

## Test Result

...
#31987 [Frontend] Improve error message — frontend,ready — by DarkLight1337 (创建于: 2026-01-09 01:50 (UTC+8)) [+20/-4, 3 files | commented:1 approved:1]

## Purpose

Improve the error message that is returned to the client

## Test Plan

## Test Result

…
#32008 [MISC] Add strict contiguity check for FlashInfer attention tensors — v1,nvidia — by vadiklyutiy (创建于: 2026-01-09 10:21 (UTC+8)) [+41/-10, 2 files | commented:2] Early check of potential error as in #30842. See also #31617, https://github.com/flashinfer-ai/flashinfer/issues/2232

Updates FlashInfer attention path to use stricter contiguous check, preventing potential CUDA kernel memory access issues. Introduces is_strictly_contiguous() utility to detect tensors with degenerate strides that PyTorch’s is_contiguous() reports as contiguous.

[!NOTE] Strengthens memory layout validation to prevent degenerate-stride tensors from reaching CUDA k…
#31994 fix lora moe sharding when rank < max_lora_rank — ready,gpt-oss — by gnovack (创建于: 2026-01-09 05:41 (UTC+8)) [💬1 | +6/-8, 2 files | commented:1 approved:1] ## Purpose This PR fixes a bug in the implementation of fully_sharded for moe lora adapters. Currently, when a LoRA adapter is loaded whose rank is less than the max_lora_rank the LoRA A W13 weights are split into shards of size current_lora_rank // tp_degree.

This causes problems when we allgather the LoRA A output along the rank dim, since the resulting tensor will be interspersed with 0s, as opposed to right-hand-padded with 0s (like the LoRA B W13 weights).

e.g. ``` LoRA A output (w…
#31970 [CI] [ROCm] Fix tests/entrypoints/test_grpc_server.py on ROCm — rocm,ready,ci/build — by tjtanaa (创建于: 2026-01-08 21:08 (UTC+8)) [+18/-4, 5 files | commented:3 approved:1] ## Purpose

Fix tests/entrypoints/test_grpc_server.py CI issue https://buildkite.com/vllm/amd-ci/builds/2524/steps/canvas?sid=019b9d04-9972-4807-9712-b517956d17b3

``` ==================================== ERRORS ==================================== – ____ ERROR collecting tests/entrypoints/test_grpc_server.py ____ ImportError while importing test module ‘/vllm-workspace/tests/entrypoints/test_grpc_server.py’. …
#32000 [ROCm][CI][V1] Fix nixl_connector test failure and achieve CUDA parity in test_async_scheduling — rocm,speculative-decoding,v1,kv-connector,nvidia — by AndreasKaratzas (创建于: 2026-01-09 08:03 (UTC+8)) [💬3 | +28/-39, 3 files | commented:1 approved:1] This PR adds FlexAttention backend support for ROCm in the EAGLE speculative decoding proposer, removing platform-specific attention backend restrictions and merging with the CUDA data flow of this test. It also fixes test_abort_timeout_on_prefiller[ray] failing on ROCm platforms.

## Changes
- Added FlexAttentionMetadata to the allowed attention types for ROCm in eagle.py
- Removed ROCm-specific backend overrides that were workarounds for missing FlexAttention support
- Modify `_make_fak…
#31949 [Bugfix] Fix FusedMoE LoRA w2_output_size — ready — by xyang16 (创建于: 2026-01-08 14:35 (UTC+8)) [💬2 | +1/-1, 1 files | commented:1 approved:1] ## Purpose

This PR fixes FusedMoE LoRA w2_output_size. w2_output_size currently is set to w2_lora_a_stacked[0].shape[-2] incorrectly, making it become rank.

## Test Plan
```
  pytest -s -v tests/lora/test_gptoss_tp.py
```
## Test Result

…
#31948 fix: remove duplicate engine_id check in nixl_connector — ready,kv-connector — by xbfs (创建于: 2026-01-08 14:28 (UTC+8)) [+0/-7, 1 files | commented:1 approved:1]
[!NOTE] Removes duplicate engine ID validation in nixl_connector.py during _nixl_handshake, keeping a single check after decoding NixlAgentMetadata.
- Simplifies control flow; no functional change expected aside from eliminating redundant error raising
^{Written by Cursor Bugbot for commit cb5e7816bed2c28eb9f8e218d4ddd1aa0e5b2c28. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugb…}
#31943 Weight transfer — documentation,frontend,v1 — by ahao-anyscale (创建于: 2026-01-08 12:37 (UTC+8)) [💬2 | +2483/-10, 26 files | commented:2 | 📝草稿] ## Purpose

This PR introduces native weight syncing APIs for vLLM to support reinforcement learning post-training workflows (RLHF, PPO, etc.).

Currently, open-source projects like SkyRL, VeRL, and TRL must implement their own weight syncing infrastructure to use vLLM as an inference server during training. This leads to duplicated effort and requires users to version-lock to specific implementations. See RFC #31848 for full motivation. …
#32005 Reduce the kernel overhead when num of active loras is smaller than max loras. Multiple cuda graphs are captured for each num of active-loras. — v1,nvidia — by yugong333 (创建于: 2026-01-09 09:41 (UTC+8)) [+212/-42, 12 files | commented:2] ## Purpose

## Test Plan Using llmperf to benchmark concurrency = 1, 2, 4, 8 when max-loras = 4 ## Test Result

...
#32003 [Cleanup] Remove obsolete spec decoding compatibility logic — performance,speculative-decoding,ready,v1 — by njhill (创建于: 2026-01-09 09:19 (UTC+8)) [+45/-75, 8 files | commented:1] This is logic which was included with the original V1 spec decoding impl (ngram only), prior to various other changes which have made it obsolete including:
- Adding other spec decoding methods
- Adding support for logprobs and penalty parameters with spec decoding
Note that there are still some parameters which don’t work with spec decoding (min_p, min_tokens, logit_bias). Requests with these params will now fail when spec decoding is enabled (see https://github.com/vllm-project/vllm/pull/3198…
#31956 [Frontend] Add reasoning_effort to OpenAIServing._preprocess_chat() — frontend — by sanghoon-yn (创建于: 2026-01-08 17:04 (UTC+8)) [+8/-1, 3 files | commented:4] ## Purpose I suggest adding reasoning_effort to OpenAIServing._preprocess_chat() to make it used in chat template.

## Test Plan

## Test Result

Essential Elements of an Effective PR Description Checklist
...
#31952 [Frontend] Add reasoning_effort and parallel_tool_calls to extra_args of SamplingParams. — frontend,gpt-oss — by sanghoon-yn (创建于: 2026-01-08 15:25 (UTC+8)) [💬5 | +14/-2, 2 files | commented:7] ## Purpose

~~I suggest adding two fields, reasoning_effort and parallel_tool_calls, to SamplingParams to enable logits processors to support the OpenAI-compatible parameters by controlling token generation.~~

I suggest adding two fields, reasoning_effort and parallel_tool_calls, to extra_args of SamplingParams to enable logits processors to support the OpenAI-compatible parameters by controlling token generation.

## Test Plan

## Test Result

…
#31999 Fix type error — frontend,ready,meta-exported,fb-exported — by Adolfo-Karim (创建于: 2026-01-09 07:36 (UTC+8)) [💬1 | +3/-1, 1 files | commented:3 approved:2] Summary: id can be None, so we need to check it.

Test Plan: Updated local instance, the error is gone.

Differential Revision: D90353429

[!NOTE] …
#32002 fused_moe_kernel - cast accumulator after applying router weights — 无标签 — by gnovack (创建于: 2026-01-09 08:54 (UTC+8)) [+5/-8, 1 files | commented:1] ## Purpose The tests/lora/test_olmoe_tp.py test_olmoe_lora_mixed test case has been failing since https://github.com/vllm-project/vllm/pull/31676 was merged. Previously the application of moe_weight (router weights) to the accumulator was performed in float32, but after https://github.com/vllm-project/vllm/pull/31676, this computation is now done after accumulator has been cast to compute_type.

This change in behavior caused the above test case to begin failing, and introduced a sl…
#32001 [fix] add cutedsl to global sf — nvidia — by jiahanc (创建于: 2026-01-09 08:39 (UTC+8)) [+1/-0, 1 files | commented:1 approved:1] ## Purpose Add flashinfer cutedsl to global sf list, fixes https://github.com/vllm-project/vllm/issues/31918 ## Test Plan ``` VLLM_DEEPEPLL_NVFP4_DISPATCH=1 VLLM_USE_FLASHINFER_MOE_FP4=1 VLLM_USE_STANDALONE_COMPILE=0 VLLM_FLASHINFER_MOE_BACKEND=”masked_gemm” VLLM_WORKER_MULTIPROC_METHOD=spawn VLLM_ALL2ALL_BACKEND=”deepep_low_latency” lm_eval –model vllm –model_args pretrained=nvidia/DeepSeek-R1-0528-FP4-v2,data_parallel_size=4,enable_expert_parallel=True,tensor_parallel_size=1,enforce_eager=Tr…
#31977 [Bugfix] Fix Typo from NVFP4 Refactor — ready,nvidia — by robertgshaw2-redhat (创建于: 2026-01-08 23:36 (UTC+8)) [+30/-3, 6 files | commented:2 approved:2] ## Purpose
- broke cutedsl with typo
- this fixes it and adds testing in the ci/cd
## Test Plan
- ci, running cutedsl
## Test Result

…
#31982 [BugFix] Add spec-decode-incompatible request param validation — ready,v1 — by njhill (创建于: 2026-01-09 01:08 (UTC+8)) [+12/-3, 2 files | commented:2 approved:1] Some sampling parameters are not yet supported with spec decoding. We log a warning during startup but then effectively silently ignore the parameters if included in requests at runtime.

This adds request-scope validation for this to fail rather than silently ignore.
#31996 [MoE Refactor] Move select_experts from FusedMoEQuantMethod -> FusedMoE — needs-rebase — by bnellnm (创建于: 2026-01-09 06:10 (UTC+8)) [💬2 | +1180/-574, 30 files | commented:1 | 📝草稿] ## Purpose

## Test Plan

## Test Result

Essential Elements of an Effective PR Description Checklist
...
#31980 Add mergify label job for “bug” match — ci/build — by mgoin (创建于: 2026-01-09 00:59 (UTC+8)) [💬1 | +12/-0, 1 files | commented:1]

## Purpose

## Test Plan

## Test Result

...
#31986 [Feature][#29390]: Add timeout support to MultiprocExecutor.collective_rpc and FutureWrapper — v1 — by SandishKumarHN (创建于: 2026-01-09 01:31 (UTC+8)) [💬3 | +33/-9, 1 files | commented:1] ## Purpose
- Add support for timeouts when calling vllm.v1.executor.multiproc_executor.MultiprocExecutor.collective_rpc and when awaiting futures returned by non-blocking RPCs.
- Treat a provided timeout as an overall deadline across response MessageQueues when collecting responses.
- Add FutureWrapper.result(timeout=…) and propagate the timeout into FutureWrapper.wait_for_response(get_response, timeout=…) so user code can wait on the Future with an overall timeout.
- This change makes debu…
#31997 [CI/Build][Hardware][AMD] Fix test_forward_error — rocm,v1 — by rjrock (创建于: 2026-01-09 07:03 (UTC+8)) [+15/-1, 1 files | commented:1 | 📝草稿] ## Purpose To get CUDA_VISIBLE_DEVICES=0,1 pytest -v -s v1/shutdown to pass on AMD CI.

Since ROCm uses fork as the subprocess init method, we need to use sitecustomize.py to propagate the evil_forward function.

## Test Plan CUDA_VISIBLE_DEVICES=0,1 pytest -v -s v1/shutdown.

## Test Result Before: Failed to raise error …
#31989 Update VOLKANURALTR.md — documentation — by volkTRhacV (创建于: 2026-01-09 02:53 (UTC+8)) [💬4 | +4/-4, 1 files | commented:1 | 📝草稿]

## Purpose

## Test Plan

## Test Result

...
#31991 [wip] expose online quantization for linear for compressed-tensors path — 无标签 — by vkuzo (创建于: 2026-01-09 03:52 (UTC+8)) [💬1 | +126/-4, 2 files | commented:2] Summary:

WIP and not ready for review yet

High level, I’d like to work towards enabling online quantization for all the quantization recipes in compressed-tensors. Ideally we work through a POC for a single recipe and get alignment from vllm team, and then we can parallelize further work if needed.

For now, tested e2e on a tiny dense model (facebook/opt-125m) and float8 rowwise scaling.

tl;dr;
1. enables CompressedTensorsConfig.from_config_file, and provides an example config file for flo…
#31941 LoRA Per Request Loading Pipelining Support — v1 — by kfhfar (创建于: 2026-01-08 11:27 (UTC+8)) [+301/-108, 11 files | commented:10 | 📝草稿] ## Purpose Currently LoRA Loading for Per Request needs to first sequentially load all the LoRA layers before beginning Pre-Fill. This adds significant overhead in TTFT.

However, Pre-Fill is a layer by layer process and for large enough input tokens(> 1000 tokens) it is longer than LoRA Loading time. If we run **Lo…

#31995 [ROCM] Add ROCm image build to release pipeline — rocm,ci/build — by dllehr-amd (创建于: 2026-01-09 05:56 (UTC+8)) [+22/-2, 3 files

commented:1]

Added build-release-image-rocm step to build ROCm release images
Builds Dockerfile.rocm_base first, then Dockerfile.rocm with vllm-openai target
Tags and publishes image as vllm-release-repo:$BUILDKITE_COMMIT-rocm
Updated annotate-release.sh script to include ROCm image pull/tag/push instructions

## Purpose

## Test Plan …

#31978 Revert “feat(moe): Add is_act_and_mul=False support for Triton MoE kernels” — rocm,ready — by mgoin (创建于: 2026-01-09 00:17 (UTC+8)) [+9/-191, 7 files | commented:1 approved:1] Reverts vllm-project/vllm#31645, my reasoning is here https://github.com/vllm-project/vllm/pull/31645#issuecomment-3724590758
#31988 [Misc][PD] Fix get_attn_backend usage in transfer connectors — kv-connector — by NickLucche (创建于: 2026-01-09 02:20 (UTC+8)) [💬1 | +17/-20, 2 files | commented:4 approved:1] This PR fixes the use of get_attn_backend in the context discussed here https://vllm-dev.slack.com/archives/C07R5Q1Q2BB/p1767810349936949. In particular, it turns out that modification to the interface of this shared function can cause unintended backend retrieval (as a partial configuration was passed in), leading to cases such as ``` VLLM_LOGGING_LEVEL=DEBUG vllm serve google/gemma-3-4b-it –port 8004 –enforce-eager –kv-transfer-config ‘{“kv_connector”:”NixlConnector”,”kv_role”:”kv_both”}’…
#31965 [Model Runner V2] Simplify BlockTables with UVA — v1 — by WoosukKwon (创建于: 2026-01-08 19:16 (UTC+8)) [+52/-154, 2 files] This PR simplifies the append_block_ids method in BlockTables by using UVA buffer to store the block tables.

Previously, the block table was stored in GPU, and we send the “diff” to GPU every step. While efficient, this approach complicated the code quite a bit to pack the “diff”s into contiguous GPU tensors and perform async-copy for them.

Using the UVA tensor eliminates the need for such packing and async copies, at the cost of transferring the block tables from CPU to GPU every step.
#31962 [Kernel][MoE] fix computation order of MoE weight multiplication and improve flow — 无标签 — by xuebwang-amd (创建于: 2026-01-08 18:44 (UTC+8)) [💬2 | +33/-18, 1 files | commented:1] ## Purpose

# Background: This PR is a continuous work on two previous PRs:
- PR #31676
  - key contribution: move bias adding after dequantization
  - computation order: to(compute_type) -> HAS_BIAS (bias adding) -> MUL_ROUTED_WEIGHT: this is the most closest one to the right order except for MoE weight multiplication not in float32.
- PR #31931
  - key contribution: preserving router weight s…

#31983 [Bugfix] Fix Fp8 Triton for non-gated MoE (Nemotron) — 无标签 — by danisereb (创建于: 2026-01-09 01:09 (UTC+8)) [💬1 | +7/-0, 3 files | commented:1] ## Purpose vLLM serve command that fails:

  export MODEL_PATH=/my_home/hf_models/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8/
    
  export VLLM_USE_FLASHINFER_MOE_FP8=0
    
  vllm serve $MODEL_PATH --served-model-name my_model \
  --trust-remote-code --async-scheduling --kv-cache-dtype fp8 --tensor-parallel-size 1

…

#31944 [BugFix] Fix spec decoding edge case bugs — ready,v1 — by njhill (创建于: 2026-01-08 12:53 (UTC+8)) [+49/-46, 3 files | commented:2 approved:1] There are a couple of edge case issues:
- Since https://github.com/vllm-project/vllm/pull/29821, it’s again possible to hit the issue described in https://github.com/vllm-project/vllm/pull/30916. Our test does cover it but it’s not guaranteed to trigger so was missed and is manifesting as a flake now. The fix here is to clear the request’s spec_token_ids in the scheduler when it is preempted.
- There’s also an issue related to a case where the request can be excluded from the batch temporarily…
#31984 [Kernel] Optimize Sliding Window Attention in 3D Triton Kernel — 无标签 — by jvlunteren (创建于: 2026-01-09 01:18 (UTC+8)) [+26/-3, 1 files | commented:1] ## Purpose

This pull request improves the efficiency of sliding window attention in the 3D kernel by applying the same optimization used in the 2D kernel. It ensures that only tiles within the defined window are processed, resulting in better performance and reduced computational overhead.

## Performance The following results were obtained for openai/gpt-oss-20b on an NVIDIA H100 GPU, by running
```
$ VLLM_ATTENTION_BACKEND=TRITON_ATTN vllm bench latency \
     ...
```
#31979 [Doc] Improve MM models LoRA notes — documentation,ready — by jeejeelee (创建于: 2026-01-09 00:21 (UTC+8)) [💬1 | +1/-23, 1 files | approved:1 commented:1]

## Purpose

## Test Plan

## Test Result

...
#31981 [Misc] Clean up world_size > avail_gpu warning for ray — v1 — by ruisearch42 (创建于: 2026-01-09 01:05 (UTC+8)) [+0/-17, 1 files | commented:1] ## Purpose

The same check is in ParallelConfig.post_init() and should be called before initialize_ray_cluster(): https://github.com/vllm-project/vllm/blob/b8112c1d85442b060e37df90e9db343cbc7b000c/vllm/config/parallel.py#L592

FIX #31005

## Test Plan N/A

…
#31973 [Model] Reorganize pooling layers — documentation,ready,v1,qwen — by DarkLight1337 (创建于: 2026-01-08 21:27 (UTC+8)) [💬2 | +1241/-1130, 32 files | commented:9 approved:1] ## Purpose

pooler.py is getting really bloated, so let’s split everything up:
- Pooler activations (vllm.model_executor.layers.pooler.activations)
- Common code (vllm.model_executor.layers.pooler.common)
- Abstract pooler (vllm.model_executor.layers.pooler.abstract)
- Special poolers (vllm.model_executor.layers.pooler.special)
- Poolers that apply to the whole batch of requests at once (vllm.model_executor.layers.pooler.batch)
- Poolers that apply to one request at a time (`vllm.mo…
#31955 [BugFix] Clear spec_token_ids for preempted req to prevent grammar conflicts on resumption — v1 — by izhuhaoran (创建于: 2026-01-08 16:51 (UTC+8)) [💬4 | +9/-0, 1 files | commented:1] ### Purpose Fixes #31876

The unit test v1/e2e/test_async_scheduling.py::test_with_spec_decoding fails intermittently. The failure occurs with the following configuration:

```python test_sampling_params = [ dict(structured_outputs=struct_outputs), ]

…
#31960 [Bugfix] Fix vllm serve failure with Nemotron Nano V3 FP8 — ready — by danisereb (创建于: 2026-01-08 17:54 (UTC+8)) [+4/-3, 1 files | commented:6 approved:1] ## Purpose Fix bug: https://github.com/vllm-project/vllm/issues/31957

The fix in file fp8_utils.py is required to fix this error (raised from flashinfer_cutlass_fused_moe):
```
  RuntimeError: Check failed: fc1_dequant.ndim() == 1 (0 vs. 1) : fc1_dequant must be a 1D tensor
```
## Test Plan …
#31953 [CI] fix ROCM_HOME: unbound variable — rocm,ready,ci/build — by LucasWilkinson (创建于: 2026-01-08 15:59 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1] https://github.com/vllm-project/vllm/pull/31922 broke the “2 Node Tests (4 GPUs in total)” with
```
  [2026-01-08T07:32:12Z] ./.buildkite/scripts/run-multi-node-test.sh: line 10: ROCM_HOME: unbound variable
```
https://buildkite.com/organizations/vllm/pipelines/ci/builds/46074/jobs/019b9c75-6389-40dc-a23b-5ecd3683d951/log
#31976 [WIP] Add: EAGLE3 Acceptance Length Regression Tests — speculative-decoding,v1 — by rahul-tuli (创建于: 2026-01-08 23:21 (UTC+8)) [+208/-0, 1 files | commented:1 | 📝草稿] Adds parameterized pytest tests to detect acceptance length regressions in EAGLE3 speculative decoding. These tests ensure that new commits do not degrade speculative decoding performance.

## Changes
- tests/v1/spec_decode/test_acceptance_length.py: New test file with parameterized tests for EAGLE3 model pairs
## Models Tested

| Verifier | Drafter | |———-|———| …
#31969 [Bugfix]: Fix Step3ReasoningParser missing is_reasoning_end_streaming — ready — by chaunceyjiang (创建于: 2026-01-08 20:53 (UTC+8)) [+6/-0, 1 files | commented:1 approved:1] ## Purpose Follow up https://github.com/vllm-project/vllm/issues/30056

## Test Plan

## Test Result

Essential Elements of an Effective PR Description Checklist
...
#31971 [Docs]: update claude code url — documentation,ready — by chaunceyjiang (创建于: 2026-01-08 21:10 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1 approved:1] ## Purpose Follow up #31188 ## Test Plan

## Test Result

Essential Elements of an Effective PR Description Checklist
...
#31974 Fix bug hf_token argument to LLM in Python SDK ignored in vllm.transformer_utils.config — 无标签 — by benglewis (创建于: 2026-01-08 21:54 (UTC+8)) [💬1 | +36/-2, 3 files | commented:1 | 📝草稿] ## Purpose Fix the hf_token argument not being passed through the transformers’s AutoConfig correctly. Fixes #31894

## Test Plan Try passing an hf_token and loaded a gated model which requires an hf_token via the vLLM Python code.

## Test Result WIP

…
#31967 [CI] [Bugfix] Fix unbounded variable in run-multi-node-test.sh — rocm,ready,ci/build — by tjtanaa (创建于: 2026-01-08 20:36 (UTC+8)) [+2/-1, 2 files | commented:1 approved:1] ## Purpose

Fix the issue ./.buildkite/scripts/run-multi-node-test.sh: line 10: ROCM_HOME: unbound variable after PR https://github.com/vllm-project/vllm/pull/31922 .

https://buildkite.com/vllm/ci/builds/46115/steps/canvas?sid=019b9d2c-bfc3-4c47-b214-3bf7def2f86a

Add .buildkite/scripts/run-multi-node-test.sh into test-pipeline.yml to trigger during CI.

## Test Plan

…
#31972 [Models]: Make Multimodal config implicit in ViT implementation — qwen — by Isotr0py (创建于: 2026-01-08 21:20 (UTC+8)) [+41/-52, 13 files | commented:1 | 📝草稿]

## Purpose
- Currently, we are passing MultimodalConfig through whole MMEncoder only to enable data parallel and attention backend overrides
- This PR makes it implicit through get_current_vllm_config.
## Test Plan

## Test Result

…
#31946 Feat/feat/support nemotron h mtp functional — new-model,speculative-decoding,v1 — by shaharmor98 (创建于: 2026-01-08 14:05 (UTC+8)) [💬2 | +1452/-127, 12 files | commented:1 | 📝草稿]

## Purpose

## Test Plan

## Test Result

...
#31964 [KVConnector] Support worker -> scheduler metadata — v1,kv-connector — by orozery (创建于: 2026-01-08 19:04 (UTC+8)) [💬1 | +270/-27, 6 files | commented:1] This PR introduces a new build_worker_connector_meta KV connector API, allowing workers to send back arbitrary metadata back to the scheduler-side connector. This aligns with the already existing API build_connector_metadata which allows for the same on the opposite direction (scheduler -> worker).

In particular, this API is needed for the OffloadingConnector to be able to notify the scheduler-side on offloaded blocks, even before a request is finished.
#31950 [MM Encoder]: Make MMEncoderAttention’s scale takes effect properly — ready,qwen — by Isotr0py (创建于: 2026-01-08 14:47 (UTC+8)) [+34/-8, 13 files | approved:1 commented:1]

## Purpose
- scale in MMEncoderAttention doesn’t take effect exactly, because we don’t pass it to vit wrapper ops before.
- This PR fixes it, and also standardize Qwen-VL-style MMEncoderAttention usage to pass scale even if they have scale=head_dim**-0.5
## Test Plan

## Test Result

…
#31947 [Model] Standardize common vision encoders — ready,deepseek — by DarkLight1337 (创建于: 2026-01-08 14:26 (UTC+8)) [💬1 | +254/-174, 19 files | commented:3 approved:1]

## Purpose
- Accept multimodal_config argument to the common vision encoders used by init_vision_tower_for_llava (CLIP, SigLIP, PixtralHF), and forward them to MMEncoderAttention. This also enables handling of mm_encoder_tp_mode.
- Update SiglipVisionEmbeddings according to current Transformers code.
- Merge Phi3ImageEmbeddingBase into Phi3HDImageEmbedding since it’s the only subclass.
- Fix some instances where self.scale failed to be passed to `M…
#31951 [Chore] Further cleanup pooler — ready,v1 — by DarkLight1337 (创建于: 2026-01-08 14:55 (UTC+8)) [+47/-62, 7 files | commented:1 approved:1] ## Purpose
- Rename Tokens* -> Tokenwise* to be clearer
- Remove redundant PoolingType and replace it with string literals.
- Avoid redundant operations in AllPool.forward.
## Test Plan

## Test Result

…
#31958 [Bugfix] Keep all tensors to be on the same device — v1 — by wjunLu (创建于: 2026-01-08 17:35 (UTC+8)) [💬1 | +1/-1, 1 files] When running on Ascend NPU, I meet the following error ``` (Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822] File “/__w/vllm-ascend/vllm-ascend/vllm-ascend/vllm_ascend/worker/model_runner_v1.py”, line 2988, in _torch_cuda_wrapper (Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822] yield (Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822] File “/__w/vllm-ascend/vllm-ascend/vllm-ascend/vllm_ascend/worker/model_runner_v1.py”, lin…
#31942 [platform] add dp_metadata arg to set_additional_forward_context — ready — by Ronald1995 (创建于: 2026-01-08 11:34 (UTC+8)) [💬1 | +1/-0, 1 files | commented:3 approved:3]

## Purpose #31674 add set_additional_forward_context in set_forward_context, so other platform can add extra fields in forward_context, but set_additional_forward_context lack the argument dp_metadata, other platform will use num_tokens_across_dp, but if there is no dp_medata, they will call coordinate_batch_acorss_dp again, which make some extra communication. ## Test Plan

## Test Result

…
#31945 [5/n] Migrate non-cutlass part of csrc/quantization/w8a8 to libtorch stable ABI — ci/build,cpu,nvidia — by mikaylagawarecki (创建于: 2026-01-08 12:57 (UTC+8)) [+1526/-1044, 30 files | commented:1 | 📝草稿] Stacked on https://github.com/vllm-project/vllm/pull/31842/files

## Purpose

## Test Plan pytest tests/kernels/quantization/test_fp8_quant.py -v pytest tests/kernels/quantization/test_fp8_quant_group.py -v pytest tests/kernels/quantization/test_int8_quant.py -v pytest tests/kernels/quantization/test_int8_kernel.py -v …
#31939 [Core] Refactor ColumnParallelLinear: remove unused parameter and optimize forward — 无标签 — by maang-h (创建于: 2026-01-08 10:57 (UTC+8)) [+4/-10, 1 files | commented:1] ## Purpose Remove unused output_sizes parameter and optimize output_bias computation in ColumnParallelLinear. ### Changes
1. Remove unused output_sizes parameter from ColumnParallelLinear.__init__():
  - The subclass MergedColumnParallelLinear sets self.output_sizes as an instance attribute before calling super().__init__(), making the parameter unnecessary.
  - Also removed dead code (lines 492-493) and updated docstring.
2. Optimize output_bias computation in forward()…

[已合并 PR]

#30495 [Async][Feat] support apply penalty or bad_words for async + spec — ready,v1 — by izhuhaoran (合并于: 2026-01-09 10:31 (UTC+8)) [💬16 | +70/-34, 4 files | commented:8 approved:1] ## Purpose

A follow-up PR about #30122

This PR enables support for sampling penalties (frequency, presence, repetition) and bad_words when using async scheduling combined with speculative decoding.

Key Changes:
- GPUModelRunner: Implemented a mechanism to asynchronously copy draft token IDs from GPU to CPU (_copy_draft_token_ids, _get_draft_token_ids_cpu).
- InputBatch: Added update_async_spec_token_ids to populate sampling_metadata.spec_token_ids with the retrieve…
#31992 [Bugfix] Fix typo in FusedMoE LoRA reshape comment — 无标签 — by xyang16 (合并于: 2026-01-09 10:46 (UTC+8)) [+1/-1, 1 files | commented:1 approved:1] ## Purpose

This PR is a minor fix of typo in lora_b reshape comment. Need to switch rank and num_experts position.

## Test Plan

## Test Result

...
#31987 [Frontend] Improve error message — frontend,ready — by DarkLight1337 (合并于: 2026-01-09 04:07 (UTC+8)) [+20/-4, 3 files | commented:1 approved:1]

## Purpose

Improve the error message that is returned to the client

## Test Plan

## Test Result

…
#31761 [Frontend] Add MCP tool streaming support to Responses API — frontend,ready,gpt-oss — by daniel-salib (合并于: 2026-01-09 09:19 (UTC+8)) [💬9 | +1387/-629, 3 files | commented:2 approved:2] ## Purpose This change enables streaming support for MCP tools when using GPT OSS. It extends the harmony utilities and response serving infrastructure to handle tool streaming, allowing tool calls and their results to be incrementally streamed back to clients rather than returned as a single batch.

This PR is taken over from #30301

## Test Plan

``` VLLM_ENABLE_RESPONSES_API_STORE=1 VLLM_GPT_OSS_SYSTEM_TOOL_MCP_LABELS=web_search_preview,conta…
#31977 [Bugfix] Fix Typo from NVFP4 Refactor — ready,nvidia — by robertgshaw2-redhat (合并于: 2026-01-09 08:18 (UTC+8)) [+30/-3, 6 files | commented:2 approved:2] ## Purpose
- broke cutedsl with typo
- this fixes it and adds testing in the ci/cd
## Test Plan
- ci, running cutedsl
## Test Result

…
#31193 [Feature] Add iteration level logging and enhance nvtx marker — ready,v1,nvidia — by maxyanghu (合并于: 2026-01-09 08:13 (UTC+8)) [💬5 | +138/-10, 6 files | commented:10] ## Purpose This PR adds iteration level logging for each scheduled iteration. It computes iteration details like number of context/generation requests and number of context/generation tokens. The definition of a context request is any requests that is still being processed in prefill phase and no output token has been generated for that request. It also logs elapsed time of iterations and keeps an index of the current iteration. As the index of iteration is recorded per EngineCore, it is kept …
#31982 [BugFix] Add spec-decode-incompatible request param validation — ready,v1 — by njhill (合并于: 2026-01-09 08:08 (UTC+8)) [+12/-3, 2 files | commented:2 approved:1] Some sampling parameters are not yet supported with spec decoding. We log a warning during startup but then effectively silently ignore the parameters if included in requests at runtime.

This adds request-scope validation for this to fail rather than silently ignore.
#31688 [Quantization] Deprecate Long Tail of Schemes — ready — by robertgshaw2-redhat (合并于: 2026-01-09 08:07 (UTC+8)) [💬1 | +61/-5, 8 files | commented:4 approved:1]

### SUMMARY
- start deprecation process for long tail of quantization schemes
- we will have 1 release with the ability to enable the deprecated stuff, then remove completely
Essential Elements of an Effective PR Description Checklist
...
#31752 [MoE Refactoring][Bugfix]Wrap WNA16 Triton kernel into mk and change compressed tensor kernel selection — rocm,ready — by zyongye (合并于: 2026-01-09 08:01 (UTC+8)) [💬4 | +148/-4, 3 files | commented:3 approved:2] Refactor WNA16 Triton kernel into modular kernel format. Previously, we factor out the WNA16 from TritonExperts and that caused WNA16 lora module can’t select the correct kernel (see issue). This PR fixes the problem.
#31747 [Misc] Fix Current vLLM config is not set. warnings, assert to avoid issues in the future — ready,v1,multi-modality,cpu,kv-connector,nvidia,ready-run-all-tests — by LucasWilkinson (合并于: 2026-01-09 07:20 (UTC+8)) [💬7 | +380/-240, 48 files | commented:7 approved:3] https://github.com/vllm-project/vllm/pull/30531 and https://github.com/vllm-project/vllm/pull/29575 introduced accesses to get_current_vllm_config on boot that are outside of set_current_vllm_config contexts leading to repeated logs ``` (EngineCore_DP0 pid=1647618) WARNING 01-05 16:41:49 [vllm.py:1447] Current vLLM config is not set. (EngineCore_DP0 pid=1647618) INFO 01-05 16:41:49 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048. (EngineCore_DP0 pid=1647618) I…
#30881 [Compressed-Tensors] Simplify NVFP4 Conditions, enable marlin support for NVFP4A16 MoEs — ready — by dsikka (合并于: 2026-01-09 06:45 (UTC+8)) [💬2 | +42/-42, 2 files | commented:8 approved:1] ## Purpose
- Clean-up the conditions used to select the Nvfp4 schemes
- Enable running Nvfp4A16 MoEs through the marlin pathway so that we can support NVFP4A16 MoEs
## Test Plan
- Smoke testing Qwen3 MoE NVFP4 and NVFP4A16 models
- Validate on SM100 and SM90
## Test Result
- Both models generate coherent outputs on both platforms …
#30519 [Misc][Refactor] Add FusedMoERouter object — ready — by bnellnm (合并于: 2026-01-09 04:52 (UTC+8)) [💬12 | +165/-36, 20 files | commented:4 approved:1] ## Purpose

Add abstract FusedMoERouter class that provides select_experts. There’s a concrete subclass that wraps FusedMoE._select_experts that is passed to all the quantization methods instead of calling FusedMoE._select_experts directly. This is a step on teasing out the routing logic from FusedMoE.apply.

See https://github.com/vllm-project/vllm/issues/28408

## Test Plan Existing CI

## Test Result …
#31627 [Documentation][torch.compile] Add documentation for torch.compile + multimodal encoders — documentation,ready — by Lucaskabela (合并于: 2026-01-09 03:33 (UTC+8)) [💬5 | +111/-0, 1 files | commented:1 approved:2] ## Purpose See title - this PR adds documentation describing torch.compile for multimodal encoders, including how to use and apply to new modules

## Test Plan

## Test Result

Essential Elements of an Effective PR Description Checklist
...
#31978 Revert “feat(moe): Add is_act_and_mul=False support for Triton MoE kernels” — rocm,ready — by mgoin (合并于: 2026-01-09 03:31 (UTC+8)) [+9/-191, 7 files | commented:1 approved:1] Reverts vllm-project/vllm#31645, my reasoning is here https://github.com/vllm-project/vllm/pull/31645#issuecomment-3724590758
#31965 [Model Runner V2] Simplify BlockTables with UVA — v1 — by WoosukKwon (合并于: 2026-01-09 02:24 (UTC+8)) [+52/-154, 2 files] This PR simplifies the append_block_ids method in BlockTables by using UVA buffer to store the block tables.

Previously, the block table was stored in GPU, and we send the “diff” to GPU every step. While efficient, this approach complicated the code quite a bit to pack the “diff”s into contiguous GPU tensors and perform async-copy for them.

Using the UVA tensor eliminates the need for such packing and async copies, at the cost of transferring the block tables from CPU to GPU every step.
#31728 [CI][ROCm] Fix NIXL tests on ROCm — rocm,ready,ci/build,kv-connector — by NickLucche (合并于: 2026-01-09 01:34 (UTC+8)) [💬4 | +20/-6, 3 files | commented:1 approved:1] Follow up to https://github.com/vllm-project/vllm/pull/31491 for AMD mirror pipeline + timeout update from latest nightly run.
#31944 [BugFix] Fix spec decoding edge case bugs — ready,v1 — by njhill (合并于: 2026-01-08 15:31 (UTC+8)) [+49/-46, 3 files | commented:2 approved:1] There are a couple of edge case issues:
- Since https://github.com/vllm-project/vllm/pull/29821, it’s again possible to hit the issue described in https://github.com/vllm-project/vllm/pull/30916. Our test does cover it but it’s not guaranteed to trigger so was missed and is manifesting as a flake now. The fix here is to clear the request’s spec_token_ids in the scheduler when it is preempted.
- There’s also an issue related to a case where the request can be excluded from the batch temporarily…
#31702 Fix ijson build for Power. — ready,ci/build — by npanpaliya (合并于: 2026-01-09 01:12 (UTC+8)) [💬2 | +5/-5, 1 files | commented:1 approved:1]

## Purpose

## Test Plan

## Test Result

...
#31979 [Doc] Improve MM models LoRA notes — documentation,ready — by jeejeelee (合并于: 2026-01-09 00:55 (UTC+8)) [💬1 | +1/-23, 1 files | approved:1 commented:1]

## Purpose

## Test Plan

## Test Result

...
#31591 [Misc] Tidy up some spec decode logic in GPUModelRunner — ready,v1 — by njhill (合并于: 2026-01-09 01:10 (UTC+8)) [💬3 | +51/-50, 3 files | commented:4 approved:1] Simplify messy top-level logic in GPUModelRunner.sample_tokens, avoid computing effective_drafter_max_model_len every step and only execute this spec-decoding-specific logic when spec decoding is actually enabled.
#31928 [ROCm][CI] Fix attention backend test flakiness from uninitialized KV cache memory — rocm,ready,v1 — by AndreasKaratzas (合并于: 2026-01-08 12:35 (UTC+8)) [💬3 | +1/-1, 1 files | commented:1 approved:1] Fixes intermittent test failures in test_attention_backends.py caused by uninitialized memory in the KV cache.

## Problem

The create_and_prepopulate_kv_cache function uses torch.empty() to allocate the KV cache, which leaves memory uninitialized. Only the blocks actually used by test sequences are populated with valid data.

This manifests as intermittent failures, particularly on ROCm where uninitialized GPU memory is less likely to contain “safe” values compared to CUDA.

## Fix

…
#31833 [ROCm][CI] v1 cpu offloading attention backend fix — rocm,ready,v1 — by AndreasKaratzas (合并于: 2026-01-08 14:37 (UTC+8)) [💬6 | +4/-2, 1 files | commented:3 approved:1] This PR fixes a regression caused by https://github.com/vllm-project/vllm/pull/30687 on ROCm. The underlying cause is that FLASH_ATTN is not supported on ROCm.
#31960 [Bugfix] Fix vllm serve failure with Nemotron Nano V3 FP8 — ready — by danisereb (合并于: 2026-01-09 00:08 (UTC+8)) [+4/-3, 1 files | commented:6 approved:1] ## Purpose Fix bug: https://github.com/vllm-project/vllm/issues/31957

The fix in file fp8_utils.py is required to fix this error (raised from flashinfer_cutlass_fused_moe):
```
  RuntimeError: Check failed: fc1_dequant.ndim() == 1 (0 vs. 1) : fc1_dequant must be a 1D tensor
```
## Test Plan …
#31969 [Bugfix]: Fix Step3ReasoningParser missing is_reasoning_end_streaming — ready — by chaunceyjiang (合并于: 2026-01-08 23:28 (UTC+8)) [+6/-0, 1 files | commented:1 approved:1] ## Purpose Follow up https://github.com/vllm-project/vllm/issues/30056

## Test Plan

## Test Result

Essential Elements of an Effective PR Description Checklist
...
#31758 [Model] Add LFM2-VL model support — documentation,new-model,ready,v1 — by tianshu-Michael-yu (合并于: 2026-01-08 21:00 (UTC+8)) [💬4 | +1266/-1, 6 files | commented:10] ## Purpose

Add support for the LFM2-VL (Liquid Foundation Model 2 Vision-Language) model family from LiquidAI.

Changes:
- Add Lfm2VLForConditionalGeneration model implementation with full multimodal processing pipeline
- Add SigLIP2 vision encoder (siglip2.py) used by LFM2-VL
- Register model in the multimodal model registry
- Fix max_seqlen type hint in MMEncoderAttention to accept int | torch.Tensor
- Add double-buffering for is_mm_embed in GPU model runner to avoid race cond…
#31575 [Model] Support IQuestCoder model — documentation,new-model,ready — by yxing-bj (合并于: 2026-01-08 22:42 (UTC+8)) [💬15 | +605/-0, 4 files | commented:10] ## Purpose IQuest-Coder-V1 is a new family of code large language models (LLMs) designed to advance autonomous software engineering and code intelligence. We built a repo about IQuestCoder.

We had uploaded these models to Hugging Face, including IQuestCoder and IQuestLoopCoder. To make them easier for everyone to use, we support these models on vLLM platform.

## Test Plan

Firstly, we start to launch a vLLM …
#31971 [Docs]: update claude code url — documentation,ready — by chaunceyjiang (合并于: 2026-01-08 22:04 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1 approved:1] ## Purpose Follow up #31188 ## Test Plan

## Test Result

Essential Elements of an Effective PR Description Checklist
...
#31967 [CI] [Bugfix] Fix unbounded variable in run-multi-node-test.sh — rocm,ready,ci/build — by tjtanaa (合并于: 2026-01-08 21:42 (UTC+8)) [+2/-1, 2 files | commented:1 approved:1] ## Purpose

Fix the issue ./.buildkite/scripts/run-multi-node-test.sh: line 10: ROCM_HOME: unbound variable after PR https://github.com/vllm-project/vllm/pull/31922 .

https://buildkite.com/vllm/ci/builds/46115/steps/canvas?sid=019b9d2c-bfc3-4c47-b214-3bf7def2f86a

Add .buildkite/scripts/run-multi-node-test.sh into test-pipeline.yml to trigger during CI.

## Test Plan

…
#31610 [OpenAI] Fix tool_choice=required streaming when output has trailing extra data — frontend,ready,tool-calling — by maylikenoother (合并于: 2026-01-08 21:01 (UTC+8)) [💬1 | +41/-2, 2 files | commented:1 approved:1] ### Problem: When using tool_choice=”required”, OpenAIServingChat.extract_tool_call_required_streaming parses the partial stream as JSON. If the model emits a valid JSON tool-call array and then continues with trailing tokens (e.g. “\nDONE”), parsing can hit a JSON “Extra data” case and stop/behave incorrectly.

### Fix:
- Parse required-tool streaming output using partial_json_loads(…, Allow.ALL) so we can safely parse the first JSON value while tolerating trailing extra data.
- Continue to …
#31724 [Model] Enable LoRA support for Pixtral — documentation,ready — by A1c0r-Z (合并于: 2026-01-08 21:00 (UTC+8)) [💬4 | +30/-3, 2 files | commented:1 approved:1] ## Purpose Enable LoRA adapters on the vision tower and connector components of Pixtral models (Part of #31479).

## Technical Details
- Implement get_mm_mapping(), get_num_mm_encoder_tokens(), and get_num_mm_connector_tokens().
- Add SupportsLoRA to PixtralForConditionalGeneration inheritance.
- Patch Merge Logic: Account for the $S \times S$ (typically $2 \times 2$) patch merging in Pixtral.
  - get_num_mm_encoder_tokens scales up the LLM budget by $S^2$ to match the Vision …
#31847 [Model] Add Grok-2 — documentation,new-model,rocm,frontend,ready,v1,nvidia — by dangoldbj (合并于: 2026-01-08 20:59 (UTC+8)) [💬12 | +777/-20, 8 files | commented:10] ## Purpose Implements https://github.com/vllm-project/vllm/issues/23557
- Add the Grok2 tokenizer that understands .tok.json, ties into the tokenizer registry, and supports Grok-2 chat templates.
- Grok1/Grok2 model machinery (MoE routing, residual path, tokenizer docs, and test coverage) so Grok-2 configs load correctly.
- Document the new support entry in docs/models/supported_models.md.
- Preserve Grok-1 behavior while adding a Grok-2 renormalization override and documenting tokenizer r…
#31652 [Model] Enable LoRA support for tower and connector in GLM4-V — ready — by Zyyeric (合并于: 2026-01-08 15:45 (UTC+8)) [+14/-0, 1 files | commented:1 approved:1]

## Purpose Issue: #31479 Add GLM4-V token-count helpers so LoRA can size tower/connector inputs correctly across the spatial merge.

## Technical Detail GLM4-V’s vision tower produces patch-level tokens that are spatially merged by spatial_merge_size before visual.merger runs. get_num_mm_encoder_tokens expands LM-side merged tokens back to pre-merge patch count (× merge_size²), while get_num_mm_connector_tokens converts tower (pre-merge) tokens to the connector…
#31931 [ROCm][LoRA] Fix MoE accuracy regression by preserving float32 router weight scaling — rocm,ready — by AndreasKaratzas (合并于: 2026-01-08 12:17 (UTC+8)) [💬9 | +8/-4, 1 files | commented:1 approved:1] Fixes MoE accuracy regression on ROCm introduced in #31676.

## Problem

PR #31676 reordered post-accumulation operations to add bias after dequantization. However, this also moved MUL_ROUTED_WEIGHT after the .to(compute_type) precision conversion, causing router weight multiplication to occur in bf16/fp16 instead of float32.

This precision loss causes different outputs on ROCm due to differences in mixed-precision handling between ROCm and CUDA Triton backends. The issue manifests as non-d…
#30190 [grpc] Support gRPC server entrypoint — frontend,ready,ci/build — by CatherineSue (合并于: 2026-01-08 15:24 (UTC+8)) [💬11 | +1357/-6, 12 files | commented:10]

## Purpose

Add gRPC server support to vLLM, enabling the community to integrate vLLM via gRPC protocol for any upstream application or routing layer.

Key Benefits:
1. Native gRPC Protocol Support
  - Enables upstream applications to connect via gRPC/Protobuf instead of HTTP/JSON …
#31388 [Voxtral] Fix speech transcription api — frontend,ready — by patrickvonplaten (合并于: 2026-01-08 18:34 (UTC+8)) [💬4 | +114/-27, 5 files | commented:5 approved:1] Make sure speech transcription api works.

Allows to do standard vllm serve + speech-to-text transcription API
#31950 [MM Encoder]: Make MMEncoderAttention’s scale takes effect properly — ready,qwen — by Isotr0py (合并于: 2026-01-08 18:33 (UTC+8)) [+34/-8, 13 files | approved:1 commented:1]

## Purpose
- scale in MMEncoderAttention doesn’t take effect exactly, because we don’t pass it to vit wrapper ops before.
- This PR fixes it, and also standardize Qwen-VL-style MMEncoderAttention usage to pass scale even if they have scale=head_dim**-0.5
## Test Plan

## Test Result

…
#31947 [Model] Standardize common vision encoders — ready,deepseek — by DarkLight1337 (合并于: 2026-01-08 18:33 (UTC+8)) [💬1 | +254/-174, 19 files | commented:3 approved:1]

## Purpose
- Accept multimodal_config argument to the common vision encoders used by init_vision_tower_for_llava (CLIP, SigLIP, PixtralHF), and forward them to MMEncoderAttention. This also enables handling of mm_encoder_tp_mode.
- Update SiglipVisionEmbeddings according to current Transformers code.
- Merge Phi3ImageEmbeddingBase into Phi3HDImageEmbedding since it’s the only subclass.
- Fix some instances where self.scale failed to be passed to `M…
#31951 [Chore] Further cleanup pooler — ready,v1 — by DarkLight1337 (合并于: 2026-01-08 18:16 (UTC+8)) [+47/-62, 7 files | commented:1 approved:1] ## Purpose
- Rename Tokens* -> Tokenwise* to be clearer
- Remove redundant PoolingType and replace it with string literals.
- Avoid redundant operations in AllPool.forward.
## Test Plan

## Test Result

…
#30803 RayLLM Bugfix - Preserve obj store URL for multi engine_config creation — ready — by omer-dayan (合并于: 2026-01-08 18:00 (UTC+8)) [💬4 | +15/-4, 4 files | commented:2 approved:2] Continueing the discussion from this PR: https://github.com/vllm-project/vllm/pull/30154

We suggest in here a solution that works for both RayLLM + vllm serve.

The root cause of the bug is that object store path is cloned to local dir, and self.model is changed to that local dir while self.model_weights contains the object store path - unless you run create_engine_config once, and then re-run it and now self.model equal to local dir and self.model_weights is not preserved.

The solutio…
#31890 [Models] Allow converting Qwen3-VL into Reranker model — documentation,frontend,ready,qwen — by Isotr0py (合并于: 2026-01-08 16:10 (UTC+8)) [💬1 | +287/-13, 8 files | commented:3 approved:3]

## Purpose
- Enable reranker support for Qwen3-VL
## Test Plan
```
  python examples/pooling/score/vision_language_reranker.py -m qwen3_vl_reranker
```
…
#31719 [Misc] Support qwen3-next lora — ready,qwen — by BJWang-ant (合并于: 2026-01-08 17:27 (UTC+8)) [💬1 | +7/-1, 1 files | commented:3 approved:1] Replace the torch.nn.Linear of shared_expert_gate in the qwen3-next model with ReplicatedLinear, so that lora can correctly identify shared_expert_gate.
#31536 fix(compile): apply partition wrapper when loading AOT cached functions — ready — by devbyteai (合并于: 2026-01-08 17:27 (UTC+8)) [💬4 | +103/-3, 2 files | commented:4 approved:2] ## Summary Fixes #31439 where VLLM_USE_AOT_COMPILE=1 causes 2x latency regression when use_inductor_graph_partition=True and cudagraph_mode=PIECEWISE.
```
## Root Cause

The bug is in vllm/compilation/decorators.py. When loading AOT-compiled functions from cache, the code bypasses maybe_use_cudagraph_partition_wrapper at two locations:

1. Line 374-376: Early return when aot_compiled_fn already set (subsequent calls)
2. Line 431-438: Early return after loading from AOT cache (first call a...
```
#31834 [CI/Build] Enable test_kv_cache_events_dp for AMD — rocm,ready,v1 — by rjrock (合并于: 2026-01-08 17:00 (UTC+8)) [💬1 | +5/-2, 1 files | commented:1 approved:1] ## Purpose To have the command pytest -v -s v1/engine/test_engine_core_client.py::test_kv_cache_events_dp pass on AMD CI.

## Test Plan pytest -v -s v1/engine/test_engine_core_client.py::test_kv_cache_events_dp

## Test Result Before: ```text ERROR: found no collectors for vllm/tests/v1/engine/test_engine_core_client.py::test_kv_cache_events_dp …
#31635 Decouple page_size_bytes calculation in AttentionSpec for TPU/RPA Compatibility. — ready,v1,kv-connector — by Lumosis (合并于: 2026-01-08 17:00 (UTC+8)) [💬4 | +75/-20, 6 files | commented:10] ## Purpose

RFC: https://github.com/vllm-project/vllm/issues/31634

This change refactors AttentionSpec in vLLM V1 to allow explicit setting of page_size_bytes. This is required for backends like TPU that use Ragged Paged Attention (RPA), where physical memory padding is necessary for alignment. The current reliance on num_gpu_blocks_override is not only a design “hack” but also breaks multi-host inference using the Ray executor.

Essential Elements of an Effective PR Des...
#31905 [ROCm]Skip test_torchao.py::test_pre_quantized_model on CDNA3 arch — rocm,ready — by ZhiweiYan-96 (合并于: 2026-01-08 15:47 (UTC+8)) [💬2 | +6/-0, 1 files | commented:2 approved:1] ## Purpose

Skip UT test_torchao.py::test_pre_quantized_model on CDNA3 arch. On CDNA3 arch, only fp8_e4m3_fnuz is supported, but the tested checkpoints use fp8_e4m3.

## Test Plan
```
  pytest -sv tests/quantization/test_torchao.py::test_pre_quantized_model
```
…
#31775 [docker] A follow-up patch to fix #30913: [docker] install cuda13 version of lmcache and nixl — ready,ci/build,kv-connector,nvidia — by wangshangsam (合并于: 2026-01-08 15:47 (UTC+8)) [💬1 | +2/-0, 1 files | commented:3 approved:1] This PR fixes a issue in #30913 : lmcache needs TORCH_CUDA_ARCH_LIST which was removed in #30913 .

## Purpose

Currently, when you try to build a cuda 13 image on main (I was doing it on an arm64 machine, but I’d presume that same error would happen for x86 too), you would encounter this error: https://github.com/vllm-project/vllm/pull/30913#discussion_r2663626084 . This PR fixes it so that the build would be fine.

## Test Plan

On a GH200 node (or any sort of arm64 machines): ```bash …
#31712 fix(rocm): Add get_supported_kernel_block_sizes() to ROCM_ATTN — documentation,rocm,ready,v1 — by rabi (合并于: 2026-01-08 15:46 (UTC+8)) [💬6 | +8/-0, 1 files | commented:1 approved:1] ## Purpose

ROCM_ATTN kernel only supports block sizes 16 and 32. Previously, the backend used the default which claims any block size is supported, causing select_common_block_size() to return large framework block sizes (e.g., 256 for Mamba alignment) that the kernel cannot handle.

This fix enables the kernel block size mechanism to correctly select 32 as the kernel block size for hybrid models like Nemotron-H.
#31745 [Bugfix] Remove the num_hidden_layers override for glm4_moe — ready — by andyl98 (合并于: 2026-01-08 15:45 (UTC+8)) [+0/-1, 1 files | commented:1 approved:1] ## Purpose Fixes an issue where we override num_hidden_layers: 0 for GLM4-MOE MTP models.

The Issue:
- The override set num_hidden_layers to 0 for Glm4MoeMTPModel, whereas the actual checkpoint weights for GLM4-MOE MTP are indexed after the base model’s last layer (e.g., model.layers.92... for a 92-layer model).
#31927 [Fix] Enable mm_processor_cache with vision LoRA — ready,v1,multi-modality — by prashanth058 (合并于: 2026-01-08 15:31 (UTC+8)) [+52/-16, 5 files | commented:2 approved:1] ## Purpose Enables multi-modal processor cache when enable_tower_connector_lora is active. #26674 prefixes identifier with LoRA hash to avoid incorrect encoder cache hits(https://github.com/vllm-project/vllm/pull/26674#discussion_r2636937770) but processor cache should be shared across LoRAs.

Solution: - Add mm_hash field to MultiModalFeatureSpec to store the base hash (without LoRA prefix) - Processor cache uses mm_hash for cache lookups (shared across LoRAs) - Encoder cac…
#30460 [chore] Update FA commit — ready,ci/build — by LucasWilkinson (合并于: 2026-01-08 15:24 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1] Update vLLM-FA commit to include:

https://github.com/vllm-project/flash-attention/pull/110 https://github.com/vllm-project/flash-attention/pull/112
#31942 [platform] add dp_metadata arg to set_additional_forward_context — ready — by Ronald1995 (合并于: 2026-01-08 14:56 (UTC+8)) [💬1 | +1/-0, 1 files | commented:3 approved:3]

## Purpose #31674 add set_additional_forward_context in set_forward_context, so other platform can add extra fields in forward_context, but set_additional_forward_context lack the argument dp_metadata, other platform will use num_tokens_across_dp, but if there is no dp_medata, they will call coordinate_batch_acorss_dp again, which make some extra communication. ## Test Plan

## Test Result

…
#31825 [Model] Enable LoRA support for tower and connector in DotsOCR — documentation,ready — by ShaanveerS (合并于: 2026-01-08 14:50 (UTC+8)) [💬4 | +9/-1, 2 files | commented:1 approved:1] ## Purpose Enable dynamic LoRA adapters on the multimodal vision tower and connector for DotsOCRForCausalLM by implementing:
- get_num_mm_encoder_tokens()
- get_num_mm_connector_tokens()
## Technical Details DotsOCR’s vision path uses spatial merging (spatial_merge_size) via vision_tower.merger, which changes the effective token count between:
- LM-side “image tokens” (post-merge)
- vision tower tokens before the merger (pre-merge) …
#31188 [Doc] Add Claude code usage example — documentation,ready — by mgoin (合并于: 2026-01-08 13:50 (UTC+8)) [💬2 | +75/-0, 2 files | commented:4 approved:1]

## Purpose

Simple example for how to use gpt-oss-120b with claude code

## Test Plan

## Test Result

…
#31873 [CI][BugFix][AMD] Actually skip tests marked @pytest.mark.skip_v1 — rocm,ready,ci/build — by rasmith (合并于: 2026-01-08 13:06 (UTC+8)) [💬2 | +1/-2, 1 files | commented:4 approved:1] There are some tests in tests/samplers/test_beam_search.py that are marked @pytest.mark.skip_v1 and are meant to be skipped in v1 due to unreliable behavior from the beam search. However, they are not skipped, so I added a ‘-m ‘not skip_v1’ flag which causes the test to actually be skipped.

These tests do not seem to run in the main CI either, but it might be a configuration in the main CI. They do run right now on AMD CI and fail intermittently.

We can try to find a permanent fix in the fut…
#31922 [ROCm][CI] Add rocm support for run-multi-node-test.sh — rocm,ready,ci/build — by charlifu (合并于: 2026-01-08 12:36 (UTC+8)) [💬1 | +21/-3, 1 files | commented:1 approved:1] Now run-multi-node-test.sh only works for cuda, this PR add support for rocm by:
- check if on rocm or not
- using different flags for rocm to control gpu devices used inside ci image.
#31915 [BugFix] Fix flakiness in test_eagle_dp for PyTorch 2.10 — ready,v1 — by zou3519 (合并于: 2026-01-08 12:04 (UTC+8)) [+3/-1, 1 files | approved:2 commented:1] ## Purpose

Fix flakiness in this test for PyTorch 2.10. The test can fail in PyTorch 2.9 too, see https://github.com/vllm-project/vllm/issues/31913 for explanation.

## Test Plan

Ran TP_SIZE=2 DP_SIZE=2 pytest tests/v1/distributed/test_eagle_dp.py::test_run_eagle_dp on 2x L4 and verified it passed.

## Test Result

…
#31692 [MoE Refactor][16/N] Apply Refactor to NVFP4 — documentation,performance,ready,llama,nvidia — by robertgshaw2-redhat (合并于: 2026-01-08 11:46 (UTC+8)) [💬5 | +780/-684, 15 files | commented:5 approved:3] ## Purpose

Apply refactor to nvfp4 integrations, key steps:
- support nvfp4 in MarlinExperts for NVFP4
- use mks for all kernels except for trtllm kernels
- create oracle for centralized kernel selection
- factor out process_weights_after_loading for sharing between ct and modelopt
- create kernel in process_weights_after_loading + call modular kernel in apply
## Test Plan …
#31932 [CI] Skip Qwen-VL in multimodal processing tests due to flaky external dependency — ready,multi-modality,qwen — by AndreasKaratzas (合并于: 2026-01-08 10:58 (UTC+8)) [💬1 | +5/-0, 1 files | commented:1 approved:1] The Qwen-VL and Qwen-VL-Chat models are being skipped in test_processing_correctness due to intermittent CI failures.

The tokenizer for these models attempts to download a font file from Alibaba’s China servers (qianwen-res.oss-cn-beijing.aliyuncs.com), which sometimes times out or refuses connections in CI environments.
```
  urllib3.exceptions.ConnectTimeoutError: Connection to qianwen-res.oss-cn-beijing.aliyuncs.com timed out
```
Discussed with @ywang96 who confirmed these models can be …

[关闭未合并 PR]

#19983 [Feature][Kernel] Blocked FP8 CUTLASS MoE for Hopper — performance,needs-rebase,ci/build,stale — by ElizaWszola (关闭于: 2026-01-09 10:27 (UTC+8)) [💬22 | +3386/-400, 30 files | commented:10] Add support for blocked fp8 CUTLASS MoE for SM90.

# Testing: Single grouped multiply unit tests:
```
  pytest tests/kernels/quantization/test_cutlass_scaled_mm.py -k test_cutlass_fp8_group_gemm
```
Fused experts op unit tests: ``` pytest tests/kernels/moe/test_cutlass_moe.py -k test_blocked_cutlass_moe_8 …
#21930 Fix #21840. Fix tool_call parsing edge cases — frontend,stale,tool-calling — by okamiRvS (关闭于: 2026-01-09 10:27 (UTC+8)) [💬5 | +4/-3, 1 files | commented:3] # Essential Elements of an Effective PR Description Checklist
- The purpose of the PR, such as “Fix some issue (link existing issues this PR will resolve)”.
- The test plan, such as providing test command.
- The test results, such as pasting the results comparison before and after, or e2e results
- (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
## Purpose

…
#22003 [Bugfix]: Fix Possible Output Corruption in Cascade Attention Caused by Non-Contiguous LSE Tensor — stale,v1 — by griii (关闭于: 2026-01-09 10:27 (UTC+8)) [💬8 | +3/-0, 1 files | commented:1] ## Purpose Fix an output corruption issue when using Cascade Attention. The flash_attn_varlen_func operator with flash-attn2 may return a non-contiguous LSE tensor (especially suffix_lse) in some cases. Passing a non-contiguous LSE tensor to merge_attn_states can cause incorrect outputs. This PR fixes the issue by making sure the LSE tensor is contiguous before further processing.

## Test Plan This issue can be consistently reproduced by serving the Qwen2.5-32B-Instruct model (or any large …
#23799 valley-eagle-7b (not finished yet) — new-model,needs-rebase,stale — by oswardlx (关闭于: 2026-01-09 10:26 (UTC+8)) [💬18 | +3504/-255, 3 files | commented:1]

## Purpose

This PR adds vLLM support for the Valley-Eagle-7B multi-modal large language model. Valley is a multi-vision tower model that combines SigLIP and Qwen2VL vision encoders with Qwen2 architecture for enhanced visual understanding capabilities.

Key Features Added:
- Complete vLLM adaptation for Valley-Eagle-7B model
- Support for dual vision towers (SigLIP + Qwen2VL)
- Multi-modal input processing with image and text integration …
#23949 fit the qwen3 moe’s awq quantization for 2080Ti. — stale,qwen — by Readon (关闭于: 2026-01-09 10:26 (UTC+8)) [💬9 | +36/-8, 2 files | commented:1]

## Purpose

Run Qwen3 A30B and Coder-A30B for awq quantization on 2080Ti.

## Test Plan

run by: serve billy800/Qwen3-30B-A3B-Instruct-2507-AWQ –max-model-len 8192 –tensor-parallel-size=4 –gpu-memory-utilization=0.6 –quantization=awq –host 0.0.0.0 –port 40030 –served-model-name qwen3-coder-30b …
#23968 [rocm] update pytorch rocm from 6.3 to 6.4 — rocm,ci/build,stale — by draftbk (关闭于: 2026-01-09 10:26 (UTC+8)) [💬7 | +1/-1, 1 files | commented:7 approved:1] ## Purpose Update pytorch rocm from 6.3 to 6.4

Initial purpose for myself: trying to test TorchAO FP8 on MI300X, and from the doc, it is tested with ROCm 6.4.

## Test Plan & Test Result ``` # MI300X lm_eval
–model vllm
…
#24224 [bugfix] fix returned chunk too large bug — performance,stale — by qibaoyuan (关闭于: 2026-01-09 10:26 (UTC+8)) [💬3 | +20/-10, 1 files | commented:1] ## Purpose

When the returned value is too large the benchmark will fail. This patch can solve this problem: using iter_any to avoid this problem.

after applying this patch, everything goes well:

…
#24225 [Benchmarks]Accelerate random dataset generation — performance,stale — by wuhang2014 (关闭于: 2026-01-09 10:26 (UTC+8)) [💬5 | +86/-11, 2 files | commented:4] ## Purpose

Fix #24058 using backend_tokenizer’s decode_batch and encode_batch_fast method to run without GIL as much as possible.

## Test Plan

--num-prompts is in range of 1,10,100,1000,10000, 100000

``` vllm bench serve –base-url http://127.0.0.1:8000 –model /home/models/Qwen3-0.6B –dataset-name random –random-input-len 2000 –random-output-len 1000 –max-concurrency 10000 –num-prompts 100000 –seed $(date +%s) –ignore-eos …
#24445 [Do not merge] Test pytest-cov — ready,needs-rebase,ci/build,stale — by rzabarazesh (关闭于: 2026-01-09 10:25 (UTC+8)) [💬3 | +295/-155, 4 files] This is just to test the CI with a simple pytest-cov configuration. This will not be merged.

## Purpose

## Test Plan

## Test Result

…
#31236 Construting grid using num of active lora in lora kernels — documentation,performance,new-model,rocm,structured-output,frontend,tpu,speculative-decoding,needs-rebase,ci/build — by yugong333 (关闭于: 2026-01-09 09:13 (UTC+8)) [💬7 | +24694/-14818, 711 files | commented:4] ## Purpose

Reduce the kernel overhead when num of active loras is smaller than max loras. Multiple cuda graphs are captured for each num of active-loras.

## Test Plan

Using llmperf to benchmark concurrency = 1, 2, 4, 8 when max-loras = 4

## Test Result

…
#25427 [Perf] Optimize torch native input group quant — performance,stale — by mgoin (关闭于: 2026-01-09 06:42 (UTC+8)) [💬1 | +16/-12, 1 files | commented:3]

## Purpose

Benchmark command
```
  python benchmarks/kernels/bench_per_token_quant_fp8.py --dtype bfloat16 --group-sizes 128 \
      --hidden-sizes 896 1024 2048 4096 7168 \
      --batch-sizes 1 16 128 512 1024
```
…
#31989 Update VOLKANURALTR.md — documentation — by volkTRhacV (关闭于: 2026-01-09 06:39 (UTC+8)) [💬4 | +4/-4, 1 files | commented:1 | 📝草稿]

## Purpose

## Test Plan

## Test Result

...
#31991 [wip] expose online quantization for linear for compressed-tensors path — 无标签 — by vkuzo (关闭于: 2026-01-09 06:31 (UTC+8)) [💬1 | +126/-4, 2 files | commented:2] Summary:

WIP and not ready for review yet

High level, I’d like to work towards enabling online quantization for all the quantization recipes in compressed-tensors. Ideally we work through a POC for a single recipe and get alignment from vllm team, and then we can parallelize further work if needed.

For now, tested e2e on a tiny dense model (facebook/opt-125m) and float8 rowwise scaling.

tl;dr;
1. enables CompressedTensorsConfig.from_config_file, and provides an example config file for flo…
#31934 [CI] Remove torch nightly checks — ci/build — by simon-mo (关闭于: 2026-01-09 01:41 (UTC+8)) [💬8 | +1/-490, 10 files | commented:1 changes:1] I would like to remove torch nightly test as it has been failing for a while (2+ months) and we don’t have a good way to track them. They are unmaintained and didn’t help us with shortening time to latest torch release.

Given this, we can both reduce our codebase complexity and testing cost.

In the future, if we do want to bring it back, I think we should create…
#12919 [CI/Build][v1] vLLM v1 automatic benchmarking — performance,ready,needs-rebase,ci/build,unstale,kv-connector — by Shaoting-Feng (关闭于: 2026-01-09 01:35 (UTC+8)) [💬9 | +87/-24, 3 files] This PR extends the performance benchmark to include both v0 and v1. The latency, throughput, and fixed-QPS serving tests will first run with v0 and then with v1. The results for v1 will be recorded and processed in the same way as v0. The filenames will have _v1 appended as a suffix.

The file structure will look like this:

```bash results/ |— benchmark_results.json |— benchmark_results.md |— benchmark_results_v1.json |— benchmark_results_v1.md …
#29733 [Feature][#29390]: Add timeout support to MultiprocExecutor.collective_rpc and FutureWrapper — v1 — by SandishKumarHN (关闭于: 2026-01-09 01:19 (UTC+8)) [💬3 | commented:1] ## Purpose
- Add support for timeouts when calling vllm.v1.executor.multiproc_executor.MultiprocExecutor.collective_rpc and when awaiting futures returned by non-blocking RPCs.
- Treat a provided timeout as an overall deadline across response MessageQueues when collecting responses.
- Add FutureWrapper.result(timeout=…) and propagate the timeout into FutureWrapper.wait_for_response(get_response, timeout=…) so user code can wait on the Future with an overall timeout.
- This change makes debu…
#30301 [ResponsesAPI] Add GPTOSS MCP tool streaming — frontend,ready,needs-rebase,gpt-oss — by qandrew (关闭于: 2026-01-09 00:39 (UTC+8)) [💬3 | +1110/-431, 3 files | commented:3] ## Purpose This change enables streaming support for MCP tools when using GPT OSS. It extends the harmony utilities and response serving infrastructure to handle tool streaming, allowing tool calls and their results to be incrementally streamed back to clients rather than returned as a single batch.

taken over from #30192, builds on top of https://github.com/vllm-project/vllm/pull/30054

## Test Plan

``` VLLM_GPT_OSS_SYSTEM_TOOL_MCP_LABELS=web_search_preview,container,code_interpreter CUDA_VIS…
#31955 [BugFix] Clear spec_token_ids for preempted req to prevent grammar conflicts on resumption — v1 — by izhuhaoran (关闭于: 2026-01-09 00:20 (UTC+8)) [💬4 | +9/-0, 1 files | commented:1] ### Purpose Fixes #31876

The unit test v1/e2e/test_async_scheduling.py::test_with_spec_decoding fails intermittently. The failure occurs with the following configuration:

```python test_sampling_params = [ dict(structured_outputs=struct_outputs), ]

…
#31953 [CI] fix ROCM_HOME: unbound variable — rocm,ready,ci/build — by LucasWilkinson (关闭于: 2026-01-08 23:44 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1] https://github.com/vllm-project/vllm/pull/31922 broke the “2 Node Tests (4 GPUs in total)” with
```
  [2026-01-08T07:32:12Z] ./.buildkite/scripts/run-multi-node-test.sh: line 10: ROCM_HOME: unbound variable
```
https://buildkite.com/organizations/vllm/pipelines/ci/builds/46074/jobs/019b9c75-6389-40dc-a23b-5ecd3683d951/log
#31976 [WIP] Add: EAGLE3 Acceptance Length Regression Tests — speculative-decoding,v1 — by rahul-tuli (关闭于: 2026-01-08 23:22 (UTC+8)) [+208/-0, 1 files | commented:1 | 📝草稿] Adds parameterized pytest tests to detect acceptance length regressions in EAGLE3 speculative decoding. These tests ensure that new commits do not degrade speculative decoding performance.

## Changes
- tests/v1/spec_decode/test_acceptance_length.py: New test file with parameterized tests for EAGLE3 model pairs
## Models Tested

| Verifier | Drafter | |———-|———| …

#31814 [Bugfix] Inject JSON schema descriptions into prompt for structured outputs — frontend — by ricky-chaoju (关闭于: 2026-01-08 21:29 (UTC+8)) [💬2 | +156/-1, 2 files | commented:1] Fixes #31804

## Problem
When using `response_format` with `json_schema`, field descriptions (e.g., `"Country in UPPERCASE"`) were ignored because they were only passed to the grammar constraint (xgrammar) but not to the model itself.

## Solution
Inject the JSON schema into the prompt as a system message so the model can understand and follow field-level constraints specified in descriptions.

## Changes
- Added `_build_json_schema_prompt()` helper function in `serving_chat.py`   ...

#31727 Fix versions for vulnerable packages — ci/build — by adobrzyn (关闭于: 2026-01-08 15:44 (UTC+8)) [+2/-2, 2 files | commented:1]

## Purpose Fix packages versions to resolve vulnerabilities found.
1. cbor2:
  - CVE-2025-68131
  - CBORDecoder reuse can leak shareable values across decode calls
  - ” Patched versions 5.8.0 …
#29844 [KV Transfer] Add optional int8 quantization for KV cache transfer to P2pNcclEngine — kv-connector — by xbfs (关闭于: 2026-01-08 14:01 (UTC+8)) [💬4 | +43/-2, 1 files | commented:3] ## Summary

This PR adds optional int8 quantization support for KV cache transfer in P2pNcclEngine to reduce network bandwidth usage and improve transfer throughput, especially in bandwidth-constrained scenarios.

## Motivation

When transferring KV cache across nodes in distributed inference setups, network bandwidth can become a bottleneck. By quantizing tensors from float16/float32 to int8 during transfer, we can reduce the data size by approximately 50-75%, significantly improving transfer…
#30646 fix: unsatisfiable testing dependencies caused by a version conflict — needs-rebase,ci/build — by leejianwoo-collab (关闭于: 2026-01-08 13:40 (UTC+8)) [💬2 | +2/-2, 1 files | commented:2] # Fix: Resolve dependency conflict with model-hosting-container-standards

## Problem Issue #30595 reported unsatisfiable testing dependencies caused by a version conflict between:
- model-hosting-container-standards >= 0.1.9 (requires starlette >= 0.49.1)
- fastapi[standard] >= 0.115.0 (some versions require starlette < 0.48.0)
This conflict prevented successful dependency resolution during testing and development.

## Root Cause Analysis …