vLLM Daily Report

« Back to vLLM Reports

[vLLM GitHub 开发动态] 2026-01-09

[概览]

时间窗口: 2026-01-09 10:47 (UTC+8) ~ 2026-01-10 10:47 (UTC+8)
新 issue: 20 (label 分布: bug:9, feature request:5, RFC:3, help wanted:2, usage:1)
关闭 issue: 21
新 PR: 43 (label 分布: ready:13, v1:13, rocm:8, documentation:5, speculative-decoding:5)
合并 PR: 44
关闭未合并 PR: 10

[新 issue]

#32055 [Feature]: Copy Ops in FP8 MLA — feature request — by robertgshaw2-redhat (创建于: 2026-01-10 06:11 (UTC+8)) [💬3] ### 🚀 The feature, motivation and pitch

https://github.com/vllm-project/vllm/blob/a4ec0c559521c055519eeabddd8279c83eb4e936/vllm/v1/attention/backends/mla/common.py#L2117

These add ~20us of 300us for a layer

### Alternatives

We can address this by unwrapping MLA and torch.compile will handle it.

…
#32069 [Usage]: — usage — by xiazi-yu (创建于: 2026-01-10 09:02 (UTC+8)) Comparison testing between Qwen3-VL-Embedding-2B vLLM 13.0 and Qwen3VLEmbedder revealed inconsistencies in accuracy.
#32068 [Bug]: Recompile in LLama model — bug — by Lucaskabela (创建于: 2026-01-10 08:43 (UTC+8)) [💬1] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#32067 [Bug]: Cannot Run on B200s — bug — by yzeng58 (创建于: 2026-01-10 07:55 (UTC+8)) ### 🐛 the bug

I try to run vllm on B200 but it gives me weird error.

The following is the minimal code to reproduce the error. ``` # test.py import os os.environ[“UNSLOTH_VLLM_STANDBY”] = “1”

…
#32066 [RFC]: Expose Sharding Metadata from vLLM Model Layers — RFC,rl — by chunzhao (创建于: 2026-01-10 07:35 (UTC+8)) [💬3] ### Motivation.

Currently, downstream systems that need to understand how vLLM model weights are sharded must infer sharding patterns from tensor names using heuristics:

```python # Current approach: Pattern matching on tensor names def infer_sharding(tensor_name: str) -> ShardingType: if “gate_proj” in tensor_name or “up_proj” in tensor_name: return COLUMN_PARALLEL # [tp_size, 1] …
#32053 [Model Bash]: DeepSeek-R1-NVFP4 B200 — feature request — by robertgshaw2-redhat (创建于: 2026-01-10 05:20 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch

# DeepSeek-R1 NVFP4 Model Bash

``` launch_trtllm_moe_trtllm_attn_fused_ar_rope_fp8_kv_ep_spec_decode_mpt3: VLLM_USE_FLASHINFER_MOE_FP16=1 VLLM_USE_FLASHINFER_MOE_FP4=1 VLLM_FLASHINFER_MOE_BACKEND=”latency” CUDA_VISIBLE_DEVICES=1,3,0,5 vllm serve -tp –port
–attention-config.use_trtllm_ragge…
#32059 [Feature]: Use platform op if inside a custom op — feature request — by robertgshaw2-redhat (创建于: 2026-01-10 06:19 (UTC+8)) [💬1] ### 🚀 The feature, motivation and pitch

We have the CustomOp for SIluandMul

``` @staticmethod def forward_native(x: torch.Tensor) -> torch.Tensor: “"”PyTorch-native implementation equivalent to forward().””” d = x.shape[-1] // 2 return F.silu(x[…, :d]) * x[…, d:] …
#32062 [Bug]: Dynamic shapes config applied in decortators not set for conditional range compilation — bug — by laithsakka (创建于: 2026-01-10 06:56 (UTC+8)) ### Your current environment

NA

### 🐛 Describe the bug

amely ``` dynamo_config_patches = {} try: …
#32057 [Feature][DSR1 NVFP4 Model Bash]: FlashInfer Quantize Op — feature request — by robertgshaw2-redhat (创建于: 2026-01-10 06:15 (UTC+8)) ### 🚀 The feature, motivation and pitch

We currently run a flashinfer quantize op

https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/utils/flashinfer_fp4_moe.py#L290-L294

This is much slower than the vllm native op

### Alternatives

…
#32046 Google MedASR — 无标签 — by Jeevi10 (创建于: 2026-01-10 04:17 (UTC+8)) ### The model to consider.

https://huggingface.co/google/medasr

### The closest model vllm already supports.

Whisper

### What’s your difficulty of supporting the model you want?

…
#32023 [Bug]: Fail to get response from a Qwen2-audio-7B-Instruct Service — bug — by moonlightian (创建于: 2026-01-09 18:20 (UTC+8)) [💬1] ### Name of failing test

None

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#32033 [Bug]: [CPU Backend]: Engine crashes on model re-run with torch=2.10.0 on CPU — bug,cpu — by fadara01 (创建于: 2026-01-10 00:37 (UTC+8)) [💬9] ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…
#32038 KV-cache / long-context boundary request (minimal repro + metric) — 7-day receipts eval — performance — by StanByriukov02 (创建于: 2026-01-10 01:27 (UTC+8)) ### Proposal to improve performance

I’m looking for the smallest vLLM boundary where KV-cache / long-context becomes the bottleneck (memory traffic + tail latency). I want a runnable repro + the metric you care about, then I can return a receipts-backed before/after within 7 days.

The candidate improvement class is topology: replace scan-like access patterns with structured traversal / layout on a declared boundary (with SHA256-indexed receipts), but I won’t speculate until we lock the minimal…
#32028 [RFC]: EPLB Implementation Refactoring — RFC — by ilmarkov (创建于: 2026-01-09 22:12 (UTC+8)) [💬1] ### Motivation.

### Current Pain Points

The existing EPLB implementation, while functionally correct, suffers from several issues that impact development velocity and system reliability:
1. Coordination Complexity: The async EPLB uses multiple variables (rebalanced, ep_buffer_ready, pending_global_ready_check, layer_to_transfer) to coordinate between threads, making it difficult to reason about system state and leading to debugging challenges.
2. Testing Challenges: Race cond…
#32029 [RFC]: Online Quantization Roadmap — RFC — by vkuzo (创建于: 2026-01-09 22:32 (UTC+8)) [💬1] ### Motivation.

Online quantization (weights passed to vLLM in high precision, quantization of weights done inside of vLLM) is emerging as an important use case for quick experimentation and RL. Today vLLM supports online quantization with a single recipe (float8 per-tensor scaling). I wanted to align on the high level future direction of the specifics of how online quantization is implemented in vLLM as we extend this to more recipes / more hardware types / more use cases.

### Proposed Chan…
#32017 [Bug]: [lm_eval crashed] lm eval accuracy test crashed using VLLM MAIN branch but v0.14.0rc0 works — bug,rocm — by danielhua23 (创建于: 2026-01-09 15:32 (UTC+8)) [💬1] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#32012 [Bug]: Qwen3-VL-MoE fails loading when using bitsandbytes quantization — bug — by Datta0 (创建于: 2026-01-09 13:08 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#32009 [Bug]: 0% Acceptance rate with FI Cutlass DeepSeekR1 NVFP4 with mtp ep — bug,help wanted — by robertgshaw2-redhat (创建于: 2026-01-09 12:09 (UTC+8)) ### Your current environment

b200

### 🐛 Describe the bug

``` launch_cutlass_moe_trtllm_attn_fused_ar_rope_fp8_kv_ep_spec_decode: VLLM_USE_FLASHINFER_MOE_FP4=1 VLLM_FLASHINFER_MOE_BACKEND=”throughput” CUDA_VISIBLE_DEVICES=1,3,0,5 vllm serve -tp –port
–attention-config.use_trtllm_ragged_deepseek_prefill=True –attention-backend FLASHINFER_MLA
…
#32011 [Bug]: “Re-running CMake” loop In pip install — bug — by zhaohaixu (创建于: 2026-01-09 12:45 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#32010 [Feature]: Support DCP with FP8 KV Cache — help wanted,feature request — by robertgshaw2-redhat (创建于: 2026-01-09 12:31 (UTC+8)) ### 🚀 The feature, motivation and pitch

DCP is great Fp8 kv cache is great We should support both at the same time:
```
  (EngineCore_DP0 pid=2734260) ERROR 01-08 23:29:29 [core.py:900] RuntimeError: Worker failed with error 'DCP not support fp8 kvcache now.', please check the stack trace above for the root cause
```
…

[已关闭 issue]

#11416 [Feature]: QTIP Quantization — feature request,stale — by ehtom (关闭于: 2026-01-10 10:14 (UTC+8)) [💬7] ### 🚀 The feature, motivation and pitch

Over the last year there have been several exciting new low-bit quantization algorithms proposed. These include AQLM (which is now in VLLM) and QuIP (which is in aphrodite engine, a vllm relative). QTIP is a new algorithm which has almost lossless performance even at 2-bits. There is code available implementing the technique here:

https://github.com/Cornell-RelaxML/qtip

I have run this code and experimented with the 8b lllama 3.1 model quantized dow…
#17759 [Bug]: Inconsistent outputs with deterministic sampling (temperature=0) when serving Qwen3-32B with vllm-0.8.5 — bug,stale — by YoctoHan (关闭于: 2026-01-10 10:13 (UTC+8)) [💬15] ### Your current environment

The output of python collect_env.py
```text INFO 05-06 19:57:58 [__init__.py:239] Automatically detected platform cuda. Collecting environment information... PyTorch version: 2.6.0+cu124 Is debug build: False ...
#18660 [Doc]: CPU Docker Image Requires AVX512 — documentation,stale — by zsxsoft (关闭于: 2026-01-10 10:13 (UTC+8)) [💬4] ### 📚 The doc issue

I’m finding a way to run vLLM on CPU without compilation. Following the documentation at https://docs.vllm.ai/en/stable/getting_started/installation/cpu.html#troubleshooting, I used the Docker image public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v0.8.5.post1. However, when running vLLM directly, it throws a SIGILL error.

After debugging with GDB, I confirmed there’s a AVX512 instruction vinserti64x4. But documentation only mentions “AVX512 (optional, recommended)”, the …
#21583 [Bug]: [P/D] P/d is incompatible with spec decoding — bug,stale — by robertgshaw2-redhat (关闭于: 2026-01-10 10:13 (UTC+8)) [💬5] ### Your current environment

unsure, user report

### 🐛 Describe the bug

For dev, we have had asserts in various places. We have one here that is breaking spec decoding

@NickLucche - could you take a look?

…
#23641 [RFC]: Frequency and Cost Aware Eviction Policy for Prefix Caching — feature request,stale — by luolun (关闭于: 2026-01-10 10:13 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch

Key Idea:

For a prefix cache of size size, and access frequency freq, define:
```
  Retention Benefit of a prefix (over time T): T * freq * compute_cost.
    
  where compute_cost = cost_factor * cost_func(size)
```
…
#24394 [Performance]: Improve Prefix Cache Hit Rate and Reduce Dirty Cache Impact — performance,stale — by CLFutureX (关闭于: 2026-01-10 10:13 (UTC+8)) [💬2] ### Proposal to improve performance

## Optimization Proposal: Improve Prefix Cache Hit Rate and Reduce Dirty Cache Impact

### Current Background Currently, vLLM uses prefix caching for matching. The earlier a request is processed, the higher its hit rate. Therefore, when evicting and releasing blocks, they are reversed before being added to the free list, implementing eviction from the tail to improve block reuse rate.

However, between multiple adjacent requests, the overall block distributio…
#24398 [Bug]: Compatibility Issue with Compressed-Tensors Sparse-Only Models (Llama4 MoE) — bug,stale — by ibrahimnd2000 (关闭于: 2026-01-10 10:13 (UTC+8)) [💬2] ### Your current environment

The output of python collect_env.py
```text ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) ...
#24399 [Usage]: — usage,stale — by parth0774 (关闭于: 2026-01-10 10:13 (UTC+8)) [💬2] ### Your current environment
```
  The output of `python collect_env.py`
```
### How would you like to use vllm

I have running llama model on my Machine using vllm serve meta-llama/Meta-Llama-3-8B-Instruct –host 0.0.0.0 –port 8081 –tensor-parallel-size 4 –enable-auto-tool-choice –tool-call-parser llama3_json …
#24408 [Usage]: Does the qwen3 model extrapolate automatically, or do I need to set the rope_scaling parameter? — usage,stale — by yuanjie-ai (关闭于: 2026-01-10 10:13 (UTC+8)) [💬3] ### Your current environment
```
  The output of `python collect_env.py`
```
### How would you like to use vllm

I want to run inference of a specific model. I don’t know how to integrate it with vllm. …
#24429 [Feature]:Does VLLM support Seed OSS-36B Instruction quantified in int4/int8 format? — feature request,stale — by xiaotianns (关闭于: 2026-01-10 10:13 (UTC+8)) [💬3] ### 🚀 The feature, motivation and pitch

Does VLLM support Seed OSS-36B Instruction quantified in int4/int8 format?

### Alternatives

No response

### Additional context

…
#26379 [Bug][torch.compile]: piecewise graphs not cached for LLaMa-4-Scout — bug,torch.compile,stale — by ProExpertProg (关闭于: 2026-01-10 08:36 (UTC+8)) [💬6] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... uv is set ============================== System Info ...
#31918 [Bug]: nvidia/DeepSeek-R1-NVFP4-v2 accuracy issue with NVFP4 dispatch (CUTEDSL MoE + DeepEP LL) — bug — by minosfuture (关闭于: 2026-01-10 04:03 (UTC+8)) [💬5] ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…
#30879 [Doc]: Add some documentation about encoder compilation — documentation,torch.compile — by zou3519 (关闭于: 2026-01-10 03:48 (UTC+8)) [💬2] ### 📚 The doc issue

I want something like a design doc for encoder compilation. For example:
- It uses support_torch_compile and set_model_tag to avoid cache collisions
- it supports or doesn’t support the following features that VllmBackend does: cudagraphs, compile_ranges, and a high-level explanation for how these are turned off or on.
- it inherits from compilation_config (or maybe it doesn’t)
- here’s how to turn it on/off
I’m having a difficult time thinking through the edge cases in htt…
#26473 [Bug][ROCm]: Failed to send request to Qwen3-Next — bug,feature request,rocm — by zhentaocc (关闭于: 2026-01-09 19:28 (UTC+8)) [💬4] ### Your current environment
output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) GCC version : (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 ...
#29541 [CI Failure]: mi325_1: Entrypoints Integration Test (API Server 1) — ci-failure — by AndreasKaratzas (关闭于: 2026-01-09 14:05 (UTC+8)) [💬6] ### Name of failing test

pytest -v -s entrypoints/openai/test_collective_rpc.py; pytest -v -s entrypoints/openai --ignore=entrypoints/openai/test_chat_with_tool_reasoning.py --ignore=entrypoints/openai/test_oot_registration.py --ignore=entrypoints/openai/test_tensorizer_entrypoint.py --ignore=entrypoints/openai/correctness/ --ignore=entrypoints/openai/test_collective_rpc.py --ignore=entrypoints/openai/tool_parsers/; pytest -v -s entrypoints/test_chat_utils.py

### Basic information
- Fla…
#29543 [CI Failure]: mi325_2: Distributed Model Tests (2 GPUs) — ci-failure — by AndreasKaratzas (关闭于: 2026-01-09 14:03 (UTC+8)) [💬3] ### Name of failing test

`TARGET_TEST_SUITE=L4 pytest basic_correctness/ -v -s -m ‘distributed(num_gpus=2)’ && CUDA_VISIBLE_DEVICES=0,1 pytest -v -s model_executor/model_loader/test_sharded_state_loader.py && pytest models/test_transformers.py -v -s -m ‘distributed(num_gpus=2)’ && pytest models/language -v -s -m ‘distributed(num_gpus=2)’ && pytest models/multimodal -v -s -m ‘distributed(num_gpus=2)’ –ignore models/multimodal/generation/test_whisper.py && VLLM_WORKER_MULTIPROC_METHOD=spawn pyte…
#29516 [CI Failure]: mi325_4: Distributed Tests (A100) — ci-failure — by AndreasKaratzas (关闭于: 2026-01-09 14:03 (UTC+8)) [💬5] ### Name of failing test

pytest -v -s distributed/test_custom_all_reduce.py && torchrun --nproc_per_node=2 distributed/test_ca_buffer_sharing.py && TARGET_TEST_SUITE=A100 pytest basic_correctness/ -v -s -m 'distributed(num_gpus=2)' && pytest -v -s -x lora/test_mixtral.py

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#29521 [CI Failure]: mi325_1: Samplers Test — ci-failure — by AndreasKaratzas (关闭于: 2026-01-09 13:58 (UTC+8)) [💬8] ### Name of failing test

pytest -v -s samplers && VLLM_USE_FLASHINFER_SAMPLER=1 pytest -v -s samplers

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#30099 [Bug]: Tokens are missing in streaming mode. — bug — by Ri0S (关闭于: 2026-01-09 12:07 (UTC+8)) [💬1] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#28635 [Bug]: Streaming=True causes missing or scrambled tokens with GPT-OSS 120B on vLLM v0.11.0 — bug — by cxz1418 (关闭于: 2026-01-09 11:59 (UTC+8)) [💬26] ### Your current environment

NVIDIA-SMI 550.127.05 Driver Version: 550.127.05 CUDA Version: 12.4
- vLLM versions: v0.10.2 and v0.11.0
- Model: GPT-OSS 120B
- Streaming: True
- –enforce-eager: Tested
### 🐛 Describe the bug …
#26458 [Bug][Deepseek-v32-exp]: import DeepseekV32IndexerCache from model registry in gpu_model_runner — bug,stale — by ILikeIneine (关闭于: 2026-01-09 11:05 (UTC+8)) [💬1] ### Your current environment

The output of python collect_env.py
```text vllm-metax plugin backend ```

…

[新 PR]

#32019 [LoRA][Perf] Improve FusedMoE LoRA performance for small rank — 无标签 — by xyang16 (创建于: 2026-01-09 17:52 (UTC+8)) [💬2 | +14/-2, 1 files | commented:5] ## Purpose

This PR is to improve performance of FusedMoE LoRA for smaller rank
- Currently the default block_m is 64, block_k is 32, see here
- For shrink, rank is N. offs_bn is computed using mod here. If N is smaller than BLOCK_SIZE_N, offs_bn will be repeated. This will make lora weights be…
#32041 [Doc] Update NVIDIA PyTorch Docker image in gpu.cuda.inc.md since it seems outdated — documentation,nvidia — by rasmith (创建于: 2026-01-10 03:10 (UTC+8)) [💬1 | +1/-1, 1 files | commented:2] The NVIDIA PyTorch Docker image link in gpu.cuda.inc.md seems outdated, since when trying this along with pip install -e . doesn’t work, and results in a setuptools error, but when switching to the more recent version, everything works.
[!NOTE] Updates the recommended NVIDIA PyTorch Docker image in gpu.cuda.inc.md to a newer tag for build troubleshooting.
- Changes docker run example to use nvcr.io/nvidia/pytorch:25.12-py3 instead of 23.10-py3
^{Written by [C…}
#32065 [CI] Allow Deprecated Quantization For LM Eval Tests — ready,ci/build,ci-failure — by micah-wil (创建于: 2026-01-10 07:07 (UTC+8)) [+1/-0, 1 files | commented:2 approved:1] Ever since https://github.com/vllm-project/vllm/pull/31688 began the process of deprecating some quantization schemes, the LM Eval Large Models test group is failing on main, as seen here for example: https://buildkite.com/vllm/ci/builds/46314/steps/canvas?sid=019ba18f-0a1a-4482-82d8-b8c102fd92c0. The PR updated other relevant tests to allow deprecated quantization types, but skipped test_lm_eval_correctness.py, so the Meta-Llama-3-70B-Instruct-FBGEMM-nonuniform test case is failing with t…
#32070 [RFC] Improve environment variable declaration and handling (#31249) — documentation — by nliu365 (创建于: 2026-01-10 09:06 (UTC+8)) [💬2 | +1400/-0, 5 files | commented:2 | 📝草稿] This PR implements the refactoring proposed in issue #31249 to improve environment variable handling in vLLM by consolidating declarations into a single source of truth with automatic type conversion.

## Summary

Replaces the duplicated type definitions and getter dictionary pattern with a unified system that:
- Declares each variable once with type annotations and defaults
- Automatically converts types based on type hints
- Supports custom parsing, lazy defaults, and validated choices
- Provi…
#32071 Use inference_mode() for torchao weight quantization — ci/build — by jerryzh168 (创建于: 2026-01-10 09:16 (UTC+8)) [💬2 | +27/-63, 3 files | commented:1] Summary: Trying to resolve issue: https://github.com/pytorch/pytorch/issues/170419, https://github.com/pytorch/pytorch/issues/164872, https://github.com/pytorch/ao/issues/3487

Main thing is in vllm we didn’t create the model under inference_mode(), but inference is in inference mode, this is causing runtime error: RuntimeError: Cannot set version_counter for inference tensor

And no easy workaround for the use case: https://github.com/pytorch/pytorch/issues/170419#issuecomment-3656315408

so …
#32058 [Perf] Optimize grouped topk kernel, 1.2% E2E Throughput improvement — ready — by yewentao256 (创建于: 2026-01-10 06:18 (UTC+8)) [💬1 | +186/-283, 2 files | commented:3] ## Purpose

Part of the https://github.com/vllm-project/vllm/issues/31755

Here we combine the two kernels together, reducing global write and read

## Test

## Test

…
#32056 [Perf] Optimize async scheduling placeholder using empty — ready,v1 — by yewentao256 (创建于: 2026-01-10 06:13 (UTC+8)) [💬2 | +4/-1, 1 files | commented:1 approved:1] ## Purpose

Here for a placeholder tensor, no need to use rand, empty is good enough

[!NOTE] ^{Cursor Bugbot is generating a summary for commit 3462afd4a5afeda39c35c85d50e394211106a38f. Configure here.}

…
#32049 [CI/Build][Hardware][AMD] Fix test_engine_core_client.py — rocm,v1 — by rjrock (创建于: 2026-01-10 04:42 (UTC+8)) [+23/-1, 1 files | commented:1 | 📝草稿] ## Purpose To get ROCm to pass on tests in test_engine_core_client.py, as run in AMD’s V1 Test e2e + engine testgroup. ## Test Plan pytest -v -s v1/engine/test_engine_core_client.py ## Test Result Still some failures, draft atm.

Essential Elements of an Effective PR Description Checklist
...
#32045 [Core] Use weights_only=True with torch.load — ready — by russellb (创建于: 2026-01-10 04:08 (UTC+8)) [+1/-1, 1 files | commented:1 approved:1] This code neglected to use weights_only=True with torch.load(). This is unsafe if used on potentially untrusted model data. We should always be using weights_only=True unless necessary.

It turns out that the default changed to True as of PyTorch 2.6, so this isn’t a security vulnerability, but we explicitly set it to True everywhere else, so change it here for consistency. This also prevents it from getting flagged by security scanners, which is how I came aware of it via a private re…
#32052 [2/N][Attention] Fix pre-commit errors — ready,v1 — by MatthewBonanni (创建于: 2026-01-10 05:19 (UTC+8)) [+6/-16, 3 files | commented:1 approved:1] ## Purpose Another step in the attention restructuring, #31919 This PR fixes pre-commit errors which arose when moving files into regions of higher mypy scrutiny

## Test Plan Pre-commit should pass

## Test Result

…
#32044 make assume_32_bit_indexing configurable — 无标签 — by laithsakka (创建于: 2026-01-10 04:01 (UTC+8)) [+9/-1, 2 files | commented:8 approved:1] Some meta internal models requires 64 bit indexing for compilation! , this way allow them to override the default config. also address compilation config hashing issue reported by AI?

[!NOTE] ^{Cursor Bugbot is generating a summary for commit 106dbf229af2a5584e9968aaa238e0b3a8f7e636. Configure here.}

…
#32054 [3/N][Attention] Move AttentionMetadata-related code from utils.py to backend.py — rocm,speculative-decoding,ready,v1,cpu,nvidia,ready-run-all-tests — by MatthewBonanni (创建于: 2026-01-10 06:03 (UTC+8)) [+371/-370, 36 files | commented:3 approved:1] ## Purpose Step 3 of #31919. Moves chunks of code from utils.py to backend.py (unchanged) and updates imports accordingly. The following objects are moved:
- CommonAttentionMetadata
- AttentionMetadataBuilder
- AttentionCGSupport
## Test Plan CI (run all tests)

## Test Result …
#32060 [4/N][Attention] Move MLA common to model_executor — rocm,speculative-decoding,ready,v1,kv-connector,nvidia,ready-run-all-tests — by MatthewBonanni (创建于: 2026-01-10 06:45 (UTC+8)) [+48/-42, 14 files | commented:1 approved:1] ## Purpose Step 4 of #31919. Moves vllm/v1/attention/backends/mla/common.py to vllm/model_executor/layers/attention/mla_attention.py and updates imports accordingly.

## Test Plan CI

## Test Result

...
#32061 [ROCm][CI] Fix engine core client tests for ROCm spawn multiprocessing — rocm,v1 — by AndreasKaratzas (创建于: 2026-01-10 06:50 (UTC+8)) [+181/-70, 1 files | commented:2] Follow-up from: https://github.com/vllm-project/vllm/pull/32049. Fixes test_engine_core_client.py utility method tests failing on ROCm.

## Problem

ROCm requires spawn as the multiprocessing start method (NVIDIA uses fork). With spawn, child processes start a fresh Python interpreter and re-import all modules, which means monkey-patches applied in the parent process are not inherited.

This caused utility method tests to fail with: ``` AttributeError: ‘EngineCoreProc’ object has no attr…
#32064 [5/N][Attention] Finish eliminating vllm/attention folder — 无标签 — by MatthewBonanni (创建于: 2026-01-10 07:03 (UTC+8)) [+777/-733, 4 files | commented:1 | 📝草稿]

## Purpose Step 5 of #31919: This PR finishes eliminating the vllm/attention folder by doing the following:
- Split vllm/attention/layer.py into vllm/model_executor/layers/attention/mla_attention.py (MLAAttention, unified_mla_attention) and vllm/model_executor/layers/attention/attention.py (Attention, unified_attention)
- Move vllm/attention/utils/kv_sharing_utils.py content into vllm/model_executor/layers/attention/attention.py
- Move `vllm/att…
#32063 [ROCm][CI] Fix flaky test_function_calling_with_stream and reduce schema test examples — rocm,gpt-oss — by AndreasKaratzas (创建于: 2026-01-10 06:56 (UTC+8)) [+14/-10, 2 files | commented:1] This PR fixes two test failures in the Harmony and OpenAPI schema test suites:
1. test_function_calling_with_stream: Fixed an UnboundLocalError caused by accessing an uninitialized variable when the model doesn’t return the expected function call.
2. test_openapi_stateless: Reduced test examples to avoid timeout failures caused by schemathesis generating malformed requests that the server processes slowly.
## Changes

### 1. test_function_calling_with_stream fix

…
#32036 [responsesAPI] add unit test for optional function tool call id — 无标签 — by qandrew (创建于: 2026-01-10 01:06 (UTC+8)) [💬1 | +118/-0, 1 files | commented:3] ## Purpose

## Test Plan follow up from #31999

## Test Result

unit tests pass

…
#32048 Added qwen3 vision language moe support for speculative decoding — speculative-decoding,v1,qwen — by shanjiaz (创建于: 2026-01-10 04:41 (UTC+8)) [+17/-1, 2 files | commented:1 | 📝草稿]

## Purpose

## Test Plan

## Test Result

...
#32051 seperate draft and target mode loading config — speculative-decoding,v1,meta-exported,fb-exported — by ZhengkaiZ (创建于: 2026-01-10 05:17 (UTC+8)) [💬1 | +14/-3, 3 files | commented:2] Summary: seperate draft and target mode loading config

Test Plan: un on top of D86736723 sh vllm/fb/scripts/llama4/start_l4_maverick_spec_decoding_h200.sh server log: P2108209888 I0105 12:21:20.634000 2686673 time_request_logger.py:217] Successfully completed request d990167cb3574d75a0e1b44f26178180 after 0.805 seconds INFO 01-05 12:21:23 [loggers.py:257] 1 Engines Aggregated: Avg prompt throughput: 9.2 tokens/s, Avg generation throughput: 34.8 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV…
#32040 [ROCm][CI] Handle pytest status code 5 when a shard isn’t allocated any tests — rocm,ci/build — by divakar-amd (创建于: 2026-01-10 02:24 (UTC+8)) [+11/-2, 2 files | commented:2] This PR ensures that test shards exiting with status 5 (indicating “no tests collected”) are handled gracefully for AMD CI. In multi-gpu setting, when tests are shared among GPUs, it’s possible that a particular shard may not be allocated with any test at all. In such cases, the pytest returns with status 5.
```
  >>Shard 0: collected 312 items / 37 deselected / 275 selected
  >>Shard 0: Running 0 items in this shard:
```
- Updated run-amd-test.sh to treat shard exit status 5 (“no tests collecte…
#32050 [EPLB][Cleanup] Remove is_async_enabled from EplbModelState — 无标签 — by SageMoore (创建于: 2026-01-10 04:43 (UTC+8)) [+12/-20, 2 files | commented:1 | 📝草稿] ## Purpose See above. This PR just removes the is_async_enabled instance variable from EplbModelState and refactors the code to use is_async in EplbState. Having two variables to determine if we are running async eplb invites the question: “When can these two variables be different?”. They really shouldn’t be different. This is a bit of unnecessary complexity and opens us up to bugs, where we update one and not the other, further down the line.

Bonus: The linter was complaining about th…
#32047 [Refactor] Remove unused cutlass moe problem size function — nvidia — by yewentao256 (创建于: 2026-01-10 04:24 (UTC+8)) [+0/-75, 4 files | commented:1] ## Purpose

Remove unused get_cutlass_moe_mm_problem_sizes function

No tested needed

[!NOTE] ^{Cursor Bugbot is generating a summary for commit 6d738cb321a6d1f197e1bc05064a7b2f8ca012a6. Configure here.} …
#32043 [AOT compilation] cached inductor artifacts benchmark — performance — by dolpm (创建于: 2026-01-10 04:01 (UTC+8)) [+682/-0, 1 files | commented:3] ## Purpose

split out of https://github.com/vllm-project/vllm/pull/25205

## Test Plan

## Test Result

...

#32042 [AMD][QWEN3-NEXT] FP8 Tunings — performance,rocm,qwen — by draftbk (创建于: 2026-01-10 03:50 (UTC+8)) [+586/-2, 5 files | commented:1] ## Purpose Add pre-tuned FP8 block-scaled matmul kernel configs for Qwen3-Next-80B on 1 MI300X.

**Configs added:**
- `N=12288,K=2048,device_name=AMD_Instinct_MI300X,dtype=fp8_w8a8,block_shape=[128,128].json`
- `N=2048,K=4096,device_name=AMD_Instinct_MI300X,dtype=fp8_w8a8,block_shape=[128,128].json`
- `N=1024,K=2048,device_name=AMD_Instinct_MI300X,dtype=fp8_w8a8,block_shape=[128,128].json`
- `N=9216,K=2048,device_name=AMD_Instinct_MI300X,dtype=fp8_w8a8,block_shape=[128,128].json`

Tu…

#32034 [Refactor] Remove numpy split in async scheduling — ready,v1 — by yewentao256 (创建于: 2026-01-10 00:39 (UTC+8)) [💬1 | +6/-12, 1 files | commented:2 approved:1] ## Purpose

Remove numpy split in async scheduling and got performance improvement a little bit

## Test

export MODEL="zai-org/GLM-4.7-FP8"

vllm serve $MODEL -tp 8 --port 9256 --enable-expert-parallel --max_num_seqs 128

…
#32035 [Docs] Add docs about OOT Quantization Plugins — documentation,ready — by mgoin (创建于: 2026-01-10 00:48 (UTC+8)) [💬1 | +158/-0, 2 files | commented:1] ## Purpose

## Test Plan

## Test Result

Essential Elements of an Effective PR Description Checklist
...
#32039 AMD CI Test - unskip moe_sum test and moe_align_block_size tests — rocm — by hongxiayang (创建于: 2026-01-10 01:32 (UTC+8)) [💬1 | +11/-4, 3 files | commented:1] ## Purpose

To test AMD CI by enabling one skipped test.

Summary of Changes
1. Fixed test_moe.py and test_moe_align_block_size.py to unskip it on ROCm Removed unnecessary @pytest.mark.skipif(current_platform.is_rocm()) from test_moe_sum, test_moe_align_block_size, test_moe_align_block_size_with_expert_map
…
#32026 [Refactor] Separate sequence and token pooling types — documentation,new-model,frontend,ready,qwen — by DarkLight1337 (创建于: 2026-01-09 19:57 (UTC+8)) [💬5 | +324/-204, 42 files | approved:1 commented:7] ## Purpose

This refactor is necessary to cleanly enable both sequence and token pooling tasks at the same time, allowing us to support both CLS/LAST/MEAN pooling together with tokenwise pooling methods apart from ALL.

## Test Plan

## Test Result

...
#32037 [CPU][BugFix] Disable AOT Compile for CPU — 无标签 — by fadara01 (创建于: 2026-01-10 01:19 (UTC+8)) [💬1 | +6/-1, 1 files | commented:1 approved:1] ## Purpose

Disable AOT Compile for CPU Fixes: #32033

## Test Plan

Reproducer in #32033

## Test Result …
#32022 [Bugfix] fix memory inconsistency in cross-process shared memory — 无标签 — by slippersss (创建于: 2026-01-09 18:18 (UTC+8)) [💬2 | +6/-0, 1 files | commented:1 | 📝草稿] ## Purpose

This PR aims to fix potential memory inconsistency in current cross-process shared memory. Related issue has been described in #27858. It is probably due to weak ordering of some CPU architectures. That is, readers may read metadata_buffer[0] = 1 before buf is fully written. We use memory_fence() to ensure the order of the buffer (i.e. buf) and flag (i.e. metadata_buffer[0]) writes.

## Test Plan

This issue occurs intermittently, so we need long-running high-concurrency te…
#32031 [NIXL][Bugfix] Failure logging overhaul + early metadata free on failure — v1,kv-connector — by NickLucche (创建于: 2026-01-09 23:12 (UTC+8)) [💬2 | +234/-28, 2 files | commented:6] ## Overview

This PR is aimed at improving logging to more easily identify failures during runs. It does so by providing a shared util function with more comprehensive context of the failure.

Change result: ``` # before ERROR 01-09 12:04:06 [nixl_connector.py:2281] NIXL transfer setup/initiation failed for request test_transfer_setup_failed_req. Marking blocks as invalid.

…
#32032 [CI/Build] Publish CPU images for each release — ci/build — by nathan-weinberg (创建于: 2026-01-09 23:24 (UTC+8)) [+101/-14, 2 files | commented:2] ## Purpose It was suggested to me in https://github.com/vllm-project/vllm/pull/31749 to extend any CPU images off a base, but currently no such base exists as far as I can tell. The only published CPU images I’ve been able to find are ECR images with a rather old vLLM version (more details in https://github.com/vllm-project/vllm/issues/23681)

This PR updates the Buildkite automation to build both x86_64 and arm64 CPU images and publish them to Docker Hub. Since this is not a “first-class” useca…
#32020 Rename –exclude-log-deltas to –enable-log-deltas — frontend,ready — by Catacomba (创建于: 2026-01-09 17:55 (UTC+8)) [+9/-10, 3 files | commented:1 approved:2] ## Purpose

Rename --exclude-log-deltas, implemented in #30322, to the standard -enable/--no-enable model. Remove the validation check for --enable-log-deltas/--no-enable-log-deltas, since strictly speaking, changing the flag does not require --enable-log-outputs.

## Test Plan
1. Run vllm serve with any model, with --enable-log-request, --enable-log-outputs enabled.
  
  vllm serve model --enable-log-requests --enable-log-outputs …
#32030 Add: acceptance length tests — speculative-decoding,v1 — by rahul-tuli (创建于: 2026-01-09 22:44 (UTC+8)) [💬1 | +266/-0, 2 files | commented:2] Adds parameterized pytest tests to detect acceptance length regressions in EAGLE3 speculative decoding. These tests ensure that new commits do not degrade speculative decoding performance.

## Changes
- tests/v1/spec_decode/test_acceptance_length.py: New test file with parameterized tests for EAGLE3 model pairs
## Models Tested

| Verifier | Drafter | Mean AL | Per-Position AL | |———-|———|———|—————–| …
#32027 [Doc] Remove hardcoded Whisper in example openai translation client — documentation,ready — by Isotr0py (创建于: 2026-01-09 21:42 (UTC+8)) [💬1 | +12/-6, 1 files | commented:1 approved:1] ## Purpose
- Avoid hardcoded openai/whisper-large-v3 in openai_translation_client.py, so that we can test other model’s translation API easily.
## Test Plan

## Test Result

Essential Elements of an Effective PR Description Checklist
...
#32016 [Model] Remove redundant None check in DeepSeekOCR image input processing — deepseek — by maang-h (创建于: 2026-01-09 14:02 (UTC+8)) [+10/-13, 1 files | commented:1 approved:1] ## Purpose Remove redundant if pixel_values is not None check and unreachable assertion in DeepseekOCRImagePixelInputs._parse_image_pixel_values.

Line 437-438 already handles the pixel_values is None case with an early return

This cleanup improves code readability and removes unnecessary branching

## Test Plan

…
#32025 [Misc] Hash keys only when value is None in kwargs — ready,multi-modality — by ywang96 (创建于: 2026-01-09 18:56 (UTC+8)) [+3/-0, 1 files | approved:1 commented:5] ## Purpose It’s not guaranteed that multimodal kwargs always have non-None values. When this happens every request will produce serialization warning which isn’t helpful. This PR modifies so that the hasher hashes keys only for such key-value pairs when value is None

## Test Plan

## Test Result

Essential Elements of an Effective PR Description Checklist
...
#32024 [Frontend] Try multiple prompts in benchmark initial test run — performance — by MatteoFari (创建于: 2026-01-09 18:31 (UTC+8)) [💬3 | +99/-36, 2 files | commented:1 approved:1] ## Purpose Added num_test_prompts argument (defaulting to 5) to improve robustness of benchmark, fixes #31902 ## Test Plan Added a new test case test_bench_serve_num_test_prompts in tests/benchmarks/test_serve_cli.py to ensure the new CLI argument is correctly parsed and passed to the benchmark function. ## Test Result

Essential Elements of an Effective PR Description Checklist
...
#32021 [WIP] Optimize greedy sample. — tpu,v1 — by whx-sjtu (创建于: 2026-01-09 18:09 (UTC+8)) [+61/-6, 3 files | commented:1 | 📝草稿] ## Purpose Implement https://github.com/vllm-project/vllm/issues/31216

## Test Plan

## Test Result

Essential Elements of an Effective PR Description Checklist
...
#32018 [Frontend] finish_reason must be tool_call whenever a tool is called — frontend — by sanghoon-yn (创建于: 2026-01-09 16:55 (UTC+8)) [+8/-8, 1 files | commented:2] ## Purpose Currently, when tool_choice is a function name, OpenAIServingChat responds with finish_reason: "stop".

The OpenAI API now emits explicit tool_call instead.

This PR updates OpenAIServingChat’s chat serving logic to align with the latest OpenAI API behavior for tool invocation.

## Test Plan

## Test Result …
#32013 [Fix] Qwen3-VL-MoE bitsandbytes 4 bit quant — qwen — by Datta0 (创建于: 2026-01-09 13:09 (UTC+8)) [💬2 | +257/-43, 2 files | commented:3] ## Purpose Starting the vLLM server with the Qwen/Qwen3-VL-30B-A3B-Instruct using bitsandbytes quantization format seems to fail while loading the weights due to quantization issues.

FIX https://github.com/vllm-project/vllm/issues/32012 ## Test Plan

## Test Result

...
#32015 test — 无标签 — by 1643661061leo (创建于: 2026-01-09 13:20 (UTC+8)) [💬1 | +199/-250, 1 files | commented:1] ## Purpose

## Test Plan

## Test Result

Essential Elements of an Effective PR Description Checklist
...
#32014 [MLA] Support DCP + FP8 — v1 — by LucasWilkinson (创建于: 2026-01-09 13:19 (UTC+8)) [+11/-7, 1 files | commented:1 | 📝草稿] FIX https://github.com/vllm-project/vllm/issues/32010

Signed-off-by: Lucas Wilkinson lwilkins@redhat.com

[已合并 PR]

#31348 resolve pydantic error in startup benchmark — performance,ready — by andyxning (合并于: 2026-01-10 10:41 (UTC+8)) [💬4 | +21/-7, 2 files | commented:6 approved:1] ## Purpose Avoid pydantic error about CompilationConfig in vllm bench startup ``` RuntimeError: Subprocess failed: 7 validation errors for CompilationConfig local_cache_dir Unexpected keyword argument [type=unexpected_keyword_argument, input_value=None, input_type=NoneType] For further information visit https://errors.pydantic.dev/2.12/v/unexpected_keyword_argument bs_to_padded_graph_size Unexpected keyword argument [type=unexpected_keyword_argument, input_value=None, input_type=NoneTy…
#31616 [Bugfix] Narrow broad exceptions in compilation backends — ready — by c0de128 (合并于: 2026-01-10 10:39 (UTC+8)) [💬15 | +1/-1, 1 files | commented:2 approved:2] ## Summary

Narrows broad except Exception: handlers in the compilation cache initialization to specific exception types for better debuggability.

## Changes

### 1. File reading exception (line 609) ```python # Before: except Exception: …
#32065 [CI] Allow Deprecated Quantization For LM Eval Tests — ready,ci/build,ci-failure — by micah-wil (合并于: 2026-01-10 10:10 (UTC+8)) [+1/-0, 1 files | commented:2 approved:1] Ever since https://github.com/vllm-project/vllm/pull/31688 began the process of deprecating some quantization schemes, the LM Eval Large Models test group is failing on main, as seen here for example: https://buildkite.com/vllm/ci/builds/46314/steps/canvas?sid=019ba18f-0a1a-4482-82d8-b8c102fd92c0. The PR updated other relevant tests to allow deprecated quantization types, but skipped test_lm_eval_correctness.py, so the Meta-Llama-3-70B-Instruct-FBGEMM-nonuniform test case is failing with t…
#32056 [Perf] Optimize async scheduling placeholder using empty — ready,v1 — by yewentao256 (合并于: 2026-01-10 08:46 (UTC+8)) [💬2 | +4/-1, 1 files | commented:1 approved:1] ## Purpose

Here for a placeholder tensor, no need to use rand, empty is good enough

[!NOTE] ^{Cursor Bugbot is generating a summary for commit 3462afd4a5afeda39c35c85d50e394211106a38f. Configure here.}

…
#32045 [Core] Use weights_only=True with torch.load — ready — by russellb (合并于: 2026-01-10 08:28 (UTC+8)) [+1/-1, 1 files | commented:1 approved:1] This code neglected to use weights_only=True with torch.load(). This is unsafe if used on potentially untrusted model data. We should always be using weights_only=True unless necessary.

It turns out that the default changed to True as of PyTorch 2.6, so this isn’t a security vulnerability, but we explicitly set it to True everywhere else, so change it here for consistency. This also prevents it from getting flagged by security scanners, which is how I came aware of it via a private re…
#32052 [2/N][Attention] Fix pre-commit errors — ready,v1 — by MatthewBonanni (合并于: 2026-01-10 08:27 (UTC+8)) [+6/-16, 3 files | commented:1 approved:1] ## Purpose Another step in the attention restructuring, #31919 This PR fixes pre-commit errors which arose when moving files into regions of higher mypy scrutiny

## Test Plan Pre-commit should pass

## Test Result

…
#31744 [Misc][BE] Type coverage for vllm/compilation [2/3] — ready,nvidia — by Lucaskabela (合并于: 2026-01-10 07:30 (UTC+8)) [+161/-91, 12 files | commented:3 approved:2] ## Purpose We want to provide better type hint coverage in vllm/compilation to improve maintainability, readability, and reduce silent errors

This PR should be applied on top of #31554

## Test Plan mypy vllm/compilation

## Test Result

…
#31998 [Misc] Enable async scheduling by default with spec decoding — ready — by njhill (合并于: 2026-01-10 07:09 (UTC+8)) [💬2 | +27/-22, 1 files | commented:2 approved:3] Now that all of the gaps have been addressed in async scheduling + spec decoding support, we can enable it by default in this case too.

It will still be disabled implicitly for non-EAGLE/MTP types or when padded drafter batch is disabled.

This should only be merged once https://github.com/vllm-project/vllm/pull/30495 is merged.

[!NOTE] Enables async scheduling by default when using compatible speculative decoding, with clearer gating and messaging. …
#31336 [perf][async] support non cpu sync get logprob tensors for spec — ready,v1 — by izhuhaoran (合并于: 2026-01-10 05:24 (UTC+8)) [💬8 | +48/-29, 3 files | commented:10] ## Description:

This PR eliminates implicit CPU-GPU synchronization points in the RejectionSampler when computing logprobs during speculative decoding, specifically for the async scheduling path.

Problem Previously, the rejection sampling logic used boolean masking to identify accepted tokens: https://github.com/vllm-project/vllm/blob/ba25a6599212f01e6b877e22549612333e95b2c1/vllm/v1/sample/rejection_sampler.py#L191-L194 https://github.com/vllm-project/vllm/blob/ba25a6599212f01e6b877e2254…
#30275 [NIXL] refine decoder side post process for heterogeneous BlockSize and kv_layout — ready,v1,kv-connector — by xuechendi (合并于: 2026-01-10 05:22 (UTC+8)) [💬7 | +139/-87, 2 files | commented:10] ## Purpose

We have supported heterogeneous BlockSize and kv_layout in seperate post process methods. This PR is to clean up and use single method to post_process for cases.

## What is changed in this PR:

I removed permute_device_kv and blocksize_post_process, and move the logic into post_process_device_kv_on_receive as single post_process function with 3 options:

``` …
#31916 [1/N][Attention] Restructure attention: move files — documentation,performance,rocm,tpu,speculative-decoding,ready,ci/build,v1,multi-modality,llama — by MatthewBonanni (合并于: 2026-01-10 05:10 (UTC+8)) [💬12 | +425/-395, 195 files | commented:1 approved:3] ## Purpose Implement step 1 of #31919. This PR consists solely of file renaming and movement, and the necessary updates to imports.
- Move vllm/attention/layers to vllm/model_executor/layers/attention
- Move vllm/attention/backends/abstract.py to vllm/v1/attention/backend.py
- Move vllm/attention/backends/registry.py to vllm/v1/attention/backends/registry.py
- Eliminate vllm/attention/backends folder
- Move vllm/attention/utils/fa_utils.py to vllm/v1/attention/backends/fa_utils.py
- Move vll…
#31830 [Perf] Optimize cutlass moe problem size calculation, 5.3% E2E Throughput improvement, 2.2% TTFT improvement — ready,nvidia — by yewentao256 (合并于: 2026-01-10 03:13 (UTC+8)) [💬3 | +172/-63, 6 files | commented:9 approved:1] ## Purpose

Part of the https://github.com/vllm-project/vllm/issues/31755

Here we add a kernel for faster calculation of problem size

## Test

export MODEL="zai-org/GLM-4.7-FP8"

…
#31836 [responsesAPI] fix incomplete_messages for simple/parsable context — frontend,ready — by qandrew (合并于: 2026-01-10 05:00 (UTC+8)) [💬1 | +44/-0, 4 files | commented:1 approved:1] ## Purpose

Similar to https://github.com/vllm-project/vllm/pull/24561 (which was done only for GPTOSS), we want to output the incomplete reason for non gptoss models.

We make sure
```
  assert response.status == "incomplete"
  assert response.incomplete_details.reason == "max_output_tokens"
```
…
#30833 [Quant] Make static quant support all group shapes — ready — by LucasWilkinson (合并于: 2026-01-10 04:49 (UTC+8)) [💬6 | +339/-47, 7 files | commented:5 approved:2] Preparatory pr for https://github.com/vllm-project/vllm/pull/30141 (per-head quant)

[!NOTE] ^{Cursor Bugbot is generating a summary for commit c82c073a6134ce25dbd7234699e8e6e66021dc4f. Configure here.}

[!NOTE] …
#32001 [fix] add cutedsl to global sf — ready,nvidia — by jiahanc (合并于: 2026-01-10 04:03 (UTC+8)) [+1/-0, 1 files | commented:1 approved:3] ## Purpose Add flashinfer cutedsl to global sf list, fixes https://github.com/vllm-project/vllm/issues/31918 ## Test Plan ``` VLLM_DEEPEPLL_NVFP4_DISPATCH=1 VLLM_USE_FLASHINFER_MOE_FP4=1 VLLM_USE_STANDALONE_COMPILE=0 VLLM_FLASHINFER_MOE_BACKEND=”masked_gemm” VLLM_WORKER_MULTIPROC_METHOD=spawn VLLM_ALL2ALL_BACKEND=”deepep_low_latency” lm_eval –model vllm –model_args pretrained=nvidia/DeepSeek-R1-0528-FP4-v2,data_parallel_size=4,enable_expert_parallel=True,tensor_parallel_size=1,enforce_eager=Tr…
#29354 Add unpermute-aware fused MoE path and small-batch fallback — performance,ready — by RunkaiTao (合并于: 2026-01-10 03:58 (UTC+8)) [💬9 | +176/-21, 2 files | commented:10] ## Purpose This PR enhances the fused MoE implementation by adding support for the unpermute execution path. It introduces a small-batch fallback path designed for models with large numbers of experts, where the regular kernel becomes inefficient. For these regimes, the new path provides a speed-up by reducing unnecessary data movement and improving kernel efficiency for low-token workloads.

## Test Result

### Benchmark: Llama-4-Maverick-17B-128E (TP = 8)

Command: ```bash python3 benchmarks/k…
#31595 [Fix] Introduce audio channels spec — documentation,performance,new-model,rocm,frontend,speculative-decoding,ready,ci/build,v1,multi-modality — by jeremyteboul (合并于: 2026-01-10 03:34 (UTC+8)) [💬10 | +717/-189, 9 files | commented:8 approved:1] Add generic AudioSpec framework for audio channel normalization This PR introduces an extensible AudioSpec-based framework for normalizing audio channels in vLLM multimodal models. The framework addresses the stereo-to-mono conversion issue where torchaudio returns [channels, time] format but Whisper-based models expect [time] (mono).

Key changes:
- Add AudioSpec dataclass and ChannelReduction enum for flexible audio format specification
- Add get_audio_spec() to detect model requirements fro…
#31707 [Feat][Core] Support multiple KV cache groups in Hybrid KV Coordinator — ready,v1 — by ivanium (合并于: 2026-01-10 02:53 (UTC+8)) [💬4 | +324/-127, 2 files | commented:7 approved:1] ## Purpose

The current hybrid KV cache coordinator supports at most two attention types (full attention + another sliding-window/mamba attention). However, emerging models may need more flexible support. For example, full attention + sliding window attn with various window sizes, etc., as required in #31592 and #30263.

Since prefix caching for sliding window and mamba does not have the monotonic prefix cache hit property, viz., a cache hit at position i does not imply a cache hit at position j…
#32034 [Refactor] Remove numpy split in async scheduling — ready,v1 — by yewentao256 (合并于: 2026-01-10 03:09 (UTC+8)) [💬1 | +6/-12, 1 files | commented:2 approved:1] ## Purpose

Remove numpy split in async scheduling and got performance improvement a little bit

## Test

export MODEL="zai-org/GLM-4.7-FP8"

vllm serve $MODEL -tp 8 --port 9256 --enable-expert-parallel --max_num_seqs 128

…
#31737 [Frontend][gpt-oss] Allow system message to overwrite model identity — frontend,ready,gpt-oss — by qandrew (合并于: 2026-01-10 03:03 (UTC+8)) [+111/-9, 3 files | commented:2 approved:1] ## Purpose

taking over from https://github.com/vllm-project/vllm/pull/27310 We have use cases where we want to override the system prompt (and not have it fed to developer prompt).

## Test Plan
```
  CUDA_VISIBLE_DEVICES=2,3 with-proxy vllm serve "openai/gpt-oss-120b" -tp 2     --trust-remote-code
```
…
#30585 [Bugfix] Fix Triton FusedMoE LoRA — ready,v1,gpt-oss — by xyang16 (合并于: 2026-01-09 19:46 (UTC+8)) [💬11 | +53/-37, 3 files | commented:3 approved:2] ## Purpose

This PR is to fix Triton fused_moe_lora.
- Reorder the rows of intermediate_cache1 to make sure LoRA weights is added to the correct rows here.
- This will fix test_gptoss_tp.py for Triton backend.
## Test Plan

``` pytest -s -v tests/lora/test_gptoss_tp.py …
#31921 [Bugfix] Fix OpenAPI schema test failures — frontend,ready — by AndreasKaratzas (合并于: 2026-01-09 18:56 (UTC+8)) [+11/-3, 2 files | commented:2 approved:1] This PR fixes issues causing OpenAPI schema test failures:

## Changes

### 1. Add upper bound validation to truncate_prompt_tokens (protocol.py)

The truncate_prompt_tokens field only had a lower bound (ge=-1) but no upper bound. When schemathesis generates extremely large integers, the server crashed instead of returning a proper 400 validation error.

Added le=_LONG_INFO.max constraint to match the seed field’s validation.

…
#32000 [ROCm][CI][V1] Fix nixl_connector test failure and achieve CUDA parity in test_async_scheduling — rocm,speculative-decoding,ready,v1,kv-connector,nvidia — by AndreasKaratzas (合并于: 2026-01-09 20:48 (UTC+8)) [💬5 | +28/-39, 3 files | commented:1 approved:1] This PR adds FlexAttention backend support for ROCm in the EAGLE speculative decoding proposer, removing platform-specific attention backend restrictions and merging with the CUDA data flow of this test. It also fixes test_abort_timeout_on_prefiller[ray] failing on ROCm platforms.

## Changes
- Added FlexAttentionMetadata to the allowed attention types for ROCm in eagle.py
- Removed ROCm-specific backend overrides that were workarounds for missing FlexAttention support
- Modify `_make_fak…
#29450 [UX] Add vLLM model inspection view — frontend,ready,v1 — by mgoin (合并于: 2026-01-10 01:12 (UTC+8)) [💬1 | +180/-1, 6 files | commented:6 approved:2] ## Purpose

Initial start to the kernels view request in https://github.com/vllm-project/vllm/issues/28085 by making a general modules view where we can see attention backends and quant_methods used at first.

You can see the model inspection if you add VLLM_LOG_MODEL_INSPECTION=1 during the engine creation or by simply printing the LLM object.

## Test Result

#### LLM class

…
#30886 [Doc] Add developer guide for CustomOp — documentation,rocm,ready — by shen-shanshan (合并于: 2026-01-10 00:21 (UTC+8)) [💬10 | +441/-5, 24 files | commented:10] ## Purpose

Currently, there are more and more CustomOp defined both in vLLM and other OOT plugin devices. Following https://github.com/vllm-project/vllm/pull/30125#issuecomment-3648916991, I have written a doc about the principle and usage of CustomOp.

## Test Plan

## Test Result

...
#32020 Rename –exclude-log-deltas to –enable-log-deltas — frontend,ready — by Catacomba (合并于: 2026-01-09 23:30 (UTC+8)) [+9/-10, 3 files | commented:1 approved:2] ## Purpose

Rename --exclude-log-deltas, implemented in #30322, to the standard -enable/--no-enable model. Remove the validation check for --enable-log-deltas/--no-enable-log-deltas, since strictly speaking, changing the flag does not require --enable-log-outputs.

## Test Plan
1. Run vllm serve with any model, with --enable-log-request, --enable-log-outputs enabled.
  
  vllm serve model --enable-log-requests --enable-log-outputs …
#32027 [Doc] Remove hardcoded Whisper in example openai translation client — documentation,ready — by Isotr0py (合并于: 2026-01-09 22:44 (UTC+8)) [💬1 | +12/-6, 1 files | commented:1 approved:1] ## Purpose
- Avoid hardcoded openai/whisper-large-v3 in openai_translation_client.py, so that we can test other model’s translation API easily.
## Test Plan

## Test Result

Essential Elements of an Effective PR Description Checklist
...
#31832 [Perf][Kernel] Fused SiLU+Mul+Quant kernel for NVFP4 cutlass_moe — performance,ready,nvidia — by mgoin (合并于: 2026-01-09 22:40 (UTC+8)) [+278/-95, 8 files | commented:5 approved:1] ## Purpose

We can easily fuse the silu_and_mul operation into the nvfp4 quantization operation for moe, which we already do for dense nvfp4 as of https://github.com/vllm-project/vllm/pull/23671. This PR just generalizes some utils to expose the same functionality for expert quantization. This seems to improve latency by ~2% and throughput by ~4%

Before: <img width=”1631” height=”659” alt=”Screenshot 2026-01-06 at 4 00 44 PM” src=”https://github.com/user-attachments/assets/bd52135a-21aa-4438-99…
#31968 [CPU] Add head sizes 80 and 112 with vec16 fallback — ready,v1,cpu — by R3hankhan123 (合并于: 2026-01-09 22:14 (UTC+8)) [💬2 | +12/-5, 4 files | commented:5 approved:1] ## Purpose Reintroduce support for head dimensions 80 and 112 in CPU attention backend which were previously removed in #27954 but these head dimensions are commonly used by granite models deployed on Z archs. Since these heads are not friendly for Intel AMX instruction set. The implementation now falls back to vec16. ## Test Plan Build Docker image and test using ibm-granite/granite-3b-code-base-2k model which has head size of 80. ## Test Result Server Logs ``` docker run –rm -it -p 8000:8…
#32003 [Cleanup] Remove obsolete spec decoding compatibility logic — performance,speculative-decoding,ready,v1 — by njhill (合并于: 2026-01-09 13:44 (UTC+8)) [+45/-75, 8 files | commented:1 approved:1] This is logic which was included with the original V1 spec decoding impl (ngram only), prior to various other changes which have made it obsolete including:
- Adding other spec decoding methods
- Adding support for logprobs and penalty parameters with spec decoding
Note that there are still some parameters which don’t work with spec decoding (min_p, min_tokens, logit_bias). Requests with these params will now fail when spec decoding is enabled (see https://github.com/vllm-project/vllm/pull/3198…
#32016 [Model] Remove redundant None check in DeepSeekOCR image input processing — deepseek — by maang-h (合并于: 2026-01-09 22:12 (UTC+8)) [+10/-13, 1 files | commented:1 approved:1] ## Purpose Remove redundant if pixel_values is not None check and unreachable assertion in DeepseekOCRImagePixelInputs._parse_image_pixel_values.

Line 437-438 already handles the pixel_values is None case with an early return

This cleanup improves code readability and removes unnecessary branching

## Test Plan

…
#31999 Fix type error — frontend,ready,meta-exported,fb-exported — by Adolfo-Karim (合并于: 2026-01-09 22:03 (UTC+8)) [💬1 | +3/-1, 1 files | commented:3 approved:2] Summary: id can be None, so we need to check it.

Test Plan: Updated local instance, the error is gone.

Differential Revision: D90353429

[!NOTE] …
#29304 [ROCm][PD] add moriio kv connector. — documentation,rocm,frontend,ready,ci/build,v1,kv-connector — by inkcherry (合并于: 2026-01-09 22:01 (UTC+8)) [💬10 | +3369/-3, 10 files | commented:10] ## Purpose

This PR introduces the mori-io KV connector for AMD devices. Built on top of the MORI project, the mori-io connector supports both PULL and PUSH modes for KV Cache transfer. Key features include:
- Mori backend integration.
- Mori-related components (buffer merge &session cache management &batch io).
- PULL mode (Serial interaction of prefill and decode).
- PUSH mode (Parallel interaction of prefill and decode, with non-blocking layer-wise transfer…
#32025 [Misc] Hash keys only when value is None in kwargs — ready,multi-modality — by ywang96 (合并于: 2026-01-09 21:20 (UTC+8)) [+3/-0, 1 files | approved:1 commented:5] ## Purpose It’s not guaranteed that multimodal kwargs always have non-None values. When this happens every request will produce serialization warning which isn’t helpful. This PR modifies so that the hasher hashes keys only for such key-value pairs when value is None

## Test Plan

## Test Result

Essential Elements of an Effective PR Description Checklist
...
#31881 [Feature][Benchmarks] Custom dataset: read output length from dataset — performance,ready — by sducouedic (合并于: 2026-01-09 20:40 (UTC+8)) [+26/-5, 1 files | commented:1 approved:1] For custom datasets only – these changes allow to read the output length from the custom dataset, through the optional field "output_tokens" in the jsonl file.

The value stored in custom-output-len argument is still used by default and overrides whatever is loaded from the dataset. The value from the dataset is used only if custom-output-len is None or set to -1.

[!NOTE] ^{Cursor Bugbot is generating a summary for commit 4275e65f6…}
#31380 [Bugfix][ROCm]Fix Qwen3-Next-80B-A3B-Thinking inference and optimize non-standard block size (544) support under rocm_atten — rocm,ready,v1,qwen — by vllmellm (合并于: 2026-01-09 19:28 (UTC+8)) [💬8 | +283/-84, 5 files | commented:10] ## Purpose

Fixes #26473

This PR refactors the rocm_attn backend kernels to support models with non-power-of-2 block sizes, specifically the Qwen/Qwen3-Next-80B-A3B-Thinking model.

The core of this update is a Dynamic Dispatching Mechanism:Standard Path ($2^n$): For models with power-of-2 block sizes (16, 32, 64, 128, etc.), the kernel retains the legacy bitwise-optimization logic to ensure maximum performance and zero regression.Generalized Path ($non-2^n$): For non-standard models (e.g., Qwe…
#31948 fix: remove duplicate engine_id check in nixl_connector — ready,kv-connector — by xbfs (合并于: 2026-01-09 20:13 (UTC+8)) [+0/-7, 1 files | commented:1 approved:1]
[!NOTE] Eliminates a duplicate engine_id mismatch check in _nixl_handshake within nixl_connector.py, keeping a single validation after decoding NixlAgentMetadata.
- Simplifies handshake logic; no functional behavior change expected
^{Written by Cursor Bugbot for commit f0828f3d720af451535076198c4f158a5345e3e3. This will update automatically on new commits. Configure here.}
#31973 [Model] Reorganize pooling layers — documentation,ready,ci/build,v1,qwen — by DarkLight1337 (合并于: 2026-01-09 19:02 (UTC+8)) [💬4 | +1290/-1143, 34 files | commented:9 approved:1] ## Purpose

pooler.py is getting really bloated, so let’s split everything up:
- Pooler activations (vllm.model_executor.layers.pooler.activations)
- Common code (vllm.model_executor.layers.pooler.common)
- Abstract pooler (vllm.model_executor.layers.pooler.abstract)
- Special poolers (vllm.model_executor.layers.pooler.special)
- Poolers that aggregate the whole sequence (vllm.model_executor.layers.pooler.seqwise)
- Poolers that apply to each token in the sequence (`vllm.model_execu…
#31906 [Bugfix] Fix Var Length Batched Padding in Granite Speech — ready — by alex-jw-brooks (合并于: 2026-01-09 18:28 (UTC+8)) [+8/-3, 1 files | commented:2 approved:1] ## Purpose Fixes a bug in granite speech padding - the features are variable length, so we pad tensors to be [bsz, longest_feature, 160], but when the multimodal inputs are batched, they are provided as a list of dim [feat_len, 160], which breaks the pad call expecting a 3D tensor. ``` (EngineCore_DP0 pid=2881328) File “/u/abrooks9944/vllm/vllm/model_executor/models/granite_speech.py”, line 675, in parse_and_validate_audio_input (EngineCore_DP0 pid=2881328) input_features = self._pad…
#31994 fix lora moe sharding when rank < max_lora_rank — ready,gpt-oss — by gnovack (合并于: 2026-01-09 14:43 (UTC+8)) [💬1 | +6/-8, 2 files | commented:1 approved:1] ## Purpose This PR fixes a bug in the implementation of fully_sharded for moe lora adapters. Currently, when a LoRA adapter is loaded whose rank is less than the max_lora_rank the LoRA A W13 weights are split into shards of size current_lora_rank // tp_degree.

This causes problems when we allgather the LoRA A output along the rank dim, since the resulting tensor will be interspersed with 0s, as opposed to right-hand-padded with 0s (like the LoRA B W13 weights).

e.g. ``` LoRA A output (w…
#31949 [Bugfix] Fix FusedMoE LoRA w2_output_size — ready — by xyang16 (合并于: 2026-01-09 13:54 (UTC+8)) [💬2 | +1/-1, 1 files | commented:1 approved:1] ## Purpose

This PR fixes FusedMoE LoRA w2_output_size. w2_output_size currently is set to w2_lora_a_stacked[0].shape[-2] incorrectly, making it become rank.

## Test Plan
```
  pytest -s -v tests/lora/test_gptoss_tp.py
```
## Test Result

…
#31970 [CI] [ROCm] Fix tests/entrypoints/test_grpc_server.py on ROCm — rocm,ready,ci/build — by tjtanaa (合并于: 2026-01-09 12:54 (UTC+8)) [💬1 | +18/-4, 5 files | commented:4 approved:1] ## Purpose

Fix tests/entrypoints/test_grpc_server.py CI issue https://buildkite.com/vllm/amd-ci/builds/2524/steps/canvas?sid=019b9d04-9972-4807-9712-b517956d17b3

``` ==================================== ERRORS ==================================== – ____ ERROR collecting tests/entrypoints/test_grpc_server.py ____ ImportError while importing test module ‘/vllm-workspace/tests/entrypoints/test_grpc_server.py’. …
#31993 [ROCm][CI] Fix test_token_classification.py::test_bert_models — rocm,ready — by divakar-amd (合并于: 2026-01-09 12:04 (UTC+8)) [💬1 | +12/-4, 1 files | commented:1 approved:2] This PR fixes mi325_1: Language Models Test (Extended Pooling): models/language/pooling/test_token_classification.py::test_bert_models[float-boltuix/NeuroBERT-NER]

The issues were caused due to similar reasons mentioned in https://github.com/vllm-project/vllm/pull/31612
[!NOTE] Addresses ROCm-specific correctness in token classification tests.
- For test_bert_models and test_modernbert_models, pass model_kwargs={'attn_implementation': 'eager'} to HF `AutoModelForToken…
#30437 [Bugfix] missing tokens occur in harmony streaming — frontend,ready,gpt-oss — by Ri0S (合并于: 2026-01-09 11:59 (UTC+8)) [💬8 | +13/-7, 2 files | commented:1 approved:2] ## Purpose Fixed an issue where in harmony streaming mode, when the engine yields more than one token at a time, only the last token is used.

FIX #28635 #30099

## Test Plan
```
  uv run api_server.py --model openai/gpt-oss-120b --gpu-memory-utilization 0.95 --port 8000 --served-model-name gptoss120b --disable-log-request --tool-call-parser openai --enable-auto-tool-choice
```
```python …

[关闭未合并 PR]

#19469 [V1][Metrics] Add instance_id (hostname) label for prometheus metrics — needs-rebase,stale,v1 — by reidliu41 (关闭于: 2026-01-10 10:13 (UTC+8)) [💬7 | +10/-3, 2 files] ## Essential Elements of an Effective PR Description Checklist
- The purpose of the PR, such as “Fix some issue (link existing issues this PR will resolve)”.
- The test plan, such as providing test command.
- The test results, such as pasting the results comparison before and after, or e2e results
- (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
## Purpose

fixes #17029 …
#32049 [CI/Build][Hardware][AMD] Fix test_engine_core_client.py — rocm,v1 — by rjrock (关闭于: 2026-01-10 08:29 (UTC+8)) [+23/-1, 1 files | commented:1 | 📝草稿] ## Purpose To get ROCm to pass on tests in test_engine_core_client.py, as run in AMD’s V1 Test e2e + engine testgroup. ## Test Plan pytest -v -s v1/engine/test_engine_core_client.py ## Test Result Still some failures, draft atm.

Essential Elements of an Effective PR Description Checklist
...
#30573 [Misc][Refactor] Separate router from FusedMoE class — needs-rebase — by bnellnm (关闭于: 2026-01-10 08:17 (UTC+8)) [💬4 | +428/-248, 22 files | commented:1 | 📝草稿] ## Purpose Separate existing router logic into a standalone class DefaultFusedMoERouter.

Needs https://github.com/vllm-project/vllm/pull/30519

## Test Plan

## Test Result

…
#31896 [Frontend] extract tool calls before reasoning to prevent marker abso… — frontend — by daniel-salib (关闭于: 2026-01-10 07:26 (UTC+8)) [+156/-9, 2 files | commented:3] ## Purpose

Fix sporadic issue with Kimi K2 model where tool calls are lost and tool markers appear in reasoning content.

Kimi K2 can output in two modes: 1. With tags: reasoning<|tool_ calls_section_begin|>... 2. Without tags: reasoning text <|tool_calls_section_begin|>...

…
#31929 [ROCm][CI] Fix test script to respect Buildkite parallelism settings — rocm,ci/build — by AndreasKaratzas (关闭于: 2026-01-10 03:28 (UTC+8)) [💬1 | +64/-37, 1 files | commented:2] Fixes the ROCm test runner script to respect Buildkite’s native parallelism configuration instead of always overriding shard settings.

## Problem

When parallelism: N is set in the pipeline YAML, Buildkite spawns N separate jobs and sets:
- BUILDKITE_PARALLEL_JOB_COUNT=N
- BUILDKITE_PARALLEL_JOB=0..N-1
The test commands use $$BUILDKITE_PARALLEL_JOB_COUNT and $$BUILDKITE_PARALLEL_JOB which get substituted correctly. However, the run-rocm-test.sh script was:

…
#31537 [Bugfix] Record metrics for aborted requests — needs-rebase,v1 — by jayhemnani9910 (关闭于: 2026-01-10 02:23 (UTC+8)) [💬4 | +100/-26, 6 files | commented:7] When requests are aborted, the abort path bypasses the normal metrics recording flow. This causes vllm:request_success_total{finished_reason=”abort”} to always show 0.

Fix by collecting abort statistics in OutputProcessor.abort_requests() and having LLMEngine/AsyncLLM record them via logger_manager.record().

Changes:
- Add AbortedRequestStats dataclass to capture abort stats
- Modify OutputProcessor.abort_requests() to optionally collect stats
- Add _record_abort_metrics() helper to LLMEngine …
#30914 [Bug] Fix torch inductor issue (shape passing through sub-graphs) — ready,qwen,nvidia — by yewentao256 (关闭于: 2026-01-09 23:26 (UTC+8)) [💬7 | +22/-12, 4 files | commented:5 changes:2] ## Purpose

Context: https://vllm-dev.slack.com/archives/C08U97ZRC0J/p1765934670913979

VLLM_ALL2ALL_BACKEND="deepep_high_throughput" VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL="debug" python3 examples/offline_inference/data_parallel.py --model Qwen/Qwen1.5-MoE-A2.7B --tp-size=1 --dp-size=2

There are two issues we found:
1. the shape passed through different sub-graphs, which could be described in this image: <img width=”2355” height=”1035” alt=”image” src=”https://github.com/user-attachments/as…
#31910 [Feature][Benchmarks] Test run: try different prompts until success — performance — by sducouedic (关闭于: 2026-01-09 19:06 (UTC+8)) [💬1 | +55/-30, 2 files | commented:2] When running benchmarks and using a test request, this one might fail for various possible reason related to the request itself (for example, if the prompt is too long and doesn’t fit the maximum context length). These changes allow to try again with other prompts from the dataset before failing definitively.

Closes #31881
#32015 test — 无标签 — by 1643661061leo (关闭于: 2026-01-09 13:37 (UTC+8)) [💬1 | +199/-250, 1 files | commented:1] ## Purpose

## Test Plan

## Test Result

Essential Elements of an Effective PR Description Checklist
...
#31563 [Model] Support SentenceTransformers V6 reranker config — documentation,frontend,needs-rebase — by noooop (关闭于: 2026-01-09 11:19 (UTC+8)) [💬2 | commented:1 | 📝草稿] ## Purpose

Following #30550 #31335

Users can correctly use the latest powerful rerank model without manually setting any hf_overrides or loading any templates.

e.g. - BAAI/bge-reranker-v2-gemma - Qwen/Qwen3-Reranker-0.6B …