vLLM Daily Report

« Back to vLLM Reports

[vLLM GitHub 开发动态] 2026-03-21

[概览]

时间窗口: 2026-03-21 11:34 (UTC+8) ~ 2026-03-22 11:34 (UTC+8)
新 issue: 12 (label 分布: bug:5, feature request:3, help wanted:2, rocm:2, good first issue:1)
关闭 issue: 10
新 PR: 43 (label 分布: ready:15, rocm:13, ci/build:9, v1:6, frontend:6)
合并 PR: 18
关闭未合并 PR: 10

[新 issue]

#37749 [Bug]: Qwen 3.5 stops working after upgrade to v0.18.0 — bug — by Pinockel (创建于: 2026-03-21 22:44 (UTC+8)) [💬1] ### Your current environment

Model: cyankiwi/Qwen3.5-27B-AWQ-BF16-INT4 + QuantTrio/Qwen3.5-27B-AWQ Inference Framework: vLLM 0.18.0 GPU Hardware: Multiple A100 40GB (one model / card) Deployment Mode: vLLM as Docker

Parameter: ` –gpu-memory-utilization 0.90 –reasoning-parser qwen3 …
#37765 [Feature]: Consolidate GPTQ Quantization — help wanted,feature request — by robertgshaw2-redhat (创建于: 2026-03-22 04:45 (UTC+8)) [💬3] ### 🚀 The feature, motivation and pitch

We currently have two files:
- gptq.py
- gptq_marlin.py
These are not needed. We now have decoupled the quantization format integration and kernels.

We should delete gptq.py and consolidate everything into gptq_marlin.py

…
#37777 [Bug]: [OOM] DeepSeek-R1 Out of Memory — bug — by hjjq (创建于: 2026-03-22 06:30 (UTC+8)) [💬1] ### Your current environment

4xGB200, vllm nightly, cu130

### 🐛 Describe the bug

Likely caused by https://github.com/vllm-project/vllm/pull/37442#issuecomment-4104485267 Creating an issue in case someone is also searching for this

```
…
#37773 [Bug]: EAGLE-3 acceptance rate collapses to 0% with Kimi-K2.5 at max_model_len=262144 — 无标签 — by ccgibson (创建于: 2026-03-22 05:50 (UTC+8)) [💬1] ### Your current environment
- vLLM v0.18.0 (release, not nightly)
- 8x NVIDIA B200 (141 GB HBM each)
- Model: moonshotai/Kimi-K2.5 (1T MoE, 32B active, compressed-tensors 4-bit)
- CUDA 13.0, Driver 580.126.09
### 🐛 Describe the bug

EAGLE-3 speculative decoding acceptance rate progressively collapses to 0% during generation when --max-model-len 262144. This causes the model to produce repetitive/degenerate output until max_tokens is hit. The bug is 100% reproducible and occurs withi…
#37758 [Bug]: FLASHINFER_CUTLASS and FLASHINFER_TRTLLM do not work for Qwen3.5 Bf16 DP/EP — bug — by robertgshaw2-redhat (创建于: 2026-03-22 03:19 (UTC+8)) [💬1] ### Your current environment

b200, main

### 🐛 Describe the bug

both of these fail

``` VLLM_USE_FLASHINFER_MOE_FP16=1 VLLM_FLASHINFER_MOE_BACKEND=latency chg run –gpus 2 – pytest -s -v evals/gsm8k/test_gsm8k_correctness.py –config-list-file=configs/models-qwen35-blackwell.txt …
#37754 [Bug] FlashInfer + MTP speculative decoding crashes on SM121 (DGX Spark) with GQA=16 model — 无标签 — by TrevorS (创建于: 2026-03-22 01:44 (UTC+8)) [💬1] ## Summary

FlashInfer attention backend + MTP speculative decoding (num_speculative_tokens=2) crashes with “illegal memory access” on NVIDIA GB10 (SM121 / DGX Spark) when serving Nemotron-3-Super-120B-A12B-NVFP4 (GQA ratio = 16). Triton attention backend works correctly.

## Reproduction

```bash # Works (Triton attention): vllm serve /models/nemotron-3-super
–attention-backend triton_attn
…
#37753 [Feature]: Unify MoE “Oracles” with Class Structure — help wanted,good first issue,feature request — by robertgshaw2-redhat (创建于: 2026-03-22 01:29 (UTC+8)) [💬5] ### 🚀 The feature, motivation and pitch

We currently have the following MoE “oracles”, which select the right MoE kernel for each model
- model_executor/layers/fused_moe/oracle
We have:
- fp8
- nvfp4
- mxfp8
- unquantized …
#37736 [CI Failure]: Gemma3 OOMs with transformers backend — rocm,ci-failure — by AndreasKaratzas (创建于: 2026-03-21 13:32 (UTC+8)) [💬2] ### Test group

mi250_1: Multi-Modal Models (Standard) 2: qwen3 + gemma

### Describe the failing test

This is not exactly a test failure, but it has been recommended to investigate further the OOMing event of Gemma3, which is a 4B model. The intuition here is that the fake tensor that is used for profiling is large enough that exceeds the 64 GB of MI250 GPUs. However, it has been suggested that this is still weird.

### 📝 History of failing test

…
#37748 [Feature]: Is there docker image support vllm and rocm 7.1+ — feature request,rocm — by BigFaceBoy (创建于: 2026-03-21 22:08 (UTC+8)) [💬1] ### 🚀 The feature, motivation and pitch

as title

### Alternatives

No response

### Additional context

…
#37746 [Bug] prompt_logprobs causes livelock with IsHybrid models (Qwen3.5) in DP mode — 无标签 — by dengoswei (创建于: 2026-03-21 21:01 (UTC+8)) ## Bug: prompt_logprobs causes livelock with IsHybrid models (Qwen3.5) in DP mode

### Your current environment
- vLLM version: 0.17.0 (V1 engine, default)
- GPU: H100-SXM-80GB × 8
- Python: 3.11.2
- OS: Linux 5.15
- CUDA: 12.x
…
#37745 [Bug]: Mooncake Connector: Decode nodes stuck in WAITING_FOR_REMOTE_KVS after Prefill node restart — bug — by SSmallMonster (创建于: 2026-03-21 20:21 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) ...
#37737 [Bug]: Missing logprobs for <tool_call> in streaming chat completions — bug — by sdpkjc (创建于: 2026-03-21 14:44 (UTC+8)) ### Your current environment
- vLLM version: v0.15.0, v0.17.1
- Serving mode: OpenAI-compatible API server
- Model: Qwen/Qwen3-VL-8B-Thinking
- Request mode: streaming chat completions with logprobs=true
### 🐛 Describe the bug

…

[已关闭 issue]

#37773 [Bug]: EAGLE-3 acceptance rate collapses to 0% with Kimi-K2.5 at max_model_len=262144 — 无标签 — by ccgibson (关闭于: 2026-03-22 05:52 (UTC+8)) [💬1] ### Your current environment
- vLLM v0.18.0 (release, not nightly)
- 8x NVIDIA B200 (141 GB HBM each)
- Model: moonshotai/Kimi-K2.5 (1T MoE, 32B active, compressed-tensors 4-bit)
- CUDA 13.0, Driver 580.126.09
### 🐛 Describe the bug

EAGLE-3 speculative decoding acceptance rate progressively collapses to 0% during generation when --max-model-len 262144. This causes the model to produce repetitive/degenerate output until max_tokens is hit. The bug is 100% reproducible and occurs withi…
#31689 [Feature][Quantization][Help Wanted]: Clean up GPTQ + AWQ Quantization — help wanted,feature request — by robertgshaw2-redhat (关闭于: 2026-03-22 04:41 (UTC+8)) [💬9] ### 🚀 The feature, motivation and pitch

We are in process of cleaning up the quantization integrations in vllm (see the FusedMoE refactor PRs I am working on)

In general, this means we are trying to separate concerns of the quantization INTEGRATION (on disk format — responsible for weight loading) from the quantization KERNEL (runtime format — responsible for executing at runtime).

For GPTQ/AWQ, we have tech debt in that we have different quantization integrations (gptq.py, `gptq_marlin…
#34256 [Model Performance SIG]: Improve MoE Oracle — feature request,rocm,model-bash — by robertgshaw2-redhat (关闭于: 2026-03-22 04:39 (UTC+8)) [💬5] ### 🚀 The feature, motivation and pitch

We have recently created programmatic interface for selecting NVFP4 and FP8 MoE kernels.

Currently, we have a single “ordering” of kernels for all hardware, device configs, etc. We should make this oracle smarter by having different defaults for different devices / models.

We should do this for:
- DeepSeekV3
- Qwen3MoE
- GLM …
#35688 [CI Failure][Tool Calling]: — ci-failure — by robertgshaw2-redhat (关闭于: 2026-03-22 04:39 (UTC+8)) ### Name of failing test

entrypoints/openai/test_completion_with_function_calling.py

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#37618 [Bug]: FP8 kv cache on b200 with qwen3.5 has degraded accuracy — bug — by robertgshaw2-redhat (关闭于: 2026-03-22 04:38 (UTC+8)) [💬10] ### Your current environment

see run here
- https://buildkite.com/vllm/ci/builds/57133#019d0808-caa0-41a2-88d2-9a4bd389efdf
removing fp8 kv cache allows this to pass

### 🐛 Describe the bug

x …
#37701 [Bug]: Segfault in FlashInfer autotuning for NVFP4 latency backend on Qwen3-30B-A3B-NVFP4 — bug — by baonudesifeizhai (关闭于: 2026-03-22 04:01 (UTC+8)) [💬2] ### Your current environment

sm100

### 🐛 Describe the bug

### Description …
#37730 Benchmark: Radix vs. PagedAttention Scaling (SGLang / vLLM) — 无标签 — by glaziermag (关闭于: 2026-03-22 01:41 (UTC+8)) # Benchmark: Radix vs. PagedAttention Scaling (SGLang / vLLM)

### Problem Description This issue documents a benchmark comparing SGLang (RadixAttention) and vLLM (PagedAttention) to observe the “Scaling Zero-Sum” trade-off. SGLang’s Radix tree optimizes prefix-sharing across requests but uses Python-based routing, which can be vulnerable to Python Global Interpreter Lock (GIL) contention under high concurrency. vLLM mitigates the Python GIL bottleneck by offloading its PagedAttention im…
#36094 [Bug]: Qwen3.5 NVFP4 Checkpoint has poor accuracy — bug,qwen,nvidia — by pavanimajety (关闭于: 2026-03-21 22:14 (UTC+8)) [💬27] ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…
#35286 [Bug]: Qwen3.5-MoE failed with enable_lora — bug — by hjh0119 (关闭于: 2026-03-21 19:30 (UTC+8)) [💬7] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#37422 [Feature]: Add kv_transfer_params to Responses API for PD disaggregation — rocm — by bongwoobak (关闭于: 2026-03-21 13:48 (UTC+8)) [💬1] ### 🚀 The feature, motivation and pitch

The Chat Completions API (/v1/chat/completions) supports kv_transfer_params on both request and response, enabling PD disaggregation. However, the Responses API (/v1/responses) does not have this field, so PD disaggregation cannot be used through it.

Since the Responses API is becoming more widely adopted, it would be valuable to add kv_transfer_params support for feature parity with Chat Completions.

The change is straightforward: add the field…

[新 PR]

#37775 [Bugfix] Fix pooling non-determinism from pinned prompt_lens aliasing — bug,rocm,ready,v1 — by AndreasKaratzas (创建于: 2026-03-22 06:25 (UTC+8)) [💬6

+63/-1, 2 files

commented:2 approved:1]

PR #37303 changed num_prompt_tokens in InputBatch from a plain np.zeros() array to a pinned-memory-backed numpy view (torch.zeros(..., pin_memory=True).numpy()).
get_pooling_metadata() calls torch.from_numpy(self.num_prompt_tokens[:self.num_reqs]), which creates a tensor that shares the underlying pinned buffer rather than copying the data.
Because pinned memory is used for async GPU transfers, the shared buffer can be modified between the time prompt_lens is created and wh…

#37787 [Bugfix][ROCm][MoE] Fix mxfp4 oracle regressions from #37128 — bug,rocm,ready,gpt-oss — by AndreasKaratzas (创建于: 2026-03-22 10:52 (UTC+8)) [💬1 | +83/-13, 8 files | commented:1] Fixes several issues introduced by #37128 that broke gpt-oss on ROCm and LoRA on NVIDIA.
- Restore gfx950 gate for CK mxfp4 backend selection. The old code only picked CK on gfx950 via on_gfx950(), the refactor dropped this and let CK get selected on gfx942 where it crashes.
- Restore CK_MXFP4_MOE_DIM_ALIGNMENT (256) check. Models with intermediate_size not aligned to 256 (like gpt-oss-20b at 2880) hit a reshape error in aiter shuffle_scale_a16w4. Added is_supported_config to AiterExperts so…
#37786 Revert “[MoE Refactor] Mxfp4 oracle rebased” (#37128) — documentation,rocm,gpt-oss,nvidia — by zhewenl (创建于: 2026-03-22 09:06 (UTC+8)) [💬2 | +1369/-1695, 18 files | commented:1 | 📝草稿] ## Revert of #37128

This reverts commit 87bd91892f8c63ed9aeb2ad2e701472ead8be84c (PR https://github.com/vllm-project/vllm/pull/37128).

### Reason CI build #57431 detected 1 new failure linked to this PR:
- LoRA TP (Distributed): test_gpt_oss_lora_tp2[True-False] fails with LoRA adapter producing garbage output (!!!...) instead of expected SQL query. This PR changed gpt_oss_triton_kernels_moe.py and MoE layer code which is exercised by thi…
#37782 [Bugfix] Handle libsndfile sf_error(NULL) race condition in audio fallback — bug,rocm,ready,multi-modality — by AndreasKaratzas (创建于: 2026-03-22 08:02 (UTC+8)) [+5/-1, 1 files | commented:2] Fixes regression after: https://github.com/vllm-project/vllm/pull/37058

When multiple threads concurrently fail sf_open_virtual() on unsupported formats (e.g. MP4 bytes passed as audio), sf_error(NULL) may return 0 instead of the actual error code due to a global-state race in libsndfile. This caused code=0 LibsndfileError (“Garbled error message from libsndfile”) to bypass the pyav fallback in load_audio(), crashing requests that send video bytes as audio_url (e.g. audio-in-video). In this P…

#37781 [CI] Skip ISAAC multimodal tests due to broken upstream HF model weights — rocm,ready,multi-modality — by AndreasKaratzas (创建于: 2026-03-22 07:29 (UTC+8)) [+5/-1, 1 files

commented:2]

Skip isaac multimodal generation tests
The upstream HF repo consolidated weights from 3 shards into a single model.safetensors but left a stale model.safetensors.index.json referencing the old shard filenames, causing weight loading to fail
This was introduced by upstream repo commits on 2026-03-20 (“5.0 structure fixes”), not a vllm code change

## Test plan

pytest -s -v tests/models/multimodal/generation/test_common.py::test_single_image_models[isaac-test_case35]

cc @ke…

#37780 [ROCm][CI] Make some duplicated tests optional so that they are only evaluated in our nightly — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-03-22 07:16 (UTC+8)) [+14/-0, 1 files | commented:1] The labeled tests are not optional for MI250. We are making them optional for MI325 and MI355 since they are already duplicated for MI250. Hence, this PR takes off some of the (unnecessary) workload from AMD CI.
#37778 [ROCm][CI] Added missing resampy dependency for MM audio tests — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-03-22 06:34 (UTC+8)) [+2/-0, 1 files | commented:2] Follow-up on: https://github.com/vllm-project/vllm/pull/37058

Addresses: Multi-Modal Models (Standard) 3: llava + qwen2_vl Related: https://buildkite.com/vllm/amd-ci/builds/6757/steps/canvas?sid=019d0efa-d61c-4621-915a-c06a2f968717&tab=output

cc @kenroche
#37774 [ROCm][CI] close missing quote in kernels/moe block in run-amd-test.sh — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-03-22 06:19 (UTC+8)) [+2/-2, 1 files | commented:1 approved:1] Addresses regression from: #32700

Fix a syntax error in run-amd-test.sh where the kernels/moe ignore block inside apply_rocm_test_overrides was missing the closing " on the cmds string assignment.

cc @kenroche
#37783 [do not merge][release] Move agent queue to Release cluster queues — ci/build — by khluu (创建于: 2026-03-22 09:00 (UTC+8)) [💬1 | +24/-24, 1 files | commented:1] for better isolation & secret protection
#37785 Revert “[Frontend] Remove librosa from audio dependency” (#37058) — performance,frontend,ci/build,multi-modality — by zhewenl (创建于: 2026-03-22 09:05 (UTC+8)) [+188/-247, 18 files | commented:1 | 📝草稿] ## Revert of #37058

This reverts commit c7f98b4d0a63b32ed939e2b6dfaa8a626e9b46c4 (PR https://github.com/vllm-project/vllm/pull/37058).

### Reason CI build #57431 detected 1 new failure linked to this PR:
- Entrypoints Integration (API Server 1): test_online_audio_in_video_interleaved fails with Error opening <_io.BytesIO object>: (Garbled error message from libsndfile) after librosa was removed from audio dependencies.
### Auto-generated T…
#37776 [MoE] Unify MoE oracles with class structure — needs-rebase — by Zijun9 (创建于: 2026-03-22 06:26 (UTC+8)) [💬5 | +2217/-1547, 8 files | commented:1] ## Purpose

Resolves #37753.

Introduces MoEKernelOracle(ABC, Generic[BackendT]) as a base class for all MoE kernel selection oracles. Each oracle (FP8, NvFP4, MXFP4, MXFP8, Unquantized) now inherits from this base class, standardizing the 4 core operations:
- select_backend – choose the best kernel backend
- convert_to_kernel_format – shuffle weights for a backend
- make_quant_config – build a FusedMoEQuantConfig
- make_kernel – construct the FusedMoEKernel …
#37784 [XPU][MoE Refactor] Refactor xpu mxfp4 support into oracle — 无标签 — by jikunshang (创建于: 2026-03-22 09:04 (UTC+8)) [💬1 | +54/-101, 3 files | commented:1] ## Purpose follow up of https://github.com/vllm-project/vllm/pull/37128, move xpu mxfp4 support into oracle as well.

## Test Plan

## Test Result

Essential Elements of an Effective PR Description Checklist
...
#37759 [MoE] Move FlashInfer CuteDSL experts into fused_moe/experts/ — ready,nvidia — by robertgshaw2-redhat (创建于: 2026-03-22 03:43 (UTC+8)) [+2/-2, 3 files | commented:1 approved:1] ## Summary
- Moves flashinfer_cutedsl_moe.py from fused_moe/ root into the fused_moe/experts/ subdirectory, consistent with the ongoing migration of expert kernel files (e.g. trtllm_nvfp4_moe.py, trtllm_fp8_moe.py, trtllm_mxfp4_moe.py)
- Updates import in oracle/nvfp4.py to point to the new location
- Updates import in tests/kernels/moe/test_cutedsl_moe.py to point to the new location
This is not duplicating an existing PR — it is part of an in-progress migration of `vllm/model…
#37779 [Perf] Optimize glm4.xv VIT — 无标签 — by KKSK-DON (创建于: 2026-03-22 06:49 (UTC+8)) [+1/-2, 1 files | commented:1] PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Glm4vVisionTransformer.forward() has 24 unnecessary CUDA D2H
synchronizations caused by max_seqlen.item() being called on a …
#37757 [UX] Logging - Improve Startup Error Logs — ready,v1 — by Waknis (创建于: 2026-03-22 03:14 (UTC+8)) [💬4 | +721/-83, 6 files | commented:5] Fixes #31683. Also addresses #37714.

## Summary This PR scopes #31683 to startup failures only, matching the guidance in the issue thread.
- propagate structured startup failures from the worker ready pipe through the engine startup handshake
- preserve the innermost child exception type, message, source, and traceback in the surfaced startup error
- improve failed process summaries during startup and add focused regression coverage for the worker and engine startup paths
## Why this…
#37770 [Quant] Consolidate GPTQ: remove gptq.py, rename gptq_marlin.py to gptq.py — rocm,v1,ready-run-all-tests — by robertgshaw2-redhat (创建于: 2026-03-22 05:35 (UTC+8)) [💬1 | +58/-557, 17 files | commented:1] ## Why this is not a duplicate

Checked open PRs for GPTQ consolidation — none found:
```
  gh pr list --repo vllm-project/vllm --state open --search "gptq consolidate"
```
## Summary

The Marlin kernels (gptq_marlin.py) support all practical GPTQ use cases (4-bit and 8-bit symmetric quantization), making the legacy exllama-based gptq.py redundant. This PR consolidates them into a single file. …
#37772 Consolidate AWQ quantization into single awq_marlin.py file — 无标签 — by robertgshaw2-redhat (创建于: 2026-03-22 05:49 (UTC+8)) [+253/-279, 3 files | commented:1 | 📝草稿] ## Summary
- Merges awq.py and awq_marlin.py into a single file, eliminating the circular import between them
- awq.py becomes a backward-compat shim that re-exports AWQConfig and AWQLinearMethod from awq_marlin.py
- __init__.py updated to import AWQConfig directly from awq_marlin
- Follows the same consolidation structure as gptq_marlin.py
Why this is not a duplicate: Previous attempt (#37768) was reverted due to an accidental push, not a correctness issue. This PR re-…
#37766 [CI/Build] Resolve a dependency deadlock when installing the test dependencies used in CI — ci/build — by yurun00 (创建于: 2026-03-22 05:01 (UTC+8)) [💬1 | +28/-7, 2 files | commented:2] ## Purpose This PR is to resolve a dependency deadlock when installing the test dependencies used in CI (CUDA only). It downgrades opentelemetry-proto from 1.36.0 to 1.35.0 and upgrades protobuf from 6.33.2 to 6.33.6 in requirements/test.txt to .

Reasoning: Currently, uv pip install -r requirements/common.txt -r requirements/dev.txt --torch-backend=auto fails because protobuf is pinned to 6.33.2 and opentelemetry-sdk is pinned to 1.35.0`, which is incompatible…
#37771 [Quant] Consolidate AWQ quantization into a single file — documentation,performance,rocm,needs-rebase,ci/build,gpt-oss,nvidia — by robertgshaw2-redhat (创建于: 2026-03-22 05:38 (UTC+8)) [💬2 | +3132/-2511, 83 files | commented:1 | 📝草稿] ## Summary
- Merges awq.py and awq_marlin.py into a single consolidated module (awq_marlin.py), following the same structure as gptq_marlin.py
- Adds AWQMarlinLinearKernel in vllm/model_executor/kernels/linear/mixed_precision/awq_marlin.py so AWQMarlinLinearMethod now delegates weight-repacking and inference through the kernel abstraction instead of calling marlin utils directly
- Extracts shared AWQ weight-parameter creation (_create_awq_weight_params) used by both the legacy a…
#37769 [Quantization] Merge gptq_marlin into gptq, remove slow GEMM kernel — performance,rocm — by robertgshaw2-redhat (创建于: 2026-03-22 05:22 (UTC+8)) [+753/-1185, 25 files | commented:1] ## Summary
- gptq.py is replaced by the former gptq_marlin.py (Marlin kernel). GPTQConfig.get_name() now returns "gptq", and override_quantization_method is removed (no longer needed since gptq IS Marlin).
- gptq_marlin.py is reduced to a thin backward-compat re-export shim.
- --quantization gptq_marlin auto-converts to gptq with a deprecation warning in _verify_quantization.
- The old exllama/GEMM-based GPTQConfig / GPTQLinearMethod are removed; backward-compa…
#37768 Revert “Consolidate AWQ quantization into single awq_marlin.py file” — 无标签 — by robertgshaw2-redhat (创建于: 2026-03-22 05:15 (UTC+8)) [+279/-253, 3 files | commented:1] REVERT Accidental push
#37767 [Quantization] Merge gptq_marlin into gptq, deprecate gptq_marlin — performance — by robertgshaw2-redhat (创建于: 2026-03-22 05:05 (UTC+8)) [+730/-1176, 21 files | commented:1] ## Summary
- gptq_marlin.py is a strict superset of gptq.py — the marlin kernel handles all GPTQ-compatible checkpoints on sm75+, and the 4-bit gptq_gemm fallback was already marked buggy in the old code. This PR merges them.
- Delete gptq_marlin.py; rewrite gptq.py with the marlin-based implementation
- GPTQMarlinConfig → GPTQConfig (get_name() returns "gptq"); backward-compat aliases preserved for any external callers (GPTQMarlinConfig, GPTQMarlinLinearMethod, `GPT…
#37760 [MoE] Move GPT OSS Triton kernel experts into fused_moe/experts/ — ready,gpt-oss — by robertgshaw2-redhat (创建于: 2026-03-22 03:57 (UTC+8)) [+12/-12, 7 files | commented:1] ## Summary
- Moves gpt_oss_triton_kernels_moe.py from fused_moe/ root into fused_moe/experts/, consistent with the ongoing migration of expert kernel files (e.g. trtllm_nvfp4_moe.py, trtllm_fp8_moe.py, trtllm_mxfp4_moe.py)
- Updates all import sites:
  - oracle/mxfp4.py (TRITON and TRITON_UNFUSED backends)
  - vllm/lora/layers/fused_moe.py
  - vllm/model_executor/layers/quantization/quark/quark_moe.py
  - tests/kernels/moe/test_modular_oai_triton_moe.py
  - `tests/kernels/mo…
#37761 [MoE] Move DEEP_GEMM into experts/ subdirectory — documentation,performance,ready — by robertgshaw2-redhat (创建于: 2026-03-22 04:01 (UTC+8)) [💬1 | +24/-20, 14 files | commented:1] ## Summary

Part of the ongoing migration to consolidate MoE kernel implementations under vllm/model_executor/layers/fused_moe/experts/.
- Moves fused_moe/deep_gemm_moe.py → fused_moe/experts/deep_gemm_moe.py
- Moves fused_moe/batched_deep_gemm_moe.py → fused_moe/experts/batched_deep_gemm_moe.py
- Updates all import sites in vllm/, benchmarks/, and tests/
This is not a duplicate of any existing PR: #35927 moves prepare/finalize methods; this PR moves the DeepGemm expert kernel …

#37764 [ROCm][CI] get_cu_count was renamed to num_compute_units in #35042 — rocm,ready — by AndreasKaratzas (创建于: 2026-03-22 04:40 (UTC+8)) [+3/-3, 1 files

commented:1]

Fix 3 failing tests in test_rocm_unquantized_gemm.py introduced in #34709
get_cu_count was renamed to num_compute_units in #35042, but the test PR was authored against the old name

cc @kenroche

#37763 [ROCm][CI] Fix MEGA_AOT_ARTIFACT fallback when PyTorch < 2.10.0 lacks AOT support — rocm,ready — by AndreasKaratzas (创建于: 2026-03-22 04:18 (UTC+8)) [+5/-1, 1 files | commented:1] When VLLM_USE_MEGA_AOT_ARTIFACT=1 is set on PyTorch < 2.10.0 (e.g. ROCm), standalone_compile correctly falls back to non-AOT mode, but set_functorch_config() still leaves bundled_autograd_cache enabled because it only checks the env var, not the PyTorch version.

This causes test_moe_startup[1] to fail:
```
  assert counters["aot_autograd"]["total"] == 0
  E   assert 30 == 0
```
In this PR, we gate on both VLLM_USE_MEGA_AOT_ARTIFACT and is_torch_equal_or_newer("2.10.0") befor…
#37762 [MoE] Move nixl_ep and mori prepare/finalize into fused_moe/prepare_finalize/ — kv-connector — by robertgshaw2-redhat (创建于: 2026-03-22 04:02 (UTC+8)) [+3/-3, 4 files | commented:1] ## Summary
- Moves nixl_ep_prepare_finalize.py and mori_prepare_finalize.py from fused_moe/ root into fused_moe/prepare_finalize/, alongside the existing naive_dp_ep.py and no_dp_ep.py
- Updates all import sites:
  - fused_moe/all2all_utils.py (relative imports updated to .prepare_finalize.*)
  - tests/kernels/moe/modular_kernel_tools/mk_objects.py
This is not duplicating an existing PR — it is part of an in-progress migration of vllm/model_executor/layers/fused_moe/ into s…
#37756 [Perf] Add SM 10.3 (B300/GB300) all-reduce communicator tuning — ready — by mmangkad (创建于: 2026-03-22 01:48 (UTC+8)) [💬1 | +23/-0, 3 files | commented:1 approved:2] ## Summary

Add benchmarked optimal all-reduce communicator config values for SM 10.3 (B300/GB300).

Depends on #37755 for allreduce fusion to be auto-enabled on SM 10.3.

## Config Values

| Config | ws=2 | ws=4 | ws=6 | ws=8 | |——–|——|——|——|——| …
#37755 [Core] Enable allreduce fusion by default for SM 10.3 (B300/GB300) — ready — by mmangkad (创建于: 2026-03-22 01:46 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1 approved:1] ## Summary

Enable allreduce fusion by default for SM 10.3 (B300/GB300) by using is_device_capability_family(100) instead of is_device_capability(100) in enable_allreduce_rms_fusion.

## Accuracy Test

``` vllm serve nvidia/DeepSeek-V3.2-NVFP4
–tensor-parallel-size 2
–tokenizer-mode deepseek_v32
…
#37752 [Test] Add unittests for make_copy_and_call — 无标签 — by SoluMilken (创建于: 2026-03-22 01:25 (UTC+8)) [+136/-11, 2 files | commented:2] ## Purpose
1. Add unit tests for make_copy_and_call in tests/compile/test_backends.py, covering:
  - sym tensor is copied into pre-allocated buffer; callable_fn receives a buffer view
  - oversize input raises before callable_fn is reached
  - second call overwrites buffer; out-of-range rows are left untouched
  - non-contiguous indices with a static passthrough arg
2. Refactor make_copy_and_call:
  - Remove lazy buffer initialization — callers always pass pre-allocated buf…
#37751 [Quantization] Route SM120 GPUs to CUTLASS MXFP4 MoE backend — needs-rebase,nvidia — by Tib-Gridello (创建于: 2026-03-21 23:43 (UTC+8)) [💬2 | +3/-2, 1 files | commented:1] ## Summary

SM120 GPUs (RTX PRO 6000 Blackwell, RTX 5090) fall back to the slow Marlin backend for MXFP4 MoE models because get_mxfp4_backend() only checks is_device_capability_family(100). SM120 is family 120, not 100, so it’s excluded.

FlashInfer already ships SM120 CUTLASS kernel infrastructure (gen_cutlass_fused_moe_sm120_module, CutlassTileConfigSM120, COMPILE_BLACKWELL_SM120_TMA_GROUPED_GEMMS). This 1-line change routes SM120 to that existing path.

## Change

Add `is_device_cap…
#37750 fix: specify device for torch.Event to prevent multi-GPU issues — v1 — by xueliangyang-oeuler (创建于: 2026-03-21 22:50 (UTC+8)) [+11/-9, 2 files | commented:1] When using torch.Event() without specifying a device parameter, it may cause issues in multi-GPU scenarios where the event is created on a different device than intended. This can lead to deadlocks in CUDA graph execution paths, especially with async scheduling and prefix caching.

This fix adds the device parameter to all torch.Event() calls in:
- gpu_model_runner.py: prepare_inputs_event, transfer_event, draft_token_ids_event, num_accepted_tokens_event, valid_sampled_token_count_event, asyn…
#37735 [Feature]: IndexCache support for DSA models — deepseek — by chaunceyjiang (创建于: 2026-03-21 13:09 (UTC+8)) [+73/-11, 4 files | commented:3] ## Purpose FIX https://github.com/vllm-project/vllm/issues/37684

## Test ``` export VLLM_INDEXCACHE_ENABLE=1 vllm serve /mnt/data4/models/deepseek-ai/DeepSeek-V3___2 -tp=8
–tokenizer-mode deepseek_v32 –enable-auto-tool-choice
–tool-call-parser deepseek_v32 –reasoning-parser deepseek_v3
–hf-overrides ‘{“index_topk_freq”: 4}’ …
#37747 fix: preserve logprobs for control tokens in streaming tool calls — frontend — by xueliangyang-oeuler (创建于: 2026-03-21 21:30 (UTC+8)) [+1/-3, 1 files | commented:1] When streaming tool_calls chunks with control tokens (e.g. ), the delta_message may be None but logprobs should still be preserved. Previously, chunks with None delta_message were skipped entirely when return_token_ids was False, causing logprobs to be lost.

Now we also check if logprobs is requested before skipping the chunk, ensuring control token logprobs are included in the streaming response.

Fixes #37737

## Purpose …
#37741 [Responses API] Fix ParsableContext: defer parsing to end of generation — frontend — by esmeetu (创建于: 2026-03-21 16:01 (UTC+8)) [💬3 | +233/-5, 3 files | commented:2] ## Summary
- Bug: ParsableContext.append_output() called parser.process() on every streaming delta, but ResponsesParser.process() internally calls extract_reasoning() and extract_tool_calls() which expect complete output text. This caused incorrect parsing when reasoning content (<think>...</think>) was split across multiple deltas.
- Fix: Accumulate text across append_output() calls, run a single full parse via _ensure_final_parse() when results are first accessed. …
#37740 [Responses API] Fix tool_choice=required: WebSearch crash, parallel tool merge, JSON fallback — frontend,tool-calling — by esmeetu (创建于: 2026-03-21 16:01 (UTC+8)) [💬3 | +556/-2, 6 files | commented:2] ## Summary

Three independent bug fixes for tool_choice=required in the Responses API:
- WebSearchTool crash (tool_parsers/utils.py): _get_json_schema_from_tools crashes when tools list includes non-function types (e.g. WebSearchTool) because they lack .name/.parameters. Now filters to only FunctionTool/ChatCompletionToolsParam before schema generation.
- Consecutive function_call merge (responses/utils.py): Consecutive ResponseFunctionToolCall items in the input (pa…
#37744 [Bugfix] Fix PyTorch stable ABI compatibility for permute_cols — bug,ci/build — by kilork (创建于: 2026-03-21 16:53 (UTC+8)) [💬1 | +6/-4, 3 files | commented:1] ## Purpose

Fix build failure when combining PR #37491 (CUTLASS upgrade to v4.4.2) with commit 8b10e4fb (Migrate permute_cols to libtorch stable ABI). The build fails with:
```
  static_assert(std::is_trivially_copyable_v<T>);
  error: non-static data member 'torch::stable::detail::ToImpl<const torch::stable::Tensor&>::call(...)::Result::t' in a union may not have reference type
```
## Root Cause …
#37742 [Responses API] Add ToolChoiceRequiredLogitsProcessor for thinking models — v1,tool-calling — by esmeetu (创建于: 2026-03-21 16:02 (UTC+8)) [💬3 | +617/-8, 5 files | commented:2] ## Summary
- Problem: tool_choice=required with thinking models (QwQ, Kimi K2, DeepSeek R1, etc.) fails because guided generation corrupts <think> reasoning tokens. The model either produces garbled reasoning or terminates before generating any tool calls.
- Solution: Add ToolChoiceRequiredLogitsProcessor that suppresses all stop/EOS tokens from the start of generation, forcing the model to produce tool calls. Activated automatically when ToolParser.adjust_request() detects a th…
#37743 [CI] replace shellcheck script with shellcheck-py hook — ci/build,v1,kv-connector — by SoluMilken (创建于: 2026-03-21 16:50 (UTC+8)) [+57/-79, 6 files | commented:2] ## Purpose

Related to #36977 (no activity since 2026-03-13).

Replace the custom tools/pre_commit/shellcheck.sh with the shellcheck-py pre-commit hook. The old script only worked on Linux x86_64; shellcheck-py ships shellcheck as a Python wheel and is cross-platform.

Also fixes real issues surfaced by the new hook: SC2089/SC2090 in run-multi-node-test.sh (string-quoted GPU args → bash array), SC2048 in `spec_decode_acceptance_test…
#37738 Fix: preserve streaming logprobs — frontend — by sdpkjc (创建于: 2026-03-21 14:50 (UTC+8)) [+15/-7, 1 files | commented:5] ## Purpose

Fixes #37737.

When a streaming tool parser suppresses a control-token chunk by returning None, the chat streaming path can skip the whole chunk even if it still carries logprobs. This change keeps emitting an empty delta chunk when per-token metadata must be preserved.

## Test Plan

Run the streaming chat completion flow with logprobs=true and reproduce the tool-call path with Qwen/Qwen3-VL-8B-Thinking. Confirm that <tool_call> logprobs are preserved even when the p…
#37739 [Frontend] Fix default_chat_template_kwargs handling in Responses API — documentation,frontend — by sidsaha-ai (创建于: 2026-03-21 15:39 (UTC+8)) [💬2 | +313/-15, 10 files | commented:2] ## Summary

--default-chat-template-kwargs was already available in the shared render stack, but the /v1/responses serving path still dropped those defaults when building prompts and when instantiating the reasoning parser used to post-process non-streaming responses.

This meant Responses API requests could still behave as if Qwen3 thinking was enabled even when the server was started with --default-chat-template-kwargs '{"enable_thinking": false}', which in turn could leave output_text…
#37734 [Feature] Add LoRA support for Qwen3ASRForConditionalGeneration — documentation,qwen — by simpx (创建于: 2026-03-21 12:59 (UTC+8)) [💬2 | +1248/-2, 3 files | commented:1] ## What does this PR do?

This PR adds LoRA fine-tuning support for the Qwen3-ASR model by implementing the SupportsLoRA interface.

## Why is this change needed?

Currently, attempting to serve Qwen/Qwen3-ASR with LoRA enabled fails because the model does not declare LoRA support. This change enables users to fine-tune Qwen3-ASR models with LoRA adapters.

## How was it tested?

…
#37733 Revert “[compile] Initialize passes at VllmBackend init” — 无标签 — by simon-mo (创建于: 2026-03-21 12:35 (UTC+8)) [+5/-19, 3 files | commented:2] Reverts vllm-project/vllm#35216

see #37732

[已合并 PR]

#37775 [Bugfix] Fix pooling non-determinism from pinned prompt_lens aliasing — bug,rocm,ready,v1 — by AndreasKaratzas (合并于: 2026-03-22 11:22 (UTC+8)) [💬6

+63/-1, 2 files

commented:2 approved:1]

PR #37303 changed num_prompt_tokens in InputBatch from a plain np.zeros() array to a pinned-memory-backed numpy view (torch.zeros(..., pin_memory=True).numpy()).
get_pooling_metadata() calls torch.from_numpy(self.num_prompt_tokens[:self.num_reqs]), which creates a tensor that shares the underlying pinned buffer rather than copying the data.
Because pinned memory is used for async GPU transfers, the shared buffer can be modified between the time prompt_lens is created and wh…

#37128 [MoE Refactor] Mxfp4 oracle rebased — documentation,rocm,ready,gpt-oss,nvidia — by zyongye (合并于: 2026-03-21 11:37 (UTC+8)) [💬5 | +1695/-1369, 18 files | commented:9 approved:1] ## Purpose Rebased and improve version of #34983 Ongoing MXFP4 MoE refactor
- Refactor MXFP4 MoE from a monolithic 1299-line Mxfp4MoEMethod class to the oracle pattern used by FP8 and NvFP4
  - Create oracle/mxfp4.py with backend selection, weight conversion, quant config, and kernel assembly
  - Create TrtLlmMxfp4ExpertsMonolithic wrapping trtllm_fp4_block_scale_moe() (both BF16 and MXFP8 input modes)
  - Create OAITritonMxfp4ExpertsMonolithic wrapping triton_kernel_moe_forward() …
#37774 [ROCm][CI] close missing quote in kernels/moe block in run-amd-test.sh — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-03-22 09:42 (UTC+8)) [+2/-2, 1 files | commented:1 approved:1] Addresses regression from: #32700

Fix a syntax error in run-amd-test.sh where the kernels/moe ignore block inside apply_rocm_test_overrides was missing the closing " on the cmds string assignment.

cc @kenroche
#37759 [MoE] Move FlashInfer CuteDSL experts into fused_moe/experts/ — ready,nvidia — by robertgshaw2-redhat (合并于: 2026-03-22 07:15 (UTC+8)) [+2/-2, 3 files | commented:1 approved:1] ## Summary
- Moves flashinfer_cutedsl_moe.py from fused_moe/ root into the fused_moe/experts/ subdirectory, consistent with the ongoing migration of expert kernel files (e.g. trtllm_nvfp4_moe.py, trtllm_fp8_moe.py, trtllm_mxfp4_moe.py)
- Updates import in oracle/nvfp4.py to point to the new location
- Updates import in tests/kernels/moe/test_cutedsl_moe.py to point to the new location
This is not duplicating an existing PR — it is part of an in-progress migration of `vllm/model…
#32700 [Quantization][Deprecation] Remove PTPC FP8 — rocm,ready,ci/build — by robertgshaw2-redhat (合并于: 2026-03-22 06:10 (UTC+8)) [💬3 | +1/-196, 5 files | commented:2 approved:2]

## Purpose
- now that 0.14 is out with deprecation notice, remove completely from 0.15
Essential Elements of an Effective PR Description Checklist
- [ ] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)". ...
#37768 Revert “Consolidate AWQ quantization into single awq_marlin.py file” — 无标签 — by robertgshaw2-redhat (合并于: 2026-03-22 05:20 (UTC+8)) [+279/-253, 3 files | commented:1] REVERT Accidental push
#32104 Add tensor IPC transfer mechanism for multimodal data — performance,frontend,ready,v1,multi-modality — by brandonpelfrey (合并于: 2026-03-22 04:10 (UTC+8)) [💬34 | +1430/-25, 13 files | commented:10] Introduce Multimodal Content Tensor IPC/SHMEM Data Path

Following on from a request to break down the RFC/PR in https://github.com/vllm-project/vllm/pull/31925 , this PR introduces a IPC/SHMEM pathway for sending multimodal content from API Server -> CoreEngine processes via multiprocessing Queues. Part of the intention of this change is to reduce the number of changes in the original PR and introduce easier-to-review components which are required for the complete solution.

Note that this pat…
#37756 [Perf] Add SM 10.3 (B300/GB300) all-reduce communicator tuning — ready — by mmangkad (合并于: 2026-03-22 03:43 (UTC+8)) [💬1 | +23/-0, 3 files | commented:1 approved:2] ## Summary

Add benchmarked optimal all-reduce communicator config values for SM 10.3 (B300/GB300).

Depends on #37755 for allreduce fusion to be auto-enabled on SM 10.3.

## Config Values

| Config | ws=2 | ws=4 | ws=6 | ws=8 | |——–|——|——|——|——| …
#37755 [Core] Enable allreduce fusion by default for SM 10.3 (B300/GB300) — ready — by mmangkad (合并于: 2026-03-22 03:40 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1 approved:1] ## Summary

Enable allreduce fusion by default for SM 10.3 (B300/GB300) by using is_device_capability_family(100) instead of is_device_capability(100) in enable_allreduce_rms_fusion.

## Accuracy Test

``` vllm serve nvidia/DeepSeek-V3.2-NVFP4
–tensor-parallel-size 2
–tokenizer-mode deepseek_v32
…
#37721 [ROCm][CI] Update GSM8K eval config to use fp8-and-mixed models list (MI355) — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-03-21 15:27 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1 approved:1] Follow-up for:
- #34839
Updated the GSM8K correctness eval in the AMD CI pipeline to use the models-mi3xx-fp8-and-mixed.txt config list instead of models-mi3xx-fp8.txt. Addresses failure in mi355_2: LM Eval Small Models (B200-MI325)

Motivation: https://buildkite.com/vllm/amd-ci/builds/6721/steps/canvas?sid=019d09d4-713e-4e07-bcb9-9b38689611a0&tab=output

cc @kenroche
#37722 quick fix for 37665 — 无标签 — by xuechendi (合并于: 2026-03-21 21:08 (UTC+8)) [+6/-4, 1 files | commented:1 approved:1] ## Purpose

missing set_current_vllm_config

so catch error as below ``` File “/opt/venv/lib/python3.12/site-packages/pydantic/_internal/_dataclasses.py”, line 121, in init s.pydantic_validator.validate_python(ArgsKwargs(args, kwargs), self_instance=s) pydantic_core._pydantic_core.ValidationError: 1 validation error for VllmConfig Assertion failed, Current vLLM config is not set. This typically means get_current_vllm_config() was called outside of a set_current_vllm_co…
#37318 [Hybrid] calling get_mamba_groups() once at MambaCopyBuffers.create() — ready,v1 — by fuscof-ibm (合并于: 2026-03-21 17:29 (UTC+8)) [+11/-5, 2 files | commented:1 approved:1] ## Purpose

Now get_mamba_groups() is called only once during MambaCopyBuffers.create() and the result is reused in both preprocess_mamba() and postprocess_mamba() rather than being recomputed on every batch.

## Test Plan

python -m pytest tests/v1/worker/test_mamba_utils.py -v
#34692 [ROCm] Enable DeepEP ROCm as all2allbackend for AMD GPUs. — documentation,rocm,ready,ci/build — by lcskrishna (合并于: 2026-03-21 15:32 (UTC+8)) [💬26 | +68/-29, 7 files | commented:4 approved:2 changes:2] This PR is to integrate changes required to run DeepEP as an all2allbackend on AMD GPUs.

co-authored by : @itej89

The following changes are performed:
- The current codebase is modified to run DeepEP backend OOB on AMD GPUs without any issues.
- FusedBatchedMoE kernels can run on AMD GPUs with float8_e4m3fnuz format added.
Related steps: …
#37424 [Responses API] Add kv_transfer_params for PD disaggregation — frontend,ready,kv-connector — by bongwoobak (合并于: 2026-03-21 13:48 (UTC+8)) [💬4 | +29/-2, 3 files | commented:4 approved:1] ## Purpose

Add kv_transfer_params to the Responses API (/v1/responses) for PD disaggregation support. The Chat Completions API already supports this, but the Responses API does not, preventing PD disaggregation through it.

Follows the same pattern as ChatCompletionRequest/ChatCompletionResponse:
- ResponsesRequest: add kv_transfer_params field, inject into SamplingParams.extra_args
- ResponsesResponse: add kv_transfer_params field, populate from engine output
- All 4 context …
#37610 [ROCm][CI] Mark gemma3 as large GPU test to avoid OOM on MI250 — rocm,ready,ci/build,multi-modality — by AndreasKaratzas (合并于: 2026-03-21 12:57 (UTC+8)) [💬6 | +19/-15, 2 files | commented:2 approved:1] Follow-up for:
- #34839
Fixes OOM in mi250_1: Multi-Modal Models (Standard) 2: qwen3 + gemma

Motivation: https://buildkite.com/vllm/amd-ci/builds/6701/steps/canvas?sid=019d07a7-1a19-4174-b4a1-c9bbfff0c164&tab=output

@kenroche
#37733 Revert “[compile] Initialize passes at VllmBackend init” — 无标签 — by simon-mo (合并于: 2026-03-21 12:35 (UTC+8)) [+5/-19, 3 files | commented:2] Reverts vllm-project/vllm#35216

see #37732
#37617 [ROCm][CI] Guard CudaPlatform/RocmPlatform imports to fix test collection on cross-platform builds — rocm,ready,nvidia — by AndreasKaratzas (合并于: 2026-03-21 11:58 (UTC+8)) [💬2 | +35/-3, 1 files | commented:1 approved:1] Follow-up for:
- #34839
Fixes unconditionally imports CudaPlatform and RocmPlatform at module level. Addresses failure in mi355_1: Kernels (B200-MI355)

Motivation: https://buildkite.com/vllm/amd-ci/builds/6701/steps/canvas?sid=019d07a7-1ac3-4614-993f-cc80ab9b2c57&tab=output

cc @kenroche
#37058 [Frontend] Remove librosa from audio dependency — performance,rocm,frontend,ready,ci/build,multi-modality — by Isotr0py (合并于: 2026-03-21 11:36 (UTC+8)) [💬4 | +247/-188, 18 files | commented:3 approved:1] ## Purpose Related issues:
- https://github.com/vllm-project/vllm-omni/issues/1013
- https://github.com/vllm-project/vllm-omni/issues/1725 Remove librosa from audio dependency, so downstream projects needs audio support like vllm-omni can avoid affected by LGPL.
## Test Plan ``` pytest -s -v tests/multimodal/test_audio.py pytest -s -v tests/multimodal/meida/test_audio.py …

[关闭未合并 PR]

#37669 WIP: [openapi] enable scaling ep only when api_server_count is 1 — frontend — by andyxning (关闭于: 2026-03-22 10:58 (UTC+8)) [💬4 | +21/-12, 1 files | commented:1] ## Purpose

scale ep can not be enabled when api server count is above 1. This is because there is a in-memory status _scaling_elastic_ep per api server process. This is similar with dynamic lora updating can only work with api server count of 1. https://github.com/vllm-project/vllm/blob/b4c1aef21c1a4cb252e7a440b3f9b0baebefbbef/vllm/entrypoints/cli/serve.py#L273-L276

We use to api server count of 2, if the /scale_elastic_ep is served by api server process 1, and then when the `/is_scali…
#28813 build: align CUDA 12.1 xformers wheel pin — ci/build,stale,nvidia — by m0nk111 (关闭于: 2026-03-22 10:14 (UTC+8)) [💬6 | +2/-1, 1 files | commented:3] ## Summary
- bump the xformers requirement on CUDA 12.1 builds to the internally built 0.0.33+5d4b92a5.d20251106 wheel
- add a comment pointing to the upstream release tag we use for the build
This isolates the build-only change that was previously bundled inside #28241.

## Testing
- not run (dependency pin update only)
#37726 Revert “[Model] Deprecate the score task (this will not affect users).” (#37537) — documentation,frontend,v1 — by zhewenl (关闭于: 2026-03-22 10:08 (UTC+8)) [💬4 | +163/-184, 22 files | commented:1 | 📝草稿] ## Revert of #37537

This reverts https://github.com/vllm-project/vllm/pull/37537 (merge commit ed359c497a728f08b5b41456c07a688ccd510fbc).

Reason: This PR is linked to 1 new CI failure in build #57332:
- Language Models Test (MTEB) — nvidia/llama-nemotron-rerank-1b-v2 rerank MTEB score dropped marginally (diff=0.0023 vs atol=0.002), causing test_rerank_models_mteb[model_info0] to fail.
The PR changed pooler heads, activations, and scori…
#37771 [Quant] Consolidate AWQ quantization into a single file — documentation,performance,rocm,needs-rebase,ci/build,gpt-oss,nvidia — by robertgshaw2-redhat (关闭于: 2026-03-22 05:45 (UTC+8)) [💬2 | +3132/-2511, 83 files | commented:1 | 📝草稿] ## Summary
- Merges awq.py and awq_marlin.py into a single consolidated module (awq_marlin.py), following the same structure as gptq_marlin.py
- Adds AWQMarlinLinearKernel in vllm/model_executor/kernels/linear/mixed_precision/awq_marlin.py so AWQMarlinLinearMethod now delegates weight-repacking and inference through the kernel abstraction instead of calling marlin utils directly
- Extracts shared AWQ weight-parameter creation (_create_awq_weight_params) used by both the legacy a…
#37769 [Quantization] Merge gptq_marlin into gptq, remove slow GEMM kernel — performance,rocm — by robertgshaw2-redhat (关闭于: 2026-03-22 05:26 (UTC+8)) [+753/-1185, 25 files | commented:1] ## Summary
- gptq.py is replaced by the former gptq_marlin.py (Marlin kernel). GPTQConfig.get_name() now returns "gptq", and override_quantization_method is removed (no longer needed since gptq IS Marlin).
- gptq_marlin.py is reduced to a thin backward-compat re-export shim.
- --quantization gptq_marlin auto-converts to gptq with a deprecation warning in _verify_quantization.
- The old exllama/GEMM-based GPTQConfig / GPTQLinearMethod are removed; backward-compa…
#37767 [Quantization] Merge gptq_marlin into gptq, deprecate gptq_marlin — performance — by robertgshaw2-redhat (关闭于: 2026-03-22 05:06 (UTC+8)) [+730/-1176, 21 files | commented:1] ## Summary
- gptq_marlin.py is a strict superset of gptq.py — the marlin kernel handles all GPTQ-compatible checkpoints on sm75+, and the 4-bit gptq_gemm fallback was already marked buggy in the old code. This PR merges them.
- Delete gptq_marlin.py; rewrite gptq.py with the marlin-based implementation
- GPTQMarlinConfig → GPTQConfig (get_name() returns "gptq"); backward-compat aliases preserved for any external callers (GPTQMarlinConfig, GPTQMarlinLinearMethod, `GPT…
#34019 [Quantization][Refactor] Clean up GPTQ + AWQ quantization — 无标签 — by mu-hashmi (关闭于: 2026-03-22 04:40 (UTC+8)) [💬1 | +40/-36, 3 files | commented:1 | 📝草稿] ## Purpose Addresses #31689. Building on gptq_marlin.py per @robertgshaw2-redhat’s guidance to consolidate GPTQ/AWQ quantization and remove the legacy code paths.

So far: widened GPTQMarlinConfig to handle all GPTQ bit-widths and symmetry configs through the MPLinearKernel abstraction, added GPTQv2 checkpoint format support, and enabled 2/3-bit types in ExllamaLinearKernel.

Still TODO: validate all kernel backends end-to-end, remove gptq.py, convert moe_wna16.py into a modular kernel…
#37751 [Quantization] Route SM120 GPUs to CUTLASS MXFP4 MoE backend — needs-rebase,nvidia — by Tib-Gridello (关闭于: 2026-03-21 23:49 (UTC+8)) [💬2 | +3/-2, 1 files | commented:1] ## Summary

SM120 GPUs (RTX PRO 6000 Blackwell, RTX 5090) fall back to the slow Marlin backend for MXFP4 MoE models because get_mxfp4_backend() only checks is_device_capability_family(100). SM120 is family 120, not 100, so it’s excluded.

FlashInfer already ships SM120 CUTLASS kernel infrastructure (gen_cutlass_fused_moe_sm120_module, CutlassTileConfigSM120, COMPILE_BLACKWELL_SM120_TMA_GROUPED_GEMMS). This 1-line change routes SM120 to that existing path.

## Change

Add `is_device_cap…
#37377 [Feature] Add LoRA support for Qwen3ASRForConditionalGeneration — documentation,qwen — by karanb192 (关闭于: 2026-03-21 14:01 (UTC+8)) [💬2 | +20/-1, 2 files | commented:1] ## Summary
- Adds SupportsLoRA mixin to Qwen3ASRForConditionalGeneration, enabling --enable-lora for Qwen/Qwen3-ASR-0.6B and similar models
- Adds packed_modules_mapping and embedding_modules class attributes, following the same pattern as Qwen3ForCausalLM, Qwen2_5OmniThinkerForConditionalGeneration, and Qwen3VLForConditionalGeneration
- The model already has get_mm_mapping() which correctly identifies language model and tower model prefixes for LoRA weight routing
- Marks L…
#37734 [Feature] Add LoRA support for Qwen3ASRForConditionalGeneration — documentation,qwen — by simpx (关闭于: 2026-03-21 14:01 (UTC+8)) [💬2 | +1248/-2, 3 files | commented:1] ## What does this PR do?

This PR adds LoRA fine-tuning support for the Qwen3-ASR model by implementing the SupportsLoRA interface.

## Why is this change needed?

Currently, attempting to serve Qwen/Qwen3-ASR with LoRA enabled fails because the model does not declare LoRA support. This change enables users to fine-tune Qwen3-ASR models with LoRA adapters.

## How was it tested?

…