vLLM/Recipes
Qwen

Qwen/Qwen3.5-35B-A3B

Compact Qwen3.5 multimodal MoE (35B total / 3B active) with gated delta networks, 256 experts, and 262K context

Compact Qwen3.5 MoE — single-GPU FP8, 2x GPU or 2x Xeon 6 NUMA nodes BF16 serving

moe35B / 3B262,144 ctxvLLM 0.17.0+multimodaltext
Guide

Overview

Qwen3.5-35B-A3B is the smallest MoE in the Qwen3.5 family, sharing the gated delta networks architecture with 35B total parameters and 3B activated per token (256 experts). With FP8 weights it fits on a single 80 GB GPU and supports the full 262K context.

Prerequisites

  • vLLM version: >= 0.17.0
  • Hardware (BF16): 1x H200, 2x H100 or 2x Xeon 6 NUMA nodes
  • Hardware (FP8): single H100/H200
  • Hardware (Int4): single 24 GB GPU

Pip Install

NVIDIA

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend=auto

CPU

For Intel and AMD x86 CPUs, follow the CPU pre-built wheels installation instructions.

uv venv
source .venv/bin/activate
export VLLM_VERSION=$(curl -s https://api.github.com/repos/vllm-project/vllm/releases/latest | jq -r .tag_name | sed 's/^v//')
uv pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cpu-cp38-abi3-manylinux_2_35_x86_64.whl --torch-backend cpu

Launching the Server

Single-GPU FP8

vllm serve Qwen/Qwen3.5-35B-A3B-FP8 \
  --max-model-len 262144 \
  --reasoning-parser qwen3

BF16 on 2xH200 (TP2)

vllm serve Qwen/Qwen3.5-35B-A3B \
  --tensor-parallel-size 2 \
  --max-model-len 262144 \
  --reasoning-parser qwen3

Intel Xeon 6 Deployment via Docker

Launch the x86 CPU vLLM Docker container for Qwen/Qwen3.5-35B-A3B:

docker run -itd --name qwen35b-cpu \
  --network host \
  --shm-size 16g \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai-cpu:latest-x86_64 \
    --model Qwen/Qwen3.5-35B-A3B \
    --host 0.0.0.0 \
    --port 8000

MTP speculative decoding

vllm serve Qwen/Qwen3.5-35B-A3B-FP8 \
  --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' \
  --reasoning-parser qwen3

Client Usage

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
resp = client.chat.completions.create(
    model="Qwen/Qwen3.5-35B-A3B",
    messages=[{"role": "user", "content": "Explain gated delta networks in one paragraph."}],
    max_tokens=512,
)
print(resp.choices[0].message.content)

Troubleshooting

  • CUDA graph / Mamba cache size error: reduce --max-cudagraph-capture-size (default 512). See vLLM PR #34571.
  • Disable reasoning: add --default-chat-template-kwargs '{"enable_thinking": false}'.
  • Prefix Caching (Mamba): currently experimental in "align" mode.

References