Qwen/Qwen3.5-35B-A3B

Compact Qwen3.5 multimodal MoE (35B total / 3B active) with gated delta networks, 256 experts, and 262K context

Compact Qwen3.5 MoE — single-GPU FP8, 2x GPU or 2x Xeon 6 NUMA nodes BF16 serving

View on HuggingFace

moe35B / 3B262,144 ctxvLLM 0.17.0+multimodaltext

Guide

Overview

Qwen3.5-35B-A3B is the smallest MoE in the Qwen3.5 family, sharing the gated delta networks architecture with 35B total parameters and 3B activated per token (256 experts). With FP8 weights it fits on a single 80 GB GPU and supports the full 262K context.

Prerequisites

vLLM version: >= 0.17.0
Hardware (BF16): 1x H200, 2x H100 or 2x Xeon 6 NUMA nodes
Hardware (FP8): single H100/H200
Hardware (Int4): single 24 GB GPU

Pip Install

NVIDIA

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend=auto

CPU

For Intel and AMD x86 CPUs, follow the CPU pre-built wheels installation instructions.

uv venv
source .venv/bin/activate
export VLLM_VERSION=$(curl -s https://api.github.com/repos/vllm-project/vllm/releases/latest | jq -r .tag_name | sed 's/^v//')
uv pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cpu-cp38-abi3-manylinux_2_35_x86_64.whl --torch-backend cpu

Launching the Server

Single-GPU FP8

vllm serve Qwen/Qwen3.5-35B-A3B-FP8 \
  --max-model-len 262144 \
  --reasoning-parser qwen3

BF16 on 2xH200 (TP2)

vllm serve Qwen/Qwen3.5-35B-A3B \
  --tensor-parallel-size 2 \
  --max-model-len 262144 \
  --reasoning-parser qwen3

Intel Xeon 6 Deployment via Docker

Launch the x86 CPU vLLM Docker container for Qwen/Qwen3.5-35B-A3B:

docker run -itd --name qwen35b-cpu \
  --network host \
  --shm-size 16g \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai-cpu:latest-x86_64 \
    --model Qwen/Qwen3.5-35B-A3B \
    --host 0.0.0.0 \
    --port 8000

MTP speculative decoding

vllm serve Qwen/Qwen3.5-35B-A3B-FP8 \
  --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' \
  --reasoning-parser qwen3

Client Usage

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
resp = client.chat.completions.create(
    model="Qwen/Qwen3.5-35B-A3B",
    messages=[{"role": "user", "content": "Explain gated delta networks in one paragraph."}],
    max_tokens=512,
)
print(resp.choices[0].message.content)

Troubleshooting

CUDA graph / Mamba cache size error: reduce --max-cudagraph-capture-size (default 512). See vLLM PR #34571.
Disable reasoning: add --default-chat-template-kwargs '{"enable_thinking": false}'.
Prefix Caching (Mamba): currently experimental in "align" mode.