vLLM/Recipes
Moonshot AI

moonshotai/Kimi-K2.5

Open-source native multimodal agentic MoE model with vision-language understanding, tool calling, and thinking modes

Multimodal agentic MoE model with DeepSeek-V3 backbone and MLA attention

moe1T / 32B262,144 ctxvLLM 0.19.1+multimodaltext
Guide

Overview

Kimi K2.5 is an open-source, native multimodal agentic model built through continual pretraining on approximately 15 trillion mixed visual and text tokens atop Kimi-K2-Base. It seamlessly integrates vision and language understanding with advanced agentic capabilities, instant and thinking modes, as well as conversational and agentic paradigms.

Prerequisites

  • vLLM version: >= 0.15.0 (speculative decoding with Eagle3 requires >= 0.18.0)
  • Hardware (BF16): 8x H200 GPUs (verified), or equivalent aggregate VRAM (~640 GB)
  • Hardware (NVFP4): 4x Blackwell GPUs (e.g. GB200)
  • AMD support: 8x MI300X / MI325X / MI355X with ROCm 7.2.1 and Python 3.12

Install vLLM

Pip (NVIDIA):

uv venv
source .venv/bin/activate
uv pip install vllm --torch-backend auto

Pip (AMD ROCm):

uv venv --python 3.12
source .venv/bin/activate
uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm

Docker (NVIDIA):

docker pull vllm/vllm-openai:latest

AMD MI300X/MI325X

On 8x MI300X or MI325X (gfx942), use the standard W4A16 MoE path with AITER and INT4 QuickReduce.

export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4

vllm serve moonshotai/Kimi-K2.5 \
  --host 0.0.0.0 \
  --port 8000 \
  --trust-remote-code \
  --tensor-parallel-size 8 \
  --tool-call-parser kimi_k2 \
  --enable-auto-tool-choice \
  --reasoning-parser kimi_k2 \
  --mm-encoder-tp-mode data

AMD MI350X/MI355X

On 8x MI350X or MI355X (gfx950), add --moe-backend flydsl to use the optimized FlyDSL W4A16 MoE kernel. Keep LoRA disabled for this path.

export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4

vllm serve moonshotai/Kimi-K2.5 \
  --tensor-parallel-size 8 \
  --trust-remote-code \
  --mm-encoder-tp-mode data \
  --moe-backend flydsl \
  --compilation-config '{"pass_config": {"fuse_allreduce_rms": false}}'

Notes:

  • The FlyDSL INT4 MoE path does not support expert parallelism; do not add --enable-expert-parallel.
  • Keep --compilation-config '{"pass_config": {"fuse_allreduce_rms": false}}'; it is required for this FlyDSL path on MI350X / MI355X.
  • vLLM has tuned MI350X/MI355X FlyDSL configs for this Kimi shape at TP=8 and TP=4.
  • Keep vLLM's default block size unless you are tuning long-context throughput; --block-size 64 is safe to try.

Client Usage

Once the vLLM server is running, consume it via the OpenAI-compatible API:

import time
from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1",
    timeout=3600
)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png"
                }
            },
            {
                "type": "text",
                "text": "Read all the text in the image."
            }
        ]
    }
]

start = time.time()
response = client.chat.completions.create(
    model="moonshotai/Kimi-K2.5",
    messages=messages,
    max_tokens=2048
)
print(f"Response costs: {time.time() - start:.2f}s")
print(f"Generated text: {response.choices[0].message.content}")

Troubleshooting

  • OOM errors: Lower --gpu-memory-utilization or adjust TP/EP to match your GPU count.
  • Vision encoder performance: Use --mm-encoder-tp-mode data to run the vision encoder in data-parallel mode. The encoder is small, so TP adds communication overhead with little gain.
  • Unique multimodal inputs: Pass --mm-processor-cache-gb 0 to avoid caching overhead. For repeated inputs, --mm-processor-cache-type shm uses host shared memory for better performance at high TP settings.
  • MoE kernel tuning: Use the benchmark_moe script from vLLM to tune Triton kernels for your specific hardware.
  • Async scheduling: Enabled by default for better throughput. Disable if you encounter issues, and file a bug report to vLLM.

References