Skip to content

MooncakeConnector Usage Guide

About Mooncake

Mooncake aims to enhance the inference efficiency of large language models (LLMs), especially in slow object storage environments, by constructing a multi-level caching pool on high-speed interconnected DRAM/SSD resources. Compared to traditional caching systems, Mooncake utilizes (GPUDirect) RDMA technology to transfer data directly in a zero-copy manner, while maximizing the use of multi-NIC resources on a single machine.

For more details about Mooncake, please refer to Mooncake project and Mooncake documents.

Prerequisites

Installation

Install mooncake through pip: uv pip install mooncake-transfer-engine.

Refer to Mooncake official repository for more installation instructions

Usage

Prefiller Node (192.168.0.2)

vllm serve Qwen/Qwen2.5-7B-Instruct --port 8010 --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer"}'

Decoder Node (192.168.0.3)

vllm serve Qwen/Qwen2.5-7B-Instruct --port 8020 --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer"}'

Proxy

python tests/v1/kv_connector/nixl_integration/toy_proxy_server.py --prefiller-host 192.168.0.2 --prefiller-port 8010 --decoder-host 192.168.0.3 --decoder-port 8020

NOTE: The Mooncake Connector currently uses the proxy from nixl_integration. This will be replaced with a self-developed proxy in the future.

Now you can send requests to the proxy server through port 8000.

Environment Variables

  • VLLM_MOONCAKE_BOOTSTRAP_PORT: Port for Mooncake bootstrap server

    • Default: 8998
    • Required only for prefiller instances
    • Each vLLM worker needs a unique port on its host; using the same port number across different hosts is fine
    • For TP/DP deployments, each worker's port on a node is computed as: base_port + dp_rank * tp_size + tp_rank
    • Used for the decoder notifying the prefiller
  • VLLM_MOONCAKE_ABORT_REQUEST_TIMEOUT: Timeout (in seconds) for automatically releasing the prefiller’s KV cache for a particular request. (Optional)

    • Default: 480
    • If a request is aborted and the decoder has not yet notified the prefiller, the prefill instance will release its KV-cache blocks after this timeout to avoid holding them indefinitely.

KV Role Options

  • kv_producer: For prefiller instances that generate KV caches
  • kv_consumer: For decoder instances that consume KV caches from prefiller
  • kv_both: Enables symmetric functionality where the connector can act as both producer and consumer. This provides flexibility for experimental setups and scenarios where the role distinction is not predetermined.