Skip to content

Latest commit

 

History

History
100 lines (75 loc) · 2.95 KB

File metadata and controls

100 lines (75 loc) · 2.95 KB

LongCat-Flash Deployment Guide

We have implemented basic adaptations in both SGLang (PR) and vLLM (PR) to support the deployment of LongCat-Flash.

Due to its size of 560 billion parameters (560B), LongCat-Flash requires at least one node (e.g., 8xH20-141G) to host the model weights in FP8 format, and at least two nodes (e.g., 16xH800-80G) for BF16 weights. Detailed launch configurations are provided below.

SGLang

Single-Node Deployment

The model can be served on a single node using a combination of Tensor Parallelism and Expert Parallelism.

python3 -m sglang.launch_server \
    --model meituan-longcat/LongCat-Flash-Chat-FP8 \
    --trust-remote-code \
    --attention-backend flashinfer \
    --enable-ep-moe \
    --tp 8

Multi-Node Deployment

In a multi-node setup, Tensor Parallelism and Expert Parallelism are employed, with additional parallel strategies planned for future implementation.

python3 -m sglang.launch_server \
    --model meituan-longcat/LongCat-Flash-Chat \
    --trust-remote-code \
    --attention-backend flashinfer \
    --enable-ep-moe \
    --tp 16 \
    --nnodes 2 \
    --node-rank $NODE_RANK \
    --dist-init-addr $MASTER_IP:5000

Note

Replace $NODE_RANK and $MASTER_IP with the specific values for your cluster.

Enabling Multi-Token Prediction (MTP)

To enable MTP with SGLang, add the following arguments to your launch command:

    --speculative-draft-model-path meituan-longcat/LongCat-Flash-Chat \
    --speculative-algorithm NEXTN \
    --speculative-num-draft-tokens 2 \
    --speculative-num-steps 1 \
    --speculative-eagle-topk 1

vLLM

While vLLM supports parallel strategies similar to SGLang, they use different syntax and parameter names for their launch commands.

Single-Node Deployment

vllm serve meituan-longcat/LongCat-Flash-Chat-FP8 \
    --trust-remote-code \
    --enable-expert-parallel \
    --tensor-parallel-size 8

Multi-Node Deployment

# Node 0
vllm serve meituan-longcat/LongCat-Flash-Chat \
   --trust-remote-code \
   --tensor-parallel-size 8 \
   --data-parallel-size 2 \
   --data-parallel-size-local 1 \
   --data-parallel-address $MASTER_IP \
   --data-parallel-rpc-port 13345

# Node 1
vllm serve meituan-longcat/LongCat-Flash-Chat \
   --trust-remote-code \
   --tensor-parallel-size 8 \
   --headless \
   --data-parallel-size 2 \
   --data-parallel-size-local 1 \
   --data-parallel-start-rank 1 \
   --data-parallel-address $MASTER_IP \
   --data-parallel-rpc-port 13345

Note

Only worker nodes (Node 1, 2, etc.) require the --data-parallel-start-rank flag.

Enabling Multi-Token Prediction (MTP)

To enable MTP with vLLM, add the following arguments to your launch command:

    --speculative_config '{"model": "meituan-longcat/LongCat-Flash-Chat", "num_speculative_tokens": 1, "method":"longcat_flash_mtp"}'