We have implemented basic adaptations in both SGLang (PR) and vLLM (PR) to support the deployment of LongCat-Flash.
Due to its size of 560 billion parameters (560B), LongCat-Flash requires at least one node (e.g., 8xH20-141G) to host the model weights in FP8 format, and at least two nodes (e.g., 16xH800-80G) for BF16 weights. Detailed launch configurations are provided below.
The model can be served on a single node using a combination of Tensor Parallelism and Expert Parallelism.
python3 -m sglang.launch_server \
--model meituan-longcat/LongCat-Flash-Chat-FP8 \
--trust-remote-code \
--attention-backend flashinfer \
--enable-ep-moe \
--tp 8In a multi-node setup, Tensor Parallelism and Expert Parallelism are employed, with additional parallel strategies planned for future implementation.
python3 -m sglang.launch_server \
--model meituan-longcat/LongCat-Flash-Chat \
--trust-remote-code \
--attention-backend flashinfer \
--enable-ep-moe \
--tp 16 \
--nnodes 2 \
--node-rank $NODE_RANK \
--dist-init-addr $MASTER_IP:5000Note
Replace $NODE_RANK and $MASTER_IP with the specific values for your cluster.
To enable MTP with SGLang, add the following arguments to your launch command:
--speculative-draft-model-path meituan-longcat/LongCat-Flash-Chat \
--speculative-algorithm NEXTN \
--speculative-num-draft-tokens 2 \
--speculative-num-steps 1 \
--speculative-eagle-topk 1While vLLM supports parallel strategies similar to SGLang, they use different syntax and parameter names for their launch commands.
vllm serve meituan-longcat/LongCat-Flash-Chat-FP8 \
--trust-remote-code \
--enable-expert-parallel \
--tensor-parallel-size 8# Node 0
vllm serve meituan-longcat/LongCat-Flash-Chat \
--trust-remote-code \
--tensor-parallel-size 8 \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--data-parallel-address $MASTER_IP \
--data-parallel-rpc-port 13345
# Node 1
vllm serve meituan-longcat/LongCat-Flash-Chat \
--trust-remote-code \
--tensor-parallel-size 8 \
--headless \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--data-parallel-start-rank 1 \
--data-parallel-address $MASTER_IP \
--data-parallel-rpc-port 13345Note
Only worker nodes (Node 1, 2, etc.) require the --data-parallel-start-rank flag.
To enable MTP with vLLM, add the following arguments to your launch command:
--speculative_config '{"model": "meituan-longcat/LongCat-Flash-Chat", "num_speculative_tokens": 1, "method":"longcat_flash_mtp"}'